Main Configuration
Table of contents
Main Configuration File
The main configuration file (in YAML format) is designed to provide flexible access to various data sources utilised within the Transit Corridor Analytics process. A customised Python module accompanies this YAML configuration file, enabling validation of the provided data. The file is structured into several high-level keys and can be expanded as needed for future applications.
Each key contains specific configuration information as outlined below:
Folder Locations
The folder_locations key defines the structure of the project’s folder hierarchy. It contains several sub-keys, each representing one of the main folder locations within the project:
src_path(Required): Specifies the absolute path to the folder where the project’s source code is stored.- Example:
C:/path/to/transit_corridor_analytics/3_etl/src
- Example:
raw_folder(Required): Contains the relative path to the location where raw data (e.g. Ticketing data, HASTUS data, GTFS data) are stored. This path is relative to thesrc_path.- Example:
../../1_raw_data/
- Example:
schema_folder(Required): Contains the relative path to the location where data schema files (in YAML format) are stored. This path is relative to thesrc_path.- Example:
../assets/schema_files
- Example:
input_folder(Required): Contains the relative path to the location where input data (processed/ingested by the Transit Corridor Analytics processes) are stored. This path is relative to thesrc_path.- Example:
../../2_input_data/
- Example:
output_folder(Required): Contains the relative path to the location where output data (produced by the Transit Corridor Analytics processes) are stored. This path is relative to thesrc_path.- Example:
../../4_outputs/
- Example:
Data Items
The data_items key contains different data items listed under predefined categorisation sub-keys. The categories defined under the data_items are compatible with the folder structure defined above. The following are the categories defined:
raw(Required): Lists all of the data items in theraw_folder.inputs(Required): Lists all of the data items in theinput_folder.outputs(Required): Lists all of the data items in theoutput_folder.
Each data item under any of these categories may require various parameters. This configuration file allows parameterising paths and files. The sub-keys defined within each data item are explained below. Optional fields (sub-keys) are indicated and examples of their application are provided accordingly.
name(Required): A custom name defined by the user, assigned to each data item. This name is used to access the data items programmatically and should be unique within each category (raw,inputs,outputs).- Example:
raw_transaction,ref_gtfs_time_periods,stop_to_stop_measure
- Example:
storage_type(Required): Specifies the type of storage for the data item. It can be any of the predefined types:local_drive,db_local, ordb_cloud. Currently, all data processed within the Transit Corridor Analytics are stored onlocal_drive.- Example:
local_drive
- Example:
file_type(Required forlocal_drive): Specifies the file type for locally stored data. Predefined data types includecsv,shapefile,hyper,geopackage,zip,parquet. These file types are defined insrc.configuration.file_types.- Example:
csv,parquet
- Example:
relative_path(Required ifabsolute_pathis not provided): Defined for each data item stored on a local drive and is relative to the base folder in the category where the data item is defined. For example, the relative path for items under theinputskey is relative to theinput_folder. If bothrelative_pathandabsolute_pathare provided,absolute_pathtakes precedence.- Example:
2_ticketing/{version}/transactions_daily_{version}.zip
- Example:
absolute_path(Optional): Optionally defined for each data item stored locally. If bothrelative_pathandabsolute_pathare provided,absolute_pathtakes precedence.- Example:
C:/Modelling_Projects/Data/transactions_daily_oct_2023.zip
- Example:
schema_file_name(Optional): Similar to the relative path, the schema file name is a relative path to where the schema file for the corresponding data item is stored. It is relative to theschema_folderdefined infolder_locations.- Example:
raw_transactions_schema_{schema_version}.yaml
- Example:
is_partitioned(Optional): A boolean indicating whether the data is partitioned. The partitioning format is currently based on ‘Year’, ‘Month’ and ‘Day’, used for reading ticketing data (e.g. transaction data and Trip Stop Timing reports). This indicator informs the process that the path for this data item should be updated accordingly ifis_partitionedis set to true.- Example:
TrueorFalse
- Example:
schema_version(Required for main input and output tables reading from local storage): If not provided, the reading and writing process considers all fields as strings. This version number follows Semantic Versioning format (e.g.YYYY.MAJOR.MINOR).- Example:
2023.0.0
- Example:
requires_region(Optional): A boolean indicating whether the data item requires a specific region in its path or processing.- Example: True or False
layer_name(Optional): Specifies the name of the layer when dealing with geospatial data files, such as geopackage.- Example:
staged
- Example:
Additionally, default_variables can be defined for data items to set default values for placeholders used in paths or other keys.
- Example:
{'version': '2023.0.0', 'schema_version': '2021.0.0'}
Connection
The connection section contains configuration details for database connections (this is currently not used within the Transit Corridor Analytics as all of the data are stored on local drives). Each connection is identified by a connection_name, which is used by tables with the same connection_name. Attributes of each connection vary based on specific requirements, but some common keys include:
connection_name(Required): Identifies the correct connection details for a given table.- Example:
local_postgres
- Example:
host(Required): The hostname or IP address of the database server.- Example:
localhost
- Example:
port(Required): The port number for the database server.- Example:
5432
- Example:
database_name(Required): The name of the database to connect to.- Example:
sampledb
- Example:
user(Required): The username for the database connection.- Example:
user
- Example:
password(Required): The password for the database connection.- Example:
password
- Example:
Configuration Module
The configuration module provides a structured way to manage and access various configuration settings defined within the main configuration file for different data types (raw, input, output). It revolves around the ConfigReader class, which is responsible for loading, validating and interpreting YAML config file. This module also defines entities like data types, storage types and regions, which collectively determine how data is processed, stored and accessed.
ConfigReader Class
The main class that users interact with is ConfigReader. When instantiated, it takes a path to a YAML configuration file and loads the necessary settings. The class provides an interface for retrieving configurations for raw data, input data and output data, based on the structure defined in the YAML file. The following is how this class is initialised within different stages of the ETL tools which load the configuration from the specified YAML file.
from src.configuration.config_reader import ConfigReader
config_file_path = r"path/to/transit_corridor_analytics/3_etl/assets/config_files/config_v2024.0.xx.yaml"
config_reader = ConfigReader(config_file_path)
- Key Method:
get_table_config: This method retrieves the configuration of a particular data item (raw, input, or output). It returns details like file paths, storage types, partitioning information and schema details, depending on the data type being queried. Example:
config = config_reader.raw.get_table_config( name="itinerary")For data that requires regional and version-based partitioning, the module allows specifying default variables like
region,versionandschema_version. These are replaced dynamically when constructing file paths or querying data configurations. For example, accessing one of the outputs of the ETL process that is partitioned and stored locally is shown below:append_corridor_config = config_reader.outputs.get_table_config( "appended_corridors", corridor_type='Infrastructure', region='SEQ', corridor_id='1041', version='20240305', )