.pkl files are included in the repo. When new quarterly data becomes available, use the update scripts to retrain.
Air fare
Download a new DB1B.asc file from the BTS website, then:
- Parse the
.ascfile viasetup_datasets/ingest_db1b.py - Rebuild the historical OD pairs dataset (2023–present)
- Retrain the HGB + RF ensemble with time-decay weighting
- Back up the old model to
fare_model_historical.pkl.bakbefore overwriting
Air time
T100_Domestic_Segment/ raw data. Add new T-100 .zip files to that folder first.
Drive time
The drive time model is built from OSRM routing results. The matrix build queries the public OSRM API and is resumable — safe to interrupt:--demo uses routing.openstreetmap.de (~1 req/sec, no API key needed). Omit for a local OSRM server.
Full local OSRM pipeline
For a local build (more coverage, faster queries):osmium-toolandosrm-backendinstalled- ~16 GB RAM
- ~15 GB disk
- Several hours of processing time
--skip-download or --skip-process to resume from a partial build.
Train time
data/amtrak_gtfs.zip and retrains. Download a fresh GTFS feed from Amtrak’s open data page and replace the zip before retraining.
Trainer architecture
All trainers inherit fromBaseTrainer in src/air_travel_model/trainers/base.py:
load_data()— reads CSV training dataengineer_features(df)— produces(X, feature_names, y, sample_weights)build_route_stats(df)— per-OD-pair lookup dict (distance, volume, averages)build_aggregate_lookups(df)— per-origin/dest fallback statstrain()— fits HGB + RF, evaluates on 20% holdout, picks best by R², saves.pkl