Forecasting Flight Delays Monthly is a machine learning project that applies exploratory data analysis, statistical analysis, feature engineering, and machine learning regression to predict monthly airline delay volumes.
This repository walks through a complete forecasting workflow, from raw data through model evaluation, and delivers actionable insights for operational planning and airline decision support.
Flight delays are a persistent challenge for airlines, airports, and passengers. Accurately forecasting delay volumes at a monthly level can help with:
- Staff planning
- Resource allocation
- Customer service readiness
- Budgeting and risk planning
This project uses historical delay data and regression modeling to estimate future delay counts with substantial predictive power.
.
├── Airline_Delay_Cause.csv # Original monthly delay dataset
├── cleaned_Airline_Delay_Data.csv # Cleaned and preprocessed data
├── notebooks/ # Jupyter notebooks with analysis & modeling
├── requirements.txt # Python dependencies
└── README.md # Project overview
The dataset includes monthly summaries of flights and delays from all U.S. carriers and airports, with features such as:
- Year and month
- Total flights
- Delay counts
- Delay types and causes
Preprocessing steps include cleaning, feature extraction, and cyclical encoding for time features.
We investigate:
- Monthly and yearly trends in delay volumes
- Seasonal patterns using cyclic encodings (sin/cos)
- Correlation between features and delay outcomes
Visualizations are provided via notebooks to illustrate patterns and support feature engineering decisions.
Time features are encoded cyclically (e.g., month via sine/cosine) to capture seasonality. Additional engineered variables improve model compatibility with temporal patterns.
Both bagging and boosting regression models (including Random Forest and CatBoost) with optimized hyperparameters, using Optuna, are trained and validated.
Final Test Results (Random Forest):
| Metric | Value |
|---|---|
| R² (Test) | 0.801 |
| MAE | 25.46 flights |
| RMSE | 80.27 flights |
| Median AE | 5.50 flights |
These results show the model explains ~80% of variance and delivers interpretable error margins, making it a practical tool for planning and forecasting.
The repository includes visualizations for:
- Actual vs. predicted delay timelines
- Train vs test comparisons
- Feature importance plots
These visuals help stakeholders quickly grasp performance and trends.
This model can be used to support:
- Operational staffing forecasts
- Seasonal capacity planning
- Budget forecasting for delay‑related costs
- Early warning for high‑demand or high‑delay months
By forecasting delays with ~80% explained variance, airlines and airports gain a reliable strategic planning tool.
Potential enhancements include:
- Integrating weather forecast data
- Extending to weekly or daily forecasts
- Developing flight‑level prediction models
- Deploying as a production service or API
Install the required Python libraries:
pip install -r requirements.txt
🧾 Licensing & Credits
Data is provided in the repository. Ensure compliance with any original data usage terms.
Thanks to all data contributors and open‑source tooling used throughout this project.
🙌 Acknowledgments
This project was developed as part of a comprehensive time‑series forecasting and machine learning workflow — showcasing real‑world modeling, validation, and business reasoning.
Happy forecasting!