Introduction

In this project, our goal is to predict flight delays from Pittsburgh International Airport (PIT) in 2024, using the flight data available online from 2022 and 2023. We aim to predict the variable DEP_DEL15, which is an binary indicator of whether a flight departs 15 minutes late or more.

Data Exploration

We began by cleaning and exploring both the 2022 and 2023 datasets. After filtering for just flights departing from PIT, we removed variables that included information that we are predicting for, and thus shouldn’t be relied on for training. We also removed redundant information for instance since we already have the date, there is no need to have a column for month.

The variables that remain are:

Weather

After removing information unrelated to the weather forecast and redundant information, the remaining columns are:

Exploratory Data Analysis

Around 15% of flights departing from PIT between 2022 and 2023 were delayed. The plot of delays over time shows a clear increase during the winter months, indicating that the date could be a potential predictor of flight delays. Since the winter months have higher numbers of delays, we decided to also take a look at weather data.

From these boxplots, we see that delayed flights tend to have higher temperature, dew point, and wind gust, while it tend to have lower sea level pressure and station pressure, suggesting these might be predictors in delay status. We also took a look at the relationship between departure delay status and flight distances.

From this plot, we see that delayed flights generally had longer distances. Additionally, a t-test revealed a significant difference between the two groups, with delayed flights having an average distance of 709 miles, compared to 618 miles for non-delayed flights. This significant difference in mean distance suggests that distance could be a potential predictor of flight delays.

Some airlines have a larger proportion of delayed flights than others. Specifically, Frontier had the largest proportion of delayed flights at 27% of its flights, while Republic had the lowest proportion of delayed flights at 8% of its flights. This suggests that the specfiic airline operating could be an potential predictor of delays.

Supervised Analysis

Our predictions were generated using a random forest model. From the original dataset, we make predictions based on the flight date, the carrier, the tail number of the plane, the destination airport, the destination state name, the expected departure time, the expected flight time, and the distance the plane travels. We additionally gathered data from the National Climatic Data Center Global Surface Summary of the Day, with Pittsburgh weather data gathered from the Pittsburgh Allegheny Co Airport. We did not include data for the weather at the destinations of any of these flights in the interest of time. However, we hope that including the weather at the Pittsburgh airport would help because sometimes the conditions at the airport impact when a plane is capable of leaving.

We opted for a random forest model because of the lack of tuning parameters save for the number of trees fitted (we opted for 1000, seeing no improvement to the results as we increased the number of trees beyond this point). Additionally, using a random forest gives us a measure for how useful every variable is for making predictions. We find that the most significant variables are the date the flight is scheduled to depart, the tail number of the plane, and the destination airport. However, due to the nature of random forests generating a large number of small trees, it is difficult to say anything precise about the relationship between the predictors and the predictions outside of this measure of importance (using GINI and the mean decrease in GINI when splitting on the given variable). Furthermore, alongside estimating our testing accuracy by having an explicit test and training split of the data, we can additionally estimate our generalization error by observing our performance on data that doesn’t get trained on when we perform bootstrapping to train the random forest.

Analysis of results

The measure of variable importance that running a random forest naturally gives us (mean decrease in GINI) is shown below. This suggests that the predicted departure time is one of the most important predictors of a flight being delayed, second being the tail number of the plane.

Producing a confusion matrix of our model’s performance on the data that wasn’t trained on when bootstrapping, we can see that we are more likely to give a false positive rather than a false negative, with a 12% false positive rate and a 1.7% false negative rate. This is unsurprising, given that the majority of the flights (~85%) are not delayed by 15 minutes or more.

Analyzing a dataframe consisting of only the flights in the test split of data that we got wrong, we can note trends in what kinds of flights we misclassify by creating barplots of what values of categorical variables we tend to get wrong and histograms for continuous data.

Regarding destination location, we see that the majority of the misclassified flights have their destination as Florida, with New York and Illinois coming second. This is similar to the distribution of flights in the test data, though relative to the number of flights heading towards Illinois or New York, the proportion of flights with these destinations that are misclassified is less than for all states.

Regarding the distance of a flight, the distribution of misclassified flights is similar to that of the distribution of flights overall, so visually, it appears that distance has little influence on whether or not a flight is misclassified. The same goes for the expected duration of the flight, and all of the weather variables, suggesting that these variables are reliable for making predictions.

For example, plotted below is the distribution of temperatures in the test data, and the distribution of weathers of the test datapoints that were classified incorrectly (in pink).

Regarding airlines, the most commonly misclassified airlines are Southwest and Republic and American. While these are also the most common airlines in the test data set, Southwest, despite appearing in the dataset less than Republic, is misclassified more often, suggesting that this airline is more inconsistent with whether their flights are delayed.

If we had to optimize for AUC rather than misclassification rate, then we would have to opt for a different model entirely, as the random forest only provides a classification. However, we can still rely on the measures of variable importance training the random forest gives us to motivate which variables we rely on when we perform a different regression, such as a general additive model or kernel regression.