Airline Performance Comparison with R/Shiny

Posted on Aug 2, 2015

In this project I set out to build an interactive app to allow a user to compare airline performance between two cities in the Continental US in order to better inform his or her flying decisions. It is useful to have additional metrics beyond price alone when choosing an airline with which to build frequent flyer miles, get an airline-affiliated credit card, etc. For example: I currently live in New York but my immediate family is in Nashville, so since that is where I am flying most often I want to get a sense for which airlines offer the most flights as well as how they perform in terms of delays and cancellations.

The Shiny package in R provides a really nice and intuitive web application framework, so I was able to take advantage of its features to build an interactive app driven by US Department of Transportation airline on-time performance data. The app is linked above so feel free to try it out! In this post I'll show some selections of the code that were used to build the app and explain them in more detail.

The source code is separated into three files each with a different function:

  1. A UI file (ui.R) which determines the look and feel of the app and provides inputs for the user to select.
  2. A "server" file (server.R) that generates output in response to the user's input.
  3. A global file (global.R) to include the required R libraries, load the data and get it into the proper format (via merging, sorting, arranging, etc), and define any functions used in the server file.

First, here is the code for ui.R:

[code language="R"]
    headerPanel('Airline Comparison Tool'),
    # make sidebar with user inputs
                    label = 'Origin', 
                    choices = origin$CITY_NAME, 
                    selected = 'New York, NY'),
                    label = 'Destination',
                    choices = dest$CITY_NAME,
                    selected = 'San Francisco, CA'),
                       label = 'Date Range',
                       start = min(flights$FL_DATE),
                       end = max(flights$FL_DATE)),
                    label = 'Departure Time (earliest)', 
                    choices = times, 
                    selected = '00:00'),
                    label = 'Departure Time (latest)', 
                    choices = times, 
                    selected = '24:00')
    # output plots to main panel with tabs to select type
        tabsetPanel(type = 'tabs',
                    tabPanel('Flights', plotOutput('countPlot')),
                    tabPanel('Delays', plotOutput('delayPlot')),
                    tabPanel('Reason for Delay', plotOutput('typePlot')),
                    tabPanel('Cancellations', plotOutput('cancelPlot'))
        # show flight path in main panel

Here we see how the basic layout of the app is defined. There is a sidebar with user inputs, including origin/destination as well as some filters based on date and departure time.

There are also a few tabs being created here to show the output plots from server.R in the main panel along with a flight path map for reference.

Next is a selection of the code from server.R:

[code language="R"]
# show flight delays plot in main panel
output$delayPlot = renderPlot({
    # get delay by carrier
    subset_delay = filter(flights, 
                          ORIGIN_CITY_NAME == input$origin_select & 
                          DEST_CITY_NAME == input$dest_select &
                          FL_DATE >= input$dateRange[1] & 
                          FL_DATE <= input$dateRange[2] &                           
                          CRS_DEP_TIME >= input$early_time &
                          CRS_DEP_TIME <= input$late_time)          

    medians = group_by(subset_delay, CARRIER_NAME) %>% summarise(median(ARR_DELAY, na.rm = T))
    medians_sorted = sort(unlist(medians[2]))
    delayTitle = paste("Arrival delay from", input$origin_select, "to", input$dest_select)
    # make plot
    p = ggplot(subset_delay, aes(x = reorder(CARRIER_NAME, ARR_DELAY, na.rm = TRUE, FUN = median), y = ARR_DELAY))
    p + geom_boxplot(middle = medians_sorted, aes(fill = CARRIER_NAME)) +
        scale_fill_brewer(palette = "Set2", name = "Carriers") +
        ylim(-50, 50) +
        xlab('') +
        ylab('Arrival Delay (minutes)') +
        ggtitle(delayTitle) +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1),
              text = element_text(size = 16)) +

The above  code is only one segment of server.R in order to demonstrate how it takes the user's input to filter the data (with dplyr) and then visualize it (with ggplot). This example is for the delay time boxplots, but the output for the other tabs is generated in a similar fashion.

The first tab in the application's main panel is a histogram showing the number of flights broken down by carrier:


The next tab shows boxplots of the arrival delay time for each airline, where a negative delay time means that the flight arrived ahead of schedule:

The next tab breaks these delay times down the proportions of the type of delay. Here is a description for each type:

  • Late Aircraft: previous flight arrived late.
  • National Aviation System: non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.
  • Weather: extreme weather that prevents or delays operation of a flight.
  • Security: evacuation of a terminal, security breach, inoperative screening equipment, and long lines in excess of 29 minutes at screening areas.
  • Carrier: circumstances within the airline's control such as maintenance or crew problems, aircraft cleaning, baggage loading, fueling, etc.

The last tab is for number of cancellations, also broken down by the same types described above:

The last section of code is contained in the global.R file. I won't go into much detail here, but the code in this file handles the loading, merging, sorting, and formatting of the data used by the application. However, I will highlight a function that is called from server.R to plot a map of the flight path:

[code language="R"]
map_plot = function(from, to){
    # get longitude/latitude at origin/destination
    lat_o <- origin$LAT[origin$CITY_NAME == from]
    long_o <- origin$LONG[origin$CITY_NAME == from]
    lat_d <- dest$LAT[origin$CITY_NAME == to]
    long_d <- dest$LONG[origin$CITY_NAME == to]
    # create map
    xlim = c(-125, -62.5)
    map('state', col = '#f2f2f2', fill = T, xlim = xlim, boundary = T, lty = 0)
    inter <- gcIntermediate(c(long_o, lat_o), c(long_d, lat_d), n=50, addStartEnd=TRUE)
    lines(inter, col = 'red', lwd = 2)
    text(long_o, lat_o, from, col = 'blue', adj = c(-0.1, 1.25))
    text(long_d, lat_d, to, col = 'blue', adj = c(-0.1, 1.25))
    points(long_d, lat_d, cex = 1.5)
    points(long_o, lat_o, cex = 1.5)

Here I'm using the maps and geosphere packages in R to make a flight path plot for reference that shows up below the data plots in the main panel. For example:

Currently, the main limitation of this application is that I was only able to reasonably use one month of flight data. I used the most recently available USDOT data from May 2015, which contains info on nearly 500,000 flights. In order to include more than a single month I need to be able to handle a very large amount of data (using Hadoop, for example). Getting that part incorporated will be the next step in the development of this app.

About Author

Related Articles

Leave a Comment

faraz January 9, 2019
please give link of data sets
nitin June 21, 2018
the app doesn't run so pls respond asap
nitin June 21, 2018
the app doesn't runs

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp