Data Visualisation of US Domestic Flights in Year 2009

Posted on Jul 24, 2016

Arrivals Departures Signpost Shows Flights Airport And International Travel

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

 "The earth has become one big village,with telephone laid on from one end to the other, and air transport, both speedy and safe."

------Wyndham Lewis

Introduction

Airplane is one of the main transportation tools for carring people to travel around the world. Last year, data shows over 798 million air passengers were transported include both domestic and international aircraft passengers of air carriers registered in the U.S. The number of passengers been carried was increased dramatically by around 72%, compared to that of year 1990.

The air transport industry a vital engine of global socio-economic growth, is considerably important for economic development, creating direct and indirect employment, supporting tourism and local businesses, and stimulating foreign investment and international trade. Moreover, it is also played an crucial part in population movement and world exploration. Due to the huge historical records of the flights, we are able to do population flow analysis to get the insight of the moving behaviour of people.

This report will use the historical records of US domestic flights during the year of 2009 to do the following analysises:visualization the air trafic in each airport to predit which is the larger hub or transit airport. Secondly, to check how the air traffic fluctuate through the year 2009. Additionaly, the article is going to define the pattern of population movemoent in US by air.


I. Origin Data Set Description

Abstract:

The origin data set is over 3.5 million monthly domestic flight records from 1990 to 2009. Data are arranged as an adjacency list with metadata.

Data Visualisation of US Domestic Flights in Year 2009

Data Source(s)

1.US Census Bureau
2.RITA/Transtats, Bureau of Transportation Statistics

Year: 2009

URL:http://academictorrents.com/details/a2ccf94bbb4af222bf8e69dad60a68a29f310d9a


II. Data Manipulation & Methodology

Bascially, the original data set is abstracted into only one year of 2009 and remove the rows that the origin airport and destination airport is the same. Additionaly, I also delete the rows with zero value in column “Passengers”.

Data set of Year 2009

tbl_df(year2009)
## # A tibble: 178,268 x 13
##     Year Month Origin Destination     Origin.City Destination.City
##    <int> <int> <fctr>      <fctr>          <fctr>           <fctr>
## 1   2009     6    LAX         RDM Los Angeles, CA         Bend, OR
## 2   2009     3    LAX         RDM Los Angeles, CA         Bend, OR
## 3   2009    12    LAX         RDM Los Angeles, CA         Bend, OR
## 4   2009     1    LAX         RDM Los Angeles, CA         Bend, OR
## 5   2009    11    LAX         RDM Los Angeles, CA         Bend, OR
## 6   2009     5    LAX         RDM Los Angeles, CA         Bend, OR
## 7   2009     2    LAX         RDM Los Angeles, CA         Bend, OR
## 8   2009     7    LAX         RDM Los Angeles, CA         Bend, OR
## 9   2009     4    LAX         RDM Los Angeles, CA         Bend, OR
## 10  2009     2    LAX         RDM Los Angeles, CA         Bend, OR
## # ... with 178,258 more rows, and 7 more variables: Passengers <int>,
## #   Seats <int>, Flights <int>, Distance <int>, Fly.Date <int>,
## #   Origin.Population <int>, Destination.Population <int>

III. Data Explore and Visualise

3.1 Data on Accumulative flow of passengers at each airport in year 2009

First, I calculate the accumulative outflow of passengers by using summarise function in dyplr and arrange the data by descending value. However, the observations are still too mcuh to analysis, which has 445 distinguish airports in rows.

str(year2009_2)
Classes 'tbl_df', 'tbl' and 'data.frame':    445 obs. of  3 variables:
##  $ Origin        : Factor w/ 445 levels "1B1","ABE","ABI",..: 31 316 112 233 231 332 90 202 263 293 ...
##  $ Passengers_out: int  34404073 24489548 22563679 19095716 16800451 16304031 14974084 14189884 13485397 13441406 ...
##  $ flights_out   : int  391849 329665 260597 184692 145011 161834 199212 197054 114203 171772 ...
Thus, I choose to filter the origin airport within the top 15 number of Passengers and plot it into bar chart. Meanwhile, filter the year 2009 data to get the destinaiton airport within top 15 total inflow of passengers.
Data Visualisation of US Domestic Flights in Year 2009

Data Visualisation of US Domestic Flights in Year 2009

Compare the above two barcharts, it is easily to find out that the names of the origin airport are exact the same as the names of destination airport. In other words, the net passengers flow for each airport may approxiemately to zero. Take a further look into the Jackson Atlanta International Airport(ATL), it ranks at the first both of the total outflow and total inflow, which is make sense as the ATL is a large hub airport, also a transit airport.

3.2 Variation of Air Traffic Volume during the Year 2009

Secondly, we want to explore the fluctuation of the air traffic volume during the year of 2009. According to the monthly traffic chart below, we will find out that the traffic volume reaches the peak at July. The reason of that may be  because of Independence day and summer holiday season.

Data Visualisation of US Domestic Flights in Year 2009

3.3 Analysis the population movement by month

Zoom into each month data and calculate the netflow of the passengers to explore how the passengers are moving around. For the simplicity and efficiency, I select the airports with top 10 and last 10 netflow of passengers in each month. The most interesting thing  I find  is that in December, people will travel from North to South area in US, as we see the the Top 3 net inflow passengers are airport MIA(Miami International Airport),FLL(Fort Lauderdale Hollywood International Airport ),MCO(Orlando International Airport),which are all located in Florida, the southmost state.

Meanwhile, the last three airports DCA,ORD and BOS are located at Washington,Chicago and Boston,respectively, which are at the Northern part of US. However, when we check back to the January of the year 2009, the movement is opposite, people are going from Sounth to North of US. In Jan., the first 3 netflow airports are DCA(Ronald Reagan Washington National Airport ),BOS(General Edward Lawrence Logan International Airport) and SFO(San Francisco International Airport) are located at the city of Washington , Boston and San Francisco. While, the last 3 airports with negative netflows are FLL, MCO, MIA,all located at Florida, south of US. (See the comparison graphs below)

Jan Dec

 

3.4 Analysis the Population Movement by each Airport

Data Analysis on the Population Movement by each Airport

The pattern of the passengers netflow also could be analysised by each of the airport. I chose the 4 airports here to disucss to show how the flow of passengers varies during the year. For the Newark Liberty International Airport, the month 7 is very special, as we see more people leave from New York or New Jersey than come in. May be because the long holiday of independent day. But this pattern is not seen in the Chicago ORD airport. There are more people come to Chicago at April and May. The above two airport are located at North of the US, so in the winter ,we will see that more people go out to other places.

As I analysis before, the people prefer going to Sountern part of US. The net flow flutuate the most at the month Dec.,of passengers in MIA airport, as there are more people come into Miami from Northen US area. The last plot, I just want to check how the Atlanta Airport pattern is, because i menthioned before the ATL is the biggiest hub and transit in US, so the net flow of it should not flutuate so strongly; the flutate is smaller than the other airports menthioned. But it is not visually see , we have to do statictical analysis to prove that the variance is significantly smaller.

 netflow_airport ord atl mia

IV. Conclusion

To summary, Hartsfield Jackson Atlanta International Airport  (ATL) is a large hub and transit airport according to the rank of the total outflow and inflow of  flights. The population size of the city may have positive correlation to the degree of the busiest of airports. Moreover, the air traffic volume vary by the time of month and reaches the peak at July, maybe because of the long holiday(Independent Day) in that month.

Additionally, people are willing to travel from North to South of US in December , especially to Florida, maybe due to the weather influence. The pattern of population movement in each month and each airport is different, but  it may affected by the geography and  airport type differences. As we see the ATL is a transit, so the variation of the net flow of the passengers is much smaller than the other airport.

This report could do more statistic correlation analysis in the future to see how the variables are mutual interacted or related. Furthermore, we could find out the latitude and longitude information of the city respectively to do a density map to visualise how the people are moving by each month and by each year."

Welcome to view raw code here.

About Author

Yunrou Gong

Yunrou Gong worked as a Business Analyst for Sanity Lighting Co., a LED Manufacturer. Through her work as an analyst, she accumulated her interest in data-driven approach of problem solving. She holds a M.S. in Operational Research, specializing...
View all posts by Yunrou Gong >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI