Data Visualisation of US Domestic Flights in Year 2009
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
"The earth has become one big village，with telephone laid on from one end to the other, and air transport, both speedy and safe."
Airplane is one of the main transportation tools for carring people to travel around the world. Last year, data shows over 798 million air passengers were transported include both domestic and international aircraft passengers of air carriers registered in the U.S. The number of passengers been carried was increased dramatically by around 72%, compared to that of year 1990.
The air transport industry a vital engine of global socio-economic growth, is considerably important for economic development, creating direct and indirect employment, supporting tourism and local businesses, and stimulating foreign investment and international trade. Moreover, it is also played an crucial part in population movement and world exploration. Due to the huge historical records of the flights, we are able to do population flow analysis to get the insight of the moving behaviour of people.
This report will use the historical records of US domestic flights during the year of 2009 to do the following analysises:visualization the air trafic in each airport to predit which is the larger hub or transit airport. Secondly, to check how the air traffic fluctuate through the year 2009. Additionaly, the article is going to define the pattern of population movemoent in US by air.
I. Origin Data Set Description
The origin data set is over 3.5 million monthly domestic flight records from 1990 to 2009. Data are arranged as an adjacency list with metadata.
1.US Census Bureau
2.RITA/Transtats, Bureau of Transportation Statistics
II. Data Manipulation & Methodology
Bascially, the original data set is abstracted into only one year of 2009 and remove the rows that the origin airport and destination airport is the same. Additionaly, I also delete the rows with zero value in column “Passengers”.
Data set of Year 2009
## # A tibble: 178,268 x 13 ## Year Month Origin Destination Origin.City Destination.City ## <int> <int> <fctr> <fctr> <fctr> <fctr> ## 1 2009 6 LAX RDM Los Angeles, CA Bend, OR ## 2 2009 3 LAX RDM Los Angeles, CA Bend, OR ## 3 2009 12 LAX RDM Los Angeles, CA Bend, OR ## 4 2009 1 LAX RDM Los Angeles, CA Bend, OR ## 5 2009 11 LAX RDM Los Angeles, CA Bend, OR ## 6 2009 5 LAX RDM Los Angeles, CA Bend, OR ## 7 2009 2 LAX RDM Los Angeles, CA Bend, OR ## 8 2009 7 LAX RDM Los Angeles, CA Bend, OR ## 9 2009 4 LAX RDM Los Angeles, CA Bend, OR ## 10 2009 2 LAX RDM Los Angeles, CA Bend, OR ## # ... with 178,258 more rows, and 7 more variables: Passengers <int>, ## # Seats <int>, Flights <int>, Distance <int>, Fly.Date <int>, ## # Origin.Population <int>, Destination.Population <int>
III. Data Explore and Visualise
3.1 Data on Accumulative flow of passengers at each airport in year 2009
First, I calculate the accumulative outflow of passengers by using summarise function in dyplr and arrange the data by descending value. However, the observations are still too mcuh to analysis, which has 445 distinguish airports in rows.
Classes 'tbl_df', 'tbl' and 'data.frame': 445 obs. of 3 variables: ## $ Origin : Factor w/ 445 levels "1B1","ABE","ABI",..: 31 316 112 233 231 332 90 202 263 293 ... ## $ Passengers_out: int 34404073 24489548 22563679 19095716 16800451 16304031 14974084 14189884 13485397 13441406 ... ## $ flights_out : int 391849 329665 260597 184692 145011 161834 199212 197054 114203 171772 ...
Compare the above two barcharts, it is easily to find out that the names of the origin airport are exact the same as the names of destination airport. In other words, the net passengers flow for each airport may approxiemately to zero. Take a further look into the Jackson Atlanta International Airport(ATL), it ranks at the first both of the total outflow and total inflow, which is make sense as the ATL is a large hub airport, also a transit airport.
3.2 Variation of Air Traffic Volume during the Year 2009
Secondly, we want to explore the fluctuation of the air traffic volume during the year of 2009. According to the monthly traffic chart below, we will find out that the traffic volume reaches the peak at July. The reason of that may be because of Independence day and summer holiday season.
3.3 Analysis the population movement by month
Zoom into each month data and calculate the netflow of the passengers to explore how the passengers are moving around. For the simplicity and efficiency, I select the airports with top 10 and last 10 netflow of passengers in each month. The most interesting thing I find is that in December, people will travel from North to South area in US, as we see the the Top 3 net inflow passengers are airport MIA(Miami International Airport),FLL(Fort Lauderdale Hollywood International Airport ),MCO(Orlando International Airport),which are all located in Florida, the southmost state.
Meanwhile, the last three airports DCA,ORD and BOS are located at Washington,Chicago and Boston,respectively, which are at the Northern part of US. However, when we check back to the January of the year 2009, the movement is opposite, people are going from Sounth to North of US. In Jan., the first 3 netflow airports are DCA(Ronald Reagan Washington National Airport ),BOS(General Edward Lawrence Logan International Airport) and SFO(San Francisco International Airport) are located at the city of Washington , Boston and San Francisco. While, the last 3 airports with negative netflows are FLL, MCO, MIA,all located at Florida, south of US. (See the comparison graphs below)
3.4 Analysis the Population Movement by each Airport
Data Analysis on the Population Movement by each Airport
The pattern of the passengers netflow also could be analysised by each of the airport. I chose the 4 airports here to disucss to show how the flow of passengers varies during the year. For the Newark Liberty International Airport, the month 7 is very special, as we see more people leave from New York or New Jersey than come in. May be because the long holiday of independent day. But this pattern is not seen in the Chicago ORD airport. There are more people come to Chicago at April and May. The above two airport are located at North of the US, so in the winter ,we will see that more people go out to other places.
As I analysis before, the people prefer going to Sountern part of US. The net flow flutuate the most at the month Dec.,of passengers in MIA airport, as there are more people come into Miami from Northen US area. The last plot, I just want to check how the Atlanta Airport pattern is, because i menthioned before the ATL is the biggiest hub and transit in US, so the net flow of it should not flutuate so strongly; the flutate is smaller than the other airports menthioned. But it is not visually see , we have to do statictical analysis to prove that the variance is significantly smaller.
To summary, Hartsfield Jackson Atlanta International Airport (ATL) is a large hub and transit airport according to the rank of the total outflow and inflow of flights. The population size of the city may have positive correlation to the degree of the busiest of airports. Moreover, the air traffic volume vary by the time of month and reaches the peak at July, maybe because of the long holiday(Independent Day) in that month.
Additionally, people are willing to travel from North to South of US in December , especially to Florida, maybe due to the weather influence. The pattern of population movement in each month and each airport is different, but it may affected by the geography and airport type differences. As we see the ATL is a transit, so the variation of the net flow of the passengers is much smaller than the other airport.
This report could do more statistic correlation analysis in the future to see how the variables are mutual interacted or related. Furthermore, we could find out the latitude and longitude information of the city respectively to do a density map to visualise how the people are moving by each month and by each year."