Airlines Five Star: A web scraping project from Skytrax

Posted on Jul 23, 2019

Introduction

From the R shiny project I did, I found out that most of the Internet users did not enjoy their experiences with major U.S. airlines, and the top three reasons for negative reviews are bad customer service, delay, and cancellations. Therefore I wanted to learn how can the U.S. improve, so I started this project, Skytrax is a United Kingdom-based consultancy which runs an airline and airport review and ranking site. The 5-Star Airline rating is a unique mark of Quality Achievement, with a select group of just 10 airlines Certified in this category. I want to see what did those 10 airlines do better to deserve a spot in the list. The link for the code is on https://github.com/freddy90503/SkyTrax_Scraping

The Data

The data I am scraping from the web are individual reviews that only verified consumers left. A full review includes title, name, country, review date, comments, type of traveler, seat type, route, date flown, overall score, seat comfort score, cabin staff service score, food & beverage score, inflight entertainment score, ground service score, value of money score, and whether or not the person recommends the airline. But not all consumers answered all questions. Some of them might only answer a few, for which I will process after I get the data.

Data scraping with Scrapy

I used the Scrapy tool and got all 10 files for each airline. Each file has all the information from each review listed in columns. Each row represents one review.

Data cleaning with Pandas

After I got all the raw data, I used Panda package in Python to do some cleaning. I first combined all 10 airline files together into one, then I dropped duplicates, removed parentheses, and colons that were not useful to my analysis. I casted the date information to date type, then I filled N/A with either number 0 or NAN depending on the column.

Overall

  1. Firstly I created a bar chart that reflected the overall rating for all 10 airlines. As shown, all of them are well rated, with score 9 and 10 in the majority except for Lufthansa.

By travel type

I did another chart to see how different types of travelers give rating differently and found out that solo travelers on average give the highest rating, while couples give the least. 

By Aircraft Model

I did another analysis by aircraft model and found out that Boeing 777 and 787 are the most highly rated models. Hainan Airlines' Boeing fleet has the highest ratings overall.

By Seat Type

Here is a chart to see how travelers seated on different types of seats give ratings differently. I found out that First Class travelers on average give the highest rating, and Premium Economy the lowest.

By other methods

To better analyze data that I gathered for different segments like wifi, entertainment, food, ground service, seat comfort, and value, I made these stacked histograms. I found out that most of the airlines have more than 75% of users rated 3 or above score out of 5 in almost all categories.

Word clouds

I also used word clouds to see what the most popular words were given in the comments section for each airline. Most of the words were positive, and phrases like "Good service", "Excellent", "Friendly", "Great experience", and "Comfortable" really stood out.

By Correlation with Overall Score

After I had all sorts of charts and analysis, I found out that most of the top 10 airlines are doing very well in all segments, so I wanted to see which segment is most important for reviewers. I checked the correlation of all the small areas and the overall score and found out that consumers usually give a high overall score when they think the value of the flight is good, and received good services in-flight and on the ground. Entertainment and wifi have lowest correlation to the overall score.

Conclusion

After the project, I was able to gain some insight:

  • First-class passengers are more likely to give positive feedback.
  • Solo travelers are more likely to give positive feedback.
  • Boeing 787 & 777 passengers are more likely to give positive feedback. 
  • The most important elements are value and customer service, which are areas that US airlines are lacking in.
  • US airlines are good at entertainment and wifi, but those are not crucial elements.
  • The top 10 airlines in the world are all very good at multiple areas instead of just one.

This project helped me answered questions I had before, and I look forward to expanding it in the future.

About Author

Fred Zeng

Fred received his M.S. in management and systems with a concentration in database technology from New York University. He was also a business and website analyst intern at NYU, for his research, he designed and conducted data testing...
View all posts by Fred Zeng >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI