What can we learn from Five-star airlines: A web scraping project from Skytrax

Fred Zeng
Posted on Jul 23, 2019

Introduction

From the R shiny project I did, I found out that most of the Internet users did not enjoy their experiences with major U.S. airlines, and the top three reasons for negative reviews are bad customer service, delay, and cancellations. Therefore I wanted to learn how can the U.S. improve, so I started this project, Skytrax is a United Kingdom-based consultancy which runs an airline and airport review and ranking site. The 5-Star Airline rating is a unique mark of Quality Achievement, with a select group of just 10 airlines Certified in this category. I want to see what did those 10 airlines do better to deserve a spot in the list. The link for the code is on https://github.com/freddy90503/SkyTrax_Scraping

The Data

The data I am scraping from the web are individual reviews that only verified consumers left. A full review includes title, name, country, review date, comments, type of traveler, seat type, route, date flown, overall score, seat comfort score, cabin staff service score, food & beverage score, inflight entertainment score, ground service score, value of money score, and whether or not the person recommends the airline. But not all consumers answered all questions. Some of them might only answer a few, for which I will process after I get the data.

Data scraping with Scrapy

I used the Scrapy tool and got all 10 files for each airline. Each file has all the information from each review listed in columns. Each row represents one review.

Data cleaning with Pandas

After I got all the raw data, I used Panda package in Python to do some cleaning. I first combined all 10 airline files together into one, then I dropped duplicates, removed parentheses, and colons that were not useful to my analysis. I casted the date information to date type, then I filled N/A with either number 0 or NAN depending on the column.

Overall

  1. Firstly I created a bar chart that reflected the overall rating for all 10 airlines. As shown, all of them are well rated, with score 9 and 10 in the majority except for Lufthansa.

By travel type

I did another chart to see how different types of travelers give rating differently and found out that solo travelers on average give the highest rating, while couples give the least. 

By Aircraft Model

I did another analysis by aircraft model and found out that Boeing 777 and 787 are the most highly rated models. Hainan Airlines' Boeing fleet has the highest ratings overall.

By Seat Type

Here is a chart to see how travelers seated on different types of seats give ratings differently. I found out that First Class travelers on average give the highest rating, and Premium Economy the lowest.

By other methods

To better analyze data that I gathered for different segments like wifi, entertainment, food, ground service, seat comfort, and value, I made these stacked histograms. I found out that most of the airlines have more than 75% of users rated 3 or above score out of 5 in almost all categories.

Word clouds

I also used word clouds to see what the most popular words were given in the comments section for each airline. Most of the words were positive, and phrases like "Good service", "Excellent", "Friendly", "Great experience", and "Comfortable" really stood out.

By Correlation with Overall Score

After I had all sorts of charts and analysis, I found out that most of the top 10 airlines are doing very well in all segments, so I wanted to see which segment is most important for reviewers. I checked the correlation of all the small areas and the overall score and found out that consumers usually give a high overall score when they think the value of the flight is good, and received good services in-flight and on the ground. Entertainment and wifi have lowest correlation to the overall score.

Conclusion

After the project, I was able to gain some insight:

  • First-class passengers are more likely to give positive feedback.
  • Solo travelers are more likely to give positive feedback.
  • Boeing 787 & 777 passengers are more likely to give positive feedback. 
  • The most important elements are value and customer service, which are areas that US airlines are lacking in.
  • US airlines are good at entertainment and wifi, but those are not crucial elements.
  • The top 10 airlines in the world are all very good at multiple areas instead of just one.

This project helped me answered questions I had before, and I look forward to expanding it in the future.

About Author

Fred Zeng

Fred Zeng

Fred received his M.S. in management and systems with a concentration in database technology from New York University. He was also a business and website analyst intern at NYU, for his research, he designed and conducted data testing...
View all posts by Fred Zeng >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data Book Launch Book-Signing bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job Jon Krohn JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp