Data Analysis on Job Satisfaction

Posted on May 2, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Being happy at work and satisfied with our position and responsibilities is a real issue. From the company's point of view, having happy employees increases their productivity and improves the company's image which will attract talented workers at a lower cost. Meanwhile, an employee is trying to maximize its job satisfaction.

The following analysis is based on the assumption that Glassdoor company's ratings are not biased and are locally independent.

Glassdoor provides information on each company, such as the list of the benefits given to their employees, and their reviews and ratings. Each rating may include the location and position of the employee. We will explain how the ratings depend on those features.


The data was scraped from Glassdoor using scrapy. Over a million observations have been collected among 2500 companies. It contains :

  • Name of the company
  • Industry
  • Revenue
  • List of benefits
  • Reviews:
    • Ratings:
      • Overall
      • Career opportunities
      • Compensation and benefits
      • Work-life balance
      • Senior management
      • Culture and values
    • Location (city, state)
    • Position
    • Former/Current employee

What does the data look like?

The following graph is showing the densities and boxplots of the mean ratings per company.

Data Analysis on Job Satisfaction

This graph shows a few things. First, peoples seem to give a lower rating for senior management. And that career opportunities rating seems to have a smaller variance with very few low extreme values.

Data Analysis on Job Satisfaction

Improving the overall rating by changing benefits

The most obvious strategy to improve a company's rating is by giving more benefits to their employees. Does this strategy have any impact on the different ratings? What is the relationship between the overall rating and the other ones?

The following graphs can help us to better understand the relationships between the different ratings and each benefit. I selected the professional development benefit as it is showing a linear dependence with some of the ratings.

Although we can observe a linear dependence on compensation and benefit and career opportunities ratings, it might not be significant enough. As we will see in the correlation matrix, most of the benefits don't have any obvious relationship with the ratings. Another issue is that some of the benefits are unbalanced (most of the data are around the same value), which induces a bias on the regression.

Data Analysis on Job Satisfaction


The different features are apparently not very correlated (they all are less than 30%). Despite everything, the compensation and benefits rating is more correlated to the benefits than the other ratings, which can be explained by two different situations. Either there is an implication between those two features and thus the more benefit, the better is this rating. Otherwise, it can be explained by the supposition that a company offering more benefits may afford to offer better salaries.

The correlations between the different ratings are all very high. This aspect is an inherent property of the data, which is coming from the psychological approach of rating something through different ratios. If one of them is low, the other ones might be lower than what they should be, for the seek of coherence. I, unfortunately, do not have any relevant data to confirm/infirm that theory.

To resume, the ratings seem to globally have a very low correlation with the benefits. Yet, they are slightly more correlated with the compensation and benefits rating.

Improving the overall rating differently

An observation we can make on the data is that the overall rating is not the mean of the other ratings but is provided by the user "independently".

Knowing this, we might wonder: What is the relation between the different ratings? Is one of them explaining the overall rating better than the others?

Despite the fact that some ratings might have more variance around the regression line (eg compensation and benefits), it is likely that there is a very high linear dependence between the different ratings, which confirms our previous theory.

The slopes of career opportunities and senior management regression lines are greater than the other ones which lead to the hypothesis that they are a better explanation of the overall rating.

Ratings and locations

This map displays the mean overall rating per state. It can be improved by adding data as it still has a high variance. 


Based on the collected data, we can conclude that the overall rating of a company is most sensitive to career opportunities and senior management ratings. Then, a good strategy would be to focus on improving those ratings. However, investing in improving the benefits the company provides to its employees also has an effect on the overall rating. We particularly observed a linear dependency from the professional development benefit on most of the ratings. It is also the case for a few other ones, such as company social events, diversity programs, and surprisingly gym membership. 

Adding more data in the study could help to reduce the selection bias induced by the fact that we scraped the first n-pages of glassdoor company's list, which are sorted by popularity. It also would balance our data by having more companies per industry and per state.

A next step to have a better understanding of that question is by including the remaining features in our analysis.


About Author

Dan Toledano

Dan has a background in applied mathematics and quantitative finance with a master degree in applied mathematics from Sorbonne University in Paris. He indeed specialized in random modeling with relevant experience as a quantitative researcher. He is passionate...
View all posts by Dan Toledano >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI