Web-scraping Indeed: Exploring the US Job Market

Posted on Aug 1, 2020

When I started my job-Search before entering this Data Bootcamp, I didn't know that to become a quantitative Trader one, must have  programming skills, I have little knowledge about the skills required for this type of Job.

The Project was a great idea to explore in depth what are the skills required to go to the industry and it was very exciting to built from scratch, clean the data and become an expert in scraping website. There was many websites to look at, e.g. Indeed,Glassdoor or LinkedIn. I decided to focus on Indeed . You can find the data scrape in this all_link and the analysis and the cleaning can be found on the github repository.

Indeed is the number one job posting site worldwide with over 250 million unique visitor every month, 10 jobs are added every second. It's a free platform where both recruiters and job applicant can find their need.


How Salaries differs in the USA?
Does money contribute to happiness in the workplace?
How do tech jobs salaries  compares to different industries?
who earn more data scientist vs data analyst?
which type of developer has the highest salary?
what are the skills needed to become a data scientist?

Data Gathering:

To answer my questions, I went to indeed.com and scraped around 2000 pages of job posting, I searched for all type of jobs so my sample would cover different industries. I looked for the job titles, companies, salaries, review, location, remote option, and description of the job.

Biggest challenges scraping Indeed:

Salaries format in Indeed varies  depending on the job position and sometime the Salary is a range which makes it inaccurate, I had to standardized my results.
also, my sample is very small, I couldn't scrape all the job posting because of time limitation.

Data Analysis:

At first, I checked the distribution of salaries. The results are shown below:

as you can see, the distribution is skewed to the right, with a median salary of 43000$.
we can see that there is an inequality in the US with the top 1% earning at least 126,000$.

A dive into states per job offering. I had to group the jobs offering by state.

California is the state that has the most job offering because it has The largest population, I was surprised that Texas is almost equal to California.
It's maybe because of the low cost of living there and  remote job are soaring in time of COVID-19.
Now lets take a look at the proportion of remote Job  and Businesses that are hiring the most during this pandemic.

I searched for these Businesses and 3 of them are from the health sector  Which is most logical in these times.

Then I wanted to investigate the claim that Income is related with happiness. I drew a Scatter plot and I found that there's no to little coloration between these two.
I was expecting sort of positive coloration but the Pearson Coloration results showed that I was wrong.

Then I turned my focus to tech related jobs. First of all, I wanted to verify the claim that data scientist earn more than data analyst.
On Average, data scientist earns around 105,000$ and data analyst 66,000$.
One limitation of my study is that it doesnt have a good accuracy of the wages.
Then, I wanted to Know which type of developer has the highest income.
Surprisingly, I found that quantitative developers are ranked  first followed by back end dev.

I wanted to conclude by focusing on the FANG ( Facebook,amazon,Netflix,Google), So I scrape a data scientist entry level position to see if a candidate like has a chance there.

Python is the skill to have as a Data Scientist. You must be proficient in another language like C++ OR JAVA and of course SQL.  We notice that a data scientist must have soft skills and for the FANG a master degree or a PHD.


In Summary, we can say that tech industry has a high Income compared to other industries. It's because there are tons of competition and lots of requirements.

some limitations of my analysis:
No historical data Which i believe will  be interesting to see how salary has evolved over the year.
No info about shareholders or CEO which I believe will make the distribution more right skewed.

If I have time in the future, I will include other countries like France or my home country and Compare Income with PPP(Purchasing power parity).
I would also create a machine learning that will read any candidate resume and match a suitable role for the applicant.


About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI