Web-scraping Indeed: Exploring the US Job Market

Posted on Aug 1, 2020

When I started my job-Search before entering this Bootcamp, I didn't know that to become a quantitative Trader one, must have  programming skills, I have little knowledge about the skills required for this type of Job.

The Project was a great idea to explore in depth what are the skills required to go to the industry and It was very exciting to built from scratch, Clean the data and become an expert in Scraping website. There was many Websites to look at, e.g. Indeed,Glassdoor or LinkedIn. I decided to focus on Indeed . You can find the data scrape in this all_link and the analysis and the cleaning can be found on the github repository.

Indeed is the number one job posting site worldwide with over 250 million unique visitor every month, 10 jobs are added every second. It's a free platform where both recruiters and job applicant can find their need.


How Salaries differs in the USA?
Does money contribute to happiness in the workplace?
How do tech jobs salaries  compares to different industries?
who earn more data scientist vs data analyst?
which type of developer has the highest salary?
what are the skills needed to become a data scientist?

Data Gathering:

To answer my questions, I went to indeed.com and scraped around 2000 pages of job posting, I searched for all type of jobs so my sample would cover different industries. I looked for the job titles, companies, salaries, review, location, remote option, and description of the job.

Biggest challenges scraping Indeed:

Salaries format in Indeed varies  depending on the job position and sometime the Salary is a range which makes it inaccurate, I had to standardized my results.
also, my sample is very small, I couldn't scrape all the job posting because of time limitation.

Data Analysis:

At first, I checked the distribution of salaries. The results are shown below:

as you can see, the distribution is skewed to the right, with a median salary of 43000$.
we can see that there is an inequality in the US with the top 1% earning at least 126,000$.

A dive into states per job offering. I had to group the jobs offering by state.

California is the state that has the most job offering because it has The largest population, I was surprised that Texas is almost equal to California.
It's maybe because of the low cost of living there and  remote job are soaring in time of COVID-19.
Now lets take a look at the proportion of remote Job  and Businesses that are hiring the most during this pandemic.

I searched for these Businesses and 3 of them are from the health sector  Which is most logical in these times.

Then I wanted to investigate the claim that Income is related with happiness. I drew a Scatter plot and I found that there's no to little coloration between these two.
I was expecting sort of positive coloration but the Pearson Coloration results showed that I was wrong.

Then I turned my focus to tech related jobs. First of all, I wanted to verify the claim that data scientist earn more than data analyst.
On Average, data scientist earns around 105,000$ and data analyst 66,000$.
One limitation of my study is that it doesnt have a good accuracy of the wages.
Then, I wanted to Know which type of developer has the highest income.
Surprisingly, I found that quantitative developers are ranked  first followed by back end dev.

I wanted to conclude by focusing on the FANG ( Facebook,amazon,Netflix,Google), So I scrape a data scientist entry level position to see if a candidate like has a chance there.

Python is the skill to have as a Data Scientist. You must be proficient in another language like C++ OR JAVA and of course SQL.  We notice that a data scientist must have soft skills and for the FANG a master degree or a PHD.


In Summary, we can say that tech industry has a high Income compared to other industries. It's because there are tons of competition and lots of requirements.

some limitations of my analysis:
No historical data Which i believe will  be interesting to see how salary has evolved over the year.
No info about shareholders or CEO which I believe will make the distribution more right skewed.

If I have time in the future, I will include other countries like France or my home country and Compare Income with PPP(Purchasing power parity).
I would also create a machine learning that will read any candidate resume and match a suitable role for the applicant.


About Author


marc medawar

msc in mathematical finance, Looking for quant researcher/data scientist position.
View all posts by marc medawar >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp