NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Web Scraping > Landing my Dream Job by Scraping Glassdoor.com

Landing my Dream Job by Scraping Glassdoor.com

Diego De Lazzari
Posted on Aug 21, 2016

In this post I am going to use web scraping to define a simple recommendation system for data scientists looking for new employment. The idea is using to use the resources published on Glassdoor.com to create a dataset addressing the US job market for data scientists and data analysts. A web application developed in R- Shiny, allows the user to upload a resumรฉ and receive a list of recommended job posts, depending on the candidate skills, education and work experience.

Introduction

Data scientists are tech savvy. We do not like manually browsing through job postings. We would rather our profile and skills be matched with the best available positions. Aren't there dozens of companies providing similar services? Sure, but you have to sign into multiple websites, share your personal infos, set an alert, check the company reputation, filter out useless recommendations ... Way too manually intensive!

Data scientists are also creative. While it may be difficult and time consuming to match a resumรฉ with multiple sources, it is possible to reverse the approach and create a single job posting database combining the results of multiple queries for a specific position. The database is then matched to a resumรฉ and the best recommendations are provided on the basis of a given set of skills.

About the dataset

For this project, I will limit the scope to Glassdoor.com, a California based company founded in 2007, providing a database of millions of company reviews, salary reports and interview reviews. About 1300 job posts for data scientist/ data analyst published within the last month in US were collected and analyzed. The resulting database shows an expected concentration of job vacancies in the well known hubs for data science, San Francisco and New York.

education

Figure 2: Preferred or required education for a data scientist. Please notice that a post requiring more than a specific title (eg. a master OR a bachelor) is counted twice in the chart.

Figure 1: Top 10 cities in US for Data Scientists.

Figure 1: Top 10 cities in US for number of job posts as Data Scientist.

Figure 3: List of skills most frequently requested in job posts for data scientist positions.

Let's now have a look at the candidate profile: as shown in figures 1-3, the ideal "Data Scientist" holds a Master/PhD title and is proficient in a number of tools for scripting, database management, visualization and parallel computing. While Python and R are by far the preferred scripting languages, the more "traditional" programming languages such as Java and C++ remain popular. The historical competition between Matlab and Mathematica seems to be definitely won by the first, although its popularity is well below R. As for data manipulation and modeling tools, SAS appears to be the most frequently used software, followed by Excel and SPSS.

Once the keywords are defined for the prospective data scientist, in terms of skills, education and experience, it is possible to move to the next step and match a resumรฉ with any given job post. Before looking into the technical details, it is important to realize that both records typically lack a uniform structure. CVs follow a variety of templates and they may or may not be complete; similarly, job posts are found to vary quite significantly in terms of length and structure. Despite the template defined by the service provider (Glassdoor, in this case) even the most basic parameters (i.e. company name, location, rating) are often inconsistent.
Under these conditions, I chose to treat both resumรฉ and job posts as unstructured text, filtered by a set of common keywords and then compared according to their similarity. Among the various measures of similarity available in literature, I will be using the so called "Jaccard similarity", defined as the ratio between the intersection and the union of two "bags of words", respectively. This metric is meant to reward a good match of keywords while penalizing those CVs (or job posts) containing a very broad description. In these cases, we will expect the union of the two "bags of words" to increase more than the intersection, decreasing therefore the value of similarity.

As mentioned in the first paragraph, the job search is deployed as a Shiny app (see the screenshot below), receiving a resumรฉ as input, processing the text and producing a list of recommendations with the relative link to the website. The user can filter the list by location, company or company rating and select the preferred link to the job post. The reliability of the system is closely related to the number of keywords (currently over 50) used for the match.

Shinysnapshot

The prototype application presented in the post allows one to get relevant recommendations for most resumรฉs tested among bootcamp participants at NYC Data Science Academy. Its limit relies mostly on the number of jobs scraped and on the limited bag of words chosen as as filter. In order to become a valuable tool for job search, the app will need to include an "update" function that allows the user to rebuild the database with the latest job posts published within 2-3 weeks. Furthermore, the keywords currently hard coded in the search functions (see helperP3.py below) should evolve into a dynamic list of skills obtained from the same database. This will allow me to improve the similarity measure and to reflect the evolution of the labor market.

Inspired by student projects? Now it's your turn.
Get information about our data science programs and see how we can help you launch your data science career.



Appendix: the code at a glance

Developed in Python. Deployed using R Shiny.
View Github: Github

Packages used:

  • Python
    • selenium
    • nltk
    • pandas
    • numpy
    • re
    • collections
    • pickle
    • csv
    • time
    • wordcloud
  • R
    • shiny/shinydashboard
    • rPython
    • dplyr
    • tidyr
    • plotly
    • pdftools

Part 1: Scraping the website

Although Glassdoor.com provides an API to retrieve informations on job posts, the project requires a manual web scraping. For such task, I chose Python Selenium which allows one to browse through a website mimicking the behavior of Chrome. Despite the relatively simple structure of the page, the script requires a few tricks, as Glassdoor tends to throw a number of pop-up and CAPTCHA messages intended to limit, if not to avoid, the presence of bots. The scraping is performed in two stages, at first recording a brief description of the post and the relative link, then browsing through the list of links and retrieving the post description. These two operations are achieved by the same function, glassdoorScrape() tuning the parameter get_short = True for the short description and get_short = False for the complete post. The function is reported below.

Part 2: Parsing and checking data

Once the scraping is complete, the dataset needs to be processed in order to obtain a list of relevant features for each post. The process is done using by several functions, displayed below in the helperP3.py script. The script also contains a few functions used for scraping, calculating the Jaccard similarity, checking data consistency and I/O. For data cleaning and building the keyword dictionary, I found the work presented by Jesse Steinweg  particularly useful.

Part 3: Getting the analytics

In the third part the function get_analytics() is used to produce the data frames used for visualization. Depending on the number of job posting found for a given category, the function returns the top 10 locations, top employers, most frequently requested skills, education and languages (besides english). The data frames can be plotted as a word cloud by calling the getWordCloud() function.

Part 4: Building the Shiny App

In the final part of the project, a simple Shiny app is built to match the users' resumรฉ and the database. The interface between R and Python is managed by the rPython package, allowing one to load the helperP3.py script in R. The python function   get_bestMatch(myCV) accepts the input text (uploaded in pdf format) and converts it into a single string by means of pdftools. As a result, the function produces a csv file containing the recommendations which is read by R and displayed dynamically by Shiny.


Contributed by Diego De Lazzari. He attended the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on his third class project - Python Web Scraping (due on the 6th week of the program). The R code can be found on GitHub .

About Author

Diego De Lazzari

Researcher, developer and data scientist. Diego De Lazzari is an applied physicist with a rather diverse background. He spent 8 years in applied research, developing computational models in the field of Plasma Physics (Nuclear Fusion) and Geophysics. As...
View all posts by Diego De Lazzari >

Related Articles

AWS
Automated Data Extraction and Transformation Using Python, OpenAI, and AWS
Python
Can the data from EA's FIFA Potential Rating Help Bettors?
Data Visualization
Using Data to Get Cats Adopted on petfinder.com
Data Visualization
Wine 101: Gathering Data From Vivino
Python
Using Data to Analyze The Library of Audible

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Google August 30, 2021
Google Here are some links to sites that we link to simply because we believe they're really worth visiting.
Google January 1, 2021
Google Please visit the internet sites we comply with, including this 1, because it represents our picks in the web.
Google December 28, 2020
Google Wonderful story, reckoned we could combine a couple of unrelated information, nonetheless truly worth taking a search, whoa did a single master about Mid East has got more problerms also.
Google October 18, 2019
Google Always a major fan of linking to bloggers that I adore but donย’t get a good deal of link adore from.
Google October 11, 2019
Google Check beneath, are some completely unrelated websites to ours, having said that, they may be most trustworthy sources that we use.
TahorSuiJuris April 29, 2019
Excellent!
funny clips November 3, 2017
Excellent, ักhat a webhlog it is! Thiั• website presents valuuable data tฮฟ us, keep it up.
Aadish Chopra October 23, 2017
If you used rshiny for developing this app, did you think of using rselenium to do web scraping . I am asking because I build a web scraper using rselenium and now want to deploy that using shiny. I am struggling with it as I havenโ€™t come across many articles who talks about rshiny and rselenium.
Robert September 19, 2017
I've installed all the dependencies listed in the project. I saved all the files and i need to know how to put it all together. Where can I go to learn how to put alll these things together to amke them work. I feel like i am really close but am missing some basic information that everyone who reads this learned in cs10 or takes for granted as being common programmer knowledge. Can you point me in the right direction to figure out what i am missing?
www.gaoodgle.com August 25, 2017
I appreciate, result in I discovered exactly what I used to be taking a look for. You have ended my four day lengthy hunt! God Bless you man. Have a nice day. Bye
Cornell May 29, 2017
Medifast does work for those who decide to work with it.
Jamaal May 29, 2017
The sheer simplicity of Bitdefender Antivirus Free Edition, both during set up and use, make it, in my view, one of many better of the perfect free antivirus programs.
Christena May 29, 2017
Need quick assistance?
anti-virus software May 29, 2017
You possibly can access drive C on the 'My Pc'. Establish the necessary recordsdata you want and do away with those you don't use in any respect. You may at all times switch the music recordsdata in your cellular phone, mp3, or mp4.
Mireya May 28, 2017
Thus far, the Nerds On Call retailer in Fresno providers the second largest metropolis out of the 14 cities we're in. Only Portland has a larger inhabitants.
antivirus May 28, 2017
Laptop help providers specialists take up the shoppers queries from their workstations via the procedure referred to as as distant desktop connection.
Emad May 3, 2017
Good Job Mate... Brilliant Actually. Especially the similarity part.
Kelsey March 6, 2017
Hello, all is going fine here and ofcourrse every onne is sharing data, that's in fact fine, keep up writing.
Charlesassiz February 9, 2017
ะกะƒะ ั”ะ ยฐะกโ€กะ ยฐะกโ€šะกะŠ ะกโ€กะ ั‘ะกโ€šะกโ€น ะ ะ…ะ ยฐ ะ ะ†ะ ยฐะกะ‚ะกโ€žะ ยตะ โ„–ะกะƒ ะ ั–ะ ยปะ ั•ะ ยฑะ ยฐะ ยปะกโ€ฆะ ยฐะ ั” ะะฐัˆ ัะฐะนั‚: ะงะธั‚ั‹ ะฝะฐ ะฒะฐั€ั„ะตะนั
massage therapy salary ohio January 25, 2017
Our full-time therapeutic massage therapy program runs for 20 weeks, and classes are held through the day.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application