Landing my Dream Job by Scraping

Posted on Aug 21, 2016

In this post I am going to use web scraping to define a simple recommendation system for data scientists looking for new employment. The idea is using to use the resources published on to create a dataset addressing the US job market for data scientists and data analysts. A web application developed in R- Shiny, allows the user to upload a resumé and receive a list of recommended job posts, depending on the candidate skills, education and work experience.


Data scientists are tech savvy. We do not like manually browsing through job postings. We would rather our profile and skills be matched with the best available positions. Aren't there dozens of companies providing similar services? Sure, but you have to sign into multiple websites, share your personal infos, set an alert, check the company reputation, filter out useless recommendations ... Way too manually intensive!

Data scientists are also creative. While it may be difficult and time consuming to match a resumé with multiple sources, it is possible to reverse the approach and create a single job posting database combining the results of multiple queries for a specific position. The database is then matched to a resumé and the best recommendations are provided on the basis of a given set of skills.

About the dataset

For this project, I will limit the scope to, a California based company founded in 2007, providing a database of millions of company reviews, salary reports and interview reviews. About 1300 job posts for data scientist/ data analyst published within the last month in US were collected and analyzed. The resulting database shows an expected concentration of job vacancies in the well known hubs for data science, San Francisco and New York.


Figure 2: Preferred or required education for a data scientist. Please notice that a post requiring more than a specific title (eg. a master OR a bachelor) is counted twice in the chart.

Figure 1: Top 10 cities in US for Data Scientists.

Figure 1: Top 10 cities in US for number of job posts as Data Scientist.

Figure 3: List of skills most frequently requested in job posts for data scientist positions.

Let's now have a look at the candidate profile: as shown in figures 1-3, the ideal "Data Scientist" holds a Master/PhD title and is proficient in a number of tools for scripting, database management, visualization and parallel computing. While Python and R are by far the preferred scripting languages, the more "traditional" programming languages such as Java and C++ remain popular. The historical competition between Matlab and Mathematica seems to be definitely won by the first, although its popularity is well below R. As for data manipulation and modeling tools, SAS appears to be the most frequently used software, followed by Excel and SPSS.

Once the keywords are defined for the prospective data scientist, in terms of skills, education and experience, it is possible to move to the next step and match a resumé with any given job post. Before looking into the technical details, it is important to realize that both records typically lack a uniform structure. CVs follow a variety of templates and they may or may not be complete; similarly, job posts are found to vary quite significantly in terms of length and structure. Despite the template defined by the service provider (Glassdoor, in this case) even the most basic parameters (i.e. company name, location, rating) are often inconsistent.
Under these conditions, I chose to treat both resumé and job posts as unstructured text, filtered by a set of common keywords and then compared according to their similarity. Among the various measures of similarity available in literature, I will be using the so called "Jaccard similarity", defined as the ratio between the intersection and the union of two "bags of words", respectively. This metric is meant to reward a good match of keywords while penalizing those CVs (or job posts) containing a very broad description. In these cases, we will expect the union of the two "bags of words" to increase more than the intersection, decreasing therefore the value of similarity.

As mentioned in the first paragraph, the job search is deployed as a Shiny app (see the screenshot below), receiving a resumé as input, processing the text and producing a list of recommendations with the relative link to the website. The user can filter the list by location, company or company rating and select the preferred link to the job post. The reliability of the system is closely related to the number of keywords (currently over 50) used for the match.


The prototype application presented in the post allows one to get relevant recommendations for most resumés tested among bootcamp participants at NYC Data Science Academy. Its limit relies mostly on the number of jobs scraped and on the limited bag of words chosen as as filter. In order to become a valuable tool for job search, the app will need to include an "update" function that allows the user to rebuild the database with the latest job posts published within 2-3 weeks. Furthermore, the keywords currently hard coded in the search functions (see below) should evolve into a dynamic list of skills obtained from the same database. This will allow me to improve the similarity measure and to reflect the evolution of the labor market.

Inspired by student projects? Now it's your turn.
Get information about our data science programs and see how we can help you launch your data science career.

Appendix: the code at a glance

Developed in Python. Deployed using R Shiny.
View Github: Github

Packages used:

  • Python
    • selenium
    • nltk
    • pandas
    • numpy
    • re
    • collections
    • pickle
    • csv
    • time
    • wordcloud
  • R
    • shiny/shinydashboard
    • rPython
    • dplyr
    • tidyr
    • plotly
    • pdftools

Part 1: Scraping the website

Although provides an API to retrieve informations on job posts, the project requires a manual web scraping. For such task, I chose Python Selenium which allows one to browse through a website mimicking the behavior of Chrome. Despite the relatively simple structure of the page, the script requires a few tricks, as Glassdoor tends to throw a number of pop-up and CAPTCHA messages intended to limit, if not to avoid, the presence of bots. The scraping is performed in two stages, at first recording a brief description of the post and the relative link, then browsing through the list of links and retrieving the post description. These two operations are achieved by the same function, glassdoorScrape() tuning the parameter get_short = True for the short description and get_short = False for the complete post. The function is reported below.

Part 2: Parsing and checking data

Once the scraping is complete, the dataset needs to be processed in order to obtain a list of relevant features for each post. The process is done using by several functions, displayed below in the script. The script also contains a few functions used for scraping, calculating the Jaccard similarity, checking data consistency and I/O. For data cleaning and building the keyword dictionary, I found the work presented by Jesse Steinweg  particularly useful.

Part 3: Getting the analytics

In the third part the function get_analytics() is used to produce the data frames used for visualization. Depending on the number of job posting found for a given category, the function returns the top 10 locations, top employers, most frequently requested skills, education and languages (besides english). The data frames can be plotted as a word cloud by calling the getWordCloud() function.

Part 4: Building the Shiny App

In the final part of the project, a simple Shiny app is built to match the users' resumé and the database. The interface between R and Python is managed by the rPython package, allowing one to load the script in R. The python function   get_bestMatch(myCV) accepts the input text (uploaded in pdf format) and converts it into a single string by means of pdftools. As a result, the function produces a csv file containing the recommendations which is read by R and displayed dynamically by Shiny.

Contributed by Diego De Lazzari. He attended the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on his third class project - Python Web Scraping (due on the 6th week of the program). The R code can be found on GitHub .

About Author

Diego De Lazzari

Researcher, developer and data scientist. Diego De Lazzari is an applied physicist with a rather diverse background. He spent 8 years in applied research, developing computational models in the field of Plasma Physics (Nuclear Fusion) and Geophysics. As...
View all posts by Diego De Lazzari >

Related Articles

Leave a Comment

Google August 30, 2021
Google Here are some links to sites that we link to simply because we believe they're really worth visiting.
Google January 1, 2021
Google Please visit the internet sites we comply with, including this 1, because it represents our picks in the web.
Google December 28, 2020
Google Wonderful story, reckoned we could combine a couple of unrelated information, nonetheless truly worth taking a search, whoa did a single master about Mid East has got more problerms also.
Google October 18, 2019
Google Always a major fan of linking to bloggers that I adore but don’t get a good deal of link adore from.
Google October 11, 2019
Google Check beneath, are some completely unrelated websites to ours, having said that, they may be most trustworthy sources that we use.
TahorSuiJuris April 29, 2019
funny clips November 3, 2017
Excellent, ѡhat a webhlog it is! Thiѕ website presents valuuable data tο us, keep it up.
Aadish Chopra October 23, 2017
If you used rshiny for developing this app, did you think of using rselenium to do web scraping . I am asking because I build a web scraper using rselenium and now want to deploy that using shiny. I am struggling with it as I haven’t come across many articles who talks about rshiny and rselenium.
Robert September 19, 2017
I've installed all the dependencies listed in the project. I saved all the files and i need to know how to put it all together. Where can I go to learn how to put alll these things together to amke them work. I feel like i am really close but am missing some basic information that everyone who reads this learned in cs10 or takes for granted as being common programmer knowledge. Can you point me in the right direction to figure out what i am missing? August 25, 2017
I appreciate, result in I discovered exactly what I used to be taking a look for. You have ended my four day lengthy hunt! God Bless you man. Have a nice day. Bye
Cornell May 29, 2017
Medifast does work for those who decide to work with it.
Jamaal May 29, 2017
The sheer simplicity of Bitdefender Antivirus Free Edition, both during set up and use, make it, in my view, one of many better of the perfect free antivirus programs.
Christena May 29, 2017
Need quick assistance?
anti-virus software May 29, 2017
You possibly can access drive C on the 'My Pc'. Establish the necessary recordsdata you want and do away with those you don't use in any respect. You may at all times switch the music recordsdata in your cellular phone, mp3, or mp4.
Mireya May 28, 2017
Thus far, the Nerds On Call retailer in Fresno providers the second largest metropolis out of the 14 cities we're in. Only Portland has a larger inhabitants.
antivirus May 28, 2017
Laptop help providers specialists take up the shoppers queries from their workstations via the procedure referred to as as distant desktop connection.
Emad May 3, 2017
Good Job Mate... Brilliant Actually. Especially the similarity part.
Kelsey March 6, 2017
Hello, all is going fine here and ofcourrse every onne is sharing data, that's in fact fine, keep up writing.
massage therapy salary ohio January 25, 2017
Our full-time therapeutic massage therapy program runs for 20 weeks, and classes are held through the day.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI