Landing my Dream Job by Scraping Glassdoor.com
In this post I am going to use web scraping to define a simple recommendation system for data scientists looking for new employment. The idea is using to use the resources published on Glassdoor.com to create a dataset addressing the US job market for data scientists and data analysts. A web application developed in R- Shiny, allows the user to upload a resumé and receive a list of recommended job posts, depending on the candidate skills, education and work experience.
Introduction
Data scientists are tech savvy. We do not like manually browsing through job postings. We would rather our profile and skills be matched with the best available positions. Aren't there dozens of companies providing similar services? Sure, but you have to sign into multiple websites, share your personal infos, set an alert, check the company reputation, filter out useless recommendations ... Way too manually intensive!
Data scientists are also creative. While it may be difficult and time consuming to match a resumé with multiple sources, it is possible to reverse the approach and create a single job posting database combining the results of multiple queries for a specific position. The database is then matched to a resumé and the best recommendations are provided on the basis of a given set of skills.
About the dataset
For this project, I will limit the scope to Glassdoor.com, a California based company founded in 2007, providing a database of millions of company reviews, salary reports and interview reviews. About 1300 job posts for data scientist/ data analyst published within the last month in US were collected and analyzed. The resulting database shows an expected concentration of job vacancies in the well known hubs for data science, San Francisco and New York.
Let's now have a look at the candidate profile: as shown in figures 1-3, the ideal "Data Scientist" holds a Master/PhD title and is proficient in a number of tools for scripting, database management, visualization and parallel computing. While Python and R are by far the preferred scripting languages, the more "traditional" programming languages such as Java and C++ remain popular. The historical competition between Matlab and Mathematica seems to be definitely won by the first, although its popularity is well below R. As for data manipulation and modeling tools, SAS appears to be the most frequently used software, followed by Excel and SPSS.
Once the keywords are defined for the prospective data scientist, in terms of skills, education and experience, it is possible to move to the next step and match a resumé with any given job post. Before looking into the technical details, it is important to realize that both records typically lack a uniform structure. CVs follow a variety of templates and they may or may not be complete; similarly, job posts are found to vary quite significantly in terms of length and structure. Despite the template defined by the service provider (Glassdoor, in this case) even the most basic parameters (i.e. company name, location, rating) are often inconsistent.
Under these conditions, I chose to treat both resumé and job posts as unstructured text, filtered by a set of common keywords and then compared according to their similarity. Among the various measures of similarity available in literature, I will be using the so called "Jaccard similarity", defined as the ratio between the intersection and the union of two "bags of words", respectively. This metric is meant to reward a good match of keywords while penalizing those CVs (or job posts) containing a very broad description. In these cases, we will expect the union of the two "bags of words" to increase more than the intersection, decreasing therefore the value of similarity.
As mentioned in the first paragraph, the job search is deployed as a Shiny app (see the screenshot below), receiving a resumé as input, processing the text and producing a list of recommendations with the relative link to the website. The user can filter the list by location, company or company rating and select the preferred link to the job post. The reliability of the system is closely related to the number of keywords (currently over 50) used for the match.
The prototype application presented in the post allows one to get relevant recommendations for most resumés tested among bootcamp participants at NYC Data Science Academy. Its limit relies mostly on the number of jobs scraped and on the limited bag of words chosen as as filter. In order to become a valuable tool for job search, the app will need to include an "update" function that allows the user to rebuild the database with the latest job posts published within 2-3 weeks. Furthermore, the keywords currently hard coded in the search functions (see helperP3.py below) should evolve into a dynamic list of skills obtained from the same database. This will allow me to improve the similarity measure and to reflect the evolution of the labor market.
Appendix: the code at a glance
Developed in Python. Deployed using R Shiny.
View Github: Github
Packages used:
- Python
- selenium
- nltk
- pandas
- numpy
- re
- collections
- pickle
- csv
- time
- wordcloud
- R
- shiny/shinydashboard
- rPython
- dplyr
- tidyr
- plotly
- pdftools
Part 1: Scraping the website
Although Glassdoor.com provides an API to retrieve informations on job posts, the project requires a manual web scraping. For such task, I chose Python Selenium which allows one to browse through a website mimicking the behavior of Chrome. Despite the relatively simple structure of the page, the script requires a few tricks, as Glassdoor tends to throw a number of pop-up and CAPTCHA messages intended to limit, if not to avoid, the presence of bots. The scraping is performed in two stages, at first recording a brief description of the post and the relative link, then browsing through the list of links and retrieving the post description. These two operations are achieved by the same function, glassdoorScrape() tuning the parameter get_short = True for the short description and get_short = False for the complete post. The function is reported below.
Part 2: Parsing and checking data
Once the scraping is complete, the dataset needs to be processed in order to obtain a list of relevant features for each post. The process is done using by several functions, displayed below in the helperP3.py script. The script also contains a few functions used for scraping, calculating the Jaccard similarity, checking data consistency and I/O. For data cleaning and building the keyword dictionary, I found the work presented by Jesse Steinweg particularly useful.
Part 3: Getting the analytics
In the third part the function get_analytics() is used to produce the data frames used for visualization. Depending on the number of job posting found for a given category, the function returns the top 10 locations, top employers, most frequently requested skills, education and languages (besides english). The data frames can be plotted as a word cloud by calling the getWordCloud() function.
Part 4: Building the Shiny App
In the final part of the project, a simple Shiny app is built to match the users' resumé and the database. The interface between R and Python is managed by the rPython package, allowing one to load the helperP3.py script in R. The python function get_bestMatch(myCV) accepts the input text (uploaded in pdf format) and converts it into a single string by means of pdftools. As a result, the function produces a csv file containing the recommendations which is read by R and displayed dynamically by Shiny.
Contributed by Diego De Lazzari. He attended the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between July 5th to September 23rd, 2016. This post is based on his third class project - Python Web Scraping (due on the 6th week of the program). The R code can be found on GitHub .