Project 3: Web Scraping company data from Indeed.com and Dice.com
Contributed by Sung Pil Moon. He attended in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - Web scraping (due on the 4th week of the program)
Overview
Overall goal:
Provide overall insight of companies based on current job openings with embedded interactive shiny app.
Specific goal:
- Extracting job and company data using web scraping techniques from Indeed.com and Dice.com
- Exploring interactive K-means cluster analyses with the extracted data
- Sentiment analysis of reviews based on the extracted company data
- Identifying the most frequent keywords in job summaries
Packages (libraries) used:
- Python (for Web scraping and data manipulation): BeautifulSoup, Pandas, Re
- R (for data manipulation and visualization): Shiny Dashboard, DT, ggplot2, dplyr, SnowballC, wordcloud, fmsb, kernlab
1. Extracting job and company data from Indeed.com
To achieve the data extraction (web scraping) and manipulation, I used the Python and BeautifulSoup library. Each step is described below in detail.
1.1. Get a full list of jobs from Indeed.com
The first step was to get a list of jobs from indeed.com before I get the detailed information of each company from the list. The code snippet below shows the preparation step for web scraping.
Then, it accesses all the page having recent job lists and stores the detailed job information into the data frame.
1.2. Get a detailed company information from the job link
Detailed company information can be obtained through a link in the data table in the previous step. The below code snippet describes this step, including accessing detailed company page from a link if exists, cleaning keywords using the Python regular expression (re library), and storing six types of company ratings into the data frame.
The returned result within the data frame look like this
This is the command to save the data frame in csv format.
1.3. Exploring the extracted data in Shiny dashboard application
As a result of the previous steps, I could get a data frame containing information of 900 jobs related to data scientist position. To get an overall insight of the data, I created a shiny dashboard app. Each row has information of
- Job-related: Job title, Job location, Date job posted, A link to the job description page
- Company-related: Company name, A link to company description page in indeed.com, Overall rating of a company, Work-life balance rating, Company culture rating, Benefit / compensation rating, Job security rating, Management / CEO rating.
2. Exploring interactive K-means cluster analyses with the extracted data
Interactive K-means cluster analysis is implemented and embedded in the shiny application to get basic concept of how K-means cluster analysis and unsupervised learning work. Although the job / company data does not perfectly fit to K-means cluster analysis, it was used as a data source for analysis because I do not predict anything, but try to identify what patterns exist in the data set. (Note that a cluster in this context means a group of companies having the same range of ratings that a user filtered. For example, an example of a cluster can be a list of companies that have work-life balance rating of 4.5 and culture rating of 4.0)
Possible basic questions with the extract data set can include:
- What are the approximate sizes of subgroups (= clusters) in the data?
- What are the commonalities of companies in the same subgroup?
2.1. Scatter plot for K-mean clusters
The first component is a scatter plot showing the clusters divided by k. The k value and 6 variables for analysis can be interactively selected by a user (The default value of K is 2).
* Note that each cluster is visually distinguished by distinct colors. However, this color does not have any statistical and analytical meaning.
2.2. Scree Plot and WCV table
As the number of clusters increases, the values of overall WCV (Within Cluster Variance) get continuously decreased. However, a proper k-value should be decided before all data points become centroids. Therefore, the interactive scree plot is integrated to provide a method to visually inspect an appropriate K value.
When a user changes a k-value in the options component, the dotted lines in the scree plot will be updated. The table of the Within cluster variance is located on the right side of the scree plot displaying the detailed WCV values corresponding to the number of clusters.
2.3. Aggregated values of K-cluster and radarChart
The value boxes include the aggregated values of the cluster a user chosen including overall ratings, work-life balance rating, culture rating, compensation / benefit rating, job security rating, and management rating. The radar chart is embedded to provide an effective way to compare multivariate data between the chosen cluster and the overall data set.
2.4. Data Table of K Cluster
The data table show a list of data belonging to the same cluster a user selected. A list is updated whenever a user changes the options.
3. Sentiment analysis of reviews based on the extracted company data
3.1. Frequency Table of Positive & Negative Review Words
The table below shows a list of reviews about companies extracted from the indeed.com, including company name, overall ratings, review summary, pros and cons of each company.
3.2. Word cloud of Positive & Negative Review Words
A word cloud component highlights the most frequently used words in a simple, clear and visually attractive way.
The two wordClouds are embedded in the application:
- A word cloud of positive review words
- A word cloud of negative review words.
R package libraries, tm, snowball, and wordcloud, are used for text mining to classify the more frequently used review words.
An interesting finding is the size difference between two word clouds (smaller size of positive word cloud and bigger size of negative word cloud). This is the same line of many research findings that most people tend to process positive and negative information differently and more constructively criticize when they are involved in negative emotions (Related article: Tugend, A. (2012, March 23). Praise is fleeting, but brickbats we recall. The New York Times. Retrieved from https://www.nytimes.com/2012/03/24/your-money/why-people-remember-negative-events-more-than-positive-ones.html?_r=0).
4. Identifying the most frequent keywords in job summaries
Extending to utilizing the basic text mining technique used in the previous word cloud component, I also tried to find what keywords are more frequently used in the job description. I thought it will be beneficial for job seekers to include those keywords to attract recruiters' attentions.
The code snippet shows how to pull a list of keywords in the job summaries from the indeed.com
A list of the most frequently used keywords are extracted from total 819 companies and displayed in the below table and word cloud.
5. Conclusion
In this project, this application tested a web scraping technique using Python with BeautifulSoup library to extract job and company-related data and several wordCloud components to find hidden insights. The application also provided an interactive element of K-means cluster analysis to better understand a concept of unsupervised learning based on the extracted data.
As next steps, I plan followings:
- Embedding another unsupervised learning, such as Hierarchical clustering analysis, with a different data set which better fits for the analysis
- More pruned text-mining techniques for accurate keyword/sentiment analyses
Please send me an email to monspo1@gmail.com if you have any question or suggestion or found any technical error in the application.