Project 3: Web Scraping company data from Indeed.com and Dice.com

Posted on Mar 20, 2016

Contributed by Sung Pil Moon. He attended in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his third class project - Web scraping (due on the 4th week of the program)

Overview

Overall goal:

Provide overall insight of companies based on current job openings with embedded interactive shiny app.

Specific goal:

Packages (libraries) used:

  • Python (for Web scraping and data manipulation): BeautifulSoup, Pandas, Re
  • R (for data manipulation and visualization): Shiny Dashboard, DT, ggplot2, dplyr, SnowballC, wordcloud, fmsb, kernlab

1. Extracting job and company data from Indeed.com

To achieve the data extraction (web scraping) and manipulation, I used the Python and BeautifulSoup library. Each step is described below in detail.

1.1. Get a full list of jobs from Indeed.com

The first step was to get a list of jobs from indeed.com before I get the detailed information of each company from the list.  The code snippet below shows the preparation step for web scraping.

Then, it accesses all the page having recent job lists and stores the detailed job information into the data frame.

This is the one example of the returned result
codeSnippet_03_ResultOfQuery

1.2. Get a detailed company information from the job link

Detailed company information can be obtained through a link in the data table in the previous step. The below code snippet describes this step, including accessing detailed company page from a link if exists, cleaning keywords using the Python regular expression (re library), and storing six types of company ratings into the data frame.

The returned result within the data frame look like this

returned_data_company_detail

This is the command to save the data frame in csv format.

1.3. Exploring the extracted data in Shiny dashboard application

As a result of the previous steps, I could get a data frame containing information of 900 jobs related to data scientist position. To get an overall insight of the data, I created a shiny dashboard app. Each row has information of

  • Job-related: Job title, Job location, Date job posted, A link to the job description page
  • Company-related: Company name, A link to company description page in indeed.com, Overall rating of a company, Work-life balance rating, Company culture rating, Benefit / compensation rating, Job security rating, Management / CEO rating.

1_indeedAppDataTable

2. Exploring interactive K-means cluster analyses with the extracted data

Interactive K-means cluster analysis is implemented and embedded in the shiny application to get basic concept of how K-means cluster analysis and unsupervised learning work. Although the job / company data does not perfectly fit to K-means cluster analysis, it was used as a data source for analysis because I do not predict anything, but try to identify what patterns exist in the data set. (Note that a cluster in this context means a group of companies having the same range of ratings that a user filtered. For example, an example of a cluster can be a list of companies that have work-life balance rating of 4.5 and culture rating of 4.0)

Possible basic questions with the extract data set can include:

  • What are the approximate sizes of subgroups (= clusters) in the data?
  • What are the commonalities of companies in the same subgroup?

2.1. Scatter plot for K-mean clusters

The first component is a scatter plot showing the clusters divided by k. The k value and 6 variables for analysis can be interactively selected by a user (The default value of K is 2).

* Note that each cluster is visually distinguished by distinct colors. However, this color does not have any statistical and analytical meaning.

2_ClusterAnalysis01

2.2. Scree Plot and WCV table

As the number of clusters increases, the values of overall WCV (Within Cluster Variance) get continuously decreased. However, a proper k-value should be decided before all data points become centroids. Therefore, the interactive scree plot is integrated to provide a method to visually inspect an appropriate K value.

2_ClusterAnalysis02

When a user changes a k-value in the options component, the dotted lines in the scree plot will be updated. The table of the Within cluster variance is located on the right side of the scree plot displaying the detailed WCV values corresponding to the number of clusters.

2.3. Aggregated values of K-cluster and radarChart

The value boxes include the aggregated values of the cluster a user chosen including overall ratings, work-life balance rating, culture rating, compensation / benefit rating, job security rating, and management rating. The radar chart is embedded to provide an effective way to compare multivariate data between the chosen cluster and the overall data set.

2_ClusterAnalysis03

2.4. Data Table of K Cluster

The data table show a list of data belonging to the same cluster a user selected. A list is updated whenever a user changes the options.

2_ClusterAnalysis04

3. Sentiment analysis of reviews based on the extracted company data

Review contents of the companies were extracted from Indeed.com and analyzed. Since the primary focus of this project is not on a deep analysis of company review, only the featured review per a company was chosen for analysis. (Also, no duplicated company exists in the data.)

3.1. Frequency Table of Positive & Negative Review Words

The table below shows a list of reviews about companies extracted from the indeed.com, including company name, overall ratings, review summary, pros and cons of each company.

3_PositiveNegativeTable

3.2. Word cloud of Positive & Negative Review Words

A word cloud component highlights the most frequently used words in a simple, clear and visually attractive way.

The two wordClouds are embedded in the application:

  1. A word cloud of positive review words
  2. A word cloud of negative review words.

R package libraries, tm, snowball, and wordcloud, are used for text mining to classify the more frequently used review words.

3_PositiveNegative1

An interesting finding is the size difference between two word clouds (smaller size of positive word cloud and bigger size of negative word cloud). This is the same line of many research findings that most people tend to process positive and negative information differently and more constructively criticize when they are involved in negative emotions (Related article: Tugend, A. (2012, March 23). Praise is fleeting, but brickbats we recall. The New York Times. Retrieved from https://www.nytimes.com/2012/03/24/your-money/why-people-remember-negative-events-more-than-positive-ones.html?_r=0).

4. Identifying the most frequent keywords in job summaries

Extending to utilizing the basic text mining technique used in the previous word cloud component, I also tried to find what keywords are more frequently used in the job description. I thought it will be beneficial for job seekers to include those keywords to attract recruiters' attentions.

The code snippet shows how to pull a list of keywords in the job summaries from the indeed.com

A list of the most frequently used keywords are extracted from total 819 companies and displayed in the below table and word cloud.

4_KeywordTable

4_KeywordCloud

5. Conclusion

In this project, this application tested a web scraping technique using Python with BeautifulSoup library to extract job and company-related data and several wordCloud components to find hidden insights. The application also provided an interactive element of K-means cluster analysis to better understand a concept of unsupervised learning based on the extracted data.

As next steps, I plan followings:

  • Embedding another unsupervised learning, such as Hierarchical clustering analysis, with a different data set which better fits for the analysis
  • More pruned text-mining techniques for accurate keyword/sentiment analyses

Please send me an email to [email protected] if you have any question or suggestion or found any technical error in the application.

Enjoyed reading this project? Now it's your turn.
Get information about our data science programs and see how we can help you launch your data science career.



About Author

Sung Pil Moon

Sung Moon is a recent graduate from the Ph.D. program in Human-Computer Interaction, School of Informatics, Indiana University (Indianapolis, IN). Through several startup activities and various research projects collaborating with MITRE, a research corporation, he found opportunities to...
View all posts by Sung Pil Moon >

Related Articles

Leave a Comment

Google August 29, 2021
Google The time to read or stop by the material or web sites we've linked to below.
Google January 8, 2021
Google That would be the end of this report. Right here you’ll find some sites that we believe you will value, just click the hyperlinks.
Google December 17, 2020
Google We like to honor many other web web sites around the internet, even though they aren’t linked to us, by linking to them. Under are some webpages really worth checking out.
Google September 27, 2019
Google Very few websites that take place to become detailed beneath, from our point of view are undoubtedly properly worth checking out.
Google September 22, 2019
Google We prefer to honor lots of other web sites on the net, even when they aren’t linked to us, by linking to them. Under are some webpages really worth checking out.
/lobby January 7, 2018
Article writing is also a excitement, if you be acquainted with afterward you can write otherwise it is complex to write.
Best Baby Bottles December 28, 2017
Hi there, this weekend is good in favor of me, since this moment i am reading this great informative article here at my house.
bovada mobile poker app December 6, 2017
I read this article completely aboht the comparison of hottest and previous technologies, it's awesome article.
Nerosoft November 13, 2017
Excellent article. I absolutely love this website. Thanks!
hire a brass band london November 10, 2017
Great info. Lucky me I discovered your website by chance (stumbleupon). I have bookmarked it for later!
Canada Computers Online November 2, 2017
Pretty! This has been an incredibly wonderful article. Many thanks for providing these details.
Oylmpia Soccer October 16, 2017
Whats up very cool site!! Man .. Excellent .. Superb .. I'll bookmark your website and take the feeds additionally? I am happy to find a lot of helpful info here within the post, we'd like work out extra strategies in this regard, thanks for sharing. . . . . .
Lan October 16, 2017
I think everything posted was very reasonable. But, consider this, suppose you added a little content? I ain't suggesting your information isn't good, but suppose you added something to maybe get a person's attention? I mean blog topic is a little boring. You might glance at Yahoo's front page and watch how they creae post headlines tto grab viewers to click. You moght addd a related video or a pic or two to get readers exciyed about everything've got to say. Just my opinion, it would bring your posts a little bit moore interesting.
Online Canada Shopping October 13, 2017
Hi there, the whole thing is going well here and ofcourse every one is sharing information, that's really fine, keep up writing.
bovada mobile casino app September 29, 2017
Right here is the right web site for everyone who wants to understand this topic. You realize so much its almost hard to argue with you (not that I actually will nesed to…HaHa). You certainly put a new spin on a subject which has been written about for years. Excellent stuff, just excellent!
webpage September 12, 2017
I've Ƅeen browsing on-line greater tһаn 3 houгs thеѕe days, yet I never found any fascinating article likе yoᥙrs. Іt'ѕ beautiful ⲣrice enough foг mе. Personally, if all website owners аnd bloggers mаde ɡood content aѕ you pr᧐bably dіd, tһе internet might be much mоrе helpful tһan evedr before.
safelink July 16, 2017
I read this piece of writing fully about the resemblance of latest and previous technologies, it's awesome article.
مهرجانات 2017 July 6, 2017
Hello,I check your new stuff named "Project 3: Web Scraping company data from Indeed.com and Dice.com - NYC Data Science Academy BlogNYC Data Science Academy Blog" regularly.Your writing style is witty, keep it up! And you can look our website about مهرجانات 2017.
faux hermes women handbags June 21, 2017
What lies have I and other White person uttered? Also, if White people have cornered the market, as it were, on lying, then why have Black-governed places always been renowned for being hopelessly corrupted beyond repair? Even the United Nations, which has poured trillions of dollars into Africa, says this has been the case. faux hermes women handbags http://www.accessoires-mode.in/hermes-birkin-handbags-c2/
bracelet cartier love or rose replique June 6, 2017
cartierbraceletlove Will activities at the FED now get “interest”ing? bracelet cartier love or rose replique http://www.bijouxpopulaire.com/replique-cartier-love-bracelet-symbolisme.html
Stephaine June 4, 2017
It would behoove the producers of DC Online Universe to use their huge variety of characters (everybody from Superman to Swamp Factor) as inspiration for a lot of weapons and abilities. The DC Online Universe weapons should function supply of robust enchantment for folks on the lookout for a stable online action experience.
Maurine May 29, 2017
Extremely really helpful.. see him first, I can not say sufficient.
Forrest May 29, 2017
Real online gambling will be fun too, nevertheless that's one other topic that needs its personal precautions and safeties.