Use Data Science to Boost Your Job Hunting

Posted on Aug 22, 2016


Job hunting can be very hard. A serious job seeker needs to spend dozens, if not hundreds of hours on job market research, finding out who are hiring, what skills employers are looking for, and what industry they want to work in. More importantly, once obtained that information, job seekers also need to make sure that their resumes and profiles are well polished, to tell recruiters how well they fulfill the job requirements.
As a data scientist, how could one utilize data analytics skills to accelerate this time-consuming job searching process, and acquire meaningful insights through the market data? This article will give you an example.


A well-designed pipeline is the foundation of a successful data science project. As shown in Figure 1, a working pipeline for this project includes extracting data from a reliable source, clean and transform data into a proper format, which can be used for later analysis, and utilize different data science skills to obtain insights from the data.


[Figure 1. Project Pipeline]

Data Collection

LinkedIn, as the most popular social network for professionals, provides a massive number of highly reliable and up-to-date recruiting information through its job posting platform. Therefore, it can be used as a reliable data source for this analysis.
With the help of Requests, BeautifulSoup and Regex packages, a Python script was written to search pre-defined job searching keywords on LinkedIn, and collect job listings URLs from the search results. Four keywords were used during this step: "Business Analytics," "Business Intelligence," "Data Scientist," and "Machine Learning." These are terms that been commonly used in the industries when people talk about "data science." In addition, all searches were limited to greater New York City Area.
Next, another script opens all the job links, save the HTML page, then parse posting details from the collected HTML files. This data collection process continued for two weeks, which collected over 2500 job postings (the search results of different keywords might contain duplicated job postings, thus jobs in this dataset were not necessarily unique)

VizΒ Visualization & Quick Insights

1. Results by Keywords

Figure 2 shows the number of jobs found with each keyword. It seems that "Business Analytics" returned the largest number of job listings, which is understandable because this is the least clearly defined term that been over-used by many companies. On the other hand, "Machine Learning" only returned less than 300 jobs, which is less than half of the amount that returned by other keywords. This might imply that there are not as many job opportunities that specifically requiring machine learning expertises; however, there is also a chance that observations are biased due to the short time range of job collection.

Search Results by Keywords

[Figure 2. Search Results by Keywords]

2. Most Popular Job Functions

Each job posting on LinkedIn is associated with one or more job functions. Figure 3 displays a word cloud that illustrates the most popular job functions along with the number of appearances. Apparently, information technology, with more than twice as many appearances as others, is the no.1 job function that is related to sample jobs. Other frequently appeared functions include analyst, marketing, research, and engineering.


[Figure 3. Job Function Word Cloud]

3. Top 3 Industries by Search Terms

Similar to job functions, there are one or more company industries associated with each job posting. Figure 4 demonstrates the top 3 industries for each search term, based on the percentage of jobs that associated with each industry. The result shows that IT and Financial Services both listed as top 3 for all of four keywords. Given the fact that all of the jobs were within NYC area, this is very reasonable. Computer Software, as another highly data-driven industry, also appeared as top 3 for most of the groups except Business Analytics, in which group Marketing and Advertising replaced Computer Software as top 3. Again, the observations were likely to be biased, but this chart indicates that Business Analytics is a concept that is more commonly used by marketing firms instead of technology companies.


[Figure 4. Top 3 Industries]

4. Number of Listings by Weekdays

Lastly, a quick review of how many jobs been posted during the week. Figure 5 illustrates that Wednesdays, followed by Mondays, are the best weekdays to look for new job postings, while there are not many job listings released on Fridays.


[Figure 5. Number of Listings by Weekdays]

MLΒ Find Popular Requirements & Skills

Compared with labels demonstrated above, job requirements and skills are harder to identify, because they were not always clearly listed within job descriptions. For example, a job requirement may list "SQL, Python and R" as required skills, while another job may require "ability to extract data using SQL." Study on these unstructured formats require some more advanced techniques.

Use N-grams to Find Popular Requirements

A N-grams, by definition, is "a contiguous sequence of n items from a given sequence of text or speech." By listing most frequent N-grams in job descriptions, it might reveal the most popular skills mentioned in those job listings. This study only used unigram, bigram, and trigrams, since longer N-grams may appear less frequently and thus more likely to be biased.
Using Business Analytics and Data Scientist jobs as examples, Business Analytics jobs are more focused on business experience and soft skills, for example, the bigrams "(year, experience)" and "(cross, function)" are frequently mentioned; in comparison, Data Scientist jobs are more technology-oriented, among which "(big, data)," "(machine, learning)" are more frequently mentioned. An interesting finding of trigrams is that while "(5, year, experience)" appeared commonly in both groups, "(2, year, experience)" only appeared commonly in Data Scientist jobs. This is probably because data science is a relatively new discipline.


[Figure 6. Top N-grams for Business Analytics Jobs]


[Figure 6. Top N-grams for Data Scientist Jobs]

Use Word2Vec to Find Similar Skills

Another model used in this study is Word2Vec, which is an advanced text analytics technique that combines two different text processing algorithms to build a series of mini neural network models in order to learn the relationship among words that tend to appear closely in the documents. By building a Word2Vec model using collected job descriptions, it can take a list of skills as input, and return other skills that are most likely to be mentioned along with input skills in the same job descriptions. Therefore, users might be able to use this model to find out what new skills they should obtain given their current skillsets.

The first part of the code below was used to train the Word2Vec model, and the second part is a quick demo to get output from Word2Vec model



[Figure 7. Sample Output of Word2Vec Model]


This project demonstrated how a job seeker could utilize data science skills to quickly acquire knowledge of the job market, find hiring patterns, and make machine learning models to reveal popular skills that are in demand. Based on the results of this study, many further analyses can be applied. For example, the same dataset used for building Word2Vec model can be also used to train a Doc2Vec model, which further extend the learning power to discover similarities among documents. Therefore users can use their text resume as input, and find most matching jobs within the data set.

The complete project source codes are available on Jonathan's GithubGitHub-Mark-32px

Jonathan's LinkedIn Profile: Jonathan Liu

About Author

Jonathan Liu

Through years of self-learning on programming and machine learning, Jonathan has discovered his interests and passion in Data Science. With his B.B.A. in accounting, M.S. in Business Analytics, and two years of experience as operation analyst, he is...
View all posts by Jonathan Liu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI