Use Data Science to Boost Your Job Hunting
Job hunting can be very hard. A serious job seeker needs to spend dozens, if not hundreds of hours on job market research, finding out who are hiring, what skills employers are looking for, and what industry they want to work in. More importantly, once obtained that information, job seekers also need to make sure that their resumes and profiles are well polished, to tell recruiters how well they fulfill the job requirements.
As a data scientist, how could one utilize data analytics skills to accelerate this time-consuming job searching process, and acquire meaningful insights through the market data? This article will give you an example.
A well-designed pipeline is the foundation of a successful data science project. As shown in Figure 1, a working pipeline for this project includes extracting data from a reliable source, clean and transform data into a proper format, which can be used for later analysis, and utilize different data science skills to obtain insights from the data.
[Figure 1. Project Pipeline]
LinkedIn, as the most popular social network for professionals, provides a massive number of highly reliable and up-to-date recruiting information through its job posting platform. Therefore, it can be used as a reliable data source for this analysis.
With the help of Requests, BeautifulSoup and Regex packages, a Python script was written to search pre-defined job searching keywords on LinkedIn, and collect job listings URLs from the search results. Four keywords were used during this step: "Business Analytics," "Business Intelligence," "Data Scientist," and "Machine Learning." These are terms that been commonly used in the industries when people talk about "data science." In addition, all searches were limited to greater New York City Area.
Next, another script opens all the job links, save the HTML page, then parse posting details from the collected HTML files. This data collection process continued for two weeks, which collected over 2500 job postings (the search results of different keywords might contain duplicated job postings, thus jobs in this dataset were not necessarily unique)
1. Results by Keywords
Figure 2 shows the number of jobs found with each keyword. It seems that "Business Analytics" returned the largest number of job listings, which is understandable because this is the least clearly defined term that been over-used by many companies. On the other hand, "Machine Learning" only returned less than 300 jobs, which is less than half of the amount that returned by other keywords. This might imply that there are not as many job opportunities that specifically requiring machine learning expertises; however, there is also a chance that observations are biased due to the short time range of job collection.
[Figure 2. Search Results by Keywords]
2. Most Popular Job Functions
Each job posting on LinkedIn is associated with one or more job functions. Figure 3 displays a word cloud that illustrates the most popular job functions along with the number of appearances. Apparently, information technology, with more than twice as many appearances as others, is the no.1 job function that is related to sample jobs. Other frequently appeared functions include analyst, marketing, research, and engineering.
[Figure 3. Job Function Word Cloud]
3. Top 3 Industries by Search Terms
Similar to job functions, there are one or more company industries associated with each job posting. Figure 4 demonstrates the top 3 industries for each search term, based on the percentage of jobs that associated with each industry. The result shows that IT and Financial Services both listed as top 3 for all of four keywords. Given the fact that all of the jobs were within NYC area, this is very reasonable. Computer Software, as another highly data-driven industry, also appeared as top 3 for most of the groups except Business Analytics, in which group Marketing and Advertising replaced Computer Software as top 3. Again, the observations were likely to be biased, but this chart indicates that Business Analytics is a concept that is more commonly used by marketing firms instead of technology companies.
[Figure 4. Top 3 Industries]
4. Number of Listings by Weekdays
Lastly, a quick review of how many jobs been posted during the week. Figure 5 illustrates that Wednesdays, followed by Mondays, are the best weekdays to look for new job postings, while there are not many job listings released on Fridays.
[Figure 5. Number of Listings by Weekdays]
Compared with labels demonstrated above, job requirements and skills are harder to identify, because they were not always clearly listed within job descriptions. For example, a job requirement may list "SQL, Python and R" as required skills, while another job may require "ability to extract data using SQL." Study on these unstructured formats require some more advanced techniques.
Use N-grams to Find Popular Requirements
A N-grams, by definition, is "a contiguous sequence of n items from a given sequence of text or speech." By listing most frequent N-grams in job descriptions, it might reveal the most popular skills mentioned in those job listings. This study only used unigram, bigram, and trigrams, since longer N-grams may appear less frequently and thus more likely to be biased.
Using Business Analytics and Data Scientist jobs as examples, Business Analytics jobs are more focused on business experience and soft skills, for example, the bigrams "(year, experience)" and "(cross, function)" are frequently mentioned; in comparison, Data Scientist jobs are more technology-oriented, among which "(big, data)," "(machine, learning)" are more frequently mentioned. An interesting finding of trigrams is that while "(5, year, experience)" appeared commonly in both groups, "(2, year, experience)" only appeared commonly in Data Scientist jobs. This is probably because data science is a relatively new discipline.
[Figure 6. Top N-grams for Business Analytics Jobs]
[Figure 6. Top N-grams for Data Scientist Jobs]
Use Word2Vec to Find Similar Skills
Another model used in this study is Word2Vec, which is an advanced text analytics technique that combines two different text processing algorithms to build a series of mini neural network models in order to learn the relationship among words that tend to appear closely in the documents. By building a Word2Vec model using collected job descriptions, it can take a list of skills as input, and return other skills that are most likely to be mentioned along with input skills in the same job descriptions. Therefore, users might be able to use this model to find out what new skills they should obtain given their current skillsets.
The first part of the code below was used to train the Word2Vec model, and the second part is a quick demo to get output from Word2Vec model
[Figure 7. Sample Output of Word2Vec Model]
This project demonstrated how a job seeker could utilize data science skills to quickly acquire knowledge of the job market, find hiring patterns, and make machine learning models to reveal popular skills that are in demand. Based on the results of this study, many further analyses can be applied. For example, the same dataset used for building Word2Vec model can be also used to train a Doc2Vec model, which further extend the learning power to discover similarities among documents. Therefore users can use their text resume as input, and find most matching jobs within the data set.
The complete project source codes are available on Jonathan's Github
Jonathan's LinkedIn Profile: Jonathan Liu