Job Salary Prediction with Python
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by John Maiden. John took Vivian Zhang's "Data Science with Python: Machine Learning" class from November to December 2014. The post is based on his final project submission. Slides are available here.
Salaries are a complicated topic. In some cultures it's extremely impolite to ask someone how much they earn in a year. In other cultures it's a common topic of discussion you'd ask a friend or acquaintance (or when I lived in China, something a person sitting next to me on the bus would ask). When looking for a job, it's a major deciding factor for many people.
Sometimes the actual number is important, other times it's relative (“Am I being offered a salary that is below or above market?”). Most sites that offer salary information cover a limited range of jobs and industries (look for “Java Developer” in “New York, New York” at salary.com - 26 results, but only half look like actual developer jobs), and those that have data usually do not adjust for factors like location, industry, and company size.
Besides sites like Salary.com, there are companies that use data science to predict salaries for jobs within certain industries or geographic locations (see projectSHERPA or Adzuna by way of their Kaggle competition). This brings transparency to a murky side of the market and fills in the data when data is not available.
Using projectSHERPA’s database of jobs, the scope of this project was to use the techniques learned in “Data Science with Python: Machine Learning” to predict base salaries for data science jobs in NYC. Additional company information was scraped from LinkedIn, providing useful data such as company size, industry, and company specialties.
The data science jobs were identified through filtering for specific keywords in the job title and job descriptions and filtering out required skills that suggested the job wasn’t a technical position (e.g. requiring “Salesforce” experience).
Using the filters I found 585 unique data science jobs in NYC across 262 companies. Unfortunately, only 67 of those had posted salaries, which is a very small sample size. I had a previous salary model that I had developed for projectSHERPA, so I decided to split my testing into two groups - see if I could predict using the 67 jobs with posted salaries, as well as test against the 584 salaries produced by "the other model". Here is the salary distribution of the posted jobs:
And here's the distribution of the 584 salaries as produced by "the other model":
Since most of the competitors in Adzuna’s Kaggle competition used a Random Forest model, I decided that would be a good place to start. I used Random Forest with a collection of parameters (via Scikit-Learn’s grid search functionality) on the data, with a 50% Training /50% Test split on the smaller data set and 80% Training / 20% Test split on the larger data set.
I was also inspired by Zygmunt Zając’s post where he said “More data beats a cleverer algorithm, especially when a cleverer algorithm is unable to handle all of data (on your machine, anyway)”.
Since I had more data than the original Kaggle competition (by pulling in the company data from LinkedIn), I could also see if linear models (Ordinary Linear Regression, Ridge Regression, and Lasso Regression) could perform as well if not better than a Random Forest model. Results for the models was not optimistic - good scores on the Training data, but poor (if not lousy) on the Test data.
The next step was to look again at the data - I was rounding the salary numbers to the nearest 1K, but that is too granular when it comes to salary decision behavior (would 101K vs. 102K make a huge difference to you?). I rounded to the nearest 10K unit (101K = 10, 128K = 13) and added in new classification models (Linear Discriminant Analysis, K-Nearest Neighbor) to see if the results could improve. Here is the salary distribution for the posted jobs after rounding:
And here is the salary distribution after rounding for the numbers produced by "the other model":
The final results - smoothing the data did produce better numbers, but the results were still unimpressive. Random Forest, which produced the best numbers, had an accuracy of 58% on the Test set of the smoothed larger data set. Here are the results of the best performing models, where "Pay Rate" = jobs with posted salary (67 records), and "Estimated Salary" = salaries as predicted by "the other model" (584 records).
How could we improve the results in the future? First of all, I looked at 585 jobs from a collection of 70k+ jobs, so I could definitely go after more jobs instead of just looking at data science jobs in NYC. The shape of the data is also important - I rounded to the nearest 10K to get better numbers, but I could play with the relative size of the ranges to reflect job seeker behavior (to a job seeker, 20K vs. 30K is definitely more important than 120K vs. 130K).
Finally, pulling the data from LinkedIn required a fuzzy match between the name of the company in the database and whatever I could get from LinkedIn; looking through the matching results it looks like I got the company right about 90% of the time, so there is room for improvement.
So why do I think a project like this (predicting salaries from job posts) is a marketing gimmick? Most of the jobs that have posted salaries are from agency recruiters, which means that a) the jobs are usually entry level or mid-tier, and b) I don’t know anything about the actual company posting the job.
Based on my observations of the jobs in the projectSHERPA database, usually only 15% of the jobs posted include salaries, so we're always extrapolating from a small data set. Next, making predictions based on keywords in the job title or job description isn’t precise. Job titles vary from company to company, as well as industry to industry (“Vice President” is higher up in the management structure in Manufacturing than in Finance), and I think everyone will agree that job descriptions are mainly boilerplate.
Finally, this only works for jobs that don’t pay a significant amount of the compensation in bonus or benefits. Companies usually do not list non-base compensation in the job post. Startups are a great example of this - the base salary is usually lower than the market for a similar job at a more established company, but the upside from stock options can be huge.
I’d like to finish by saying that I think that predicting salaries is very useful - modeling salaries using total compensation numbers (base + bonus + benefits) based on data provided directly by companies is very important data, both for job seekers and other market participants. I’d discount using job post data to predict salaries - I think it worked well for the Adzuna competition, where most of the salaries that I saw paid hourly rates or had a compensation structure where base was a significant component of the total. I don’t think it produces accurate numbers for jobs that require candidates with high-level technical skills.
The code I used is below - I can't share the code that I used to extract the job posts (🙁) but I can show you everything after that. For the record, here's the projectSHERPA job scraping process:
Let's start with how I came up with the keywords to retrieve the data science jobs. Originally I looked through a spreadsheet filled with the jobs that I had filtered based on a few initial keywords in the job description (e.g. "data science", "machine learning").
I extracted the job title and job description from those posts to look at common words, bigrams, and trigrams; this took a couple of iterations to find the right set of keywords that produced jobs that were "data science-y" enough (i.e. not developer jobs, not sales related jobs, not business analyst jobs). See function test_ds_words() for the text analysis outputs.
Once I had a set of jobs that I was satisfied with, I used the Company search API that comes with Python-LinkedIn to extract additional data about the company that posted the job. I needed a fuzzy string function to match the parsed company name from the job post to the results returned from LinkedIn.
Now we arrive at the LinkedIn data retrieval code - I pulled in basic company data (e.g. company name, homepage, company type, company size), industry, and specialties (a collection of unstructured tags). If you want to test the code, you'll have to go to the LinkedIn Developer site to get an API key.
The next step was to merge the data I retrieved from LinkedIn with the job data, and convert the LinkedIn codes into numeric values. See function prepare_and_merge_data() for the merge step.
I wanted to test multiple models on different versions of the data, so I originally wrote a separate function for each model and collection of data. I quickly realized it was not scalable (I kept "fixing" the code everytime I thought of a new way to test the data), so I wrote a collection of generic modeling functions. Here's the function for running a scikit-learn model and extracting the score and mean-squared-error:
Here's a function to return score and mean-squared-error for a scikit-learn model using a grid search:
I encountered a problem with colinearity when using all of the data features with the LDA model, so I also wrote a function to retrieve "Best-K" for a scikit-learn model, using test score as the evaluation criteria.
Combining all of the above code, here is the entire set of code that I used to predict job salaries. The modeling component is kicked off in function run_model().