Web-scraping Linkedin: Exploring the Background of a Data Scientist
Reason For Research:
When I had first started my journey on changing career paths into coding, that was when I first heard about data science. After countless hours of research my conclusion about what the job consists of was unclear. What I had discovered in my research was how this job title is rapidly growing into a demanding career option. Living in a time of enormous amounts of data, more and more companies are in demand and looking to fill this position. I began to question, how do you become a data scientists, what degree or level of education do you need, what skills are employers looking for? This lead me to explore the backgrounds, experiences, and skills that the current data scientist posses. So the data analysis begins!
At first, I had thought about scraping job sites such as Indeed, Glassdoor, Monster, etc. However, with these websites most of the information that would be gathered are more for the job descriptions and salary. Whereas in my case I am looking more for the individuals who land the job as a Data Scientist. LinkedIn is a social network for professionals making it the Facebook for your career. This platform is the best for networking and connecting with others within your industry or an industry the user may be trying to enter. Not only is this platform great for social networking but it's also great for job searches! So I decided to do my web-scraping project on LinkedIn.
Once I have chosen the website on which to scrape, I had to decide on what company to pull information from about their current employees. After some research, I had decided to go with Uber due to the enormous amounts of open positions for data scientists within that company. Besides the ridesharing, Uber has branched into new areas which include Uber Eats, Uber Freight, and Uber Health, and other modes of personal transportation such as bikes and buses. As of recently, Uber has even started a new project for a new mode of ridesharing...in the air! All of these projects are done using big data and a big demand for data scientists which makes this company perfect for my project.
I used selenium and beautiful soup to web-scrape Uber's LinkedIn profile. However, I had encountered some issues while building the script. When searching through the list of current employees on a company's profile, LinkedIn will show a number of pages with 10 employee profiles on each page. After your first page, to continue scraping on to the next set of 10 profiles you have to get to the next page. The only way to achieve this is by clicking on the "next" button located on the bottom right. Secondly, to gather the needed information about a current employee, you have to click on the employee's name which is the link to their profile. Using selenium helped me maneuver around this issue. Selenium has a restriction on its speed because the scraping with the browser is much slower. Due to the slowness, to not get banned by LinkedIn the use of the "sleep" statements had to be used in my code multiple times to cause further slow-down.
The second problem was the number of pages scraped. I had to rewrite my code to where the scraping stopped at the 100th page before being "timed out". Making it to where it no longer continued to look for the "next button". Once this problem was solved I was able to create a pandas table which consisted of the scraped information such as employees name, job title, location, and the profile link to the current employee's profile. Then saved the outputs to a csv file.
Once this cvs file was created, I started another scraping process which consisted of scraping the output cvs file from the previous scrape that went into each profile link to grab the information needed for my analysis. This second scrape included information from each employee's experience, education, and skills. Once I was able to retrieve this information I was able to narrow the results down to only "data scientists" type roles. Once I was able to narrow down only to data scientist type roles under the Uber company I was able to make the observations needed to make my conclusions.
My first analysis was done on the typical data scientists educational background. First, I was curious as to what education requirements are needed to land the data science role. So I took a look at each employee's last education type completed. Categorizing the degrees into a separate table and taking value counts, turns out the majority has a Masters degree as their last completed education with Ph.D.s following.
Taking a look at the results, I was curious as to what type of Master degrees were received by these employees. With the majority having Master degrees I felt like this would be a valuable analysis to look into. Once I created a separate table and categorized the degree names I was able to make the pie chart below. As you can see, most of the degrees completed were either Engineering or type of Mathematics degree.
Once my analysis on the last degree was completed, I wanted to take a deeper look into what degrees the typical data scientist at Uber started out with. This day in time, it's pretty common for individuals to change career paths after completing their first degree including myself. So I was curious as to where these employees started off on their career. I created a separate table that entailed each profile's last first education information. The majority started off with a bachelor degree so it made sense to only do an analysis on the bachelor degrees completed. Comparing, you can see there isn't much of a difference from the Master degree results as Engineering, Mathematics, and Computer Science being the top 3 types received.
Next, I wanted to take a look at what skill sets are more in demand from the employers looking to fill these roles. More particularly which code languages are more in demand as coding skills play a big role in data scientist duties. But first, I wanted to gather the employee's skills set and organize into other categories including coding such data analytics (data, research, analysts), and statistics skills (machine learning, modeling, stats). Looking at the chart below, you see where coding language is more common of a skill set to have over the other categories as suggested earlier. In the bar chart, 3 represents coding language, 1 represents data analytics, 2 represents statistic or machine learning skills and 0 representing other.
With coding skills clearing being very important in the data science community I looked into which language code is more popular and in demand by the employers. Within the Uber company, the coding skills listed on the employee's LinkedIn consists of Python, R, C++, C, Java, and SQL. Taking the same table and only gathering the coding skill value counts, you can see below Python is clearly the most common and in demand coding skill to learn.
The next set of information to be analyzed is the employee's experience. Here I decided to take a look at which companies current Uber employees typically worked before their current position. After creating a separate table and cleaning up the data, I was able to compare the top 10 results. However, this comparison wasn't the best analysis for this kind of data as the results didn't show much or give us a big insight. Looking at the results below, you can see where the numbers weren't great enough to use this as an insight into the most popular companies Uber hires from. All numbers are even across the board for the most part. I was a little shocked as I would assume the most common companies would be Microsoft, Amazon, or even Facebook.
With this set of information not being the best to use for analyzing, I looked into how many years of experience does the average employee have when hired by Uber. To gather this information I had to create a table which incorporated information from the education and experience analysis to see how many years were in between the education completed year to the hired by Uber year. After cleaning and analyzing, it made a lot more sense as most of the current employees were employed within the first couple of years after finishing their education. Looking at the chart below, the highest peaks were from experience level 0 to 3 years of experience from the employees.
As mentioned earlier in this blog, recently Uber has amped up several projects which require data science type work along with other tech companies. I was curious as to when the data scientists roles became more in popular and in demand so I wanted to take a look into the counts of hires per year. This year 2019 not being too accurate as it is still considered a little too early, as suspected the number of hires went up drastically starting in 2017 and 2018 compared to earlier years. From hiring 4 employees then jumping up to a total of 19 in 2017 is a pretty big gap in which this was the time the projects starting occurring.
To take this analysis a step further, I looked into the current job titles for the Uber employees. With over 21 different job titles in the "Data Scientists" category for job positions, the results showed me that over 60 percent were Data Scientists titles. Following Data Scientists you have Software Engineer, data analysts and data research or data engineer titles coming in at 8 - 10 percent. Last you have Machine Learning Engineer with product at 4 - 7 percent.
With the Data Scientist title taking more than 60 percent of the positions hired for this category, I wanted to look into the specifics of what kind of Data Scientists are in demand or popular amongst this company. So, I took the data scientists titles and created a separate table with all needed information to categorize. After cleaning there are a total of 11 different data scientists titles within that 60 percent of positions. Data Scientist being the most common and Data Scientists II and Senior Data Scientists coming in behind. Taking a look at the graphs it's a little hard to read or analyze as there are a good number of types for this category of positions.
Because this was a little tricky and hard to read to compare results I decided to dive more into the data scientists and senior data scientists. I wanted to take a look at these two particular titles to see what makes a difference between the two. What skill sets does the senior data scientist have that the data scientists don't? How many more years of experience do the senior data scientists have that the data scientist has? What are the highest levels of education for each? To start off this analysis I had to once again create a separate table and gather information for only these two titles. Starting out, the total number of data scientists are 17 and a total of senior data scientists are 9. First I looked into the education differences between two titles and you can see my results in the graph below.
Comparing the education levels completed for each, there wasn't much of a difference between the two. For both positions the Master degree was more common with Ph.D following right behind. The only difference you can see amongst the two graphs is that Senior Data Scientists has other type of education completed as Data Scientists doesn't. This is only because there is one employee with a Jurisprudence degree (J.D). These results are not sufficient enough to use at least for comparing the difference between titles.
Next, I gathered information to compare the difference years of experience between the two job titles and the results were better than from comparing the education level. For the Senior Data Scientists role the years of experience ranged from 3 - 11 where Data Scientists role ranged from 0-5 which makes sense as being qualified for a "Senior" role should require more experience. This gave me the information to conclude that to qualify for a Senior level role, you need to have a the least 3-5 years of experience prior.
Lastly, I analyzed the skill set difference between both job titles. I created separate tables for each Senior Data Scientists and Data Scientists and gathered the total counts of skills for each profile to see if there are comparisons if any to be made. Just like the difference in education level, for the skill sets there wasn't much of a difference either as they are practically the same. For both titles, having python and machine learning skills under your belt is a must and the most common which makes sense as machine learning plays a big role amongst data science type positions and python being the most common and used coding language. Data analysis appears in both which makes a lot of sense as a big part of these roles are to analyze data. The only difference you see between the two charts below is the for Data Scientists you have the skill "R" which is another popular coding language along with "matlab", and for the Senior level you have "Algorithms" along with "Optimization Models" which makes sense for every data scientists to know. The two charts also only show the top 5 skill sets for each as there were different types of skills each employee added to their profile. I wanted to only gather the most common and not add skills to my comparison that were note "data science" type skills.
Comparing between Data Scientists and Senior Data Scientists, the only major difference between the two titles is the years of experience as the education and skill sets were basically the same. This concludes that in order to qualify for a senior role, you need at the least 3 years underneath your belt with the given skills in this analysis.
Concerning web scraping, this project was pretty challenging. With LinkedIn constantly updating their script, this causes limitations of the run time and how often the code needs to be updated in order for this analysis to run correctly. For this particular project, it would be interesting to continue gathering information on the current Uber employees to see where they end up for their next position. It would also be interesting to gather more data to compare the salary jumps from each position. For example, we could compare the differences between data science/ machine learning engineer/analyst job positions to investigate how salary, educational or skill sets requirement differs for different positions. However, in order to compare these salaries, we would have to intertwine another source for this data such as Glassdoor. I believe this analysis can be a great idea for future projects ahead when continuing the gather of information from LinkedIn and starting salary comparisons from Glassdoor. These ideas could even lead to a possible machine learning project. Such as, with a person's set of skills and educational background there could be recommendations for which jobs you should apply to.
You can view my codes, data visualizations, and csv files at my GitHub page here.