Data Analytics on Human Resource.
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Why did they stay? Why did they go? An exploratory analysis of Human Resource data provided by IBM.
Imagine that you are working at IBM. No, let's say you are running IBM. It seems like a lot of employees are leaving your organization for Microsoft and Apple. This drain of talent might threaten your organizations very survival. How would determine to what degree this is a problem? How would you know what is going on such a large company and what you should do about it? You could make decisions based on experience. You can make decisions my intuition. Or you could consult the cold hard data numbers.
In this data set provided by IBM, I wanted to explore which variables might be related to retention. Why are some people leaving while others are staying? I first downloaded the data off Kaggle (https://www.kaggle.com/c/sm/data). It came in an excel document which I imported into R Studio with relative ease. R Studio is a free programming platform which allows you to deploy a wide variety of visualization, data manipulation, and statistical packages for almost any data analysis imaginable.
Preparing the Data
The first course of action is to understand the nature of the data. This is necessary because if any variable is miss-interpreted the three is a chance that the analysis and visualizations may be interpreted. Structure(data) can give you a good basic idea of what kind of data you are dealing with R., In this case, we have 35 columns and 1,470 rows per column. This is much smaller than a lot of data in the case of HR data this is probably massive.
sapply(WA_Fn_UseC_HR_Employee_Attrition, function(x) sum(length(which(is.na(WA_Fn_UseC_HR_Employee_Attrition)))))
I used this code to determine the amount of missing data in this data frame. It appears that the HR folks at IBM are quite meticulous. Not one column has missing data. This data is almost a data scientist’s dream come true. There is one problem though. Even though each column is labeled there is no information on Kaggle about what each feature/variable might entail. The meaning of each data point in most of the columns is self-evident. When you see the data “35” under the column Age, you can safely assume that “35” stands for approximately 35 years old.
On the other hand, the kind of scale used to measure life satisfaction remains a mystery. Does 5 mean super satisfied or super-duper unsatisfied? What is life satisfaction and how was this data collected/determined? Say that the feature life satisfaction is not related to attrition levels. We cannot be confident in saying whether it is an issue related to the validity/reliability of the measure or if it is true that this ill-defined psychological construct is not related to attrition. It just makes it more difficult to judge the meaning of any relationship or lack thereof between some of the factorial (scale based) features. Because some of the features are text-based, it makes it more difficult to run any kind of analysis.
Cleaning the Data
I cleaned the data by using “ifelse” to convert these text-based variables such as Attrition (yes/no) into factors (1 = they did leave, 2 = they did not leave). That is a safe assumption because most people don’t leave the company 237 = 1, 1233 = 2. So almost 20 percent did leave the company although we don’t know the time span in which this event occurred (wouldn’t it be great if we could benchmark this number relative to other companies!)
Visualizing the Data
The purpose of visualizing is to discover and/or communicate patterns in the data. A great start to discover in this instance was to create a correlation plot. My initial correlation plot was not worth including in this analysis because most of the features had a zero correlation with attrition and that made it difficult to see the few correlations that were related to attrition.
From the initial spread though you can tell that overtime is most positively correlated with attrition. In this case, we know that employees who work overtime are more likely to leave. This seems to be relatively common sense even if the feature is not well defined. Other intuitive features such as Work-Life Balance and Performance rating aren’t related to Attrition. I was also surprised to see that gender was also unrelated. You can that Job Involvement, Monthly Salary, Age, Job Level, Years With Current Manager, Total Working Years all have slight positive correlations with attrition.
Monthly Income by Age
I wanted to first look at the simplest of these. You can see below that those employees who answered "Yes" to working overtime had for more employees that also had "Yes" for leaving. Job involvement is a little less clear but it is obvious that those employees that only reached level 1 in job involvement were far more likely to leave compared those answered 4.
Several features are correlated with each other. It’s not surprising that Total Years Worked would be correlated to with Age and that Age would be correlated with income. Both factors related to Age and Income are correlated with attrition. Are people with a higher job level less likely to leave because they are being paid more or because they are older?
Since several of these features (Total Years Worked, Job Level, Number of Companies Worked, Years with Current Manager) are related but we know that age and income are independent of each other, I decided test to see if there is a difference between those who have a high monthly income and are older and those who have a lower monthly income and who are younger.
The scatter plot above illustrates the relationship between income and age in total. I will also do this for those who are of a lower income and who are older and those who have a lower income and are younger. To this will subset the data into four sections. First, though I need to know how to divide the data in a proportional way so that such differences make sense.
Histogram of Age and monthly Income
I created a histogram of age and monthly income to get a better idea of the distribution between the two. Age is normally distributed with the median age being 36 and the mean age only being less than a year higher. It’s obvious to divide the age at 36 so that you get two representative groups (so that differences make sense). The second issue related to income is more problematic. IBM would have a high GINI coefficient (measurement of inequality, not purity) since around half of all the employees make 5000 or less for a month and the rest make more than 5000 (the median is 4919) but the mean is substantially higher (6502.93).
Subsetting this feature by the mean is problematic because the two populations won’t be roughly equal in a number of employees and more extreme values will represent the “richer” group. I decided that it would better to divide the group based on the median. This decision was not only made because of the drawbacks of dividing it by the mean but also out domain expertise. People tend to be happier if they wealthier than those around them. In this case, I’m saying that such happiness is related to retention and that absolute income is not what is going have the most pull. Anyone who past the 5000 a month can look down on most of their coworkers
After a filtered/subsetted the data I decided to test to see if attrition differed between these four groups and whether income or age played a stronger role in influencing retention. We would hypothesize that if young rich people had similar attrition (lower) rates compared to their older peers and that the opposite was the case for poor individuals then income has a stronger relationship with age. The only reason that age is associated with attrition is that as one age, one’s income increases with each promotion.
On the other hand, if the opposite is the case, then age is the variable most associated with attrition then income is just something that happens to increase without having any real direct association. This has all has policy implications for what the company should do to reduce attrition since if income is the factor strongly associated with attrition then greater retention of employees would be simply accomplished increasing their income to a certain level.
If age is most associated with attrition then other variables either outside or inside the companies control are relevant (culture, opportunity elsewhere, tendency for younger people to switch careers due to less attachment to career). These, of course, are not captured by the data at hand and may involve the need for new data capture.
Sampled Attrition Differences
This here is 4 randomly sampled groups graphed from the original. Having a sample of randomly selected values from a population has its downsides. It doesn’t always reflect the actual population. In this case, each group is of 25O individuals so that is a large sample and is nearly the entire population for some of the original groups. The advantage of comparing samples of attrition rates is that come can feel comfortable about the differences in question.
The sample group YoungPoor2 doesn’t have higher attrition relative to the other groups because it is a larger group (thereby having a larger amount of values in general). It, as a random sample, just has higher attrition levels. You, of course, can also look at the ratio of the groups in general to get a better understanding. I considered creating a stacked bar graph but that can be confusing as well. What is clear is that the outcome is somewhat expected. The young poor group is the most likely to leave by far. All other groups are far less likely to leave meaning that this is the most vulnerable population for attrition.
Actual Ratios in Attrition Total Ratio for the Company = 0.192, 0.144 Young Rich, 0.337 YoungPoor, 0.162 Oldpoor, 0.093 OldRich
The next potential steps would be to use a classification algorithm. Likely a logistic regression to classify whether someone is going to leave or not. A regression model and a non-parametric ANOVA/T-Test could also be used to make better sense of variables associated with income. To do a full-scale classification prediction, dimension reduction (PCA) could be used on the highly correlated variables such (Income, Job Rank… to Age).
Ultimately it is clear that employees of IBM that belong the YoungPoor Employees, Employees Working Overtime, and those with Lower Job Involvement were more likely to leave their jobs at IBM. This isn’t too surprising but other variables like gender I expected be equally as powerful. There is so much more to be explored in the dataset. Like perhaps occupation doesn’t correlate with attrition but it may indirectly correlate with monthly income.
Other next steps would be to benchmark these findings to see if they are similar to other organizations and to find ways to minimize attrition. Paying young valued employees more. Discover the underlying issues with job involvement. Find ways to minimize overtime or minimize its effect on attrition.
In summary, I found that the groups most likely to leave IBM were employees grouped as YoungPoor based upon the median income/and median age, employees that worked overtime, and to a certain degree those with lower job involvement. The company now can know which employees are most vulnerable to leaving and can take measures to prevent the catastrophic costs of high attrition. Thus IBM can continue to survive and thrive well into the future.