DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business green lights the project, they have tasked us to conduct an analysis of existing employee data. This R markdown does detailed statistical analysis of the given datasets and contains code,plots and all hypothesis test with conclusion. It also contains code and its output that was written to build predictive model as requested by talent management firm.
Along with this R code, we also built an app to perform EDA and interactive plots. We have posted this app on the web.
Please visit the app on this link
https://sachinac.shinyapps.io/msds_rshiny_cs02/
Dataset contains record of 870 employees and 36 different attributes that can be utitlized to find pattern of attrition and to build a model to predict attrition. Data is clean and our initial investigation found nothing suspicious. For more details please keep reading following sections.
## 'data.frame': 870 obs. of 36 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : int 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : int 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : int 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
## $ JobSatisfaction : int 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : int 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: int 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : int 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
Let’s take a closer look at structure of the dataset
We can categorized fields as follows:
Nominal predictors are numeric fields in the dataset. However, we should change it to factors to treat them as categorical variables.
msds_cs02_ds$Education <- as.factor(msds_cs02_ds$Education)
msds_cs02_ds$EnvironmentSatisfaction <- as.factor(msds_cs02_ds$EnvironmentSatisfaction)
msds_cs02_ds$JobInvolvement <- as.factor(msds_cs02_ds$JobInvolvement)
msds_cs02_ds$JobLevel <- as.factor(msds_cs02_ds$JobLevel)
msds_cs02_ds$JobSatisfaction <- as.factor(msds_cs02_ds$JobSatisfaction)
msds_cs02_ds$PerformanceRating <- as.factor(msds_cs02_ds$PerformanceRating)
msds_cs02_ds$RelationshipSatisfaction <- as.factor(msds_cs02_ds$RelationshipSatisfaction)
msds_cs02_ds$StockOptionLevel <- as.factor(msds_cs02_ds$StockOptionLevel)
msds_cs02_ds$WorkLifeBalance <- as.factor(msds_cs02_ds$WorkLifeBalance)
Final structure to continue with analysis
## 'data.frame': 870 obs. of 36 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : int 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 2 1 1 3 1 3 1 4 2 2 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : int 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
# Create list for different types of variables for later use
non_predictors <- c('ID','EmployeeNumber','EmployeeCount','StandardHours','Over18')
nom_qual_predictors <- c('BusinessTravel','Department','EducationField',
'Gender','JobRole','MaritalStatus','OverTime')
ord_qual_predictors <- c('Education','EnvironmentSatisfaction','JobInvolvement','JobLevel',
'JobSatisfaction','PerformanceRating','RelationshipSatisfaction',
'StockOptionLevel','WorkLifeBalance')
num_predictors <- c('Age','DailyRate','DistanceFromHome','HourlyRate',
'MonthlyRate','NumCompaniesWorked','PercentSalaryHike','TotalWorkingYears',
'TrainingTimesLastYear','YearsAtCompany','YearsInCurrentRole',
'YearsSinceLastPromotion','YearsWithCurrManager')
Lets start to get some basic insights from the dataset.
Number of employees per Department. As clearly seen from the pie chart R & D is the major department for DDS Analytics followed by Sales and HR.
Mean age of the organization is 37
Median age of the organization is 35
Mean Age of attrition 33
Median Age of attrtion is 32
age_Attrition <- msds_cs02_ds %>% filter(Attrition=='Yes') %>% select(Age)
age_non_attrition <- msds_cs02_ds %>% filter(Attrition=='No') %>% select(Age)
p <- plot_ly(alpha = 0.6) %>%
add_histogram(x = ~age_non_attrition$Age+1) %>%
add_histogram(x = ~age_Attrition$Age) %>%
layout(title = 'Histogram',
barmode = "overlay",showlegend = F,xaxis = list(title = "Attrition(Yellow) and No Attrition (Blue)"))
p
## [1] 36.82874
## [1] 35
## [1] 33.78571
## [1] 32
let’s analyze Attrition from the dataset
Total Employees - 870
Employees Attrition - 140
Below plot shows agewise attrition pattern from the data. Attrition is higher for the age range - 18-30 and 55+
df_RandD <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds where Department like 'Research%' group by Attrition")
df_RandD
df_Sales <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds where Department like 'Sales%' group by Attrition")
df_Sales
df_hr <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds where Department like 'Human%' group by Attrition")
df_hr
p <- plot_ly() %>%
add_pie(data = df_RandD ,labels = ~Attrition, values = ~Attrition_count,
title = "Research & Development", domain = list(x = c(0, 0.4), y = c(0.4, 1))) %>%
add_pie(data = df_Sales, labels = ~Attrition, values = ~Attrition_count,
title = "Sales", domain = list(x = c(0.6, 1), y = c(0.4, 1))) %>%
add_pie(data = df_hr, labels = ~Attrition, values = ~Attrition_count,
title = "Human Resource", domain = list(x = c(0.25, 0.75), y = c(0, 0.6))) %>%
layout(title = "Departmentwise Attrition", showlegend = T,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
This plot shows attrition by job role by taking department into account
Following Jobroles are the topmost contributors to attrition
df <- sqldf("select Department,JobRole,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by Department,JobRole order by Attrition_count desc")
p <- ggplotly(ggplot(data=df, aes(x=Department, y=Attrition_count, fill=JobRole)) +
geom_bar(stat="identity", position=position_dodge())+
geom_text(aes(label=Attrition_count),vjust = 2.1,position=position_dodge(width=0.8))+
theme_bw()+
theme(axis.text.x = element_text(angle=45, vjust=0.6)))
p
This plot shows attrition by job role without taking department into account
df <- sqldf("select JobRole,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by JobRole order by Attrition_count desc")
p <- df %>%
plot_ly(labels = ~JobRole, values = ~Attrition_count) %>%
add_pie(hole = 0.6) %>%
layout(title = "Attrition by Job Role", showlegend = T,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
df <- sqldf("select Department,Gender,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by Department,Gender order by Attrition_count desc")
df
p <- ggplotly(ggplot(data=df, aes(x=Department, y=Attrition_count, fill=Gender)) +
geom_bar(stat="identity", position=position_dodge())+
geom_text(aes(label=Attrition_count),vjust = 2.1,position=position_dodge(width=0.5))+
theme_bw()+
theme(axis.text.x = element_text(angle=45, vjust=0.5)))
p
df <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by Gender order by Attrition_count ")
df
df_RandD <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' and Department like 'Research%' group by Gender")
df_RandD
df_Sales <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' and Department like 'Sales%' group by Gender")
df_Sales
df_hr <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' and Department like 'Human%' group by Gender")
df_hr
p <- plot_ly() %>%
add_pie(data = df_RandD ,labels = ~Gender, values = ~Attrition_count,
title = "Research & Development", domain = list(x = c(0, 0.4), y = c(0.4, 1))) %>%
add_pie(data = df_Sales, labels = ~Gender, values = ~Attrition_count,
title = "Sales", domain = list(x = c(0.6, 1), y = c(0.4, 1))) %>%
add_pie(data = df_hr, labels = ~Gender, values = ~Attrition_count,
title = "Human Resource", domain = list(x = c(0.25, 0.75), y = c(0, 0.6))) %>%
layout(title = "Departmentwise Attrition", showlegend = T,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
Here are top 4 reasons that have contributed to attrition.
df_travel <- sqldf("select BusinessTravel,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by BusinessTravel ")
df_joblevel <- sqldf("select JobLevel,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by JobLevel")
rownames(df_joblevel) <- c('JobLevel1','JobLevel2','JobLevel3','JobLevel4','JobLevel5')
df_stocklevel <- sqldf("select StockOptionLevel,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by StockOptionLevel")
rownames(df_stocklevel) <- c('Stock0','Stock1','Stock2','Stock3')
df_overtime <- sqldf("select Overtime,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by Overtime")
p1 <- plot_ly() %>%
add_pie(data = df_travel ,labels = ~BusinessTravel, values = ~Attrition_count,
title = "BusinessTravel", domain = list(row = 0, column = 0)) %>%
add_pie(data = df_joblevel, labels = ~rownames(df_joblevel), values = ~Attrition_count,
title = "JobLevel", domain = list(row = 0, column = 1)) %>%
add_pie(data = df_overtime ,labels = ~OverTime, values = ~Attrition_count,
title = "Overtime", domain = list(row = 1, column = 0)) %>%
add_pie(data = df_stocklevel, labels = ~rownames(df_stocklevel), values = ~Attrition_count,
title = "StockOptionLevel", domain = list(row = 1, column = 1)) %>%
layout(title = "Attrition Rate", showlegend = T,
grid=list(rows=2, columns=2),
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p1
df_jobinv <- sqldf("select JobInvolvement,count(*) Attrition_count from msds_cs02_ds where Attrition='Yes' group by JobInvolvement")
p <- plot_ly(df_jobinv, labels = ~JobInvolvement, values = ~Attrition_count, type = 'pie') %>%
layout(title = 'Attrition rate for job Involvement',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
Dataset contains Jobsatiscation field. We can get few insights about what makes employees more satisfied. We will be using proportions plots here to get clear picture.So Let’s see.
Assumption of level of satisfaction
Level-1 - highly dissasfied
Level-2 - dissatisfied
Level-3 - Satisfied
Level-4 - Highly satisfied
Fisher’s test on this feature tells us that at least one proportion is significantly different than others p-value < 0.05. Show later section of this markdown. First three levels contributes 80% to the attrition though.
df <- msds_cs02_ds %>% filter(Attrition=='Yes') %>% count(JobSatisfaction)
p <- plot_ly(df, labels = ~JobSatisfaction, values = ~n, type = 'pie') %>%
layout(title = 'Attrition by job satisfaction level',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
Thats what density plot shows high and lows of satisfaction levels
Density plot show lows and highs of satisfaction for all incomes. min(msds_cs02_ds$MonthlyIncome)
As quite obvious employees closer to office are more satisfied compared to those who stay far. Spinogram and density plot shows the same.
Lower the age higher the attrition
Lower monthlyIncome higher the attrition
Higher NumCompaniesWorked higher the attrition
Less exprience higher attrition
opar <- par(mfrow = c(2, 3))
par(opar)
vcd::cd_plot(Attrition ~ Age, data = msds_cs02_ds, main = "Age")
Run hypothesis tests to compare levels of different factors.
Hypothesis : All levels of the factor have same effect on the attrition.
Alternate : At least one level has different effect on the attrition than other levels.
i=1
while(i <= length(ord_qual_predictors)) {
index <- match(ord_qual_predictors[i], names(msds_cs02_ds))
hypothesis_test <- fisher.test(table(msds_cs02_ds[,c(index,3)]),simulate.p.value=TRUE)
print(paste("Feature ",ord_qual_predictors[i]," p-value ",round(hypothesis_test$p.value,2)))
i <- i + 1
}
## [1] "Feature Education p-value 0.64"
## [1] "Feature EnvironmentSatisfaction p-value 0.01"
## [1] "Feature JobInvolvement p-value 0"
## [1] "Feature JobLevel p-value 0"
## [1] "Feature JobSatisfaction p-value 0.01"
## [1] "Feature PerformanceRating p-value 0.7"
## [1] "Feature RelationshipSatisfaction p-value 0.37"
## [1] "Feature StockOptionLevel p-value 0"
## [1] "Feature WorkLifeBalance p-value 0.01"
Fisher test was performed and from the above p-values returned by fishers’ test we conclude that following factors are singificant (p-value < 0.05) i.e. at least one level of the factor is different than others.
So these variables needs to be included for model selections.
i=1
while(i <=length(nom_qual_predictors)) {
index <- match(nom_qual_predictors[i], names(msds_cs02_ds))
hypothesis_test <- fisher.test(table(msds_cs02_ds[,c(index,3)])[,c(2,1)],simulate.p.value=TRUE)
print(paste("Feature ",nom_qual_predictors[i]," p-value ",round(hypothesis_test$p.value,2)))
i <- i + 1
}
## [1] "Feature BusinessTravel p-value 0.06"
## [1] "Feature Department p-value 0.01"
## [1] "Feature EducationField p-value 0.23"
## [1] "Feature Gender p-value 0.51"
## [1] "Feature JobRole p-value 0"
## [1] "Feature MaritalStatus p-value 0"
## [1] "Feature OverTime p-value 0"
Fisher test was performed and from the above p-values returned by fishers’ test we conclude that following factors are singificant (p-value < 0.05) i.e. at least one level of the factor is different than others.
So these variables needs to be included for model selections
Key takeways from EDA
From the analysis it looks there are few predictors like age, businessTravel,Joblevel,Overtime are highly important as they are major contributor to attrtion. But other fields also seems to be showing some variations to attrition (as per density plots) so we will keep all predictors to build model and will go from there.
This heatmap shows multicollinearity exist in the numerical predictors. We have examined the correlated variables and removed manually.
Age and TotalWorkingYears YearsAtCompany and YearsAtCurrentRole YearsAtCompany and YearsWithCurrentManager YearsWithCurrentManager and YearsAtCurrentRole
## corrplot 0.84 loaded
corr <- round(cor(msds_cs02_ds[,num_predictors[-c(6,8,10,11,12,13)]]), 1)
corr <- round(cor(msds_cs02_ds[,num_predictors]), 1)
corrplot(corr, is.corr=FALSE)
trace1 <- list(
uid = "b972e66a-6d9c-4cd1-a8fc-c533178a171b",
type = "heatmap",
x = num_predictors,
y = rev(num_predictors),
z = corr
)
data <- list(trace1)
layout <- list(
xaxis = list(
side = "top",
ticks = "outside",
title = "Correlations Heat Map"
)
)
p <- plot_ly()
p <- add_trace(p, uid=trace1$uid, type=trace1$type, x=trace1$x, y=trace1$y, z=trace1$z)
p <- layout(p, title=layout$title, xaxis=layout$xaxis)
p
Here are top 5 predictors as per random forest
* Overtime
* MonthlyIncome
* JobRole
* Age
* TotalWorkingYears
## Loading required package: lattice
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
#
# Predictors to be retained from EDA
#
fin_nom_qual_predictors <- nom_qual_predictors[-c(1,3,4)]
fin_ord_qual_predictors <- ord_qual_predictors[-c(1,6,7)]
fin_num_predictors <- num_predictors[-c(6,8,10,11,12,13)] #num_predictors[c(5,2,8,9,7)]
#num_predictors[-c(8,10,12,13)]
msds_cs02_mds <- msds_cs02_ds[, c(fin_nom_qual_predictors,fin_ord_qual_predictors,fin_num_predictors,"MonthlyIncome","Attrition")]
sample_n(msds_cs02_mds, 4)
# Split the data into training and test set
set.seed(123)
training.samples <- msds_cs02_mds$Attrition %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- msds_cs02_mds[training.samples, ]
test.data <- msds_cs02_mds[-training.samples, ]
# Weighting
model_weights <- ifelse(train.data$Attrition == "No",
(1/table(train.data$Attrition)[1]) * 0.5,
(1/table(train.data$Attrition)[2]) * 0.5)
# Fit the model on the training set
set.seed(123)
model <- train(
Attrition ~., data = train.data, method = "rf",
trControl = trainControl("cv", number = 10,sampling="down"),
preProcess = c("center","scale"),
# weights = model_weights,
importance = TRUE
)
# Best tuning parameter
model$bestTune
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 32.14%
## Confusion matrix:
## No Yes class.error
## No 73 39 0.3482143
## Yes 33 79 0.2946429
## [1] No No Yes No No No
## Levels: No Yes
## [1] 0.7241379
predicted.classes <- relevel(predicted.classes,ref="Yes")
test.data$Attrition <- relevel(test.data$Attrition ,ref="Yes")
CM <- confusionMatrix(table(predicted.classes, test.data$Attrition))
CM
## Confusion Matrix and Statistics
##
##
## predicted.classes Yes No
## Yes 23 43
## No 5 103
##
## Accuracy : 0.7241
## 95% CI : (0.6514, 0.7891)
## No Information Rate : 0.8391
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3403
##
## Mcnemar's Test P-Value : 9.27e-08
##
## Sensitivity : 0.8214
## Specificity : 0.7055
## Pos Pred Value : 0.3485
## Neg Pred Value : 0.9537
## Prevalence : 0.1609
## Detection Rate : 0.1322
## Detection Prevalence : 0.3793
## Balanced Accuracy : 0.7635
##
## 'Positive' Class : Yes
##
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Importance
## StockOptionLevel1 100.00
## OverTimeYes 93.15
## JobLevel4 92.55
## Age 76.12
## JobRoleSales Representative 74.06
## MonthlyIncome 65.33
## MaritalStatusSingle 65.04
## DistanceFromHome 61.12
## DepartmentResearch & Development 54.44
## JobRoleManufacturing Director 53.75
## DepartmentSales 49.99
## JobLevel2 45.76
## StockOptionLevel2 44.41
## JobRoleManager 43.99
## JobInvolvement4 40.86
## DailyRate 40.06
## TrainingTimesLastYear 39.59
## JobRoleSales Executive 37.17
## JobRoleHuman Resources 34.66
## EnvironmentSatisfaction2 33.29
#
# This library required for one hot encoding to convert categorical variables into numeric
#
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
#
# Create separate dataframe of 30 predictors 1 target
#
knn_with_target_ds <- msds_cs02_ds[,c(fin_num_predictors,fin_nom_qual_predictors,fin_ord_qual_predictors,"MonthlyIncome","Attrition")]
knn_hot_encoded_ds <- dummy.data.frame(knn_with_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
set.seed(123)
training.samples <- knn_hot_encoded_ds$Attrition %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- knn_hot_encoded_ds[training.samples, ]
test.data <- knn_hot_encoded_ds[-training.samples, ]
#res.pca <- PCA(msds_cs02_ds[,fin_num_predictors], graph = FALSE,ncp=20)
#res.pca$eig
#corrplot(res.pca$var$cos2, is.corr=FALSE)
#fviz_contrib(res.pca, choice = "var", axes = 1:7)
# Fit the model on the training set
set.seed(123)
model <- train(
Attrition ~., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10,sampling="smote"),
preProcess = c("center","scale"),
tuneLength = 20
)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## [1] No No No No No No
## Levels: No Yes
predicted.classes <- relevel(predicted.classes,ref="Yes")
test.data$Attrition <- relevel(test.data$Attrition,ref="Yes")
# Compute model accuracy rate
mean(predicted.classes == test.data$Attrition)
## [1] 0.7471264
## Confusion Matrix and Statistics
##
##
## predicted.classes Yes No
## Yes 19 35
## No 9 111
##
## Accuracy : 0.7471
## 95% CI : (0.6758, 0.8099)
## No Information Rate : 0.8391
## P-Value [Acc > NIR] : 0.999336
##
## Kappa : 0.3191
##
## Mcnemar's Test P-Value : 0.000164
##
## Sensitivity : 0.6786
## Specificity : 0.7603
## Pos Pred Value : 0.3519
## Neg Pred Value : 0.9250
## Prevalence : 0.1609
## Detection Rate : 0.1092
## Detection Prevalence : 0.3103
## Balanced Accuracy : 0.7194
##
## 'Positive' Class : Yes
##
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred <- prediction(as.numeric(if_else(predicted.classes=='Yes',1,0)), as.numeric(if_else(test.data$Attrition=='Yes',1,0)))
nb.prff = performance(pred, "tpr", "fpr")
plot(nb.prff,main="ROC Curve")
## ROC curve variable importance
##
## only 20 most important variables shown (out of 50)
##
## Importance
## OverTimeYes 100.00
## OverTimeNo 100.00
## StockOptionLevel0 88.84
## MonthlyIncome 87.95
## JobLevel1 81.27
## StockOptionLevel1 75.04
## Age 66.94
## MaritalStatusSingle 59.96
## JobLevel2 54.64
## DepartmentResearch & Development 51.45
## DepartmentSales 46.74
## JobRoleSales Representative 43.59
## DistanceFromHome 42.34
## MaritalStatusDivorced 40.72
## JobInvolvement1 35.14
## EnvironmentSatisfaction1 34.86
## JobInvolvement3 34.42
## TrainingTimesLastYear 33.22
## JobSatisfaction4 32.72
## JobRoleManufacturing Director 27.39
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
#
# Create separate dataframe of 30 predictors 1 target
#
nb_with_target_ds <- msds_cs02_ds[,c(fin_num_predictors,fin_nom_qual_predictors,fin_ord_qual_predictors,"Attrition")]
nb_hot_encoded_ds <- dummy.data.frame(nb_with_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))
# Inspect the data
sample_n(nb_with_target_ds, 3)
# Split the data into training and test set
set.seed(123)
training.samples <- nb_hot_encoded_ds$Attrition %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- nb_hot_encoded_ds[training.samples, ]
test.data <- nb_hot_encoded_ds[-training.samples, ]
# Fit the model
#model <- NaiveBayes(Attrition ~., data = train.data)
#model
# Make predictions
#predicted.classes <- model %>% predict(test.data)
#pred <- prediction(pred_nb[, 2], test_color$Style)
# Model accuracy
#mean(predicted.classes$class == test.data$Attrition)
set.seed(123)
model <- train(Attrition ~., data = train.data, method = "nb",
trControl = trainControl("cv", number = 10,sampling="down")
)
# Make predictions
predicted.classes <- model %>% predict(test.data)
predicted.classes <- relevel(predicted.classes,ref="Yes")
test.data$Attrition <- relevel(test.data$Attrition,ref="Yes")
# Compute model accuracy rate
mean(predicted.classes == test.data$Attrition)
## [1] 0.683908
## Confusion Matrix and Statistics
##
##
## predicted.classes Yes No
## Yes 23 50
## No 5 96
##
## Accuracy : 0.6839
## 95% CI : (0.6092, 0.7522)
## No Information Rate : 0.8391
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2904
##
## Mcnemar's Test P-Value : 2.975e-09
##
## Sensitivity : 0.8214
## Specificity : 0.6575
## Pos Pred Value : 0.3151
## Neg Pred Value : 0.9505
## Prevalence : 0.1609
## Detection Rate : 0.1322
## Detection Prevalence : 0.4195
## Balanced Accuracy : 0.7395
##
## 'Positive' Class : Yes
##
## ROC curve variable importance
##
## only 20 most important variables shown (out of 49)
##
## Importance
## OverTimeYes 100.00
## OverTimeNo 100.00
## StockOptionLevel0 88.84
## JobLevel1 81.27
## StockOptionLevel1 75.04
## Age 66.94
## MaritalStatusSingle 59.96
## JobLevel2 54.64
## DepartmentResearch & Development 51.45
## DepartmentSales 46.74
## JobRoleSales Representative 43.59
## DistanceFromHome 42.34
## MaritalStatusDivorced 40.72
## JobInvolvement1 35.14
## EnvironmentSatisfaction1 34.86
## JobInvolvement3 34.42
## TrainingTimesLastYear 33.22
## JobSatisfaction4 32.72
## JobRoleManufacturing Director 27.39
## EnvironmentSatisfaction2 20.80
#
# Create separate dataframe of 30 predictors 1 target
#
knn_reg_target_ds <- msds_cs02_ds[,c(fin_num_predictors,fin_nom_qual_predictors,fin_ord_qual_predictors, "MonthlyIncome")]
kg_hot_encoded_ds <- dummy.data.frame(knn_reg_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))
# Split the data into training and test set
set.seed(123)
training.samples <- kg_hot_encoded_ds$MonthlyIncome %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- kg_hot_encoded_ds[training.samples, ]
test.data <- kg_hot_encoded_ds[-training.samples, ]
# Fit the model on the training set
set.seed(123)
model_rg <- train( MonthlyIncome~., data = train.data, method = "knn",
trControl = trainControl("cv", number = 20),
preProcess = c("center","scale"),
tuneLength = 20
)
# Plot model error RMSE vs different values of k
# Best tuning parameter k that minimize the RMSE
model$bestTune
# Make predictions on the test data
predictions <- model_rg %>% predict(test.data)
head(predictions)
## [1] 5265.4 18979.4 6759.0 3015.6 5334.8 15739.2
## [1] 1565.399
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$MonthlyIncome),
Rsquare = R2(predictions, test.data$MonthlyIncome)
)
p <- plot_ly( x = ~predictions, y = ~test.data$MonthlyIncome,
marker = list(size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2))) %>%
layout(title = 'RMSE Plot (kNN)',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
p
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
#
# Create separate dataframe of 30 predictors 1 target
#
mlp_reg_target_ds <- msds_cs02_ds[,c(fin_num_predictors,nom_qual_predictors,fin_ord_qual_predictors, "MonthlyIncome")]
#mlp_hot_encoded_ds <- dummy.data.frame(mlp_reg_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))
# Split the data into training and test set
set.seed(123)
training.samples <- mlp_reg_target_ds$MonthlyIncome %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- mlp_reg_target_ds[training.samples, ]
test.data <- mlp_reg_target_ds[-training.samples, ]
# Predictor variables
x <- model.matrix(MonthlyIncome~., train.data)[,-1]
# Outcome variable
y <- train.data$MonthlyIncome
# Find the best lambda using cross-validation
set.seed(123)
cv <- cv.glmnet(x, y, alpha = 1)
# Display the best lambda value
cv$lambda.min
## [1] 38.74357
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model)
## 48 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3196.4231798
## Age 16.5648076
## DailyRate 0.1024272
## DistanceFromHome .
## HourlyRate .
## MonthlyRate .
## PercentSalaryHike .
## TrainingTimesLastYear .
## BusinessTravelTravel_Frequently .
## BusinessTravelTravel_Rarely 125.3324523
## DepartmentResearch & Development .
## DepartmentSales .
## EducationFieldLife Sciences .
## EducationFieldMarketing .
## EducationFieldMedical .
## EducationFieldOther .
## EducationFieldTechnical Degree -16.9864853
## GenderMale .
## JobRoleHuman Resources -1058.9637181
## JobRoleLaboratory Technician -1134.9735365
## JobRoleManager 3224.7908177
## JobRoleManufacturing Director 41.7020537
## JobRoleResearch Director 3330.2678378
## JobRoleResearch Scientist -898.9129480
## JobRoleSales Executive .
## JobRoleSales Representative -1163.9441813
## MaritalStatusMarried .
## MaritalStatusSingle -10.5105598
## OverTimeYes .
## EnvironmentSatisfaction2 .
## EnvironmentSatisfaction3 .
## EnvironmentSatisfaction4 .
## JobInvolvement2 .
## JobInvolvement3 .
## JobInvolvement4 14.7333612
## JobLevel2 1679.9360048
## JobLevel3 5162.7923389
## JobLevel4 8752.6620728
## JobLevel5 11517.9229502
## JobSatisfaction2 .
## JobSatisfaction3 .
## JobSatisfaction4 28.3873359
## StockOptionLevel1 5.9321437
## StockOptionLevel2 .
## StockOptionLevel3 -7.9468728
## WorkLifeBalance2 .
## WorkLifeBalance3 3.9392092
## WorkLifeBalance4 .
# Make predictions on the test data
x.test <- model.matrix(MonthlyIncome ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
RMSE = RMSE(predictions, test.data$MonthlyIncome),
Rsquare = R2(predictions, test.data$MonthlyIncome)
)
p <- plot_ly( x = ~predictions, y = ~test.data$MonthlyIncome,
marker = list(size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2))) %>%
layout(title = 'RMSE Plot(LASSO)',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
p
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Top 6 factors determined by Random Forest model. Three of which are matching to EDA
23 Research Scientist, Most of the lab technicians, 33 Sales Representative and Sales executive and few were both overloaded and without any stock options.
Our model has captured that accurately.