1 Introduction

DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business green lights the project, they have tasked us to conduct an analysis of existing employee data. This R markdown does detailed statistical analysis of the given datasets and contains code,plots and all hypothesis test with conclusion. It also contains code and its output that was written to build predictive model as requested by talent management firm.

Along with this R code, we also built an app to perform EDA and interactive plots. We have posted this app on the web.

Please visit the app on this link
https://sachinac.shinyapps.io/msds_rshiny_cs02/

2 Data Description

Dataset contains record of 870 employees and 36 different attributes that can be utitlized to find pattern of attrition and to build a model to predict attrition. Data is clean and our initial investigation found nothing suspicious. For more details please keep reading following sections.

2.2 Original Structure

## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : int  2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ JobSatisfaction         : int  4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : int  2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...

2.3 Data fields and Types

Let’s take a closer look at structure of the dataset

We can categorized fields as follows:

  • Non Predictors
    • ID
    • EmployeeNumber
    • EmployeeCount
    • StandardHours
    • Over18
  • Nominal Categorical Predictors :
    • BusinessTravel
    • Department
    • EducationField
    • Gender
    • JobRole
    • MaritalStatus
    • OverTime
  • Ordinal Categorical Predictors :
    • Education
    • EnvironmentSatisfaction
    • JobInvolvement
    • JobLevel
    • JobSatisfaction
    • PerformanceRating
    • RelationshipSatisfaction
    • StockOptionLevel
    • WorkLifeBalance
  • Numerical Predictors :
    • Age
    • DailyRate
    • DistanceFromHome
    • HourlyRate
    • MonthlyIncome
    • MonthlyRate
    • NumCompaniesWorked
    • PercentSalaryHike
    • TotalWorkingYears
    • TrainingTimesLastYear
    • YearsAtCompany
    • YearsInCurrentRole
    • YearsSinceLastPromotion
    • YearsWithCurrManager
  • Response Variable :
    • Attrition (for classification model)
    • MonthlyIncome (for regression model)


Nominal predictors are numeric fields in the dataset. However, we should change it to factors to treat them as categorical variables.

2.4 Final Structure

Final structure to continue with analysis

## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 2 1 1 3 1 3 1 4 2 2 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...

3 Explore the data

Lets start to get some basic insights from the dataset.

3.1 Total Employees

Number of employees per Department. As clearly seen from the pie chart R & D is the major department for DDS Analytics followed by Sales and HR.

3.2 Mean Age

Mean age of the organization is 37
Median age of the organization is 35
Mean Age of attrition 33
Median Age of attrtion is 32

## [1] 36.82874
## [1] 35
## [1] 33.78571
## [1] 32

3.3 Attrition

let’s analyze Attrition from the dataset

3.3.1 Attrition Rate

Total Employees - 870
Employees Attrition - 140

3.3.2 Age

Below plot shows agewise attrition pattern from the data. Attrition is higher for the age range - 18-30 and 55+

3.3.3 Department

  • As shown in pie chart
    • Sales department is the highest contributor to attrition (21.6%) 59 out of 273 employees
    • Research and Development is the second highest with 13.3%. i.e. 75 out of 562 employees
    • HR has only 6 in headcount

3.3.4 JobRole and Department

This plot shows attrition by job role by taking department into account

Following Jobroles are the topmost contributors to attrition

  • Sales Executives 33
  • Research Scientist 32
  • Labortory Technicians 30
  • Sales Representative 24

3.3.5 JobRole only

This plot shows attrition by job role without taking department into account Topmost Jobroles that contributes to attrition

  • Sales Executives (23.6%)
  • Research Scientist (22.9%)
  • Labortory Technicians (21.4%)
  • Sales Representative (21.4%)

3.3.6 Gender

3.3.7 Gender by Dept

3.3.8 Top 4

Here are top 4 reasons that have contributed to attrition.

  • Business Travel - Those who travel rarely or frequently (92.1%)
  • StockOptionLevel- Stock option level 0 and 1 - (89.3%)
  • Job Level - Job level 1 & 2 (81.8%)
  • Overtime - Those who overwork. Put extra time to work. (57.1%)
df_travel  <- sqldf("select BusinessTravel,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by BusinessTravel  ")

df_joblevel <- sqldf("select JobLevel,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by JobLevel")

rownames(df_joblevel) <- c('JobLevel1','JobLevel2','JobLevel3','JobLevel4','JobLevel5')

df_stocklevel <- sqldf("select StockOptionLevel,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by StockOptionLevel")
rownames(df_stocklevel) <- c('Stock0','Stock1','Stock2','Stock3')

df_overtime <- sqldf("select Overtime,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by Overtime")

p1 <- plot_ly() %>%
  add_pie(data = df_travel ,labels = ~BusinessTravel, values = ~Attrition_count,
          title = "BusinessTravel", domain = list(row = 0, column = 0)) %>%
  add_pie(data = df_joblevel, labels = ~rownames(df_joblevel), values = ~Attrition_count,
          title = "JobLevel", domain = list(row = 0, column = 1)) %>%
  add_pie(data = df_overtime ,labels = ~OverTime, values = ~Attrition_count,
          title = "Overtime", domain = list(row = 1, column = 0)) %>%
  add_pie(data = df_stocklevel, labels = ~rownames(df_stocklevel), values = ~Attrition_count,
          title = "StockOptionLevel", domain = list(row = 1, column = 1)) %>%
  layout(title = "Attrition Rate", showlegend = T,
         grid=list(rows=2, columns=2),
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p1

3.3.9 Other

3.4 Job Satisfaction

Dataset contains Jobsatiscation field. We can get few insights about what makes employees more satisfied. We will be using proportions plots here to get clear picture.So Let’s see.

Assumption of level of satisfaction
Level-1 - highly dissasfied
Level-2 - dissatisfied
Level-3 - Satisfied
Level-4 - Highly satisfied

3.4.1 Attrition by Satisfaction Level

Fisher’s test on this feature tells us that at least one proportion is significantly different than others p-value < 0.05. Show later section of this markdown. First three levels contributes 80% to the attrition though.

3.4.2 By Age

Interpretations from Spineplots
  • Proportions of highly satisfied employees is higher between age 30-35
  • Proportions of highly dissatisfied employees is higher between age 25-30,40-45 and 55+

Thats what density plot shows high and lows of satisfaction levels

3.4.3 Monthly Income

Interpretations from Spineplots
  • Proprotion of highly satified is highest for employees with monthly Income range 6000-8000
  • Proprotion of highly dissatisfied is highest for employees with monthly Income range 1081-4000 and 15000-18000

Density plot show lows and highs of satisfaction for all incomes. min(msds_cs02_ds$MonthlyIncome)

3.4.4 Distance from Home

As quite obvious employees closer to office are more satisfied compared to those who stay far. Spinogram and density plot shows the same.

4 Hypothesis Tests

Run hypothesis tests to compare levels of different factors.

Hypothesis : All levels of the factor have same effect on the attrition.
Alternate : At least one level has different effect on the attrition than other levels.

4.1 Ordinal Qualtitative variables

## [1] "Feature  Education  p-value  0.64"
## [1] "Feature  EnvironmentSatisfaction  p-value  0.01"
## [1] "Feature  JobInvolvement  p-value  0"
## [1] "Feature  JobLevel  p-value  0"
## [1] "Feature  JobSatisfaction  p-value  0.01"
## [1] "Feature  PerformanceRating  p-value  0.7"
## [1] "Feature  RelationshipSatisfaction  p-value  0.37"
## [1] "Feature  StockOptionLevel  p-value  0"
## [1] "Feature  WorkLifeBalance  p-value  0.01"

Fisher test was performed and from the above p-values returned by fishers’ test we conclude that following factors are singificant (p-value < 0.05) i.e. at least one level of the factor is different than others.

  • EnvironmentSatisfaction
  • JobInvolvement
  • JobLevel
  • JobSatisfaction
  • StockOptionLevel
  • WorkLifeBalance

So these variables needs to be included for model selections.

4.2 Nominal Qualtitative variables

## [1] "Feature  BusinessTravel  p-value  0.06"
## [1] "Feature  Department  p-value  0.01"
## [1] "Feature  EducationField  p-value  0.23"
## [1] "Feature  Gender  p-value  0.51"
## [1] "Feature  JobRole  p-value  0"
## [1] "Feature  MaritalStatus  p-value  0"
## [1] "Feature  OverTime  p-value  0"

Fisher test was performed and from the above p-values returned by fishers’ test we conclude that following factors are singificant (p-value < 0.05) i.e. at least one level of the factor is different than others.

  • Department
  • JobInvolvement
  • JobRole
  • MaritalStatus
  • OverTime

So these variables needs to be included for model selections

Key takeways from EDA

  • Mean age of attrition is 33
  • Major Attrition is at Job Level1 i.e. Research Scientist,Laboratory Technician and Sales Representative
  • Males have higher attrition rate compared to Females
  • Stock options make people more happy and has lowest contributor to attrition
  • Higher the job involvement lower the attrition

From the analysis it looks there are few predictors like age, businessTravel,Joblevel,Overtime are highly important as they are major contributor to attrtion. But other fields also seems to be showing some variations to attrition (as per density plots) so we will keep all predictors to build model and will go from there.

4.3 Correlation Heatmap

This heatmap shows multicollinearity exist in the numerical predictors. We have examined the correlated variables and removed manually.

Age and TotalWorkingYears YearsAtCompany and YearsAtCurrentRole YearsAtCompany and YearsWithCurrentManager YearsWithCurrentManager and YearsAtCurrentRole

## corrplot 0.84 loaded

5 Model Building

5.1 Classification Problem

5.1.1 Random Forest

Here are top 5 predictors as per random forest

* Overtime
* MonthlyIncome
* JobRole
* Age
* TotalWorkingYears

## Loading required package: lattice
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 32.14%
## Confusion matrix:
##     No Yes class.error
## No  73  39   0.3482143
## Yes 33  79   0.2946429
## [1] No  No  Yes No  No  No 
## Levels: No Yes
## [1] 0.7241379
## Confusion Matrix and Statistics
## 
##                  
## predicted.classes Yes  No
##               Yes  23  43
##               No    5 103
##                                           
##                Accuracy : 0.7241          
##                  95% CI : (0.6514, 0.7891)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3403          
##                                           
##  Mcnemar's Test P-Value : 9.27e-08        
##                                           
##             Sensitivity : 0.8214          
##             Specificity : 0.7055          
##          Pos Pred Value : 0.3485          
##          Neg Pred Value : 0.9537          
##              Prevalence : 0.1609          
##          Detection Rate : 0.1322          
##    Detection Prevalence : 0.3793          
##       Balanced Accuracy : 0.7635          
##                                           
##        'Positive' Class : Yes             
## 
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                                  Importance
## StockOptionLevel1                    100.00
## OverTimeYes                           93.15
## JobLevel4                             92.55
## Age                                   76.12
## JobRoleSales Representative           74.06
## MonthlyIncome                         65.33
## MaritalStatusSingle                   65.04
## DistanceFromHome                      61.12
## DepartmentResearch & Development      54.44
## JobRoleManufacturing Director         53.75
## DepartmentSales                       49.99
## JobLevel2                             45.76
## StockOptionLevel2                     44.41
## JobRoleManager                        43.99
## JobInvolvement4                       40.86
## DailyRate                             40.06
## TrainingTimesLastYear                 39.59
## JobRoleSales Executive                37.17
## JobRoleHuman Resources                34.66
## EnvironmentSatisfaction2              33.29

5.1.2 KNN classifier

## dummies-1.5.6 provided by Decision Patterns
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## [1] No No No No No No
## Levels: No Yes
## [1] 0.7471264
## Confusion Matrix and Statistics
## 
##                  
## predicted.classes Yes  No
##               Yes  19  35
##               No    9 111
##                                           
##                Accuracy : 0.7471          
##                  95% CI : (0.6758, 0.8099)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 0.999336        
##                                           
##                   Kappa : 0.3191          
##                                           
##  Mcnemar's Test P-Value : 0.000164        
##                                           
##             Sensitivity : 0.6786          
##             Specificity : 0.7603          
##          Pos Pred Value : 0.3519          
##          Neg Pred Value : 0.9250          
##              Prevalence : 0.1609          
##          Detection Rate : 0.1092          
##    Detection Prevalence : 0.3103          
##       Balanced Accuracy : 0.7194          
##                                           
##        'Positive' Class : Yes             
## 
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 50)
## 
##                                  Importance
## OverTimeYes                          100.00
## OverTimeNo                           100.00
## StockOptionLevel0                     88.84
## MonthlyIncome                         87.95
## JobLevel1                             81.27
## StockOptionLevel1                     75.04
## Age                                   66.94
## MaritalStatusSingle                   59.96
## JobLevel2                             54.64
## DepartmentResearch & Development      51.45
## DepartmentSales                       46.74
## JobRoleSales Representative           43.59
## DistanceFromHome                      42.34
## MaritalStatusDivorced                 40.72
## JobInvolvement1                       35.14
## EnvironmentSatisfaction1              34.86
## JobInvolvement3                       34.42
## TrainingTimesLastYear                 33.22
## JobSatisfaction4                      32.72
## JobRoleManufacturing Director         27.39

5.1.3 Naive Bayes Classifier

## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## [1] 0.683908
## Confusion Matrix and Statistics
## 
##                  
## predicted.classes Yes No
##               Yes  23 50
##               No    5 96
##                                           
##                Accuracy : 0.6839          
##                  95% CI : (0.6092, 0.7522)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2904          
##                                           
##  Mcnemar's Test P-Value : 2.975e-09       
##                                           
##             Sensitivity : 0.8214          
##             Specificity : 0.6575          
##          Pos Pred Value : 0.3151          
##          Neg Pred Value : 0.9505          
##              Prevalence : 0.1609          
##          Detection Rate : 0.1322          
##    Detection Prevalence : 0.4195          
##       Balanced Accuracy : 0.7395          
##                                           
##        'Positive' Class : Yes             
## 
## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 49)
## 
##                                  Importance
## OverTimeYes                          100.00
## OverTimeNo                           100.00
## StockOptionLevel0                     88.84
## JobLevel1                             81.27
## StockOptionLevel1                     75.04
## Age                                   66.94
## MaritalStatusSingle                   59.96
## JobLevel2                             54.64
## DepartmentResearch & Development      51.45
## DepartmentSales                       46.74
## JobRoleSales Representative           43.59
## DistanceFromHome                      42.34
## MaritalStatusDivorced                 40.72
## JobInvolvement1                       35.14
## EnvironmentSatisfaction1              34.86
## JobInvolvement3                       34.42
## TrainingTimesLastYear                 33.22
## JobSatisfaction4                      32.72
## JobRoleManufacturing Director         27.39
## EnvironmentSatisfaction2              20.80

5.2 Regression Problem

5.2.1 Knn Regression Model

## [1]  5265.4 18979.4  6759.0  3015.6  5334.8 15739.2
## [1] 1565.399

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

5.2.2 Penalized Regression Model (LASSO)

## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
## [1] 38.74357
## 48 x 1 sparse Matrix of class "dgCMatrix"
##                                             s0
## (Intercept)                       3196.4231798
## Age                                 16.5648076
## DailyRate                            0.1024272
## DistanceFromHome                     .        
## HourlyRate                           .        
## MonthlyRate                          .        
## PercentSalaryHike                    .        
## TrainingTimesLastYear                .        
## BusinessTravelTravel_Frequently      .        
## BusinessTravelTravel_Rarely        125.3324523
## DepartmentResearch & Development     .        
## DepartmentSales                      .        
## EducationFieldLife Sciences          .        
## EducationFieldMarketing              .        
## EducationFieldMedical                .        
## EducationFieldOther                  .        
## EducationFieldTechnical Degree     -16.9864853
## GenderMale                           .        
## JobRoleHuman Resources           -1058.9637181
## JobRoleLaboratory Technician     -1134.9735365
## JobRoleManager                    3224.7908177
## JobRoleManufacturing Director       41.7020537
## JobRoleResearch Director          3330.2678378
## JobRoleResearch Scientist         -898.9129480
## JobRoleSales Executive               .        
## JobRoleSales Representative      -1163.9441813
## MaritalStatusMarried                 .        
## MaritalStatusSingle                -10.5105598
## OverTimeYes                          .        
## EnvironmentSatisfaction2             .        
## EnvironmentSatisfaction3             .        
## EnvironmentSatisfaction4             .        
## JobInvolvement2                      .        
## JobInvolvement3                      .        
## JobInvolvement4                     14.7333612
## JobLevel2                         1679.9360048
## JobLevel3                         5162.7923389
## JobLevel4                         8752.6620728
## JobLevel5                        11517.9229502
## JobSatisfaction2                     .        
## JobSatisfaction3                     .        
## JobSatisfaction4                    28.3873359
## StockOptionLevel1                    5.9321437
## StockOptionLevel2                    .        
## StockOptionLevel3                   -7.9468728
## WorkLifeBalance2                     .        
## WorkLifeBalance3                     3.9392092
## WorkLifeBalance4                     .
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

6 Conclusion

Top 6 factors determined by Random Forest model. Three of which are matching to EDA

  • Stock Options
  • Overtime
  • Job Level
  • Age
  • Job Role
  • MonthlyIncome

23 Research Scientist, Most of the lab technicians, 33 Sales Representative and Sales executive and few were both overloaded and without any stock options.

Our model has captured that accurately.