1 Introduction

DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business green lights the project, they have tasked us to conduct an analysis of existing employee data. This R markdown does detailed statistical analysis of the given datasets and contains code,plots and all hypothesis test with conclusion. It also contains code and its output that was written to build predictive model as requested by talent management firm.

Along with this R code, we also built an app to perform EDA and interactive plots. We have posted this app on the web.

Please visit the app on this link
https://sachinac.shinyapps.io/msds_rshiny_cs02/

2 Data Description

Dataset contains record of 870 employees and 36 different attributes that can be utitlized to find pattern of attrition and to build a model to predict attrition. Data is clean and our initial investigation found nothing suspicious. For more details please keep reading following sections.

2.1 Load data

msds_cs02_ds <- read.csv('../data/CaseStudy2-data.csv',header=TRUE)

2.2 Original Structure

str(msds_cs02_ds)

## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : int  2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : int  3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : int  2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ JobSatisfaction         : int  4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : int  3 3 3 3 3 4 3 3 3 3 ...
##  $ RelationshipSatisfaction: int  3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  1 0 0 2 0 2 0 3 1 1 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : int  2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...

2.3 Data fields and Types

Let’s take a closer look at structure of the dataset

We can categorized fields as follows:

Non Predictors
- ID
- EmployeeNumber
- EmployeeCount
- StandardHours
- Over18
Nominal Categorical Predictors :
- BusinessTravel
- Department
- EducationField
- Gender
- JobRole
- MaritalStatus
- OverTime
Ordinal Categorical Predictors :
- Education
- EnvironmentSatisfaction
- JobInvolvement
- JobLevel
- JobSatisfaction
- PerformanceRating
- RelationshipSatisfaction
- StockOptionLevel
- WorkLifeBalance
Numerical Predictors :
- Age
- DailyRate
- DistanceFromHome
- HourlyRate
- MonthlyIncome
- MonthlyRate
- NumCompaniesWorked
- PercentSalaryHike
- TotalWorkingYears
- TrainingTimesLastYear
- YearsAtCompany
- YearsInCurrentRole
- YearsSinceLastPromotion
- YearsWithCurrManager
Response Variable :
- Attrition (for classification model)
- MonthlyIncome (for regression model)

Nominal predictors are numeric fields in the dataset. However, we should change it to factors to treat them as categorical variables.

msds_cs02_ds$Education                <- as.factor(msds_cs02_ds$Education)
msds_cs02_ds$EnvironmentSatisfaction  <- as.factor(msds_cs02_ds$EnvironmentSatisfaction)
msds_cs02_ds$JobInvolvement           <- as.factor(msds_cs02_ds$JobInvolvement)      
msds_cs02_ds$JobLevel                 <- as.factor(msds_cs02_ds$JobLevel)
msds_cs02_ds$JobSatisfaction          <- as.factor(msds_cs02_ds$JobSatisfaction) 
msds_cs02_ds$PerformanceRating        <- as.factor(msds_cs02_ds$PerformanceRating)
msds_cs02_ds$RelationshipSatisfaction <- as.factor(msds_cs02_ds$RelationshipSatisfaction)
msds_cs02_ds$StockOptionLevel         <- as.factor(msds_cs02_ds$StockOptionLevel) 
msds_cs02_ds$WorkLifeBalance          <- as.factor(msds_cs02_ds$WorkLifeBalance)

2.4 Final Structure

Final structure to continue with analysis

str(msds_cs02_ds)

## 'data.frame':    870 obs. of  36 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : int  2 1 2 1 1 1 2 2 1 1 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 1 3 3 3 3 1 3 4 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 2 1 1 3 1 3 1 4 2 2 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : int  3 2 2 3 2 4 5 5 2 3 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...

# Create list for different types of variables for later use

non_predictors <- c('ID','EmployeeNumber','EmployeeCount','StandardHours','Over18')
  
nom_qual_predictors <- c('BusinessTravel','Department','EducationField',
                         'Gender','JobRole','MaritalStatus','OverTime')
  
ord_qual_predictors <- c('Education','EnvironmentSatisfaction','JobInvolvement','JobLevel',
                         'JobSatisfaction','PerformanceRating','RelationshipSatisfaction',
                         'StockOptionLevel','WorkLifeBalance')

num_predictors <- c('Age','DailyRate','DistanceFromHome','HourlyRate',
                    'MonthlyRate','NumCompaniesWorked','PercentSalaryHike','TotalWorkingYears',
                    'TrainingTimesLastYear','YearsAtCompany','YearsInCurrentRole',
                    'YearsSinceLastPromotion','YearsWithCurrManager')

3 Explore the data

Lets start to get some basic insights from the dataset.

3.1 Total Employees

Number of employees per Department. As clearly seen from the pie chart R & D is the major department for DDS Analytics followed by Sales and HR.

  df <- sqldf("select Department,count(*) employee_count from msds_cs02_ds group by  Department")
  df

  p  <- plot_ly(df, labels = ~Department, values = ~employee_count, type = 'pie') %>%
                 layout(title = 'Number of employees per Department',
                 xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                 yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
  p

3.2 Mean Age

Mean age of the organization is 37
Median age of the organization is 35
Mean Age of attrition 33
Median Age of attrtion is 32

  age_Attrition     <- msds_cs02_ds %>% filter(Attrition=='Yes') %>% select(Age)
  age_non_attrition <- msds_cs02_ds %>% filter(Attrition=='No') %>% select(Age)
  

p <- plot_ly(alpha = 0.6) %>%
     add_histogram(x = ~age_non_attrition$Age+1) %>%
     add_histogram(x = ~age_Attrition$Age) %>%
     layout(title = 'Histogram',
     barmode = "overlay",showlegend = F,xaxis = list(title = "Attrition(Yellow) and No Attrition (Blue)"))
       
p

mean(msds_cs02_ds$Age)

## [1] 36.82874

median(msds_cs02_ds$Age)

## [1] 35

mean(age_Attrition$Age)

## [1] 33.78571

median(age_Attrition$Age)

## [1] 32

3.3 Attrition

let’s analyze Attrition from the dataset

3.3.1 Attrition Rate

Total Employees - 870
Employees Attrition - 140

   df <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds group by  Attrition")
   df

   p  <- plot_ly(df, labels = ~Attrition, values = ~Attrition_count, type = 'pie') %>%
                 layout(title = 'Attrition Rate',
                 xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                 yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
   p

3.3.2 Age

Below plot shows agewise attrition pattern from the data. Attrition is higher for the age range - 18-30 and 55+

spineplot(Attrition~ Age, data = msds_cs02_ds)

3.3.3 Department

As shown in pie chart
- Sales department is the highest contributor to attrition (21.6%) 59 out of 273 employees
- Research and Development is the second highest with 13.3%. i.e. 75 out of 562 employees
- HR has only 6 in headcount

df_RandD <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds  where Department like 'Research%' group by  Attrition")
df_RandD

df_Sales <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds  where Department like 'Sales%' group by  Attrition")
df_Sales

df_hr    <- sqldf("select Attrition,count(*) Attrition_count from msds_cs02_ds  where Department like 'Human%' group by  Attrition")
df_hr

p <- plot_ly() %>%
  add_pie(data = df_RandD ,labels = ~Attrition, values = ~Attrition_count,
          title = "Research & Development", domain = list(x = c(0, 0.4), y = c(0.4, 1))) %>%
  add_pie(data = df_Sales, labels = ~Attrition, values = ~Attrition_count,
          title = "Sales", domain = list(x = c(0.6, 1), y = c(0.4, 1))) %>%
  add_pie(data = df_hr, labels = ~Attrition, values = ~Attrition_count,
          title = "Human Resource", domain = list(x = c(0.25, 0.75), y = c(0, 0.6))) %>%
  layout(title = "Departmentwise Attrition", showlegend = T,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
   
p

3.3.4 JobRole and Department

This plot shows attrition by job role by taking department into account

Following Jobroles are the topmost contributors to attrition

Sales Executives 33
Research Scientist 32
Labortory Technicians 30
Sales Representative 24

df <- sqldf("select Department,JobRole,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by  Department,JobRole order by Attrition_count desc")

p <- ggplotly(ggplot(data=df, aes(x=Department, y=Attrition_count, fill=JobRole)) +
     geom_bar(stat="identity", position=position_dodge())+
     geom_text(aes(label=Attrition_count),vjust = 2.1,position=position_dodge(width=0.8))+
     theme_bw()+
     theme(axis.text.x = element_text(angle=45, vjust=0.6)))

p

3.3.5 JobRole only

This plot shows attrition by job role without taking department into account Topmost Jobroles that contributes to attrition

Sales Executives (23.6%)
Research Scientist (22.9%)
Labortory Technicians (21.4%)
Sales Representative (21.4%)

df <- sqldf("select JobRole,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by  JobRole order by Attrition_count desc")

p <- df %>%
     plot_ly(labels = ~JobRole, values = ~Attrition_count) %>%
     add_pie(hole = 0.6) %>%
     layout(title = "Attrition by Job Role",  showlegend = T,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p

3.3.6 Gender

count(msds_cs02_ds,Gender)

df <- sqldf("select Department,Gender,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by  Department,Gender order by Attrition_count desc")
df

p <- ggplotly(ggplot(data=df, aes(x=Department, y=Attrition_count, fill=Gender)) +
     geom_bar(stat="identity", position=position_dodge())+
     geom_text(aes(label=Attrition_count),vjust = 2.1,position=position_dodge(width=0.5))+
     theme_bw()+
     theme(axis.text.x = element_text(angle=45, vjust=0.5)))

p

df <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by Gender order by Attrition_count ")
df

p  <- plot_ly(df, labels = ~Gender, values = ~Attrition_count, type = 'pie') %>%
                  layout(title = 'Genderwise Attrition',
                  xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                  yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

p

3.3.7 Gender by Dept

df_RandD <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' and Department like 'Research%' group by  Gender")
df_RandD

df_Sales <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' and Department like 'Sales%' group by  Gender")
df_Sales

df_hr    <- sqldf("select Gender,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' and Department like 'Human%' group by  Gender")
df_hr

p <- plot_ly() %>%
  add_pie(data = df_RandD ,labels = ~Gender, values = ~Attrition_count,
          title = "Research & Development", domain = list(x = c(0, 0.4), y = c(0.4, 1))) %>%
  add_pie(data = df_Sales, labels = ~Gender, values = ~Attrition_count,
          title = "Sales", domain = list(x = c(0.6, 1), y = c(0.4, 1))) %>%
  add_pie(data = df_hr, labels = ~Gender, values = ~Attrition_count,
          title = "Human Resource", domain = list(x = c(0.25, 0.75), y = c(0, 0.6))) %>%
  layout(title = "Departmentwise Attrition", showlegend = T,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
   
p

3.3.8 Top 4

Here are top 4 reasons that have contributed to attrition.

Business Travel - Those who travel rarely or frequently (92.1%)
StockOptionLevel- Stock option level 0 and 1 - (89.3%)
Job Level - Job level 1 & 2 (81.8%)
Overtime - Those who overwork. Put extra time to work. (57.1%)

df_travel  <- sqldf("select BusinessTravel,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by BusinessTravel  ")

df_joblevel <- sqldf("select JobLevel,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by JobLevel")

rownames(df_joblevel) <- c('JobLevel1','JobLevel2','JobLevel3','JobLevel4','JobLevel5')

df_stocklevel <- sqldf("select StockOptionLevel,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by StockOptionLevel")
rownames(df_stocklevel) <- c('Stock0','Stock1','Stock2','Stock3')

df_overtime <- sqldf("select Overtime,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by Overtime")

p1 <- plot_ly() %>%
  add_pie(data = df_travel ,labels = ~BusinessTravel, values = ~Attrition_count,
          title = "BusinessTravel", domain = list(row = 0, column = 0)) %>%
  add_pie(data = df_joblevel, labels = ~rownames(df_joblevel), values = ~Attrition_count,
          title = "JobLevel", domain = list(row = 0, column = 1)) %>%
  add_pie(data = df_overtime ,labels = ~OverTime, values = ~Attrition_count,
          title = "Overtime", domain = list(row = 1, column = 0)) %>%
  add_pie(data = df_stocklevel, labels = ~rownames(df_stocklevel), values = ~Attrition_count,
          title = "StockOptionLevel", domain = list(row = 1, column = 1)) %>%
  layout(title = "Attrition Rate", showlegend = T,
         grid=list(rows=2, columns=2),
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p1

3.3.9 Other

df_jobinv <- sqldf("select JobInvolvement,count(*) Attrition_count from msds_cs02_ds  where Attrition='Yes' group by JobInvolvement")

p  <- plot_ly(df_jobinv, labels = ~JobInvolvement, values = ~Attrition_count, type = 'pie') %>%
                  layout(title = 'Attrition rate for job Involvement',
                  xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                  yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p

3.4 Job Satisfaction

Dataset contains Jobsatiscation field. We can get few insights about what makes employees more satisfied. We will be using proportions plots here to get clear picture.So Let’s see.

Assumption of level of satisfaction
Level-1 - highly dissasfied
Level-2 - dissatisfied
Level-3 - Satisfied
Level-4 - Highly satisfied

3.4.1 Attrition by Satisfaction Level

Fisher’s test on this feature tells us that at least one proportion is significantly different than others p-value < 0.05. Show later section of this markdown. First three levels contributes 80% to the attrition though.

df <- msds_cs02_ds %>% filter(Attrition=='Yes')  %>% count(JobSatisfaction)
p  <- plot_ly(df, labels = ~JobSatisfaction, values = ~n, type = 'pie') %>%
                layout(title = 'Attrition by job satisfaction level',
                xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p

3.4.2 By Age

Interpretations from Spineplots

Proportions of highly satisfied employees is higher between age 30-35
Proportions of highly dissatisfied employees is higher between age 25-30,40-45 and 55+

Thats what density plot shows high and lows of satisfaction levels

spineplot(JobSatisfaction ~ Age, data = msds_cs02_ds)

vcd::cd_plot(JobSatisfaction ~Age , data = msds_cs02_ds)

3.4.3 Monthly Income

Interpretations from Spineplots

Proprotion of highly satified is highest for employees with monthly Income range 6000-8000
Proprotion of highly dissatisfied is highest for employees with monthly Income range 1081-4000 and 15000-18000

Density plot show lows and highs of satisfaction for all incomes. min(msds_cs02_ds$MonthlyIncome)

spineplot(JobSatisfaction ~ MonthlyIncome, data = msds_cs02_ds)

vcd::cd_plot(JobSatisfaction~MonthlyIncome , data = msds_cs02_ds)

3.4.4 Distance from Home

As quite obvious employees closer to office are more satisfied compared to those who stay far. Spinogram and density plot shows the same.

spineplot(JobSatisfaction ~ DistanceFromHome, data = msds_cs02_ds)

vcd::cd_plot(JobSatisfaction ~DistanceFromHome , data = msds_cs02_ds)

3.4.5 Other

Lower the age higher the attrition
Lower monthlyIncome higher the attrition
Higher NumCompaniesWorked higher the attrition
Less exprience higher attrition

opar <- par(mfrow = c(2, 3))
par(opar)
vcd::cd_plot(Attrition ~ Age, data = msds_cs02_ds, main = "Age")

vcd::cd_plot(Attrition ~ DailyRate, data = msds_cs02_ds, main = "DailyRate")

vcd::cd_plot(Attrition ~ HourlyRate, data = msds_cs02_ds, main = "HourlyRate")

vcd::cd_plot(Attrition ~ MonthlyIncome, data = msds_cs02_ds, main = "MonthlyIncome")

vcd::cd_plot(Attrition ~ NumCompaniesWorked, data = msds_cs02_ds, main = "NumCompaniesWorked")

vcd::cd_plot(Attrition ~ PercentSalaryHike, data = msds_cs02_ds, main = "PercentSalaryHike")

vcd::cd_plot(Attrition ~ TotalWorkingYears, data = msds_cs02_ds, main = "TotalWorkingYears")

vcd::cd_plot(Attrition ~ TrainingTimesLastYear, data = msds_cs02_ds, main = "TrainingTimesLastYear")

vcd::cd_plot(Attrition ~ YearsAtCompany, data = msds_cs02_ds, main = "YearsAtCompany")

vcd::cd_plot(Attrition ~ YearsInCurrentRole, data = msds_cs02_ds, main = "YearsInCurrentRole")

vcd::cd_plot(Attrition ~ YearsSinceLastPromotion, data = msds_cs02_ds, main = "YearsSinceLastPromotion")

vcd::cd_plot(Attrition ~ YearsWithCurrManager, data = msds_cs02_ds, main = "YearsWithCurrManager")

4 Hypothesis Tests

Run hypothesis tests to compare levels of different factors.

Hypothesis : All levels of the factor have same effect on the attrition.
Alternate : At least one level has different effect on the attrition than other levels.

4.1 Ordinal Qualtitative variables

i=1 
while(i <= length(ord_qual_predictors)) {
   index <- match(ord_qual_predictors[i], names(msds_cs02_ds))
   hypothesis_test <- fisher.test(table(msds_cs02_ds[,c(index,3)]),simulate.p.value=TRUE)
   print(paste("Feature ",ord_qual_predictors[i]," p-value ",round(hypothesis_test$p.value,2)))
   i <- i + 1
}

## [1] "Feature  Education  p-value  0.64"
## [1] "Feature  EnvironmentSatisfaction  p-value  0.01"
## [1] "Feature  JobInvolvement  p-value  0"
## [1] "Feature  JobLevel  p-value  0"
## [1] "Feature  JobSatisfaction  p-value  0.01"
## [1] "Feature  PerformanceRating  p-value  0.7"
## [1] "Feature  RelationshipSatisfaction  p-value  0.37"
## [1] "Feature  StockOptionLevel  p-value  0"
## [1] "Feature  WorkLifeBalance  p-value  0.01"

Fisher test was performed and from the above p-values returned by fishers’ test we conclude that following factors are singificant (p-value < 0.05) i.e. at least one level of the factor is different than others.

EnvironmentSatisfaction
JobInvolvement
JobLevel
JobSatisfaction
StockOptionLevel
WorkLifeBalance

So these variables needs to be included for model selections.

4.2 Nominal Qualtitative variables

i=1 
while(i <=length(nom_qual_predictors)) {
   index <- match(nom_qual_predictors[i], names(msds_cs02_ds))
   hypothesis_test <- fisher.test(table(msds_cs02_ds[,c(index,3)])[,c(2,1)],simulate.p.value=TRUE)
   print(paste("Feature ",nom_qual_predictors[i]," p-value ",round(hypothesis_test$p.value,2)))
   i <- i + 1
}

## [1] "Feature  BusinessTravel  p-value  0.06"
## [1] "Feature  Department  p-value  0.01"
## [1] "Feature  EducationField  p-value  0.23"
## [1] "Feature  Gender  p-value  0.51"
## [1] "Feature  JobRole  p-value  0"
## [1] "Feature  MaritalStatus  p-value  0"
## [1] "Feature  OverTime  p-value  0"

Department
JobInvolvement
JobRole
MaritalStatus
OverTime

So these variables needs to be included for model selections

Key takeways from EDA

Mean age of attrition is 33
Major Attrition is at Job Level1 i.e. Research Scientist,Laboratory Technician and Sales Representative
Males have higher attrition rate compared to Females
Stock options make people more happy and has lowest contributor to attrition
Higher the job involvement lower the attrition

From the analysis it looks there are few predictors like age, businessTravel,Joblevel,Overtime are highly important as they are major contributor to attrtion. But other fields also seems to be showing some variations to attrition (as per density plots) so we will keep all predictors to build model and will go from there.

4.3 Correlation Heatmap

This heatmap shows multicollinearity exist in the numerical predictors. We have examined the correlated variables and removed manually.

Age and TotalWorkingYears YearsAtCompany and YearsAtCurrentRole YearsAtCompany and YearsWithCurrentManager YearsWithCurrentManager and YearsAtCurrentRole

library("corrplot")

## corrplot 0.84 loaded

corr <- round(cor(msds_cs02_ds[,num_predictors[-c(6,8,10,11,12,13)]]), 1)
corr <- round(cor(msds_cs02_ds[,num_predictors]), 1)
corrplot(corr, is.corr=FALSE)

trace1 <- list(
  uid = "b972e66a-6d9c-4cd1-a8fc-c533178a171b", 
  type = "heatmap", 
  x = num_predictors, 
  y = rev(num_predictors), 
  z = corr
)
data <- list(trace1)
layout <- list(
  xaxis = list(
    side = "top", 
    ticks = "outside",
    title = "Correlations Heat Map"
  )
)
p <- plot_ly()
p <- add_trace(p, uid=trace1$uid, type=trace1$type, x=trace1$x, y=trace1$y, z=trace1$z)
p <- layout(p, title=layout$title, xaxis=layout$xaxis)

p

5 Model Building

5.1 Classification Problem

5.1.1 Random Forest

Here are top 5 predictors as per random forest

* Overtime
* MonthlyIncome
* JobRole
* Age
* TotalWorkingYears

library(caret)

## Loading required package: lattice

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

#
# Predictors to be retained from EDA
#

fin_nom_qual_predictors <- nom_qual_predictors[-c(1,3,4)]
fin_ord_qual_predictors <- ord_qual_predictors[-c(1,6,7)]
fin_num_predictors <- num_predictors[-c(6,8,10,11,12,13)] #num_predictors[c(5,2,8,9,7)]
#num_predictors[-c(8,10,12,13)]

msds_cs02_mds <- msds_cs02_ds[, c(fin_nom_qual_predictors,fin_ord_qual_predictors,fin_num_predictors,"MonthlyIncome","Attrition")]

sample_n(msds_cs02_mds, 4)

# Split the data into training and test set
set.seed(123)
training.samples <- msds_cs02_mds$Attrition %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- msds_cs02_mds[training.samples, ]
test.data   <- msds_cs02_mds[-training.samples, ]

# Weighting 
model_weights <- ifelse(train.data$Attrition == "No",
                        (1/table(train.data$Attrition)[1]) * 0.5,
                        (1/table(train.data$Attrition)[2]) * 0.5)

# Fit the model on the training set
set.seed(123)
model <- train(
  Attrition ~., data = train.data, method = "rf",
  trControl = trainControl("cv", number = 10,sampling="down"),
  preProcess = c("center","scale"),
#  weights = model_weights,
  importance = TRUE
  )


# Best tuning parameter
model$bestTune

model$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 32.14%
## Confusion matrix:
##     No Yes class.error
## No  73  39   0.3482143
## Yes 33  79   0.2946429

predicted.classes <- model %>% predict(test.data, type = 'raw')


head(predicted.classes)

## [1] No  No  Yes No  No  No 
## Levels: No Yes

mean(predicted.classes == test.data$Attrition)

## [1] 0.7241379

predicted.classes <- relevel(predicted.classes,ref="Yes")
test.data$Attrition <- relevel(test.data$Attrition ,ref="Yes")
CM <- confusionMatrix(table(predicted.classes, test.data$Attrition))
CM

## Confusion Matrix and Statistics
## 
##                  
## predicted.classes Yes  No
##               Yes  23  43
##               No    5 103
##                                           
##                Accuracy : 0.7241          
##                  95% CI : (0.6514, 0.7891)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3403          
##                                           
##  Mcnemar's Test P-Value : 9.27e-08        
##                                           
##             Sensitivity : 0.8214          
##             Specificity : 0.7055          
##          Pos Pred Value : 0.3485          
##          Neg Pred Value : 0.9537          
##              Prevalence : 0.1609          
##          Detection Rate : 0.1322          
##    Detection Prevalence : 0.3793          
##       Balanced Accuracy : 0.7635          
##                                           
##        'Positive' Class : Yes             
##

caret::varImp(model)

## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                                  Importance
## StockOptionLevel1                    100.00
## OverTimeYes                           93.15
## JobLevel4                             92.55
## Age                                   76.12
## JobRoleSales Representative           74.06
## MonthlyIncome                         65.33
## MaritalStatusSingle                   65.04
## DistanceFromHome                      61.12
## DepartmentResearch & Development      54.44
## JobRoleManufacturing Director         53.75
## DepartmentSales                       49.99
## JobLevel2                             45.76
## StockOptionLevel2                     44.41
## JobRoleManager                        43.99
## JobInvolvement4                       40.86
## DailyRate                             40.06
## TrainingTimesLastYear                 39.59
## JobRoleSales Executive                37.17
## JobRoleHuman Resources                34.66
## EnvironmentSatisfaction2              33.29

ggplot(caret::varImp(model)) + 
geom_bar(stat = 'identity', fill = 'steelblue', color = 'black') + 
ylab("Feature Importance - Random Forest ")+
scale_y_continuous(limits = c(0, 105), expand = c(0, 0)) +
theme_light()

5.1.2 KNN classifier

#
# This library required for one hot encoding to convert categorical variables into numeric
#
library(dummies)

## dummies-1.5.6 provided by Decision Patterns

#
# Create separate dataframe of 30 predictors 1 target
#
knn_with_target_ds <- msds_cs02_ds[,c(fin_num_predictors,fin_nom_qual_predictors,fin_ord_qual_predictors,"MonthlyIncome","Attrition")] 

knn_hot_encoded_ds <- dummy.data.frame(knn_with_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
## non-list contrasts argument ignored

set.seed(123)
training.samples <- knn_hot_encoded_ds$Attrition %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- knn_hot_encoded_ds[training.samples, ]
test.data   <- knn_hot_encoded_ds[-training.samples, ]

#res.pca <- PCA(msds_cs02_ds[,fin_num_predictors], graph = FALSE,ncp=20)
#res.pca$eig
#corrplot(res.pca$var$cos2, is.corr=FALSE)
#fviz_contrib(res.pca, choice = "var", axes = 1:7)

# Fit the model on the training set
set.seed(123)
model <- train(
  Attrition ~., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10,sampling="smote"),
  preProcess = c("center","scale"),
  tuneLength = 20
  )

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Plot model accuracy vs different values of k
plot(model)

# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune

predicted.classes <- model %>% predict(test.data)
head(predicted.classes)

## [1] No No No No No No
## Levels: No Yes

predicted.classes   <- relevel(predicted.classes,ref="Yes")
test.data$Attrition <- relevel(test.data$Attrition,ref="Yes")

# Compute model accuracy rate
mean(predicted.classes == test.data$Attrition)

## [1] 0.7471264

CM <- confusionMatrix(table(predicted.classes, test.data$Attrition))
CM

## Confusion Matrix and Statistics
## 
##                  
## predicted.classes Yes  No
##               Yes  19  35
##               No    9 111
##                                           
##                Accuracy : 0.7471          
##                  95% CI : (0.6758, 0.8099)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 0.999336        
##                                           
##                   Kappa : 0.3191          
##                                           
##  Mcnemar's Test P-Value : 0.000164        
##                                           
##             Sensitivity : 0.6786          
##             Specificity : 0.7603          
##          Pos Pred Value : 0.3519          
##          Neg Pred Value : 0.9250          
##              Prevalence : 0.1609          
##          Detection Rate : 0.1092          
##    Detection Prevalence : 0.3103          
##       Balanced Accuracy : 0.7194          
##                                           
##        'Positive' Class : Yes             
##

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred <- prediction(as.numeric(if_else(predicted.classes=='Yes',1,0)), as.numeric(if_else(test.data$Attrition=='Yes',1,0)))
nb.prff = performance(pred, "tpr", "fpr")
plot(nb.prff,main="ROC Curve")

caret::varImp(model)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 50)
## 
##                                  Importance
## OverTimeYes                          100.00
## OverTimeNo                           100.00
## StockOptionLevel0                     88.84
## MonthlyIncome                         87.95
## JobLevel1                             81.27
## StockOptionLevel1                     75.04
## Age                                   66.94
## MaritalStatusSingle                   59.96
## JobLevel2                             54.64
## DepartmentResearch & Development      51.45
## DepartmentSales                       46.74
## JobRoleSales Representative           43.59
## DistanceFromHome                      42.34
## MaritalStatusDivorced                 40.72
## JobInvolvement1                       35.14
## EnvironmentSatisfaction1              34.86
## JobInvolvement3                       34.42
## TrainingTimesLastYear                 33.22
## JobSatisfaction4                      32.72
## JobRoleManufacturing Director         27.39

ggplot(caret::varImp(model)) + 
geom_bar(stat = 'identity', fill = 'steelblue', color = 'black') + 
ylab("Feature Importance - Knn Classification ")+
scale_y_continuous(limits = c(0, 105), expand = c(0, 0)) +
theme_light()

5.1.3 Naive Bayes Classifier

library(klaR)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:plotly':
## 
##     select

## The following object is masked from 'package:dplyr':
## 
##     select

#
# Create separate dataframe of 30 predictors 1 target
#

nb_with_target_ds <- msds_cs02_ds[,c(fin_num_predictors,fin_nom_qual_predictors,fin_ord_qual_predictors,"Attrition")] 

nb_hot_encoded_ds <- dummy.data.frame(nb_with_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))
# Inspect the data

sample_n(nb_with_target_ds, 3)

# Split the data into training and test set

set.seed(123)
training.samples <- nb_hot_encoded_ds$Attrition %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- nb_hot_encoded_ds[training.samples, ]
test.data   <- nb_hot_encoded_ds[-training.samples, ]


# Fit the model
#model <- NaiveBayes(Attrition ~., data = train.data)
#model
# Make predictions
#predicted.classes <- model %>% predict(test.data)

#pred <- prediction(pred_nb[, 2], test_color$Style)
# Model accuracy
#mean(predicted.classes$class == test.data$Attrition)

set.seed(123)
model <- train(Attrition ~., data = train.data, method = "nb", 
               trControl = trainControl("cv", number = 10,sampling="down")
               )

# Make predictions
predicted.classes <- model %>% predict(test.data)

predicted.classes   <- relevel(predicted.classes,ref="Yes")
test.data$Attrition <- relevel(test.data$Attrition,ref="Yes")

# Compute model accuracy rate
mean(predicted.classes == test.data$Attrition)

## [1] 0.683908

CM <- confusionMatrix(table(predicted.classes, test.data$Attrition))
CM

## Confusion Matrix and Statistics
## 
##                  
## predicted.classes Yes No
##               Yes  23 50
##               No    5 96
##                                           
##                Accuracy : 0.6839          
##                  95% CI : (0.6092, 0.7522)
##     No Information Rate : 0.8391          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2904          
##                                           
##  Mcnemar's Test P-Value : 2.975e-09       
##                                           
##             Sensitivity : 0.8214          
##             Specificity : 0.6575          
##          Pos Pred Value : 0.3151          
##          Neg Pred Value : 0.9505          
##              Prevalence : 0.1609          
##          Detection Rate : 0.1322          
##    Detection Prevalence : 0.4195          
##       Balanced Accuracy : 0.7395          
##                                           
##        'Positive' Class : Yes             
##

caret::varImp(model)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 49)
## 
##                                  Importance
## OverTimeYes                          100.00
## OverTimeNo                           100.00
## StockOptionLevel0                     88.84
## JobLevel1                             81.27
## StockOptionLevel1                     75.04
## Age                                   66.94
## MaritalStatusSingle                   59.96
## JobLevel2                             54.64
## DepartmentResearch & Development      51.45
## DepartmentSales                       46.74
## JobRoleSales Representative           43.59
## DistanceFromHome                      42.34
## MaritalStatusDivorced                 40.72
## JobInvolvement1                       35.14
## EnvironmentSatisfaction1              34.86
## JobInvolvement3                       34.42
## TrainingTimesLastYear                 33.22
## JobSatisfaction4                      32.72
## JobRoleManufacturing Director         27.39
## EnvironmentSatisfaction2              20.80

ggplot(caret::varImp(model)) + 
geom_bar(stat = 'identity', fill = 'steelblue', color = 'black') + 
ylab("Feature Importance -Naive Bayes ")+
scale_y_continuous(limits = c(0, 105), expand = c(0, 0)) +
theme_light()

5.2 Regression Problem

5.2.1 Knn Regression Model

#
# Create separate dataframe of 30 predictors 1 target
#

knn_reg_target_ds <- msds_cs02_ds[,c(fin_num_predictors,fin_nom_qual_predictors,fin_ord_qual_predictors, "MonthlyIncome")] 

kg_hot_encoded_ds <- dummy.data.frame(knn_reg_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))

# Split the data into training and test set
set.seed(123)
training.samples <- kg_hot_encoded_ds$MonthlyIncome %>% createDataPartition(p = 0.8, list = FALSE)

train.data  <- kg_hot_encoded_ds[training.samples, ]
test.data   <- kg_hot_encoded_ds[-training.samples, ]

# Fit the model on the training set
set.seed(123)
model_rg <- train( MonthlyIncome~., data = train.data, method = "knn",
                trControl = trainControl("cv", number = 20),
                preProcess = c("center","scale"),
                tuneLength = 20
              )
# Plot model error RMSE vs different values of k


# Best tuning parameter k that minimize the RMSE
model$bestTune

# Make predictions on the test data
predictions <- model_rg %>% predict(test.data)

head(predictions)

## [1]  5265.4 18979.4  6759.0  3015.6  5334.8 15739.2

# Compute the prediction error RMSE
RMSE(predictions, test.data$MonthlyIncome)

## [1] 1565.399

plot(predictions, test.data$MonthlyIncome)

# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test.data$MonthlyIncome),
  Rsquare = R2(predictions, test.data$MonthlyIncome)
)

p <- plot_ly( x = ~predictions, y = ~test.data$MonthlyIncome,
        marker = list(size = 10,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',
                                   width = 2))) %>%
  layout(title = 'RMSE Plot (kNN)',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))
p

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

5.2.2 Penalized Regression Model (LASSO)

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

#
# Create separate dataframe of 30 predictors 1 target
#
mlp_reg_target_ds <- msds_cs02_ds[,c(fin_num_predictors,nom_qual_predictors,fin_ord_qual_predictors, "MonthlyIncome")] 

#mlp_hot_encoded_ds <- dummy.data.frame(mlp_reg_target_ds,names=c(fin_nom_qual_predictors,fin_ord_qual_predictors))

# Split the data into training and test set
set.seed(123)
training.samples <- mlp_reg_target_ds$MonthlyIncome %>% createDataPartition(p = 0.8, list = FALSE)
train.data  <- mlp_reg_target_ds[training.samples, ]
test.data <- mlp_reg_target_ds[-training.samples, ]


# Predictor variables
x <- model.matrix(MonthlyIncome~., train.data)[,-1]
# Outcome variable
y <- train.data$MonthlyIncome

# Find the best lambda using cross-validation
set.seed(123) 
cv <- cv.glmnet(x, y, alpha = 1)
# Display the best lambda value
cv$lambda.min

## [1] 38.74357

# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
# Dsiplay regression coefficients
coef(model)

## 48 x 1 sparse Matrix of class "dgCMatrix"
##                                             s0
## (Intercept)                       3196.4231798
## Age                                 16.5648076
## DailyRate                            0.1024272
## DistanceFromHome                     .        
## HourlyRate                           .        
## MonthlyRate                          .        
## PercentSalaryHike                    .        
## TrainingTimesLastYear                .        
## BusinessTravelTravel_Frequently      .        
## BusinessTravelTravel_Rarely        125.3324523
## DepartmentResearch & Development     .        
## DepartmentSales                      .        
## EducationFieldLife Sciences          .        
## EducationFieldMarketing              .        
## EducationFieldMedical                .        
## EducationFieldOther                  .        
## EducationFieldTechnical Degree     -16.9864853
## GenderMale                           .        
## JobRoleHuman Resources           -1058.9637181
## JobRoleLaboratory Technician     -1134.9735365
## JobRoleManager                    3224.7908177
## JobRoleManufacturing Director       41.7020537
## JobRoleResearch Director          3330.2678378
## JobRoleResearch Scientist         -898.9129480
## JobRoleSales Executive               .        
## JobRoleSales Representative      -1163.9441813
## MaritalStatusMarried                 .        
## MaritalStatusSingle                -10.5105598
## OverTimeYes                          .        
## EnvironmentSatisfaction2             .        
## EnvironmentSatisfaction3             .        
## EnvironmentSatisfaction4             .        
## JobInvolvement2                      .        
## JobInvolvement3                      .        
## JobInvolvement4                     14.7333612
## JobLevel2                         1679.9360048
## JobLevel3                         5162.7923389
## JobLevel4                         8752.6620728
## JobLevel5                        11517.9229502
## JobSatisfaction2                     .        
## JobSatisfaction3                     .        
## JobSatisfaction4                    28.3873359
## StockOptionLevel1                    5.9321437
## StockOptionLevel2                    .        
## StockOptionLevel3                   -7.9468728
## WorkLifeBalance2                     .        
## WorkLifeBalance3                     3.9392092
## WorkLifeBalance4                     .

summary(model$beta)

# Make predictions on the test data
x.test <- model.matrix(MonthlyIncome ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# Model performance metrics
data.frame(
  RMSE = RMSE(predictions, test.data$MonthlyIncome),
  Rsquare = R2(predictions, test.data$MonthlyIncome)
)

p <- plot_ly( x = ~predictions, y = ~test.data$MonthlyIncome,
        marker = list(size = 10,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',
                                   width = 2))) %>%
  layout(title = 'RMSE Plot(LASSO)',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))
p

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

6 Conclusion

Top 6 factors determined by Random Forest model. Three of which are matching to EDA

Stock Options
Overtime
Job Level
Age
Job Role
MonthlyIncome

23 Research Scientist, Most of the lab technicians, 33 Sales Representative and Sales executive and few were both overloaded and without any stock options.

Our model has captured that accurately.

Predict Employee Attrition

Sachin Chavan

11/30/2019