This is an analysis of a dataset containing beer and brewery data. The dataset contains 2410 beers from 558 breweries in 50 states of the United States and the District of Columbia. The goal was to explore the data set and look for interesting features of the data. Once the data was explored, the analysis focused on how international bitterness unit (IBU) and alcohol by volume (ABV) of beer are distributed over states and the relationship between IBU and ABV. The analysis also includes a discussion of the missing data and how the missing data impacts making insights from the dataset.
breweries_ds <- read.csv('data/Breweries.csv',header = TRUE)
states_ds <- read.csv('data/state_abbreviations.csv',header = TRUE)
#Trim leading and trailing spaces
breweries_ds$State <- as.factor(str_trim(breweries_ds$State,side="both"))
states_ds$State <- as.factor(str_trim(states_ds$State,side="both"))
states_ds$Unabbreviated <- as.factor(capitalize(str_trim(states_ds$Unabbreviated,side="both")))
states_ds$CensusDivision <- as.factor(capitalize(str_trim(states_ds$CensusDivision,side="both")))
#Merge with states_ds for state names
breweries_ds <- merge(breweries_ds,states_ds,by.x="State",by.y="State")
beers_ds <- read.csv('data/Beers.csv' ,header = TRUE)
bmerged_ds <- merge(breweries_ds,beers_ds,by.x="Brew_ID",by.y="Brewery_id")
names(bmerged_ds)[3] <- "BreweryName"
names(bmerged_ds)[7] <- "BeerName"
str(breweries_ds)
## 'data.frame': 558 obs. of 6 variables:
## $ State : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Brew_ID : int 494 224 459 271 454 558 103 413 479 287 ...
## $ Name : Factor w/ 551 levels "10 Barrel Brewing Company",..: 98 318 276 16 161 442 279 464 219 41 ...
## $ City : Factor w/ 384 levels "Abingdon","Abita Springs",..: 8 8 317 165 342 8 8 157 34 127 ...
## $ Unabbreviated : Factor w/ 51 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 1 1 1 ...
## $ CensusDivision: Factor w/ 9 levels "East north central",..: 6 6 6 6 6 6 6 2 2 2 ...
str(beers_ds)
## 'data.frame': 2410 obs. of 7 variables:
## $ Name : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1638 577 1704 1842 1819 268 1160 758 1093 486 ...
## $ Beer_ID : int 1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
## $ ABV : num 0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
## $ IBU : int NA NA NA NA NA NA NA NA NA NA ...
## $ Brewery_id: int 409 178 178 178 178 178 178 178 178 178 ...
## $ Style : Factor w/ 100 levels "","Abbey Single Ale",..: 19 18 16 12 16 80 18 22 18 12 ...
## $ Ounces : num 12 12 12 12 12 12 12 12 12 12 ...
str(bmerged_ds)
## 'data.frame': 2410 obs. of 12 variables:
## $ Brew_ID : int 1 1 1 1 1 1 2 2 2 2 ...
## $ State : Factor w/ 51 levels "AK","AL","AR",..: 24 24 24 24 24 24 18 18 18 18 ...
## $ BreweryName : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 355 355 355 355 355 12 12 12 12 ...
## $ City : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 228 228 228 228 228 200 200 200 200 ...
## $ Unabbreviated : Factor w/ 51 levels "Alabama","Alaska",..: 24 24 24 24 24 24 18 18 18 18 ...
## $ CensusDivision: Factor w/ 9 levels "East north central",..: 8 8 8 8 8 8 2 2 2 2 ...
## $ BeerName : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1258 2185 802 1525 1640 1926 1116 71 1218 2017 ...
## $ Beer_ID : int 2691 2690 2692 2687 2689 2688 2676 2683 2685 2674 ...
## $ ABV : num 0.049 0.048 0.045 0.056 0.06 0.06 0.065 0.042 0.125 0.05 ...
## $ IBU : int 26 19 50 47 38 25 NA 42 80 20 ...
## $ Style : Factor w/ 100 levels "","Abbey Single Ale",..: 77 48 16 57 83 22 26 18 46 48 ...
## $ Ounces : num 16 16 16 16 16 16 16 16 16 16 ...
head(bmerged_ds)
tail(bmerged_ds)
As we can see from Summary, IBU and ABV have some missing values.We can either remove or impute these missing values with various techniques.
Let’s first find out pattern of missing data.
Please refer Imputation techniques for more details
Summary of merged dataset
summary(bmerged_ds)
## Brew_ID State BreweryName
## Min. : 1.0 CO : 265 Brewery Vivant : 62
## 1st Qu.: 94.0 CA : 183 Oskar Blues Brewery : 46
## Median :206.0 MI : 162 Sun King Brewing Company : 38
## Mean :232.7 IN : 139 Cigar City Brewing Company: 25
## 3rd Qu.:367.0 TX : 130 Sixpoint Craft Ales : 24
## Max. :558.0 OR : 125 Hopworks Urban Brewery : 23
## (Other):1406 (Other) :2192
## City Unabbreviated CensusDivision
## Grand Rapids: 66 Colorado : 265 East north central:528
## Portland : 64 California: 183 Mountain :448
## Chicago : 55 Michigan : 162 Pacific :428
## Indianapolis: 43 Indiana : 139 South atlantic :220
## San Diego : 42 Texas : 130 New england :198
## Boulder : 41 Oregon : 125 West north central:185
## (Other) :2099 (Other) :1406 (Other) :403
## BeerName Beer_ID ABV
## Nonstop Hef Hop : 12 Min. : 1.0 Min. :0.00100
## Dale's Pale Ale : 6 1st Qu.: 808.2 1st Qu.:0.05000
## Oktoberfest : 6 Median :1453.5 Median :0.05600
## Longboard Island Lager: 4 Mean :1431.1 Mean :0.05977
## 1327 Pod's ESB : 3 3rd Qu.:2075.8 3rd Qu.:0.06700
## Boston Lager : 3 Max. :2692.0 Max. :0.12800
## (Other) :2376 NA's :62
## IBU Style Ounces
## Min. : 4.00 American IPA : 424 Min. : 8.40
## 1st Qu.: 21.00 American Pale Ale (APA) : 245 1st Qu.:12.00
## Median : 35.00 American Amber / Red Ale : 133 Median :12.00
## Mean : 42.71 American Blonde Ale : 108 Mean :13.59
## 3rd Qu.: 64.00 American Double / Imperial IPA: 105 3rd Qu.:16.00
## Max. :138.00 American Pale Wheat Ale : 97 Max. :32.00
## NA's :1005 (Other) :1298
package naniar and visat package provides visualizations of missing data.
vis_miss(bmerged_ds)
Missing percentage:
Field Name | Pecentage(%) |
---|---|
ABV | 2.57 |
IBU | 41.7 |
Total Missing - 3.7%
Here is another color coded view of missing data with datatypes
vis_guess(bmerged_ds)
The MICE package (stands for Multiple Imputation by Chained Equations) can help in several ways - here, we’ll use it for finding the NA-patterns:
md.pattern(bmerged_ds)
## Brew_ID State BreweryName City Unabbreviated CensusDivision BeerName
## 1405 1 1 1 1 1 1 1
## 943 1 1 1 1 1 1 1
## 62 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## Beer_ID Style Ounces ABV IBU
## 1405 1 1 1 1 1 0
## 943 1 1 1 1 0 1
## 62 1 1 1 0 0 2
## 0 0 0 62 1005 1067
The function aggr() is an additional tool to visualize the missing values.
aggr_plot <- aggr(bmerged_ds, col=c('#41A9EC','red'), numbers=TRUE, sortVars=TRUE,
labels=names(data), cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## IBU 0.41701245
## ABV 0.02572614
## Brew_ID 0.00000000
## State 0.00000000
## BreweryName 0.00000000
## City 0.00000000
## Unabbreviated 0.00000000
## CensusDivision 0.00000000
## BeerName 0.00000000
## Beer_ID 0.00000000
## Style 0.00000000
## Ounces 0.00000000
marginplot(bmerged_ds[c(9,10)])
For the plot, we only specifed to take ABV and IBU. Therefore, we can see here the pattern for these two columns as follows:
Under the MCAR (Missing Completely at Random) assumption, the red and the blue box should be identical.
In our case it is not and 62 values are missing in both IBU and ABV. There is no way we can relate this missing information with any other fields. So we can delete these 62 observations.
bmerged_final_ds <- bmerged_ds %>% filter(!is.na(ABV) | !is.na(IBU))
bmerged_final_ds
This is marginplot after deleting 62 observations.
marginplot(bmerged_final_ds[c(9,10)])
pbox(bmerged_final_ds,pos=5)
## Warning in createPlot(main, sub, xlab, ylab, labels, ca$at): not enough
## space to display frequencies
##
## Click in in the left margin to switch to the previous variable or in the right margin to switch to the next variable.
## To regain use of the VIM GUI and the R console, click anywhere else in the graphics window.
The MAR assumption states that missingness is based on the other observed vairable(s) but not of the missing variable(s) itself.
# Fetch records with non-missing IBU
# set indicator imputed to "N" to indicate these are original values.
non_missing_ibu <- bmerged_final_ds %>% filter(!is.na(IBU))
non_missing_ibu$imputed <- "N"
# Fetch records with missing IBU
# set indicator imputed to "Y" to indicate these are missing and will be imputed
missing_ibu <- bmerged_final_ds %>% filter(is.na(IBU))
missing_ibu$imputed <- "Y"
# Run regression model without log transform
model_ibu <- non_missing_ibu %>% lm(formula=IBU~ABV)
# Model Summary
summary(model_ibu)
##
## Call:
## lm(formula = IBU ~ ABV, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.849 -11.977 -0.721 13.997 93.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.099 2.326 -14.66 <2e-16 ***
## ABV 1282.037 37.860 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared: 0.4497, Adjusted R-squared: 0.4493
## F-statistic: 1147 on 1 and 1403 DF, p-value: < 2.2e-16
# Run regression model
model_log_ibu <- non_missing_ibu %>% lm(formula=log(IBU)~log(ABV))
# log transformed Model Summary
summary(model_log_ibu)
##
## Call:
## lm(formula = log(IBU) ~ log(ABV), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81374 -0.30156 0.08312 0.38514 1.35081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.99667 0.18115 49.66 <2e-16 ***
## log(ABV) 1.91683 0.06363 30.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.514 on 1403 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.3923
## F-statistic: 907.4 on 1 and 1403 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model_ibu)
par(mfrow=c(1,2))
plot(model_log_ibu)
bmerged_ds %>% filter(!is.na(IBU)) %>%
ggplot(aes(x=ABV,y=IBU))+
geom_point()+
theme_wsj()+geom_smooth(method="lm")+
theme(axis.text.x = element_text(angle=65, vjust=0.3))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=8),
axis.text.y = element_text(vjust=0.2,size=6),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.5)))
# Since IBU was log transformed. Below code transforms it back original scale
missing_ibu$IBU <- exp(predict(model_log_ibu,missing_ibu))
# Combine missing and non-missing datasets
imputed_df <- rbind(non_missing_ibu,missing_ibu)
imputed_df$imputed <- as.factor(imputed_df$imputed)
imputed_df %>% ggplot(aes(x=ABV,y=IBU,color=imputed))+
geom_point()+ggtitle("ABV Vs IBU") +
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.3))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=8),
axis.text.y = element_text(vjust=0.2,size=6),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.5)))
attach(imputed_df)
b_states <- breweries_ds %>% group_by(State) %>%
count(State,Unabbreviated) %>% arrange(desc(n))
b_states$Unabbreviated <- factor(b_states$Unabbreviated, levels = b_states$Unabbreviated)
ggplot(b_states, aes(x=Unabbreviated, y=n)) +
geom_bar(stat="identity", width=.8, fill="tomato3") +
labs(title="Ordered Bar Chart",
subtitle="State Vs Number of Breweries",
caption="Beer Analysis") +
xlab("State")+
ylab("Total Breweries")+
theme_wsj()+
theme(axis.text.x = element_text(angle=80, vjust=0.4))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=8),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.2)))
abv_by_states <- imputed_df %>% group_by(Unabbreviated) %>%
summarise(Median=as.numeric(median(ABV))) %>% arrange(desc(Median))
abv_by_states$Unabbreviated <- factor(abv_by_states$Unabbreviated, levels = abv_by_states$Unabbreviated)
ggplot(abv_by_states, aes(x=Unabbreviated, y=Median)) +
geom_bar(stat="identity", width=.8, fill="tomato3") +
labs(title="Ordered Bar Chart",
subtitle="Median ABV Vs State",
caption="Beer Analysis") +
xlab("State")+
ylab("Alcohol content")+
theme_wsj()+
theme(axis.text.x = element_text(angle=80, vjust=0.8))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.4)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.2)))
ibu_by_states <- imputed_df %>% group_by(Unabbreviated) %>%
summarise(Median=as.numeric(median(IBU))) %>% arrange(desc(Median))
ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = ibu_by_states$Unabbreviated)
ggplot(ibu_by_states, aes(x=Unabbreviated, y=Median)) +
geom_bar(stat="identity", width=.8, fill="tomato3") +
labs(title="Ordered Bar Chart",
subtitle="Median ABV Vs State",
caption="Beer Analysis") +
xlab("State")+
ylab("Alcohol content")+
theme_wsj()+
theme(axis.text.x = element_text(angle=80, vjust=0.8))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.4)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.2)))
ibu_by_states <- imputed_df %>% group_by(State) %>% arrange(desc(ABV))
ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = unique(ibu_by_states$Unabbreviated))
ibu_by_states %>% ggplot(aes(y=ABV,x=Unabbreviated)) +
geom_boxplot(aes(fill=Unabbreviated), alpha=0.8,show.legend = FALSE) +
labs(title="Alcohol Content",
caption="Beer Analysis") +
xlab("State")+
ylab("Alcohol content")+
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.2)))
ibu_by_states <- imputed_df %>% group_by(Unabbreviated) %>% arrange(desc(IBU))
ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = unique(ibu_by_states$Unabbreviated))
ibu_by_states %>% ggplot(aes(y=IBU,x=Unabbreviated)) +
geom_boxplot(aes(fill=Unabbreviated), alpha=0.8,show.legend = FALSE) +
labs(title="IBU",
caption="Beer Analysis") +
xlab("State")+
ylab("IBU")+
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=9),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.2)))
imputed_df %>% ggplot(aes(x=ABV)) +
geom_density(aes(fill=State), alpha=0.8) +
labs(title="Density Plot",
caption="Beer Analysis",fill="States") +
xlab("State")+
ylab("Alcohol content")+
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.5)))
## Warning: Groups with fewer than two data points have been dropped.
imputed_df %>% ggplot(aes(x=ABV,fill=State)) +
geom_histogram(bins=30) +
labs(title="ABV Distribution",
caption="Beer Analysis") +
ylab("Alcohol content")+
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.5)))
imputed_df$Style <- toupper(imputed_df$Style)
IPA <- imputed_df %>% filter(str_detect(Style, "IPA"))
IPA$Style <- "IPA"
ALE <- imputed_df %>% filter(str_detect(Style, "IPA" , negate = TRUE)) %>% filter(str_detect(Style, "ALE"))
ALE$Style <- "Ale"
mod_df <- rbind(IPA, ALE)
mod_df %>% ggplot(aes(x=ABV,y=IBU,color=Style))+geom_point() +
labs(title="IBU Vs ABV",
caption="Beer Analysis") +
xlab("State")+
ylab("Alcohol content")+
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.5)))
set.seed(7)
iterations = 100
k = 30
Acc_holder = matrix(nrow = iterations, ncol = k)
for(j in 1:iterations)
{
smp <- floor(0.75 * nrow(mod_df))
train_ind <- sample(seq_len(nrow(mod_df)), size = smp)
imputed_train <- mod_df[train_ind, ]
imputed_test <- mod_df[-train_ind, ]
for(i in 1:k)
{
classifications <- knn(imputed_train[,c(9,10)], imputed_test[,c(9,10)], imputed_train$Style, prob = TRUE, k = i)
CM <- confusionMatrix(table(classifications, imputed_test$Style))
Acc_holder[j,i] = CM$overall[1]
}
}
MeanAcc = colMeans(Acc_holder)
which.max(MeanAcc)
## [1] 5
k <- c(1:30)
Mean_Acc_df <- data_frame(MeanAcc)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Mean_Acc_df <- cbind(k, Mean_Acc_df)
Mean_Acc_df %>% ggplot(aes(x = k, y= MeanAcc)) +
geom_line(color = "blue", alpha = .8) +
labs(title="Cross-Validating Different KNN Models",
caption="KNN Score Across K's") +
xlab("K's")+
ylab("Prediction Accuracy")+
theme_wsj()+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title = element_text(size = rel(0.5)),
plot.subtitle = element_text(size = rel(0.5)),
axis.text.x = element_text(vjust=0.6,size=7),
axis.title = element_text(size = rel(0.5)),
legend.position = "right",
legend.direction ="vertical",
legend.title = element_text(size = rel(0.2)))
classifications <- knn(imputed_train[,c(9,10)], imputed_test[,c(9,10)], imputed_train$Style, prob = TRUE, k = 6)
CM <- confusionMatrix(table(classifications, imputed_test$Style))
CM
## Confusion Matrix and Statistics
##
##
## classifications Ale IPA
## Ale 195 43
## IPA 29 117
##
## Accuracy : 0.8125
## 95% CI : (0.7698, 0.8503)
## No Information Rate : 0.5833
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6094
##
## Mcnemar's Test P-Value : 0.1255
##
## Sensitivity : 0.8705
## Specificity : 0.7312
## Pos Pred Value : 0.8193
## Neg Pred Value : 0.8014
## Prevalence : 0.5833
## Detection Rate : 0.5078
## Detection Prevalence : 0.6198
## Balanced Accuracy : 0.8009
##
## 'Positive' Class : Ale
##
imputed_df$Style <- toupper(imputed_df$Style)
Blonde <- imputed_df %>%filter(str_detect(Style, "BLONDE"))
Blonde$Style <- "Light"
Light <- imputed_df %>%filter(str_detect(Style, "LIGHT"))
Light$Style <- "Light"
Cream <- imputed_df %>%filter(str_detect(Style, "CREAM"))
Cream$Style <- "Light"
Shandy <- imputed_df %>%filter(str_detect(Style, "SHANDY"))
Shandy$Style <- "Light"
Porter <- imputed_df %>%filter(str_detect(Style, "PORTER"))
Porter$Style <- "Dark"
Stout <- imputed_df %>%filter(str_detect(Style, "STOUT"))
Stout$Style <- "Dark"
Dark <- imputed_df %>%filter(str_detect(Style, "DARK"))
Dark$Style <- "Dark"
Black <- imputed_df %>%filter(str_detect(Style, "BLACK"))
Black$Style <- "Dark"
color <- rbind(Blonde,Light,Porter,Stout,Dark,Black,Cream,Shandy)
iterations = 100
AccHolder = numeric(100)
SensHolder = numeric(100)
SpecHolder = numeric(100)
for(seed in 1:iterations)
{
set.seed(seed)
Indices = sample(seq(1:length(color$Style)),round(.75*length(color$Style)))
train_color = color[Indices,]
test_color = color[-Indices,]
model = naiveBayes(train_color[,c("ABV", "IBU", "State")],factor(train_color$Style, labels = c("Light", "Dark")))
CM = confusionMatrix(table(factor(test_color$Style, labels = c("Light", "Dark")),predict(model,test_color[,c("ABV", "IBU", "State")])))
AccHolder[seed] = CM$overall[1]
SensHolder[seed] = CM$byClass[1]
SpecHolder[seed] = CM$byClass[2]
}
mean(AccHolder)
## [1] 0.7713684
mean(SensHolder)
## [1] 0.9037857
mean(SpecHolder)
## [1] 0.6547825
which.max(AccHolder)
## [1] 100
which.max(SensHolder)
## [1] 2
which.max(SpecHolder)
## [1] 100
set.seed(38)
Indices = sample(seq(1:length(color$Style)),round(.75*length(color$Style)))
train_color = color[Indices,]
test_color = color[-Indices,]
model = naiveBayes(train_color[,c("ABV", "IBU", "State")],factor(train_color$Style, labels = c("Light", "Dark")))
CM = confusionMatrix(table(factor(test_color$Style, labels = c("Light", "Dark")),predict(model,test_color[,c("ABV", "IBU", "State")])))
CM
## Confusion Matrix and Statistics
##
##
## Light Dark
## Light 41 14
## Dark 5 35
##
## Accuracy : 0.8
## 95% CI : (0.7054, 0.8751)
## No Information Rate : 0.5158
## P-Value [Acc > NIR] : 8.836e-09
##
## Kappa : 0.602
##
## Mcnemar's Test P-Value : 0.06646
##
## Sensitivity : 0.8913
## Specificity : 0.7143
## Pos Pred Value : 0.7455
## Neg Pred Value : 0.8750
## Prevalence : 0.4842
## Detection Rate : 0.4316
## Detection Prevalence : 0.5789
## Balanced Accuracy : 0.8028
##
## 'Positive' Class : Light
##
pred_nb <- predict(model,test_color[,c("ABV", "IBU", "State")], type = 'raw')
pred <- prediction(pred_nb[, 2], test_color$Style)
nb.prff = performance(pred, "tpr", "fpr")
plot(nb.prff,main="ROC Curve")
Following are insights from analysis