Introduction

This is an analysis of a dataset containing beer and brewery data. The dataset contains 2410 beers from 558 breweries in 50 states of the United States and the District of Columbia. The goal was to explore the data set and look for interesting features of the data. Once the data was explored, the analysis focused on how international bitterness unit (IBU) and alcohol by volume (ABV) of beer are distributed over states and the relationship between IBU and ABV. The analysis also includes a discussion of the missing data and how the missing data impacts making insights from the dataset.

Load Libraries

  • ggplot2 - visualizations
  • ggthemes - Themes
  • tidyverse
  • naniar and visdat - visualizations for missing data
  • ggthemes - themes for ggplot

Load Data

  • Load Breweries from csv
  • Load Beers from csv
  • Merge Breweries and Beers
  • Rename Names column after merging two datasets

breweries_ds <- read.csv('data/Breweries.csv',header = TRUE)
states_ds    <- read.csv('data/state_abbreviations.csv',header = TRUE)
#Trim leading and trailing spaces
breweries_ds$State       <- as.factor(str_trim(breweries_ds$State,side="both"))
states_ds$State          <- as.factor(str_trim(states_ds$State,side="both"))
states_ds$Unabbreviated  <- as.factor(capitalize(str_trim(states_ds$Unabbreviated,side="both")))
states_ds$CensusDivision <- as.factor(capitalize(str_trim(states_ds$CensusDivision,side="both"))) 
#Merge with states_ds for state names
breweries_ds <- merge(breweries_ds,states_ds,by.x="State",by.y="State")

beers_ds     <- read.csv('data/Beers.csv'    ,header = TRUE)
bmerged_ds   <- merge(breweries_ds,beers_ds,by.x="Brew_ID",by.y="Brewery_id")

names(bmerged_ds)[3] <- "BreweryName"
names(bmerged_ds)[7] <- "BeerName"

Structures of the datasets

  • breweries_ds - Breweries in USA
  • beers_ds - Different beer styles with ABV and IBU
  • bmerged_ds - Merged dataset
str(breweries_ds)
## 'data.frame':    558 obs. of  6 variables:
##  $ State         : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Brew_ID       : int  494 224 459 271 454 558 103 413 479 287 ...
##  $ Name          : Factor w/ 551 levels "10 Barrel Brewing Company",..: 98 318 276 16 161 442 279 464 219 41 ...
##  $ City          : Factor w/ 384 levels "Abingdon","Abita Springs",..: 8 8 317 165 342 8 8 157 34 127 ...
##  $ Unabbreviated : Factor w/ 51 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 1 1 1 ...
##  $ CensusDivision: Factor w/ 9 levels "East north central",..: 6 6 6 6 6 6 6 2 2 2 ...
str(beers_ds)
## 'data.frame':    2410 obs. of  7 variables:
##  $ Name      : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1638 577 1704 1842 1819 268 1160 758 1093 486 ...
##  $ Beer_ID   : int  1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
##  $ ABV       : num  0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
##  $ IBU       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Brewery_id: int  409 178 178 178 178 178 178 178 178 178 ...
##  $ Style     : Factor w/ 100 levels "","Abbey Single Ale",..: 19 18 16 12 16 80 18 22 18 12 ...
##  $ Ounces    : num  12 12 12 12 12 12 12 12 12 12 ...
str(bmerged_ds)
## 'data.frame':    2410 obs. of  12 variables:
##  $ Brew_ID       : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ State         : Factor w/ 51 levels "AK","AL","AR",..: 24 24 24 24 24 24 18 18 18 18 ...
##  $ BreweryName   : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 355 355 355 355 355 12 12 12 12 ...
##  $ City          : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 228 228 228 228 228 200 200 200 200 ...
##  $ Unabbreviated : Factor w/ 51 levels "Alabama","Alaska",..: 24 24 24 24 24 24 18 18 18 18 ...
##  $ CensusDivision: Factor w/ 9 levels "East north central",..: 8 8 8 8 8 8 2 2 2 2 ...
##  $ BeerName      : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1258 2185 802 1525 1640 1926 1116 71 1218 2017 ...
##  $ Beer_ID       : int  2691 2690 2692 2687 2689 2688 2676 2683 2685 2674 ...
##  $ ABV           : num  0.049 0.048 0.045 0.056 0.06 0.06 0.065 0.042 0.125 0.05 ...
##  $ IBU           : int  26 19 50 47 38 25 NA 42 80 20 ...
##  $ Style         : Factor w/ 100 levels "","Abbey Single Ale",..: 77 48 16 57 83 22 26 18 46 48 ...
##  $ Ounces        : num  16 16 16 16 16 16 16 16 16 16 ...

Merged dataset

  • First and last 10 records from merged set - bmerged_ds
head(bmerged_ds)
tail(bmerged_ds)

Summary Statistics

  • breweris_ds - Breweries in USA
  • beers_ds - Different beer styles with ABV and IBU
  • bmerged_ds - Merged dataset

As we can see from Summary, IBU and ABV have some missing values.We can either remove or impute these missing values with various techniques.

Let’s first find out pattern of missing data.

Please refer Imputation techniques for more details

Summary of merged dataset

summary(bmerged_ds)
##     Brew_ID          State                          BreweryName  
##  Min.   :  1.0   CO     : 265   Brewery Vivant            :  62  
##  1st Qu.: 94.0   CA     : 183   Oskar Blues Brewery       :  46  
##  Median :206.0   MI     : 162   Sun King Brewing Company  :  38  
##  Mean   :232.7   IN     : 139   Cigar City Brewing Company:  25  
##  3rd Qu.:367.0   TX     : 130   Sixpoint Craft Ales       :  24  
##  Max.   :558.0   OR     : 125   Hopworks Urban Brewery    :  23  
##                  (Other):1406   (Other)                   :2192  
##            City         Unabbreviated             CensusDivision
##  Grand Rapids:  66   Colorado  : 265   East north central:528   
##  Portland    :  64   California: 183   Mountain          :448   
##  Chicago     :  55   Michigan  : 162   Pacific           :428   
##  Indianapolis:  43   Indiana   : 139   South atlantic    :220   
##  San Diego   :  42   Texas     : 130   New england       :198   
##  Boulder     :  41   Oregon    : 125   West north central:185   
##  (Other)     :2099   (Other)   :1406   (Other)           :403   
##                    BeerName       Beer_ID            ABV         
##  Nonstop Hef Hop       :  12   Min.   :   1.0   Min.   :0.00100  
##  Dale's Pale Ale       :   6   1st Qu.: 808.2   1st Qu.:0.05000  
##  Oktoberfest           :   6   Median :1453.5   Median :0.05600  
##  Longboard Island Lager:   4   Mean   :1431.1   Mean   :0.05977  
##  1327 Pod's ESB        :   3   3rd Qu.:2075.8   3rd Qu.:0.06700  
##  Boston Lager          :   3   Max.   :2692.0   Max.   :0.12800  
##  (Other)               :2376                    NA's   :62       
##       IBU                                    Style          Ounces     
##  Min.   :  4.00   American IPA                  : 424   Min.   : 8.40  
##  1st Qu.: 21.00   American Pale Ale (APA)       : 245   1st Qu.:12.00  
##  Median : 35.00   American Amber / Red Ale      : 133   Median :12.00  
##  Mean   : 42.71   American Blonde Ale           : 108   Mean   :13.59  
##  3rd Qu.: 64.00   American Double / Imperial IPA: 105   3rd Qu.:16.00  
##  Max.   :138.00   American Pale Wheat Ale       :  97   Max.   :32.00  
##  NA's   :1005     (Other)                       :1298

Missing values and Imputations

package naniar and visat package provides visualizations of missing data.

Fields with missing values

vis_miss(bmerged_ds)

Missing percentage:

Field Name Pecentage(%)
ABV 2.57
IBU 41.7

Total Missing - 3.7%



Here is another color coded view of missing data with datatypes

vis_guess(bmerged_ds)

mice::md.pattern()

The MICE package (stands for Multiple Imputation by Chained Equations) can help in several ways - here, we’ll use it for finding the NA-patterns:

md.pattern(bmerged_ds)

##      Brew_ID State BreweryName City Unabbreviated CensusDivision BeerName
## 1405       1     1           1    1             1              1        1
## 943        1     1           1    1             1              1        1
## 62         1     1           1    1             1              1        1
##            0     0           0    0             0              0        0
##      Beer_ID Style Ounces ABV  IBU     
## 1405       1     1      1   1    1    0
## 943        1     1      1   1    0    1
## 62         1     1      1   0    0    2
##            0     0      0  62 1005 1067

We see that there are 3 patterns:

  • 1405 observations with complete information
  • 943 observations have a NA in IBU
  • 62 observations have a NA in ABV

VIM::aggr()

The function aggr() is an additional tool to visualize the missing values.

aggr_plot <- aggr(bmerged_ds, col=c('#41A9EC','red'), numbers=TRUE, sortVars=TRUE, 
                  labels=names(data), cex.axis=.7, gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##        Variable      Count
##             IBU 0.41701245
##             ABV 0.02572614
##         Brew_ID 0.00000000
##           State 0.00000000
##     BreweryName 0.00000000
##            City 0.00000000
##   Unabbreviated 0.00000000
##  CensusDivision 0.00000000
##        BeerName 0.00000000
##         Beer_ID 0.00000000
##           Style 0.00000000
##          Ounces 0.00000000

Missing Pattern Interpretation

  • Left plot shows fields with missing values
  • Right plot shows missing pattern
  • 58.3% observations with complete information
  • 2.6% observations have a NA in both ABV and IBU
  • 3.91% observations have a NA in both IBU

VIM::marginplot()

marginplot(bmerged_ds[c(9,10)])

Marginplot Interpretation

For the plot, we only specifed to take ABV and IBU. Therefore, we can see here the pattern for these two columns as follows:

  • The blue dots represent Beers that have observed values in both ABV and IBU.
  • The blue boxes are their box plots.
  • The red dots represent missing values for either ABV but observed for IBU (left vertical margin) or vice versa: missing values for IBU but observed for ABV (bottom horizontal margin).
  • and their box plots.


Under the MCAR (Missing Completely at Random) assumption, the red and the blue box should be identical.
In our case it is not and 62 values are missing in both IBU and ABV. There is no way we can relate this missing information with any other fields. So we can delete these 62 observations.

Below is the table after deleting 62 observations

  • Now final dataset that we are using for analysis contains 2348 records.
  • This dataset contains missing NAs only in IBU
  • We will check again missingness and then apply imputations to populate missing data by using regression.
bmerged_final_ds <- bmerged_ds %>% filter(!is.na(ABV) | !is.na(IBU))
bmerged_final_ds

VIM::marginplot() - trimmed

This is marginplot after deleting 62 observations.

marginplot(bmerged_final_ds[c(9,10)])

Interpretations

  • Now data contains NAs only in IBU
  • Total 943 obserbvations have NAs
  • boxplots are nearly identical. MCAR assumption is met to impute data.
  • regression methodoloy will be employed to get values in missing column.

VIM::pbox()

pbox(bmerged_final_ds,pos=5)
## Warning in createPlot(main, sub, xlab, ylab, labels, ca$at): not enough
## space to display frequencies

## 
## Click in in the left margin to switch to the previous variable or in the right margin to switch to the next variable.
## To regain use of the VIM GUI and the R console, click anywhere else in the graphics window.

pbox Interpretation

The MAR assumption states that missingness is based on the other observed vairable(s) but not of the missing variable(s) itself.

  • In the above plot, we are looking at the missing values and how they are distributed w.r.t. to BeerName (position 2 in our case).
  • Evidence for MAR would be if the missing-value distributions were much higher or much lower than those of the non-missing-values.
  • But because in our example, the boxplots seem to be overlapping, hence, the missing values in IBU and ABV seem to be randomly distributed w.r.t. to BeerName, hence, providing support for the MCAR assumption, and not the MAR assumption.
  • We will proceed with Imputation to populate IBU using regression

Imputations

  • 62 records doens’t have IBU and SBV both - these are deleted
  • Total records to process 2410-62 = 2348
  • Prediction is required on 943 missing IBU
  • Total records with IBU is 1405
# Fetch records with non-missing IBU
# set indicator imputed to "N" to indicate these are original values.
non_missing_ibu <- bmerged_final_ds %>% filter(!is.na(IBU))
non_missing_ibu$imputed <- "N"

# Fetch records with missing IBU
# set indicator imputed to "Y" to indicate these are missing and will be imputed
missing_ibu <- bmerged_final_ds %>% filter(is.na(IBU))
missing_ibu$imputed <- "Y"

# Run regression model without log transform
model_ibu <- non_missing_ibu %>% lm(formula=IBU~ABV)

# Model Summary
summary(model_ibu)
## 
## Call:
## lm(formula = IBU ~ ABV, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.849 -11.977  -0.721  13.997  93.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -34.099      2.326  -14.66   <2e-16 ***
## ABV         1282.037     37.860   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared:  0.4497, Adjusted R-squared:  0.4493 
## F-statistic:  1147 on 1 and 1403 DF,  p-value: < 2.2e-16

# Run regression model
model_log_ibu <- non_missing_ibu %>% lm(formula=log(IBU)~log(ABV))

# log transformed Model Summary
summary(model_log_ibu)
## 
## Call:
## lm(formula = log(IBU) ~ log(ABV), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81374 -0.30156  0.08312  0.38514  1.35081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.99667    0.18115   49.66   <2e-16 ***
## log(ABV)     1.91683    0.06363   30.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.514 on 1403 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.3923 
## F-statistic: 907.4 on 1 and 1403 DF,  p-value: < 2.2e-16

Checking on assumptions

  • Residual plots before log transformation
par(mfrow=c(1,2))
plot(model_ibu)

Interpretations

  • Residual plots shows departure from constant variation assumption
  • QQplot shows normality
  • But since constant variation assumption is violated, We will have to try log transforamtion

Residual plots after log transformation

par(mfrow=c(1,2))
plot(model_log_ibu)

Interpretations

  • Tried log on response (not shown here) and log on predictor model constant variation assumption was still no met.
  • Residual plot vs fitted plot after log-log looks much better. No serious voilation of constant variance assumption
  • QQplot shows no departure from normality
  • log log model is selected for imputation.
  • Adjusted R-squared - 39.27%
  • Model is statistically significant p-value < 0.05
  • degrees of freedom - 1403

Orignal Scatterplot


bmerged_ds %>% filter(!is.na(IBU)) %>%
               ggplot(aes(x=ABV,y=IBU))+
               geom_point()+
               theme_wsj()+geom_smooth(method="lm")+
               theme(axis.text.x = element_text(angle=65, vjust=0.3))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=8),
               axis.text.y   = element_text(vjust=0.2,size=6),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))

Scatterplot after Imputation.

# Since IBU was log transformed. Below code transforms it back original scale
missing_ibu$IBU <- exp(predict(model_log_ibu,missing_ibu))

# Combine missing and non-missing datasets
imputed_df <- rbind(non_missing_ibu,missing_ibu)

imputed_df$imputed <- as.factor(imputed_df$imputed)

imputed_df %>% ggplot(aes(x=ABV,y=IBU,color=imputed))+
               geom_point()+ggtitle("ABV Vs IBU") + 
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.3))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=8),
               axis.text.y   = element_text(vjust=0.2,size=6),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))


attach(imputed_df)

Analysis

Total Breweries per state


b_states <- breweries_ds %>% group_by(State) %>% 
            count(State,Unabbreviated) %>% arrange(desc(n))

b_states$Unabbreviated <- factor(b_states$Unabbreviated, levels = b_states$Unabbreviated) 

ggplot(b_states, aes(x=Unabbreviated, y=n)) + 
  geom_bar(stat="identity", width=.8, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="State Vs Number of Breweries", 
       caption="Beer Analysis") + 
  xlab("State")+
  ylab("Total Breweries")+
  theme_wsj()+
  theme(axis.text.x = element_text(angle=80, vjust=0.4))+  
  theme(plot.title    = element_text(size = rel(0.5)),
  plot.subtitle = element_text(size = rel(0.5)),
  axis.text.x   = element_text(vjust=0.6,size=8),
  axis.title    = element_text(size = rel(0.5)),
  legend.position  = "right",
  legend.direction ="vertical",
  legend.title = element_text(size = rel(0.2)))

Median ABV Vs State

  • State Vs Median Alcohol content

abv_by_states <- imputed_df  %>% group_by(Unabbreviated) %>% 
                 summarise(Median=as.numeric(median(ABV))) %>% arrange(desc(Median))

abv_by_states$Unabbreviated <- factor(abv_by_states$Unabbreviated, levels = abv_by_states$Unabbreviated) 

ggplot(abv_by_states, aes(x=Unabbreviated, y=Median)) + 
  geom_bar(stat="identity", width=.8, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="Median ABV Vs State", 
       caption="Beer Analysis") + 
  xlab("State")+
  ylab("Alcohol content")+
  theme_wsj()+
  theme(axis.text.x = element_text(angle=80, vjust=0.8))+  
  theme(plot.title    = element_text(size = rel(0.5)),
  plot.subtitle = element_text(size = rel(0.5)),
  axis.text.x   = element_text(vjust=0.6,size=7),
  axis.title    = element_text(size = rel(0.4)),
  legend.position  = "right",
  legend.direction ="vertical",
  legend.title = element_text(size = rel(0.2)))

Interpretation

  • As seen from the ordered bar plot DC and Kentucky has highest median ABV
  • West Virginia has highest median IBU

Median IBU Vs State


ibu_by_states <- imputed_df %>% group_by(Unabbreviated) %>%  
                 summarise(Median=as.numeric(median(IBU))) %>% arrange(desc(Median))

ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = ibu_by_states$Unabbreviated) 

ggplot(ibu_by_states, aes(x=Unabbreviated, y=Median)) + 
  geom_bar(stat="identity", width=.8, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="Median ABV Vs State", 
       caption="Beer Analysis") + 
  xlab("State")+
  ylab("Alcohol content")+
  theme_wsj()+
  theme(axis.text.x = element_text(angle=80, vjust=0.8))+  
  theme(plot.title    = element_text(size = rel(0.5)),
  plot.subtitle = element_text(size = rel(0.5)),
  axis.text.x   = element_text(vjust=0.6,size=7),
  axis.title    = element_text(size = rel(0.4)),
  legend.position  = "right",
  legend.direction ="vertical",
  legend.title = element_text(size = rel(0.2)))

Interpretation

  • As seen from the ordered bar plot West Virginia has highest median IBU

Distribution of ABV

  • boxplot by Max ABV

ibu_by_states <- imputed_df %>%  group_by(State) %>% arrange(desc(ABV))

ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = unique(ibu_by_states$Unabbreviated)) 

ibu_by_states %>% ggplot(aes(y=ABV,x=Unabbreviated)) +
                  geom_boxplot(aes(fill=Unabbreviated), alpha=0.8,show.legend = FALSE) + 
                  labs(title="Alcohol Content", 
                  caption="Beer Analysis") + 
                  xlab("State")+
                  ylab("Alcohol content")+
                  theme_wsj()+
                  theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
                  theme(plot.title    = element_text(size = rel(0.5)),
                  plot.subtitle = element_text(size = rel(0.5)),
                  axis.text.x   = element_text(vjust=0.6,size=7),
                  axis.title    = element_text(size = rel(0.5)),
                  legend.position  = "right",
                  legend.direction ="vertical",
                  legend.title = element_text(size = rel(0.2)))

Interpretation

  • As shown above ordered bar plot State of Colorado has maximum alcoholic content with ABV=0.128

Distribution of IBU

  • boxplot by Max IBU

ibu_by_states <- imputed_df %>% group_by(Unabbreviated) %>%  arrange(desc(IBU))

ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = unique(ibu_by_states$Unabbreviated)) 

ibu_by_states %>% ggplot(aes(y=IBU,x=Unabbreviated)) +
                  geom_boxplot(aes(fill=Unabbreviated), alpha=0.8,show.legend = FALSE) + 
                  labs(title="IBU", 
                  caption="Beer Analysis") + 
                  xlab("State")+
                  ylab("IBU")+
                  theme_wsj()+
                  theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
                  theme(plot.title    = element_text(size = rel(0.5)),
                  plot.subtitle = element_text(size = rel(0.5)),
                  axis.text.x   = element_text(vjust=0.6,size=9),
                  axis.title    = element_text(size = rel(0.5)),
                  legend.position  = "right",
                  legend.direction ="vertical",
                  legend.title = element_text(size = rel(0.2)))

Interpretation

  • As shown above ordered bar plot State of Colorado has maximum bitterness IBU=156.99224
  • Note this is imputed value. Originally IBU was not populated for Colorado

Density Plot - ABV


imputed_df %>% ggplot(aes(x=ABV)) +
               geom_density(aes(fill=State), alpha=0.8) + 
               labs(title="Density Plot", 
               caption="Beer Analysis",fill="States") + 
               xlab("State")+
               ylab("Alcohol content")+
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=7),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))
## Warning: Groups with fewer than two data points have been dropped.

Interpretation

  • As per box plot Alcohol content varies by State

Histogram - ABV


imputed_df %>% ggplot(aes(x=ABV,fill=State)) +
               geom_histogram(bins=30) + 
               labs(title="ABV Distribution", 
               caption="Beer Analysis") + 
               ylab("Alcohol content")+
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=7),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))

Interpretation

  • Colorado is the state with highest alcohol content And Delaware is the lowest alcohol content.
  • Overall Distribution of ABV is approximately normal.

Model Building

IPAs Vs Other Ales

  • Relationship between IPAs and other Ales
imputed_df$Style <- toupper(imputed_df$Style)
IPA <- imputed_df %>% filter(str_detect(Style, "IPA")) 
IPA$Style <- "IPA"
ALE <- imputed_df %>% filter(str_detect(Style, "IPA" , negate = TRUE)) %>% filter(str_detect(Style, "ALE"))
ALE$Style <- "Ale"
mod_df <- rbind(IPA, ALE)
mod_df %>% ggplot(aes(x=ABV,y=IBU,color=Style))+geom_point() +
           labs(title="IBU Vs ABV", 
           caption="Beer Analysis") + 
           xlab("State")+
           ylab("Alcohol content")+
           theme_wsj()+
           theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
           theme(plot.title    = element_text(size = rel(0.5)),
                 plot.subtitle = element_text(size = rel(0.5)),
                 axis.text.x   = element_text(vjust=0.6,size=7),
                 axis.title    = element_text(size = rel(0.5)),
                 legend.position  = "right",
                 legend.direction ="vertical",
                 legend.title = element_text(size = rel(0.5)))

Interpretation

  • Visual evidence of Linear relationship between IPAs and other Ales
  • Clear distinction between two types of Ales
  • IPA tends to be more Bitter compared to Other Ales

Knn Classifier

  • This Model predicts Style usign ABV and IBU
set.seed(7)
iterations = 100
k = 30
Acc_holder = matrix(nrow = iterations, ncol = k)
for(j in 1:iterations)
{
smp <- floor(0.75 * nrow(mod_df))
train_ind <- sample(seq_len(nrow(mod_df)), size = smp)
imputed_train <- mod_df[train_ind, ]
imputed_test  <- mod_df[-train_ind, ]
  for(i in 1:k)
  {
  classifications <- knn(imputed_train[,c(9,10)], imputed_test[,c(9,10)], imputed_train$Style,     prob = TRUE, k = i)
  CM <- confusionMatrix(table(classifications, imputed_test$Style))
    Acc_holder[j,i] = CM$overall[1]
  }
}
MeanAcc = colMeans(Acc_holder)
which.max(MeanAcc)
## [1] 5
k <- c(1:30)
Mean_Acc_df <- data_frame(MeanAcc)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Mean_Acc_df <- cbind(k, Mean_Acc_df)
Mean_Acc_df %>% ggplot(aes(x = k, y= MeanAcc)) +
               geom_line(color = "blue", alpha = .8) +
               labs(title="Cross-Validating Different KNN Models", 
               caption="KNN Score Across K's") + 
               xlab("K's")+
               ylab("Prediction Accuracy")+
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=7),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.2)))

classifications <- knn(imputed_train[,c(9,10)], imputed_test[,c(9,10)], imputed_train$Style,     prob = TRUE, k = 6)
CM <- confusionMatrix(table(classifications, imputed_test$Style))
CM
## Confusion Matrix and Statistics
## 
##                
## classifications Ale IPA
##             Ale 195  43
##             IPA  29 117
##                                           
##                Accuracy : 0.8125          
##                  95% CI : (0.7698, 0.8503)
##     No Information Rate : 0.5833          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6094          
##                                           
##  Mcnemar's Test P-Value : 0.1255          
##                                           
##             Sensitivity : 0.8705          
##             Specificity : 0.7312          
##          Pos Pred Value : 0.8193          
##          Neg Pred Value : 0.8014          
##              Prevalence : 0.5833          
##          Detection Rate : 0.5078          
##    Detection Prevalence : 0.6198          
##       Balanced Accuracy : 0.8009          
##                                           
##        'Positive' Class : Ale             
## 

Naive Bayes

Further exploration

  • Separating Beers into known Light and Dark
  • Only Beers that are definitely Light or Dark are used
imputed_df$Style <- toupper(imputed_df$Style)
Blonde <- imputed_df %>%filter(str_detect(Style, "BLONDE"))
Blonde$Style <- "Light"
Light <- imputed_df %>%filter(str_detect(Style, "LIGHT"))
Light$Style <- "Light"
Cream <- imputed_df %>%filter(str_detect(Style, "CREAM"))
Cream$Style <- "Light"
Shandy <- imputed_df %>%filter(str_detect(Style, "SHANDY"))
Shandy$Style <- "Light"
Porter <- imputed_df %>%filter(str_detect(Style, "PORTER"))
Porter$Style <- "Dark"
Stout <- imputed_df %>%filter(str_detect(Style, "STOUT"))
Stout$Style <- "Dark"
Dark <- imputed_df %>%filter(str_detect(Style, "DARK"))
Dark$Style <- "Dark"
Black <- imputed_df %>%filter(str_detect(Style, "BLACK"))
Black$Style <- "Dark"
color <- rbind(Blonde,Light,Porter,Stout,Dark,Black,Cream,Shandy)
  • Naive Bayes to classify Light Vs. Dark Beers
  • ROC Curve plot
  • Confusion Matrix
iterations = 100
AccHolder = numeric(100)
SensHolder = numeric(100)
SpecHolder = numeric(100)
for(seed in 1:iterations)
{
    set.seed(seed)
    Indices = sample(seq(1:length(color$Style)),round(.75*length(color$Style)))
    train_color = color[Indices,]
    test_color = color[-Indices,]
    model = naiveBayes(train_color[,c("ABV", "IBU", "State")],factor(train_color$Style, labels     = c("Light", "Dark")))
    CM = confusionMatrix(table(factor(test_color$Style, labels = c("Light",   "Dark")),predict(model,test_color[,c("ABV", "IBU", "State")])))
    AccHolder[seed] = CM$overall[1]
    SensHolder[seed] = CM$byClass[1]
    SpecHolder[seed] = CM$byClass[2]
}
mean(AccHolder)
## [1] 0.7713684
mean(SensHolder)
## [1] 0.9037857
mean(SpecHolder)
## [1] 0.6547825
which.max(AccHolder)
## [1] 100
which.max(SensHolder)
## [1] 2
which.max(SpecHolder)
## [1] 100
set.seed(38)
Indices = sample(seq(1:length(color$Style)),round(.75*length(color$Style)))
train_color = color[Indices,]
test_color = color[-Indices,]
model = naiveBayes(train_color[,c("ABV", "IBU", "State")],factor(train_color$Style, labels = c("Light", "Dark")))
CM = confusionMatrix(table(factor(test_color$Style, labels = c("Light", "Dark")),predict(model,test_color[,c("ABV", "IBU", "State")])))
CM
## Confusion Matrix and Statistics
## 
##        
##         Light Dark
##   Light    41   14
##   Dark      5   35
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.7054, 0.8751)
##     No Information Rate : 0.5158          
##     P-Value [Acc > NIR] : 8.836e-09       
##                                           
##                   Kappa : 0.602           
##                                           
##  Mcnemar's Test P-Value : 0.06646         
##                                           
##             Sensitivity : 0.8913          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7455          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.4842          
##          Detection Rate : 0.4316          
##    Detection Prevalence : 0.5789          
##       Balanced Accuracy : 0.8028          
##                                           
##        'Positive' Class : Light           
## 
pred_nb <- predict(model,test_color[,c("ABV", "IBU", "State")], type = 'raw')
pred <- prediction(pred_nb[, 2], test_color$Style)
nb.prff = performance(pred, "tpr", "fpr")
plot(nb.prff,main="ROC Curve")

Conclusion

Following are insights from analysis

  • There is evidence that data is missing at random
  • There appears to be geographic clustering of median ABV and IBU
  • Most beers have moderate Alcohol content (5 to 6.7%)
  • There is evidence of linear relationship between IBU and ABV