Introduction

This is an analysis of a dataset containing beer and brewery data. The dataset contains 2410 beers from 558 breweries in 50 states of the United States and the District of Columbia. The goal was to explore the data set and look for interesting features of the data. Once the data was explored, the analysis focused on how international bitterness unit (IBU) and alcohol by volume (ABV) of beer are distributed over states and the relationship between IBU and ABV. The analysis also includes a discussion of the missing data and how the missing data impacts making insights from the dataset.

Load Libraries

ggplot2 - visualizations
ggthemes - Themes
tidyverse
naniar and visdat - visualizations for missing data
ggthemes - themes for ggplot

Load Data

Load Breweries from csv
Load Beers from csv
Merge Breweries and Beers
Rename Names column after merging two datasets


breweries_ds <- read.csv('data/Breweries.csv',header = TRUE)
states_ds    <- read.csv('data/state_abbreviations.csv',header = TRUE)
#Trim leading and trailing spaces
breweries_ds$State       <- as.factor(str_trim(breweries_ds$State,side="both"))
states_ds$State          <- as.factor(str_trim(states_ds$State,side="both"))
states_ds$Unabbreviated  <- as.factor(capitalize(str_trim(states_ds$Unabbreviated,side="both")))
states_ds$CensusDivision <- as.factor(capitalize(str_trim(states_ds$CensusDivision,side="both"))) 
#Merge with states_ds for state names
breweries_ds <- merge(breweries_ds,states_ds,by.x="State",by.y="State")

beers_ds     <- read.csv('data/Beers.csv'    ,header = TRUE)
bmerged_ds   <- merge(breweries_ds,beers_ds,by.x="Brew_ID",by.y="Brewery_id")

names(bmerged_ds)[3] <- "BreweryName"
names(bmerged_ds)[7] <- "BeerName"

Structures of the datasets

breweries_ds - Breweries in USA
beers_ds - Different beer styles with ABV and IBU
bmerged_ds - Merged dataset

str(breweries_ds)
## 'data.frame':    558 obs. of  6 variables:
##  $ State         : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Brew_ID       : int  494 224 459 271 454 558 103 413 479 287 ...
##  $ Name          : Factor w/ 551 levels "10 Barrel Brewing Company",..: 98 318 276 16 161 442 279 464 219 41 ...
##  $ City          : Factor w/ 384 levels "Abingdon","Abita Springs",..: 8 8 317 165 342 8 8 157 34 127 ...
##  $ Unabbreviated : Factor w/ 51 levels "Alabama","Alaska",..: 2 2 2 2 2 2 2 1 1 1 ...
##  $ CensusDivision: Factor w/ 9 levels "East north central",..: 6 6 6 6 6 6 6 2 2 2 ...
str(beers_ds)
## 'data.frame':    2410 obs. of  7 variables:
##  $ Name      : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1638 577 1704 1842 1819 268 1160 758 1093 486 ...
##  $ Beer_ID   : int  1436 2265 2264 2263 2262 2261 2260 2259 2258 2131 ...
##  $ ABV       : num  0.05 0.066 0.071 0.09 0.075 0.077 0.045 0.065 0.055 0.086 ...
##  $ IBU       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Brewery_id: int  409 178 178 178 178 178 178 178 178 178 ...
##  $ Style     : Factor w/ 100 levels "","Abbey Single Ale",..: 19 18 16 12 16 80 18 22 18 12 ...
##  $ Ounces    : num  12 12 12 12 12 12 12 12 12 12 ...
str(bmerged_ds)
## 'data.frame':    2410 obs. of  12 variables:
##  $ Brew_ID       : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ State         : Factor w/ 51 levels "AK","AL","AR",..: 24 24 24 24 24 24 18 18 18 18 ...
##  $ BreweryName   : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 355 355 355 355 355 12 12 12 12 ...
##  $ City          : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 228 228 228 228 228 200 200 200 200 ...
##  $ Unabbreviated : Factor w/ 51 levels "Alabama","Alaska",..: 24 24 24 24 24 24 18 18 18 18 ...
##  $ CensusDivision: Factor w/ 9 levels "East north central",..: 8 8 8 8 8 8 2 2 2 2 ...
##  $ BeerName      : Factor w/ 2305 levels "#001 Golden Amber Lager",..: 1258 2185 802 1525 1640 1926 1116 71 1218 2017 ...
##  $ Beer_ID       : int  2691 2690 2692 2687 2689 2688 2676 2683 2685 2674 ...
##  $ ABV           : num  0.049 0.048 0.045 0.056 0.06 0.06 0.065 0.042 0.125 0.05 ...
##  $ IBU           : int  26 19 50 47 38 25 NA 42 80 20 ...
##  $ Style         : Factor w/ 100 levels "","Abbey Single Ale",..: 77 48 16 57 83 22 26 18 46 48 ...
##  $ Ounces        : num  16 16 16 16 16 16 16 16 16 16 ...

Merged dataset

First and last 10 records from merged set - bmerged_ds

head(bmerged_ds)

tail(bmerged_ds)

Summary Statistics

breweris_ds - Breweries in USA
beers_ds - Different beer styles with ABV and IBU
bmerged_ds - Merged dataset

As we can see from Summary, IBU and ABV have some missing values.We can either remove or impute these missing values with various techniques.

Let’s first find out pattern of missing data.

Please refer Imputation techniques for more details

Summary of merged dataset

summary(bmerged_ds)
##     Brew_ID          State                          BreweryName  
##  Min.   :  1.0   CO     : 265   Brewery Vivant            :  62  
##  1st Qu.: 94.0   CA     : 183   Oskar Blues Brewery       :  46  
##  Median :206.0   MI     : 162   Sun King Brewing Company  :  38  
##  Mean   :232.7   IN     : 139   Cigar City Brewing Company:  25  
##  3rd Qu.:367.0   TX     : 130   Sixpoint Craft Ales       :  24  
##  Max.   :558.0   OR     : 125   Hopworks Urban Brewery    :  23  
##                  (Other):1406   (Other)                   :2192  
##            City         Unabbreviated             CensusDivision
##  Grand Rapids:  66   Colorado  : 265   East north central:528   
##  Portland    :  64   California: 183   Mountain          :448   
##  Chicago     :  55   Michigan  : 162   Pacific           :428   
##  Indianapolis:  43   Indiana   : 139   South atlantic    :220   
##  San Diego   :  42   Texas     : 130   New england       :198   
##  Boulder     :  41   Oregon    : 125   West north central:185   
##  (Other)     :2099   (Other)   :1406   (Other)           :403   
##                    BeerName       Beer_ID            ABV         
##  Nonstop Hef Hop       :  12   Min.   :   1.0   Min.   :0.00100  
##  Dale's Pale Ale       :   6   1st Qu.: 808.2   1st Qu.:0.05000  
##  Oktoberfest           :   6   Median :1453.5   Median :0.05600  
##  Longboard Island Lager:   4   Mean   :1431.1   Mean   :0.05977  
##  1327 Pod's ESB        :   3   3rd Qu.:2075.8   3rd Qu.:0.06700  
##  Boston Lager          :   3   Max.   :2692.0   Max.   :0.12800  
##  (Other)               :2376                    NA's   :62       
##       IBU                                    Style          Ounces     
##  Min.   :  4.00   American IPA                  : 424   Min.   : 8.40  
##  1st Qu.: 21.00   American Pale Ale (APA)       : 245   1st Qu.:12.00  
##  Median : 35.00   American Amber / Red Ale      : 133   Median :12.00  
##  Mean   : 42.71   American Blonde Ale           : 108   Mean   :13.59  
##  3rd Qu.: 64.00   American Double / Imperial IPA: 105   3rd Qu.:16.00  
##  Max.   :138.00   American Pale Wheat Ale       :  97   Max.   :32.00  
##  NA's   :1005     (Other)                       :1298

Missing values and Imputations

package naniar and visat package provides visualizations of missing data.

Fields with missing values

vis_miss(bmerged_ds)

Missing percentage:

Field Name	Pecentage(%)
ABV	2.57
IBU	41.7

Total Missing - 3.7%

Here is another color coded view of missing data with datatypes

vis_guess(bmerged_ds)

mice::md.pattern()

The MICE package (stands for Multiple Imputation by Chained Equations) can help in several ways - here, we’ll use it for finding the NA-patterns:

md.pattern(bmerged_ds)

##      Brew_ID State BreweryName City Unabbreviated CensusDivision BeerName
## 1405       1     1           1    1             1              1        1
## 943        1     1           1    1             1              1        1
## 62         1     1           1    1             1              1        1
##            0     0           0    0             0              0        0
##      Beer_ID Style Ounces ABV  IBU     
## 1405       1     1      1   1    1    0
## 943        1     1      1   1    0    1
## 62         1     1      1   0    0    2
##            0     0      0  62 1005 1067

We see that there are 3 patterns:

1405 observations with complete information
943 observations have a NA in IBU
62 observations have a NA in ABV

VIM::aggr()

The function aggr() is an additional tool to visualize the missing values.

aggr_plot <- aggr(bmerged_ds, col=c('#41A9EC','red'), numbers=TRUE, sortVars=TRUE, 
                  labels=names(data), cex.axis=.7, gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##        Variable      Count
##             IBU 0.41701245
##             ABV 0.02572614
##         Brew_ID 0.00000000
##           State 0.00000000
##     BreweryName 0.00000000
##            City 0.00000000
##   Unabbreviated 0.00000000
##  CensusDivision 0.00000000
##        BeerName 0.00000000
##         Beer_ID 0.00000000
##           Style 0.00000000
##          Ounces 0.00000000

Missing Pattern Interpretation

Left plot shows fields with missing values
Right plot shows missing pattern
58.3% observations with complete information
2.6% observations have a NA in both ABV and IBU
3.91% observations have a NA in both IBU

VIM::marginplot()

marginplot(bmerged_ds[c(9,10)])

Marginplot Interpretation

For the plot, we only specifed to take ABV and IBU. Therefore, we can see here the pattern for these two columns as follows:

The blue dots represent Beers that have observed values in both ABV and IBU.
The blue boxes are their box plots.
The red dots represent missing values for either ABV but observed for IBU (left vertical margin) or vice versa: missing values for IBU but observed for ABV (bottom horizontal margin).
and their box plots.

Under the MCAR (Missing Completely at Random) assumption, the red and the blue box should be identical.
In our case it is not and 62 values are missing in both IBU and ABV. There is no way we can relate this missing information with any other fields. So we can delete these 62 observations.

Below is the table after deleting 62 observations

Now final dataset that we are using for analysis contains 2348 records.
This dataset contains missing NAs only in IBU
We will check again missingness and then apply imputations to populate missing data by using regression.

bmerged_final_ds <- bmerged_ds %>% filter(!is.na(ABV) | !is.na(IBU))
bmerged_final_ds

VIM::marginplot() - trimmed

This is marginplot after deleting 62 observations.

marginplot(bmerged_final_ds[c(9,10)])

Interpretations

Now data contains NAs only in IBU
Total 943 obserbvations have NAs
boxplots are nearly identical. MCAR assumption is met to impute data.
regression methodoloy will be employed to get values in missing column.

VIM::pbox()

pbox(bmerged_final_ds,pos=5)
## Warning in createPlot(main, sub, xlab, ylab, labels, ca$at): not enough
## space to display frequencies

## 
## Click in in the left margin to switch to the previous variable or in the right margin to switch to the next variable.
## To regain use of the VIM GUI and the R console, click anywhere else in the graphics window.

pbox Interpretation

The MAR assumption states that missingness is based on the other observed vairable(s) but not of the missing variable(s) itself.

In the above plot, we are looking at the missing values and how they are distributed w.r.t. to BeerName (position 2 in our case).
Evidence for MAR would be if the missing-value distributions were much higher or much lower than those of the non-missing-values.
But because in our example, the boxplots seem to be overlapping, hence, the missing values in IBU and ABV seem to be randomly distributed w.r.t. to BeerName, hence, providing support for the MCAR assumption, and not the MAR assumption.
We will proceed with Imputation to populate IBU using regression

Imputations

62 records doens’t have IBU and SBV both - these are deleted
Total records to process 2410-62 = 2348
Prediction is required on 943 missing IBU
Total records with IBU is 1405

# Fetch records with non-missing IBU
# set indicator imputed to "N" to indicate these are original values.
non_missing_ibu <- bmerged_final_ds %>% filter(!is.na(IBU))
non_missing_ibu$imputed <- "N"

# Fetch records with missing IBU
# set indicator imputed to "Y" to indicate these are missing and will be imputed
missing_ibu <- bmerged_final_ds %>% filter(is.na(IBU))
missing_ibu$imputed <- "Y"

# Run regression model without log transform
model_ibu <- non_missing_ibu %>% lm(formula=IBU~ABV)

# Model Summary
summary(model_ibu)
## 
## Call:
## lm(formula = IBU ~ ABV, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.849 -11.977  -0.721  13.997  93.458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -34.099      2.326  -14.66   <2e-16 ***
## ABV         1282.037     37.860   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 1403 degrees of freedom
## Multiple R-squared:  0.4497, Adjusted R-squared:  0.4493 
## F-statistic:  1147 on 1 and 1403 DF,  p-value: < 2.2e-16

# Run regression model
model_log_ibu <- non_missing_ibu %>% lm(formula=log(IBU)~log(ABV))

# log transformed Model Summary
summary(model_log_ibu)
## 
## Call:
## lm(formula = log(IBU) ~ log(ABV), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81374 -0.30156  0.08312  0.38514  1.35081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.99667    0.18115   49.66   <2e-16 ***
## log(ABV)     1.91683    0.06363   30.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.514 on 1403 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.3923 
## F-statistic: 907.4 on 1 and 1403 DF,  p-value: < 2.2e-16

Checking on assumptions

Residual plots before log transformation

par(mfrow=c(1,2))
plot(model_ibu)

Interpretations

Residual plots shows departure from constant variation assumption
QQplot shows normality
But since constant variation assumption is violated, We will have to try log transforamtion

Residual plots after log transformation

par(mfrow=c(1,2))
plot(model_log_ibu)

Interpretations

Tried log on response (not shown here) and log on predictor model constant variation assumption was still no met.
Residual plot vs fitted plot after log-log looks much better. No serious voilation of constant variance assumption
QQplot shows no departure from normality
log log model is selected for imputation.
Adjusted R-squared - 39.27%
Model is statistically significant p-value < 0.05
degrees of freedom - 1403

Orignal Scatterplot


bmerged_ds %>% filter(!is.na(IBU)) %>%
               ggplot(aes(x=ABV,y=IBU))+
               geom_point()+
               theme_wsj()+geom_smooth(method="lm")+
               theme(axis.text.x = element_text(angle=65, vjust=0.3))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=8),
               axis.text.y   = element_text(vjust=0.2,size=6),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))

Scatterplot after Imputation.

# Since IBU was log transformed. Below code transforms it back original scale
missing_ibu$IBU <- exp(predict(model_log_ibu,missing_ibu))

# Combine missing and non-missing datasets
imputed_df <- rbind(non_missing_ibu,missing_ibu)

imputed_df$imputed <- as.factor(imputed_df$imputed)

imputed_df %>% ggplot(aes(x=ABV,y=IBU,color=imputed))+
               geom_point()+ggtitle("ABV Vs IBU") + 
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.3))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=8),
               axis.text.y   = element_text(vjust=0.2,size=6),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))


attach(imputed_df)

Analysis

Total Breweries per state


b_states <- breweries_ds %>% group_by(State) %>% 
            count(State,Unabbreviated) %>% arrange(desc(n))

b_states$Unabbreviated <- factor(b_states$Unabbreviated, levels = b_states$Unabbreviated) 

ggplot(b_states, aes(x=Unabbreviated, y=n)) + 
  geom_bar(stat="identity", width=.8, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="State Vs Number of Breweries", 
       caption="Beer Analysis") + 
  xlab("State")+
  ylab("Total Breweries")+
  theme_wsj()+
  theme(axis.text.x = element_text(angle=80, vjust=0.4))+  
  theme(plot.title    = element_text(size = rel(0.5)),
  plot.subtitle = element_text(size = rel(0.5)),
  axis.text.x   = element_text(vjust=0.6,size=8),
  axis.title    = element_text(size = rel(0.5)),
  legend.position  = "right",
  legend.direction ="vertical",
  legend.title = element_text(size = rel(0.2)))

Median ABV Vs State

State Vs Median Alcohol content


abv_by_states <- imputed_df  %>% group_by(Unabbreviated) %>% 
                 summarise(Median=as.numeric(median(ABV))) %>% arrange(desc(Median))

abv_by_states$Unabbreviated <- factor(abv_by_states$Unabbreviated, levels = abv_by_states$Unabbreviated) 

ggplot(abv_by_states, aes(x=Unabbreviated, y=Median)) + 
  geom_bar(stat="identity", width=.8, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="Median ABV Vs State", 
       caption="Beer Analysis") + 
  xlab("State")+
  ylab("Alcohol content")+
  theme_wsj()+
  theme(axis.text.x = element_text(angle=80, vjust=0.8))+  
  theme(plot.title    = element_text(size = rel(0.5)),
  plot.subtitle = element_text(size = rel(0.5)),
  axis.text.x   = element_text(vjust=0.6,size=7),
  axis.title    = element_text(size = rel(0.4)),
  legend.position  = "right",
  legend.direction ="vertical",
  legend.title = element_text(size = rel(0.2)))

Interpretation

As seen from the ordered bar plot DC and Kentucky has highest median ABV
West Virginia has highest median IBU

Median IBU Vs State


ibu_by_states <- imputed_df %>% group_by(Unabbreviated) %>%  
                 summarise(Median=as.numeric(median(IBU))) %>% arrange(desc(Median))

ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = ibu_by_states$Unabbreviated) 

ggplot(ibu_by_states, aes(x=Unabbreviated, y=Median)) + 
  geom_bar(stat="identity", width=.8, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="Median ABV Vs State", 
       caption="Beer Analysis") + 
  xlab("State")+
  ylab("Alcohol content")+
  theme_wsj()+
  theme(axis.text.x = element_text(angle=80, vjust=0.8))+  
  theme(plot.title    = element_text(size = rel(0.5)),
  plot.subtitle = element_text(size = rel(0.5)),
  axis.text.x   = element_text(vjust=0.6,size=7),
  axis.title    = element_text(size = rel(0.4)),
  legend.position  = "right",
  legend.direction ="vertical",
  legend.title = element_text(size = rel(0.2)))

Interpretation

As seen from the ordered bar plot West Virginia has highest median IBU

Distribution of ABV

boxplot by Max ABV


ibu_by_states <- imputed_df %>%  group_by(State) %>% arrange(desc(ABV))

ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = unique(ibu_by_states$Unabbreviated)) 

ibu_by_states %>% ggplot(aes(y=ABV,x=Unabbreviated)) +
                  geom_boxplot(aes(fill=Unabbreviated), alpha=0.8,show.legend = FALSE) + 
                  labs(title="Alcohol Content", 
                  caption="Beer Analysis") + 
                  xlab("State")+
                  ylab("Alcohol content")+
                  theme_wsj()+
                  theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
                  theme(plot.title    = element_text(size = rel(0.5)),
                  plot.subtitle = element_text(size = rel(0.5)),
                  axis.text.x   = element_text(vjust=0.6,size=7),
                  axis.title    = element_text(size = rel(0.5)),
                  legend.position  = "right",
                  legend.direction ="vertical",
                  legend.title = element_text(size = rel(0.2)))

Interpretation

As shown above ordered bar plot State of Colorado has maximum alcoholic content with ABV=0.128

Distribution of IBU

boxplot by Max IBU


ibu_by_states <- imputed_df %>% group_by(Unabbreviated) %>%  arrange(desc(IBU))

ibu_by_states$Unabbreviated <- factor(ibu_by_states$Unabbreviated, levels = unique(ibu_by_states$Unabbreviated)) 

ibu_by_states %>% ggplot(aes(y=IBU,x=Unabbreviated)) +
                  geom_boxplot(aes(fill=Unabbreviated), alpha=0.8,show.legend = FALSE) + 
                  labs(title="IBU", 
                  caption="Beer Analysis") + 
                  xlab("State")+
                  ylab("IBU")+
                  theme_wsj()+
                  theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
                  theme(plot.title    = element_text(size = rel(0.5)),
                  plot.subtitle = element_text(size = rel(0.5)),
                  axis.text.x   = element_text(vjust=0.6,size=9),
                  axis.title    = element_text(size = rel(0.5)),
                  legend.position  = "right",
                  legend.direction ="vertical",
                  legend.title = element_text(size = rel(0.2)))

Interpretation

As shown above ordered bar plot State of Colorado has maximum bitterness IBU=156.99224
Note this is imputed value. Originally IBU was not populated for Colorado

Density Plot - ABV


imputed_df %>% ggplot(aes(x=ABV)) +
               geom_density(aes(fill=State), alpha=0.8) + 
               labs(title="Density Plot", 
               caption="Beer Analysis",fill="States") + 
               xlab("State")+
               ylab("Alcohol content")+
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=7),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))
## Warning: Groups with fewer than two data points have been dropped.

Interpretation

As per box plot Alcohol content varies by State

Histogram - ABV


imputed_df %>% ggplot(aes(x=ABV,fill=State)) +
               geom_histogram(bins=30) + 
               labs(title="ABV Distribution", 
               caption="Beer Analysis") + 
               ylab("Alcohol content")+
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=7),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.5)))

Interpretation

Colorado is the state with highest alcohol content And Delaware is the lowest alcohol content.
Overall Distribution of ABV is approximately normal.

Model Building

IPAs Vs Other Ales

Relationship between IPAs and other Ales

imputed_df$Style <- toupper(imputed_df$Style)
IPA <- imputed_df %>% filter(str_detect(Style, "IPA")) 
IPA$Style <- "IPA"
ALE <- imputed_df %>% filter(str_detect(Style, "IPA" , negate = TRUE)) %>% filter(str_detect(Style, "ALE"))
ALE$Style <- "Ale"
mod_df <- rbind(IPA, ALE)
mod_df %>% ggplot(aes(x=ABV,y=IBU,color=Style))+geom_point() +
           labs(title="IBU Vs ABV", 
           caption="Beer Analysis") + 
           xlab("State")+
           ylab("Alcohol content")+
           theme_wsj()+
           theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
           theme(plot.title    = element_text(size = rel(0.5)),
                 plot.subtitle = element_text(size = rel(0.5)),
                 axis.text.x   = element_text(vjust=0.6,size=7),
                 axis.title    = element_text(size = rel(0.5)),
                 legend.position  = "right",
                 legend.direction ="vertical",
                 legend.title = element_text(size = rel(0.5)))

Interpretation

Visual evidence of Linear relationship between IPAs and other Ales
Clear distinction between two types of Ales
IPA tends to be more Bitter compared to Other Ales

Knn Classifier

This Model predicts Style usign ABV and IBU

set.seed(7)
iterations = 100
k = 30
Acc_holder = matrix(nrow = iterations, ncol = k)
for(j in 1:iterations)
{
smp <- floor(0.75 * nrow(mod_df))
train_ind <- sample(seq_len(nrow(mod_df)), size = smp)
imputed_train <- mod_df[train_ind, ]
imputed_test  <- mod_df[-train_ind, ]
  for(i in 1:k)
  {
  classifications <- knn(imputed_train[,c(9,10)], imputed_test[,c(9,10)], imputed_train$Style,     prob = TRUE, k = i)
  CM <- confusionMatrix(table(classifications, imputed_test$Style))
    Acc_holder[j,i] = CM$overall[1]
  }
}
MeanAcc = colMeans(Acc_holder)
which.max(MeanAcc)
## [1] 5
k <- c(1:30)
Mean_Acc_df <- data_frame(MeanAcc)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Mean_Acc_df <- cbind(k, Mean_Acc_df)
Mean_Acc_df %>% ggplot(aes(x = k, y= MeanAcc)) +
               geom_line(color = "blue", alpha = .8) +
               labs(title="Cross-Validating Different KNN Models", 
               caption="KNN Score Across K's") + 
               xlab("K's")+
               ylab("Prediction Accuracy")+
               theme_wsj()+
               theme(axis.text.x = element_text(angle=65, vjust=0.6))+  
               theme(plot.title    = element_text(size = rel(0.5)),
               plot.subtitle = element_text(size = rel(0.5)),
               axis.text.x   = element_text(vjust=0.6,size=7),
               axis.title    = element_text(size = rel(0.5)),
               legend.position  = "right",
               legend.direction ="vertical",
               legend.title = element_text(size = rel(0.2)))

classifications <- knn(imputed_train[,c(9,10)], imputed_test[,c(9,10)], imputed_train$Style,     prob = TRUE, k = 6)
CM <- confusionMatrix(table(classifications, imputed_test$Style))
CM
## Confusion Matrix and Statistics
## 
##                
## classifications Ale IPA
##             Ale 195  43
##             IPA  29 117
##                                           
##                Accuracy : 0.8125          
##                  95% CI : (0.7698, 0.8503)
##     No Information Rate : 0.5833          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6094          
##                                           
##  Mcnemar's Test P-Value : 0.1255          
##                                           
##             Sensitivity : 0.8705          
##             Specificity : 0.7312          
##          Pos Pred Value : 0.8193          
##          Neg Pred Value : 0.8014          
##              Prevalence : 0.5833          
##          Detection Rate : 0.5078          
##    Detection Prevalence : 0.6198          
##       Balanced Accuracy : 0.8009          
##                                           
##        'Positive' Class : Ale             
##

Naive Bayes

Further exploration

Separating Beers into known Light and Dark
Only Beers that are definitely Light or Dark are used

imputed_df$Style <- toupper(imputed_df$Style)
Blonde <- imputed_df %>%filter(str_detect(Style, "BLONDE"))
Blonde$Style <- "Light"
Light <- imputed_df %>%filter(str_detect(Style, "LIGHT"))
Light$Style <- "Light"
Cream <- imputed_df %>%filter(str_detect(Style, "CREAM"))
Cream$Style <- "Light"
Shandy <- imputed_df %>%filter(str_detect(Style, "SHANDY"))
Shandy$Style <- "Light"
Porter <- imputed_df %>%filter(str_detect(Style, "PORTER"))
Porter$Style <- "Dark"
Stout <- imputed_df %>%filter(str_detect(Style, "STOUT"))
Stout$Style <- "Dark"
Dark <- imputed_df %>%filter(str_detect(Style, "DARK"))
Dark$Style <- "Dark"
Black <- imputed_df %>%filter(str_detect(Style, "BLACK"))
Black$Style <- "Dark"
color <- rbind(Blonde,Light,Porter,Stout,Dark,Black,Cream,Shandy)

Naive Bayes to classify Light Vs. Dark Beers
ROC Curve plot
Confusion Matrix

iterations = 100
AccHolder = numeric(100)
SensHolder = numeric(100)
SpecHolder = numeric(100)
for(seed in 1:iterations)
{
    set.seed(seed)
    Indices = sample(seq(1:length(color$Style)),round(.75*length(color$Style)))
    train_color = color[Indices,]
    test_color = color[-Indices,]
    model = naiveBayes(train_color[,c("ABV", "IBU", "State")],factor(train_color$Style, labels     = c("Light", "Dark")))
    CM = confusionMatrix(table(factor(test_color$Style, labels = c("Light",   "Dark")),predict(model,test_color[,c("ABV", "IBU", "State")])))
    AccHolder[seed] = CM$overall[1]
    SensHolder[seed] = CM$byClass[1]
    SpecHolder[seed] = CM$byClass[2]
}
mean(AccHolder)
## [1] 0.7713684
mean(SensHolder)
## [1] 0.9037857
mean(SpecHolder)
## [1] 0.6547825
which.max(AccHolder)
## [1] 100
which.max(SensHolder)
## [1] 2
which.max(SpecHolder)
## [1] 100
set.seed(38)
Indices = sample(seq(1:length(color$Style)),round(.75*length(color$Style)))
train_color = color[Indices,]
test_color = color[-Indices,]
model = naiveBayes(train_color[,c("ABV", "IBU", "State")],factor(train_color$Style, labels = c("Light", "Dark")))
CM = confusionMatrix(table(factor(test_color$Style, labels = c("Light", "Dark")),predict(model,test_color[,c("ABV", "IBU", "State")])))
CM
## Confusion Matrix and Statistics
## 
##        
##         Light Dark
##   Light    41   14
##   Dark      5   35
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.7054, 0.8751)
##     No Information Rate : 0.5158          
##     P-Value [Acc > NIR] : 8.836e-09       
##                                           
##                   Kappa : 0.602           
##                                           
##  Mcnemar's Test P-Value : 0.06646         
##                                           
##             Sensitivity : 0.8913          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7455          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.4842          
##          Detection Rate : 0.4316          
##    Detection Prevalence : 0.5789          
##       Balanced Accuracy : 0.8028          
##                                           
##        'Positive' Class : Light           
## 
pred_nb <- predict(model,test_color[,c("ABV", "IBU", "State")], type = 'raw')
pred <- prediction(pred_nb[, 2], test_color$Style)
nb.prff = performance(pred, "tpr", "fpr")
plot(nb.prff,main="ROC Curve")

Conclusion

Following are insights from analysis

There is evidence that data is missing at random
There appears to be geographic clustering of median ABV and IBU
Most beers have moderate Alcohol content (5 to 6.7%)
There is evidence of linear relationship between IBU and ABV

Beer_Analysis

Sachin Chavan, Jordan Salsman

10/13/2019

Introduction

Load Libraries

Load Data

Structures of the datasets

Merged dataset

Summary Statistics

Missing values and Imputations

Fields with missing values

mice::md.pattern()

We see that there are 3 patterns:

VIM::aggr()

Missing Pattern Interpretation

VIM::marginplot()

Marginplot Interpretation

Below is the table after deleting 62 observations

VIM::marginplot() - trimmed

Interpretations

VIM::pbox()

pbox Interpretation

Imputations

Checking on assumptions

Interpretations

Residual plots after log transformation

Interpretations

Orignal Scatterplot

Scatterplot after Imputation.

Analysis

Total Breweries per state

Median ABV Vs State

Interpretation

Median IBU Vs State

Interpretation

Distribution of ABV

Interpretation

Distribution of IBU

Interpretation

Density Plot - ABV

Interpretation

Histogram - ABV

Interpretation

Model Building

IPAs Vs Other Ales

Interpretation

Knn Classifier

Naive Bayes

Further exploration

Conclusion