Udacity Data Analyst Nanodegree

Project Overview

This report explores a dataset containing attributes for 4898 instances of the Portuguese “Vinho Verde” white wine.

The attributes are the following:

  1. fixed acidity (tartaric acid - g / dm3): most acids involved with wine are fixed or nonvolatile (do not evaporate readily).
  2. volatile acidity (acetic acid - g / dm3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
  3. citric acid (g / dm3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
  4. residual sugar (g / dm3): the amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 g / dm3 and wines with more than 45 g / dm3 are considered sweet.
  5. chlorides (sodium chloride - g / dm3): the amount of salt in the wine.
  6. free sulfur dioxide (mg / dm3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine.
  7. total sulfur dioxide (mg / dm3): amount of free and bound forms of S02. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
  8. density (g / cm3): the density of wine is close to that of water depending on the percent alcohol and sugar content.
  9. pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
  10. sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
  11. alcohol (% by volume) - the percent alcohol content of the wine.
  12. quality: score between 0 and 10 (based on sensory data).

Univariate Plots Section

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Main feature of interest: Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality follows a normal-like distribution with discrete integer values.

Regarding acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

There is an interesting peak at .49 and a smaller one at .74g / dm3. This suggests me that maybe a standard amount of citric acid is added to some of the wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH shows a bell shaped distribution. I wonder how it relates individually to the concentrations of acids.

Regarding SO2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Free sulfur dioxide has some extreme outliers to the right of the curve.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
wines$bound.sulfur.dioxide <- with(wines, 
                                   total.sulfur.dioxide - free.sulfur.dioxide)
wines$sulfur.dioxide.ratio <- with(wines, 
                                   free.sulfur.dioxide / bound.sulfur.dioxide)

I created a bound sulfur dioxide variable by subtracting the free from the total sulfur dioxide. Then I created a feature consisting of the ratio between the free and bound sulfur dioxide present in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

It looks very similar to the total sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02419 0.23600 0.33990 0.36750 0.46150 2.45500

I transformed the scale to log10 to better visualize the distribution. Maybe it will be useful when trying to predict the quality, or even give us some insight about the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates are a little positively skewed. Since it can contribute to sulfur dioxide levels, it can be valuable to plot relations between them.

Other attributes

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

I transformed the residual sugar to a log10 scale to better visualize its distribution. The transformed variable appears bimodal, with peaks around 1.3 and 8.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

I transformed the long tail distribution with a log10 scale so it could be better visualized. After the transformation, the chlorides histogram appears normal, with some outliers on the right side of the curve.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Most of the density values are between .99 and 1.00 g / cm3, but there are some outliers near 1.01 and 1.04.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol presents mostly discrete values, with intervals of .1%. There are a few exceptions though.

Univariate Analysis

What is the structure of your dataset?

There are 11 variables representing physicochemical measurements and 1 variable representing the median of at least 3 evaluations of quality made by wine experts, varying from 0 (very bad) to 10 (very excellent).

What is/are the main feature(s) of interest in your dataset?

Quality is the main feature of interest. The objective of the analysis is to determine the features that influence wine quality the most, and then building a predictive model of quality using these variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Most features have an aproximately normal distribution, just like the quality variable. It makes it hard to guess which features will have a greater impact on the prediction of quality.

Did you create any new variables from existing variables in the dataset?

I created the “sulfur.dioxide.ratio”, which consists of the ratio between “free.sulfur.dioxide” and “total.sulfur.dioxide”.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of citric acid presented two unusual peaks which standed out of an otherwise normal distribution.

I preformed a log transformation on the residual sugar and chlorides distributions, because they were very skewed, and the transformations allowed better visualizations of the data.

Bivariate Plots Section

This correlation matrix naturally shows strong correlations between free sulfur dioxide, total sulfur dioxide and the constructed variables bound sulfur dioxide and sulfur dioxide ratio.

It also shows interesting relations between residual.sugar vs density and alcohol vs density.

Density, residual sugar and alcohol

Density varies aproximately linearly with residual sugar (positive correlation) and with alcohol (negative correlation). It makes sense, taking into account the fermentation process of wine, in which sugar is consumed to generate alcohol. And since the residual sugar is more dense than alcohol, this inverse relation is presented.

Sulfur dioxide

The sulfur dioxide ratio increases along with the free sulfur dioxide, and wines with greater ratios tend to have smaller concentrations of bound sulfur dioxide. I wonder how quality varies related to these variables.

Acids

The only acid concentration that shows some considerable correlation with pH is the one regarding fixed acidity.

Main feature of interest: Quality

Better quality wines seem to have smaller fixed acidities on average.

The same seems to apply with volatile acidity, but nothing very conclusive.

The low correlation on the panel above can be seen on these charts of quality by citric acid.

Except for wines with quality score 3, the median pH increases along with quality score.

No clear relation between free sulfur dioxide and quality.

Here a trend can be seen. Overall, quality decreases as bound sulfur dioxide increases.

Here a slight correlation can be seen, somewhat similar to that of bound sulfur dioxide.

In general, quality increases as the ration of free and bound sulfur dioxide increases, but the correlation is weak.

Sulphates don’t seem to add much isolatedly.

Nothing very clear from these charts.

There is a curious amount of outliers for scores 5 and 6. I wonder why it happens.

A greater correlation is more evident here. This seems to be one of the most promising relations so far. Maybe it has something to do with the fact that density is highly correlated with residual sugar and alcohol concentration, features that may be more easily detected by the experts palate.

Alcohol is the variable with the greatest correlation with quality. It can be seen on the chart. Wines with grades 3 and 4 are going against the trend, but there are not many of those.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I analyzed the relations between quality and every other variable in the dataset. The two largest Pearson’s correlations found were with alcohol (.436) and density (-.307). With both variables, an aproximately linear relation exsited for wines with scores from 5 to 9. The same did not apply for scores 3 and 4.

Analyzing the wines with quality score 9, I observed that they have in average a high concentration of alcohol, a very low density, and also a low amount of residual sugar. I imagine it derives from a well adjusted fermentation process, in which the sugar from the grapes is almost completely consumed in the process of fermentation, generating an above average alcohol concentration and thus a smaller density.

There is also a curious amount of outliers for scores 5 and 6. I wonder why it happens.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Density is strongly correlated with two other variables: residual sugar (positively), and alcohol (negatively). It makes sense, taking into account the fermentation process of wine, in which sugar is consumed to generate alcohol. And since the residual sugar is more dense than alcohol, this inverse relation is presented.

Another relationship found was between fixed acidity and pH. Among the measures of acidity in the dataset, fixed acidity was the only one presenting at least a weak linear relationship with pH.

What was the strongest relationship you found?

The one between density and residual sugar. These features have a Pearson’s correlation coefficient of .839.

Multivariate Plots Section

I am dividing alcohol in bins to be able to plot density, alcohol, residual sugar and quality together and see how they relate with eachother:

It can be seen that the points corresponding to higher amounts of alcohol show wines of better quality in average, and, for a given residual sugar, quality increases as density decreases.

Revisiting this chart from the bivariate plots section, but now colored by quality score. None of the charts indicate that this factors are well fit for a good linear model for predicting quality. However, some regions with higher concentrations of good and bad quality wines are defined, although not very clearly.

Revisiting this charts and adding quality as color, nothin very useful seemed to appear.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wines)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wines)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar, 
##     data = wines)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio, data = wines)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates, data = wines)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density, data = wines)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH, data = wines)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity, 
##     data = wines)
## m9: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity + 
##     chlorides, data = wines)
## m10: lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     sulfur.dioxide.ratio + sulphates + density + pH + fixed.acidity + 
##     chlorides + citric.acid, data = wines)
## 
## ===============================================================================================================================================
##                            m1         m2         m3         m4         m5          m6          m7           m8           m9           m10      
## -----------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            2.582***   3.017***   2.356***   2.264***   2.014***   82.862***   102.754***   145.254***   143.910***   144.536***  
##                         (0.098)    (0.098)    (0.114)    (0.114)    (0.125)    (12.567)     (12.925)     (18.216)     (18.506)     (18.561)    
##   alcohol                0.313***   0.324***   0.375***   0.367***   0.368***    0.271***     0.242***     0.192***     0.192***     0.191***  
##                         (0.009)    (0.009)    (0.010)    (0.010)    (0.010)     (0.018)      (0.019)      (0.024)      (0.024)      (0.024)    
##   volatile.acidity                 -1.979***  -2.107***  -1.961***  -1.943***   -1.910***    -1.887***    -1.835***    -1.831***    -1.823***  
##                                    (0.110)    (0.109)    (0.111)    (0.110)     (0.110)      (0.110)      (0.111)      (0.111)      (0.113)    
##   residual.sugar                               0.027***   0.025***   0.026***    0.055***     0.065***     0.081***     0.081***     0.081***  
##                                               (0.002)    (0.002)    (0.002)     (0.005)      (0.005)      (0.007)      (0.007)      (0.007)    
##   sulfur.dioxide.ratio                                    0.384***   0.388***    0.319***     0.304***     0.308***     0.309***     0.308***  
##                                                          (0.056)    (0.056)     (0.057)      (0.057)      (0.056)      (0.057)      (0.057)    
##   sulphates                                                          0.463***    0.636***     0.588***     0.645***     0.644***     0.642***  
##                                                                     (0.095)     (0.098)      (0.098)      (0.100)      (0.100)      (0.100)    
##   density                                                                      -80.561***  -101.824***  -145.402***  -144.002***  -144.641***  
##                                                                                (12.522)     (12.938)     (18.457)     (18.767)     (18.823)    
##   pH                                                                                          0.472***     0.702***     0.695***     0.699***  
##                                                                                              (0.076)      (0.103)      (0.105)      (0.105)    
##   fixed.acidity                                                                                            0.068***     0.066**      0.065**   
##                                                                                                           (0.020)      (0.021)      (0.021)    
##   chlorides                                                                                                            -0.224       -0.251     
##                                                                                                                        (0.543)      (0.546)    
##   citric.acid                                                                                                                        0.043     
##                                                                                                                                     (0.095)    
## -----------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190      0.240      0.259      0.266      0.269      0.275        0.281        0.283        0.283        0.283   
##   adj. R-squared            0.190      0.240      0.258      0.265      0.268      0.274        0.280        0.281        0.281        0.281   
##   sigma                     0.797      0.772      0.763      0.759      0.758      0.754        0.752        0.751        0.751        0.751   
##   F                      1146.395    773.875    568.789    442.368    360.295    309.623      272.891      240.632      213.878      192.478   
##   p                         0.000      0.000      0.000      0.000      0.000      0.000        0.000        0.000        0.000        0.000   
##   Log-likelihood        -5839.391  -5681.776  -5622.083  -5598.647  -5586.778  -5566.139    -5547.023    -5541.549    -5541.464    -5541.364   
##   Deviance               3112.257   2918.264   2847.993   2820.870   2807.231   2783.672     2762.028     2755.862     2755.766     2755.653   
##   AIC                   11684.782  11371.552  11254.166  11209.295  11187.556  11148.278    11112.045    11103.098    11104.927    11106.728   
##   BIC                   11704.272  11397.538  11286.649  11248.274  11233.032  11200.250    11170.515    11168.064    11176.389    11184.687   
##   N                      4898       4898       4898       4898       4898       4898         4898         4898         4898         4898       
## ===============================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is a very interesting relation between density, alcohol, residual sugar and quality. In general, quality increases as alcohol increases, density decreases and residual sugar decreases. These variables were amongst the most important predictors in the linear model built.

Were there any interesting or surprising interactions between features?

Since I did not have much knowledge of wine appraising before this exercise, I did not set expectations for the role of each variable, and therefore I was not surprised by the relations between them.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model for predicting quality. The R-squared value for the model was 0.283, which was a very low one. It indicates that a linear model probably is not the best fit for this dataset. Alcohol, volatile acidity and residual sugar were the most important prediction variables. Since there is a large correlation between some of the variables, some sort of feature selection would improve the model.


Final Plots and Summary

Plot One

Description One

This chart depicts the relation between alcohol concentration and quality score. For scores from 5 to 9, quality increases as alcohol increases, and for scores 3 and 4 the relation is the inverse. Alcohol has the largest correlation with quality among all the variables in the dataset, with a Pearson’s correlation coefficient of .436.

Plot Two

Description Two

A very interesting relation is shown in this chart. Given a value of residual sugar, density increases as alcohol decreases. This is in some extent due to the fermentation process of winemaking, in which sugar is consumed to generate alcohol. Since alcohol is less dense than water and sugar is more dense than water, this process makes the density of the wine decrease.

Plot Three

Description Three

This chart shows how quality relates with density and residual sugar. The two lowest and highest quality levels have been grouped to improve visibility.

It is noticeable that, for a given residual sugar concentration, quality increases as density increases. The same occurs if you fix density and increase residual sugar.


Reflection

This exploratory data analysis in which at first a univariate, then bivariate and finally multivariate examinations are performed allow for a progressive understanding of the dataset and the relations between its features.

Some interesting relations came up, like the one between alcohol, density, residual sugar and quality, that could be related to the fermentation process of wine. The correlation between pH and fixed acidity, while not correlating with volatile acidity and citric acids is also worth noting.

I strugled to find meaningful relations in the multivariate analysis section, and I have the feeling that some interesting relations may have been left aside among the many permutations of variables in the dataset. Anyway the whole analysis process was a very valuable experience, in which I practiced plotting various types of charts, handling overplotting and choosing the best chart type to convey the intended message.

A linear model for predicting quality was built, but it performed poorly, indicating that the dataset did not behave very much linearly. The process of evaluating wines is very subjective, and experts can be biased by their histories and preferences, making the relation between quality and the other variables too complex to be explained by a linear model. In the future, a diferent set of quality prediction models could be applied, and an evaluation of the best fit could be performed.

References