My project report will focus on my findings, analysis, and machine learning methods using the Python library scikit-learn. This dataset contains a repository of 1,600 red wines with 11 features such as alcohol concentration, acidity, and sulfur dioxide contents as well as an output variable of wine quality (scaled from 1 to 10). According to the data curator of this dataset, these factors are potential factors to wine quality.
As stated in my project proposal, my research question is whether or not alcohol concentration correlates with wine quality. Moreover, I would also like to know whether or not my machine learning model could potentially be used to generate accurate wine quality prediction scores based on these factors. Overall, there are no major differences between my project proposal and the final project besides the addition of an analysis of the model cross-validation score.
Methodology
I began my analysis by reading the CSV file found in the source of the dataset (https://archive.ics.uci.edu/ml/datasets/wine+quality). Minimal data cleaning was required because the data in the CSV was not comma-separated but separated by semicolons. This is simply done by reading the CSV using pandas and setting the separator as “;”. The feature used in this model will be all 11 input variables:
- Fixed acidity
- Volatile acidity
- Citric acid
- Residual sugar
- Chlorides
- Free sulfur dioxide
- Total sulfur dioxide
- Density
- pH
- Sulfates
- Alcohol
Subsequently, KFold cross-validation is applied to reliably split the testing and training data. The reason to use KFold cross-validation as opposed to splitting the dataset manually is that KFold cross-validation uses multiple train splits which optimizes the model parameter. In my model, ten splits were used as it is a rather large dataset. A Multi-Linear regression will be used to create the model, and an R squared score will be calculated to determine the accuracy of the machine learning model. On my first run, the results for R2 for both training and testing data are:
- Train R2: 0.3611464368898285
- Train R2 SD: 0.005403689017665551
- Test R2: 0.34282454749521324
- Test R2 SD: 0.05069777254576919
Meanwhile, I also decided to do some exploratory data analysis and created data visualizations as well as simple calculations to show correlations between all the different variables. I first calculated the correlation coefficient of all variables to wine quality, as shown in Figure 1. This proves that alcohol concentration has the highest effect on wine quality.
Furthermore, I decided to visualize how every variable correlates with each other. Using a heatmap, Figure 2 visualizes correlation through a colored grid. Consequently, this visualization can tell that alcohol concentration and volatile acidity are the two most correlating variables of wine quality.
Another visualization shows a scatterplot of wine quality and alcohol. Figure 3 shows that roughly, better quality wine has higher concentrations of alcohol. A linear regression line is created using the seaborn library to show the line of best fit. In this visualization, I noticed several outliers that may affect the accuracy of this model.
To answer my research question, I believe that alcohol concentration has the most significant influence on wine quality over other factors available in the dataset. As shown in Figures 1 and 2, alcohol concentration has the best correlation coefficient to wine quality.
Conclusion
Despite the relatively low R2 scores, I learned a lot about using linear regression. This low R2 value may result in a poor prediction model because there may be a lot of errors. Since a linear regression model works by fitting the line of best-fit passing through the points, a low R2 value can potentially mean an inaccurate model. It has its limitations and drawbacks, so we should not always expect to make great predictions every time. While often used in the simple models found in the business field, I believe there may be better models that yield more comfortable results compared to linear regression.