Wine qulaity dataset machine learning

Introduction

In this task, we will predict wine quality using following classifier models from the “sklearn” library.

  1. Decision Tree
  2. Random forest
  3. Nearest neighbors
  4. Gaussian naive bayes

We will use these models to classify data into multiple groups of “quality” buckets. We will also compare different models and select one with best performance. To evaluate the performance of the models, we used MAE Mean Absolute Score (MAE) and accuracy score.

This task is implemented in four steps. In step 1, we explored and analysed the data set. In step 2, we implemented different models. In step 3, the task has been evaluated and performance of the models has been compared and the best performing model had been selected. In section 4 parameter tuning has been implemented to improve model performance.

The Wine Quality Data Set along with additional information about can be found on the UCI Machine Learning Repository.

Prerequisites: You should first install anaconda and all the necessary Python libraries (e.g. xgboost) on your computer that are imported below.

Now let’s get started by importing the libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import tree
from sklearn.metrics import mean_absolute_error
import seaborn as sn
#from imblearn.over_sampling import RandomOverSampler

from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import xgboost
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt
%matplotlib inline

1.  Preporcessing and Data Analysis

In section will do the EDA. There are total 13 features of the data. The last feature “quality” is the class label which we want our model to learn about and predict. In the provided training dataset, we have data for both red wine and white wine. The “train_data.csv” file should be in the same folder as your Jupyter notebook.

data = pd.read_csv('train_data.csv') # Loading dataset

We can see that there are 13 features in our data.

# Getting peak into the data
data.head()

Check data distribution with histograms

We will plot Histogram to see the data distribution. Here we can see that quality groups for wine go from 3 to 9. We can also wee that 6 is the largest, then 5, then 7 and so on. We have total 7 outcome variables. We have a lot of values in the middle and very on the edges.

# Plotting histograms of data 
data['quality'].hist()
plt.suptitle('Histogram of Wine Quality')
plt.xlabel('Quality Groups')
plt.ylabel('Number of Votes')
plt.show()
Histogram of Wine Quality

Here we have plotted the class distribution with their counts. we can see that majority of our data lies in quality of wine 6. 6 is an average quality wine consists of 2248 rows. Also the 2nd highest rows in the data are of wine quality 5. Wine 5 and 6 take majority of the instances in the dataset. The best quality wine has only 4 rows, and worst quality wine has 26 rows. This tells us a few things that we have multiple classes and thus a multi classification problem. Also there is an imbalance between the classes.

# Getting counts of instances of all classes
pd.value_counts(data['quality']).plot.bar()
plt.title('Quality class histogram')
plt.xlabel('Class')
plt.ylabel('Frequency')
data['quality'].value_counts()

6 2248
5 1678
7 872
4 173
8 149
3 26
9 4
Name: quality, dtype: int64

Checking missing or null values

# check missing data
missing_data = data[data.isnull() == True].count();
print("Overall empty data fields:" , missing_data.sum())

Overall empty data fields: 0

# See if there are null values in the data 
data.isnull().sum()
fixed acidity0
volatile acidity0
citric acid0
residual sugar0
chlorides0
free sulfur dioxide0
total sulfur dioxide0
density0
pH0
sulphates0
alcohol0
type0
quality0
dtype: int64

Data distribution with .info() .describe() methods

Next we will run the info() function. We can see that there are 11 float values and 2 integers. All the chemical readings are floats. The integer values are outcome variables which tell us the type of wine (red or white) and the other one is quality variable. The other interesting thing is that all values are non-null which means there are no null values in 5150 rows. So the data is pretty much clean.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5150 entries, 0 to 5149
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         5150 non-null   float64
 1   volatile acidity      5150 non-null   float64
 2   citric acid           5150 non-null   float64
 3   residual sugar        5150 non-null   float64
 4   chlorides             5150 non-null   float64
 5   free sulfur dioxide   5150 non-null   float64
 6   total sulfur dioxide  5150 non-null   float64
 7   density               5150 non-null   float64
 8   pH                    5150 non-null   float64
 9   sulphates             5150 non-null   float64
 10  alcohol               5150 non-null   float64
 11  type                  5150 non-null   int64  
 12  quality               5150 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 523.2 KB

The describe() function below will tell us an idea about the data distribution which is always very interesting to look at. The data seems clean so but we don’t see any outliers or extreme values which we have to trim. Everything looks positive here. Regarding the quality of the data, we can see that 3 is the minimum quality, which means 3 is the worst quality wine. While the maximum value of quality is 9. The mean value for quality is around 6, which is exactly between 3 and 9.

# Data districution
data.describe()

We can also have interesting look data distribution for each feature by using boxplots. In the first plot below, we have box plots for all the features and in the second box plot, we have box plot for quality. With the help of box plots we can see that there are a lot of outliers in the data which are far from majority of the distribution. In the second box plot we can see that the values 8,9 and 3 are very less, which we also saw in the above examples. We have a lot of mediocre wines.

Boxplots

# Checking Data Distribution
fig, axes = plt.subplots(nrows=2,ncols=1)
fig.set_size_inches(15, 30)
sn.boxplot(data=data,orient="v",ax=axes[0])
sn.boxplot(data=data,y="quality",orient="pH",ax=axes[1])

<AxesSubplot:ylabel=’quality’>

Box plots for wine quality features and quality

Correlation matrix

In correlation matrix, we see that if two features are too closely coupled. Correlation values range from 1 to -1. The color of the boxes is light if two features are highly correlated. E.g. we can see that feature ‘free sulphur dioxide’ and ‘total sulphur dioxide’ have a high correlation with a value of 0.72.

# Correlation analasys
corrMatt = data.corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sn.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)

<AxesSubplot:>

Splitting data into training and test sets

We will split our data set to make it ready for modeling. First we split the features and the target variable and then we split our dataset into training and test datasets. 70% of the data is used as training and 30% test.

# Separatiing features and labels
y = data.quality
X = data.drop('quality', axis=1)
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)

2. Implementing Different Models

In this section we have tried different classification models to solve our wine classification problem and then we will select the one with the best performace score based on their MAE score.

print('Start Predicting...')


decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,y_train)
tree_pred = decision_tree.predict(X_test)

rf = RandomForestClassifier(n_estimators = 50, class_weight="balanced")
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)

KN = KNeighborsClassifier()
KN.fit(X_train,y_train)
KN_pred = KN.predict(X_test)

Gaussian = GaussianNB()
Gaussian.fit(X_train,y_train)
Gaussian_pred = Gaussian.predict(X_test)

svc = SVC()
svc.fit(X_train,y_train)
svc_pred = svc.predict(X_test)

xgb = xgboost.XGBClassifier()
xgb.fit(X_train,y_train)
xgb_pred = xgb.predict(X_test)

print('...Complete')
Start Predicting...
/home/joe/anaconda3/lib/python3.7/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
[21:38:04] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
...Complete

3. Evaluation

After implementing the models, we can evalute their performance. Below we will see accuracy and MAE score of the models and see which one performed the best. We can see from the results that Random Forest has the best accuracy and MAE scores. So we will use the Random Forest classifer algorithm for our task.

Accuracy

print('Decision Tree:', accuracy_score(y_test, tree_pred)*100,'%')
print('Random Forest:', accuracy_score(y_test, rf_pred)*100,'%')
print('KNeighbors:',accuracy_score(y_test, KN_pred)*100,'%')
print('GaussianNB:',accuracy_score(y_test, Gaussian_pred)*100,'%')
print('SVC:',accuracy_score(y_test, svc_pred)*100,'%')
print('XGB:',accuracy_score(y_test, xgb_pred)*100,'%')
Decision Tree: 56.310679611650485 %
Random Forest: 63.883495145631066 %
KNeighbors: 46.47249190938511 %
GaussianNB: 42.653721682847895 %
SVC: 44.983818770226534 %
XGB: 62.653721682847895 %

MAE Score

print('Decision Tree:', mean_absolute_error(tree_pred,y_test))
print('Random forest',mean_absolute_error(rf_pred,y_test))
print('Nearest neighbors',mean_absolute_error(KN_pred,y_test))
print('GuassianNB',mean_absolute_error(Gaussian_pred,y_test))
print('SVC', mean_absolute_error(svc_pred,y_test))
print('XGB',mean_absolute_error(xgb_pred,y_test))
Decision Tree: 0.5326860841423948
Random forest 0.4051779935275081
Nearest neighbors 0.6621359223300971
GuassianNB 0.6802588996763754
SVC 0.6330097087378641
XGB 0.42071197411003236

4. Parameter tuning for Random forest and Kaggle submission

In this section, we will try to improve select model Random Forest by tuning its parameters. We will try to plot the best value for the parameter n_estimators and we can see that it has improved the MAE score of the model.

Plotting MAE score for to estimate n_estimators

def get_mae_rf(num_est, predictors_train, predictors_val, targ_train, targ_val):

    # fitting model with input max_leaf_nodes
    model = RandomForestClassifier(n_estimators=num_est, random_state=0)

    # fitting the model with training dataset
    model.fit(predictors_train, targ_train)

    # making prediction with the test dataset
    preds_val = model.predict(predictors_val)

    # calculate and return the MAE
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)
plot_mae = {}
for num_est  in range(2,50):
    my_mae = get_mae_rf(num_est,X_train,X_test,y_train,y_test)
    plot_mae[num_est] = my_mae
plt.plot(list(plot_mae.keys()),plot_mae.values())
plt.show()
# Selecting n_estimators=40 and class_weight= "balanced" 

rf = RandomForestClassifier(n_estimators = 50, class_weight="balanced")
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
# Checking if MAE score improves after parameter tuning

print('Random forest',mean_absolute_error(rf_pred,y_test))

Random forest 0.40129449838187703

Getting test data ready for Kaggle

As our models is ready and now we can make our kaggle submission. First we input the test dataset, them remove the “id” column as its only needed for kaggle submission. After predicting test features on our model, we can put the predicted labels and the “id” column back in the submission.csv file and make the submission for kaggle competition.

test = pd.read_csv("test_data.csv")
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1347 entries, 0 to 1346
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    1347 non-null   int64  
 1   fixed acidity         1347 non-null   float64
 2   volatile acidity      1347 non-null   float64
 3   citric acid           1347 non-null   float64
 4   residual sugar        1347 non-null   float64
 5   chlorides             1347 non-null   float64
 6   free sulfur dioxide   1347 non-null   float64
 7   total sulfur dioxide  1347 non-null   float64
 8   density               1347 non-null   float64
 9   pH                    1347 non-null   float64
 10  sulphates             1347 non-null   float64
 11  alcohol               1347 non-null   float64
 12  type                  1347 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 136.9 KB
# Dropping extra 'id' column
test_features = test.drop('id', axis=1)
# Predicting kaggle test data
rf_pred = rf.predict(test_features)

Making submission.csv

my_submission = pd.DataFrame({'id': test.id, 'prediction': rf_pred})
# you could use any filename. We choose submission here
my_submission.to_csv('submission1.csv', index=False)
Leave a Comment