1. Outlier detection in Iris data using DBSCAN
In this tutorial we will implement outlier detection with dbscan algorithm on IRIS dataset using python, jupyter notebook and anaconda. We will implement the whole data mining pipeline starting from data preprocessing, implementing dbscan model, detecting outliers in the iris dataset and evaluate the dbscan algorithm using adjusted_rand_score. Before starting this tutorial, we should already install anaconda with jupyter notebook on our computer and we should also have sufficient knowledge of python.
First quickly import all the necessary libraries.
from sklearn import datasets
import seaborn as sns
import pandas as pd
from collections import Counter
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from sklearn.metrics import adjusted_rand_score
In the first line, we imported datasets from sklearn library to load iris data. Next we imported seaborn for making plots and figures. On the 3rd line we imported pandas, which is an open source data analysis and data manipulation library. On line 4 we imported the Counter from the collections library for counting the outliers. On line 5, we imported the DBSCAN model from sklearn. On the next line, we imported matplotlib for making plots and figures and to visualize the data. And on the last line we imported the evaluation matrix adjusted_rand_score to evaluate the DBSCAN model.
iris_data=sns.load_dataset("iris") #loading the iris dataset
After importing the libraries, we loaded the iris dataset from seaborn library. The iris dataset comes with many libraries as a builtin dataset.
iris_data.head(5)
sepal_length | sepal_width | petal_length | petal_width | species |
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
Next we will have a quick look into the features of the iris dataset with the head() function. The Iris dataset has 3 different classes of flowers such as setosa, versicolor and verginica as we will see in the species column. Each class has 4 features such sepal_length, sepal_width, petal_length and petal_width. Every flower class has 50 examples in the data, which means the data has a total of 150 rows. The last column species contains the class labels.
data = iris_data.iloc[:, 0:4]
In the above line, we droped the species column and only kept the features of the dataset. Since dbscan does not need the data labels for model implementation because it is unsupervised. At this point our data is ready to fit in the model. Next we will implement the model by feeding these features to the dbscan model.
2. Model
model = DBSCAN(eps = 0.8, min_samples=19).fit(data)
The model implementation step is easy. The model takes two parameters. The first parameter is epsilon. “Epsilon is the maximum distance between two samples for one to be considered as in the neighborhood of the other point. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.” We set the value of epsilon to be 0.8.
The second parameter is min_samples. The “min_samples is the number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.” We sill set the value of min_samples to be 19.
Next we will simply fit our data features into the dbscan model and that’s it for model implementation. At this point our implementation is done.
3. Detecting outliers
outliers = pd.DataFrame(data)
print(Counter(model.labels_))
print(outliers[model.labels_==-1])
sepal_length | sepal_width | petal_length | petal_width | |
98 | 5.1 | 2.5 | 3.0 | 1.1 |
105 | 7.6 | 3.0 | 6.6 | 2.1 |
117 | 7.7 | 3.8 | 6.7 | 2.2 |
118 | 7.7 | 2.6 | 6.9 | 2.3 |
122 | 7.7 | 2.8 | 6.7 | 2.0 |
131 | 7.9 | 3.8 | 6.4 | 2.0 |
After model implementation, Our 3rd step is to count how many outliers did dbscan find. To do this, we will first count the labels predicted by dbscan. And then count how many of them are outliers and print the results. We can see at the bottom of the outlier table, there are 2 clusters predicted by dbscan, cluster 1, and cluster 0. Cluster 1 has 94 values and cluster 0 has 50 values. While there are 6 values which are -1. These are our outliers. And all of these outlier values have been printed on this table. So dbscan has predicted 6 outliers based on the parameter values we passed.
4. Visualizing the clusters
feature = data
labels = model.labels_
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
fignum = 1
fig = plt.figure(fignum, figsize=(8, 6))
ax = Axes3D(fig, rect=[0, 0, 1.5, 1.5], elev=48, azim=134)
ax.scatter(features.iloc[:, 3], features.iloc[:, 0], features.iloc[:, 2],
c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.dist = 12
In this section, we have visualized the clusters and outliers predicted by dbscan. We have a 3D plot with 3 features, “Petal length”, “Petal width” and “Sepal length”. We can see that there are 2 big clusters, the yellow and the green cluster that dbscan found in the iris data based the parameter values we provided in the model. We can also notice that there are a few black points at the top edge of big yellow cluster. These black points are the outliers that we saw in the last step.
5. Evaluation
iris_data_labels = iris_data.species
dbscan_predicted_labels = model.labels_
adjusted_rand_score(iris_data_labels, dbscan_predicted_labels)
Out[30]: 0.5560753044129076
In the last step, we evaluated the dbscan performance for predicting the clusters. We did that by comparing the ground truth labels in the iris data and labels predicted by dbscan. We compared the labels by passing them to the adjusted_rand_score() function. The adjusted_rand_score gives the value from 0 to 1. The value closer to 1 indicates a good score.
We can see that the score for dbscan is 0.55. We cannot say that this is good or a bad score, because the interpretation of the results depends on what we are trying to achieve with the model.