Author: Pr Fabien MOUTARDE, Center for Robotics, MINES ParisTech, PSL Université Paris
import numpy as np
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
# Get data
def get_data(data_path):
data = []
with open(data_path, 'r') as file:
for num, line in enumerate(file):
if num >= 4:
line_sep = line.strip('\n').split(sep='\t')
single_example = [float(element) for element in line_sep][1:]
data.append(single_example)
return np.array(data)
data_atom = get_data('./clustering-examples/Atom.lrn')
data_lsun = get_data('./clustering-examples/Lsun.lrn')
data_wingnut = get_data('./clustering-examples/WingNut.lrn')
data_chainlink = get_data('./clustering-examples/Chainlink.lrn')
data_twodiamonds = get_data('./clustering-examples/TwoDiamonds.lrn')
def data_show(data):
num_dim = data.shape[1]
# Show
fig = plt.figure()
if num_dim == 2:
plt.scatter(data[:, 0], data[:, 1])
else:
ax = plt.axes(projection='3d')
ax.scatter3D(data[:, 0], data[:, 1], data[:, 2])
plt.show()
data_show(data_atom)
data_show(data_lsun)
data_show(data_wingnut)
data_show(data_chainlink)
data_show(data_twodiamonds)
Test the K-means clustering method, implemented in SciKit-Learn by the class sklearn.cluster.KMeans. First read in detail its documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Then experiment K-means on each of the 5 datasets, and with several values for K
Now, test the Hierarchical Agglomerative Clustering, implemented in SciKit-Learn by the class sklearn.cluster.AgglomerativeClustering First read in detail its documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Then experiment Agglomerative Clustering on each of the 5 datasets, with either single-linkage or complete-linkage, and with several values for the requested number of clusters
#Load the digits dataset
digits = datasets.load_digits()
Therefore, you should perform clustering, with 10 or more clusters (as one class could correspond to more than one cluster), on the dataset WITHOUT USING LABELS Then, you should analyze the distribution of labels of examples in each of the obtained clusters, in order to measure how homogeneous in terms of labels is each cluster, and check if it is possible to obtain a one-to-one (or one-to-few) correspondance between classes and clusters.