kMeans

 

K-means

Clustering: Find the underlying structure, summarize and group unlabeled dataset. Clustering means grouping of objects based of the information found in the data describing the objects or their relationship. Objects in one group should be similar to each other but different from objects in another group.set.

K-means is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that points in different clusters are dissimilar while points within a cluster are similar. It is unsupervised because the points have no external classification.

Every centroid tries to place itself in the heart of likeshaped points but as far as possible from the other centroids and their point.

  • K-menas is Exclusive clustering, an item belongs exclusively to one cluster.
  • All sort of segmentation
  • Classifying handwritten digits
  • Can’t have missing values
  • Not easy to come up with metrics for measuring results (domain-specific)

 

K-means clustering

 

K-Means-Clustering-Gif k means clustering visualized kmeans clusters presentation seen easy

kmeansclustering, k means clustering, k-means clusters, k-means clustering model

 

 

 

 

 

  • Toy story: (0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0)
  • Batman :   (0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0)
  • euclidean distance: sqrt((0-0)^2 + (0-1)^2 + (0-1)^2…)
  • Other popular distance metrics:
    • Manhattan Distance: sum of absolute values instead of squares
    • Maximum Coordinate Distance: only consider measurement for which data points devite the most

 

Solving in terms of c, we find the length of the hypotenuse by taking the square root of the sum of squared lengths of a and b, where a and b are orthogonal sides of the triangle (i.e. they are at a 90-degree angle from one another, going in perpendicular directions in space) Green line = Euclidean distance. Blue line = Manhattan distance

 

  • Distance between clusters:
    • Minimum Distance: distance between clusters = dist between points that are closest
    • Maximum Distance: distance between clusters = dist between points that are the farthest
    • Centroid Distance: dist between centroids (center, average) of clusters

 

Must Normalize data in clustering. Otherwise the variables with higher values like revenue doesn’t dominate the model, compared to other lower values like years.

 

Choosing the value of K:

kmeans with canopy clustering elbow method choosing nr of cluster

 

cluster distance measure single linkage complete link average link

 

 

 


R:

library()
fit <-kmeans(data, k=5)
summary(fit)
fit$cluster

distances = dist(movies[2:20], method = “euclidean”)

  • method = ward.D cares about distances between clusters using centroind dist and also variance in each of the clusters
  • method = “euclidean”

 

kmeans(dataset with no labels, nr of clusters)
model$size
model$cluster
model$center

clusterGroups = cutree(clusterMovies, k = 10)
tapply(movies$Action, clusterGroups, mean)
tapply(movies$Romance, clusterGroups, mean)
colMeans(subset(movies[2:20], clusterGroups == 1))
colMeans(subset(movies[2:20], clusterGroups == 2))

subset(movies, Title==”Men in Black (1997)”)
clusterGroups[257]
cluster2 = subset(movies, clusterGroups==2)
cluster2$Title[1:20]

spl = split(movies[2:20], clusterGroups)
spl[[1]] #is the same as subset(movies[2:20], clusterGroups == 1)
colMeans(spl[[1]]) #will output the centroid of cluster 1.
#But an even easier approach uses the lapply function. The following command
#will output the cluster centroids for all clusters:
lapply(spl, colMeans)
#The lapply function runs the second argument (colMeans) on each element of
#the first argument (each cluster subset in spl). So instead of using 19 tapply
#commands, or 10 colMeans commands, we can output our centroids with just two
#commands: one to define spl, and then the lapply command.