K-Means 概念定義:
K-Means 是一種基于距離的排他的聚類劃分方法。
上面的 K-Means 描述中包含了幾個概念:
- 聚類(Clustering):K-Means 是一種聚類分析(Cluster Analysis)方法。聚類就是將數據對象分組成為多個類或者簇 (Cluster),使得在同一個簇中的對象之間具有較高的相似度,而不同簇中的對象差別較大。
- 劃分(Partitioning):聚類可以基于劃分,也可以基于分層。劃分即將對象劃分成不同的簇,而分層是將對象分等級。
- 排他(Exclusive):對于一個數據對象,只能被劃分到一個簇中。如果一個數據對象可以被劃分到多個簇中,則稱為可重疊的(Overlapping)。
- 距離(Distance):基于距離的聚類是將距離近的相似的對象聚在一起。基于概率分布模型的聚類是在一組對象中,找到能符合特定分布模型的對象的集合,他們不一定是距離最近的或者最相似的,而是能完美的呈現出概率分布模型所描述的模型。
K-Means 問題描述:
給定一個 n 個對象的數據集,它可以構建數據的 k 個劃分,每個劃分就是一個簇,并且 k ≤ n。同時還需滿足:
- 每個組至少包含一個對象。
- 每個對象必須屬于且僅屬于一個簇。
Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.
例如,有如下包含 10 條數據的集合。集合中每項描述了一個人的身高(Height: inches)和體重(Weight: kilograms)。
Height Weight ------------- (73.0, 72.6) (61.0, 54.4) (67.0, 99.9) (68.0, 97.3) (62.0, 59.0) (75.0, 81.6) (74.0, 77.1) (66.0, 97.3) (68.0, 93.3) (61.0, 59.0)
通過按照身高和體重的聚類,可以將上述 10 條數據分組成 3 類。
Height Weight ------------- (67.0, 99.9) (68.0, 97.3) (66.0, 97.3) (68.0, 93.3) (73.0, 72.6) (75.0, 81.6) (74.0, 77.1) (61.0, 54.4) (62.0, 59.0) (61.0, 59.0)
分類結果可以描述為:中等身高并且很重、很高并且中等體重、矮并且輕。如果用圖形來觀察分組狀況則結果一目了然。
K-Means 算法實現:
由于 K-Means 算法值針對給定的完整數據集進行操作,不需要任何特殊的訓練數據,所以 K-Means 是一種無監督的機器學習方法(Unsupervised Machine Learning Technique)。
K-Means 算法最常見的實現方式是使用迭代式精化啟發法的 Lloyd's algorithm。
- 給定劃分數量 k。創建一個初始劃分,從數據集中隨機地選擇 k 個對象,每個對象初始地代表了一個簇中心(Cluster Centroid)。對于其他對象,計算其與各個簇中心的距離,將它們劃入距離最近的簇。
- 采用迭代的重定位技術,嘗試通過對象在劃分間移動來改進劃分。所謂重定位技術,就是當有新的對象加入簇或者已有對象離開簇的時候,重新計算簇的平均值,然后對對象進行重新分配。這個過程不斷重復,直到各簇中對象不再變化為止。
randomly assign all data items to a cluster loop until no change in cluster assignments compute centroids for each cluster reassign each data item to cluster of closest centroid end
簡潔點兒的表述即為:
initialize clustering loop update centroids update clustering end loop
應用 K-Means 算法到上述身高與體重的示例,聚類過程如下圖所示。
K-Means 優缺點:
當結果簇是密集的,而且簇和簇之間的區別比較明顯時,K-Means 的效果較好。對于大數據集,K-Means 是相對可伸縮的和高效的,它的復雜度是 O(nkt),n 是對象的個數,k 是簇的數目,t 是迭代的次數,通常 k << n,且 t << n,所以算法經常以局部最優結束。
K-Means 的最大問題是要求先給出 k 的個數。k 的選擇一般基于經驗值和多次實驗結果,對于不同的數據集,k 的取值沒有可借鑒性。另外,K-Means 對孤立點數據是敏感的,少量噪聲數據就能對平均值造成極大的影響。
Basic K-Means - Lloyd's algorithm C# 代碼實現:
Code below referenced from Machine Learning Using C# Succinctly by James McCaffrey, and article K-Means Data Clustering Using C#.
1 using System; 2 3 namespace ClusterNumeric 4 { 5 class ClusterNumProgram 6 { 7 static void Main(string[] args) 8 { 9 Console.WriteLine("\nBegin k-means clustering demo\n"); 10 11 double[][] rawData = new double[10][]; 12 rawData[0] = new double[] { 73, 72.6 }; 13 rawData[1] = new double[] { 61, 54.4 }; 14 rawData[2] = new double[] { 67, 99.9 }; 15 rawData[3] = new double[] { 68, 97.3 }; 16 rawData[4] = new double[] { 62, 59.0 }; 17 rawData[5] = new double[] { 75, 81.6 }; 18 rawData[6] = new double[] { 74, 77.1 }; 19 rawData[7] = new double[] { 66, 97.3 }; 20 rawData[8] = new double[] { 68, 93.3 }; 21 rawData[9] = new double[] { 61, 59.0 }; 22 23 Console.WriteLine("Raw unclustered height (in.) weight (kg.) data:\n"); 24 Console.WriteLine(" ID Height Weight"); 25 Console.WriteLine("---------------------"); 26 ShowData(rawData, 1, true, true); 27 28 int numClusters = 3; 29 Console.WriteLine("\nSetting numClusters to " + numClusters); 30 31 Console.WriteLine("Starting clustering using k-means algorithm"); 32 Clusterer c = new Clusterer(numClusters); 33 int[] clustering = c.Cluster(rawData); 34 Console.WriteLine("Clustering complete\n"); 35 36 Console.WriteLine("Final clustering in internal form:\n"); 37 ShowVector(clustering, true); 38 39 Console.WriteLine("Raw data by cluster:\n"); 40 Console.WriteLine(" ID Height Weight"); 41 ShowClustered(rawData, clustering, numClusters, 1); 42 43 Console.WriteLine("\nEnd k-means clustering demo\n"); 44 Console.ReadLine(); 45 } 46 47 static void ShowData( 48 double[][] data, int decimals, 49 bool indices, bool newLine) 50 { 51 for (int i = 0; i < data.Length; ++i) 52 { 53 if (indices == true) 54 Console.Write(i.ToString().PadLeft(3) + " "); 55 56 for (int j = 0; j < data[i].Length; ++j) 57 { 58 double v = data[i][j]; 59 Console.Write(v.ToString("F" + decimals) + " "); 60 } 61 62 Console.WriteLine(""); 63 } 64 65 if (newLine == true) 66 Console.WriteLine(""); 67 } 68 69 static void ShowVector(int[] vector, bool newLine) 70 { 71 for (int i = 0; i < vector.Length; ++i) 72 Console.Write(vector[i] + " "); 73 74 if (newLine == true) 75 Console.WriteLine("\n"); 76 } 77 78 static void ShowClustered( 79 double[][] data, int[] clustering, 80 int numClusters, int decimals) 81 { 82 for (int k = 0; k < numClusters; ++k) 83 { 84 Console.WriteLine("==================="); 85 for (int i = 0; i < data.Length; ++i) 86 { 87 int clusterID = clustering[i]; 88 if (clusterID != k) continue; 89 Console.Write(i.ToString().PadLeft(3) + " "); 90 for (int j = 0; j < data[i].Length; ++j) 91 { 92 double v = data[i][j]; 93 Console.Write(v.ToString("F" + decimals) + " "); 94 } 95 Console.WriteLine(""); 96 } 97 Console.WriteLine("==================="); 98 } 99 } 100 } 101 102 public class Clusterer 103 { 104 private int numClusters; // number of clusters 105 private int[] clustering; // index = a tuple, value = cluster ID 106 private double[][] centroids; // mean (vector) of each cluster 107 private Random rnd; // for initialization 108 109 public Clusterer(int numClusters) 110 { 111 this.numClusters = numClusters; 112 this.centroids = new double[numClusters][]; 113 this.rnd = new Random(0); // arbitrary seed 114 } 115 116 public int[] Cluster(double[][] data) 117 { 118 int numTuples = data.Length; 119 int numValues = data[0].Length; 120 this.clustering = new int[numTuples]; 121 122 for (int k = 0; k < numClusters; ++k) // allocate each centroid 123 this.centroids[k] = new double[numValues]; 124 125 InitRandom(data); 126 127 Console.WriteLine("\nInitial random clustering:"); 128 for (int i = 0; i < clustering.Length; ++i) 129 Console.Write(clustering[i] + " "); 130 Console.WriteLine("\n"); 131 132 bool changed = true; // change in clustering? 133 int maxCount = numTuples * 10; // sanity check 134 int ct = 0; 135 while (changed == true && ct <= maxCount) 136 { 137 ++ct; // k-means typically converges very quickly 138 UpdateCentroids(data); // no effect if fail 139 changed = UpdateClustering(data); // no effect if fail 140 } 141 142 int[] result = new int[numTuples]; 143 Array.Copy(this.clustering, result, clustering.Length); 144 return result; 145 } 146 147 private void InitRandom(double[][] data) 148 { 149 int numTuples = data.Length; 150 151 int clusterID = 0; 152 for (int i = 0; i < numTuples; ++i) 153 { 154 clustering[i] = clusterID++; 155 if (clusterID == numClusters) 156 clusterID = 0; 157 } 158 for (int i = 0; i < numTuples; ++i) 159 { 160 int r = rnd.Next(i, clustering.Length); 161 int tmp = clustering[r]; 162 clustering[r] = clustering[i]; 163 clustering[i] = tmp; 164 } 165 } 166 167 private void UpdateCentroids(double[][] data) 168 { 169 int[] clusterCounts = new int[numClusters]; 170 for (int i = 0; i < data.Length; ++i) 171 { 172 int clusterID = clustering[i]; 173 ++clusterCounts[clusterID]; 174 } 175 176 // zero-out this.centroids so it can be used as scratch 177 for (int k = 0; k < centroids.Length; ++k) 178 for (int j = 0; j < centroids[k].Length; ++j) 179 centroids[k][j] = 0.0; 180 181 for (int i = 0; i < data.Length; ++i) 182 { 183 int clusterID = clustering[i]; 184 for (int j = 0; j < data[i].Length; ++j) 185 centroids[clusterID][j] += data[i][j]; // accumulate sum 186 } 187 188 for (int k = 0; k < centroids.Length; ++k) 189 for (int j = 0; j < centroids[k].Length; ++j) 190 centroids[k][j] /= clusterCounts[k]; // danger? 191 } 192 193 private bool UpdateClustering(double[][] data) 194 { 195 // (re)assign each tuple to a cluster (closest centroid) 196 // returns false if no tuple assignments change OR 197 // if the reassignment would result in a clustering where 198 // one or more clusters have no tuples. 199 200 bool changed = false; // did any tuple change cluster? 201 202 int[] newClustering = new int[clustering.Length]; // proposed result 203 Array.Copy(clustering, newClustering, clustering.Length); 204 205 double[] distances = new double[numClusters]; // from tuple to centroids 206 207 for (int i = 0; i < data.Length; ++i) // walk through each tuple 208 { 209 for (int k = 0; k < numClusters; ++k) 210 distances[k] = Distance(data[i], centroids[k]); 211 212 int newClusterID = MinIndex(distances); // find closest centroid 213 if (newClusterID != newClustering[i]) 214 { 215 changed = true; // note a new clustering 216 newClustering[i] = newClusterID; // accept update 217 } 218 } 219 220 if (changed == false) 221 return false; // no change so bail 222 223 // check proposed clustering cluster counts 224 int[] clusterCounts = new int[numClusters]; 225 for (int i = 0; i < data.Length; ++i) 226 { 227 int clusterID = newClustering[i]; 228 ++clusterCounts[clusterID]; 229 } 230 231 for (int k = 0; k < numClusters; ++k) 232 if (clusterCounts[k] == 0) 233 return false; // bad clustering 234 235 Array.Copy(newClustering, clustering, newClustering.Length); // update 236 return true; // good clustering and at least one change 237 } 238 239 // Euclidean distance between two vectors for UpdateClustering() 240 private static double Distance(double[] tuple, double[] centroid) 241 { 242 double sumSquaredDiffs = 0.0; 243 for (int j = 0; j < tuple.Length; ++j) 244 sumSquaredDiffs += (tuple[j] - centroid[j]) * (tuple[j] - centroid[j]); 245 return Math.Sqrt(sumSquaredDiffs); 246 } 247 248 // helper for UpdateClustering() to find closest centroid 249 private static int MinIndex(double[] distances) 250 { 251 int indexOfMin = 0; 252 double smallDist = distances[0]; 253 for (int k = 1; k < distances.Length; ++k) 254 { 255 if (distances[k] < smallDist) 256 { 257 smallDist = distances[k]; 258 indexOfMin = k; 259 } 260 } 261 return indexOfMin; 262 } 263 } 264 }
運行結果如下:
參考資料
- 基于 Apache Mahout 構建社會化推薦引擎
- 探索推薦引擎內部的秘密,第 1 部分: 推薦引擎初探
- 探索推薦引擎內部的秘密,第 2 部分: 深入推薦引擎相關算法 - 協同過濾
- 探索推薦引擎內部的秘密,第 3 部分: 深入推薦引擎相關算法 - 聚類
- 淺談Kmeans聚類
- Mahout學習——K-Means Clustering
- 基本Kmeans算法介紹及其實現
- 算法雜貨鋪——k均值聚類(K-means)
- K-Means Data Clustering Using C#
- K-Means Clustering Used in Intention Based Scoring Projects
- Multi-Threaded K-Means Clustering in .NET 4.0
- CS229 Lecture notes -- Andrew Ng
- Cluster analysis
- k-means clustering
- K-means++ clustering
- Lloyd's algorithm
- NP-hard
- Voronoi diagram
- K-Means 算法
- A Tutorial on Clustering Algorithms
- Unsupervised learning or Clustering – K-means Gaussian mixture models
本篇文章《K-Means 聚類算法》由 Dennis Gao 發表自博客園個人博客,未經作者本人同意禁止以任何的形式轉載,任何自動的或人為的爬蟲轉載行為均為耍流氓。
文章列表