Data Mining 資料探勘. 分群分析 (Cluster Analysis)

Size: px

Start display at page:

Download "Data Mining 資料探勘. 分群分析 (Cluster Analysis)"

Cameron Sims
10 years ago
Views:

1 Data Mining 資料探勘 Tamkang University 分群分析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University 淡江大學資訊管理學系 tku.edu.tw/myday/ - -

2 課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) // 資料探勘導論 (IntroducGon to Data Mining) // 關連分析 (AssociaGon Analysis) // 分類與預測 (ClassificaGon and PredicGon) // 分群分析 (Cluster Analysis) // 個案分析與實作一 (SAS EM 分群分析 ): Case Study (Cluster Analysis K- Means using SAS EM) // 個案分析與實作二 (SAS EM 關連分析 ): Case Study (AssociaGon Analysis using SAS EM) // 教學行政觀摩日 (Off- campus study) // 個案分析與實作三 (SAS EM 決策樹模型評估 ): Case Study (Decision Tree, Model EvaluaGon using SAS EM)

Study (Cluster Analysis K- Means using SAS EM) // 個案分析與實作二 (SAS EM 關連分析 ): Case Study (AssociaGon Analysis using SAS EM)

3 課程大綱 (Syllabus) 週次 (Week) 日期 (Date) 內容 (Subject/Topics) // 期中報告 (Midterm Project PresentaGon) // 期中考試週 (Midterm Exam) // 個案分析與實作四 (SAS EM 迴歸分析類神經網路 ): Case Study (Regression Analysis, ArGficial Neural Network using SAS EM) // 文字探勘與網頁探勘 (Text and Web Mining) // 海量資料分析 (Big Data AnalyGcs) // 期末報告 (Final Project PresentaGon) // 畢業考試週 (Final Exam)

Study (Regression Analysis, ArGficial Neural Network using SAS EM) // 文字探勘與網頁探勘 (Text and

4 Outline Cluster Analysis K- Means Clustering Source: Han & Kamber ()

5 Cluster Analysis Used for automagc idengficagon of natural groupings of things Part of the machine- learning family Employ unsupervised learning Learns the clusters of things from past data, then assigns new instances There is not an output variable Also known as segmentagon Source: Turban et al. (), Decision Support and Business Intelligence Systems

from past data, then assigns new instances There is not an output variable Also known

6 Cluster Analysis Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a +.) Source: Han & Kamber ()

7 Cluster Analysis Clustering results may be used to IdenGfy natural groupings of customers IdenGfy rules for assigning new cases to classes for targegng/diagnosgc purposes Provide characterizagon, definigon, labeling of populagons Decrease the size and complexity of problems for other data mining methods IdenGfy outliers in a specific domain (e.g., rare- event detecgon) Source: Turban et al. (), Decision Support and Business Intelligence Systems

populagons Decrease the size and complexity of problems for other data mining methods IdenGfy outliers in a

8 Example of Cluster Analysis Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, )

9 Cluster Analysis for Data Mining Analysis methods StaGsGcal methods (including both hierarchical and nonhierarchical), such as k- means, k- modes, and so on Neural networks (adapgve resonance theory [ART], self- organizing map [SOM]) Fuzzy logic (e.g., fuzzy c- means algorithm) GeneGc algorithms Divisive versus AgglomeraGve methods Source: Turban et al. (), Decision Support and Business Intelligence Systems

10 Cluster Analysis for Data Mining How many clusters? There is not a truly opgmal way to calculate it HeurisGcs are ojen used. Look at the sparseness of clusters. Number of clusters = (n/) / (n: no of data points). Use Akaike informagon criterion (AIC). Use Bayesian informagon criterion (BIC) Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items Euclidian versus Manhalan (recglinear) distance Source: Turban et al. (), Decision Support and Business Intelligence Systems

Use Bayesian informagon criterion (BIC) Most cluster analysis methods involve the use of a distance measure to calculate the

11 k- Means Clustering Algorithm k : pre- determined number of clusters Algorithm (Step : determine value of k) Step : Randomly generate k random points as inigal cluster centers Step : Assign each point to the nearest cluster center Step : Re- compute the new cluster centers RepeGGon step: Repeat steps and ungl some convergence criterion is met (usually that the assignment of points to clusters becomes stable) Source: Turban et al. (), Decision Support and Business Intelligence Systems

Re- compute the new cluster centers RepeGGon step: Repeat steps and ungl some convergence criterion is met (usually that

12 Cluster Analysis for Data Mining - k- Means Clustering Algorithm Step Step Step Source: Turban et al. (), Decision Support and Business Intelligence Systems

13 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: where i = (x i, x i,, x ip ) and j = (x j, x j,, x jp ) are two p- dimensional data objects, and q is a posigve integer If q =, d is Manhalan distance q q p p q q j x i x j x i x j x i x j i d )... ( ), ( =... ), ( p p j x i x j x i x j x i x j i d = Source: Han & Kamber ()

j,, x jp ) are two p- dimensional data objects, and q is a posigve integer If q =, d is Manhalan distance q q p p q q

14 Similarity and Dissimilarity Between Objects (Cont.) If q =, d is Euclidean distance: d( i, j) = ProperGes d(i,j) d(i,i) = ( x i d(i,j) = d(j,i) x j + x i d(i,j) d(i,k) + d(k,j) x j x i p p Also, one can use weighted distance, parametric Pearson product moment correlagon, or other disimilarity measures x j ) Source: Han & Kamber ()

15 Euclidean distance vs ManhaMan distance Distance of two point x = (, ) and x (, ) x (, ). x = (, ) Euclidean distance: = ((-) + (-) ) / = ( + ) / = ( + ) / = () / =. Manhattan distance: = (-) + (-) = + =

16 Example The K- Means Clustering Method Assign each objects to most similar center reassign Update the cluster means reassign K= Arbitrarily choose K object as initial cluster center Update the cluster means Source: Han & Kamber ()

cluster means reassign K= Arbitrarily choose K object as

17 K- Means Clustering Step by Step Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, )

18 m = (, ) K- Means Clustering Step : K=, Arbitrarily choose K object as initial cluster center M = (, ) Point P P(x,y) p a (, ) p b (, ) p c (, ) p d (, ) p e (, ) p f (, ) p g (, ) p h (, ) p i (, ) p j (, ) Initial m (, ) Initial m (, )

19 Step : Compute seed points as the centroids of the clusters of the current partition Step : Assign each objects to most similar center m = (, ) M = (, ) Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster K- Means Clustering Initial m (, ) Initial m (, )

. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).

20 Step : Compute seed points as the centroids of the clusters of the current partition Step : Assign each objects to most similar center m = (, ) Euclidean distance b(,) ß à m(,) = ((-) + (-) ) / = ( + (-) ) / = ( + ) / M = (, ) K- Means = () / Clustering =. Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster Euclidean distance b(,) ß à m(,) = ((-) + (-) ) / = ( + (-) ) / = ( + ) / = () / =. p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster Initial m (, ) Initial m (, )

Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).

21 Step : Update the cluster means, Repeat Step,, stop when no more new assignment m = (.,.) M = (.,.) K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

22 Step : Update the cluster means, Repeat Step,, stop when no more new assignment m = (.,.) M = (..,.) K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

23 stop when no more new assignment K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

24 K- Means Clustering (K=, two clusters) stop when no more new assignment K- Means Clustering Point P P(x,y) m distance m distance Cluster p a (, ).. Cluster p b (, ).. Cluster p c (, ).. Cluster p d (, ).. Cluster p e (, ).. Cluster p f (, ).. Cluster p g (, ).. Cluster p h (, ).. Cluster p i (, ).. Cluster p j (, ).. Cluster m (.,.) m (.,.)

25 Summary Cluster Analysis K- Means Clustering Source: Han & Kamber ()

26 References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second EdiGon,, Elsevier Efraim Turban, Ramesh Sharda, Dursun Delen, Decision Support and Business Intelligence Systems, Ninth EdiGon,, Pearson.

Data Mining 5. Cluster Analysis

Data Mining 5. Cluster Analysis 5.2 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining 資料探勘. 分群分析 (Cluster Analysis)