Library of Congress Cataloging-in-Publication Data
Mirkin, B. G. (Boris Grigorévich) Clustering for data mining : a data recovery approach / Boris Mirkin. p. cm. -- (Computer science and data analysis series ; 3) Includes bibliographical references and index. ISBN 1-58488-534-3 1. Data mining. 2. Cluster analysis. I. Title. II. Series. QA76.9.D343M57 2005 006.3'12--dc22
2005041421
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com
Taylor & Francis Group is the Academic Division of T&F Informa plc.
and the CRC Press Web site at http://www.crcpress.com
Computer Science and Data Analysis Series
The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other. This series aims to foster the integration between the computer sciences and statistical, numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks. SERIES EDITORS John Lafferty, Carnegie Mellon University David Madigan, Rutgers University Fionn Murtagh, Royal Holloway, University of London Padhraic Smyth, University of California, Irvine Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Hall/CRC 23-25 Blades Court London SW15 2NU UK
Published Titles Bayesian Artificial Intelligence Kevin B. Korb and Ann E. Nicholson Pattern Recognition Algorithms for Data Mining Sankar K. Pal and Pabitra Mitra Exploratory Data Analysis with MATLAB® Wendy L. Martinez and Angel R. Martinez Clustering for Data Mining: A Data Recovery Approach Boris Mirkin Correspondence Analysis and Data Coding with JAVA and R Fionn Murtagh R Graphics Paul Murrell
Contents
Preface List of Denotations Introduction: Historical Remarks 1 What Is Clustering
Base words 1.1 Exemplary problems 1.1.1 Structuring 1.1.2 Description 1.1.3 Association 1.1.4 Generalization 1.1.5 Visualization of data structure 1.2 Bird's-eye view 1.2.1 De nition: data and cluster structure 1.2.2 Criteria for revealing a cluster structure 1.2.3 Three types of cluster description 1.2.4 Stages of a clustering application 1.2.5 Clustering and other disciplines 1.2.6 Di erent perspectives of clustering Base words 2.1 Feature characteristics 2.1.1 Feature scale types 2.1.2 Quantitative case 2.1.3 Categorical case 2.2 Bivariate analysis 2.2.1 Two quantitative variables 2.2.2 Nominal and quantitative variables
2.2.3 Two nominal variables cross-classi ed 2.2.4 Relation between correlation and contingency 2.2.5 Meaning of correlation 2.3 Feature space and data scatter 2.3.1 Data matrix 2.3.2 Feature space: distance and inner product 2.3.3 Data scatter 2.4 Pre-processing and standardizing mixed data 2.5 Other table data types 2.5.1 Dissimilarity and similarity data 2.5.2 Contingency and ow data
3 K-Means Clustering
Base words 3.1 Conventional K-Means 3.1.1 Straight K-Means 3.1.2 Square error criterion 3.1.3 Incremental versions of K-Means 3.2 Initialization of K-Means 3.2.1 Traditional approaches to initial setting 3.2.2 MaxMin for producing deviate centroids 3.2.3 Deviate centroids with Anomalous pattern 3.3 Intelligent K-Means 3.3.1 Iterated Anomalous pattern for iK-Means 3.3.2 Cross validation of iK-Means results 3.4 Interpretation aids 3.4.1 Conventional interpretation aids 3.4.2 Contribution and relative contribution tables 3.4.3 Cluster representatives 3.4.4 Measures of association from ScaD tables 3.5 Overall assessment Base words 4.1 Agglomeration: Ward algorithm 4.2 Divisive clustering with Ward criterion 4.2.1 2-Means splitting 4.2.2 Splitting by separating 4.2.3 Interpretation aids for upper cluster hierarchies 4.3 Conceptual clustering 4.4 Extensions of Ward clustering 4.4.1 Agglomerative clustering with dissimilarity data 4.4.2 Hierarchical clustering for contingency and ow data
Base words 5.1 Statistics modeling as data recovery 5.1.1 Averaging 5.1.2 Linear regression 5.1.3 Principal component analysis 5.1.4 Correspondence factor analysis 5.2 Data recovery model for K-Means 5.2.1 Equation and data scatter decomposition 5.2.2 Contributions of clusters, features, and individual entities 5.2.3 Correlation ratio as contribution 5.2.4 Partition contingency coe cients 5.3 Data recovery models for Ward criterion 5.3.1 Data recovery models with cluster hierarchies 5.3.2 Covariances, variances and data scatter decomposed 5.3.3 Direct proof of the equivalence between 2-Means and Ward criteria 5.3.4 Gower's controversy 5.4 Extensions to other data types 5.4.1 Similarity and attraction measures compatible with K-Means and Ward criteria 5.4.2 Application to binary data 5.4.3 Agglomeration and aggregation of contingency data 5.4.4 Extension to multiple data 5.5 One-by-one clustering 5.5.1 PCA and data recovery clustering 5.5.2 Divisive Ward-like clustering 5.5.3 Iterated Anomalous pattern 5.5.4 Anomalous pattern versus Splitting 5.5.5 One-by-one clusters for similarity data 5.6 Overall assessment Base words 6.1 Extensions of K-Means clustering 6.1.1 Clustering criteria and implementation 6.1.2 Partitioning around medoids PAM 6.1.3 Fuzzy clustering 6.1.4 Regression-wise clustering 6.1.5 Mixture of distributions and EM algorithm 6.1.6 Kohonen self-organizing maps SOM
6.2 Graph-theoretic approaches 6.2.1 Single linkage, minimum spanning tree and connected components 6.2.2 Finding a core 6.3 Conceptual description of clusters 6.3.1 False positives and negatives 6.3.2 Conceptually describing a partition 6.3.3 Describing a cluster with production rules 6.3.4 Comprehensive conjunctive description of a cluster 6.4 Overall assessment
7 General Issues
Base words 7.1 Feature selection and extraction 7.1.1 A review 7.1.2 Comprehensive description as a feature selector 7.1.3 Comprehensive description as a feature extractor 7.2 Data pre-processing and standardization 7.2.1 Dis/similarity between entities 7.2.2 Pre-processing feature based data 7.2.3 Data standardization 7.3 Similarity on subsets and partitions 7.3.1 Dis/similarity between binary entities or subsets 7.3.2 Dis/similarity between partitions 7.4 Dealing with missing data 7.4.1 Imputation as part of pre-processing 7.4.2 Conditional mean 7.4.3 Maximum likelihood 7.4.4 Least-squares approximation 7.5 Validity and reliability 7.5.1 Index based validation 7.5.2 Resampling for validation and selection 7.5.3 Model selection with resampling 7.6 Overall assessment
Conclusion: Data Recovery Approach in Clustering Bibliography
Preface
Clustering is a discipline devoted to nding and describing cohesive or homogeneous chunks in data, the clusters. Some exemplary clustering problems are: - Finding common surf patterns in the set of web users - Automatically revealing meaningful parts in a digitalized image - Partition of a set of documents in groups by similarity of their contents - Visual display of the environmental similarity between regions on a country map - Monitoring socio-economic development of a system of settlements via a small number of representative settlements - Finding protein sequences in a database that are homologous to a query protein sequence - Finding anomalous patterns of gene expression data for diagnostic purposes - Producing a decision rule for separating potentially bad-debt credit applicants - Given a set of preferred vacation places, nding out what features of the places and vacationers attract each other - Classifying households according to their furniture purchasing patterns and nding groups' key characteristics to optimize furniture marketing and production. Clustering is a key area in data mining and knowledge discovery, which are activities oriented towards nding non-trivial or hidden patterns in data collected in databases. Earlier developments of clustering techniques have been associated, primarily, with three areas of research: factor analysis in psychology 55], numerical taxonomy in biology 122], and unsupervised learning in pattern recognition 21]. Technically speaking, the idea behind clustering is rather simple: introduce a measure of similarity between entities under consideration and combine similar entities into the same clusters while keeping dissimilar entities in di erent clusters. However, implementing this idea is less than straightforward. First, too many similarity measures and clustering techniques have been
invented with virtually no support to a non-specialist user in selecting among them. The trouble with this is that di erent similarity measures and/or clustering techniques may, and frequently do, lead to di erent results. Moreover, the same technique may also lead to di erent cluster solutions depending on the choice of parameters such as the initial setting or the number of clusters speci ed. On the other hand, some common data types, such as questionnaires with both quantitative and categorical features, have been left virtually without any substantiated similarity measure. Second, use and interpretation of cluster structures may become an issue, especially when available data features are not straightforwardly related to the phenomenon under consideration. For instance, certain data on customers available at a bank, such as age and gender, typically are not very helpful in deciding whether to grant a customer a loan or not. Specialists acknowledge peculiarities of the discipline of clustering. They understand that the clusters to be found in data may very well depend not on only the data but also on the user's goals and degree of granulation. They frequently consider clustering as art rather than science. Indeed, clustering has been dominated by learning from examples rather than theory based instructions. This is especially visible in texts written for inexperienced readers, such as 4], 28] and 115]. The general opinion among specialists is that clustering is a tool to be applied at the very beginning of investigation into the nature of a phenomenon under consideration, to view the data structure and then decide upon applying better suited methodologies. Another opinion of specialists is that methods for nding clusters as such should constitute the core of the discipline related questions of data pre-processing, such as feature quantization and standardization, de nition and computation of similarity, and post-processing, such as interpretation and association with other aspects of the phenomenon, should be left beyond the scope of the discipline because they are motivated by external considerations related to the substance of the phenomenon under investigation. I share the former opinion and argue the latter because it is at odds with the former: in the very rst steps of knowledge discovery, substantive considerations are quite shaky, and it is unrealistic to expect that they alone could lead to properly solving the issues of pre- and post-processing. Such a dissimilar opinion has led me to believe that the discovered clusters must be treated as an \ideal" representation of the data that could be used for recovering the original data back from the ideal format. This is the idea of the data recovery approach: not only use data for nding clusters but also use clusters for recovering the data. In a general situation, the data recovered from aggregate clusters cannot t the original data exactly, which can be used for evaluation of the quality of clusters: the better the t, the better the clusters. This perspective would also lead to the addressing of issues in pre- and post-
processing, which now becomes possible because parts of the data that are explained by clusters can be separated from those that are not. The data recovery approach is common in more traditional data mining and statistics areas such as regression, analysis of variance and factor analysis, where it works, to a great extent, due to the Pythagorean decomposition of the data scatter into \explained" and \unexplained" parts. Why not try the same approach in clustering? In this book, two of the most popular clustering techniques, K-Means for partitioning and Ward's method for hierarchical clustering, are presented in the framework of the data recovery approach. The selection is by no means random: these two methods are well suited because they are based on statistical thinking related to and inspired by the data recovery approach, they minimize the overall within cluster variance of data. This seems to be the reason of the popularity of these methods. However, the traditional focus of research on computational and experimental aspects rather than theoretical ones has contributed to the lack of understanding of clustering methods in general and these two in particular. For instance, no rm relation between these two methods has been established so far, in spite of the fact that they share the same square error criterion. I have found such a relation, in the format of a Pythagorean decomposition of the data scatter into parts explained and unexplained by the found cluster structure. It follows from the decomposition, quite unexpectedly, that it is the divisive clustering format, rather than the traditional agglomerative format, that better suits the Ward clustering criterion. The decomposition has led to a number of other observations that amount to a theoretical framework for the two methods. Moreover, the framework appears to be well suited for extensions of the methods to di erent data types such as mixed scale data including continuous, nominal and binary features. In addition, a bunch of both conventional and original interpretation aids have been derived for both partitioning and hierarchical clustering based on contributions of features and categories to clusters and splits. One more strain of clustering techniques, oneby-one clustering which is becoming increasingly popular, naturally emerges within the framework giving rise to intelligent versions of K-Means, mitigating the need for user-de ned setting of the number of clusters and their hypothetical prototypes. Most importantly, the framework leads to a set of mathematically proven properties relating classical clustering with other clustering techniques such as conceptual clustering and graph theoretic clustering as well as with other data mining concepts such as decision trees and association in contingency data tables. These are all presented in this book, which is oriented towards a reader interested in the technical aspects of data mining, be they a theoretician or a practitioner. The book is especially well suited for those who want to learn WHAT clustering is by learning not only HOW the techniques are applied
but also WHY. In this way the reader receives knowledge which should allow him not only to apply the methods but also adapt, extend and modify them according to the reader's own ends. This material is organized in ve chapters presenting a uni ed theory along with computational, interpretational and practical issues of real-world data mining with clustering: - What is clustering (Chapter 1) - What is data (Chapter 2) - What is K-Means (Chapter 3) - What is Ward clustering (Chapter 4) - What is the data recovery approach (Chapter 5). But this is not the end of the story. Two more chapters follow. Chapter 6 presents some other clustering goals and methods such as SOM (self-organizing maps) and EM (expectation-maximization), as well as those for conceptual description of clusters. Chapter 7 takes on \big issues" of data mining: validity and reliability of clusters, missing data, options for data pre-processing and standardization, etc. When convenient, we indicate solutions to the issues following from the theory of the previous chapters. The Conclusion reviews the main points brought up by the data recovery approach to clustering and indicates potential for further developments. This structure is intended, rst, to introduce classical clustering methods and their extensions to modern tasks, according to the data recovery approach, without learning the theory (Chapters 1 through 4), then to describe the theory leading to these and related methods (Chapter 5) and, in addition, see a wider picture in which the theory is but a small part (Chapters 6 and 7). In fact, my prime intention was to write a text on classical clustering, updated to issues of current interest in data mining such as processing mixed feature scales, incomplete clustering and conceptual interpretation. But then I realized that no such text can appear before the theory is described. When I started describing the theory, I found that there are holes in it, such as a lack of understanding of the relation between K-Means and the Ward method and in fact a lack of a theory for the Ward method at all, misconceptions in quantization of qualitative categories, and a lack of model based interpretation aids. This is how the current version has become a threefold creature oriented toward: 1. Giving an account of the data recovery approach to encompass partitioning, hierarchical and one-by-one clustering methods 2. Presenting a coherent theory in clustering that addresses such issues as (a) relation between normalizing scales for categorical data and measuring association between categories and clustering, (b) contributions of various elements of cluster structures to data scatter and their use in interpreta-
tion, (c) relevant criteria and methods for clustering di erently expressed data, etc. 3. Providing a text in data mining for teaching and self-learning popular data mining techniques, especially K-Means partitioning and Ward agglomerative and divisive clustering, with emphases on mixed data pre-processing and interpretation aids in practical applications. At present, there are two types of literature on clustering, one leaning towards providing general knowledge and the other giving more instruction. Books of the former type are Gordon 39] targeting readers with a degree of mathematical background and Everitt et al. 28] that does not require mathematical background. These include a great deal of methods and speci c examples but leave rigorous data mining instruction beyond the prime contents. Publications of the latter type are Kaufman and Rousseeuw 62] and chapters in data mining books such as Dunham 23]. They contain selections of some techniques reported in an ad hoc manner, without any concern on relations between them, and provide detailed instruction on algorithms and their parameters. This book combines features of both approaches. However, it does so in a rather distinct way. The book does contain a number of algorithms with detailed instructions and examples for their settings. But selection of methods is based on their tting to the data recovery theory rather than just popularity. This leads to the covering of issues in pre- and post-processing matters that are usually left beyond instruction. The book does contain a general knowledge review, but it concerns more of issues rather than speci c methods. In doing so, I had to clearly distinguish between four di erent perspectives: (a) statistics, (b) machine learning, (c) data mining, and (d) knowledge discovery, as those leading to di erent answers to the same questions. This text obviously pertains to the data mining and knowledge discovery perspectives, though the other two are also referred to, especially with regard to cluster validation. The book assumes that the reader may have no mathematical background beyond high school: all necessary concepts are de ned within the text. However, it does contain some technical stu needed for shaping and explaining a technical theory. Thus it might be of help if the reader is acquainted with basic notions of calculus, statistics, matrix algebra, graph theory and logics. To help the reader, the book conventionally includes a list of denotations, in the beginning, and a bibliography and index, in the end. Each individual chapter is preceded by a boxed set of goals and a dictionary of base words. Summarizing overviews are supplied to Chapters 3 through 7. Described methods are accompanied with numbered computational examples showing the working of the methods on relevant data sets from those presented in Chapter 1 there are 58 examples altogether. Computations have been carried out with
self-made programs for MATLAB r , the technical computing tool developed by The MathWorks (see its Internet web site www.mathworks.com). The material has been used in the teaching of data clustering and visualization to MSc CS students in several colleges across Europe. Based on these experiences, di erent teaching options can be suggested depending on the course objectives, time resources, and students' background. If the main objective is teaching clustering methods and there are very few hours available, then it would be advisable to rst pick up the material on generic K-Means in sections 3.1.1 and 3.1.2, and then review a couple of related methods such as PAM in section 6.1.2, iK-Means in 3.3.1, Ward agglomeration in 4.1 and division in 4.2.1, single linkage in 6.2.1 and SOM in 6.1.6. Given a little more time, a review of cluster validation techniques from 7.6 including examples in 3.3.2 should follow the methods. In a more relaxed regime, issues of interpretation should be brought forward as described in 3.4, 4.2.3, 6.3 and 7.2. If the main objective is teaching data visualization, then the starting point should be the system of categories described in 1.1.5, followed by material related to these categories: bivariate analysis in section 2.2, regression in 5.1.2, principal component analysis (SVM decomposition) in 5.1.3, K-Means and iKMeans in Chapter 3, Self-organizing maps SOM in 6.1.6 and graph-theoretic structures in 6.2.
Acknowledgments
Too many people contributed to the approach and this book to list them all. However, I would like to mention those researchers whose support was important for channeling my research e orts: Dr. E. Braverman, Dr. V. Vapnik, Prof. Y. Gavrilets, and Prof. S. Aivazian, in Russia Prof. F. Roberts, Prof. F. McMorris, Prof. P. Arabie, Prof. T. Krauze, and Prof. D. Fisher, in the USA Prof. E. Diday, Prof. L. Lebart and Prof. B. Burtschy, in France Prof. H.-H. Bock, Dr. M. Vingron, and Dr. S. Suhai, in Germany. The structure and contents of this book have been in uenced by comments of Dr. I. Muchnik (Rutgers University, NJ, USA), Prof. M. Levin (Higher School of Economics, Moscow, Russia), Dr. S. Nascimento (University Nova, Lisbon, Portugal), and Prof. F. Murtagh (Royal Holloway, University of London, UK).
Boris Mirkin is a Professor of Computer Science at the University of London UK. He develops methods for data mining in such areas as social surveys, bioinformatics and text analysis, and teaches computational intelligence and data visualization. Dr. Mirkin rst became known for his work on combinatorial models and methods for data analysis and their application in biological and social sciences. He has published monographs such as \Group Choice" (John Wiley & Sons, 1979) and \Graphs and Genes" (Springer-Verlag, 1984, with S. Rodin). Subsequently, Dr. Mirkin spent almost ten years doing research in scienti c centers such as Ecole Nationale Suprieure des Tlcommunications (Paris, France), Deutsches Krebs Forschnung Zentrum (Heidelberg, Germany), and Center for Discrete Mathematics and Theoretical Computer Science DIMACS, Rutgers University (Piscataway, NJ, USA). Building on these experiences, he developed a uni ed framework for clustering as a data recovery discipline.
List of Denotations
I N V Vl M X = (xiv ) Y = (yiv ) yi yi
= (yiv ) = (yiv )
(x y)
d(x y )
fS
K
1
::: SK
g
Nk ck = (ckv ) Sw Sw1 Sw2 dw(Sw1 Sw2 ) Nkv
Entity set Number of entities Feature set Set of categories of a categorical feature l Number of column features Raw entity-to-feature data table Standardized entity-to-feature data table yiv = (xiv ; av )=bv where av and bv denote the shift and scale coefcients, respectively M -dimensional vector corresponding to entity i 2 I according to data table Y M -dimensional vector corresponding to entity i 2 I according to data table Y Inner product of two vector points x = (xj ) and y = (yj ), P (x y) = j xj yj Distance (Euclidean squared) between two vector points P x = (xj ) and y = (yj ), d(x y ) = j (xj ; yj )2 Partition of set I in K disjoint classes Sk I , k = 1 ::: K Number of classes/clusters in a partition S = fS1 ::: SK g of set I Number of entities in class SkP partition S (k = 1 ::: K ) of Centroid of cluster Sk , ckv = i2Sk yiv =Nk , v 2 V Parent-children triple in a cluster hierarchy, Sw = Sw1
Sw2
Ward distance between clusters Sw1 , with centroid c1 , and N Sw2 , with centroid c2 , dw(Sw1 Sw2 ) = Nww1 Nw22 d(cw1 cw2 ) 1 +Nw Number of entities in class Sk of partition S (k = 1 ::: K ) that fall in category v 2 V an entry in the contingency table between partition S and categorical feature l with set of categories Vl
Marginal distribution: Number of entities in class Sk of partition S (k = 1 ::: K ) as related to a contingency table between partition S and another categorical feature Marginal distribution: Number of entities falling in category v 2 Vl of categorical feature l as related to a contingency table between partition S and categorical feature l Frequency Nkv =N
Nk+ =N N+v =N
(i
Sk )
Relative Quetelet coe cient, qkv = pkpkv+v ; 1 P P 2 +p Data scatter, T (Y ) = i2I v2V yiv P Cluster's square error, W (Sk ck ) = i2Sk d(yi ck ) K-Means square error criterion equal to the sum of W (Sk ck ), k = 1 :::K Attraction of i 2 I to cluster Sk
Introduction: Historical Remarks
Clustering is a discipline aimed at revealing groups, or clusters, of similar entities in data. The existence of clustering activities can be traced a hundred years back, in di erent disciplines in di erent countries. One of the rst was the discipline of ecology. A question the scientists were trying to address was of the territorial structure of the settlement of bird species and its determinants. They did eld sampling to count numbers of various species at observation spots similarity measures between spots were de ned, and a method of analysis of the structure of similarity dubbed Wrozlaw taxonomy was developed in Poland between WWI and WWII (see publication of a later time 32]). This method survives, in an altered form, in diverse computational schemes such as single-linkage clustering and minimum spanning tree (see section 6.2.1). Simultaneously, phenomenal activities in di erential psychology initiated in the United Kingdom by the thrust of F. Galton (1822-1911) and supported by the mathematical genius of K. Pearson (1855-1936) in trying to prove that human talent is not a random gift but inherited, led to developing a body of multivariate statistics including the discipline of factor analysis (primarily, for measuring talent) and, as its o shoot, cluster analysis. Take, for example, a list of high school students and their marks at various disciplines such as maths, English, history, etc. If one believes that the marks are exterior manifestations of an inner quality, or factor, of talent, then one can assign a student i with a hidden factor score of his talent, zi . Then marks xil of student i at di erent disciplines l can be modeled, up to an error, by the product cl zi so that xil cl zi where factor cl re ects the impact of the discipline l over students. The problem is to nd the unknown zi and cl , given a set of students' marks over a set of disciplines. This was the idea behind a method proposed by K. Pearson in 1901 106] that became the ground for later developments in Principal Component Analysis (PCA), see further explanation in section 5.1.3. To do the job of measuring hidden factors, F. Galton hired C. Spearman who devel-
oped a rather distinct method for factor analysis based on the assumption that no unique talent can explain various human abilities, but there are di erent, and independent, dimensions of talent such as linguistic or spatial ones. Each of these hidden dimensions must be presented by a corresponding independent factor so that the mark can be thought of as the total of factor scores weighted by their loadings. This idea proved fruitful in developing various personality theories and related psychological tests. However, methods for factor analysis developed between WWI and WWII were computationally intensive since they used the operation of inversion of a matrix of discipline-to-discipline similarity coe cients (covariances, to be exact). The operation of matrix inversion still can be a challenging task when the matrix size grows into thousands, and it was a nightmare before the electronic computer era even with a matrix size of a dozen. It was noted then that variables (in this case, disciplines) related to the same factor are highly correlated among themselves, which led to the idea of catching \clusters" of highly correlated variables as proxies for factors, without computing the inverse matrix, an activity which was referred to once as \factor analysis for the poor." The very rst book on cluster analysis, within this framework, was published in 1939 131], see also 55]. In the 50s and 60s of the 20th century, with computer powers made available at universities, cluster analysis research grew fast in many disciplines simultaneously. Three of these seem especially important for the development of cluster analysis as a scienti c discipline. First, machine learning of groups of entities (pattern recognition) sprang up to involve both supervised and unsupervised learning, the latter being synonymous to cluster analysis 21]. Second, the discipline of numerical taxonomy emerged in biology claiming that a biological taxon, as a rule, could not be de ned in the Aristotelian way, with a conjunction of features: a taxon thus was supposed to be such a set of organisms in which a majority shared a majority of attributes with each other 122]. Hierarchical agglomerative and divisive clustering algorithms were supposed to formalize this. They were being \polythetic" by the very mechanism of their action in contrast to classical \monothetic" approaches in which every divergence of taxa was to be explained by a single character. (It should be noted that the appeal of numerical taxonomists left some biologists unimpressed there even exists the so-called \cladistics" discipline that claims that a single feature ought always to be responsible for any evolutionary divergence.) Third, in the social sciences, an opposite stance of building a divisive decision tree at which every split is made over a single feature emerged in the work of Sonquist and Morgan (see a later reference 124]). This work led to the development of decision tree techniques that became a highly popular part of machine learning and data mining. Decision trees actually cover three methods, conceptual clustering, classi cation trees and regression trees, that are usually
considered di erent because they employ di erent criteria of homogeneity 58]. In a conceptual clustering tree, split parts must be as homogeneous as possible with regard to all participating features. In contrast, a classi cation tree or regression tree achieves homogeneity with regard to only one, so-called target, feature. Still, we consider that all these techniques belong in cluster analysis because they all produce split parts consisting of similar entities however, this does not prevent them also being part of other disciplines such as machine learning or pattern recognition. A number of books re ecting these developments were published in the 70s describing the great opportunities opened in many areas of human activity by algorithms for nding \coherent" clusters in a data \cloud" placed in geometrical space (see, for example, Benzecri 1973, Bock 1974, Cli ord and Stephenson 1975, Duda and Hart 1973, Duran and Odell 1974, Everitt 1974, Hartigan 1975, Sneath and Sokal 1973, Sonquist, Baker, and Morgan 1973, Van Ryzin 1977, Zagoruyko 1972). In the next decade, some of these developments have been further advanced and presented in such books as Breiman et al. 11], Jain and Dubes 58] and McLachlan and Basford 82]. Still the common view is that clustering is an art rather than a science because determining clusters may depend more on the user's goals than on a theory. Accordingly, clustering is viewed as a set of diverse and ad hoc procedures rather than a consistent theory. The last decade saw the emergence of data mining, the discipline combining issues of handling and maintaining data with approaches from statistics and machine learning for discovering patterns in data. In contrast to the statistical approach, which tries to nd and t objective regularities in data, data mining is oriented towards the end user. That means that data mining considers the problem of useful knowledge discovery in its entire range, starting from database acquisition to data preprocessing to nding patterns to drawing conclusions. In particular, the concept of an interesting pattern as something which is unusual or far from normal or anomalous has been introduced into data mining 29]. Obviously, an anomalous cluster is one that is further away from the grand mean or any other point of reference { an approach which is adapted in this text. A number of computer programs for carrying out data mining tasks, clustering included, have been successfully exploited, both in science and industry a review of them can be found in 23]. There are a number of general purpose statistical packages which have made it through from earlier times: those with some cluster analysis applications such as SAS 119] and SPSS 42] or those entirely devoted to clustering such as CLUSTAN 140]. There are data mining tools which include clustering, such as Clementine 14]. Still, these programs are far from su cient in advising a user on what method to select, how to pre-process data and, especially, what sense to make of the clusters. Another feature of this more recent period is that a number of application
areas have emerged in which clustering is a key issue. In many application areas that began much earlier { such as image analysis, machine vision or robot planning { clustering is a rather small part of a very complex task such that the quality of clustering does not much matter to the overall performance as any reasonable heuristic would do, these areas do not require the discipline of clustering to theoretically develop and mature. This is not so in Bio-informatics, the discipline which tries to make sense of interrelation between structure, function and evolution of biomolecular objects. Its primary entities, DNA and protein sequences, are complex enough to have their similarity modeled as homology, that is, inheritance from a common ancestor. More advanced structural data such as protein folds and their contact maps are being constantly added to existing depositories. Gene expression technologies add to this an invaluable next step - a wealth of data on biomolecular function. Clustering is one of the major tools in the analysis of bioinformatics data. The very nature of the problem here makes researchers see clustering as a tool not only for nding cohesive groupings in data but also for relating the aspects of structure, function and evolution to each other. In this way, clustering is more and more becoming part of an emerging area of computer classi cation. It models the major functions of classi cation in the sciences: the structuring of a phenomenon and associating its di erent aspects. (Though, in data mining, the term `classi cation' is almost exclusively used in its partial meaning as merely a diagnostic tool.) Theoretical and practical research in clustering is thriving in this area. Another area of booming clustering research is information retrieval and text document mining. With the growth of the Internet and the World Wide Web, text has become one of the most important mediums of mass communication. The terabytes of text that exist must be summarized e ectively, which involves a great deal of clustering in such key stages as natural language processing, feature extraction, categorization, annotation and summarization. In author's view, clustering will become even more important as the systems for acquiring and understanding knowledge from texts evolve, which is likely to occur soon. There are already web sites providing web search results with clustering them according to automatically found key phrases (see, for instance, 134]). This book is mostly devoted to explaining and extending two clustering techniques, K-Means for partitioning and Ward for hierarchical clustering. The choice is far from random. First, they present the most popular clustering formats, hierarchies and partitions, and can be extended to other interesting formats such as single clusters. Second, many other clustering and statistical techniques, such as conceptual clustering, self-organizing maps (SOM), and contingency association measures, appear to be closely related to these. Third, both methods involve the same criterion, the minimum within cluster variance, which can be treated within the same theoretical framework. Fourth, many data
mining issues of current interest, such as analysis of mixed data, incomplete clustering, and conceptual description of clusters, can be treated with extended versions of these methods. In fact, the book contents go far beyond these methods: the two last chapters, accounting for one third of the material, are devoted to the \big issues" in clustering and data mining that are not limited to speci c methods. The present account of the methods is based on a speci c approach to cluster analysis, which can be referred to as the data recovery clustering. In this approach, clusters are not only found in data but they also feed back into the data: a cluster structure is used to generate data in the format of the data table which has been analyzed with clustering. The data generated by a cluster structure are, in a sense, \ideal" as they reproduce only the cluster structure lying behind their generation. The observed data can then be considered a noisy version of the ideal cluster-generated data the extent of noise can be measured by the di erence between the ideal and observed data. The smaller the di erence the better the t. This idea is not particularly new it is, in fact, the backbone of many quantitative methods of multivariate statistics, such as regression and factor analysis. Moreover, it has been applied in clustering from the very beginning in particular, Ward 135] developed his method of agglomerative clustering with implicitly this view of data analysis. Some methods were consciously constructed along the data recovery approach: see, for instance, work of Hartigan 46] at which the single linkage method was developed to approximate the data with an ultrametric matrix, an ideal data type corresponding to a cluster hierarchy. Even more appealing in this capacity is a later work by Hartigan 47]. However, this approach has never been applied in full. The sheer idea, following from models presented in this book, that classical clustering is but a constrained analogue to the principal component model has not achieved any popularity so far, though it has been around for quite a while 89], 90]. The unifying capability of the data recovery clustering is grounded on convenient relations which exist between data approximation problems and geometrically explicit classical clustering. Firm mathematical relations found between different parts of cluster solutions and data lead not only to explanation of the classical algorithms but also to development of a number of other algorithms for both nding and describing clusters. Among the former, principal-componentlike algorithms for nding anomalous clusters and divisive clustering should be pointed out. Among the latter, a set of simple but e cient interpretation tools, that are absent from the multiple programs implementing classical clustering methods, should be mentioned.
What Is Clustering
After reading this chapter the reader will have a general understanding of: 1. What clustering is and its basic elements. 2. Clustering goals. 3. Quantitative and categorical features. 4. Main cluster structures: partition, hierarchy, and single cluster. 5. Di erent perspectives at clustering coming from statistics, machine learning, data mining, and knowledge discovery. A set of small but real-world clustering problems will be presented.
Base words
Association Finding interrelations between di erent aspects of a phenomenon
by matching cluster descriptions in the feature spaces corresponding to the aspects.
Classi cation An actual or ideal arrangement of entities under consideration
in classes to shape and keep knowledge, capture the structure of phenomena, and relate di erent aspects of a phenomenon in question to each other. This term is also used in a narrow sense referring to any activities in assigning entities to prespeci ed classes.
Cluster A set of similar data entities found by a clustering algorithm.
Cluster representative An element of a cluster to represent its \typical"
properties. This is used for cluster description in domains knowledge of which is poor. Cluster structure A representation of an entity set I as a set of clusters that form either a partition of I or hierarchy on I or an incomplete clustering of I . Cluster tendency A description of a cluster in terms of the average values of relevant features. Clustering An activity of nding and/or describing cluster structures in a data set. Clustering goal Types of problems of data analysis to which clustering can be applied: associating, structuring, describing, generalizing and visualizing. Clustering criterion A formal de nition or scoring function that can be used in computational algorithms for clustering. Conceptual description A logical statement characterizing a cluster or cluster structure in terms of relevant features. Data A set of entities characterized by values of quantitative or categorical features. Sometimes data may characterize relations between entities such as similarity coe cients or transaction ows. Data mining perspective In data mining, clustering is a tool for nding patterns and regularities within the data. Generalization Making general statements about data and, potentially, about the phenomenon the data relate to. Knowledge discovery perspective In knowledge discovery, clustering is a tool for updating, correcting and extending the existing knowledge. In this regard, clustering is but empirical classi cation. Machine learning perspective In machine learning, clustering is a tool for prediction. Statistics perspective In statistics, clustering is a method to t a prespecied probabilistic model of the data generating mechanism. Structuring Representing data with a cluster structure. Visualization Mapping data onto a known \ground" image such as the coordinate plane or a genealogy tree { in such a way that properties of the data are re ected in the structure of the ground image.
1.1 Exemplary problems
Clustering is a discipline devoted to revealing and describing homogeneous groups of entities, that is, clusters, in data sets. Why would one need this? Here is a list of potentially overlapping objectives for clustering. 1. Structuring, that is, representing data as a set of groups of similar objects. 2. Description of clusters in terms of features, not necessarily involved in nding the clusters. 3. Association, that is, nding interrelations between di erent aspects of a phenomenon by matching cluster descriptions in spaces corresponding to the aspects. 4. Generalization, that is, making general statements about data and, potentially, the phenomena the data relate to. 5. Visualization, that is, representing cluster structures as visual images. These categories are not mutually exclusive, nor do they cover the entire range of clustering goals but rather re ect the author's opinion on the main applications of clustering. In the remainder of this section we provide real-world examples of data and the related clustering problems for each of these goals. For illustrative purposes, small data sets are used in order to provide the reader with the opportunity of directly observing further processing with the naked eye.
1.1.1 Structuring
Structuring is the main goal of many clustering applications, which is to nd principal groups of entities in their speci cs. The cluster structure of an entity set can be looked at through di erent glasses. One user may wish to aggregate the set in a system of nonoverlapping classes another user may prefer to develop a taxonomy as a hierarchy of more and more abstract concepts yet another user may wish to focus on a cluster of \core" entities considering the rest as merely a nuisance. These are conceptualized in di erent types of cluster structures, such as a partition, a hierarchy, or a single subset.
Market towns
Table 1.1 represents a small portion of a list of thirteen hundred English market towns characterized by the population and services provided in each listed in the following box.
Market town features: Population resident in 1991 Census P Primary Schools PS Doctor Surgeries Do Hospitals Ho Banks and Building Societies Ba National Chain Supermarkets SM Petrol Stations Pe Do-It-Yourself Shops DIY Public Swimming Pools SP Post O ces PO Citizen's Advice Bureaux (cheap legal advice) CA Farmers' Markets FM For the purposes of social monitoring, the set of all market towns should be partitioned into similarity clusters in such a way that a representative from each of the clusters may be utilized as a unit of observation. Those characteristics of the clusters that separate them from the others should be used to properly select representative towns. As further computations will show, the numbers of services on average follow the town sizes, so that the found clusters can be described mainly in terms of the population size. This set, as well as the complete set of almost thirteen hundred English market towns, consists of seven clusters that can be described as belonging to four tiers of population: large towns of about 17-20,000 inhabitants, two clusters of medium sized towns (8-10,000 inhabitants), three clusters of small towns (about 5,000 inhabitants) and a cluster of very small settlements with about 2,500 inhabitants. The di erence between clusters in the same population tier is caused by the presence or absence of some service features. For instance, each of the three small town clusters is characterized by the presence of a facility, which is absent in two others: a Farm market, a Hospital and a Swimming pool, respectively. The number of clusters is determined in the process of computations (see sections 3.3, 3.4.2). This data set is analyzed on pp. 52, 56, 68, 92, 94, 97, 99, 100, 101, 108.
Primates and Human origin
In Table 1.2, the data on genetic distances between Human and three genera of great apes are presented the Rhesus monkey is added as a distant relative to certify the starting divergence event. It is well established that humans diverged from a common ancestor with chimpanzees approximately 5 million years ago, after a divergence from other great apes. Let us see how compatible with this conclusion the results of cluster analysis are.
1.1. EXEMPLARY PROBLEMS
Town Ashburton Bere Alston Bodmin Brixham Buckfastleigh Bugle/Stenalees Callington Dartmouth Falmouth Gunnislake Hayle Helston Horrabridge/Yel Ipplepen Ivybridge Kingsbridge Kingskerswell Launceston Liskeard Looe Lostwithiel Mevagissey Mullion Nanpean/Foxhole Newquay Newton Abbot Padstow Penryn Penzance Perranporth Porthleven Saltash South Brent St Agnes St Austell St Blazey/Par St Columb Major St Columb Road St Ives St Just Tavistock Torpoint Totnes Truro Wadebridge
5
Table 1.1: Market towns: Market towns in the West Country, England.
Figure 1.1: A tree representing pair-wise distances between the primate species from Table 1.2. The data is a square matrix of the dissimilarity values between the species from Table 1.2 as cited in 90], p. 30. (Only sub-diagonal distances are shown since the table is symmetric.) An example of analysis of the structure of this matrix is given on p. 192. The query: what species belongs to the same cluster as Humans? This obviously can be treated as a single cluster problem: one needs only one cluster to address the issue. The structure of the data is so simple that the cluster of chimpanzee, gorilla and human can be separated without any theory: distances within this subset are similar, all about the average 1.51, and by far less than other distances. In biology, this problem is traditionally addressed through evolutionary trees, which are analogues to genealogy trees except that species play the role of relatives. An evolutionary tree built from the data in Table 1.2 is shown in Figure 1.1. The closest relationship between human and chimpanzee is obvious, with gorilla branching o next. The subject of human evolution is treated in depth with data mining methods in 13].
Gene presence-absence pro les
Evolutionary analysis is an important tool not only for understanding evolution but also for analysis of gene functions in humans and other organisms including medically and industrially important ones. The major assumption underlying the analysis is that all species are descendants of the same ancestor species, so that subsequent evolution can be depicted in terms of divergence only, as in the evolutionary tree in Figure 1.1. The terminal nodes, so-called leaves, correspond to the species under consideration, and the root denotes the common ancestor. The other interior nodes represent other ancestral species, each being the last common ancestor to the set of organisms in the leaves of the sub-tree rooted in the given node. Recently, this line of research has been supplemented by data on the gene content of multiple species as exempli ed in Table 1.3. Here, the columns correspond to 18 simple, unicellular organisms, bacteria and archaea (collectively called
Table 1.4: Species: List of eighteen species (one eukaryota, then six archaea and then eleven bacteria) represented in Table 1.3.
Species Code Species Code Saccharomyces cerevisiae y Deinococcus radiodurans d Archaeoglobus fulgidus a Mycobacterium tuberculosis r Halobacterium sp.NRC-1 o Bacillus subtilis b Methanococcus jannaschii m Synechocystis c Pyrococcus horikoshii k Escherichia coli e Thermoplasma acidophilum p Pseudomonas aeruginosa f Aeropyrum pernix z Vibrio cholera g Aquifex aeolicus q Xylella fastidiosa s Thermotoga maritima v Caulobacter crescentus j
WHAT IS CLUSTERING
Code COG0090 COG0091 COG2511 COG0290 COG0215 COG2147 COG1746 COG1093 COG2263 COG0847 COG1599 COG3066 COG3293 COG3432 COG3620 COG1709 COG1405 COG3064 COG2853 COG2951 COG3114 COG3073 COG3026 COG3006 COG3115 COG2414 COG3029 COG3107 COG3429 COG1950 Name Ribosomal protein L2 Ribosomal protein L22 Archaeal Glu-tRNAGln Translation initiation factor IF3 Cysteinyl-tRNA synthetase Ribosomal protein L19E tRNA nucleotidyltransferase (CCA-adding enzyme) Translation initiation factor eIF2alpha Predicted RNA methylase DNA polymerase III epsilon Replication factor A large subunit DNA mismatch repair protein Predicted transposase Predicted transcriptional regulator Predicted transcriptional regulator with C-terminal CBS domains Predicted transcriptional regulators Transcription initiation factor IIB Membrane protein involved Surface lipoprotein Membrane-bound lytic murein transglycosylase B Heme exporter protein D Negative regulator of sigma E Negative regulator of sigma E Uncharacterized protein involved in chromosome partitioning Cell division protein Aldehyde:ferredoxin oxidoreductase Fumarate reductase subunit C Putative lipoprotein Uncharacterized BCR, stimulates glucose-6-P dehydrogenase activity Predicted membrane protein
Table 1.5: COG names and functions.
prokaryotes), and a simple eukaryote, yeast Saccharomyces cerevisiae. The list of species along with their one-letter codes is given in Table 1.4. The rows in Table 1.3 correspond to individual genes represented by the socalled Clusters of Orthologous Groups (COGs) which are supposed to include genes originating from the same ancestral gene in the common ancestor of the respective species 68]. COG names which re ect the functions of the respective genes in the cell are given in Table 1.5. These tables present but a small part of the publicly available COG database currently including 66 species and 4857 COGs posted in the web site www.ncbi.nlm.nih.gov/COG. The pattern of presence-absence of a COG in the analyzed species is shown in Table 1.3, with zeros and ones standing for absence and presence, respectively. This way, a COG can be considered a character (attribute) that is either present or absent in a species. Two of the COGs, in the top two rows, are present at each of the 18 genomes, whereas the others cover only some of the species. An evolutionary tree must be consistent with the presence-absence patterns.
Speci cally, if a COG is present in two species, then it should be present in their last common ancestor and, thus, in all other descendants of the last common ancestor. This would be in accord with the natural process of inheritance. However, in most cases, the presence-absence pattern of a COG in extant species is far from the \natural" one: many genes are dispersed over several subtrees. According to comparative genomics, this may happen because of multiple loss and horizontal transfer of genes 68]. The hierarchy should be constructed in such a way that the number of inconsistencies is minimized. The so-called principle of Maximum Parsimony (MP) is a straightforward formalization of this idea. Unfortunately, MP does not always lead to appropriate solutions because of intrinsic and computational problems. A number of other approaches have been proposed including hierarchical cluster analysis (see 105]). Especially appealing in this regard is divisive cluster analysis. It begins by splitting the entire data set into two parts, thus imitating the divergence of the last universal common ancestor (LUCA) into two descendants. The same process then applies to each of the split parts until a stop-criterion is reached to halt the division process. In contrast to other methods for building evolutionary trees, divisive clustering imitates the process of evolutionary divergence. Further approximation of the real evolutionary process can be achieved if the characters on which divergence is based are discarded immediately after the division of the respective cluster 96]. Gene pro les data are analyzed on p. 121 and p. 131. After an evolutionary tree is built, it can be utilized for reconstructing gene histories by mapping events of emergence, inheritance, loss and horizontal transfer of individual COGs on the tree according to the principle of Maximum Parsimony (see p. 126). These histories of individual genes can be helpful in advancing our understanding of biological functions and drug design.
1.1.2 Description
The problem of description is that of automatically deriving a conceptual description of clusters found by a clustering algorithm or supplied from a di erent source. The problem of cluster description belongs in cluster analysis because this is part of the interpretation and understanding of clusters. A good conceptual description can be used for better understanding and/or better predicting. The latter because we can check whether an object in question satis es the description or not: the more the object satis es the description the better the chances that it belongs to the cluster described. This is why conceptual description tools, such as decision trees 11, 23], have been conveniently used and developed mostly for the purposes of prediction.
Describing Iris genera
Table 1.6 presents probably the most popular data set in the machine learning research community: 150 Iris specimens, each measured on four morphological variables: sepal length (w1), sepal width (w2), petal length (w3), and petal width (w4), as collected by botanist E. Anderson and published in a founding paper of celebrated British statistician R. Fisher in 1936 7]. It is said that there are three species in the table, I Iris setosa (diploid), II Iris versicolor (tetraploid), and III Iris virginica (hexaploid), each represented by 50 consecutive entities in the corresponding column. The classes are de ned by the genome (genotype) the features are of the appearance (phenotype). Can the classes be described in terms of the features in Table 1.6? It is well known from previous studies that classes II and III are not well separated in the variable space (for example, specimens 28, 33 and 44 from class II are more similar to specimens 18, 26, and 33 from class III than to specimens of the same species, see Figure 1.10 on p. 25). This leads to the problem of deriving new features from those that have been measured on spot to provide for better descriptions of the classes. These new features could be then utilized for the clustering of additional specimens. Some non-linear machine learning techniques such as Neural Nets 51] and Support Vector Machines 128] can tackle the problem and produce a decent decision rule involving non-linear transformation of the features. Unfortunately, rules that can be derived with currently available methods are not comprehensible to the human mind and, thus, cannot be used for interpretation and description. The human mind needs somewhat less arti cial logics that can reproduce and extend such botanists' observations as that the petal area roughly expressed by the product of w3 and w4 provides for much better resolution than the original linear sizes. A method for building cluster descriptions of this type, referred to as APPCOD, will be described in section 7.2. The Iris data set is analyzed on pp. 87, 211, 212, 213.
Body mass
Table 1.7 presents data on the height and weight of 22 males of which individuals p13-p22 are considered overweight and p1-p12 normal. As Figure 1.2 clearly shows, a line of best t separating these two sets should run along the elongated cloud formed by entity points. The groups have been de ned according to the so-called body mass index, bmi: those individuals whose bmi is 25 or over are considered overweight. The body mass index is de ned as the ratio of the weight, in kilograms, to the squared height, in meters. The problem is to make a computer automatically transform the current height-weight feature space into such a format that would allow one to clearly distinguish between the overweight and normally-built individuals.
The best thing would be if a computer could derive the bmi based decision rule itself, which may not be necessarily the case since the bmi is de ned universally whereas only a very limited data set is presented here. One would obviously have to consider whether a linear description could be derived such as the following existing rule of thumb: a man is overwheight if the di erence between his height in cm and weight in kg is greater than one hundred. A man 175 cm in height should normally weigh 75 kg or less according to this rule. Once again it should be pointed out that non-linear transformations supplied by machine learning tools for better prediction may be not necessarily usable for the purposes of description. The Body mass data set is analyzed on pp. 205, 213, 242.
1.1.3 Association
Revealing associations between di erent aspects of phenomena is one of the most important goals of classi cation. Clustering as a classi cation of empirical data also can do the job. A relation between di erent aspects of a phenomenon in question can be established if the same clusters are well described twice,
Figure 1.2: Twenty-two individuals at the height-weight plane. each description related to one of the aspects. Di erent descriptions of the same cluster are then obviously linked as those referring to the same contents, though possibly with di erent errors.
Digits and patterns of confusion between them
1 2 4 5 7 6 3
Figure 1.3: Styled digits formed by segments of the rectangle. The rectangle in the upper part of Figure 1.3 is used to draw numeral digits around it in a styled manner of the kind used in digital electronic devices. Seven binary presence/absence variables e1, e2,..., e7 in Table 1.8 correspond to the numbered segments on the rectangle in Figure 1.3. Although the digit character images may seem arbitrary, nding patterns of similarity in them may be of interest in training operators dealing with digital numbers.
Results of a psychological experiment on confusion between the segmented numerals are in Table 1.9. A digit appeared on a screen for a very short time (stimulus), and an individual was asked to report what was the digit (response). The response frequencies of digits versus shown stimuli stand in the rows of Table 1.9 90]. The problem is to nd general patterns in confusion and to interpret them in terms of the segment presence-absence variables in Digits data Table 1.8. If the found interpretation can be put in a theoretical framework, the patterns can be considered as empirical re ections of theoretically substantiated classes. Patterns of confusion would show the structure of the phenomenon. Interpretation of the clusters in terms of the drawings, if successful, would allow us to see what relation may exist between the patterns of drawing and confusion.
Figure 1.4: Visual representation of four Digits confusion clusters: solid and dotted lines over the rectangle show distinctive features that must be present in or absent from all entities in the cluster. Indeed, four major confusion clusters can be distinguished in the Digits data, as will be found in section 4.4.2 and described in section 6.3 (see pp. 73, 129, 133 and 134 for computations on these data). On Figure 1.4 these four clusters are presented with distinctive features shown with segments de ning the drawing of digits. We can see that all relevant features are concentrated on the left and down the rectangle. It remains to be seen if there is any physio-psychological mechanism behind this and how it can be utilized. Moreover, it appears the attributes in Table 1.8 are quite relevant on their own, pinpointing the same patterns that have been identi ed as those of confusion. This can be clearly seen in Figure 1.5, which illustrates a classi cation tree for Digits found using an algorithm for conceptual clustering presented in section 4.3. On this tree, clusters are the terminal boxes and interior nodes are labeled by the features involved in classi cation. The coincidence of the drawing clusters with confusion patterns indicates that the confusion is caused by the segment features participating in the tree. These appear to be the same features in both Figure 1.4 and Figure 1.5.
Literary masterpieces
The data in Table 1.10 re ect the language and style features of eight novels by three great writers of the nineteenth century. Two language features are: 1) LenSent - Average length of (number of words in) sentences 2) LenDial - Average length of (number of sentences in) dialogues. (It is assumed that longer dialogues are needed if the author uses dialogue as a device to convey information or ideas to the reader.)
Figure 1.5: The conceptual tree of Digits. Table 1.10: Masterpieces: Masterpieces of 19th century: the rst three by Charles Dickens (1812{1870), the next three by Mark Twain (1835{1910), and the last two by Leo Tolstoy (1828{1910).
Title LenSent LenDial NChar SCon Narrative Oliver Twist 19.0 43.7 2 No Objective Dombey and Son 29.4 36.0 3 No Objective Great Expectations 23.9 38.0 3 No Personal Tom Sawyer 18.4 27.9 2 Yes Objective Huckleberry Finn 25.7 22.3 3 Yes Personal Yankee at King Arthur 12.1 16.9 2 Yes Personal War and Peace 23.9 30.2 4 Yes Direct Anna Karenina 27.2 58.0 5 Yes Direct
Features of style: 3) NChar - Number of principal characters (the larger the number the more themes raised) 4) SCon - Yes or No depending on the usage of the stream of conscience techniques 5) Narrative - The narrative style is a qualitative feature categorized as: (a) Personal (if the narrative comes from the mouth of a character such as Pip in \Great Expectations" by Charles Dickens), or (b) Objective (if the subject develops mainly through the behavior of the characters and other indirect means), or (c) Direct (if the author prefers to directly intervene with the comments and explanations). As we have seen already with the Digits data, features are not necessarily quantitative. They also can be categorical, such as SCon, a binary variable, or Narrative, a nominal variable.
The data in Table 1.10 can be utilized to advance two of the clustering goals: 1. Structurization: To cluster the set of masterpieces and intensionally describe clusters in terms of the features. We expect the clusters to accord to the three authors and convey features of their style. 2. Association: To analyze interrelations between two aspects of prose writing: (a) linguistic (presented by LenSent and LenD), and (b) the author's narrative style (the other three variables). For instance, we may nd clusters in the linguistic features space and conceptually describe them in terms of the narrative style features. The number of entities that do not satisfy the description will score the extent of correlation. We expect, in this particular case, to have a high correlation between these aspects, since both must depend on the same cause (the author) which is absent from the feature list (see page 104). This data set is used for illustration of many concepts and methods described further on see pp. 61, 62, 78, 79, 80, 81, 84, 89, 104, 105, 162, 182, 193, 195, 197.
1.1.4 Generalization
Generalization, or overview, of data is a (set of) statement(s) about properties of the phenomenon re ected in the data under consideration. To make a generalization with clustering, one may need to do a multistage analysis: at rst, structure the entity set second, describe clusters third, nd associations between di erent aspects. Probably one of the most exciting applications of this type can be found in the newly emerging area of text mining 139]. With the abundance of text information ooding every Internet user, the discipline of text mining is ourishing. A traditional paradigm in text mining is underpinned by the concept of the key word. The key word is a string of symbols (typically corresponding to a language word or phrase) that is considered important for the analysis of a pre-speci ed collection of texts. Thus, rst comes a collection of texts de ned by a meaningful query such as \recent mergers among insurance companies" or \medieval Britain." (Keywords can be produced by human experts in the domain or from statistical analyses of the collection.) Then a virtual or real text-to-keyword table can be created with keywords treated as features. Each of the texts (entities) can be represented by the number of occurrences of each of the keywords. Clustering of such a table may lead to nding subsets of texts covering di erent aspects of the subject. This approach is being pursued by a number of research and industrial groups, some of which have built clustering engines on top of Internet search engines: given a query, such a clustering engine singles out several dozen of the most relevant web pages, resulting from a search by a search engine such as
VI. Bribe level X. Branch
1. $10K or less 2. Up to $100K 3. $100K
1. Government 2. Law enforcement 3. Other
VII. Type
1. 2. 3. 4.
1. Infringement 2. Extortion
XI. Punishment
1. 2. 3. 4. 5.
VIII. Network
None Within o ce Between o ces Clients
None Administrative Arrest Arrest followed by release Arrest with imprisonment
Google or Yahoo, nds keywords or phrases in the corresponding texts, clusters web pages according to the keywords used as features, and then describes clusters in terms of the most relevant keywords or phrases. Two top web sites which have been found from searching for \clustering engines" with Google on 29 June 2004 in London are Vivisimo at hhtp://vivisimo.com and iBoogie at http://iboogie.tv. The former is built on top of ten popular search engines and can be used for partitioning web pages from several di erent sources such as \Web" or \Top stories," the latter maintains several dozen languages and presents a hierarchical classi cation of selected web pages. In response to the query \clustering" Vivisimo produced 232 web pages in a \Web" category and 117 in a \Top news" category. Among top news the most populated clusters were \Linux" (16 items), \Stars" (12), and \Bombs" (11). Among general web sites the most numerous were \Linux" (25), \Search, Engine" (21), \Computing" (22), etc. More or less random web sites devoted to individual papers or scientists or scienti c centers or commercial companies have been listed under categories \Visualization" (12), \Methods" (7), \Clustering" (7), etc. Such categories as \White papers" contained pages devoted to both computing clusters and cluster analysis. Similar results, though somewhat more favourable towards clustering as data mining, have been produced with iBoogie. Its cluster \Cluster" (51) was further divided into categories such as \computer" (10) and \analysis" (5). Such categories as \software for clustering" and \data cluster-
ing" have been presented too to refer to a random mix of 24 and 20 web sites respectively. The activity of generalization so far mainly relies on human experts who supply understanding of a substantive area behind the text corpus. Human experts develop a text-to-feature data table that can be further utilized for generalization. Such is a collection of 55 articles on Bribery cases from central Russian newspapers 1999-2000 presented in Table 1.12 according to 97]. The features re ect the following vefold structure of bribery situations: two interacting sides - the o ce and the client, their interaction, the corrupt service rendered, and the environment in which it all occurs. These structural aspects can be characterized by eleven features that can be recovered from the newspaper articles they are presented in Table 1.11. To show how these features can be applied to a newspaper article, let us quote an article that appeared in a newspaper called \Kommersant" on 20 March 1999 (translated from Russian): Thursday this week, Mr Evgeny Parshukov, Mayor of town Belovo near Kemerovo, was arrested under a warrant issued by the region attorney, Mr. Valentin Simuchenkov. The mayor is accused of receiving a bribe and abusing his powers for wrongdoing. Before having been elected to the mayoral post in June 1997, he received a credit of 62,000 roubles from Belovo Division of KUZBASS Transport Bank to support his election campaign. The Bank then cleared up both the money and interest on it, allegedly because after his election Mr. Parshukov ordered the Finance Department of the town administration, as well as all municipal organisations in Belovo, to move their accounts into the Transport Bank. Also, the attorney o ce claims that in 1998 Mr. Parshukov misspent 700,000 roubles from the town budget. The money came from the Ministry of Energy speci cally aimed at creating new jobs for mine workers made redundant because their mines were getting closed. However, Mr. Parshukov ordered to lend the money at a high interest rate to the Municipal Transport agency. Mr. Parshukov doesn't deny the facts. He claims however that his actions involve no crime. A possible coding of the eleven features in this case constitutes the contents of row 29 in Table 1.12. The table presents 55 cases that could be more or less unambiguously coded (from the original 66 cases 98]). The prime problem here is similar to those in the Market towns and Digits data: to see if there are any patterns at all. To generalize, one has to make sense of patterns in terms of the features. In other words, we are interested in getting a synoptic description of the data in terms of clusters which are to be found and described. On the rst glance, no structure exists in the data. Nor could the scientists
specializing in the research of corruption see any. However, after applying an intelligent version of the algorithm K-Means as described later in example 3.20, section 3.3, a rather simple core structure could be found that is de ned by just two features and determines all other aspects. The results provide for a really short generalization: \It is the branch of government that determines which of the ve types of corrupt services are rendered: Local government ! Favors or Extortion Law enforcement ! Obstruction of Justice or Cover-Up and Other ! Category Change." A detailed discussion is given in examples on pp. 95, 106 and 147.
1.1.5 Visualization of data structure
Visualization is considered a rather vague area involving psychology, cognitive sciences and other disciplines, which is rapidly developing. In the current thinking, the subject of data visualization is de ned as creation of mental images to gain insight and understanding 125]. This, however, seems too wide and includes too many non-operational images such as realistic and surrealistic paintings. In our presentation, we take on a more operational view and consider that data visualization is an activity related to mapping data onto a known ground image such as a coordinate plane, geography map, or a genealogy tree in such a way that properties of the data are re ected in the structure of the ground image. Among ground images, the following are the most popular: geographical maps, networks, 2D displays of one-dimensional objects such as graphs or piecharts or histograms, 2D displays of two-dimensional objects, and block structures. Sometimes, the very nature of the data suggests what ground image should be used. All of these can be used with clustering, and we are going to review most of them except for geographical maps.
One-dimensional data
One-dimensional data over pre-speci ed groups or found clusters can be of two types: (a) the distribution of entities over groups and (b) values of a feature within clusters. Accordingly, there can be two types of visual support for these. Consider, for instance, groups of the Market town data de ned by the population. According to Table 1.1 the population ranges approximately between 2000 and 24000 habitants. Let us divide the range in ve equal intervals, bins, that are de ned thus to have size (24000-2000)/5=4400 and bound points 6400, 10800, 15200, and 19600. In Table 1.13 the data of the groups are displayed: their absolute and relative sizes and also the average numbers of Banks and the standard deviations within them. For the de nitions of the average and standard deviation see section 2.1.2.
Figure 1.6: Histogram (a) and pie-chart (b) presenting the distribution of Population
over ve equally sized bins in Market data.
Figure 1.6 shows two traditional displays for the distribution: a histogram (part (a) on the left) in which bars are proportional to the group sizes and a pie-chart in which a pie is partitioned into slices proportional to the cluster sizes (part (b) on the right). These two point to di erent features of the distribution. The histogram positions the categories along the horizontal axis, thus providing for a possibility to see the distribution's shape, which can be quite useful when the categories have been created as interval bins of a quantitative feature, as is this case. The pie-chart points to the fact that the group sizes sum up to the total so that one can see what portions of the pie account for di erent categories.
One-dimensional data within groups
To visualize a quantitative feature within pre-speci ed groups, box-plots and stick-plots are utilized. They show within-cluster central values and their disTable 1.13: Population groups: Data of the distribution of population groups and numbers of banks within them.
Group Size Frequency, % Banks Std Banks I 25 55.6 1.80 1.62 II 11 24.4 4.82 2.48 III 2 4.4 5.00 1.00 IV 3 6.7 12.00 5.72 V 4 8.9 12.50 1.12 Total 45 100 4.31 4.35
1.1. EXEMPLARY PROBLEMS
20 15 10 4 2 0
I II III (a) IV V 20 15 10 4 2 0 I II III (b) IV V
23
Figure 1.7: Box-plot (a) and stick-plot (b) presenting the feature Bank over the ve
bins in Market data.
20 15 10 4 2 0 I II III IV V
Figure 1.8: Box-plot presenting the feature Bank over ve bins in Market data along
with bin sizes.
persion, which can be done in di erent ways. Figure 1.7 presents a box-plot (a) and stick-plot (b) of feature Bank within the ve groups de ned above as Population bins. The box-plot on Figure 1.7 (a) represents each group as a box bounded by its 10% percentile values separating extreme 10% cases both on the top and bottom of the feature Bank range. The real within group ranges are shown by \whiskers" that can be seen above and below the boxes of groups I and II the other groups have no whiskers because of too few entities in each of them. A line within each box shows the within-group average. The stick-plot on Figure 1.7 (b) represents the within-group averages by \sticks," with their \whiskers" proportional to the standard deviations. Since the displays are in fact two-dimensional, both features and distributions can be shown on a box-plot simultaneously. Figure 1.8 presents a box-plot of the feature Bank over the ve bins with the box widths made proportional to the group sizes. This time, the grand mean is also shown by the horizontal dashed line.
24
SepalLen 8 7 6 5 4 Versicolor Setosa Quantiles 20% through 80% Virginica Mean species Range
WHAT IS CLUSTERING
Figure 1.9: Box-plot of three classes of Iris specimens from Table 1.6 over the sepal
length w1 the classes are presented by both the percentile boxes and within cluster range whiskers the choice of percentiles can be adjusted by the user.
A similar box-plot for the three genera in the Iris data is presented in Figure 1.9. This time the percentiles are taken at 20%.
Two-dimensional display
A traditional two-dimensional display of this type is the so-called scatter-plot, representing all the entity points in a plane generated by two of the variables or linear combinations of the variables such as principal components (for a de nition of principal components see section 5.1.3). A scatter-plot at the plane of two variables can be seen in Figure 1.2 for the Body mass data on page 13. A scatter-plot in the space of two rst principal components is presented in Figure 1.10: the Iris specimens are labelled by the class number (1, 2, or 3) centroids are gray circles the most deviate entities (30 in class 1, 32 in class 2, and 39 in class 3) are shown in boxes. For an explanation of the principal components see section 5.1.3. The scatter-plot illustrates that two of the classes are somewhat interwoven.
Block-structure
A block-structure is a representation of the data table as organized in larger blocks of a speci c pattern with transpositions of rows and/or columns. In principle one can imagine various block patterns 47], of which the most common is a pattern formed by the largest entry values. Figure 1.11 presents an illustrative example (described in 125]). In part A, results of seven treatments (denoted by letters from a to g) applied to each of ten crops denoted by numerals are presented: gray represents a success and blank space failure. The pattern of gray seems rather chaotic in table A. However, it becomes very much orderly when appropriate rearrangements of rows and
Figure 1.10: Scatter-plot of Iris specimens in the plane of the rst two principal
components.
a 1 2 3 4 5 6 7 8 9 0 A b c d e f g 1 3 8 2 6 0 4 7 9 5 B a d c e g b f
Figure 1.11: Visibility of the matrix block structure with a rearrangement of rows
and columns.
columns are performed. Part B of the Figure clearly demonstrates a visible block structure in the matrix, that can be interpreted as mapping speci c sets of treatments to di erent sets of crops, which can be exploited, for instance, in specifying adjacent locations for crops. Visualization of block structures by reordering rows and/or columns is popular in the analysis of gene expression data 20] and ecology 53]. A somewhat more realistic example is shown in Figure 1.12 (a) representing a matrix of value transferred between nine industries during a year: the (i,j)th entry is gray if the transfer from industry i to industry j is greater than a speci ed threshold and blank otherwise. Figure 1.12 (b) shows a block structure pattern that becomes visible when the order of industries from 1 to 9 changes for the order 1-6-9-4-2-5-8-3-7, which is achieved with the reordering of both rows and columns of the matrix. The reordering is made simultaneously on both rows and columns because both represent the same industries, both as sources (rows) and targets (columns) of the value transfer. We can discern four blocks of di erent patterns (1-6-9, 4, 2-5-8, 3-7) in Figure 1.12 (b). The structure of
Figure 1.12: Value transfer matrix presented with only entries greater than a threshold (a) the same matrix, with rows and columns simultaneously reordered, is in (b) in (c), the structure is represented as a graph.
(a)
(b)
(c)
Structure
A simple structure such as a chain or a tree or just a small graph, whose vertices (nodes) correspond to clusters and edges to associations between them, is a frequent tool in data visualization. Two examples are presented in Figure 1.13: a tree structure over clusters re ecting common origin is shown in part (a) and a graph corresponding to the block structure of Figure 1.12 (c) is shown in part (b) to re ect links between clusters of industries in the production process. A similar tree structure is presented on Figure 1.5 on page 16 illustrating a classi cation tree for Digits. Tree leaves, the terminal boxes, show clusters as entity sets the features are shown along corresponding branches the entire structure illustrates the relation between clusters in such a way that any combination of the segments can be immediately identi ed and placed into a corresponding cluster or not identi ed at all if it is not shown on the tree.
Visualization using an inherent topology
In many cases the entities come from an image themselves { such as in the cases of analysis of satellite images or topographic objects. For example, consider the Digit data set: all the integer symbols are associated with segments of the generating rectangle on Figure 1.3, page 13. Clusters of such entities can be visualized with the generating image. Figure 1.4 visualizes clusters of digits along with their de ning features resulting from analyses conducted later in example 4.43 (page 134) as parts of the generating rectangle. There are four major confusion clusters in the
Figure 1.13: Visual representation of relations between clusters: (a) the evolutionary
structure of the Primates genera according to distances in Table 1.2 (b) interrelation between clusters of industries according to Figure 1.12 (c).
Digits data of Figure 1.3 that are presented with distinctive features shown with segments de ning the drawing of digits.
1.2 Bird's-eye view
This section contains general remarks on clustering and can be skipped on the rst reading.
1.2.1 De nition: data and cluster structure
After looking through the series of exemplary problems in the previous section, we can give a more formal de nition of clustering than that in the Preface: Clustering is a discipline devoted to revealing and describing cluster structures in data sets. To animate this de nition, one needs to specify the four concepts involved: (a) data, (b) cluster structure, (c) revealing a cluster structure, (d) describing a cluster structure.
Data
The concept of data refers to any recorded and stored information such as satellite images or time series of prices of certain stocks or survey questionnaires lled in by respondents. Two types of information are associated with data: the data entries themselves, e.g., recorded prices or answers to questions, and meta-data, that is, legends to rows and columns giving meaning to entries. The
aspect of developing and maintaining databases of records, taking into account the relations stored in metadata, is very important for data mining 23, 29, 44]. In this text, for the sake of linearity of presentation, we concentrate on a generic data format only, the so-called entity-to-variable table whose entries represent values of pre-speci ed variables at pre-speci ed entities. The variables are synonymously called attributes, features, characteristics, characters and parameters. Such words as case, object, observation, instance, record are in use as synonyms to the term, entity, accepted here. The data table format of data often arises directly from experiments or observations, from surveys, and from industrial or governmental statistics. This also is a conventional form for presenting database records. Other data types such as signals or images can be modelled in this format, too, via digitalized representation. However, a digital representation, typically, involves much more information than can be kept in a data table format. Especially important is the spatial arrangement of pixels, which is not, typically, maintained in the concept of data table. The contents of a data table are assumed to be invariant under permutations of rows and columns and corresponding metadata. Another data type traditionally considered in clustering is the similarity or dissimilarity between entities (or features). The concept of similarity is most important in clustering: similar objects are to be put into the same cluster and dissimilar into di erent clusters. There have been invented dozens of (dis)similarity indices. Some of them nicely t into theoretical frameworks and will be considered further in the text. One more data type considered in this text is co-occurrence or ow tables that represent the same substance distributed between di erent categories such as Confusion data in Table 1.9 in which the substance is the scores of individuals. An important property of this type of data is that any part of the data table, referred to a subset of rows and /or a subset of columns, can be meaningfully aggregated by summing the part of the total ow within the subset of rows (and/or the subset of columns). The sums represent the total ow to, from, and within the subset(s). Thus, problems of clustering and aggregating are naturally linked here. Until recently, this type of data appeared as a result of data analysis rather than input to it. Currently it has become one of the major data formats. Examples are: distributions of households purchasing various commodities or services across postal districts or other regional units, counts of telephone calls across areas, and counts of visits in various categories of web-sites.
Cluster structure The concept of cluster typically refers to a set of entities that is cohesive in
such a way that entities within are more similar to each other than to the outer entities.
Three major types of cluster structures are: (a) a single cluster considered against the rest or whole of the data, (b) a partition of the entity set in a set of clusters, and (c) a (nested) hierarchy of clusters. Of these three, partition is the most conventional, probably because it is relevant to both science and management, the major forces behind scienti c developments. A scientist, as well as a manager, wants unequivocal control over the entire universe under consideration. This is why they may wish to partition the entity set into a set of nonoverlapping clusters. In some situations there is no need for total clustering. The user may be quite satis ed with getting just a single (or few) cluster(s) and leaving the rest completely unclustered. Examples: (1) a bank manager wants to learn how to discern potential fraudsters from other clients or (2) a marketing researcher separates a segment of customers prone to purchase a particular product or (3) a bioinformatician seeks a set of proteins homologous to a query protein sequence. Incomplete clustering is a recently recognized addition to the body of clustering approaches, very suitable not only at the situations above but also as a tool for conventional partitioning via cluster-by-cluster procedures such as those described in section 5.5. The hierarchy is the oldest and probably least understood of the cluster structures. To see how important it is, it should su ce to recall that the Aristotelian approach to classi cation encapsulated in library classi cations and biological taxonomies is always based on hierarchies. Moreover, hierarchy underlies most advanced data processing tools such as wavelets and quadtrees. It is ironic then that as a cluster structure in its own right, the concept of hierarchy rarely features in clustering, especially when clustering is con ned to the cohesive partitioning of geometric points.
1.2.2 Criteria for revealing a cluster structure
To reveal a cluster structure in a data table means to nd such clusters that allow the individual characteristics of entities to be substituted by aggregate characteristics of clusters. A cluster structure is revealed by a method according to a criterion of how well the data are represented by clusters. Criteria and methods are, to an extent, independent from each other so that the same method such as agglomeration or splitting can be used with di erent criteria. Criteria usually are formulated in terms of (dis)similarity between entities. This helps in formalizing the major idea that entities within clusters should be similar to each other and those between clusters dissimilar. Dozens of similarity based criteria developed so far can be categorized in three broad classes:
(1) De nition-based, (2) Index-based, and (3) Computation-based. The rst category comprises methods for nding clusters according to an explicit de nition of a cluster. An example: A cluster is a subset S of entities such that for all i, j in S the similarity between i and j is greater than the similarities between these and any k outside S . Such a property must hold for all entities with no exceptions, which means that well isolated clusters are rather rare in real world data. However, when the de nition of cluster is relaxed to include less isolated clusters, too many may then appear. This is why de nitionbased methods are not popular in practical clustering. A criterion in the next category involves an index, that is, a numerical function that scores di erent cluster structures and, in this way, may guide the process of choosing the best. However, not all indices are suitable for obtaining reasonable clusters. Those derived from certain model-based considerations tend to be computationally hard to optimize. Optimizing methods are thus bound to be local and, therefore, heavily reliant on the initial settings, which involve, in the case of K-Means clustering, pre-specifying the number of clusters and the location of their central points. Accordingly, the found cluster structure may be rather far from the global optimum and, thus, must be validated. Cluster validation may be done according to internal criteria such as that involved in the optimization process or external criteria comparing the clusters found with those known from external considerations or according to its stability with respect to randomly resampling entities/features. These will be outlined in section 7.5 and exempli ed in section 3.3.2. The third category comprises computation methods involving various heuristics for individual entities to be added to or removed from clusters, for merging or splitting clusters, and so on. Since operations of this type are necessarily local, they resemble local search optimization algorithms, though, typically, have no unique guiding scoring index to follow, thus, can include various tricks making them exible. However, such exibility is associated with an increase in the number of ad hoc parameters such as various similarity thresholds and, in this way, turning clustering from a reproducible activity into a kind of magic. Validation of a cluster structure found with a heuristic-based algorithm becomes a necessity. In this book, we adhere to an index-based principle, which scores a cluster structure against the data from which it has been built. The cluster structure here is used as a device for reconstructing the original data table the closer the reconstructed data are to the original ones, the better the structure. It is this principle that is called the data recovery approach in this book. Many index-based and computation-based clustering methods can be reinterpreted according to the principle, which allows us to see interrelations between dif-
ferent methods and concepts for revealing and analyzing clustering structures. New methods can be derived from the principle too (see Chapter 5, especially sections 5.4-5.6). It should be noted, though, that we will use only the most straightforward rules for reconstructing the data from cluster structures.
1.2.3 Three types of cluster description
Cluster descriptions help in understanding, explaining and predicting clusters. These may come in di erent formats of which the most popular are the following three: (a) Representative, (b) Tendency, (c) Conceptual description. A representative, or a prototype, is an object such as a literary character or a sort of wine or mineral, representing the most typical features of a cluster. This format is useful in giving a meaning to entities that are easily available empirically but di cult to conceptually describe. There is evidence that some aggregate language constructs, such as \fruit," are mentally maintained via prototypes, such as \apple" 74]. In clustering, the representative is usually the most central entity in a cluster. A tendency expresses a cluster's most likely features such as its way of behavior or pattern. It is usually related to the center of gravity of the cluster and its di erences from the average. In this respect, the tendency models the concept of type in classi cation studies. A conceptual description may come in the form of a classi cation tree built for predicting a class or partition. Another form of conceptual description is an association, or production, rule, stating that if an object belongs to a cluster then it must have such and such features. Or, vice versa, if an object satis es the premise, then it belongs in the cluster. The simplest conceptual description of a cluster is a statement of the form \the cluster is characterized by the feature A being between values a1 and a2." The existence of a feature A, which alone is su cient to distinctively describe a cluster is a rare occurrence of luck in data mining. Typically, features in data are rather super cial and do not express essential properties of entities and thus cannot be the basis of straightforward descriptions. The subject of cluster description overlaps that of supervised machine learning and pattern recognition. Indeed, given a cluster, having its description may allow one to predict, for new objects, whether they belong to the cluster or not, depending on how much they satisfy the description. On the other hand, a decision rule obtained with a machine learning procedure, especially, for example, a classsi cation tree, can be considered a cluster description usable for the interpretation purposes. Still the goals are di erent: interpretation in clustering and prediction in machine learning. However, cluster description is as important in clustering as cluster nding.
1.2.4 Stages of a clustering application
A. Developing a data set. B. Data pre-processing and standardizing. C. Finding clusters in data. D. Interpretation of clusters. E. Drawing conclusions.
Typically, clustering as a data mining activity involves the following ve stages:
To develop a data set one needs to de ne a substantive problem or issue, however vague it may be, and then determine what data set related to the issue can be collected from an existing database or set of experiments or survey, etc. Data pre-processing is the stage of preparing data processing by a clustering algorithm typically, it includes developing a uniform data set, frequently called a ` at' le, from a database, checking for missing and unreliable entries, rescaling and standardizing variables, deriving a uni ed similarity measure, etc. The cluster nding stage involves application of a clustering algorithm and results in a (series of) cluster structure(s) to be presented, along with interpretation aids, to substantive specialists for an expert judgement and interpretation in terms of features, both those utilized for clustering (internal features) and those not utilized (external features). At this stage, the expert may see no relevance in the results and suggest a modi cation of the data by adding/removing features and/or entities. The modi ed data is subject to the same processing procedure. The nal stage is the drawing of conclusions, with respect to the issue in question, from the interpretation of the results. The more focussed are the regularities implied by the ndings, the better the quality of conclusions. There is a commonly held opinion among specialists in data analysis that the discipline of clustering concerns only the proper clustering stage C while the other four are the concern of specialists in the substance of the particular issue for which clustering is performed. Indeed, typically, clustering results can not and are not supposed to solve the entire substantive problem, but rather relate to an aspect of it. On the other hand, clustering algorithms are supposedly most applicable to situations and issues in which the user's knowledge of the domain is more super cial than profound. What are the choices regarding data pre-processing, initial settings in clustering and interpretation of results { facing the laymen user who has an embryonic knowledge of the domain? More studies and experiments? In most cases, this is not practical advice. Sometimes a more viable strategy would be to better utilize properties of the clustering methods at hand.
At this stage, no model-based recommendations can be made about the initial and nal stages, A and E. However, the data recovery approach does allow us to use the same formalisms for tackling not stage C only, but also B and D see sections 2.4, 4.3 and 6.3 for related prescriptions and discussions.
1.2.5 Clustering and other disciplines
The concepts involved make clustering a multidisciplinary activity on its own, regardless of its many applications. In particular, 1. Data relates to database, data structure, measurement, similarity and dissimilarity, statistics, matrix theory, metric and linear spaces, graphs, data analysis, data mining, etc. 2. Cluster structure relates to discrete mathematics, abstract algebra, cognitive science, graph theory, etc. 3. Revealing cluster structures relates to algorithms, matrix analysis, optimization, computational geometry, etc. 4. Describing clusters relates to machine learning, pattern recognition, mathematical logic, knowledge discovery, etc.
1.2.6 Di erent perspectives of clustering
Clustering is a discipline on the intersection of di erent elds and can be viewed from di erent angles, which may be sometimes confusing because di erent perspectives may contradict each other. A question such as, \How many clusters are out there?," which is legitimate in one perspective, can be meaningless in the other. Similarly, the issue of validation of clusters may have di erent solutions in di erent frameworks. The author nds it useful to distinguish between the perspectives supplied by statistics, machine learning, data mining and classi cation.
Statistics perspective
Statistics tends to view any data table as a sample from a probability distribution whose properties or parameters are to be estimated with the data. In the case of clustering, clusters are supposed to be associated with di erent probabilistic distributions which are intermixed in the data and should be recovered from it. Within this approach, such questions as \How many clusters are out there?" and \How to preprocess the data?" are well substantiated and can be dealt with according to the assumptions of the underlying model.
In many cases the statistical paradigm suits quite well and should be applied as the one corresponding most to what is called the scienti c method: make a hypothesis of the phenomenon in question, then look for relevant data and check how the hypothesis ts them. A trouble with this approach is that in most cases clustering is applied to phenomena of which almost nothing is known, not only of their underlying mechanisms but of the very features measured or to be measured. Then any modelling assumptions of the data generation would be necessarily rather arbitrary and so too conclusions based on them. Moreover in many cases the set of entities is rather unique and cannot be considered a sample from a larger population, such as the set of European countries or single malt whisky brands. Sometimes the very concept of a cluster as a probabilistic distribution seems to not t into a clustering goal. Look, for example, at a bell-shaped Gaussian distribution which is considered a good approximation for such variables as the height or weight of young male individuals of the same ethnicity so that they form a cluster corresponding to the distribution. However, when confronted with the practical issue of dividing people, for example, according to their ghting capabilities (such as in military conscription or in the sport of boxing), the set cannot be considered a homogeneous cluster anymore and must be further partitioned into more homogeneous strata. Some say that there must be a boundary between \natural" clusters and clusters to be drawn on purpose that a bell-shape distribution corresponds to a natural cluster and a boxing weight category to an arti cial one. However, it is not always easy to distinguish which situation is which. There will always be situations when a cluster of potentially weak ghters (or bad customers, or homologous proteins) must be cut out from the rest.
Machine learning perspective
Machine learning tends to view the data as a device for learning how to predict pre-speci ed or newly created categories. The entities are considered as coming one at a time so that the machine can learn adaptively in a supervised manner. To theorize, the ow of data must be assumed to come from a probabilistic population, an assumption which has much in common with the statistics approach. However, it is prediction rather than model tting which is the central issue in machine learning. Such a shift in the perspective has led to the development of strategies for predicting categories such as decision trees and support vector machines as well as resampling methods such as the bootstrap and cross-validation for dealing with limited data sets.
Data mining perspective
Data mining is not much interested in re ection on where the data have come from nor how they have been collected. It is assumed that a data set or database has been collected already and, however bad or well it re ects the properties of the phenomenon in question, the major concern is in nding patterns and regularities within the data as they are. Machine learning and statistics methods are welcome here { for their capacity to do the job. This view, started as early as in the sixties and seventies in many countries including France, Russia and Japan in such subjects as analysis of questionnaires or of inter-industrial transfers, was becoming more and more visible, but it did not make it into prominence until the nineties. By that time, big warehouse databases became available, which led to the discovery of patterns of transactions with the so-called association search methods. The patterns proved themselves correct when superstores increased pro ts by accommodating to them. Data mining is a huge activity on the intersection of databases and data analysis methods. Clustering is a recognized part of it. The data recovery approach which is maintained in this book obviously ts within data mining very well, because it is based only on the data available. It should be added that the change of the paradigm from modeling of mechanisms of data generation to data mining has drastically changed requirements to methods and programs. According to the statistics approach, the user must know the models and methods he uses if a method is applied wrongly, the results can be wrong too. Thus, application of statistical methods is limited within a small circle of experts. In data mining, it is the patterns not methods that matter. This shifts the focus of computer programs from statistics to the user's substantive area and makes them user-friendly. Similarly, the validation objectives seem to diverge here: in statistics and machine learning the stress goes on the consistency of the algorithms, which is not quite so important in data mining, in which it is the consistency of patterns, not algorithms, which matters the most.
Classi cation/knowledge-discovery perspective
The classi cation perspective is rarely discussed indeed. In data mining the term \classi cation" is usually referred to in a very limited sense: as an activity of assigning prespeci ed categories (classes) to entities, in contrast to clustering which assigns entities with newly created categories (clusters). According to its genuine meaning, classi cation is an actual or ideal arrangement of entities under consideration in classes to: (1) shape and keep knowledge
(2) capture the structure of phenomena and (3) relate di erent aspects of a phenomenon in question to each other. These make the concept of classi cation a speci c mechanism for knowledge discovery and maintenance. Consider, for instance, the Periodic Chart of chemical elements. Its rows correspond to numbers of electron shells in the atoms, and its columns to the numbers of electrons in the external shell thus capturing the structure of the phenomenon. These also relate to most important physical properties and chemical activities of the elements thus associating di erent aspects of the phenomenon. And this is a compact form of representing the knowledge moreover, historically it is this form itself, developed rather empirically, that made possible rather fast progress to the current theories of the matter. In spite of the fact that the notion of classi cation as part of scienti c knowledge was introduced by the ancient Greeks (Aristotle and the like) the very term \classi cation" seems a missing item in the vocabulary of current scienti c discourse. This may have happened because in traditional sciences, classi cations are de ned within well developed substantive theories according to variables which are de ned as such within the theories. Thus, there has been no need in speci c theories for classi cation. Clustering should be considered as classi cation based on empirical data in a situation when clear theoretical concepts and de nitions are absent and the regularities are unknown. Thus, the clustering goals should relate to the classi cation goals above. This brings one more aspect to clustering. Consider, for example, how one can judge whether a clustering is good or bad? According to the classi cation/knowledge-discovery view, this is easy and has nothing to do with statistics: just look at how well clusters t within the existing knowledge, how well they allow updating, correcting and extending. Somewhat simplistically, one might say that two of the points stressed in this book, that of the data recovery approach and the need to not only nd, but describe clusters, t well into the two perspectives, the former into data mining and the latter into classi cation as knowledge discovery.
What Is Data
After reading through this chapter, the reader will know of: 1. Three types of data tables: (a) feature-to-entity, (b) similarity/dissimilarity and (c) contingency/ ow tables, and ways to standardize them. 2. Quantitative, categorical and mixed data, and ways to pre-process and standardize them. 3. Characteristics of feature spread and centrality. 4. Bi-variate distributions over mixed data, correlation and association, and their characteristics. 5. Visualization of association in contingency tables with Quetelet coefcients. 6. Multidimensional concepts of distance and inner product. 7. The concept of data scatter.
Base words
Average The average value of a feature over a subset of entities. If the fea-
two sets of categories in a contingency table. The greater it is, the closer the association to a conceptual one. Contingency table Given two sets of categories corresponding to rows and columns, respectively, this table presents counts of entities co-occurring at the intersection of each pair of categories from the two sets. When categories within each of the sets are mutually disjoint, the contingency table can be aggregated by summing up relevant entries. Correlation The shape of a scatter-plot showing the extent to which two features can be considered mutually related. The (product-moment) correlation coe cient captures the extent at which one of the features can be expressed as a linear function of the other. Data scatter The sum of squared entries of the data matrix it is equal to the sum of feature contributions or the summary distance from entities to zero. Data table Also referred to as at le (in databases) or vector space data (in information retrieval), this is a two-dimensional array whose rows correspond to entities, columns to features, and entries to feature values at entities. Distance Given two vectors of the same size, the (Euclidean squared) distance is the sum of squared di erences of corresponding components, d(x y) = P (x ; y )2 . It is closely related to the inner product: d(x y) = (x ; i i i y x ; y). Entity Also referred to as observation (in statistics) or case (in social sciences) or instance (in arti cial intelligence) or object, this is the main item of clustering corresponding to a data table row. Feature Also referred to as variable (in statistics) or character (in biology) or attribute (in logic), this is another major data item corresponding to a data table column. It is assumed that feature values can be compared to each other, at least, whether they coincide or not (categorical features), or even averaged over any subset of entities (quantitative feature case). Inner product Given two vectors of the same size, the inner product is the P sum of products of corresponding components, (x y) = i xi yi . It is closely related to the distance: d(x y) = (x x) + (y y) ; 2(x y). Quetelet index In contingency tables: A value showing the change in frequency of a row category when a column category becomes known. The greater the value, the greater the association between the column and row categories. It is a basic concept in contingency table analysis.
Range The interval in which a feature takes its values the di erence between
the feature maximum and minimum over a data set. Scatter plot A graph presenting entities as points on the plane formed by two quantitative features. Variance The average of squared deviations of feature values from the average.
2.1 Feature characteristics
The Masterpieces data in Table 1.10 will be used to illustrate data handling concepts in this section. For the reader's convenience, the table is reprinted here as Table 2.1. A data table of this type represents a unity of the set of rows, always denoted as I further on, the set of columns denoted by V and the table contents X , the set of values xiv in rows i 2 I and columns v 2 V . The number of rows, or cardinality of I , jI j, will be denoted by N , and the number of columns, the cardinality of V , jV j, by M . Rows will always correspond to entities, columns to features. Whatever metadata of entities may be known, are all to be put as the features, except for names, that may be maintained as a list associated with I . As to the features v 2 V , it is assumed that each has a measurement scale assigned to it, and of course a name. All within-column entries are supposed to have been measured in the same scale and thus comparable within the scale this is not so over rows in Y . Three di erent types of scales that are present in Table 2.1 and will be dealt with in the remainder are quantitative (LenSent, LenDial, and NChar), nominal (Narrative) and binary (SCon). Let us elaborate on these scale types: Table 2.1: Masterpieces: Masterpieces of 19th century: the rst three by Charles
2.1.1 Feature scale types
Dickens (1812{1870), the next three by Mark Twain (1835{1910), and the last two by Leo Tolstoy (1828{1910). Title LenSent LenDial NChar SCon Narrative Oliver Twist 19.0 43.7 2 No Objective Dombey and Son 29.4 36.0 3 No Objective Great Expectations 23.9 38.0 3 No Personal Tom Sawyer 18.4 27.9 2 Yes Objective Huckleberry Finn 25.7 22.3 3 Yes Personal Yankee at King Arthur 12.1 16.9 2 Yes Personal War and Peace 23.9 30.2 4 Yes Direct Anna Karenina 27.2 58.0 5 Yes Direct
1. Quantitative: A feature is quantitative if the operation of taking its average is meaningful. It is quite meaningful to compare the average values of feature LenS or LenD for di erent authors in Table 2.1. Somewhat less convincing is the case of NumC which must be an integer some authors even consider such
\counting" features a di erent scale type. Still, we can safely say that on average Tolstoy's novels have larger numbers of principal characters than those by Dickens or Twain. This is why counting features are also considered quantitative in this text. 2. Nominal: A categorical feature is said to be nominal if its categories are (i) disjoint, that is, no entity can fall in more than one of them, and (ii) not ordered, that is, they only can be compared with respect to whether they coincide or not. Narrative, in Table 2.1, is such a feature. Categorical features maintaining (ii) but not (i) are referred to as multichoice variables. For instance, Masterpieces data might include a feature that presents a list of social themes raised in a novel, which may contain more than one element. That would produce a one-to-many mapping of the entities to the categories, that is, social themes. There is no problem in treating this type of data within the framework described here. For instance, the Digit data table may be treated as that representing the only, multi-choice, variable \Segment" which has the set of seven segments as its categories. Categorical features that maintain (i) but have their categories ordered are called rank variables. Variable Bribe level in the Bribery data of Tables 1.11 and 1.12 is rank: its three categories are obviously ordered according to the bribe size. Traditionally, it is assumed for the rank variables that only the order of categories matters and intervals between them are irrelevant. That is, rank categories may accept any quantitative coding which is compatible with their ordering. This makes rank features di cult to deal with in the context of mixed data tables. We maintain a di erent view, going back to C. Spearman: the ranks are treated as numerical values and the rank variables are considered thus quantitative and processed accordingly. In particular, seven of the eleven variables in Bribery data (II. Client, IV. Occurrence, V. Initiator, VI. Bribe, VII. Type, VIII. Network, and XI. Punishment) will be considered ranked with ranks assigned in Table 1.11 and treated as quantitative values. There are two approaches to the issue of involving qualitative features into analysis. According to one, more traditional, approach, categorical variables are considered non-treatable quantitatively. The only quantitative operation admitted for categories is counting the number or frequency of its occurrences at various subsets. To conduct cluster analysis, categorical data, according to this view, can only be utilized for deriving an entity-to-entity (dis)similarity measure. Then this measure can be used for nding clusters. A di erent approach is maintained and further developed here: a category de nes a quantitative zero-one variable on entities, with one corresponding
WHAT IS DATA
to its presence and zero absence, which is treated then as such. We will see later that this view, in fact, does not contradict the former one but rather ts into it with geometrically and statistically sound speci cations. 3. Binary: A qualitative feature is said to be binary if it has two categories which can be thought of as Yes or No answer to a question such as feature SCon in Table 2.1. A two-category feature can be considered either a nominal or binary one, depending on the context. For instance, feature \Gender" of a human should be considered a nominal feature, whereas the question \Are you female?" a binary feature, because the latter assumes that it is the \female," not \male," category which is of interest. Operationally, the di erence between these two types will amount to how many binary features should be introduced to represent the feature under consideration in full. Feature \Gender" cannot be represented by one column with Yes or No categories: two are needed, one for \Female" and one for \Male."
2.1.2 Quantitative case
As mentioned, we consider that the meaningfulness of taking the average is a de ning property of a quantitative variable. Given a feature v 2 V whose values yiv , i 2 I , constitute a column in the data table, its average over entity subset S I is de ned by the formula
cv (S ) = (1=NS )
X
i2S
yiv
(2.1)
where NS is the number of entities in S . The average cv = cv (I ) of v 2 V over the entire set I is sometimes referred to as grand mean. After grand mean cv of v 2 V has been subtracted from all elements of the column-feature v 2 V , the grand mean of v becomes zero. Such a variable is referred to as centered. It should be mentioned that usually the quantitative scale is de ned somewhat di erently, not in terms of the average but the so-called admissible transformations y = (x). The scale type is claimed to depend on the set of transformations which are considered admissible, that is, do not change the scale contents. For the quantitative feature scales, those that are admissible are transformations such as y = ax + b converting all x values into y values by changing the scale factor a times and shifting the scale origin at b. Transformations y = (x) of this type, with (x) = ax + b for some real a and b, are referred to as a ne transformations. For instance, the temperature Celsius scale x is transformed into the temperature Fahrenheit scale with (x) = 1:8x +32. Stan-
dardizations of data with a ne transformations are at the heart of our approach to clustering. Our de nition is compatible with the one given above. Indeed, if a feature x admits a ne transformations, it is meaningful to compare its average values over various entity sets. Let xJ and xK be the averages of sets fxj : j 2 J g and fxk : k 2 K g respectively, and, say, xJ xK . Does the same automatically hold for the averages of y = ax + b over J and K ? To answer this question, we consider values yj = axj + b, j 2 J , and yk = axk + b, k 2 K and calculate their averages, yJ and yK . It is easy to prove that yK = axK + b and yJ = axJ + b so that any relation between xJ and xK remains the same for yJ and yK , up to the obvious reversal when a is negative (which means that rescaling involves change of the direction of the scale). Other indices of \centrality" have been considered too the most popular of them are: i Midrange, point in the middle of the range, that is, equi-distant from the minimum and maximum values of the feature. ii Median, the middle item in the series of elements of column v sorted in ascending (or descending) order. iii Mode, \the most likely" value, which is operationally de ned by partitioning the feature range in a number of bins (intervals of the same size) and determining at which of the bins the number of observations is maximum: the center of this bin is the mode, up to the error related to the bin size. Each of these has its advantages and drawbacks as a centrality measure. The median is the most stable with regard to change in the sample and, especially, to the presence of outliers. Outliers can drastically change the average, and they do not a ect the median at all. However, the calculation of the median requires sorting the entity set, which sometimes may be costly. Midrange is insensitive to the shape of the distribution and is highly sensitive to outliers. The mode is of interest when distribution of the feature is far from uniform. These may give a hint with respect to what measure should be used in a speci c situation. For example, if the data to be analyzed have no speci c properties at all, the average should be utilized. When outliers or data errors are expected, the median would be a better bet. The average, median and midrange all t within the following approximation model which is at the heart of the data recovery approach. Given a number of reals, x1 , x2 ,..., xN , nd a unique real a that can be used as their aggregate substitute so that for each i, a approximates xi up to a residual i : xi = a + i , i 2 I . The smaller the residuals the better the aggregate. To minimize the residuals i = xi ; a, they should be combined into a scalar criterion such
44
P PS Mean 7351.4 3.0 Std 6193.2 2.7 Range 21761.0 12.0 Do 1.4 1.3 4.0 Ho Ba 0.4 4.3 0.6 4.4 2.0 19.0 Su 1.9 1.7 7.0
WHAT IS DATA
Table 2.2: Summary characteristics of the Market town data.
Pe DIY SP PO CAB FM 2.0 0.2 0.5 2.6 0.6 0.2 1.6 0.6 0.6 2.1 0.6 0.4 7.0 3.0 2.0 8.0 2.0 1.0
as L1 = i jxi ; aj, L1 = maxi jxi ; aj, or L2 = i jxi ; aj2 . It appears, L1 is minimized by the median, L1 by midrange and L2 by the average. The average ts best because it solves the least squares approximation problem, and the least-square criterion is the basis of all further developments in Chapter 5. A number of characteristics have been de ned to measure the features' dispersion or spread. Probably the simplest of them is the variable's range, the di erence between its maximal and minimal values, that has been mentioned above already. This measure should be used cautiously as it may be overly sensitive to changes in the entity set. For instance, removal of Truro from the set of entities in the Market town data immediately reduces the range of variable Banks to 14 from 19. Further removal of St Blazey/Par further reduces the range to 13. Moreover, the range of variable DIY shrinks to 1 from 3, with these two towns removed. Obviously, no such drastic changes emerge when all thirteen hundred of the English Market towns are present. A somewhat more elaborate characteristic of dispersion is the so-called (empirical) variance of v 2 V which is de ned as
P
P
s2 = v
X
i2I
(yiv ; cv )2 =N
(2.2)
where cv is the grand mean. That is, s2 is the average squared deviation L2 of v yiv from cv . p The standard deviation of v 2 V is de ned as just sv = s2 which has also v a statistical meaning as the square-average deviation of the variable's values from its grand mean. The standard deviation is zero, s = 0, if and only if the variable is constant, that is, all the entries are equal to each other. In some packages, especially statistical ones, denominator N ; 1 is used instead of N in de nition (2.2) because of probabilistic consistency considerations (see any text on mathematical statistics). This shouldn't much a ect results because N is assumed constant here and, moreover, 1=N and 1=(N ; 1) do not much di er when N is large. For the Market town data, with N = 45 and n = 12, the summary characteristics are in Table 2.2. The standard deviations in Table 2.2 are at least as twice as small as the ranges, which is true for all data tables (see Statement 2.2.). The values of the variance s2 and standard deviation sv obviously depend v on the variable's spread measured by its range. Multiplying the column v 2 V
by > 0 obviously multiplies its range and standard deviation by , and the variance by 2 . The quadratic index of spread, s2 , depends not only on the scale but also v on the character of the feature's distribution within its range. Can we see how? Let as consider all quantitative variables de ned on N entities i 2 I and ranged between 0 and 1 inclusive, and analyze at what distributions the variance attains its maximum and minimum values. It is not di cult to see that any feature v that minimizes the variance s2 is v equal to 0 at one of the entities, 1 at another entity, and yiv = cv = 1=2 at all other entities i 2 I . The minimum variance s2 is 21 then. v N Among the distributions under consideration, the maximum value of s2 is v reached at a feature v which is binary, that is, has only boundary points, 0 and 1, as its values. Indeed, if v has any other value at an entity i, then the variance will only increase if we rede ne v in such a way that it becomes 0 or 1 at i depending on whether yiv is smaller or greater than v's average cv . For a binary v, let us specify proportion p of values yiv at which the feature is larger than its grand mean, yiv > cv . Then, obviously, the average cv = 0 (1 ; p) + 1 p = p and, thus, s2 = (0 ; p)2 (1 ; p) + (1 ; p)2 p = p(1 ; p). v The choice of the left and right bounds of the range, 0 and 1, does have an e ect on the values attained by the extremal variable but not on the conclusion of its binariness. That means that the following is proven. Statement 2.1. With the range and proportion p of values smaller than the average prespeci ed, the distribution at which the variance reaches its maximum is the distribution of a binary feature having p values at the left bound and 1 ; p values at the right bound of the range. Among the binary features, the maximum variance is reached at p = 1=2, the maximum uncertainty. This implies one more property. Statement 2.2. For any feature, its standard deviation is at least as twice as small as its range. Proof: Indeed, with the range being unity between 0 and 1, the maximum variance is p(1;p) = 1=4 at p = 1=2 leading to the maximum standard deviation of just half of the unity range, q.e.d. From the intuitive point of view, the range being the same, the greater the variance the better the variable suits the task of clustering.
2.1.3 Categorical case
Let us rst consider binary features and then nominal ones. To quantitatively recode a binary feature, its Yes category is converted into 1 and No into 0. The grand mean of the obtained zero/one variable will be
In statistics, two types of probabilistic mechanisms for generating zero/one binary variables are considered, Bernoulli/binomial and Poisson. Each relies on having the proportion of ones, p, xed. However, the binomial distribution assumes that every single entry has the same probability p of being unity, whereas Poisson distribution does not care about individual entries: just that the proportion p of entries randomly thrown into a column must be unity. This subtle di erence makes the variance of the Poisson distribution greater: the variance of the binomial distribution is equal to s2 = p(1 ; p) and the variance v of the Poisson distribution is equal to v = p. Thus, the variance of a one-zero feature considered as a quantitative feature corresponds to the statistical model of binomial distribution. Turning to the case of nominal variables, let us denote the set of categories of a nominal variable l by Vl . Any category v 2 Vl is conventionally characterized by its frequency, the number of entities, Nv , falling in it. The sum of frequencies P is equal to the total number of entities in I , v2Vl Nv = N . The relative frequencies, pv = Nv =N , sum up to unity. The vector p = (pv ), v 2 Vl is referred to as the distribution of l (over I ). A category with the largest frequency is referred to as the distribution's mode. The dispersion of a nominal variable l is frequently measured by the so-called Gini coe cient, or qualitative variance:
pv , the proportion of entities falling in the category Yes. Its variance will be s2 = p(1 ; p). v
G=
X
v2Vl
pv (1 ; pv ) = 1 ;
X
v2Vl
p2 v
(2.3)
This is zero if all entities fall in one of the categories only. G is maximum when the distribution is uniform, that is, when all category frequencies are the same, pv = 1=jVl j for all v 2 Vl . A similar measure referred to as entropy and de ned as
H =;
X
v2Vl
pv log pv
(2.4)
with the logarithm's base 2 is also quite popular. This measure is related to so-called information theory 12]. Entropy reaches its minimum and maximum values at the same distributions as the Gini coe cient. Moreover, the Gini coe cient can be thought of as a linearized version of entropy since 1 ; pv linearly approximates log pv at pv close to 1. In fact, both can be considered averaged information measures, just that one uses ; log pv and the other 1 ; pv to express the information contents. There exists a general formula to express the diversity of a nominal variable P as Sq = (1 ; v2Vl pq )=(q ; 1), q > 0 132]. The entropy and Gini index are v special cases of Sq since S2 = G and S1 = H assuming S1 to be the limit of Sq when q tends to 1.
A nominal variable l can be converted into a quantitative format by assigning a zero/one feature to each of its categories v 2 Vl coded by 1 or 0 depending on whether an entity falls into the category or not. These binary features are referred to sometimes as dummy variables. Unlike a binary feature, a two-category nominal feature such as \Gender" is converted into two columns, each corresponding to one of the categories, \Male" and \Female" of \Gender." This way of quantization is quite convenient within the data recovery approach as will be seen further in section 4.3 and others. However, it is also compatible with the traditional view of quantitative measurement scales as expressed in terms of admissible transformations. Indeed, for a nominal scale x, any one-to-one mapping y = (x) is considered admissible. When there are only two categories, x1 and x2 , they can be recoded into any y1 and y2 with an appropriate rescaling factor a and shift b so that the transformation of x to y can be considered an a ne one, y = ax + b. It is not di cult to prove that a = (y1 ; y2 )=(x1 ; x2 ) and b = (x1 y2 ; x2 y1 )=(x1 ; x2 ) will do the recoding. In other words, nominal features with two categories can be considered quantitative. The binary features, in this context, are those with category Yes coded by 1 and No by 0 for which transformation y = ax + b is meaningful only when a > 0. The vector of averages of the dummy category features, pv , v 2 Vl , is nothing but the distribution of l. Moreover, the Gini coe cient appears to be but the summary Bernoullian variance of the dummy category features, P G = v2Vl pv (1 ; pv ). In the case when l has only two categories, this becomes just the variance of any of them doubled. Thus, the transformation of a nominal variable into a bunch of zero-one dummies conveniently converts it into a quantitative format which is compatible with the traditional treatment of nominal features.
2.2 Bivariate analysis
Statistical science in the pre-computer era developed a number of tools for the analysis of interrelations between variables, which will be useful in the sequel. In the remainder of this section, a review is given of the three cases emerging from the pair-wise considerations, with emphasis on the measurement scales: (a) quantitative-to-quantitative, (b) categorical-to-quantitative, and (c) categorical-to-categorical variables. The discussion of the latter case follows that in 93].
2.2.1 Two quantitative variables
Mutual interrelations between two quantitative features can be caught with a scatter plot such as in Figure 1.10, page 24. Two indices for measuring
Figure 2.1: Geometrical meaning of the inner product and correlation coe cient. association between quantitative variables have attracted considerable attention in statistics and data mining: those of covariance and correlation. The covariance coe cient between the variables x and y considered as columns in a data table, x = (xi ) and y = (yi ), i 2 I , can be de ned as
cov(x y) = (1=N )
X
i2I
(xi ; x)(yi ; y)
(2.5)
where x and y are the average values of x and y, respectively. Obviously, cov(x x) = s2 (x), the variance of x de ned in section 2.1.3. The covariance coe cient changes proportionally when the variable scales are changed. A scale-invariant version of the coe cient is the correlation coe cient (sometimes referred to as the Pearson product-moment correlation coe cient) which is the covariance coe cient normalized by the standard deviations:
r(x y) = cov(x y)=(s(x)s(y))
(2.6)
A somewhat simpler formula for the correlation coe cient can be obtained if the data are rst standardized by subtracting their average and dividing the results by the standard deviation: r(x y) = cov(x0 y0) = (x0 y0 )=N where x0i = (xi ; x)=s(x), yi0 = (yi ; y)=s(y), i 2 I . Thus, the correlation coe cient is but the mean of the component-to-component, that is, inner, product of feature vectors when both of the scales are standardized as above. The coe cient of correlation can be substantiated in di erent theoretic frameworks. These require some preliminary knowledge of mathematics and can be omitted at rst reading, which is re ected in using a smaller font for explaining them.
1. Cosine. A geometric approach, relying on concepts introduced later in section 2.3.2, o ers the view that the covariance coe cient as the inner product of feature column-vectors is related to the angle between the vectors so that (x y) = jjxjjjjyjj cos(x y). This can be illustrated with Fig. 2.1 norms jjxjj and jjyjj are Euclidean lengths of intervals from 0 to x and y, respectively. The correlation coe cient is the inner product of the corresponding normalized variables, that is, the cosine of the angle between the vectors.
These three frameworks capture di erent pieces of the \elephant." That of cosine is the most universal framework: one may always take that measure to see to what extent two features go in concert, that is, to what extent their highs and lows co-occur. As any cosine, the correlation coe cient is between {1 and 1, the boundary values corresponding to the coincidence of the normalized variables or to a linear relation between the original features. The correlation coe cient being zero corresponds to the right angle between the vectors: the features are not correlated! Does that mean they must be independent in this case? Not necessarily. The linear slope approach allows one to see how this may happen: just the slope of the line best tting the scatter-plot must be horizontal. According to this approach, the square of the correlation coe cient shows to what extent the relation between variables, as observed, is owed to linearity. The Gaussian distribution view is the most demanding: it requires a properly de ned distribution function, a unimodal one, if not normal. Three examples of scatter-plots on Figure 2.2 illustrate some potential cases of correlation: (a) strong positive, (b) strong negative, and (c) zero correlation. Yet be reminded: in contrast to what is claimed in popular web sites, the the correlation coe cient cannot gather up all the cases in which variables are related it does capture only those of linear relation and those close enough to that.
2. Linear slope. The data recovery approach suggests that one of the features is modeled as a linear function of the other, say, y as ax + b where a and b are chosen to minimize the norm of the di erence, jjy ; ax ; bjj. It appears, the optimal slope a is proportional to r(x y) and, moreover, the square r(x y)2 expresses that part of the variance of y that is taken into account by ax + b (see details in section 5.1.2). 3. Parameter in Gaussian distribution. The correlation coe cient has a very clear meaning in the framework of probabilistic bivariate distributions. Consider, for the sake of simplicity, features x and y normalized so that the variance of each is unity. Denote the matrix formed by the two features by z = (x y) and assume a unimodal distribution over z, controlled by the so-called Gaussian, or normal, density function (see section 6.1.5) which is proportional to the expor nent of ;zT ;1 z=2 where is a 2 2 matrix equal to = 1 1 . The r parameter r determines the distance between the foci of the ellipse zT ;1 z = 1: the greater r the greater the distance. At r = 0 the distance is zero so that the ellipsis is a circle and at r tending to 1 or ;1 the distance tends to the in nity so that the ellipse degenerates into a straight line. It appears r(x y) is a sample based estimate of this parameter.
2.2.2 Nominal and quantitative variables
How should one measure association between a nominal feature and a quantitative feature? By looking at whether specifying a category can lead to a better
Figure 2.2: Three cases of a scatter-plot: (a) positive correlation, (b) negative cor-
relation, (c) no correlation. The shaded area is supposed to be randomly covered by entity points.
prediction of the quantitative feature or not. Let us denote the partition of the entity set I corresponding to categories of the nominal variable by S = fS1 ::: Sm g subset Sk consists of Nk entities falling in k-th category of the variable. The quantitative variable will be denoted by y with its values yi for i 2 I . The box-plot such as in Figure 1.7 on page 23 is a visual representation of the relationship between the nominal variable represented by the grouping and the quantitative variable represented by the boxes and whiskers. Let us introduce the framework for prediction of y values. Let the predicted y value for any entity be the grand mean y if no other information is supplied, P or yk = i2Sk yi =Nk , the within-class average, if the entity is known to belong to Sk . The average error of these predictions can be estimated in terms of the variances. To do this, one should relate within-class variances of y to its total variance s2 : the greater the change, the lesser the error and the closer the relation between S and y. An index, referred to as the correlation ratio, measures the proportion of total feature variance that falls within classes. Let us denote the within class variance of variable y by
s2 = k
X
i2Sk
(yi ; yk )2 =Nk
(2.7)
where Nk is the number of entities in Sk and yk the feature's within cluster average. Let us denote the proportion of Sk in I by pk so that pk = Nk =N . P Then the average variance within partition S = fS1 ::: Sm g will be K=1 pk s2 . k k This can be proven to never be greater than the total variance s2 . However, the average within partition variance can be as small as zero { when values yi all coincide with yk within each category Sk , that is, when y is piece-wise constant across S . In other words, all cluster boxes of a box-plot of y over classes of S degenerate into straight lines in this case. In such a situation partition S is said to perfectly match y. The smaller the di erence between
Table 2.3: Cross-classi cation of 8 masterpieces according to the author and Narrative in the format Count/Proportion. Narrative Author Objective Personal Direct Total Dickens 2/0.250 1/0.125 0/0 3/0.375 Twain 1/0.125 2/0.250 0/0 3/0.375 Tolstoy 0/0 0/0 2/0.250 2/0.250 Total 3/0.375 3/0.375 2/0.375 8/1.000
the average within-class variance and s2 , the worse the match between S and y. The relative value of the di erence,
2=
is referred to as the correlation ratio. The correlation ratio is between 0 and 1, the latter corresponding to the perfect match case. The greater the within-category variances, the smaller the correlation ratio. The minimum value, zero, is reached when all within class variances coincide with the total variance. Interrelation between two nominal variables is represented with the so-called contingency table. A contingency, or cross-classi cation, data table corresponds to two sets of disjoint categories, such as authorship and narrative style in the Masterpieces data, which respectively form rows and columns of Table 2.3. Entries of the contingency table are co-occurrences of row and column categories, that is, counts of numbers of entities that fall simultaneously in the corresponding row and column categories such as in Table 2.3. In a general case, with the row categories denoted by t 2 T and column categories by u 2 U , the co-occurrence counts are denoted by Ntu . The frequencies of row and column categories usually are called marginals (since they are presented on margins of contingency tables as in Table 2.3) and denoted by Nt+ and N+u since, when the categories within each of the two sets do not overlap, they are sums of co-occurrence entries, Ntu , in rows, t, and columns, u, respectively. The proportions, ptu = Ntu =N , pt+ = Nt+ =N , and p+u = N+u =N are also frequently used as contingency table entries. The general contingency table is presented in Table 2.4. Contingency tables can be considered for quantitative features too, if they are preliminarily categorized as demonstrated in the following example.
Table 2.4: A contingency table or cross classi cation of two sets of categories, t 2 T and u 2 U on the entity set I . Category 1 2 ... jU j Total 1 N11 N12 ... N1jU j N1+ 2 N21 N22 ... N2jU j N2+
... jT j Total ... ... .... ... ... NjT j1 NjT j2 ... NjT jjU j NjT j+ N+1 N+2 ... N+jU j N
Table 2.5: Cross classi cation of the Bank related partition with FM feature at
Market towns. Number of banks FMarket 10+ 4+ 2+ 1- Total Yes 2 5 1 1 9 No 4 7 13 12 36 Total 6 12 14 13 45
Table 2.6: Frequencies, per cent, in the bivariate distribution of the Bank related
partition and FM at Market towns. Number of banks FMarket 10+ 4+ 2+ 1- Total Yes 4.44 11.11 2.22 2.22 20.00 No 8.89 15.56 28.89 26.67 80.00 Total 13.33 26.67 31.11 28.89 100.00
Example 2.1. A cross classi cation of Market towns
Let us partition the Market town set in four classes according to the number of Banks and Building Societies (feature Ba): class T1 to include towns with Ba equal to 10 or more T2 with Ba equal to 4 or more, but less than 10 T3 with Ba equal to 2 or 3 and T4 to consist of towns with one or no bank at all. Let us cross classify partition T = fTv g with feature FM, presence or absence of a Farmers' market in the town. That means that we draw a table whose columns correspond to classes Tv , rows to presence or absence of Farmers' markets, and entries to their overlaps (see Table 2.5). The matrix of frequencies, or proportions, ptu = Ntu =N for Table 2.5 can be found by dividing all its entries by N = 45 (see Table 2.6). 2
A contingency table gives a picture of interrelation between two categorical features, or partitions corresponding to them, which is not quite clear. Let us make the picture sharper by removing thirteen towns from the sample, those
2.2. BIVARIATE ANALYSIS
\cleaned" subsample of Market towns. Number of banks FMarket 10+ 4+ 2+ 1- Total Yes 2 5 0 0 7 No 0 0 13 12 25 Total 2 5 13 12 32
53
Table 2.7: Cross classi cation of the Bank related partition with FM feature at a
falling in the less populated cells of Table 2.5 (see Table 2.7). Table 2.7 shows a very clear association between two features on the \cleaned" subsample: the Farmers' markets are present only in towns in which the number of banks is 4 or greater. A somewhat subtler relation: the medium numbers of banks are more closly associated with the presence of a Farmers' market than the higher ones. This clear picture is somewhat blurred in the original sample in Table 2.5 and, moreover, maybe does not hold at all. Thus, the issue of relating two features to each other can be addressed by looking at mismatches. For instance, Table 2.3 shows that Narrative style is quite close to authorship, though they do not completely match: there are two mismatching entities, one by Dickens and the other by Twain. Similarly, there are 13 mismatches in Table 2.5 removed in Table 2.7. The sheer numbers of mismatching entities measure the di erences between category sets rather well when the distribution of entities within each category is rather uniform as it is in Table 2.3. When the proportions of entities in di erent categories drastically di er, as in Table 2.5, to measure association between category sets more properly, the numbers of mismatching entities should be weighted according to the frequencies of corresponding categories. Can we discover the relation in Table 2.5 without removing entities? To measure association between categories according to a contingency table, a founding father of the science of statistics, A. Quetelet, proposed utilizing the relative or absolute change of the conditional probability of a category. The conditional probability p(u=t) = Ntu =Nt+ = ptu =pt+ measures the proportion of category u in category t. Quetelet coe cients measure the di erence between p(u=t) and the average rate p+u of u 2 U . The Quetelet absolute probability change is de ned as
If, for instance, t is an illness risk factor such as \exposure to certain allergens" and u is an allergic reaction such as asthma, and ptu = 0:001 pt+ = 0:01 p+u = 0:02, that means that ten per cent of those people who have been exposed to the allergens, p(u=t) = ptu =pt+ = 0:001=0:1 = 0:1, contract the disease while only two per cent on average have the disease. Thus, the exposure to the allergens multiplies risk of the disease vefold or increases the probability of contracting it by 400 %. This is exactly the value of qtu = 0:001=0:0002 ; 1 = 4. The value of gtu expresses the absolute di erence between p(u=t) = 0:1 and p+u = 0:02 it is not that dramatic, just 0:08. The summary Quetelet coe cients (weighted by the co-occurrence values) can be considered as summary measures of association between two category sets especially when distributions are far from uniform:
G2 =
and
XX
t2T u2U
ptu gtu =
X X p2 X 2 tu pt+ ; u2U p+w t2T u2U
(2.11)
X X p2 tu (2.12) pt+ p+u ; 1 t2T u2U t2T u2U P P The right-hand 1 in (2.12) comes as t2T u2U ptu when the categories t are mutually exclusive and cover the entire set I as well as categories u. In this case Q2 is equal to the well known Pearson chi-squared coe cient X 2 de ned
Q2 = ptu qtu =
by a di erent formula:
XX
X2 =
X X (ptu ; pt+ p+u )2 pt+ p+u t2T u2U
(2.13)
The fact that Q2 = X 2 can be proven with simple algebraic manipulations. Indeed, take the numerator in X 2: (ptu ; pt+ p+u )2 = p2 ; 2ptupt+ p+u + p2+ p2 u . tu t + Divided by the denominator pt+ p+u , this becomes p2 =pt+ p+u ; 2ptu + pt+p+u . tu 2 Summing up the rst item over all u and t leads to Q + 1. The second item sums up to ;2 and the third item to 1, which proves the statement. This coe cient is by far the most popular association coe cient. There is a probability-based theory describing what values of NX 2 can be explained by uctuations of the random sampling from the population. The di erence between equivalent expressions (2.12) and (2.13) for the relative Quetelet coe cient Q2 re ects deep epistemological di erences. In fact, Pearson chi-squared coe cient has been introduced in the format of NX 2 with X 2 in (2.13) to measure the deviation of the bivariate distribution in an observed contingency table from the model of statistical independence. Two partitions (categorical variables) are referred to as statistically independent if any entry
in their relative contingency table is equal to the product of corresponding marginal proportions that is, in our notation, for all t 2 T and u 2 U . Expression (2.13) for X 2 shows that it is a quadratic measure of deviation of the contingency table entries from the model of statistical independence. This shows that (2.13) is good at testing the hypothesis of statistical independence when I is an independent random sample: the statistical distribution of NX 2 has been proven to converge, when N tends to in nity, to the chi-squared distribution with (jT j ; 1)(jU j ; 1) degrees of freedom. Statistics texts and manuals claim that, without relating the observed contingency counts to the model of statistical independence, there is no point in considering X 2. This claim sets a very restrictive condition for using X 2 as an association measure. In particular, it is quite cumbersome to substantiate presence of zero entries (such as in Tables 2.3 and 2.7) in a contingency table under this condition. However, expression (2.12) for Q2 sets a very different framework that has nothing to do with the statistical independence. In this framework, X 2 is Q2 , the average relative change of the probability of a category u when category t becomes known. There is no restriction on using X 2 = Q2 in this framework. It is not di cult to prove that the summary coe cient Q2 reaches its maximum value max X 2 = min(jU j jT j) ; 1 (2.15) in tables with the structure of Table 2.7, at which only one element is not zero in every column (or row, if the number of rows is greater than the number of columns) 93]. Such a structure suggests a conceptual relation between categories of the two features, which means that the coe cient is good in measuring association indeed. For instance, according to Table 2.7, Farmers' markets are present if and only if the number of banks or building societies is 4 or greater, and Q2 = 1 in this table. The minimum value of Q2 is reached in the case of statistical independence between the features, which obviously follows from the \all squared" form of the coe cient in (2.13). Formula (2.12) suggests a way for visualization of dependencies in a \blurred" contingency table by putting the constituent items ptu qtu as (t u) entries of a show-case table. The proportional but greater values Nptu qtu = Ntu qtu can be used as well, since they sum up to NX 2 used in the probabilistic framework.
WHAT IS DATA
Table 2.8: Relative Quetelet coe cients, per cent, for Table 2.5.
Number of banks FMarket 10+ 4+ 2+ 1Yes 66.67 108.33 -64.29 -61.54 No -16.67 -27.08 16.07 15.38
Table 2.9: Items summed up in the chi-square contingency coe cient (times N ) in
the Quetelet format (2.12) for Table 2.5. Number of banks FMarket 10+ 4+ 2+ 1- Total Yes 1.33 5.41 -0.64 -0.62 5.48 No -0.67 -1.90 2.09 1.85 1.37 Total 0.67 3.51 1.45 1.23 6.86
Example 2.2. Highlighting positive contributions to the total association
The table of the relative Quetelet coe cients qtu for Table 2.5 is presented in Table 2.8 and that of items Ntu qtu in Table 2.9. It is easy to see that the highlighted positive entries in both of the tables express the same pattern as in Table 2.7 but without removing entities from the table. Table 2.9 demonstrates one more property of the items ptuqtu summed up in the chi-square coe cient: their within-row or within-column sums are always positive. 2
Highlighting the positive entries ptu qtu > 0 (or qtu > 0) can be used for visualization of the pattern of association between any categorical features 94]. A similar to (2.13), though asymmetric, expression can be derived for G2 :
G2 =
X X (ptu ; pt+ p+u )2 pt+ t2T u2U
(2.16)
Though it also can be considered a measure of deviation of the contingency table from the model of statistical independence, G2 has been always considered in the literature as a measure of association. A corresponding de nition involves the Gini coe cient de ned in section 2.1.3, G(U ) = 1P u2U p2 u : Within a ;P + 2 category t, the variation is equal to G(U=t) = 1 ; u2U (ptu =pt+ ) , which makes, on P average, the qualitativeP P that cannot be explained by T : variation G(U=T ) = t2T pt+ G(U=t) = 1 ; t2T u2U p2 =pt+. tu The di erence G(U );G(U=T ) represents that part of G(U ) that is explained by T , and this is exactly G2 in (2.11).
Let us elaborate on the interrelation between the correlation and contingency. K. Pearson tackled the issue by proving that, given two quantitative features whose ranges have been divided into a number ofp intervals, under some equal standard mathematical assumptions, the value of X 2 =(1 + X 2) converges to the correlation coe cient when the number of intervals tends to in nity 63]. To de ne a framework for experimentally exploring the issue in the context of a mixed scale pair of features, let us consider a quantitative feature A and a nominal variable At obtained by partitioning the range of A into t qualitative categories, with respect to a pre-speci ed partition S = fSk g. The relation between S and A can be captured by comparing the correlation ratio 2 (S A) with corresponding values of contingency coe cients G2 (S At ) and Q2 (S At ). The choice of the coe cients is not random. As proven in section 5.2.3, 2 (S A) is equal to the contribution of A and clustering S to the data scatter. In the case of At , analogous roles are played by coe cients G2 and X 2 = Q2 . Relations between 2 , G2 and X 2 can be quite complex depending on the bivariate distribution of A and S . However, when the distribution is organized in such a way that all the within-class variances of A are smaller than its overall variance, the pattern of association expressed in G2 and X 2 generally follows that expressed in 2 . To illustrate this, let us set an experiment according to the data in Table 2.10: within each of four classes, S1 S2 S3 and S4 , a prespeci ed number of observations is randomly generated with pre-speci ed mean and variance. The totality of 2300 generated observations constitutes the quantitative feature A for which the correlation ratio 2 (S A) is calculated. Then, the range of A is divided in t = 5 equally-spaced intervals (i.e., not necessarily intervals with an equal number of data) constituting categories of the corresponding attribute At , which is cross-classi ed with S to calculate G2 and X 2 . This setting follows that described in 94]. The initial within-class means are not much di erent with respect to the corresponding variances. Multiplying each of the initial means by the same factor value, f = 1 2 ::: 20, the means are step by step diverged in such a way that the within-class samples become more and more distinguishable from each other, thus increasing the association between S and A. The nal means in Table 2.10 correspond to f = 20. This is re ected in Figure 2.3 where the horizontal axis corresponds to the divergence factor, f , and the vertical axis represents values of the three coe cients for the case when the within class distribution of A is uniform (on the left) or Gaussian, or normal (on the right). We can see that the patterns follow each other rather closely in the case of a uniform distribution. There are small diversions from this in the case of a normal distribution. The product-moment correlation between G2 and X 2 is always about 0.98-0.99 whereas they both
2.2.4 Relation between correlation and contingency
WHAT IS DATA
Correlation G squared Chi squared
2.5 2.5
Correlation G squared Chi squared
2
2
1.5
1.5
1
1
0.5
0.5
0
0
5
10
15
20
0
0
5
10
15
20
Figure 2.3: Typical change of the correlation ratio (solid line), G squared (dotted
line) and chi-square (dashdotted line) with increase of the class divergence factor in the case of uniform (left) and normal (right) within class distribution of the quantitative variable A. Table 2.10: Setting of the experiment.
Class S1 S2 S3 S4 Number of observations 200 100 1000 1000 Variance 1.0 1.0 4.0 0.1 Initial mean 0.5 1.0 1.5 2.0 Final mean 10 20 30 40 correlate with 2 on the level of 0.90. The di erence in values of G2 X 2 and 2 is caused by two factors: rst, by the coarse qualitative nature of At versus the ne-grained quantitative character of A, and, second, by the di erence in their contributions to the data scatter. The second factor scales G2 down and X 2 up, to the maximum value 3 according to (2.15).
2.2.5 Meaning of correlation
Correlation is a phenomenon which may be observed between two features cooccurring in the same observations: the features are co-related in such a way that change in one of them accords with a corresponding change in the other. These are frequently asked questions: Given a high correlation or association, is there any causal relation behind? Given a low correlation, are the features involved independent? If there is a causal relation, should it translate into a higher correlation? The answer to each is: no, not necessarily. To make our point less formal, let us refer to a typical statistics news nugget brought to life by newspapers and BBC Ceefax 25 June 2004: \Children whose
Table 2.11: Association between mother's sh eating (A) and her baby's language
skills (B) and health (C). Feature B B C C Total A 520 280 560 240 800 A 480 720 840 360 1200 Total 1000 1000 1400 600 2000
mothers eat sh regularly during pregnancy develop better language and communication skills. The ndings are based on analysis of eating habits of 7400 mothers ... by the University of North Carolina, published in the journal Epidemiology." At face value, the claim is simple: eat more sh while pregnant and your baby will be better o in the contest of language and communication skills. The real value behind it is a cross classi cation of a mother's eating habits and her baby's skills over the set of 7400 mother-baby couples at which the cell combining \regular sh eating" with \better language skills" has accounted for a considerably greater number of observations than it would be expected under statistical independence, that is, the corresponding Quetelet coe cient q is positive. So what? Could it be just because of the sh? Very possibly: some say that the phosphorus which is abundant in sh is a building material for the brain. Yet some say that the phosphorus diet has nothing to do with brain development. They think that the correlation is just a manifestation of a deeper relation between family income, not accounted for in the data, and the two features: in richer families it is both the case that mothers eat more expensive food, sh included, and babies have better language skills. The conclusion: more research is needed to see which of these two explanations is correct. And more research may bring further unaccounted for and unforeseen factors and observations.
To illustrate the emergence of \false" correlations and non-correlations, let us dwell on the mother-baby example above involving the following binary features: A { \more sh eating mother," B { \baby's better language skills," and C { \healthier baby." Table 2.11 presents arti cial data on two thousand mother-baby couples relating A with B and C. According to Table 2.11, the baby's language skills (B) are indeed positively related to mother's sh eating (A): 520 observations at cell AB rather than 400 expected if A and B were independent, which is supported by a positive Quetelet coe cient q(B=A) = 30%. In contrast, no relation is observed between sh eating (A) and a baby's health (C): all A/C cross classifying entries on the right of Table 2.11 are proportional to the products of marginal frequencies. For instance, with p(A) = 0:4 and p(C ) = 0:7 their product p(A)p(C ) = 0:28 accounts for 28% of 2000 observations,
60
(B) and health (C) with income (D) taken into account. Feature D Feature A B B C D A 480 120 520 A 320 80 300 Total 800 200 820 A 40 160 40 D A 160 640 540 Total 200 800 580 Total 1000 1000 1400
WHAT IS DATA
Table 2.12: Association between mother's sh eating (A) and baby's language skills
C Total 80 600 100 400 180 1000 160 200 260 800 420 1000 600 2000
that is, 560, which is exactly the entry at cell AC . However, if we take into account one more binary feature, D, which assigns Yes to better o families, and break down the sample according to D, the data may show a di erent picture (see Table 2.12). All turned upside down in Table 2.12: what was independent in Table 2.11, A and C, became associated within both D and not-D categories, and what was correlated in Table 2.11, A and B, became independent within both D and not-D categories! Speci cally, with these arti cial data, one can see that A accounts for 600 within D category and 200 within not-D category. Similarly, B accounts for 800 within D and only 200 within not-D. Independence between A and B within either strata brings the numbers of AB to 480 in D and only 40 in not-D. This way, the mutually independent A and B within each stratum become correlated in the combined sample, because both A and B are concentrated mostly within D. Similar though opposite e ects are at play with association between A and C: they are negatively related in not-D and positively related in D, so that combining these two strata brings the mutual dependence to zero. 2
A high correlation/association is just a pointer to the user, researcher or manager alike, to look at what is behind. The data on their own cannot prove any causal relations, especially when no timing is involved, as is the case in all our exemplary problems. A causal relation can be established only with a mechanism explaining the process in question theoretically, to which the data may or may not add credibility.
2.3 Feature space and data scatter
2.3.1 Data matrix
A quantitative data table is usually referred to as a data matrix. Its rows correspond to entities and columns to variables. Moreover, in most clustering computations, all metadata are left aside so that a feature and entity are represented by the corresponding column and row only, under the assumption that the labels of entities and variables do not change. A data table with mixed scales such as Table 2.1 will be transformed to
a quantitative format. According to rules described in the next section, this is achieved by pre-processing each of the categories into a dummy variable by assigning 1 to an entity that falls in it and 0 otherwise.
Example 2.4. Pre-processing Masterpieces data
Let us convert the Masterpieces data in Table 2.1 to the quantitative format. The binary feature SCon is converted by substituting Yes by 1 and No by zero. A somewhat more complex transformation is performed at the three categories of feature Narrative: each is assigned with a corresponding zero/one vector so that the original column Narrative is converted into three (see Table 2.13). 2
A data matrix row corresponding to an entity i 2 I constitutes what is called an M -dimensional point or vector yi = (yi1 ::: yiM ) whose components are the row entries. For instance, Masterpieces data in Table 2.13 is a 8 7 matrix, and the rst row in it constitutes vector y1 = (19:0 43:7 2 0 0 1 0) each component of which corresponds to a speci c feature and, thus, cannot change its position without changing the feature's position in the feature list. Similarly, a data matrix column corresponds to a feature or category with its elements corresponding to di erent entities. This is an N -dimensional vector. Matrix and vector terminology is not just fancy language but part of a well developed mathematical discipline of linear algebra, which is used throughout all data mining disciplines. Some of it will be used in Chapter 5.
2.3.2 Feature space: distance and inner product
Any M -dimensional vector y = (y1 ::: yM ) pertains to the corresponding combination of feature values. Thus, the set of all M -dimensional vectors y is referred to as the feature space. This space is provided with interrelated distance and similarity measures. The distance between two M -dimensional vectors, x = (x1 x2 ::: xM ) and
This di erence-based quadratic measure is what mathematicians call Euclidean distance squared. It generalizes the basic property of plane geometry, the socalled Pythagoras' theorem as presented in Figure 2.4. Indeed, distance d(x y) in it is c2 and c is the Euclidean distance between x and y.
Example 2.5. Distance between entities
Let us consider three novels, 1 and 2 by Dickens, and one, 7 by Tolstoy, as rowpoints of the matrix in Table 2.13 as presented in the upper half of Table 2.14. The mutual distances between them are calculated in the lower half. The di erences in the rst two variables, LenS and LenD, prede ne the result however di erent the other features are, because their scales prevail. This way we get a counter-intuitive conclusion that a novel by Dickens is closer to that of Tolstoy than to the other by Dickens because d(2 7) = 67:89 < d(1 2) = 168:45. Therefore, feature scales must be rescaled to give greater weights to the other variables. 2
The concept of M -dimensional feature space comprises not only all M -series of reals, y = (y1 ::: yM ), but also two mathematical operations with them: the component-wise summation de ned by the rule x + y = (x1 + y1 ::: xn + yn) and multiplication of a vector by a number de ned as x = ( x1 ::: xn ) for any real . These operations naturally generalize the operations with real numbers and have similar properties. With such operations, variance s2 also can be expressed as the distance, v between column v 2 V and column vector cv e, whose components are all equal to the column's average, cv , divided by N , s2 = d(xv cv e)=N . Vector cv e is the v
Table 2.14: Computation of distances between three masterpieces according to Table
2.13. Squared di erences of values in the upper part are in the lower part of the matrix, column-wise they are summed up in the column \Distance" on the right. Item LenS LenD NumC SCon Obje Pers Dire Distance 1 19.0 43.7 2 0 1 0 0 2 29.4 36.0 3 0 1 0 0 7 23.9 30.2 4 1 0 0 1 Distance Total d(1,2) 108.16 59.29 1 0 0 0 0 168.45 d(1,7) 24.01 182.25 4 1 1 0 1 213.26 d(2,7) 30.25 33.64 1 1 1 0 1 67.89
result of multiplication of vector e = (1 ::: 1) whose components are all unities by cv . Another important operation is the so-called inner, or scalar, product. For any two M -dimensional vectors x y their inner product is a number denoted by (x y) and de ned as the sum of component-wise products, (x y) = x1 y1 + x2 y2 + ::: + xM yM . The inner product and distance are closely related. It is not di cult to see, just from the de nitions, that for any vectors/points x y: d(x 0) = (x x) = P x2 and d(y 0) = (y y) and, moreover, d(x y) = (x;y x;y). The symbol v2V v 0 refers here to a vector with all components equal to zero. The distance d(y 0) will be referred to as the scatter of y. The square root of the scatter d(y 0) p is referred to as the (Euclidean) norm of y and denoted by jjyjj = (y y) = qP 2 i2I yi . It expresses the length of y . In general, for any M -dimensional x y, the following equation holds:
This equation becomes especially simple when (x y) = 0. In this case, vectors x y are referred to as mutually orthogonal. When x and y are mutually orthogonal, d(0 x ; y) = d(0 x + y) = d(0 x) + d(0 y), that is, the scatters of x ; y and x + y are equal to each other and the sum of scatters of x and y. This is a multidimensional analogue to the Pythagoras theorem and the base for decompositions of the data scatter employed in many statistical theories including the theory for clustering presented in Chapter 5. The inner product of two vectors has a simple geometric interpretation (see Figure 2.1 on page 48), (x y) = jjxjjjjyjj cos where is the \angle" between x and y (at 0). This conforms to the concept of orthogonality above: vectors are orthogonal when the angle between them is a right angle.
The summary scatter of all row-vectors in data matrix Y is referred to as the data scatter of Y and denoted by
T (Y ) =
X
i 2I
d(0 yi ) =
XX
i2I v2V
2 yiv
(2.19)
Equation (2.19) means that T (Y ) is the total of all Y entries squared. An important characteristic of feature v 2 V is its contribution to the data scatter de ned as X 2 Tv = yiv (2.20)
i2I
the distance of the N -dimensional column from the zero column. Data scatter P is obviously the sum of contributions of all variables, T (Y ) = v2V Tv . If feature v is centered, then its contribution to the data scatter is proportional to its variance: Tv = Ns2 (2.21) v P Indeed, cv = 0 since v is centered. Thus, s2 = i2I (yiv ; 0)2 =N = Tv =N . v The relative contribution Tv =T (Y ) is a characteristic playing an important role in data standardization issues as explained in the next section.
2.4 Pre-processing and standardizing mixed data
The data pre-processing stage is to transform the raw entity-to-feature table into a quantitative matrix for further analysis. To do this, one needs rst to convert all categorical data to a numerical format. We will do this by using a dummy zero-one variable for each category. Then variables are standardized by shifting their origins and rescaling. This operation can be clearly substantiated from a statistics perspective, typically, by assuming that entities have been randomly sampled from an underlying Gaussian distribution. In data mining, substantiation may come from the data geometry. By shifting all the origins to feature means, entities become scattered around the center of gravity so that clusters can be more easily \seen" from that point. With feature rescaling, feature scales become balanced according to the principle of equal importance of each feature brought into the data table. To implement these general principles, we are going to rely on the following three-stage procedure. The stages are: (1) enveloping qualitative categories, (2) standardization, and (3) rescaling, as follows:
1. Quantitatively enveloping categories: This stage is to convert a mixed scale data table into a quantitative matrix by treating every qualitative category as a separate dummy variable coded by 1 or 0 depending on whether an entity falls into the category or not. Binary features are coded similarly except that no additional columns are created. Quantitative features are left as they are. The converted data table will be denoted by X = (xiv ), i 2 I v 2 V . 2. Standardization: This stage aims at transforming feature-columns of the data matrix to make them comparable by shifting their origins to av and rescaling them by bv , v 2 V , thus to create standardized matrix Y = (yiv ):
yiv = xiv b; av i 2 I v 2 V:
v
(2.22)
In this text, the shift coe cient av always will be the grand mean. In particular, the dummy variable corresponding to category v 2 Vl has its mean cv = pv , the proportion of entities falling in the category. The scale factor bv can be either the standard deviation or range or other quantity re ecting the variable's spread. Inp particular, for a category v 2 Vl , the standard deviation can be either pv (1 ; pv ) (Bernoulli distribution) or ppv (Poisson distribution), see page 46. The range of a dummy variable is always 1. Using the standard deviation is popular in data mining probably because it is used in classical statistics relying on the theory of Gaussian distribution which is characterized by the mean and standard deviation. Thus standardized, contributions of all features to data scatter become equal to each other because of the proportionality of contributions and standard deviations. On rst glance this seems an attractive property guaranteeing equal contributions of all features to the results, an opinion to which the current author once also subscribed 90]. However, this is not so. Two di erent factors contribute to the value of standard deviation: the feature scale and the shape of its distribution. As shown in section 2.1.2, within the same range scale the standard deviation may greatly vary from the minimum, at the peak unimodal distribution, to the maximum, at the peak bimodal distribution. By standardizing with standard deviations, we deliberately bias data in favor of unimodal distributions, although obviously it is the bimodal distribution that should contribute to clustering most. This is why the range, not the standard deviation, is used here as the scaling factor bv . In the case when there can be outliers in data, which may highly a ect the range, another more stable range-like scaling factor can be chosen, such as the di erence between percentiles, that does not
Table 2.15: Std standardized Masterpieces matrix Mean is grand mean, Std the
standard deviation and Cntr the relative contribution of a variable. LS LD SC NC Ob Pe Di 1 -0.6 0.7 -0.9 -1.2 1.2 -0.7 -0.5 2 1.2 0.1 0.0 -1.2 1.2 -0.7 -0.5 3 0.3 0.3 0.0 -1.2 -0.7 1.2 -0.5 4 -0.7 -0.5 -0.9 0.7 1.2 -0.7 -0.5 5 0.6 -0.9 0.0 0.7 -0.7 1.2 -0.5 6 -1.8 -1.3 -0.9 0.7 -0.7 1.2 -0.5 7 0.3 -0.3 0.9 0.7 -0.7 -0.7 1.6 8 0.8 1.8 1.9 0.7 -0.7 -0.7 1.6 Mean 22.4 34.1 3.0 0.6 0.4 0.4 0.3 Std 5.6 12.9 1.1 0.5 0.5 0.5 0.5 Cntr, % 14.3 14.3 14.3 14.3 14.3 14.3 14.3
Table 2.15 presents the Masterpieces data in Table 2.13 standardized according to the most popular transformation of feature scales, the so-called z-scoring, when the scales are shifted to their mean values and then normalized by the standard deviations. Table 2.16 presents the same data matrix range standardized. All feature contributions are di erent in this table except for those of NC, Ob and Pe which are the same. Why the same? Because they have the same variance p(1 ; p) corresponding to p or 1 ; p equal to 3/8.
Example 2.6. E ects of di erent scaling options
much depend on the distribution shape. The range based scaling option has been supported experimentally in 87]. 3. Rescaling: This stage rescales column-features v, which come from the same categorical variable l, back by further dividing yiv with supplementary rescaling coe cients b0v to restore the original weighting of raw variables. The major assumption in clustering is that all raw variables are supposed to be of equal weight. Having its categories enveloped, the \weight" of Pnominal variable l becomes equal to the summary contria bution Tl = v2Vl Tv to the data scatter where Vl is the set of categories belonging to l. Therefore, to restore the original \equal weighting" of l, the total contribution Tl must be related to jVl j, which is achieved by p taking b0v = jVl j for v 2 Vl . For a quantitative v 2 V , b0v is, typically, unity. Sometimes, there can be available an expert evaluation of the relative weights of the original variables l. If such is the case, rescaling coe cients b0v should be rede ned with the square roots of the expert supplied relative weights. This option may be applied to both quantitative and qualitative features. Note that two of the three steps above refer to categorical features.
Figure 2.5: Masterpieces on the plane of two rst principal components at four
di erent standardizations: no scaling (top left), scaling with standard deviations (top right), range scaling (bottom left), and range scaling with the follow-up rescaling (bottom right). We can see how overrated the summary contribution of the qualitative variable Narrative becomes: three dummy columns on the right in Table 2.16 take into account 55:7% of the data scatter and thus highly a ect any further results. This is why p further rescaling of these three variables by the 3 is needed to decrease their total contribution 3 times. Table 2.17 presents results of this operation applied to data in Table 2.16. Note that the total Narrative's contribution per cent has not changed as much as we would expect: about two times, yet not three times! Figure 2.5 shows how important the scaling can be for clustering results. It displays mutual locations of the eight masterpieces on the plane of the rst two principal components of the data in Table 2.13 at di erent scaling factors: (a) left top: no scaling at all (b) right top: scaling by the standard deviations, see Table 2.15 (c) left bottom: scaling by ranges (d) right bottom: scaling by ranges with the followup rescaling of the three dummy variables for categories of Narrative by taking into account that they come from the same nominal feature the scale shifting parameter is always the variable's mean. The left top scatter-plot displays no relation to the novels' authorship. Probably no clustering algorithm can properly identify the author classes with this standardization of the data (Table 2.15). On the contrary, the authorship pattern is clearly displayed on the bottom right gure, and it is likely that any reasonable clustering algorithm will capture them with this standardization. We can clearly see on Figure 2.5 that, in spite of the unidimensional nature of transformation (2.22), its combination of shifts and scales can be quite powerful in changing the geometry of the data. 2
68
the range and Cntr the relative contribution of a variable. LS LD SC NC Ob Pe 1 -0.2 0.2 -0.3 -0.6 0.6 -0.4 2 0.4 0.0 0.0 -0.6 0.6 -0.4 3 0.1 0.1 0.0 -0.6 -0.4 0.6 4 -0.2 -0.2 -0.3 0.4 0.6 -0.4 5 0.2 -0.3 0.0 0.4 -0.4 0.6 6 -0.6 -0.4 -0.3 0.4 -0.4 0.6 7 0.1 -0.1 0.3 0.4 -0.4 -0.4 8 0.3 0.6 0.7 0.4 -0.4 -0.4 Mean 22.4 34.1 3.0 0.6 0.4 0.4 Range 17.3 41.1 3.0 1.0 1.0 1.0 Cntr, % 7.8 7.3 9.4 19.9 19.9 19.9
WHAT IS DATA
Table 2.16: Range standardized Masterpieces matrix Mean is grand mean, Range
Di -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 0.8 0.8 0.3 1.0 15.9
Table 2.17: Range standardized Masterpieces matrix with the additionally rescaled
nominal feature attributes Mean is grand mean, Range the range and Cntr the relative contribution of a variable. LS LD SC NC Ob Pe Di 1 -0.20 0.23 -0.33 -0.63 0.36 -0.22 -0.14 2 0.40 0.05 0.00 -0.63 0.36 -0.22 -0.14 3 0.08 0.09 0.00 -0.63 -0.22 0.36 -0.14 4 -0.23 -0.15 -0.33 0.38 0.36 -0.22 -0.14 5 0.19 -0.29 0.00 0.38 -0.22 0.36 -0.14 6 -0.60 -0.42 -0.33 0.38 -0.22 0.36 -0.14 7 0.08 -0.10 0.33 0.38 -0.22 -0.22 0.43 8 0.27 0.58 0.67 0.38 -0.22 -0.22 0.43 Mean 22.45 34.13 3.00 0.63 0.38 0.38 0.25 Range 17.30 41.10 3.00 1.00 1.73 1.73 1.73 Cntr, % 12.42 11.66 14.95 31.54 10.51 10.51 8.41
Example 2.7. Relative feature weighting under standard deviations and
ranges may di er
For the Market town data, with N = 45 and n = 12, the summary feature characteristics are shown in Table 2.18. As proven above, the standard deviations in Table 2.18 are at least twice as small as the ranges, which is true for all data tables. However, the ratio of the range over the standard deviation may di er for di erent features reaching as much as 3/0.6=5 for DIY. Therefore, using standard deviations and ranges for scaling in (2.22) may lead to di erences in relative scales between the variables and, thus, to di erent clustering results as well as in Masterpiece data. 2
Are there any regularities in the e ects of data standardization (and rescaling) on the data scatter and feature contributions to it? Not many. But there are items that should be mentioned:
Table 2.18: Summary characteristics of the Market town data Mean is grand mean,
Std the standard deviation, Range the range and Cntr the relative contribution of a variable.
P PS Do Mean 7351.4 3.0 1.4 Std 6193.2 2.7 1.3 Range 21761.0 12.0 4.0 Cntr, % 8.5 5.4 11.1 Ho Ba Su 0.4 4.3 1.9 0.6 4.4 1.7 2.0 19.0 7.0 8.8 5.6 6.3 Pe DIY SP 2.0 0.2 0.5 1.6 0.6 0.6 7.0 3.0 2.0 5.7 4.2 10.3 PO CAB FM 2.6 0.6 0.2 2.1 0.6 0.4 8.0 2.0 1.0 7.3 9.7 17.1
E ect of shifting to the average. With shifting to the averages, the feature
E ects of scaling of categories. Let us take a look at the e ect of scaling
contributions and variances become proportional to each other, Tv = Ns2 , v for all v 2 V .
and rescaling coe cients on categories. The contribution of a binary attribute v, standardized according to (2.22), becomes Tv = Npv (1 ; pv )=(bv )2 where pv is the relative frequency of category v. This can be either pvp ; pv ) or 1 or 1 ; pv depending on whether bv =p1 (range) (1 or bv = pv (1 ; pv ) (Bernoulli standard deviation) or bv = pv (Poisson standard deviation), respectively. These can give some guidance in rescaling of binary categories: the rst option should be taken when both zeros and ones are equally important, the second when the distribution does not matter and the third when it is only unities that matter. Identity of binary and two-category features. An important issue faced by the user is how to treat a categorical feature with two categories such as gender (Male/Female) or voting (Democrat/Republican) or belongingness to a group (Yes/No). The three-step procedure of standardization makes the issue irrelevant: there is no di erence between a two-category feature and either of its binary representations. Indeed, let x be a two-category feature assigning each entity i 2 I with a category `eks1' or `eks2' whose relative frequencies are p1 and p2 such that p1 + p2 = 1. Denote by y1 a binary feature corresponding to category `eks1' so that y1i = 1 if xi = eks1 and y1i = 0 if xi = eks2. Analogously de ne a binary feature y2 corresponding to category `eks2'. Obviously, y1 and y2 complement each other so that their sum makes an all unity vector. In the rst stage of the standardization procedure, the user decides whether x is to be converted to a binary feature or left as a categorical one. In the former case, x is converted to a dummy feature, say, column y1 and in the latter case, x is converted into a two-column submatrix
consisting of columns y1 and y2. Since the averages of y1 and y2 are p1 and p2 , respectively, after shifting column y1's entries become 1 ; p1 , for 1, and ;p1 , for 0. Respective entries of y2 are shifted to ;p2 and 1 ; p2, which can be expressed through p1 = 1 ; p2 as p1 ; 1 and p1 . That means that y2 = ;y1 after the shifting: the two columns become identical, up to the sign. That implies that all their square contribution characteristics become the same, including the total contributions to the data scatter so that the total contribution of the two-column submatrix is twice greater than the contribution of a single column y1, whichever scaling option is accepted. However, further rescaling the two-column submatrix by the p recommended 2 restores the balance of contributions: the two-column submatrix contributes as much as a single column. Total contributions of categorical features. The total contribution of nominal variable l is X Tl = N pv (1 ; pv )=(bv )2 : (2.23)
v2Vl
Depending on the choice of scaling coe cients bv , this can be 1. 2. 3.
P Tl = N (1 ; v2Vl p2 ) if bv = 1 (range normalization) v p Tl = N jVl j if bv = pv (1 ; pv ) (Bernoulli normalization) Tl = N (jVl j ; 1) if bv = ppv (Poisson normalization). where jVl j is the number of categories of l. The quantity on the top is the Gini coe cient of the distribution of v (2.3).
The square roots of these should be used for further rescaling qualitative categories stemming from the same nominal variable l to adjust their total impact on the data scatter.
2.5 Other table data types
2.5.1 Dissimilarity and similarity data
In many cases the entity-to-entity dissimilarity or similarity data is the preferred format of data as derived from more complex data such as Primates in Table 1.2 or as directly resulting from observations as Confusion data in Table 1.9. Similarity scoring is especially important for treating the so-called \wide" data tables in which the number of features is much greater than the number of entities. Such is the case of unstructured textual documents for which the presence or absence of a keyword is a feature. The number of meaningful keywords may go into hundreds of thousands even when the entire text collection
is in dozens or hundreds. In this case, bringing in a text-to-text similarity index may convert the problem from a virtually untreatable one into a modest size clustering exercise. An entity-to-entity similarity index may appear as the only data for clustering. For example, similarity scores may come from experiments on subjective judgements such as scoring individual's evaluation of similarity between stimuli or products. More frequently, though, similarities are used when entities to be clustered are too complex to be put in the entity-to-feature table format. When considering two biomolecular amino acid sequences (proteins) a similarity score between them can be based on the probability of transformation of one of them into the other with evolutionary meaningful operations of substitution, deletion and insertion of amino acids 70]. The terminology re ects the di erences between the two types of proximity scoring: the smaller the dissimilarity coe cient the closer the entities are, whereas the opposite holds for similarities. Also, the dissimilarity is conventionally considered as a kind of extended distance, thus satisfying some distance properties. In particular, given a matrix D = (dij ), i j 2 I , where I is the entity set, the entries dij form a dissimilarity measure between entities i j 2 I if D satis es the properties: (a) Symmetry: dij = dji . (b) Non-negativity: dij 0. (c) Semi-de niteness: dij = 0 if entities i and j coincide. A dissimilarity measure is referred to as a distance if it additionally satis es: (d) De niteness: dij = 0 if and only if entities i and j coincide. (e) Triangle inequality: dij dil + dlj for any i j l 2 I . A distance is referred to as an ultra-metric if it satis es a stronger triangle inequality: (f) Ultra-triangle inequality: dij max(dil dlj ) for any i j l 2 I . In fact, the ultra-triangle inequality states that among any three distances dij dil and djl , two are equal to each other and the third cannot be greater than that. Ultra-metrics emerge as distances between leaves of trees in fact, they are equivalent to some tree structures such as the heighted upper cluster hierarchies considered in section 5.3.1. No such properties are assumed for similarity data except, sometimes, for symmetry (a).
Standardization of similarity data
Given a similarity measure, all entity-to-entity similarities are measured in the same scale so that its change will not change clustering results. This is why there is no need to change the scale of similarity data. As to the shift in the origin of the similarity measure, this can be of advantage by making within- and between-cluster similarities more contrasted. Figure 2.6 demonstrates the e ect
Figure 2.6: A pattern of similarity aij = sij ; a values depending on a subtracted threshold a. of changing a positive similarity measure sij to aij = sij ; a by subtracting a threshold a > 0: small similarities sij < a can be transformed into negative similarities aij . This can be irrelevant as for example in such clustering methods as single linkage or similarity-based K-Means. But there are methods such as ADDI-S in section 5.5.5 which can be quite sensitive to the threshold a. Shift of the origin can be a useful option in standardizing similarity data.
Standardization of dissimilarity data
Given a dissimilarity measure dij , i j 2 I , it is frequently standardized by transforming it into both a row- and column-wise centered similarity measure according to the formula:
P whereP dot denotes the operation of averaging so that di = j2I dij =N , the P d =(N N ). d j = i2I dij =N , and d = i j2I ij
sij = ;(dij ; di
; d j ; d )=2
(2.24)
This formula can be applied to any dissimilarity measure, but it is especially suitable in the situation in which dij is Euclidean distance squared, that is, when dij = (xi ; xj xi ; xj ) for some multidimensional xi xj , i j 2 I . It can be proven then that sij in (2.24) is the inner product sij = (xi xj ) for all i j 2 I if all xi are centered.
2.5.2 Contingency and ow data
Table F = (fij ), i 2 I , j 2 J , is referred to as a ow table if every entry expresses a quantity of the same matter in such a way that all of the entries can be meaningfully summed up to a number expressing the total amount of the matter in the data. Examples of ow data tables are: (a) contingency tables counting numbers of co-occurred instances (b) mobility tables counting numbers of individual members of a group having changed their categories (c) trade tables showing the money transferred from i to j during a speci ed period.
This type of data is of particular interest in processing massive information sources. The data itself can be untreatable within time-memory constraints, but by counting co-occurrences of categories of interest in a sample it can be pre-processed into the ow data format and analyzed as such. row P The nature of the data associates weights ofP categories i 2 I , fi+ = fij , and column categories j 2 J , f+j = P2I fij , the total ow from j 2J i P i and that to j . The total ow volume is f++ = i2I j2J fij , which is the P P summary weight, f++ = j2J f+j = i2I fi+ . Extending concepts introduced in section 2.2.3 for contingency tables to the general ow data, we can extend the de nition of the relative Quetelet index as
qij = fij ff++ ; 1 f
i+ +j
(2.25)
Example 2.8. Quetelet coe cients for Confusion data Coe cients (2.25) are
This index, in fact, compares the share of j in i's transaction, p(j=i) = fij =fi+ , with the share of j in the overall ow, p(j ) = f+j =f++. Indeed, it is easy to see that qij is the relative di erence of the two, qij = (p(j=i) ; p(j ))=p(j ). Obviously, qij = 0 when there is no di erence. Transformation (2.25) is a standardization of ow data which takes into account the data's nature. Standardization (2.25) is very close to the so-called normalization of Rao and Sabovala widely used in marketing research and, also, the (marginal) cross-product ratio utilized in the analysis of contingency data. Both can be expressed as pij transformed into qij + 1.
presented in Table 2.19. One can see from the table that the numerals overwhelmingly respond to themselves. However, there are also a few positive entries in the table outside of the diagonal. For instance, 7 is perceived as 1 with the frequency 87.9% greater than the average, and 3 is perceived as 9, and vice versa, with also higher frequencies than the average. 2
Taking into account that the ow data entries can be meaningfully summed up to the total ow volume, the distance between row entities in a contingency table is de ned by weighting the columns by their \masses" p+l as follows, (k k 0 ) =
X
j 2J
p+j (qkj ; qk j )2 :
0
(2.26)
This is equal to the so-called chi-square distance de ned between row conditional pro les in the Correspondence factor analysis (see section 5.1.4). The formula (2.26) is similar to formula (2.17) for Euclidean distance squared except for the weights. A similar chi-square distance can be de ned for columns. The concept of data scatter for contingency data is also introduced with weighting:
X 2 (P ) =
XX
I 2I j 2J
2 pi+ p+j qij
(2.27)
The notation re ects the fact that this value is closely connected with the Pearson chi-square contingency coe cient, de ned in (2.13) above (thus to Q2 (2.12) as well). Elementary algebraic manipulations show that X 2 (P ) = X 2 . Note that in this context the chi-squared coe cient has nothing to do with the statistical independence: it is just a weighted data scatter measure compatible with the speci c properties of ow and contingency data. In particular, it is not di cult to prove an analogue to a property of the conventional data scatter: X 2 (P ) is the sum of chi-square distances (k 0) over all k. Thus, Pearson's chi-squared coe cient here emerges as the data scatter after the data have been standardized into Quetelet coe cients. For Table 1.9 transformed here into Table 2.19 the coe cient is equal to 4.21 which is 47% of 9, the maximum value of the coe cient for 10 10 tables.
K-Means Clustering
After reading this chapter the reader will know about: 1. Straight and incremental K-Means. 2. The instability of K-Means with regard to initial centroids. 3. An anomalous cluster version of K-Means for incomplete clustering. 4. Three approaches to the initial setting in K-Means: random, maxmin and anomalous pattern. 5. An intelligent version of K-Means mitigating issues of the initial setting and interpretation. 6. Cross-validation of clustering results. 7. Conventional and contribution based interpretation aids for K-Means.
from the so-called reference point, which may coincide with the grand mean of the entity set. The method works as K-Means at K=2 except for the location of the reference point which is never changed. ter's elements. If the distance is Euclidean squared, the centroid is equal to the center of gravity of the cluster.
76
K-MEANS CLUSTERING
well. Conventionally such an entity is drawn as that nearest to the cluster centroid in the Euclidean space. The theory used here suggests that the nearest entity must be drawn according to the inner product rather than distance, which extends the cluster tendencies over the grand mean.
Cluster representative An entity that is considered to represent its cluster
Contributions to the data scatter Additive items representing parts of the
data scatter that are explained by certain elements of a cluster structure such as feature-cluster pairs. The greater the contribution, the more important the element. Summary contributions coincide with statistical measures of correlation and association, which is a theoretical support to the recommended data standardization rules. or its results by the comparison of cluster results found on subsamples formed by a random partitioning of the entity set into a number of groups of equal sizes. troids (seeds) is found with an iterated version of the Anomalous pattern algorithm. one-by-one.
Cross validation A procedure for testing consistency of a clustering algorithm iK-Means An intelligent version of K-Means, in which an initial set of cenIncremental K-Means A version of K-Means in which entities are dealt with
Interpretation aids Computational tools for helping the user to interpret
clusters in terms of features, external or used in the process of clustering. Conventional interpretation aids include cluster centroids and bivariate distributions of cluster partitions and features. Contribution based interpretation aids such as ScaD and QScaD tables are derived from the decomposition of the data scatter into parts explained and unexplained by the clustering.
K-Means A major clustering method producing a partition of the entity set
into non-overlapping clusters along with within-cluster centroids. It proceeds in iterations consisting of two steps each one step updates clusters according to the Minimum distance rule, the other step updates centroids as the centers of gravity of clusters. The method implements the so-called alternating minimization algorithm for the square error criterion. To initialize the computations, either a partition or a set of all K tentative centroids must be speci ed. nearest centroid.
Minimum distance rule The rule which assigns each of the entities to its
Reference point A vector in the variable space serving as the space origin.
The Anomalous pattern is sought starting from an entity which is the farthest from the reference point, which thus models the norm from which the Anomalous pattern deviates most. ScaD and QScaD tables Interpretation aids helping to capture clusterspeci c features that are relevant to K-Means clustering results. ScaD is a cluster-to-feature table whose entries are cluster-to-feature contributions to the data scatter. QScaD is a table of the relative Quitelet coe cients of the ScaD entries to express how much they di er from the average. Square error criterion The sum of summary distances from cluster elements to the cluster centroids, which is minimized by K-Means. The distance used is the Euclidean distance squared, which is compatible with the least-squares data recovery criterion.
3.1 Conventional K-Means
K-Means is a major clustering technique that is present, in various forms, in major statistical packages such as SPSS 42] and SAS 17, 119] and data mining packages such as Clementine 14], iDA tool 114] and DBMiner 44]. The algorithm is appealing in many aspects. Conceptually it may be considered a model for the human process of making a typology. Also, it has nice mathematical properties. This method is computationally easy, fast and memory-e cient. However, there are some problems too, especially with respect to the initial setting and stability of results, which will be dealt with in section 3.2. The cluster structure in K-Means is a partition of the entity set in K nonoverlapping clusters represented by lists of entities and within cluster means of the variables. The means are aggregate representations of clusters and as such they are sometimes referred to as standard points or centroids or prototypes. These terms are considered synonymous in the remainder of the text. More formally, the cluster structure is represented by subsets Sk I and M -dimensional centroids ck = (ckv ), k = 1 ::: K . Subsets Sk form partition S = fS1 ::: SK g with a set of centroids c = fc1 ::: cK g.
3.1.1 Straight K-Means
Example 3.9. Centroids of author clusters in Masterpieces data
Let us consider the author-based clusters in the Masterpieces data. The cluster structure is presented in Table 3.1 in such a way that the centroids are calculated twice, once for the raw data in Table 2.13 and the second time, with the standardized data in Table 3.2, which is a copy of Table 2.17 of the previous chapter. 2
Given K M -dimensional vectors ck as cluster centroids, the algorithm updates cluster lists Sk according to the so-called Minimum distance rule. Minimum distance rule assigns entities to their nearest centroids. Specif-
Table 3.1: Means of the variables in Table 3.2 within K=3 author-based clusters, real
(upper row) and standardized (lower row).
Mean Cl. List LS (f1) LD (f2) NC (f3) SC (f4) 1 1, 2, 3 24.1 39.2 2.67 0 0.095 0.124 -0.111 -0.625 2 4, 5, 6 18.7 22.4 2.33 1 -0.215 -0.286 -0.222 0.375 3 7, 8 25.6 44.1 4.50 1 0.179 0.243 0.500 0.375 P (f5) 0.67 0.168 0.33 -0.024 0.00 -0.216 O (f6) 0.33 -0.024 0.67 0.168 0.00 -0.216 D (f7) 0 -0.144 0 -0.144 1 0.433
ically, for each entity i 2 I , its distances to all centroids are calculated, and the entity is assigned to the nearest centroid. When there are several nearest centroids, the assignment is taken among them arbitrarily. In other words, Sk is made of all such i 2 I that d(i ck ) is minimum over all centroids from c = fc1 ::: cK g. The Minimum distance rule is popular in data analysis and can be found under di erent names such as Voronoi diagrams and vector learning quatization. In general, some centroids may be assigned no entity at all with this rule. Having cluster lists updated with the Minimum distance rule, the algorithm updates centroids as gravity centers of the cluster lists Sk the gravity center coordinates are de ned as within cluster averages, that is, updated centroids are de ned as ck = c(Sk ), k = 1 ::: K , where c(S ) is a vector whose components are averages of features over S . Then the process is reiterated until clusters do not change. Recall that the distance referred to is Euclidean squared distance de ned, for any M -dimensional x = (xv ) and y = (yv ) as d(x y) = (x1 ; y1 )2 + ::: + (xM ; yM )2 .
Example 3.10. Minimum distance rule at author cluster centroids in
Masterpieces data
Let us apply the Minimum distance rule to entities in Table 3.2, given the standardized centroids in Table 3.1. The matrix of distances between the standardized eight row points in Table 3.2 and three centroids in Table 3.1 is in Table 3.3. The table shows that points 1,2,3 are nearest to centroid c1 , 4,5,6 to c2 , and 7, 8 to c3 , which is boldfaced. This means that the rule does not change clusters. These clusters will have the same centroids. Thus, no further calculations can change the clusters: the author-based partition is to be accepted as the result. 2
Let us now explicitly formulate the algorithm, which will be referred to as straight K-Means. Sometimes the same procedure is referred to as batch K-Means or parallel K-Means.
0. Data pre-processing. Transform data into a quantitative matrix Y . This can be done according to the three step procedure described in section 2.4. 1. Initial setting. Choose the number of clusters, K, and tentative centroids c1 c2 ::: cK , frequently referred to as seeds. Assume initial cluster lists Sk empty. 0 2. Clusters update. Given K centroids, determine clusters Sk (k = 1 ::: K ) with the Minimum distance rule. 3. Stop-condition. Check whether S 0 = S . If yes, end with clustering S = Sk , c = (ck ). Otherwise, change S for S 0 . 4. Centroids update. Given clusters Sk , calculate within cluster means ck (k = 1 ::: K ) and go to Step 2. This algorithm usually converges fast, depending on the initial setting. Location of the initial seeds may a ect not only the speed of convergence but, more importantly, the nal results as well. Let us give examples of how the initial setting may a ect results.
Straight K-Means
Example 3.11. Successful application of K-Means
Let us apply K-Means to the same Masterpiece data in Table 3.2, this time starting with entities 2, 5 and 7 as tentative centroids (Step 1). To perform Step 2, the matrix of entity-to-centroid distances is computed (see Table 3.4 in which within column minima are boldfaced). The Minimum distance rule produces three cluster lists, S1 = f1 2 3g S2 = f4 5 6g and S3 = f7 8g. These coincide with the authorbased clusters and produce within-cluster means (Step 4) already calculated in Table 3.1. Since these di er from the original tentative centroids (entities 2, 5, and 7), the algorithm returns to Step 2 of assigning clusters around the updated centroids. We do not do this here since the operation has been done already with distances in Table 3.3, which produced the same author-based lists according to the Minimum distance rule. The process thus stops. 2
Example 3.12. Unsuccessful run of K-Means with di erent initial seeds
Let us take entities 1, 2 and 3 as the initial centroids (assuming the same data in Table 3.2). The Minimum distance rule, according to entity-to-centroid distances in
Table 3.5, leads to cluster lists S1 = f1 4g S2 = f2g and S3 = f3 5 6 7 8g. With the centroids updated at Step 4 as means of these clusters, a new application of Step 3 leads to slightly changed cluster lists S1 = f1 4 6g S2 = f2g and S3 = f3 5 7 8g. Their means calculated, it is not di cult to see that the Minimum distance rule does not change clusters anymore. Thus the lists represent the nal outcome, which di ers from the author-based solution. 2
The intuitive inappropriateness of the results in this example may be explained by the stupid choice of the initial centroids, all by the same author. However, K-Means can lead to inconvenient results even if the initial setting is selected according to clustering by authors. Table 3.6: Distances between the standardized Masterpiece entities and entities 1, 4,
7 as tentative centroids.
Row-point Centroid 1 2 3 4 5 6 7 8 1 0.00 0.51 0.88 1.15 2.20 2.25 2.30 3.01 4 1.15 1.55 1.94 0.00 0.97 0.87 1.22 2.46 7 2.30 1.90 1.81 1.22 0.83 1.68 0.00 0.61
Example 3.13. Unsuccessful K-Means with author-based initial seeds
With the initial centroids at rows 1, 4, and 7, the entity-to-centroid matrix in Table 3.6 leads to cluster lists S1 = f1 2 3g S2 = f4 6g and S3 = f5 7 8g that do not change in the follow-up operations. These results put a piece by Mark Twain among those by Leo Tolstoy. Not a good outcome. 2
The instability of clustering results with respect to the initial settings leads to a natural question whether there is anything objective in the method at all. Yes, there is. It appears, there is a scoring function, an index, that is minimized by KMeans. To formulate the function, let us de ne the within cluster error. For a cluster Sk with centroid ck = (ckv ), v 2 V , its square error is de ned as the summary distance from its elements to ck :
W (Sk ck ) =
X
i2Sk
d(yi ck ) =
K X k=1
XX
i2Sk v2V
(yiv ; ckv )2 :
(3.1)
The square error criterion is the sum of these values over all clusters:
W (S c) =
W (Sk ck )
(3.2)
Criterion W (S c) (3.2) depends on two groups of arguments: cluster lists Sk and centroids ck . Criteria of this type are frequently optimized with the socalled alternating minimization algorithm. This algorithm consists of a series of iterations. At each of the iterations, W (S c) is, rst, minimized over S , given c, and, second, minimized over c, given the resulting S . This way, at each iteration a set c is transformed into a set c0 . The calculations stop when c is stabilized, that is, c0 = c. Statement 3.3. Straight K-Means is the alternating minimization algorithm for the summary square-error criterion (3.2) starting from seeds c = fc1 ::: cK g speci ed in step 1. Proof: Equation
W (S c) =
K XX
following from (3.1), implies that, given c = fc1 ::: cK g, the Minimum distance rule minimizes W (S c) over S . Let us now turn to the problem of minimizing W (S c) over c, given S . It is obvious, that minimizing W (S c) over c can be done by minimizing W (Sk ck ) (3.1) over ck independently for every k = 1 ::: K . Criterion W (Sk ck ) is a quadratic function of ck and, thus, can be optimized with just rst-order optimality conditions that the derivatives of W (Sk ck ) over ckv P be equal to zero for all v 2 V . These derivatives are equal to F (ckv ) = must ;2 i2Sk (yiv ; ckv ), k = 1 ::: K v 2 V . The condition F (ckv ) = 0 obviously P leads to ckv = i2Sk yiv =jSk j, which proves that the optimal centroids must be within cluster gravity centers. This proves the statement.
Square-error criterion (3.2) is the sum of distances from entities to their cluster centroids. This can be reformulated as the sum of within cluster variances P 2 2 kv = i2Sk (yiv ; ckv ) =Nk weighted by the cluster cardinalities:
W (S c) =
K XXX k=1 i2Sk v2V
(yiv ; ckv )2 =
K XX v2V k=1
2 Nk kv
(3.3)
Statement 3.3. implies, among other things, that K-Means converges in a nite number of steps because the set of all partitions S over a nite I is nite and W (S c) is decreased at each change of c or S . Moreover, as experiments show, K-Means typically does not move far away from the initial setting of c. Considered from the perspective of minimization of criterion (3.2), this leads to the conventional strategy of repeatedly applying the algorithm starting from various randomly generated sets of prototypes to reach as deep a minimum of (3.2) as possible. This strategy may fail especially if the feature set is large because in this case random settings cannot cover the space of solutions in a reasonable time. Yet, there is a di erent perspective, of typology making, in which the criterion is considered not as something that must be minimized at any cost but rather a beacon for direction. In this perspective, the algorithm is a model for developing a typology represented by the prototypes. The prototypes should come from an external source such as the advice of experts, leaving to data analysis only their adjustment to real data. In such a situation, the property that the nal prototypes are not far away from the original ones, is more of an advantage than not. What is important, though, is de ning an appropriate, rather than random, initial setting. The data recovery framework is consistent with this perspective since the model underlying K-Means is based on a somewhat simplistic claim that entities can be represented by their cluster's centroids, up to residuals. This model, according to section 5.2.1, leads to an equation involving K-Means criterion W (S c) (3.2) and the data scatter T (Y ):
T (Y ) = B (S c) + W (S c)
where
(3.4) (3.5)
B (S c) =
K X k=1
Nk c2 kv
In this way, data scatter T (Y ) is decomposed into two parts: that one explained by the cluster structure (S c), that is, B (S c), and the other unexplained, that is, W (S c). The larger the explained part the better the match between clustering (S c) and data.
Criterion B (S c) measures the part of the data scatter taken into account by the cluster structure.
Example 3.14. Explained part of the data scatter
The explained part of the data scatter, B (S c), is equal to 43.7% of the data scatter T (Y ) for partition ff1 4 6g f2g f3 5 7 8gg, found with entities 1,2,3 as initial centroids. The score is 58.9% for partition ff1 2 3g f4 6g f5 7 8gg, found with entities 1,4,7 as initial centroids. The score is 64.0% for the author based partition ff1 2 3g f4 5 6g f7 8gg, which is thus superior. 2
Advice for selecting the number of clusters and tentative centroids at Step 1 will be given in sections 3.2 and 7.5.
3.1.3 Incremental versions of K-Means
Incremental versions of K-Means are those at which Step 4, with its Minimum distance rule, is executed not for all of the entities but for one of them only. There can be two principal reasons for doing so: R1 The user is not able to operate with the entire data set and takes entities in one by one, because of either the nature of the data generation process or the largeness of the data set sizes. The former cause is typical when clustering is done in real time as, for instance, in an on-line application. Under traditional assumptions of probabilistic sampling of the entities, convergence of the algorithm was explored in paper 83], from which KMeans became known publicly. R2 The user operates with the entire data set, but wants to smooth the action of the algorithm so that no drastic changes in the cluster contents may occur. To do this, the user may specify an order of the entities and run entities one-by-one in this order for a number of times. (Each of the runs through the data set is referred to as an \epoch" in the neural network discipline.) The result of this may di er from that of Straight K-Means because of di erent computations. This computation can be especially e ective if the order of entities is not constant but depends on their contributions to the criterion optimized by the algorithm. In particular, each entity i 2 I can be assigned value di , the minimum of distances from i to centroids c1 ,..., cK , so that i minimizing di is considered rst. When an entity yi joins cluster St whose cardinality is Nt , the centroid ct changes to c0t to follow the within cluster average values:
When yi moves out of cluster St , the formula remains valid if all pluses are changed for minuses. By introducing the variable zi which is equal to +1 when yi joins the cluster and ;1 when it moves out of it, the formula becomes Accordingly, the distances from other entities change to d(yj c0t ). Because of the incremental setting, the stopping rule of the straight version (reaching a stationary state) may be not necessarily applicable here. In case R1, the natural stopping rule is to end when there are no new entities observed. In case R2, the process of running through the entities one-by-one stops when all entities remain in their clusters. Also, the process may stop when a pre-speci ed number of runs (epochs) is reached. This gives rise to the following version of K-Means.
i c0t = N Nt z ct + N z+ z yi t+ i t i
(3.6)
Incremental K-Means: one entity at a time.
Example 3.15. Smoothing action of incremental K-Means
1. Initial setting. Choose the number of clusters, K, and tentative centroids, c1 c2 ::: cK . 2. Getting an entity. Observe an entity i 2 I coming either randomly (setting R1) or according to a prespeci ed or dynamically changing order (setting R2). 3. Cluster update. Apply Minimum distance rule to determine to what cluster list St (t = 1 ::: K ) entity i should be assigned. 4. Centroid update. Update within cluster centroid ct with formula (3.6). For the case in which yi leaves cluster t0 (in R2 option), ct is also updated with (3.6). Nothing is changed if yi remains in its cluster. Then the stopping condition is checked as described above, and the process moves to observing the next entity (Step 2) or ends (Step 5). 5. Output. Output lists St and centroids ct with accompanying interpretation aids (as advised in section 3.4).
0
Let us apply version R2 to the Masterpieces data with the entity order dynamically updated and K = 3 starting with entities 1, 4 and 7 as centroids. Minimum distances di to the centroids for the ve remaining entities are presented in the rst column of Table 3.7 along with the corresponding centroid (iteration 0). Since d2 = 0:51 is minimum among them, entity 2 is put in cluster I whose centroid is changed accordingly. The next column, iteration 1, presents minimum distances to the updated centroids. This time the minimum is at d8 = 0:61, so entity 8 is put in its nearest cluster III and its center is recomputed. In iteration 2, the distances are in column 2. Among remaining entities, 3, 5, and 6, the minimum distance is d3 = 0:70, so 3 is added to its closest cluster I. Thus updated the centroid of cluster I leads to the change in minimum distances recorded at iteration 3. This time d6 = 0:087 becomes minimum for the remaining entities 5 and 6 so that 6 joins cluster II and, in the next iteration,
5 follows it. Then the partition stabilizes: each entity is closer to its cluster centroid than to any other. The nal partition of the set of masterpieces is the author based one. We can see that this procedure smoothes the process indeed: starting from the same centroids in Example 3.13, straight K-Means leads to a di erent and worse partition. 2
3.2 Initialization of K-Means
To initialize K-Means, one needs to specify: (1) the number of clusters, K , and (2) initial centroids, c1 c2 ::: cK . Each of these can be of issue in practical computations. Both depend on the user's expectations related to the level of resolution and typological attitudes, which remain beyond the scope of the theory of K-Means. This is why some claim these considerations are beyond the clustering discipline. There have been however a number of approaches suggested for specifying the number and location of initial centroids, which will be brie y described in section 7.5.1. Here we present, rst, the most popular existing approaches and, second, two approaches based on preliminary analysis of the data set structure. Conventionally, either of two extremes is adhered to in initial setting. One view assumes no knowledge of the data and domain and takes initial centroids randomly the other, on the contrary, relies on the user being an expert and de ning initial centroids as hypothetical prototypes. The rst approach randomly selects K of the entities (or generates K ndimensional points within the feature ranges) as the initial seeds (centroids), and apply K-Means (either straight or incremental). After repeating this a prespeci ed number of times (for instance, 100 or 1000), the best solution according to the square-error criterion (3.2) is taken. This approach can be handled by any package containing K-Means. For instance, SPSS allows the taking of the rst K entities in a data set as the initial seeds. This can be repeated as many
times as needed, each time reformatting the data matrix by putting a random K entity sample as its rst K rows. Selection of K can be done empirically by following this strategy for di erent values of K , say, in a range from 2 to 15. However, the optimal value of the square-error criterion decreases when K grows and thus cannot be utilized, as is, for the purpose. In the literature, a number of coe cients and tricks have been suggested based on the use of the square error (see later in section 7.5.1). Unfortunately, they all may fail even in the relatively simple situations of controlled computation experiments. Comparing clusterings found for di erent K may lead to insights on the cluster structure. In many real world computations, the following phenomenon has been observed by the author and other researchers. When repeatedly proceeding from a larger K to K ; 1, the found K ; 1 clustering, typically, is rather similar to that found by merging some of clusters in the K clustering, in spite of the fact that the K - and (K ; 1)-clusterings are found independently. However, in the process of decreasing K this way, a critical value of K is reached such that (K ; 1)-clustering doesn't resemble K -clustering at all. If this is the case, the value of K can be taken as that corresponding to the cluster structure. This can be a viable strategy. There are two critical points though. 1. The K-Means algorithm, as is, doesn't seek a global minimum of the square-error criterion and, moreover, the local minima achieved with KMeans are not very deep. Thus, with the number of entities in order of hundreds or thousands and K within a dozen, or more, the number of tries needed to reach a representative set of the initial centroids may become too large and make it a computationally challenging problem. To overcome this, some e ective computational strategies have been suggested as that of random jumps from a subset of centroids. Such random track changing, typically, produces much deeper minima than the standard K-Means 45]. 2. Even if one succeeds in getting a deep or a global minimum of the squareerror criterion, it should not be taken for granted that the clusters found re ect the cluster structure. There are some intrinsic aws in the criterion that would not allow us to accept it as the only means for deciding upon whether the clusters minimizing it are those we are looking for. The square-error criterion needs to be supplemented with other tools for getting better insights into the data structure. Setting of the initial centroids can be utilized as such a tool. The other approach relies on the opinion of an expert in the subject domain.
Example 3.16. K-Means at Iris data
Table 3.8 presents results of the straight K-Means applied to the Iris data on page 11 with K=3 and specimens numbered 1, 51, and 101 taken as the initial centroids and
Table 3.8: Cross-classi cation of 150 Iris specimens according to K-Means clustering
and the genera entries show Count/Proportion. Iris genus Cluster Setosa Versicolor Virginica Total S1 50/0.333 0/0 0/0 50/0.333 S2 0/0 47/0.313 14/0.093 61/0.407 S3 0/0 3/0.020 36/0.240 39/0.260 Total 50/0.333 50/0.333 50/0.333 150/1.000
cross-classi ed with the prior three class partition. The clustering does separate genus Setosa but misplaces 14+3=17 specimens between two other genera. This corresponds to the visual pattern in Figure 1.10, page 25. 2
Similarly, an expert may propose to distinguish numeral digits by the presence of a closed drawing in them, so that this feature is present in 6 and absent from 1, and suggest these entities as the initial seeds. The expert may even go further and suggest one more feature, presence of a semi-closed drawing instantiated by 3, to be taken into account. This is a viable approach, too. It allows seeing how the conceptual types relate to the data and to what extent the hypothetical seed combinations match real data. However, in a common situation in which the user cannot make much sense of his data because they re ect super cial measurable features rather than those of essence, which cannot be measured, the expert vision may fail to suggest a reasonable degree of resolution, and the user should take a more data-driven approach to tackle the problem. Two data-driven approaches are described in the next two sections.
3.2.2 MaxMin for producing deviate centroids
This approach is based on the following intuition. If there are cohesive clusters in the data, then entities within any cluster must be close to each other and rather far away from entities in other clusters. The following method, based on this intuition, has proved to work well in real and simulated experiments. 1. Take entities yi and yi maximizing the distance d(yi yj ) over all i j 2 I as c1 and c2 . 2. For each of the entities yi , that have not been selected to the set c of initial seeds so far, calculate dc (yi ), the minimum of its distances to ct 2 c. 3. Find i maximizing dc (yi ) and check Stop-condition (see below). If it doesn't hold, add yi to c and go to Step 2. Otherwise, end and output c as the set of initial seeds.
0 00
As the Stop-condition in MaxMin either or all of the following pre-speci ed constraints can be utilized: 1. The number of seeds has reached a pre-speci ed threshold. 2. Distance dc (yi ) is larger than a pre-speci ed threshold such as d = d(c1 c2 )=3. 3. There is a signi cant drop, such as 35%, in the value of dc (yi ) in comparison to that at the previous iteration.
Example 3.17. MaxMin for selecting intial seeds
The table of entity-to-entity distances for Masterpieces is displayed in Table 3.9. The maximum distance here is 3.43, between AK and YA, which makes the two of
Table 3.9: Distances between Masterpieces from Table 3.2.
OT DS GE TS HF YA WP AK OT 0.00 0.51 0.88 1.15 2.20 2.25 2.30 3.01 DS 0.51 0.00 0.77 1.55 1.82 2.99 1.90 2.41 GE 0.88 0.77 0.00 1.94 1.16 1.84 1.81 2.38 TS 1.15 1.55 1.94 0.00 0.97 0.87 1.22 2.46 HF 2.20 1.82 1.16 0.97 0.00 0.75 0.83 1.87 YA 2.25 2.99 1.84 0.87 0.75 0.00 1.68 3.43 WP 2.30 1.90 1.81 1.22 0.83 1.68 0.00 0.61 AK 3.01 2.41 2.38 2.46 1.87 3.43 0.61 0.00
them initial centroids according to MaxMin. The distances from other entities to these two are in Table 3.10 those minimal at the two are boldfaced. The maximum among them, the next MaxMin distance, is 2.41 between DS and AK. The decrease here is less than 30% suggesting that this can represent a di erent cluster. Thus, we
Table 3.10: Distances from Masterpieces entities to YA and AK
YA AK
add DS to the list of candidate centroids and then need to look at distances from other entities to these three (see Table 3.11). This time the MaxMin distance is 0.87 between TS and YA. We might wish to stop the process at this stage since we expect only three meaningful clusters in Masterpieces data and, also, there is a signi cant drop, 64% of the previous MaxMin distance. It is useful to remember that such a clear-cut situation may not necessarily occur in other examples. The three seeds selected have been shown in previous examples to produce the author based clusters with K-Means. 2
K-MEANS CLUSTERING
OT GE TS HF WP DS 0.51 0.77 1.55 1.82 1.90 YA 2.25 1.84 0.87 0.75 1.68 AK 3.01 2.38 2.46 1.87 0.61
Table 3.11: Distances between DS, YA, and AK and other Masterpiece entities.
The issues related to this approach are typical in data mining. First, it involves ad hoc thresholds which are not substantiated in terms of data. Second, it can be computationally intensive when the number of entities N is large since nding the maximum distance at Step 1 involves computation of O(N 2 ) distances.
3.2.3 Deviate centroids with Anomalous pattern
Reference point based clustering
The method described in this section provides an alternative to MaxMin for the initial setting, which is less intensive computationally and, also, reduces the number of ad hoc parameters. To avoid the computationally intensive problems of analyzing pair-wise distances, one may employ the concept of a reference point which is chosen to exemplify an average or norm of the features which de ne the entities. For example, the user might choose, as representing a \normal student," a point which indicates good marks in tests and serious work in projects, and then see what patterns of observed behavior deviate from this. Or, a bank manager may set as his reference point, a customer having speci c assets and backgrounds, not necessarily averaged, to see what types of customers deviate from this. In engineering, a moving robotic device should be able to classify the elements of the environment according to the robot's location, with things that are closer having more resolution, and things that are farther having less resolution: the location is the reference point in this case. In many cases the gravity center of the entire entity set can be the reference point of choice. Availability of a reference point allows the comparison of entities with it, not with each other, which drastically reduces computations. To nd a cluster which is most distant from a reference point, a version of K-Means described in 92] can be utilized. According to this procedure, the only ad hoc choice is the cluster's seed. There are two seeds here: the reference point which is unvaried in the process and the cluster's seed, which is taken to be the entity which is farthest from the reference point. Only the anomalous cluster is built here, de ned as the set of points that are closer to the cluster seed than to the reference point. Then the cluster seed is substituted by the cluster's gravity center, and the procedure is reiterated until it converges. An exact formulation of this follows.
3.2. INITIALIZATION OF K-MEANS
Reference point Reference point
91
1/2 1/2 Anomalous cluster center
Farthest entity
Figure 3.1: Extracting an `Anomalous pattern' cluster with the reference point in
the gravity center: the initial iteration is on the left and the nal one on the right.
1. Pre-processing. Specify a reference point a = (a1 ::: an ) (this can be the data grand mean) and standardize the original data table with formula (2.22) at which shift parameters ak are the reference point coordinates. (This way, the space origin is shifted into a.) 2. Initial setting. Put a tentative centroid, c, as an entity which is the most distant from the origin, 0. 3. Cluster update. Determine cluster list S around c against the only other \centroid" 0 with the Minimum distance rule so that yi is assigned to S if d(yi c) < d(yi 0). 4. Centroid update. Calculate the within S mean c0 and check whether it di ers from the previous centroid c. If c0 and c do di er, update the centroid by assigning c c0 and return to Step 3. Otherwise, go to 5. 5. Output. Output list S and centroid c with accompanying interpretation aids (as advised in section 3.4) as the most anomalous pattern. The process is illustrated in Figure 3.1. Obviously, the Anomalous pattern method is a version of K-Means in which: (i) the number of clusters K is 2 (ii) centroid of one of the clusters is 0, which is forcibly kept there through all the iterations (iii) the initial centroid of the anomalous cluster is taken as an entity point which is the most distant from 0. Property (iii) mitigates the issue of determining appropriate initial seeds,
which allows using Anomalous pattern algorithm for nding an initial setting for K-Means. Like K-Means itself, the Anomalous pattern alternately minimizes a criterion, X X W (S c) = d(yi c) + d(yi 0) (3.7)
i2S
which is a speci c version of K-Means general criterion W (S c) in (3.2): S is a partition in the general criterion and a subset in AP. More technical detail of the method can be found in section 5.5.
i62S
Example 3.18. Anomalous pattern in Market towns
The Anomalous pattern method can be applied to the Market towns data in Table 1.1 assuming the grand mean as the reference point and scaling by range. The point farthest from 0, the tentative centroid at step 2, appears to be entity 35 (St Austell) whose distance from zero is 4.33, the maximum. Step 3 adds three more entities, 26, 29 and 44 (Newton Abbot, Penzance and Truro), to the cluster. They are among the largest towns in the data, though there are some large towns like Falmouth that are out of the list, thus being closer to 0 rather than to St Austell in the range standardized feature space. After one more iteration, the anomalous cluster stabilizes.
Table 3.12: Iterations in nding an anomalous pattern in Market towns data.
Iteration List # Distance Cntr Cntr, % 1 26, 29, 35, 44 4 2.98 11.92 28.3 2 4, 9, 25, 26, 29, 35, 41, 44 8 1.85 14.77 35.1 The iterations are presented in Table 3.12. It should be noted that the scatter's cluster part (contribution) increases along the iterations as follows from the theory in section 5.5.3: the decrease of the distance between centroid and zero is well compensated by the in ux of entities. The nal cluster consists of 8 entities and takes into account 35.13 % of the data scatter. Its centroid is displayed in Table 3.13. As frequently happens, the anomalous cluster here consists of better o entities { towns with all the standardized centroid values larger than the grand mean by 30 to 50 per cent of the feature ranges. This probably relates to the fact that they comprise eight out of eleven towns which have a resident population greater than 10,000. The other three largest towns have not made it into the cluster because of their de ciencies in services such as Hospitals and Farmers' markets. The fact that the scale of measurement of population is by far the largest in the original table doesn't much a ect the computation here as it runs with the range standardized scales at which the total contribution of this feature is mediocre, about 8.5% only (see Table 2.18). It is rather a concerted action of all features associated with greater population which makes up the cluster. As follows from the last line in Table 3.13, the most important for the cluster separation
PO CAB FM 6.4 1.2 .4 .47 .30 .18 143 94 88 163 51 10
3.3. INTELLIGENT K-MEANS
93
are the following features: Population, Post o ces, and Doctors, highlighted with the boldface. This analysis suggests a simple decision rule separating the cluster entities from the rest with these variables: \P is greater than 10,000 and Do is 3 or greater." 2
3.3 Intelligent K-Means
3.3.1 Iterated Anomalous pattern for iK-Means
When clusters in the feature space are well separated from each other or the cluster structure can be thought of as a set of di erently contributing clusters, the clusters can be found with iterative application of Anomalous pattern that mitigates the need for pre-setting the number of clusters and their initial centroids. Moreover, this can be used as a procedure to meaningfully determine the number of clusters and initial seeds for K-Means. In this way we come to an algorithm that can be referred to as an intelligent K-Means, because it relieves from the user the task of specifying the initial setting. Some other potentially useful features of the method relate to its exibility with regard to dealing with outliers and the \swamp" of inexpressive, ordinary, entities around the grand mean. 0. Setting. Put t = 1 and It the original entity set. Specify a threshold of resolution to discard all AP clusters whose cardinalities are less than the threshold. 1. Anomalous pattern. Apply AP to It to nd St and ct . There can be either option taken: do Step 1 of AP (standardization of the data) at each t or only at t = 1. The latter is the recommended option as it is compatible with the theory in section 5.5. 2. Control. If Stop-condition (see below) does not hold, put It It ; St and t t + 1 and go to Step 1. 3. Removal of small clusters. Remove all of the found clusters that are smaller than a pre-speci ed cluster discarding threshold for the cluster size. (Entities comprising singleton clusters should be checked for the errors in their data entries.) Denote the number of remaining clusters by K and their centroids by c1 ,..., cK . 4. K-Means. Do Straight (or Incremental) K-Means with c1 ,..., cK as initial seeds. The Stop-condition in this method can be any or all of the following: 1. All clustered. St = It so that there are no unclustered entities left. 2. Large cumulative contribution. The total contribution of the rst t
K-MEANS CLUSTERING
clusters to the data scatter has reached a pre-speci ed threshold such as 50 %. 3. Small cluster contribution. Contribution of t-th cluster is too small, say, compared to the order of average contribution of a single entity, 1=N . 4. Number of clusters reached. Number of clusters, t, has reached a pre-speci ed value K .
The rst condition is natural if there are \natural" clusters that indeed di er in their contributions to the data scatter. The second and third conditions can be considered as imposing further degrees of resolution with which the user looks at the data. At step 4, K-Means can be applied to either the entire dataset or to the set from which the smaller clusters have been removed. This may depend on the situation: in some problems, such as structuring of a set of settlements for better planning or monitoring, no entity should be left out of the consideration, whereas in other problems, such as developing synoptic descriptions for text corpora, some deviant texts should be left out of the coverage.
Example 3.19. Iterated Anomalous patterns in Market towns
Applied to the Market towns data with Stop-condition 1, the iterated AP algorithm has produced 12 clusters of which 5 are singletons. Each of the singletons has a strange pattern of town facilities with no similarity to any other town in the list. For instance, entity 19 (Liskeard, 7044 residents) has an unusually large number of Hospitals (6) and CABs(2), which makes it a singleton cluster. The seven non-singleton clusters are in Table 3.14, in the order of their extraction in the iterated AP. Centroids of the seven clusters are presented in Table 3.20 in the next section.
Cluster Size Elements Cntr,% 1 8 4, 9, 25, 26, 29, 35, 41, 44 35.1 3 6 5, 8 , 12, 16, 21, 43 10.0 4 18 2, 6, 7, 10, 13, 14, 17, 22, 23, 24, 18.6 27, 30, 31, 33, 34, 37, 38, 40 5 2 3 , 32 2.4 6 2 1,11 1.6 8 2 39 , 42 1.7 11 2 20 45 1.2 The cluster structure doesn't much change when, according to the iK-Means algorithm, Straight K-Means is applied to the seven centroids (with the ve singletons put
back into the data). Moreover, similar results have been observed with clustering of the original list of about thirteen hundred Market towns described by an expanded list of eighteen characteristics of their development: the number of non-singleton clusters was the same, with their descriptions (see page 101) very similar. 2 Let us apply iK-Means to the Bribery data in Table 1.12 on page 20. According to the prescriptions above, the data processing includes the following steps: 1. Data standardization. This is done by subtracting the feature averages (grand means) from all entries and then dividing them by the feature ranges. For a binary feature corresponding to a qualitative category, this reduces to subtraction of the category proportion, p, from all the entries which in this way become either 1 ; p, for \yes," and ;p, for \no." 2. Repeatedly performing AP clustering. Applying AP to the pre-processed data matrix with the reference point taken as the space origin 0 and never altered, 13 clusters have been produced as shown in Table 3.15. They explain 64% of the data variance. 3. Initial setting for K-Means. There are only 5 clusters that have more than three elements according to Table 3.15. This de nes the number of clusters as well as the initial setting: the rst elements of the ve larger clusters, indexed as 5, 12, 4, 1, and 11, are taken as the initial centroids. 4. Performing K-Means. K-Means, with the ve centroids from the previous step, produces ve clusters presented in Table 3.16. They explain 45% of the data scatter. The reduction of the proportion of the explained data scatter is obviously caused by the reduced number of clusters. Conceptual description of the clusters is left to the next section (see page 106) which is devoted to interpretation aids. 2
Table 3.16: Clusters found by K-Means in the entire Bribery data set from the largest
clusters in Table 3.15.
Cluster 1 2 3 4 5 # Elements Contribution, % 8 5,16,23,25,27,28,41,42 10.0 19 7,8,12,13,14,20,33,34,35,3638,39,43,45,47,48,50,51,52 9.8 10 4,6,9,10,21,22,26,30,31,40 10.0 7 1,3,15,17,29,32,49 7.0 11 2,11,18,19,24,37,44,46,53,54,55 8.1
3.3.2 Cross validation of iK-Means results
As described in section 1.2.6, the issue of validation of clusters may be subject to di erent perspectives. According to the classi cation paradigm, validation of clusters is provided by their interpretation, that is, by the convenience of the clusters and their tting into and enhancing the existing knowledge. In the statistics paradigm, a cluster structure is validated by its correspondence to the underlying model. In the machine learning perspective, it is learning algorithms that are to be validated. In data mining, one validates the cluster structure found. In machine learning and data mining, validation is treated as the testing of how stable the algorithm results are with respect to random changes in the data. We refer the reader to section 7.5 for a general discussion of validation criteria in clustering. Here we concentrate on the most popular validation method, m-fold crossvalidation. According to this method, the entity set is randomly partitioned into m equal parts and m pairs of training and testing sets are formed by taking each one of the m parts as the testing set, with the rest considered the training set. This scheme is easy to use regarding the problems of learning of decision rules: a decision rule is formed using a training set and then tested on the corresponding testing set. Then testing results are averaged over all m traintest experiments. How can this line of thought be applied to clustering? In the literature, several methods for extending of the cross-validation techniques to clustering have been described (see references in section 7.5.2). Some of them fall in the machine learning perspective and some in the data mining perspective. The common idea is that the set of m training sets supplied by the cross validation approach constitute a convenient set of random samples from the entity set. In the remainder of this section, we describe somewhat simpli ed experiments in each of the two frameworks. In the machine learning framework, one tests the consistency of a clustering algorithm. To do this, results of the algorithm run over each of the m training sets are compared. But how can two clusterings be compared if they partition di erent sets? One way to do this is by extending each clustering from the
training set to the full entity set by assigning appropriate cluster labels to the test set elements. Another way would be to compare partitions pairwise over the overlap of their training sets. The overlap is not necessarily small. If, for instance, m = 10, then each of the training sets covers 90% of entities and the pairwise overlap is 80%. In data mining, it is the clustering results that are tested. In this framework, the selected clustering method is applied to the entire data set before the set is split into m equal-size parts. Then m training sets are formed as usual, by removing one of the parts and combining the other parts. These training sets are used to verify the clustering results found on the entire data set. To do this, the clustering algorithm is applied to each of the m training sets and the found clustering is compared with that obtained on the entire data set. Let us consider, with examples, how these strategies can be implemented.
Example 3.21. Cross-validation of iK-Means clusters of the Market towns
data
Let us address the issue of consistency of clustering results, a data mining approach. We already have found a set of clusters in the Market towns data, see example 3.19 on page 94. This will be referred to as base clustering. To explore how stable base clusters are, let us do 10-fold cross-validation. First, randomly partition the set of 45 towns in 10 classes of approximately the same size, ve classes with four towns and ve classes with ve towns in each. Taking out each of the classes, we get ten 90% subsamples of the original data as the training sets and run iK-Means on each of them. To see how much these clusterings di er from the base clustering found using the entire set, we use three scoring functions, as follows. 1. Average distance between centroids adc. Let ck (k = 1 ::: 7) be base centroids and c0l (l = 1 ::: L)0 centroids of the clustering found 0on a 90% sample. For each ck nd the nearest cl over l = 1 ::: L, calculate d(ck cl ) and average the distance over all k = 1 ::: 7. (The correspondence between ck and c0l can also be established with the so-called best matching techniques 3].) This average distance scores the di erence between base clusters and sample clusters. The smaller it is the more consistent is the base clustering. 2. Relative distance between partitions of samples M . Given a 90% training sample, let us compare two partitions of it: (a) the partition found on it with the clustering algorithm and (b) the base partition constrained to the sample. Cross classifying these two partitions, we get a contingency table P = (ptu) of frequencies ptu of sample entities belonging to the t-th class of one partition and the u-th class of the other. The distance, or mismatch coe cient, is X X X M = p2+ + p2 u ; 2 p2 t + tu where pt+ and p+u are summary frequencies over rows and columns of P , as introduced later in formula (7.12). 3. Relative chi-square contingency coe cient T . This is computed in the same way as distance M the only di erence is that now chi-squared coe cient (2.12), (2.13) X X 2 = p2 =(pt+p+u) ; 1 tu
tu t u tu
Table 3.17: Averaged results of fteen cross-validations of Market towns clusters with
real and random data. Method Real data Random data adc 0.064 (0.038) 0.180 (0.061) Ms 0.018 (0.018) 0.091 (0.036) T 0.865 (0.084) 0.658 (0.096)
and its normalized version T = X 2 = (K ; 1)(L ; 1), the Tchouprov coe cient, are used. Tchouprov coe cient cannot be greater than 1. Averaged results of fteen independent 10-fold cross validation tests are presented in the left column of Table 3.17 the standard deviations of the values are in parentheses. We can see that distances adc and Ms are low and contingency coe cient T is high. But how low and how high are they? Can any cornerstones or benchmarks be found? One may wish to compare adc with the average distance between uniformly random vectors. This is not di cult, because the average squared di erence (x ; y)2 between numbers x and y that are uniformly random in a unity interval is 1/6. This implies that the average distance in 12-dimensional space is 2 which is by far greater than the observed 0.064. This di erence however, shouldn't impress anybody, because the distance 2 refers to an unclustered set. Let us generate thus a uniformly random 45 12 data table and simulate the same computations as with the real data. Results of these computations are in the column on the right of Table 3.17. We can see that distances adc and Ms over random data are small too however, they are 3-5 times greater than those on the real data. If one believes that the average distances at random and real data may be considered as sampling averages of normal or chi-square distributions, one may consider a statistical test of di erence such as that by Fisher 63, 50] to be appropriate and lead to a statistically sound conclusion that the hypothesis that the clustering of real data di ers from that of random data can be accepted with a great con dence level. 2
p
Example 3.22.
In this example, the cross-validation techniques are applied within the machine learning context, that is to say, we are going to address the issue of the consistency of the clustering algorithm rather than its results. Thus, the partitions found on the training samples will be compared not with the base clustering but with each other. A 10-fold cross-validation is applied here as in the previous example. Ten 90% cross-validation subsamples of the original data are produced and iK-Means is applied to each of them. Two types of comparison between the ten subsample partitions are used, as follows. 1. Comparing partitions on common parts. Two 90% training samples' overlap comprises 80% of the original entities, which allows the building of their contingency table over those common entities. Then both the distance M and chi-squared T coe cients can be used. 2. Comparing partitions by extending them to the entire entity set. Given a 90% training sample, let us rst extend it to the entire entity set. To
Cross-validation of iK-Means algorithm on the Market towns data
3.3. INTELLIGENT K-MEANS
ket towns and random data. Method Real data Random data Ms 0.027 (0.025) 0.111 (0.052) T 0.848 (0.098) 0.604 (0.172) real Market towns and random data. Method Real data Random data Ms 0.032 (0.028) 0.128 (0.053) T 0.832 (0.098) 0.544 (0.179)
99
Table 3.18: Averaged comparison scores between iK-Means results at 80% real Mar-
Table 3.19: Averaged comparison scores between iK-Means results extended to all
do so, each entity from the 10% testing set is assigned to the cluster whose centroid is the nearest to the entity. Having all ten 90% partitions extended this way to the entire data set, their pair-wise contingency tables are built and scoring functions, the distance M and chi-squared T coe cients, are calculated. Tables 3.18 and 3.19 present results of the pair-wise comparison between partitions found by iK-Means applied to the Market towns data in both ways, on 80% overlaps and on the entire data set after extension, averaged over fteen ten-fold cross-validation experiments. The cluster discarding threshold has been set to 1 as in the previous examples. We can see that these are similar to gures observed in the previous example though the overall consistency of clustering results decreases here, especially when comparisons are conducted over extended partitions. It should be noted that the issue of consistency of the algorithm is treated somewhat simplistically in this example, with respect to the Market towns data only, not to a pool of data structures. Also, the concept of algorithm's consistency can be de ned di erently, for instance, with regards to the criterion optimized by the algorithm. 2
Example 3.23. Higher dimensionality e ects
It is interesting to mention that applying the same procedure to the original set of 18 features (not presented), the following phenomenon has been observed. When a matrix 45 18 is lled in by a set of uniformly random numbers, iK-Means with the cluster discarding threshold 2, produces two clusters only. However, at the 90% training subsamples iK-Means fails most of the times to produce more than one nontrivial cluster. This is an e ect of the higher dimensionality of the feature space relative to the number of entities in this example. Random points are situated too far away from each other in this case and can not be con ated by iK-Means into clusters. One may safely claim that iK-Means di ers from other clustering algorithms in that respect that, in contrast to the others, it may fail to partition a data set if it is random. This happens not always but only in the cases in which the number of features is comparable to or greater than half of the number of entities. 2
3.4 Interpretation aids
As it was already pointed out, interpretation is an important part of clustering, especially from the classi cation perspective in which it is a validation tool as well. Unfortunately, this subject is generally not treated within the same framework as `proper' clustering. The data recovery view of clustering allows us to ll in some gaps here as described in this section.
3.4.1 Conventional interpretation aids
Two conventional tools for interpreting K-Means clustering results (S c) are: (1) analysis of cluster centroids ct and (2) analysis of bivariate distributions between cluster partition S = fStg and various features. In fact, under the zero-one coding system for categories, cross-classi cation frequencies are nothing but cluster centroids, which allows us to safely suggest that analysis of cluster centroids at various feature spaces is the only conventional interpretation aid.
clusters.
Example 3.24. Conventional interpretation aids applied to Market towns
Let us consider Table 3.20 displaying centroids of the seven clusters of Market towns data both in real and range standardized scales. These show some tendencies rather clearly. For instance, the rst cluster appears to be a set of larger towns that score 30 to 50 % higher than average on almost all 12 features in the feature space. Similarly, cluster 3 obviously relates to smaller than average towns. However, in other cases, it is not always clear what features caused the separation of some clusters. For instance, both clusters 6 and 7 seem too close to the average to have any real di erences at all. 2
Table 3.20: Patterns of Market towns in the cluster structure found with iK-Means
the rst column displays cluster numbering (top) and cardinalities (bottom).
k/# 1 8 2 6 3 18 4 3 5 2 6 5 7 3 Centr P PS Do Ho Ba Su Pe DIY SP PO CAB FM Real 18484 7.63 3.63 1.13 11.63 4.63 4.13 1.00 1.38 6.38 1.25 0.38 Stand 0.51 0.38 0.56 0.36 0.38 0.38 0.30 0.26 0.44 0.47 0.30 0.17 Real 5268 2.17 0.83 0.50 4.67 1.83 1.67 0.00 0.50 1.67 0.67 1.00 Stand -0.10 -0.07 -0.14 0.05 0.02 -0.01 -0.05 -0.07 0.01 -0.12 0.01 0.80 Real 2597 1.17 0.50 0.00 1.22 0.61 0.89 0.00 0.06 1.44 0.11 0.00 Stand -0.22 -0.15 -0.22 -0.20 -0.16 -0.19 -0.17 -0.07 -0.22 -0.15 -0.27 -0.20 Real 11245 3.67 2.00 1.33 5.33 2.33 3.67 0.67 1.00 2.33 1.33 0.00 Stand 0.18 0.05 0.16 0.47 0.05 0.06 0.23 0.15 0.26 -0.04 0.34 -0.20 Real 5347 2.50 0.00 1.00 2.00 1.50 2.00 0.00 0.50 1.50 1.00 0.00 Stand -0.09 -0.04 -0.34 0.30 -0.12 -0.06 -0.01 -0.07 0.01 -0.14 0.18 -0.20 Real 8675 3.80 2.00 0.00 3.20 2.00 2.40 0.00 0.00 2.80 0.80 0.00 Stand 0.06 0.06 0.16 -0.20 -0.06 0.01 0.05 -0.07 -0.24 0.02 0.08 -0.20 Real 5593 2.00 1.00 0.00 5.00 2.67 2.00 0.00 1.00 2.33 1.00 0.00 Stand -0.08 -0.09 -0.09 -0.20 0.04 0.10 -0.01 -0.07 0.26 -0.04 0.18 -0.20
Here two more interpretation aids are proposed: 1. Decomposition of the data scatter over clusters and features (table ScaD) 2. Quetelet coe cients for the decomposition (table QScaD). According to (3.4) and (3.5), clustering decomposes the data scatter T (Y ) in the explained and unexplained parts, B (S c) and W (S c), respectively. The explained part can be further presented as the sum of additive items Bkv = Nk c2 , which account for the contribution of every pair Sk (k = 1 ::: K ) and kv v 2 V , a cluster and a feature. The unexplained part can be further additively P P decomposed in contributions Wv = K=1 i2Sk (yiv ; ckv )2 , which can be k di erently expressed as Wv = Tv ; B+v where Tv and B+v are P of T (Y ) parts P 2 and B (S c) related to feature v 2 V , Tv = i2I yiv and B+v = K=1 Bkv . k This can be displayed as a decomposition of T (Y ) in a table ScaD whose rows correspond to clusters, columns to variables and entries to the contributions (see Table 3.21). Table 3.21: ScaD: Decomposition of the data scatter over a K-Means cluster structure. Feature f 1 Cluster S1 B11 S2 B21
3.4.2 Contribution and relative contribution tables
f2 B12 B22 BK 2 B+2 W2 T2
fM B1M B2M BKM B+M WM TM
Total B1+ B2+
SK Expl Unex Total
BK1 B+1 W1 T1
BK+ B (S c) W (S c) T (Y )
Summary rows, Expl and Total, and column, Total, are added to the table they can be expressed as percentages of the data scatter T (Y ). The notation follows the notation of ow data. The row Unex accounts for the \unexplained" di erences Wv = Tv ; B+v . The contributions highlight relative roles of features both at individual clusters and in total. These can be applied within clusters as well (see Table 3.26 further on as an example).
Example 3.25. Contribution table ScaD for Market towns clusters
Table 3.22 presents the Market towns data scatter decomposed, as in Table 3.21, over both clusters and features. The table shows that, among the variables, the maximum contribution to the data scatter is reached at FM. This can be attributed to the fact that FM is a binary
variable: as shown in section 2.1.2, contributions of binary variables are maximal when they cover about half of the sample. The least contributing is DIY. The value of the ratio of the explained part of DIY to the total contribution, 0.79/1.75=0.451, amounts to the correlation ratio between the partition and DIY, as explained in sections 3.4.4 and 5.2.3. The entries in the table actually combine together cardinalities of clusters with squared di erences between the grand mean vector and within-cluster centroids. Some show an exceptional value such as contribution 3.84 of FM to cluster 2, which covers more than 50 % of the total contribution of FM and more than 90% of the total contribution of the cluster. Still, overall they do not give much guidance in judging whose variables' contributions are most important in a cluster because of di erences between relative contributions of individual rows and columns. 2
To measure the relative in uence of contributions Bkv , let us utilize the property that they sum up to the total data scatter and, thus, can be considered an instance of the ow data. The table of contributions can be analyzed in the same way as a contingency table (see section 2.2.3). Let us de ne, in particular, the relative contribution of feature v to cluster Sk , B (k=v) = Bkv =Tv , to show what part of the variable contribution goes to the cluster. The total explained P part of Tv , Bv = B+v =Tv = K=1 B (k=v), is equal to the correlation ratio k 2 (S v ) introduced in section 2.2.3. More sensitive measures can be introduced to compare the P relative contributions B (k=v) with the contribution of cluster Sk , Bk+ = v2V Bkv = Nk d(0 ck ), related to the total data scatter T (Y ). These are similar to Quetelet coe cients introduced for ow data: the di erence g(k=v) = B (k=v) ; Bk+ =T (Y ) and the relative di erence q(k=v) = g(k=v)=(Bk+ =T (Y )) = T (Y )Bkv ; 1. The former compares the contribution of v with the average conTv Bk+ tribution of variables to Sk . The latter relates this to the cluster's contribution. Index q(k=v) can also be expressed as the ratio of the relative contributions of v: within Sk , Bkv =Bk+ , and in the whole data, Tv =T (Y ). We refer to q(k=v) as the Relative contribution index, RCI(k v).
For each cluster k, features v with the largest RCI(k v) should be presented to the user for interpretation.
Example 3.26. Table QScaD of the relative and Quetelet indexes
All three indexes of association, B (k=v), g(k=v) and RCI q(k=v), applied to the Market towns data in Table 3.22 are presented in Table 3.23 below cluster centroids.
Table 3.23: Tendencies of the cluster structure of Market towns. At each cluster, the
Now contributions have become visible indeed. One can see, for instance, that variable Do highly contributes to cluster 5: RCI is 219.9. Why? As the upper number in the cell, 0, shows, this is a remarkable case indeed: no Doctor surgeries in the cluster at all. The di erence between clusters 6 and 7, that was virtually impossible to spot with other interpretation aids, now can be explained by the high RCI values of SP, in excess of 100%, reached at these clusters. A closer look at the data shows that there is a swimming pool in each town in cluster 7 and none in cluster 6. If the variable SP is removed then clusters 6 and 7 will not di er anymore and join together. Overall, the seven nontrivial clusters can be considered as re ecting the following four tiers of the settlement system: largest towns (Cluster 1), small towns (Cluster
3), large towns (Clusters 4 and 6), and small-to-average towns (Clusters 2,5 and 7). In particular, the largest town Cluster 1 consists of towns whose population is two to three times larger than the average, and they have respectively larger numbers of all facilities, of which even more represented are Post O ces, Doctors, Primary Schools, and Banks. The small town Cluster 3 consists of the smallest towns with 2-3 thousand residents on average. Respectively, the other facilities are also smaller and some are absent altogether (such as DIY shops and Farmers' markets). Two large town clusters, Cluster 4 and Cluster 6, are formed by towns of nine to twelve thousand residents. Although lack of such facilities as Farmers' market is common to them, Cluster 4 is by far the richer, with service facilities that are absent in Cluster 6, which probably is the cause of the separation of the latter within the tier. Three small-to-average town clusters have towns of about 5,000 residents and di er from each other by the presence of a few fancy objects that are absent from the small town cluster, as well as from the other two clusters of this tier. These objects are: a Farmers' market in Cluster 2, a Hospital in Cluster 5, and a Swimming pool in Cluster 7. 2
Example 3.27. ScaD and QScaD for Masterpieces
Tables 3.24 and 3.25 present similar decompositions with respect to author-based clustering of the Masterpieces data in Table 2.13 on page 61. This time, only Quetelet indexes of variables, RCI(k v) are presented (in Table 3.25). Table 3.25 shows feature SCon as the one most contributing to the Dickens cluster, feature LenD to the Twain cluster, and features NChar and Direct to the Tolstoy cluster. Indeed, these clusters can be distinctively described by the statements \SCon=0," \LenD < 28," and \NChar > 3" (or \Narrative is Direct"), respectively. Curiously, the decisive role of LenD for the Twain cluster cannot be recognized from the absolute contributions in Table 3.24: SCon prevails over the Twain cluster in that table. 2
Table 3.25: Relative centroids: cluster centroids standardized and Relative contribution indexes of variables, in cluster rst and second lines, respectively. Title LenS LenD NChar SCon Pers Obje Dickens 0.10 0.12 -0.11 -0.63 0.17 -0.02 -83.3 -70.5 -81.5 158.1 -40.1 -100.0 Twain -0.21 -0.29 -0.22 0.38 -0.02 0.17 1.2 91.1 -9.8 20.2 -100.0 -22.3 Tolstoy 0.18 0.24 0.50 0.38 -0.22 -0.22 -68.3 -33.0 119.5 -41.5 -43.3 -43.3 Dire -0.14 -50.5 -0.14 -35.8 0.43
The user can be interested in a conceptual description of a cluster, but he also can be interested in looking at the cluster via its representative, a \prototype." This is especially appealing when the representative is a well known object. Such an object can give much better meaning to a cluster than a logical description in situations where entities are complex and the concepts used in description are super cial and do not penetrate deep into the phenomenon. This is the case, for instance, in mineralogy where a class of minerals can be represented by its stratotype, or in literary studies where a general concept can be represented by a literary character. To specify what entity should be taken as a representative of its cluster, conventionally that entity is selected which is the nearest to its cluster's centroid. This strategy can be referred to as \the nearest in distance." It can be P P justi ed in terms of the square error criterion W (S c) = K=1 h2Sk d(yh ck ) k (3.2). Indeed, the entity h 2 Sk which is the nearest to ck contributes the least to W (S c), that is, to the unexplained part of the data scatter. The contribution based approach supplements the conventional approach. Decomposition of the data scatter (3.4) suggests a di erent strategy by relating to the explained rather than unexplained part of the data scatter. This strategy suggests that the cluster's representative must beP entity that maximally P the contributes to the explained part, B (S c) = K=1 v c2 Nk . kv k How can one compute the contribution of an entity to that? There seems nothing of entities in P(S c). To reveal contributions of individual entities, let B us recall that ckv = i2Sk yiv =Nk . Let us take c2 in B (S c) as the product kv of ckv with itself, and change one of the factors for the de nition. This way we P obtain equation c2 Nk = i2Sk yiv ckv . This leads to a formula for B (S c) as kv the summary inner product:
3.4.3 Cluster representatives
B (S c) =
K XXX k=1 i2Sk v2V
yiv ckv =
K XX k=1 i2Sk
(yi ck )
(3.8)
which shows that the contribution of entity i 2 Sk is (yi ck ). The most contributing entity is \the nearest in inner product" to the cluster centroid, which may lead sometimes to di erent choices. Intuitively, the choice according to the inner product follows tendencies represented in ck towards the whole of the data rather than ck itself, which is manifested in the choice according to distance.
Example 3.28. Di erent concepts of cluster representatives
The entity based elements of the data scatter decomposition for the Dickens cluster from Table 3.24 are displayed in Table 3.26. Now some contributions are negative, which shows that a feature at an entity may be at odds with the cluster centroid. According to this table, the maximum contribution to the data scatter, 8.82%, is
Table 3.26: Decomposition of feature contributions to the Dickens cluster in Table
Title LenS LenD NChar SCon Pers Obje Dire Cntr Cntr,% Dist OTwist -19 29 37 391 61 5 21 524 8.82 222 DoSon 38 6 0 391 61 5 21 521 8.77 186 GExpect 8 12 0 391 -36 -9 21 386 6.49 310 Dickens 27 46 37 1172 86 2 62 1431 24.08 0
3.24 (in thousandth). The right-hand column shows distances to the cluster's centroid.
Table 3.27: Two Dickens' masterpieces along with features contributing to their
di erences. Item LenS LenD NChar OTwist 19.0 43.7 2 DoSon 29.4 36.0 3 Cluster mean 24.1 39.2 2.67 Grand mean 22.4 34.1 3.00
delivered by the novel Oliver Twist. Yet the minimum distance to the cluster's centroid is reached at a di erent novel, Dombey and Son. To see why this may happen, let us take a closer look at the two novels versus within cluster and grand means (Table 3.27). Table 3.27 clearly shows that the cluster's centroid is greater than the grand mean on the rst two components and smaller on the third one. These tendencies are better expressed in Dombey and Son over the rst component and in Oliver Twist over the other two, which accords with the contributions in Table 3.26. Thus, Oliver Twist wins over Dombey and Son as better representing the di erences between the cluster centroid and the overall gravity center, expressed in the grand mean. With the distance measure, no overall type tendency can be taken into account. 2 Let us apply similar considerations to the ve clusters of the Bribery data listed in Table 3.16. Since individual cases are not of interest here, no cluster representatives will be considered. However, it is highly advisable to consult the original data and their description on page 19. In cluster 1, the most contributing features are: Other branch (777%), Change of category (339%), and Level of client (142%). Here and further in this example the values in parentheses are relative contribution indexes RCI. By looking at the cluster's centroid, one can nd speci cs of these features in the cluster. In particular, all its cases appear to fall in Other branch, comprising such bodies as universities or hospitals. In each of the cases the client's issue was of a personal matter, and most times (six of the eight cases) the service provided was based on re-categorization of the client into a better category. The category Other branch (of feature Branch) appears to be distinctively describing the cluster: the eight cases in this category constitute the cluster. Cluster 2 consists of nineteen cases. Its most salient features are: Obstruction of justice (367%), Law enforcement (279%), and Occasional event (151%). By looking
at the centroid values of these features, one can conclude: (1) all corruption cases in this cluster have occurred in the law enforcement system (2) they are mostly done via obstruction of justice for occasional events. The fact (1) is not su cient for distinctively describing the cluster since there are thirty-four cases, not just nineteen, that have occurred in the law enforcement branch. Two more conditions have been found by a cluster description algorithm, APPCOD (see in section 7.1), to be conjunctively added to (1) to make the description distinctive: (3) the cases occurred at o ce levels higher than Organization, and (4) no cover-up was involved. Cluster 3 contains ten cases for which the most salient categories are: Extortion in variable III Type of service (374%), Organization (189%), and Government (175%) in X Branch. Nine of the ten cases occurred in the Government branch, overwhelmingly at the level of organization (feature I) and, also overwhelmingly, the o ce workers extorted money for rendering their supposedly free services (feature III). The client level here is always of an organization, though this feature is not that salient as the other three features. Cluster 4 contains seven cases, and its salient categories are: Favors in III (813%), Government in X (291%), and Federal level of O ce (238%). Indeed, all its cases occurred in the government legislative and executive branches. The service provided was mostly Favors (six of seven cases). Federal level of corrupt o ce was not frequent, two cases only. Still, this frequency was much higher than the average, for the two cases are just half of the total number, four, of the cases in which Federal level of o ce was involved. Cluster 5 contains eleven cases and pertains to two salient features: Cover-up (707%) and Inspection (369%). All of the cases involve Cover-up as the service provided, mostly in inspection and monitoring activities (nine cases of eleven). A distinctive description of this cluster can be de ned to conjunct two statements: it is always a cover-up but not at the level of Organization. Overall, the cluster structure leads to the following overview of the situation. Most important, it is Branch which is the feature de ning Russian corruption when looked at through the media glass. Di erent branches tend to involve di erent corruption services. The government corruption involves either Extortion for rendering their free services to organizations (Cluster 3) or Favors (Cluster 4). The law enforcement corruption in higher o ces is for either Obstruction of justice (Cluster 2) or Cover-up (Cluster 5). Actually, Cover-up does not exclusively belong in the law enforcement branch: it relates to all o ces that are to inspect and monitor business activities (Cluster 5). Corruption cases in Other branch involve re-categorization of individual cases into more suitable categories (Cluster 1). 2
3.4.4 Measures of association from ScaD tables
Here we are going to see that summary contributions of clustering towards a feature in ScaD tables are compatible with traditional statistical measures of correlation considered in section 2.2. As proven in section 5.2.3, the total contribution B+v = k Bvk of a quantitative feature v to the cluster-explained part of the scatter, presented in the ScaD tables, is proportional to the correlation ratio between v and cluster partition S , introduced in section 2.2.2. In fact, the correlation ratios can be found
by relating the row Expl to row Total in the general ScaD table 3.21.
Example 3.30. Correlation ratio from a ScaD table
The correlation ratio of the variable P (Population resident) over the clustering in Table 3.22 can be found by relating the corresponding entries in rows Expl and Total it is 3.16/3.56=0.89. This relatively high value shows that the clustering closely { though not entirely { follows this variable. In contrast, the clustering has rather little to do with variable DIY, the correlation ratio of which is equal to 0.79/1.75=0.45. 2
The summary contribution of a nominal feature l having Vl as the set of its categories, to the clustering partition S has something to do with contingency coe cients introduced in section 2.2.3. It is proven in section 5.2.4 to be equal to K X X (pkv ; pk+ p+v )2 (3.9) B (S l) = N where bv stands for the scaling coe cient at the data standardization. Divisor jVl j, the number of categories, comes from the rescaling stage introduced in section 2.4. The coe cient B (S l) in (3.9) can be further speci ed depending on the scaling coe cients bv . In particular, the items summed up in (3.9) are: 2 1. (pkv ;pk pv ) if bv = 1, the range pk
Categorical feature case: Chi-square and other contingency coefcients
jVl j k=1 v2Vl
pk + b 2 v
p p ;p 2 2. (pkkvv (1k ppvv)) if bv = pv (1 ; pv ), the Bernoullian standard deviation p ; 2 ; 3. (pkvpkpkvpv ) if bu = ppu , the Poissonian standard deviation. p Items 1 and 3 above lead to B (S l) being equal to the summary Quetelet coe cients introduced in section 2.2.3. The Quetelet coe cients, thus, appear to be related to the data standardization. Speci cally, G2 corresponds to bv = 1 and Q2 = X 2 to bv = ppv . Yet item 2, the Bernoullian standardization, leads to an association coe cient which has not been considered in the literature.
Let us consider the contingency table between the author-based clustering of masterpieces and the only nominal variable in the data, Narrative (Table 3.28). In this example, the dummy variables have been range normalized and then rescaled with p b0v = 3, which is consistent with formula (3.9) with bv = 1 and jVl j = 3 for the calculation2 of the summary contribution B (S l). Table 3.29 presents the values of (pkv ;pk pv ) in each cell of the cross classi cation. In fact, these are entries of the full 3pk =N ScaD table in Table 3.24, page 104, related to the categories of Narrative (columns) and the author-based clusters (rows), with row Total corresponding to row Expl in Table 3.24. In particular, the total contribution of the clustering and variable Narrative is equal to 0.18+0.18+0.50=0.86, or about 14.5% of the data scatter. 2
Example 3.31. ScaD based association between a feature and clustering
3.5. OVERALL ASSESSMENT
eight masterpieces (in thousandth). Class Personal Objective Direct Total Dickens 125 250 0 375 Twain 250 125 0 375 Tolstoy 0 0 250 250 Total 375 375 250 1000 thousandth).
109
Table 3.28: Cross-classi cation of the author-based partition and Narrative at the
Table 3.29: Elements of calculation B (S l) according to formula (3.9) (in tenClass Personal Objective Direct Total Dickens 17 851 625 1493 Twain 851 17 625 1493 Tolstoy 938 938 3750 5626 Total 1806 1806 5000 8606
feature set should be chosen carefully according to the goals of the data analysis. To cope with issue 2, the initial seeds should be selected based on conceptual understanding of the substantive domain or preliminary data analysis with the AP clustering approach. There can be some advantages in the issues as well. Issue 3 keeps solutions close to pre-speci ed centroid settings, which is good when centroids have been conceptually substantiated. Issue 1 of simplicity of cluster shapes provides for a possibility of deriving simple conjunctive descriptions of the clusters, which can be used as supplementary interpretation aids (see section 6.3). A clustering algorithm should present the user with a comfortable set of options to do clustering. In our view, the intelligent version of K-Means described above and its versions, implementing the possibility of removal of entities that have been found either (1) \deviant" (contents of small Anomalous pattern clusters), or (2) \intermediate" (entities that are far away from their centroids, or have small attraction index values), or (3) \trivial" (entities that are close to the grand mean), give the user an opportunity to select a preferred option without imposing on him technical issues.
Ward Hierarchical Clustering
After reading this chapter the reader will know about: 1. Agglomerative and divisive clustering. 2. The Ward algorithm for agglomerative clustering. 3. Divisive algorithms for Ward criterion. 4. Visualization of hierarchical clusters with heighted tree diagrams and box charts. 5. Decomposition of the data scatter involving both Ward and K-Means criteria. 6. Contributions of individual splits to: (i) the data scatter, (ii) feature variances and covariances, and (iii) individual entries. 7. Extensions of Ward clustering to dissimilarity, similarity and contingency data.
aggregating its row and column categories with summing up corresponding entries. Can be done with Ward clustering extended to contingency tables. Box chart A visual representation of an upper cluster hierarchy involving a triple partition of a rectangular box corresponding to each split. The middle part is proportional to the contribution of the split and the other two to contributions of resulting clusters. Conceptual clustering Any divisive clustering method that uses only a single feature at each splitting step. The purity and category utility scoring functions are closely related to Ward clustering criterion. Contribution of a split Part of the data scatter that is explained by a split and equal to the Ward distance between split parts. Features most contributing to a split can be used in taxonomic analysis. Split contributions to covariances between features and individual entities can also be considered. Divisive clustering Any method of hierarchical clustering that works from top to bottom, by splitting a cluster in two distant parts, starting from the universal cluster containing all entities. Heighted tree A visual representation of a cluster hierarchy by a tree diagram in which nodes correspond to clusters and are positioned along a vertical axis in such a way that the height of a parent node is always greater than the heights of its child nodes. Hierarchical clustering An approach to clustering based on representation of data as a hierarchy of clusters nested over set-theoretic inclusion. In most cases, hierarchical clustering is used as a tool for partitioning, though there are some cases, such as that of the evolutionary tree, in which the hierarchy re ects the substance of a phenomenon. Ward clustering A method of hierarchical clustering involving Ward distance between clusters. Ward distance is maximized in Ward divisive clustering and minimized in Ward agglomerative clustering. Ward clustering accords with the data recovery approach. Ward distance A measure of dissimilarity between clusters, equal to the squared Euclidean distance between cluster centroids weighted by the product of cluster sizes.
for instance, 75, 58, 90]). Lance and Williams formula covers all interesting algorithms proposed in the literature so far and much more. Here we concentrate on a weighted group average criterion rst proposed by Ward 135]. Speci cally, for clusters Sw1 and Sw2 whose cardinalities are Nw1 and Nw2 and centroids cw1 and cw2 , respectively, Ward distance is de ned as
w dw(Sw1 Sw2 ) = NNw1NN2 d(cw1 cw2) + w1 w2
(4.1)
where d(cw1 cw2 ) is the squared Euclidean distance between cw1 and cw2. To describe the intuition behind this criterion, let us consider a partition S on I and two of its classes Sw1 and Sw2 and ask ourselves the following question: how the square error of S , W (S c), would change if these two classes are merged together? To answer the question, let us consider the partition that di ers from S only in that respect that classes Sw1 and Sw2 are changed in it for the union Sw1 Sw2 and denote it by S (w1 w2). Note that the combined cluster's centroid can be expressed through centroids of the original classes as cw1 w2 = (Nw1 cw1 + Nw2 cw2)=(Nw1 + Nw2 ). Then calculate the di erence between the square error criterion values at the two partitions, W (S (w1 w2) c(w1 w2)) ; W (S c), where c(w1 w2) stands for the set of centroids in S (w1 w2). The di erence is equal to the Ward distance between Sw1 and Sw2 :
dw(Sw1 Sw2 ) = W (S (w1 w2) c(w1 w2)) ; W (S c)
(4.2)
Because of the additive nature of the square error criterion (3.2), all items on the right of (4.2) are self subtracted except for those related to Sw1 and Sw2 so that the following equation holds
dw(Sw1 Sw2 ) = W (Sw1 Sw2 cw1 w2) ; W (Sw1 cw1) ; W (Sw2 cw2 ) (4.3) where, for any cluster Sk with centroid ck , W (Sk ck ) is the summary distance (3.1) from elements of Sk to ck (k = w1 w2 w1 w2).
The latter equation can be rewritten as
Equation (4.2) justi es the use of Ward distance if one wants to keep the within cluster variance as small as possible at each of the agglomerative steps. The following presents the Ward agglomeration algorithm. 1. Initial setting. The set of maximal clusters is all the singletons, their cardinalities being unity, heights zero, themselves being centroids. 2. Cluster update. Two clusters, Sw1 and Sw2 , that are closest to each other (being at the minimum Ward distance) among the maximal clusters, are merged together forming their parent cluster Sw1 w2 = Sw1 Sw2 . The merged cluster's cardinality is de ned as Nw1 w2 = Nw1 + Nw2 , centroid as cw1 w2 = (Nw1 cw1 + Nw2 cw2 )=Nw1 w2 and its height as h(w1 w2) = h(w1) + h(w2) + dw(Sw1 Sw2). 3. Distance update. Put Sw1 w2 into and remove Sw1 and Sw2 from the set of maximal clusters. De ne Ward distances between the new cluster Sw1 w2 and other maximal clusters St . 4. Repeat. If the number of maximal clusters is larger than 1, go to step 2. Otherwise, output the cluster merger tree along with leaves labelled by the entities. Ward agglomeration starts with singletons whose variance is zero and produces an increase in criterion (3.2) that is as small as possible, at each agglomeration step. This justi es the use of Ward agglomeration results by practitioners to get a reasonable initial setting for K-Means. Two methods supplement each other in that clusters are carefully built with Ward agglomeration, and K-Means allows overcoming the in exibility of the agglomeration process over individual entities by reshu ing them. There is an issue with this strategy though: Ward agglomeration, unlike K-Means, is a computationally intensive method, not applicable to large sets of entities. The height of the new cluster is de ned as its square error according to equation (4.4). Since the heights of merged clusters include the sums of heights of their children, the heights of the nodes grow fast, with an \exponential" speed. This can be used to address the issue of determining what number of clusters is \relevant" to the data by cutting the hierarchical tree at the layer separating long edges from shorter ones if such a layer exists (see, for example, Figure 4.1 whose three tight clusters can be seen as hanging on longer edges), see 28], pp. 76-77, for heuristical rules on this matter.
Let as apply the Ward algorithm to the pre-processed and standardized Masterpieces data in Table 2.13 presented in the right-bottom display of Figure 2.5. The algorithm starts with the matrix of Ward distances between all singletons, that is a matrix of entity-to-entity Euclidean distances squared and divided by two, as obviously follows from (4.1). The Ward distance matrix is presented in Table 4.1. Minimum non-diagonal value in the matrix of Table 4.1 is dw(1 2) = 0:25 with dw(7 8) = 0:30 and d(5 6) = 0:37 as the second and third runners-up, respectively. These are the starting agglomerations according to Ward algorithm: clusters f1 2g f7 8g and f5 6g whose heights are 0.25, 0.30 andP 2 0.37, respectively, shown on Figure 4.1 as percentages of the data scatter T (Y ) = i v yiv which is the height of the maximum cluster comprising all the entities as proven in section 5.3. Further mergers are also shown in Figure 4.1 with their heights. The author based classes hold on
100
Example 4.32. Agglomerative clustering of Masterpieces
80
60
40
20
0 1 2 3 4 5 6 7 8
Figure 4.1: Cluster tree built with Ward clustering algorithm the node heights are
scaled in per cent to the height of the entire entity set. the tree for about 35% of its height, then the Leo Tolstoy cluster merges with that of Mark Twain as should be expected from the bottom-right display in Figure 2.5. The hierarchy drastically changes if a di erent feature scaling system is applied. For instance, with the standard deviation based standardization, Leo Tolstoy's two novels do not constitute a single cluster but are separately merged within the Dickens and Twain clusters. This does not change even with the follow-up rescaling of categories p of Narrative by dividing them over 3. 2
4.2 Divisive clustering with Ward criterion
A divisive algorithm builds a cluster hierarchy from top to bottom, each time by splitting a cluster in two, starting from the entire set. Such an algorithm will be referred to as a Ward-like divisive clustering algorithm if the splitting steps maximize Ward distance between split parts. Let us denote a cluster by Sw , its split parts by Sw1 and Sw2 , so that Sw = Sw1 Sw2 and consider equation (4.4) which is applicable here: it decomposes the square error, that is, the summary distance between elements and the centroid of Sw into the sum of the square errors of the split parts and the Ward distance between them: W (Sw cw ) = W (Sw1 cw1 ) + W (Sw2 cw2 ) + dw(cw1 cw2) (4.5) where the indexed c refers to the centroid of the corresponding cluster. In the process of divisions, a divisive clustering algorithm builds what is referred to as an upper cluster hierarchy, which is a binary tree rooted at the universal cluster I such that its leaves are not necessarily singletons. One may think of an upper cluster hierarchy as a cluster hierarchy halfway through construction from top to bottom, in contrast to lower cluster hierarchies that are cluster hierarchies built halfway through, bottom up. Thus, a Ward-like divisive clustering algorithm goes like this. 1. Start. Put Sw I and draw the tree root as a node corresponding to Sw at the height of W (Sw cw ). 2. Splitting. Split Sw in two parts, Sw1 and Sw2 , to maximize Ward distance wd(Sw1 Sw2 ). 3. Drawing attributes. In the drawing, add two children nodes corresponding to Sw1 and Sw2 at the parent node corresponding to Sw , their heights being their square errors. 4. Cluster set's update. Set Sw Sw where Sw is the node of maximum height among the leaves of the current upper cluster hierarchy. 5. Halt. Check the stopping condition (see below). If it holds, halt and output the hierarchy and interpretation aids described in section 4.2.3 otherwise, go to 2.
0 0
2. Cluster height. The height W (Sw cw ) of Sw has decreased to a prespeci ed threshold such as the average contribution of a single entity, T (Y )=N , or a pre-speci ed proportion, say 5%, of the data scatter. 3. Contribution to data scatter. The total contribution of the current cluster hierarchy, that is, the sum of Ward distances between split parts in it, has reached a pre-speci ed threshold such as 50% of the data scatter. Each of these e ectively speci es the number of nal clusters. Other approaches to choosing the number of clusters are reviewed in section 7.5.1. It should be noted that the drawn representation of an upper cluster hierarchy may follow formats di ering from that utilized for representing results of an agglomerative method. In particular, we suggest that one can utilize the property that all contributions are summed up to 100% of the data scatter and present the process of divisions with a box chart such as in Figure 4.3 on page 122. At such a box chart each splitting is presented with a partition of a corresponding box in three parts of which that in the middle corresponds to the split whereas those on the right and left correspond to split clusters. The parts' areas are proportional to their contributions, to that of the split, Ward distance itself, and those of clusters split, which are the summary distances of P the cluster's entities to their centroids, W (Sw cw ) = i2Sw d(yi cw ), for any Sw 2 S. The box chart concept is similar to that of the pie chart except for the fact that the pie chart slices are of the same type whereas there are two types of slices in the box chart, those corresponding to splits and those to split clusters.
4.2.1 2-Means splitting
Developing a good splitting algorithm at Step 2 of Ward-like divisive clustering is an issue. To address it, let us take a closer look at the Ward distance as a splitting criterion. One of the possibilities follows from the fact that maximizing Ward distance is equivalent to minimizing the square-error criterion W (S c) of K-Means at K = 2 as proven in section 5.3.3 90]. Thus, 2-Means can be used in the Ward-like divisive clustering algorithm to specify it as follows.
In spite of the fact that Euclidean squared distance d, not Ward distance dw is used in splitting, the algorithm in fact goes in line with Ward agglomeration. To specify two initial seeds in 2-Means splitting, either of the three options indicated in section 3.2 can be applied: 1. random selection 2. maximally distant entities 3. centroids of two Anomalous pattern clusters derived at Sw . Random selection must be repeated many times to get a reasonable solution for any sizeable data. The 2-Means splitting algorithm has two major drawbacks: 1. Step 1 is highly time consuming since it requires nding the maximum of all pair-wise distances. 2. The result can be highly a ected by the choice of the initial seeds as the most distant entities, which can be at odds with the cluster structure hidden in data. In spite of these, divisive clustering with 2-Means became rather popular after it had been experimentally approved in 127], where it was described as a heuristical method under the name of \Bisecting K-Means" probably without any knowledge of the work 90] in which it was proposed as an implementation of divisive clustering with the Ward criterion.
4.2.2 Splitting by separating
To relax both of the issues above, one can employ a di erent formulation of Ward distance in (4.1):
w dw(Sw1 Sw2 ) = NNNw1 d(cw1 cw ) w2
the incremental approach to building clusters described in section 3.1.3. According to this approach, cluster Sw2 and its center cw2 are updated incrementally by considering one entity's move at a time. Let us denote z = 1 if an entity was added to Sw1 and z = ;1 if that entity was removed. Then the new value of Ward distance (4.6) after the move will be Nw (Nw1 + z )d(c0w1 cw )=(Nw2 ; z ) where c0w1 is the updated centroid of Sw1 . Relating this to dw (4.6), we can see that the value of Ward distance increases if the ratio is greater than 1, that is, if d(cw cw1 ) < Nw1Nw2 + zNw2 (4.7) d(cw c0w1 ) Nw1Nw2 ; zNw1 and it decreases otherwise. This leads us to the following incremental splitting algorithm.
Splitting by separating
1. Initial setting. Given Sw I and its centroid cw , specify its split part Sw1 as consisting of entity y1 , which is furthest from cw , and put cw1 = y1 , Nw1 = 1 and Nw2 = Nw ; 1. 2. Next move. Take an entity yi this can be that nearest to cw1. 3. Stop-condition. Check inequality (4.7) with yi added to Sw1 if yi 62 Sw1 or removed from Sw1 , otherwise. If (4.7) holds, change the state of yi with respect to Sw1 accordingly, recalculate cw1 = c0w1 , Nw1, Nw2 and go to step 2. 4. Output results: split parts Sw1 and Sw2 = Sw ; Sw1 their centroids cw1 and cw2 their heights, h1 = W (Sw1 cw1 ) and h2 = W (Sw2 cw2) and the contribution of the split, that is, Ward distance dw(Sw1 Sw2 ). To specify the seed at Step 1, the entity which is the farthest from the centroid is taken. However, di erent strategies can be pursued too: (a) random selection or (b) taking the centroid of the Anomalous pattern found with the AP algorithm from section 3.2.3. These strategies are similar to those suggested for the 2-Means splitting algorithm, and so are their properties.
Example 4.33. Divisive clustering of Masterpieces with 2-Means splitting
Let us apply the Ward-like divisive clustering method to the Masterpieces data in Table 2.13 range standardized with the follow-up rescaling the dummy variables corresponding to the three categories of Narrative. The method with 2-Means splitting may produce a rather poorly resolved picture if the most distant entities, 6 and 8 according to the distance matrix in Table 4.1, are taken as the initial seeds. Then step 2 would produce tentative classes f1 3 4 5 6g and f2 7 8g because 2 is closer to 8 than to 6 as easily seen in Table 4.1. This partition breaks the authorship clusters.
Figure 4.2: Cluster tree of masterpieces built with Splitting by separating the node
heights are scaled as percentages to the pre-processed data scatter. Unfortunately, no further iterations can change it. This shows how vulnerable results found with the rule for initial seed setting at the two farthest entities can be. 2
Example 4.34. Divisive clustering of Masterpieces with splitting by sepSplitting with separating by taking the rst seed at the entity 8 which is the farthest from the origin, works more gently and produces the tree presented in Figure 4.2. This tree di ers from the tree found with the agglomerative Ward method not only in the order of author-based divisions (the Tolstoy cluster rst goes here) but also in the node heights. Let us illustrate the splitting process in the Ward-like divisive clustering with a box chart. The rst split separates Tolstoy's two novels, 7 and 8, from the rest. Contributions are calculated according to the decomposition (4.5). The split itself contributes to the data scatter 34.2%, the Tolstoy cluster 5.1% and the rest 60.7%, which is re ected in the areas occupied by the vertically split parts in Figure 4.3. The second split (horizontal lines across the right-hold part of the box) produces the Dickens cluster, with entities 1, 2, and 3, contributing 12.1% to the data scatter, and the Twain cluster, with entities 4, 5, and 6, contributing 14.5% the split itself contributes 34.1 %. If we accept threshold 1/8=12.5% of the data scatter, which is the average contribution of a single entity, as the stopping criterion, then the process halts at this point. A box chart in Figure 4.3 illustrates the process. Slices corresponding to clusters are shadowed and those corresponding to splits are left blank. The most contributing features are put in split slices along with their contributions. The thinner the area of a cluster, the closer its elements to the centroid and thus to each other. 2
arating and a box-chart
Example 4.35. Evolutionary tree for Gene pro le data and mapping gene
histories
Applying agglomerative and divisive Ward clustering, the latter with 2-Means splitting at every step, to the Gene pro les data in Table 1.3 for clustering genomes, which are columns of the table, leads to almost identical trees shown in Figure 4.4, (a) and (b), respectively the height of each split re ects its contribution to the data scatter as described in section 5.3.2. The rows here are features. It should be noted
Figure 4.3: A box chart to illustrate clustering by separating splitting was halted
when split contributions to the data scatter became less than the average contribution of an individual entity. that the rst two lines in which all entries are unities do not a ect the results of the computation at all, because their contributions to the data scatter are zero. Similarly, in the bacterial cluster appearing after the rst split on the right, the next nine COGs (rows 3 to 11) also become redundant because they have constant values throughout this cluster. The only di erence between the two trees is the position of species b within the bacterial cluster dcrbjqv: b belongs to the left split part in tree (a), and to the right split part in tree (b). The rst two splits after LUCA re ect the divergence of bacteria (the cluster on the right), then eukaryota (the leaf y) and archaea. All splits in these trees are compatible with the available biological knowledge this is due to a targeted selection of COGs: of the original 1700 COGs considered in 96], more than onethird did not conform to the major divisions between bacteria, archaea and eukaryota because of extensive loss and horizontal transfer events during evolution. Due to these processes, the results obtained with a variety of tree-building algorithms on the full data are incompatible with the tree found with more robust data, such as similarities between their ribosomal proteins. The COGs which contribute the most to the splits seem to be biologically relevant in the sense that they tend to be involved in functional systems and processes that are unique for the corresponding cluster. For instance, COG3073, COG3115, COG3107, COG2853 make the maximum contribution, of 13% each, to the split of bacteria in both trees. The respective proteins are unique to the bacterial cluster egfs and are bacteria-speci c cell wall components or expression regulators. Curiously, the divisive Ward-like algorithm with splitting by separating produces a di erent tree in which a subset of bacteria efgsj splits o rst. In contrast to the Masterpieces set analyzed above, the procedure incorrectly determines the starting divergence here. 2
4.2.3 Interpretation aids for upper cluster hierarchies
124
WARD HIERARCHICAL CLUSTERING
Table 4.2: ScaD extended: Decomposition of the data scatter over the author-based
hierarchy for Masterpieces data in Table 3.2.
Aid Item LenS LenD NChar SCon Pers Dickens 0.03 0.05 0.04 1.17 0.09 Typ Twain 0.14 0.25 0.15 0.42 0.00 Tolstoy 0.06 0.12 0.50 0.28 0.09 Expl 0.23 0.41 0.69 1.88 0.18 Tax Split1 0.09 0.16 0.67 0.38 0.12 Split2 0.14 0.25 0.02 1.50 0.06 Expl 0.23 0.41 0.69 1.88 0.18 Unex 0.51 0.28 0.20 0.00 0.44 Total 0.74 0.69 0.89 1.88 0.63 Obje 0.00 0.09 0.09 0.18 0.12 0.06 0.18 0.44 0.63 Dire Total Total,% 0.06 1.43 24.08 0.06 1.10 18.56 0.38 1.53 25.66 0.50 4.06 68.30 0.50 2.03 34.22 0 2.03 34.09 0.50 4.06 68.30 0.00 1.88 31.70 0.50 5.95 100.00
contribution of feature v 2 V . The denotation yv refers to column v of the pre-processed data matrix. According to (5.24) at u = v one has
Tv = (yv yv ) =
X Nw 1 Nw 2 2 Nw (cw1 v ; cw2 v ) + (ev ev ) w
(4.8)
where summation goes over all internal hierarchy clusters Sw split in parts Sw1 1 and Sw2 . Each split contributes, thus, NwNNw2 (cw1 v ; cw2 v )2 the larger a w contribution the greater the variable's e ect to the split. The overall decomposition of the data scatter in (5.29),
T (Y ) =
where S is the set of leaf clusters of an upper cluster hierarchy S, shows contributions of both splits and leaf clusters. Both parts of the decomposition can be used for interpretation: 1 { Ward distances NwNNw2 d(cw1 cw2 ) betwen split parts, to express di erw ences between them for taxonomic purposes, and { decomposition of the square error criterion W (S c) over leaf clusters, which is nothing but that employed in the analysis of results of K-Means clustering in the previous Chapter these can be used for purposes of typology rather than taxonomy.
X Nw 1 Nw 2 Nw d(cw1 cw2) + W (S c) w
(4.9)
Example 4.36. Split-to-variable aids on a box chart and in table ScaD
Interpretation split-to-variable aids are displayed at the box chart in Figure 4.3. The upper split contributes 34.2 % to the data scatter. Between cluster di erences (cw1v ; cw2v )2 contributing most are at variables NChar and Direct. The next split contributes 34.1% and is almost totally due to SCon (25.2 out of 34.1). A more complete picture can be seen in Table 4.2, which extends Table 3.24 ScaD by adding one more aspect: split contributions. Maximum contributions are highlighted in boldface. The upper part of the table supplies aids for typological analysis, treating each cluster as is, and the middle part for taxonomical analysis, providing aids for
interpretation of splits. Both parts take into account those contributions that relate to the explained part of the leaf cluster partition. 2
(ii) Split-to-Covariance. Feature-to-feature covariances are decom1 posed over splits according to cluster contributions equal to NwNNw2 (cw1 v ; w cw2 v )(cw1 u ; cw2 u) according to (5.24), where u v 2 V are two features. At this level, not only the quantities remain important for the purposes of comparison, but also one more phenomenon may occur. A covariance coe cient entry may appear with a di erent sign in a cluster, which indicates that in this split the association between the variables concerned changes its direction from positive to negative or vice versa. Such an observation may lead to insights into the cluster's substance.
Let us consider covariances between variables LenD, NChar, and SCon. Their total values, in thousandth, are presented in the left-hand part of the matrix equation below, and corresponding items related to the rst and second splits in Figure 4.2. Entries on the main diagonal relate to the data scatter as discussed above.
Example 4.37. Decomposition of covariances over splits
LenD Nch Scon S1 S2 LenD 87 58 ;47 20 40 30 32 9 ;77 NChar 58 111 42 = 40 83 63 + 9 2 ;21 +::: SCon ;47 42 234 30 63 47 ;77 ;21 188 Each entry decomposed may tell us a story. For instance, the global positive correlation between NChar and SCon (+42) becomes more expressed at the rst split (+63) and negative at the second split ({21). Indeed, these two are at their highest in the Tolstoy cluster and are discordant between Dickens and Twain. 2 (iii) Split-to-Entry. Any individual row-vector yi in the data matrix can be decomposed according to an upper cluster hierarchy S into the sum of items contributed by clusters Sw 2 S containing i, plus a residual, which is zero when i itself constitutes a singleton cluster belonging to the hierarchy. This is guaranteed by the model (5.23). Each cluster Sw1 containing i contributes the di erence between its centroid and the centroid of its parent, cw1 ; cw , as described on page 153. The larger the cluster the more aggregated its contribution is.
Example 4.38. Decomposition of an individual entity over a hierarchy
The decomposition of an individual entity, such as Great Expectations (entity 3 in the Masterpieces data), into items supplied by clusters from the hierarchy presented in Figure 4.2 on page 121, is illustrated in Table 4.3. The entity data constitute the rst line in this table. The other lines refer to contributions of the three largest clusters in Figure 4.2 covering entity 3: the root, the not-Tolstoy cluster, and the Dickens cluster. These clusters are formed before the splitting, after the rst split, and after the second split, respectively. The last line contains residuals, due to the aggregate nature of the Dickens cluster. One can see, for instance, that SCon=0 for the entity is initially changed for the grand mean, 0.63, and then step by step degraded { the most important being the second split separating Dickens from the rest. 2
Table 4.3: Single entity data decomposed over clusters containing it the last line is
the residuals. Item LenS LenD NChar SCon Pers Obje Dire GExpectat 23.9 38 3 0 0 1 0 Grand Mean 22.45 34.13 3.00 0.63 0.38 0.38 0.25 1 split -1.03 -3.32 -0.50 -0.13 0.13 0.13 -0.25 2 split 2.68 8.43 0.17 -0.50 0.17 -0.17 0.00 Residual -0.20 -1.23 0.33 0.00 -0.67 0.67 0.00
This type of analysis, which emphasizes unusually high negative or positive contributions, can be applied to a wide variety of hierarchical clustering results. When an hierarchical tree obtained by clustering has substantive meaning, it also can be used for interpretation of other types of data. To illustrate this, let us consider the evolutionary trees built in the previous section and see how one can employ them to reconstruct evolutionary histories of individual genes.
Example 4.39. Using an evolutionary tree to reconstruct the history of
a gene
After an evolutionary tree has been built (see Figure 4.4), one may consider the problem of nding parsimonious scenarios of gene evolution leading to the observed presence-absence patterns. An evolutionary scenario for a gene may include major evolutionary events such as emergence of the gene, its inheritance along the tree, loss, and horizontal transfer between branches of the tree. To illustrate this line of development, let us consider three COGs in Table 4.4: one from the data in Table 1.3 and two not used for building the tree, COG1514 2'-5' RNA ligase and COG0017 Aspartyl-tRNA synthetase.
Table 4.4: Gene pro les: Presence-absence pro les of three COGs over 18 genomes.
No y 16 COG1709 0 31 COG1514 0 32 COG0017 1 COG a 1 1 1 o 1 0 1 m 1 1 1 p 1 1 1 k 1 1 1 z 1 1 1 q 0 1 0 Species v d 0 0 1 1 0 1 r 0 0 0 b 0 1 1 c 0 0 1 e 0 1 1 f 1 1 0 g 0 0 1 s 0 0 1 j 0 1 0
Given a gene presence-absence pro le, let us consider, for the sake of simplicity, that each of the events of emergence, loss, and horizontal transfer of the gene is assigned the same penalty weight while events of inheritance along the tree are not penalized as conforming to the tree structure. Then, COG1709 (the rst line of Table 4.4) can be thought of as having emerged in the ancestor of all archaea (node corresponding to cluster aompkz and then horizontally transferred to species f, which amounts to two penalized events. No other scenario for this COG has the same or smaller number of events. The reconstructed history of COG1514 (the second line of Table 4.4) is more
complicated. Let us reconstruct it on the right-hand tree of Figure 4.4. Since this COG is represented in all non-singleton clusters, it might have emerged in the root of the tree. Then, to conform to its pro le, it must be lost at y, o, s, g, c and r, which amounts to 7 penalized events, one emergence and six losses, altogether. However, there exist better scenarios with six penalized events each. One such scenario assumes that the gene emerged in the node of cluster archaea aompkz and then horizontally transferred to cluster vqbj and singletons f, e, d, thus leading to one emergence event, four horizontal transfers, and one loss, at o. Obviously, the pro le data by themselves give no clue to the node of emergence. In principle, the gene in question might have emerged in cluster vqbj or even in one of the singletons. To resolve this uncertainty, one needs additional data on evolution of the given gene, but this is beyond the scope of this book. 2
WARD HIERARCHICAL CLUSTERING
K XX X XX P (l = v)2 ] P (l = vjSk )2 ; pk l v2Vl l v2Vl
1 u(S ) = K
k=1
(4.10)
The term in square brackets is the increase in the expected number of attribute values that can be predicted given a class, Sk , over the expected number of attribute values that could be predicted without using the class. The assumed prediction strategy follows a probability-matching approach. According to this approach, category v is predicted with the frequency re ecting its probability, p(v=k) within Sk , and pk = Nk =N when information of the class is not provided. Factors pk weigh classes Sk according to their sizes, and the division by K takes into account the di erence in partition sizes: the smaller the better. Either of these functions can be applied to a partition S to decide which of its clusters is to be split and how. They are closely related to each other as well as to contributions Bkv in section 3.4.2, which can be stated as follows.
Statement 4.4. TheP impurity function (l S ) equals the summary contribuP
tion B (l S ) = v2Vl K=1 Bkv with scaling factors bv = 1 for all v 2 Vl . The k category utility function u(S ) is the sum of impurity functions over all features P l 2 L related to the number of clusters K , u(S ) = l (l S )=K .
Proof: Indeed, according to the de P of impurity l = ( P p G(l=k) = 1 ; P p2 ; nition(1 ; P pfunction, = (PS )P Gkvl) ; p pk (v=k)2 ) k v Vl v k v k ; Pk p2 = B (l S ). To 2prove the second partv2Vl the statement, k us pnote of let v2Vl v that P (P= vjSk ) = p(v=k) and P (l = v) = pv . This obviously implies that l u(S ) = l (l S )=K , which proves the statement. The summary impurity function, or the category utility function multiplied by K , is exactly the summary contribution of variables l to the explained part of the data scatter, B (S c), that is, the complement of the K-Means square error clustering criterion to the data scatter, when the data pre-processing has been done with all bv = 1 (v 2 Vl ) 93]. In brief, maximizing the category utility function is equivalent to minimizing the K-Means square error criterion divided by the number of clusters with the data standardized as described above. This invites, rst, di erent splitting criteria associated with B (S c) at different rescaling factors bv and b0v , and, second, extending the splitting criterion to the mixed scale feature case, which is taken into account in the following formulation.
1. Initial setting. Set S to consist of the only cluster, the entire entity set I . 2. Evaluation. In the cycle over all clusters k and variables l, consider the possibility of splitting Sk over l in two parts. If l is quantitative, the split must correspond to a split of its range in two intervals: if bl and br are the minimum and maximum values of l at Sk , take a number T of locations for the splitting point, pt = bl + t(br ; bl)=T (t = 1 ::: T ), and select that one of them which maximizes the goodness-of-split function. If l is nominal and has more than two categories, consider categories v 2 Vl as entities weighted by their frequencies in Sk , p(v=k), and apply a version of the correspondingly modi ed Serial splitting algorithm to divide them in two parts. 3. Splitting. Select the pair (k v) that received the highest score and do the binary split. 4. Halt. Check stop-condition. If it is satis ed, end. Otherwise go to 2. Since the scoring function is the same as in the Ward-like divisive clustering of the previous section, the stop-condition here can be borrowed from that on page 117. One more stopping criterion comes from the category utility function, which is the total contribution divided by the number of clusters K : calculations should stop when this goes down. Unfortunately, this simple idea seems to not always work, as will be seen in the following examples.
Conceptual clustering with binary splits
Example 4.40. Conceptual clustering of Digits
The classi cation tree of the Digit data in Figure 1.5 on page 16 has been produced with the process above, assuming all the features are binary nominal. Let us take, for instance, partition S = fS1 S2 g of I according to attribute e2 which is present at S1 comprising 4,5, 6, 8, 9, and 0, and is absent at S2 comprising 1,2,3, and 7. Cross-classi cation of S and e7 in Table 4.5 yields (e7 S ) = 0:053.
Table 4.5: Cross-tabulation of S (or, e2) against e7. e7 S1 S2 Total
e7=1 e7=0 Total 5 1 6 2 2 4 7 3 10
To see what this has to do with the setting in which K-Means complementary criterion applies, let us pre-process the Digit data matrix by subtracting the averages within each column (see Table 4.6) note that the scaling coe cients are all unity here. However, the data in Table 4.6 is not exactly the data matrix Y considered theoretically because both Y and X must have 14 columns after enveloping each of the
Table 4.6: Data in Table 2.3 1/0 coded with the follow-up centering of the columns.
14 categories re ected in Table 2.3. Columns corresponding to the category \ei is absent" in all features i=1,2,...,7 are not included in Table 4.6, because they provide no additional information. The data scatter of this matrix is the summary column variance times N = 10, which is 13.1. However, to get the data scatter in the lefthand side of (5.13), this must be doubled to 26.2 to re ect the \missing half" of the virtual data matrix Y . Let us now calculate within class averages ckv of each of the variables, k = 1 2, v=e1,...,e7, and take contributions Nk c2 summed up over clusters S1 and S2 . This kv is done in Table 4.7, the last line in which contains contributions of all features to the explained part of the data scatter.
Table 4.7: Feature contributions to digit classes de ned by e2.
e2 e1 e2 e3 e4 e5 e6 e7 e2=1 0.007 0.960 0.107 0.107 0.060 0.060 0.107 e2=0 0.010 1.440 0.160 0.160 0.090 0.090 0.160 Total 0.017 2.400 0.267 0.267 0.150 0.150 0.267 The last item, 0.267, is the contribution of e7. Has it anything to do with the reported value of impurity function (e7 S ) = 0:053? Yes, it does. There are two factors that make these two quantities di erent. First, to get the contribution from it must be multiplied by N = 10 leading to 10 (e7 S ) = 0:533. Second, this is the contribution to the data scatter of matrix Y obtained after enveloping all 14 categories which has not been done in Table 4.6, thus, not taken into account in the contribution 0.267. After the contribution is properly doubled, the quantities do coincide. Similar calculations made for the other six attributes, e1, e2, e3,..., e6, lead to the P total 7=1 (el S ) = 0:703 and, thus, to u(S ) = 0:352 according to Statement 4.4. l since M = 2. The part of the data scatter taken into account by partition S is the total of (S el) over l = 1 ::: 7 times N = 10, according to (5.22), that is, 7.03 or 26.8% of the scatter 26.2. The evaluations at the rst splitting step of the total Digit set actually involve all pairwise contingency coe cients G2 (l l0 ) = (l l0 ) (l l0 = 1 ::: 7) displayed in Table 4.8. According to this data, the maximum summary contribution is supplied by the S made according to e7 it is equal to 9.63 which is 36.8% of the total data scatter.
Table 4.8: Pairwise contingency coe cients. In each column el values of for all variables are given under the assumption that partition S is made according to el
Target e1 e2 e3 e4 e5 e6 e7 Total e1 0.320 0.005 0.020 0.020 0.080 0.005 0.245 0.695 e2 0.003 0.480 0.053 0.053 0.030 0.030 0.053 0.703 e3 0.020 0.080 0.320 0.045 0.005 0.005 0.045 0.520 e4 0.015 0.061 0.034 0.420 0.004 0.009 0.115 0.658 e5 0.053 0.030 0.003 0.003 0.480 0.030 0.120 0.720 e6 0.009 0.080 0.009 0.020 0.080 0.180 0.020 0.398 e7 0.187 0.061 0.034 0.115 0.137 0.009 0.420 0.963
Thus, the rst split must be done according to e7. The second split, according to e5, contributes 3.90, and the third split, according to e1, 3.33, so that the resulting four-class partition, S = ff1 4 7g f3 5 9g f6 8 0g f2gg, contributes 9.63+3.90+3.33=16.87 =64.4% to the total data scatter. The next partition step would contribute less than 10% of the data scatter, that is, less than an average entity, which may be considered a signal to stop the splitting process. One should note that the category utility function u(S ) after the rst split is equal to 9.63/2=4.81, and after the second split, to (9.63+3.90)/3=13.53/3=4.51. The decrease means that calculations must be stopped after the very rst split, according to the category utility function, which is not an action of our preference. 2
Example 4.41. Relation between conceptual and Ward clustering of Gene
pro les
The divisive tree of species according to gene pro les on Figure 4.4 (b) can be used for analysis of the conceptual clustering category utility score criterion (4.10) which is equivalent to the ratio of the explained part of the data scatter over the number of clusters. Indeed, the rst split contributes 50.3% to the data scatter, which makes the category utility function u be equal to 50:3=2 = 25:15. The next split, of the bacterial cluster, adds 21.2%, making the total contribution 50.3+21.2=71.5%, which decreases the utility function to u = 71:5=3 = 23:83. This would force the division process to stop at just two clusters, which shows that the normalizing value K might be overly stringent. This consideration may be applied not only to the general divisive clustering results in Figure 4.4 (b) but to conceptual clustering results as well. Why? Because each of the two splits, although found with the multidimensional search, also can be done monothetically, with one feature only: the rst split at COG0290 or COG1405 (lines 4, 17 in Table 1.3) and the second one at COG3073 or COG3107 (lines 22, 28 Table 1.3). These splits must be optimal because they have been selected in the much less restrictive multidimensional splitting process. 2
4.4 Extensions of Ward clustering
4.4.1 Agglomerative clustering with dissimilarity data
Given a dissimilarity matrix D = (dij ), i j reformulated as follows.
2 I , the Ward algorithm can be
Ward dissimilarity agglomeration:
1. Initial setting. All the entities are considered as singleton clusters so that the between-entity dissimilarities are between-cluster dissimilarity. All singleton heights are set to be zero. 2. Agglomeration rule. Two candidate clusters, Sw1 and Sw2 , that are nearest to each other (that is, being at the minimum distance) are merged together forming their parent cluster Sw1 w2 = Sw1 Sw2 , and the merged cluster's height is de ned as the sum of the children's heights plus the distance between them. 3. Distance. If the newly formed cluster Sw1 w2 coincides with the entire entity set, go to Step 4. Otherwise, remove Sw1 and Sw1 from the set of candidate clusters and de ne distances between the new cluster Sw1 w2 and other clusters Sk as follows: (4.11) dw = Nw1 + Nk dw + Nw2 + Nk dw ; Nk dw
w1 w2 k
where N + = Nw1 w2 + Nk . The other distances remain unvaried. Then, having the number of candidate clusters reduced by one, go to Step 2. 4. Output. Output upper part of the cluster tree according to the height function. It can be proven that distance (4.11) is equal to Ward distance between the merged cluster and other clusters when D is a matrix of Euclidean squared distances. In fact, formula (4.11) allows the calculation and update of Ward distances without calculation of cluster centroids. Agglomeration step 2 remains computationally intensive. However, the amount of calculations can be decreased because of properties of the Ward distance 103].
total number of observations, they can be processed with a greater extent of comparability than the ordinary entity-to-feature data. To introduce the concepts needed, let us consider a contingency table P = (ptu ), t 2 T , u 2 U , whose entries have been divided by the total ow p++ , which means that p++ = 1. The marginals pt+ and p+u , which are just withinrow and within-column totals, will be referred to as the weights of rows t 2 T and columns u 2 U . For any subset of rows S T , the conditional probability of a column u can be de ned as p(u=S ) = pSu =pS+ where pSu is the sum of frequencies ptu over all t 2 S and pS+ the summary frequency of rows t 2 S .
Example 4.42. Aggregating Confusion data
For the Confusion data in Table 1.9, the matrix of relative frequencies is in Table 4.9. For example, for S = f1 4 7g and u = 3, pS3 = 0:001 + 0:000 + 0:002 = 0:003 and pS+ = 0:100 + 0:100 + 0:100 = 0:300 so that p(3=S ) = 0:003=0:300 = 0:010. Analogously, for u = 1, pS1 = 0:088 + 0:015 + 0:027 = 0:130 and p(1=S ) = 0:130=0:100 = 1:30. 2
The row set S will be characterized by its pro le, the vector of conditional probabilities g(S ) = (p(u=S )), u 2 U . Then, the chi-squared distance between any two non-overlapping row sets, S1 and S2 , is de ned as X (4.12) (g(S1 ) g(S2 )) = (p(u=S1 ) ; p(u=S2 ))2 =p+u Using this concept, Ward's agglomeration algorithm applies to contingency data exactly as it has been de ned in section 4.1 except that the Ward distance is modi ed here to adapt to the situation when both rows and columns are weighted:
S w(Sh1 Sh2 ) = pp h1 + pSh2 (g(Sh1 ) g(Sh2 )) Sh1 Sh2 u2U
Figure 4.5: The hierarchy found with the modi ed Ward algorithm for Confusion
data.
This de nition di ers from the standard de nition in the following three aspects: 1. Pro les are taken as cluster centroids. 2. Chi-squared distance is taken instead of the Euclidean squared. 3. Marginal frequencies are used instead of cardinalities. These formulas are derived in section 5.4.3 from a data recovery model relating Quetelet coe cients in the original data table and that aggregated according to the clustering. It appears, they express the decrement of the Pearson chi-square contingency coe cient under the aggregation.
Example 4.43. Clustering and aggregation of Confusion data
The drawing in Figure 4.5 shows the hierarchy of Digits row clusters found with the agglomerative clustering algorithm which uses the chi-square distance (4.12). Curiously, the same topology, with slightly changed heights, emerges when the data table2 is aggregated over rows and columns simultaneously to minimize the decrement of X (F F ) of the aggregated table. The aggregate confusion rates and Quetelet coe cient data corresponding to the four-class partition, S = ff1 4 7g f3 5 9g f6 8 0g f2gg, are on the right in Table 4.10. 2
I 1.95 -0.88 -0.89 -0.95 II -0.90 7.29 -0.69 -0.68 III -0.79 -0.70 1.81 -0.69 IV -0.84 -0.84 -0.70 1.86
4.5. OVERALL ASSESSMENT
135
4.5 Overall assessment
Advantages of hierarchical clustering: 1. Visualizes the structure of similarities in a convenient form. 2. Models taxonomic classi cations. 3. Provides a bunch of interpretation aids at the level of entities, variables and variable covariances. Less attractive features of the approach: 1. Massive computations related to nding minimum distances at each step, which is especially time consuming in agglomerative algorithms. 2. Rigidity: Having a splitting or merging step, no possibility to change it afterwards. Computations in agglomerative clustering can be drastically reduced if the minima from previous computations are kept.
Data Recovery Models
Main subjects covered: 1. What is the data recovery approach. 2. A data recovery model and method for Principal component analysis. 3. A data recovery model and data scatter decomposition for K-Means and Anomalous cluster clustering. 4. A data recovery model and data scatter decompositions for cluster hierarchies. 5. A uni ed matrix equation model for all three above. 6. Mathematical properties of the models justifying methods presented in previous chapters. 7. Extensions of the models, criteria and methods to similarity and contingency data. 8. One-by-one data recovery clustering methods. 9. Data recovery interpretation of correlation and association coe cients.
Linear regression A method for analysis of interrelation between two quan-
titative features x and y in which y is approximated by an a ne transformation ax + b of x, where a and b are referred to as slope and intercept, respectively. This setting is the genuine ground on which the coe cients of correlation and determination are de ned and substantiated. One-by-one clustering A method in clustering in which clusters or splits are taken one by one. In this text, all such methods exploit the additive structure of clustering data recovery models, which is analogous to that of the model of Principal Component Analysis. Cluster separations and splits are to be made in the order of their contribution to the data scatter. In this way, the additive structure of the data scatter decomposition is maintained to provide for model-based interpretation aids. Principal component analysis A method for approximation of a data matrix with a small number of hidden factors, referred to as principal components, such that data entries are expressed as linear combinations of hidden factor scores. It appears that principal components can be determined with the singular value decomposition (SVD) of the data matrix. Reference point A vector in the variable space serving as the space origin. The Anomalous pattern is sought starting from an entity furthest from the reference point, which thus models the norm from which the Anomalous pattern deviates most. Split versus separation Di erence between two perspectives: cluster-versusthe-rest and cluster-versus-the-whole, re ected in di erent coe cients attached to the distance between centroids of split parts. The former perspective is taken into account in the Ward-like divisive clustering methods, the latter in the Anomalous pattern clustering. Ward-like divisive clustering A divisive clustering method using Ward distance as the splitting criterion. The method can be considered an implementation of the one-by-one PCA strategy within the data recovery clustering.
Let a series of real numbers, x1 ::: xN , have been assumed to represent the same unknown value a. Equation ( ) then becomes xi = a + ei with ei P the residual for i = 1 ::: N . To minimize the sum of squares being P L(a) = i e2 = i (xi ; a)2 as a function of a, one may utilize the rst-order i P optimality condition, dL=da = 0, that is, dL=da = ;2 i xI ; Na = 0. That P means that the least-squares solution is the average P= x = i xi =N . By a substituting this for a in L(a), one obtains L(x) = i x2 ; N x2 . The last i expression gives, in fact, theP decomposition of the data scatter, the sum of data entries squared T (x) = i x2 , into the explained and unexplained parts, i T (x) = N x + L(x). The averaged unexplained value L(p)=N is the well known x variance s(x)2 of the series, and its square root, s(x) = L(x)=N , the standard deviation. It appears thus that the average minimizes the standard deviation s(x) of observations from a.
2. The values of are restricted to the interval between ;1 and 1. The closer is to either 1 or ;1, the smaller are the residuals in equation ( ). For instance, at = 0:9, the unexplained variance of y constitutes 1 ; 2 = 19% of its original variance. 3. The slope a is proportional to so that a is positive or negative depending on the sign of correlation coe cient. When = 0 the slope is 0 too the variables y and x are referred to as non-correlated, in this case, which means that there is no linear relation between them, though another functional relation, such as a quadratic one, may exist. The case of = 0 geometrically means that centered versions of feature vectors x = (xi ) and y = (yi ) are mutually orthogonal. 4. With the data pre-processed as
; ; x0 = xs(x)x and y0 = ys(y)y
(5.1)
the variances become unities, thus leading to simpler formulas: a0 = = (x0 y0)=N , b0 = 0, L0 = N (1 ; 2 ), and N = N 2 + L0 where N happens to be equal to the scatter of y0 . 5. The value of does not change under linear transformations of scales of x and y.
5.1.3 Principal component analysis
Principal component analysis1 is a major tool for approximating observed data with model data formed by a few `hidden' factors. Observed data such as marks of students i 2 I at subjects labelled by l = 1 ::: M constitute a data matrix X = (xil ). Assume that each mark xil re ects the student's hidden ability zi (i 2 I ) with an impact coe cient cl , due to subject l's speci cs. The principal component analysis model suggests that the student i's score over subject l re ects the product of the mutual impact of student and subject, zi cl . Then equation ( ) can be formulated as
xil = cl zi + eil : (5.2) P P The least squares criterion is L = i2I Pl2L (xil ; cl zP2 and thePrst-order i) optimality conditions lead to equations Pl xil cl = Pi l c2 and i xil zi = z l P cl i zi2 for all l 2 L and i 2 I . Sums l c2 and i zi2 pPsquared norms are l 2 of vectors c and z with the norms being de ned as jjcjj = l cl and jjz jj =
1 This section, as well as the next one, can be understood in full only if introductory concepts of linear algebra are known, including the concepts of matrix and its rank.
is the normed version of z . Let us denote by the product = jjz jjjjcjj and by z and c the normed versions of P least-squares solution c z . Then the P the equations above can be rewritten as l xil cl = zi and i xil zi = cl , or in matrix algebra notation, Xc = z and X T z = c . These are quite remarkable equations expressing the fact that optimal vectors c and z are linear combinations of, in respect, rows and columns of matrix X . These expressions de ne an inherent property of matrix X , its singular value and vectors, where is a singular value of matrix X and c and z are the normed singular vectors corresponding to . It is well known that the number of non-zero singular values of a matrix is equal to its rank and, moreover, the singular vectors c corresponding to di erent singular values are mutually orthogonal, as well as the vectors z 38]. In our case however, must be the maximum singular value P of X because of the decomposition of the data scatter T (X ) = i l x2 , il
2 i zi . A vector whose norm is unity is referred to as a normed 0 vector to make z = (zi ) normed, z has to be divided by its norm: vector z = z= z
T (X ) = 2 + L (5.3) that holds for the optimal c and z . Indeed, since the data scatter T (X ) is constant if the data do not change, the unexplained part L is minimum when
144
DATA RECOVERY MODELS
shows that the least-squares solution is de ned only up to the linear subspace of the space of N -dimensional vectors, whose base is formed by columns of matrix Zm . The optimal linear subspace can be speci ed in terms of the so-called singular value decomposition (SVD) of matrix X. Let us recall that SVD of N M matrix X amounts to equation X = Z C where Z is N r matrix of mutually orthogonal normed N -dimensional column-vectors zk , C is r M matrix of mutually orthogonal normed M -dimensional row-vectors ck , and is diagonal r r matrix with positive singular values k on the main diagonal such that zk and ck are normed singular vectors of X corresponding to its singular value k so that Xck = k zk and X T zk = k ck , k = 1 ::: r 38]. The SVD decomposition is proven to be unique when singular values t are mutually di erent. A leastsquares solution to model (5.5) can now be determined from matrices Zm and Cm of m singular vectors zk and ck corresponding to m greatest singular values k , k = 1 ::: m. (The indices re ect the assumption that the singular values have been placed in the order of descent, 1 2 ::: r > 0.) Let us denote the diagonal matrix of the rst m singular vestors by m. Then a solution to the problem is determined by rescaling the normed singular vectors with p p formulas Zm = Zm m de ning principal components, and Cm = m Cm de ning their loadings. Since singular vectors z corresponding to di erent singular values are mutually orthogonal, the factors can be found one by one as solutions to the one-factor model (5.2) above applied to the so-called residual data matrix: after a factor z and loadings c are found, X must be substituted by the matrix of residuals, xil xil ; cl zi . The principal component of the residual matrix corresponds to the second largest singular value of the original matrix X . Repeating the process m times, one gets m rst principal components and loading vectors. P It can be proven that, given m, the minimum value of L = iP 2 is equal il l er to the sum of r ; m smallest singular values squared, L(Zm Cm ) = k=m+1 2 , k whereas m greatest singular values and corresponding singular vectors de ne the factor space solving the least-squares tting problem for equation (5.3). Each k ; th component additively contributes 2 to the data scatter T (X ) k (k = 1 ::: m) so that equation (5.3), in the general case, becomes
m m matrix F satisfying equation F T = F ;1 , that is, being a \rotation" matrix 27, 38]. Obviously ZC = ZmCm , that is, the rotated solution Z , C corresponds to the same residual matrix E and thus the same value of L. This
the formula X T z = c leading to X T Xc = 2 c . This equation means that c is a latent vector of the square matrix X T X corresponding to its latent value 2 . Thus, for X T X , its latent value decomposition rather than SVD must be sought, because singular vectors of X are latent vectors of X T X corresponding to their latent values 2 . Similarly, singular vectors z of X are latent vectors of XX T . In should be noted that in many texts the method of principal components is explained by using the square matrix X T X only, without any reference to the basic equations (5.2) or (5.4), see for instance 27, 61, 72]. The elements of matrix X T X are proportional to covariances or correlations between the variables nding the maximum latent values and corresponding latent values of X T X can be interpreted as nding such a linear combination of the original variables that takes into account the maximum share of the data scatter. In this, the fact that the principal components are linear combinations of variables is an assumption of the method, not a corollary, which it is with models (5.2) and (5.4). The singular value decomposition is frequently used as a data visualization tool on its own (see, for instance, 54]). Especially interesting results can be seen when entities are naturally mapped onto a visual image such as a geographic map Cavalli-Sforza has interpreted several principal components in this way 13]. There are also popular data visualization techniques such as Correspondence analysis 6, 77], Latent semantic analysis 16], and Eigenfaces 133] that heavily rely on SVD. The former will be reviewed in the next section following the presentation in 93]. Correspondence Analysis (CA) is a method for visually displaying both row and column categories of a contingency table P = (pij ) (i 2 I j 2 J ) in such a way that distances between the presenting points re ect the pattern of cooccurrences in P . There have been several equivalent approaches developed for introducing the method (see, for example, Benzecri 6]). Here we introduce CA in terms similar to those of PCA above. To be speci c, let us concentrate on the problem of nding just two \underlying" factors, u1 = f(v1 (i)) (w1 (j ))g and u2 = f(v2 (i)) (w2 (j ))g, with I J as their domain, such that each row i 2 I is displayed as point u(i) = (v1 (i) v2 (i)) and each column j 2 J as point u(j ) = (w1 (j ) w2 (j )) on the plane as shown in Figure 5.1. The coordinate row-vectors, vl , and column-vectors, wl , constituting ul (l = 1 2) are calculated to approximate the relative Quetelet coe cients qij = pij =(pi+ p+j ) ; 1 according to equations:
DATA RECOVERY MODELS
are positive reals, by minimizing the weighted least-squares
E2 =
XX
i2I j 2J
pi+ p+j e2 ij
(5.7)
with regard to l , vl , wl , subject to conditions of weighted ortho-normality: 0 X X pi+ vl (i)vl (i) = p+j wl (j )wl (j ) = 1 l = l0 (5.8) 0 l 6= l i2I j 2J
0 0
where l l0 = 1 2. The weighted criterion E 2 is equivalent to the unweighted least-squares criterion p applied to the matrix with entries aij = qij ppi+ p+j = (pij ; L pi+ p+j )= pi+ p+j . This implies that the factors are determined by the singularvalue decomposition of matrix A = (aij ). More explicitly, the optimal values p p l and row-vectors fl = (vl (i) pi+ ) and column-vectors gl = (wl (j ) p+j ) are the maximal singular values and corresponding singular vectors of matrix A, de ned by equations Agl = l fl fl A = l gl . These equations, rewritten in terms of vl and wl , are considered to justify the joint display: the row-points appear to be averaged column-points and, vice versa, the column-points appear to be averaged versions of the row-points. The mutual location of the row-points is considered as justi ed by the fact that between-row-point squared Euclidean distances d2 (u(i) u(i0 )) approximate chisquare distances between corresponding rows of the contingency table 2 (i i0 ) = P p (q ; q )2 . Here u(i) = (v (i) v (i)) for v and v rescaled in such a 1 2 1 2 j 2J +j ij i j way that their norms are equal to 1 andP 2 , respectively. To see it, one needs P to derive rst that the weighted averages i2I pi+ vi and j2J p+j wj are equal to zero. Then, it will easily follow that the singularity equations for f and g are P P equivalent to equations j2J p(j=i)wj = vi and i2I p(i=j )vi = wj where p(j=i) = pij =pi+ and p(i=j ) = pij =p+j are conditional probabilities de ned by the contingency table P . These latter equations de ne elements of v as weighted averages of elements of w, up to the factor , and vice versa. The values 2 are latent values of matrix AT A. As is known, the sum of all l latent values of a matrix is equal to its trace, de ned as the sum of diagonal P entries, that is, Tr(AT A) = r=1 2 where r is the rank of A. On the other t t hand, direct calculation shows that the sum of diagonal entries of AT A is
0
5.1. STATISTICS MODELING AS DATA RECOVERY
Type of Service Branch ObstrJus Favors Extort CategCh Cover-up Total Government 0 8 7 0 3 18 LawEnforc 14 1 3 2 9 29 Other 1 1 0 5 1 8 Total 15 10 10 7 13 55
147
Table 5.1: Bribery: Cross-classi cation of features Branch (X) and Type of Service (III) from Table 1.11.
which can be seen as a decomposition of the contingency data scatter, measured by X 2, into contributions of the individual factors, 2 , and unexplained residl uals, E 2 . (Here, l = 1 2, but, actually, the number of factors sought can be raised up to the rank of matrix A.) In a common situation, the rst two latent values account for a major part of X 2, thus justifying the use of the plane of the rst two factors for visualization of the interrelations between I and J . Thus, CA is analogous to PCA, but di ering from PCA in the following aspects: (i) CA applies to contingency data in such a way that relative Quetelet coe cients are modeled rather than original frequencies (ii) Rows and columns are assumed to have weights, the marginal frequencies, that are used in both the least-squares criterion and orthogonality equations (iii) Both rows and columns are visualized on the same display so that geometric distances between the representations re ect chi-square distances between row and column conditional frequency pro les (iv) The data scatter is measured by the Pearson chi-square association coe cient. As shown in 6] (see also 77]), CA better reproduces the visual shapes of contingency data than the standard PCA.
Example 5.44. Contingency table for the synopsis of Bribery data and
its visualization
Let us build on results of the generalization of the Bribery data set obtained by clustering in section 3.4.3: all features of the Bribery data in Table 1.12 are well represented by the interrelation between the two variables: the branch at which the corrupt service occurred and type of the service, features X and III in table 1.11, respectively. Let us take a look at the cross-classi cation of these features (Table 5.1) and visualize it with the method of Correspondence analysis. On Figure 5.1, one can see which columns are attracted to which rows: Change of category to Other branch, Favors and Extortion to Government, and Cover-up and Obstruction of justice to Law Enforcement. This is compatible with the conclusions drawn in section 3.4.3. A unique feature of this display is that the branches constitute a triangle covering the services. 2
Figure 5.1: CA display for the rows and columns of Table 5.1 represented by circles and pentagrams, respectively.
5.2 Data recovery model for K-Means
In K-Means, a clustering is represented by a partition S = fSk g of the entity set I consisting of K cluster lists Sk that do not overlap and cover all entities, that is, Sk \ Sl 6= if k 6= l and K=1 Sk = I . The latter condition can be k relaxed as described in sections 3.2.3 and 3.3. The lists are frequently referred to as cluster contents. The number of elements in Sk is frequently referred to as the cluster's size or cardinality and denoted by Nk . Centroids of clusters are vectors ck = (ckv ) representing cluster \prototypes" or \standard" points. Given a partition S and set of centroids c = fck g resulting from K-Means, the original data can be recovered in such a way that any data entry yiv (where i 2 I denotes an entity and v 2 V a category or quantitative feature) is represented by the corresponding centroid value ckv such that i 2 Sk , up to a residual, eiv = yiv ; ckv . In this way, clustering (S c) leads to the data recovery model described by equations yiv = ckv + eiv i 2 Sk k = 1 ::: K: (5.11) It is this model, in all its over-simplicity, that stands behind K-Means. Let us see how this may happen. Multiplying equations (5.11) by themselves and summing up the results, it is not di cult to derive the following equation:
The derivation is based on the assumption that ckv is the average of within P cluster values yiv , i 2 Sk , so that i2Sk ckv yiv = Nk c2 . Noting that the P P P P kv P right-hand term in (5.12), i2I v2V e2 = M i2Sk v2V (yiv ; ckv )2 = iv k=1 PM P k=1 i2Sk d(yi ck ), is K-Means square error criterion W (S c) (3.2), equation (5.12) can be rewritten as
2 where T (Y ) = i v yiv is the data scatter, W (S c) = M i2Sk d(yi ck ) k=1 square-error clustering criterion and B (S c) the middle term in decomposition (5.12):
P
T (Y ) = B (S c) + W (S c)
P
P
(5.13)
B (S c) =
K XX v2V k=1
c2 Nk kv
(5.14)
Equation (5.12), or its equivalent (5.13), is well-known in the analysis of variance its parts B (S c) and W (S c) are conventionally referred to as betweengroup and within-group variance in statistics. In the context of model (5.11) these, however, denote the explained and unexplained parts of the data scatter, respectively. The square error criterion of K-Means, therefore, minimizes the unexplained part of the data scatter, or, equivalently, maximizes the explained part, B (S c) (5.14). In other words, this criterion expresses the idea of approximation of data Y by the K-Means partitioning as expressed in equation (5.11). Equation (5.13) can be rewritten in terms of distances d since P P T(Y)= N d(yi 0) and B (S c) = K=1 Nk d(ck 0) according to the de nition i=1 k of the Euclidean distance squared:
N X i=1
d(yi 0) =
K X k=1
Nk d(ck 0) +
K N XX k=1 i=1
d(yi ck )
(5.15)
5.2.2 Contributions of clusters, features, and individual entities
According to equations (5.13) and (5.14), each individual cluster k = 1 :::K additively contributes
standardization of data by subtracting features' grand means, the contribution Bvk is proportional to the squared di erence between variable v's grand mean cv and its within cluster mean ckv : the larger the di erence the greater the contribution. This nicely ts into our intuition: the farther away is the cluster from the grand mean on a feature range, the more useful should be the feature in separating the cluster from the rest. To evaluate contributions of individual entities to the explained part of the data scatter, one needs yet another reformulation of (5.16). Let us refer P to the de nition ckv = P i2Sk yiv =Nk and put it into c2 Nk \halfway." This kv then becomes c2 Nk = i2Sk yiv ckv leading to B (Sk ck ) reformulated as the kv summary inner product:
B (Sk ck ) =
XX
thus suggesting that the contribution of entity i 2 Sk to the explained part of the data scatter is (yi ck ), the inner product between the entity point and the cluster's centroid, as follows from (3.8). This may give further insights into the scatter decomposition to highlight contributions of individual entities (see an example in Table 3.26). These and related interpretation aids are suggested for use in section 3.4.2 as a non-conventional but informative instrument.
v2V i2Sk
yiv ckv =
X
i2Sk
(yi ck )
(5.17)
5.2.3 Correlation ratio as contribution
2 (S
2 PK 2 v) = v ; k=1 pk kv (5.18) 2 v P 2 P 2 where v = i2I (xiv ; cv )2 =N and kv = i2Sk (xiv ; ckv )2 =Nk are the variance and within-cluster variance of variable v, respectively, before data preprocessing and pk = Nk =N . Actually, the correlation ratio expresses the extent to which the within-cluster averages can be used as predicted values of v and, in
To measure statistical association between a quantitative feature v and partition S = fS1 ::: SK g, the so-called correlation ratio 2 has been de ned in statistics:
2 The cluster to variable contribution N 2 (S v) v =b2 becomes plain v N 2 (S v) when the variable has been normalized with bv being its standard
deviation, the option which hides the shape of the variable distribution. Otherwise, with bv being the range rv , the contribution should be considered a partition-to-feature association coe cient on its own: 2 (S v ) = 2 (S v ) 2 =r2 : (5.19) v v Consider now the case of a nominal variable l presented by its set of categories P P Vl . The summary contribution B (S l) = v2Vl k Bvk of nominal feature l to partition S , according to decomposition (5.12), appears to have something to do with association coe cients in contingency tables considered in section 2.2.3. To analyze the case, let us initially derive frequency based reformulations of centroids for binary variables v 2 Vl . Let us recall that a categorical variable l and cluster-based partition S , when cross classi ed, form contingency table pkv , whose marginal frequencies are pk+ and p+v , k = 1 ::: K , v 2 Vl . For any three-stage pre-processed column v 2 Vl and cluster Sk in the clustering (S c), its within-cluster average is equal to:
5.2.4 Partition contingency coe cients
pkv where b0v = jVl j. Indeed, within-cluster average in this case equals ckv = pk+ , the proportion of v in cluster Sk . The mean cv of binary attribute v 2 Vl is the proportion of ones in it, that is, the frequency of the corresponding category, cv = p+v . This implies that
p
p ckv = p kv ; cv ]= bv bv ] k+
0
(5.20)
B (S l) = N
K X k=1
pk+
X
v2Vl
(pkv =pk+ ; p+v )2 =jVl jb2 v
(5.21)
which can be transformed, with little arithmetic, into
l k=1 v2Vl k+ v
where bv and jVl j stand for the second stage, scaling, and the third stage, rescaling, during data pre-processing, respectively. The items summarized in (5.22) can be further speci ed depending on scaling coe cients bv as
2 1. (pkv ;pk pv ) if bv = 1, the range pk
; 3. (pkvpkpkvpv ) if bu = ppu , Poisson standard deviation. p These lead to the following statement. Statement 5.5. The contribution B(S l) of a nominal variable l and partition S to the explained part of the data scatter, depending on the standardizing coe cients, is equal to contingency coe cient G2 (2.11) p Q2 = X 2=N (2.12) or if scaling coe cients bv are taken to be bv = 1 or bv = p+v , respectively, and rescaling coe cients b0v = 1. It is further divided by the number of categories jVl j if rescaling coe cients are b0v = pjVl j for v 2 Vl . Two well known normalizations of the Pearson chi-square contingency cop e cient are due to Tchouprov, T = X 2 = (K ; 1)(jVl j ; 1), and Cramer, C = X 2= min(K ; 1 jVl j ; 1), both symmetric over the numbers of categories and clusters. The statement 5.5. implies one more, asymmetric, normalization of X 2 , M = X 2=jVl j, as a meaningful part of the data scatter in the clustering problem. When the chi-square contingency coe cient or related indexes are applied in the traditional statistics context, the presence of zeros in a contingency table becomes an issue because it contradicts the hypothesis of statistical independence. In the context of data recovery clustering, zeros are treated as any other numbers and create no problems at all because the coe cients are measures of contributions and bear no other statistical meaning in this context.
2
p 2 p 2. (pkkvv;pk ppvv)) if bv = pv (1 ; pv ), Bernoulli standard deviation p (1;
5.3 Data recovery models for Ward criterion
5.3.1 Data recovery models with cluster hierarchies
To formulate supporting models for agglomerative and divisive Ward clustering, one needs to explicitly de ne the concepts of cluster tree and hierarchy. A set S of subsets Sw I is called nested if for every two subsets Sw and Sw from S either one of them is part of the other or they do not overlap at all. Given a nested set S, Sw 2 S is referred to as a child of Sw 2 S if Sw Sw and no other subset Sw 2 S exists such that Sw Sw Sw . A subset Sw 2 S is referred to as a terminal node or leaf if Sw has no children in S. A nested set S will be referred to as a cluster hierarchy over I if any non-terminal subset Sw 2 S has two children Sw Sw 2 S covering it entirely so that Sw Sw = Sw . The subsets Sw 2 S will be referred to as clusters. Two types of cluster hierarchy are of interest in modeling clustering algorithms: those S containing singleton clusters fig for all i 2 I and those containing set I itself as a cluster. The former will be referred to as the lower
0 0 0 00 00 0 0 00 0 00
cluster hierarchy and the latter, the upper cluster hierarchy. A lower hierarchy can be thought of as resulting from an agglomerative clustering algorithm and an upper hierarchy from a divisive clustering algorithm. A complete result of a clustering algorithm of either type can be represented by a cluster tree, that is, a cluster hierarchy which is both lower and upper. For an upper cluster hierarchy S, let us denote the set of its leaves by L(S) it is obviously a partition of I . Similarly, for a lower cluster hierarchy S, let us denote its set of maximal clusters by M (S) this is also a partition of I . Given an upper or lower cluster hierarchy S over set I and a pre-processed data matrix Y = (yiv ), let us, for any feature v, denote the average value of yiv within Sw 2 S by cwv . Given an upper cluster hierarchy S, let us consider its leaf partition L(S). For any data entry yiv and a leaf cluster Sw 2 L(S) containing it, the model underlying K-Means suggests that yiv is equal to cw v up to the residual eiv = yiv ; cw v . Obviously, eiv = 0 if Sw is a singleton consisting of just one entity i. To extend this to the hierarchy S, let us denote the set of all nonsingleton clusters containing i by Si and add to and subtract from the equation the averages of feature v within each Sw 2 Si . This leads us to the following equation:
yiv =
X
where Sw1 is a child of Sw that runs through Si . Obviously, all eiv = 0 in (5.23) if S is a cluster tree. Model in (5.23) is not just a trivial extension of the K-Means model to the case of upper cluster hierarchies, in spite of the fact that the only \real" item in the sum in (5.23) is cw v where Sw is the leaf cluster containing i. Indeed, the equation implies the following decomposition. Statement 5.6. For every feature columns v u 2 V in the pre-processed data matrix Y , their inner product can be decomposed over the cluster hierarchy S as follows: (yv yu ) =
Sw 2Si
(cw1 v ; cwv ) + eiv
(5.23)
5.3.2 Covariances, variances and data scatter decomposed
X Nw1 Nw2 (5.24) Nw (cw1 v ; cw2 v )(cw1 u ; cw2 u ) + (ev eu ) w where w runs over all nonterminal clusters Sw 2 S with children Sw1 and Sw2 Nw , Nw1, Nw2 being their respective cardinalities. Proof: The proof follows from equation Nw cw = Nw1cw1 + Nw2cw2 which relates the centroid of a non-terminal cluster Sw with those of its children. By
putting this into (5.23), one can arrive at (5.24) by multiplying the decomposition for yiv by that for yiu and summing up results over all i 2 I , q.e.d. Another, more mathematically loaded analysis of model (5.23) can be based on the introduction of a N m matrix where N is the number of entities in I and m the number of nonterminal clusters in S. The columns w of correspond to non-terminal clusters Sw 2 S and are de ned by equations: iw = 0 if i 62 Sw , iw = aw for i 2 Sw1 , and iw = ;bw for i 2 Sw2 where aw and bw are positive reals speci ed by the conditions thatp vector w must be centered and normed. These conditions implyp aw = 1=Nw1 ; 1=Nw = that p p Nw2 =Nw Nw1 and bw = 1=Nw2 ; 1=Nw = Nw1=Nw Nw2 so that aw bw = 1=Nw . It is not di cult to prove that thus de ned vectors w are mutually orthogonal and, therefore, form an orthonormal base (see 90]). By using matrix , equations (5.23) can be rewritten in matrix denotations as where A is a m M matrix with entries awv = Nw1 Nw2 =Nw (cw1 v ; cw2 v ). Multiplying (5.25) by Y T on the left, one arrives at matrix equation Y T Y = C T C + E T E , since T is the identity matrix and T E the zero matrix. This matrix equation is a matrix formulation for (5.24). This would give another proof of Statement 5.6. Given a lower cluster hierarchy S, model (5.23) remains valid, with eiv rede ned as eiv = cw# i where Sw# 2 M (S) is the maximal cluster containing i. The summation in (5.23) still runs over all Sw 2 S containing i, so that in the end the equation may be reduced to the de nition ehi = cw#i . Yet taken as they are, the equations lead to the same formula for decomposition of inner products between feature columns because of (5.25). Statement 5.7. Statement 5.6. is also true if S is a lower cluster hierarchy, with residuals rede ned accordingly. These lead to a decomposition described in the following statement. Statement 5.8. Given a lower or upper cluster hierarchy S, the data scatter can be decomposed as follows:
X Nw1Nw2 X XX 2 (cw1 v ; cw2 v )2 + eiv Nw v2V w i2I v2V
(5.26)
5.3. DATA RECOVERY MODELS FOR WARD CRITERION
155
Note that the central sum in (5.26) is nothing but the squared Euclidean distance between centroids of clusters Sw1 and Sw2, which leads to the following reformulation:
T (Y ) =
Further reformulations easily follow from the de nitions of eiv in upper or lower cluster hierarchies. P In particular, if S is a lower cluster hierarchy then the residual part P 2 i v eiv of the data scatter decomposition in (5.26) is equal to the complementary K-Means criterion B (S c) where S = M (S) is the set of maximal clusters in S and c the set of their centroids. That means that for any lower cluster hierarchy S with S = M (S):
XX 2 X Nw1 Nw2 eiv d(cw1 cw2 ) + Nw w i2I v2V
(5.27)
T (Y ) =
Similarly, if S is an upper cluster hierarchy, the residual part is equal to the original K-Means square error criterion W (S c) where S = L(S) is the set of leaf clusters in S with c being their centroids. That means that for any upper cluster hierarchy S with S = L(S):
X Nw 1 Nw 2 Nw d(cw1 cw2 ) + B (S c) w
(5.28)
T (Y ) =
These decompositions explain what is going on in Ward clustering in terms of the underlying data recovery model. Every merging step in agglomerative clustering or every splitting step in divisive clustering adds the Ward distance
X Nw1 Nw2 Nw d(cw1 cw2) + W (S c) w
(5.29)
Nw1 Nw2 d(c c ) (5.30) w1 w2 Nw to the central sum in (5.27) by reducing the other part, B (S c) or W (S c), rew = dw(Sw1 Sw2 ) =
5.3.3 Direct proof of the equivalence between 2-Means and Ward criteria
X X
On the rst glance, the Ward criterion for dividing an entity set in two clusters has nothing to do with that of K-Means. Given Sw I , the former is to maximize w in (5.30) over all splits of Sw in two parts, while the K-Means criterion, with K=2, in the corresponding denotations is:
W (Sw1 Sw2 cw1 cw2 ) =
i2Sw1
d(yi cw1) +
i2Sw2
d(yi cw2 )
(5.31)
Criterion (5.31) is supposed to be minimized over all possible partitions of Sw in two clusters, Sw1 and Sw2 . According to equation (5.15), this can be equivalently reformulated as the problem of maximization of: B (Sw1 Sw2 cw1 cw2 ) = Nw1d(cw1 0) + Nw2d(cw2 0) (5.32) over all partitions Sw1 Sw2 of Sw . Now we are ready to prove that criteria (5.31) and (5.30) are equivalent. Statement 5.9. Maximizing Ward criterion (5.30) is equivalent to minimizing 2-Means criterion (5.31). Proof: To see if there is any relation between (5.30) and (5.32), let us consider an equation relating the two centroids with the total gravity center, cw , in the entity set Sw under consideration: Nw1 cw1 + Nw2 cw2 = (Nw1 + Nw2 )cw (5.33) The equation holds because the same summary entity point stands on both sides of it. Let us assume cw = 0. This shouldn't cause any trouble because the split criterion (5.30) depends only on the di erence between cw1 and cw2 , which does not depend on cw . Indeed, if cw 6= 0, then we can shift all entity points in Sw by subtracting cw from each of them, thus de ning yiv = yiv ; cwv . With the shifted data, the averages are obviously, cw = 0, cw1 = cw1 ; cw , and cw2 = cw2 ; cw , which does not change the di erence between centroids. With cw = 0, equation (5.33) implies cw1 = (;Nw2 =Nw1)cw2 and cw2 = (;Nw1 =Nw2)cw1 . Based on these, the inner product (cw1 cw2 ) can be presented as either (cw1 cw2 ) = (;Nw2=Nw1 )(cw1 cw1 ) or (cw1 cw2 ) = (;Nw1=Nw2 )(cw2 cw2 ). By substituting these instead of (cw1 cw2 ) in decomposition d(cw1 cw2 ) = (cw1 cw1) ; (cw1 cw2 )] + (cw2 cw2) ; (cw1 cw2 )] we can see that w (5.30) becomes:
0 0 0 0
w=
Nw1Nw2 N (c c )=N ] + N (c c )=N ]: w w1 w1 w2 w w2 w2 w1 Nw
5.3. DATA RECOVERY MODELS FOR WARD CRITERION
1 2 3
157
Figure 5.2: Three clusters with respect to Ward clustering: which is the rst to go, S3 or S1 ? By removing redundant items, this leads to equation:
w = B (Sw1 Sw2 cw1 cw2 )
a mathematical property. In fact, with local search algorithms, which are the only ones currently available, the property may not work at all. Let us consider, for instance, Ward-like divisive clustering with 2-Means splitting. It appears, the controversy does not show up in it. Indeed, one starts with the entities farthest from each other, c1 and c3 , as initial seeds. Then, with the Minimum distance rule, all points in S2 are put in the closest cluster S1 and never removed, as the centroid of the merged cluster S1 S2 is always closer to c1 than c3 . This way, 2-Means will produce the intuitively correct separation of the farthest singleton from the rest. Similarly, in Splitting by separating, c3 is selected as the initial seed of the cluster to be separated. Then adding an element from the middle cluster will produce d=d0 larger than the right-hand side expression in (4.7), thereby halting the splitting process. For instance, with N1 = N2 = 50 in the example above, d=d0 = 3:89 while (N1 N2 + N2 )=(N1 N2 ; N1 ) = 2:02. Actually, the latter becomes larger than the former only after more than half of the entities from the middle cluster have been added to the distant singleton, which is impossible with local search heuristics. This is an example of the situation in which the square error criterion would have led to a wrong partition if this was not prevented by constraints associated with the local search nature of 2-Means and Splitting by separating procedures.
5.4.1 Similarity and attraction measures compatible with K-Means and Ward criteria
5.4. EXTENSIONS TO OTHER DATA TYPES
159
translates cv to the scale's origin, 0. The product is positive when both entities are either larger than cv = 0 or smaller than cv = 0. It is negative when i and j are on di erent sides from cv = 0. These correspond to our intuition. Moreover, the closer yiv and yjv to 0, the smaller is the product the further away is either of them from 0, the larger is the product. This property can be interpreted as supporting a major data mining idea that the less ordinary a phenomenon, the more interesting it is: \the ordinary" in this case is just the average. The next issue to be addressed is measuring the quality of a cluster. Given a similarity matrix A = (aij ), i j 2 I , let us introduce the following measure of the quality of any S I :
A(S ) =
X
i j 2S
aij =NS = NS a(S )
(5.34)
where NS is the number of entities in S and a(S ) its average internal similarity, P 2 a(S ) = i j2S aij =NS . The greater A(S ), the better is subset S as a cluster. Why is this so? Conventionally, a cluster should be cohesive internally and separate externally. This is the leading thought in de ning what is a good cluster in the literature. Criterion (5.34) does involve a measure of cohesion, the average similarity a(S ), but it seems to have nothing to do with measuring the separation of clusters from the rest. However, this is not exactly so. The right-hand expression in (5.34) shows that it is a compromise between two goals: maximizing the within cluster similarity a(S ) and maximizing the cluster's cardinality, NS . The goals are at odds: the larger the cluster, the smaller its within cluster similarity. The factor NS appears to be a proxy for the goal of making S separate from the rest, I ; S . Let us formulate a property of the criterion supporting this claim. For any entity i 2 I , let us de ne its attraction to subset S S as the di erence: (i S ) = a(i S ) ; a(S )=2 (5.35) where: (a) a(i S ) is the average similarity of i and S de ned as a(i S ) = P a =N (b) a(S ) is the average within S similarity. The fact, that a(S ) j 2S ij S is equal to the average a(i S ) over all i 2 S , leads us to expect that normally a(i S ) a(S )=2 for the majority of elements i 2 S that is, normally (i S ) 0 for i 2 S { entities should be positively attracted to their clusters. It appears, a cluster S maximizing the quality criterion A(S ) is much more than that: S is cohesive internally and separate externally because all its members are positively attracted to S whereas non-members are negatively attracted to S .
tradicts the assumption that S maximizes (5.34). To prove the inequality, P let us consider A(S )NS = i j2S aij , the sum of within cluster similarities. Obviously, A(S )NS = A(S ; i)(NS ; 1) + 2a(i S )NS ; aii . This leads to A(S ; i) ; A(S )](NS ; 1) = aii + A(S ) ; 2a(i S )NS = aii ; 2 (i S )NS > 0 since aii must be non-negative, which proves the inequality. The other part, that i 62 S contradicts (i S ) > 0, can be proven similarly, q.e.d. Having thus discussed the intuition behind the similarity and cluster quality measures thus de ned, it is nice to see that they have something to do with K-Means and Ward criteria.
Proof: Indeed, if i 2 S and (i S ) < 0, then A(S ; i) > A(S ) which con-
Statement 5.11. If the similarity measure aij is de ned as the inner product
(yi yj ), then, for any partition S = fS1 ::: SK g with the set of centroids c = fc1 ::: cK g, the P 2within-cluster similarity measure A(Sk ) is equal to the relative contribution Nk v2V ckv of Sk to the complementary cluster criterion B (S c) in decomposition (5.13):
A(Sk ) =
X
Proof: Indeed, according to de nition, ckv = Pi2Sk yiv =Nk , which implies that P
2 c2 = i j2Sk yiv yjv =Nk . Summing these up by v 2 V , we get (5.36), q.e.d. kv
i j 2Sk
aij =Nk = Nk
X
v 2V
c2 kv
(5.36)
This statement leads to the following.
Statement 5.12. Criteria maximized by K-Means partitioning and the divisive
B (S c) =
and
w = A(Sw1 ) + A(Sw2 ) K X k=1
Ward-like method can be expressed in terms of between entity inner products aij = (yi yj ) as
A method for building clusters individually, one at a time, according to criterion (5.37), will be described in section 5.5.5. In this section we present only a method for divisive clustering using similarity data with criterion (5.38). P The sum i j2Sw aij in the parent class Sw is equal to the sum of within children summary similarities plus the between children similarity doubled. Thus, by subtracting the former from their counterparts in criterion (5.38), the criterion can be transformed into: Nw1 Nw2 (A + A ; 2A ) (5.39) w= N 11 22 12 where
w
aij wi j Aij = i2SN N2Swj wi wj the average similarity between Swi and Swj or within Swi if i = j (i
P
P
j = 1 2). Note that criterion (5.39) is but the Ward distance translated in terms of entity-to-entity inner products treated as similarities. Moving an entity i from Sw2 to Sw1 leads to the change in w equal to
w2
where A1(i w1) = j2Sw1 aij =(Nw1 + 1), A2(i w2) = j2Sw2 aij =(Nw2 ; 1) and = N=((Nw1 + 1)(Nw2 ; 1)). This can be derived from expression (5.38). A similar value for the change in w when i 2 Sw1 is moved into Sw2 can be obtained from this by interchanging symbols 1 and 2 in the indices. Let us describe a local search algorithm for maximizing criterion w in (5.39) by splitting Sw in two parts. At rst, a tentative split of Sw is carried out according to the dissimilarities (i j ) = (aii + ajj ; 2aij )=2 that are de ned according to the criterion (5.39) applied to individual entities. Then the split parts are updated by exchanging entities between them until the increment in (5.40) becomes negative.
In this method, step 3 is the most challenging computationally: it requires the nding of a maximum (i) over i 2 I . To scale the method to large data sizes, one can use not the optimal i but rather any i at which (i) > 0: this would be analogous to shifting from the method of steepest descent to the method of possible directions in the optimization of a function. In spite of its simplicity, the Splitting by similarity produces rather tight clusters, which can be expressed in terms of the measure of attraction (i S ) = a(i S ) ; a(S )=2 (5.35). If an entity i 2 Sw is moved from split part Sw2 to Sw1, the change in the numbers of elements of individual parts can be characterized by the number Nw ;1) n12 = Nw21(Nw12+1) or by the similar number n21 if an entity is moved in the Nw ( opposite direction. Obviously, for large Nw1 and Nw2 , n12 tends to unity. It appears that elements of a split part are more attracted to it than to the other part, up to this quantity.
Statement 5.13. For any i 2 Sw2,
n12 (i Sw1) < (i Sw2 ) (5.41) and a symmetric inequality holds for any i 2 Sw1 . Proof: Indeed, (i) < 0 for all i 2 Sw2 in (5.40) after the N Splitting by similar1 ity algorithm has been applied to Sw . This implies that Nww+1 (i Sw1 ) + < 1 Nw2 (i Sw2 ). The inequality remains true even if is removed from it beNw2 ;1 2 cause > 0. Dividing the result by NNw;1 leads to (5.41), q.e.d. w2
A divisive clustering algorithm can be formulated exactly as in section 4.2 except that this time Splitting by similarity is to be used for splitting clusters.
Example 5.45. Attractions of entities to Masterpieces clusters
Table 5.2 displays the attractions of entities in the Masterpieces data to clusters occurring at the splitting process. In the end, not only Statement 5.13. holds but all entities become positively attracted to their clusters and negatively to the rest. 2
The case at which all variables are binary is of considerable interest because it emerges in many application areas such as document clustering in which features characterize presence/absence of keywords. When keywords are created automatically from collections of documents, the size of the feature space may become much larger than the size of the entity set so that it can be of advantage to shift from the entity-to-feature data format to a less demanding search space such as similarities between entities or just word counts. The K-Means and Ward criteria can be adjusted for use in these situations.
Inner product expressed through frequencies
As stated in the previous section, the inner product of rows in the prestandardized data matrix is a similarity measure which is compatible with K-Means and Ward criteria. With binary features, specifying cv = pv , the proportion of entities at which feature v is present, and bv = 1, the range, the inner product can be expressed as:
where t(i) (or t(j )) is the total frequency weight of features that are present at i (or, at j ). The frequency weight of a feature v is de ned here as 1 ; pv theP more frequent is the feature the less its frequency weight. The value of = v p2 v here is simply an averaging constant related to the Gini coe cient. The righthand expression for aij follows from the fact that each of pre-processed yiv and yjv can only be either 1 ; pv or ;pv . Similarity index (5.42) is further discussed in section 7.3.1. Here we note that it involves evaluations of the information contents of all binary features that are present at entities i or j . With binary features, K-Means and Ward clustering criteria can be reformulated in terms of feature counts that can lead to scalable heuristics for their optimization. Indeed, given a partition S = fSk g along with cluster centroids ck = (ckv ), and standardizing coe cients av = pv and bv , the within cluster average values of binary features can be expressed as ckv = (1 ; pv )pkv =pk ; pv (1 ; pkv =pk )]=bv where pkv is the proportion of entities simultaneously falling in feature v and cluster Sk , and pk is the proportion of Sk in I . This leads to
Putting equation (5.43) into the formulas for K-Means criterion and Ward distance proves the following statement. Statement 5.14. In the situation of binary data pre-processed with av = pv , the complementary criterion B (S c) of K-Means is equal to
B (S c) = N
and Ward distance to
w=
K X X (pkv ; pk pv )2 pk b 2 v k=1 v2V
(5.44)
Nw1Nw2 X ( pw1 v ; pw2 v )2 =b2 v Nw v2V pw1 pw2
(5.45)
It should be noted that, when bv = 1, measure w (5.45) closely resembles the so-called \twoing rule," a heuristic criterion used in CART techniques for scoring splits in decision trees 11] and B (S c) (5.44) is the sum of NG2 over all nominal features involved, where G2 is Quetelet coe cient (2.11) (and (2.16)) introduced in section 2.2.3.
5.4.3 Agglomeration and aggregation of contingency data
X 2 (T U ) ; X 2 (F U ) where X 2 is the Pearson contingency coe cient in the format of equation (2.13) applied to P (T U ) and P (F U ), respectively. This
equation can be put as: (5.46) and interpreted as a Pythagorean decomposition of the data scatter: X 2(T U ) is the original data scatter, X 2(F U ) is its part taken into account by partition F , and L(F U ) is the unexplained part. Obviously, minimizing L(F U ) over partition F is equivalent to maximizing X 2 (F U ). Let F (k l) be the partition obtained from F by merging its classes Fk and Fl into united class Fk l = Fk Fl . To score the quality of the merging, we need to analyze the di erence D = X 2(F U ) ; X 2(F (k l) U ), according to (5.46). Obviously, D depends only on items related to k and l: X D = pku qku + plu qlu ; (pku + plu )qk l u ] =
X 2 (T U ) = X 2 (F U ) + L(F U )
X
u2U
(1=p+u) p2 =pk+ + p2 =pl+ ; (pku + plu )2 =(pk+ + pl+ )]: ku lu
u2U
By using equation (x ; y)2 = x2 + y2 ; 2xy, one can transform the expression on the right to obtain: X p2 p2 + p2 p2+ ; 2p p p p D = (1=p+u) ku l+ p lu k(p +ku lu k+ l+ k+ pl+ k+ pl+ ) u2U which leads to: + X (1=p )(p =p ; p =p )2 : D = p pk+ plp +u ku k+ lu l+ +
k+
The power of the data recovery approach can be further demonstrated in problems involving data on the same entities coming from di erent sources in different formats. For instance, two data tables on numeral digits, Digits Table 1.8 of their drawing features and Confusion Table 1.9, are given in section 1.1. We have analyzed these data by nding patterns in one of them and using the other one for interpretation of the patterns. However, in some applications one may need to look for patterns that are present in both data sets. Obviously, to do this, data sets must have a common entity set, which is so in the example of Digits. Problems of this type are of interest in such applications as data fusion. Although generally many di erent data sets involving the same entities can be available, we will concentrate here on the data formats that are given for Digits. Thus, the case is analyzed in which an entity set I is provided with two data sets, a pre-processed standardized entity-to-feature data matrix Y and a ow table P (I I ) = P = (pij ), i j 2 I . These two data sets can be of di erent importance in description of the phenomenon under consideration. To take this into account, a relative weight > 0 can be assigned to Y with P (I I ) assumed to have weight 1. Let us consider a partition S = fS1 ::: SK g of I with classes Sk corresponding to the data patterns that are being searched for. Let us denote the binary membership vector for Sk by zk = (zik ) so that zik = 1 if i 2 Sk and zik = 0 if i 62 Sk . According to the data recovery approach, this partition can be used to recover the feature data with an analogue to the K-Means clustering model (5.11) which indeed can be presented as an extension of the Principal component analysis model (5.4) (this is discussed in a greater detail later in section 5.5.3, see (5.52)): for all i 2 taking into account that the row and column entities are the same:
yiv = c1v zi1 + c2v zi2 + ::: + cKv ziK + eiv (5.47) I and v 2 V , and the ow data with a similar equation, though qij =
K K XX k=1 l=1 kl zik zjl + ij
(5.48)
for all i j 2 I , where ckv and kl are unknown reals. The partition S and reals ckv and kl are sought to minimize the least squares criterion:
To develop a K-Means-like method of alternating minimization of L, let us consider S xed. Then the optimal ckv is obviously the average of yiv over i 2 Sk , and the optimal kl is the Quetelet coe cient qkl calculated over the aggregate interaction matrix P (S S ), entries of which are de ned as pkl = P P i2Sk j 2Sl pij , k l = 1 ::: K . Then, given cluster centroids ck = (ckv ) and kl = qkl , criterion L (5.49) can be rewritten as
the cluster membership vectors. Similarly, the model for hierarchical clustering is equivalent to the PCA model with the restriction that Z must be a matrix of m ternary split membership vectors. The PCA strategy of extracting columns of Z one by one, explained in section 5.1.3, thus can be applied in each of these situations. We begin by describing an adaptation of this strategy to the case of hierarchical clustering and then to the case of partitioning. Extensions to the similarity and contingency data are brie y described next (for more detail, see 90]).
5.5.2 Divisive Ward-like clustering
Given an upper cluster hierarchy S = fSw g, each item cw1 ; cw in equation (5.23) with cw1 being the centroid of Sw1 2 S and cw the centroid of the parent 1 Sw , contributes w = wd(Sw1 Sw2 ) = NwNNw2 d(cw1 cw2 ) (5.30) to data scatter w T (Y ) according to equation (5.29). This implies that, to build S = fSw g from scratch, that is, by starting from the universal cluster Sw = I , one can employ the PCA one-by-one strategy by adding to S one split at a time, each time maximizing contribution w of the split to the explained part of the data scatter so that w plays the role of the squared singular value 2 of Y . The t p analogy is not super cial. Both t and w express the scaling coe cient when projecting the data onto a corresponding axis which follows the principal component zt , for the former, and the ternary split vector w , for the latter. In the sequel to this section, we are going to demonstrate that the divisive Ward-like algorithm described in section 4.2 is an implementation of the PCA one-by-one extracting strategy. The strategy involves two types of iterations. One, major iteration, builds a series of splits of clusters Sw into split parts Sw1 P and Sw2 to greedily maximize the summary contribution Sw 2S w to the data scatter T (Y ). The other, minor iteration, is utilized when maximizing individual w by splitting Sw using a speci c splitting method. Two such methods described in section 4.2 t into the strategy because they do maximize w , though locally. They are: a straight method, with 2-Means, and an incremental method, Splitting by separation, based on a di erent expression, (4.6), for w. The use of 2-Means with Euclidean squared distance d, not Ward distance dw, for splitting is justi ed with the following statement.
Statement 5.15. The 2-Means algorithm for splitting Sw maximizes (locally)
the Ward distance between split parts.
Each splitting step is equivalent to nding a ternary splitting vector maximizing Ward distance between its parts or, equivalently, minimizing the summary squared residuals in (5.27) with regard to all possible tertiary column vectors w = ( iw ) and any row-vectors aw = (awv ) in model (5.25). According to the rst-order optimality conditions, the optimal vector aw in (5.25) relates p to the w as the weighted vector of di erences, awv = Nw1 Nw2=Nw (cw1 v ; cw2 v ), which concurs with the data recovery model (5.23) for hierarchical clustering.
5.5.3 Iterated Anomalous pattern
In this section we are going to demonstrate that the PCA one-by-one strategy applied when the membership vectors are constrained to be binary, is but the iterated Anomalous pattern analysis. Consider a set of subsets Sk I , each speci ed by a binary N -dimensional vector zk = (zik ) where zik = 1 if i 2 Sk and zik = 0 if i 62 Sk . (Vector zk is interpreted as the membership vector for subset Sk I .) The data recovery model (5.11) for K-Means can be rewritten with zk in the following way:
yiv = c1v zi1 + c2v zi2 + ::: + cKv ziK + eiv (5.52) which follows the PCA model (5.4) with m = K factors and the additional requirement that scoring vectors zk must be binary.
The one-by-one extracting strategy here would require the building of clusters Sk one by one, each time minimizing criterion:
l=
XX
i2I v2V
(yiv ; cv zi )2
(5.53)
over unknown cv and binary zi , index k being omitted. The membership vector z is characterized by subset S = fi : zi = 1g. Criterion (5.53) can be rewritten in terms of S , as follows:
exactly the rules in Steps 3 and 4 of the Anomalous pattern clustering algorithm in section 3.2.3. Thus, the following is proven.
Statement 5.16. The Anomalous pattern algorithm iteratively minimizes
(5.54) according to the alternating optimization rule: given c, nd optimal S , and given S , nd optimal c. The initial choice of c in the Anomalous cluster minimizes (5.54) over all singleton clusters.
Thus, the AP method is a set of minor iterations of alternatingly optimizing criterion (5.53) for selecting the best cluster. After one cluster has been found, the next one can be sought in the remainder of the set I , to obtain a cluster partition { this amounts to the iterated application of Anomalous pattern in Intelligent K-Means (section 3.3). Another version of the one-by-one Anomalous clustering, with the residual data matrix and not necessarily non-overlapping clusters, is described in 90], section 4.6.
5.5.4 Anomalous pattern versus Splitting
A question related to the Anomalous pattern method is whether any relation exists between its criterion (5.54) and Ward criterion w , in its form (4.6) used in the Splitting by separating algorithm. Both methods separate a cluster from a body of entities with a square-error criterion. To analyze the situation at any of the major iterations, let us denote the entity set under consideration by J , the separated cluster by S1 with its centroid c1 . Consider also a pre-speci ed point c in J , which is the centroid of J in Ward clustering and a reference point in the Anomalous pattern clustering. The other subset, J ; S1 , will be denoted by S2 . In these denotations, the Anomalous pattern criterion can be expressed as:
around c, an item which is constant with respect to the cluster S1 being separated. By noting that d(yi c) = (yi yi ) + (c c) ; 2(yi c) and d(yi c1 ) = (yi yi ) + (c1 c1 ) ; 2(yi c1 ), the last expression can be equivalently rewritten as W = T (Y c) ; N1 d(c1 c). The following is proven: Statement 5.17. Both separating methods, Anomalous pattern and Splitting by separation, maximize the weighted distance d(c1 c) between centroid c1 of the cluster being separated and a pre-speci ed point c. The di erence between the methods is that: (1) the weight is = N1 in the Anomalous pattern and = N1=N2 in the Splitting by separation, and (2) c is the user-speci ed reference point in the former and the unvaried centroid of set J being split in the latter. The di erence (2) disappears if the reference point has been set at the centroid of J . The di erence (1) is fundamental: as proven in section 5.4.1, both sets, S1 and S2 , are tight clusters in Splitting, whereas only one of them, S1 , is to be tight as the Anomalous pattern: the rest, J ; S1 , may be a set of rather disconnected entities. This shows once again the e ect of size coe cients in cluster criteria. Let A = (aij ) i j 2 I , be a given similarity matrix and z=( zi zj ) a weighted binary matrix de ned by a binary membership vector z = (zi ) of a subset S = fi 2 I : zi = 1g along with its numeric intensity weight . When A can be T considered as noisy information on a set of weighted \additive clusters" k zk zk , k = 1 :::K , the following model can be assumed ( 121], 88]):
P L = i j2I (aij ; zi zj )2 with respect to all possible and binary z = (zi ). Obviously, given S I , minimizing criterion P is the average similarity within L cluster S , that is, = a(S ) where a(S ) = i j2S i<j aij = jS j(jS j ; 1)=2]. In matrix terms, a(S ) = z T Az=z T z . After a cluster S and corresponding = a(S ) are found, similarities are changed for residual similarities aij ; zi zj , and the process of nding a cluster S and its intensity a(S ) is reiterated using the updated similarity matrix. In
the case of non-overlapping clusters, there is no need to calculate the residual similarity: the process needs only be applied to only those entities that have remained as yet unclustered. When the clusters are assumed to be mutually nonoverlapping (that is, the membership vectors zt are mutually orthogonal) or when tting of the model is done using the one-by-one PCA strategy, the data scatter decomposition holds as follows 88]:
T (A) =
K X k=1
T T zk Ak zk =zk zk ]2 + L(E )
(5.56)
in which the least-squares optimal k 's are taken as the within cluster averages of the (residual) similarities. The matrix Ak in (5.56) is either A unchanged, if clusters P not overlapping, or a residual similarity matrix at iteration k, are ; T Ak = A ; k=11 k zk zk . t Equation (5.56) shows that nding least-squares optimal clusters requires maximizing the intermediate term in (5.56), which is the sum of squared criteria (5.36) or (5.34) discussed in section 5.4.1. When applying the one-by-one extraction strategy so that one cluster is sought at a time, the constituent one cluster criterion is the squared criterion (5.34). To nd a cluster optimizing it, one may exploit its square root, criterion (5.34) itself, which will be brie y discussed next. In 88, 90] a number of incremental procedures to maximize A(S ) (5.34) or its square are described to involve the average similarity of an entity i 2 I to a subset S ,
Recall that the attraction of i to S is de ned as (i S ) = a(i S ) ; a(S )=2. Statement 5.10. states that if subset S maximizes criterion (5.34) then entities with (i S ) > 0 belong in S and those with (i S ) < 0 do not. This line of action can be wrapped up in the following algorithm.
ADDI-S or Additive similarity cluster
1. Initialization. Find maximum aij in A and set S to consist of the row and column indexes i, j involved de ne a(S ) as the maximum aij . Calculate a(i S ) for all i 2 I . 2. Direction of the steepest ascent. Find i 2 I that maximizes i = si (i S ) = si (a(i S ) ; a(S )=2) where si = ;1 if i 2 S and si = 1, otherwise. If i 0, stop the process at current S otherwise, proceed. 3. Steepest ascent. Update S by adding i to S if it does not belong to S or by removing it from S if it does. Recalculate a(S ) and a(i S ) for all i 2 I and return to Step 2. What is nice in this algorithm is that it involves no ad hoc parameters. Obviously, at the resulting cluster, attractions (i S ) = a(i S ) ; a(S )=2 are positive for within-cluster elements i and negative for out-of-cluster elements i. Actually, the procedure is a local search algorithm for maximizing criterion A(S ) (5.34). This method works well in application to real world data similar procedures have been successfully applied in 141].
The theory supports: 1. A uni ed framework for K-Means and Ward clustering that not only justi es conventional partitioning and agglomerative methods but extends them to mixed scale and multiple table data 2. Modi ed versions of the algorithms such as scalable divisive clustering as well as intelligent versions of K-Means mitigating the need in user-driven ad hoc parameters 3. Compatible measures and criteria for similarity data to produce provably tight clusters with the tightness measured by the attraction coe cient derived in the data recovery framework 4. E ective measures and clustering criteria for analyzing binary data tables 5. One-by-one clustering procedures that allow for more exible clustering structures including single clusters, overlapping clusters, and incomplete clustering 6. Interpretation aids based on decompositions of the data scatter over elements of cluster structures to judge elements' relative contributions 7. Conventional and modi ed measures of statistical association, such as the correlation ratio and Pearson chi-square contingency coe cient, as summary contributions of cluster structures to the data scatter 8. A related framework for cluster analysis of contingency and ow data in which the data scatter and contributions are measured in terms of the Pearson chi-square contingency coe cient.
Di erent Clustering Approaches
After reading through this chapter the reader will know about the most popular other clustering approaches including: 1. Partitioning around medoids (PAM) 2. Gaussian mixtures and the Expectation-Maximization (EM) method 3. Kohonen's Self-organizing map (SOM) 4. Fuzzy clustering 5. Regression-wise K-Means 6. Single linkage clustering and Minimum Spanning Tree (MST) 7. Core and shelled core clusters 8. Conceptual description of clusters
DIFFERENT CLUSTERING APPROACHES
precision, the better the quality of the association rule.
Comprehensive description A conjunctive predicate A describing a subset
S
I in such a way that A holds for an entity if and only if it belongs to S . It may have errors of two types: the false positives, the entities satisfying A but not belonging to S , and the false negatives, the entities from S that do not satisfy A.
predicates.
Conceptual description Description of clusters with feature based logical Connected component Part of a graph or network comprising a maximal
subset of vertices that can be reached from each other by using the graph's edges only.
Decision tree A highly popular concept in machine learning and data mining.
A decision tree is a conceptual description of a subset S I or a partition on I in a tree-like manner, the root of the tree corresponding to all entities in I , with each node divided according to values of a feature so that tree leaves correspond to individual classes. mization applied to the likelihood function for a mixture of distributions model. At each iteration, EM is performed according to the following steps: (1) Expectation: Given parameters of the mixture pk and individual density functions ak , nd posterior probabilities for observations to belong to individual clusters, gik (i 2 I k = 1 ::: K ) (2) Maximization: given posterior probabilities gik , nd parameters pk ak maximizing the likelihood function. belong to di P clusters k with a degree of membership zik so that (1) erent zik 0 and K=1 zik = 1 for all i 2 I and k = 1 ::: K . Conventional k `crisp' clusters can be considered a special case of fuzzy clusters in which zik may only be 0 or 1. by the modal point of the cluster mean vector and the feature-to-feature covariance matrix. The surfaces of equal density for a Gaussian distribution are ellipses. nodes, or vertices, and a set of edges or arcs connecting pairs of nodes.
Expectation and maximization EM An algorithm of alternating maxi-
Fuzzy clustering A clustering model at which entities i 2 I are assumed to
Gaussian distribution A popular probabilistic cluster model characterized Graph A formalization of the concept of network involving two sets: a set of
Medoid An entity in a cluster with the minimal summary distance to the other
within cluster elements. It is used as a formalization of the concept of prototype in the PAM clustering method. Minimum spanning tree A tree structure within a graph such that its summary edge weights are minimal. This concept is the backbone of the method of single link clustering. Mixture of distributions A probabilistic clustering model according to which each cluster can be represented by a unimodal probabilistic density function so that the population density is a probabilistic mixture of the individual cluster functions. This concept integrates a number of simpler models and algorithms for tting them, including the K-Means clustering algorithm. The assumptions leading to K-Means can be overly restrictive, such as the compulsory z-score data standardization. Production rule A predicate A ! B which is true for those and only those entities that either satisfy B whenever they satisfy A. A conceptual description of a cluster S can be a production rule A ! B in which B expresses belongingness of an entity to S . Regression-wise clustering A clustering model in which a cluster is sought as a set of observations satisfying a regression function. Such a model can be tted with an extension of K-Means in which prototypes are presented as regression functions. Self-organizing map A model for data visualization in the form of a grid on plane in which entities are represented by grid nodes re ecting both their similarity and the grid topology. Shelled core A clustering model according to which the data are organized as a dense core surrounded by shells of a lesser density. Single link clustering A method in clustering in which the between-cluster distance is de ned by the nearest entities. The method is related to the minimum spanning trees and connected components in the corresponding graphs. Weakest link partitioning A method for sequentially removing \weakest link" entities from the data set. Thus found series is utilized for nding the shelled core.
This distance is frequently used in data analysis and referred to as Manhattan distance or city-block metric. Within the data recovery approach with the least-moduli criterion the concept of centroid would slightly change too: it is the medians, not the averages that would populate them! Both Manhattan distance and median-based centroids bring more stability to cluster solutions, especially with respect to outliers. However, the very same stability properties can make least-moduli based data-recovery clustering less versatile with respect to the presence of mixed scale data 91]. One may use di erent distance and centroid concepts with computational schemes of K-Means and hierarchical clustering without any strictly de ned model framework as, for instance, proposed in 62]. A popular method from that book, PAM (Partitioning around medoids), will be described in the next subsection. There can be other clustering goals as well. In particular, the following are rather popular: I Cluster membership of an entity may not necessarily be con ned to one cluster only but shared among several clusters (Fuzzy clustering) II Geometrically, cluster shapes may be not necessarily spherical as in the classic K-Means but may have shapes elongated along regression lines (Regression-wise clustering) III Data may come from a probabilistic distribution which is a mixture of unimodal distributions that are to be separated to represent di erent clusters (Expectation{Maximization method EM) IV A visual representation of clusters on a two-dimensional screen can be explicitly embedded in clustering (Self-Organizing Map method SOM). Further on in this section we present extensions of K-Means clustering techniques that are oriented towards these goals.
Having this concept in mind, the method of partitioning around medoids PAM from 62] can be formulated analogously to that of Straight K-Means. Our formulation slightly di ers from the formulation in 62], though it is equivalent to the original, to make its resemblance to K-Means more visible. 1. Initial setting. Choose the number of clusters, K , and select K entities c1 c2 ::: cK 2 I with a special algorithm Build. Assume initial cluster lists Sk are empty. 2. Clusters update. Given K medoids ck 2 I , determine clusters 0 Sk (k = 1 ::: K ) with the Minimum distance rule applied to dissimilarity d(i j ) i j 2 I . 3. Stop-condition. Check whether S 0 = S . If yes, end with clustering S = fSk g, c = (ck ). Otherwise, change S for S 0 . 4. Medoids update. Given clusters Sk , determine their medoids ck (k = 1 ::: K ) and go to Step 2. The Build algorithm 62] for selecting initial seeds proceeds in a manner resembling that of the iterated Anomalous pattern. It starts with choosing an analogue to the grand mean, that is, the medoid of set I , and takes it as the rst medoid c1 . Let us describe how cm+1 is selected when a set Cm of m initial medoids, Cm = fc1 ::: cm g, have been selected already ( 1 m < K ). For each i 2 I ; Cm , a cluster Si is de ned to consist of entities j that are closer to i than to Cm . The distance from j to Cm is taken as D(j Cm ) = minm=1 d(j ck ), and k Si is de ned as the set of all P2 I ; Cm for which Eji = D(j Cm ) ; d(i j ) > 0. j The summary value Ei = j2Si Eji is used as a decisive characteristic of remoteness of i from Cm . The next seed cm+1 is de ned as the most remote from Cm , that is, an entity i for which Ei is maximum over i 2 I ; Cm . There is a certain similarity between selecting initial centroids in iK-Means and initial medoids with Build. But there are certain di erences as well: 1. K must be pre-speci ed in Build and not necessarily in iterated Anomalous clusters in iK-Means 2. The central point of the entire set I is taken as an initial seed in Build and is not in iK-Means
Partitioning around medoids PAM
Example 6.46. Partitioning around medoids for Masterpieces
3. Adding a seed is based on di erent criteria in the two methods.
Let us apply PAM to the matrix of entity-to-entity distances for Masterpieces displayed in Table 6.1, which is replicated from Table 3.9. We start with building three initial medoids. First, we determine that HF is the medoid of the entire set I , because its total distance to the others, 9.60, is the minimum of total distances in the bottom line of Table 6.1. Thus, HF is the rst initial seed.
6.1. EXTENSIONS OF K-MEANS CLUSTERING
Table 6.1: Distances between Masterpieces data from Table 3.2.
OT DS GE TS HF YA WP AK Total OT 0.00 0.51 0.88 1.15 2.20 2.25 2.30 3.01 12.30 DS 0.51 0.00 0.77 1.55 1.82 2.99 1.90 2.41 12.95 GE 0.88 0.77 0.00 1.94 1.16 1.84 1.81 2.38 10.78 TS 1.15 1.55 1.94 0.00 0.97 0.87 1.22 2.46 10.16 HF YA WP 2.20 2.25 2.30 1.82 2.99 1.90 1.16 1.84 1.81 0.97 0.87 1.22 0.00 0.75 0.83 0.75 0.00 1.68 0.83 1.68 0.00 1.87 3.43 0.61 9.60 13.81 11.35 AK 3.01 2.41 2.38 2.46 1.87 3.43 0.61 0.00 16.17
183
Then we build clusters Si around all other entities i 2 I . To build SOT , we take the distance between OT and HF, 2.20, and see that entities DS, GE, and TS have their distances to OT smaller than that, which makes them OT's cluster with EOT = 4:06. Similarly, SDS is set to consist of the same entities, but its summary remoteness EDS = 2:98 is smaller. Cluster SGE consists of OT and DS with even smaller EGE = 0:67. Cluster SY A is empty and those of WP and AK contain just another Tolstoy's novel each contributing less than EOT . This makes OT the next selected seed. After the set of seeds has been updated by OT, we start building clusters Si again, on the remaining six entities. Of them, clusters SDS and SY A are empty and the others are singletons of which SAK consisting of WP is the most remote, EAK = 1:87 ; 0:61 = 1:26. This completes the set of initial seeds: HF, OT, and AK. Note, these are novels by di erent authors. With the selected seeds, the Minimum distance rule produces the author-based clusters (Step 2). The Stop-condition sends us to Step 4, because these clusters di er from the initial, empty, clusters. At Step 4, clusters' medoids are selected: they are obviously DS in the Dickens cluster, YA in the Mark Twain cluster and either AK or WP in the Tolstoy cluster. With the set of medoids changed to DS, YA and AK, we proceed to Step 2, and again apply the Minimum distance rule, which again leads us to the author-based clusters. This time the Stop-condition at Step 3 halts the process. 2
Figure 6.1: Fuzzy sets: Membership functions for the concepts of short, normal and
tall in men's height.
belongingness. However, in many cases fuzzy partitions have nothing to do with probabilities. For instance, dividing all people by their height may involve fuzzy categories \short," \average" and \tall" with fuzzy meanings such as shown in Figure 6.1. Fuzzy clustering can be of interest in applications related with natural fuzziness of the cluster borders such as image analysis, robot planning, geography, etc. The following fuzzy version of Straight K-Means became popular. It involves a fuzzy K -class partition and cluster centroids. The fuzzy partition is represented with an N K membership matrix (zik ) (i 2 I k = 1 :::K ) where zik is the degree of membership of entity i in cluster k satisfying conditions: P 0 zik 1 and K=1 zik = 1 for every i 2 I . With these conditions, one k may think of the total membership of item i as a unity that can be di erently distributed among centroids. The criterion of quality of fuzzy clustering is a modi ed version of the squareerror criterion (3.2),
WF (z c) =
K XX
where > 1 is a parameter a ecting the shape of the membership function and d distance (2.17) (Euclidean distance squared) as usual, yi is an entity point and ck a centroid. In computations, typically, the value of is put at 2. By analogy with Straight K-Means, which is an alternating optimization technique, Fuzzy K-Means can be de ned as the alternating minimization technique for function (6.1). The centroids are actually weighted averages of the entity points, while memberships are related to the distances between entities and centroids. More precisely, given centroids, ct = (ctv ), the optimal membership values are determined as
Figure 6.2: Regression-wise clusters: solid lines as centroids. Given membership values, centroids are determined as convex combinations of the entity points: X ct = (6.3) it yi where it is a convex combination coe cient de ned as it = zit = i 2I zi t . These formulas follow from the rst-degree optimality conditions for criterion (6.1). Thus, starting from a set of initial centroids and repeatedly applying formulas (6.2) and (6.3), a computational algorithm has been proven to converge to a local optimum of criterion (6.1). Further improvements of the approach are reported in 60] and 35]. Criterion (6.1) as it stands cannot be associated with a data recovery model. An attempt to build a criterion tting into the data recovery approach is made in 104].
0 0
ables, y1 ::: yn;1 , in each cluster. Such a function is de ned by the equation yn = a1 y1 + a2 y2 + ::: + an;1 yn;1 + a0 for some coe cients a0 a1 ::: an;1 . These coe cients form a vector, a = (a0 a1 ::: an;1 ), which can be referred to as the regression-wise centroid. When a regression-wise centroid is given, its distance to an entity point yi = (yi1 ::: yin ) is de ned as r(i a) = (yin ; a1 yi1 ; a2 yi2 ; ::: ; an;1 yi n;1 ; a0 )2 , the squared di erence between the observed value of yn and that calculated from the regression equation. To determine the regression-wise centroid a(S ), given a cluster list S I , the standard technique of multivariate linear regression analysis is applied, which is but P minimizing the within cluster summary residual i2S r(i a) over all possible a. Then Straight K-Means can be applied with the only changes being that: (1) centroids must be regression-wise centroids and (2) the entity-to-centroid distance must be r(i a).
6.1.5 Mixture of distributions and EM algorithm
Data of nancial transactions or astronomic observations can be considered as a random sample from a (potentially) in nite population. In such cases, the data structure can be analyzed with probabilistic approaches of which arguably the most radical is the mixture of distributions approach. According to this approach, each of the yet unknown clusters k is modeled by a density function f (x ak ) which represents a family of density functions over x de ned up to a parameter vector ak . A one-dimensional density function f (x), for any small dx > 0, assigns probability f (x)dx to the interval between x and x + dx multidimensional density functions have similar interpretation. Usually, the density f (x ak ) is considered unimodal (the mode corresponding to a cluster standard point), such as the normal, or Gaussian, density function de ned by its mean vector k and covariance matrix k :
f (x ak ) = (2p p j k j);1=2 expf;(x ; k )T ;1 (x ; k )=2g k
(6.4)
f (x ak ) (6.4) is constant satis es equation (x; k )T ;1 (x; k ) = const de ning k an ellipsoid. The mean vector k speci es the k-th cluster's location.
The shape of Gaussian clusters is ellipsoidal because any surface at which
The mixture of distributions clustering model can be set as follows. The row points y1 ::: yN are considered a random sample of jV j-dimensional observations from a population with density function f (x) which is a mixture of P individual cluster density functions f (x ak ) (k = 1 ::: K ) so that P f (x) = K=1 pk f (x ak ) where pk 0 are the mixture probabilities, k pk = 1. k For f (x ak ) being the normal density, ak = ( k k ) where k is the mean and k the covariance matrix.
To estimate the individual cluster parameters, the main approach of mathematical statistics, the maximum likelihood, is applied. The approach is based on the postulate that really occurred events are those that are most likely. In its simplest version, the approach requires the nding of the parameters pk ak k = 1 ::: K , by maximizing the logarithm of the likelihood of the observed data under the assumption that the data come from a mixture of distributions: N K YX L = logf pk f (yi ak )g: To computationally handle the maximization problem, this criterion can be reformulated as
i=1 k=1
L=
N K XX i=1 k=1
gik log pk +
N K XX i=1 k=1
gik log f (yi ak ) ;
N K XX i=1 k=1
gik log gik
(6.5)
where gik is the posterior density of class k, de ned as
f yi ) gik = Ppkp (f (yaka ) :
k k i k
In this way, criterion L can be considered a function of two groups of variables: (1) pk and ak , and (2) gik , so that the method of alternating optimization can be applied. The alternating optimization algorithm for this criterion is referred to as the EM-algorithm since computations are performed as a sequence of the so-called estimation (E) and maximization (M) steps.
EM-algorithm
Start: With any initial values of the parameters, E-step: Given pk ak , estimate gik . M-step: Given gik , nd pk ak maximizing the log-likelihood function (6.5). Halt: When the current parameter values approximately coincide with the previous ones.
If f is the Gaussian density function, then the optimal values of parameters, in M-step, can be found with the following formulas:
k= N X i=1
If the user needs an assignment of the observations to the classes, the posterior probabilities gik can be utilized: i is assigned to that k for which gik is the maximum. Also, ratios gik =gk can be considered as fuzzy membership values. The situation, in which all covariance matrices k are diagonal and have the same variance value 2 on the diagonal, corresponds to the assumption that all clusters have uniformly spherical distributions. This situation is of particular interest because the maximum likelihood criterion is here equivalent to W (S c) criterion of K-Means and, moreover, there is a certain homology between the EM and Straight K-Means algorithms. Indeed, under the assumption that feature vectors corresponding to entities x1 ::: xN are randomly and independently sampled from the population, with unknown assignment of the entities to clusters Sk , the likelihood function in this case has the following formula:
L(f k
2
Sk g) = A l(f k
2
K YY k=1 i2Sk
;M exp
f;(xi ;
k )T ;2 (xi
;
2
k )=2
g
(6.6)
so that to maximize its logarithm, the following function is to be minimized:
Sk g) =
K XX k=1 i2Sk
(xi ; k )T (xi ; k )=
(6.7)
This function is but a theoretic counterpart to K-Means criterion W (S ) = PK P k=1 i2Sk d(yi k ) applied to vectors yi obtained from xi with z-scoring standardization (shifting scales to grand means and normalizing them by standard deviations). Thus the mixture model can be considered a probabilistic model behind the conventional K-Means method. Moreover, it can handle overlapping clusters of not necessarily spherical shapes (see Figure 6.3). Note however that the KMeans data recovery model assumes no restricting hypotheses on the mechanism of data generation. We also have seen how restricting is the requirement of data standardization by z-scoring, associated with the model. Moreover, there are numerous computational issues related to the need in estimating much larger numbers of parameters in the mixture model. One of the latest and most successful attempts in application of this approach is described in 141]. The authors note that there is a tradeo between the complexity of the probabilistic model and the number of clusters: a more complex model may t to a smaller number of clusters. To select the better model one can choose that one which gives the higher value of the likelihood criterion which can be approximately evaluated by the so called Bayesian Information Criterion (BIC) equal, in this case, to BIC = 2 log p(X=pk ak ) ; k log N (6.8)
Figure 6.3: A system of ellipsoids corresponding to a mixture of three normal distributions with di erent means and covariance matrices.
where X is the observed data matrix, k , the number of parameters to be tted, and N the number of observations, that is, the rows in X . The BIC analysis has been demonstrated to be useful in accessing the number of clusters. To guarantee normality, the authors applied three popular transformations of the data: logarithm, square root, and z-score standardization of rows (genes), not columns. Typically, with logarithms the data better accord to the Gaussian distribution. The authors note that using complex models cannot be always feasible: the number of parameters estimated per cluster at the space dimension 24 becomes 24+ 24*23/2=324, which can be greater than the cluster's cardinality. Overall the results are inconclusive 141]. Further advances in mixture of distributions clustering are described in 5, 142].
distance from et on the grid is smaller than a pre-selected threshold value. Historically, all SOM algorithms worked in an incremental manner as neuron networks, but later on, after some theoretical investigation, straight versions appeared, such as the following.
Straight SOM
1. Initial setting. Select r and c for the grid and initialize model vectors mt (t = 1 ::: rc) in the entity space. 2. Neighborhood update. For each grid node et , de ne its grid neighborhood Et and collect the list Iu of entities most resembling the model mu for each eu 2 Et . 3. Seeds update. For each node et , de ne new mt as the average of all entities yi with i 2 Iu for some eu 2 Et . 4. Stop-condition. Halt if new mt s are close to the previous ones (or after a pre-speci ed number of iterations). Otherwise go to 2. As one can see, the process much resembles that of Straight K-Means, with the model vectors similar to centroids, except for two items: 1. The number of model vectors is large and has nothing to do with the number of clusters, which are determined visually in the end as grid clusters. 2. The averaging goes along the grid, not entity space, neighborhood. These features provide for less restricted visual mapping of the data structures to the grid. On the other hand, the interpretation of results here remains more of an intuition rather than instruction. A framework relating SOM and EM approaches is proposed in 9].
6.2 Graph-theoretic approaches
In this section we present some approaches that are relevant to networks and other structural data (see, for instance, 3]). As the bottom line, they rely on graph-theoretic properties of data.
6.2.1 Single linkage, minimum spanning tree and connected components
6.2. GRAPH-THEORETIC APPROACHES
191
dissimilarity matrix as a separate data le. This may save quite a lot of memory, especially when the number of entities is large. For instance, if the size of the original data table is 1000 10, it takes only 10,000 numbers to store, whereas the entity-to-entity distances may take up to half a million numbers. There is always a trade-o between memory and computation in distance based approaches that may require some e orts to balance. The single linkage approach is based on the principle that the dissimilarity between two clusters is de ned as the minimum dissimilarity (or, maximum similarity) between entities of one and the other cluster 41]. This can be implemented into the agglomerative distance based algorithm described in section 4.4.1 by leaving it without any change except for the formula (4.11) that calculates distances between the merged cluster and the others and must be substituted by the following:
dw1 w2 w = min(dw1 w dw2 w )
(6.9)
Formula (6.9) follows the principle of minimum distance, which explains the method's name. In general, agglomerative processes are rather computationally intensive because the minimum of inter-cluster distances must be found at each merging step. However, for single linkage clustering, there exists a much more e ective implementation involving the concept of the minimum spanning tree (MST) of a weighted graph.
1.51 RhM 2.98 2.94 Hum 1.51 Gor 1.57 1.45 Chim 7.10 Ora Hum 1.45
Gor
Chim 2.94
Ora
a
RhM
b
RhM Ora Chim Hum Gor
c
Figure 6.4: Weighted graph (a), minimum spanning tree (b) and single-linkage hierarchy for Primates data (c).
as the sum of all weights dij over edges fi j g belonging to T . A minimum spanning tree (MST) T must have minimum length. The concept of MST is of prime importance in many applications in the Computer Sciences. Let us take a look at how it works in clustering. First of all, let us consider the so-called Prim algorithm for nding an MST. The algorithm processes nodes (entities), one at a time, starting with T = and updating T at each step by adding to T an element i (and edge fi j g) minimizing dij over all i 2 I ; T and j 2 T . An exact formulation is this.
Prim algorithm
1. Initialization. Start with set T I consisting of an arbitary i 2 I with no edges. 2. Tree update. Find j 2 I ; T minimizing d(i j ) over all i 2 T and j 2 I ; T . Add j and (i j ) with the minimal d(i j ) to T . 3. Stop-condition. If I ; T = , halt and output tree T . Otherwise go to 2. To build a computationally e ective procedure for the algorithm may be a cumbersome issue, depending on how d(i j ) and their minima are handled, to which a lot of work has been devoted. A simple pre-processing step can be quite useful: in the beginning, nd a nearest neighbor for each of the entities only they may go to MST.
Example 6.47. Building MST for Primate data
Let us apply Prim algorithm to the Primates distance matrix in Table 1.2, p. 5. Let us start, for instance, with T = fHumang. Among remaining entities Chimpanzee is the closest to Human (distance 1.45), which adds Chimpanzee corresponding edge to T as shown in Figure 6.4 (b). Among the three other entities, Gorilla is the closest to one of the elements of T (Human, distance 1.57). This adds Gorilla to T and the corresponding edge in MST in Figure 6.4 (b). Then comes Orangutan as the closest to Chimpanzee in T . The only remaining entity, Monkey, is nearest to Orangutan, as shown on the drawing. 2
6.2. GRAPH-THEORETIC APPROACHES
OT 115 TS 87 YA 75 HF 83 AK (a) 61 WP AK (b) 61 TS YA 75 HF 83 WP 51 DS 77 GE OT 51 DS 77 GE
193
Figure 6.5: Minimum spanning tree for Materpieces (a) and a result of cutting it (b). An MST T allows nding the single linkage hierarchy by a divisive method: sequentially cutting edges in MST beginning from the largest and continuing in descending order. The result of each cut creates two connected components that are children of the cluster in which the cut occurred (see, for instance, Figure 6.4 (c)). A peculiarity of the single linkage method is that it involves just N ; 1 dissimilarity entries occurring in an MST rather than all N (N ; 1)=2 of them. This results in a threefold e ect: (1) a nice mathematical theory, (2) fast computations, and (3) poor application capability.
Example 6.48. MST and single linkage clusters in the Masterpieces data
To illustrate point (3) above let us consider an MST built on the distances between Masterpieces in Table 3.9 (see Figure 6.5 (a)). Cutting the tree at the longest two edges, we obtain clusters presented in part (b) of the Figure. Obviously these clusters do not re ect the data structure properly, in spite of the fact that the structure of (a) corresponds to authors. 2
belonging to the relevant part of T , thus Gu , because all the weights along T are smaller than u by de nition. On the other hand, for any k 2 S2 , the dissimilarity dik u so that edge (i k) is absent from Gu thus making S1 a connected component. The fact that S2 is a connected component of Gu can be proven similarly. When the maximum weight u on the tree T is not unique, T must be split along all maximum weight edges to make the split parts connected components. Then one could prove that the connected components of threshold graphs Gu (for u being the weight of an edge from a minimum spanning tree) are single linkage clusters. Here we consider the problem of nding a dense core, rather than a deviate pattern, in a given set of interrelated objects. This problem attracted attention not only in data mining but in other disciplines such as Operations Research (knapsack and location problems). We follow the way to formalize the problem in terms of a, possibly edgeweighted, graph proposed in 102] and further extended in 98]. Such is the graph of feature-to-feature similarities in Figure 6.6. For a subset of vertices H I and a vertex i 2 H , let us de ne linkage (i H ) as the sum of the weights of all edges connecting i with j 2 H . Obviously, the thus de ned linkage function (i H ) is monotone over H : adding more vertices to H may only increase the linkage that is, (i H ) (i H G) for any vertex subset G I . All contents of this section are applicable to any monotone linkage function. A monotone linkage function such as (i H ) can be used to estimate the overall density of a subset H I by \integrating" its values (i H ) over i 2 H . In particular, an integral function de ned by the weakest link in H ,
6.2.2 Finding a core
F (H ) = min (i H ) i2H
will be referred to as the tightness function.
Example 6.49. Linkage function on Masterpieces data features
For the pre-processed Masterpieces data in Table 2.17 or 3.2, let us consider the matrix of squared feature-to-feature inner products in Table 6.2. This matrix can be used for analysis of interrelations between features of Masterpieces. In particular, let us draw a weighted similarity graph whose vertices are features and whose edges correspond to those similarities which are, in the rounded form, 14 or greater, see Figure 6.6. In this graph, at set H = fLS LD NC SC g, (LS H ) = 31, (LD H ) = 22 + 14 = 36, (NC H ) = 31 + 22 = 53 and (SC H ) = 14. The value of tightness function at this set H is the minimum of these four, F (H ) = 14, thus making it the core.
LS
31
NC
LS
31
NC
33
LD
33
LD
NC 33
LD NC
22
Di
22
Di
15
15 19
19
SC
14
Ob Ob SC
26 14
26
Ob SC
26
(a) Pe
(b) Pe
(c)
Figure 6.6: Weighted graph generating the summary linkage function (a) connected
components after cutting two weakest edges in its Maximum spanning tree (b) and the core of the tightness function over (c).
2 A property of the tightness function (easily following from the monotonicity of ) is that it satis es the so-called quasi-convexity condition: for any H G I ,
added, even if some of its elements are removed, that is, F (S ) > F (H ) for any H I such that H \ (I ; S ) 6= . Thus de ned, F -patterns must be chain-nested. Statement 6.18. The set of all F -patterns, P , is nonempty and chain-nested, that is, S1 S2 or S2 S1 for any S1 S2 2 P . Proof: If S1 , S2 are F -patterns and S1 is not part of S2, then F (S2 ) > F (F1 ). If, moreover, S2 is not part of S1 , then F (S1 ) > F (S2 ). The contradiction proves that P must be chain-nested. Besides, S = I makes the de nition of F -pattern true because of the false premise, which proves that I is always an F -pattern and completes the proof, q.e.d. In general, more dense subparts may occur within the smallest F -pattern S : nothing prevents one or more H S with F (H ) > F (S ) to exist. When this is not the case, that is, when the smallest F -pattern is a global maximizer of the function F , the set of F -patterns will be referred to as a layered cluster, which is uniquely de ned and thus can be considered as an aggregate representation of the F density structure in I .
Statement 6.19. If F (S ) is a tightness function, then its minimum pattern is the largest global maximizer of F (S ) in the set of all S I . Proof: Let S be the minimum F -pattern in the chain nested set of F -patterns.
If S is not a global maximizer of F , then F (H ) > F (S ) for some H I . In fact, all such H must fall within S , because of the de nition of F -patterns. Let us take a maximal subset H S in the set of all H such that F (H ) > F (S ) and prove that H is a pattern as well. Indeed, let us take any S 0 S such that S 0 \ (I ; H ) 6= the existence of such S 0 follows from the fact that H does not coincide with S . Then F (H ) > F (H S 0 ) because of the assumed maximality of H within S . But F (H S 0 ) min(F (S 0 ) F (H )) according to (6.11), that is, F (H ) > F (S 0 ). Let us consider now an S 0 which is not contained in S and still satis es the condition S 0 \ (I ; H ) 6= . (This may only happen when S is not equal to I .) By the de nition of S 0 , F (S ) > F (S 0 ) because S is a pattern. Therefore, F (H ) > F (S 0 ). This implies that H is a pattern, which contradicts the assumption of minimality of S . Thus, no H S exists with F (H ) > F (S ) and S is the maximum global maximizer of F (H ), q.e.d. Let us denote by m(H ) the \weakest link", that is, the set of elements i 2 H at which the value of F (H ) is reached:
with class values F (M ) = fF0 F1 ::: Fn g de ned as follows. Step 0. Initial setting. Put t = 0 and de ne I0 = I . Step 1. Partitioning. Find class Mt = m(It ) and de ne It+1 = It ; Mt . De ne class value Ft = F (It ) = (i It ) for i 2 Mt. Step 2. Stop-condition. If It+1 = , halt. Otherwise, add 1 to t and go to Step 1. The layered cluster of F -patterns can be easily extracted from the weakest link partition M thus produced. From the sequence F , pick up the smallest index t among those maximizing Ft , t = 0 1 ::: n. Then apply the same selection rule to the starting part of the sequence F , F t = (F0 ::: Ft ;1 ) obtained by removing Ft and all the consequent elements. Reiterating this pick-and-removal process until all elements of F are removed, we obtain set T of all the picked up indices. Sets It , t 2 T , form the layered cluster of F , which is proven in 98].
Weakest Link Partitioning. Input: Monotone linkage function (i H ) de ned for all pairs i H such that i 2 H I . Output: Weakest link partition M = (M0 M1 ::: Mn) of I along
Let us apply the Weakest link partitioning algorithm to the graph in Figure 6.6 P with linkage function (i S ) = j2S aij with aij being the weight. Obviously, m(I ) = fObg because (Ob I ) = 14 is the minimum of (i I ) over all i 2 I . With the weakest link Ob removed from I , the minimum of (i I ; fObg) is reached at LS with (LS I ; fObg) = 31, which is the next weakest link to be removed. The next entity to be removed is LD, with (LD I ; fOb LS g) = 36. In the remaining set I3 = fNC Di SC Peg, the weakest link is Pe with (Pe fNC Di SC Peg) = 41. This yields I4 = fNC Di SC g with the minimum link, 19, reached at m(I4) = fSC g. Two remaining entities, NC and Di, are linked by 33. The results can be represented as a labeled sequence: (Ob)14 (LS )31 (LD)36 (Pe)41 (SC )19 (Di NC )33 where the parentheses contain sets Mt = m(It) removed at each step of the algorithm, their order re ecting the order of removals. The labels correspond to the values of the linkage function F (It ) for t = 0 1 2 3 4 5. The maxima are 41, 36, 31, 14 the corresponding patterns being H3 = fNC Di SC Peg (part (c) on Figure 6.6), H2 = H3 fLDg, H1 = H2 fLS g, and H0 = I , to form the layered cluster. It should be noted that this drastically di ers from any partition along the maximum spanning tree presented in part (b) of Figure 6.6. 2
these, we considered (1) and (2) in the previous chapters. Here we concentrate on the conceptual description of clusters.
6.3.1 False positives and negatives
A conceptual description of a cluster is a logic predicate over features de ned on entities, that is, in general, true on entities belonging to the cluster and false on entities out of it. For instance, the cluster of Dickens masterpieces, according to Masterpieces data in Table 1.10 can be described conceptually with predicate \SCon=0" or predicate \19 LSent 30 & 2 NChar 3." These descriptions can be tested for any entity from the table. Obviously, the former predicate distinctively describes the cluster with no errors at all, while the latter admits one false positive error: the entity HuckFinn satis es the description but belongs to a di erent cluster. False positive errors are entities that do not belong to the cluster but satisfy its description in contrast, an entity from the cluster that does not satisfy its description is referred to as a false negative. Thus, the problem of the conceptual description of clusters can be formalized as that of nding as brief and clear descriptions of them as possible while keeping the false positive and false negative errors as low as possible. Traditionally, the issue of the interpretation of clusters is considered as art rather than science. The problem of nding a cluster description is supposed to have little interest on its own and be of interest only as an intermediate tool in cluster prediction: a cluster description sets a decision rule which is then applied to predict, given an observation, which class it belongs to. This equally applies to both supervised and unsupervised learning, that is, when clusters are pre-speci ed or found from data. In data mining, the prediction problem is frequently referred to as that of classi cation so that a decision rule, which is not necessarily based on a conceptual description, is referred to as a classi er (see, for instance, 23]). In clustering, the problem of cluster description is part of the interpretation problem. Some authors even suggest that conceptual descriptions be a form of representing clusters 59].
Figure 6.7: Classi cation tree in the feature space (a) the same, as a decision tree
(b) the tree after pruning (c).
ters whose de nitions involve combinations of features. Within traditional approaches, handling both continuous-valued and categorical features can be an issue because of incomparability between statistical indexes used for di erent feature types. The data recovery framework seems a good vehicle for overcoming this hurdle, since di erent types of features get uni ed contribution based indexes of association, such as in sections 5.2 and 3.4.4. One more aspect of decision trees that can be addressed with the data recovery approach is of developing splitting criteria explicitly oriented towards description of a single cluster or a category rather than a partition. The conventional decision tree techniques do not pay much attention to the cases in which the user is interested in getting a description for a cluster S only, though in many applications I ; S can be highly non-homogeneous so that its conceptual description may have no meaning at all. A scoring function of a split of tree cluster Sw in two parts S1 and S2 , with respect to their relevance to S can have the following formula
6.3.4 Comprehensive conjunctive description of a cluster
204
y
DIFFERENT CLUSTERING APPROACHES
P
Q g h x
Figure 6.9: Rectangular description de ning the numbers of false positives (blank
circles within the box) and false negatives (black dots outside of the rectangle).
been considered in 92] and 107]. In 107] the di culty of the problem was partly avoided by applying the techniques to cohesive clusters only and in 92] transformation of the space by arithmetically combining features was used as a device to reduce the spread of S over I . An algorithm outlined in 92] has the following input: set I of entities described by continuously-valued features v 2 V and a group S I to be described. The algorithm operates with interval feature predicates Pf (a b) de ned for each feature f and real interval (a b), of values a x b: Pf (a b)(x) = 1 if the value of feature f at x falls between a and b, and Pf (a b)(x) = 0 otherwise. The output is a conjunctive description of S with the interval feature predicates along with its false positive and false negative errors. The algorithm involves two building steps: 1. Finding a conjunctive description in a given feature space V. To do this, all features are initially normalized by their ranges or other coe cients. Then all features are ordered according to their contribution weights which are proportional to the squared di erences between their within-group S I averages and grand means, as described in section 3.4.2. A conjunctive description of S is then found by consecutively adding feature interval predicates fv (a b) according to the sorted order, with (a b) being the range of feature v within group S (v 2 V ). An interval predicate fv (a b) is added to a current description A only if it decreases the error.1 This forward feature selection process stops after the last element of V is checked. Then a backward feature search strategy is applied to decrease the number of conjunctive items in the description, if needed.
1
In this way, correlated features are eliminated without much fuss about it.
2. Expanding the feature space V. This operation is applied if there are too many errors in the description A found on the rst step. It produces a new feature space by arithmetically combining original features v 2 V with those occurring in the found description A. These two steps can be reiterated up to a pre-speci ed number of feature combining operations. Step 2 transforms and arithmetically combines the features to extend the set of potentially useful variables. Then Step 1 is applied to the feature space thus enhanced. The combined variables appearing in the derived decision rule are then used in the next iteration, for combining with the original variables again and again, until a pre-speci ed number of combining operations, t, is reached. For example, one iteration may produce a feature v1 v2, the second may add to this another feature, leading to v1 v2 + v3, and the third iteration may further divide this by v4, thus producing the combined feature f = (v1 v2 + v3)=v4 with three combining operations. This iterative combination process will be referred to as APPCOD (APProximate COmprehensive Description). An APPCOD iteration can be considered a speci cation of the recombination step in genetic algorithms 99]. The di erence is that the diversity of the original space here is maintained by using the original variables at each recombination step, rather than by the presence of `bad' solutions in conventional genetic algorithms. In this way, given a feature set V , entity set I , and class S I , APPCOD produces a new feature set F (V ), a conjunction of the interval feature predicates based on F (V ), A, and its errors, the numbers of false positives FP and false negatives FN. Since the algorithm uses within-S ranges of variables, it ts into the situation in which all features are quantitative. However, the algorithm can also be applied to categorical features presented with dummy zero-one variables just intervals (a b) here should satisfy the condition a = b, that is, correspond to either of values, 1 or 0. It should be mentioned that a more conventional approach to nding a good description A with mixed scale variables involves building logical predicates over combinations of features 24] and 69].
Let us apply APPCOD to the Body mass data in Table 1.7, with the restriction of no more than one feature combining operation. Depending on which of the groups, overweight or normal-weight, is to be described, the algorithm rst selects either Weight or Height. This gives a lousy description of the group with too many false positives. Then APPCOD combines the feature with the other one and produces the di erence Height ; Weight and the ratio Height/Weight for both the overweight and the normal-weight groups. The overweight group is described with these combined features with no errors, whereas the normal group cannot be described distinctively with them. This means that the former group is somewhat more compact in the combined feature space. The overweight group's comprehensive description is: 2:12 H=W 2:39 & 89 H ; W 98, with no errors.
Example 6.51. Combined features to describe Body mass groups
; Weight for the other, normal group, is about 100, which ts well into the known
common sense rule: \Normally, the di erence between Height (cm) and Weight (kg) ought to be about 100." When the number of permitted feature combining operations increases to 2 and the number of conjunctive items is limited by 1, the method produces the only variable, H H=W , which distinctively describes either group and, in fact, is the inverse body mass index used to de ne the groups. 2
The second inequality can be reinterpreted as stating that the di erence Height
6.4 Overall assessment
There is a great number of clustering methods that have been and are being developed. In section 6.1, a number of di erent clustering methods that can be treated as extensions of K-Means clustering have been presented. These methods are selected because of their great popularity. There are rather straightforward extensions of K-Means among them, such as Partitioning around medoids, in which the concept of centroid is speci ed to be necessarily an observed entity, and Fuzzy K-Means, in which entity memberships can be spread over several clusters. Less straightforward extensions are represented by the regression-wise clustering, in which centroids are regression functions, and self-organizing maps SOM that combine the Minimum distance rule with visualization of the cluster structure over a grid. The probabilistic mixture of distributions is a far reaching extension, falling within the statistics paradigm with a set of rather rigid requirements such as that all features are continuous-valued and must be standardized with the z-score transformation, none of which needs to be assumed within the data recovery approach. Two graph-theoretic clustering methods are presented in section 6.2. One is the quite popular single linkage method related to connected components and minimum spanning trees. The other is a rather new method oriented towards nding a central dense core cluster the method works well when such a dense core is unique. Conceptual description of clusters is covered in the last section. Three di erent types of conceptual description are highlighted: (1) decision tree, the most popular means for conceptual classi cation in machine learning, (2) production rule, a device for picking up conditions leading to belongingness to a cluster, and (3) comprehensive description, which is oriented towards formulation of necessary and su cient conditions of belongingness to a cluster. The problem of conceptual description of clusters has not yet found any general satisfactory solution.
General Issues
After reading through this chapter you will get a general understanding of existing approaches to: 1. Data pre-processing and standardization 2. Imputation of missing data 3. Feature selection and extraction 4. Various approaches to determining number of clusters 5. Validation of clusters with indexes and resampling and contributions of the data recovery approach to these issues.
Base words
Bootstrapping A resampling technique in which a copy of the data set is
created by randomly selecting either rows or columns, with replacement, as many times as there are respective number of rows or columns in the data. algorithm. This can be based on an internal index, or external index or resampling. An internal index scores the degree of correspondence between the data and the cluster structure. An external index compares the cluster structure with a structure given externally. A resampling is used to see whether the cluster structure is stable with respect to data change. 207
Cluster validation A procedure for validating a cluster structure or clustering
Cross-validation A resampling technique that provides for a full coverage of
the data set. The data set is randomly divided in a number Q of equallysized parts and Q copies of the data in the `training/testing' format are generated by taking each of the parts as the test part in a corresponding copy. Dis/similarity between partitions A partition-to-partition measure based on the contingency table. All measures reviewed in this chapter can be expressed as the weighted averages of subset-to-subset measures. Dis/similarity between sets A subset-to-subset measure relying on the subsets' overlap size. A popular Jaccard coe cient relates the overlap size to that of the union and has both advantages and shortcomings. Somewhat better measures should involve the ratios of the overlap size to each of the subsets or Quetelet coe cients. Feature extraction A procedure for creating such `synthetic' data features that maintains the most important information of a pattern in question. Feature selection A procedure for nding such a subspace of data features that retains the most important information of a pattern in question. Index based validation Validation of a cluster structure involving either an internal index, measuring degree of correspondence between the data and the clusters, or an external index, measuring the correspondence between the cluster structure and an externally speci ed partition or other external data. Number of clusters An important characteristic of a cluster structure, which depends both on the data and the user's objectives. There have been developed a number of criteria for determining a `right' number of clusters, both index-based and resampling based so far none has come up as applicable across diverse data ranges. Resampling An approach to creating an empirical distribution of a data mining pattern or criterion by creating a number of random copies of the data table and independently applying to them the method for computing the pattern or criterion of interest. The distribution is used for assessing validity of the algorithm or pattern/criterion.
7.1.2 Comprehensive description as a feature selector
Let us concentrate on the task of learning a subset S I . A lter method can be employed for selecting the most salient features. The salience of feature v with respect to S should be measured based on the ability of v to separate S from the rest. To this end, the di erence between v's grand mean and withinS mean can be utilized. In particular, its squared value c2 can be used as Sv pertaining to the contribution of feature v at S to the data scatter (see section 3.4.2). With the data preliminarily pre-processed by shifting them to the grand mean av and normalizing by bv , c2 = (av ; aSv )2 =b2 where aSv is the within-S v Sv mean of feature v. Note that both av and aSv are calculated at the original variable xv . When xv is a zero-one binary feature corresponding to a category, c2 = (pv ; p(v=S ))2 =b2 where pv and p(v=S ) are unconditional and conditional v Sv frequencies of v. Denoting the proportion of S in I by pS and proportion of the overlap of v and S by pSv , this can be further reformulated as c2p = Sv (pSv ; pv pS )2 =p2 or c2 = (pSv ; pv pS )2 =(pv pS ) at bv = 1 or bv = pv , S Sv respectively. Selecting features with largest c2 is computationally e ective and simple. Sv However, features picked this way can be highly correlated, thus reducing their usefulness. Using methods of cluster description described in section 6.3 to postprocess salient features can mitigate the issue. In particular, the APPCOD method from section 6.3.4 can be used as a feature selecting device: the set of selected features W is formed by those features that are involved in the comprehensive description of S found with APPCOD.
Example 7.52. APPCOD based feature selection applied to Iris classes
In the Iris data set, prede ned classes can be described by the following concepts found with the algorithm APPCOD: 1 w3 1:9 (class 1, FP=0), 3:0 w3 5:1 & 1:0 w4 1:8 (class 2, FP=8), and 1:4 w4 2:5 & 4:5 w3 6:9 (class 3, FP=18). Since the method APPCOD uses within-class ranges of features for producing interval predicates, numbers of false negatives FN, in this version, are all zero. Numbers of false positives FP at conjunctions describing classes II and III are rather high, but they cannot be reduced by adding other variables' ranges. High errors for two of the Iris classes support the conclusion that they are dispersed in the variable space (see Figure 1.10, page 25). Still, only two variables occur in the descriptions obtained for the Iris classes, w3 and w4. This e ectively selects these two variables as those that matter. 2
7.1.3 Comprehensive description as a feature extractor
In the early days of the development of multivariate data analysis methods, feature extraction was con ned to extracting `hidden' factors being linear combinations of the original variables. This is still a niche for Principal component analysis (PCA) and related methods described in sections 5.1.3 and 5.5.1, which have been recently grossly enhanced by the advent of kernel based methods 36, 110]. In this context, kernel based methods provide for non-linear transformations of the original features to e ectively linearize the task of supervised learning. An issue with all these methods is of an arti cial character of the `synthetic' features emerging, which is rather disappointing in the context of clustering because new features are needed mainly for better understanding. More recently, a number of methods emerged, especially in supervised learning, that combine original features to produce meaningful descriptions of classes 92, 107]. APPCOD method 92] described in section 6.3.4 seems rather convenient in this regard as it uses arithmetic combinations of original features, which nicely ts into the long-standing tradition of sciences. According to this tradition, derivative measures such as area or electric current can be expressed as arithmetic products or ratios of other features such as length and width or voltage and resistance.
Let us try APPCOD as a feature-extracting device on the Iris data set. Assume that features w3 and w4, selected with APPCOD in the previous section, constitute set A which will be arithmetically combined with the four original Iris variables in set V . We denote by F (A V ) the set of features produced by applying each of ve operations (x + y x ; y x y x=y y=x) to each x 2 A and y 2 V . APPCOD applied to F (A V ) for description of Iris classes II and III (class I has been distinctively separated by w3 alone and thus excluded from further analysis), produces conjunctions 1:18 w1=w3 1:70 & 3:30 w3 w4 8:64 (class II, FP=4) and 7:50 w3 w4 15:87 & 1:80 w3 ; w2 4:30 (class III, FP=2) with the number of conjunctive terms restricted to being not greater than 2. The errors have become smaller, but still they may be considered too high. Can the numbers of false positives be reduced to just one per class? Putting the compound variables involved in the descriptions above, w1=w3, w3 w4, and w3 ; w2, as A and leaving V as is, an update F (A V ) can be computed to give rise to the following APPCOD produced conjunctions: 2:86 w1 w2=w3 4:77 & 3:30 w3 w42 15:55 (class II, FP=4) and 3:24 (w3 ; w2) w4 9:89 & 1:35 w3 w4 ; w1 8:70 (class III, FP=2). The errors of the two-term conjunctions have not changed. However, one could see that the new feature space leads to a better four-term conjunctive description of class III (with FP decreased to 1). With set A consisting of the four new variables, w1 w2=w3, w3 w42 , (w3 ; w2) w4, and w3 w4 ; w1, and V unchanged, the algorithm APPCOD applied to F (A V ) leads to the2 following conjunctions: 0:64 w2 (w3 ; w2)2 w4 4:55 & 0:21 w2=(w3 w4 ) 0:74 (class II, FP=1) and 4:88 w3 w4 ; w1 31:20 & ;2:85 (w3 ; w2) w4 ; w1 2:19 (class III, FP=1). It should be added that class I can be distinctively separated with one of these variables, w2 (w3 ; w2) w4 ;3:07 (class 1, FP=0).
Example 7.53. APPCOD based feature extraction applied to Iris classes
Table 7.1: Confusion matrix for iK-Means clusters, in the original feature space, and
prede ned classes of the Iris data. Classes iK-Means clusters, option 1 iK-Means clusters, option 2 in Iris data 1 2 3 4 5 6 1 2 3 4 5 Total 1 0 49 1 0 0 0 0 50 0 0 0 50 2 0 0 13 10 5 22 0 0 15 16 19 50 3 12 0 2 18 18 0 27 0 3 21 0 50 Total 12 49 16 28 23 22 27 50 18 36 19 150
Having achieved the goal of not more than one error per class, the process of combining variables is stopped at this point to underscore a trade-o between the exactness and complexity of cluster descriptions, which parallels similar trade-o s in other description techniques such as regression analysis. The nal features do not make much sense. However, an intermediate feature emerged and conserved, w3 w4, referring to the petal area, is considered by many as a key feature to substantively discriminate between Iris genera II and III. 2
Example 7.54.
Example 6.51 of using APPCOD for separating two classes of Body mass data can be considered in the feature extracting context, too. Depending on what group is considered as S , those of overweight or normal weight, the method leads to either Height ; Weight or Height/Weight2 , as an extracted combined feature. Both of these features are known to be quite meaningful in the problem of separation of the overweight from the normal. 2
Extracting Body mass features with APPCOD
As explained above, criteria for evaluation of the quality of extracted features can be based on the performance of a classi er or cluster maker in the new feature space. In the clustering context the latter seems preferable.
Example 7.55.
Let us explore the quality of di erent feature spaces produced in the previous examples to tackle the issue of supervised feature selection and extraction on the Iris data set. Let us utilize the intelligent version of K-Means, iK-Means described in section 3.3, as a cluster maker. First, let us consider results of iK-Means applied to the original data set with two standardizing options: (1) features z-score standardized (2) features range standardized, that are presented in the left and right contingency tables in Table 7.1, respectively. The cluster discarding threshold in iK-Means was set to be equal to 2. Table 7.1 shows that the level of confusion is less in the case of the range standardized data (option 2). This is caused by the fact that with this standardization, contributions of features to the data scatter are shifted towards those petal related, w3 and w4, which better correlate with the genera, as was shown in the example 7.52: the vector of contributions of features w1-w4 under option (2) is (19 12 32 37), per cent, whereas all contributions are equal to 25% for the z-score standardization. The levels of confusion are lessened even more if we impose the restriction that
Table 7.2: Confusion matrix at iK-Means clusters set to 3 (in all three options) and
prede ned classes of the Iris data. Classes Clusters, op. 1 in Iris data 1 2 3 1 0 49 1 2 13 0 37 3 42 0 8 Total 55 49 46 Clusters, op. 2 Clusters, op. 3 1 2 3 1 2 3 Total 0 50 0 0 50 0 50 10 0 40 2 0 48 50 42 0 8 46 0 4 50 52 50 48 48 50 52 150
Table 7.3: Confusion matrix of iK-Means clusters, in the extracted feature space,
and prede ned classes of the Iris data. Classes Clusters, 1 in Iris data 1 2 3 1 50 0 0 2 0 2 41 3 0 48 2 Total 50 50 43 Clusters, 2 4 1 2 3 Total 0 50 0 0 50 7 0 2 48 50 0 0 48 2 50 7 50 50 50 150
the number of clusters in iK-Means must be the same as the number of genera, that is, 3, as clearly seen in Table 7.2. One more option, 3, is added here, with the feature set con ned to only two features, w3 and w4, that have been selected by APPCOD in example 7.52. This last option leads to a solution with only 6 displaced entities, which supports the principle of feature selection. Let us take a look now at clustering results with extracted features. Table 7.3 presents results of iK-Means applied to the APPCOD enhanced features w1/w3, w3*w4 and w3-w2 from the previous example. The method leads to a four cluster solution presented in the left part (option 1) the right part presents the solution found by iK-Means restricted to having 3 clusters only (option 2). The range standardized features contribute almost equally to the data scatter in this case, which implies that results would not di er under the z-score based data standardization. All the confusion in the extracted feature space is created by four specimens, 13 and 32 of class II, and 33 and 39 of class III, since they are still closer to the other class centroid than to that of the class they belong to. It is worth adding that the separation of cluster 2 is mostly due to the contribution of the \area" feature w3*w4, while cluster 3 is overwhelmingly supported by the ratio of sepal length to petal length w1/w3. Results in Table 7.3 are the best we could obtain with APPCOD-extracted features of the previous example. Contrary to expectations, further combining features does not improve the quality of clustering, probably because more complex features relate to ner details of di erences between entities, which do not help in separating the genera but rather break them up according to the ner granulation. Let us take a look, for example, at iK-Means clustering found at the nal feature set ob-
Table 7.4: Confusion matrix for iK-Means clusters, in a complex feature space, and
prede ned classes of the Iris data. Classes iK-Means clusters in Iris data 1 2 3 4 5 6 7 Total 1 0 5 45 0 0 0 0 50 2 0 0 0 0 22 7 21 50 3 15 0 0 25 0 10 0 50 Total 15 5 45 25 22 17 21 150
tained using three combining operations in the previous example, w2*(w3-w2)*w4, w2/(w3*w4*w4), w3*w4*w4-w1, and (w3-w2)*w4-w1 (Table 7.4). Five entities of the rst genus are extracted as cluster 2 here because they have by far the greatest values of feature w2/(w3*w4*w4), which is also responsible for the emergence of a partial cluster 7. This delicate feature, along with (w3-w2)*w4-w1, leads to the production of the mixed cluster 6 as well. 2
is determined in terms of dis/similarities between entities. In fact, features can be useful as well because: 1. Clustering methods such as K-Means may employ the speci city of the vector space format to explicitly formulate the concept of cluster's prototype or centroid as a feature based entity. 2. Features are involved in cluster interpretation. 3. The vector space format can be by far more e ective computationally when the number of features is relatively small. Think of the di erence between NM input numbers in the vector space and N 2 or N 2 =2 input numbers in the dis/similarity format when N is large and M small. If, for instance, there are 10,000 entities and 100 features, there will be about one million numbers in the former case and about a hundred million numbers in the latter case, the di erence which can be crucial with contemporary personal computers. The idea of conversion of dis/similarity based data into the feature space format motivated a lot of research in the so-called multidimensional scaling, the discipline oriented towards embedding given dis/similarity data into a vector space in such a way that the between-entity distances in the space would approximate the prespeci ed dis/similarities. An issue with this approach is that the found space has no intrinsic interpretational support. In our view, this can be overcome with another idea gaining more and more popularity that, given a dis/similarity data table, a number of entities should be pre-selected as reference points serving as coordinate axes so that each entity can be represented by the vector of its distances to the reference points. Complex data such as chemical compound formulas, time-series of stock prices, biomolecular sequences, images, etc. can be used for automatically discovering patterns such as motives in sequences or edges in images that can be used then for feature generation. This implies that the pre-processing issues are not necessarily purely computational ones but can involve speci c data mining techniques including clustering. These topics remain beyond the scope of this text. Relevant materials can be found in 23] (sequences and temporal data), 123] (images), 44] (data bases and warehouses), 26] (spatial data).
GENERAL ISSUES
they must be considered on par with the correlation ratio for quantitative variables and, moreover, get unexpectedly transformed, one into the other, depending on the data normalization option accepted.
4. The data recovery framework helps in addressing one of the major issues of data analysis: warranting that data pre-processed into the dis/similarity format get no additional structures imposed by the pre-processing. Statements proven in sections 5.4 and 5.5 allow us to claim that there are criteria and methods that should lead to compatible results in both featurebased and dis/similarity data. More explicitly, if a dissimilarity measure is equal (or similar) to the squared Euclidean distance squared between rows of a vector space matrix, or a similarity measure is the row-to-row inner product, then compatible criteria and local search heuristics are those detailed in section 5.4.1.
In 85] a set of computational experiments have been performed with a number of distance-based clustering methods and various standardization options. The methods included a range of agglomerative procedures from the single linkage to Ward agorithms. The data generated consisted of a small number, 50, of points of small, 4 to 8, dimensions, concentrated in a number of well separated clusters, supplemented with errors being either (i) normally distributed noise added to data entries, or (ii) two additional, randomly generated, features, or (iii) 20% outlier points added to the set. Overall, the normalization by range appeared to unanimously provide for the best recovery of the underlying cluster structure among other normalization options \the traditional z-score formula was not especially e ective." ( 85], p. 202.) This concurs with the advice following from the analysis of feature contributions to the Pythagorean decomposition in section 3.4. Subtracted shift values av play no role in the distance (7.1), but they do work when using the inner product as a similarity measure and, more importantly, in cluster centroids. In this regard, the recommended option (a5) for av being the grand mean has obvious advantages. With this, the cluster centroids express the di erences between the feature within-cluster averages and grand means { these di erences accentuate each cluster's speci cs. The di erences are indispensable in interpreting clusters and, moreover, they are behind the entire system of cluster-feature contributions to the data scatter. The appropriateness of using grand means as the shift coe cients is underscored by the properties of these contributions relating them to classical ideas of statistics: the correlation ratio, Quetelet indexes, and the Pearson chi-square association coe cient. The working of the Anomalous pattern method and iK-Means clustering is based on that, too.
In some applications, binary data are pre-processed to take into account their row and column frequencies. In particular, in information retrieval, the socalled tf-idf (\term frequency, inverse-document frequency") weighting scheme is widely accepted. This scheme assigns every unity in a document-to-keyword data table the product of the keyword frequency within the document and the logarithm of the proportion of total documents to the number of documents in which the term appears 116]. Another pre-processing option would be to treat a binary data table as a contingency table and code every entry by the corresponding Quetelet index de ned in section 2.2.3. These transformations make the data not binary anymore and are not considered further on. Partitions correspond to nominal features and, thus, also constitute an important class of input data. On the output side, subsets and/or partitions are results of many clustering algorithms. To compare results of di erent classi cation schemes or clustering algorithms, one needs to measure similarity between them. There have been a number of di erent approaches and measures of dis/similarity between subsets or partitions developed in the literature of which some will be reviewed here (see 56] and 90] for more).
7.3.1 Dis/similarity between binary entities or subsets
When all features are binary, every entity i 2 I can be considered as just a set (bag) of features Vi that are present at i. Then traditional set-theoretic operations can be used to express such concepts as \set of features that are absent from i" (Vi ), \set of features that are present at i but not at j " (Vi ; Vj ), \set of features that are common to i and j " (Vi \ Vj ), etc. The so-called four-fold table presented in Table 7.5 is a traditional instrument for comparing subsets Vi and Vj in I . It is a contingency table cross classifying Vi and its complement Vi = I ; Vi with Vj and its complement Vj = I ; Vj . Its interior entries are cardinalities of intersections of corresponding sets and its marginal entries are cardinalities of the sets themselves.
Set theoretic similarity measures
Table 7.5: Four-fold table.
Vj Total a b a+b c d c+d Total a+c b+d a + b + c + d = N
Set
Probably the simplest similarity measure is the size of the overlap, However, this measure fails to relate the size of the overlap to the set sizes and thus does not allow us to judge whether the overlap comprises the bulk or just tip of the sets. The distance between sets de ned as,
h(i j ) = jVi Vj j ; jVi \ Vj j = jVi j + jVj j ; 2jVi \ Vj j = b + c (7.3) is somewhat better. It is 0 when Vi = Vj , but its maximum, jVi Vj j, still depends on the set sizes. Returning to binary rows xi and xj of the data table corresponding to entities i and j , it is easy to see that distance h(i j ) indeed equals the squared Euclidean, and indeed city-block, distance between xi and xj : X X h(i j ) = (xiv ; xjv )2 = jxiv ; xjv j:
v 2V v 2V
The latter equation follows from the fact that (x ; x0 )2 = jx ; x0 j for any binary x and x0 . Measure h(i j ) (7.3) is referred to as Hamming distance between the binary vectors. The relative distance h(i j )=N is known as the mismatch coe cient between sets:
m(i j ) = h(i j )=N = (b + c)=N
Its complement to unity, the so-called match coe cient,
(7.4) (7.5)
s(i j ) = (N ; h(i j ))=N = (a + d)=N
is popular as well. Index s(i j ) is always between 0 and 1. It is 1 when Vi = Vj and it is 0 when Vi and Vj are complementary parts of I . An issue with s(i j ) as a similarity measure is that it depends on the size N of I which may be irrelevant, especially in comparing two \little things in the big world" such as two text documents among thousands of others. The so-called Jaccard coe cient 57],
and 0 when Vi and Vj are disjoint. It is good mathematically, too, because its complement to unity, the dissimilarity dJ = 1 ; J , satis es the so-called triangle inequality, dJ (i j ) dJ (i l) + dJ (j l) for any i j l 2 I (for a proof see 90], p. 175). A popular algorithm for clustering categorical data, ROCK 43], uses the Jaccard coe cient. However, there appears to be an intrinsic aw in the Jaccard coe cient in that it systematically underestimates the similarity. To demonstrate this, let us consider two typical situations 96]. 1. Undervalued overlap. When the sizes of sets Vi and Vj are about the same and their overlap is about half of the elements in each of them, the Jaccard coe cient is about 1/3, while one would expect the similarity score to be, in this case, about 1/2. To make the coe cient equal to 1/2, the overlap must contain about 2/3 of the elements from each of the sets, which intuitively should correspond to a score of 2/3. 2. Undervalued part. When Vi is part of Vj being smaller than Vj in size, then J is just equal to the proportion of Vi in Vj . If, for instance, the size of Vi is 0.2 of the size of Vj , then the value of J also will be 0.2. Such a small value contradicts our intuition on the relationship between entities i and j because the fact that Vi Vj may intuitively mean that they may be highly related, semantically in the case of text documents or evolutionarily in the case of genomes. Is there any remedy that can be suggested? Yes. One can note that in the former example, each of the ratios jVi \ Vj j=jVi j and jVi \ Vj j=jVj j is equal to one half. In the latter example, one of them is still 0.2 but the other is 1! Thus, an appropriate similarity index should combine these two ratios. Practitioners usually take the maximum or minimum of them by relating jVi \ Vj j to min(jVi j jVj j) or max(jVi j jVj j), respectively. Also, the geometric mean,
jV g (i j ) = p i \ V j j = p a jVi jjVj j (a + b)(a + c)
i j
(7.7)
or arithmetic mean,
\ \ a b f (i j ) = ( jVijV jVj j + jVijV jVj j )=2 = a (a2+ +)(a+ cc) b +
As mentioned above, the distance dij between rows xi and xj representing i j 2 I in the original binary data table is precisely the Hamming distance h(i j ) (7.3). The inner product of the two binary rows is the overlap (7.2). To get more subtle measures onep needs to employ subtler tools. In particular, the normalized product, (xi xj )= (xi xi )(xj xj ) yields the geometric mean g(i j ) (7.7). When the feature means have been subtracted according to recommendations of the data recovery approach, the inner product of rows in the prestandardized data matrix leads to an interesting similarity index 96]. With av = pv , the proportion of entities in which feature v is present, and bv = 1, the range, the scalar product of the standardized rows has been derived in (5.42) on page 163 as,
where is a constant and t(i) (or t(j )) is the total frequency weight of features that are present at i (or, at j ). (The more frequent is the feature, the less its frequency weight.) Similarity index (7.9) is a linear analogue to the arithmetic mean coe cient t (7.8) and, thus, has properties similar to those of t. However, it is further adjusted according to the information content of features in h and g, which is evaluated over all entities by counting their frequencies. Also, it can be either positive or negative, thus introducing the expression power of the sign into similarity measurement, and should be further explored.
Table 7.6: A contingency table or cross classi cation of two partitions, S = fS1 ::: SK g and T = fT1 ::: TLg on the entity set I .
Cluster 1 2 ... L Total 1 N11 N12 ... N1L N1+ 2 N21 N22 ... N2L N2+ ... ... ... .... ... ... K NK1 NK2 ... NKL NK+ Total N+1 N+2 ... N+L N
and one of the halves separated by the diagonal, either that under or above the diagonal, because of their symmetry. Let us consider graph P as ;the set of its edges. Then, obviously, the ;S ; k k cardinality of ;S is j;S j = K=1 N2+ where N2+ = Nk+ (Nk+ ; 1)=2 is the k number of edges in the clique of ;S corresponding to cluster Sk . In terms of the corresponding binary matrices, this would be the number of rij = 1 for ; k 2 i j 2 Sk , that is, Nk standing for N2+ here. With a little arithmetic, this can be transformed to:
j;S j = (X Nk2+ ; N )=2
k=1
K
(7.10)
To compare graphs ;S and ;T as edge sets one can invoke the four-fold table utilized in section 7.3.1 for comparing sets (see Table 7.5). Table 7.7 presents it in the current context 2].
Table 7.7: Four-fold table for partition graphs.
Graph ;T ;T Total ;S a b j;S j ;S c d c+d ; Total j;T j b+d a + b + c + d = N 2
graph corresponding to the intersection of partitions S and T . Thus,
a=(
(7.11)
which is a partition analogue to the overlap set similarity measure (7.2). It is of interest because it is proportional (up to the subtracted N which is due to the speci cs of graphs) to the inner product of matrices rS and rT , which is frequently used as a measure of similarity between partitions on its own. Among other popular measures of proximity between partitions are partition graph analogues to the mismatch and match coe cients m and s, (7.4) and (7.5), Jaccard coe cient J (7.6) and geometric mean g (7.7). These analogues are referred to as distance (mismatch), Rand, Jaccard and Fowlkes-Mallows coe cients, respectively, and can be presented with the following formulas:
M = (j;S j + j;T j ; 2a)= N 2 Rand = 1 ; (j;S j + j;T j ; 2a)= N = 1 ; M 2 J = j; j + ja; j ; a S T FM = p a
7.3. SIMILARITY ON SUBSETS AND PARTITIONS
Table 7.8: Model confusion data. Attribute T T Total S (1 ; )n n n S n (1 ; )n n Total n n 2n
227
N ; Nk+ ; N+l + 2Nkl respectively, as related to the set-theoretic di erence between Sk and Tl . This suggests a potential extension of the partition sim-
ilarity indexes by averaging other, nonlinear, measures of set similarity such as the means (7.7) and (7.8) or even the Quetelet coe cients. The averaged Quetelet coe cients, G2 and Q2 , are akin to traditional contingency measures, which are traditionally labeled as deliberately attached to the case of statistical independence. However, the material of sections 2.2.3 and 3.4.4 demonstrates that the measures can be utilized just as summary association measures in their own right with no relation to the statistical independence framework.
Example 7.56. Distance and chi-squared according to a model confusion
data
Consider the following contingency table (Table 7.8) that expresses the idea that binary features S and T signi cantly overlap and di er by only a small proportion of the total contents of their classes. If, for instance, = 0:1, then the overlap is 90% the smaller the the greater the overlap. Let us analyze how this translates in values of the contingency coe cients described above, in particular, the distance and chi-squared. The distance, or mismatch coe cient, is de ned by formulas (7.12), (7.11) and 7.10) so that it equals the sum of the marginal frequencies squared minus the doubled sum of squares of the contingency elements. To normalize, for the sake of convenience, we divide all;2n by the total items 4n2 , which is 2n squared, rather than by the binomial coe cient 2 . This leads to
M = (4n2 ; 2(2(1 ; )2 + 2 2 )n2 )=(4n2 ) = 2 (1 ; ):
To the calculate chi-squared, we use formula (2.12) from section 2.2.3. According to this formula, X 2 is the total of contingency entries squared and related to both column and row marginal frequencies minus one:
X 2 = 2(1 ; )2 + 2 2 ; 1 = 1 ; 4 (1 ; ):
These formulas imply that for these model data, X 2 = 1 ; 2M . For example, at = 0:05 and = 0:1, M = 0:095 and M = 0:18 respectively. The respective values of X 2 are 0.64 and 0.81. 2
Matching-based similarity versus Quetelet association
We can distinguish between two types of subset-to-subset and respective partition-to-partition association measures: M: matching between subsets/partitions measured by their overlap a in (7.2), (7.11) and (7.13), and C: conditional probability and Quetelet indexes such as (2.10) and (2.12) and (2.13). Considered as they are, in terms of co-occurrences, these two types are so di erent that, to the author's knowledge, they have never been considered simultaneously. There is an opinion that, in the subset-to-subset setting, the former type applies to measuring similarity between entities whereas the latter type to measuring association between features. However, there exists a framework, that of data recovery models for entityto-entity similarity matrices, in which these two types represent the same measure of correlation applied in two slightly di erent settings. In this framework, each partition S = fS1 ::: SK g of the entity set I , or a nominal variable whose categories correspond to classes Sk (k = 1 ::: K ), can be represented by a similarity matrix s = (sij ) between i j 2 I where sij = 0 for i and j from di erent classes and a positive real when i and j are from the same class. Consider two de nitions: M: sij = 1 if i j 2 Sk for some k = 1 ::: K C: sij = 1=Nk if i j 2 Sk for some k = 1 ::: K where Nk is the number of entities in Sk .
In some situations, especially when the same clustering algorithm has been applied many times at di erent initial settings or subsamples, there can be many partitions to compare. A good many-to-many dis/similarity measure can be produced by averaging a pair-wise dis/similarity between partitions. Let us denote given partitions on I by S 1 , S 2 ,..., S m where m > 2. Then the average distance M (fS tg) will be de ned by formula: m X M (fS tg) = 1 M (S u S w ) (7.16) where M (S u S w ) is de ned by formula (7.12). An interesting measure was suggested in 101] based on the average partition matrix which is an entity-to-entity similarity matrix de ned by where st is the binary relation matrix corresponding to St with st = 1 when ij i and j belong to the same class of S t and st = 0, otherwise, with m (i j ) ij denoting the number of partitions S t at which both i and j are present. The latter denotation concerns the case when (some of) partitions S t have been found not necessarily at the entire entity set I but at its sampled parts. In the situation in which all m partitions coincide, all values (i j ) are binary being either 1 or 0 depending on whether i and j belong to the same class or not, which means that the distribution of values (i j ) is bimodal in this case. The further away partitions from the coincidence, the further away the distribution of (i j ) from the bimodal. Thus, the authors of 101] suggest watching for the distribution of (i j ), its shape and the area under the empirical cumulative distribution,
L t ) = X( l A( S l=2
P Proof: According to de nition, M (fS tg) =P mw=1 Pi j2I (su ; sw )2=m2. ij ij P mu (su + sw ; 2su sw )=m2 . This can be rewritten as M (fS t g) =
Applying the operations to individual items we obtainPM (fS t g) = P (Pm Pm i j 2I w=1 (i j )=m + u=1 (i j )=m ; 2 (i j ) (i j )) = 2 i j 2I ( (i j ) ; (i j )2 ). The proof of the statement follows then from the de nition of the variance of the matrix, q.e.d. As proven in statement 2.1., given a probability of smaller than the average values, the variance of a variable reaches its maximum when the variable is binary. The value in this case can be assigned the meaning of being such a probability. Then the formula in statement 7.20. shows that the average distance M in fact, measures the di erence between the maximum and observed values of the variance of the average partition matrix. Either this di erence or the variance itself can be used as an index of similarity between the observed distribution of values (i j ) and the ideal case at which all of them coincide.
3. Least-squares approximation, in which the data are approximated with a low-rank data matrix 34, 65, 90]. Let us brie y discuss them in turn.
7.4.2 Conditional mean
Arguably the most popular method of imputation, at least among practitioners, is the substitution of a missing entry by the corresponding variable's mean, which will be referred to as the Mean algorithm. More subtle approaches use regression based imputation in which a regression-predicted value is imputed 73, 79]. Decision trees are used for handling missing categorical data in 111]. The Mean algorithm has been combined with nearest-neighbor based techniques in 130]. Two common features of the conditional mean approaches are: (1) missing entries are dealt with sequentially, one-by-one, and (2) most of them rely on a limited number of variables. The other approaches handle all missings simultaneously by taking advantage of using all the available data entries.
7.4.3 Maximum likelihood
The maximum likelihood approach relies on a parametric model of data generation, typically, the multivariate Gaussian mixture model. The maximum likelihood method is applied for both tting the model and imputation of the missing data. The most popular is the so-called expectation-maximization (EM) algorithm 118] which exploits the popular idea of alternating optimization to maximize the maximum likelihood criterion as described in section 6.1.5. The popularity of the maximum likelihood approach is based on the fact that it is grounded on a precise statistical model. However, methods within this approach may involve unsubstantiated hypotheses and be computationally intensive.
2. Iterative majorized least squares (IMLS): Start by lling in all missing entries with some value such as zero, then iteratively approximate thus completed data by updating the imputed values with those implied by a low-rank SVD based approximation 65]. These can be combined with nearest-neigbor based techniques as follows: take a row that contains a missing entry as the target entity xi , nd its K nearest neighbors in X , and form a matrix Xi consisting of the target entity and its neighbors. Then apply an imputation algorithm to the matrix Xi , imputing missing entries at the target entity only. Repeat this until all missing entries are lled in. Then output the thus completed data matrix. A global-local approach proposed in 136] involves two stages. First stage: Use a global imputation technique to ll in all missings in the original matrix X so that entity-to-entity distances can be calculated with no concessions to the presence of missing values. Let us denote the resulting matrix X . Second stage: Apply a nearest-neighbor based technique to ll in the missings in X again, but, this time, based on distances computed with the completed data matrix X . This global-local approach involving IMLS on both of the stages, in the beginning with m = 4 and in the end with m = 1, is referred to as algorithm INI in 136]. In experiments reported in 136], nearest-neighbor based least squares imputation algorithms give similar results outperforming other leastsquares algorithms including the Mean and nearest-neighbor-Mean imputation. When the proportion of random missings grows to 10% and, moreover, 25%, INI becomes the only winner 136]. Overall, the subject is in early stages of development. Potential mechanisms of missings are not well de ned yet.
An internal validity index in clustering is a measure of correspondence between a cluster structure and the data from which it has been generated. The better the index value, the more reliable the cluster structure. We give just a few formulations. 1. Measures of cluster cohesion versus isolation
1.1. Silhouette width.
The silhouette width of an entity j 2 I 62] is a popular measure de ned as:
sil(j ) = (b(j ) ; a(j ))=max(a(j ) b(j ))
where a(j ) is the average dissimilarity of j with its cluster and b(j ) the smallest average dissimilarity of j from the other clusters. Values a(j ) and b(j ) measure cohesion and isolation, respectively. Entities with large silhouette width are well clustered while those with small width can be considered intermediate. The greater the average silhouette width, the better the clustering. The measure has shown good results in some experiments 109]. The point-biserial correlation between a partition and distances is a global measure that has shown very good results in experiments described in 84]. As any other correlation coe cient, it can be introduced in the context of the data recovery approach similar to that of the linear regression in section 5.1.2. We follow the presentation in 90]. Let us denote the matrix of between-entity distances by D = (dij ). This can be just an input dissimilarity matrix, but in our context, D is a matrix of squared Euclidean distances. For a partition S = fS1 ::: SK g on I let us consider the corresponding \ideal" dissimilarity matrix s = (sij ) where sij = 0 if i and j belong to the same cluster Sk for some k = 1 ::: K , and sij = 1 if i and j belong to di erent clusters. Both matrices are considered here at unordered pairs of di erent i and j the number of these pairs is obviously N = N (N ; 1)=2. Consider the coe cient of correlation (2.6) between these matrices as N -dimensional vectors. With elementary transformations, it can be proven that the coe cient can be expressed as follows:
partitioning 90], that is, nding a partition in which all within cluster distances are equal to the same number and all between cluster distances are equal to the same number , so that its distance matrix is equal to s + where = , = ; and s = (sij ) is the dissimilarity matrix of partition S de ned above. In the framework of the data recovery approach, the problem of uniform partitioning is the problem of approximating matrix D with an unknown matrix s + . This is the problem of regression analysis of D over s with the added stance that s is also unknown. Obviously, the least squares solution to the uniform partitioning problem is that one that maximizes the correlation coe cient r(D s) (7.19). It appears, the problem indirectly involves the requirement that the cluster sizes should be balanced, which works well when the underlying clusters are of more or less similar sizes 90]. The criterion may fail, however, when the underlying clusters drastically di er in sizes. Good results shown by criterion (7.19) and its versions in experiments conducted by G. Milligan 84] may have been implied by the fact that cardinalities of clusters generated in these experiments were much similar to each other. This is a data-recovery based analogue to the concept of silhouette width above, described in section 5.4.1. The attraction of entity j 2 I to its cluster is de ned as (j ) = a(j ) ; a=2 where a(j ) is the average similarity of j to entities in its cluster and a the average within cluster similarity. If the data are given in feature-based format, the similarity is measured by the inner product of corresponding row vectors. Otherwise, it is taken from data as described in section 5.5.5. Attraction indexes are positive in almost all clusterings obtained with the K-Means and Ward-like algorithms still, the greater they are, the better the clustering. The average attraction coe cient can be used as a measure of cluster tightness. A foremost index of validity of a cluster structure in the data recovery approach is the measure of similarity between the observed data and that arising from a clustering model. Criterion B (S c), the cluster structure's contribution to the data scatter according to decomposition (5.13) in section 5.2, measures exactly this type of similarity: the greater it is, the better the partition ts into the data. It should be pointed out that the data recovery based criteria are formulated in such a way that they seem to score cluster cohesion only. However, they implicitly do take into account cluster isolation as well, as proven in Chapter 5.
2. Indexes derived using the data recovery approach. 2.1. Attraction.
2.2. Contribution of the cluster structure to the data scatter.
3. Indexes derived from probabilistic clustering models.
A number of indexes have been suggested based on probabilistic models of cluster structure for a review, see 8, 39]. Results of using these indexes highly depend on the accepted model which, typically, the user has no possibility of verifying. The issue of how to determine the number of clusters in K-Means and other partitional algorithms has attracted a great deal of attention in the literature. As explained in section 1.2.6, this issue is not always meaningful. However, this does not mean that a review of the many attempts should be glossed over. On the rst glance, any validity measure, such as the average silhouette width and uniform correlation explained above, can be used for selecting the number of clusters: just do clustering for a range of values of K and select that one which makes the measure maximum (or minimum if the better t corresponds to its decrease). This idea was explored by many authors with respect to the within-cluster variance, WK , the value of W (S c) at the found K -cluster partition. The index as is obviously cannot be used because it monotonely decreases when K grows: the greater is the number of clusters, the better t. But its change, the rst (or even second) di erence (WK ; WK +1 )=T where T is the data scatter, can be considered a good signal: the number of clusters should stop rising when this di erence becomes small. Hartigan 48] is credited with formally shaping a similar idea: take the index,
Use of internal indexes to estimate the number of clusters
hK = (WK =WK +1 ; 1)(N ; K ; 1) and one by one increase K starting from 1 stop calculations when hK becomes
less than 10. An index by Calinski and Harabasz,
The latter seems especially suitable: calculate the average `distortion' of an axis associated with tting K centroids to data with the K-Means algorithm, wK = WK =M where M is the number of features and WK the value of K-Means square error criterion at a partition found with K-Means, and calculate `jumps' ; ; jK = wKM=2 ; wKM=2 . With w0 = 0 and w1 = T=M where T is the data ;1 scatter, jumps can be de ned for all reasonable values of K = 0 1 :::. The maximum jump corresponds to the right number of clusters. This is supported with a mathematical derivation stating that if the data can be considered a standard sample from a mixture of Gaussian distributions and distances between centroids are great enough, then the maximum jump would indeed occur at K equal to the number of Gaussian components in the mixture 128]. Reviews of these and other indexes of the best number of clusters can be found in 22], 138], 129]. Experiments conducted by authors of 138] and others show that none of them can be considered a universal criterion. In fact, the straightforward option does not necessarily work even for the average silhouette width and the uniform correlation. In particular, 109] reports that somewhat better results could be obtained by using the average silhouette width indirectly: for determining that no further splits of clusters are needed: the smaller the average silhouette width within a split cluster, the more likely it should not be split at all. There is no universal criterion because, apart from obvious cases such as clearly seen di erences between rock and water, light and dark, horse and birch tree, clusters are not only in the data but also in the mind of the user. The granularity of the user's vision and between-cluster boundaries, when no natural boundaries can be seen, should be left to the user. This can be formalized by introduction of relevant \external" criteria.
External indexes
The so-called external indexes compare a clustering found in data with another clustering either given by an expert or following from a knowledge domain or found by another clustering algorithm. Typically clusterings are sets or partitions of the entity set I . Thus, the external indexes are those described in sections 2.2.3 and 7.3.
GENERAL ISSUES
uations for each individual copy. The averaged score can be considered as a test result for the algorithm. In this way, one can select the best performing algorithm among those tested. This can also be applied to selection of parameters, such as the number of clusters, with the same algorithm 78], 22]. The results on perturbed copies can be used to score con dence in various elements of the cluster structure found at the original data set. For instance, in 30], each of the hierarchic clusters is accompanied by the proportion of copies on which the same algorithm produced the same cluster: the greater the proportion, the greater the con dence. D2 Averaging models. Models found with di erent copies can be averaged if they are of the same format. The averaging is not necessarily done by merely averaging numerical values. For instance, a set of hierarchical clustering structures can be averaged in a structure that holds only those clusters that are found in a majority of the set structures 80]. In 19], centroids found at subsamples are considered a data set which is clustered on its own to produce the averaged centroids. D3 Combining models. When models have di erent formats, as in the case of decision trees that may have di erent splits over di erent features at di erent copies, the models can be combined to form a `committee' in such a way that it is their predictions rather than themselves that are averaged. Such is the procedure referred to as bagging in 50].
7.5.3 Model selection with resampling
Since a quality score such as the number of misclassi ed entities or the contribution to the data scatter may be di erent over di erent samples, resampling can be used not only for testing, but for learning, too. Of a set of algorithms one is selected at which an accepted quality score coe cient is optimized. This is called model selection. Examples of model selection can be found in Efron and Tibshirani 25] p. 243-247 (selection of decision tree sizes), Salzberg 117], p. 193-194 (averaging probabilities obtained with split-sample-based decision trees) and the following subsections.
Let us specify a number K , perform K-Means on many random subsamples from the entity set I , de ne the average partition matrix (i j ) based on partitions found and the area under its cumulative distribution, A(K ), as de ned by formula (7.18). After having done this for K = 2 3 :::Kmax, B (K ) is dened as the maximum of A(K ), K = 2 ::: K . Then the relative area increase is expressed as (K ) = (A(K + 1) ; B (K ))=B (K ), K = 3 4 :::. At K = 2, (K ) is de ned as (2) = A(2). Based on a number of simulation studies, the authors of 101] claim that the maximum of this index is indicative of the true number of clusters.
0 0
GENERAL ISSUES
keep = 0:5. Relative, not absolute, values of the errors appear in E to address the situations in which S is signi cantly smaller than I ; S . The features occurring in the best comprehensive description are considered e ective for learning set S in I and added to the feature set.
2. Selecting descriptions. At this stage, a number of independent samples from (I S ) is generated again and an APPCOD produced description is found at each of them. However, this time the APPCOD is used as a feature selector, not extractor. Only features present after the rst stage, not their combinations, are used in producing feature interval predicates. Then sample descriptions are aggregated according to a version of the majority rule: only features occurring at the majority of sample descriptions are selected. Their intervals are determined by averaging the left and right boundaries of intervals involved in the sample feature interval predicates. The algorithm can be tuned up with two principal parameters: the maximum number of items permitted in a conjunctive description, l, and the number of arithmetic operations admitted in a combined feature, f . This proceeds in a loop over f and l, starting with a value of 1 for each and doing the loop over l within the loop over f . Based on our experimentation, these were set f =2 and the limits of l set from l = 5(f ; 1) + 1 to l = 5f by default. For any given pair, l and f , the generalization process applies a number of times (three, by default) and the best result is picked up and stored 95].
Example 7.57. Generalization with resampling at Body mass data.
Applied to the Body mass data in Table 1.7, this method produces either the body mass index Weight/(Height*Height) or the less expressive variables Weight/Height and Height - Weight depending on numbers of random samples at each step of the generalization process described above. When both numbers are small, the less expressive variables tend to appear. When the number of samples at the Aggregating step increases, the method produces the body mass index. This tendency is much less expressed when the number of samplings increases at the Generalization step. Curiously, these tendencies practically do not depend on the sample size. 2
Example 7.58. Binary data: Republicans and democrats at the US congressional voting
Let us describe results of applying the APPCOD based model selection method on the data set of 1984 United States Congressional Voting Records from 113] which contains records of sixteen yes/no votings by 267 democrats and 168 republicans. It appears the set of republicans can be described by just one predicate p /esa=1 admitting 11 FP and 0 FN, 2.53% of total error. Here p and esa are abbreviations of issues \physician-fee-freeze" and \export-administration-act-south-africa," and 1 codes `yes' and 0 `no'. This means that all republicans, and only 11 democrats, voted consistently on both issues, either both yes or both no, which cannot be described that short without
the operation of division. Another issue: the method describes each individual class asymmetrically, which allows us to check which of the classes, republican or democrat, is more coherent. Because of the asymmetry in APPCOD, one can safely claim that the class whose description has the minimum error is more coherent. Descriptions of democrats found at various subsamples always had 3-4 times more errors than descriptions of republicans, which ts very well into the popular images of these parties 95]. 2
7.6 Overall assessment
The chapter reviews a number of issues of current interest in clustering as part of data mining. First, the issue of nding a relevant feature space is considered. There are two approaches here: feature selection and feature transformation (extraction). Automatic transformation of the feature space is a major problem in data mining which is yet to be properly addressed. Arithmetically combining features within the context of a decision rule maker, speci cally, the comprehensive description algorithm, can be of value in attacking the problem. Second, issues and approaches in data pre-processing and standardization, including dealing with missing data, are discussed { advantages of the data recovery approach are pointed out. Third, issues of cluster validity including those of determining the \right" number of clusters, are addressed in terms of: (a) internal and external indexes and (b) data resampling. Internal indexes show the correspondence between data and clusters derived from them. Obviously, the best source of such indexes are the data recovery criteria and related coe cients. Unfortunately, none of the indexes proposed so far can be considered a universal tool because of both the diversity of data and cluster structures and the di erent levels of granulation required in di erent problems. External indexes show correspondence between data-driven clusters and those known from external considerations. We provide a description of such indexes in section 7.3 in such a way that correspondences between set-to-set and partition-to-partition indexes are established, as well as relations between structural and association indexes. Data resampling as a tool for model testing and selection is a relatively new addition. A systematic review of the approaches is given based on the most recent publications. A data resampling based model for learning a comprehensive description of a cluster is discussed.
CLUSTERING FOR DATA MINING
position among the entity points (this position is taken to be the grand mean). (b) The data scatter is the sum of feature contributions that are proportional to their variances thus re ecting the distribution shapes this allows for tuning feature normalization options by separating scale and shape related parts. (c) The strategy of binary coding of the qualitative categories in order to simultaneously process them together with quantitative features, which cannot be justi ed in conventional frameworks, is supported in this framework with the following: i. Binary features appear to be the ultimate form of quantitative features, those maximally contributing to the data scatter. ii. The explained parts of category contributions sum up to association contingency coe cients that already have been heavily involved in data analysis and statistics, though from a very different perspective. iii. The association coe cients are related to the data normalization options, which can be utilized to facilitate the user's choice among the latter this can be done now from either end, the process' input or output, or both. iv. The equivalent entity-to-entity similarity measure which has emerged in the data recovery context is akin to best heuristic similarity measures but goes even further by taking into account the information weights of categories.
contributions of individual clusters and features. In contrast, the decomposition for agglomerative clustering hides the structure of the explained part, which probably can explain why no speci c tools for the interpretation of cluster hierarchies have been proposed before. By exploiting the additive structure of the data recovery models in the manner following that of the PCA method, one-by-one clustering methods are proposed to allow for e ective computational schemes as well as greater exibility in a controlled environment. In particular, the intelligent version of K-Means, iK-Means, can be used for incomplete clustering with removal of devious or, in contrast, overly normal items, if needed. Local search algorithms presented lead to provably tight clusters { the fact expressed with the attraction coe cient, a theory-based analogue to popular criteria such as the silhouette width coe cient. The approach is extended to contingency and ow data by taking into account the property that each entry is a part of the whole. The entries are naturally standardized into the Quetelet coe cients the corresponding data scatter appears to be equal to the chi-squared contingency coe cient. The inner product can be used as an equivalent device in the correspondingly changed criteria, thus leading to similarity measures and clustering criteria { some of those are quite popular and some are new, still being similar to those in use.
137] A. Webb (2002) Statistical Pattern Recognition, Chichester, England: J. Wiley & Sons. 138] A. Weingessel, E. Dimitriadou, and S. Dolnicar (1999) An examination of indexes for determining the number of clusters in binary data sets, Working Paper No. 29, Vienna University of Economics, Wien, Austria. 139] S.M. Weiss, N. Indurkhya, T. Zhang, and F.J. Damerau (2005) Text Mining: Predictive Methods for Analyzing Unstructured Information, Springer Science+Business Media. 140] D. Wishart (1999) The ClustanGraphics Primer, Edinburgh: Clustan Limited. 141] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzzo (2001) Model-based clustering and data transformations for gene expression data, Bioinformatics, 17, no. 10, 977-987. 142] S. Zhong and J. Ghosh (2003) A uni ed framework for model-based clustering, Journal of Machine Learning Research, 4, 1001-1037.