|Title of host publication||Encyclopedia of Data Warehousing and Mining|
|Place of Publication||Hershey, PA|
|Number of pages||6|
|State||Published - 2009|
Cluster analysis is a set of statistical models and algorithms that attempt to find “natural groupings” of sampling units (e.g., customers, survey respondents, plant or animal species) based on measurements. The observable measurements are sometimes called manifest variables and cluster membership is called a latent variable. It is assumed that each sampling unit comes from one of K clusters or classes, but the cluster identifier cannot be observed directly and can only be inferred from the manifest variables. See Bartholomew and Knott (1999) and Everitt, Landau and Leese (2001) for a broader survey of existing methods for cluster analysis. Many applications in science, engineering, social science, and industry require grouping observations into “types.” Identifying typologies is challenging, especially when the responses (manifest variables) are categorical. The classical approach to cluster analysis on those data is to apply the latent class analysis (LCA) methodology, where the manifest variables are assumed to be independent conditional on the cluster identity. For example, Aitkin, Anderson and Hinde (1981) classified 468 teachers into clusters according to their binary responses to 38 teaching style questions. This basic assumption in classical LCA is often violated and seems to have been made out of convenience rather than it being reasonable for a wide range of situations. For example, in the teaching styles study two questions are “Do you usually allow your pupils to move around the classroom?” and “Do you usually allow your pupils to talk to one another?” These questions are mostly likely correlated even within a class.