Wednesday, January 26, 2005

Wednesday 26 January 2005 Lecture: Clustering

Clustering


Qualities of a similarity metric
Clustering algos
evaluation discussion



finding natural groupings among collection of objects into k sets such that the avg distance of points form the centroid of their assigned group is minimized
centroid: average of coordinates in each dimension
mimimum avg pairwise distance

question: we typically pick a k, cluster, then consider whether we'd like to change k. are there more systematic ways of doing this? setting k to a range; have a formal means for deciding between k values

how is unsupervided learning different from dredging?

finding patterns and then assigning meaning


we can come up with a set of features
clustering is subjective because deciding what groupings are best is a subjective process, and deciding what serves as the basis for similarity is likewise subjective


Distance metric props
symmetry
constancy of self-similarity
positivity (separation)
triangular inequality

two types of clustering
partitional
hierarchical/subspace


use dendograms: root/terminal branch/internal branch/internal node/leaf

desirable props of a clustering algo
scalabilty
ability to dela with diff data types
minimal req's for domain knowledge to determine input params
able to deal with noice & outliers

(number of objects)!/2 - number of objects


hierarchival is O(n2); exponentials don't scale well


simple clustering: with numeric data only - K-means
pick a number of (K) of cluster centers
move each centroid to the mean of each cluster
repeat (2,3) until movement is less than a sentinel value

value of your k selection

inter and intracluser distance values



partitioned clustering
----------------------

how do we evaluate clustering?


birch algo example

question:
we're hoping that our clusters are somehow meaningful summaries of subsets of the attributes
but aren't we somehow more prone to data dredging in unsupervised approaches

well, everything is subject to verification, validity, meaningfulness

Distance revisited: measuring a distance been an object and a cluster or two clusters
single linkage (or nearest neighbor) - distance of the two clostest objects in the different clusters
complete linkage (furthest neighbor) - greatest distance between any two objects in the different clusters
group average


For class presentation:

B1 - Introduction to KD est Feb 9
Don R. Swanson & Neil R. Smalheiser An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artifical Intelligence 91:183-203; 1997.

B8 - Document Summarization est Mar 28 or 30
Generating Patient-Specific Summaries of Online Literature
Kathleen R. McKeown, Desmond A. Jordan, Vasileios Hatzivassiloglou
*http://www.ics.uci.edu/~pratt/courses/papers/TextMining/patient-specific-summaries.pdf

0 Comments:

Post a Comment

<< Home