Wednesday 26 January 2005 Lecture: Clustering
Clustering
Qualities of a similarity metric
Clustering algos
evaluation discussion
finding natural groupings among collection of objects into k sets such that the avg distance of points form the centroid of their assigned group is minimized
centroid: average of coordinates in each dimension
mimimum avg pairwise distance
question: we typically pick a k, cluster, then consider whether we'd like to change k. are there more systematic ways of doing this? setting k to a range; have a formal means for deciding between k values
how is unsupervided learning different from dredging?
finding patterns and then assigning meaning
we can come up with a set of features
clustering is subjective because deciding what groupings are best is a subjective process, and deciding what serves as the basis for similarity is likewise subjective
Distance metric props
symmetry
constancy of self-similarity
positivity (separation)
triangular inequality
two types of clustering
partitional
hierarchical/subspace
use dendograms: root/terminal branch/internal branch/internal node/leaf
desirable props of a clustering algo
scalabilty
ability to dela with diff data types
minimal req's for domain knowledge to determine input params
able to deal with noice & outliers
(number of objects)!/2 - number of objects
hierarchival is O(n2); exponentials don't scale well
simple clustering: with numeric data only - K-means
pick a number of (K) of cluster centers
move each centroid to the mean of each cluster
repeat (2,3) until movement is less than a sentinel value
value of your k selection
inter and intracluser distance values
partitioned clustering
----------------------
how do we evaluate clustering?
birch algo example
question:
we're hoping that our clusters are somehow meaningful summaries of subsets of the attributes
but aren't we somehow more prone to data dredging in unsupervised approaches
well, everything is subject to verification, validity, meaningfulness
Distance revisited: measuring a distance been an object and a cluster or two clusters
single linkage (or nearest neighbor) - distance of the two clostest objects in the different clusters
complete linkage (furthest neighbor) - greatest distance between any two objects in the different clusters
group average
For class presentation:
B1 - Introduction to KD est Feb 9
Don R. Swanson & Neil R. Smalheiser An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artifical Intelligence 91:183-203; 1997.
B8 - Document Summarization est Mar 28 or 30
Generating Patient-Specific Summaries of Online Literature
Kathleen R. McKeown, Desmond A. Jordan, Vasileios Hatzivassiloglou
*http://www.ics.uci.edu/~pratt/courses/papers/TextMining/patient-specific-summaries.pdf
Qualities of a similarity metric
Clustering algos
evaluation discussion
finding natural groupings among collection of objects into k sets such that the avg distance of points form the centroid of their assigned group is minimized
centroid: average of coordinates in each dimension
mimimum avg pairwise distance
question: we typically pick a k, cluster, then consider whether we'd like to change k. are there more systematic ways of doing this? setting k to a range; have a formal means for deciding between k values
how is unsupervided learning different from dredging?
finding patterns and then assigning meaning
we can come up with a set of features
clustering is subjective because deciding what groupings are best is a subjective process, and deciding what serves as the basis for similarity is likewise subjective
Distance metric props
symmetry
constancy of self-similarity
positivity (separation)
triangular inequality
two types of clustering
partitional
hierarchical/subspace
use dendograms: root/terminal branch/internal branch/internal node/leaf
desirable props of a clustering algo
scalabilty
ability to dela with diff data types
minimal req's for domain knowledge to determine input params
able to deal with noice & outliers
(number of objects)!/2 - number of objects
hierarchival is O(n2); exponentials don't scale well
simple clustering: with numeric data only - K-means
pick a number of (K) of cluster centers
move each centroid to the mean of each cluster
repeat (2,3) until movement is less than a sentinel value
value of your k selection
inter and intracluser distance values
partitioned clustering
----------------------
how do we evaluate clustering?
birch algo example
question:
we're hoping that our clusters are somehow meaningful summaries of subsets of the attributes
but aren't we somehow more prone to data dredging in unsupervised approaches
well, everything is subject to verification, validity, meaningfulness
Distance revisited: measuring a distance been an object and a cluster or two clusters
single linkage (or nearest neighbor) - distance of the two clostest objects in the different clusters
complete linkage (furthest neighbor) - greatest distance between any two objects in the different clusters
group average
For class presentation:
B1 - Introduction to KD est Feb 9
Don R. Swanson & Neil R. Smalheiser An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artifical Intelligence 91:183-203; 1997.
B8 - Document Summarization est Mar 28 or 30
Generating Patient-Specific Summaries of Online Literature
Kathleen R. McKeown, Desmond A. Jordan, Vasileios Hatzivassiloglou
*http://www.ics.uci.edu/~pratt/courses/papers/TextMining/patient-specific-summaries.pdf
0 Comments:
Post a Comment
<< Home