INLS 110-122 Spring 2005 Knowledge Discovery: Lecture Monday 07 March 2005: CLUSTERING!

Document clustering
Lecture Monday 07 March 2005: CLUSTERING!

Papers covered:

* Tao Li, Sheng Ma, Mitsunori Ogihara Clustering: Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)

* Mark Steyvers , Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)

Presented by Elxsas, J.

1. What is clustering?
2. Two articles
3. Discussion

Jain, Murty, Flynn. Data clustering: a review. ACM Computing Surveys. Vol 31, No. 3, pp. 264-323, 1999.

Partitional vs hierarchical clustering
k-means: partitional square error

text clustering: locate topically similar(?) documents

open problems:
- how do you represent docs in a way these algos can take advantage of them
- document or term similarity: how is similarity measured?
- cluster interpretation: what does that cluster mean?
- cluster evaluation: are the clusters any good?

Li, Ma, Ogihara: Document clustering via adaptive subspace iteration
- addresses document similaritu & cluster interpretation
- subspace clustering

ASI:
in: M: term-document matrix
k: number of clusters (k can be estimated)
out:
D: cluster for each doc
F: subspace for each cluster

partitions docs in to k groups (optimizing D)
reduces distiance between docs and centroids

help me out: why should I expect my clusters should correspond to topics? Because we want them to? clusters aren't being defined on one-dimensional topics; we know they're probably not, and we don't know what they mean exactly, even if explicitly described.

mutual information score!!!
can be used for disambiguation: p.239 schutze

Probabilistic author-topic models for info discovery

modeling topics: a model of word generation based on statistics
modeling authors: model of author association based on topics and terms

showing increase and decrease of topics over time

association of words with language fashions: e.g., use of greek letterings vs. french letterings

assigning topics to unseen documents; separating two abstracts that have been merged; detecting an author's surprizing papers - problems well-definied by what data we have by the system...the method recommends such things by nature

EM clustering

INLS 110-122 Spring 2005 Knowledge Discovery

Monday, March 07, 2005

Lecture Monday 07 March 2005: CLUSTERING!

0 Comments:

Previous Posts