Lecture Monday 07 March 2005: CLUSTERING!
Document clustering
Lecture Monday 07 March 2005: CLUSTERING!
Papers covered:
* Tao Li, Sheng Ma, Mitsunori Ogihara Clustering: Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)
* Mark Steyvers , Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)
Presented by Elxsas, J.
1. What is clustering?
2. Two articles
3. Discussion
Jain, Murty, Flynn. Data clustering: a review. ACM Computing Surveys. Vol 31, No. 3, pp. 264-323, 1999.
Partitional vs hierarchical clustering
k-means: partitional square error
text clustering: locate topically similar(?) documents
open problems:
- how do you represent docs in a way these algos can take advantage of them
- document or term similarity: how is similarity measured?
- cluster interpretation: what does that cluster mean?
- cluster evaluation: are the clusters any good?
Li, Ma, Ogihara: Document clustering via adaptive subspace iteration
- addresses document similaritu & cluster interpretation
- subspace clustering
ASI:
in: M: term-document matrix
k: number of clusters (k can be estimated)
out:
D: cluster for each doc
F: subspace for each cluster
partitions docs in to k groups (optimizing D)
reduces distiance between docs and centroids
help me out: why should I expect my clusters should correspond to topics? Because we want them to? clusters aren't being defined on one-dimensional topics; we know they're probably not, and we don't know what they mean exactly, even if explicitly described.
mutual information score!!!
can be used for disambiguation: p.239 schutze
Probabilistic author-topic models for info discovery
modeling topics: a model of word generation based on statistics
modeling authors: model of author association based on topics and terms
showing increase and decrease of topics over time
association of words with language fashions: e.g., use of greek letterings vs. french letterings
assigning topics to unseen documents; separating two abstracts that have been merged; detecting an author's surprizing papers - problems well-definied by what data we have by the system...the method recommends such things by nature
EM clustering
Lecture Monday 07 March 2005: CLUSTERING!
Papers covered:
* Tao Li, Sheng Ma, Mitsunori Ogihara Clustering: Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)
* Mark Steyvers , Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)
Presented by Elxsas, J.
1. What is clustering?
2. Two articles
3. Discussion
Jain, Murty, Flynn. Data clustering: a review. ACM Computing Surveys. Vol 31, No. 3, pp. 264-323, 1999.
Partitional vs hierarchical clustering
k-means: partitional square error
text clustering: locate topically similar(?) documents
open problems:
- how do you represent docs in a way these algos can take advantage of them
- document or term similarity: how is similarity measured?
- cluster interpretation: what does that cluster mean?
- cluster evaluation: are the clusters any good?
Li, Ma, Ogihara: Document clustering via adaptive subspace iteration
- addresses document similaritu & cluster interpretation
- subspace clustering
ASI:
in: M: term-document matrix
k: number of clusters (k can be estimated)
out:
D: cluster for each doc
F: subspace for each cluster
partitions docs in to k groups (optimizing D)
reduces distiance between docs and centroids
help me out: why should I expect my clusters should correspond to topics? Because we want them to? clusters aren't being defined on one-dimensional topics; we know they're probably not, and we don't know what they mean exactly, even if explicitly described.
mutual information score!!!
can be used for disambiguation: p.239 schutze
Probabilistic author-topic models for info discovery
modeling topics: a model of word generation based on statistics
modeling authors: model of author association based on topics and terms
showing increase and decrease of topics over time
association of words with language fashions: e.g., use of greek letterings vs. french letterings
assigning topics to unseen documents; separating two abstracts that have been merged; detecting an author's surprizing papers - problems well-definied by what data we have by the system...the method recommends such things by nature
EM clustering
0 Comments:
Post a Comment
<< Home