INLS 110-122 Spring 2005 Knowledge Discovery: March 2005

Wednesday, March 30, 2005

Lecture Wednesday 30 March 2005

Shift focus from retrieval to synthesis

Users making consequential decisions

Information systems in augmenative role

130k+ breast cancer articles (MeSH term: "breast neoplasm")
cancer 'cures' often worse than symptoms
beyond age + gender, risk factors don't explain half of breast cancer incidence

desirable properties of a system:
integrate well with work practices

In epidemiology, becasue of the non-experimental design of the studies, we should expect results to be controversial (http://www.hsph.harvard.edu/Organizations/DDIL/gep_PE.html)
risk factors should be skeptically treated in exploring any notion of causality

Cochrane reviews (http://www.cochrane.org/index2.htm)

Are there results Epi studies, RCTs, mixtures?

Conchrane reviews pertain mainly to RCTs:
"The Cochrane Collaboration and the Cochrane Reviewers’ Handbook focus particularly on systematic reviews of randomised controlled trials (RCTs) because they are likely to provide more reliable information than other sources of evidence on the differential effects of alternative forms of healthcare (Kunz 2003). Systematic reviews of other types of evidence can also help those wanting to make better decisions about healthcare, particularly forms of care where RCTs have not been done and may not be possible or appropriate. The basic principles of reviewing research are the same, whatever type of evidence is being reviewed. Although we focus mainly on systematic reviews of RCTs we address issues specific to reviewing other types of evidence when this is relevant. Fuller guidance on such reviews is being developed."
(http://www.cochrane.dk/cochrane/handbook/1_introduction/1_0_introduction.htm)

Cohort studies: (http://servers.medlib.hscbklyn.edu/ebm/2400.htm)
"A Cohort Study is a study in which patients who presently have a certain condition and/or receive a particular treatment are followed over time and compared with another group who are not affected by the condition under investigation.

"For instance, since a randomized controlled study to test the effect of smoking on health would be unethical, a reasonable alternative would be a study that identifies two groups, a group of people who smoke and a group of people who do not, and follows them forward through time to see what health problems they develop.

"Cohort studies are not as reliable as randomized controlled studies, since the two groups may differ in ways other than in the variable under study. For example, if the subjects who smoke tend to have less money than the non-smokers, and thus have less access to health care, that would exaggerate the difference between the two groups.

"The main problem with cohort studies, however, is that they can end up taking a very long time, since the researchers have to wait for the conditions of interest to develop. Physicians are, of course, anxious to have meaningful results as soon as possible, but another disadvantage with long studies is that things tend to change over the course of the study. People die, move away, or develop other conditions, new and promising treatments arise, and so on. Even so, cohort studies are generally preferred to case control studies, since they involve far fewer statistical problems and generally produce more reliable answers. "

More on Epi tests/experimental design:
http://www.vetmed.wsu.edu/courses-jmgay/GlossClinStudy.htm

Concept of "ever drinker"
working in tox labs, dosage was everything

I want to see dimensions that express dosage mean & variance
and clinical experiment OR observational study
use the height of the bar or the shade/intensity of color

Monday, March 28, 2005

Lecture Monday 28 March 2005

Visualization of text data

ThemeRiver
Reminds me of babynamewizard
http://babynamewizard.com/namevoyager/lnv0105.html

Indicates a growth of newsfeeds with certain events. That there's more text.

Jon: frustrated with the neglect of some dimentions

Kohonen maps (SOM)

Pratt's DynaCat (1997):
approach is to establish predefined questions
then generate dynamic categories--hierarchical categories

Wednesday, March 23, 2005

Lecture Wednesday 23 March 2005

Clifton, C, Cooley, R, , JM, and Rauch, J; TopCat: data mining for topic identification in a text corpus. in Principles of Data Mining and Knowledge Discovery. Third European

Experimental Design:
Factor: something you are changing
keywords being assigned
interst instead of support-confidence
person-org-place: using this representation other entities,

Level: what you are setting the factor to
It is important to cite reasons for why it is you're selecting factors, why it is you're fixing those factors

e.g., setting minimum and maximum term frequencies to get documents with a minimum of five terms in both docs. to retain as close as half of the original document corpus

e.g., why ten-fold cross-validation? because most people use it....

Blake, C. & Pratt, W. (2001). Better rules fewer features: A semantic approach to selecting features from text. In Proceedings of the Institute of Electrical and Electronics Engineers Data Mining Conference (IEEE DM 2001), San Jose, CA.

Monday, March 07, 2005

Lecture Monday 07 March 2005: CLUSTERING!

Document clustering
Lecture Monday 07 March 2005: CLUSTERING!

Papers covered:

* Tao Li, Sheng Ma, Mitsunori Ogihara Clustering: Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)

* Mark Steyvers , Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)

Presented by Elxsas, J.

1. What is clustering?
2. Two articles
3. Discussion

Jain, Murty, Flynn. Data clustering: a review. ACM Computing Surveys. Vol 31, No. 3, pp. 264-323, 1999.

Partitional vs hierarchical clustering
k-means: partitional square error

text clustering: locate topically similar(?) documents

open problems:
- how do you represent docs in a way these algos can take advantage of them
- document or term similarity: how is similarity measured?
- cluster interpretation: what does that cluster mean?
- cluster evaluation: are the clusters any good?

Li, Ma, Ogihara: Document clustering via adaptive subspace iteration
- addresses document similaritu & cluster interpretation
- subspace clustering

ASI:
in: M: term-document matrix
k: number of clusters (k can be estimated)
out:
D: cluster for each doc
F: subspace for each cluster

partitions docs in to k groups (optimizing D)
reduces distiance between docs and centroids

help me out: why should I expect my clusters should correspond to topics? Because we want them to? clusters aren't being defined on one-dimensional topics; we know they're probably not, and we don't know what they mean exactly, even if explicitly described.

mutual information score!!!
can be used for disambiguation: p.239 schutze

Probabilistic author-topic models for info discovery

modeling topics: a model of word generation based on statistics
modeling authors: model of author association based on topics and terms

showing increase and decrease of topics over time

association of words with language fashions: e.g., use of greek letterings vs. french letterings

assigning topics to unseen documents; separating two abstracts that have been merged; detecting an author's surprizing papers - problems well-definied by what data we have by the system...the method recommends such things by nature

EM clustering

INLS 110-122 Spring 2005 Knowledge Discovery