Wednesday, April 06, 2005

Lecture Wednesday 06 April 2005

Summarization

Mani, I and Bloedorn, E; Summarizing similarities and differences among related documents. Information Retrieval, 1999. 1(1-2): p. 35-67.

automatic text summarization
analysis phase
refinement phase
synthesis phase

why salient information?
establish similarities & differences
easier to compate salient items than the entire text body

by identifying both commonalities and differences, we can see what's novel
(very different than the nnotion of centroids)


represent each document of a graph with nodes as a word instance and edges are multiple types: ADJACENCY, SAME, ALPHA, PHRASE, NAME, COREFERENTIAL
weights of graph nodes: activation vector

phrase extraction:
WordNet as HP

topic-related text region spreading activation (think of nodes being lit up by queries)

two types of evaluation: extrinsic evaluation & intrinsic evaluation
extrinsic: how summary affects outcome of some other task
intrinsic: judgements of informativeness


what is it you should evaluate?
what if users disagree?
force disagreement?
compare it to system....


Okurowski, M. et al (2000). Text summarizer in use: lessons learned from real world deployment and evaluation. Proceedings of the ANLP/NAACL Workshop on Automatic Summarization, 49-58.

Mark Pope presenting



Question: we seem to be making the assumption that we are improving upon relevance rather than leaving it behind, that we are trying to get things "faster"

maybe we already know what our most relevant documents are, and instead of getting a "more efficient" representation, maybe we want to learn something remarkable
maybe we pick our favorite papers on a topic, we've read them and have grokked them well, but maybe we feel we might be missing something.

this idea of the technology suggesting the task is good
but also being more creative with problem identification than "information overload" in the "millions of relevant document" sense
let's IMAGINE some uses that are not currently part of any professional's task

Wednesday, March 30, 2005

Lecture Wednesday 30 March 2005

Shift focus from retrieval to synthesis

Users making consequential decisions

Information systems in augmenative role

130k+ breast cancer articles (MeSH term: "breast neoplasm")
cancer 'cures' often worse than symptoms
beyond age + gender, risk factors don't explain half of breast cancer incidence

desirable properties of a system:
integrate well with work practices


In epidemiology, becasue of the non-experimental design of the studies, we should expect results to be controversial (http://www.hsph.harvard.edu/Organizations/DDIL/gep_PE.html)
risk factors should be skeptically treated in exploring any notion of causality

Cochrane reviews (http://www.cochrane.org/index2.htm)

Are there results Epi studies, RCTs, mixtures?

Conchrane reviews pertain mainly to RCTs:
"The Cochrane Collaboration and the Cochrane Reviewers’ Handbook focus particularly on systematic reviews of randomised controlled trials (RCTs) because they are likely to provide more reliable information than other sources of evidence on the differential effects of alternative forms of healthcare (Kunz 2003). Systematic reviews of other types of evidence can also help those wanting to make better decisions about healthcare, particularly forms of care where RCTs have not been done and may not be possible or appropriate. The basic principles of reviewing research are the same, whatever type of evidence is being reviewed. Although we focus mainly on systematic reviews of RCTs we address issues specific to reviewing other types of evidence when this is relevant. Fuller guidance on such reviews is being developed."
(http://www.cochrane.dk/cochrane/handbook/1_introduction/1_0_introduction.htm)

Cohort studies: (http://servers.medlib.hscbklyn.edu/ebm/2400.htm)
"A Cohort Study is a study in which patients who presently have a certain condition and/or receive a particular treatment are followed over time and compared with another group who are not affected by the condition under investigation.

"For instance, since a randomized controlled study to test the effect of smoking on health would be unethical, a reasonable alternative would be a study that identifies two groups, a group of people who smoke and a group of people who do not, and follows them forward through time to see what health problems they develop.

"Cohort studies are not as reliable as randomized controlled studies, since the two groups may differ in ways other than in the variable under study. For example, if the subjects who smoke tend to have less money than the non-smokers, and thus have less access to health care, that would exaggerate the difference between the two groups.

"The main problem with cohort studies, however, is that they can end up taking a very long time, since the researchers have to wait for the conditions of interest to develop. Physicians are, of course, anxious to have meaningful results as soon as possible, but another disadvantage with long studies is that things tend to change over the course of the study. People die, move away, or develop other conditions, new and promising treatments arise, and so on. Even so, cohort studies are generally preferred to case control studies, since they involve far fewer statistical problems and generally produce more reliable answers. "

More on Epi tests/experimental design:
http://www.vetmed.wsu.edu/courses-jmgay/GlossClinStudy.htm


Concept of "ever drinker"
working in tox labs, dosage was everything


I want to see dimensions that express dosage mean & variance
and clinical experiment OR observational study
use the height of the bar or the shade/intensity of color

Monday, March 28, 2005

Lecture Monday 28 March 2005

Visualization of text data

ThemeRiver
Reminds me of babynamewizard
http://babynamewizard.com/namevoyager/lnv0105.html

Indicates a growth of newsfeeds with certain events. That there's more text.

Jon: frustrated with the neglect of some dimentions

Kohonen maps (SOM)


Pratt's DynaCat (1997):
approach is to establish predefined questions
then generate dynamic categories--hierarchical categories

Wednesday, March 23, 2005

Lecture Wednesday 23 March 2005

Clifton, C, Cooley, R, , JM, and Rauch, J; TopCat: data mining for topic identification in a text corpus. in Principles of Data Mining and Knowledge Discovery. Third European

Experimental Design:
Factor: something you are changing
keywords being assigned
interst instead of support-confidence
person-org-place: using this representation other entities,

Level: what you are setting the factor to
It is important to cite reasons for why it is you're selecting factors, why it is you're fixing those factors


e.g., setting minimum and maximum term frequencies to get documents with a minimum of five terms in both docs. to retain as close as half of the original document corpus

e.g., why ten-fold cross-validation? because most people use it....




Blake, C. & Pratt, W. (2001). Better rules fewer features: A semantic approach to selecting features from text. In Proceedings of the Institute of Electrical and Electronics Engineers Data Mining Conference (IEEE DM 2001), San Jose, CA.

Monday, March 07, 2005

Lecture Monday 07 March 2005: CLUSTERING!

Document clustering
Lecture Monday 07 March 2005: CLUSTERING!

Papers covered:

* Tao Li, Sheng Ma, Mitsunori Ogihara Clustering: Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)

* Mark Steyvers , Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)

Presented by Elxsas, J.

1. What is clustering?
2. Two articles
3. Discussion

Jain, Murty, Flynn. Data clustering: a review. ACM Computing Surveys. Vol 31, No. 3, pp. 264-323, 1999.

Partitional vs hierarchical clustering
k-means: partitional square error

text clustering: locate topically similar(?) documents


open problems:
- how do you represent docs in a way these algos can take advantage of them
- document or term similarity: how is similarity measured?
- cluster interpretation: what does that cluster mean?
- cluster evaluation: are the clusters any good?

Li, Ma, Ogihara: Document clustering via adaptive subspace iteration
- addresses document similaritu & cluster interpretation
- subspace clustering

ASI:
in: M: term-document matrix
k: number of clusters (k can be estimated)
out:
D: cluster for each doc
F: subspace for each cluster

partitions docs in to k groups (optimizing D)
reduces distiance between docs and centroids

help me out: why should I expect my clusters should correspond to topics? Because we want them to? clusters aren't being defined on one-dimensional topics; we know they're probably not, and we don't know what they mean exactly, even if explicitly described.

mutual information score!!!
can be used for disambiguation: p.239 schutze

Probabilistic author-topic models for info discovery

modeling topics: a model of word generation based on statistics
modeling authors: model of author association based on topics and terms


showing increase and decrease of topics over time

association of words with language fashions: e.g., use of greek letterings vs. french letterings



assigning topics to unseen documents; separating two abstracts that have been merged; detecting an author's surprizing papers - problems well-definied by what data we have by the system...the method recommends such things by nature



EM clustering

Monday, February 28, 2005

Lecture Monday 28 February 2005

Information Extraction
----------------------

Two general approaches to information systems:

1. knowledge engineering
- hand-constructed grammars
- human experts design rules
- e.g., Paice & Jones, Blaschke et al

2. trained systems
- use stats where possible to learn rules
- Riloff less info, not fully learned; Califf & Mooney


Even trained system approaches require background knowledge of some sort


knowledge vs. data trade-off
1.2 million words needed to learn statistically


levels of info

text
words e.g., POS
noun phrase e.g., phrase units
sentence level
inter sentence level e.g., anaphoric resolution & discourse analysis
template level -vchanges format to output required

effort increased as


AAAI Applet & Hobbes


IE techniques:
KB -
Semi-learned
Learned



Representation for Learning
---------------------------

pre filler pattern
filler: what you want
post filler pattern


RAPIER: Robust automated production of information extraction rules


covering algorithm: takes a sentence that is a positive example
while more positive examples remain, create a rule that removes majority of positive examples

(seems like it would have pretty good precision but not so good recall)

RAPIER is one good example of a covering algo

To do week of 28 Feb 2005

TO DO FOR THE TM PROJECTS (independent study/kdd)

load hypernym tree
correct synset tree
regenerate webterm+lragr+synset table
regenerate statistics for table

for each term with multiple POS,
find hypernym I-V
add hypernym I-V columns & add those hypernyms
find lowest level reduction


figure out how to backtrack to POS
install POS tagger
for every sentence
retrieve & reconstruct sentence
submit sentence to pos tagger
retrieve results
insert POS info back to webterm data set

now retrieve data set limited to POS={N, V, ADJ, ADV}
create as table webtermtagged

equijoin webtermtagged on webtermlragrsynset

generate statistics again

Monday, February 14, 2005

Lecture Monday 14 February 2005

papers:
Marti Hearst
Henry Small

heavy stemming good for high recall


Pruning our document space tf*idf

term freq * inverse doc freq

term frequency in document
idf: two options
inverse of the number of documents with term i
inverse of the log of the number of total docs ( more typical; controls for wide range) which is log (n/dfi)


co-occurrence analysis


send in statistics
how many unique words?
stemmed words?
some data that describes the dataset


we can avoid using a "search-centric" approach by assuming we're discovering knowledge? by assuming

but at the end of the day
we still want to generate hypotheses

generate vs. discover
discovery seems to be a search paradigm



avoiuding search-centric: removing the document as the denominator

if it's already explicit then it's searching not discovery
if it's not explicit then it's discovery (or hypothesis generation)

generation:
vs:
discovery:

we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true

concern surrounds user

Swanson:
something is not in the literature

if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense

testable hypotheses
unverifiable statements:
unfalsifiable statements:
generating poetry