Monday, February 14, 2005

Lecture Monday 14 February 2005

Marti Hearst
Henry Small

heavy stemming good for high recall

Pruning our document space tf*idf

term freq * inverse doc freq

term frequency in document
idf: two options
inverse of the number of documents with term i
inverse of the log of the number of total docs ( more typical; controls for wide range) which is log (n/dfi)

co-occurrence analysis

send in statistics
how many unique words?
stemmed words?
some data that describes the dataset

we can avoid using a "search-centric" approach by assuming we're discovering knowledge? by assuming

but at the end of the day
we still want to generate hypotheses

generate vs. discover
discovery seems to be a search paradigm

avoiuding search-centric: removing the document as the denominator

if it's already explicit then it's searching not discovery
if it's not explicit then it's discovery (or hypothesis generation)


we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true

concern surrounds user

something is not in the literature

if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense

testable hypotheses
unverifiable statements:
unfalsifiable statements:
generating poetry


Post a Comment

<< Home