Monday, February 14, 2005

Lecture Monday 14 February 2005

papers:
Marti Hearst
Henry Small

heavy stemming good for high recall


Pruning our document space tf*idf

term freq * inverse doc freq

term frequency in document
idf: two options
inverse of the number of documents with term i
inverse of the log of the number of total docs ( more typical; controls for wide range) which is log (n/dfi)


co-occurrence analysis


send in statistics
how many unique words?
stemmed words?
some data that describes the dataset


we can avoid using a "search-centric" approach by assuming we're discovering knowledge? by assuming

but at the end of the day
we still want to generate hypotheses

generate vs. discover
discovery seems to be a search paradigm



avoiuding search-centric: removing the document as the denominator

if it's already explicit then it's searching not discovery
if it's not explicit then it's discovery (or hypothesis generation)

generation:
vs:
discovery:

we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true

concern surrounds user

Swanson:
something is not in the literature

if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense

testable hypotheses
unverifiable statements:
unfalsifiable statements:
generating poetry

0 Comments:

Post a Comment

<< Home