Lecture Monday 14 February 2005
papers:
Marti Hearst
Henry Small
heavy stemming good for high recall
Pruning our document space tf*idf
term freq * inverse doc freq
term frequency in document
idf: two options
inverse of the number of documents with term i
inverse of the log of the number of total docs ( more typical; controls for wide range) which is log (n/dfi)
co-occurrence analysis
send in statistics
how many unique words?
stemmed words?
some data that describes the dataset
we can avoid using a "search-centric" approach by assuming we're discovering knowledge? by assuming
but at the end of the day
we still want to generate hypotheses
generate vs. discover
discovery seems to be a search paradigm
avoiuding search-centric: removing the document as the denominator
if it's already explicit then it's searching not discovery
if it's not explicit then it's discovery (or hypothesis generation)
generation:
vs:
discovery:
we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true
concern surrounds user
Swanson:
something is not in the literature
if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense
testable hypotheses
unverifiable statements:
unfalsifiable statements:
generating poetry
Marti Hearst
Henry Small
heavy stemming good for high recall
Pruning our document space tf*idf
term freq * inverse doc freq
term frequency in document
idf: two options
inverse of the number of documents with term i
inverse of the log of the number of total docs ( more typical; controls for wide range) which is log (n/dfi)
co-occurrence analysis
send in statistics
how many unique words?
stemmed words?
some data that describes the dataset
we can avoid using a "search-centric" approach by assuming we're discovering knowledge? by assuming
but at the end of the day
we still want to generate hypotheses
generate vs. discover
discovery seems to be a search paradigm
avoiuding search-centric: removing the document as the denominator
if it's already explicit then it's searching not discovery
if it's not explicit then it's discovery (or hypothesis generation)
generation:
vs:
discovery:
we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true
concern surrounds user
Swanson:
something is not in the literature
if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense
testable hypotheses
unverifiable statements:
unfalsifiable statements:
generating poetry
0 Comments:
Post a Comment
<< Home