INLS 110-122 Spring 2005 Knowledge Discovery: February 2005

Monday, February 28, 2005

Lecture Monday 28 February 2005

Information Extraction
----------------------

Two general approaches to information systems:

1. knowledge engineering
- hand-constructed grammars
- human experts design rules
- e.g., Paice & Jones, Blaschke et al

2. trained systems
- use stats where possible to learn rules
- Riloff less info, not fully learned; Califf & Mooney

Even trained system approaches require background knowledge of some sort

knowledge vs. data trade-off
1.2 million words needed to learn statistically

levels of info

text
words e.g., POS
noun phrase e.g., phrase units
sentence level
inter sentence level e.g., anaphoric resolution & discourse analysis
template level -vchanges format to output required

effort increased as

AAAI Applet & Hobbes

IE techniques:
KB -
Semi-learned
Learned

Representation for Learning
---------------------------

pre filler pattern
filler: what you want
post filler pattern

RAPIER: Robust automated production of information extraction rules

covering algorithm: takes a sentence that is a positive example
while more positive examples remain, create a rule that removes majority of positive examples

(seems like it would have pretty good precision but not so good recall)

RAPIER is one good example of a covering algo

To do week of 28 Feb 2005

TO DO FOR THE TM PROJECTS (independent study/kdd)

load hypernym tree
correct synset tree
regenerate webterm+lragr+synset table
regenerate statistics for table

for each term with multiple POS,
find hypernym I-V
add hypernym I-V columns & add those hypernyms
find lowest level reduction

figure out how to backtrack to POS
install POS tagger
for every sentence
retrieve & reconstruct sentence
submit sentence to pos tagger
retrieve results
insert POS info back to webterm data set

now retrieve data set limited to POS={N, V, ADJ, ADV}
create as table webtermtagged

equijoin webtermtagged on webtermlragrsynset

generate statistics again

Monday, February 14, 2005

Lecture Monday 14 February 2005

papers:
Marti Hearst
Henry Small

heavy stemming good for high recall

Pruning our document space tf*idf

term freq * inverse doc freq

term frequency in document
idf: two options
inverse of the number of documents with term i
inverse of the log of the number of total docs ( more typical; controls for wide range) which is log (n/dfi)

co-occurrence analysis

send in statistics
how many unique words?
stemmed words?
some data that describes the dataset

we can avoid using a "search-centric" approach by assuming we're discovering knowledge? by assuming

but at the end of the day
we still want to generate hypotheses

generate vs. discover
discovery seems to be a search paradigm

avoiuding search-centric: removing the document as the denominator

if it's already explicit then it's searching not discovery
if it's not explicit then it's discovery (or hypothesis generation)

generation:
vs:
discovery:

we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true

concern surrounds user

Swanson:
something is not in the literature

if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense

testable hypotheses
unverifiable statements:
unfalsifiable statements:
generating poetry

Friday, February 11, 2005

Thoughts on knowledge discovery versus hypothesis generation

reading some of the philosophy behind scientific reasoning...discussions of incommensurability from Polanyi, Kuhn, Feyerabend...noting that Kuhn believed alternate hypotheses should be entertained only when science is at a point of crisis while Feyerabend suggested we might always want to entertain alternatives as a necessity perhaps of scientific reasoning...someone suggests that Kuhn's theories do not apply to biology...oxidative phosphorylation is perhaps a good example for examining incommensurability, as two competing theories battled from 1960 or 61 until the late 70s....

Some argument

If it is our goal to mine information from the literature, and if the best we can do with mining text is to generate hypotheses rather than make discoveries (because making knowledge discoveries requires a tight coupling between the signifier and the signifiedm between the world and the language used to describe it, whereas generating hypotheses suggests we are merely producing ideas that need to be tested based on what's already present--generated hypotheses as emergent properties of the interaction of texts and text mining applications), then, given that information is essentialy what captures variance, we might want to generate as best as possible orthodox hypotheses--hypotheses with a majority of data already confirming those hypotheses--and, more importantly, heterodox hypotheses--hypotheses suggested by the literature with little or no data supporting them. The more heterodox a hypothesis, the greater the potential for information. It's a risk/reward trade-off: the costs in testing heterodox hypotheses given the reduced likelihood of veracity may be offset by the motherlode potential of confirming but one extremely heterodox theory.

Further, we may want to mine the various ways these hypotheses, whether heterodox or orthodox, are contradicted in the literature. It seems out burden in science is to posit a hypothesis and then try our best to disprove it in every way possible, rather than try to prove it.

http://tinyurl.com/4y39s
http://tinyurl.com/5j5b5

Metaphors defining our cognitive scaffolding, which in turns defines and limits how it is we witness novel things (or any thing, for that matter) and deem them interesting

Friday CRADLE Talk Mining Spatial Patterns from Protein Structures

Luke Huan

Seq->Struct->Function

Structure indicates function

Mine frequent subgraphs; retrieve spatial motifs frm protein structure data

Global vs. local alignments of protein structures
local lignments are motifs

Mining for subgraphs/spatial motifs is a challenging problem for data mining

How to model protein w/ set of points
each aa us presented by a point in a 3d space; protein structure is a point set
LCP largest common point set problem

Is clique hashing a useful pattern finding approach for other domains?
Heavy combinmatorial approach

SCOP Structureal Classification of Proteins DB
10 classes
800 folds
1294 superfamilies
2327 families

every protein entered into this db is annotated using this scheme

hypergeometirc distribution:
The problem of finding the probability of such a picking problem is sometimes called the "urn problem," since it asks for the probability that i out of N balls drawn are "good" from an urn that contains n "good" balls and m "bad" balls. It therefore also describes the probability of obtaining exactly i correct balls in a pick-N lottery from a reservoir of r balls (of which n = N are "good" and

are "bad"). (from MathWorld http://mathworld.wolfram.com/HypergeometricDistribution.html)

Searching for biological relevance:
reading papers: key word search, drawing from experience
website search
talking to people who know the subjects

presented an algo for finding frequently occurring spatial motifs
discovered motifs are specific, measured by low P-values using the hypergeometric distribution

motifs they discover are highly specific, measured by low P-values

mapping structures to highly specific motifs
how about mapping sequence

Monday, February 07, 2005

Monday 07 February 2005 Lecture: Interstingness

Silberschatz & Tuzhilin

Actionable & Unexpected makes for interestingness

possibly: unexpectedness is roughly equivalent to actionability

unexpectedness as a criteria for interestingness is uninteresting

valid
understandable
novel
-->interestingness<--

Are there uninteresting facts that arise that somehow might cause beliefs to change? Or a lack of facts, can they cause beliefs to change?

the data vs the processing of it?

this statement is either trivial or false

terms change? terms must stay the same

statements

what is my confidence factor that all birds can fly? my confidence is low but some birds can fly
unexpectedness roughly equivalent

confidence factor & CYC: assign confidence associated with facts

Rational agency:
http://www.ryerson.ca/~dgrimsha/courses/cps720/rational.html

expected data returned
but
beliefs change

something between data and belief that seems absent form the discussion in the paper
there must be some sort of representation in between that converts data to belief that might be rational and might convert expected data into new beliefs
e.g., we might be able to test for significance

understanding was somewhat removed; a telltale sign is this elimination of the intermediary between the receipt of data and the formulation of belief

throw in data that is wildly aberrant & witness change?
unexpected data witness belief changes

monoitor belief changes, see if enough data is out there

what's better? monotinicity is conserved or maximally rejected?

reaction essay for every paper?....

(x/y)
correctly classified/incorrectly classified

homework:
pick a UCI dataset that starts with the same letter as your first name, if not, then last name
explore different classification algos within the WEKA toolkit
email Cathy the decision tree + 1 more classifier

Thursday, February 03, 2005

Evaluations

Novelty
C. D. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.

Basu,S. Mooney,R.J., Pasupuleti, K.V. and Ghosh,J. Evaluating the Novelty of Text-Mined Rules using Lexical Knowledge KDD-2001

Understandability Bias
Pazzani, M.J., Mani,S., Shankle,W.R. (2001). Acceptance of Rules Generated by Machine Learning among Medical Experts. Methods of Information in Medicine; 40: 380-385

INLS 110-122 Spring 2005 Knowledge Discovery