Monday, January 31, 2005

Monday 31 January 2005 Lecture

KD: Association

Typically used for recommendations
Also called market basket analysis
What do people buy together? What attributes are associated

left hand side: antecedent
right hand side: consequent

association rule must have an associated population P
- pop consists of a set of instances
e.g., each sale at a store is an instance
- set of all transactions is the population


set of items I {I1, I2, ..., Im}
transactions: D = {t1, t2, ... tn}
Itemset
set of items that satisfy some criteria or other


association rule algo
we are generally only intersted in association rules w/ generally high support

A priori algorithm
If {ACD} is frequent, then all subsets of {ACD} are frequent ({AC}, {AD}, {CD})

Two questions: why is unidirectional causaility implied by the terminology (e.g., antecedent, consequent)? Isn't it bidirectional by nature? Direction on the graph as we speak of it now is not temporal

Also, why aren't we interested in low support? Do we want to get only the best association rules in all cases, or sometimes do we want to describe the population space as completely as possible? isn't that determined by some extent to how we plan on using the results?

Re: different feature representations yield different

Confidence vs. support: interestingness!

We may have that info already present in a DB...

Another algo: Instance-Based Learning

Decision trees, clustering and association rules are created on historical data, then model us used to predict/describe class of new instance

Instance based: no model is created ahead of time
- learned when a new instance arrives
- identify historical data that is simlar

Similar challenges as clustering with respect to distance
symbolic distances are particualarly difficult
instance-based learning effective when efficient db design
similarity is being pushecd into the db--next-generation dbs will enable similarity-based queries

Wednesday, January 26, 2005

Wednesday 26 January 2005 Lecture: Clustering

Clustering


Qualities of a similarity metric
Clustering algos
evaluation discussion



finding natural groupings among collection of objects into k sets such that the avg distance of points form the centroid of their assigned group is minimized
centroid: average of coordinates in each dimension
mimimum avg pairwise distance

question: we typically pick a k, cluster, then consider whether we'd like to change k. are there more systematic ways of doing this? setting k to a range; have a formal means for deciding between k values

how is unsupervided learning different from dredging?

finding patterns and then assigning meaning


we can come up with a set of features
clustering is subjective because deciding what groupings are best is a subjective process, and deciding what serves as the basis for similarity is likewise subjective


Distance metric props
symmetry
constancy of self-similarity
positivity (separation)
triangular inequality

two types of clustering
partitional
hierarchical/subspace


use dendograms: root/terminal branch/internal branch/internal node/leaf

desirable props of a clustering algo
scalabilty
ability to dela with diff data types
minimal req's for domain knowledge to determine input params
able to deal with noice & outliers

(number of objects)!/2 - number of objects


hierarchival is O(n2); exponentials don't scale well


simple clustering: with numeric data only - K-means
pick a number of (K) of cluster centers
move each centroid to the mean of each cluster
repeat (2,3) until movement is less than a sentinel value

value of your k selection

inter and intracluser distance values



partitioned clustering
----------------------

how do we evaluate clustering?


birch algo example

question:
we're hoping that our clusters are somehow meaningful summaries of subsets of the attributes
but aren't we somehow more prone to data dredging in unsupervised approaches

well, everything is subject to verification, validity, meaningfulness

Distance revisited: measuring a distance been an object and a cluster or two clusters
single linkage (or nearest neighbor) - distance of the two clostest objects in the different clusters
complete linkage (furthest neighbor) - greatest distance between any two objects in the different clusters
group average


For class presentation:

B1 - Introduction to KD est Feb 9
Don R. Swanson & Neil R. Smalheiser An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artifical Intelligence 91:183-203; 1997.

B8 - Document Summarization est Mar 28 or 30
Generating Patient-Specific Summaries of Online Literature
Kathleen R. McKeown, Desmond A. Jordan, Vasileios Hatzivassiloglou
*http://www.ics.uci.edu/~pratt/courses/papers/TextMining/patient-specific-summaries.pdf

Monday, January 24, 2005

Monday 24 January Lecture: Classification

Classification example: selecting safe fruit from dangerous fruit on a deserted island with no info

nominal attributes
empirical approach
requires knowledge of the conclusion: THE CLASS

Conclusion|Skin|Color|Size|Flesh
----------------------------------
class nom nom nom nom


decision trees! if attribute 1=value1, then subtree 1
else if attribute 1=value2 then subtree 2

(Question for later: are there decision tree algos that use something other than number of instances in each class to calculate information gain [or some discriminant other than information gain]?)


decision tree classification:
assume you know the conclusion C that has any pof the values c1...cn
that an attribute can take values a1...an

then you can calculate P(cj|ao)


P(C=safe|skin=hairy) = 6/8=0.75
P(C=safe|size=small) = 5/9=0.55

conditional probability close to 1 is a


Step 1: partition data into groups based on a partitioning attribute and partioning condition
Step 2: continue until stop condition reached
- all or more of items belong to the same class
- all attributes have been considered and no further partitioning is possible
- such a node is a leaf node


INfo[2,3] = 0.972 bits
Info[4,0] = 0.0 bits
Info[3,2] = 0.971 bits
goodness for conditions = 0.693 bits
(5/14)*.971 + 4/14* 0.0 + (5/14)*0.971 = 0.693
Gain=0.247
no partiti



(Is there some way to handle assume our attribute list is complete?)

(Rescaling/revaluing attribute values may be helpful for useless attributes)




Wednesday, January 19, 2005

Thursday 19 January 2005 Lecture: Terminology

central questions of KD
what data should you use?
what goes into the mining task?
what will the patterns look like?


what data should you use?

Inputs required:
attribute type
nominal/ordinal/interval/ratio
(vals are names/scale with meaningless interval but meaningful order/scale with meaningful interval/scale with true zero; zero is a true zero)
type
symbolic/symbolic/numeric/numeric
order
n/y/y/y
e.g.,
zip/ranking/deg Celcius/kelvin


attributes can be considered along a spectrum:
from a two-valued nominal attribute to a ratio

nominal takes two bits
ordinal: as many symbols as needed
interval: integer
ratio: float


Concept is
‘the thing to be learned’ p38

concepts can be predictive or descriptive, supervised or unsupervised, transparent or black box


goal of predictive dm: induction
predictive mechanisms
- classification
- regression (linear: y=ax+b):
given a set of param-value-to-function-result mappings,
WITHOUT KNOWING THE REAL FUNCTION,
predict the function result for a new param-value


descriptive concepts
association rules & clusters
associations are often first step in detecting causation (or making a prediction)
clustering good for epidemiology; for detecting source of a problem


supervised vs. unsupervised
supervised:
input: labeled data
goal: identify the attribute combination that maximizes chance you get correct concept label
unsupervised:
input: unlabelled data
goal: identify groups within the data


black box vs. transparent

is the output black box or transparent?
neural network good example of a black box
id3 decision tree good example of a transparent

black box: goal is to get categories right (good for control systems)
transparent: understanding the rules or ways is also useful


bias
------

high-dimensional data will have multiple "correct" concept descriptions
bias is not bad
algos are greedy and sometimes insist their results are best


we CAN use low quality data? yes, but we need high quality data to map it to the low quality

what role does database design principles play in KD process?
flexibility in kinds of queries you might write/difficulty as well for accessible data
without those queries you might not be able to pick what inputs you want, what data you use
scalability, maintenance, indexability in RDBMS











Monday, January 17, 2005

A1 Readings Pt. 3 DeGruy; Guernsey

DeGruy, Healthcare Applications of Knowledge Discovery in Databases, Journal of Healthcare Information Management, vol 14 no 2, summer 2000

DeGruy makes the case for KDD in healthcare. Am painfully familiar with the questions, slogans, debates, & players.

Successful helathcare KDD implementations:
  • HMO determined disease-specific risk for its members: potential for targeted intervention (or potential denial of coverage); means can likewise be used to identify new risk factors in a population
  • fraud detection in insurance claims
  • NY Worker's Compensation Board decision trees for streamlining compensation claims
  • EHRs (electronic health records) are ripe for KDD - who needs flu shots; who isn't following needed treatments, etc.
  • medical marketing

NYT article oct 16 2003 "Digging for Nuggets of Wisdom", Lisa Guernsey

Text mining vs. information retrieval: text mining goes beyond tokens, categorization, linking unconnected documents, creating visual representation of related docs, synthesizing new ideas

Text mining vs. data mining: text starts with unstructured data: text

Determine context/resolve ambiguity/select semantic concepts

Text mining might not recognize subtlety


A1 Readings Pt. 2 Fayyad

Fayadd, et. al, "KDD Process for Extracting Useful Knowledge from Volumes of Data", Communications of the ACM, November 1996, Vol 39, no 11.

Manual data analysis is swiftly becoming impractical. The number of databases is growing, and so too is the average dimensionality (records times attributes). Automation is the answer, Fayadd, et al, contend.

Fayadd et al labor to differentiate KDD from DM. DM is but a step in the KDD process which involves:
  1. Data collection
  2. Data selection (picking our topic)
  3. preprocessing of target data (prepping our relevant data)
  4. transformation of preprocessed data (creating useful structured representations of relevant data)
  5. mining of transformed data
  6. interpreting discovered patterns (deriving knowledge)
DM is only step 5 of this six step process.

The authors raise the concern that the discovery of a pattern from transformed data does not always mean that the pattern is statistically significant. Fundamental is the statistician's art of hypothesis selection.

Ahh, method.

Discovered patters should neither be underfit or overfit: overfitted patterns typically fail to be predictive, while underfit models don't provide very much information.

Interestingness:
Process is assumed to be nontrivial
Patterns should be valid for new data to some degree of certainty
Patterns should be novel to the system, and hopefully to the user as well
Patterns should be understandable - simplicity?

Model functions in DM:
  • classification
  • regression
  • clustering
  • summarization
  • dependency modeling
  • link analysis
  • sequence analysis
Challenges & issues:

What's interesting or useful? What's not? Knowledge leverages some amout of subjectivity, and that subjectivity is the very essence of making judgments about such things as usability, interestingness, informativeness, relevance, and so on. Insert here the role of the information scientist.









A1 Readings Pt. 1 Lesk

Lesk, M. "How Much Information...?"

In 1997 M. Lesk estimated volume of information in the world at several exabytes (~ several billion GB). Spirit of article is dedicated to haphazard guessing & makeshift counting method (e.g., estimates of hard drive sales), but entertaining nonetheless. Less entertaining and more haphazard is the later speculation on the "volume" of "human memory;" Lesk cites Landauer's estimate of brain capacity at 200MB, which is manifestly dubious, and then Lesk calculates total human memory to be just over one exabyte. Oops. Landauer, at least according to Lesk, figures the brain holds 1,000 to 100,000 neurons per bit of memory, based on Landauer's recal tests of human memory. There are obvious reasons why recall tests are a bad way of measuring the amount of memory (more in the computer sense than in the sentimental sense) a human brain can retain. Computers of course store memories in 0s and 1s, but, importantly, those 0s and 1s do not always add up to high level, semantically-rich information. Some of this memory can be very low-level. So human recall tests prima facie bias results to be very low.

To be fair, just as we measure hard disk capacity (and just as Lesk uses hard drive *capacity* quantity figures for his argument) we should guess the very same way with the human brain: count the number of bits. Now, a human brain has ~10e14 neurons, but hardly is a neuron a single bit. Neurons have a fairly high order of ways in which they can articulate: number & length of dendrites & axons alone are but two dimensions which we may count to get an idea of a possible number of states for each neuron. We might also have a few dimensions for measuring neural cell signalling; it's unlikely that nerve cells have just one signal to pass, and it's equally unlikely that the signal is strictly digital. There may be many other dimensions with respect to neurons to countas well, various location-dependent interactions with other aspects of the brain, for example. But we'll skip that arena. We might also count possible states for astrocytes, a species of glial cell, since they may also be involved in "saving state."

10e14 nerve cells

conservative estimate of states per cell
---------------------------------------
avg number of synapses: 10e4
avg number of sodium pumps: 10e6

10e4*10e6=10e10

total number our bits based on neurons: 10e14 * 10e10 = 10e24

one exabyte is approximately 10e19 bits
10e24 bits, our current estaimate for storage capacity of a single human brain without counting astrocytes, is approximately 100,000 times larger than Lesk's estimate for the total of the world's information, and is 10e16 larger than Landauer's estimate for a single brain.

By the numbers, and by the paper's current methods, the storage needed to cover the capacity of human neurons would be in the order of 10e34 (6*10e9*10e24) bits, or approximately 100 trillion exabytes.

It cannot be remembered for us wholesale.

To be very honest, I don't take seriously any claims to strong AI, and don't fancy the parallels routinely drawn between human experience and computer I/O. The metaphors of computing just don't work very well, except maybe to dumb down and underestimate our understanding of the myriad complexities of this lump of gray matter and its correlated but yet-to-be-verified partner, the mind. For example, there's simply *no* equivalence between an actual sentence in an ASCII file and that same sentence memorized and "inside" a human head. That sentence "inside" the human head ("inside" is in quotes, because no one can actually locate it, touch it, verify it) . probably has a tonb of other information attached to it, and the recall task brings with it tons of other information. In other words, when it comes to the human mind, a sentence is not a sentence. A sentence (e.g., ASCII) is not a sentence (either in the mind's eye OR in the brain).

So I digress. Lesk's more mild claim that we will be able to store all our media digitally seems reasonable. What this means for text mining is that we will have data, lots of data, for mining (if we gain access to it), and that the quantity of data will demand more and more mining work for information retrieval, classification, discovery, and synthesis.