Wednesday, January 19, 2005

Thursday 19 January 2005 Lecture: Terminology

central questions of KD
what data should you use?
what goes into the mining task?
what will the patterns look like?


what data should you use?

Inputs required:
attribute type
nominal/ordinal/interval/ratio
(vals are names/scale with meaningless interval but meaningful order/scale with meaningful interval/scale with true zero; zero is a true zero)
type
symbolic/symbolic/numeric/numeric
order
n/y/y/y
e.g.,
zip/ranking/deg Celcius/kelvin


attributes can be considered along a spectrum:
from a two-valued nominal attribute to a ratio

nominal takes two bits
ordinal: as many symbols as needed
interval: integer
ratio: float


Concept is
‘the thing to be learned’ p38

concepts can be predictive or descriptive, supervised or unsupervised, transparent or black box


goal of predictive dm: induction
predictive mechanisms
- classification
- regression (linear: y=ax+b):
given a set of param-value-to-function-result mappings,
WITHOUT KNOWING THE REAL FUNCTION,
predict the function result for a new param-value


descriptive concepts
association rules & clusters
associations are often first step in detecting causation (or making a prediction)
clustering good for epidemiology; for detecting source of a problem


supervised vs. unsupervised
supervised:
input: labeled data
goal: identify the attribute combination that maximizes chance you get correct concept label
unsupervised:
input: unlabelled data
goal: identify groups within the data


black box vs. transparent

is the output black box or transparent?
neural network good example of a black box
id3 decision tree good example of a transparent

black box: goal is to get categories right (good for control systems)
transparent: understanding the rules or ways is also useful


bias
------

high-dimensional data will have multiple "correct" concept descriptions
bias is not bad
algos are greedy and sometimes insist their results are best


we CAN use low quality data? yes, but we need high quality data to map it to the low quality

what role does database design principles play in KD process?
flexibility in kinds of queries you might write/difficulty as well for accessible data
without those queries you might not be able to pick what inputs you want, what data you use
scalability, maintenance, indexability in RDBMS











0 Comments:

Post a Comment

<< Home