A1 Readings Pt. 2 Fayyad
Fayadd, et. al, "KDD Process for Extracting Useful Knowledge from Volumes of Data", Communications of the ACM, November 1996, Vol 39, no 11.
Manual data analysis is swiftly becoming impractical. The number of databases is growing, and so too is the average dimensionality (records times attributes). Automation is the answer, Fayadd, et al, contend.
Fayadd et al labor to differentiate KDD from DM. DM is but a step in the KDD process which involves:
The authors raise the concern that the discovery of a pattern from transformed data does not always mean that the pattern is statistically significant. Fundamental is the statistician's art of hypothesis selection.
Ahh, method.
Discovered patters should neither be underfit or overfit: overfitted patterns typically fail to be predictive, while underfit models don't provide very much information.
Interestingness:
Process is assumed to be nontrivial
Patterns should be valid for new data to some degree of certainty
Patterns should be novel to the system, and hopefully to the user as well
Patterns should be understandable - simplicity?
Model functions in DM:
What's interesting or useful? What's not? Knowledge leverages some amout of subjectivity, and that subjectivity is the very essence of making judgments about such things as usability, interestingness, informativeness, relevance, and so on. Insert here the role of the information scientist.
Manual data analysis is swiftly becoming impractical. The number of databases is growing, and so too is the average dimensionality (records times attributes). Automation is the answer, Fayadd, et al, contend.
Fayadd et al labor to differentiate KDD from DM. DM is but a step in the KDD process which involves:
- Data collection
- Data selection (picking our topic)
- preprocessing of target data (prepping our relevant data)
- transformation of preprocessed data (creating useful structured representations of relevant data)
- mining of transformed data
- interpreting discovered patterns (deriving knowledge)
The authors raise the concern that the discovery of a pattern from transformed data does not always mean that the pattern is statistically significant. Fundamental is the statistician's art of hypothesis selection.
Ahh, method.
Discovered patters should neither be underfit or overfit: overfitted patterns typically fail to be predictive, while underfit models don't provide very much information.
Interestingness:
Process is assumed to be nontrivial
Patterns should be valid for new data to some degree of certainty
Patterns should be novel to the system, and hopefully to the user as well
Patterns should be understandable - simplicity?
Model functions in DM:
- classification
- regression
- clustering
- summarization
- dependency modeling
- link analysis
- sequence analysis
What's interesting or useful? What's not? Knowledge leverages some amout of subjectivity, and that subjectivity is the very essence of making judgments about such things as usability, interestingness, informativeness, relevance, and so on. Insert here the role of the information scientist.
0 Comments:
Post a Comment
<< Home