<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-10158875</id><updated>2011-04-21T11:13:36.362-07:00</updated><title type='text'>INLS 110-122 Spring 2005 Knowledge Discovery</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>19</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-10158875.post-111281524579806945</id><published>2005-04-06T15:15:00.000-07:00</published><updated>2005-04-06T12:20:45.800-07:00</updated><title type='text'>Lecture Wednesday 06 April 2005</title><content type='html'>Summarization&lt;br /&gt;&lt;br /&gt;Mani, I and Bloedorn, E; Summarizing similarities and differences among related documents. Information Retrieval, 1999. 1(1-2): p. 35-67.&lt;br /&gt;&lt;br /&gt;automatic text summarization&lt;br /&gt;    analysis phase&lt;br /&gt;    refinement phase&lt;br /&gt;    synthesis phase&lt;br /&gt;&lt;br /&gt;why salient information?&lt;br /&gt;establish similarities &amp; differences&lt;br /&gt;easier to compate salient items than the entire text body&lt;br /&gt;&lt;br /&gt;by identifying both commonalities and differences, we can see what's novel&lt;br /&gt;(very different than the nnotion of centroids)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;represent each document of a graph with nodes as a word instance and edges are multiple types: ADJACENCY, SAME, ALPHA, PHRASE, NAME, COREFERENTIAL&lt;br /&gt;weights of graph nodes: activation vector&lt;br /&gt;&lt;br /&gt;phrase extraction:&lt;br /&gt;WordNet as HP&lt;br /&gt;&lt;br /&gt;topic-related text region spreading activation (think of nodes being lit up by queries)&lt;br /&gt;&lt;br /&gt;two types of evaluation: extrinsic evaluation &amp; intrinsic evaluation&lt;br /&gt;extrinsic: how summary affects outcome of some other task&lt;br /&gt;intrinsic: judgements of informativeness&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;what is it you should evaluate?&lt;br /&gt;what if users disagree?&lt;br /&gt;force disagreement?&lt;br /&gt;compare it to system....&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Okurowski, M. et al (2000). Text summarizer in use: lessons learned from real world deployment and evaluation. Proceedings of the ANLP/NAACL Workshop on Automatic Summarization, 49-58.&lt;br /&gt;&lt;br /&gt;Mark Pope presenting&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Question: we seem to be making the assumption that we are improving upon relevance rather than leaving it behind, that we are trying to get things "faster"&lt;br /&gt;&lt;br /&gt;maybe we already know what our most relevant documents are, and instead of getting a "more efficient" representation, maybe we want to learn something remarkable&lt;br /&gt;maybe we pick our favorite papers on a topic, we've read them and have grokked them well, but maybe we feel we might be missing something.&lt;br /&gt;&lt;br /&gt;this idea of the technology suggesting the task is good&lt;br /&gt;but also being more creative with problem identification than "information overload" in the "millions of relevant document" sense&lt;br /&gt;let's IMAGINE some uses that are not currently part of any professional's task&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-111281524579806945?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/111281524579806945/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=111281524579806945' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111281524579806945'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111281524579806945'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/04/lecture-wednesday-06-april-2005.html' title='Lecture Wednesday 06 April 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-111221387025512163</id><published>2005-03-30T15:15:00.000-08:00</published><updated>2005-03-30T12:17:50.266-08:00</updated><title type='text'>Lecture Wednesday 30 March 2005</title><content type='html'>Shift focus from retrieval to synthesis&lt;br /&gt;&lt;br /&gt;Users making consequential decisions&lt;br /&gt;&lt;br /&gt;Information systems in augmenative role&lt;br /&gt;&lt;br /&gt;130k+ breast cancer articles (MeSH term: "breast neoplasm")&lt;br /&gt;cancer 'cures' often worse than symptoms&lt;br /&gt;beyond age + gender, risk factors don't explain half of breast cancer incidence&lt;br /&gt;&lt;br /&gt;desirable properties of a system:&lt;br /&gt;    integrate well with work practices&lt;br /&gt;    &lt;br /&gt;&lt;br /&gt;In epidemiology, becasue of the non-experimental design of the studies, we should expect results to be controversial (&lt;a href="http://www.hsph.harvard.edu/Organizations/DDIL/gep_PE.html"&gt;http://www.hsph.harvard.edu/Organizations/DDIL/gep_PE.html&lt;/a&gt;)&lt;br /&gt;risk factors should be skeptically treated in exploring any notion of causality&lt;br /&gt;&lt;br /&gt;Cochrane reviews (&lt;a href="http://www.cochrane.org/index2.htm"&gt;http://www.cochrane.org/index2.htm&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Are there results Epi studies, RCTs, mixtures?&lt;br /&gt;&lt;br /&gt;Conchrane reviews pertain mainly to RCTs:&lt;br /&gt;"The Cochrane Collaboration and the Cochrane Reviewers’ Handbook focus particularly on systematic reviews of randomised controlled trials (RCTs) because they are likely to provide more reliable information than other sources of evidence on the differential effects of alternative forms of healthcare (Kunz 2003). Systematic reviews of other types of evidence can also help those wanting to make better decisions about healthcare, particularly forms of care where RCTs have not been done and may not be possible or appropriate. The basic principles of reviewing research are the same, whatever type of evidence is being reviewed. Although we focus mainly on systematic reviews of RCTs we address issues specific to reviewing other types of evidence when this is relevant. Fuller guidance on such reviews is being developed."&lt;br /&gt;(&lt;a href="http://www.cochrane.dk/cochrane/handbook/1_introduction/1_0_introduction.htm"&gt;http://www.cochrane.dk/cochrane/handbook/1_introduction/1_0_introduction.htm&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Cohort studies: (&lt;a href="http://servers.medlib.hscbklyn.edu/ebm/2400.htm"&gt;http://servers.medlib.hscbklyn.edu/ebm/2400.htm&lt;/a&gt;)&lt;br /&gt;"A Cohort Study is a study in which patients who presently have a certain condition and/or receive a particular treatment are followed over time and compared with another group who are not affected by the condition under investigation.&lt;br /&gt;&lt;br /&gt;"For instance, since a randomized controlled study to test the effect of smoking on health would be unethical, a reasonable alternative would be a study that identifies two groups, a group of people who smoke and a group of people who do not, and follows them forward through time to see what health problems they develop.&lt;br /&gt;&lt;br /&gt; "Cohort studies are not as reliable as randomized controlled studies, since the two groups may differ in ways other than in the variable under study. For example, if the subjects who smoke tend to have less money than the non-smokers, and thus have less access to health care, that would exaggerate the difference between the two groups.&lt;br /&gt;&lt;br /&gt; "The main problem with cohort studies, however, is that they can end up taking a very long time, since the researchers have to wait for the conditions of interest to develop. Physicians are, of course, anxious to have meaningful results as soon as possible, but another disadvantage with long studies is that things tend to change over the course of the study. People die, move away, or develop other conditions, new and promising treatments arise, and so on. Even so, cohort studies are generally preferred to case control studies, since they involve far fewer statistical problems and generally produce more reliable answers. "&lt;br /&gt;&lt;br /&gt;More on Epi tests/experimental design:&lt;br /&gt;&lt;a href="http://www.vetmed.wsu.edu/courses-jmgay/GlossClinStudy.htm"&gt;http://www.vetmed.wsu.edu/courses-jmgay/GlossClinStudy.htm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Concept of "ever drinker"&lt;br /&gt;working in tox labs, dosage was everything&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I want to see dimensions that express dosage mean &amp; variance&lt;br /&gt;and clinical experiment OR observational study&lt;br /&gt;use the height of the bar or the shade/intensity of color&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-111221387025512163?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/111221387025512163/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=111221387025512163' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111221387025512163'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111221387025512163'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/03/lecture-wednesday-30-march-2005.html' title='Lecture Wednesday 30 March 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-111211735717885342</id><published>2005-03-28T12:15:00.000-08:00</published><updated>2005-03-29T09:29:17.180-08:00</updated><title type='text'>Lecture Monday 28 March 2005</title><content type='html'>Visualization of text data&lt;br /&gt;&lt;br /&gt;ThemeRiver&lt;br /&gt;Reminds me of babynamewizard&lt;br /&gt;http://babynamewizard.com/namevoyager/lnv0105.html&lt;br /&gt;&lt;br /&gt;Indicates a growth of newsfeeds with certain events.  That there's more text.&lt;br /&gt;&lt;br /&gt;Jon: frustrated with the neglect of some dimentions&lt;br /&gt;&lt;br /&gt;Kohonen maps (SOM)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Pratt's DynaCat (1997):&lt;br /&gt;approach is to establish predefined questions&lt;br /&gt;then generate dynamic categories--hierarchical categories&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-111211735717885342?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/111211735717885342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=111211735717885342' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111211735717885342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111211735717885342'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/03/lecture-monday-28-march-2005.html' title='Lecture Monday 28 March 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-111163453718764184</id><published>2005-03-23T15:21:00.000-08:00</published><updated>2005-03-23T19:22:17.190-08:00</updated><title type='text'>Lecture Wednesday 23 March 2005</title><content type='html'>Clifton, C, Cooley, R, , JM, and Rauch, J;  TopCat: data mining for topic identification in a text corpus. in Principles of Data Mining and Knowledge Discovery. Third European&lt;br /&gt;&lt;br /&gt;Experimental Design:&lt;br /&gt;Factor: something you are changing&lt;br /&gt;keywords being assigned&lt;br /&gt;interst instead of support-confidence&lt;br /&gt;person-org-place: using this representation  other entities, &lt;br /&gt;&lt;br /&gt;Level: what you are setting the factor to&lt;br /&gt;  It is important to cite reasons for why it is you're selecting factors, why it is you're fixing those factors&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;e.g., setting minimum and maximum term frequencies to get documents with a minimum of five terms in both docs.  to retain as close as half of the original document corpus&lt;br /&gt;&lt;br /&gt;e.g., why ten-fold cross-validation?  because most people use it....&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Blake, C. &amp; Pratt, W. (2001). Better rules fewer features: A semantic approach to selecting features from text. In Proceedings of the Institute of Electrical and Electronics Engineers Data Mining Conference (IEEE DM 2001), San Jose, CA.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-111163453718764184?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/111163453718764184/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=111163453718764184' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111163453718764184'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111163453718764184'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/03/lecture-wednesday-23-march-2005.html' title='Lecture Wednesday 23 March 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-111023886126341350</id><published>2005-03-07T15:15:00.000-08:00</published><updated>2005-03-07T15:41:01.266-08:00</updated><title type='text'>Lecture Monday 07 March 2005: CLUSTERING!</title><content type='html'>Document clustering&lt;br /&gt;Lecture Monday 07 March 2005: CLUSTERING!&lt;br /&gt;&lt;br /&gt;Papers covered:&lt;br /&gt;&lt;br /&gt;    * Tao Li, Sheng Ma, Mitsunori Ogihara Clustering: Document clustering via adaptive subspace iteration, Proceedings of the 27th annual international conference on Research and development in information retrieval (2004)&lt;br /&gt;&lt;br /&gt;    * Mark Steyvers , Padhraic Smyth, Michal Rosen-Zvi, Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)&lt;br /&gt;&lt;br /&gt;Presented by Elxsas, J.&lt;br /&gt;&lt;br /&gt;1. What is clustering?&lt;br /&gt;2. Two articles&lt;br /&gt;3. Discussion&lt;br /&gt;&lt;br /&gt;Jain, Murty, Flynn. Data clustering: a review.  ACM Computing Surveys. Vol 31, No. 3, pp. 264-323, 1999.&lt;br /&gt;&lt;br /&gt;Partitional vs hierarchical clustering&lt;br /&gt;k-means: partitional square error&lt;br /&gt;&lt;br /&gt;text clustering: locate topically similar(?) documents&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;open problems:&lt;br /&gt; - how do you represent docs in a way these algos can take advantage of them&lt;br /&gt; - document or term similarity: how is similarity measured?&lt;br /&gt; - cluster interpretation: what does that cluster mean?&lt;br /&gt; - cluster evaluation: are the clusters any good?&lt;br /&gt;&lt;br /&gt;Li, Ma, Ogihara: Document clustering via adaptive subspace iteration&lt;br /&gt; - addresses document similaritu &amp; cluster interpretation&lt;br /&gt;  - subspace clustering&lt;br /&gt;&lt;br /&gt;ASI:&lt;br /&gt;in: M: term-document matrix&lt;br /&gt;    k: number of clusters (k can be estimated)&lt;br /&gt;out: &lt;br /&gt;   D: cluster for each doc&lt;br /&gt;   F: subspace for each cluster&lt;br /&gt;&lt;br /&gt;partitions docs in to k groups (optimizing D)&lt;br /&gt;reduces distiance between docs and centroids &lt;br /&gt;&lt;br /&gt;help me out: why should I expect my clusters should correspond to topics?  Because we want them to?  clusters aren't being defined on one-dimensional topics; we know they're probably not, and we don't know what they mean exactly, even if explicitly described.&lt;br /&gt;&lt;br /&gt;mutual information score!!!&lt;br /&gt;can be used for disambiguation: p.239 schutze&lt;br /&gt;&lt;br /&gt;Probabilistic author-topic models for info discovery&lt;br /&gt;&lt;br /&gt;modeling topics: a model of word generation based on statistics&lt;br /&gt;modeling authors: model of author association based on topics and terms&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;showing increase and decrease of topics over time&lt;br /&gt;&lt;br /&gt;association of words with language fashions: e.g., use of greek letterings vs. french letterings&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;assigning topics to unseen documents; separating two abstracts that have been merged;  detecting an author's surprizing papers - problems well-definied by what data we have by the system...the method recommends such things by nature&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;EM clustering&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-111023886126341350?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/111023886126341350/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=111023886126341350' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111023886126341350'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/111023886126341350'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/03/lecture-monday-07-march-2005.html' title='Lecture Monday 07 March 2005: CLUSTERING!'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110962023056580000</id><published>2005-02-28T15:15:00.000-08:00</published><updated>2005-02-28T11:50:30.566-08:00</updated><title type='text'>Lecture Monday 28 February 2005</title><content type='html'>Information Extraction&lt;br /&gt;----------------------&lt;br /&gt;&lt;br /&gt;Two general approaches to information systems:&lt;br /&gt;&lt;br /&gt;1.  knowledge engineering&lt;br /&gt; - hand-constructed grammars&lt;br /&gt; - human experts design rules&lt;br /&gt; - e.g., Paice &amp; Jones, Blaschke et al&lt;br /&gt;&lt;br /&gt;2.  trained systems&lt;br /&gt; - use stats where possible to learn rules&lt;br /&gt; - Riloff less info, not fully learned; Califf &amp; Mooney&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Even trained system approaches require background knowledge of some sort&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;knowledge vs. data trade-off&lt;br /&gt;1.2 million words needed to learn statistically&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;levels of info&lt;br /&gt;&lt;br /&gt;text&lt;br /&gt;words e.g., POS&lt;br /&gt;noun phrase e.g., phrase units&lt;br /&gt;sentence level&lt;br /&gt;inter sentence level e.g., anaphoric resolution &amp; discourse analysis&lt;br /&gt;template level -vchanges format to output required&lt;br /&gt;&lt;br /&gt;effort increased as&lt;br /&gt; &lt;br /&gt;&lt;br /&gt;AAAI Applet &amp; Hobbes&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;IE techniques:&lt;br /&gt;KB - &lt;br /&gt;Semi-learned&lt;br /&gt;Learned&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Representation for Learning&lt;br /&gt;---------------------------&lt;br /&gt;&lt;br /&gt;pre filler pattern&lt;br /&gt;filler: what you want&lt;br /&gt;post filler pattern&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;RAPIER: Robust automated production of information extraction rules&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;covering algorithm: takes a sentence that is a positive example&lt;br /&gt;while more positive examples remain, create a rule that removes majority of positive examples&lt;br /&gt;&lt;br /&gt;(seems like it would have pretty good precision but not so good recall)&lt;br /&gt;&lt;br /&gt;RAPIER is one good example of a covering algo&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110962023056580000?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110962023056580000/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110962023056580000' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110962023056580000'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110962023056580000'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/lecture-monday-28-february-2005.html' title='Lecture Monday 28 February 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110963109158103156</id><published>2005-02-28T14:51:00.000-08:00</published><updated>2005-02-28T14:51:31.583-08:00</updated><title type='text'>To do week of 28 Feb 2005</title><content type='html'>TO DO FOR THE TM PROJECTS (independent study/kdd)&lt;br /&gt;&lt;br /&gt;load hypernym tree&lt;br /&gt;correct synset tree&lt;br /&gt;regenerate webterm+lragr+synset table&lt;br /&gt;regenerate statistics for table&lt;br /&gt;&lt;br /&gt;for each term with multiple POS,&lt;br /&gt; find hypernym I-V&lt;br /&gt; add hypernym I-V columns &amp; add those hypernyms&lt;br /&gt;find lowest level reduction&lt;br /&gt; &lt;br /&gt;  &lt;br /&gt;figure out how to backtrack to POS&lt;br /&gt;install POS tagger&lt;br /&gt;for every sentence&lt;br /&gt; retrieve &amp; reconstruct sentence&lt;br /&gt; submit sentence to pos tagger&lt;br /&gt; retrieve results&lt;br /&gt; insert POS info back to webterm data set&lt;br /&gt;&lt;br /&gt;now retrieve data set limited to POS={N, V, ADJ, ADV}&lt;br /&gt;create as table webtermtagged&lt;br /&gt;&lt;br /&gt;equijoin webtermtagged on webtermlragrsynset&lt;br /&gt;&lt;br /&gt;generate statistics again&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110963109158103156?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110963109158103156/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110963109158103156' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110963109158103156'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110963109158103156'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/to-do-week-of-28-feb-2005.html' title='To do week of 28 Feb 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110841226531834641</id><published>2005-02-14T15:15:00.000-08:00</published><updated>2005-02-14T12:17:45.320-08:00</updated><title type='text'>Lecture Monday 14 February 2005</title><content type='html'>papers:&lt;br /&gt;Marti Hearst&lt;br /&gt;Henry Small&lt;br /&gt;&lt;br /&gt;heavy stemming good for high recall&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Pruning our document space tf*idf&lt;br /&gt;&lt;br /&gt;term freq * inverse doc freq&lt;br /&gt;&lt;br /&gt;term frequency in document&lt;br /&gt;idf: two options&lt;br /&gt;        inverse of the number of documents with term i&lt;br /&gt;       inverse of the log of the number of total docs  ( more typical; controls for wide range) which is  log (n/dfi)  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;co-occurrence analysis&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;send in statistics&lt;br /&gt;how many unique words?&lt;br /&gt;stemmed words?&lt;br /&gt;some data that describes the dataset&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;we can avoid using a "search-centric" approach by assuming we're discovering knowledge?  by assuming&lt;br /&gt;&lt;br /&gt;but at the end of the day&lt;br /&gt;we still want to generate hypotheses&lt;br /&gt;&lt;br /&gt;generate vs. discover&lt;br /&gt;discovery seems to be a search paradigm&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;avoiuding search-centric: removing the document as the denominator&lt;br /&gt;&lt;br /&gt;if it's already explicit then it's searching not discovery&lt;br /&gt;if it's not explicit then it's discovery (or hypothesis generation)&lt;br /&gt;&lt;br /&gt;generation:&lt;br /&gt;vs:&lt;br /&gt;discovery:&lt;br /&gt;&lt;br /&gt;we had an experiment where we cited a percentage of truth to a statement, but we need a reliable denominator that says, this is 100% true&lt;br /&gt;&lt;br /&gt;concern surrounds user&lt;br /&gt;&lt;br /&gt;Swanson:&lt;br /&gt;something is not in the literature&lt;br /&gt;&lt;br /&gt;if we are generating hypotheses, then it is besides the point as to whether we should predicate discovery in a relative or absolute sense&lt;br /&gt;&lt;br /&gt;testable hypotheses&lt;br /&gt;unverifiable statements:&lt;br /&gt;unfalsifiable statements:&lt;br /&gt;    generating poetry&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110841226531834641?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110841226531834641/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110841226531834641' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110841226531834641'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110841226531834641'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/lecture-monday-14-february-2005.html' title='Lecture Monday 14 February 2005'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110818760622800813</id><published>2005-02-11T21:47:00.000-08:00</published><updated>2005-02-11T22:03:58.920-08:00</updated><title type='text'>Thoughts on knowledge discovery versus hypothesis generation</title><content type='html'>reading some of the philosophy behind scientific reasoning...discussions of incommensurability from Polanyi, Kuhn, Feyerabend...noting that Kuhn believed alternate hypotheses should be entertained only when science is at a point of crisis while Feyerabend suggested we might always want to entertain alternatives as a necessity perhaps of scientific reasoning...someone suggests that Kuhn's theories do not apply to biology...oxidative phosphorylation is perhaps a good example for examining incommensurability, as two competing theories battled from 1960 or 61 until the late 70s....&lt;br /&gt;&lt;br /&gt;Some argument&lt;br /&gt;&lt;br /&gt;If it is our goal to mine information from the literature, and if the best we can do with mining text is to generate hypotheses rather than make discoveries (because making knowledge discoveries requires a tight coupling between the signifier and the signifiedm between the world and the language used to describe it, whereas generating hypotheses suggests we are merely producing ideas that need to be tested based on what's already present--generated hypotheses as emergent properties of the interaction of texts and text mining applications), then, given that information is essentialy what captures variance, we might want to generate as best as possible orthodox hypotheses--hypotheses with a majority of data already confirming those hypotheses--and, more importantly, heterodox hypotheses--hypotheses suggested by the literature with little or no data supporting them. The more heterodox a hypothesis, the greater the potential for information. It's a risk/reward trade-off: the costs in testing heterodox hypotheses given the reduced likelihood of veracity may be offset by the motherlode potential of confirming but one extremely heterodox theory.&lt;br /&gt;&lt;br /&gt;Further, we may want to mine the various ways these hypotheses, whether heterodox or orthodox, are contradicted in the literature. It seems out burden in science is to posit a hypothesis and then try our best to disprove it in every way possible, rather than try to prove it.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;http://tinyurl.com/4y39s&lt;br /&gt;http://tinyurl.com/5j5b5&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Metaphors defining our cognitive scaffolding, which in turns defines and limits how it is we witness novel things (or any thing, for that matter) and deem them interesting &lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110818760622800813?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110818760622800813/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110818760622800813' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110818760622800813'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110818760622800813'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/thoughts-on-knowledge-discovery-versus.html' title='Thoughts on knowledge discovery versus hypothesis generation'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110815421125581547</id><published>2005-02-11T13:03:00.000-08:00</published><updated>2005-02-11T12:36:51.256-08:00</updated><title type='text'>Friday CRADLE Talk Mining Spatial Patterns from Protein Structures</title><content type='html'>Luke Huan&lt;br /&gt;&lt;br /&gt;Seq-&gt;Struct-&gt;Function&lt;br /&gt;          &lt;br /&gt;Structure indicates function&lt;br /&gt;&lt;br /&gt;Mine frequent subgraphs; retrieve spatial motifs frm protein structure data&lt;br /&gt;&lt;br /&gt;Global vs. local alignments of protein structures&lt;br /&gt;local lignments are motifs&lt;br /&gt;&lt;br /&gt;Mining for subgraphs/spatial motifs is a challenging problem for data mining&lt;br /&gt;&lt;br /&gt;How to model protein w/ set of points&lt;br /&gt;each aa us presented by a point in a 3d space; protein structure is a point set&lt;br /&gt;LCP largest common point set problem&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Is clique hashing a useful pattern finding approach for other domains?&lt;br /&gt;Heavy combinmatorial approach&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;SCOP Structureal Classification of Proteins DB&lt;br /&gt;10 classes&lt;br /&gt;800 folds&lt;br /&gt;1294 superfamilies&lt;br /&gt;2327 families&lt;br /&gt;&lt;br /&gt;every protein entered into this db is annotated using this scheme&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;hypergeometirc distribution:&lt;br /&gt; The problem of finding the probability of such a picking problem is sometimes called the "urn problem," since it asks for the probability that &lt;i&gt;i&lt;/i&gt; out of &lt;i&gt;N&lt;/i&gt; balls drawn are "good" from an urn that contains &lt;i&gt;n&lt;/i&gt; "good" balls and &lt;i&gt;m&lt;/i&gt; "bad" balls.  It therefore also describes the probability of obtaining exactly &lt;i&gt;i&lt;/i&gt; correct balls in a pick-&lt;i&gt;N&lt;/i&gt; lottery from a reservoir of &lt;i&gt;r&lt;/i&gt; balls (of which &lt;i&gt;n = N&lt;/i&gt; are "good" and &lt;img src="http://mathworld.wolfram.com/himg4416.gif" align="middle" border="0" height="29" width="81" /&gt; are "bad").  (from MathWorld http://mathworld.wolfram.com/HypergeometricDistribution.html)&lt;br /&gt;&lt;br /&gt;Searching for biological relevance:&lt;br /&gt;    reading papers: key word search, drawing from experience&lt;br /&gt;    website search&lt;br /&gt;    talking to people who know the subjects&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;presented an algo for finding frequently occurring spatial motifs&lt;br /&gt;discovered motifs are specific, measured by low P-values using the hypergeometric distribution&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;motifs they discover are highly specific, measured by low P-values&lt;br /&gt;&lt;br /&gt;mapping structures to highly specific motifs&lt;br /&gt;how about mapping sequence&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110815421125581547?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110815421125581547/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110815421125581547' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110815421125581547'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110815421125581547'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/friday-cradle-talk-mining-spatial.html' title='Friday CRADLE Talk Mining Spatial Patterns from Protein Structures'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110780726388096165</id><published>2005-02-07T15:15:00.000-08:00</published><updated>2005-02-07T12:14:23.880-08:00</updated><title type='text'>Monday 07 February 2005 Lecture: Interstingness</title><content type='html'>Silberschatz &amp; Tuzhilin&lt;br /&gt;&lt;br /&gt;Actionable &amp;amp; Unexpected makes for interestingness&lt;br /&gt;&lt;br /&gt;possibly: unexpectedness is roughly equivalent to actionability&lt;br /&gt;&lt;br /&gt;unexpectedness as a criteria for interestingness is uninteresting&lt;br /&gt;&lt;br /&gt;valid&lt;br /&gt;understandable&lt;br /&gt;novel&lt;br /&gt;--&gt;interestingness&lt;--&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Are there uninteresting facts that arise that somehow might cause beliefs to change?  Or a lack of facts, can they cause beliefs to change?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;the data vs the processing of it?&lt;br /&gt;&lt;br /&gt;this statement is either trivial or false&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;terms change?  terms must stay the same&lt;br /&gt;&lt;br /&gt;statements&lt;br /&gt;&lt;br /&gt;what is my confidence factor that all birds can fly?  my confidence is low but some birds can fly&lt;br /&gt;unexpectedness roughly equivalent&lt;br /&gt;&lt;br /&gt;confidence factor &amp; CYC: assign confidence associated with facts&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Rational agency:&lt;br /&gt;http://www.ryerson.ca/~dgrimsha/courses/cps720/rational.html&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;expected data returned&lt;br /&gt;but&lt;br /&gt;beliefs change&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;something between data and belief that seems absent form the discussion in the paper&lt;br /&gt;there must be some sort of representation in between that converts data to belief that might be rational and might convert expected data into new beliefs&lt;br /&gt;e.g., we might be able to test for significance&lt;br /&gt;&lt;br /&gt;understanding was somewhat removed; a telltale sign is this elimination of the intermediary between the receipt of data and the formulation of belief&lt;br /&gt;&lt;br /&gt;throw in data that is wildly aberrant &amp; witness change?&lt;br /&gt;unexpected data witness belief changes&lt;br /&gt;&lt;br /&gt;monoitor belief changes, see if enough data is out there&lt;br /&gt;&lt;br /&gt;what's better?  monotinicity is conserved or maximally rejected?&lt;br /&gt;&lt;br /&gt;reaction essay for every paper?....&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(x/y)&lt;br /&gt;correctly classified/incorrectly classified&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;homework:&lt;br /&gt;pick a UCI dataset that starts with the same letter as your first name, if not, then last name&lt;br /&gt;explore different classification algos within the WEKA toolkit&lt;br /&gt;email Cathy the decision tree + 1 more classifier&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110780726388096165?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110780726388096165/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110780726388096165' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110780726388096165'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110780726388096165'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/monday-07-february-2005-lecture.html' title='Monday 07 February 2005 Lecture: Interstingness'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110745093931843079</id><published>2005-02-03T09:15:00.000-08:00</published><updated>2005-02-03T09:15:39.316-08:00</updated><title type='text'></title><content type='html'>Evaluations&lt;br /&gt;&lt;br /&gt;Novelty&lt;br /&gt;C. D. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.&lt;br /&gt;&lt;br /&gt;Basu,S. Mooney,R.J., Pasupuleti, K.V. and Ghosh,J. Evaluating the Novelty of Text-Mined Rules using Lexical Knowledge KDD-2001&lt;br /&gt;&lt;br /&gt;Understandability Bias&lt;br /&gt;Pazzani, M.J., Mani,S., Shankle,W.R.  (2001). Acceptance of Rules Generated by Machine Learning among Medical Experts. Methods of Information in Medicine; 40: 380-385&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110745093931843079?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110745093931843079/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110745093931843079' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110745093931843079'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110745093931843079'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/02/evaluations-novelty-c.html' title=''/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110720199273181063</id><published>2005-01-31T15:15:00.000-08:00</published><updated>2005-01-31T12:06:32.730-08:00</updated><title type='text'>Monday 31 January 2005 Lecture</title><content type='html'>KD: Association&lt;br /&gt;&lt;br /&gt;Typically used for recommendations&lt;br /&gt;Also called market basket analysis&lt;br /&gt;What do people buy together?  What attributes are associated&lt;br /&gt;&lt;br /&gt;left hand side: antecedent&lt;br /&gt;right hand side: consequent&lt;br /&gt;&lt;br /&gt;association rule must have an associated population P&lt;br /&gt;    - pop consists of a set of instances&lt;br /&gt;                    e.g., each sale at a store is an instance&lt;br /&gt;    - set of all transactions is the population&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;set of items I {I1, I2, ..., Im}&lt;br /&gt;transactions: D = {t1, t2, ... tn}&lt;br /&gt;Itemset&lt;br /&gt;    set of items that satisfy some criteria or other&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;association rule algo&lt;br /&gt;we are generally only intersted in association rules w/ generally high support&lt;br /&gt;&lt;br /&gt;A priori algorithm&lt;br /&gt;If {ACD} is frequent, then all subsets of {ACD} are frequent ({AC}, {AD}, {CD})&lt;br /&gt;&lt;br /&gt;Two questions: why is unidirectional causaility implied by the terminology (e.g., antecedent, consequent)?  Isn't it &lt;span style="font-style: italic;"&gt;bi&lt;/span&gt;directional by nature?  Direction on the graph as we speak of it now is not temporal&lt;br /&gt;&lt;br /&gt;Also, why aren't we interested in low support?  Do we want to get only the best association rules in all cases, or sometimes do we want to describe the population space as completely as possible?  isn't that determined by some extent to how we plan on using the results?&lt;br /&gt;&lt;br /&gt;Re:  different feature representations yield different&lt;br /&gt;&lt;br /&gt;Confidence vs. support: interestingness!&lt;br /&gt;&lt;br /&gt;We may have that info already present in a DB...&lt;br /&gt;&lt;br /&gt;Another algo: Instance-Based Learning&lt;br /&gt;&lt;br /&gt;Decision trees, clustering and association rules are created on historical data, then model us used to predict/describe class of new instance&lt;br /&gt;&lt;br /&gt;Instance based: no model is created ahead of time&lt;br /&gt; - learned when a new instance arrives&lt;br /&gt; - identify historical data that is simlar&lt;br /&gt;&lt;br /&gt;Similar challenges as clustering with respect to distance&lt;br /&gt;symbolic distances are particualarly difficult&lt;br /&gt;instance-based learning effective when efficient db design&lt;br /&gt;similarity is being pushecd into the db--next-generation dbs will enable similarity-based queries&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110720199273181063?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110720199273181063/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110720199273181063' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110720199273181063'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110720199273181063'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/monday-31-january-2005-lecture.html' title='Monday 31 January 2005 Lecture'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110677005831759817</id><published>2005-01-26T15:15:00.000-08:00</published><updated>2005-01-26T12:07:38.316-08:00</updated><title type='text'>Wednesday 26 January 2005 Lecture: Clustering</title><content type='html'>Clustering&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Qualities of a similarity metric&lt;br /&gt;Clustering algos&lt;br /&gt;evaluation discussion&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;finding natural groupings among collection of objects into k sets such that the avg distance of points form the centroid of their assigned group is minimized&lt;br /&gt;centroid: average of coordinates in each dimension&lt;br /&gt;mimimum avg pairwise distance&lt;br /&gt;&lt;br /&gt;question: we typically pick a k, cluster, then consider whether we'd like to change k.  are there more systematic ways of doing this?  setting k to a range; have a formal means for deciding between k values&lt;br /&gt;&lt;br /&gt;how is unsupervided learning different from dredging?&lt;br /&gt;&lt;br /&gt;finding patterns and then assigning meaning&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;we can come up with a set of features&lt;br /&gt;clustering is subjective because deciding what groupings are best is a subjective process, and deciding what serves as the basis for similarity is likewise subjective&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Distance metric props&lt;br /&gt;symmetry&lt;br /&gt;constancy of self-similarity&lt;br /&gt;positivity (separation)&lt;br /&gt;triangular inequality &lt;br /&gt;&lt;br /&gt;two types of clustering&lt;br /&gt;    partitional&lt;br /&gt;    hierarchical/subspace &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;use dendograms:  root/terminal branch/internal branch/internal node/leaf&lt;br /&gt;&lt;br /&gt;desirable props of a clustering algo&lt;br /&gt;scalabilty&lt;br /&gt;ability to dela with diff data types&lt;br /&gt;minimal req's for domain knowledge to determine input params&lt;br /&gt;able to deal with noice &amp; outliers&lt;br /&gt;&lt;br /&gt;(number of objects)!/2 - number of objects&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;hierarchival is O(n2); exponentials don't scale well&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;simple clustering: with numeric data only  - K-means&lt;br /&gt;pick a  number of (K) of cluster centers&lt;br /&gt;move each centroid to the mean of each cluster&lt;br /&gt;repeat (2,3) until movement is less than a sentinel value&lt;br /&gt;&lt;br /&gt;value of your k selection&lt;br /&gt;&lt;br /&gt;inter and intracluser distance values&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;partitioned clustering&lt;br /&gt;----------------------&lt;br /&gt;&lt;br /&gt;how do we evaluate clustering?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;birch algo example&lt;br /&gt;&lt;br /&gt;question:&lt;br /&gt;we're hoping that our clusters are somehow meaningful summaries of subsets of the attributes&lt;br /&gt;but aren't we somehow more prone to data dredging in unsupervised approaches&lt;br /&gt;&lt;br /&gt;well, everything is subject to verification, validity, meaningfulness&lt;br /&gt;&lt;br /&gt;Distance revisited: measuring a distance been an object and a cluster or two clusters&lt;br /&gt;single linkage (or nearest neighbor) - distance of the two clostest objects in the different clusters&lt;br /&gt;complete linkage (furthest neighbor) - greatest distance between any two objects in the different clusters&lt;br /&gt;group average&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For class presentation:&lt;br /&gt;&lt;br /&gt;B1 - Introduction to KD   est Feb 9&lt;br /&gt;Don R. Swanson &amp; Neil R. &lt;span class="SpellE"&gt;Smalheiser&lt;/span&gt;   &lt;a href="http://kiwi.uchicago.edu/webwork/AIabtext.html"&gt;An interactive   system for finding complementary literatures: a stimulus to scientific   discovery.&lt;/a&gt; &lt;span class="SpellE"&gt;Artifical&lt;/span&gt; Intelligence 91:183-203;   1997.&lt;br /&gt;&lt;br /&gt;B8 -  Document Summarization  est Mar 28 or 30&lt;br /&gt;Generating Patient-Specific Summaries of Online Literature&lt;br /&gt;Kathleen R. McKeown, Desmond A. Jordan, Vasileios Hatzivassiloglou&lt;br /&gt;*http://www.ics.uci.edu/~pratt/courses/papers/TextMining/patient-specific-summaries.pdf&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110677005831759817?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110677005831759817/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110677005831759817' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110677005831759817'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110677005831759817'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/wednesday-26-january-2005-lecture.html' title='Wednesday 26 January 2005 Lecture: Clustering'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110659751492133863</id><published>2005-01-24T15:15:00.000-08:00</published><updated>2005-01-24T12:11:54.920-08:00</updated><title type='text'>Monday 24 January Lecture: Classification</title><content type='html'>Classification example: selecting safe fruit from dangerous fruit on a deserted island with no info&lt;br /&gt;&lt;br /&gt;nominal attributes&lt;br /&gt;empirical approach&lt;br /&gt;requires knowledge of the conclusion: THE CLASS&lt;br /&gt;&lt;br /&gt;Conclusion|Skin|Color|Size|Flesh&lt;br /&gt;----------------------------------&lt;br /&gt;class             nom nom  nom nom&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;decision trees!  if attribute 1=value1, then subtree 1&lt;br /&gt;else if attribute 1=value2 then subtree 2&lt;br /&gt;&lt;br /&gt;(Question for later: are there decision tree algos that use something other than number of instances in each class to calculate information gain [or some discriminant other than information gain]?)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;decision tree classification:&lt;br /&gt;assume you know the conclusion C that has any pof the values c1...cn&lt;br /&gt;that an attribute can take values a1...an&lt;br /&gt;&lt;br /&gt;then you can calculate P(cj|ao)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;P(C=safe|skin=hairy) = 6/8=0.75&lt;br /&gt;P(C=safe|size=small)  =  5/9=0.55&lt;br /&gt;&lt;br /&gt;conditional probability close to 1 is a&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Step 1: partition data into groups based on a partitioning attribute and partioning condition&lt;br /&gt;Step 2: continue until stop condition reached&lt;br /&gt;     - all or more of items belong to the same class&lt;br /&gt;     - all attributes have been considered and no further partitioning is possible&lt;br /&gt;     - such a node is a leaf node&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;INfo[2,3] = 0.972 bits&lt;br /&gt;Info[4,0] = 0.0 bits&lt;br /&gt;Info[3,2] = 0.971 bits&lt;br /&gt;goodness for conditions = 0.693 bits&lt;br /&gt;(5/14)*.971 + 4/14* 0.0 + (5/14)*0.971 = 0.693&lt;br /&gt;Gain=0.247&lt;br /&gt;no partiti&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(Is there some way to handle assume our attribute list is complete?)&lt;br /&gt;&lt;br /&gt;(Rescaling/revaluing attribute values may be helpful for useless attributes)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110659751492133863?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110659751492133863/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110659751492133863' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110659751492133863'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110659751492133863'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/monday-24-january-lecture.html' title='Monday 24 January Lecture: Classification'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110616719471105645</id><published>2005-01-19T15:15:00.000-08:00</published><updated>2005-01-19T23:31:47.553-08:00</updated><title type='text'>Thursday 19 January 2005 Lecture: Terminology</title><content type='html'>central questions of KD&lt;br /&gt;what data should you use?&lt;br /&gt;what goes into the mining task?&lt;br /&gt;what will the patterns look like?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;what data should you use?&lt;br /&gt;&lt;br /&gt;Inputs required:&lt;br /&gt;attribute type&lt;br /&gt;nominal/ordinal/interval/ratio&lt;br /&gt;(vals are names/scale with meaningless interval but meaningful order/scale with meaningful interval/scale with true zero; zero is a true zero)&lt;br /&gt;type&lt;br /&gt;symbolic/symbolic/numeric/numeric&lt;br /&gt;order&lt;br /&gt;n/y/y/y&lt;br /&gt;e.g.,&lt;br /&gt;zip/ranking/deg Celcius/kelvin&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;attributes can be considered along a spectrum:&lt;br /&gt;from a two-valued nominal attribute to a ratio&lt;br /&gt;&lt;br /&gt;nominal takes two bits&lt;br /&gt;ordinal: as many symbols as needed&lt;br /&gt;interval: integer&lt;br /&gt;ratio: float&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Concept is&lt;br /&gt;‘the thing to be learned’ p38&lt;br /&gt;&lt;br /&gt;concepts can be predictive or descriptive, supervised or unsupervised, transparent or black box&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;goal of predictive dm: induction&lt;br /&gt;predictive mechanisms&lt;br /&gt;- classification&lt;br /&gt;- regression (linear: y=ax+b):&lt;br /&gt;given a set of param-value-to-function-result mappings,&lt;br /&gt;WITHOUT KNOWING THE REAL FUNCTION,&lt;br /&gt;predict the function result for a new param-value&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;descriptive concepts&lt;br /&gt;association rules &amp; clusters&lt;br /&gt;associations are often first step in detecting causation (or making a prediction)&lt;br /&gt;clustering good for epidemiology; for detecting source of a problem&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;supervised vs. unsupervised&lt;br /&gt;supervised:&lt;br /&gt;input: labeled data&lt;br /&gt;goal: identify the attribute combination that maximizes chance you get correct concept label&lt;br /&gt;unsupervised:&lt;br /&gt;input: unlabelled data&lt;br /&gt;goal: identify groups within the data&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;black box vs. transparent&lt;br /&gt;&lt;br /&gt;is the output black box or transparent?&lt;br /&gt;neural network good example of a black box&lt;br /&gt;id3 decision tree good example of a transparent&lt;br /&gt;&lt;br /&gt;black box: goal is to get categories right (good for control systems)&lt;br /&gt;transparent: understanding the rules or ways is also useful&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;bias&lt;br /&gt;------&lt;br /&gt;&lt;br /&gt;high-dimensional data will have multiple "correct" concept descriptions&lt;br /&gt;bias is not bad&lt;br /&gt;algos are greedy and sometimes insist their results are best&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;we CAN use low quality data? yes, but we need high quality data to map it to the low quality&lt;br /&gt;&lt;br /&gt;what role does database design principles play in KD process?&lt;br /&gt;flexibility in kinds of queries you might write/difficulty as well for accessible data&lt;br /&gt;without those queries you might not be able to pick what inputs you want, what data you use&lt;br /&gt;scalability, maintenance, indexability in RDBMS&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110616719471105645?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110616719471105645/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110616719471105645' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110616719471105645'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110616719471105645'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/thursday-19-january-2005-lecture.html' title='Thursday 19 January 2005 Lecture: Terminology'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110601238037617818</id><published>2005-01-17T17:24:00.000-08:00</published><updated>2005-01-17T17:39:40.376-08:00</updated><title type='text'>A1 Readings Pt. 3 DeGruy; Guernsey</title><content type='html'>DeGruy, Healthcare Applications of Knowledge Discovery in Databases, Journal of Healthcare Information Management, vol 14 no 2, summer 2000&lt;br /&gt;&lt;br /&gt;DeGruy makes the case for KDD in healthcare.  Am painfully familiar with the questions, slogans, debates, &amp; players.&lt;br /&gt;&lt;br /&gt;Successful helathcare KDD implementations:&lt;br /&gt;&lt;ul&gt;   &lt;li&gt; HMO determined disease-specific risk for its members: potential for targeted intervention (or potential denial of coverage); means can likewise be used to identify new risk factors in a population&lt;/li&gt;   &lt;li&gt;fraud detection in insurance claims&lt;/li&gt;   &lt;li&gt;NY Worker's Compensation Board decision trees for streamlining compensation claims&lt;/li&gt;   &lt;li&gt;EHRs (electronic health records) are ripe for KDD - who needs flu shots; who isn't following needed treatments, etc.&lt;/li&gt;   &lt;li&gt;medical marketing&lt;/li&gt; &lt;/ul&gt;&lt;br /&gt;NYT article oct 16 2003 "Digging for Nuggets of Wisdom", Lisa Guernsey&lt;br /&gt;&lt;br /&gt;Text mining vs. information retrieval: text mining goes beyond tokens, categorization, linking unconnected documents, creating visual representation of related docs, synthesizing new ideas&lt;br /&gt;&lt;br /&gt;Text mining vs. data mining: text starts with unstructured data: text&lt;br /&gt;&lt;br /&gt;Determine context/resolve ambiguity/select semantic concepts&lt;br /&gt;&lt;br /&gt;Text mining might not recognize subtlety&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110601238037617818?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110601238037617818/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110601238037617818' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110601238037617818'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110601238037617818'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/a1-readings-pt-3-degruy-guernsey.html' title='A1 Readings Pt. 3 DeGruy; Guernsey'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110601134282984372</id><published>2005-01-17T16:55:00.000-08:00</published><updated>2005-01-17T17:22:22.830-08:00</updated><title type='text'>A1 Readings Pt. 2 Fayyad</title><content type='html'>Fayadd, et. al, "KDD Process for Extracting Useful Knowledge from Volumes of Data", Communications of the ACM, November 1996, Vol 39, no 11.&lt;br /&gt;&lt;br /&gt;Manual data analysis is swiftly becoming impractical.  The number of databases is growing, and so too is the average dimensionality (records times attributes).  Automation is the answer, Fayadd, et al, contend. &lt;br /&gt;&lt;br /&gt;Fayadd et al labor to differentiate KDD from DM.  DM is but a step in the KDD process which involves:&lt;br /&gt;&lt;ol&gt;   &lt;li&gt;Data collection&lt;/li&gt;   &lt;li&gt;Data selection (picking our topic)&lt;br /&gt;  &lt;/li&gt;   &lt;li&gt;preprocessing of target data (prepping our relevant data)&lt;br /&gt;  &lt;/li&gt;   &lt;li&gt;transformation of preprocessed data (creating useful structured representations of relevant data)&lt;br /&gt;  &lt;/li&gt;   &lt;li&gt;mining of transformed data&lt;/li&gt;   &lt;li&gt;interpreting discovered patterns (deriving knowledge)&lt;/li&gt; &lt;/ol&gt; DM is only step 5 of this six step process.&lt;br /&gt;&lt;br /&gt;The authors raise the concern that the discovery of a pattern from transformed data does not always mean that the pattern is statistically significant.  Fundamental is the statistician's art of hypothesis selection.&lt;br /&gt;&lt;br /&gt;Ahh, method.&lt;br /&gt;&lt;br /&gt;Discovered patters should neither be underfit or overfit: overfitted patterns typically fail to be predictive, while underfit models don't provide very much information.&lt;br /&gt;&lt;br /&gt;Interestingness:&lt;br /&gt;Process is assumed to be nontrivial&lt;br /&gt;Patterns should be valid for new data to some degree of certainty&lt;br /&gt;Patterns should be novel to the system, and hopefully to the user as well&lt;br /&gt;Patterns should be understandable - simplicity?&lt;br /&gt;&lt;br /&gt;Model functions in DM:&lt;br /&gt;&lt;ul&gt;   &lt;li&gt;classification&lt;/li&gt;   &lt;li&gt;regression&lt;/li&gt;   &lt;li&gt;clustering&lt;/li&gt;   &lt;li&gt;summarization&lt;/li&gt;   &lt;li&gt;dependency modeling&lt;/li&gt;   &lt;li&gt;link analysis&lt;/li&gt;   &lt;li&gt;sequence analysis&lt;/li&gt; &lt;/ul&gt; Challenges &amp; issues:&lt;br /&gt;&lt;br /&gt;What's interesting or useful?  What's not?  Knowledge leverages some amout of subjectivity, and that subjectivity is the very essence of making judgments about such things as usability, interestingness, informativeness, relevance, and so on.  Insert here the role of the information scientist.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110601134282984372?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110601134282984372/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110601134282984372' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110601134282984372'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110601134282984372'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/a1-readings-pt-2-fayyad.html' title='A1 Readings Pt. 2 Fayyad'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-10158875.post-110600944493825393</id><published>2005-01-17T15:50:00.000-08:00</published><updated>2005-01-17T16:55:19.603-08:00</updated><title type='text'>A1 Readings Pt. 1 Lesk</title><content type='html'>Lesk, M. "How Much Information...?"&lt;br /&gt;&lt;br /&gt;In 1997 M. Lesk estimated volume of information in the world at several exabytes (~ several billion GB). Spirit of article is dedicated to haphazard guessing &amp; makeshift counting method (e.g., estimates of hard drive sales), but entertaining nonetheless. Less entertaining and more haphazard is the later speculation on the "volume" of "human memory;" Lesk cites Landauer's estimate of brain capacity at 200MB, which is manifestly dubious, and then Lesk calculates total human memory to be just over one exabyte. Oops. Landauer, at least according to Lesk, figures the brain holds 1,000 to 100,000 neurons per bit of memory, based on Landauer's recal tests of human memory. There are obvious reasons why recall tests are a bad way of measuring the amount of memory (more in the computer sense than in the sentimental sense) a human brain can retain. Computers of course store memories in 0s and 1s, but, importantly, those 0s and 1s do not always add up to high level, semantically-rich information. Some of this memory can be very low-level. So human recall tests prima facie bias results to be very low.&lt;br /&gt;&lt;br /&gt;To be fair, just as we measure hard disk capacity (and just as Lesk uses hard drive *capacity* quantity figures for his argument) we should guess the very same way with the human brain: count the number of bits. Now, a human brain has ~10e14 neurons, but hardly is a neuron a single bit. Neurons have a fairly high order of ways in which they can articulate: number &amp; length of dendrites &amp;amp; axons alone are but two dimensions which we may count to get an idea of a possible number of states for each neuron. We might also have a few dimensions for measuring neural cell signalling; it's unlikely that nerve cells have just one signal to pass, and it's equally unlikely that the signal is strictly digital. There may be many other dimensions with respect to neurons to countas well, various location-dependent interactions with other aspects of the brain, for example. But we'll skip that arena. We might also count possible states for astrocytes, a species of glial cell, since they may also be involved in "saving state."&lt;br /&gt;&lt;br /&gt;10e14 nerve cells&lt;br /&gt;&lt;br /&gt;conservative estimate of states per cell&lt;br /&gt;---------------------------------------&lt;br /&gt;avg number of synapses: 10e4&lt;br /&gt;avg number of sodium pumps: 10e6&lt;br /&gt;&lt;br /&gt;10e4*10e6=10e10&lt;br /&gt;&lt;br /&gt;total number our bits based on neurons: 10e14 * 10e10 = 10e24&lt;br /&gt;&lt;br /&gt;one exabyte is approximately 10e19 bits&lt;br /&gt;10e24 bits, our current estaimate for storage capacity of a single human brain without counting astrocytes, is approximately 100,000 times larger than Lesk's estimate for the total of the world's information, and is 10e16 larger than Landauer's estimate for a single brain.&lt;br /&gt;&lt;br /&gt;By the numbers, and by the paper's current methods, the storage needed to cover the capacity of human neurons would be in the order of 10e34 (6*10e9*10e24) bits, or approximately 100 trillion exabytes.&lt;br /&gt;&lt;br /&gt;It cannot be remembered for us wholesale.&lt;br /&gt;&lt;br /&gt;To be very honest, I don't take seriously any claims to strong AI, and don't fancy the parallels routinely drawn between human experience and computer I/O. The metaphors of computing just don't work very well, except maybe to dumb down and underestimate our understanding of the myriad complexities of this lump of gray matter and its correlated but yet-to-be-verified partner, the mind. For example, there's simply *no* equivalence between an actual sentence in an ASCII file and that same sentence memorized and "inside" a human head. That sentence "inside" the human head ("inside" is in quotes, because no one can actually locate it, touch it, verify it) . probably has a tonb of other information attached to it, and the recall task brings with it tons of other information. In other words, when it comes to the human mind, a sentence is not a sentence. A sentence (e.g., ASCII) is not a sentence (either in the mind's eye OR in the brain).&lt;br /&gt;&lt;br /&gt;So I digress. Lesk's more mild claim that we will be able to store all our media digitally seems reasonable. What this means for text mining is that we will have data, lots of data, for mining (if we gain access to it), and that the quantity of data will demand more and more mining work for information retrieval, classification, discovery, and synthesis.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/10158875-110600944493825393?l=inls110.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://inls110.blogspot.com/feeds/110600944493825393/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=10158875&amp;postID=110600944493825393' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110600944493825393'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/10158875/posts/default/110600944493825393'/><link rel='alternate' type='text/html' href='http://inls110.blogspot.com/2005/01/a1-readings-pt-1-lesk.html' title='A1 Readings Pt. 1 Lesk'/><author><name>Patrick</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='21' height='32' src='http://bp3.blogger.com/_E3CMTxi_Yas/SH0MIfGD3bI/AAAAAAAAAAM/FJICRtQ5ffU/s1600-R/ae-16730.jpeg'/></author><thr:total>0</thr:total></entry></feed>
