Friday CRADLE Talk Mining Spatial Patterns from Protein Structures
Luke Huan
Seq->Struct->Function
Structure indicates function
Mine frequent subgraphs; retrieve spatial motifs frm protein structure data
Global vs. local alignments of protein structures
local lignments are motifs
Mining for subgraphs/spatial motifs is a challenging problem for data mining
How to model protein w/ set of points
each aa us presented by a point in a 3d space; protein structure is a point set
LCP largest common point set problem
Is clique hashing a useful pattern finding approach for other domains?
Heavy combinmatorial approach
SCOP Structureal Classification of Proteins DB
10 classes
800 folds
1294 superfamilies
2327 families
every protein entered into this db is annotated using this scheme
hypergeometirc distribution:
The problem of finding the probability of such a picking problem is sometimes called the "urn problem," since it asks for the probability that i out of N balls drawn are "good" from an urn that contains n "good" balls and m "bad" balls. It therefore also describes the probability of obtaining exactly i correct balls in a pick-N lottery from a reservoir of r balls (of which n = N are "good" and are "bad"). (from MathWorld http://mathworld.wolfram.com/HypergeometricDistribution.html)
Searching for biological relevance:
reading papers: key word search, drawing from experience
website search
talking to people who know the subjects
presented an algo for finding frequently occurring spatial motifs
discovered motifs are specific, measured by low P-values using the hypergeometric distribution
motifs they discover are highly specific, measured by low P-values
mapping structures to highly specific motifs
how about mapping sequence
Seq->Struct->Function
Structure indicates function
Mine frequent subgraphs; retrieve spatial motifs frm protein structure data
Global vs. local alignments of protein structures
local lignments are motifs
Mining for subgraphs/spatial motifs is a challenging problem for data mining
How to model protein w/ set of points
each aa us presented by a point in a 3d space; protein structure is a point set
LCP largest common point set problem
Is clique hashing a useful pattern finding approach for other domains?
Heavy combinmatorial approach
SCOP Structureal Classification of Proteins DB
10 classes
800 folds
1294 superfamilies
2327 families
every protein entered into this db is annotated using this scheme
hypergeometirc distribution:
The problem of finding the probability of such a picking problem is sometimes called the "urn problem," since it asks for the probability that i out of N balls drawn are "good" from an urn that contains n "good" balls and m "bad" balls. It therefore also describes the probability of obtaining exactly i correct balls in a pick-N lottery from a reservoir of r balls (of which n = N are "good" and are "bad"). (from MathWorld http://mathworld.wolfram.com/HypergeometricDistribution.html)
Searching for biological relevance:
reading papers: key word search, drawing from experience
website search
talking to people who know the subjects
presented an algo for finding frequently occurring spatial motifs
discovered motifs are specific, measured by low P-values using the hypergeometric distribution
motifs they discover are highly specific, measured by low P-values
mapping structures to highly specific motifs
how about mapping sequence
0 Comments:
Post a Comment
<< Home