"Data Mining" Experiment with EP
Saccharomyces cerevisiae: 6221 genes over 80 expression conditions (from P. Brown’s lab)
~1000 x K-means clustering: K ? 2..1000, repeat 10 x for each K
Calculate average “Silhouette” value for all cluster and clustering
Select 52,000 clusters of size 20 to 100
Analyze all up-stream regions of all 52,000 clusters with SPEXS
Extract all substrings from up-stream regions with probability less than 1% according to binomial distribution (background probability is calculated simultaneously from 6221 up-stream regions)
Plot the silhouette value of each cluster vs. the probability of the most improbable pattern from the up-stream regions of the genes in cluster
Repeat pattern discovery for randomized clusters (i.e., replace genes in a real cluster with arbitrary genes)