Skip Navigation


Briefings in Bioinformatics Advance Access originally published online on May 26, 2006
Briefings in Bioinformatics 2006 7(2):202-203; doi:10.1093/bib/bbl013
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/2/202    most recent
bbl013v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rubin, E.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Rubin, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org

News Section

Circumventing the cut-off for enrichment analysis

ABSTRACT

Three tools for threshold-free enrichment analysis of microarray data are introduced: GSEA (gene set enrichment analysis), ermineJ and DRIM (discovering rank imbalanced motifs). GSEA offers an interface to a specific algorithm and a well-defined pipeline for the identifying enrichment in diverse gene sets and the creation of signature profiles. ermineJ offers a combined front end to three different algorithms, two of which perform a cut-off-free enrichment analysis. DRIM comprises an implementation of a new algorithm and is specifically designed for the search of new transcription-factor-binding sites based on expression patterns. Together, these tools demonstrate an emerging trend in high-throughput data analysis—the joint analysis of raw results with external knowledge.


Software for interpreting metabolomics data is still scarce. In this issue, I would like to take a look at a trend in microarrays analysis (for which algorithms and software are abundant) that may be directly relevant to metabolomics: the use of cut-off-free enrichment analysis.

Enrichment analysis has been a standard step in microarray analysis ever since relevant, comprehensive, machine-readable data sources, such as the Gene Ontology (GO) database, have become available (http://www.geneontology.org/). In enrichment analysis, expression profiles are used to partition the genes into non-regulated, up-regulated and down-regulated groups. In most cases, the resulting lists of genes are meaningless in themselves; rather they are used to find common features for the regulated genes, such as over- or under-representation of GO categories, or in transcription-factor-binding sites. Such enrichment offers a generalization of the response, expressing it in terms of affected systems or processes, or providing mechanistic clues regarding the processes behind the observed change. The main weakness of this approach is the separation of the partitioning, i.e. the decision as to the cut-off in the gene list from the subsequent enrichment analysis. Such a separation leads to loss of information: Imagine, for example, a case in which all 100 genes in some GO category are slightly over-expressed under certain conditions. This should be contrasted with a situation in which the genes follow the general expression distribution (i.e. some being over-expressed and some being under-expressed). The former expression pattern would probably be considered more ‘surprising’ than the latter. Moreover, it is conceivable that none of the genes is significant by itself; rather it is the pattern that emerges when observing the genes as a group that is ‘surprising’.

Several resources have recently been developed that couple enrichment analysis with the raw data: I will review here three such tools that support threshold-free enrichment analysis—gene set enrichment analysis (GSEA), ermineJ and discovering rank imbalanced motifs (DRIM).

GSEA, developed at the Broad Institute, is a well-rounded application that guides the user through a pre-defined path for enrichment analysis [1]. Given samples with two labels, each gene is scored according to its correlation with the label. For example, in the NCI-60 set, 17 samples are known to be p53+ and 33 are p53; NADK and CEP1 are both correlated to the p53 status of the cells but in opposite directions. The entire list of genes, without any cut-off, is then sorted and weighted according to this correlation. Then, the weighted ranks of the genes belonging to a particular gene set (e.g. genes belonging to a specific process) are examined, and the likelihood of obtaining this ranking by chance is estimated. This analysis is performed for each gene set, by using re-sampling and false discovery rate, to provide a robust P-value. The resulting report comprises lists of enriched sets with the relevant statistics and a list of genes that can serve as markers for the given phenotypes. The package offers two bonuses: first, a smooth, friendly and robust environment, and second, a large collection of human gene sets. Formatting the expression data in the right way took some hacking on my part—writing a few lines of code in Perl—but I am convinced that it can also be done in Excel. Reformatting GO annotations derived from the Gene Expression Omnibus (GEO) into gene sets took 20 leisurely minutes of Perl hacking, but I do not think it could have been done in Excel. All in all, GSEA is designed to perform a specific task, and it performs it very well with rather impressive back-end statistics.

ermineJ, developed at Columbia University, offers more of a ‘supermarket’ of enrichment analysis methods [2]. It supports the traditional cut-off-based analysis as well as three cut-off-free algorithms (all described elsewhere). The Gene Score Resampling method is conceptually similar to GSEA, but uses simpler statistics. The receiver operator characteristic (ROC) method takes only the rank of the genes into consideration but does not use a threshold; rather it tries all possible thresholds. Finally, the gene group correlation analysis method identifies gene categories in which the expression ‘clusters together’, with the correlation between expression patterns serving to guide the clustering. What makes ermineJ stand out is its simple-to-use integrated environment. Note that, in my hands, getting new data in was not simple: unlike GSEA, when ermineJ is ‘unhappy’ about the format of the input files, it provides an unspecific error message. As a result, I found that introducing expression files to ermineJ was more of an effort than preparing the same data for GSEA. Preparing the GO annotation from GEO SOFT files, on the other hand, was very straight forward and can probably be done on a word processor or on Excel. All in all, ermineJ is a worthwhile tool to get to know, and its ability to support multiple analysis methods may come in handy. After all, no single analysis method is ideal for all types of data.

Finally, I want to briefly mention DRIM (http://bioinfo.cs.technion.ac.il/drim/). DRIM is geared towards a related but different task, i.e. finding enrichment in sequence motifs (or more precisely, in sequence words over a flexible alphabet). It employs an original statistical method at its back end, which is conceptually similar to the ROC method in ermineJ. Variants of the statistical approach were published (e.g. [3]), and I recently had the opportunity to hear it explained by the authors. The principle is to sort the genes, calculate a P-value for enrichment with all possible thresholds and then apply exact statistics to calculate the likelihood of obtaining by chance enrichment at the level of the best threshold (i.e. the probability of obtaining the observed enrichment under a null model). DRIM is offered through a web interface, which I have not yet tried, but it seems simple enough to operate.

Together, the described resources demonstrate where the integration of external knowledge and raw data from high-throughput methods is heading. Note that none of these methods is specific to microarray data (although care should be taken to revisit any possible assumption regarding background distributions when applying them as is to other data types). Indeed, I think that enrichment analysis, in general, and non-threshold methods, in particular, will very quickly become general practice in external knowledge integration. It will be interesting to see how this approach will come to be applied to the analysis of metabolomics data. We certainly need tools to relate metabolic fluctuations and reaction rate changes to prior knowledge, such as enzyme features or pathway topology.

Eitan Rubin
National Institute of Biotechnology, Negev Ben Gurion University


E-mail: erubin{at}bgu.ac.il

Submitted: April 2, 2006. Received (in revised form): April 6, 2006.

References

  1. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005; 102:15545–50.[Abstract/Free Full Text]
  2. Lee HK, Braynen W, Keshav K, Pavlidis P. ErmineJ: Tool for functional analysis of gene expression data sets. BMC Bioinformatics 2005; 6:269.[CrossRef][Medline]
  3. Ben Zaken C, Eskin E, Yakhini Z. Using Expression Data to Discover RNA and DNA Regulatory Sequence Motifs. Lecture Notes in Computer Science 2005; 3318:65–78.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
D. W. Huang, B. T. Sherman, and R. A. Lempicki
Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
Nucleic Acids Res., January 1, 2009; 37(1): 1 - 13.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
W. Rodenburg, A. G. Heidema, J. M. A. Boer, I. M. J. Bovee-Oudenhoven, E. J. M. Feskens, E. C. M. Mariman, and J. Keijer
A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes
Physiol Genomics, October 8, 2008; 33(1): 78 - 90.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/2/202    most recent
bbl013v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rubin, E.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Rubin, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?