Skip Navigation


Briefings in Bioinformatics Advance Access originally published online on February 3, 2006
Briefings in Bioinformatics 2006 7(1):116-120; doi:10.1093/bib/bbk009
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/1/116    most recent
bbk009v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jordan, I. K.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Jordan, I. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© Oxford University Press, 2006, All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Abstracts

Briefings in Bioinformatics aims to provide working biologists with an awareness and understanding of the computational approaches available for research and discovery. The Abstracts section of the journal consists of summaries of bioinformatics manuscripts published in the previous quarter. Inclusion of an article in this section indicates that the editors consider it to be among the most interesting and/or useful contributions to the field for the quarter covered. The contents of these reports are briefly distilled for the readers with an emphasis placed on their biological context and potential utility. Publications from the fourth quarter of 2005 (October–December) are reviewed here.

Alignments anchored on genomic landmarks can aid in the identification of regulatory elements
Kannan Tharakaraman, Leonardo Mariño-Ramírez, Sergey Sheetlin, David Landsman and John L. Spouge Bioinformatics (2005) Vol. 21, Suppl. 1, pp. i440–i448
One of the most important challenges facing bioinformatics is the characterization of regulatory sequences that control the timing and pattern of gene expression. Gene expression is controlled, to large extent, by the interaction of trans-regulatory proteins with their cis-regulatory DNA binding sites. Cis-regulatory sites tend to be short, often degenerate, sequences, and attempts to identify such sites in gene promoters based on sequence information alone are prone to numerous false positives. As such, the necessity of incorporating additional information sources into the process of cis-regulatory sequence identification, exemplified by the manuscript of Tharakaraman et al., is becoming increasingly obvious. The method of Tharakaraman et al. is unique in two important respects. First of all, it combines two broad approaches to cis-regulatory element identification—namely alignment and enumerative-based methods. Alignment methods rely on (local) similarity between regulatory sequences, while enumerative methods identify sequence motifs that are over-represented in regulatory regions. Secondly, their method relies on positional information, with respect to the start site of transcription, to help inform the identification of functionally relevant regulatory motifs. The approach starts with an ungapped alignment of proximal promoter sequences, from the human genome in the case reported here, that is anchored by the transcription start sites. After the application of a novel word-specific mask, the method identifies all octonucleotide words that form colocalized clusters along the promoter sequence alignment. The promoter sequences are then realigned using the positions of the significant words as anchors, and the new alignment is surveyed for potential cis-regulatory sequences using a Gibbs sampling algorithm. When applied to the human promoter dataset, this method identified 791 words with conserved promoter positions. Alignments based on these word positions led to the discovery of novel potential cis-regulatory motifs missed by other prediction methods. A potentially interesting bonus of the algorithm is the fact that it does not necessitate the masking of known repetitive regions prior to the identification of putative cis-regulatory elements. Thus it may be possible to identify an entirely new class of cis-regulatory sites, i.e. those donated by repetitive sequences, that is likely to be missed by almost all other methods. The authors make the program that runs their alignment freely available along with the results of the analysis of human promoter regions.

A genome-wide survey of structural variation between human and chimpanzee
Tera L. Newman, Eray Tuzun, V. Anne Morrison, Karen E. Hayden, Mario Ventura, Sean D. McGrath, Mariano Rocchi and Evan E. Eichler Genome Research (2005) Vol. 15, no. 10, pp. 1344–1356
The high levels of similarity between the protein coding sequences of humans and chimpanzees has been appreciated for the last thirty years and was confirmed by comparisons with the recently completed sequence of the chimpanzee genome. The lack of divergence at protein coding sequences has raised questions pertaining to the hereditary basis for the obvious phenotypic differences between the two species. One possible explanation lies in structural genomic differences that may exist between the two lineages, and indeed, cytogenetic studies have successfully identified a number of large-scale structural differences that result in distinct human and chimp karyotypes. However, many smaller scale structural differences that may exist between human and chimp have been obscured by the relatively low quality of the chimpanzee genome assembly. Such structural differences may have had profound evolutionary impacts due to their dramatic and irreversible nature. To deal with the assembly issue, Newman et al. have employed their own mapping procedure to compare human and chimp sequences directly, and in so doing, identified numerous, previously undetected, structural differences between the two genomes. A total of 651 sites of structural variation including insertions, deletions and inversions were discovered. Structural variants were found on all chromosomes and, not surprisingly, tend to be enriched in the areas where segmental duplications had occurred. The variant structural sites not only cover a substantial portion of the analyzed sequences, 24 megabases, but also overlap more than 200 genes. Importantly, a number of these computationally identified structural variants were confirmed experimentally using PCR and Southern blots. This work represents the first genome-scale analysis of structural differences between human and chimpanzees. In addition to increasing the number of known human-chimpanzee structural rearrangements by a factor of 50, this study may also help to point out chimpanzee genomic regions that should be priorities for future efforts at producing ‘finished’ sequences.

Metabolic functions of duplicate genes in Saccharomyces cerevisiae
Lars Kuepfer, Uwe Sauer and Lars M. Blank Genome Research (2005) Vol. 15, no. 10, pp. 1421–1430
Gene duplication is a major evolutionary force and the subject of numerous molecular evolutionary studies. There are a number of competing hypotheses that seek to explain the fate of duplicate genes—most of these address the basic question of why duplicate genes are retained in the genome. Kuepfer et al. approach this issue using the extremely well studied metabolism of the yeast Saccharomyces cerevisiae as a model system. The essence of this work is the functional characterization of 295 duplicate metabolic genes in yeast. The authors used an impressive and in depth combined functional annotation scheme that employed information from experimentally determined phenotypes (under five different environmental conditions), in vivo flux data, in silico flux balance analysis and topological analysis of the yeast metabolic network. Interestingly, they found that duplicates are no more frequently associated with essential metabolic reactions than singletons. These results suggest that duplicate genes are not maintained by any one particular dominant function, i.e. duplicates do not appear to be primarily maintained as back ups. Instead, it seems as if duplicates encode proteins involved in different, if somewhat overlapping, functions. This work has important implications for theories of gene duplication and genetic robustness.

Ascertainment bias in studies of human genome-wide polymorphism
Andrew G. Clark, Melissa J. Hubisz, Carlos D. Bustamante, Scott H. Williamson, and Rasmus Nielsen Genome Research (2005) Vol. 15, no. 11, pp. 1496–1502
Studies that map single nucleotide polymorphisms (SNPs) onto genomes often use a two-step process: first genomic locations that harbor SNPs are identified using small samples and second the SNP allele frequencies of these locations are characterized by resequencing larger samples. The primary use of SNPs is for the detection of phenotype-genotype associations through linkage disequilibrium, and this two-step strategy works quite well irrespective of the mode of the initial SNP identification. However, SNP data gathered in this way will most likely not be suitable for population genetic studies. That's because the original sampling used to identify SNPs is small and thus biased towards high frequency alleles; rare SNPs are more likely to go undetected with this approach. This will inevitably alter statistical characteristics of the resulting SNP frequency distribution, such as nucleotide diversity and FST (between population variance). To demonstrate the effects of this ascertainment bias, Clark et al. compare two large-scale SNP datasets, each of which used a different strategy to uncover SNPs. The International HapMap project used an extremely heterogeneous SNP discovery step, while Perlegen's SNP map was produced using a more uniform approach where SNPs identified by hybridization were resequenced. A SNP data set produced by the NIEHS was taken as a reference for comparison since it was produced by complete resequencing and as such is not expected to have any ascertainment bias. The SNP site frequency spectra as well as measures of heterozygosity FST and were determined for genome windows of 500-kb on each of these datasets. There are substantial differences between the SNP datasets for all three of these measures, and these disparities underscore the relevance of the ascertainment biases inherent in the different SNP discovery methods. Clearly then, population genetic analyses on such datasets will be affected by these biases. The authors take an important step beyond issuing this caveat by performing an ascertainment correction. The result is a corrected SNP frequency spectrum with frequencies weighted by the probability of discovering each SNP in the original study. The corrected data are far more consistent in terms of the various measures based on the SNP site frequencies. Nevertheless, even after this correction some disparities between datasets still exist. On a slightly more optimistic note, while the effects of ascertainment bias on population genetic studies are profound, genotype-phenotype association studies are likely to be far less affected (and not prone to false positives).

Natural selection on protein-coding genes in the human genome
Carlos D. Bustamante, Adi Fledel-Alon, Scott Williamson, Rasmus Nielsen, Melissa Todd Hubisz, Stephen Glanowski, David M. Tanenbaum, Thomas J. White, John J. Sninsky, Ryan D. Hernandez, Daniel Civello, Mark D. Adams, Michele Cargill and Andrew G. Clark Nature (2005) Vol. 437, no. 7062, pp. 1153–1157
One of the enduring questions of human biology is the extent to which the evolution of our species has been shaped by natural selection. The recent completion of the closely related chimpanzee genome sequence as well as concerted efforts aimed at uncovering human genomic polymorphism in the form of single nucleotide polymorphisms (SNPs) provide the necessary raw data to address this issue in a systematic way and on genome-wide scale. Bustamante et al. present a comparative sequence analysis of human–chimp protein coding sequences that also employed abundant SNP data, based on a population of 20 European Americans and 19 African Americans, generated by Celera Genomics. The work focused on more than 11 000 well-annotated human genes that have demonstrable orthologs in the chimpanzee. Levels of coding sequence SNP polymorphism and human-chimp divergence were compared in order to assess the effects of selection on these genes. To do this, coding sequence sites were partitioned into synonymous sites, where a change does not result in an amino acid difference, and non-synonymous sites, where nucleotide changes result in a corresponding change in the encoded amino acid. Under this approach, known as the McDonald–Kreitman test, synonymous sites are taken as a standard of neutral evolution and an excess of non-synonymous divergence indicates adaptive evolution between species. On the other hand, a paucity of non-synonymous divergence indicates either weak negative selection or balancing selection acting to maintain polymorphism at a locus. The vast majority of genes analyzed in this study were either uninformative with respect to the McDonald–Kreitman test or did not show evidence for selection when the test was applied. However, when genes with at least four variable non-synonymous sites are considered, 9.0% showed evidence of rapid amino acid evolution due to adaptive divergence. It should be noted that this figure represents only ~2.5% of all genes in the study since the majority of genes show very low human–chimp divergence and thus have not undergone adaptive protein evolution. As has been noted previously, genes involved in immunity and defense, gametogenesis and sensory perception were over-represented in the set of adaptively evolving genes. Transcription factors were also enriched in this set suggesting one mechanism by which regulatory differences may evolve between species. In addition, 13.5% of informative loci show a paucity of amino acid divergence suggesting weak negative selection. These loci were found to be enriched in genes involved in actin binding and cytoskeleton formation.

Adaptive evolution of non-coding DNA in Drosophila
Peter Andolfatto Nature (2005) Vol. 437, no. 7062, pp.1149–1152
Most studies of molecular evolution have focused on the coding sequences of genes and the protein products that they encode. However, the vast majority of genomic DNA in eukaryotes, e.g. >98% in the human genome, is non-coding. The tempo and mode of non-coding sequence evolution has been studied far less and, as a consequence, is not nearly as well understood. As more and more complete genome sequences accumulate, comparative genomics studies are beginning to remedy this situation. The recent report of Peter Andolfatto exemplifies the evolutionary insights that can be gained by the analysis of non-coding genome sequences. Andolfatto studied genome sequences within and between Drosophila species, with an emphasis on D. melanogaster, and demonstrated the extent to which natural selection has influenced the evolution of non-coding sequences. Two main aspects of the study were: (i) partitioning genomic sites into different categories that are expected to differ with respect to the intensity of selection and (ii) comparing levels of within species (polymorphism) to between species (divergence) sequence variation. Intergenic and gene regions were studied separately, and gene regions were partitioned into synonymous and non-synonymous sites of the coding sequence as well as introns and untranslated regions (UTRs). This resulted in one of the most surprising findings of the study, namely the discovery that the majority of non-coding sequences evolve more slowly than do synonymous sites of coding sequences. This includes not only non-coding genic sequences such as introns and UTRs but also intergenic regions quite distant from known genes; relative levels of both polymorphism and divergence were reduced for non-coding DNA. Apparently, the conservation of non-coding sequences cannot be attributed to reduced mutation rates. The estimate that 40-70% of Drosophila non-coding sequences are conserved by natural selection represents more than an order of magnitude increase over the fraction of conserved sites recently estimated for other eukaryotes such as mammals. Andolfatto also employed a version of the McDonald–Kreitman test that allows for inferences of adaptive selection. This test compares the ratio of polymorphism to divergence for sites that are thought to be subject to the effects of selection versus sites that are thought to evolve neutrally. Application of this test revealed a significant excess of divergence for UTRs indicating that these regions have evolved adaptively. Taken together, these results suggest that a large fraction, most likely a majority, of non-coding Drosophila DNA is functionally important to the organism, presumably due to the exertion of regulatory effects, and subject to the effects of both purifying and adaptive selection.

Two rounds of whole genome duplication in the ancestral vertebrate
Paramvir Dehal and Jeffrey L. Boore PLoS Biology (2005) Vol. 3, no. 10, p. e314
In 1970, Susumu Ohno proposed the so-called 2R hypothesis, which explains the increase in vertebrate genome size and complexity by two rounds of genome duplication that were to have occurred early in vertebrate evolution. It was difficult to test this provocative hypothesis directly without the availability of complete (or nearly so) vertebrate and outgroup genome sequences, and so it was not addressed analytically (at the genome scale) for decades. In just the past few years or so, owing to the availability of numerous genome sequences, a plethora of tests of the 2R hypothesis have been published. These works have produced highly contradictory results—some studies confirm the 2R hypothesis, others find evidence for only one round of genome duplication, while still others find no support whatever for whole genome duplication in vertebrates. It is against this backdrop of controversy that Dehal and Boore contribute their own test of the 2R hypothesis. They report what they consider to be unequivocal evidence of two rounds of whole genome duplication in early vertebrate evolution. This conclusion is based on a comprehensive phylogenetic analysis of all multi-gene families from complete genome sequences of tunicate, fish, mouse and human. To date, most tests of 2R have similarly analyzed multi-gene families to look for evidence based on gene number or tree topology. Such works often find negligible remaining signal of two rounds of genome duplication, and the analysis reported here confirms this paucity of supporting evidence for 2R when gene number and tree topology alone are considered. It is only when the genomic positions of duplicate (paralogous) gene families are mapped to the genome that the evidence confirming 2R emerges. Specifically, genes that were duplicated prior to the divergence of fish and tetrapods tend to cluster into what the authors refer to as tetra-paralogons—that is four distinct genome regions, each of which represents an ancient, pre-duplicated, genomic segment. In contrast, more recently duplicated genes, which presumably result from tandem duplications of individual genes or chromosome segments, do not group into distinct tetra-paralogons. The authors estimate that 25–72% of human genes map to tetra-paralogons and thus may have been generated by whole genome duplications.

The design of transcription-factor binding sites is affected by combinatorial regulation
Yonatan Bilu and Naama Barkai Genome Biology (2005), Vol. 6, no. 12, p. R103
A concerted and sustained effort is underway to characterize the regulatory genomic regions, cis-binding sequences in particular, that control gene expression. Bioinformatic studies are increasingly relying on the incorporation of evolutionary information to aid in the identification of functionally important cis-regulatory sequences. Phylogenetic footprinting in particular relies on the assumption that functionally important cis-regulatory sites will be anomalously conserved over evolutionary time—i.e. show low levels of sequence diversity. However, the mode of evolution of regulatory sequences remains poorly understood, and consequently the extent to which this assumption is justified not known. Bilu and Barkai report a timely and relevant attempt to understand the factors involved in the evolution, and by inference the design, of cis-regulatory binding sequences. They employ the yeast Saccharomyces cerevisiae for their study, taking advantage of the detailed map of cis-regulatory binding sites that exists for this model eukaryote. The specific aim of the study is to understand the relationship between the length and sequence specificity of cis-regulatory binding sequences. Prior to this work, it was recognized that cis-regulatory binding sequences that are short and lack sequence specificity may nevertheless be responsible for controlling gene expression. On the other hand, many apparent binding sites, including ones that are longer a closer to the sequence consensus for the site, are actually non-functional. This paradox has been attributed to combinatorial regulation, whereby the regulation of eukaryotic genes is controlled by the interaction of numerous regulatory factors and cis-binding sites. For the first time here, the authors actually demonstrate a connection between this type of combinatorial regulation and binding site specificity. They show that binding sites found in promoters that bind multiple transcription factors tend to be both shorter and ‘fuzzier’—i.e. less similar to the consensus sequence. In addition, novel binding sites were found to appear more often in promoters that already bind multiple transcription factors. The authors posit that this may be due to reduced selective pressure on such promoter sequences, which is also consistent with the fuzzier sequence motifs found therein. The observation that essential genes tend to possess fewer binding sequences seems to be consistent with this evolutionary model. The type of evolutionary design principles elucidated by the work of Bilu and Barkai would seem to have direct relevance to the use of comparative genomic-based cis-regulatory site prediction methods.

I. King Jordan

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/1/116    most recent
bbk009v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jordan, I. K.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Jordan, I. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?