Abstracts
Briefings in Bioinformatics aims to provide working biologists with an awareness and understanding of the computational approaches available for research and discovery. The Abstracts section of the journal consists of summaries of bioinformatics manuscripts published in the previous quarter. Inclusion of an article in this section indicates that the editors consider it to be among the most interesting and/or useful contributions to the field for the quarter covered. The contents of these reports are briefly distilled for the readers with an emphasis placed on their biological context and potential utility. Publications from the third quarter of 2006 (JulySeptember), with an emphasis on computational resources for the detection and analysis of orthologous groups of genes (proteins), are reviewed here.| Roundup: a multi-genome repository of orthologs and evolutionary distances |
|---|
|
|
|---|
Todd F. DeLuca, I-Hsien Wu, Jian Pu, Thomas Monaghan, Leonid Peshkin, Saurav Singh and Dennis P. Wall
Bioinformatics (2006) Vol. 22, no. 16,
pp. 20442046
Orthologs are genes (proteins) that share a common ancestor and diverged via a speciation event. In this sense, orthologs can be thought of as directly corresponding genes from different genomes, and orthologous relationships are the fundamental unit of comparative genomics. Orthologous relationships are most often used to guide functional inferences, since orthologs are very likely to encode proteins with the same function. The importance of defining orthologs is underscored by the increasing number of databases dedicated to defining orthologous sets of genes. Orthologs are operationally defined as sequences with the highest similarity between two genomes, and as such, the accurate identification of orthologs relies on comparisons between complete (or nearly so) genome sequences. Accurate ortholog identification can be a computationally daunting task since it rests on the sequence comparison of many thousands of gene pairs, and one of the biggest challenges facing the field is the increasing number of complete genome sequences. The developers of Roundup, a database of orthologous gene pairs, have met this challenge head on by computing all possible pairwise sequence relationships between 250 complete genome sequences, including 32 eukaryotes. This represents a more than two-fold increase over any other ortholog database. Roundup uses a previously reported method for ortholog detectionreciprocal smallest distancewhich is a refinement of the classically employed reciprocal best BLAST hit method. The reciprocal smallest distance method is distinguished by the use of global, rather than local, sequence alignments as well as the calculation of evolutionary distances instead of the reliance on BLAST scores. The combination of these two approaches yields improvements in both the sensitivity and selectivity of ortholog detection. The evolutionary distance metric used is a maximum likelihood estimate of the number of amino acid substitutions between orthologs, and these values are reported along with the pairs of orthologs. There are three distinct ways to access the ortholog data in Roundup. Users may retrieve phylogenetic profiles, which are binary vectors indicating the presence and absence of orthologs across a set of genomes. Users can also use individual genes to browse orthologous relationships among specified sets of genomes. This option is less inclusive then the phylogenetic profile approach, but is faster and more direct. To their credit, the database developers also make the raw data available for download. This data consists of pairs of orthologous genes for any two genomes along with their evolutionary distances. One of the nice features of Roundup is that it is conceived and implemented as a community evolvable resourcemeaning that users may actually request that particular genomes be added to the database.
| PhyloPat: phylogenetic pattern analysis of eukaryotic genes |
|---|
|
|
|---|
Tim Hulsen, Jacob de Vlieg and Peter M.A. Groenen
BMC Bioinformatics (2006) Vol. 7, p. 398
Functional inferences on genomic sequence data are increasingly relying on evolutionary information. An important source of evolutionary information is phylogenetic patterns, which reflect the presence or absence of genes across a set of species. PhyloPat is a utility that allows users to query the Ensembl gene database using phylogenetic patterns. Phylogenetic pattern queries allow for the identification of evolutionarily and/or functionally related groups of genes. In order to define phylogenetic patterns, it is necessary to first establish orthologous relationships between gene (protein) sequences. PhyloPat uses orthologous relationships defined by the Ensmart database. A number of other orthology databases exist and several also allow for queries using phylogenetic patterns. PhyloPat is distinguished from these earlier tools in that it uses a strictly gene-centric approach as opposed to one that relies on relationships among proteins. This is important because orthology is most accurately defined at the level of the genome rather than the proteome. In particular, individual genomic loci can encode multiple proteins via alternative transcripts, and the presence of multiple protein variants from single genes can confound attempts to accurately define orthologous relationships as well as lineage-specific gene family expansions and deletions. There are two basic ways to search PhyloPat: (i) users can start with a given phylogenetic pattern, using check boxes or regular expressions, and extract the set of genes that match the pattern or (ii) users can provide a list of genes to evaluate the phylogenetic pattern of each. This flexibility, combined with the easy to use interface, allows users to readily construct complex phylogenetic queries. PhyloPat output can be stored in HTML, MS Excel or plain text format. In addition to the phylogenetic pattern information, functional information is provided via the integration of the FatiGO web interface that reports Gene Ontology annotations. The integration with Ensemble and Ensmart helps to ensure the reliability of the orthologous relationships while providing additional utility to the user.
| PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification |
|---|
|
|
|---|
Nandini Krishnamurthy, Duncan P. Brown, Dan Kirshner and Kimmen Sjölander
Genome Biology (2006) Vol. 7, no. 9, p. R83
The fundamental method for functionally annotating newly sequenced genes is information transfer. Information transfer is a computational technique whereby experimentally characterized information is transferred to the query sequence of interest by virtue of sequence similarity to a known gene. While this approach has been largely successful, it is not without substantial problems. For instance, the most similar sequence in a database search may have a different function from the query sequence, and many query sequences do not retrieve any statistically significant hit when searched against a database. In light of these challenges, the Berkeley Phylogenomics Group has developed the database PhyloFacts. The essence of PhyloFacts is data integration from multiple sources, and the use of consensus bioinformatics approaches that help to avoid systematic errors inherent to individual methods. PhyloFacts is an online encyclopedia with close to 10 000 entries, or books as they are known, each of which reports the results of pre-calculated structural, functional and evolutionary analyses for a protein family or domain. To assemble such an integrated resource PhyloFacts combines a number of publicly available bioinformatic tools with their own phylogenomic methods of analysis. Standard database homology search tools such as BLAST and FASTA are employed along with more recently developed iterative search methods like PSI-BLAST. PhyloFacts also uses domain-based annotation and structure prediction as implemented in databases like PFAM and SMART as well as prediction of protein localization. The in-house phylogenomic methods used allow for more precise classification based on the discrimination between orthologous and paralogous evolutionary relationships. PhyloFacts books include a number of distinct data types that characterize a protein family such as homologous protein clusters, multiple sequence alignments, phylogenetic trees, predicted structures, subfamily definitions, taxonomic distributions, domain characterizations and Gene Ontology functional annotations as well as links to the literature and other online databases. Users can browse this repository of data by viewing the individual web pages for each PhyloFact book, or they can query their sequence of interest against the database in order to classify it. PhyloFact books are assembled according to several different classes of evidentiary standards which correspond roughly to different levels of confidence. More than half of the book entries correspond to experimentally characterized structural domains; presumably, these represent the gold standard of the resource. The remaining entries are successively less confident, being derived from groups of proteins that can be globally aligned, conserved regions, motifs and pending groups that remain to be manually validated.
| Functional noncoding sequences derived from SINEs in the mammalian genome |
|---|
|
|
|---|
Hidenori Nishihara, Arian F.A. Smit and Norihiro Okada
Genome Research (2006) Vol. 16, no. 7,
pp. 864874
The single most abundant class of eukaryotic genomic DNA is made up of the inserted remnants of formerly mobile transposable elements (TEs). For years, it was assumed that these sequences were merely junk or parasitic DNA that served no function other than their own propagation. However, numerous molecular biology studies have uncovered different ways in which TEs have become domesticated to play some functional role for the host genome in which they reside. Most of these studies have been anecdotal in nature and their functional revelations about TEs were accidental. More recently, investigators have begun to explore the functional contributions of TEs to their host genomes in a more direct and systematic way. This report from the Okada group demonstrating the domestication of SINEs (short interspersed nuclear elements) in mammalian genomes represents one such study. In this case, the investigators were aware that (i) large numbers of mammalian non-coding sequences evolve under selective constraint that points to their functional relevance and (ii) non-coding DNA is vastly enriched with TE sequences. This led them to search for cases of conserved sequences derived from TEs. They were able to demonstrate that numerous ancient SINEs, from several families that they discovered, were evolving under strong selective constraint. The authors started with their discovery of a number of ancient SINE families from mammalian and chicken genomes as well as coelacanth, shark, hagfish and amphioxus. All of the novel SINEs they reported were discovered by virtue of a shared central domain that is highly conserved within and between element families. Together these elements comprise an ancient superfamily of TEs named DeuSINE for their distribution among deuterostome genomes. Perhaps the most interesting family discovered is named Amniota SINE1 (AmnSINE1) to reflect its distribution among mammalian and bird genomes. AmnSINE1 is more than 300 million years old and has a chimaeric structure consisting of a 5' promoter region derived from 5S rRNA followed by a tRNA derived sequence and the conserved core domain characteristic of all the DeuSINEs. The 3' tail is also shared by other DeuSINEs and shows similarity to the 3' end of a zebrafish LINE (long interspersed nuclear element). There are
1000 copies of AmnSINE1 in the human genome, and remarkably, more than 10% of these copies correspond to loci that are highly conserved across mammalian genomes. Furthermore, the conservation is highest in the core domain of the element shared among all DeuSINEs. Thus, AmnSINE1 represents and ancient TE with a substantial fraction of copies that have evolved some function that benefits their host genome. While the authors speculate as to what this particular function may be, they were not able to point to any specific role based on computational analysis alone.
| Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure |
|---|
|
|
|---|
Elfar Torarinsson, Milena Sawera, Jakob H. Havgaard, Merete Fredholm and Jan Gorodkin
Genome Research (2006) Vol. 16, no. 7,
pp. 885889
This is an exciting and provocative article that points to an entirely new area of post-genomic sequence analysis. Comparative analyses of eukaryotic genomes have revealed that numerous non-coding sequences have been selectively conserved for tens of millions of years. Since this conservation must be based on function, efforts to functionally characterize non-coding portion of mammalian genomes are increasingly focused on these slowly evolving regions. Torarinsson et al. departed from this standard line of reasoning and focused their efforts instead on the non-repetitive part of the human genome that is cannot be aligned with mouse genomic sequence. They point out that while these genomic regions may not be conserved at the level of sequence, they may be structurally conserved. To evaluate this possibility, the authors carefully selected corresponding regions of the human and mouse genome that were syntenic and contiguous but nevertheless could not be aligned. In other words, these pairs of humanmouse sequences represent orthologous regions that share a common ancestor but have evolved beyond sequence similarity recognition. They then ran the program FOLDALIGN to compare the structures of the corresponding humanmouse sequences. FOLDALIGN conducts a local structural alignment, simultaneously aligning and folding regions between the two sequences, which facilitates the identification of conserved local structures. This allowed them to identify
1800 structurally conserved regions with no sequence similarity. Simulation with randomly shuffled sequences was used to indicate that this conservation was statistically significant. When these structurally conserved regions were compared to a recently published transcriptional map of the human genome, they were found to be twice as likely to map to transcriptionally active regions of the genome. This strongly suggests that these regions are structurally conserved by virtue of some role that their transcripts play. However, the set analysed in this study does not correspond to any of the known non-coding RNAs stored in the Rfam database. In other words, these represent completely novel discoveries. Studies of this kind emphasize the functional relevance of vast uncharacterized regions of genomic DNA and reveal the nuances of natural selection on the genome. As an aside, the computational analysis needed to complete this study was impressive and daunting. Alignment of the RNA structures took 5 months to run on a Linux cluster consisting of 70 nodes with 2GB of RAM each. This suggests an outstanding need for the development of heuristic algorithms that can accomplish similar structural comparisons, albeit with some loss of accuracy, but at far greater speeds.
| An initial map of insertion and deletion (INDEL) variation in the human genome |
|---|
|
|
|---|
Ryan E. Mills, Christopher T. Luttig, Christine E. Larkins, Adam Beauchamp, Circe Tsui, W. Stephen Pittard and Scott E. Devine
Genome Research (2006) Vol. 16, no. 9,
pp. 11821190
A sustained and vigorous effort has been made to survey variation among human genome sequences in the form of single nucleotide polymorphisms (SNPs). There are more than 10 million distinct SNPs and a large, well-funded consortium is dedicated to assembling these data into a haplotype map that can be used for genetic linkage association studies. Far less work has been done on other sources of genetic variation; insertion and deletion polymorphism in particular has not been nearly as well characterized. This is surprising because indels are known to be common among genomes and can be expected to have even more profound mutagenic effects than SNPs. The lab of Scott Devine has been working to remedy this inequality, and here they present their own map of human genome indel polymorphism. This study is particularly notable because the investigators simply took advantage of sequencing studies that had been previously conducted to identify SNPs. Apparently the investigators behind these original re-sequencing studies were so focused on SNPs that they neglected indels. The authors of this report were able to identify more than 400 000 unique indel polymorphisms by computationally reanalysing the previously reported sequence traces. When the gaps in the sequence traces were polarized using chimpanzee genome sequence as an outgroup, they were found to correspond to roughly equal proportions of insertions and deletions. The indels ranged in size from 1bp to almost 10 00 bp and could be classified into several groups. These groups include 1bp indels, indels that result from local base pair repeat expansions, transposable element derived indels and all other indels of random genomic sequence. Over the entire human genome there is, on average, one indel for every 7.2kb of sequence. However, indels are far from randomly distributed; there are hotspots that have an order of magnitude more indel variation than the chromosome average. A number of indels that map to gene regions including promoters and coding sequences were identified in this study. Importantly, the authors have deposited their human indel map into the NCBI's SNP database (dbSNP), and in so doing substantially enhanced the publicly available information regarding human genome variation.
| Bacterial regulatory networks are extremely flexible in evolution |
|---|
|
|
|---|
Irma Lozada-Chávez, Sarath Chandra Janga and Julio Collado-Vides
Nucleic Acids Research (2006) Vol. 34, no. 12,
pp. 34343445
The combination of accumulating complete genome sequences and decades of experimental effort is leading to a more systematic understanding of higher-order functional and evolutionary relationships among numerous genes and their encoded protein products. This is the basis of the renaissance of systems biology, which can now serve as a powerful explanatory paradigm at the molecular level where reductionism one held sway exclusively. The leitmotif of molecular systems biology is the networka graphical representation of the interactions among numerous genes, proteins, metabolites or other biological players. The articulation of transcriptional regulatory networks, whereby transcription factors are directionally connected to their target genes, allows for the comprehensive understanding of an organism's regulatory program. A question that has emerged with the availability of large-scale transcriptional network is the extent of evolutionary conservation of regulatory interactions and pathways between organisms. It is this question Julio Collado-Vides and colleagues turn their attention to in this far reaching report on bacterial regulatory network evolution. They were able to demonstrate that bacterial regulatory networks have changed dramatically over evolutionary time and postulate that this flexibility underlies the remarkable ability of bacteria to adapt to vastly different environmental niches. The authors defined regulatory networks, based on interactions between transcription factors and target genes, in the two model organisms Escherichia coli and Bacillus subtilis. These two organisms have been intensively studied and so it was possible to rely on experimentally validated regulatory networks from each. They evaluated the evolution of these regulatory networks at three different levels: (i) the evolution of individual proteins in the network, (ii) the evolution of the interactions between proteins and (iii) the coherence of regulons, which consist of groups of genes regulated by a single transcription factor. All three levels of inquiry revealed evidence of rapid and substantial reorganization of bacterial regulatory networks. Interestingly, transcript factors were found to evolve more rapidly than target genes and global regulators, which influence the expression of numerous target genes, were found to be particularly poorly conserved. This result is the opposite of what might be expected based on biological network properties where hubs of interaction networks tend to be evolutionarily conserved, and it underscores the ability of bacterial regulatory networks to change dramatically in one or a few evolutionary steps. Regulatory interactions were also found to be poorly conserved and there was little evidence of co-evolution between interacting components in the regulatory network. Bacterial regulons were also shown to be rapidly lost in evolution, all of which points to the rapid evolution of bacterial gene regulation and the importance of this process in driving adaptation.
| Identification of the REST regulon reveals extensive transposable element-mediated binding site duplication |
|---|
|
|
|---|
Rory Johnson, Richard J. Gamblin, Lezanne Ooi, Alexander W. Bruce, Ian J. Donaldson, David R. Westhead, Ian C. Wood, Richard M. Jackson and Noel J. Buckley
Nucleic Acids Research (2006) Vol. 34, no. 14,
pp. 38623877
A major goal of post-genomics research efforts is to characterize the functional roles of the vast quantities of non-coding DNA that predominate among eukaryotic genomes. Towards this end there are many efforts underway to map the DNA cis-regulatory motifs that interact with transcription factors to regulate the expression of nearby genes. The article of Johnson et al. presents a genome-wide analysis of one particular genetic regulatory sequence motif known as repressor element 1 (RE1). RE1 is a 21bp long transcription factor binding site that is bound by the transcription factor REST near a number of neuron-specific genes. The authors mapped RE1 sites and their associated target genes in the human in mouse genome by creating a sensitive position-specific score matrix (PSSM) with which to query the genome sequences. The PSSM is a probabilistic representation of the sequence variability among 93 experimentally characterized RE1 sequences. The authors were able to achieve both good sensitivity and selectivity in the computational detection of RE1 sites by training the PSSM with known binding sequences as well as a known negative set that does not bind REST. They were able to identify
1000 RE1 sites in both the human and mouse genome and more than 40% of these sites represent novel discoveries. In addition to the computational analysis, the functional relevance of multiple RE1 sites was confirmed using an experimental transcription assay. But the most interesting finding of the study was that many of the RE1 sites are closely related and turn out to be derived from transposable elements (TEs). Despite the fact that the TE derived RE1s are similar in sequence, they come from members a number of different TE families including long and short interspersed nuclear elements as well as endogenous retrovirus sequences. This suggests a specific mechanism by which multiple novel RE1 sites can be introduced near genes dispersed throughout the genome and lead to changes (possibly coordinated) in their expression pattern.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||