Briefings in Bioinformatics Advance Access published online on October 10, 2007
Briefings in Bioinformatics, doi:10.1093/bib/bbm048
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Discovering and detecting transposable elements in genome sequences
Corresponding author. Casey M. Bergman, Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK. Tel: +44 (0)161-275-1713; Fax: +44 (0)161-275-5082; E-mail: casey.bergman{at}manchester.ac.uk
| ABSTRACT |
|---|
|
|
|---|
The contribution of transposable elements (TEs) to genome structure and evolution as well as their impact on genome sequencing, assembly, annotation and alignment has generated increasing interest in developing new methods for their computational analysis. Here we review the diversity of innovative approaches to identify and annotate TEs in the post-genomic era, covering both the discovery of new TE families and the detection of individual TE copies in genome sequences. These approaches span a broad spectrum in computational biology including de novo, homology-based, structure-based and comparative genomic methods. We conclude that the integration and visualization of multiple approaches and the development of new conceptual representations for TE annotation will further advance the computational analysis of this dynamic component of the genome.
Keywords: transposable element, genome annotation, repetitive DNA, bioinformatics
| INTRODUCTION |
|---|
|
|
|---|
Transposable elements (TEs) are a widespread class of repetitive sequences that can be viewed largely as selfish intragenomic parasitic sequences [1, 2]. Owing to their ability to undergo replicative transposition via an RNA or DNA intermediate, TEs can increase in copy number to occupy large fractions of genome sequences, especially in higher eukaryotes. For example,
25% and
45% of the rice and human genome sequences, respectively, are estimated to be TE in origin [3, 4]. Aside from their unique modes of replication and sheer abundance, TEs are important biological entities for study because of their roles in genome structure [5], genome size [6], genome rearrangement [7] and contribution to host gene [7, 8] and regulatory evolution [8]. More practically, the repetitive nature of TE sequences poses fundamental challenges to genome sequencing [9], assembly [10], annotation [11] and alignment [12]. For these reasons, together with the unprecedented availability of genomic DNA sequences in recent years, there has been a rapidly growing interest to develop new computational methods to aid the analysis of TEs in genome sequences. Here we review the diverse and innovative computational approaches to identify TEs in genome sequences, covering aspects of both TE discovery (i.e. the identification of new TE families) as well as TE detection (i.e. the identification of individual TEs from previously discovered TE families). Our main aim here is to cover recently developed methods for TE genome bioinformatics (see Refs [13, 14] for related perspectives); we refer readers elsewhere for an introduction to the biology of TEs [15, 16] and previous discussions on fundamental considerations in the field that have led to this new generation of computational tools [17–19]. Available computational resources for TE discovery and TE detection discussed in this review are summarized in Table 1. Some additional methods discussed in the text are available by request from the authors.
|
| TE DISCOVERY |
|---|
|
|
|---|
De novo repeat discovery
De novo methods attempt to discover new TEs using intrinsic repetition of mobile DNA in genome sequences, typically without using prior information about TE structure or similarity to known TE sequences. These methods either attempt to discover consensus sequences of new TE families, instances of new TEs or both. Typically de novo repeat discovery methods use assembled sequence data, and therefore are critically dependent on both sequencing and assembly strategies (see Refs [10, 20] for discussion of the effects of repetitive DNA on genome assembly), but new methods have recently been developed that attempt to bypass these dependencies by using sequencing reads [21]. The most common strategy is to detect pairs of similar sequences at different locations in a self-genome comparison, and then cluster these pairs to obtain repeat families. Because these methods are not specific to TEs they find repeats generated by many different processes including tandem repeats, segmental duplication and satellites as well as TE sequences. Additionally, TE families at low copy number (
1–2 copies) or derived entirely of non-overlapping fragments are unable to be detected by these approaches. The main challenges for de novo methods are to distinguish TEs from all other repeat classes and to identify distinct TE families. The difficulty in achieving these goals comes from aspects of the complexities of TE biology, including: (i) the fragmented nature of TE instances (which can result from sequence divergence, abortive repair of double-strand DNA breaks, incomplete reverse transcription, recombination among TEs and/or nesting of TEs into each other), (ii) the interspersion of TEs into other repeat classes such as segmental duplications and tandem repeats and (iii) resolving closely related TE families that share different extents of sequence similarity. These biological realities have several effects computationally. First, because they are deleted or nested, partial matches among repeat instances are interrupted by large insertions or deletions. Thus between two divergent instances of the same repeat, repeat discovery methods will often detect several co-linear repeats instead of a single repeat with large gaps. We note that most of the TEs in a genome are degenerate copies and thus this is a major problem for de novo TE discovery. Second, TE nesting causes discovered repeats to aggregate into meta-families containing assemblages of distinct repeat families present in the largest nest. One nest is often sufficient to cause this effect. Third, differences in historical activity and functional constraints lead to various degrees of sequence similarity both in percent divergence and extent of homology among TEs, and thus families may be either over-split or composed of multiple distinct subfamilies.
Most de novo TE discovery methods initially use classical computational strategies to detect repeat regions in a genome, such as suffix trees or pairwise similarity searches. Suffix trees approaches implemented in Reputer and RepeatMatch [22–26], have algorithmic complexity linear with genome size in space and time. They detect exact [22] but also degenerate [24] repeats. RepeatFinder [27] relies also on the efficient suffix tree data structure to detect exact repeats, taking output from Reputer or RepeatMatch, and subsequently merges neighboring and overlapping exact repeats to detect non-exact repeats. Faster approaches use hashing and k-mer approaches (such as BLAST [28–30], SSAHA [31], BLAT [32] or PASH [33]) to anchor pairwise similarity searches. These strategies often use less memory, making them more practical for large genomes. For example, Leung et al. [34] designed an efficient algorithm using k-mer hashing and a linked list connecting all occurrences of the same word to find repeats with errors, with average memory and run time increasing almost linearly with total sequence length.
The next step in de novo TE detection is to cluster pairs of aligned DNA segments into repeat families and filter out non-TE repeats. Two general approaches are currently used to filter out false-positive clusters corresponding to non-TE repeats or those containing a mix of different repeat types (i.e. TEs nested in segmental duplications). The first approach, implemented in RECON [35], considers the multiple alignments of all repeat copies of a cluster. Multiple alignments are split into sub-clusters when several sequences end at the same (or similar) position in the alignment, as expected for interspersed repeats that arise by the process of transposition. Sequences are split according to these boundaries and the underlying alignment pairs are re-clustered. In this way, nested repeats can be detected and separated from one another. Moreover, RECON will split sub-clusters according to the presence of long non-conserved regions (i.e. mismatching regions) between instances to deal with interfamily similarity between closely related TE families.
The second approach to detect families from pairs of repeat instances tries to find complete copies of the repeat among all instances of a family. These methods search for the longest sequences in a cluster and filter according to their occurrence. The rationale is that for an active TE family, there will exist at least few complete copies (>3) in the genome. Other dispersed repeats would have generally few copies (generally 2) or copies of different lengths (if >2). An example of this approach is the GROUPER program from the BLASTER suite [36, 37], which operates on the basis of single-link clustering with overlap constraints. GROUPER starts from pairwise gapped alignments and connects co-linear aligned segments by an efficient chaining algorithm based on dynamic programming. Hence, it attempts to recover properly nested and deleted TEs by first chaining their constitutive fragments (a process known as defragmentation, see below). Chained matches are then gathered according to their similarity into groups by a constrained single-link clustering algorithm. Taking groups with more than three sequences removes many false positives and allows bona fide TE sequences to be kept in majority of the cases.
The PILER de novo repeat discovery system [38] operates on a similar basis by constructing piles of pairwise hits. A pile is formed by stacking pairwise alignments on the genomic regions from which they are obtained. The boundaries of the maximal contiguous region covered by a pile defines a repeated region. Then pairwise alignments are re-examined and grouped by single-link clustering if they belong to the same piles and cover more than x % of the region it defines. Only clusters with more than n instances are kept. A unique feature of PILER is that it attempts to find and distinguish different classes of repeats by their characteristic features, implementing different search methods for dispersed families (PILER-DF), tandem arrays (PILER-TA), pseudo-satellites (PILER-PS) and terminal repeats (PILER-TR). The ability of PILER to incorporate biologically informed constraints into the repeat discovery process allows the identification of repeats that are more likely to be TEs. The PILER-DF and PILER-TR methods clearly are most relevant for the detection of TE families, however the use of other PILER methods could prove useful as here sensitivity is clearly sacrificed for specificity with additional constraints.
Other methods solve the clustering and filtering steps using approaches based on graph theory. The Repeat Pattern Toolkit (RPT) [39] searches for ungapped pairwise alignments and groups instances by single-link clustering only if an overlap constraint of 90% is satisfied. It then divides the set of alignments into connected components (repeat families), computes the evolutionary distance between repeat family members, constructs minimum spanning trees from the connected components and visualizes the divergence within repeat families. The RepeatGluer [40] system also uses graphs to cluster pairwise similarities in assembled sequences based on an A-Bruijn graph, which represents all repeats in a genome as a mosaic of sub-repeats. This graph-based method decomposes a repeat into shorter subsequences shared by multiple repeat families that reveal the mosaic and nesting structure of repeat families. Then the graph structure can be queried to extract repeat families according to specific occurrence constraints.
Other methods depart from this pairwise alignment-clustering paradigm, and instead detect either repeat instances or families from over-represented k-mers. The aim of these methods is simply to mask repetitive sequences from a genome in the absence of a repeat library, and as such do not attempt to infer repeat families or annotate repeat boundaries, but may be extended for such purposes. For example, the WindowMasker repeat detection suite [41] first automatically sets k-mers lengths based on input sequence length and estimates of k-mer occurrence frequencies. In a second pass, all k + 4 windows exceeding an average score (based on k-mer frequencies) are masked as repetitive, joining neighboring windows if intervening intervals exceed a slightly lower threshold. Related methods to find repeats based on word counting use compression techniques to reduce memory requirements (mer-engine [42]) or count gapped words to detect inexact repeats (RAP [43]).
The ReAS program [21] uses a seed-and-extend strategy for repeat detection based on high frequency k-mers. A key distinguishing feature of ReAS is that it operates on sequencing reads, rather than assembled genome sequences and therefore is not affected by poor quality or collapsed repeat sequences which are common in whole-genome shotgun assemblies. ReAS is based on the converse principle of many shotgun sequence assembly programs, which first remove sequencing reads containing high-frequency k-mers prior to contig layout. Once the reads containing highly represented k-mers are identified and assembled into repeat contigs, these contigs serve to seed an extension procedure using multiple alignments. Sudden changes in the depth of sequence read coverage and partially aligned reads are used to detect TEs nested in segmental duplication in a way similar to RECON. RepeatScout [44] similarly extends over-represented k-mers seeds but here on already assembled sequences. The k-mer approaches of ReAS and RepeatScout approaches avoid the main problems of pairwise similarity-based approaches such as RECON, which are computationally intensive and rely on pairwise alignment boundaries to identify TE instances.
Homology-based methods
The most common approach to detect new TE families is based on detecting homology to known TE protein-coding sequences. Protein homology-based methods have distinct advantages over de novo repeat discovery methods in that they capitalize on prior knowledge captured in the large number of previously reported TE sequences. Thus they are more likely to detect bona fide TEs, even those present in a single copy in the genome, and can provide a classification of the new TE (i.e. transposon versus retrotransposon). Homology-based methods are biased towards the detection of previously identified families and to TEs active recently enough to retain substantial protein homology. They are also not applicable to certain classes of TEs that are composed entirely of noncoding sequences, such as miniature inverted repeat transposable elements (MITEs) and short interspersed nuclear elements (SINEs). Homology-based TE detection methods are typically applied to assembled genome sequences, but have also been successfully applied to preliminary genome resources such as BAC-end sequences [45]. Most homology-based methods [36, 45–47] used some version of a translated fast heuristic alignment algorithm in the BLAST [29] or FASTA [48] families of programs with known TEs used as queries, followed by post-processing including merging and/or extending of individual genomic hits. An alternative approach to homology-based TE discovery is to use HMMER [49] to scan predicted ORFs with profile hidden Markov models (HMMs) from the PFAM database [50] to detect common TE protein domains [51, 52]. In general, homology-based TE detection requires further analysis of structural features to obtain a full-length reference sequence.
Structure-based methods
A third class of tools uses prior knowledge about the common structural features shared by different TEs that are necessary for the process of transposition, such as long terminal repeats (LTRs). Unlike de novo repeat discovery methods, structure-based methods rely on detecting specific models of TE architecture, rather than just the expected results of the transposition process (i.e. dispersed repeats with similar boundaries). In contrast to homology-based methods, structure-based methods are less biased by similarity to the set of known elements. We note that the distinction between the classification of de novo, homology and structure-based methods is not absolute, as demonstrated by the PILER-TR method mentioned above. Like homology-based methods, structure-based methods can detect low copy number families, have high specificity to detect TE repeats and can provide a preliminary structural classification of the new TE. Nevertheless, purely structure-based methods are limited by the facts that specific models must be designed and implemented for each type of TE under consideration, and that some classes of TEs are inherently more strongly structured than others and are therefore more easily detected using these kinds of methods.
Several structure-based methods have been developed recently to detect LTR retrotransposons, by searching for the common structural signals in this subclass of TE—LTRs, target site duplications (TSDs), primer-binding sites (PBSs), polypurine tracts (PPTs) and ORFs for the gag, pol (containing the RT domain) and/or env genes. The first method of this type to be developed is LTR_STRUC [53], which uses a heuristic seed-and-extend strategy to find and align local repeats located within a user-specified distance which are used as an initial set of candidate LTRs. The pairwise alignment of putative LTRs is then used to estimate the boundaries of the LTRs on the original contig, which from the start of the 5' LTR to the end of the 3' LTR should span a full-length element. Putative full-length LTR elements are subsequently given a quality score based on the presence of TSD (required), PBS, PPT and RT ORF signals. LTR_STRUC is unable to identify incomplete LTR elements or solo LTRs and thus, like most TE discover methods, should be used to compile reference sets for subsequent detailed TE detection (see below). One critical limitation of LTR_STRUC is that only TEs within the same contig can be detected, which is a particularly acute problem given that contigs from shotgun sequence assemblies often end in the middle of TE sequences [10].
Kalyanaraman and Aluru [54] have developed a similar structure-based method to detect LTR retrotransposons called LTR_par that improves on some of the weakness in LTR_STRUC. First, this method allows for degenerate nucleotides in their alphabet that obviates the problem posed to LTR_STRUC by sequence gaps. Second, their algorithm uses a suffix-array based pre-processing of the input sequence which speeds the generation of pairs of putative LTRs as well as ensures that all possible LTR candidate pairs are considered. Alignment of LTR pairs is performed by banded dynamic programming (in contrast to the greedy heuristic implemented in LTR_STRUC), which also takes into account distance and structural constraints between the LTR pairs on the original contig. More recently, a similar method called LTR_FINDER based on related principles to LTR_STRUC and LTR_par has been established as a webserver for LTR element discovery [55]. Rho et al. [52] have extended the structure-based detection of LTR elements using suffix arrays first by merging neighboring exact repeats into longer fragments that more likely represent a complete pair of LTRs, which define the search space for TE ORFs. This method further clusters LTRs to build nucleotide profile HMMs to scan for solo LTRs and fragmented LTR elements. Phylogenetic analysis of the complete set of LTRs is used to detect pairs of solo LTRs that are nearest neighbors in the tree and closely spaced in the genome, which are subsequently joined into a complete divergent copy.
Morgante et al. [56] have developed the SMaRTFinder platform to conduct efficient searches in DNA for structured sets of motifs, including those shared among LTR retrotranspsosons. A structured motif is an ordered set of motifs and a list of intervals specifying the distances between motifs. In the case of LTR elements, these motifs can be LTR end motifs, the PBS or PPT or the DNA sequence of a highly conserved domain in an ORF. This generalized approach first starts by locating instances of individual motifs (using suffix trees) and then solves a constraint satisfaction problem, by constructing a graph with motif instances as nodes and edges between nodes which satisfy order and distance constraints. SMaRTFinder allows a user-specified level of edits to the motifs as well as deletion of motifs from the overall compound pattern, thus permitting variations on the structured motif resulting from mutation to be identified. In this regard, SMaRTFinder should be able to discover incomplete or degenerate LTR elements and potentially deal with missing data found at the site of TE insertions in whole-genome shotgun assemblies. The flexibility of structured motif searches suggest that other classes of TE would be amenable to analysis by this approach, and indeed Morgante et al. [56] discuss the extension of their approach to helitrons.
TE discovery methods that capture specific structural features have also been successfully developed for the MITE subclass of TEs. MITEs are DNA-based elements that have terminal inverted repeats (TIRs) but lack a transposase gene, which is supplied by autonomous DNA transposons that recognize their TIRs. As such, homology-based methods that attempt to discover TEs by similarity to known TE ORFs are not applicable to MITEs. The FINDMITE program [57] uses a fast string matching algorithm to identify TIRs flanked by TSDs separated by a user-specified distance, retaining only those TIRs that are not composed of simple sequences. The related method TRANSPO uses a fast approximate string search algorithm to find matches to known MITE TIRs that occur in inverted orientation within a given window [58]. For MITEs derived from autonomous DNA transposons, it is possible to discover their ancestral founder element using the MITE Analysis Kit (MAK) [59]. MAK first finds matches to TIRs with the inverted orientation and spacing and then performs a translated BLAST to find transposase ORFs.
Andrieu et al. [60] have developed a unique HMM-based structural method that can discover TEs based on the fact that the nucleotide composition of TE ORFs often differs from that of host genes in the same genome. The origin of these differences in base composition between TEs and host sequences is unknown, but likely relates to biases in biochemical processes that govern aspects of TE replication (e.g. reverse transcription) that do not apply to host sequences. This approach involves building HMMs that model 3 states for coding regions (one for each codon position) and at least one state for noncoding regions, which allows for the frameshift mutations that are common in decaying TEs. By training separate HMMs for RNA-based TEs, DNA-based TEs and host genes, this method was able to discriminate TEs from genes in curated datasets as well as to identify accurately the coding regions in known TEs. A global model which incorporates three submodels of this basic structure (one each for RNA-based TEs, DNA-based TEs and host genes) into a single HMM was further developed and applied to genome-wide TE detection [37]. As with all HMMs, the TE-HMM of Andrieu et al. [60] requires, and is dependent on, good training data sets, and for this method it appears that training data from the same species group will be required. In addition, since the TE-HMM attempts to predict only coding (and associated noncoding) sequences, other structural features of unidentified TEs (such as UTRs and LTRs) cannot be discovered using this approach.
Comparative genomic methods
An innovative method to detect new TE families and instances which relies neither on homology nor structural features has been recently proposed by Caspi and Pachter [61], which uses the fact that transposition creates large insertions that can be detected in multiple sequence alignments. Starting with whole-genome multiple alignments, this method searches for insertion regions (IRs) where multiple alignments of orthologous genome sequences are disrupted by a large (>200 bp) insertion in one or more species. After filtering (for simple and tandem repeats) and concatenating, IRs are then locally aligned with all other IRs to identify repeat insertion regions (RIRs). This process is constrained to cluster only IRs that are inferred to have occurred on the same branch of the phylogenetic tree. This method can identify new TE families and instances, as well as date TEs to branches of a phylogeny. While avoiding many of the problems of TE discovery associated with repeat boundary detection, this comparative approach is dependent on the quality of whole genome alignments, which can be compounded by the multiple alignment of draft genomes, and undoubtedly will be poor in TE-rich regions. Furthermore, the applicability of this method will be determined by the activity of TEs in the genomes in question: if all TE insertions are ancestral with respect to the species analyzed, no TEs will be identified. Nevertheless, more research to discover TE families using comparative genomic data between (and perhaps even within [62]) species seems justified given the performance of this approach and the recent explosion of multiple genome sequences for closely related species.
TE detection
TE discovery methods that identify TE instances can also be used for TE detection, but as this is not their main aim, they usually have low sensitivity [35,37]. Thus, a second step of TE detection is required to comprehensively annotate TE instances in genomes. TE detection starts by assembling a reference set of TE sequences produced by the methods described above, followed by determination of the consensus sequence, classification of the TE type and typically some amount of manual curation. The definitive repository of such reference sets for eukaryotes is Repbase [63], which continues to expand through its new community-driven annotation and submission tools [64]. Reference sets of TE sequences can be composed of either instances of individual TE copies or consensus sequences based on an entire TE family. Consensus sequences represent the best available approximation of the active TEs that generated the copies we observe in the genome today, and the evolutionary distance between the consensus and genomic copies is less than that of two TE instances to each other [18]. Therefore, alignment-based TE detection sensitivity is better using consensus sequences than any genomic copy chosen as a reference. However, because some TE families are at low copy number or are composed of only partially overlapping fragmented copies, construction of a consensus sequence may neither be possible nor lead to a meaningful reconstruction of an active element.
Given a reference set of TE sequences, there are two main goals in detecting TEs in genome sequences: either to mask them as a preprocessing step in some other bioinformatic task (e.g. gene-finding or alignment) or to study them directly to make inferences about the biology of TEs. These twin aims of TE genome informatics are explicitly incorporated into the most common systems used to detect individual instances of TEs in genomes sequences Censor [64, 65] and RepeatMasker [66]. These methods classically used Smith–Waterman nucleotide alignment to output masked genomic DNA and a tabular summary of TE content. To speed the initial detection of local alignments, MaskerAid [67] was developed as a wrapper for WU-BLAST results to be processed by RepeatMasker, although more recent versions of both RepeatMasker and Censor run natively with WU-BLAST or NCBI-BLAST as the alignment search engine. RepeatMasker uses different similarity matrices (each optimal for a background GC% level of the mammalian genome) in the scoring of local alignments. The BLASTER program [36], also works as a wrapper of NCBI-BLAST or WU-BLAST, but allows matches, mismatches and gap penalties to be adjusted for each genome. As an alternative to pairwise alignment, nucleotide profile-HMMs can be used to search for TE instances [68, 69]. These approaches replace consensus sequences by the richer information contained in a profile obtained from a multiple alignments of known TE instances, but are too computationally intensive to be used routinely. Other detection methods use translated similarity searches (e.g. BLASTER, Censor) or a combination of translated and nucleotide searches (e.g. RetroMap [70], RepeatRunner [71]).
Large insertions or deletions that occur in TEs after they integrate in the genome can cause detection methods to identify two matches to a query sequence, instead of one match with a long gap. This occurs through both TE nesting and small-scale insertion and deletion processes. Therefore, a post-processing step, called defragmentation [68] is often required to join fragments of a single TE insertion event into a biologically meaningful annotation. For example, the ProcessRepeats script distributed with RepeatMasker joins LTRs to body of their corresponding LTR retrotransposon and links poly(A) sequences to the tail of non-LTR retrotransposons. Another heuristic defragmentation approach is implemented in LTR_MINER [72] based on distance and orientation between fragments, membership in the same family and length constraints on the resulting chain of fragments. Transposon Cluster Finder (TCF) [73] likewise joins fragments based on family, strand and the amount of non-TE DNA between fragments, but also implements structure-based defragmentation methods specific for LTR and LINE-like elements. TE nest (Kronmiller et al., unpublished data) is another defragmentation method designed primarily for LTR elements that uses age information encoded in the sequence divergence that occurs between LTRs after insertion. MATCHER, which comes with the BLASTER suite of tools, uses dynamic programming to efficiently find an optimal chain of co-linear fragments from the same family by summing HSP scores and subtracting a gap penalty. PLOTREP [74] combines defragmentation (based on maximal insertion/deletion sizes and other parameters) with an interactive graphical user interface to allow rapid visual inspection of merged hits.
As with TE discovery, hallmarks of the transposition process and structural features of various TE classes can influence their detection. For example, TSDs can be used to refine the precise location of TEs, as is implemeted in TSDfinder [75] or SINEDR [76]. Detection of symmetrical structures such as TIRs or LTRs can be complicated if these internally repetitive regions are not identical in the reference sequence. If the two TIRs of the genomic copy are more similar to each other than to the appropriate TIR in the reference sequence, only one TIR of the reference (the most similar one) is used to detect the two genomic TIRs, but on different strands. Similarly, if the two LTRs of a reference TE are not identical, a genomic copy can be detected with two 5' LTRs (or 3' LTRs) if its LTRs are more similar to each other than to the appropriate LTRs of the reference sequence. These cases will disrupt defragmentation algorithms that (sensibly) require co-linearity of fragments on the same strand. RepeatMasker libraries avoid this problem by splitting LTR retrotransposons into constitutive LTRs and internal sequences. Methods such as ProcessRepeats, LTR_MINER or TCF are then required to connect LTRs to internal regions as a post-processing step. A simple solution that avoids these problems is to use reference sequences with identical TIRs or LTRs. Likewise, non-LTR retrotransposons pose unique challenges to TE detection based on variable poly(A) tail length. One solution to this problem is to extend the poly(A) tail of non-LTR retrotransposons in the reference set to the length of the longest observed genomic copy. However, this can have the negative effect of causing spurious hits to homopolymeric repeats.
The problem of poly(A) tails reflects the more general problem in TE detection posed by the existence of simple repeats in TE reference sequences, which lead to many false positive TEs being identified in the genome [37, 64]. In general, TE and simple repeat detection should be considered inter-related processes, as is evident from the inclusion of simple repeat masking functionality in both Censor and RepeatMasker. However, the best way to filter simple repeats either on genomic or reference sequences without affecting the sensitivity of TE detection is not obvious and remains an open research question. RepeatMasker and Censor identifies simple repeats by similarity searches to libraries of short k-mers, but are not optimized for the detection of subtly divergent simple repeat regions, i.e. those with some mismatches to the periodic unit. Specialized simple repeat detection software such as TRF [77] or mreps [78] may be more appropriate, however certain trade-offs between sensitivity and specificity in TE detection will persist until the problem of simple repeat detection is solved.
| CONCLUSIONS |
|---|
|
|
|---|
The growing awareness of the challenge to understand TEs in genome sequences is beginning to be reflected in the increasing number of sophisticated methods for TE discovery and TE detection. The diversity of recent approaches and continued improvements in performance and usability indicate that the field of TE bioinformatics is still in a growth phase. Recent studies indicate that using the results of no single computational method for TE detection is the most efficient or sensitive approach for TE annotation [37, 69]. Borrowing ideas from the field of gene annotation, Quesneville et al. [37] have shown that high quality TE models can be produced by combining the results of multiple independent TE detection methods. Systems that integrate results from multiple TE discovery methods will undoubtedly also provide the most robust and meaningful reference sets of TE sequences. Intelligently reasoning over the results of sometimes conflicting methods will require new methods for combining multiple sources of evidence, such as those recently developed for gene finding [79]. The evaluation of such systems will be greatly accelerated by optimized parallelization of TE discovery/classification/detection workflows [80] and by enhanced visualization of the underlying computational evidence supporting newly discovered TE sequences [74,81]. Perhaps equally important is the need for improvements in how TEs and TE-specific features (such as LTRs, TSDs, and overlapping reading frames) are represented and stored in genome annotation file formats and databases, which, together with further improvements in discovery and detection software, should begin to unmask the true impact of TEs on genome biology.
Key Points
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Thomas Bureau, Douglas Hoen and Nikoleta Juretic for helpful discussions and two anonymous reviewers for helpful comments on the manuscript.
| FOOTNOTES |
|---|
|
|
|---|
Casey M. Bergman received his PhD from the University of Chicago, did his postdoctoral research at the Berkeley Drosophila Genome Project and University of Cambridge, and is currently a Lecturer in the Bioinformatics at the University of Manchester (UK).
Hadi Quesneville received his PhD from the University Pierre et Marie Curie in Paris, did his postdoctoral research at Genethon and at INSERM, and is currently head of the laboratory Unité de Recherches en Génomique-Info in Evry (France).
Submitted: July 16, 2007. Accepted: September 17, 2007.
| REFERENCES |
|---|
|
|
|---|
- Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature (1980) 284:601–3.[CrossRef][Medline]
- Orgel LE, Crick FH. Selfish DNA: the ultimate parasite. Nature (1980) 284:604–7.[CrossRef][Medline]
- Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
- Yu J, Hu S, Wang J, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science (2002) 296:79–92.
[Abstract/Free Full Text] - Pardue ML, Rashkova S, Casacuberta E, et al. Two retrotransposons maintain telomeres in Drosophila. Chromosome Res (2005) 13:443–53.[CrossRef][Web of Science][Medline]
- Gregory TR. Synergy between sequence and size in large-scale genomics. Nat Rev Genet (2005) 6:699–708.[CrossRef][Web of Science][Medline]
- Bennetzen JL. Transposable elements, gene creation and genome rearrangement in flowering plants. Curr Opin Genet Dev (2005) 15:621–7.[CrossRef][Web of Science][Medline]
- Medstrand P, van de Lagemaat LN, Dunn CA, et al. Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenet Genome Res (2005) 110:342–52.[CrossRef][Web of Science][Medline]
- Devine SE, Chissoe SL, Eby Y, et al. A transposon-based strategy for sequencing repetitive DNA in eukaryotic genomes. Genome Res (1997) 7:551–63.
[Abstract/Free Full Text] - Myers EW, Sutton GG, Delcher AL, et al. A whole-genome assembly of Drosophila. Science (2000) 287:2196–204.
[Abstract/Free Full Text] - Reese MG, Hartzell G, Harris NL, et al. Genome annotation assessment in Drosophila melanogaster. Genome Res (2000) 10:483–501.
[Abstract/Free Full Text] - Bray N, Dubchak I, Pachter L. AVID: a global alignment program. Genome Res (2003) 13:97–102.
[Abstract/Free Full Text] - Feschotte C, Pritham EJ. Computational analysis and paleogenomics of interspersed repeats in eukaryotes. In: Computational Genomics: Current Methods—Stojanovic N, ed. (2007) London: Taylor and Francis. 31–53.
- Pavlicek A, Kohany O, Jurka J. Repeat mining: basic tools for detection and analysis. In: Analytical Tools for DNA, Genes and Genomes Nuts and Bolts—Markoff A, ed. (2005) Eagleville: DNA Press. 131–60.
- Feschotte C, Jiang N, Wessler SR. Plant transposable elements: where genetics meets genomics. Nat Rev Genet (2002) 3:329–41.[CrossRef][Web of Science][Medline]
- Kazazian HH Jr. Mobile elements: drivers of genome evolution. Science (2004) 303:1626–32.
[Abstract/Free Full Text] - Jurka J. Approaches to identification and analysis of interspersed repetitive DNA sequences. In: Automated DNA Sequencing and Analysis—Adams MD, Fields C, Venter JC, eds. (1994) San Diego: Academic Press: Inc. 294–8.
- Jurka J. Repeats in genomic DNA: mining and meaning. Curr Opin Struct Biol (1998) 8:333–7.[CrossRef][Web of Science][Medline]
- Kapitonov VV, Pavlicek A, Jurka J. Anthology of human repetitive DNA. In: Encyclopedia of Molecular Cell Biology and Molecular Medicine.—Meyers RA, ed. (2004) Wiley-VCH: Verlag GmbH & Co. 251–305.
- Pop M, Salzberg SL, Shumway M. Genome sequence assembly: algorithms and issues. IEEE Computer (2002) 35:47–54.
- Li R, Ye J, Li S, et al. ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol (2005) 1:e43.[CrossRef][Medline]
- Kurtz S, Schleiermacher C. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics (1999) 15:426–7.
[Abstract/Free Full Text] - Delcher AL, Kasif S, Fleischmann RD, et al. Alignment of whole genomes. Nucleic Acids Res (1999) 27:2369–76.
[Abstract/Free Full Text] - Kurtz S, Choudhuri JV, Ohlebusch E, et al. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res (2001) 29:4633–42.
[Abstract/Free Full Text] - Delcher AL, Phillippy A, Carlton J, et al. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res (2002) 30:2478–83.
[Abstract/Free Full Text] - Kurtz S, Phillippy A, Delcher AL, et al. Versatile and open software for comparing large genomes. Genome Biol (2004) 5:R12.[CrossRef][Medline]
- Volfovsky N, Haas BJ, Salzberg SL. A clustering method for repeat analysis in DNA sequences. Genome Biol (2001) 2. :RESEARCH0027.
- Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol (1990) 215:403–10.[CrossRef][Web of Science][Medline]
- Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–402.
[Abstract/Free Full Text] - Washington University BLAST Archives. (1 October 2007, date last accessed). http://blast.wustl.edu.
- Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large DNA databases. Genome Res (2001) 11:1725–9.
[Abstract/Free Full Text] - Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res (2002) 12:656–64.
[Abstract/Free Full Text] - Kalafus KJ, Jackson AR, Milosavljevic A. Pash: efficient genome-scale sequence anchoring by positional hashing. Genome Res (2004) 14:672–8.
[Abstract/Free Full Text] - Leung MY, Blaisdell BE, Burge C, et al. An efficient algorithm for identifying matches with errors in multiple long molecular sequences. J Mol Biol (1991) 221:1367–78.[Web of Science][Medline]
- Bao Z, Eddy SR. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res (2002) 12:1269–76.
[Abstract/Free Full Text] - Quesneville H, Nouaud D, Anxolabehere D. Detection of new transposable element families in Drosophila melanogaster and Anopheles gambiae genomes. J Mol Evol (2003) 57(suppl. 1):S50–9.[CrossRef][Web of Science][Medline]
- Quesneville H, Bergman CM, Andrieu O, et al. Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol (2005) 1:e22.[CrossRef]
- Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics (2005) 21(suppl. 1):i152–8.[Abstract]
- Agarwal P, States DJ. The Repeat Pattern Toolkit (RPT): analyzing the structure and evolution of the C. elegans genome. Proc Int Conf Intell Syst Mol Biol (1994) 2:1–9.[Medline]
- Pevzner PA, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Res (2004) 14:1786–96.
[Abstract/Free Full Text] - Morgulis A, Gertz EM, Schaffer AA, et al. WindowMasker: window-based masker for sequenced genomes. Bioinformatics (2006) 22:134–41.
[Abstract/Free Full Text] - Healy J, Thomas EE, Schwartz JT, et al. Annotating large genomes with exact word matches. Genome Res (2003) 13:2306–15.
[Abstract/Free Full Text] - Campagna D, Romualdi C, Vitulo N, et al. RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics (2005) 21:582–8.
[Abstract/Free Full Text] - Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics (2005) 21(suppl. 1):i351–8.[Abstract]
- Mao L, Wood TC, Yu Y, et al. Rice transposable elements: a survey of 73,000 sequence-tagged-connectors. Genome Res (2000) 10:982–90.
[Abstract/Free Full Text] - Biedler J, Tu Z. Non-LTR retrotransposons in the African malaria mosquito, Anopheles gambiae: unprecedented diversity and evidence of recent activity. Mol Biol Evol (2003) 20:1811–25.
[Abstract/Free Full Text] - McClure MA, Richardson HS, Clinton RA, et al. Automated characterization of potentially active retroid agents in the human genome. Genomics (2005) 85:512–23.[CrossRef][Web of Science][Medline]
- Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol (2000) 132:185–219.[Medline]
- Durbin R, Eddy SR, Krogh A, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1999) Cambridge: Cambridge University Press. 368.
- Bateman A, Birney E, Cerruti L, et al. The Pfam protein families database. Nucleic Acids Res (2002) 30:276–80.
[Abstract/Free Full Text] - Berezikov E, Bucheton A, Busseau I. A search for reverse transcriptase-coding sequences reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster. Genome Biol (2000) 1. :RESEARCH0012.
- Rho M, Choi JH, Kim S, et al. De novo identification of LTR retrotransposons in eukaryotic genomes. BMC Genomics (2007) 8:90.[CrossRef][Medline]
- McCarthy EM, McDonald JF. LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics (2003) 19:362–7.
[Abstract/Free Full Text] - Kalyanaraman A, Aluru S. Efficient Algorithms and software for detection of full-length LTR retrotransposons. In: Proc IEEE Comput Syst Bioinform Conf (2005) 56–64.
- Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res (2007) 35:W265–8.
[Abstract/Free Full Text] - Morgante M, Policriti A, Vitacolonna N, et al. Structured motifs search. J Comput Biol (2005) 12:1065–82.[CrossRef][Web of Science][Medline]
- Tu Z. Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae. Proc Natl Acad Sci USA (2001) 98:1699–704.
[Abstract/Free Full Text] - Santiago N, Herraiz C, Goni JR, et al. Genome-wide analysis of the Emigrant family of MITEs of Arabidopsis thaliana. Mol Biol Evol (2002) 19:2285–93.
[Abstract/Free Full Text] - Yang G, Hall TC. MAK, a computational tool kit for automated MITE analysis. Nucleic Acids Res (2003) 31:3659–65.
[Abstract/Free Full Text] - Andrieu O, Fiston AS, Anxolabehere D, et al. Detection of transposable elements by their compositional bias. BMC Bioinformatics (2004) 5:94.[CrossRef][Medline]
- Caspi A, Pachter L. Identification of transposable elements using multiple alignments of related genomes. Genome Res (2006) 16:260–70.
[Abstract/Free Full Text] - Bennett EA, Coleman LE, Tsui C, et al. Natural genetic variation caused by transposable elements in humans. Genetics (2004) 168:933–51.
[Abstract/Free Full Text] - Jurka J, Kapitonov VV, Pavlicek A, et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res (2005) 110:462–7.[CrossRef][Web of Science][Medline]
- Kohany O, Gentles AJ, Hankus L, et al. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics (2006) 7:474.[CrossRef][Medline]
- Jurka J, Klonowski P, Dagman V, et al. CENSOR—a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem (1996) 20:119–21.[CrossRef][Web of Science][Medline]
- Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. (1996-2004) Institute for Systems Biology.
- Bedell JA, Korf I, Gish W. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics (2000) 16:1040–1.
[Abstract/Free Full Text] - Paces J, Pavlicek A, Paces V. HERVd: database of human endogenous retroviruses. Nucleic Acids Res (2002) 30:205–6.
[Abstract/Free Full Text] - Juretic N, Bureau TE, Bruskiewich RM. Transposable element annotation of the rice genome. Bioinformatics (2004) 20:155–60.
[Abstract/Free Full Text] - Peterson-Burch BD, Nettleton D, Voytas DF. Genomic neighborhoods for Arabidopsis retrotransposons: a role for targeted integration in the distribution of the Metaviridae. Genome Biol (2004) 5:R78.[CrossRef][Medline]
- Smith CD, Edgar RC, Yandell MD, et al. Improved repeat identification and masking in Dipterans. Gene (2007) 389:1–9.[CrossRef][Web of Science][Medline]
- Pereira B. Insertion bias and purifying selection of retrotransposons in the Arabidopsis thaliana genome. Genome Biol (2004) 5:R79.[CrossRef][Medline]
- Giordano J, Ge Y, Gelfand Y, et al. Evolutionary history of mammalian transposons determined by genome-wide defragmentation. PLoS Comput Biol (2007) 3:e137.[CrossRef][Medline]
- Toth G, Deak G, Barta E, et al. PLOTREP: a web tool for defragmentation and visual analysis of dispersed genomic repeats. Nucleic Acids Res (2006) 34:W708–13.
[Abstract/Free Full Text] - Szak ST, Pickeral OK, Makalowski W, et al. Molecular archeology of L1 insertions in the human genome. Genome Biol (2002) 3. RESEARCH0052.
- Tu Z, Li S, Mao C. The changing tails of a novel short interspersed element in Aedes aegypti: genomic evidence for slippage retrotransposition and the relationship between 3' tandem repeats and the poly(dA) tail. Genetics (2004) 168:2037–47.
[Abstract/Free Full Text] - Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res (1999) 27:573–80.
[Abstract/Free Full Text] - Kolpakov R, Bana G, Kucherov G. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res (2003) 31:3672–8.
[Abstract/Free Full Text] - Allen JE, Salzberg SL. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics (2005) 21:3596–603.
[Abstract/Free Full Text] - Ranganathan N, Feschotte C, Levine D. Cluster and grid based classification of transposable elements in eukaryotic genomes. (2006) In: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06). 45.
- Lewis SE, Searle SM, Harris N, et al. Apollo: a sequence annotation editor. Genome Biol (2002) 3. :RESEARCH0082.
This article has been cited by other articles:
![]() |
S. Steinbiss, U. Willhoeft, G. Gremme, and S. Kurtz Fine-grained annotation and classification of de novo predicted LTR retrotransposons Nucleic Acids Res., September 28, 2009; (2009) gkp759v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Feschotte, U. Keswani, N. Ranganathan, M. L. Guibotsy, and D. Levine Exploring Repetitive DNA Landscapes Using REPCLASS, a Tool That Automates the Classification of Transposable Elements in Eukaryotic Genomes Gen Biol Evol, August 12, 2009; 2009(0): 205 - 220. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Becher, A. Deymonnaz, and P. Heiber Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome Bioinformatics, July 15, 2009; 25(14): 1746 - 1753. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Huang, G. Lu, Q. Zhao, X. Liu, and B. Han Genome-Wide Analysis of Transposon Insertion Polymorphisms Reveals Intraspecific Variation in Cultivated Rice Plant Physiology, September 1, 2008; 148(1): 25 - 40. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



