Skip Navigation


Briefings in Bioinformatics Advance Access originally published online on February 3, 2006
Briefings in Bioinformatics 2006 7(1):113-115; doi:10.1093/bib/bbk008
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/1/113    most recent
bbk008v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mullan, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mullan, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org

Pairwise sequence alignment—it's all about us!

Lisa Mullan

Lisa Mullan, EMBL-European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. E-mail: lisa{at}ebi.ac.uk

Pairwise alignment is one of the most fundamental tools of bioinformatics and underpins a variety of other, more sophisticated methods of annotation. Pairwise alignment in its most rigorous form uses a method called ‘dynamic programming’, which is highly accurate, but also incredibly costly to compute.

In order to align anything other than an exact alphabetic match, the algorithm has to know what it is looking for and how it can evaluate the worth of what it finds. To this end, ‘comparison matrices’ have been created which define a score for every possible match possibility—an effective tally of how well the computational alignment is doing. The software will search for the highest score available. The final score is relevant only with its resulting alignment and cannot be used outside this context.

In the case of DNA, comparison values are generated using a simple identity matrix of the type that allows one (positive) score for a correct match within the alignment, and a different (zero or negative) score for a mismatch. A fuller comparison matrix allowing ambiguities, alters these basic values as potential transitions and transversions are taken into account—but essentially there is very little mathematical difference that can be achieved between one alignment and another similar one.

Protein matrices, on the other hand, offer a greater breadth of calculation as not only are there five times as many common amino acid residues as there are DNA bases, they incorporate a significant amount of evolutionary information. The most common matrices here are the position accepted mutation (PAM) [1, 2] and BLOSUM [3] comparison tables. The first PAM matrix was created in the late 1970s and relied on noting accepted residue substitutions within protein sequences to produce the PAM 1 table. Subsequent tables in the PAM family have been created by multiplication models based on that first matrix. The greater the multiplication, the higher the number in the PAM series and the greater the number of accepted mutations which have been involved in the proteins used to create the tables, and thus the greater the evolutionary divergence of those proteins. One of the more common matrices—the PAM 250 matrix—represents a subset of proteins of approximately 80% diversity.

The BLOSUM matrices were created in the early 90s and relied on the presence of residues within the blocks of conserved regions of related proteins to create the matrix. These blocks can be accessed in the BLOCKS [4, 5] database. There is also a family of BLOSUM matrices which are differentiated with numbers. These numbers, however, represent the minimum percentage identity of the BLOCKS used to create the matrix. The most common matrix of this set is BLOSUM 62, a default setting for many protein alignment applications. It indicates that BLOCKS of at least 62% identity were used in the creation of this matrix. In the case of the BLOSUM matrices, the higher the number connected to the BLOSUM matrix, the smaller the evolutionary divergence between sequences.

Once the comparison matrix has been established, the computer can make its own matrix based on the two sequences to be aligned—inserting a ‘score’ for each potential base or residue alignment. However, to allow the computer to score each comparison and select the one match—or run of matches—that gives the highest score would not necessarily yield the best alignment as it ignores biological insertions and deletions. In order to accommodate an insertion, gaps must be tolerated and as with the bases or residues, they must contribute a numerical value to the final score for the alignment. In order to allow the computer to create gaps in a biologically significant manner, the value contributed is a negative one—thus introducing a gap into an alignment is penalised. An initial gap is generally penalised by using the ‘gap open penalty’ and forms a relatively high cost to the program. Any subsequent gaps belonging to that particular insertion event are penalised at a much lower cost by ‘gap extension penalties’. Such a scoring system, known as ‘affine gap penalties’ encourages longer gaps in the alignment instead of lots of smaller, more random gaps and results in a more biologically significant alignment.

Rigorous alignment software uses values for base or residue comparison, along with the penalties for gaps to create its own matrix. Each position in the matrix is occupied by the highest score possible between pairs of bases or residues based on the alignment up to that point. It can allow for gaps in the alignment and in such cases, points are deducted from each score to compensate for introducing them. Once the matrix has been filled, the computer can trace back the best alignment following the route of the highest values—the sum of which is returned with the alignment. In order to create the matrix, the most rigorous programs need memory capacity proportional to the multiple of the lengths of the two sequences to be aligned. This is approximately 1.6 KB for two sequences of 100 bases or residues in length, rising to 16 MB for sequences of 1000 bases or residues in length.

Depending on the requirements of the user, an alignment may either be local or global. Local alignments match only conserved regions within two given sequences and discard any unmatched portions so they do not appear in the final result. In computational terms, once the score goes down to such a point that no further alignment will increase it, the match is terminated and the matched portion brought back as a result (hit). Flexible local alignment programs will bring back several matches in a sequence, regardless of their topology. The most rigorous algorithm for local alignments is Smith–Waterman [6].

Global alignment programs are designed to match sequences end-to-end and will search for the best score possible. As they cannot discard any unmatching regions of sequence, it is best to use global alignment programs to align sequences of approximately the same length—for example orthologous genes or proteins. The most rigorous algorithm for local alignments is Needleman–Wunsch [7], which should be used for short, divergent sequences.

All programs designed to offer pairwise alignment will bring back a match, however bad, that represents the highest score possible. For proteins, if the subsequence alignment displays 40% identity, sequences are generally expected to be homologous—i.e. evolutionarily related. Should this identity fall below 20%, statistics suggest there can be no definition of homology based simply on alignment. The ‘twilight zone’ indicates the region in-between these two matches where identity of the alignment is not deemed to be statistically significant enough to indicate any homology and a relationship must be demonstrated using other methods.

Other, faster, alignment methods still use dynamic programming, but have cut corners in order to both speed up the alignment and reduce compute power. These are generally slightly less accurate than the most rigorous implementations, and should be used on sequences that are relatively closely related, or much longer sequences. These include pairwise alignment matches such as LAlign or, in more extreme cases, sequence search software such as BLAST or FastA (not covered in this article). In addition, multiple sequence alignment options generally rely on initial pairwise alignment before producing a multiple match. Although extensions to the Smith–Waterman [10] and Needleman–Wunsch [11] algorithms can and have been made to simultaneously align three sequences, the compute power required is extraordinary and currently multiple alignment methods using other means offer a good solution at a much lower computational cost.

Table 1 lists some alignment options available through a web interface. Hits indicate the number of alignments that can potentially be generated by the program.


View this table:
[in this window]
[in a new window]
 
Table 1: Pairwise alignment programs

 

Key Points

  • Local alignments match similar regions and discard any unmatching regions, regardless of topology.
  • Global alignments look for matches across the entire length of the sequences.
  • More accurate alignments use more compute power and should not be used as default for extremely large sequences.

 

FOOTNOTES

Lisa Mullan is a Scientific Training Officer at the European Bioinformatics Institute, which she joined in 2004 to coordinate their user training program. Previously she worked as training coordinator at the late Rosalind Franklin Centre for Genome Research (formerly HGMP-RC). She currently also teaches introductory level bioinformatics on courses throughout the world.

Submitted: November 30, 2005. Received (in revised form): December 15, 2005.

References

  1. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In Dayhoff MO (Ed.). Atlas of Protein Sequence and Structure Washington, DC: National Biomedical Research Foundation 1978; 5:345–52.
  2. Dayhoff MO, Schwartz RM, Orcutt BC. Matrices for detecting distant relationships. In Dayhoff MO (Ed.). Atlas of Protein sequence and Structure Washington, DC: National Biomedical Research Foundation 1978; 5:353–8.
  3. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992; 89:10915–19.[Abstract/Free Full Text]
  4. Henikoff JG, Greene EA, Pietrokovski S, et al. Increased coverage of protein families with the blocks database servers. Nucl Acids Res 2000; 28:228–30.[Abstract/Free Full Text]
  5. Henikoff S, Henikoff JG, Pietrokovski S. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999; 15:471–9.[Abstract/Free Full Text]
  6. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 1981; 147:195–7.[CrossRef][Web of Science][Medline]
  7. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970; 48:443–53.[CrossRef][Web of Science][Medline]
  8. Huang X, Miller W. A time-efficient, linear-space local similarity algorithm. Adv Appl Math 1991; 12:337–57.[CrossRef]
  9. Myers E, Miller W. Optimal alignments in linear space. CABIOS 1988; 4:11–17.
  10. Waterman MS, Smith TF, Beyer WA. Some biological sequence metrics. Adv Math 1976; 20:367–87.[CrossRef]
  11. Murata M, Richardson JS, Sussman JL. Simultaneous comparison of three protein sequences. Proc Nat Acad Sci 1985; 82:3073–3077.[Abstract/Free Full Text]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/1/113    most recent
bbk008v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mullan, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mullan, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?