Briefings in Bioinformatics Advance Access originally published online on February 3, 2006
Briefings in Bioinformatics 2006 7(1):2-24; doi:10.1093/bib/bbk001
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical significance in biological sequence analysis
Corresponding author. Mark Borodovsky, School of Biology, Georgia Institute of Technology, Atlanta, GA 303320230, USA. Tel: +1 404 894 8432; Fax: +1 404 894 0519;mark.borodovsky{at}biology.gatech.edu
| ABSTRACT |
|---|
One of the major goals of computational sequence analysis is to find sequence similarities, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Since the degree of similarity is usually assessed by the sequence alignment score, it is necessary to know if a score is high enough to indicate a biologically interesting alignment. A powerful approach to defining score cutoffs is based on the evaluation of the statistical significance of alignments. The statistical significance of an alignment score is frequently assessed by its P-value, which is the probability that this score or a higher one can occur simply by chance, given the probabilistic models for the sequences. In this review we discuss the general role of P-value estimation in sequence analysis, and give a description of theoretical methods and computational approaches to the estimation of statistical signifiance for important classes of sequence analysis problems. In particular, we concentrate on the P-value estimation techniques for single sequence studies (both score-based and score-free), global and local pairwise sequence alignments, multiple alignments, sequence-to-profile alignments and alignments built with hidden Markov models. We anticipate that the review will be useful both to researchers professionally working in bioinformatics as well as to biomedical scientists interested in using contemporary methods of DNA and protein sequence analysis.
Keywords: sequence analysis, pairwise alignment, multiple alignment, profile, probabilistic model, statistical significance, P-value, E-value
| INTRODUCTION |
|---|
Computational methods of biological sequence analysis have become an indispensable part of the modern scientist's research arsenal. In protein studies, the results of sequence similarity searches in databases help generate reasonable hypotheses concerning structural and functional properties of proteins, as well as their evolutionary relationships. On the DNA level, sequence analysis techniques make it possible to identify genes and functional elements in newly sequenced genomes. The abundance of biomolecular sequence information (generated as a result of the ever-increasing number of large-scale sequencing projects), together with a relatively high cost of wet lab experimentation, calls for powerful and efficient computational tools as primary means for high-throughput genomic proteomic investigations. Since sequence comparison and motif analysis methods can be used to predict proteinprotein interactions [1, 2] and interactions in transcriptional regularity networks [3], computational sequence analysis also provides a basis for the rapidly developing field of systems biology.
It thus comes as no surprise that searching a biosequence database is probably the first thing a biologist would do once a biological sequence of interest is known. However, the evolution of DNA and proteins in living organisms is influenced by a number of random factors, and observed amino acid or nucleotide patterns may result from the action of these factors, rather than from the selective pressure maintaining a certain function. Therefore, not all database sequences which resemble the query sequence may be its true homologues. Naturally enough, the biologist is interested in finding biologically relevant targets, namely, those sharing important structural and functional characteristics with the query. The question arises how likely it is that a particular observation of high sequence similarity, a hit, has occurred by chance. This is the starting point for exploring the role of statistical significance in computational analysis of biological sequences.
The strength of an alignment is usually determined by its score, and the statistical significance of the score is assessed by the P-value. The term P-value of an alignment designates the probability of an alignment with this score or higher occurring by chance alone. The exact meaning of chance depends on the method of modeling randomness in the P-value estimation procedures. Originally, works on statistical significance analysis [48] considered randomly shuffled sequences with preserved compositional properties. Alternatively, sequences can be modeled as sample paths of stochastic models whose parameters are derived from the observed compositional statistics of the real sequences. Such models have the advantages of being analytically tractable in simple settings. This tractability leads to explicit formulas connecting the score distribution to the nucleotide or amino acid composition of the sequence and the parameters of the scoring system, thus simplifying and accelerating the assessment of statistical significance in many practically important situations.
The general fact is that the procedure for determining the P-value depends on the sequence analysis problem being solved. Therefore, different types of alignments (global versus local, pairwise versus multiple) require the development of the corresponding P-value estimation methods and algorithms. Our goal is to outline some of the important accomplishments in this area. After a broad discussion of the notion of the P-value in the context of sequence analysis, we give a description of problems that require a stochastic model for only one sequence. In a sense, this is the simplest class of sequence analysis problems, but it can be viewed as an example illustrating the general difficulties related to the P-value estimation. We see that some important results on score distributions obtained for this class can be carried over to the more intricate case of two sequences, i.e. to pairwise alignments. The framework developed for single- and two-sequence methods may be extended further to multiple alignments and sequence-to-profile, as well as to HMM-based alignments.
Although statistically significant patterns are supposed to attract the biologist's attention, it is necessary to realize that statistical significance is not equivalent to biological significance [911]. The ultimate way to establish the latter is by experimental studies. The primary purpose of the statistical significance estimation is to highlight the targets for experiments with potentially interesting results. Whether such highlights are (on average) indeed worth further study, depends on the quality of the P-value estimation method.
| THE USE OF P-VALUES IN SEQUENCE COMPARISONS |
|---|
Definitions and basic methods
The notion of a P-value originates in the general statistical methodology of hypotheses testing [12]. Suppose we have a null and an alternative hypotheses, both of which can be used to explain the data produced by an experiment. The hypotheses are mutually exclusive, and we wish to determine which one holds true. Based on a statistical test applied to the data, the null hypothesis can be either accepted or rejected in favor of the alternative hypothesis. To formalize the acceptance/rejection procedure, we introduce the P-value and the significance level (
-value) of the test. The P-value is defined as the probability of seeing a value of the test statistic at least as extreme as the observed value, assuming that the null hypothesis is true. If the null hypothesis is rejected when the observed statistic has a particular P-value, then the probability of making an error, rejecting the null hypothesis when in fact it is true (type I error), is equal to the P-value. The predefined level of the type I error selected by the researcher is called the
-value (say,
= 0.01). If the P-value is smaller than the significance level, then the null hypothesis is rejected and the result is said to be significant (at the given significance level). It is easy to see that the P-value is actually all we need; it can be interpreted as the smallest significance level at which we can reject the null hypothesis.
For sequence alignments, the test statistic is usually the alignment score, and the null hypothesis is that the aligned sequences are unrelated. The alternative hypothesis is that the sequences are biologically related; this relatedness often means homology, i.e. having a common ancestor. In this context, the P-value of a score can be viewed as the probability that a prediction of biological relationship with this score or higher is a false positive. The efficiency of score-based binary classification methods is often judged by their sensitivity (Sn) and specificity (Sp). The Sn and Sp are defined as follows:
|
|
To calculate the P-value for a sequence alignment, we need to rigorously define the null hypothesis by specifying a probabilistic model (the null model) in which the aligned sequences are independent, and estimate the probability that this model generates a score at least as extreme as the observed score. Note that, since usually we are interested in high scores, the score-based tests is a one-sided statistical test, and we should concentrate on the behavior of the right tail of the score distribution. In score-free analyses, such as studies of word occurrences in single or multiple sequences or string matching, the test statistic may be the count of a given word in a sequence, or the distance between words, etc. The P-value is again the probability of this statistic taking a value at least as large as actually observed under the assumption that the null hypothesis is true, i.e. by pure chance.
As it was already mentioned, it is possible to implement randomness in the null model either by applying shuffling algorithms or by defining stochastic processes. Yet another approach is to select biomolecular sequences at random from available databases [15, 16]. The motivation for the latter approach is that real sequences may give a more biologically adequate null model, compared to mathematical abstractions or ad hoc algorithms. Shuffling methods also use real sequences as raw material, and are oriented toward preserving certain properties of real sequences. For example, the permutation technique devised by Altschul and Erickson [4] can preserve dinucleotide and trinucleotide composition in nucleotide sequences. However, shuffling methods have been criticized for introducing significant bias akin to overfitting in the case of short- and medium-length sequences, and for failing to reflect the natural processes of random nucleotide and amino acid replacement: evolution does not scramble sequence [17]. In the study carried out by Pearson [15], the results of using database-derived effectively random sets of unrelated sequences for the null model parameter estimation were encouraging, however, the null model itself was derived based on analytic properties of probabilistic sequence models. Waterman and Vingron [16] found that, for protein sequences, the estimation results using sequences retrieved from databases are in good agreement with those using random sequence models with independent residues. In general, the use of database sequences for parameter estimation and benchmarking is a promising approach [16, 18].
Having chosen a null model, it is important to have an efficient method for calculating the P-value. In the absence of analytical results, P-values are usually estimated by simulations; in the case of database-derived sequences, a part of simulation is the process of random selection of the sequences. Suppose we would like to estimate the P-value of the value s of the test statistic, e.g. the alignment score. The most straightforward way is to run N simulations for the null model, and calculate the test statistic for each run. If the scores observed in M runs are greater than or equal to s, then P(s) can be estimated by
|
|
A more insightful way to estimate P-values is to use the (approximate) knowledge of the distribution of the test statistic. If the statistic is continuously distributed, and its probability density function (p.d.f.), f(x), is known, then we can apply the formula
|
|
|
|
Instead of using the text statistic itself, researchers frequently use the corresponding Z-score, which is defined by
|
|
is the standard deviation. Usually, empirical estimates of the mean and standard deviation are used in the Z-score definition instead of the exact values [22]. A general result concerning the sequence alignment Z-scores was obtained by Bastien [23], who suggested a theoretical justification to the empirical fact that Z > 8 usually corresponds to significant alignments. This result is distribution-free, and follows directly from Chebyshev's inequality (which has been applied in sequence analysis earlier [10, 24]). Note, however, that Chebyshev's inequality is known not to give tight bounds (due to its generality), so this approach may underestimate the P-value and fail to detect a large percentage of significant results. Although obtaining P-value estimates of the form (1) or building empirical distribution is always an option, such estimates may fail to give the correct answer for small P-values [10, 11, 25]. Furthermore, the applicability of the P-values obtained by simulation is frequently restricted to the concrete parameter values for which they were derived [18]. In addition, obtaining good estimates by simulations may be very time consuming [11, 18, 26]. Thus, analytic expressions for the distributions are always desirable. They are especially useful when they give explicit dependence of the P-value on some of the parameters, so that this dependence does not have to be estimated by simulations. Moreover, in the case of local gapped alignments analytic results for a simple special case allowed to guess an accurate formula for the alignment score distribution. While simulations are needed to estimate those of the distribution parameters which depend on the sequence composition and the scoring system, the formula gives explicit dependence of the P-value on the lengths of aligned sequences, which makes its use quite efficient in practice [16, 18, 27].
Null models
Analytic approximations can only be derived for precisely specified probabilistic models. The question is what probabilistic models are adequate for DNA and protein sequences. For DNA, typical models are homogeneous Markov chains of order m
0; the case m = 0 accounts for sequences of i.i.d. random variables [7, 28, 29]. Evidence has been obtained that, for protein-coding regions of DNA, non-homogeneous three-matrix Markov chain models of order m
1 should be used, while noncoding regions can be modeled by homogeneous Markov chains [3033]. Markov chain models are built under the assumption that the genome is homogeneous. If the nucleotide composition of a genome varies substantially along its length, it is possible to parse DNA [34] and build Markov models for separate homogeneous segments. It should also be noted that successful modern gene finding methods use hidden Markov or hidden semi-Markov models to represent DNA sequences [3537]. These models may be best for describing inhomogeneous DNA sequences.
It is apparent that, in general, we should expect some dependence of the P-values on the selected model. However, it has been found that the expression for the mean length of the longest (possibly interrupted) match for two random sequences with i.i.d. letters (nucleotides) can be fitted well to the distribution of the scores of the optimal local (possibly gapped) alignments of unrelated sequences; the variance in these two cases also behaves in a similar way (does not depend on the sequence length) [10, 24]. These findings have been regarded as a justification for the claim that independence models are likely to give accurate enough approximations for the distributions of test statistics in sequence comparisons [11]. For protein sequences, independence models (sequences of i.i.d. random variables) are usually employed [16, 18, 27, 38]; Robinson and Robinson [39] give amino acid frequencies which are considered typical [14, 40]. Numerical experiments show that, for protein sequences, using Markov models instead of independence models gives a relatively small enhancement in accuracy [16, 18, 41]. The study of Goldstein and Waterman [42] also shows that independence models for protein sequences may produce distributions close to those for real sequences. Yet Karlin and Altschul [43] note that independence models should be used only as a reference standard to prove that scores of certain alignments can occur by chance alone. Mott [14] points out that protein sequences consist of segments which differ in composition, and that accurate estimates of statistical significance should take this structure into account. However, the work of Schäffer et al. [40] on improving the accuracy PSI-BLAST demonstrated that, while it is important to consider the actual amino acid composition (which may be different from typical), adjustments made to cope with local variations in amino acid composition may not increase the accuracy of the database searches.
In the DNA sequence matching studies by Arratia et al. [44], the independence models worked rather well (see also Reich et al. [17]). Yet it is known that simple independence models should not be used for describing low complexity and GC-biased regions in DNA sequences [13, 45]. In the studies of transcription factor binding sites, background models in the form of Markov chains are generally considered as more advanced compared to independence models [46]. Thus, the degree of influence of the choice of a probabilistic sequence model on the quality of the statistical significance analysis appears to depend on the specific problem being solved. We shall return to the subject of null model selection when describing specific sequence analysis problems.
| SINGLE SEQUENCE ANALYSIS |
|---|
Occurrence of words in random sequences
Many functional elements in the genome, e.g. DNA restriction sites [42] and transcription factor binding sites [46], may be represented as words, i.e. as strings of letters from an alphabet. The knowledge of probabilistic properties of words in random texts is necessary for the estimation of statistical significance of patterns observed in DNA sequences. In particular, distributional properties of words are important in DNA sequencing by hybridization [28, 47], in the analysis of functional DNA regions [4850] and in alignment-free comparisons of biological sequences [51].
The abundance of analytic results for Markov chain models makes such models a convenient tool in the studies of statistical properties of words in DNA sequences. Many of the probabilistic properties of words in texts generated by a stationary homogeneous Markov chain of order m
0 have been surveyed by Reinert, Schbath and Waterman [28] and Schbath [52]; see also Robin and Schbath [53]. In these works, approximate as well as exact results on the statistical properties of counts of sell-overlapping and non-overlapping words have been discussed. These results allow to assess the probabilities of events related to different word occurrences, thus providing P-value estimates for such events. Statistics for both exact and degenerate words, like TAT(A or G)A, have been considered. Word count distributions, distributions of the length of the gaps between words, and occurrences of multiple patterns were analyzed. In asymptotic cases, normal, Poisson and compound Poisson approximations have been derived for word counts; the type of approximation depends on the word length and on the method of counting word occurrences. In particular, if n is the length of the sequence, then, for large n, the distribution of counts of a word can be approximated by the normal distribution; this approximation is good when the length of the word is relatively small compared to the sequence length, and works for both self-overlapping and non-overlapping words. The normal approximation can also be extended to the case of the inhomogeneous three-matrix Markov model [28]. For the distributions of counts of words for which the expected count is rather small (words with length O(ln n)), Poisson approximation is more appropriate. For periodic words (words consisting of repeated patterns) Poisson approximation is not satisfactory because of possible overlaps, and compound Poisson approximation should be used. Not only limiting distributions, but also bounds on the speed of convergence have been provided for the approximations. Similar results are available for the distributions of joint counts and joint occurrences of multiple patterns. While the Gaussian and Poisson laws are used to approximate the probability that a given word occurs more than a certain number of times, the probability that a given word frequency deviates from its expected value can be approximated using large deviations techniques. The distribution of the length of sequence intervals between two occurrences of a word can (under some conditions) also be approximated by the Poisson law.
The word count statistics could be immediately used in word representation analysis [53]. Its main principle is as follows. Let N(w) be the random variable designating the number of occurrences of word w in a sequence described by a given probabilistic model; N(w) is our test statistic. Denote by Nobs(w) the number of occurrences of w observed in the real sequence. Then the P-value for the observation is defined by
|
|
Score statistics for random sequences
If a sequence of letters represents the primary structure of a biological macromolecule, certain biologically relevant properties of the monomers (letters) can be described numerically by scores, the numbers associated with each letter. For the letters from an alphabet
, a scoring system is defined by the set
, where score si corresponds to the letter ai. For example, the score based on amino acid charge can be defined as follows [43]: s = + 2 for the positively charged amino acids lysine and arginine; for the negatively charged amino acids aspartate and glutamate, s = 2; for histidines, s = 0.04; for other amino acids, s = 1. Other examples of scores include scores derived from target frequencies and scores based on structural alphabets [43, 54, 55]. The aggregate score for a (contiguous) sequence segment is defined as the sum of the scores of constituting letters. Of interest are high-scoring segments, which with an appropriate choice of scoring system correspond to structurally and functionally interesting regions of a biological molecule.
A natural objective of the estimation of statistical significance for scores is to assess the probability of scores higher than a given score for random sequences. The general simulation-based approach to this problem was described above, and here we concentrate on the theoretical estimates for the P-values. One of the most important result in this direction is the theorem of IglehartKarlinDemboKawabata [55, 56], brought into a biological context by Karlin and Altschul [43]. The theorem concerns with maximal scoring segments (MSSs) in sequences of i.i.d. letters with letter probability distribution ( p1, ... , pq). An MSS is a sequence segment with maximal aggregate score. The two main assumptions on the scores are that: (i) at least one of the scores si is positive and (ii) the mean score,
, is negative (the case
can also be analyzed [43, 55]). The assumption (ii) is to prevent the MSS from extending up to the whole sequence. The conditions (i) and (ii) are naturally satisfied in many situations [43]. Let n be the sequence length, and s(n) be the score of an MSS. Then, for any real number x and for large n,
|
|
* are positive constants depending on the probabilities pi and scores si. In particular,
* is the unique positive root of the equation |
|
* in (3) is the asymptotics for the maximal segment score; to account for the growth of s(n) with n, we need to do the substraction in the left-hand side of (3). Setting s = ln n/
* + x, we can write (3) in the equivalent form: |
|
*s(n) ln(K*n) can be approximated by the standard Gumbel distribution. A more rigorous form of (3) is the inequality
|
|
* from the geometrically converging series for K * (Karlin and Altschul [43]; Karlin, Dembo and Kawabata [55]).
The formula (3) gives estimates for the statistical significance of the score x of an MSS: if the right-hand side equals p, then the score x is significant at the level
p. This formula can be derived given that the following statement is true [43, 55]: for large n, the distribution of the number, m, of nonintersecting segments with score exceeding ln n/
* + x is approximately Poisson with parameter
. Indeed, the probability that no segment score exceeds ln n/
* + x is calculated by setting m = 0, and is equal to
. Subtracting this from 1, we obtain (3). This Poisson approximation can be used to assess the statistical significance of several occurrences of high-scoring segments (i.e. distinct segments scoring higher than a given threshold) in a sequence and judge about their possible overrepresentation. However, while the utility of (3) and the Poisson approximation is obvious, no theoretical bounds on the accuracy of the approximation are available. Thus it is unknown of how large n should be to guarantee a good approximation. Nevertheless, Karlin and Altschul [43] state that, in practical situations, n
150 is large enough.
The theory described above concerns with the i.i.d. models of sequences, but the analogue of (3) can be obtained for Markov chain models of the sequence, as was shown by Karlin and Dembo [57]. In that work, the sequence was modeled by a Markov chain with a positive transition probability matrix (the results are extendable to irreducible aperiodic case). For each consecutive pair of letters in the sequence, the scores were represented by random variables associated with the Markov chain transitions, so that the sequence of scores formed a hidden Markov model of a special type. The analogues of the conditions (i) and (ii) were introduced, and formulas similar to (3) and (4) were derived for the distribution of the MSS score.
Sometimes it may be necessary to analyze not only the MSS, but also the second-, third-, ... , rth highest-scoring segments. Such a necessity arises when, in a biomolecule, there exist several high-scoring domains with a common property of biological interest. In this case, the search for the MSS only will ignore other important information. Let s1, ... , sr be the scores of the r distinct highest-scoring segments. Write
; this normalization is introduced for convenience. For large n, the joint distribution for s1, ... , sr is well approximated by the p.d.f.
|
|
|
|
|
|
| STATISTICAL SIGNIFICANCE OF PAIRWISE ALIGNMENTS |
|---|
Global alignments
Two sequences can be aligned in many different ways. Optimal alignments, those with the best scores, are of great practical interest. Global optimal alignments are optimized along the whole length of the two sequences. The dynamic programming algorithm for constructing the optimal global pairwise alignment was proposed by Needleman and Wunsch [60]. A more efficient version was later devised by Gotoh [61].
So far, the statistical significance of global alignment scores has been assessed via simulations; no theoretical results on the score distribution are known. Reich et al. [17] investigated score distributions for the DNA global alignments. Their scoring system was defined by three numbers: a match score (equal to +1), a mismatch score (equal to 0) and a linear gap penalty. The work was centered around the case of zero gap penalties. Using independence models for DNA sequences, the authors found that the score distribution for alignments of such sequences is quite close to that for alignments of DNA sequences randomly retrieved from a database. The authors used the Z-score as the statistical measure for the extremity of the scores. As is well known, if the score s is normally distributed, then Z-scores such that Z > 3 are highly significant; however, in the case of massive screening in databases, Z-score cutoffs Z = 5 or Z = 6 should be used [17]. Reich et al. argued that, although in reality the score distribution may display significant deviations from normal, the approach to P-value estimation based on the use of Z-scores may work well in practice. For sequences of equal length and uniform four-letter distribution, aligned with zero gap penalties, Reich et al. provided empirical formulas for the dependence of
and
on the sequence length. Corrections to these formulas for the cases of non-zero gap penalties, non-uniform letter distributions and non-equal sequence lengths have also been discussed.
Even if the empirical rule that Z
6 corresponds to significant results holds in some cases, it should be used with caution in the absence of knowledge about the tail behavior of the score distribution. An example of another possible approach to assessing P-values for global optimal alignments is the estimation of estimation of statistical significance for evolutionary distances between sequences (derived from global alignments) carried out by Altschul and Erickson [4]. In that work, the authors devised a special shuffling procedure (mentioned in the section of the use of P-values in sequence comparisons), and proposed to use it for P-value estimation. Since computations showed that the distribution is slightly non-Gaussian, the authors preferred to make no assumptions concerning the tail behavior, and used an expression similar to (1) for P-value estimation.
The above-cited works considered global alignments of DNA sequences. Webber and Barton [62] investigated the Z-score distribution for global protein alignments with length-independent gap penalties (the affine gap penalty with gap extension coefficient equal to zero). The protein sequences to be aligned were selected from existing databases and modified by shuffling. Having tried different distributions (including the normal distribution and the extreme value distribution) for fitting the data, the authors came to the conclusion that the best fit was given by the gamma distribution with density defined by
|
|
x + ß <
,
> 0 is a shape parameter,
> 0 is a scale parameter, and
(·) is the gamma function. These parameters account for the length variability for the studied proteins. The authors stated that the choice of this distribution was justified by the quality of fit. In the paper, the authors provided the values of
, ß, and
for several frequently used scoring matrices (PAM [63], BLOSUM [64], and Gonnet [65]) and length-independent gap penalties. Using the formula (8), the P-value of a given Z-score can be computed via (2). Webber and Barton provided P-values for Z-scores ranging from 0 to 11.
Local alignments
In many practical situations we are interested in the highest-scoring alignment between the subsequences of two given sequences, the best local alignment. The local alignment is called optimal if (a) the aligned subsequences form a high-scoring segment pair (HSP) (with score that cannot be increased by extending or shortening the aligned subsequences) and (b) all other HSPs have the same or smaller alignment score. The dynamic programming algorithm for finding optimal local alignment was proposed by Smith and Waterman [66]. Gotoh [61] developed a faster version for the case of affine gap scores.
The problem of P-value estimation for local alignments has been intensively studied. We first describe the theoretical results available for simplified cases, and then move on to a description of different corrections and modifications necessary for the P-value estimates to work in practically important situations. Consider a local ungapped optimal alignment with score matrix S = (sij). Such an alignment can be obtained by applying the SmithWaterman algorithm with sufficiently large gap costs. The score sij is the score of aligning the letter i in one sequence with the letter j in the other sequence. The fundamental result in the P-value estimation for such alignments is given by the extension of the single sequence formula (3) to the case of two sequences. The possibility of such an extension was stated by Karlin and Altschul [43], and the corresponding approximation is known as the KarlinAltschul statistic. Consider sequences of letters from the alphabet
. For two independence sequence models with letter distributions ( p1, ... , pq) and
, respectively, we assume that: (i)
for some i, j and (ii)
. Assume also that the letter distributions ( pi) and
are not too dissimilar, and that the sequence lengths n and n' grow at roughly equal rates. In this case, the two-sequence analogue of (3) is
|
|
|
|
u),
u is the unique positive root of the equation
|
|
u than Ku, because of the double-exponential dependence on
u in (9). If x is large, then, using (9) with the expression |
|
In general, condition (ii) needs to be checked for a given scoring system and sequence compositions, but there is a special case when (ii) is satisfied automatically. This is the case when sij are log-odds scores, i.e. by definition
|
|
u = 1 for log-odds scores.
The expression (9) was rigorously proved by Dembo, Karlin and Zeitouni [69] for the special case n = n', under an additional assumption on the sequence compositions and the scoring system (see also Karlin [59]). This assumption is satisfied if
, sij = sji for all i, and sij are not of the form s(i) + s( j). In fact, they proved that the number of distinct HSPs with score larger than ln n2/
u + x has Poisson distribution with mean Ku exp(
ux), which implies (9) if n = n'. In the general case, this Poisson approximation and the expression (9) can be considered intuitive. Yet (9) has been shown to work quite well in practice [16, 17]. The formula (6) for the score distribution of the rth best HSP could also be adapted to the case of local ungapped pairwise alignments: it describes the p.d.f. of the statistic
, where
is the greatest value attainable by the sum of the normalized scores from r distinct and consistently ordered segment pairs [58]. The approximation (7) in the case of pairwise alignments becomes
|
|
The knowledge of the behavior of the score distribution for ungapped alignments provides insights into what to expect of gapped alignments. In this more general situation, frequently used approach has been to fit the score distribution via an extreme value distribution of the form (9), where the constants Kg and
g (the gapped analogues of Ku and
u) are to be estimated from computations with random sequences or with sequences randomly retrieved from databases [15, 16, 18, 70, 71]. In general, the obtained approximations are quite accurate. Altschul and Gish [27] came to the conclusion that the statistical theory for ungapped alignments carries over essentially unchanged to gapped alignments. The currently available theoretical results support the assumption that such approximations are appropriate. In particular, a theorem of Siegmund and Yakir [72, 73] gives the following result. Consder two independence sequences of lengths n and n', and suppose that the given scoring system (the set of matching scores) has a rather general form (the scores occurring with positive probabilities cannot be represented as a sequence of increasing numbers with constant increment). These sequences are locally aligned with gaps, using affine gap penalties. Suppose also that, for the gap opening penalty,
(s), we have the formula
u
(s) = ln(
us) + C, where s is a real number representing the alignment score, and C is some constant. If, in addition, nn' exp(
us) converges to a finite, positie limit as s, n, n'
, then, for large n, n', s,
|
|
Approximations by parametric distributions of the form (9) work not for all scoring systems, gap penalties and sequence compositions. It can be shown that, in the case of affine gap costs and symmetric scores, there are two types of regions for scoring parameters: linear and logarithmic [16, 18, 74]. In the logarithmic region, the growth of the expected local alignment score is proportional to the logarithm of the product of the sequence lengths; in the linear region, the expected local alignment score grows proportionally to the sequence lengths. This result holds both for independence and Markov sequence models [74]. The precise definitions of the logarithmic and linear regions are as follows. Let sg(n2) be the optimal global alignment score of two random sequences of length n, and Esg(n2) be its expected value. It can be shown that the limit
exists. If
< 0, then the local alignment with this scoring system is in the logarithmic region; if
> 0, then it is in the linear region [74]. The requirement that the local alignments be in the logarithmic region is the extension of condition (ii) for ungapped alignments to alignments with gaps. The parallel with the ungapped case becomes obvious if we remember that in that case the expected score is ln(nn')/
u, which exactly corresponds to our description of the logarithmic region. In the linear region, the penalty for poorly aligned letters and indels is too low, so the limiting expected global alignment score per aligned letter pair is positive. In this case, high-scoring local alignments may have regions of poorly aligned letters. Consequently, using local alignments with scores falling in the linear region is not productive in biological sequence analysis.
Waterman and Vingron [16, 18] showed that (9) can be extended to gapped local alignments in the logarithmic region, but not in the linear region. They used Poisson clumping heuristic to model the distinct HSPs. The basic idea of the approximation is that the number of nonintersecting segment pairs with score greater than or equal to s is approximately Poisson distributed with mean
, where n, n' are the sequence lengths, and Kg,
g are the parameters to be estimated from simulations (notice the similarity with the Poisson approximation in the single sequence case). In parallel to the single sequence case, this Poisson approximation also yields the distribution of the rth best alignment score sr (see Waterman and Vingron [16]):
|
|
Another study of the quality of approximation of the score distribution for gapped alignments by (9) was undertaken by Altschul and Gish [27], who estimated Kg,
g via the method of moments for independence sequence models. One of the important facts that they discovered was that (6) gave a good approximation for the sum of normalized scores distribution for gapped local pairwise alignments. Another fact was the necessity of using corrected sequence lengths when extending (9) to gapped alignments. Such a necessity arises because local alignments starting near the end of either sequence are likely to run out of sequence before they can attain a sufficiently large score. All sequences must therefore be considered as having an effective length shorter than the actual one. Altschul and Gish [27] propose using the corrected lengths
|
|
due to which they do not arise in theoretical considerations which provide asymptotics. When l is a sizable part of either n or n', the edge effects cannot be discarded, and asymptotic theory loses accuracy. It has also been noticed that gapped alignments are generally longer than ungapped alignments, so finite-length effects and, respectively, the importance of the finite-length corrections are greater in the gapped case [75, 76]. The formula for l, suggested by Altschul and Gish [27], is as follows:
|
|
|
|
and ß can be estimated from simulations. A theoretical investigation of finite-length corrections for ungapped alignments was carried out by Spouge [77]. However, the correction (12) was shown [75] to be generally more suitable for applications than the one developed by Spouge [77]. The importance of finite-length corrections for a given pair of sequences depends on how good an approximation is provided by (9) for sequences of such lengths. Mott [14] has shown that the quality of this approximation depends on the scoring system and on the sequence composition, for both gapped and ungapped alignments. In the case of protein sequences with typical composition, relatively accurate approximation is achieved at lengths about n = n' = 250. For sequences with biased compositions, even n = n' = 1000 may be insufficient. Since many real protein sequences have skewed composition, and the average protein sequence length is about 330 residues, finite-length corrections are as a rule necessary.
The search for computationally efficient methods of estimating the parameters of the score distribution for gapped alignments continues. Several methods for computing the parameters were compared by Pearson [15]. Among recently proposed methods, we have to mention the islands method [70] for estimating Kg and
g, which is being used in the latest modifications of BLAST [40]. Islands are in fact distinct local alignments defined in a special way; this definition leads to an effective algorithm for finding such alignments. Altschul et al. [70] approximated the distribution of the number of such alignments by the Poisson distribution analogous to that used by Waterman and Vingron [16, 18]. This gave an explicit expression for the expected number of such alignments, which allowed to estimate the parameters Kg,
g by fitting this expression to the islands score data obtained by simulations. An importance sampling based method for estimating
g was developed by Bundschuh [71], and theoretically justified by Grossman and Yakir [78]. Bailey and Gribskov [79] treated H in (12) as a parameter to be estimated, and devised a maximum likelihood method for simultaneous estimation of Kg,
g and H. Mott and Tribe [80] proposed an ingenious method of estimating Kg,
g, which uses a suboptimal approximation to the SmithWaterman algorithm with simpler score statistics. For this approximation, termed the Greedy Extension Model (GEM), the score distribution was made dependent on the gap penalty function through a parameter
, and Kg,
g are estimated as
|
|
,
depend only on
. Note that the method of Mott and Tribe allows to predict the transition point between the linear and algorithmic region. Subsequently, Mott [14] further developed and improved this approach. Note that we need to know sequences lengths and composition to use the KarlinAltschul statistic (9). While this statistic explicitly shows the dependence of the score distribution on the lengths, the dependence on sequence composition is implicit. It is therefore desirable to have a composition-free measure of statistical significance. Bacro and Comet [81] have shown that the Z-score of an optimal local alignment is such a measure. Under the assumptions used by Waterman and Vingron [16], Bacro and Comet demonstrated that the Z-score has approximately extreme value distribution with parameters independent of sequence lengths or compositions. The Z-values can be obtained from simulations, using the shuffling procedure described by Comet et al. [22].
Local pairwise alignments and database searches
One of the frequently used types of database searches is the search for sequence similarity implemented as a series of local pairwise alignments of the query sequence to all the sequences in the database. Current databases are quite large, and the SmithWaterman full dynamic programming algorithm is too slow to be practical in this context. This is why fast heuristic algorithms have been designed for database searches. The best-known and most widely used heuristic algorithm are BLAST [38, 40, 82] and FASTA [83]. Though fast, the heuristic local alignment algorithms are not as sensitive as the SmithWaterman algorithm.
Since results of database searches are best local alignments (hits), the KarlinAltschul statistics is applicable for the estimation of their statistical significance. However, the specific nature of database searches makes it necessary to modify the straightforwardly obtained KarlinAltschul estimates. Before describing these modifications, we introduce the notion of the E-value. As was mentioned above, the number of HSPs for gapped alignments with score not lower than s has approximately the Poisson distribution with parameter Kgnn' exp(
gs). Since the parameter of the Poisson distribution is also its mean value, the expected number of HSPs with score not less than s is given by the formula
|
|
|
|
E. In the context of sequence similarly detection by massive screening of local alignments to database sequences, the E-value of an alignment score can be interpreted as the expected number of false positives having this score or a higher one.
The approach to the evaluation of the statistical significance of database hits depends on how to consider the set of sequences in the database. In this respect, BLAST and FASTA differ significantly. In BLAST, all the sequences are treated as one concatenated sequence [38, 84]. Thus, if a query sequence of length n hits a database sequence with score s, then the corresponding E-value is
|
|
|
|
|
|
In evaluation of the FASTA hit it is assumed that all database sequences, independent of their length, a priori have equal chances to score high when compared to a given query sequence [15, 84]. This can be justified by the assumption that a significant similarity to the query sequence can occur in both short and long database sequences. In this case, the E-value of a database hit with score s is calculated as
|
|
| STATISTICAL SIGNIFICANCE OF MULTIPLE ALIGNMENTS |
|---|
Global alignments
The exact dynamic programming algorithm for constructing a multiple alignment is known but is impractical for more than a few sequences, therefore, heuristic methods, such as progressive alignment, are usually used [68, 85]. The question of how to estimate the statistical significance of such alignments is even more obscure than it is for pairwise alignments. The papers describing well-known multiple alignment algorithms such as CLUSTAL W [86] and T-COFFEE [87], and newer methods such as MUSCLE [88] and MAFFT [89], say nothing about statistical significance of the produced alignment. Of course, it is possible to use simulations and distribution curve fitting for P-value estimations, but methods based on pure simulation may fail in the case of small P-values [25]. Thus, analytic approaches should be developed. When estimating statistical significance, it is natural to take into account the specific features of a multiple alignment algorithm. For example, if a method uses a sequence of pairwise alignments, then it may be possible to utilize the estimates for the pairwise alignment score statistics to assess the P-value for the multiple alignment. This approach is implemented in DIALIGN-T [90], which builds multiple alignments from ungapped pairwise local alignments, called fragments, involving pairs from the whole set of sequences. The score of a multiple alignment produced by DIALIGN-T is the sum of the scores of the constituting fragments, while a fragment score is defined as a negative logarithm of the P-value of global (NeedlemanWunsch) pairwise alignment for the fragment. Thus, an optimal multiple alignment is a collection of fragments with minimal product of pairwise P-values, and the score of a multiple alignment is the negative logarithm of an estimate of its P-value. The probabilistic sequence models for both DNA and protein alignments are independence models with uniform letter distribution; the global pairwise score P-values are estimated via simulations combined with heuristic formulas. Although such an approach seems reasonable, it was argued that DIALIGN-T over-estimates the probabilities of random occurrences of alignments with high scores [90].
Local alignments
Similar to the case of global multiple alignments, the full dynamic programming solution to the problem of local multiple alignments is currently unfeasible. Practically efficient approaches include heuristic block analysis [91], expectationmaximization [92], Gibbs sampling [93] and Eulerian path approach [94]. Also, the program DIALIGN-T can construct local alignments, and has been shown to perform quite well (the P-value estimation procedure of DIALIGN-T was outlined above). As is the case with pairwise alignments, some analytic results are available which can increase the efficiency of the P-value estimation for local multiple alignments. The choice of a method for P-value estimation depends on the nature of the alignment algorithm and on scoring system.
Formulas for the P-value estimation are available only for ungapped local multiple alignments. The basic idea behind the P-value estimation is the conjecture that expression (9) is extendable to the case of multiple sequences [43, 69]. Let S1, ... , Sr be independent sequences consisting of i.i.d. letters from the alphabet
. Let
be the letter probabilities for the sequence Sj. We specify a length and choose one segment of this length from each of the sequences; the set of such segments is called a block. The score of this block is defined as the sum of scores for the block columns. If there is a block whose score cannot be increased by shifting any of its borders, then this block is called a high-scoring block. A high-scoring block with the maximal score corresponds to the optimal local ungapped alignment of the sequences
. Suppose that: (i) some column score is positive with positive probability; (ii) the average column score is negative; (iii) column scores do not change upon permuting the block's rows. These assumptions are valid if the column scores are defined as the SP(sum-of-pairs)-scores [91]. The SP-score of a column is the sum of all the pairwise scores for the letters it comprises, with each pair being counted only once. For example, if the letters denote amino acids, then the pairwise substitution scores can be the usual log-odds scores defined by a PAM matrix. SP-scores are frequently used scores for multiple sequence alignments [68].
Suppose now that the lengths n1, ... , nl of the sequences S1, ... , Sl are large and do not differ much from each other; the sequences are assumed to have similar sequence composition. Then, for the P-value of the multiple sequence alignment with score s(n1 ··· nl), we have the expression
|
|
m satisfies |
|



