Briefings in Bioinformatics Advance Access published online on February 27, 2007
Briefings in Bioinformatics, doi:10.1093/bib/bbm004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Book Review |
Algebraic Statistics for Computational Biology
Edited by Lior Pachter and Bernd Sturmfels
Cambridge University Press, Cambridge; ISBN: 0-521-85700-7; 432pp.; 2005; $60. This book introduces a new, nontraditional framework that attempts to fuse the four themes of algebra, statistics, computational algorithms and biological sequence analysis into a coherent whole. Algebraic Statistics is an emerging field that comprises statistical applications of computational algebraic geometry: statistical models are formulated so that one studies solutions of systems of polynomial equations. The models considered in this book come from computational biology, and include sequence alignment and molecular evolution. The book will primarily be of interest to mathematically sophisticated readers, from graduate students on up, with an interest in computational biology. The book is based on a graduate mathematics class taught by Pachter and Sturmfels at UC Berkeley in 2004, and comprises their lectures and student projects.
General topic areas include sequence alignments, phylogenetic trees based on sequence alignments, substitution models of molecular evolution and Hidden Markov Models and generalizations. The tools used to cover these topics are what make this book unique: polytopes, Gröbner bases, toric models and tropical geometry. Some of the chapters are focused on the mathematics and algorithms and simply mention application areas in biology, while others apply algorithms to real biological data. Throughout the book, code is given in a variety of languages, including C++, Java, Maple Matlab and others, sometimes directly in the book and sometimes on a supplementary website.
The first part of the book consists of four chapters written by Pachter and Sturmfels, one on each of the four themes. They lay out connections between the themes using integrated examples and algebraic notations across the different disciplines. This is intriguing, although the algebraic notation may reduce the accessibility of the material. However, they work out detailed examples that illustrate the notation and the connections between the different disciplines.
The second part of the book, Studies on the four themes, consists of 18 chapters written by students based on their class projects. It is similar to conference proceedings, but with cross-references to other chapters so that one may read just the necessary portions of the first four chapters for background and then follow-up with overlapping projects in other chapters. However, the consistency of notation and integrated examples in the Pachter and Sturmfels chapters is not always maintained in the remaining chapters.
Several chapters concern polytopes and sequence alignment. Given two DNA sequences, the precise details of the alignment between them will vary depending on the scoring method that is used. A common scoring system for the NeedlemanWunsch alignment algorithm has three parameters: the scores for matches (M), mismatches (X) and spaces (insertions/deletions; S), where these score parameters are real numbers with constraints that M is nonnegative and X and S are nonpositive. The alignment summary (m,x,s) consists of the numbers of matches, mismatches and spaces; these are nonnegative integers. A four parameter model with gap scores, and a general 33 parameter hidden Markov model allowing different scores for all combinations of nucleotides, are also considered. The questions addressed include the mathematical problem of the structure of the different alignments produced from two fixed nucleotide sequences as the scoring parameters vary, and the biological problem of the accuracy of the alignments.
In Chapter 2 (Computation), they define tropical arithmetic, an algebraic system in which the combination of all steps of the NeedlemanWunsch algorithm (and separately the Viterbi algorithm) may be expressed as a single tropical polynomial. This coding by polynomials allows them to apply tools from computational algebraic geometry, such as polytopes and Gröbner bases.
In Chapter 7, by Colin Dewey and Kevin Woods, a convex polytope is formed (the alignment polytope) which, in the three parameter model, consists of all score parameters (M,X,S) that result in a particular alignment. They consider the mRNA alignment of particular nucleotide sequences from chicken
and ß hemoglobin for which there are two biologically reasonable alignments: one that respects codon boundaries and one that does not. They prove that there are no score parameters that give either of these in the three parameter model. In the four parameter model, the alignment respecting codon boundaries still is not achievable, while the less likely but still biologically plausible alignment that breaks codon boundaries is achievable. Chapters 5 and 6 concern algorithms and software for analyzing polytopes and hidden Markov models; the connection to alignments is mentioned but is not the focus. The methods of these three chapters were combined and extended into an article [1] in which they aligned two Drosophila genomes. They identified 2 million pairs of orthologous segments between them, 44% of which have ambiguous alignments. For these ambiguous cases, they simultaneously find all optimal alignments over all parameter settings in the three, four, and five parameter models.
Chapter 8, by Sergi Elizalde and Fumei Lam, further studies the structure of the space of all alignments, but through a dual problem: they form a polytope as the convex hull of the alignment summaries, which in the three parameters model is the numbers (m,x,s) that count matches, mismatches and spaces, as the score parameters vary. They relate this polytope to the partition of score parameter space into all alignment polytopes, and they give known and conjectural bounds on the number of different alignments obtained between two fixed sequences as the score parameters vary.
The impact of Chapters 58 is that people typically run alignment software with a fixed scoring scheme, but the score parameters have to be carefully tuned to produce the correct biological alignments and there may be cases when no parameter settings give the correct alignment.
Chapter 14, by Niko Beerenwinkel and Mathias Drton, considers mutagenetic tree models, which are trees that model accumulative evolutionary processes such as chromosomal gains and losses in tumor development, and amino acid substitutions in proteins such as in the evolution of HIV under antiviral drug therapy. The chapter mentions these applications, but is focused on mathematical algorithms. It shows how to construct certain polynomials that describe events such as amino acid substitutions in different positions, and then how to use Gröbner bases on these polynomials to study the trees. It also considers mixture models, which arise from a patient having a population of diseased cells in different evolutionary states. While the chapter is focused on the mathematics, a follow-up study [2] by the same authors applies their techniques to determine the order and rate of occurrence of particular amino acid changes in HIV mutation data from clinical trials.
There are other recent books in this field that follow a more traditional approach. Statistical Methods in Molecular Evolution [3] is a series of separately authored chapters that cover statistical models used in popular software. Statistical Methods in Bioinformatics [4] is a graduate textbook that starts with a ground-up review of probability and statistics, and continues with statistical applications in bioinformatics. The book under review is unique in its use of algebraic methods, the exploratory nature of the projects described in it, and in that it presents new results.
Assistant Professor
Department of Mathematics
University of California, San Diego
La Jolla
CA 92093-0112
USA
E-mail: gptesler{at}math.ucsd.edu
| References |
|---|
|
|
|---|
- Dewey CN, Huggins PM, Woods K, et al. Parametric Alignment of Drosophila Genomes. PLoS Computational Biology 2006; 2:e73 doi:10.1371/journal.pcbi.0020073.
- Beerenwinkel N, Drton M. A mutagenetic tree hidden Markov model for longitudinal clonal HIV sequence data. Biostatistics 2007; 8:5371 doi:10.1093/biostatistics/kxj033.
[Abstract/Free Full Text] - Nielsen R. Statistical Methods in Molecular Evolution.New York: Springer,-Verlag 2005 ISBN:978-0-387-22333-9.
- Ewens WJ, Grant GR. Statistical Methods in Bioinformatics. 2nd edn New York: Springer-Verlag 2005 ISBN:978-0-387-40082-2.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||