Briefings in Bioinformatics Advance Access originally published online on December 21, 2006
Briefings in Bioinformatics 2007 8(2):136-137; doi:10.1093/bib/bbl020
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Book Reviews |
Bioinformatics and Computational Biology Solutions Using R and Bioconductor.
Edited by Robert Gentleman, Wolfgang Huber, Vincent J. Carey, Rafael A. Irizarry and Sandrine Dudoit
Bioinformatics and Computational Biology Solutions Using R and Bioconductor.
Edited by Robert Gentleman, Wolfgang Huber, Vincent J. Carey, Rafael A. Irizarry and Sandrine Dudoit
Springer
ISBN: 0387251464; 473 pp.; 2005; $89.95.
This book guides through practical bioinformatics data analysis using the Bioconductor toolkit, which is based on the statistical language R. R itself is an open-source recreation of the language S-Plus. The Bioconductor is a collection of R-packages for the analysis of genomic and molecular biological data generated in high-throughput experiments. High-throughput experiments are characterized by large amounts of data generated in short periods of time on a sizable number of samples. This poses new challenges to the analysis such as assessing and adjusting for noise, exploration using cluster-analysis, visualization, and linking to (or annotating with) biomedical knowledge bases.
The book focuses on gene expression microarrays, the high-throughput technology for which statistical methods are best developed today. In addition, a few chapters are dedicated to other high-throughput technologies such as cell-based assays and proteomics using the first mass marketed surface-enhanced LASER desorption/ionization time-of-flight spectrometry (SELDI-TOF) technology (not the more robust LC/MS technology). Each of the discussed experimental technologies is introduced briefly to help even the relative novice reader in bioinformatics to be familiar with them before the discussion dives into the specific data analysis problems and methods. In the same vein, it would have been useful to provide the reader with some introduction to the syntax and semantics of R itself and the coding conventions used in the examples; but unfortunately the book simply writes down R code examples without a guide and even without any obvious systematic or self-explanatory coding style. The somewhat idiosyncratic short-naming convention, although apparently common-place among R users, does not help to make the code examples more transparent either.
So, the reader who chooses this book as a holiday reading on a remote island should be advised to take along a reference book on R, in order to make sense of the many R-code samples provided. The problem is not restricted to the R-code examples; even the Bioconductor itself is not introduced in enough detail to convey to the reader a systematic understanding of the design and roadmap of this powerful evolving bioinformatics toolkit. This is a critical loss to those readers who are hoping to engage in the community effort and who need to learn about fundamental data structures and design conventions of the Bioconductor. For instance, the importance of the exprSet data structure is announced early in the book, and this structure seems to be used in many of the code examples, yet there is not even a tabular or schematic overview provided about it. It is still possible for the reader who has seen many programming languages to follow the discussion, but those readers who lack such experience may be lost.
The strength of the book seems to lie more on a practical application-oriented discussion of the various data analysis approaches with a solid body of explanation, references, and comparison of alternative methods. It helps that the reader hears all this from the horse's mouth, as the authors and editors are not only the chief designers of the discussed Bioconductor packages, but also the authors of much original methodology, and thus outstanding experts in the areas they discuss. It is also helpful that even the practically oriented reader with a more casual background in mathematics will have a chance of following the discussion. However, the book is again not self-consistent even in the methodological discussion, because most methods are being discussed by reference only and their essentials are not actually described either. The application of these methods is demonstrated using realistic data and there is plenty of example output and diagrams shown that the reader can still follow the point of the discussion albeit somewhat unsure about the specific detail.
The book is divided into four main parts, dealing with (i) preprocessing of raw experimental data, (ii) working with annotation metadata, (iii) statistical analysis of results (to find differentially expressed genes) and (iv) working with graphs and networks. This is followed by a series of short case studies which illustrate the application of the sum of material discussed in the four main parts to specific example projects. Each part will be useful to the reader as each is an essential component of working with high-throughput data. Particularly, the parts i, iii, and iv are inherently mathematical, and hence clearly the domain of R-packages and their discussion is most gratifying. It is good to see R-examples for both statistical applications (parts i and iii) as well as for discrete mathematics algorithms used in graph theory (part iv). Conversely, regarding the treatment of annotation metadata in R (part ii), it is not obvious why one would rely solely on R-packages to access such annotations which mostly reside in relational database management systems, XML or web resources, integrating these resources does not seem to be such a unique strength of R. It would have been helpful if the book had discussed how the user can make their own annotation database resources accessible to analysis algorithms executed in R rather than simply showing some subset of R-modules designed to interact with the world outside. Likewise the short discussion on workflow integration of R-analyses has too narrow a horizon, is caught within the perimeter of R, not discussing any other alternatives such as use of R as part of a larger analysis platform. A much needed discussion on the CPU and memory requirements of the data structures and algorithms involved, their scalability, optimization and parallelization for the analysis of complex huge data sets, is lacking for a book which otherwise presents itself as such a practical cookbook.
Even knowing that this book cannot be used as a self-containing introduction into Bioinformatics with Bioconductor and R as the title would suggest, it is nevertheless to be recommended as a timely and practical guide into this highly dynamic subject area, written by experts in a style which everyone, including the student, should have no problem following when armed with the additional reference manuals and cited methodology papers. The quality of the discussion and relevant details is exceptional, which can be seen in the reviewer's favorite chapter on visualization techniques, and overall the book is very nicely produced with plenty of high-quality illustrations. It is also good to know that in fact the content of the book was actually generated using the R-package itself and that the examples and code can be downloaded from a companion website. Although what the authors describe as computable book seems to be not quite the same as Donald Knuth's literate programming developed for his phenomenal work on algorithms and typesetting, this book honors Knuth's legacy which lives on in the R-community. The book is easily preferred over any hastily thrown-together beginner's guide for dummies reference books which provide little more than the online user-manual, and hence, after all the above criticism has been raised, this book may even be the only right way to produce a printed book in today's dynamic world of web-accessible online reference material. Indeed, one intriguing use of the book might be as a main textbook for a Bioinformatics course, which would guide the student through practical tasks and methodology leaving plenty of room for the additionally desirable self-study in primary methodological literature and software reference manuals.
Indiana University School of Informatics and
Regenstrief Institute
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||