An approach to compare genome tiling microarray and MPSS sequencing data for transcript mapping

Abstract

Background: There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data.

Findings: This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies.

Conclusion: Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure.

Download from publisher
Download free from PubMed
Correction

Citation
Rajkumar Sasidharan, Ashish Agarwal, Joel Rozowsky, Mark Gerstein (2009). An approach to compare genome tiling microarray and MPSS sequencing data for transcript mapping, BMC Research Notes 2(1): 150.

Posted in Publications | Tagged | Comments Off

ENCODE/modENCODE Consortium Meeting 2009

Ashish Agarwal, LaDeana W. Hillier, Joel Rozowsky, David Koppstein, Andrea Sboner, Lukas Habegger, Jeanyoung Jo, Michael Snyder, Philip Green, Valerie Reinke, Robert H. Waterston, Mark Gerstein, “Transcriptome comparison between tiling arrays and next generation sequencing on matched worm samples, towards making optimal use arrays”. Presentation at ENCODE/modEncode Consortium Meeting 2009.

Title: Transcriptome comparison between tiling arrays and next generation sequencing on matched worm samples, towards making optimal use arrays

Abstract:

Sequencing technologies are becoming a viable alternative to traditional microarrays as their cost continues to decrease, and so a detailed comparison of their relative strengths is warranted. We investigate the transcriptome of a C. elegans matched sample that was both sequenced and hybridized on a tiling array. We describe a method for comparing the single base pair resolution data from sequencing with the probe data from an array. Given this we conduct several correlations of the signal. We find the raw signal to have a high correlation, both across the genome as well as in only transcribed regions. A comparison of the differential expression across two samples from both technologies also shows significant agreement.

Next we compare both technologies in regards to their agreement with a gold standard in a ROC plot fashion, where the parameter varied is the threshold above which the signal is considered indicative of transcriptional activity. The sequencing data simultaneously provides a higher sensitivity and a lower false positive rate. The higher resolution of sequencing also leads to more accurate prediction of exonic boundaries. We are able to use the sequencing data as a gold standard to optimally calibrate the parameters required to analyze the tiling array.

Finally, we investigate the extent of cross hybridization, the most likely artifact leading to false positives, in the tiling array. We quantify the degree of its contribution to the signal and show the effects of filtering out the unreliable probes. Correspondingly, we show that there is a much smaller but detectable degree of error in the sequencing data from cross-mapping.

The material presented here was later published in BMC Genomics
Presentation Slides

Posted in Presentations | Tagged | Comments Off

Linear coupled component automata for MILP modeling of hybrid systems

Abstract

We first introduce a novel modeling framework, called linear coupled component automata (LCCA), to facilitate the modeling of discrete-continuous dynamical systems with piecewise constant derivatives. Second, we provide a procedure for transforming models in this framework to mixed-integer linear programming (MILP) constraints. Traditionally, such systems have been modeled directly with MILP constraints. We show with an example that our framework significantly simplifies model formulation and allows the complex MILP constraints to be produced systematically.

Preprint version
Published version

Citation
Ashish Agarwal, Ignacio E. Grossmann (2009). Linear coupled component automata for MILP modeling of hybrid systems, Computers & Chemical Engineering 33(1): 162-175.

Posted in Publications | Tagged , | Leave a comment

tYNA: An Online Tool for Analysis of Biological Networks

A summary of Kevin Yip’s tYNA software is available here.

Posted in News | Tagged | Leave a comment

Dissertation: Logical Modeling Frameworks for the Optimization of Discrete-Continuous Systems

Abstract

Often, it is very difficult to pose a model for a system even after the system is conceptually understood. The reason is the mathematical languages we employ have few forms of expression. We define more expressive languages, first for dynamical discrete-continuous systems, and then more rigorously for mathematical programs (MP). Our approach provides theoretical basis for designing MP software.

The first framework we define is called linear coupled component automata (LCCA). It supports finite domain constraints, explicitly handles dynamics, and enforces modular modeling. We show how LCCA models can be mechanically converted into mathematical programming (MP) constraints. Currently, chemical process systems are usually modeled directly with MP constraints. We show with an example that it is much easier to model hybrid systems in our LCCA framework.

We then pursue a more rigorous approach for the MP part of our work, for the purposes of providing a computer implementation of an MP framework. There are two main results: a rich computer language for declaring MPs and automation of certain model transformations. Mathematically, these correspond to defining a set p of MPs and defining a binary relation on p.

The set p contains programs as one would want to write in practice, not just canonical matrix forms. Complex index sets can be defined in intuitive ways, and they are first-class entities in our theory, not mere notational conveniences eliminated at parse time. This has many benefits: it retains knowledge of the problem structure, keeps the program size to a minimum, and speeds up certain operations. Our definition of the semantics elucidates the nature of MP algorithms and explains the information sought from a solution.

The binary relation on programs p can be defined because a logical formulation allows treating constraints and programs as mathematical objects. Principally, our definition includes: a procedure for putting Boolean expressions into conjunctive normal form, and a procedure for converting disjunctive constraints into mixed-integer inequalities. Neither has been defined previously for a language as expressive as ours, and the latter has not been defined as a formal mapping on constraint spaces. Overall this leads to a procedure for converting general MPs to pure mixed-integer programs (MIPs).

Sets and set relations are defined using the methods of type theory, which espouses a close relation between mathematics and computation. As a result, the set p can be viewed simultaneously as a novel definition of MP and as the software architecture for implementing an MP language. Similarly, the binary relation on p can be directly implemented on a computer. Several examples of our software’s input and output are provided.

Download thesis
Download defense slides

Citation
Ashish Agarwal (2006). Logical Modeling Frameworks for the Optimization of Discrete-Continuous Systems, PhD Dissertation, Carnegie Mellon University.

Posted in Presentations, Publications | Tagged , , , | Leave a comment

Finished my PhD!

Thanks to my great advisors Robert Harper and Ignacio Grossmann. Read all about it here.

Posted in News | Tagged , , | Leave a comment

Milner on the science of the artificial

“Probably what [physicists] mean is: [computer science] can’t be a science because we are always making things. What are we doing science of? We are doing the science of our own constructions. But I think the boundary between understanding one’s own constructions and understanding the world is breaking down. You’ve only got to look at bioinfomatics. In chemistry, in chemical engineering, there are structures which we are making which deserve to be understood by the same models as we understand the natural world.”
— Milner (3 Sep 2003, In interview conducted by Martin Berger)

Posted in Quotes | Leave a comment

Boltzmann says theory is practical

“Nothing is more practical than a good theory.”
— Ludwig Boltzmann

Posted in Quotes | 1 Comment