Ashish Agarwal » Bioinformatics

Challenges in Genomics Data Visualization

ashish — Mon, 15 Apr 2013 16:29:09 +0000

Genomics, like many fields, is generating data at an ever increasing rate with the promise to enable personalized medicine, improve agriculture, and advance fundamental research. Data visualization is a key component of the scientific process and often the bottleneck in effective interpretation of analysis results. Weâ€™ll begin with a broad introduction to Genomics and then describe several visualization challenges. We categorize the challenges in three domains: those arising purely from the size of the data, those requiring intricate interactivity, and those requiring high resolution 3D rendering. The goal of the talk is to provide a broad overview of visualization challenges in the field of Genomics.

Paul Scheid, Ashish Agarwal, Karl Ward. Challenges in Genomics Data Visualization, Tisch Interactive Telecommunications Program (ITP), New York University, Apr 12, 2013.

A Validated Regulatory Network for Th17 Cell Specification

ashish — Thu, 27 Sep 2012 15:25:18 +0000

Abstract
Th17 cells have critical roles in mucosal defense and are major contributors to inflammatory disease. Their differentiation requires the nuclear hormone receptor ROR&#947t working with multiple other essential transcription factors (TFs). We have used an iterative systems approach, combining genome-wide TF occupancy, expression profiling of TF mutants, and expression time series to delineate the Th17 global transcriptional regulatory network. We find that cooperatively bound BATF and IRF4 contribute to initial chromatin accessibility and, with STAT3, initiate a transcriptional program that is then globally tuned by the lineage-specifying TF ROR&#947t, which plays a focal deterministic role at key loci. Integration of multiple data sets allowed inference of an accurate predictive model that we computationally and experimentally validated, identifying multiple new Th17 regulators, including Fosl2, a key determinant of cellular plasticity. This interconnected network can be used to investigate new therapeutic approaches to manipulate Th17 functions in the setting of inflammatory disease.

Full article from publisher

Citation
Maria Ciofani, Aviv Madar, Carolina Galan, MacLean Sellars, Kieran Mace, Florencia Pauli, Ashish Agarwal, Wendy Huang, Christopher N. Parkurst, Michael Muratet, Kim M. Newberry, Sarah Meadows, Alex Greenfield, Yi Yang, Preti Jain, Francis K. Kirigin, Carmen Birchmeier, Erwin F. Wagner, Kenneth M. Murphy, Richard M. Myers, Richard Bonneau, and Dan R. Littman (2012). A Validated Regulatory Network for Th17 Cell Specification, Cell 151:1-15.

Functional Big-Data Genomics

ashish — Tue, 11 Sep 2012 20:11:51 +0000

Abstract
High-throughput genomic sequencing is characterized by large diverse datasets and numerous analysis methods. It is normal for an individual bioinformatician to work with thousands of data files and employ hundreds of distinct computations during the course of a single project. This problem is magnified in “core facilities”, which support multiple researchers working on diverse projects. Most investigators use ad hoc methods to manage this complexity with dire consequences: analyses frequently fail to meet the scientific mandate of reproducibility; improved analysis methods are often not considered because recomputing all downstream steps would be overwhelming; hard drives and CPUs are used sub-optimally; and, in some cases, raw data is lost.

We describe HITSCORE, an OCaml software stack that operates all computational aspects of the Genomics Core Facility at New York University’s Center for Genomics and Systems Biology. HITSCORE has been in production use for one year, and was implemented quickly by less than two programmers following design advice from several biologists. A simple domain specific language (DSL) enables generating type safe database bindings and GUI components, and greatly eases updates and migration of our data model. We found a multi-lingual stack too burdensome in a small team setting, and credit OCaml for fulfilling the needs of our full application stack. It has good database bindings, excels at encoding complex domain logic, and now allows construction of rich websites due to the Ocsigen web programming framework. Higher level libraries for distributed computing would be a welcome improvement.

The opportunity to build this system did not stem directly from any strength of functional programming or OCaml. It was necessary for a person with credibility amongst biologists to champion its development, and this credibility took several years to build. Rapid development appears to be the single most important factor in allaying doubts about using a lesser known language, and we will briefly describe our experiences in bringing OCaml to several high profile projects.

Download slides
Video

Citation
Ashish Agarwal, Sebastien Mondet, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Functional Big-Data Genomics. Commercial Users of Functional Programming 2012, Copenhagen, Denmark, Sep 15, 2012.

Biocaml: The OCaml Bioinformatics Library

ashish — Mon, 10 Sep 2012 20:11:38 +0000

Abstract
Biology is an increasingly computational discipline due to rapid advances in experimental techniques, especially DNA sequencing, that are generating data at unprecedented rates. The computational techniques needed range from the complex (.e.g algorithms, distributed computing) to the simple (e.g. scripting, parsing), and there are hundreds of thousands of Biologists now involved in computing. We propose that OCaml can serve virtually the full spectrum of computational tasks needed by Biologists, improving both programmer productivity and computational efficiency. To support this end, we have developed Biocaml.

Biocaml aims to be a standard library for the Biology domain. We provide features that are needed in a broad range of applications and avoid including overly specialized methods. The current feature set can be split into 3 broad categories: stream parsing/printing of many data formats, data structures for genomics, and access to public data repositories. We will demonstrate how some complex calculations can be performed quite easily with the current API, and describe our efforts to make a uniform API with comprehensive documentation. Finally, there is a BioX library for X equal to any programming language. The most widely used is BioPerl, and we will compare Biocaml with these alternatives.

Biocaml and other OCaml libraries have now been successfully used in multiple high-profile Biology projects (e.g. modENCODE, ENCODE, NYU’s Genomics Core Facility, and others). Some time will be spent discussing the social aspect of bringing a novel language to the Biology community. We will attempt to elucidate strategies that are successful and those that are not. In particular, it will be argued that discussions regarding programming language choices need to be more scientific.

Download slides
Video

Citation
Ashish Agarwal, Sebastien Mondet, Philippe Veber, Christophe Troestler, Francois Berenger. Biocaml: The OCaml Bioinformatics Library. OCaml Users and Developers Meeting 2012, Copenhagen, Denmark, Sep 14, 2012.

Managing and Analyzing Big-Data in Genomics

ashish — Fri, 29 Jun 2012 17:18:00 +0000

Abstract

Biology is an increasingly computational discipline. Rapid advances in experimental techniques, especially DNA sequencing, are generating data at exponentially increasing rates. Aside from the algorithmic challenges this poses, researchers must manage large volumes and innumerable varieties of data, run computational jobs on an HPC cluster, and track the inputs/outputs of the numerous computational tools they employ. Here we describe a software stack fully implemented in OCaml that operates the Genomics Core Facility at NYU’s Center for Genomics and Systems Biology.

We define a domain specific language (DSL) that allows us to easily describe the data we need to track. More importantly, the DSL approach provides us with code generation capabilities. From a single description, we generate PostgreSQL schema definitions, OCaml bindings to the database, and web pages and forms for end-users to interact with the database. Strong type safety is provided at each stage. Database bindings check properties not expressible in SQL, and web pages, forms, and links are validated at compile time by the Ocsigen framework. Since the entire stack depends on this single data description, rapid updates are easy; the compiler informs us of all necessary changes.

The application launches compute intensive jobs on a high-performance compute (HPC) cluster, requiring consideration of concurrency and fault-tolerance. We have implemented what we call a “flow” monad that combines error and thread monads. Errors are modeled with polymorphic variants, which get arranged automatically into a hierarchical structure from lower level system calls to high level functions. The net result is extremely precise information in the case of any failures and reasonably straightforward concurrency management.

Download slides

Citation
Sebastien Mondet, Ashish Agarwal, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Managing and Analyzing Big-Data in Genomics. IBM Programming Languages Day 2012, Hawthorne, NY, June 28, 2012.

The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics

ashish — Mon, 11 Apr 2011 21:36:26 +0000

Abstract

Biological data is often tabular but finding statistically valid connections between entities in a sequence of tables can be problematic – for example, connecting particular entities in a drug property table to gene properties in a second table, using a third table associating genes with drugs. Here we present an approach (CRIT) to find connections such as these and show how it can be applied in a variety of genomic contexts including chemogenomics data.

Full article from publisher
Paper’s website

Citation
Tara A Gianoulis, Ashish Agarwal, Michael Snyder, and Mark Gerstein (2011). The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics, Genome Biology 12(R32):1-12.

Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data

ashish — Thu, 20 Jan 2011 20:09:02 +0000

Abstract

We present an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs). It combines a large amount of expression data, RNA secondary-structure stability, and evolutionary conservation at the protein and nucleic-acid level. Using the incRNA model and data from the modENCODE consortium, we are able to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy (97% AUC on an independent validation set), and find more than 7000 novel ncRNA candidates, among which more than 1000 are located in the intergenic regions of C. elegans genome. Based on the validation set, we estimate that 91% of the approximately 7000 novel ncRNA candidates are true positives. We then analyze 15 novel ncRNA candidates by RT-PCR, detecting the expression for 14. In addition, we characterize the properties of all the novel ncRNA candidates and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families. We also find that they are often targeted by specific transcription factors (âˆ¼59% of intergenic novel ncRNA candidates). Overall, our study identifies many new potential ncRNAs in C. elegans and provides a method that can be adapted to other organisms.

Full article from publisher
Supplementary material
Paper’s website

Citation
Zhi John Lu, Kevin Y. Yip, Guilin Wang, Chong Shou, LaDeana W. Hillier, Ekta Khurana, Ashish Agarwal, Raymond Auerbach, Joel Rozowsky, Chao Cheng, Masaomi Kato, David M. Miller, Frank Slack, Michael Snyder, Robert H. Waterston, Valerie Reinke, and Mark B. Gerstein (2011). Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data, Genome Research 21(2):276-85.

Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project

ashish — Sat, 25 Dec 2010 04:00:01 +0000

Abstract

We systematically generated large-scale data sets to improve genome annotation for the nematodeÂ Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factorâ€“binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factorâ€“binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.

Full article from publisher

Citation
Mark B. Gerstein, Zhi John Lu, Eric L. Van Nostrand, Chao Cheng, Bradley I. Arshinoff, Tao Liu, Kevin Y. Yip, Rebecca Robilotto, Andreas Rechtsteiner, Kohta Ikegami, Pedro Alves, Aurelien Chateigner, Marc Perry, Mitzi Morris, Raymond K. Auerbach, Xin Feng, Jing Leng, Anne Vielle, Wei Niu, Kahn Rhrissorrakrai, Ashish Agarwal, Roger P. Alexander, Galt Barber, Cathleen M. Brdlik, Jennifer Brennan, Jeremy Jean Brouillet, Adrian Carr, Ming-Sin Cheung, Hiram Clawson, Sergio Contrino, Luke O. Dannenberg, Abby F. Dernburg, Arshad Desai, Lindsay Dick, AndrÃ©a C. DosÃ©, Jiang Du, Thea Egelhofer, Sevinc Ercan, Ghia Euskirchen, Brent Ewing, Elise A. Feingold, Reto Gassmann, Peter J. Good, Phil Green, Francois Gullier, Michelle Gutwein, Mark S. Guyer, Lukas Habegger, Ting Han, Jorja G. Henikoff, Stefan R. Henz, Angie Hinrichs, Heather Holster, Tony Hyman, A. Leo Iniguez, Judith Janette, Morten Jensen, Masaomi Kato, W. James Kent, Ellen Kephart, Vishal Khivansara, Ekta Khurana, John K. Kim, Paulina Kolasinska-Zwierz, Eric C. Lai, Isabel Latorre, Amber Leahey, Suzanna Lewis, Paul Lloyd, Lucas Lochovsky, Rebecca F. Lowdon, Yaniv Lubling, Rachel Lyne, Michael MacCoss, Sebastian D. Mackowiak, Marco Mangone, Sheldon McKay, Desirea Mecenas, Gennifer Merrihew, David M. Miller III, Andrew Muroyama, John I. Murray, Siew-Loon Ooi, Hoang Pham, Taryn Phippen, Elicia A. Preston, Nikolaus Rajewsky, Gunnar RÃ¤tsch, Heidi Rosenbaum, Joel Rozowsky, Kim Rutherford, Peter Ruzanov, Mihail Sarov, Rajkumar Sasidharan, Andrea Sboner, Paul Scheid, Eran Segal, Hyunjin Shin, Chong Shou, Frank J. Slack, Cindie Slightam, Richard Smith, William C. Spencer, E. O. Stinson, Scott Taing, Teruaki Takasaki, Dionne Vafeados, Ksenia Voronina, Guilin Wang, Nicole L. Washington, Christina M. Whittle, Beijing Wu, Koon-Kiu Yan, Georg Zeller, Zheng Zha, Mei Zhong, Xingliang Zhou, modENCODE Consortium, Julie Ahringer, Susan Strome, Kristin C. Gunsalus, Gos Micklem, X. Shirley Liu, Valerie Reinke, Stuart K. Kim, LaDeana W. Hillier, Steven Henikoff, Fabio Piano, Michael Snyder, Lincoln Stein, Jason D. Lieb, and Robert H. Waterston (2010). Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project, Science 330(6012):1775-1787.

RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries

ashish — Tue, 07 Dec 2010 16:06:47 +0000

Abstract

Summary: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that use this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. Moreover, these tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

Availability and implementation: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/

Download free from publisher

Citation
Lukas Habegger, Andrea Sboner, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein (2011). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries, Bioinformatics 27(2):281-283.

Our paper comparing sequencing and array technologies is online.

ashish — Fri, 18 Jun 2010 13:47:45 +0000

Click here.