Challenges in Genomics Data Visualization

Genomics, like many fields, is generating data at an ever increasing rate with the promise to enable personalized medicine, improve agriculture, and advance fundamental research. Data visualization is a key component of the scientific process and often the bottleneck in effective interpretation of analysis results. We’ll begin with a broad introduction to Genomics and then describe several visualization challenges. We categorize the challenges in three domains: those arising purely from the size of the data, those requiring intricate interactivity, and those requiring high resolution 3D rendering. The goal of the talk is to provide a broad overview of visualization challenges in the field of Genomics.

Paul Scheid, Ashish Agarwal, Karl Ward. Challenges in Genomics Data Visualization, Tisch Interactive Telecommunications Program (ITP), New York University, Apr 12, 2013.

Posted in Presentations | Tagged , | Leave a comment

A Validated Regulatory Network for Th17 Cell Specification

Th17 cells have critical roles in mucosal defense and are major contributors to inflammatory disease. Their differentiation requires the nuclear hormone receptor ROR&#947t working with multiple other essential transcription factors (TFs). We have used an iterative systems approach, combining genome-wide TF occupancy, expression profiling of TF mutants, and expression time series to delineate the Th17 global transcriptional regulatory network. We find that cooperatively bound BATF and IRF4 contribute to initial chromatin accessibility and, with STAT3, initiate a transcriptional program that is then globally tuned by the lineage-specifying TF ROR&#947t, which plays a focal deterministic role at key loci. Integration of multiple data sets allowed inference of an accurate predictive model that we computationally and experimentally validated, identifying multiple new Th17 regulators, including Fosl2, a key determinant of cellular plasticity. This interconnected network can be used to investigate new therapeutic approaches to manipulate Th17 functions in the setting of inflammatory disease.

Full article from publisher

Maria Ciofani, Aviv Madar, Carolina Galan, MacLean Sellars, Kieran Mace, Florencia Pauli, Ashish Agarwal, Wendy Huang, Christopher N. Parkurst, Michael Muratet, Kim M. Newberry, Sarah Meadows, Alex Greenfield, Yi Yang, Preti Jain, Francis K. Kirigin, Carmen Birchmeier, Erwin F. Wagner, Kenneth M. Murphy, Richard M. Myers, Richard Bonneau, and Dan R. Littman (2012). A Validated Regulatory Network for Th17 Cell Specification, Cell 151:1-15.

Posted in Publications | Tagged | Leave a comment

On my way to ICFP/CUFP/OUD and more in Copenhagen

I’m on my way to ICFP, CUFP, OUD, and more in Copenhagen. Especially interesting to me this year are the several biology related talks and events:

  • ICFP: Sneaking Around concatMap — Efficient Combinators for Dynamic Programming, Christian Höner zu Siederdissen (University of Vienna, Vienna, Austria)
  • ICFP: Experience Report: Haskell in Computational Biology,
    Noah M. Daniels, Andrew Gallant, and Norman Ramsey (Tufts University)
  • OUD: Biocaml: The OCaml Bioinformatics Library, Ashish Agarwal, Sebastien Mondet, Philippe Veber, Christophe Troestler, Francois Berenger
  • CUFP BoF: FP in the Life Sciences
  • CUFP: Functional Big-Data Genomics, Ashish Agarwal, Sebastien Mondet, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus
  • CUFP: Microsoft: Using F# to Prove Stabilization of Biological Networks, Samin Ishtiaq
  • CUFP: IntelliFactory: Developing an F# Bioinformatics Application with HTML5 Visualization, Adam Granicz
  • CUFP: factis research: Developing Medical Software in Scala and Haskell, Stefan Wehr

Quite an impressive list! Hopefully I’m not being too optimistic in thinking we just might move Biology past the dark ages of Perl.

Posted in News | Leave a comment

Functional Big-Data Genomics

High-throughput genomic sequencing is characterized by large diverse datasets and numerous analysis methods. It is normal for an individual bioinformatician to work with thousands of data files and employ hundreds of distinct computations during the course of a single project. This problem is magnified in “core facilities”, which support multiple researchers working on diverse projects. Most investigators use ad hoc methods to manage this complexity with dire consequences: analyses frequently fail to meet the scientific mandate of reproducibility; improved analysis methods are often not considered because recomputing all downstream steps would be overwhelming; hard drives and CPUs are used sub-optimally; and, in some cases, raw data is lost.

We describe HITSCORE, an OCaml software stack that operates all computational aspects of the Genomics Core Facility at New York University’s Center for Genomics and Systems Biology. HITSCORE has been in production use for one year, and was implemented quickly by less than two programmers following design advice from several biologists. A simple domain specific language (DSL) enables generating type safe database bindings and GUI components, and greatly eases updates and migration of our data model. We found a multi-lingual stack too burdensome in a small team setting, and credit OCaml for fulfilling the needs of our full application stack. It has good database bindings, excels at encoding complex domain logic, and now allows construction of rich websites due to the Ocsigen web programming framework. Higher level libraries for distributed computing would be a welcome improvement.

The opportunity to build this system did not stem directly from any strength of functional programming or OCaml. It was necessary for a person with credibility amongst biologists to champion its development, and this credibility took several years to build. Rapid development appears to be the single most important factor in allaying doubts about using a lesser known language, and we will briefly describe our experiences in bringing OCaml to several high profile projects.

Download slides

Ashish Agarwal, Sebastien Mondet, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Functional Big-Data Genomics. Commercial Users of Functional Programming 2012, Copenhagen, Denmark, Sep 15, 2012.

Posted in Presentations | Tagged , , | Leave a comment

Biocaml: The OCaml Bioinformatics Library

Biology is an increasingly computational discipline due to rapid advances in experimental techniques, especially DNA sequencing, that are generating data at unprecedented rates. The computational techniques needed range from the complex (.e.g algorithms, distributed computing) to the simple (e.g. scripting, parsing), and there are hundreds of thousands of Biologists now involved in computing. We propose that OCaml can serve virtually the full spectrum of computational tasks needed by Biologists, improving both programmer productivity and computational efficiency. To support this end, we have developed Biocaml.

Biocaml aims to be a standard library for the Biology domain. We provide features that are needed in a broad range of applications and avoid including overly specialized methods. The current feature set can be split into 3 broad categories: stream parsing/printing of many data formats, data structures for genomics, and access to public data repositories. We will demonstrate how some complex calculations can be performed quite easily with the current API, and describe our efforts to make a uniform API with comprehensive documentation. Finally, there is a BioX library for X equal to any programming language. The most widely used is BioPerl, and we will compare Biocaml with these alternatives.

Biocaml and other OCaml libraries have now been successfully used in multiple high-profile Biology projects (e.g. modENCODE, ENCODE, NYU’s Genomics Core Facility, and others). Some time will be spent discussing the social aspect of bringing a novel language to the Biology community. We will attempt to elucidate strategies that are successful and those that are not. In particular, it will be argued that discussions regarding programming language choices need to be more scientific.

Download slides

Ashish Agarwal, Sebastien Mondet, Philippe Veber, Christophe Troestler, Francois Berenger. Biocaml: The OCaml Bioinformatics Library. OCaml Users and Developers Meeting 2012, Copenhagen, Denmark, Sep 14, 2012.

Posted in Presentations | Tagged , | Leave a comment

Managing and Analyzing Big-Data in Genomics


Biology is an increasingly computational discipline. Rapid advances in experimental techniques, especially DNA sequencing, are generating data at exponentially increasing rates. Aside from the algorithmic challenges this poses, researchers must manage large volumes and innumerable varieties of data, run computational jobs on an HPC cluster, and track the inputs/outputs of the numerous computational tools they employ. Here we describe a software stack fully implemented in OCaml that operates the Genomics Core Facility at NYU’s Center for Genomics and Systems Biology.

We define a domain specific language (DSL) that allows us to easily describe the data we need to track. More importantly, the DSL approach provides us with code generation capabilities. From a single description, we generate PostgreSQL schema definitions, OCaml bindings to the database, and web pages and forms for end-users to interact with the database. Strong type safety is provided at each stage. Database bindings check properties not expressible in SQL, and web pages, forms, and links are validated at compile time by the Ocsigen framework. Since the entire stack depends on this single data description, rapid updates are easy; the compiler informs us of all necessary changes.

The application launches compute intensive jobs on a high-performance compute (HPC) cluster, requiring consideration of concurrency and fault-tolerance. We have implemented what we call a “flow” monad that combines error and thread monads. Errors are modeled with polymorphic variants, which get arranged automatically into a hierarchical structure from lower level system calls to high level functions. The net result is extremely precise information in the case of any failures and reasonably straightforward concurrency management.

Download slides

Sebastien Mondet, Ashish Agarwal, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Managing and Analyzing Big-Data in Genomics. IBM Programming Languages Day 2012, Hawthorne, NY, June 28, 2012.

Posted in Presentations | Tagged , | Leave a comment

A Domain Specific Language Stack for Bio HPC

We have given several presentations on our DSL approach to the management and analysis of big-data in the field of Biology. Now, with Karl Ward, we have been extending this approach to the systems layer, enabling more robust management and configuration of the hardware and software infrastructure so critical to bioinformatics. Many thanks to Efstratios (Stratos) Efstathiadis for giving us an opportunity to present this work at the first Bio HPC Workshop at NYU’s Langone Medical Center.

Karl Ward, Sebastien Mondet, Ashish Agarwal. A Domain Specific Language Stack for Bio HPC, First Workshop on High Performance Computing for Biomedical Research, Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, May 2012.

Posted in Presentations | Tagged | Leave a comment

Shonan Meeting

Shonan Meeting Logo
I’ll be back in Tokyo, this time for the Shonan Meeting on Bridging the Theory of Staged Programming Languages and the Practice of High-Performance Computing. My talk will be on BINQ, a domain-specific-language for genomic computations.

Posted in News, Presentations | Tagged | Leave a comment

A Comparison of Single-cell RNA-seq with Gene Expression Microarrays


Single-cell RNA-sequencing (SCRS) is a powerful technique to address biological variation by profiling expression in single cells and samples with low RNA input. Previous studies have shown that gene expression is highly correlated across array and standard RNA-sequencing technologies. However, no comparative studies utilizing SCRS and the same starting sample across platforms have been reported.

We compared expression data from tiling arrays and SCRS on RNA harvested from embryonic cholinergic motor neurons (dorsal A), embryonic coelomocytes (macrophage-like cell), and larval dopaminergic neurons from C. elegans. Picogram quantities of total RNA from each sample were amplified using a single-cell protocol to generate double stranded cDNAs, which were then sequenced with the Illumina HiSeq platform. The same RNA samples from each cell type were previously amplified using the NuGEN WT-Ovation Pico protocol and hybridized to Affymetrix tiling arrays.

We compared log2 FPKM counts for each gene with the corresponding RMA-normalized log2 array expression values. These two independent measures of transcript expression are highly correlated (Spearman correlation = 0.75 for coelomocytes, 0.62 for A-class motor neurons, and 0.68 for dopaminergic neurons). Moreover, SCRS data showed several hundred genes that are significantly enriched in each cell type in comparison with existing whole animal RNA-seq from the same developmental stage, and these significantly overlap genes detected as enriched from the tiling array data (p < 5.38e-36 for all three sets).

In sum, the correlation of SCRS to tiling array is high, similar to published comparisons between microarrays and standard RNA-seq where at least one thousand-fold more starting material was used. These results suggest that single-cell RNA-sequencing is a robust tool for gene expression quantification and transcriptome profiling when input material is limiting.

Download poster

Paul Scheid, Clay Spencer, Michelle Gutwein, Ashish Agarwal, Kristin C. Gunsalus, David Miller III. A Comparison of Single-cell RNA-seq with Gene Expression Microarrays. Advances in Genome Biology and Technology, Marco Island, FL, Feb 2012.

Posted in Posters | Leave a comment

On my way to POPL

I’ll be at POPL through Saturday. See you there!

Posted in News | Leave a comment