Ashish Agarwal » Presentations

Challenges in Genomics Data Visualization

ashish — Mon, 15 Apr 2013 16:29:09 +0000

Genomics, like many fields, is generating data at an ever increasing rate with the promise to enable personalized medicine, improve agriculture, and advance fundamental research. Data visualization is a key component of the scientific process and often the bottleneck in effective interpretation of analysis results. Weâ€™ll begin with a broad introduction to Genomics and then describe several visualization challenges. We categorize the challenges in three domains: those arising purely from the size of the data, those requiring intricate interactivity, and those requiring high resolution 3D rendering. The goal of the talk is to provide a broad overview of visualization challenges in the field of Genomics.

Paul Scheid, Ashish Agarwal, Karl Ward. Challenges in Genomics Data Visualization, Tisch Interactive Telecommunications Program (ITP), New York University, Apr 12, 2013.

Functional Big-Data Genomics

ashish — Tue, 11 Sep 2012 20:11:51 +0000

Abstract
High-throughput genomic sequencing is characterized by large diverse datasets and numerous analysis methods. It is normal for an individual bioinformatician to work with thousands of data files and employ hundreds of distinct computations during the course of a single project. This problem is magnified in “core facilities”, which support multiple researchers working on diverse projects. Most investigators use ad hoc methods to manage this complexity with dire consequences: analyses frequently fail to meet the scientific mandate of reproducibility; improved analysis methods are often not considered because recomputing all downstream steps would be overwhelming; hard drives and CPUs are used sub-optimally; and, in some cases, raw data is lost.

We describe HITSCORE, an OCaml software stack that operates all computational aspects of the Genomics Core Facility at New York University’s Center for Genomics and Systems Biology. HITSCORE has been in production use for one year, and was implemented quickly by less than two programmers following design advice from several biologists. A simple domain specific language (DSL) enables generating type safe database bindings and GUI components, and greatly eases updates and migration of our data model. We found a multi-lingual stack too burdensome in a small team setting, and credit OCaml for fulfilling the needs of our full application stack. It has good database bindings, excels at encoding complex domain logic, and now allows construction of rich websites due to the Ocsigen web programming framework. Higher level libraries for distributed computing would be a welcome improvement.

The opportunity to build this system did not stem directly from any strength of functional programming or OCaml. It was necessary for a person with credibility amongst biologists to champion its development, and this credibility took several years to build. Rapid development appears to be the single most important factor in allaying doubts about using a lesser known language, and we will briefly describe our experiences in bringing OCaml to several high profile projects.

Download slides
Video

Citation
Ashish Agarwal, Sebastien Mondet, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Functional Big-Data Genomics. Commercial Users of Functional Programming 2012, Copenhagen, Denmark, Sep 15, 2012.

Biocaml: The OCaml Bioinformatics Library

ashish — Mon, 10 Sep 2012 20:11:38 +0000

Abstract
Biology is an increasingly computational discipline due to rapid advances in experimental techniques, especially DNA sequencing, that are generating data at unprecedented rates. The computational techniques needed range from the complex (.e.g algorithms, distributed computing) to the simple (e.g. scripting, parsing), and there are hundreds of thousands of Biologists now involved in computing. We propose that OCaml can serve virtually the full spectrum of computational tasks needed by Biologists, improving both programmer productivity and computational efficiency. To support this end, we have developed Biocaml.

Biocaml aims to be a standard library for the Biology domain. We provide features that are needed in a broad range of applications and avoid including overly specialized methods. The current feature set can be split into 3 broad categories: stream parsing/printing of many data formats, data structures for genomics, and access to public data repositories. We will demonstrate how some complex calculations can be performed quite easily with the current API, and describe our efforts to make a uniform API with comprehensive documentation. Finally, there is a BioX library for X equal to any programming language. The most widely used is BioPerl, and we will compare Biocaml with these alternatives.

Biocaml and other OCaml libraries have now been successfully used in multiple high-profile Biology projects (e.g. modENCODE, ENCODE, NYU’s Genomics Core Facility, and others). Some time will be spent discussing the social aspect of bringing a novel language to the Biology community. We will attempt to elucidate strategies that are successful and those that are not. In particular, it will be argued that discussions regarding programming language choices need to be more scientific.

Download slides
Video

Citation
Ashish Agarwal, Sebastien Mondet, Philippe Veber, Christophe Troestler, Francois Berenger. Biocaml: The OCaml Bioinformatics Library. OCaml Users and Developers Meeting 2012, Copenhagen, Denmark, Sep 14, 2012.

Managing and Analyzing Big-Data in Genomics

ashish — Fri, 29 Jun 2012 17:18:00 +0000

Abstract

Biology is an increasingly computational discipline. Rapid advances in experimental techniques, especially DNA sequencing, are generating data at exponentially increasing rates. Aside from the algorithmic challenges this poses, researchers must manage large volumes and innumerable varieties of data, run computational jobs on an HPC cluster, and track the inputs/outputs of the numerous computational tools they employ. Here we describe a software stack fully implemented in OCaml that operates the Genomics Core Facility at NYU’s Center for Genomics and Systems Biology.

We define a domain specific language (DSL) that allows us to easily describe the data we need to track. More importantly, the DSL approach provides us with code generation capabilities. From a single description, we generate PostgreSQL schema definitions, OCaml bindings to the database, and web pages and forms for end-users to interact with the database. Strong type safety is provided at each stage. Database bindings check properties not expressible in SQL, and web pages, forms, and links are validated at compile time by the Ocsigen framework. Since the entire stack depends on this single data description, rapid updates are easy; the compiler informs us of all necessary changes.

The application launches compute intensive jobs on a high-performance compute (HPC) cluster, requiring consideration of concurrency and fault-tolerance. We have implemented what we call a “flow” monad that combines error and thread monads. Errors are modeled with polymorphic variants, which get arranged automatically into a hierarchical structure from lower level system calls to high level functions. The net result is extremely precise information in the case of any failures and reasonably straightforward concurrency management.

Download slides

Citation
Sebastien Mondet, Ashish Agarwal, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Managing and Analyzing Big-Data in Genomics. IBM Programming Languages Day 2012, Hawthorne, NY, June 28, 2012.

A Domain Specific Language Stack for Bio HPC

ashish — Fri, 11 May 2012 22:42:07 +0000

We have given several presentations on our DSL approach to the management and analysis of big-data in the field of Biology. Now, with Karl Ward, we have been extending this approach to the systems layer, enabling more robust management and configuration of the hardware and software infrastructure so critical to bioinformatics. Many thanks to Efstratios (Stratos) Efstathiadis for giving us an opportunity to present this work at the first Bio HPC Workshop at NYU’s Langone Medical Center.

Karl Ward, Sebastien Mondet, Ashish Agarwal. A Domain Specific Language Stack for Bio HPC, First Workshop on High Performance Computing for Biomedical Research, Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, May 2012.

Shonan Meeting

ashish — Wed, 09 May 2012 19:11:38 +0000

I’ll be back in Tokyo, this time for the Shonan Meeting on Bridging the Theory of Staged Programming Languages and the Practice of High-Performance Computing. My talk will be on BINQ, a domain-specific-language for genomic computations.

A Type Theory for Probability Density Functions

ashish — Tue, 04 Oct 2011 17:20:18 +0000

Abstract

There has been great interest in creating probabilistic programming languages to simplify the coding of statistical tasks; however, there still does not exist a formal language that simultaneously provides (1) continuous probability distributions, (2) the ability to naturally express custom probabilistic models, and (3) probability density functions (PDFs). This collection of features is necessary for mechanizing fundamental statistical techniques. We formalize the first probabilistic language that exhibits these features, and it serves as a foundational framework for extending the ideas to more general languages. Particularly novel are our type system for absolutely continuous (AC) distributions (those which permit PDFs) and our PDF calculation procedure, which calculates PDFs for a large class of AC distributions. Our formalization paves the way toward the rigorous encoding of powerful statistical reformulations.

Download preprint
Published version
Download slides

Citation
Sooraj Bhat, Ashish Agarwal, Richard Vuduc, Alexander Gray (2012). A Type Theory for Probability Density Functions, Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2012. ACM SIGPLAN Notices 47(1):545-556.

Errata
In Figure 10, the P-PLUS rule should be:
\[
\frac{{\Upsilon;\Lambda} \vdash {\varepsilon_1} \perp {\varepsilon_2}
\qquad\{{\Upsilon;\Lambda} \vdash {\varepsilon_i} \leadsto {\delta_i}\}_{i=1,2}}
{{\Upsilon;\Lambda} \vdash {\varepsilon_1+\varepsilon_2} \leadsto {\lambda {x:\mathsf{R}}\centerdot
\int\lambda {t:\mathsf{R}}\centerdot\ \delta_1\ t * \delta_2\ (x – t)}}
\]
The \(t\) and \(x\) were accidentally transposed. Many thanks to Chung-chieh “Ken” Shan for finding this.

Presenting at the CScADS Workshop on Autotuning for Petascale Applications

ashish — Thu, 22 Jul 2010 18:22:44 +0000

Thanks to Rich Vuduc for inviting me to give a talk at CScADS. Autotuning is an approach for generating efficient code for high performance computing. I’ll try to summarize how my PL work can contribute to and benefit from this approach.

Slides

IBM PL Day 2010

ashish — Wed, 14 Jul 2010 14:13:56 +0000

Here are the abstract and slides for my talk at IBM PL Day.

Title: Mechanizing Optimization and Statistics
Abstract:

Scientific and engineering investigations are formalized most often in the language of numerical mathematics. The tools supporting this are numerous but disparate, leading to sub-optimal use of existing mathematical theory. We present a unifying framework by taking a programming languages based approach to this problem. Our richly typed language allows naturally declaring optimization and statistics problems, and a library of transformations allows users to interactively compile input problems to solvable forms. We implement our system as a domain specific language embedded in OCaml. Here, we focus on three features: disjunctive constraints, measure types and random variables, and indexing.

By disjunctive constraints, we mean disjunctions over propositions on reals, e.g. \(x \leq w \vee x \geq w + 4.0\). The usual solution strategy involves converting these into mixed-integer linear programming (MILP) constraints using the big-M, convex-hull, or other methods. Automation is clearly needed because these are algebraically tedious and manual application limits them to experts. We provide the first robust implementations and compare our results with that of ILOG CPLEX.

Statistics is increasingly important due to the increasing amount of data generated in the sciences. We introduce language features that enable declarative expression of statistical models and estimation problems. A type ‘prob T’ characterizes probability measures over type T, a special let binding introduces random variables, and some standard measures (e.g. Normal, Gaussian) can be used to construct more complex ones. We demonstrate with an example how our software facilitates exploring the large space of algorithms for solving statistical problems.

Finally, matrices are accepted canonical forms in mathematics, but practitioners employ a more flexible indexing notation: e.g. \(\forall i \in \{A,B,C\} \quad x_i \leq w_i\). Especially in optimization, this need is so critical that virtually every tool supports it. However, indexing has been treated as a mere syntactic convenience and is eliminated at parse time. We present a dependently typed theory that enables far richer index sets to be expressed. Importantly, our theory brings indexing into the formal realm, providing an O(n) to O(1) reduction in memory requirements and the potential for a corresponding computational improvement.

Download slides

Toward Interactive Statistical Modeling

ashish — Sat, 27 Mar 2010 22:39:22 +0000

Abstract

When solving machine learning problems, there is currently little automated support for easily experimenting with alternative statistical models or solution strategies. This is because this activity often requires expertise from several diï¬€erent ï¬elds (e.g., statistics, optimization, linear algebra), and the level of formalism required for automation is much higher than for a human solving problems on paper. We present a system toward addressing these issues, which we achieve by (1) formalizing a type theory for probability and optimization, and (2) providing an interactive rewrite system for applying problem reformulation theorems. Automating solution strategies this way enables not only manual experimentation but also higher-level, automated activities, such as autotuning.

Download from publisher
Presentation slides

Citation
Sooraj Bhat, Ashish Agarwal, Alexander Gray, Richard Vuduc (2010). Toward Interactive Statistical Modeling, In Procedia Computer Science, International Conference on Computational Science ICCS 2010, 1(1): 1892-1838.