Ashish Agarwal » OCaml

Functional Big-Data Genomics

ashish — Tue, 11 Sep 2012 20:11:51 +0000

Abstract
High-throughput genomic sequencing is characterized by large diverse datasets and numerous analysis methods. It is normal for an individual bioinformatician to work with thousands of data files and employ hundreds of distinct computations during the course of a single project. This problem is magnified in “core facilities”, which support multiple researchers working on diverse projects. Most investigators use ad hoc methods to manage this complexity with dire consequences: analyses frequently fail to meet the scientific mandate of reproducibility; improved analysis methods are often not considered because recomputing all downstream steps would be overwhelming; hard drives and CPUs are used sub-optimally; and, in some cases, raw data is lost.

We describe HITSCORE, an OCaml software stack that operates all computational aspects of the Genomics Core Facility at New York University’s Center for Genomics and Systems Biology. HITSCORE has been in production use for one year, and was implemented quickly by less than two programmers following design advice from several biologists. A simple domain specific language (DSL) enables generating type safe database bindings and GUI components, and greatly eases updates and migration of our data model. We found a multi-lingual stack too burdensome in a small team setting, and credit OCaml for fulfilling the needs of our full application stack. It has good database bindings, excels at encoding complex domain logic, and now allows construction of rich websites due to the Ocsigen web programming framework. Higher level libraries for distributed computing would be a welcome improvement.

The opportunity to build this system did not stem directly from any strength of functional programming or OCaml. It was necessary for a person with credibility amongst biologists to champion its development, and this credibility took several years to build. Rapid development appears to be the single most important factor in allaying doubts about using a lesser known language, and we will briefly describe our experiences in bringing OCaml to several high profile projects.

Download slides
Video

Citation
Ashish Agarwal, Sebastien Mondet, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Functional Big-Data Genomics. Commercial Users of Functional Programming 2012, Copenhagen, Denmark, Sep 15, 2012.

Biocaml: The OCaml Bioinformatics Library

ashish — Mon, 10 Sep 2012 20:11:38 +0000

Abstract
Biology is an increasingly computational discipline due to rapid advances in experimental techniques, especially DNA sequencing, that are generating data at unprecedented rates. The computational techniques needed range from the complex (.e.g algorithms, distributed computing) to the simple (e.g. scripting, parsing), and there are hundreds of thousands of Biologists now involved in computing. We propose that OCaml can serve virtually the full spectrum of computational tasks needed by Biologists, improving both programmer productivity and computational efficiency. To support this end, we have developed Biocaml.

Biocaml aims to be a standard library for the Biology domain. We provide features that are needed in a broad range of applications and avoid including overly specialized methods. The current feature set can be split into 3 broad categories: stream parsing/printing of many data formats, data structures for genomics, and access to public data repositories. We will demonstrate how some complex calculations can be performed quite easily with the current API, and describe our efforts to make a uniform API with comprehensive documentation. Finally, there is a BioX library for X equal to any programming language. The most widely used is BioPerl, and we will compare Biocaml with these alternatives.

Biocaml and other OCaml libraries have now been successfully used in multiple high-profile Biology projects (e.g. modENCODE, ENCODE, NYU’s Genomics Core Facility, and others). Some time will be spent discussing the social aspect of bringing a novel language to the Biology community. We will attempt to elucidate strategies that are successful and those that are not. In particular, it will be argued that discussions regarding programming language choices need to be more scientific.

Download slides
Video

Citation
Ashish Agarwal, Sebastien Mondet, Philippe Veber, Christophe Troestler, Francois Berenger. Biocaml: The OCaml Bioinformatics Library. OCaml Users and Developers Meeting 2012, Copenhagen, Denmark, Sep 14, 2012.

Managing and Analyzing Big-Data in Genomics

ashish — Fri, 29 Jun 2012 17:18:00 +0000

Abstract

Biology is an increasingly computational discipline. Rapid advances in experimental techniques, especially DNA sequencing, are generating data at exponentially increasing rates. Aside from the algorithmic challenges this poses, researchers must manage large volumes and innumerable varieties of data, run computational jobs on an HPC cluster, and track the inputs/outputs of the numerous computational tools they employ. Here we describe a software stack fully implemented in OCaml that operates the Genomics Core Facility at NYU’s Center for Genomics and Systems Biology.

We define a domain specific language (DSL) that allows us to easily describe the data we need to track. More importantly, the DSL approach provides us with code generation capabilities. From a single description, we generate PostgreSQL schema definitions, OCaml bindings to the database, and web pages and forms for end-users to interact with the database. Strong type safety is provided at each stage. Database bindings check properties not expressible in SQL, and web pages, forms, and links are validated at compile time by the Ocsigen framework. Since the entire stack depends on this single data description, rapid updates are easy; the compiler informs us of all necessary changes.

The application launches compute intensive jobs on a high-performance compute (HPC) cluster, requiring consideration of concurrency and fault-tolerance. We have implemented what we call a “flow” monad that combines error and thread monads. Errors are modeled with polymorphic variants, which get arranged automatically into a hierarchical structure from lower level system calls to high level functions. The net result is extremely precise information in the case of any failures and reasonably straightforward concurrency management.

Download slides

Citation
Sebastien Mondet, Ashish Agarwal, Paul Scheid, Aviv Madar, Richard Bonneau, Jane Carlton, Kristin C. Gunsalus. Managing and Analyzing Big-Data in Genomics. IBM Programming Languages Day 2012, Hawthorne, NY, June 28, 2012.

logit – a simple tool to date-stamp files into a log directory

ashish — Wed, 21 Apr 2010 18:18:02 +0000

http://github.com/agarwal/logit

Toward Interactive Statistical Modeling

ashish — Sat, 27 Mar 2010 22:39:22 +0000

Abstract

When solving machine learning problems, there is currently little automated support for easily experimenting with alternative statistical models or solution strategies. This is because this activity often requires expertise from several diï¬€erent ï¬elds (e.g., statistics, optimization, linear algebra), and the level of formalism required for automation is much higher than for a human solving problems on paper. We present a system toward addressing these issues, which we achieve by (1) formalizing a type theory for probability and optimization, and (2) providing an interactive rewrite system for applying problem reformulation theorems. Automating solution strategies this way enables not only manual experimentation but also higher-level, automated activities, such as autotuning.

Download from publisher
Presentation slides

Citation
Sooraj Bhat, Ashish Agarwal, Alexander Gray, Richard Vuduc (2010). Toward Interactive Statistical Modeling, In Procedia Computer Science, International Conference on Computational Science ICCS 2010, 1(1): 1892-1838.

Automating Mathematical Program Transformations

ashish — Mon, 18 Jan 2010 19:13:17 +0000

Abstract

Mathematical programs (MPs) are a class of constrained optimization problems that include linear, mixed-integer, and disjunctive programs. Strategies for solving MPs rely heavily on various transformations between these subclasses, but most are not automated because MP theory does not presently treat programs as syntactic objects. In this work, we present the ï¬rst syntactic deï¬nition of MP and of some widely used MP transformations, most notably the big-M and convex hull methods for converting disjunctive constraints. We use an embedded OCaml DSL on problems from chemical process engineering and operations research to compare our automated transformations to existing technologyâ€”ï¬nding that no one technique is always bestâ€”and also to manual reformulationsâ€”ï¬nding that our mechanizations are comparable to human experts. This work enables higher-level solution strategies that can use these transformations as subroutines.

Download from publisher
Download preprint
Associated code
Presentation slides

Citation
Ashish Agarwal, Sooraj Bhat, Alexander Gray, Ignacio E. Grossmann (2010). Automating Mathematical Program Transformations, in Proceedings of the 12th International Symposium on Practical Aspects of Declarative Languages, PADL 2010, Vol 5937 of Lecture Notes in Computer Science, pp. 134-148.