A Type Theory for Probability Density Functions

Abstract

There has been great interest in creating probabilistic programming languages to simplify the coding of statistical tasks; however, there still does not exist a formal language that simultaneously provides (1) continuous probability distributions, (2) the ability to naturally express custom probabilistic models, and (3) probability density functions (PDFs). This collection of features is necessary for mechanizing fundamental statistical techniques. We formalize the first probabilistic language that exhibits these features, and it serves as a foundational framework for extending the ideas to more general languages. Particularly novel are our type system for absolutely continuous (AC) distributions (those which permit PDFs) and our PDF calculation procedure, which calculates PDFs for a large class of AC distributions. Our formalization paves the way toward the rigorous encoding of powerful statistical reformulations.

Download preprint
Published version
Download slides

Citation
Sooraj Bhat, Ashish Agarwal, Richard Vuduc, Alexander Gray (2012). A Type Theory for Probability Density Functions, Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2012. ACM SIGPLAN Notices 47(1):545-556.

Errata
In Figure 10, the P-PLUS rule should be:
\[
\frac{{\Upsilon;\Lambda} \vdash {\varepsilon_1} \perp {\varepsilon_2}
\qquad\{{\Upsilon;\Lambda} \vdash {\varepsilon_i} \leadsto {\delta_i}\}_{i=1,2}}
{{\Upsilon;\Lambda} \vdash {\varepsilon_1+\varepsilon_2} \leadsto {\lambda {x:\mathsf{R}}\centerdot
\int\lambda {t:\mathsf{R}}\centerdot\ \delta_1\ t * \delta_2\ (x – t)}}
\]
The \(t\) and \(x\) were accidentally transposed. Many thanks to Chung-chieh “Ken” Shan for finding this.

Posted in Presentations, Publications | Tagged , | 1 Comment

ML/ICFP/CUFP 2011


I’ve been having a great time at ICFP in Tokyo. I spent the first day at the ML Workshop, and am now enjoying CUFP after presenting my own tutorial on creating collaborative scientific software in OCaml.

Posted in News | Leave a comment

Rap Guide to Evolution

Baba Brinkman’s Rap Guide to Evolution is a hilarious show on a serious topic. It’s an alternative approach to educate people about the fact of evolution. If you have it, don’t miss the chance to see him live!

Posted in Uncategorized | Tagged | Leave a comment

Dijkstra on elegance in programming

“programmers should not be puzzle-minded …. We would be much better served by clean, systematic minds, with a sense of elegance.”
— Dijkstra (In interview as reported in CACM)

Posted in Quotes | Leave a comment

The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics

Abstract

Biological data is often tabular but finding statistically valid connections between entities in a sequence of tables can be problematic – for example, connecting particular entities in a drug property table to gene properties in a second table, using a third table associating genes with drugs. Here we present an approach (CRIT) to find connections such as these and show how it can be applied in a variety of genomic contexts including chemogenomics data.

Full article from publisher
Paper’s website

Citation
Tara A Gianoulis, Ashish Agarwal, Michael Snyder, and Mark Gerstein (2011). The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics, Genome Biology 12(R32):1-12.

Posted in Publications | Tagged | Comments Off

OCaml User Meeting 2011

I just registered for the OCaml User Meeting. Looking forward to seeing everyone in Paris. I’ll also attend the OCaml Hacking Days at IRILL and visit OCamlCore.

Posted in News | Leave a comment

MACS 1.4 patch

A bug was introduced in MACS 1.4, which arises if many of the initial reads in an input SAM file are unaligned. You are having this error if MACS is printing something like the following to stderr:

Traceback (most recent call last):
File "python/bin/macs14", line 354, in
main()
File "python/bin/macs14", line 59, in main
(treat, control) = load_tag_files_options (options)
File "python/bin/macs14", line 323, in load_tag_files_options
ttsize = tp.tsize()
File "lib/python2.6/site-packages/MACS14/IO/Parser.py", line 655, in tsize
return int(s/n)
ZeroDivisionError: integer division or modulo by zero

Here’s a patch to fix the problem. You can download the patch to any location. Apply the patch by cd’ing to the directory lib/IO within the MACS source code. Then do

patch < path/to/macs14_Parser.patch.txt

and proceed to install MACS as normal.

Many thanks to David Cittaro, who provided this fix on the MACS mailing list. However, the diff given there is the wrong way around---with the old and new files swapped---so that cannot be used as the patch.

Posted in Uncategorized | Leave a comment

Small example of pthreads

Here’s a simple C program that demonstrates the use of the pthreads library. The following code can be downloaded here, so you can test it yourself. Simply compile with the command:

gcc -lpthread pthreads_test.c -o pthreads_test

First, let’s import some necessary headers, mainly pthread.h which provides the POSIX threads implementation.

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

Now, define a task that takes some non-negligible amount of time to complete. We pass in an id simply to identify each call to the task in output messages.

void *task(int id) {
  printf("Task %d started\n", id);
  int i;
  double result = 0.0;
  for (i = 0; i < 1000000; i++) {
    result = result + sin(i) * tan(i);
  }
  printf("Task %d completed with result %e\n", id, result);
}

We can run the above task a desired number of times in serial by calling the following function:

void *serial(int num_tasks) {
  int i;
  for (i = 0; i < num_tasks; i++) {
    task(i);
  }
}

Now, let’s define task in a manner suitable for being called in its own thread. All this requires is to use pthread_exit to let the parent process know this thread has completed. Actually this doesn’t affect the simple program we’re writing, but it’s the correct thing to do if you want to later join threads. Also, we have to make sure the function signature matches the requirements of pthread_create.

void *threaded_task(void *t) {
  long id = (long) t;
  printf("Thread %ld started\n", id);
  task(id);
  printf("Thread %ld done\n", id);
  pthread_exit(0);
}

Finally, the interesting code generates a new thread for each call to threaded_task.

void *parallel(int num_tasks)
{
  int num_threads = num_tasks;
  pthread_t thread[num_threads];
  int rc;
  long t;
  for (t = 0; t < num_threads; t++) {
    printf("Creating thread %ld\n", t);
    rc = pthread_create(&thread[t], NULL, threaded_task, (void *)t);
    if (rc) {
      printf("ERROR: return code from pthread_create() is %d\n", rc);
      exit(-1);
    }
  }
}

The main function runs a specified number of tasks either in serial or parallel.

void *print_usage(int argc, char *argv[]) {
  printf("Usage: %s serial|parallel num_tasks\n", argv[0]);
  exit(1);
}

int main(int argc, char *argv[]) {
  if (argc != 3) {print_usage(argc, argv);}

  int num_tasks = atoi(argv[2]);

  if (!strcmp(argv[1], "serial")) {
    serial(num_tasks);
  } else if (!strcmp(argv[1], "parallel")) {
    parallel(num_tasks);
  }
  else {
    print_usage(argc, argv);
  }

  printf("Main completed\n");
  pthread_exit(NULL);
}

That’s it! Now, compile the program using the command given at the beginning of this post. Here are some results on a MacBook Air.

$ time ./pthreads_test serial 4
Task 0 started
Task 0 completed with result -3.153838e+06
Task 1 started
Task 1 completed with result -3.153838e+06
Task 2 started
Task 2 completed with result -3.153838e+06
Task 3 started
Task 3 completed with result -3.153838e+06
Main completed

real 0m0.673s
user 0m0.665s
sys 0m0.003s


$ time ./pthreads_test parallel 4
Creating thread 0
Creating thread 1
Creating thread 2
Creating thread 3
Main completed
Thread 1 started
Task 1 started
Thread 2 started
Task 2 started
Thread 3 started
Task 3 started
Thread 0 started
Task 0 started
Task 3 completed with result -3.153838e+06
Thread 3 done
Task 2 completed with result -3.153838e+06
Task 1 completed with result -3.153838e+06
Thread 2 done
Thread 1 done
Task 0 completed with result -3.153838e+06
Thread 0 done

real 0m0.378s
user 0m0.667s
sys 0m0.007s

Note that threads do not necessarily start or end in order. And the point of it all; there is a factor 2x speedup for the multi-threaded run! Here’s a plot of the real time against number of tasks for 2, 4, 8, 16, 32, and 64 tasks.

You can learn more by following the excellent tutorial provided here, from which the above example was derived.

Posted in Uncategorized | Tagged | 12 Comments

MACS password download

Downloading MACS requires you to either sign up for their mailing list or solve a puzzle. I signed up for their mailing list and then solved the puzzle for fun. Perhaps you will find the answer on my website.

Posted in Uncategorized | Leave a comment

Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data

Abstract

We present an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs). It combines a large amount of expression data, RNA secondary-structure stability, and evolutionary conservation at the protein and nucleic-acid level. Using the incRNA model and data from the modENCODE consortium, we are able to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy (97% AUC on an independent validation set), and find more than 7000 novel ncRNA candidates, among which more than 1000 are located in the intergenic regions of C. elegans genome. Based on the validation set, we estimate that 91% of the approximately 7000 novel ncRNA candidates are true positives. We then analyze 15 novel ncRNA candidates by RT-PCR, detecting the expression for 14. In addition, we characterize the properties of all the novel ncRNA candidates and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families. We also find that they are often targeted by specific transcription factors (∼59% of intergenic novel ncRNA candidates). Overall, our study identifies many new potential ncRNAs in C. elegans and provides a method that can be adapted to other organisms.

Full article from publisher
Supplementary material
Paper’s website

Citation
Zhi John Lu, Kevin Y. Yip, Guilin Wang, Chong Shou, LaDeana W. Hillier, Ekta Khurana, Ashish Agarwal, Raymond Auerbach, Joel Rozowsky, Chao Cheng, Masaomi Kato, David M. Miller, Frank Slack, Michael Snyder, Robert H. Waterston, Valerie Reinke, and Mark B. Gerstein (2011). Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data, Genome Research 21(2):276-85.

Posted in Publications | Tagged , | Comments Off