Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project


We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.

Full article from publisher

Mark B. Gerstein, Zhi John Lu, Eric L. Van Nostrand, Chao Cheng, Bradley I. Arshinoff, Tao Liu, Kevin Y. Yip, Rebecca Robilotto, Andreas Rechtsteiner, Kohta Ikegami, Pedro Alves, Aurelien Chateigner, Marc Perry, Mitzi Morris, Raymond K. Auerbach, Xin Feng, Jing Leng, Anne Vielle, Wei Niu, Kahn Rhrissorrakrai, Ashish Agarwal, Roger P. Alexander, Galt Barber, Cathleen M. Brdlik, Jennifer Brennan, Jeremy Jean Brouillet, Adrian Carr, Ming-Sin Cheung, Hiram Clawson, Sergio Contrino, Luke O. Dannenberg, Abby F. Dernburg, Arshad Desai, Lindsay Dick, Andréa C. Dosé, Jiang Du, Thea Egelhofer, Sevinc Ercan, Ghia Euskirchen, Brent Ewing, Elise A. Feingold, Reto Gassmann, Peter J. Good, Phil Green, Francois Gullier, Michelle Gutwein, Mark S. Guyer, Lukas Habegger, Ting Han, Jorja G. Henikoff, Stefan R. Henz, Angie Hinrichs, Heather Holster, Tony Hyman, A. Leo Iniguez, Judith Janette, Morten Jensen, Masaomi Kato, W. James Kent, Ellen Kephart, Vishal Khivansara, Ekta Khurana, John K. Kim, Paulina Kolasinska-Zwierz, Eric C. Lai, Isabel Latorre, Amber Leahey, Suzanna Lewis, Paul Lloyd, Lucas Lochovsky, Rebecca F. Lowdon, Yaniv Lubling, Rachel Lyne, Michael MacCoss, Sebastian D. Mackowiak, Marco Mangone, Sheldon McKay, Desirea Mecenas, Gennifer Merrihew, David M. Miller III, Andrew Muroyama, John I. Murray, Siew-Loon Ooi, Hoang Pham, Taryn Phippen, Elicia A. Preston, Nikolaus Rajewsky, Gunnar Rätsch, Heidi Rosenbaum, Joel Rozowsky, Kim Rutherford, Peter Ruzanov, Mihail Sarov, Rajkumar Sasidharan, Andrea Sboner, Paul Scheid, Eran Segal, Hyunjin Shin, Chong Shou, Frank J. Slack, Cindie Slightam, Richard Smith, William C. Spencer, E. O. Stinson, Scott Taing, Teruaki Takasaki, Dionne Vafeados, Ksenia Voronina, Guilin Wang, Nicole L. Washington, Christina M. Whittle, Beijing Wu, Koon-Kiu Yan, Georg Zeller, Zheng Zha, Mei Zhong, Xingliang Zhou, modENCODE Consortium, Julie Ahringer, Susan Strome, Kristin C. Gunsalus, Gos Micklem, X. Shirley Liu, Valerie Reinke, Stuart K. Kim, LaDeana W. Hillier, Steven Henikoff, Fabio Piano, Michael Snyder, Lincoln Stein, Jason D. Lieb, and Robert H. Waterston (2010). Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project, Science 330(6012):1775-1787.

Posted in Publications | Tagged | Comments Off

RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries


Summary: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools that use this format for the analysis of RNA-Seq experiments. RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. Moreover, these tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses.

Availability and implementation: RSEQtools is implemented in C and the source code is available at

Download free from publisher

Lukas Habegger, Andrea Sboner, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein (2011). RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries, Bioinformatics 27(2):281-283.

Posted in Publications | Tagged , | Comments Off

Milner on science and language

“[Languages] should be treated as a part of a modelling theory. Up to now I don’t think we had sufficient incentive to make sure that our languages are close to scientific models. It’s only with the onset of computation as a global phenomenon that modelling those interactions becomes so scientifically important that it is bound to have its effect on programming languages.”
— Milner (3 Sep 2003, In interview conducted by Martin Berger)

Posted in Quotes | Leave a comment

DREAM and RECOMB Satellite Conferences

I’m spending the week at the 3rd Annual Joint Conference on Systems Biology, Regulatory Genomics, and Reverse Engineering Challenges.

Posted in News | Leave a comment

Platform independent .bashrc file

The number of computers I have accounts on has recently exploded, and manually editing my .bashrc file on each of these computers started getting tedious. I decided to resolve this by writing a single .bashrc file that can be copied over to all my accounts. This allows me to maintain a single master file, but requires me to set configurations conditional on the account the file is on. Here, I describe some of the things I did.

Firstly, I finally spent a few hours learning bash by reading Learning the bash Shell. I recommend you do this too. You’ll easily make up the hours in increased productivity.

My conditional settings are dependent on the kind of operating system I’m running and the particular machine I’m on. Let’s put this information in two variables:

os=`uname -s`
host=`hostname | cut -d. -f1`

Passing the result of hostname to cut causes host to be set to just the first part of your hostname. For example, foo instead of This allows me to type less in later code.

Now, as an example, on one of my computer accounts I was using a software called Netkit. This required setting some environment variables and sourcing a file that provided bash completion features. Assuming the hostname was, I did this with:

if [ $host = "net" ]; then
    export NETKIT_HOME=/usr/local/netkit
    . $NETKIT_HOME/bin/netkit_bash_completion

With this code, my environment variable namespace is not polluted on all my other accounts, and I won’t get any errors at login about netkit_bash_completion not being found.

Some settings depend on the OS. For example, I like to colorize the output from ls, but the option for this differs on Mac and Linux systems. I can use a bash case construct to set the alias appropriately:

case $os in
    "Darwin" )
        alias ls='ls -G';;
    "Linux"  )
        alias ls='ls --color=auto';;

As one more example, I define aliases to help me ssh between all my machines more easily. There’s no point in setting this up for the machine you’re on, so I do:

[ $host != "net" ] && alias ssh_net='ssh'

Now all you have to do is create another bash script that rsync’s your .bashrc file to all your accounts, and run it every time you make a change.

Posted in Uncategorized | Tagged | 1 Comment

Goodbye Yale. Hello NYU.

It’s official. I’m moving to NYU. This will be a great opportunity for me to push my agenda of bringing functional programming techniques to biology. I’m fortunate enough to be working there with Kris Gunsalus and Rich Bonneau.

Many thanks to Michael SternMark Gerstein, and Paul Hudak, all of whose labs I worked with at different times during the last few years. I will miss Yale and New Haven. But I’m looking forward to living in Manhattan!

Posted in News | Leave a comment

Poincare on the necessity of hypothesis

“I consider a priori a law… Without this belief, … interpolation would be impossible; no law could be deduced from a finite number of observations; science would not exist.”
— Poincare (1913, p. 170)

Posted in Quotes | Leave a comment

Shiny new copy of Logicomix has arrived!


Posted in Uncategorized | Tagged | 1 Comment

Carnap on the challenge of interdisciplinary research

“If one is interested in the relations between fields which, according to customary academic divisions, belong to different departments, then he will not be welcomed as a builder of bridges, as he might have expected, but will rather be regarded by both sides as an outsider and troublesome intruder.”
— Rudolf Carnap

Posted in Quotes | Leave a comment

Just registered for ICFP

Functional programmers unite at ICFP! I’ll also be at the ML Worskhop, Haskell Symposium, and CUFP.

Posted in News | Leave a comment