With the release 76 looming large in our calendars and the final deadlines out of the way for GRCh38 data production, it’s a good time to look back and take stock of what we’ve been doing in the Ensembl Regulation office. We have been rather quiet the past few months, working feverishly on an ambitious overhaul of our infrastructure. We’ve already given you a sneak peak at the new Ensembl Regulatory Build, so I’d like to take a look at the work horse underlying all of our data, the ‘Ensembl Regulation Analysis Pipeline’.
The end result is a core resource that centralises epigenomic data from multiple public sources, processes them through a universal pipeline, then summarises them into easily understood annotations. Ensembl Regulation aims to be a single entry point to obtain an overview of all the available regulatory data, from individual datasets to summary annotations, all coming to a browser near you, very soon. Underlying it is a full ‘end to end’ pipeline for producing the input data to the Regulatory Build, from fastq download, to alignment, IDR processing, peak calling and finally motif alignments.
The inputs to the Regulatory Segmentation and Build are experiments (Chip-Seq & DNAse-Seq) describing the chromatin status (i.e. histone modifications) and transcription factor landscape across various cell lines. These experiments come from large projects (e.g. ENCODE, Roadmap Epigenomics and BLUEPRINT), through to individual experiments made accessible via archives such as the ERA/ENA, SRA and GEO.
The main outputs of the pipeline are genome alignments, peak calls and ‘collection’ files which provide coverage statistics across the genome. Managing and processing these data is no simple task, and we expect the number of available epigenomic datasets to increase significantly in the years to come. Also, with the arrival of GRCh38, we needed to reprocess all of the existing data in a short timeframe. We therefore integrated our processes into a shiny new fully automated pipeline using the ensembl-hive framework. Here follows a brief summary of the new features of the regulation analysis pipeline.
The Tracking Data Base
This now constitutes our main analysis and archive database, tracking the data both within our pipeline, but also in external repositories. In it, we register the meta-data from different projects and data repositories, providing a single point of reference to query the data available in the public domain. This has been crucial in determining which cell lines meet the requirements for a build.
Read Alignment and Peak Calling
We first align reads using BWA, then call peaks using SWEMBL for short regions and CCAT for broader ranging histone modifications. Replicates are processed in parallel to support ENCODE’s Irreducible Discovery Rate (IDR) methodology.
Pipeline Improvements
Flexibility has been a key aim of the redesign, and the hive infrastructure has helped here by allowing us to define each logical part of the pipeline as a separate configuration which can be ‘topped up’ as required. This means that it’s easy to run just the read alignment stage (which we require as input to the segmentation), or at your pleasure add in the peak calling and collection file writing stages whilst it’s still running. All the necessary state information is captured in the tracking database, so it’s really easy to pick things up at any point and start running the later stages of the pipeline.
Due to the size of our input data set and the resulting rolling data footprint, we set up a garbage collection of intermediate files and added inline archiving. This has limited our footprint, and enabled us to reprocess the entire human data set in one go.
The combination of the above improvements, the new ensembl-hive implementation and a whole load of other refinements, means much less manual intervention is required, resulting in a large reduction in run times. For the alignments in particular, what was taking several weeks now takes just ~5 days!
What does the future hold?
We’ve already identified some more optimisations to the structure of the pipeline, so the runtimes are likely to drop even further. This will be crucial to handle the hundreds of cell types currently being examined within Roadmap Epigenomics, Blueprint, ENCODE 3 and other projects. We will also be revising our schemas to better reflect tissue specific data. This is part of a larger push within Ensembl to better describe the dynamics of gene regulation and transcription.
Finally, we are keeping up with lab techniques, and will be extending our pipelines to handle newer types of data, such as chromatin conformation assays or eQTLs. Although we do not process this data ourselves, we already integrated and remapped the FANTOM5 CAGE-tag annotations onto GRCh38.
p.s. If you want even more info on the, keep an eye on this page. Once release 76 is out it will be updated with our new Regulatory Build documentation.