Author: Daniel (Genome Analysis)

Drawing cis-interactions in Ensembl

27th May 2015 by Daniel (Genome Analysis)·Comments Off

As we announced in our e80 release post, we rolled out a prototype display for cis-regulatory interactions, essentially arches connecting any two elements on the same chromosome. This was mainly designed to display your eQTL or Hi-C with little effort, based on the WashU Epigenomics Explorer interaction track:

All you need to do is prepare a tab-delimited interaction file, optionally indexing it with Tabix if it is too large to upload directly. You can use it to represent Intra-chromosomal rearrangements such as the micro-inversion below:

@RNA3DHub even suggested using it to display RNA secondary structure:

You can do what you want really!

The Ensembl Regulatory Build – the Track Hub

20th August 2014 by Daniel (Genome Analysis)·Comments Off

The brand new Ensembl Regulatory Build on the new GRCh38 human assembly has been released in Ensembl 76. This involved a complete redesign of the build process with a new statistically rigorous logic, a streamlining all the backend processes and a remapping and peak calling of all our data sets.

The build and constituent data are available to view directly through Ensembl through the Regulation section of “Configure This Page”, but we have also made it accessible by creating a public track hub. A track hub is a pre-configured set of tracks which you can load together into Ensembl or other genome browsers such as the UCSC. In addition to data loaded into Ensembl, it also contains tracks that summarise the data used to generate the build

What’s in the Ensembl Regulatory Build Track Hub?

GRCh37 data

You’ve been meaning to do that data remapping a while now, but didn’t quite get to it? Or maybe you’re just feeling nostalgic? No worries, the track hub covers both GRCh37 and 38.

Raw data

You can scan the header files for GRCh37 or GRCh38 for direct access to the raw data in BigBed and BigWig format.

Transcription Factors

For each transcription factor, we calculate the probability of having binding at any position, based on the available data sets by simply dividing the number of overlapping peaks by the number of data sets. These probabilities can be viewed in the TFBS Summaries section. An overall probability of any binding is viewable in the TFBS Summary track of the Ensembl build overview section.

Segmentations

We use genome segmentation software (Segway), to partition the genome into regions of similar signal over these assays, and label these states as e.g. predicted promoters, enhancers or repressed. The segmentations for each cell type can be found in the Cell Type Segmentations tracks.

For each state of the segmentation, we also create a summary track which represents the number of cell types that have that state at any given base pair of the genome.

The Ensembl Regulatory Build

The summarised Ensembl Regulatory Build can be viewed in the “Ensembl Reg. Build” track of the Ensembl Build Overview section. For each cell type, we then annotate each feature as on or off, as displayed in the Cell Type Activity tracks.

Computing Ensembl’s New Regulatory Annotation

1st January 2014 by Daniel (Genome Analysis)·Comments Off

We described in a previous post Ensembl’s new regulatory annotation of the genome. Now, we will go in greater detail into how we computed it.

We started by running ChromHMM over 17 cell types, using publicly available ENCODE and Roadmap Epigenomics data. This produced a segmentation or annotation the genome for each of these cell types under 25 labels, or segmentation states. These states were given arbitrary names (P0, P1, …) after a preliminary comparison to the earlier ENCODE segmentations across six cell types.

For each state and each position in the genome, we computed the number of cell types that have that state at that position. This resulted in 25 segmentation state summary tracks. We also pulled in all the Chip-Seq peaks from the January 2011 ENCODE data freeze, and kept those that overlapped with open chromatin (i.e. DNAseI hypersensitivity) on the same cell type. From all these assays, we computed a summary track, which indicates the probability (between 0 and 1) of seeing a TF peak at any location of the genome. The segmentations and summary tracks can be seen in the illustration below.

Summary tracks generated from the segmentation and TFBS peaks

We then used TF binding data to determine the specificity of the ChromHMM states, as regulatory regions are presumably correlated with TF binding. Each function (TSS, CTCF insulator, proximal regulatory elements, distal regulatory elements) was thus associated with several states. For example, transcription start sites were strongly associated to the P0 state in the segmentations. We overlapped these state signals to define function specific signals. A simple count threshold was set to maximise the detection of TF binding sites. This led to the regions as displayed at the bottom of the following figure:

Selected summary tracks were overlapped to define the Ensembl Regulatory features

These new regions concur strongly with the TF binding datasets: 73.4% of the TF Chip-Seq peaks were captured in the ChromHMM regions, equivalent to an 8.3x enrichment with respect to the genomic average. Conversely, 24.0% of the ChromHMM-based regions were covered by observed TF Chip-Seq peaks. To avoid losing information, TFBS peaks which were not covered by any of these elements were marked as ‘Unannottated TFBS’.

Having defined consensus regulatory region, we returned to the original data, to determine which region is active in which cell line:

The Regulatory Features are then compared to the cell type specific segmentations to determine their activity in each cell type.

Validation

Fantom 4 CAGE tags

In the median cell type, 83% of FANTOM tags supported by three CAGE tags or more were annotated by our pipeline.

Vista enhancers

The Vista Enhancer database contains enhancer sequences validated by in vivo staining assays on transgenic mice (hats off to the VISTA team for their years of meticulous work). It currently contains 1,575 predicted enhancers, of which 807 were experimentally confirmed. 491 of those (60.8%) were picked up by our pipeline.

TF binding motifs

We found two estimates of active TF binding motifs:

Source:	JASPAR	Arbiza et al.
Number:	803,489	2,013,074
Avg. length (bp):	13.1	10.8
Total length (Mbp):	10.5	21.7
% covered:	59.0	87.0
Enrichment (fold):	6.1	9.0

JASPAR motifs are not mapped by default to the human genome, so we are missing a few. We will therefore be remapping them in time for Ensembl release 75, in early 2014.

The computing of Ensembl’s new regulatory annotation is work in progress. If you have any ideas on the subject, feel free to leave a comment or to send them to helpdesk@ensembl.org.

The New Ensembl Regulatory Annotation

26th December 2013 by Daniel (Genome Analysis)·Comments Off

GWAS after GWAS return statistically significant hits that are hard to interpret because they fall outside of coding regions, and this begs for more functional annotation of regulatory regions. We at Ensembl have been providing such an annotation for a few years now and we are now redesigning from the ground up the way we define these regions. This work is in progress and we would love to hear your suggestions and comments.

An overview of the major elements involved in gene expression regulation

In short, we are looking for all regions of the genome which display regulatory function. Much ink has been spilled over the definition of the word ‘functional’, so we’re going to expand a bit.

We propose to map out the regions of the genome that display epigenomic marks and/or transcription factor binding sites (TFBS) associated to proximal and distal regulatory elements, transcription start sites (TSS), and CTCF insulators.

Ensembl’s Regulation pipeline

We will post more details next week on Computing the New Ensembl Regulatory Annotation. To cut to the chase, we defined the following regions from publicly available ENCODE and Roadmap Epigenomics datasets:

Label	Count	Avg. Lgth (bp)	Max. Lgth (bp)	Tot. Lgth (Mbp)
TSS	40,249	973.2	11,400	39.2
Proximal Reg.	101,206	1,005.5	15,000	101.8
Distal Reg.	209,081	526.1	8,400	110.0
CTCF	108,284	550.1	5,200	59.6
Unannotated TFBS	163,528	155.8	1,630	25.5
All:				299.2

The new Regulatory Build will allow us to separate state from function, as shown below, upstream of VNN2:

EnsEMBL_Web_Component_Location_ViewBottom-Homo_sapiens-Location-View-74-

The track at the top colours the region by function, independently of the cell type. In the cell specific tracks below, the various features are greyed out if we do not have evidence of activity for that region.

We incorporated our preliminary results into a track hub, along with some of our intermediary data. Next week we will post more details on Computing Ensembl’s New Regulatory Build. We want to integrate this build officially into Ensembl release 76, sometime during the 3rd or 4th quarter of 2014.

FAQ:

Are you saying that 9.7% of the genome is functional?

Not quite. We’re saying that if you split the genome into 200 bp bins, 9.7% of them show epigenomic marks or TF binding. Remember that histone marks are measured at nucleosome resolution, so this signal is at best at a 140bp resolution. If you add in experimental noise (typically proportional to the Chip-Seq fragment length), the exact position of these elements on the genome is rather fuzzy. At the same time, the epigenome is a dynamic system, and we only have some assays on some cell lines. No doubt more regions will be annotated as more datasets come in.

What happened to the 80.4% of the genome being functional?

This statistic from the main ENCODE paper took into account other biochemical markers, in particular those associated to transcription, which can be observed over most of the genome. We therefore recommend using curated genesets such as GENCODE to define gene bodies. Nonetheless, the number of regulatory elements and promoters described here is of the same order of magnitude as that discussed in the ENCODE paper.

Is this the same as ENCODE?

This is more than just ENCODE. There are other fascinating epigenomic surveys out there, such as Roadmap Epigenomics or BLUEPRINT to name a few. Here at Ensembl, we have started merging all these datasets (including ENCODE), and provide the most comprehensive overview possible, updating our calls as new projects come along. Also, as discussed above, we are producing a cell-type independent summary of epigenomic function, which can be used to inform studies on new cell types.

What about other species?

We focused our Regulation database primarily on human: that is what most of our users ask for, and what we have most data for. But that does not mean that we ignore other species. Ensembl has already regulatory information for mouse, and we plan on shortly expanding this to farm animals, in collaboration with the Roslin Institute.

Can you assign regulatory elements to genes?

We’re working on it. Correlations are easy to find, but multiple testing quickly gets in the way when testing 310,287 regulatory elements against 40,249 TSSs.

Remember, this is work in progress, and we would love to hear your suggestions. Please leave your comments here or drop us a line.

WiggleTools: a pocket calculator for very large datasets

23rd December 2013 by Daniel (Genome Analysis)·Comments Off

We are pleased to announce a new bioinformatics application, WiggleTools, described in a recent Application Note in Bioinformatics. It allows you to quickly and conveniently compute statistics across many (up to the hundreds) of genome-wide datasets.

WiggleTools is first a data summary tool. It collapses into a single summary a large collection of genome wide datasets, such as BigWig, BigBed or BAM files. You can then view on the Ensembl browser a single statistic that combines all the datasets for a given project rather than displaying a pileup of several data tracks.

For example, if you wanted to display the average binding probability for each TF in the ENCODE dataset you could display a huge number of tracks on the browser, one for each TF. It is clearly difficult to interpret the view as one has to scroll up and down endlessly

And here is a summary track which recaps all of the data above in a single track (explained below):

Overall binding probability for all TF in a single track. Note that you now have room to add other datasets in the ‘Region in detail’ view.

To better handle different types of signal, WiggleTools offers a range of statistics on a set of values, such as mean, median, minimum, maximum, or variance.

Besides boiling down large collections of data into a single track, WiggleTools also allows you to compare groups of datasets. For example, if you have a collection of case and control replicates, you can compare the means of the cases and controls, but you can also apply more advanced statistics as Welch’s T-test (for normally distributed variables) or Wilcoxon’s rank sum test (for other variables).

WiggleTools has been designed with efficiency in mind. Streaming the data keeps memory requirements to a minimum by only storing local information. Functional components communicate directly in memory, without disk access or string passing. Parallel threads keep the system going smoothly regardless of irregularities in disk access. Finally, a novel BigWig file merging tool, bigWigCat, which we contributed to Jim Kent’s C library, allows WiggleTools to make the most of a cluster of computers. For example, to compute the sum of 126 BigWig files (a total of 121 GB) takes less than 17 minutes in total, on 116 CPUs, and fits on less than 5.5 GB of RAM.

A statistics package for genomic datasets

With WiggleTools, you can pretty much play with the BigWig, BigBed and Bam files lying on your filesystem as if they were vectors loaded in R, Numpy or Matlab. A simple language, which resembles LISP, is enough to define the functions that WiggleTools then runs in a single pass through the data.

A use case: we wanted a summary of transcription factor (TF) binding across the genome. For every position in the genome, we had estimated for each TF the probability of observing binding in a random cell type. To compose these datasets, we wanted to compute an overall probability of observing any binding at that position. We therefore wanted to compute:

WiggleTools can create the appropriate function on the fly and compute the result in a single pass through the files. Total run time: 34s, max memory: 20MB.

For more information on WiggleTools, have a look at our paper in Bioinformatics, and our code on Github.

Ensembl Blog

News about the Ensembl Project and its genome browser