A brand new regulatory build for the human GRCh38 and GRCh37 assemblies was released in Ensembl 95 earlier this week. The new regulatory build incorporates data for 55 new and 38 updated epigenomes from the ENCODE project. So what are the differences from the previous regulatory build?
The Ensembl regulatory build
The Ensembl regulatory build uses experimental data to predict features that regulate gene expression, such as promoters, enhancers and transcription factor bing sites. The annotation of these regulatory features is based upon a wide variety of data produced from the Blueprint epigenome project, the Roadmap epigenomics project and the ENCODE project. The ENCODE project generates a wide variety of data for a range of cultured cell types, including DNA hypersensitivity assays, DNA methylation assays and ChIP-Seq of proteins that interact with DNA, i.e., modified histones, transcription factors and chromatin regulators.
What’s changed?
This is the first update to the regulatory build since release 87 in 2016. Since then, the data hosted by ENCODE has been continually growing and the new regulatory build incorporates the most up-to-date data from ENCODE, with data for 55 new and 38 updated epigenomes.
Previous build | New build | |
Number of epigenomes | 68 | 123 |
Number of regulatory features | 341,929 | 675,965 |
Genome covered | 10% | 21% |
The new regulatory build has 675,965 regulatory features which cover 21% of the genome. This is an increase from the previous build which covered 10% of the genome and had 341,929 regulatory features.
Figure 1: Number of regulatory features by feature type in the new build for GRCh38
Figure 2: Percent of genome covered by feature type for GRCh38
The reason for the increase in regulatory features is a change to the regulatory segmentation process. In the regulatory build, segmentation algorithms are used to partition the genome into regions with distinct epigenomic profiles.
The segmentation algorithm no longer analyses all the experiments combined. Instead, the experiments are split into smaller groups and each group is processed independently. The experiments are split
- by the consortium that provided the experiment and
- by which epigenetic marks were available.
This means that the segmentation algorithm now has more degrees of freedom when learning the epigenetic patterns which allows it to find more structure and leads to a greater number of regulatory features and coverage of the genome.
Will this affect my work?
The MySQL database schema and data access through the Ensembl web browser, BioMart and the Perl and REST APIs will not change, but the regulatory features themselves may be different from those found in older Ensembl releases. Many regulatory features will remain the same, and in these cases, we have mapped the stable IDs (ENSR#) from the older regulatory build to the new version. New stable IDs will be assigned for any new regulatory features.
Seriously?!
The amount of data processed in the new Ensembl regulatory build is truly phenomenal:
- More than eight terabytes of data in read files were analysed to create the new Ensembl regulatory build.
- If printed on a strip of paper with 5 mm per base, this would be long enough to wrap around the sun more than 11 times or to go 66 times to the moon and back. Or six base pairs for every day that our solar system has existed.
- Analysing this data to generate the regulatory build took more than 9 cpu years (but thanks to parallelisation we only had to wait seven days).
- The code written by the Ensembl Regulation team for this analysis has more lines than The Lord of the Rings, but not more than The Lord of the Rings including The Hobbit.
We hope that you find the data available from the new Ensembl regulatory build useful in your work. Please do get in touch using the Ensembl Helpdesk or the comments section if you have any further questions.