Our initial annotation of the new human assembly is live in the Ensembl Pre site and we are now fully annotating GRCh38 as part of our customary Ensembl release cycle. The complete annotation will  be available in Ensembl release 76, due in the third quarter of 2014.

What is a release cycle in Ensembl?

In a nutshell, it’s the process of updating and releasing the Ensembl website and the underlying databases. It takes three months and allows us to provide our data and tools and to perform rigorous steps of quality checking throughout.

A new Ensembl release may include new species, new assemblies, updated gene sets, new variation data, new gene trees, genomic alignments and homologies, and annotation of regulatory features. There will also be improvements and additions to our web-interface and database structure, including the Perl APIs.

Different teams in Ensembl get involved in this orchestrated release cycle. More than 40 people take part in this whole process.

group

 

Genebuild

The Genebuild team have developed a gene annotation pipeline to annotate gene and transcripts on a given assembly, based on biological evidence (e.g. protein and nucleotide sequences). This step can take three to six months depending on the quality of the assembly, the number of species-specific sequences available in public sequence databases, and the amount of RNASeq data available, just to name a few. The genebuild team also annotate patches and haplotypes.

For human, mouse and zebrafish, the Ensembl automatic annotation is combined with the Havana manual annotation to give the merged gene set. For human and mouse, these merged genes are the GENCODE gene set. Once these databases are complete, they are handed over to the other Ensembl teams for downstream processing and analyses. The handing over of the databases corresponds to the first step in our release cycle.

Slide1

Core

The Core team provide the API support for the core and core-like (i.e. otherfeatures, cdna and rnaseq) databases in Ensembl. They also maintain our automated quality control (QC) system in addition to the key pieces of Perl code, such as assembly mapping, Ensembl ID mapping, cross-referencing of Ensembl models to external data sources, such as HGNC, UniProt-KB, RefSeq and GO.

Compara

The Comparative genomics team (also known as Compara) run multiple pipelines to bring the separate species together and provide gene trees (coding and ncRNA), homologues and protein families, in addition to whole genome alignments and synteny data. The resultant data is compiled into the ensembl_compara and ensembl_ancestral databases.

Variation

The Variation team bring together sequence variation data from a variety of sources, predict effect of these variants on our transcripts, call new variants from re-sequencing data, import QTLs, maintain the VEP and incorporate associated disease and phenotype information. These data are used to create variation databases, currently for 22 Ensembl species.

Regulation

The Regulation team collect experimental data from Ensembl collaborators and integrate these data into our regulatory build for human and mouse. This results in potential promoters, enhancers and other regulatory features based on ChIP-seq, DNase I, methylation and other data. Currently we compute a regulatory build for human and mouse. Regulation (also known as funcgen) databases exist for some other species, such as chicken, to support the microarray mapping data. Other sources of regulation can be found on our help page.

Production

Once the Compara, Core, Variation and Regulation teams have handed over their databases, the Production team do a full rebuild of all BioMart databases from the updated data. These data can be accessed via the Ensembl BioMart data-mining toolmartservice, or using the biomaRt package from Bioconductor. The Production team are also responsible for overseeing the entire release cycle and running various pipelines throughout the process to generate statistics for the website, dump flat-files for the FTP site, generate BLAST indices, and carry out quality control of the data.

Web

Whilst the genomic data is being prepared, the Web team work on new displays and website features, maintain BLAST/BLAT in addition to revamping extant online tools, such as our popular VEP. They bring together all the finished databases and make the content available on the Ensembl web browser in a number of ways:

  • The website configuration is updated to access the new data
  • The databases are copied to the public MySQL servers and dumped for downloading from FTP site.

When the new release is ready to go live, a copy of the current version is set up as an archive, and the web server is updated to point to the new site. A new release also includes updated APIs from many of the teams and tools to access the fully integrated data.

Outreach

The Outreach team carry out usability testing of the new displays and website features developed during the release cycle. The team also update the help and documentation pages, provide user support and online training, deliver workshops and engage with our users via YouTube, Twitter and Facebook. It’s the bridge between the different Ensembl teams and our user communities, so that our users know how and when to access and view the data in our integrated tools and resources.

The special release cycle for GRCh38

GRCh38The new assembly of our most popular genome, human, was released by GRC in December 2013.

Getting a new human assembly requires extra work during the release cycle. In addition to the usual steps described above, we also need to remap and reprocess our data against the new genome. This invariably leads to a longer release cycle.

 

We want to deliver a top-quality annotation of genes, sequence variants, regulatory elements and comparative features for the new human assembly.

The blog posts from the Ensembl teams in this GRCh38 series will describe the extra steps involved in preparing the updated data in the new human assembly. For more details and comments, just in get in touch.

Ensembl 73 is scheduled for release in August  2013. Highlights for the coming release include:

Updated gene sets
  • Updated human gene set including Havana manual annotation using GRCh37.p12 release
  • Zebrafish including manual annotation from Havana
  • Rabbit, including RNAseq data
Variation data imports and updates
  • COSMIC version 65
  • PhenCode data
  • HGMD-PUBLIC data from release 2013.2 with regulatory data for human
  • Mouse phenotype data from EuroPhenome, International Mouse Phenotyping Consortium and WTSI Mouse Genetics Project
  • Update of COSMIC structural variants
New species
Collared-Flycatcher_Garth_Peacock

Courtesy of Gareth Peacock

Anas platyrhynchos

  •  Collared flycatcher (FicAlb_1.4), including RNASeq data
  •  Duck (BGI_duck_1.0)
New web features
  • New search engine with enhanced interface
  • Alternative display styles for assembly exceptions.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

Ensembl release 69 is scheduled for release in October.

We expect this release to include, among other things:

  • The latest sequence variants from dbSNP build 137 for human
  • New DNA methylation data from multiple human and mouse cell lines including data from ENCODE and ES cells in both species
  • Updated human and mouse CCDS sets
  • An increased number of human phenotype annotations
  • New COSMIC variants update
  • An import of the Exome Sequencing project dataset
  • More 1000 Genomes structural variations
  • The first inclusion of HAVANA manual curation for Ensembl genes in pig
  • Update of the HAVANA manual curation for human and zebrafish
  • Ferret (Mustela putorius furo) and Southern Platyfish (Xiphophorus maculatus) genomes with full annotation. These are new species and each will include an RNA-seq database in addition to the gene annotation. These species will also be included in new multiple and pairwise alignments and into recalculated GeneTrees and homology annotations.

For more details on the declared intentions, please visit our Ensembl admin site. Please note that these are intentions and are not guaranteed to make it into the release.

We are delighted to announce the latest Ensembl release 64 (e!64).

This release includes assemblies for two new species; lamprey (Petromyzon marinus) and Tasmanian devil (Sarcophilus harrisii) as well as a patch of the human assembly (GRCh37.p5) and an update of the cow assembly (UMD 3.1).  We have incorporated the most recent human and mouse manual gene annotations from HAVANA, new regulation data for human and mouse, as well as many other interesting data updates and features. The previous Ensembl release is archived at e63.ensembl.org.

Petromyzon_marinus_7.0 is an assembly of the sea lamprey (Petromyzon marinus) provided by the lamprey consortium which was sequenced to a total of 5.0X whole genome coverage. The gene set for lamprey was built using the Ensembl genebuild pipeline. New translated BLAT whole genome pairwise alignments against the zebrafish, the stickleback, Ciona intestinalis and the human genome are now available for lamprey. Protein trees now include genes from the lamprey (10,079 genes) and with the inclusion of the lamprey, 849 more trees have a root older than the last common ancestor of bony vertebrates.

We now have new phenotype views where one can view genes associated with diseases and phenotypes. The new phenotype page can be accessed via the gene tab. Associated genes and variations to a phenotype can also be displayed on a karyotype. The associated colour key corresponds to the p-value of the association between the variation and the phenotype.

In order to make turning on data tracks easier, a number of changes have been made to the configuration panel in the region in detail page (accessed via the “Configure this page” button), including a new menu structure with grouping for similar track types. Configuration for regulatory evidence is now accessible via two links in the Regulation section of the menu for the configuration panel – “Open chromatin & TFBS” and “Histones & polymerases”.

The Tasmanian devil (Sarcophilus harrisii) 7.0 assembly, provided by Illumina and the Wellcome Trust Sanger Institute, has been added as a new species to Ensembl for release 64.  RNASeq data was used in the genebuild and can be found in the otherfeatures database. More detailed information on the genebuild can be found here.

Check out our improved FAQ’s. These have been reorganized into categories.

Confused about browser navigation? Why not try our new elearning course!

More details on some of these changes will be posted soon, so keep an eye on our blog!

More information also available on the Ensembl website.