Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.

This has proven unsatisfactory for several reasons:

  • not all biotypes have an equivalent SO term
  • there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
  • some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
  • some SO mappings are inaccurate, missing or wrong
  • it is mostly redundant with the biotypes which are also provided as an attribute
  • there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not

For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.

Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.

To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.

You may have noticed our beta REST server has been retired. We have replaced it with our new service, http://rest.ensembl.org, and have a handy migration guide to help you update existing scripts. Details about the new server can be found in the article published in Bioinformatics. Some of the improvements include:

  • New POST endpoints
  • POST messages allow users to submit a list of inputs as a single request
    This is supported for the archive, lookup and vep endpoints
  • The rate limit has been increased, with up to 15 requests per second allowed
    Combined with POST, we were able to process 1000 variants per second!
  • New /variation endpoint to retrieve variation information linked to a gene or a transcript
  • New /regulatory endpoint to retrieve data from the regulatory build
  • HTTPS support for clients working with a secure environment

Screen Shot 2014-10-08 at 10.12.26

This server provides access to the latest data in Ensembl, including the new human build on the GRCh38 assembly. For those wishing to use data from the GRCh37 assembly, a dedicated server is available on http://grch37.rest.ensembl.org

As mentioned in another post, due to the presence of patches in both GRCh37 and GRCh38, the assembly mapping has proven challenging.
Related to this, another novelty arises when assigning stable ids to genes.

Every time a gene set is updated for a species, we compare the newest gene set with the previous one.
If we find a perfect match between the two gene sets, the stable id assigned to the older model will be used for the new model.
Even if the model has changed slightly (longer UTR for example), we try to map the old stable id whenever possible, with a version change to indicate that it was not a perfect match.

To provide a better comparison between the last GRCh37 gene set (e!75) and the new GRCh38 gene set (e!76), we have decided to project the old set onto the new assembly. This allows for overlap comparisons rather than simple sequence alignments. However, this means that around 2% of the genes are lost, as they can not be mapped onto the new assembly. If these gene models are still present in the new assembly, they are being assigned a new stable id.

Putting this in perspective of patch fixes integrated into the new reference, we also have cases where two genes in GRCh37 (one of the reference, one on the patch) both match the same gene on the new reference in GRCh38.
In that case, we have decided to arbitrarily keep the longest standing stable ID, which is likely to be the one on the reference.
The stable ID which was used on the patch is recorded as retired but a link is provided to its replacement. For example, searching for ENSG00000260384 (SERINC2 gene on HG989_PATCH) will redirect the user to ENSG00000168528 (SERINC2 on the primary assembly).

Screen Shot 2014-06-27 at 10.46.23Screen Shot 2014-06-27 at 10.48.13

This resulted in the deletion of around 3% of our genes.

In other cases, the difference between the GRCh37 reference (without patch) and the GRCh38 reference (with integrated patch fix from GRCh37) is too important to project annotations from the reference. Only annotations from the patch are then kept, along with the stable ids. For these cases, if there is a known alt_allele to a gene on the GRCh37 reference, it is added as a link to its equivalent on the patch.

Consequently, searching for ENSG00000183678 (CTAG1A gene on the GRCh37 primary assembly) will redirect the user to ENSG00000268651 (CTAG1A gene on HG1497_PATCH in GRCh37, on the primary assembly in GRCh38).

As mentioned in the blog post about the new gene set, a new assembly implies a number of underlying changes in the gene structure.
Despite this, 95% of all the gene stable ids have been assigned to the new gene models.
With this work, we try and ensure that you will still be able to find your favourite gene using the same stable id as in GRCh37.

As you may know, the new GRCh38 assembly for human was released in December 2013. This is a major update for Ensembl and will require months of hard work to provide high quality annotation for our users. Our goal is to provide a full genebuild on the GRCh38 assembly, as well as regulation, comparative and variation features.

As part of the Ensembl core team, I am responsible for generating a reliable mapping between the GRCh37 and GRCh38 assemblies. This mapping will be used by other teams to project existing annotations onto new coordinates. Therefore, it is important to get this right if we don’t want to end up with features in the wrong location!

The basic principle of assembly mapping is relatively simple. Let’s say we are mapping chromosome 1 in GRCh37 to chromosome 1 in GRCh38. For both chromosomes, we get the list of contigs used to construct the chromosome. If the same contigs are used, in the same order, in both chromosomes, these can be mapped directly. For the remaining unmapped regions, where no shared contigs can be found, the sequences are aligned using lastz.

Screen Shot 2014-02-28 at 17.38.03

Results of the mapping: how similar are the assemblies?

Faced with the results, the mapping meets our expectations. For the 24 chromosomes as well as haplotype regions, we map between 95 and 100% of the non-N sequence. Out of 82 regions, 72 map over 99%. To check the consistency of these mappings, the Ensembl gene set (GENCODE) is copied from GRCh37 to GRCh38. 97% of the transcripts find an identical model in GRCh38, with 98.5% of exons mapped correctly. Only 1.5% of the total transcripts do not have an equivalent model in the new assembly. This is expected, as we know some regions in GRCh37 do not exist in GRCh38.
For example, the gene PPIAL4A, associated to CCDS30835.1, is on a reference region in GRCh37 which is overlapped by patch HG1287. In GRCh38, that region does not exist and PPIAL4A is lost. The PPIAL is a family of retrogenes and other PPIAL4 models will still be in GRCh38.
Screen Shot 2014-02-27 at 10.19.06

Two additional regions have proved challenging for our mapping.

Chromosome 17:22904289-37003842:
This region in GRCh37 has become a haplotype in GRCh38 (HSCHR17_1_CTG4). As we do not provide mappings between haplotypes, we have only an approximate alignment between the reference in GRCh37 and the reference in GRCh38.

Chromosome 9:42900000-66450000, flanking the centromeric region:
This region in GRCh37 corresponds to 9:40700000-61600000 in GRCh38 and has undergone some massive changes on a sequence level. Some of the contigs have been split, shortened, extended, or simply removed. This means that gene models located on this region will change considerably from the gene models in GRCh37.

If your favourite gene is not in one of these regions though, there is a good chance you will be able to identify it using the same stable_id as in release 75! You’ll be able to read more about stable id mapping in a future post in this series.

Challenge: Patches

One of the major challenges when mapping GRCh37 to GRCh38 comes with patch regions. Shortly after GRCh37 was first released, a number of sequence differences were noticed. Rather than provide a whole new assembly, the concept of patches was introduced.

For regions where a sequencing error was corrected, a patch fix was added. It contains the corrected reference sequence as well as some padding on both ends, to locate it onto the genome.

In Ensembl, we provide annotation for both the reference and the patch region. Where the modified sequence is relatively short, a number of annotations are identical between the reference and the patch.

For example, CHAMP1 is a merged gene on chromosome 13 but has also been annotated on patch HG531_PATCH.
Screen Shot 2014-02-26 at 13.29.16 Screen Shot 2014-02-26 at 13.28.48

For regions where an alternative sequence was found, a patch novel is added.

In GRCh38, all patch novels will still exist as haplotypes. For the patch fix, it is another story altogether. Given these patches are fixing an error in the reference sequence in GRCh37, they will become the reference in GRCh38, replacing the GRCh37 sequence. This means that we are likely to keep the annotation produced on patches in GRCH37 while losing the GRCh37 reference annotation.

To deal with the special patch cases, we add an additional step in the assembly mapping. For patch fixes in GRCh37, we know their contig composition, as well as where they are mapped against the reference. Presuming the contig composition has not changed, we should be able to locate the same region in the reference in GRCh38. It should then be possible to map any feature in GRCh37, whether on patch or reference, onto GRCh38.

The mouse assembly GRCm38 (GCA_000001635.2) was submitted by the Genome Reference Consortium (GRC). The whole assembly comprises 65 toplevel sequences: 19 autosomes, X and Y, 22 unlocalized scaffolds and 22 unplaced scaffolds. These toplevel sequences are assembled from 20,423 contigs with a N50 value of 191kb. The N50 size is the length such that 50% of the assembled genome lies in blocks of the N50 size or longer. Click here to go to the mouse Pre! site, where you can view mouse protein, cDNA and EST alignments, as well as alignments of the Ensembl release 66 mouse translations. This assembly will undergo full automatic gene annotation in due course.