Since release 81, Ensembl has provided the gene annotations in GFF3 files alongside the already existing GTF ones. While GTF uses its own controlled vocabulary to classify features, GFF3 takes advantage of sequence ontology. In the initial release, we attempted to map all existing Ensembl biotypes to equivalent SO terms.
This has proven unsatisfactory for several reasons:
- not all biotypes have an equivalent SO term
- there are too many levels of granularity, with 25 terms for genes and another 25 for transcripts
- some SO mappings do not respect the parent-child relationship expected between gene and transcript SO terms
- some SO mappings are inaccurate, missing or wrong
- it is mostly redundant with the biotypes which are also provided as an attribute
- there can be confusion when most features have identical values in the third column (the SO term) and the biotype attribute, yet a handful do not
For all these reasons, our SO term mapping has undergone a major overhaul to take advantage of the functionality sequence ontologies offer. This new mapping, which will be used from release 90 onwards, attempts to provide general biotype groupings that match the ones used on the website. As a result, all gene biotypes are mapped to one of these three groups, coding, non-coding or pseudogene. Meanwhile, transcript biotypes are mapped to one of five main groups: mRNA, pseudogenic_transcript, long non coding RNAs, short non coding RNAs and IG biotypes.
Additionally, the groupings remove some of the previous granularity that can still be explored via the biotype and the assigned terms respect the gene-transcript relationship where possible.
To see the full extent of those changes, as they will be reflected in the GFF3 files provided from release 90 onwards, please check these files on the FTP.
We hope this improvement will help our users take better advantage of the GFF3 format.