We are pleased to announce the release of Ensembl 113, and the corresponding release of Ensembl Genomes 60. This release brings major gene and regulatory feature annotation updates in Homo sapiens (Human) and Mus musculus (Mouse). We have updated existing genomes and added additional genomes across the different Ensembl sites, including livestock breeds in Ensembl, three new species in Ensembl Plants and 26 new species in Ensembl Metazoa. Can’t find the species you are looking for? Don’t forget that new and exciting genome assemblies and annotations are continuously added to Ensembl Rapid Release!
Vertebrates
Gene annotation
Ensembl 113 features the biggest ever expansion of the GENCODE human and mouse gene annotations from one release to the next. This is due to our integration of large numbers of new transcript models produced by the GENCODE Capture Long-read Sequencing (CLS) project, further details of which are provided on this GENCODE documentation page. The present phase of the CLS project was specifically designed to expand the annotation of lncRNAs, and we have added over 130,000 new lncRNA transcripts to both species. We now have approximately twice as many lncRNA genes for these species as we did before. A manuscript to describe this work is currently being finalised.
The default gene track has changed from GENCODE Comprehensive to GENCODE Basic to accommodate the increased number of transcripts. Consequently, a new GENCODE primary tag has been introduced. This tag represents a minimal set of transcripts including all well-conserved exons and exon skips that have high expression in human protein-coding genes. This will be expanded to other biotypes and in mouse at a later date. Further details can be found in the GENCODE FAQs. They are labelled gencode_primary
in REST API responses and GFF3/GTF files. The current GENCODE basic tag, labelled basic
, has been changed to gencode_basic
. Tags have not been changed in archives.
Updated Havana manual gene annotations are also available for Human (GRCh38) and Mouse. The major histocompatibility complex (MHC) genes in Rattus norvegicus (Norway rat; mRatBN7.2) and Sus scrofa (Pig; Sscrofa11.1) have also been manually updated, along with additional immune genes.
Moreover, genome assemblies and gene annotations for additional breeds of existing species are now available:
- Capra hircus (Goat): 2 breeds (Saanen dairy and Xinong Saanen dairy)
- Ovis aries (Sheep): 8 breeds (Qiaoke, Kermani, East Friesian, Polled Dorset, White dorper, Hu, Chinese merino and Romanov)
- S. scrofa (Pig): 8 breeds (Bama miniature, Duroc, NIHS-2020, PB115, Euw1, Ossabaw miniature, Meishan and Ningxiang)
Variation
To simplify variant displays during the large increase in Human GRCh38 transcripts driven by the on-going incorporation of models from long-read sequencing data, only pre-calculated transcript consequences for the GENCODE Primary set, which includes all exons in this assembly, are being displayed. The Human GRCh37 displays remain unchanged. A VCF file containing variant annotations for the full set of GRCh38 transcripts is available on the FTP site.
The Ensembl Variant Effect Predictor (VEP) now supports the GENCODE Primary transcript set to enable annotation of all potential variant consequences without duplication across multiple transcripts. The following features are supported:
- Optionally limiting Ensembl VEP annotation to this set
- Reporting the flag in Ensembl VEP output (as was done for MANE) as a boolean
GnomAD v4.1 population allele frequency data is now available for Human on the website, command-line, and REST API instances of Ensembl VEP. Additionally, REVEL and ClinPred scores along with Ribosome profiling open reading frames (Ribo-seq ORFs) annotations are now available on the website and REST API version of Ensembl VEP. The Paralogues plugin is now supported on the website and the LOEUF plugin is available on REST API. There are also some plugin data updates for CADD (v1.6 to v1.7) and dbNSFP (4.5c to v4.7c). From this release, and going forward, we only generate indexed Ensembl VEP cache files.
We have updated our Nextflow VEP pipeline so that it can accept all the input formats Ensembl VEP supports, including VCF.
The O. aries (Sheep; ARS-UI_Ramb_v2.0 (GCA_016772045.1) genome now includes variants from the European Variation Archive (EVA) release 5.
Regulation
We have added new regulatory annotation for Cow. Regulatory annotation for Human and Mouse has undergone a major update, including regulatory features, a set of epigenomes (tissues and cell lines), and the associated regulatory activity. The Transcription Factor Binding Sites (TFBS) regulatory feature has been retired for both Human and Mouse. Instead, we recommend using the motif features track, which has been updated for Human and added to Mouse. This track can be found in the Other regulatory features under the Regulation section available in the configuration menu in the Location tab:
The format of ENSR IDs has been updated for chromosome-level assemblies, and now includes additional characters (capital letters and the underscore symbol ‘_
’). This change affects eight species with regulatory annotation. Regulatory annotation has been updated across all species to address minor inconsistencies, incorporate additional data, and introduce the new ENSR IDs.
The following REST API regulation endpoints are now retiredGET regulatory/species/:species/id/:id
GET regulatory/species/:species/epigenome
GET regulatory/species/:species/microarray/:microarray/vendor/:vendor
GET regulatory/species/:species/microarray
GET regulatory/species/:species/microarray/:microarray/probe/:probe
GET regulatory/species/:species/microarray/:microarray/probe_set/:probe_set
In addition, the other_regulatory
and array_probe
options in the feature
parameter have been removed from the following endpoints:GET overlap/region/:species/:region
GET overlap/id/:id
Ensembl Plants
Additional Watkins Wheat variation data has been added for Triticum aestivum (Common wheat; GCA_900519105.1). The Watkins Wheat variation collection encompasses the genetic diversity found in the A.E. Watkins Landrace Collection of bread wheat, which includes 827 landraces from 32 countries.
Gene annotation for the following existing species has been updated:
- Brassica rapa (Field mustard; GCA_900412535.3)
- Marchantia polymorpha (Common liverwort; GCA_039105155.1)
- Triticum aestivum (Common wheat; Paragon v2; GCA_949126075.1)
Genome assemblies and gene annotation for the following new species are now available:
- Arachis hypogaea (Peanut; GCA_003086295.3)
- Lathyrus sativus (Grass pea; GCA_963859935.3)
- Triticum timopheevii (Sanduri wheat; GCA_963921465.1)
Ensembl Metazoa
Multiple mosquito genomes have been added, including the latest reference genomes for Anopheles coluzzii (GCA_943734685.1) and Anopheles funestus (African malaria mosquito, GCA_943734845.1). Additionally, a new annotation source has also been added for the existing Culex quinquefasciatus (Southern house mosquito; GCA_015732765.1) assembly. Furthermore, genome assemblies and gene annotations are now available for the following species:
- Anastrepha ludens (Mexican fruit fly; GCA_028408465.1)
- Anastrepha obliqua (West Indian fruit fly; GCA_027943255.1)
- Anopheles coluzzii (Mosquitos; GCA_943734685.1)
- Anopheles funestus (African malaria mosquito; GCA_943734845.1)
- Bombus affinis (Rusty patched bumble bee; GCA_024516045.2)
- Culex pipiens pallens (Common house mosquito; GCA_016801865.2)
- Cylas formicarius (Sweet potato weevil; GCA_029955315.1)
- Diorhabda carinulata (Northern tamarisk beetle; GCA_026250575.1)
- Diorhabda sublineata (Subtropical tamarisk beetle; GCA_026230105.1)
- Hylaeus anthracinus (Anthricinan yellow-faced bee; GCA_026225885.1)
- Hylaeus volcanicus (Volcano masked bee; GCA_026283585.1)
- Lutzomyia longipalpis (Sandfly; GCA_024334085.1)
- Malaya genurostris (Mosquitos; GCA_030247185.2)
- Microplitis demolitor (Parasitoidwasp; GCA_026212275.2)
- Microplitis mediator (Endoparasitoid wasp; GCA_029852145.1)
- Mytilus californianus (California mussel; GCA_021869535.1)
- Phlebotomus argentipes (Sandfly; GCA_947086385.1)
- Phlebotomus papatasi (Sandfly; GCA_024763615.2)
- Plodia interpunctella (Indianmeal moth; GCA_027563975.1)
- Spodoptera frugiperda (Fall armyworm; GCA_023101765.3)
- Topomyia yanbarensis (Mosquitos; GCA_030247195.1)
- Toxorhynchites rutilus septentrionalis (Elephant mosquito; GCA_029784135.1)
- Uranotaenia lowii (Sandfly; GCA_029784155.1)
- Wyeomyia smithii (Pitcher plant mosquito; GCA_029784165.1)
- Zeugodacus cucurbitae (Melon fly; GCA_028554725.2)
The following genomes are no longer available:
- Athalia rosae (Turnip sawfly; GCA_000344095.2)
- Drosophila yakuba (Fruit fly; GCA_000005975.1)
- Melitaea cinxia (Glanville fritillary; GCA_000716385.1)
Other updates and changes
- The latest InterProScan version (5.69-101.0) has been run on all species, including bacteria.
- Cross-references for all plant and vertebrate species are fully updated.
- UniProt references for all species are up-to-date.
- The Ensembl core API is now available on CPAN.
- Ensembl Genomes 37 (October 2017), Ensembl Genomes 40 (July 2018) and Ensembl 97 (July 2019) archives are now retired. Data are still available via the Ensembl and Ensembl Genomes FTP sites.