Ensembl 94 is out!

The latest version of Ensembl, release 94, is out and have we got some treats for you. As well as GENCODE updates for human and mouse, we’ve also got loads of new fish. Plus, we have brand new transcription factor binding motifs, additional predictors of variant pathogenicity and updated gene tree pipelines.

New assemblies and gene annotation

Human and mouse have both been updated for release 94, bringing us to GENCODE 29 and M19 respectively.

Fish

It’s all looking a bit fishy. We have a tank full with 38 new fish genomes, plus we’ve updated the genomes of some of our older fish:

All the new fish genomes in Ensembl, arranged alphabetically by scientific name (the three strains of Medaka are represented only once). Can you find Nemo?

Variation updates

We’ve used SIFT and PolyPhen2 in Ensembl for many years to predict how likely it is that missense variants will change protein function. More recently, we’ve made scores from other algorithms available through VEP plugins. In Ensembl 94, we’ve made four additional pathogenicity scores available for known variants in the database. You can see them in the gene variation table, the transcript variation table and the variant gene table. The new scores we’ve added are:

Details of all these algorithms are in our documentation.

The new pathogenicity scores in the gene table.

We also have an update in horse (Equus caballus), with genotypes from six new breeds.

Transcription factor binding motifs

We’ve updated our transcription factor binding motifs (TFBMs) pipeline in human and mouse. We are using TFBMs computed by the SELEX project, which is much broader than the JASPAR collection we previously used. Further we have altered our filtering process to align more closely with standard practice, to ensure that our calls capture the breadth of validated binding sites. This has resulted in over 200 million TFBMs across the human genome and 30 million in the mouse genome. As well as the motif positions, we have also annotated which cell lines the transcription factor is known to bind these positions in, based on matched ChIP-seq data. These data are available in our existing interfaces such as the region in detail view and the regulation tab as well as a new interface to display TFBMs.


Click on a TFBM identifier to see the sequence logo as a pop-up

Because of the number of motifs, which cover a large proportion of the genome, we are no longer assigning motif feature consequences to variants, either for known variants or for VEP analysis. It will be possible to find the motif features a variant overlaps using VEP custom annotation with a BED file export of these data .

 

Gene trees

Our gene tree pipeline, which is used to predict orthologues and paralogues, has been updated. From e94 onwards we will be using HMMs to define our gene trees. The advantage of this is that we can now add new species and genes to the tree, without disrupting the existing genes. This is a great improvement on the old version, as it means that we won’t have significant changes between trees each release. The new pipeline has been run for vertebrates and plants, and will be run for other species sets in future.

Find out more

If you would like to find out more about these new changes, see live demos on how to find new data in the site, and ask questions to the Ensembl team, please register for the release webinar at 4pm (BST) on Tuesday 9th October. A recording of this webinar will be available on our YouTube channel.