The latest version of Ensembl, release 94, is out and have we got some treats for you. As well as GENCODE updates for human and mouse, we’ve also got loads of new fish. Plus, we have brand new transcription factor binding motifs, additional predictors of variant pathogenicity and updated gene tree pipelines.
New assemblies and gene annotation
Human and mouse have both been updated for release 94, bringing us to GENCODE 29 and M19 respectively.
Fish
It’s all looking a bit fishy. We have a tank full with 38 new fish genomes, plus we’ve updated the genomes of some of our older fish:
- Spiny chromis damselfish (Acanthochromis polyacanthus)
- Midas cichlid (Amphilophus citrinellus)
- Clown anemonefish (Amphiprion ocellaris)
- Orange clownfish (Amphiprion percula)
- Climbing perch (Anabas testudineus)
- Eastern happy (Astatotilapia calliptera)
- Mexican tetra (Astyanax mexicanus) – Update
- Tongue sole (Cynoglossus semilaevis)
- Sheepshead minnow (Cyprinodon variegatus)
- Pike (Esox lucius)
- Mummichog (Fundulus heteroclitus)
- Western mosquitofish (Gambusia affinis)
- Burton’s mouthbrooder cichlid (Haplochromis burtoni)
- Tiger tail seahorse (Hippocampus comes)
- Channel catfish (Ictalurus punctatus)
- Mangrove killifish (Kryptolebias marmoratus)
- Ballan wrasse (Labrus bergylta)
- Zig-zag eel (Mastacembelus armatus)
- Zebra mbuna (Maylandia zebra)
- Ocean sunfish (Mola mola)
- Asian swamp eel (Monopterus albus)
- Lyretail cichlid (Neolamprologus brichardi)
- Medaka Hd-rR (Oryzias latipes Hd-rR) – Update
- Medaka HNI (Oryzias latipes HNI)
- Medaka HSOK (Oryzias latipes HSOK)
- Indian medaka (Oryzias melastigma)
- Elephant fish (Paramormyrops kingsleyae)
- Mudskipper (Periophthalmus magnuspinnatus)
- Sailfin molly (Poecilia latipinna)
- Shortfin molly (Poecilia mexicana)
- Guppy (Poecilia reticulata)
- Lake Victoria cichlid (Pundamilia nyererei)
- Red-bellied piranha (Pygocentrus nattereri)
- Asian arowana (Scleropages formosus)
- Turbot (Scophthalmus maximus)
- Greater amberjack (Seriola dumerili)
- Yellowtail amberjack (Seriola lalandi dorsalis)
- Bicolor damselfish (Stegastes partitus)
- Fugu (Takifugu rubripes) – Update
- Monterrey platyfish (Xiphophorus couchianus)
- Southern platyfish (Xiphophorus maculatus) – Update
Variation updates
We’ve used SIFT and PolyPhen2 in Ensembl for many years to predict how likely it is that missense variants will change protein function. More recently, we’ve made scores from other algorithms available through VEP plugins. In Ensembl 94, we’ve made four additional pathogenicity scores available for known variants in the database. You can see them in the gene variation table, the transcript variation table and the variant gene table. The new scores we’ve added are:
Details of all these algorithms are in our documentation.
We also have an update in horse (Equus caballus), with genotypes from six new breeds.
Transcription factor binding motifs
We’ve updated our transcription factor binding motifs (TFBMs) pipeline in human and mouse. We are using TFBMs computed by the SELEX project, which is much broader than the JASPAR collection we previously used. Further we have altered our filtering process to align more closely with standard practice, to ensure that our calls capture the breadth of validated binding sites. This has resulted in over 200 million TFBMs across the human genome and 30 million in the mouse genome. As well as the motif positions, we have also annotated which cell lines the transcription factor is known to bind these positions in, based on matched ChIP-seq data. These data are available in our existing interfaces such as the region in detail view and the regulation tab as well as a new interface to display TFBMs.
Because of the number of motifs, which cover a large proportion of the genome, we are no longer assigning motif feature consequences to variants, either for known variants or for VEP analysis. It will be possible to find the motif features a variant overlaps using VEP custom annotation with a BED file export of these data .
Gene trees
Our gene tree pipeline, which is used to predict orthologues and paralogues, has been updated. From e94 onwards we will be using HMMs to define our gene trees. The advantage of this is that we can now add new species and genes to the tree, without disrupting the existing genes. This is a great improvement on the old version, as it means that we won’t have significant changes between trees each release. The new pipeline has been run for vertebrates and plants, and will be run for other species sets in future.
Find out more
If you would like to find out more about these new changes, see live demos on how to find new data in the site, and ask questions to the Ensembl team, please register for the release webinar at 4pm (BST) on Tuesday 9th October. A recording of this webinar will be available on our YouTube channel.