Genomics gets global: Ensembl and the VGP

As the community’s capacity for genome sequencing expands, so do its ambitions. Recently, many exciting global genomics projects have been launched, including the Vertebrate Genomes Project (VGP), Darwin Tree of Life (DToL), Earth Biogenome Project EBP, i5K (insects) and 10KP (plants). Between them, they aim to sequence the genomes of every eukaryote on Earth, and Ensembl are excited to take on the annotation of some of those genomes.

Through these kinds of projects, we can understand the genomics of whole ecosystems, preserve a digital record of endangered species and understand our evolutionary history. They force us to innovate and improve the way we study genomics and bioinformatics, as well as to set standards making it easier for researchers to use the data.

Vertebrate Genomes Project: a project if the G10K ConsortiumGiven our recent efforts to significantly increase the number of vertebrate annotations in Ensembl, this post will focus on the VGP, which aims to sequence the genomes of all 66,000 vertebrate species alive right now. This global project is a collaboration of laboratories and bioinformatics groups around the world, carrying out sample collection, sequencing, assembly, archiving and (our job) genome annotation.

Our role within the VGP is to lead the way in terms of producing annotation and providing a public framework for complex downstream analysis. We’ve spent the past few years building towards the capacity to annotate many genomes at once. While there’s still a lot of work to do, we’ve demonstrated that we can work at scale with the annotation of 16 rodents in release 90, 12 primates in release 91, 38 fish in release 93, and a collection of four reptiles, 15 mammals and 20 birds expected for release 96. In fact, the number of species in Ensembl has added over the past two years exceeds the number incorporated durning the preceding 17 years of our existence.

The number of vertebrate genomes in the Ensembl database over time, which has more than doubled in the past two years.

As you may have noticed from the clades of species listed in the previous paragraphs, we are gradually working our way through various clades of existing vertebrate assemblies in order to optimise our pipelines for different species. This process has been useful in that it has highlighted challenges that we will face with the coming wave of assemblies. If the vertebrate ecosystem is a jigsaw puzzle, it has 66,000 pieces and we can currently only see a few hundred of them. On top of that many of the pieces are faded or have bits missing and need replacing. Some parts of the puzzle, like the great apes or rodents, are very clear, but other sections, like reptiles or amphibians, are barely represented.

Vertebrate genomics is a big puzzle, and we don’t yet have all the pieces

The best data for gene annotation is deep transcriptomic data, but we don’t expect this for every species in the VGP. This means that the resulting annotations will quality variations based roughly on the coverage of transcriptomic data and the suitability of homology data. During the course of the VGP, we will endeavour to come up with new and innovative solutions to deal with this challenge.

Outside of annotation, there are challenges for the broader bioinformatics community. How do we deal with things like salamanders, where genome sizes can be an order of magnitude larger than human? These genomes will likely break many of the most-used tools in the community (and they pale in comparison to the 130GB genome of the marbled lung fish). How will we find and mask repeats in genomes where there no curated repeat libraries? How do we display, navigate and interact with data sets of 66,000 vertebrates?

Meeting these challenges will not be easy but we are committed to working with an ever-widening community of geneticists, bioinformaticians, ecologists, taxonomists and more. The resources that we hope to provide will be critical services to the community, and drive future research into emerging potential model organisms for human disease and conservation of endangered species. Crucially, this research will allow us to construct the complete history of vertebrate evolution, at both the gene and chromosomal level. Projects like the VGP are ushering in a new era of genomics and Ensembl are excited to be a part of it.