The Variant Effect Predictor (VEP) is one of Ensembl’s most popular tools. It has grown in 6 years from a simple perl script with just a couple of hundred lines of code to become a multi-limbed beast with thousands of lines of code and well over 100 configurable options.

VEP is now used by many high-profile projects, institutes and companies around the world. In order to effectively manage this growth and ensure we deliver the most reliable and feature filled variant annotator out there, we’ve had to go back to basics. Over the past six months the VEP codebase has been totally rewritten, and the new version is now available for download. Users of VEP’s web and REST API interfaces should see virtually no difference with the new version, so if that’s you, you can stop reading now!

For users of our command line tool, you can trial the new VEP by visiting https://github.com/Ensembl/ensembl-vep. The full list of changes to the code can be found in the README on GitHub, but these are the main points of note:

  • Faster : process an individual genome in around 30 minutes.
  • Backward-compatible : all data sources (cache files, databases) and most command line flags from the old code are fully compatible with the new code.
  • More reliable : test-driven development means the new code is covered by more than 1500 unit tests with over 99% statement coverage.

For those tied to the current codebase, it is still available as part of the ensembl-tools GitHub repository, though updates and support for this will cease over time. Ensembl release 87 will be the last for which the ensembl-tools version of VEP will be the “primary” VEP codebase. Of course, the previous code and supporting data will remain available as part of Ensembl’s archiving strategy.

Some other points of note:

  • The documentation at ensembl.org still refers to the old code. From Ensembl release 88 onwards full documentation for the new code will be made available.
  • If possible, please report any issues you may find with the new code as a GitHub Issue.
  • The code that calculates variant consequence types (e.g. missense_variant, stop_gained) remains a part of the ensembl-variation API module and has not been (significantly) updated; it is used by both the old and new code. The ensembl-vep codebase performs the following functions:
    • parsing command line flags
    • parsing input
    • reading data from annotation sources (databases, cache files, flat files)
    • interval alignment of input variants with annotation data
    • writing output
    • monitoring statistics
    • threading
    • data filtering interface

Ensembl Variation recently incorporated the latest versions of the dbSNP and 1000 Genomes datasets. While we are able to import all of the variant loci from phase 3 of the 1000 Genomes project, the vast amount of genotype data (2500 individuals x 80 million sites = 200 billion data points!!!) meant we had to create a new solution to deliver this data through our API and website.

To this end we have extended the Ensembl Variation API to read genotype data directly from tabix-indexed VCF files. The API then calculates frequency and linkage disequilibrium (LD) data from these genotypes on-the-fly. You can see this in action on a typical population genetics page:
Screen Shot 2015-06-18 at 14.55.53
In order to use this functionality with your local API installation, there’s a couple of extra dependencies to install. You may even have them already!

Tabix

The tabix utility is used for rapid random access into compressed position-based text files. It also allows access to data across HTTP and FTP protocols, downloading only a small index file in the process.

To install it, we clone it from GitHub and run a couple of “make” statements. From here on we assume that you typically install things in your $HOME/src/ directory and that you are using bash or a bash-like terminal.

cd ~/src
git clone git@github.com:samtools/tabix.git
cd tabix
make
cd perl
perl Makefile.PL PREFIX=${HOME}/src/
make && make install

You may need the tabix binary in your path; you can either copy ~/src/tabix/tabix to a directory in your path, or add this to your path:

PATH=${PATH}:${HOME}/src/tabix/
export PATH

If it isn’t already, you should also add the relevant path to your PERL5LIB environment variable; the path in question is shown in the output from the “make && make install” command above.

PERL5LIB=${PERL5LIB}:${HOME}/src/lib/perl/5.14.2/
export PERL5LIB

ensembl-io

The ensembl-io package contains objects and methods for parsing and writing data formats commonly used in bioinformatics. If you installed the API using Git and Ensembl Git tools, chances are you already have the module.

If not, it’s simple to install with git:

cd ~/src
git clone git@github.com:Ensembl/ensembl-io.git
PERL5LIB=${PERL5LIB}:${HOME}/src/ensembl-io/modules
export PERL5LIB

Using in the API

That’s it! Now to use this in an API script, there’s a simple flag we have to set on the Variation DBAdaptor object:

use strict;
use warnings;
use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

my $variation_adaptor = $registry->get_adaptor('homo_sapiens', 'variation', 'variation');

# Tell API to use VCFs
$variation_adaptor->db->use_vcf(1);

my $variation = $variation_adaptor->fetch_by_name('rs699');
my $alleles = $variation->get_all_Alleles();

foreach my $allele (@{$alleles}) {
  next unless 
    (defined $allele->population) &&
    (defined $allele->frequency);
  my $allele_string = $allele->allele;
  my $frequency = $allele->frequency;
  my $population_name = $allele->population->name;
  printf("Allele %s has frequency %.3g in %s\n", $allele_string, $frequency, $population_name);
}

This script should print out frequency data for a number of populations, including those from 1000 Genomes phase 3:

....
Allele A has frequency 0.121 in 1000GENOMES:phase_3:KHV
Allele G has frequency 0.879 in 1000GENOMES:phase_3:KHV
Allele A has frequency 0.149 in 1000GENOMES:phase_3:JPT
Allele G has frequency 0.851 in 1000GENOMES:phase_3:JPT
Allele A has frequency 0.295 in 1000GENOMES:phase_3:ALL
Allele G has frequency 0.705 in 1000GENOMES:phase_3:ALL

You can use the “->db->use_vcf(1)” stub on any adaptor from the variation adaptor group.

Once set, it will affect fetching objects of the following types:

  • Allele
  • PopulationGenotype
  • IndividividualGenotype
  • LDFeatureContainer

Advanced configuration

The value we pass to use_vcf() also affects the behaviour of the API:

  • 0 : fetch data only from database
  • 1 : fetch data from VCFs and database
  • 2 : fetch data only from VCFs

One final thing; the API is pre-configured to use VCFs hosted on the Ensembl FTP site. It is also possible to use VCFs on your local machine or any arbitrary server. The configuration is found in the ensembl-variation folder:

cat ~/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json
{
 "collections": [
   {
     "id": "1000genomes_phase3",
     "species": "homo_sapiens",
     "assembly": "GRCh37",
     "type": "remote",
     "strict_name_match": 1,
     "filename_template": "ftp://ftp.ensembl.org/pub/grch37/release-79/variation/vcf/homo_sapiens/1000GENOMES-phase_3-genotypes/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.vcf.gz",
     "chromosomes": [
       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22"
     ],
     "individual_prefix": "1000GENOMES:phase_3:"
   },
   {
     "id": "1000genomes_phase3",
     "species": "homo_sapiens",
     "assembly": "GRCh38",
     "type": "remote",
     "strict_name_match": 1,
     "filename_template": "ftp://ftp.ensembl.org/pub/release-80/variation/vcf/homo_sapiens/1000GENOMES-phase_3-genotypes/ALL.chr###CHR###.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP.vcf.gz",
     "chromosomes": [
       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12","13", "14", "15", "16", "17", "18", "19", "20", "21", "22"
     ],
     "individual_prefix": "1000GENOMES:phase_3:"
   }
 ]
}

Feel free to edit the filename_template entry in this file. Note there are separate entries for the two currently supported human assemblies, GRCh37 and GRCh38; the relevant entries will be used depending on which port you connect to in your API script (3306 for GRCh38, 3337 for GRCh37).

“###CHR###” is a placeholder that allows the API to read from a set of files distributed as one per chromosome. This is not mandatory, and indeed a single genome-wide VCF file could be used. The only requirement is that the chromosomes contained in the VCF or set of VCFs are listed in the “chromosomes” field of the JSON configuration file.

Any questions, don’t hesitate to get in touch!

Boom!

Due to a somewhat catastrophic hardware failure during our production cycle for Ensembl release 79, we have only been able to release human dbSNP 142 incorporating 1000 Genomes phase 3 data on our GRCh37-based services (Web: grch37.ensembl.org, REST API: grch37.rest.ensembl.org, MySQL: anonymous@ensembldb.ensembl.org:3337, VEP: grch37.ensembl.org/vep). We hope to have these data on GRCh38 for Ensembl release 80; our GRCh38 database is based on dbSNP 138 until then.

VEP and GRCh38

For users of the Variant Effect Predictor (VEP) living in the brave new GRCh38 world, we have made available a VCF file which can be used to incorporate IDs and allele frequencies from the 1000 Genomes phase 3 data into your results. See this handy guide for details.

Viewing phase 3 data on the web

Using the same VCF as above, you can attach this as a custom data file to your Ensembl browser; this will allow you to see the genomic locations of the variants and their consequence types on our Region in Detail view.

Attach the following URL as a VCF (indexed): ftp://ftp.ensembl.org/pub/release-79/variation/VEP/1KG.phase3.GRCh38.vcf.gz

Screen Shot 2015-04-02 at 17.06.46

You can then see the data alongside the tracks from our database:

Screen Shot 2015-04-02 at 17.11.03

It has been quite a while since we’ve blogged about the VEP (Variant Effect Predictor), and in that time we’ve added a whole load of new features, particularly to the downloadable script version.

Structural variants

The VEP now supports finding the consequences of structural variants, with input either in VCF or tab-delimited format. Using the web interface to the VEP you can visualise which transcripts and features your structural variants overlap by clicking through to the Region in Detail view:

Screen Shot 2013-04-19 at 15.14.23 copy

The cache

We’ve really pushed the VEP script’s capabilities when using local “caches” (as opposed to using remote databases). Almost every feature of the VEP is now available when using the cache in offline mode. You can use a local FASTA file to quickly retrieve the sequences required to construct HGVS notations. You can even construct your own cache from a GTF file if your species isn’t supported by Ensembl.

Our cache for human now contains allele frequency data from phase 1 of the 1000 Genomes Project, and you can use these frequencies to filter your input (for example, you might want to filter out variants that are common in the combined European (EUR) population). We also now provide SIFT predictions for 8 species – human, mouse, zebrafish, pig, cow, chicken, rat and dog.

Plugins

We’re always trying to add new and useful features to the VEP, but we also recognise that other users have great ideas that they’d like to implement. The VEP script enables the use of plugins; these are bits of code that add extra functionality to the VEP. They can be used to retrieve data from remote sources, run external tools, filter output; pretty much anything you can think of can be accomplished in a plugin!

It’s easy to get started, and a basic plugin can be just a few lines of code – have a look at some of the examples we’ve created.

I recently added a plugin to retrieve data from dbNSFP – this is a great resource created by Liu et al in Houston, TX. They have, for every possible missense substitution in the human genome, pre-calculated pathogenicity scores, frequencies, conservation scores and a plethora of other things, and made all of this available as an easily downloadable file. To use this with the VEP, you just download the file and the plugin, run a couple of commands to get the data into the right format, and away you go – the VEP can now provide you with scores from LRT, MutationAssessor, MutationTaster, FATHMM and more for any missense substitution in your input.

Summary and HTML output

We had a number of requests for the VEP to provide summary statistics at the end of each run, and who are we to disappoint our loyal users?!? The VEP now writes a pretty HTML summary:
Screen Shot 2013-04-03 at 13.35.45 You can also view your output as HTML using the –html flag, which allows you to sort, filter and analyse your output on the fly.

Don’t hesitate to get in touch with us about the VEP – our developer mailing list is the best place for technical questions, with helpdesk for everything else.

 

Variant Call Format (VCF) has become the standard portable data format for variation data. Ensembl provides various tools for handling VCF files, such as the ability to upload and view VCF data on the genome browser, and the Variant Effect Predictor (VEP) for predicting the functional consequence of variants.

We now provide an easy-to-use Perl script that allows you to create and populate an Ensembl Variation database from a VCF file.

Using the script you can create a database that is ready for use on your own Ensembl website mirror, as well as with Ensembl’s Perl API. Through these you have access to our powerful tools such as:

You can either build a new database from scratch, or add data on top of an existing database (for example, a copy of one of our databases downloaded from our FTP servers).

All you need is the Ensembl API modules, and a MySQL server that you can write to!

The script is located in the ensembl-variation API module, in the scripts/import/ sub-directory.

The Variant Effect Predictor (VEP) software can predict the consequence of genomic variants using the genomic annotations provided by Ensembl. In release 63 of Ensembl we have added new features to both the script and web versions of the VEP.

Regulatory consequences have made their return; the VEP now reports if a variant falls within a regulatory region or a transcription factor binding motif, and furthermore if the variant falls in a high information locus within the motif.

The VEP now also has a dedicated area of the Ensembl website documentation.

Script version

To improve performance for users in the USA, we have now deployed a mirror of the public database server; to use this simply pass the flag “–host useastdb.ensembl.org” when running the script.

We have also implemented a caching system in the VEP, such that is possible to use almost all of the functionality of the script without the script querying the database at all. Simply download and unpack a pre-built cache, run the script with the flag “–cache”, and hey presto! No more network dependencies.

We have now made “whole genome mode” the default run mode of the script – this code has been rewritten and optimized such that it should be suitable for all use cases. We’ve also improved the status output of the script as it runs, so users with lots of data can easily track their progress.

See the new documentation for further details on all of these new features, or just download the script!

Web version

It is now possible to filter your input variants by their frequency as observed in the 1000 genomes or HapMap populations. You can either include or exclude input variants that are co-located with existing variants, based on frequencies in any particular population or across a range of populations.

As before, you can access the web VEP through the tools page, or via the “Manage your data” link on any species-specific page.