Cool stuff the Ensembl VEP can do: predict pathogenicity of missense variants

Some missense variants have significant impact on the protein function, some do not. In the absence of global comprehensive functional assays of missense variants, the next best way to assess if a missense variant is likely to be pathogenic is through prediction tools which take into account factors like the chemical properties of amino acids, functional protein domains and protein conservation to predict how likely it is that a missense variant will impact function. A number of different missense pathogenicity predictors are available for human through Ensembl VEP, and these are are optimised for different purposes.

What effect do missense variants have on proteins?

Many of the scores we provide with the VEP, such as REVEL, MetalR and Mutation Assessor, come via dbNSFP (database for nonsynonymous SNPs’ functional predictions). Version 3.5a of dbNSFP is available through VEP and can be accessed through the web interface and the REST API endpoints. These data were released in February 2019 and represent the transcipt set available at that time. The offline tool supports a number of different versions through a VEP plugin. Similarly CADD (Combined Annotation Dependent Depletion) scores available through the VEP are version 1.5 from January 2018. For each of these measures there is a single score for each variant allele.

In contrast, SIFT and PolyPhen-2, which are available directly via all VEP interfaces, are calculated by Ensembl. These also use fixed data freezes, but they are calculated per transcript/variant allele combination, which means that each variant allele will have a separate score for each transcript in which it causes a missense change.

PolyPhen-2 also has two options, allowing it to be optimised for different kinds of study. The VEP default is HumVar, which is optimised for potentially mildly deleterious rare alleles at loci potentially involved in complex phenotypes, such as you might study through GWAS. If you’re running the offline VEP tool, you can change the PolyPhen-2 options to give you a HumDiv score instead, which is more suited to analysing variants in rare Mendelian disease.

Scores for missense predictions can be highly dependant on the multiple sequence alignment available at the time of analysis, which is used to calculate the level of conservation. This means that fetching scores for the same variant allele through different tools or websites may give different results, depending on the data freeze used. Indeed, these scores should always be taken only as predictions, and are no substitute for experimental validation.