SIFT and PolyPhen predictions for human Ensembl and RefSeq proteins

For release 67 we changed how we store the protein function predictions from SIFT and PolyPhen so that they also can be used for more than just Ensembl transcripts, including RefSeq transcripts. We use these tools to compute the predicted effect of every possible amino acid substitution in the human proteome (over 2 billion predictions!). Now, the complete set of predictions for a particular protein are retrieved using the protein sequence itself as an identifier rather than an Ensembl stable identifier (we actually use the MD5 hash of the sequence). This means that you can retrieve predictions for any protein that has the same amino acid sequence as an Ensembl translation. So if you work with RefSeq transcripts, you can now get SIFT and PolyPhen predictions for any missense variants that fall in the 95% of RefSeq transcripts that match an Ensembl transcript exactly, using both the Variant Effect Predictor (VEP) and the Variation API.

New in release 67 are also predictions from both classifier models supplied with PolyPhen. Previously we provided predictions using a classifier trained on the HumVar dataset which is intended to distinguish between severely deleterious alleles against the background of abundant variation with milder effects. This is still the default, but when using the API you can now also opt to use predictions from the classifier trained on the HumDiv dataset which is intended to help evaluate rarer alleles potentially involved in complex disease. For more details on how these datasets are composed, please refer to the PolyPhen website.