The Ensembl VEP command-line tool can annotate and filter variants called against the latest human assemblies, including the telomere-to-telomere assembly of the CHM13 cell line (T2T-CHM13). In this blog post, we provide examples of how to run Ensembl VEP with these new assemblies and list the additional annotations supported via plugins.
The Human Pangenome Reference Consortium (HPRC) aims to sequence 350 individuals of diverse ancestries, producing a pangenome of 700 haplotypes by the end of 2024. The first publication (A draft human pangenome reference) describes 47 phased, diploid assemblies from a cohort of genetically diverse individuals.
We have annotated genes on these human assemblies, based on Ensembl/GENCODE 38 genes and transcripts, via a new mapping pipeline as detailed in the Methods section of A draft human pangenome reference. The links to download and visualise the human annotations for HPRC assemblies are summarised in the Ensembl HPRC data page.
Running Ensembl VEP with HPRC assemblies
Currently, Ensembl VEP can only be run with HPRC assemblies in offline mode and one assembly at a time. There are two ways to use Ensembl VEP with HPRC assemblies:
- Using Ensembl VEP cache with (recommended) FASTA sequence (the most efficient way)
- Using GTF annotation with (mandatory) FASTA sequence
In the examples below, we demonstrate annotating variants on the T2T-CHM13v2.0 (GCA_009914755.4) assembly. To create a sample VCF to use in the examples below, you can take the first 100 lines from the ClinVar VCF file mapped to T2T-CHM13:
clinvar=https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz
tabix -h $clinvar 1 | head -n 100 > test.vcf
Ensembl VEP cache
The cache is a downloadable archive containing all transcript models for an assembly; it may also contain regulatory features and variant data.
Let’s start by downloading and extracting the Ensembl VEP cache to the default Ensembl VEP directory (available for each annotation by clicking in VEP cache in the Ensembl HPRC data page). In the case of T2T-CHM13:
cd $HOME/.vep
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/indexed_vep_cache/Homo_sapiens-GCA_009914755.4-2022_10.tar.gz
tar xzf Homo_sapiens-GCA_009914755.4-2022_10.tar.gz
This will create the folder homo_sapiens_gca009914755v4/107_T2T-CHM13v2.0 with the gene data required to run Ensembl VEP. The name of this folder contains relevant information when running VEP:
- Species: homo_sapiens_gca009914755v4
- Cache version: 107
- Assembly: T2T-CHM13v2.0
As well as molecular consequence predictions, many gene/transcript-based Ensembl VEP options are supported for HPRC assemblies:
vep -i test.vcf --offline --species homo_sapiens_gca009914755v4 --cache_version 107 --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz --domains --symbol --canonical --protein --biotype --uniprot --variant_class
We don’t have other annotations, such as RefSeq transcripts or variant information in the cache.
To run Ensembl VEP with the downloaded cache in offline mode, please specify the species (which includes assembly name) and cache version:
vep -i test.vcf --offline --species homo_sapiens_gca009914755v4 --cache_version 107
FASTA sequence
When using Ensembl VEP cache, supplying the reference genomic sequence in a FASTA file is optional, but is required to enable the following options:
- Create HGVS notations (–hgvs and –hgvsg)
- Check the reference sequence given in input data (–check_ref)
Genomic FASTA files can be found in Ensembl HPRC data page > FTP dumps > ensembl > genome. FASTA files need to be either uncompressed or compressed with bgzip (recommended) to be compatible with VEP. For instance, to download a compressed FASTA file, uncompress it and then re-compress it with bgzip:
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/genome/Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
gzip -d Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
bgzip Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
Afterwards, you can run VEP using cache and the –fasta flag:
vep -i test.vcf --offline --species homo_sapiens_gca009914755v4 --cache_version 107 --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
Visit the documentation for more information on using FASTA files with VEP.
GTF and GFF annotation
As an alternative to using cache files, Ensembl VEP can utilise gene information in appropriately indexed GTF or GFF files. GTF and GFF files can be downloaded from the annotation column in the Ensembl HPRC data page. The data needs to be re-sorted in chromosomal order, compressed in bgzip and indexed with tabix. We present here the example for a GTF file:
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/geneset/2022_07/Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
gzip -d Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
grep -v "#" Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
tabix Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz
FASTA files are always required when running HPRC data with GTF annotation, as the transcript sequences are not available in the GFF files.
Afterwards, you can run Ensembl VEP using the GTF and FASTA files:
vep -i test.vcf --gtf Homo_sapiens-GCA_009914755.4-2022_07-genes.gtf.gz --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz
Check our documentation for more information on using GTF and GFF annotation.
Missense deleteriousness predictions
Using our ProteinFunction pipeline, we ran PolyPhen-2 2.2.3 and SIFT 6.2.1 on the proteome sequences for GRCh38 and all HPRC assemblies (the protein FASTA files indicated in Ensembl HPRC data page) and stored their results in a single SQLite file: homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db.
Pre-computed scores and predictions can be retrieved by downloading this file and running VEP with the PolyPhen_SIFT plugin:
curl -O https://ftp.ensembl.org/pub/current_variation/pangenomes/Human/homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db
vep -i test.vcf --offline --species homo_sapiens_gca009914755v4 --cache_version 107 --fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz --plugin PolyPhen_SIFT,db=human_pangenomes.PolyPhen_SIFT.db
Matched variant annotations (ClinVar, gnomAD and dbSNP)
We don’t have variant data in the Ensembl VEP caches for the pangenome assemblies, but it can be integrated using the –custom option with data files using the same assembly coordinates. We have lifted-over some key datasets, including ClinVar and gnomAD to the HPRC assemblies (downloadable from the VCF column in Ensembl HPRC data page).
# Download ClinVar data and respective index (TBI)
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/2024_07/clinvar_20240624_GCA_009914755.4.vcf.gz.tbi
# Run VEP with ClinVar data
vep -i test.vcf --offline \
--species homo_sapiens_gca009914755v4 --cache_version 107 \
--fasta Homo_sapiens-GCA_009914755.4-softmasked.fa.gz \
--custom file=clinvar_20240624_GCA_009914755.4.vcf.gz,short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN
Additional annotations
Ensembl VEP plugins are a simple way to add new functionality to your analysis. Many require data that is only available for GRCh37 or GRCh38, but others, for example, those based on gene attributes or on the fly analysis are compatible with the HGRC assemblies.
Here is a list of compatible Ensembl VEP plugins that can be easily used with HPRC assemblies:
Plugin | Description | Plugin data | Usage example |
---|---|---|---|
Blosum62 | Looks up the BLOSUM 62 substitution matrix score for the reference and alternative amino acids predicted for a missense mutation. | –plugin Blosum62 | |
DosageSensitivity | Retrieves haploinsufficiency and triplosensitivity probability scoresfor affected genes (Collins et al., 2022). | Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz | –plugin DosageSensitivity,file=Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz |
Downstream | Predicts downstream effects of a frameshift variant on the protein sequence of a transcript. | Requires a FASTA file provided via the –fasta option | –plugin Downstream |
Draw | Draws pictures of the transcript model showing the variant location. | –plugin Draw | |
GeneSplicer | Runs GeneSplicer to get splice site predictions. | Binary and training data for GeneSplicer (plugin instructions) | –plugin GeneSplicer,binary=genesplicer/bin/linux/genesplicer,training=genesplicer/human |
GO | Retrieves Gene Ontology (GO) terms associated with genes (for HGRC assemblies, specifically) using custom GFF annotation containing GO terms. | Ensembl HPRC data page > FTP dumps > ensembl > variation > [date] > gff:*_GO_plugin.gff.gz*_GO_plugin.gff.gz.tbi | –plugin GO,file=homo_sapiens_gca009914755v4_110_VEP_GO_plugin.gff.gz |
HGVSIntronOffset | Returns HGVS intron start and end offsets. To be used with –hgvs option. | –plugin HGVSIntronOffset | |
LoFtool | Provides a rank of genic intolerance and consequent susceptibility to disease based on the ratio of Loss-of-function (LoF) to synonymous mutations for each gene. | –plugin LoFtool | |
MaxEntScan | Runs MaxEntScan to get splice site predictions. | Extracted directory from fordownload.tar.gz | –plugin MaxEntScan,/path/to/fordownload |
NearestExonJB | Finds the nearest exon junction boundary to a coding sequence variant. | –plugin NearestExonJB | |
NMD | Predicts if a variant allows the transcript to escape nonsense-mediated mRNA decay based on certain rules. | –plugin NMD | |
Phenotypes | Retrieves overlapping phenotype information. | Ensembl HPRC data page > FTP dumps > ensembl > variation > [date] > gff:*_phenotypes_plugin.gvf.gz*_phenotypes_plugin.gvf.gz.tbi | –plugin Phenotypes,file=homo_sapiens_gca009914755v4_110_VEP_phenotypes_plugin.gvf.gz |
pLI | Adds the probability of a gene being loss-of-function intolerant (pLI). | –plugin pLI | |
PolyPhen_SIFT | Retrieves PolyPhen and SIFT predictions from a SQLite database. | homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db | –plugin PolyPhen_SIFT,db=homo_sapiens_pangenome_PolyPhen_SIFT_20240502.db |
ProteinSeqs | Writes two files with the reference and mutated protein sequences of any proteins found with non-synonymous mutations in the input file. | –plugin ProteinSeqs | |
SingleLetterAA | Returns HGVSp string with single amino acid letter codes. | –plugin SingleLetterAA | |
SpliceRegion | Provides more granular predictions of splicing effects. | –plugin SpliceRegion | |
SubsetVCF | Retrieves overlapping records from a given VCF file. | A VCF file | –plugin SubsetVCF,file=file.vcf.gz,name=myvfc |
TranscriptAnnotator | Annotates variant-transcript pairs based on a given file. | Tab-separated annotation file (plugin instructions) | –plugin TranscriptAnnotator,file=annotation.txt.gz |
TSSDistance | Calculates the distance from the transcription start site for upstream variants. | –plugin TSSDistance |
Written by: Nuno Agostinho, Jamie Allen and Sarah Hunt. Edited by Aleena Mushtaq.