One of the biggest headaches when working with insertions and deletions is how many different ways you can represent the same variant. If you’re looking to find out if there’s already known allele frequencies or phenotypes at a locus, you want to make sure that you find the right one. The VEP can take that headache away through normalisation of variants.
Normalisation is the process by which variants are standardised into a particular format. For SNPs, this is pretty obvious, you just state the coordinates, but for insertions and deletions, particularly those that fall within repeat regions, this is far more complicated. Take the following example:
REF: 1 - GATCTATAAAAAAAAAAATCGAATCGTACA - 30 ALT: 1 - GATCTATAAAAAAAAATCGAATCGTACA - 28
There is a repeat of 12 As in the reference sequence, shortened to 10 in the alternative. That means that two bases are deleted: but which two?
The standard way to report an insertion or a deletion in a VCF file is to write it in terms of the base upstream of it. So, if the insertion or deletion occurs within a repeat region, we move up to the 5′ end of the repeat, and report the insertion or deletion as if it occurred at the beginning of the repeat. All allele frequencies from sources such as gnomAD and 1000 Genomes are reported in VCF so follow this convention.
chromosome 7 TAA T
HGVS works differently, they report the position of an insertion or a deletion in a repeat as the last position within the repeat. Since HGVS notation is in terms of the transcript, this means that for negative-stranded transcripts, the reported position is the same as that that would appear in VCF, but for positive-stranded transcripts, a different position is reported. Since this is a preferred format for the clinical community, a lot of phenotypes will be linked to the HGVS notation for variants.
transcript:c.17_18delAA
On top of this, publications listing phenotypes linked to variants may pick a random point within the repeat to report their variant, and follow neither convention. For example, the above variant may be listed in a publication as AA deletion at position 10-11. If these end up in this format in public databases, they will be propagated with these coordinates into Ensembl.
VEP knows how to deal with this. If you input any variant into the VEP, the first thing it will do is strip off any superfluous sequence, so if you give it a reference of AGT and an alternative of ACT, the VEP converts this to just a G/C base change.
Then it looks for variants that match your input variant in the known database. If the variant you’ve put in is an insertion or deletion in a repeat, it will scan up and down the repeat for other variants which give the same resultant sequence. Then, when you get the colocated variants in your VEP output, you’ll find them all. If you also get allele frequencies from gnomAD or 1000 Genomes, you’ll get the frequencies that are associated with any matching variant found in the whole repeat.