Ensembl Variation - Variant quality

Ensembl carry out quality control on all imported variants, and have a summary of the evidence behind a variant.


Evidence status

We provide a simple summary of the evidence supporting a variant as a guide to its potential reliability.

Icon Name Description
Multiple observations The variant has multiple independent dbSNP submissions, i.e. submissions with a different submitter handles or different discovery samples.
Frequency The variant is reported to be polymorphic in at least one sample.
Cited The variant is cited in a PubMed article.
Phenotype or Disease The variant is associated with at least one phenotype or disease.
1000 Genomes The variant was discovered in the 1000 Genomes Project (human only).
gnomAD The variant was discovered in the genome Aggregation Database (human only).
TOPMed The variant was discovered in the Trans-Omics for Precision Medicine program (human only).


Quality control

A quality control process is employed to check imported variant data. Suspect variants and alleles are flagged, but are not withheld from downstream annotation. Data failing the checks is available through the browser where failure reasons are prominently listed. The API does not extract failed data by default, unless the database adaptor is specifically configured to do so using Bio::EnsEMBL::Variation::DBSQL::DBAdaptor::include_failed_variations();

Variants for which dbSNP holds citations from PubMed are not submitted to the QC process so are not flagged as failed.

Failure reasons

QC Type Reported failure reason Checking process
Mapping checks Variant does not map to the genome Variants with flanking sequences which do not map to reference or non-reference genomic sequences are flagged as failed.
Variant maps to more than 1 location For variants with flanking sequences mapping to a reference sequence, the number of mappings within all reference sequences is counted and those mapping more than once are flagged as failed. (Variants with a single mapping to both X and Y within a PAR region are not failed.) For variants with flanking sequences which do not map to a reference sequence, the number of mappings within all non-reference sequences is counted and those mapping more than once are flagged as failed.
Mapped position is not compatible with reported alleles The length of the reported alleles is compared to that expected given the coordinates specified for the variant. If none of the alleles match the expected length, the variant is flagged as failed.
None of the variant alleles match the reference allele The sequence at the coordinates specified for the variant are extracted from the reference genome and compared to the dbSNP refSNP alleles. If the extracted sequence does not match the expected alleles, the variant is flagged as failed.
Checks on the alleles of refSNPs Loci with no observed variant alleles in dbSNP Variants with dbSNP refSNP alleles reported as 'NOVARIATION' are flagged as failed.
Alleles contain ambiguity codes Variants with a IUPAC ambiguity code (eg. M, Y, R, etc ) in the dbSNP refSNP alleles are reported as failed.
Alleles contain non-nucleotide characters Variants with unexpected characters in the dbSNP refSNP alleles are reported as failed.
Checks on the alleles in dbSNP submissions Additional submitted allele data from dbSNP does not agree with the dbSNP refSNP alleles Alleles from all the dbSNP submissions for the rsID are checked against the dbSNP refSNP alleles. These alleles are primarily frequency submissions but can also be from variant discovery submissions, and these are merged in the dbSNP pipeline with the pre-existing refSNP variant). Discrepant sets of alleles are flagged as failed as this will often highlight a strand error in the submission of frequency information for a known variant. The failure is flagged at the allele submission level.
External failure classification Flagged as suspect by dbSNP Variants reported as being suspect by dbSNP due to being in probable paralogous regions are imported but flagged as failed (human only).
New assembly Variant can not be re-mapped to the current assembly Variants that mapped to the previous assembly, but couldn't be remapped to the current assembly are flagged as failed.