GMI Logo
The stringvalidation test suite is accompanying the publication Roeck et al. 2010. We explicitly state that this database is not intended for any other use than testing the string-based search SAM published in Roeck et al. 2010, especially not for forensic frequency searches.
The stringvalidation test suite is a mirror of the EMPOP database (http://empop.org) as far as the search engine and the layout are concerned. In contrast to the EMPOP database no registration is needed for full access to the database. The stringvalidation database provides a subset of the EMPOP database and contains three populations (Brandstätter et al. 2004, Brandstätter et al. 2007, Zimmermann et al. 2009) that can be freely downloaded in difference-coded format in the download section. These data sets are forensic data that are accompanied by high quality primary sequence data, which permit questionable positions to be checked any time.
The mtDNA haplotypes are stored in difference-coded format, relative to the revised Cambridge Reference Sequence (rCRS; Andrews et al. 1999) and aligned using the phylogenetic approach described in Bandelt and Parson (2008).
Rules:
(quoting Bandelt and Parson, 2008)
  • (Phylogenetic law) Sequences should be aligned with regard to the current knowledge of the phylogeny. In the case of multiple equally plausible solutions, one should strive for maximum (weighted) parsimony. Variants flanking long C tracts, however, are subject to extra conventions in view of extensive length heteroplasmy.
  • (C tract conventions) The long C tracts of HVS-I and HVS-II should always be scored with 16189C and 310C, respectively, so that phylogenetically subsequent interruptions by novel C to T changes are encoded by the corresponding transition. Length variation of the short A tract preceding 16184 should be notated in terms of transversions.
  • (Indel scoring) Indels should be placed 3′ with respect to the light strand unless the phylogeny suggests otherwise.

Database queries

MtDNA haplotypes can be entered into the query field in difference-coded format or copy/pasted as consensus sequence strings. For the query process SAM translates all haplotypes into sequence strings and then compares them. Thus, the search becomes independent from alignment and annotation and allows identical haplotypes to be correctly retrieved, even if a query sequence was entered using other alignment rules.
The results of the string based query SAM are reported in difference coded format. A maximum likelihood (ML) and a maximum parsimony (MP) approach are used to determine the transcript (i.e. the differences) from database to query profile. The ML approach reports the transcript with minimal costs (based on mutation-specific weights), while the MP approach favors the transcript with the minimal number of differences. In most cases the MP transcript is identical to ML. Therefore only MP transcripts are shown in the output summary, thus making a search much faster. The detail view of a database profile also provides the option to calculate the ML transcript on demand. See Roeck et al. 2010 for further details.