Problematic haplogroup assignments in YFull's MTree

Dr. Ian Logan, an mtDNA expert whose website is IanLogan.co.uk and maintains a magnificent "GENBANK SEQUENCES BY HAPLOGROUP" database, informed me that YFull's MTree, which is a comprehensive and often-reliable resource, contains some erroneous haplogroup assignments for particular samples that they downloaded from the National Center for Biotechnology Information's public GenBank database. Sometimes they correct their mistakes and other times they do not. Some of YFull's mistakes in haplogroup assignments did not originate with them because certain mtDNA sequences in GenBank have poor quality with questionable mutations or missing expected mutations, including some of the data set of Marina Silva's September 2021 study in people from Spain.


1. Ian Logan found a problem with YFull MTree's listing of Serbian samples at https://www.yfull.com/mtree/H1as2/ under subclade H1as2a including with sample number MK134305 that came from the data set for the September 2020 study "Complete mitogenome data for the Serbian population: the contribution to high-quality forensic databases" by Slobodan Davidovic's team in International Journal of Legal Medicine. YFull lists the following samples from Serbia there as H1as2a:
  • GenBank sample MK617219 from Serbia
  • GenBank sample MK617244 from Serbia
  • GenBank sample MK134305 from Serbia
  • YFull customer YF80654 from an ethnic Serbian from Serbia
  • YFull customer YF07829 from Serbia

    Some or all of them don't carry the mutation T4688C that defined membership in H1as. That is why some of them (including also MK617219 and MK617244) were instead called haplogroup H1cm1 or just H1 inside GenBank. At least MK134305 is definitely not a branch of H1as even though some of its other mutations are shared with H1as2. It does not have T4688C. Logan lists its mutations as: A263G A750G T980C A1438G G3010A A4769G A8860G A15326G T16209C T16519C.

    I don't know what the YFull customers YF80654 's and YF07829 's lists of mutations are but I doubt they have T4688C either. Please check that too. The mutations T980C and T16209C would not be enough for membership in H1as2a if they are missing T4688C.

    CURRENT STATUS: I informed YFull's team about the above problems on September 16, 2021. They are still listed with the wrong haplogroup assignment in YFull as of November 18, 2021.

    YFull told me: "Ian Logan has at least one sample MN540564, which also has SNP T4688C negative, but it is located under H1as2. We believe that in this case, a back mutation may have occurred since the remaining signs indicate a high probability of finding these samples under the H1as2a subclade."

    Ian Logan in response told me: "I do not like the idea of a 'back mutation'. So much more likely a new subgroup."


    2. Ian Logan informed me that the sample MZ920334 collected in León, Spain by Marina Silva's team for their September 2021 study "Biomolecular insights into North African-related ancestry, mobility and diet in eleventh-century Al-Andalus" in Scientific Reports isn't actually in haplogroup HV5* like YFull's MTree claims at https://yfull.com/mtree/HV5/ and that it is a poor-quality sample because its "sequencing is poor".

    Logan lists the sample's mutations as: A93G A263G C7028T A7543G G8269A A8860G A10398G G10589A A11251G G11719A A12612G A13105G A14037G A15326G A16235G C16291T

    Logan says "This nothing like a HV5 - but neither is it like anything else. So I guess something has gone wrong. Interestingly James Lick's program doesn't like it either !"

    CURRENT STATUS: I informed YFull's team about the above problem on November 18, 2021. Later that day, YFull acknowledged their error and reassigned sample MZ920334 to haplogroup H2a2b10 at https://www.yfull.com/mtree/H2a2b10/ where another Spanish sample is located and where a Portuguese sample is located.

    But Ian Logan believes that too may be in error. He saw that "the position of MZ920334 bears no resemblance to H2a2" (the root level) and has extra mutations that the other Spanish sample MZ921068 doesn't have. Thus, Logan concluded, "I should there are so many missing/wrong mutations that neither sequence is to be acceptable. But nevertheless there is marked similarity. [...] So my suggestion is that these 2 sequences are corrupted results - either in the sequencing, or perhaps more likely in the data-manipulation."


    3. When it comes to the sample JN415472 from Italy, which GenBank generically lists as a member of haplogroup "H", Ian Logan places it within haplogroup H66 partly on the basis of it containing the mutation G7337A that defines H66. "List of complete mtDNA sequences included in Figure 1" within the 2012 study "Rare Primary Mitochondrial DNA Mutations and Probable Synergistic Variants in Leber’s Hereditary Optic Neuropathy" similarly lists JN415472 as H66. But MTree is using another mutation, A2060G, to define what they call H-c2, as a child of H-c, not within the H66 tree, and they include JN415472 in H-c2 at https://yfull.com/mtree/H-c2/ This seems to be an error on YFull's part.