Frequently (and not-so-frequently) asked questions about the age analysis process

Before contacting us about a query, you may find one of the following previously asked questions provides an answer:

Queries about the reports:

Where is haplogroup/clade X, you haven't included it!
The report won't open / looks funny.
What are the codes in the raw reports?
Why do you only include BigY tests?
What is the history behind this project?
Why do the numbers keep changing?

General queries about age analysis:

How does the age analysis process work?
How accurate are these ages?
How valid is the concept of a mutation rate?
Which mutation rate do you use?
Do you use constraints from ancient DNA in your ages?
Which tests / regions of the chromosome do you take into account?

Where is haplogroup/clade X, you haven't included it!

Answer valid as of: 22 June 2017

There are several possiblities that could explain an absence of a haplogroup (clade) in the data.

The clade is there, you're just not seeing it. The full reports are large files, which can be hard to process visually, and don't always open correctly in some browsers or software packages. Smaller, clade-specific files are better supported (these are still being processed for P312 as of 22/06/17). However, we recommend that you search by your kit number or clade name. Searches can normally be performed by pressing Ctrl-F (Command-F on a Mac).

The clade is not there because it is not covered by BigY. BigY only sequences about 15% of the Y chromosome. For comparison, Full Genomes Corp.'s products typically cover 25%. Hence, many clades are known that do not appear in BigY, or appear so sporadically that they are not appropriate for age analysis. The tree created by our software is designed to be robust for the purposes of age calculation, not to be a complete representation of every mutation that takes place. It may be that your clade listed as a sub-clade of a clade further up in the tree (e.g. U152>Z142, rather than U152>L2>Z49>Z142). Some of these clades can be included on an ad hoc basis, but it negatively impacts the accuracy of the overall analysis (typically over-estimating the age of that clade by about 50 years, with smaller effects on neighbouring branches). We are aware that this is a confusing factor for many people, and are working on an improvement which accounts for this. See also this question.
The clade is there, but under another name. Many clades are represented by a number of SNPs, and a "lead" SNP has to be picked to represent them. Many clades, particularly new ones, don't yet have a standardised "lead" SNP. While we have tried to synchronise this tree against a number of others (e.g. Family Tree DNA's haplotree), some inconsistencies will remain where clades have not been entered.
The clade is not there because it is not part of our covered regime. Currently, we are operating two trees: one for U106 and one for P312. If we receive a file that is not from one of these clades, it is not processed. You may also want to check to make sure you are looking in the right tree!
The clade only has one member. It takes two to tango. The ages provided here are to the date of birth of the last common ancestor of two or more people that form a clade. The birth date of the last common ancestor of a single tester doesn't make sense, so we cannot compute it. However, these isolated clades are very useful in constraining the ages of clades, and are incorporated in our analysis.
We don't have the data files. We can only process files we have. Please see our instructions for uploading your data to the right places.

The report won't open / looks funny.

Answer valid as of: 23 June 2017

For the HTML documents, the most common reason for this is because you are trying to open a large file on an unsupported browser. For the CSV files, it is likely because you are trying to open a large file in an unsupported spreadsheet package, or you are not opening it as a CSV file.

In the process of putting together this analysis, we discovered a bug that affects several web broswers, including Microsoft Edge and Mozilla Firefox. Google Chrome appears unaffected. The bug means that only the first cell of an HTML table is displayed when the number of columns exceeds 999. For this reason, we have to split the report up into smaller sections that can be opened on all browsers (not fully implemented for P312 as of 23/06/17).

A similar, known limitation also exists in some spreadsheet software, including older versions of Excel, Numbers, OpenOffice and LibreOffice, whereby files with more than 1024 columns cannot be fully opened. Newer version of Excel open the files correctly. The last version of Gnumerics provides a free alternative which opens the files correctly. Again, using the files for smaller sub-clades offers a work-around for those only interested in one sub-clade.

The raw reports are in CSV (comma-separated variable) format. A common problem encountered when spreadsheet packages open CSV files is that they are interpreted as generic plain text files, where spaces, tabs, semi-colons or other characters can be used to split the text up into cells. Options are displayed when you open the file, where you can select which separators you want to use. When you are presented with these options, ensure that only "comma" is selected.

What are the codes in the raw reports?

Answer valid as of: 2 July 2017

The raw (CSV format) report should open (partly or wholly) in most spreadsheets. It contains a list of kits going horizontally, and a list of mutations going vertically. The kits are ordered horizontally by sub-clade from largest to smallest, and vertically by sub-clade from oldest to most recent. Additional meta-data for each mutation is present in the left-hand 17 columns.

Each cell contains a description of the call for that mutation in that tester. They can be negative (blank), positive (listed with the SNP name), or uncalled. In cases where there is not a call, one of several codes are listed, which are detailed in the top-left of the report. These are:

nc = No call. This region is not covered in the BED file, and is netither called positive nor negative.
cbl = Coverage boundary, lower. This base pair is not called, but lies on the edge of a region of coverage. A BAM file call may be possible.
cbu = Coverage boundary, upper. This base pair is called, but lies on the edge of a region of coverage. Calls for indels may be affected.
cblu = Coverage boundary, both lower and upper. Indicates a poorly formed BED file.
(R) = Mutation is treated as recurrent within the dataset. The first, second, third, etc., occurence of the mutation are listed as (R1), (R2), (R3), etc. Generally, these mutation shouldn't be used for age analysis, and alternative mutations should be used where possible in any subsequent analysis or testing.
(?+) = Presumed positive. The mutation is not called, but a manual edit has been made to force it to be positive on the basis of downstream calls, so that a phylogenic tree can be created.

In the short version of the report, "singleton SNPs"(and other mutations) are condensed down into a single list which follows the shared SNPs. These "singleton" SNPs are called in only one test, and are sometimes referred to as "private" SNPs, but are not necessarily mutations which are "de novo", or new to the testers themselves. Sometimes, SNPs can be called in individual testers, but not in their nearest neighbours. Consequently, we don't know if they are really "singleton" SNPs, or shared SNPs. Often, the information can be found by examining the BAM file of the opposing test. Such mutations receive the following codes:

(s?) = Questionable singleton. The mutation is not called in some other people belonging to this person's lowest shared clade. It may be possible to determine whether this mutation is shared by looking at the individual calls of other people in this clade. For example, if someone is R-Z156 but negative for all shared SNPs below Z156, and a R-Z156>Z306 person is not called for one of their singleton SNPs, they will receive a (s?) code, however they cannot share this SNP, since they don't share the Z306 clade.
(s?!) = Highly questionable singleton. The mutation is not called in all the other people belonging to this person's lowest clade. It is not possible to determine whether this SNP is truly private or not without a BAM file analysis of every other person from this clade. Often, this will just be one or two people, as it is much more common in the smaller clades.

Why do you only include BigY tests?

Answer valid as of: 22 June 2017

At present, we only include BigY tests in this analysis, not other sequencing tests such as Full Genomes' YElite or WGS, or YSeq's WGS tests, or test data from the scientific literature. The reason behind this is simply the large number of testers taking BigY tests has allowed a homogeneous sample of test results that can be accurately compared to each other, and accurately internally calibrated.

A homogeneous sample is needed if we are to be sure we are comparing like with like. Different companies and different tests use different quality criteria for declaring coverage of a particular base pair, and declaring an SNP to be passed or rejected. That may be differences in the read quality, number of reads, or the ratio and combination of the two.

An internal calibration is necessary if we are to determine that the ages we derive are to be believable: we need to recover the same mutation rate in the BigY tests as is published in the literature.

We fully support Full Genomes Corp. and YSeq, among other companies, in these excellent products. The best way to include their data in our analysis is to buy these tests and increase the number of their tests that can be used in our internal calibrations.

What is the history behind this project?

Answer valid as of: 23 June 2017

This project was born out of a BigY data comparison first performed by David Carlisle for the U106 project when the BigY tests were first released (March 2014). David's original Mac code is still available on the U106 forum.

This project dovetailed with analysis I was already performing on STR ages for the U106 group, which never fully reached maturity due to limitations of the STR-based age analysis methods available. The project was taken over by Andrew Booth in October 2014, who ran it up to his sudden death in May 2016.

At this point, a "quick fix" was needed. J.R. Cannon became familiar with and helped run David's software, while I coded up a platform-independent version as quickly as I could. This co-incided with the release of Windows 10 Subsystems for Linux. Consequently, it was decided to code this in a portable script that could be run on Windows Bash, Linux or Mac without the need for a compiler. The current code was mostly written by me, but invaluable coding help has been provided by Harald Alvestrand. Jef Treece has also contributed latterly to the coding, but his main help has been in the considerable amount of work it has needed to shape the P312 tree out of the raw data.

At this point, the U106 project contained a few hundred BigY tests, and the output of the software was a single spreadsheet, with a table of ages. Since then, the format has expanded to contain more user-friendly information that can be conveyed more easily over the web. The number of tests has also expanded into the thousands, and the inclusion of data from P312 has multiplied that further. The project is now at the stage where a replacement code is needed. The obvious choice at this point is to provide a code in Python, due to its versitility, portability, array-processing functions and use within the scientific and programming community. A Python-based "version 2" of this code is therefore envisaged which allows us to take many of the existing problems into account.

Why do the numbers keep changing?

Answer valid as of: 23 June 2017

The ages we provide are estimates. They have limited accuracy and are only as good as the data and assumptions that go into them. The code computes ages dynamically, and each clade's age is dependent on both the age of its parent and the age of its children. Hence, changing one clade on the tree affects the age computed for all of them.

We frequently include new kits into our analysis, both changing the tree layout and changing the mutation rates that go into it. Every addition we make to the tree has a knock-on effect on other branches. Usually these are minor, changing the ages by a few years, but changes close to any given clade will have a larger impact.

Larger changes occur when global changes are made. These may include changes to the mutation rate as more data or publications become available, changes to the workings of the code to better take into account limits from ancient DNA, changes to the regions of coverage as we better explore the chromosome, etc.

The changes that you see highlight the inexactness of these computations, and are typical of scientific data products. Depending on the assumptions you make, you will get a slightly different answer. This is why a scientific error margin is incorporated in the results: you should consider that the age of each clade could be anywhere within the 95% confidence interval, presented on this site as a range of dates. There is even a 5% chance the age could be outside this range! Consequently, as we get more data, you can expect the dates we compute to move around, but normally stay well within this interval.

How does the age analysis programme work?

Answer valid as of: 22 June 2017

The method is based on the same method as YFull (Adamov et al. (2015)), but with a few mathematic bells and whistles. A short, mathematical description of the process is available here. Further details can be found in the code repository. A fuller description is being written up more rigorously.

How accurate are these ages?

Answer valid as of: 22 June 2017

Typically, the ages of individual clades are only accurate to within a few hundred years at best. Each age is quoted with a 95% confidence interval: this is a standard measure, and roughly translates as the range in which I am 95% certain that the true answer lies, provided the assumptions I have made are correct.

This uncertainty range takes into account the factors which we think are the most uncertain in the analysis. These include (in descending order of importance): statistical scatter due to the small number of mutations (typically +/- a few centuries), uncertainties in the mutation rate (typically +/- 8%, or a few centuries), and uncertainties in the age zero point (birth dates of testers; +/- 16 years).

There are other uncertainties that are not included in the modelling of this analysis, including temporal or spatial variations in the mutation rate (+/- less than about 7%), inaccuracy in SNP calls (affecting less than about 10% of clades by +/- 180 years), inaccuracy in classifying mutations (including MNPs and SNP clusters, which represent ~1% of mutations and are not distinguished from SNPs), and correct accounting of clades with zero SNPs counted for age analysis (effects vary between about 10 and 50 years per clade, depending on source). In some cases, this may cause additional uncertainties that have not been correctly accounted for. Plus there may be "unknown unknowns", things we have not considered that might affect the final ages.

As part of the process of developing these methods and this website, we are writing up more rigorous documentation to detail our findings in these matters, much of which does not appear in the current scientific literature.

How valid is the concept of a mutation rate?

Answer valid as of: 22 June 2017

Mathematically, a rate is simply defined as the number of something that occurs per change in something else. This might be a very accurate rate, such as the rate at which a clock ticks or an atom vibrates; a mostly accurate rate, such as the rate the Earth spins or the rate a tap drips; a slightly inaccurate rate, like the number of raindrops falling on a leaf per second or the number of babies born per year; or a very inaccurate rate, like the frequency with which the word "cat" appears in the New York Times or the number of cars per minute crossing Tower Bridge. A rate can be defined by pretty much anything. The question is, how useful is that rate?

Genetic mutations are supposed to occur as random processes. Similar processes are the rate at which radioactive atoms like carbon-14 decays, or the rate at which photons of light hit a camera. These are governed by a branch of mathematics known as Poisson statistics. The original example, which led Poisson to this branch of statistics, was the number of cavalrymen being killed by kicks from their horses. If genetic mutations are truly random, they should be described entirely by Poisson statistics.

We also know from STR data that mutations from one state to another can be made more likely by an existing mutation. We also know of several environmental factors that can make genetic mutations occur of their own accord, which are broadly synonymous with carcinogens. Social factors, like generation length, could change the rate at which mutations are passed down. Hence there is reason to believe that the mutation rate may change over time. Direct factors exascerbating this may come from a variety of sources, included changes in diet (e.g. hunter-gatherer to farmer), changes in location, altitude (cosmic ray exposure), radon exposure, etc.

Observationally, we can limit this, by looking at the mutation rate derived in different studies (see ISOGG list), that focus on:

Different time periods (ancient versus modern).
Different geographical locations (different haplogroups and different environments).
Different generational lengths (from different modern father-son pairs).

These different studies provide near-identical mutation rates within about 15% (see a recent review or a now-outdated in depth article), centred on around 7 or 8 x 10^-10 SNPs per base pair per year. The particular study of Kong et al. (2012) indicates that mutations per year is the primary factor, not mutations per generation, so we can ignore generational length as a concern. The self-similar mutation rates found in these studies show that the mutation rate has not changed by more than 15% (+/- 7.5% from their mean) within the last 10,000 years and across regimes as different as Icelandic farmers, Sardinian hunter-gatherers and Amerindian tribes. Extension of this rate before about 10,000 BC become increasingly less constrained by the data, but still provide a reasonable (if very uncertain) approximation to relationships as old as the chimpanzee-human divergence.

The application of any particular mutation rate to any given study depends on reproducing the conditions of that mutation rate to the study in question. Hence, since most mutation rates rely solely on SNPs from a particular region of the Y chromosome, it is important that the age analysis study only counts the SNPs from that region of the chromosome, not other mutations, or other regions of the chromosome. It is possible to bootstrap the values across these regions, but it generally requires some form of interpolation that is not scientifically ideal.

Which mutation rate do you use?

Answer valid as of: 23 June 2017

The exact mutation rate we use is under constant revision, as new calibration data is added, and as we revise how we treat literature studies and how we account for ancient DNA. It is currently 8.124 x 10^-10 SNPs per base pair per year, with a 95% confidence interval of 7.534 - 8.722 x 10^-10 SNPs / bp / yr.

We typically use about 7.7 million base pairs of coverage out of Family Tree DNA's average claimed coverage of 10.6 million, although both the coverage and what we can use varies from test to test. As an average, this works out to one SNP per 160 years (95% c.i.: 149 to 172 years) over the region of the BigY test we cover. If we were to use the entire 10.6 million base pairs of the BigY test (and we don't recommend you do), that would work out to about 124 years per SNP, and an extrapolated rate for Full Genomes YElite and >5x WGS products of about 94 years per SNP, but we don't recommend you use these figures as any more than an approximate guide, since the remaining regions of the chromosome may mutate at different rates.

The mutation rate we use is derived from an amalgam of different sources. There are three main sources are: literature studies of the mutation rate, constraints from ancient DNA, and our own internal calibration. The uncertainty-weighted mean of these is used to derive a final mutation rate.

The literature sources we use include Xue et al. (2009), Poznik et al. (2013), Mendez et al. (2013), Francalacci et al. (2013), Fu et al. (2014), Helgason et al. (2015), Karmin et al. (2015) and Trombetta et al. (2015). Each study has its advantages and drawbacks (here's one example, here's another). These rates are weighted by their errors and combined in quadrature. Additional uncertainties of ⁺²⁰_-10% are added to Poznik et al. (2013) and Francalacci et al. (2013) to account for archaeological uncertainties, and a small uncertainty is added to Fu et al. (2014) to account for the uncertainty in the burial date. The coverage-weighted average of the palindromic and non-palindromic regions of Helgason et al. is used, although our currently-used coverage only contains the non-palindromic regions. Currently, we are not including the ancient-DNA-based Karmin et al. and Trombetta et al. in the analysis, instead accounting for the constraints from ancient DNA more directly. With this setup, the literature rates alone provide a 95% confidence interval of 146 - 170 years per SNP (50th centile: 158 years/SNP).

The archaeological DNA we use is constraints from Lille Beddinge (RISE 98) and Rathlin 1, where the data have been analysed by others and, in the case of RISE 98, passed through the Full Genomes Corp. pipeline and novel variants identified. These provide a mutation rate of (95% c.i.) 166 - 266 years per SNP (50th centile: 183 years/SNP), which agrees within the uncertainties with the literature mutation rates and those of Karmin (136 - 206 years/SNP) and Trombetta (159 - 210 years/SNP). The reason for adopting this approach is that these two burials directly constrain the R-P311 rate, rather than relying on results from a different haplogroup. Lower limit uncertainties from ancient DNA can also be applied, but are not at present. Additional, better options for including ancient DNA results in the analysis are possible, and we are exploring how best to incorporate these.

The internal calibration we use is based on the tree we create, where two testers with proven genealogical connections have taken BigY tests which confirm their triangulation back to a common ancestor.

For R-U106, we have approximately 17,850 years of lineages in where there are approximately 103 SNP mutations, giving approximately 173 years per SNP (95% c.i.: 133 - 233 years/SNP). The approximations come from two sources: sometimes the exact year of the common ancestor is not known, or the birth year of testers is not given. The average birth year of BigY testers is 1950 AD with a standard deviation of 15 years, based on a sample of 121 BigY testers. Together with uncertainties in the common ancestor date, this provides an estimated uncertainty in the length of time of +/- 724 years.

The uncertainty in the number of mutations arises because not every mutation is called in both tests of the triangulation. This affects approximately 10% of SNPs. Some of them should be included and some of them should not be. We are currently inquiring about the status of these mutations in the BAM files of these testers, and we are investigation modifications to the age analysis process that may be able to ignore these mutations on a more rigorous mathematical basis.

Equivalent numbers in P312 are 14,157 +/- 618 years and 81 mutations, giving 175 years per SNP (95% c.i.: 131 - 244 years/SNP). From R1a, we have 11,688 +/- 689 years and 53 mutations, giving 221 years per SNP (95% c.i.: 153 - 340 years/SNP). These values are less well calibrated, however. The R1a haplogroup results are entirely based off the Clan Donald BigY sample, where many of the shared lineages are poorly constrained by the historical record, and this is suspected to be the cause of the offset (although it remains within the 95% uncertainty range).

The three rates (literature, archaeological and internal) are combined to yield the rate of 160 years per SNP, with a 95% confidence interval of 149 to 172 years per SNP, that is used in the main analysis. For privacy reasons, I cannot make public all the data that goes into the internal calibrations, and we are not yet ready to release the full details of this calculation, since we are still determining the details of the best way to account for the archaeological DNA. I hope to be able to provide more detail in the future.

Do you use constraints from ancient DNA in your ages?

Answer valid as of: 23 June 2017

Not directly. We do take into account some constraints from ancient DNA in our mutation rate calculations. However, the recent release of a large number of relevant ancient DNA results, plus the expansion of our tree to P312, where archaeological DNA provides a more direct constraint on ages, means a more explicit use of ancient DNA is desirable.

I have outlined in the summary of how the pipeline works how it can be modified to take into account ancient DNA results. However, this is not yet implemented.

Which tests / regions of the chromosome do you take into account?

Answer valid as of: 25 June 2017

This analysis can only work on sequencing data. We need a large number (millions) of base pairs called positive or negative for SNPs before we can do anything meaningful with this data. Individual results and targetted SNP packs/panels cannot be used because they don't contain enough data. However, they can be useful in placing individual kits within this hierarchy. Out of the plethora of available sequencing tests available these days, for consistency reasons we currently only include BigY tests in this age analysis, but hope to expand this once the software is working maturely and the datasets for these other tests have grown.

We only select certain parts of the Y chromosome for this analysis, where the mutation rate is well characterised, the test coverage is relatively uniform, and the individual SNPs well read. The details of this region are subject to final consideration, but are currently as listed in the summary description of the reduction pipeline. These regions are restricted to the euchromatic, non-palindromic regions of the Y chromosome (see the last page of the primer). It excludes regions in the BigY such as the major palindromic and inverted repeat regions (which are the main differences between Family Tree DNA's claimed coverage of the BigY test and that of Full Genomes' re-analysis), as well as the centromeric and DYZ19 repeat regions. Currently, any phylogenically consistent SNPs falling within STRs are still counted.