qpWave and qpAdm

library(magrittr)
library(tidyverse)
library(admixtools)

qpWave and qpAdm are programs which can answer a number of questions about the relationships of different populations, such as:

Do populations form clades with respect to each other?
What is the minimum number of admixture waves that separates two groups of populations?
Can a population be modeled as a mix of a set of other populations, or would this miss an ancestry component?
What are the relative proportions of the different admixing populations?

The first two questions can be answered by qpWave. The latter two questions are about a specific target population, and they can be answered by qpAdm, an extension of qpWave.

This page describes how both programs use \(f_4\)-statistics to address these questions. The original description of qpWave and qpAdm which goes into more technical detail can be found in Supplementary Information section 10 in this paper.

qpWave

qpWave and f4

\(f_4\)-statistics can be used to test whether two pairs of populations (\(L_1, L_2\) and \(R_1, R_2\)) form clades with respect to one another: If \(f_4(L_1,L_2; R_1,R_2)\) is significantly different from 0, the two groups do not form a clade. If we have two groups with more than two populations each, and we want to know if the groups form two clades, we could compute all possible \(f_4\)-statistics and test if any of them deviate from zero. Except we would have to correct for multiple testing, which is not straight-forward in this case because the \(f_4\)-statistics are not independent of each other.

qpWave provides another way to test whether two groups of populations form two clades. We can empirically see that qpWave and \(f_4\) are closely related, because in the special case of applying qpWave to only four populations, we get almost the same p-value as for an \(f_4\)-statistic of the four populations:

L = c('Altai_Neanderthal.DG', 'Vindija.DG')
R = c('Chimp.REF', 'Mbuti.DG')
f4(example_f2_blocks, L[1], L[2], R[1], R[2])$p

## [1] 0.3576872

qpwave(example_f2_blocks, L, R)$rankdrop$p

## ℹ Computing f4 stats...
## ℹ Computing number of admixture waves...
##

## [1] 0.3577113

Estimating the number of independent gene flows

Testing whether two pairs of populations form clades with each other is the same as asking whether there is only one gene flow event which separates the two pairs (\(f_4 = 0\)) or whether there must have been more than a single gene flow event between the two pairs (\(f_4 != 0\)).

In the graph on the left, \(L1\) and \(L2\) form a clade relative to \(R1\) and \(R2\), which means \(L\) and \(R\) are only connected through one gene flow event. On the right they do not form a clade, and they are connected through two independent gene flow events (one going through the root, and the other going through the edge n1b|n2b).

qpWave allows us to estimate a lower bound on the number of gene flow events that separate two groups of populations \(L\) and \(R\) of size \(n_L\) and \(n_R\).

The basic idea behind it is that we can use \(f_4\)-statistics of the form \(f_4(L_i, L_j; R_m, R_n)\) to estimate the genetic drift that is shared between pairs of left populations and pairs of right populations. There are \(\frac{n_L(n_L-1)}{2} \frac{n_R(n_R-1)}{2}\) \(f_4\)-statistics of that form, but not all of them contain unique information. In the case where \(L\) and \(R\) form two clades, all \(f_4\)-statistics \(f_4(L_i, L_j; R_m, R_n)\) are expected to be zero. If there are two independent gene flow events, a few of these \(f_4\)-statistics will be enough to determine what the other \(f_4\)-statistics should be. More generally, we can construct a matrix \(X\) of these \(f_4\)-statistics, and the rank of \(X\) tells us what the minimum number of gene flows is between \(L\) and \(R\). A high rank of \(X\) suggests many independent gene flows between \(L\) and \(R\).

In practice, \(X\) is constructed by fixing one population in \(L\) and one population in \(R\) which results in a matrix of dimensions \((n_L - 1) \times (n_R - 1)\) with all \(f_4\)-statistics of the form \(f_4(L_1, L_i; R_1, R_j)\). By convention, \(n_R > n_L\), which means that the highest possible rank of \(X\) is \(n_L - 1\). A rank of \(n_L - 1\) suggests that at least \(n_L - 1\) independent gene flows have occurred between \(L\) and \(R\).

Estimating the rank of X

To estimate the rank of \(X\) we test how well \(X\) can be approximated by different low rank approximations. An approximation of \(X\) with one rank less than full rank can be written as \(AB\), where \(A\) is \((n_L-1) \times (n_L-2)\), and \(B\) is \((n_L-2) \times (n_R-1)\) and \(AB\) should be as close as possible to \(X\). In this case (one rank less than full rank), \(X\) would have rank \(n_L - 1\), and \(AB\) would have rank \(n_L - 2\). If this approximation works well, the residuals \(E = X - AB\) will be very small. Large residuals suggest that the low rank approximation \(AB\) doesn’t capture all the information in \(X\) and that a simple model with only few admixture events doesn’t fit the data. How large these residuals can become before a model can be rejected depends on the standard errors of the \(f_4\)-statistics. If we have data from a large number of SNPs, these standard errors will be small, which gives us more power to reject models.

To get low rank approximations of \(X\), we start with a singular value decompositions (SVD) of \(X\), which gives us precursors of the matrices \(A\) and \(B\). Because we cannot ignore the covariance among the \(f_4\)-statistics (\(Q\)), \(A\) and \(B\) need to be further adjusted. The initial SVD estimates of \(A\) and \(B\) already minimize the squared sum of the residuals \(E = X - AB\). What is actually minimized in the end is the quadratic form \(E' Q^{-1} E\) (\(E\) is a vector of length \((n_L - 1)(n_R - 1)\), and \(Q\) is a square matrix of dimension \((n_L - 1)(n_R - 1) \times (n_L - 1)(n_R - 1)\)). This quantity is closely related to the likelihood of the model, \(L(A,B) = -\frac{1}{2} E'Q^{-1}E\). The difference between the log likelihoods of two models (same populations, but one with higher rank than the other) approximately follows a \(\chi^2\) distribution with \(x\) degrees of freedom. From this \(\chi^2\) distribution we can compute a p-value that tests whether we can reject simple models with only one or a few admixture waves.

Let’s consider two extreme cases:

First, \(L\) and \(R\) are two groups of populations which form clades with respect to each other. This means that all \(f_4\)-statistics of the form \(f_4(L_i, L_j; R_m, R_n)\) are \(0\), and hence \(X\) is just a matrix full of \(0s\) and rank has rank \(0\).

The opposite extreme is when \(X\) (which has dimensions \((n_L - 1) \times (n_R - 1)\)) has full rank (\((n_L - 1)\), because \(n_L\) <= \(n_R\)). This means that there is no population in \(L\) which can be modeled as a combination of other populations in \(L\). Each population in \(L\) shares some amount of drift with populations in \(R\), which is not captured by other populations in \(L\).

This second extreme case is important, because it forms the basis for qpAdm: Once we have identified two groups \(L\) and \(R\) with this property, we can add an additional population \(T\) and test whether this property is retained or not.

qpAdm

Modelling an admixed population

The qpWave section above describes how we can use a matrix of \(f_4\)-statistics (\(X\)) to test the number of gene flows that separate \(L\) and \(R\). The idea behind qpAdm is that we can say something about an additional population \(T\) if we follow this approach twice: Once for a model with \(L\) and \(R\), and once for a model with \(L \cup T\) (the union of \(L\) and \(T\)) and \(R\). If we find a higher number of gene flows for the model with \(L \cup T\) and \(R\) than for the model with only \(L\) and \(R\), this suggests that there was some gene flow between \(T\) and \(R\). Thus \(T\) cannot be modeled as a combination of populations in \(L\). This may hint at the existence of a yet undiscovered “ghost” population which is related to a population in \(R\), but not represented by any of the populations in \(L\). On the other hand, if both models have the same rank (no extra gene flow between \(T\) and \(R\)), then \(T\) can be modeled as a combination of populations in \(L\), and we can estimate the relative contribution of each population.

qpAdm therefore produces two types of output that are of main interest: A p-value which tests whether there is a connection between \(R\) and \(T\), other than through \(L\), and (if the answer to the last question is no), admixture weights which can tell us in what proportions populations in \(L\) have contributed to \(T\).

The null hypothesis tested by qpAdm is that the populations in \(L\) can account for all shared drift between \(T\) and \(R\). A significant p-value rejecting this null hypothesis indicates that there has been gene flow between \(T\) and \(R\) (in either direction) that is not accounted for by any population in \(L\).

The admixture weights are estimated by computing the left null space of the matrix \(A\) (they satisfy \(w A = 0\)). This is basically asking in what proportions the populations in \(L\) need to be added up so that their \(f_4\)-statistics will be equal to the \(f_4\)-statistics involving \(T\). These weights will also be computed when the model is rejected (the p-value is significant), but in that case they can’t be interpreted as admixture weights. It is important to keep in mind that the true source populations of \(T\) are almost certainly not identical to any of the populations in \(L\), and that any inferences about the mixture proportions of the true, unobserved source populations depend on an unverifiable assumption: Each left population needs to be strictly cladal with one of the source populations.