*qpWave* and *qpAdm* are programs which can answer a
number of questions about the relationships of different populations,
such as:

- Do populations form clades with respect to each other?
- What is the minimum number of admixture waves that separates two groups of populations?
- Can a population be modeled as a mix of a set of other populations, or would this miss an ancestry component?
- What are the relative proportions of the different admixing populations?

The first two questions can be answered by *qpWave*. The
latter two questions are about a specific target population, and they
can be answered by *qpAdm*, an extension of *qpWave*.

This page describes how both programs use \(f_4\)-statistics to address these
questions. The original description of *qpWave* and
*qpAdm* which goes into more technical detail can be found in
Supplementary Information section 10 in this paper.

\(f_4\)-statistics can be used to test whether two pairs of populations (\(L_1, L_2\) and \(R_1, R_2\)) form clades with respect to one another: If \(f_4(L_1,L_2; R_1,R_2)\) is significantly different from 0, the two groups do not form a clade. If we have two groups with more than two populations each, and we want to know if the groups form two clades, we could compute all possible \(f_4\)-statistics and test if any of them deviate from zero. Except we would have to correct for multiple testing, which is not straight-forward in this case because the \(f_4\)-statistics are not independent of each other.

*qpWave* provides another way to test whether two groups of
populations form two clades. We can empirically see that *qpWave*
and \(f_4\) are closely related,
because in the special case of applying *qpWave* to only four
populations, we get almost the same p-value as for an \(f_4\)-statistic of the four
populations:

```
L = c('Altai_Neanderthal.DG', 'Vindija.DG')
R = c('Chimp.REF', 'Mbuti.DG')
f4(example_f2_blocks, L[1], L[2], R[1], R[2])$p
```

`## [1] 0.3576872`

`qpwave(example_f2_blocks, L, R)$rankdrop$p`

```
## ℹ Computing f4 stats...
## ℹ Computing number of admixture waves...
##
```

`## [1] 0.3577113`

Testing whether two pairs of populations form clades with each other is the same as asking whether there is only one gene flow event which separates the two pairs (\(f_4 = 0\)) or whether there must have been more than a single gene flow event between the two pairs (\(f_4 != 0\)).

In the graph on the left, \(L1\) and
\(L2\) form a clade relative to \(R1\) and \(R2\), which means \(L\) and \(R\) are only connected through one gene
flow event. On the right they do not form a clade, and they are
connected through two independent gene flow events (one going through
the root, and the other going through the edge
`n1b|n2b`

).

*qpWave* allows us to estimate a lower bound on the number of
gene flow events that separate two groups of populations \(L\) and \(R\) of size \(n_L\) and \(n_R\).

The basic idea behind it is that we can use \(f_4\)-statistics of the form \(f_4(L_i, L_j; R_m, R_n)\) to estimate the genetic drift that is shared between pairs of left populations and pairs of right populations. There are \(\frac{n_L(n_L-1)}{2} \frac{n_R(n_R-1)}{2}\) \(f_4\)-statistics of that form, but not all of them contain unique information. In the case where \(L\) and \(R\) form two clades, all \(f_4\)-statistics \(f_4(L_i, L_j; R_m, R_n)\) are expected to be zero. If there are two independent gene flow events, a few of these \(f_4\)-statistics will be enough to determine what the other \(f_4\)-statistics should be. More generally, we can construct a matrix \(X\) of these \(f_4\)-statistics, and the rank of \(X\) tells us what the minimum number of gene flows is between \(L\) and \(R\). A high rank of \(X\) suggests many independent gene flows between \(L\) and \(R\).

In practice, \(X\) is constructed by fixing one population in \(L\) and one population in \(R\) which results in a matrix of dimensions \((n_L - 1) \times (n_R - 1)\) with all \(f_4\)-statistics of the form \(f_4(L_1, L_i; R_1, R_j)\). By convention, \(n_R > n_L\), which means that the highest possible rank of \(X\) is \(n_L - 1\). A rank of \(n_L - 1\) suggests that at least \(n_L - 1\) independent gene flows have occurred between \(L\) and \(R\).

To estimate the rank of \(X\) we test how well \(X\) can be approximated by different low rank approximations. An approximation of \(X\) with one rank less than full rank can be written as \(AB\), where \(A\) is \((n_L-1) \times (n_L-2)\), and \(B\) is \((n_L-2) \times (n_R-1)\) and \(AB\) should be as close as possible to \(X\). In this case (one rank less than full rank), \(X\) would have rank \(n_L - 1\), and \(AB\) would have rank \(n_L - 2\). If this approximation works well, the residuals \(E = X - AB\) will be very small. Large residuals suggest that the low rank approximation \(AB\) doesn’t capture all the information in \(X\) and that a simple model with only few admixture events doesn’t fit the data. How large these residuals can become before a model can be rejected depends on the standard errors of the \(f_4\)-statistics. If we have data from a large number of SNPs, these standard errors will be small, which gives us more power to reject models.

To get low rank approximations of \(X\), we start with a singular value decompositions (SVD) of \(X\), which gives us precursors of the matrices \(A\) and \(B\). Because we cannot ignore the covariance among the \(f_4\)-statistics (\(Q\)), \(A\) and \(B\) need to be further adjusted. The initial SVD estimates of \(A\) and \(B\) already minimize the squared sum of the residuals \(E = X - AB\). What is actually minimized in the end is the quadratic form \(E' Q^{-1} E\) (\(E\) is a vector of length \((n_L - 1)(n_R - 1)\), and \(Q\) is a square matrix of dimension \((n_L - 1)(n_R - 1) \times (n_L - 1)(n_R - 1)\)). This quantity is closely related to the likelihood of the model, \(L(A,B) = -\frac{1}{2} E'Q^{-1}E\). The difference between the log likelihoods of two models (same populations, but one with higher rank than the other) approximately follows a \(\chi^2\) distribution with \(x\) degrees of freedom. From this \(\chi^2\) distribution we can compute a p-value that tests whether we can reject simple models with only one or a few admixture waves.

Let’s consider two extreme cases:

First, \(L\) and \(R\) are two groups of populations which form clades with respect to each other. This means that all \(f_4\)-statistics of the form \(f_4(L_i, L_j; R_m, R_n)\) are \(0\), and hence \(X\) is just a matrix full of \(0s\) and rank has rank \(0\).

The opposite extreme is when \(X\) (which has dimensions \((n_L - 1) \times (n_R - 1)\)) has full rank (\((n_L - 1)\), because \(n_L\) <= \(n_R\)). This means that there is no population in \(L\) which can be modeled as a combination of other populations in \(L\). Each population in \(L\) shares some amount of drift with populations in \(R\), which is not captured by other populations in \(L\).

This second extreme case is important, because it forms the basis for
*qpAdm*: Once we have identified two groups \(L\) and \(R\) with this property, we can add an
additional population \(T\) and test
whether this property is retained or not.

The *qpWave* section above describes how we can use a matrix
of \(f_4\)-statistics (\(X\)) to test the number of gene flows that
separate \(L\) and \(R\). The idea behind *qpAdm* is that
we can say something about an additional population \(T\) if we follow this approach twice: Once
for a model with \(L\) and \(R\), and once for a model with \(L \cup T\) (the union of \(L\) and \(T\)) and \(R\). If we find a higher number of gene
flows for the model with \(L \cup T\)
and \(R\) than for the model with only
\(L\) and \(R\), this suggests that there was some gene
flow between \(T\) and \(R\). Thus \(T\) cannot be modeled as a combination of
populations in \(L\). This may hint at
the existence of a yet undiscovered “ghost” population which is related
to a population in \(R\), but not
represented by any of the populations in \(L\). On the other hand, if both models have
the same rank (no extra gene flow between \(T\) and \(R\)), then \(T\) can be modeled as a combination of
populations in \(L\), and we can
estimate the relative contribution of each population.

*qpAdm* therefore produces two types of output that are of
main interest: A p-value which tests whether there is a connection
between \(R\) and \(T\), other than through \(L\), and (if the answer to the last
question is no), admixture weights which can tell us in what proportions
populations in \(L\) have contributed
to \(T\).

The null hypothesis tested by *qpAdm* is that the populations
in \(L\) can account for all shared
drift between \(T\) and \(R\). A significant p-value rejecting this
null hypothesis indicates that there has been gene flow between \(T\) and \(R\) (in either direction) that is not
accounted for by any population in \(L\).

The admixture weights are estimated by computing the left null space of the matrix \(A\) (they satisfy \(w A = 0\)). This is basically asking in what proportions the populations in \(L\) need to be added up so that their \(f_4\)-statistics will be equal to the \(f_4\)-statistics involving \(T\). These weights will also be computed when the model is rejected (the p-value is significant), but in that case they can’t be interpreted as admixture weights. It is important to keep in mind that the true source populations of \(T\) are almost certainly not identical to any of the populations in \(L\), and that any inferences about the mixture proportions of the true, unobserved source populations depend on an unverifiable assumption: Each left population needs to be strictly cladal with one of the source populations.