Computes f4-statistics of the form \(f4(A, B; C, D)\). For allele frequencies `a`

, `b`

, `c`

, `d`

, `f4(A, B; C, D)`

is computed as the average of `(a-b)*(c-d)`

over all SNPs in each SNP block. This is equivalent to \((f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2\) (assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by `f2_from_precomp`

or `f2_from_geno`

, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).

```
qpdstat(
data,
pop1 = NULL,
pop2 = NULL,
pop3 = NULL,
pop4 = NULL,
boot = FALSE,
sure = FALSE,
unique_only = TRUE,
comb = TRUE,
blgsize = NULL,
block_lengths = NULL,
f4mode = TRUE,
afprod = TRUE,
cpp = TRUE,
verbose = is.character(data),
...
)
```

- data
Input data in one of three forms:

A 3d array of blocked f2 statistics, output of

`f2_from_precomp`

or`f2_from_geno`

(fastest option)A directory which contains pre-computed f2-statistics

The prefix of genotype files (slowest option)

- pop1
One of the following four:

`NULL`

: all possible population combinations will be returnedA vector of population labels. All combinations with the other

`pop`

arguments will be returnedA matrix with population combinations to be tested, with one population per column and one combination per row. Other

`pop`

arguments will be ignored.the location of a file (

`poplistname`

or`popfilename`

) which specifies the populations or population combinations to be tested. Other`pop`

arguments will be ignored.

- pop2
A vector of population labels

- pop3
A vector of population labels

- pop4
A vector of population labels

- boot
If

`FALSE`

(the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If`boot`

is an integer, that number will specify the number of bootstrap resamplings. If`boot = TRUE`

, the number of bootstrap resamplings will be equal to the number of SNP blocks.- sure
The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.

- unique_only
If

`TRUE`

(the default), redundant combinations will be excluded- comb
Generate all combinations of

`pop1`

,`pop2`

,`pop3`

,`pop4`

. If`FALSE`

,`pop1`

,`pop2`

,`pop3`

,`pop4`

should all be vectors of the same length.- blgsize
SNP block size in Morgan. Default is 0.05 (5 cM). Only used when

`data`

is the prefix of genotype files- block_lengths
Vector with lengths of each jackknife block.

`sum(block_lengths)`

has to match the number of SNPs. only used when`data`

is the prefix of genotype files- f4mode
Set this to

`FALSE`

to compute D-statistics instead of f4. This only has an effect if the first argument is a genotype prefix. D-statistics are computed as`(a-b)*(c-d) / ((a + b - 2*a*b) * (c + d - 2*c*d))`

, which is the same as`(P(ABBA) - P(BABA)) / (P(ABBA) + P(BABA))`

- afprod
Compute f4 from allele frequency products instead of f2 (default

`TRUE`

). Only used if`data`

is a directory with precomputed data. For populations with lots of missing data, this option reduces bias that can result from setting`maxmiss`

to values greater than 0. In all other cases it should not make a difference.- cpp
Use C++ functions. Setting this to

`FALSE`

will be slower but can help with debugging.- verbose
Print progress updates

- ...
Additional arguments passed to

`f4blockdat_from_geno`

if`data`

is a genotype file prefix or`f2_from_precomp`

if`data`

is a directory with f2-statistics

`qpdstat`

returns a data frame with f4 statistics, with columns for populations 1 to 4,
f4-estimate (`est`

), standard error of the estimate (`se`

), z-score (`z`

, `est`

/`se`

), p-value (`2*(1-pnorm(z))`

),
and the number of SNPs used (`n`

; only if first argument is genotype prefix)

f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.

Use the same SNPs for every f4-statistic

This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (`maxmiss = 0`

in`extract_f2`

or`f2_from_geno`

).Use different SNPs for each f4-statistic

This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set`allsnps = FALSE`

. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the`allsnps`

option, whereas the default mode in qpAdm and qpGraph is equivalent to`maxmiss = 0`

)Use different SNPs for each f2-statistic

This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the`maxmiss`

parameter to a value greater than 0 (and not larger than 1) in`extract_f2`

or`f2_from_geno`

. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option`afprod = TRUE`

in`f2_from_precomp`

. In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting`maxmiss = 1`

in`extract_f2`

) to get an overview, and then spot-check individual results using more conservative options.

`f4`

Patterson, N. et al. (2012) *Ancient admixture in human history* Genetics

Peter, B. (2016) *Admixture, Population Structure, and F-Statistics* Genetics

```
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
#> # A tibble: 6 × 8
#> pop1 pop2 pop3 pop4 est se z p
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Denisova.DG Altai_Neanderthal.DG Chim… Swit… 1.50e-2 4.64e-4 32.3 6.06e-229
#> 2 Denisova.DG Altai_Neanderthal.DG Mbut… Swit… 2.03e-3 3.53e-4 5.75 8.85e- 9
#> 3 Denisova.DG Altai_Neanderthal.DG Russ… Swit… -2.17e-4 3.73e-4 -0.580 5.62e- 1
#> 4 Denisova.DG Vindija.DG Chim… Swit… 1.54e-2 4.78e-4 32.2 5.81e-228
#> 5 Denisova.DG Vindija.DG Mbut… Swit… 2.33e-3 3.63e-4 6.42 1.40e- 10
#> 6 Denisova.DG Vindija.DG Russ… Swit… -2.40e-4 3.87e-4 -0.620 5.35e- 1
if (FALSE) {
qpdstat(f2_dir, pop1, pop2, pop3, pop4)
}
if (FALSE) {
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)
}
if (FALSE) {
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)
}
```