Estimate f4 statistics — qpdstat • admixtools

Computes f4-statistics of the form \(f4(A, B; C, D)\). For allele frequencies a, b, c, d, f4(A, B; C, D) is computed as the average of (a-b)*(c-d) over all SNPs in each SNP block. This is equivalent to \((f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2\) (assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by f2_from_precomp or f2_from_geno, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).

qpdstat(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  comb = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  f4mode = TRUE,
  afprod = TRUE,
  cpp = TRUE,
  verbose = is.character(data),
  ...
)

Arguments

data

Input data in one of three forms:

A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno (fastest option)
A directory which contains pre-computed f2-statistics
The prefix of genotype files (slowest option)

pop1

One of the following four:

NULL: all possible population combinations will be returned
A vector of population labels. All combinations with the other pop arguments will be returned
A matrix with population combinations to be tested, with one population per column and one combination per row. Other pop arguments will be ignored.
the location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. Other pop arguments will be ignored.

pop2

A vector of population labels

pop3

A vector of population labels

pop4

A vector of population labels

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

sure

The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.

unique_only

If TRUE (the default), redundant combinations will be excluded

comb

Generate all combinations of pop1, pop2, pop3, pop4. If FALSE, pop1, pop2, pop3, pop4 should all be vectors of the same length.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). Only used when data is the prefix of genotype files

block_lengths

Vector with lengths of each jackknife block. sum(block_lengths) has to match the number of SNPs. only used when data is the prefix of genotype files

f4mode

Set this to FALSE to compute D-statistics instead of f4. This only has an effect if the first argument is a genotype prefix. D-statistics are computed as (a-b)*(c-d) / ((a + b - 2*a*b) * (c + d - 2*c*d)), which is the same as (P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA))

afprod

Compute f4 from allele frequency products instead of f2 (default TRUE). Only used if data is a directory with precomputed data. For populations with lots of missing data, this option reduces bias that can result from setting maxmiss to values greater than 0. In all other cases it should not make a difference.

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

verbose

Print progress updates

...

Additional arguments passed to f4blockdat_from_geno if data is a genotype file prefix or f2_from_precomp if data is a directory with f2-statistics

Value

qpdstat returns a data frame with f4 statistics, with columns for populations 1 to 4, f4-estimate (est), standard error of the estimate (se), z-score (z, est/se), p-value (2*(1-pnorm(z))), and the number of SNPs used (n; only if first argument is genotype prefix)

Details

f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.

Use the same SNPs for every f4-statistic
This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (maxmiss = 0 in extract_f2 or f2_from_geno).
Use different SNPs for each f4-statistic
This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set allsnps = FALSE. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the allsnps option, whereas the default mode in qpAdm and qpGraph is equivalent to maxmiss = 0)
Use different SNPs for each f2-statistic
This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the maxmiss parameter to a value greater than 0 (and not larger than 1) in extract_f2 or f2_from_geno. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option afprod = TRUE in f2_from_precomp. In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting maxmiss = 1 in extract_f2) to get an overview, and then spot-check individual results using more conservative options.

Alias

f4

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
#> # A tibble: 6 × 8
#>   pop1        pop2                 pop3  pop4       est      se      z         p
#>   <chr>       <chr>                <chr> <chr>    <dbl>   <dbl>  <dbl>     <dbl>
#> 1 Denisova.DG Altai_Neanderthal.DG Chim… Swit…  1.50e-2 4.64e-4 32.3   6.06e-229
#> 2 Denisova.DG Altai_Neanderthal.DG Mbut… Swit…  2.03e-3 3.53e-4  5.75  8.85e-  9
#> 3 Denisova.DG Altai_Neanderthal.DG Russ… Swit… -2.17e-4 3.73e-4 -0.580 5.62e-  1
#> 4 Denisova.DG Vindija.DG           Chim… Swit…  1.54e-2 4.78e-4 32.2   5.81e-228
#> 5 Denisova.DG Vindija.DG           Mbut… Swit…  2.33e-3 3.63e-4  6.42  1.40e- 10
#> 6 Denisova.DG Vindija.DG           Russ… Swit… -2.40e-4 3.87e-4 -0.620 5.35e-  1
if (FALSE) {
qpdstat(f2_dir, pop1, pop2, pop3, pop4)
}
if (FALSE) {
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)
}
if (FALSE) {
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)
}