Computes f4-statistics of the form \(f4(A, B; C, D)\). For allele frequencies a
, b
, c
, d
, f4(A, B; C, D)
is computed as the average of (a-b)*(c-d)
over all SNPs in each SNP block. This is equivalent to \((f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2\) (assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by f2_from_precomp
or f2_from_geno
, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).
qpdstat(
data,
pop1 = NULL,
pop2 = NULL,
pop3 = NULL,
pop4 = NULL,
boot = FALSE,
sure = FALSE,
unique_only = TRUE,
comb = TRUE,
blgsize = NULL,
block_lengths = NULL,
f4mode = TRUE,
afprod = TRUE,
cpp = TRUE,
verbose = is.character(data),
...
)
Input data in one of three forms:
A 3d array of blocked f2 statistics, output of f2_from_precomp
or f2_from_geno
(fastest option)
A directory which contains pre-computed f2-statistics
The prefix of genotype files (slowest option)
One of the following four:
NULL
: all possible population combinations will be returned
A vector of population labels. All combinations with the other pop
arguments will be returned
A matrix with population combinations to be tested, with one population per column and one
combination per row. Other pop
arguments will be ignored.
the location of a file (poplistname
or popfilename
) which specifies the populations or
population combinations to be tested. Other pop
arguments will be ignored.
A vector of population labels
A vector of population labels
A vector of population labels
If FALSE
(the default), block-jackknife resampling will be used to compute standard errors.
Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot
is an integer, that number
will specify the number of bootstrap resamplings. If boot = TRUE
, the number of bootstrap resamplings will be
equal to the number of SNP blocks.
The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.
If TRUE
(the default), redundant combinations will be excluded
Generate all combinations of pop1
, pop2
, pop3
, pop4
. If FALSE
, pop1
, pop2
, pop3
, pop4
should all be vectors of the same length.
SNP block size in Morgan. Default is 0.05 (5 cM). Only used when data
is the prefix of genotype files
Vector with lengths of each jackknife block. sum(block_lengths)
has to
match the number of SNPs. only used when data
is the prefix of genotype files
Set this to FALSE
to compute D-statistics instead of f4. This only has an effect if the first argument is a genotype prefix. D-statistics are computed as (a-b)*(c-d) / ((a + b - 2*a*b) * (c + d - 2*c*d))
, which is the same as (P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA))
Compute f4 from allele frequency products instead of f2 (default TRUE
). Only used if data
is a directory with precomputed data. For populations with lots of missing data, this option reduces bias that can result from setting maxmiss
to values greater than 0. In all other cases it should not make a difference.
Use C++ functions. Setting this to FALSE
will be slower but can help with debugging.
Print progress updates
Additional arguments passed to f4blockdat_from_geno
if data
is a genotype file prefix or f2_from_precomp
if data
is a directory with f2-statistics
qpdstat
returns a data frame with f4 statistics, with columns for populations 1 to 4,
f4-estimate (est
), standard error of the estimate (se
), z-score (z
, est
/se
), p-value (2*(1-pnorm(z))
),
and the number of SNPs used (n
; only if first argument is genotype prefix)
f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.
Use the same SNPs for every f4-statistic
This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (maxmiss = 0
in extract_f2
or f2_from_geno
).
Use different SNPs for each f4-statistic
This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set allsnps = FALSE
. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the allsnps
option, whereas the default mode in qpAdm and qpGraph is equivalent to maxmiss = 0
)
Use different SNPs for each f2-statistic
This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the maxmiss
parameter to a value greater than 0 (and not larger than 1) in extract_f2
or f2_from_geno
. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option afprod = TRUE
in f2_from_precomp
.
In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting maxmiss = 1
in extract_f2
) to get an overview, and then spot-check individual results using more conservative options.
f4
Patterson, N. et al. (2012) Ancient admixture in human history Genetics
Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
#> # A tibble: 6 × 8
#> pop1 pop2 pop3 pop4 est se z p
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Denisova.DG Altai_Neanderthal.DG Chim… Swit… 1.50e-2 4.64e-4 32.3 6.06e-229
#> 2 Denisova.DG Altai_Neanderthal.DG Mbut… Swit… 2.03e-3 3.53e-4 5.75 8.85e- 9
#> 3 Denisova.DG Altai_Neanderthal.DG Russ… Swit… -2.17e-4 3.73e-4 -0.580 5.62e- 1
#> 4 Denisova.DG Vindija.DG Chim… Swit… 1.54e-2 4.78e-4 32.2 5.81e-228
#> 5 Denisova.DG Vindija.DG Mbut… Swit… 2.33e-3 3.63e-4 6.42 1.40e- 10
#> 6 Denisova.DG Vindija.DG Russ… Swit… -2.40e-4 3.87e-4 -0.620 5.35e- 1
if (FALSE) {
qpdstat(f2_dir, pop1, pop2, pop3, pop4)
}
if (FALSE) {
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)
}
if (FALSE) {
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)
}