This function takes a allele frequency data and computes blocked f2 statistics for all population pairs,
which are written to outdir
. \(f2\) for each SNP is computed as
\((p1 - p2)^2 - p1 (1 - p1)/(n1 - 1) - p2 (1 - p2)/(n2 - 1)\), where \(p1\) and \(p2\) are
allele frequencies in populations \(1\) and \(2\), and \(n1\) and \(n2\) is the number of
non-missing haplotypes in populations \(1\) and \(2\). See details
afs_to_f2_blocks(
afdat,
maxmem = 8000,
blgsize = 0.05,
pops1 = NULL,
pops2 = NULL,
outpop = NULL,
outdir = NULL,
overwrite = FALSE,
afprod = TRUE,
fst = TRUE,
poly_only = c("f2"),
apply_corr = apply_corr,
n_cores = 1,
verbose = TRUE
)
A list with three items with the same SNP in each row (generated by packedancestrymap_to_afs
or plink_to_afs
)
afs
A matrix of allele frequencies for all populations (columns) and SNPs (rows)
counts
A matrix of allele counts for all populations (columns) and SNPs (rows)
snpdat
A data frame with SNP metadata
split up allele frequency data into blocks, if memory requirements exceed maxmem
MB.
SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize
is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
If specified, f2-statistics will be weighted by heterozygosity in this population
Directory into which to write f2 data (if NULL
, data is returned instead)
Should existing files be overwritten? Only relevant if outdir
is not NULL
Exclude sites with identical allele frequencies in all populations. Can be different for f2-statistics, allele frequency products, and fst. Should be a character vector of length three, with some subset of c("f2", "ap", "fst")
Print progress updates
pops1
and pops2
can be specified if only a subset of pairs should be computed.
pops1
and pops2
can be specified if only a subset of pairs should be computed.
For each population pair, each of the \(i = 1, \ldots, n\) resutling values (\(n\) is around 700 in practice) is the mean \(f2\) estimate across all SNPs except the ones in block \(i\).
\(- p1 (1 - p1)/(2 n1 - 1) - p2 (1 - p2)/(2 n2 - 1)\) is a correction term which makes the estimates unbiased at low sample sizes.
if (FALSE) {
afdat = plink_to_afs('/my/geno/prefix')
afs_to_f2_blocks(afdat, outdir = '/my/f2/data/')
}