This function takes a allele frequency data and computes blocked f2 statistics for all population pairs, which are written to outdir. \(f2\) for each SNP is computed as \((p1 - p2)^2 - p1 (1 - p1)/(n1 - 1) - p2 (1 - p2)/(n2 - 1)\), where \(p1\) and \(p2\) are allele frequencies in populations \(1\) and \(2\), and \(n1\) and \(n2\) is the number of non-missing haplotypes in populations \(1\) and \(2\). See details

afs_to_f2_blocks(
  afdat,
  maxmem = 8000,
  blgsize = 0.05,
  pops1 = NULL,
  pops2 = NULL,
  outpop = NULL,
  outdir = NULL,
  overwrite = FALSE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = apply_corr,
  n_cores = 1,
  verbose = TRUE
)

Arguments

afdat

A list with three items with the same SNP in each row (generated by packedancestrymap_to_afs or plink_to_afs)

  • afs A matrix of allele frequencies for all populations (columns) and SNPs (rows)

  • counts A matrix of allele counts for all populations (columns) and SNPs (rows)

  • snpdat A data frame with SNP metadata

maxmem

split up allele frequency data into blocks, if memory requirements exceed maxmem MB.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

outpop

If specified, f2-statistics will be weighted by heterozygosity in this population

outdir

Directory into which to write f2 data (if NULL, data is returned instead)

overwrite

Should existing files be overwritten? Only relevant if outdir is not NULL

poly_only

Exclude sites with identical allele frequencies in all populations. Can be different for f2-statistics, allele frequency products, and fst. Should be a character vector of length three, with some subset of c("f2", "ap", "fst")

verbose

Print progress updates

pop1

pops1 and pops2 can be specified if only a subset of pairs should be computed.

pop2

pops1 and pops2 can be specified if only a subset of pairs should be computed.

Details

For each population pair, each of the \(i = 1, \ldots, n\) resutling values (\(n\) is around 700 in practice) is the mean \(f2\) estimate across all SNPs except the ones in block \(i\).

\(- p1 (1 - p1)/(2 n1 - 1) - p2 (1 - p2)/(2 n2 - 1)\) is a correction term which makes the estimates unbiased at low sample sizes.

Examples

if (FALSE) {
afdat = plink_to_afs('/my/geno/prefix')
afs_to_f2_blocks(afdat, outdir = '/my/f2/data/')
}