Compute and store blocked f2 statistics

This function prepares data for various other ADMIXTOOLS 2 functions. It reads data from genotype files, computes allele frequencies and blocked f2-statistics for selected populations, and writes the results to outdir.

extract_f2(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  cols_per_chunk = NULL,
  fst = TRUE,
  afprod = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  qpfstats = FALSE,
  n_cores = 1,
  verbose = TRUE,
  ...
)

Arguments

pref: Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam
outdir: Directory where data will be stored.
inds: Individuals for which data should be extracted
pops: Populations for which data should be extracted. If both pops and inds are provided, they should have the same length and will be matched by position. If only pops is provided, all individuals from the .ind or .fam file in those populations will be extracted. If only inds is provided, each indivdual will be assigned to its own population of the same name. If neither pops nor inds is provided, all individuals and populations in the .ind or .fam file will be extracted.
blgsize: SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
maxmem: Maximum amount of memory to be used. If the required amount of memory exceeds maxmem, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.
maxmiss: Discard SNPs which are missing in a fraction of populations higher than maxmiss
minmaf: Discard SNPs with minor allele frequency less than minmaf
maxmaf: Discard SNPs with minor allele frequency greater than than maxmaf
minac2: Discard SNPs with allele count lower than 2 in any population (default FALSE). This option should be set to TRUE when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. minac2 == 2 will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the minac2 option discards SNPs with allele count lower than 2 in any population, the qp3pop function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).
pops2: If specified, only a pairs between pops and pops2 will be computed
outpop: Keep only SNPs which are heterozygous in this population
outpop_scale: Scale f2-statistics by the inverse outpop heteroygosity (1/(p*(1-p))). Providing outpop and setting outpop_scale to TRUE will give the same results as the original qpGraph when the outpop parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.
transitions: Set this to FALSE to exclude transition SNPs
transversions: Set this to FALSE to exclude transversion SNPs
auto_only: Keep only SNPs on chromosomes 1 to 22
keepsnps: SNP IDs of SNPs to keep. Overrides other SNP filtering options
overwrite: Overwrite existing files in outdir
format: Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.
adjust_pseudohaploid: Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.
cols_per_chunk: Number of allele frequency chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for cols_per_chunk in extract_afs is 10. Lower numbers will lower the memory requirement but increase the time it takes.
fst: Write files with pairwise FST for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.
afprod: Write files with allele frequency products for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.
poly_only: Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE), or whether they should be used (poly_only = FALSE). By default (poly_only = c("f2")), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
apply_corr: Apply small-sample-size correction when computing f2-statistics (default TRUE)
qpfstats: Compute smoothed f2-statistics (default FALSE). In the presence of large amounts of missing data, this option can be used to retain information from all SNPs while introducing less bias than setting maxmiss to values greater than 0. When setting qpfstats = TRUE, most other options to extract_f2 will be ignored. See qpfstats for more information. Arguments to qpfstats can be passed via ...
n_cores: Parallelize computation across n_cores cores via the doParallel package.
verbose: Print progress updates
...: Pass arguments to qpfstats

Value

SNP metadata (invisibly)

Examples

if (FALSE) {
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2(pref, f2dir, pops = c('popA', 'popB', 'popC'))
}

Arguments

Value

See also

Examples