Extract and store data needed to compute blocked f2

Prepare data for various ADMIXTOOLS 2 functions. This function reads data from genotype files, and extracts data required to compute blocked f-statistics for any sets of samples. The data consists of .rds files with total and alternative allele counts for each individual, and products of total and alternative allele counts for each pair. The function calls packedancestrymap_to_afs or plink_to_afs and afs_to_f2_blocks.

extract_counts(
  pref,
  outdir,
  inds = NULL,
  blgsize = 0.05,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  maxmem = 8000,
  overwrite = FALSE,
  format = NULL,
  cols_per_chunk = NULL,
  verbose = TRUE
)

Arguments

pref: Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam
outdir: Directory where data will be stored.
inds: Individuals for which data should be read. Defaults to all individuals
blgsize: SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
maxmiss: Discard SNPs which are missing in a fraction of individuals greater than maxmiss
minmaf: Discard SNPs with minor allele frequency less than minmaf
maxmaf: Discard SNPs with minor allele frequency greater than than maxmaf
transitions: Set this to FALSE to exclude transition SNPs
transversions: Set this to FALSE to exclude transversion SNPs
auto_only: Keep only SNPs on chromosomes 1 to 22
keepsnps: SNP IDs of SNPs to keep. Overrides other SNP filtering options
maxmem: Maximum amount of memory to be used. If the required amount of memory exceeds maxmem, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.
overwrite: Overwrite existing files in outdir
format: Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.
cols_per_chunk: Number of genotype chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for cols_per_chunk in extract_afs is 10. Lower numbers will lower the memory requirement but increase the time it takes.
verbose: Print progress updates