Prepare data for various ADMIXTOOLS 2 functions. This function reads data from genotype files, and extracts data required to compute blocked f-statistics for any sets of samples. The data consists of .rds files with total and alternative allele counts for each individual, and products of total and alternative allele counts for each pair. The function calls packedancestrymap_to_afs or plink_to_afs and afs_to_f2_blocks.

extract_counts(
  pref,
  outdir,
  inds = NULL,
  blgsize = 0.05,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  maxmem = 8000,
  overwrite = FALSE,
  format = NULL,
  cols_per_chunk = NULL,
  verbose = TRUE
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

outdir

Directory where data will be stored.

inds

Individuals for which data should be read. Defaults to all individuals

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

maxmiss

Discard SNPs which are missing in a fraction of individuals greater than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

auto_only

Keep only SNPs on chromosomes 1 to 22

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

maxmem

Maximum amount of memory to be used. If the required amount of memory exceeds maxmem, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.

overwrite

Overwrite existing files in outdir

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

cols_per_chunk

Number of genotype chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for cols_per_chunk in extract_afs is 10. Lower numbers will lower the memory requirement but increase the time it takes.

verbose

Print progress updates