Prepare data for various ADMIXTOOLS 2 functions. Reads data from packedancestrymap or PLINK files,
and computes allele frequencies for selected populations and stores it as .rds
files in outdir.
extract_afs(
pref,
outdir,
inds = NULL,
pops = NULL,
cols_per_chunk = 10,
numparts = 100,
maxmiss = 0,
minmaf = 0,
maxmaf = 0.5,
minac2 = FALSE,
outpop = NULL,
auto_only = TRUE,
transitions = TRUE,
transversions = TRUE,
keepsnps = NULL,
format = NULL,
poly_only = FALSE,
adjust_pseudohaploid = TRUE,
verbose = TRUE
)
Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files.
EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno
, .snp
, .ind
, PLINK has to end in .bed
, .bim
, .fam
Directory where data will be stored.
Individuals for which data should be extracted
Populations for which data should be extracted. If both pops
and inds
are provided, they should have the same length and will be matched by position. If only pops
is provided, all individuals from the .ind
or .fam
file in those populations will be extracted. If only inds
is provided, each indivdual will be assigned to its own population of the same name. If neither pops
nor inds
is provided, all individuals and populations in the .ind
or .fam
file will be extracted.
Number of populations per chunk. Lowering this number will lower the memory requirements when running afs_to_f2
, but more chunk pairs will have to be computed.
Number of parts in which genotype data will be read for computing allele frequencies
Discard SNPs which are missing in a fraction of populations higher than maxmiss
Discard SNPs with minor allele frequency less than minmaf
Discard SNPs with minor allele frequency greater than than maxmaf
Discard SNPs with allele count lower than 2 in any population (default FALSE
). This option should be set to TRUE
when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. minac2 == 2
will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the minac2
option discards SNPs with allele count lower than 2 in any population, the qp3pop
function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).
Keep only SNPs which are heterozygous in this population
Keep only SNPs on chromosomes 1 to 22
Set this to FALSE
to exclude transition SNPs
Set this to FALSE
to exclude transversion SNPs
SNP IDs of SNPs to keep. Overrides other SNP filtering options
Supply this if the prefix can refer to genotype data in different formats
and you want to choose which one to read. Should be plink
to read .bed
, .bim
, .fam
files, or eigenstrat
, or packedancestrymap
to read .geno
, .snp
, .ind
files.
Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE
), or whether they should be used (poly_only = FALSE
). By default (poly_only = c("f2")
), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
Genotypes of pseudohaploid samples are usually coded as 0
or 2
, even though only one allele is observed. adjust_pseudohaploid
ensures that the observed allele count increases only by 1
for each pseudohaploid sample. If TRUE
(default), samples that don't have any genotypes coded as 1
among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE
treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO
option. Setting adjust_pseudohaploid
to an integer n
will check the first n
SNPs instead of the first 1000 SNPs.
Print progress updates
SNP metadata (invisibly)
if (FALSE) {
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)
}