Estimate admixture weights

qpadm models a target population as a mixture of left (source) populations, given a set of right (outgroup) populations. It can be used to estimate whether the left populations explain all genetic variation in the target population, relative to the right populations, and to estimate admixture proportions of the left populations to the target population.

qpadm(
  data,
  left,
  right,
  target,
  f4blocks = NULL,
  fudge = 1e-04,
  fudge_twice = FALSE,
  auto_only = TRUE,
  blgsize = 0.05,
  poly_only = FALSE,
  boot = FALSE,
  getcov = TRUE,
  constrained = FALSE,
  return_f4 = FALSE,
  cpp = TRUE,
  verbose = TRUE,
  ...
)

Arguments

data

The input data in the form of:

A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2
A directory with f2 statistics
The prefix of a genotype file

left

Left populations (sources)

right

Right populations (outgroups)

target

Target population

f4blocks

Instead of f2 blocks, f4 blocks can be supplied. This is used by qpadm_multi

fudge

Value added to diagonal matrix elements before inverting

fudge_twice

Setting this to TRUE should result in p-values that better match those in the original qpAdm program

auto_only

Use only chromosomes 1 to 22.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

poly_only

Exclude sites with identical allele frequencies in all populations.

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

getcov

Compute weights covariance. Setting getcov = FALSE will speed up the computation.

constrained

Constrain admixture weights to be non-negative

return_f4

Return f4-statistics

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

verbose

Print progress updates

...

If data is the prefix of genotype files, additional arguments will be passed to f4blockdat_from_geno

Value

qpadm returns a list with up to four data frames describing the model fit:

weights: A data frame with estimated admixture proportions where each row is a left population.
f4: A data frame with estimated and fitted f4-statistics
rankdrop: A data frame describing model fits with different ranks, including p-values for the overall fit and for nested models (comparing two models with rank difference of one). A model with L left populations and R right populations has an f4-matrix of dimensions (L-1)*(R-1). If no two left population form a clade with respect to all right populations, this model will have rank (L-1)*(R-1).
- f4rank: Tested rank
- dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)
- chisq: Chi-sqaured statistic, obtained as E'QE, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.
- p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)
- dofdiff: Difference in degrees of freedom between this model and the model with one less rank
- chisqdiff: Difference in chi-squared statistics
- p_nested: p-value testing whether the difference between two models of rank difference 1 is significant
popdrop: A data frame describing model fits with different populations. Note that all models with fewer populations use the same set of SNPs as the first model.
- pat: A binary code indicating which populations are present in this model. A 1 represents dropped populations. The full model is all zeros.
- wt: Number of populations dropped
- dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)
- chisq: Chi-sqaured statistic, obtained as E'QE, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.
- p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)
- f4rank: Tested rank
- feasible: A model is feasible if all weights fall between 0 and 1
- <population name>: The weights for each population in this model

References

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 10)

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm(example_f2_blocks, left, right, target)
#> ℹ Computing f4 stats...
#> ℹ Computing admixture weights...
#> ℹ Computing standard errors...
#> ℹ Computing number of admixture waves...
#> 
#> $weights
#> # A tibble: 2 × 5
#>   target      left                 weight    se     z
#>   <chr>       <chr>                 <dbl> <dbl> <dbl>
#> 1 Denisova.DG Altai_Neanderthal.DG   49.6  23.3  2.13
#> 2 Denisova.DG Vindija.DG            -48.6  23.3 -2.08
#> 
#> $f4
#> # A tibble: 36 × 9
#>    pop1        pop2        pop3  pop4       est      se       z         p weight
#>    <chr>       <chr>       <chr> <chr>    <dbl>   <dbl>   <dbl>     <dbl>  <dbl>
#>  1 Denisova.DG Altai_Nean… Chim… Mbut…  0.0129  3.64e-4  35.6   2.22e-277   49.6
#>  2 Denisova.DG Vindija.DG  Chim… Mbut…  0.0131  3.73e-4  35.0   7.55e-269  -48.6
#>  3 Denisova.DG fit         Chim… Mbut…  0.00693 6.60e-3   1.05  2.94e-  1   NA  
#>  4 Denisova.DG Altai_Nean… Chim… Russ…  0.0152  4.46e-4  34.0   4.67e-254   49.6
#>  5 Denisova.DG Vindija.DG  Chim… Russ…  0.0156  4.53e-4  34.5   2.14e-261  -48.6
#>  6 Denisova.DG fit         Chim… Russ… -0.00642 8.03e-3  -0.800 4.23e-  1   NA  
#>  7 Denisova.DG Altai_Nean… Chim… Swit…  0.0150  4.64e-4  32.3   6.06e-229   49.6
#>  8 Denisova.DG Vindija.DG  Chim… Swit…  0.0154  4.78e-4  32.2   5.81e-228  -48.6
#>  9 Denisova.DG fit         Chim… Swit… -0.00552 8.43e-3  -0.654 5.13e-  1   NA  
#> 10 Denisova.DG Altai_Nean… Mbut… Chim… -0.0129  3.64e-4 -35.6   2.22e-277   49.6
#> # ℹ 26 more rows
#> 
#> $rankdrop
#> # A tibble: 2 × 7
#>   f4rank   dof   chisq      p dofdiff chisqdiff p_nested
#>    <int> <int>   <dbl>  <dbl>   <int>     <dbl>    <dbl>
#> 1      1     2    7.15 0.0280       4     1572.        0
#> 2      0     6 1580.   0           NA       NA        NA
#> 
#> $popdrop
#> # A tibble: 3 × 13
#>   pat      wt   dof    chisq      p f4rank Altai_Neanderthal.DG Vindija.DG
#>   <chr> <dbl> <dbl>    <dbl>  <dbl>  <dbl>                <dbl>      <dbl>
#> 1 00        0     2     7.15 0.0280      1                 49.6      -48.6
#> 2 01        1     3 11412.   0           0                  1         NA  
#> 3 10        1     3 11449.   0           0                 NA          1  
#> # ℹ 5 more variables: feasible <lgl>, best <lgl>, dofdiff <dbl>,
#> #   chisqdiff <dbl>, p_nested <dbl>
#> 
if (FALSE) {
# The original ADMIXTOOLS qpAadm program has an option called "allsnps"
# that selects different SNPs for each f4-statistic, which is
# useful when working with sparse genotype data.
# To get the same behavior in ADMIXTOOLS 2, supply the genotype data prefix
# and set `allsnps = TRUE`
qpadm("/my/geno/prefix", left, right, target, allsnps = TRUE)
}

Arguments

Value

References

See also

Examples