Diffusion approximation of genetic drift

Last updated: 2022-03-18

Checks: 6 1

Knit directory: ~/Documents/Winter_Quarter_2022/Fundamentals/jaurbanChicago.github.io/bin/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 571343e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 571343e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  .DS_Store
    Untracked:  .Rhistory

Unstaged changes:
    Deleted:    New Folder With Items/.DS_Store
    Deleted:    New Folder With Items/Genetic_Drift_Markov_Chain.Rmd
    Deleted:    New Folder With Items/_site.yml
    Deleted:    New Folder With Items/include/footer.html
    Modified:   bin/Genetic_Drift_Intro.Rmd
    Modified:   bin/Genetic_Drift_Markov.Rmd
    Modified:   bin/diffusion_approx.Rmd
    Modified:   bin/diffusion_process.Rmd
    Deleted:    index.html

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (bin/diffusion_approx.Rmd) and HTML (docs/diffusion_approx.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	742e5e6	jaurbanChicago	2022-03-18	Added diffusion_approx
html	742e5e6	jaurbanChicago	2022-03-18	Added diffusion_approx
Rmd	644b273	jaurbanChicago	2022-03-18	Added diffusion_approx.Rmd file

While the previous vignette provides a mathematically satisfactory (without invoking advanced concepts beyond calculus) derivation of the Kolmogorov forward/backward equation for time-homogenenous diffusion process, it doesn’t convey too much intuition. In this vignette, we will study the classic example of using diffusion process to approximate genetic drift (Kimura 1955). As we will see, not only does it provide an appealing model for continuous variations of allele frequency, it also offers an intuitive justification of the Kolmogorov forward equation.

Prerequisites

A good understanding of the material in previous vignettes
Able to reason continuous variation and infinitesimal changes

Introduction

The term diffusion refers to the general phenomenon where particles move from a high concentration region to a lower concentration region. In genetic drift, we use diffusion to describe the “flattening” effect of random drift on the allele state distributions. In 1956, Peter Buri used a beautiful experiment to demonstrate this effect. Briefly, Buri began the experiment with 107 populations of D. melanogaster (fruit fly). In each population, the frequency of a target allele called \(\text{bw}^{75}\) is exactly 0.5. Buri tracked the allele frequency changes over 19 generations by inferring the genotype of each fly from its eye color. The changes in allele frequency distribution vividly describe the effect of genetic drift on many populations.

Version	Author	Date
742e5e6	jaurbanChicago	2022-03-18

In the picture above, all the masses are initially concentrated at the center. In a few generations, the masses spread out fairly evenly. As time passes, some alleles reach fixation and some are driven out of the population.

Derivation of forward equation

In fact, compared to our derivation, Kimura’s intuitive justification applies to a broader class of diffusion processes, in that it doesn’t assume time-homogeneity.

Consider a locus with a pair of alleles \(A_1\) and \(A_2\) segregating in a population. Let \(A_1\) be the focal allele with frequency \(p\) at time 0. Following our notation in the previous vignette, \(\phi(p,0,x,t)\) is the transition density for the frequency of \(A_1\), \(x\), at time \(t\). Note that we use \(\phi(\cdot)\) instead of \(p(\cdot)\) to avoid confusion between density and initial frequency. We can further simplify our notation to \(\phi(x,t)\) as 0 and \(p\) are both fixed quantities. The forward equation is: \[ \frac{d}{dt}\phi(x,t) = -\frac{d}{dx}\{\phi(x,t)\mu(x, t)\} + \frac{1}{2}\frac{d^2}{dx^2}\{\phi(x,t)\sigma^2(x,t) \} \] Let’s pause for a second and think about the interpretation of \(\phi(x,t)\). First of all, we highlight the difference between diffusion process and the characterization of genetic drift as binomial sampling (as demonstrated in the first vignette). In the binomial sampling model, we focus on the behavior in a single population. However, when we switch to Markov chain and diffusion process, we are essentially modelling the average behavior of allele frequency in many populations. Consider some tiny \(dx > 0\), then \(\phi(x,t) dx\) can be thought of as the proportion of populations with allele frequency \(x\) among an infinite number of “identical” Wright-Fisher populations. Sometimes we also call \(\phi(x,t) dx\) the frequency of class \(x\) at time \(t\). In other words, think of a single population as a small machine that runs its own drifting process. Now we have many such population, each of which is a small machine that runs a drifting process. Although the same set of “instructions” (parameters of genetic drift) are sent to each machine, sampling variation in the drifting process leads to allele frequency variation among populations. If we pause the process at a particular time \(t\), we obtain the allele frequency distribution at time \(t\) (as a function of \(x\), the dummy variable for allele frequency), which is exactly \(\phi(x,t)\). Kimura (1955) has a nice diagram that visualizes \(\phi(x,t)\).

Version	Author	Date
742e5e6	jaurbanChicago	2022-03-18

The distribution in approximated by bins of width \(h\). The approximation gets better as \(h\) becomes smaller, so think of \(h\) tiny. Because the density in each bin is constant, there are only three classes here, represented by the middle points \(x-h\), \(x\), and \(x+h\), respectively. Consider population of allele frequency \(x\) (henthforce class \(x\) and similarly for other classes), we want to describe its probability of moving to another class after some time \(\Delta t\). Recall in the definition of diffusion process, there is a technical condition which ensures continuous path. Therefore, if we take \(\Delta t\) sufficiently small, it will move to either class \(x-h\) or class \(x+h\).

We also consider two forces that drive the population to different classes: systematic pressure and randomness. Let \(m(x,t)\Delta t\) be the probability that the population moves to class \(x+h\) after time \(\Delta t\) due to systematic pressure. Note that the pressure can be specified in either direction without loss of generality. Let \(v(x,t)\Delta t\) be the probability that the population move to a different class after time \(\Delta t\) due to randomness. Therefore, if the population leaves the current class (say, class \(x\)) by randomness, half of time it moves to class \(x-h\) and half of the time to class \(x+h\). We can write the density after time \(\Delta t\) as: \[ \begin{aligned} \phi(x,t+\Delta t)h &= \phi(x,t)h \\ &- [v(x,t)+m(x,t)]\Delta t \phi(x,t)h \hspace{10mm} (\star) \\ &+ \frac{1}{2}v(x-h,t)\Delta t \phi(x-h,t) + \frac{1}{2}v(x+h, t)\Delta t \phi(x+h,t)h \hspace{10mm} (\star\star)\\ &+ m(x-h,t)\Delta t \phi(x-h, t)h \hspace{10mm} (\star\star\star) \end{aligned} \] Let’s translate the above equation into plain English. \((\star)\) represents the probability mass leaving class \(x\) per \(\Delta t\) at time \(t\); \((\star\star)\) represents the probability mass moving into class \(x\) per \(\Delta t\) from the other two classes due to randomness; \((\star\star\star)\) represents the probability mass moving into class \(x\) per \(\Delta t\) from class \(x-h\) by systematic pressure (i.e.: the wind only blows from left to right). Together, \((\star\star)\) and \((\star\star\star)\) represent the probability mass entering class \(x\) per \(\Delta t\) at time \(t\). Hence, the previous equation says: \[ \begin{aligned} \text{the probability mass in class } x \text{ at time } (t + \Delta t) &= \text{the probability mass in class } x \text{ at time } t \\ &- \text{the probability mass leaving class } x \text{ per } \Delta t \text{ at time } t \\ &+ \text{the probability mass entering class } x \text{ per } \Delta t \text{ at time } t \end{aligned} \] Cool! Now we can define \(\sigma^2(x,t)\) and \(\mu(x,t)\). Let: \[ \begin{aligned} \sigma^2(x,t) \Delta t &= h^2 \frac{1}{2}v(x,t)\Delta t + (-h)^2 \frac{1}{2}v(x,t)\Delta t\\ \mu(x,t) \Delta t &= hm(x,t)\Delta t \end{aligned} \] In other words, \(\sigma^2(x,t) \Delta t\) is the variance of the per \(\Delta t\) change in \(x\) due to randomness, and \(\mu(x,t) \Delta t\) is the average per \(\Delta t\) change in \(x\). Dividing both sides by \(\Delta t\), we have: \[ \begin{aligned} \sigma^2(x,t) &= h^2 v(x,t)\\ \mu(x,t) &= hm(x,t) \end{aligned} \] Although \(\Delta t\) is short, it still counts as a segment. Now we are looking at a particle. We should interpret \(\sigma^2(x,t)\) as the variance per infinitesimal time change in \(x\) due to randomness and \(\mu(x,t)\) the average per infinitesimal time change in \(x\). Recall in the definition of diffusion process, we call \(\mu(x,t)\) the infinitesimal mean and \(\sigma^2(x,t)\) the infinitesimal variance.

Eventually, we want to find an expression for the time derivative: \[ \lim_{\Delta t \rightarrow 0} \frac{\phi(x,t+\Delta t) - \phi(x,t)}{\Delta t} \] This is easy because we can substitute \(\sigma^2(x,t)\) and \(\mu(x,t)\) back to the equation for \(\phi(x,t+\Delta t)h\). With some standard algebraic manipulations, we have the desired forward equation (remember to take the limit as \(h \rightarrow 0\) because we start with an approximation of the distribution): \[ \frac{d}{dt}\phi(x,t) = -\frac{d}{dx}\{\phi(x,t)\mu(x, t)\} + \frac{1}{2}\frac{d^2}{dx^2}\{\phi(x,t)\sigma^2(x,t) \} \]

Main reference

Stochastic processes and distribution of gene frequencies under natural selection by Motoo Kimura (1955)

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.2     rstudioapi_0.13  whisker_0.4      knitr_1.37      
 [5] magrittr_2.0.2   workflowr_1.7.0  R6_2.5.1         rlang_1.0.2     
 [9] fastmap_1.1.0    fansi_1.0.2      highr_0.9        stringr_1.4.0   
[13] tools_4.1.2      xfun_0.29        utf8_1.2.2       cli_3.2.0       
[17] jquerylib_0.1.4  git2r_0.29.0     htmltools_0.5.2  ellipsis_0.3.2  
[21] yaml_2.2.2       digest_0.6.29    rprojroot_2.0.2  lifecycle_1.0.1 
[25] tibble_3.1.6     crayon_1.5.0     later_1.3.0      vctrs_0.3.8     
[29] sass_0.4.0       promises_1.2.0.1 fs_1.5.2         glue_1.6.2      
[33] evaluate_0.14    rmarkdown_2.11   stringi_1.7.6    pillar_1.7.0    
[37] compiler_4.1.2   bslib_0.3.1      jsonlite_1.8.0   httpuv_1.6.5    
[41] pkgconfig_2.0.3

This site was created with R Markdown

Diffusion approximation of genetic drift

Ethan Zhong, Toño Urbán

3/17/2022

Prerequisites

Introduction

Derivation of forward equation

Main reference