GET THE APP

Short Review on Mathematical Model of Molecular Evolution through a Stochastic Analysis

Short Review - American Journal of Physiology, Biochemistry and Pharmacology (2021)

Short Review on Mathematical Model of Molecular Evolution through a Stochastic Analysis

Joel Valdivia Ortega*
 
1Faculty of Science from the National Autonomous University of Mexico, Mexico
 
*Corresponding Author:

Joel Valdivia Ortega, Faculty of Science from the National Autonomous University of Mexico, Mexico, Email: joelvaltega@ciencias.unam.mx

Published: 29-May-2021

Abstract

On Mathematical model of molecular evolution through a stochastic analysis, I present a mathematical treatment for the molecular evolution on DNA chains from some species which allowed me to understand its modifications as random variables connected by a Markov chain. As a result of this, I was able to determine the probability for some given nucleotide on a given position in the studied genes changes into another as a product of random events and to describe a mechanism to give the estimated number of generations needed so one given DNA chain could become another given one.

Introduction

Several attempts to study molecular evolution as a stochastic process have been made across time, each of them using different types of parameters and different hypotheses. One of the first ones was [1], which assumed a uniform probability for all the events; I consider this to be a good first hypothesis, but it is well known there exists several ways in which a mutation can be made, e.g. a transition or a trans version, each of them with different probabilities of occurrence and therefore, making this supposition needed of a change.

Some works which do consider non-uniform proba- bility distribution for the mutation of nucleotides are [2,3], which used a molecular clock of mitochondrial DNA and comparison of a pair of nucleotide sequenc- es. Respectively the former used a molecular clock of mitochondrial DNA to estimate the number of gener- ations between two species, but even the authors rec- ognized the results were not accurate with respect to the evidence found on fossils. On the other hand, the latter found a simple formula to estimate the “evo- lutionary distance per site” which is K=-1/2 ln (1-2 P-Q), where P and Q are respectively the fractions of nucleotides showing differences due to transitions and Trans versions. Leaving apart the fact that it is possible for 1-2 P-Q to be negative while the loga- rithm is only defined for positive values, this formula also has the problem that the weight it gives to the proportion depends on the proportion itself, leading to wrong estimations for the evolutionary distance. To give an example of what I mean by this, let’s sup- pose we have one nucleotide chain α1=(A,A,A,A,A) which becomes into α2=(T,A,A,A,A) on the next gener- ation and α3=(T,T,A,A,A) after three generations; with this formula, the evolutionary distance between α_1 and α2 would be around 0.11, but the estimation for α_1 and α3 would be of 0.25, showing that the propor- tion the estimation changes is not the same given the same amount of mutations occurred, so I will propose another formula which takes this into consideration.

One last work I would like to address since it has a similar approach to what I did on my previous paper is [4], where a Markov chain with a 61 entries per side squared transition matrix was used to study codon based mutations and concluded on a maximum like- lihood of α_ and β-goblin genes. The main difference I would like to remark between the paper from Gold- man and Yang and what I did is that they obtained the transition matrix as a solution from a differential equation they supposed; meanwhile, I got my tran- sition matrix as a result of data analysis, meaning I made no suppositions on the way it should behave so my result would be a reflection of the empirical data.

Stochastic Analysis

Since the mutations do not depend on the previous ones, I can think of the process of evolution as a Mar- kov chain by definition, so I propose to take the nu- cleotides on a DNA chain as a random variable which has for range the set of all possible nucleotides. Giv- en this, I tried to calculate the transition matrix P^’ which connects the wolf’s genome to the genome of a boxer, a poodle, a yorkshire and a beagle [5-9]. To do this, I define

and D:={boxer,poodle,yorkshire,beagle} so I can build the function in figure 1.

Where d is inD, w(n) is the nth nucleotide of the wolf’s genome and d(n) is the n-th nucleotide of the given dog’s genome. Therefore I can write

since this is the probability that a given type of nu- cleotide on the wolf’s genome ends up being another given type of nucleotide on the genome of any of the dogs and where I took only 10000 nucleotides so I can be sure the probability distribution for a muta- tion to happen is the same in the whole chain.

According to the domestication of the wolf began 30000 years ago and since then, the evolution of the dog began. Supposing a new generation of dogs is born each year on average and if P is the transition matrix of the Markov chain I am working with, there- fore I have P^’=P^30000, so I just have to calculate P=(P^’)^(1/30000) and after taking the real part from the entries, I obtained the matrix, where P is the transition matrix which describes the probability to see a mutation on a certain nucleotide on the next generation [10].

Figure 2: Obtained transition matrix for the Markov chain

It is worth noticing that the described calculation of P did not use any assumption of any kind and is only based on the empirical data I used, so this re- sult should be independent of any type of scientific scheme but the mathematical one.

Now that I have the transition matrix, I should be able to begin from the wolf’s genome and arrive at each dog’s genome by simulating the random variables describing the nucleotides from each chain. To do so, I made an algorithm of artificial selection in which I picked a dog’s genome and simulated the wolf had had 50 children from which I chose the two of them which had the closest genome to the selected dog and simulated they had a breed; after that, I used the transition matrix again to recreate 50 children from the breed and repeat the process until the desired genome was reached; this process was repeated 100 times for each dog. In order to know if this process could lead me only to the genomes from the dogs I began the work with or if I could reach any other genome, I generated a random one and applied the artificial selection algorithm choosing the random genome; same as the dogs, this process was repeated 100 times.

Given the 100 simulations for each dog and the ran- dom genome, I calculated the average amount of generations taken to reach the target, the standard deviation between the amount of generations taken to conclude the simulation and the standard devi- ation as a percentage of the average but I could not find a significant difference between them, meaning this algorithm would lead to any genome, not only to those the transition matrix was calculated with. Rath- er than a failure on the model, I see the origin of this incapability to distinguish the real genomes from the random one on both reasons: the first one is the ar- tificial selection algorithm is very strict and unrealis- tic because of the very selective way the children are chosen and because it was designed only to be sure I was capable of connecting the genomes from the wolf and the dogs using only the transition matrix; the sec- ond reason is that a phenomena described by a Mar- kov chain, such as the process of mutations itself, de- pends only on some given probabilities and random events and does not have some planned target nor logical decisions.

Estimating Number of Generations be- tween Two Genomes

Let qn be some nucleotide chain obtained from the wolf’s genome after using n times the transition ma- trix calculated before and

which represents the number of identical nucleotides between q_nand w. In order to find a differential equation which describesU, I propose the following hypothesis:

1. The number of new differences depends on the quantity of nucleotides on the genome, and then dU/dn is proportional to|U|.

2. The probability of a mutation on a specific entry of the genome does not depend on the number of generations without change, which means dU/dn is proportional to -U [11].

3. It is possible that a nucleotide from q_n mutates in such way that it becomes the same as in w, therefore dU/dn is proportional toU(0)-U.

Therefore I propose the differential equation

dU/dn=-aU+b(U(0)-U)

Which, after taking q0=w as an initial condition, has for solution

where a is the probability that if some nucleotide on position m on the chain q_n is the same as in w, this nucleotide changes for the next iteration so that it is no longer true for q(n+1), while b represents the proba- bility that if a nucleotide on a given position m on the chain represented by q_n is not the same as in w, the corresponding nucleotides are the same for q(n+1).

Remembering I am understanding the molecular evo- lution as a Markov chain and using the values I ob- tained from its transition matrix, I conclude the ex- pression I have been looking for is

Which makes clear the tendency of the nucleotides chain qn to become independent from w at a proba- bilistic level, being this a successful simulation of the genetic diversity.

Now we can answer the question about the amount of necessary generations to get from one genome to another. Being U the number of differences between this two given genomes, one can solve the previous expression for n and get

Which for the empirical data we got here, we have that the number of generations between the wolf and the dogs is 34054±70, where the second number is the marginal error. If one remembers the transition ma- trix was made with the assumption that the number of generations between those species were 30000, one can say this is a very good result. It is worth men- tioning that the way the transition matrix was made is not the source of the accuracy of the estimation I obtained because I proposed the differential expres- sion independently and the only information I used from the matrix is the values of a and b.

Conclusion

Let us notice again that the described process and the presented results does not depend on any kind of hy- pothesis other than being a Markov chain and follow- ing the rules I proposed for the construction of the differential equation, so this can be extended not only to other sequences on the genome of these species or other species itself, but also to non-nuclear DNA or even RNA.

References