# Basic Statistics and Data Analysis

## Markov Chain

A Markov chain, named after Andrey Markov is a mathematical system that experience transitions from one state to another, between a finite or countable number of possible states. Markov chain is a random process usually characterized as memoryless: the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of memorylessness is called the Markov property. Markov chains have many applications as statistical models of real world processes.

If the random variables $X_{n-1}$ and $X_n$ take the values $X_{n-1}=i$ and $X_n=j$, then the system has made a transition $S_i \rightarrow S_j$, that is, a transition from state $S_i$ to state $S_j$ at the $n$th trial. Note that $i$ can equal $j$, so that transitions within the same state may be possible. We need to assign probabilities to the transitions $S_i \rightarrow S_j$. Generally in chain, the probability that $X_n=j$ will depend on the whole sequence of random variables starting with the initial value $X_0$. The Markov chain has the characteristic property that the probability that $X_n=j$ depends only on the immediate previous state of the system. This means that we need no further information at each step other than for each $i$ and $j$,  $P\{X_n=j|X_{n-1}=i\}$
which means the probability that $X_n=j$ given that $X_{n-1}=i$: this probability is independent of the values of $X_{n-2},X_{n-3},\cdots, X_0$.

Let we have a set of states $S=\{s_1,s_2,\cdots,s_n\}$. The process starts in one of these states and moves successively from one state to another state. Each move is called a step. If the chain is currently in state $s_i$ then it moves to state $s_j$ at the next step with probability denoted by $p_{ij}$ (transition probability) and this probability does not depend upon which states the chain was in before the current state. The probabilities $p_{ij}$ are called transition probabilities ($s_i \xrightarrow[]{p_{ij}} s_j$ ). The process can remain in the state it is in, and this occurs in probability $p_{ii}$.

An initial probability distribution, define on $S$ specifies the starting state. Usually this is done by specifying a particular state as the starting state.

Formally a Markov chain is a sequence of random variables $X_1,X_2,\cdots,$ with the Markov property that, given the present state, the future and past state are independent. Thus
$P(X_n=x|X_1=x_1,X_2=x_2\cdots X_{n-1}=x_{n-1})$
$\quad=P(X_n=x|X_{n-1}=x_{n-1})$
Or
$P(X_n=j|X_{n-1}=i)$

## Example: Markov Chain

A Markov chain $X$ on $S=\{0,1\}$ is determined by the initial distribution given by $p_0=P(X_0=0), \; p_1=P(X_0=1)$ and the one-step transition probability given by $p_{00}=P(x_{n+1}=0|X_n=0)$, $p_{10}=P(x_{n+1}=0|X_n=1)$, $p_{01}=1-p_{00}$ and $p_{11}=1-p_{10}$, so one-step transition probability in matrix form is $P=\begin{pmatrix}p_{00}&p_{10}\\p_{01}&p_{11}\end{pmatrix}$

## References:

• https://en.wikipedia.org/wiki/Markov_chain
• http://people.virginia.edu/~rlc9s/sys6005/SYS_6005_Intro_to_MC.pdf
• http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf

# Symmetric Random Walk Probability

As we know that a related probability is the one for the event in which the first visit to position x occurs at the nth step given that the walk starts at the origin, i.e. the first passage through x. For case x=0 we will find the probability of the first return to the origin. Let A be the event that $X_n=0$, let $B_k$ be the event that the first visit to the origin occurs at the kth step. By the law of total probability
$P(A)=\sum_{k=1}^n P(A|B_k)P(B_k) \tag{1}\label{eqn2.341}$
Let $f_k=P(B_k)$. The conditional probability $P(A|B_k)$ is the probability that the walk returns to the origin after n–k steps i.e. $P(A|B_k)=p_{n-k}$. Note that $p_n$ is the probability that the walk is at the origin at step n and $p_n=0$ if n is odd. Hence (\ref{eqn2.341}) can be written as
$p_n=\sum_{k=1}^n p_{n-k}f_k \tag{2}\label{eqn2.342}$
Multiplying both sides of (\ref{eqn2.342}) by $s^n$ and summing for $n\geq1$, we have
$\sum_{n=1}^\infty p_ns^n =\sum_{n=1}^\infty \sum_{k=1}^n p_{n-k}f_k s^n \tag{3}\label{eqn2.343}$
as
$H(s)=\sum_{n=0}^\infty p_ns^n=1+\sum_{n=1}^\infty p_ns^n$
$H(s)-1=\sum_{n=1}^\infty p_ns^n\qquad$
Therefore from (\ref{eqn2.343})
\begin{align*}
H(s)-1&=\sum_{n=1}^\infty p_n s^n\\
&=\sum_{n=1}^\infty \sum_{k=1}^n p_{n-k}f_ks^n\tag{4}\label{eqn2.344}
\end{align*}
Since $p_0=1$ and $f_0=0$, we can replace (\ref{eqn2.344}) by
\begin{align*}
H(s)-1&=\sum_{n=1}^\infty p_ns^n\\
&=\sum_{n=0}^\infty \sum_{k=0}^n p_{n-k}f_ks^n = \sum_{n=0}^\infty p_ns^n \sum_{k=0}^\infty f_k s^k \tag{5}\label{eqn2.345}
\end{align*}
the product of two power series, also known as Convolution of the distribution.

Now if $Q(s)=\sum_{n=0}^\infty f_ns^n$ is the probability generating function of the first return distribution, then
$H(s)-1=H(s)Q(s)$.

From equation (8) of lecture “Probability Distribution after n steps” we have
\begin{align*}
Q(s)&=\frac{H(s)-1}{H(s)}=1-\frac{1}{H(s)}\\
&=1-(1-s^2)^{\frac{1}{2}} \tag{6}\label{eqn2.346}
\end{align*}
The probability that the walk will, at some step, return to the origin is $\sum_{n=1}^\infty f_n=Q(1)=1$;
i.e. the return is certain. In this walk the origin or any point by translation is persistent. However, the mean number of steps until this return occurs is
\begin{align*}
\sum_{n=1}^\infty nf_n &=\lim_{s \rightarrow 1-} Q'(s)\\
&=\lim_{s \rightarrow 1-} \frac{s}{(1-s^2)^{\frac{1}{2}}}=\infty
\end{align*}
In this walk which starts at the origin is certain to return there in the future, but on average, it will take an infinite number of steps.

Example:  Find the probability that a symmetric random walk starting from the origin returns there for the first time after 6 steps.

solution:  We require the coefficient of $s^6$ in the power series expansion of the pgf Q(s), which is
\begin{align*}
Q(s)&=1-(1-s^2)^{\frac{1}{2}}\\
&=1-\left[\frac{1^{\frac{1}{2}}}{0!}-\frac{\frac{1}{2}{(-s^2)}^1}{1!}-\frac{\frac{1}{2}(\frac{1}{2}-1){(-s^2)}^2}{2!}-\frac{\frac{1}{2}(\frac{1}{2}-1)(\frac{1}{2}-2){(-s^2)}^3}{3!}\right]\\
&=1-1+\frac{1}{2}s^2+\frac{1}{8}s^4+\frac{1}{16}s^6+0s^8
\end{align*}
Hence the probability of a first return at step 6 is $\frac{1}{16}$.

# Random Walk Probability of Returning to Origin after n Steps

Assume that the walk starts at x=0 with steps to the right or left occurring with probabilities p and q=1-p. We can write the position $X_n$ after n steps as
$X_n=R_n-L_n \tag{1}$
where $R_n$ is the number of right or positive steps (+1) and $L_n$ is the number of left or negative steps (-1).

Therefore the Total steps can be calculated as:  $n=R_n+L_n \tag{2}$
Hence
\begin{align*}
L_n&=n-R_n\\
\Rightarrow X_n&=R_n-n+R_n\\
R_n&=\frac{1}{2}(n+X_n) \tag{3}
\end{align*}
The equation (3) will be an integer only when n and $X_n$ are both even or both odd (eg. To go from x=0 to x=7 we must take an odd number of steps).

Now, let $v_{n,x}$ be the probability that the walk is at state x after n steps assuming that x is a positive integer. Then
\begin{align*}
v_{n,x}&=P(X_n=x)\\
&=P(R_n=\frac{1}{2}(n+x))
\end{align*}
$R_n$ is a binomial random variable with index $n$ having probability p, since the walker either moves to the right or not at every step, and the steps are independent, then
\begin{align*}
v_{n,x}&=\binom{n}{\frac{1}{2}(n+x)}p^{\frac{1}{2}(n+x)}q^{n-\frac{1}{2}(n+x)}\\
&=\binom{n}{\frac{1}{2}(n+x)}p^{\frac{1}{2}(n+x)}q^{\frac{1}{2}(n-x)} \tag{4}
\end{align*}
where (n,x) are both even or both odd and $-n \leq x \leq n$. Note that a similar argument can be constructed if x is a negative integer.

Example: For total number of steps is 2, the net displacement must be one of the three possibilities: (1) two steps to the left, (2) back to the start, (3) or two steps to the right. These correspond to values of x = -2, 0,+2. Clearly it is impossible to get more than two units away from the origin if you take only two steps and it is equally impossible to end up exactly one unit from the origin if you take two steps.

For symmetric case ($p=\tfrac{1}{2}$), starting from the origin, there are $2^n$ different paths of length n since there is a choice of right or left move at each step. Since the number of steps in the right direction must be $\tfrac{1}{2}(n+x)$ and the total number of paths must be the number of ways in which $\frac{1}{2}(n+x)$ can be chosen from n: that is
$N_{n,x}=\binom{n}{\tfrac{1}{2}(n+x)}$
provided that $\tfrac{1}{2}(n+x)$ is an integer.

By counting rule, the probability that the walk ends at x after n steps is given by the ratio of this number and the total number of paths (since all paths are equally likely). Hence
$v_{n,x}=\frac{N_{n,x}}{2^n}=\binom{n}{\tfrac{1}{2}(n+x)}\frac{1}{2^n}$
The probability $v_{n,x}$ is the probability that the walk ends at state x after n steps: the walk could have overshot x before returning there.

## Related Probability/ First Passage through x

A related probability is the probability that the first visit to position x occurs at the nth step. The following is descriptive derivation of the associated probability generating function of the symmetric random walk in which the walk starts at the origin, and we consider the probability that it returns to the origin.

From equation (4), the probability that a walk is at the origin at step n is
\begin{align*}
v_{n,x}&=\binom{n}{\frac{1}{2}(n+x)}p^{\frac{1}{2}(n+x)}q^{n-\frac{1}{2}(n+x)}\\
&=\binom{n}{\tfrac{1}{2}(n+0)} \left(\frac{1}{2}\right)^{\tfrac{1}{2}n} \left(\frac{1}{2}\right)^{\tfrac{1}{2}n}\\
\end{align*}
Here $p_n$ is the probability that after n steps the position of walker is at origin. We also assume that $p_n=0$ if n is odd. From equation (5) we can construct a generating function.
\begin{align*}
H(s)&=\sum_{n=0}^\infty p_n s^n\\
&=\sum_{n=0}^\infty p_{2n}s^{2n}=\sum_{n=0}^\infty \frac{1}{2^{2n}}\binom{2n}{n}s^{2n} \tag{6}
\end{align*}
Note that $p_0=1$, and H(s) is not a probability generating function since $H(1)\neq1$.

The binomial coefficient can be re-arranged as follows:
\begin{align*}
\binom{2n}{n}&=\frac{(2n)!}{n!n!}=\frac{2n(2n-1)(2n-2)\cdots3.2.1}{n!n!}\\
&=\frac{2^nn!(2n-1)(2n-3)\cdots3.2.1}{n!n!}\\
&=\frac{2^{2n}}{n!}\frac{1}{2}\frac{3}{2}\cdots(n-\tfrac{1}{2})\\
&=(-1)^n \binom{-\tfrac{1}{2}}{n}2^{2n} \tag{7}
\end{align*}
Using equation (6) in (7)
$H(s)=\sum_{n=0}^\infty \frac{1}{2^{2n}}(-1)^n \binom{-\tfrac{1}{2}}{n}s^{2n}2^{2n}=(1-s^2)^{-\tfrac{1}{2}} \tag{8}$
by binomial theorem, provided $|s|<1$. Note that this expansion guarantees that $p_n=0$ if n is odd.

Note that the equation (8) does not sum to one. This is called defective distribution which still gives the probability that the walk is at the origin at step n.

We can estimate the behaviour of $p_n$ for large n by using Stirling’s Formula (asymptotic estimate for n! for large n), $n!\approx\sqrt{2\pi} n^{n+\tfrac{1}{2}}e^{-n}$

From equation (5)
\begin{align*}
p_{2n}&=\frac{1}{2^{2n}}\binom{2n}{n}=\frac{1}{2^{2n}}\frac{(2n)!}{n!n!}\\
&\approx\frac{1}{2^{2n}}\frac{\sqrt{2\pi}(2n)^{2n+\tfrac{1}{2}}e^{-2n}}{[\sqrt{2\pi}(n^{n+\tfrac{1}{2}}e^{-n})]^2}\\
&=\frac{1}{\sqrt{\pi n}}; \qquad \text{for large $n$}
\end{align*}
Hence $np_n\rightarrow \infty$ confirming that the series $\sum\limits_{n=0}^\infty p_n$ must diverge.

Random Walk Probability of Returning to Origin after n Steps some EXAMPLES

Example: Consider a random walk starts from x_0=0 find the probability that after 5 steps the position is 3. i.e. $X_5=3$, p=0.6.

Solution: Here number of steps are n=5 and position is x=3. Therefore positive and negative steps are

$R_n= \frac{1}{2}(n+x)=\frac{1}{2}(5+3)=4$ and $X_n=R_n-L_n \Rightarrow 3=4+L_n=1$
The probability that the event $X_5=3$ will occur in a random walk with $p=0.6$ is
$P(X_5=3)=\binom{5}{\frac{1}{2}(5+3)}(0.6)^{\tfrac{1}{2}(3+5)}(0.4)^{\tfrac{1}{2}(5-3)}=0.2592$

## Simple Random Walk: Unrestricted Random Walk

A simple random walk on a line or in one dimension occurs with probability $p$ when walker step forward (+1) and/or has probability q=1-p if walker steps back (-1). For ith step, the modified Bernoulli random variable Wi (takes the value +1 or -1 instead of {0,1}) is observed and the position of the walk at the nth step can be found by
\begin{align}
X_n&=X_0+W_1+W_2+\cdots+W_n\nonumber\\
&=X_0+\sum_{i=1}^nW_i\nonumber\\
&=X_{n-1}+W_n
\end{align}
In the gambler’s ruin problems X0=k, but here we assume (without loss of generality) that walks starts from the origin so that Xo=0.

Several derived results for random walks are restricted by boundaries. We consider here random walks without boundaries called unrestricted random walks. We are interested in

1. The position of the walk after a number of steps and
2. The probability of a return to the origin, the start of the walker.

From equation (1) the position of the walker at step n simply depends on the position at (n-1)th step, because the simple random walk possesses the Markov property (the current state of the walk depends on its immediate previous state, not on the history of the walks up to the present state)

Furthermore Xn=Xn-1 ±1

and the transition probabilities from one position to another is
P(Xn=j|Xn-1=j-1)=p, and P(Xn=j|Xn-1=j+1)=q are independent of the number of plays in the game or steps in the represented by n.

Mean and Variance of Xn can be calculate as:
\begin{align*}
E(X_n)&=E\left(X_0+\sum_{i=1}^n W_i\right)\\
&=E\left(\sum_{i=1}^n W_i\right)=nW_n\\
V(X_n)&=V\left(\sum_{i=1}^n W_i\right)=nV(W)
\end{align*}
Since Wi are independent and identically distributed (iid) random variables and where W is the common or typical Bernoulli random variable in the sequence {Wi}. Thus
\begin{align*}
E(W)&=1.p+(-1)q=p-q\\
V(W)&=E(W^2)-[E(W)]^2\\
&=1^2p+(-1)^2q-(p-q)^2\\
&=p+q-(p^2+q^2-2pq)\\
&=1-p^2-q^2+2pq\\
&=1-p^2-(1-p)^2+2pq\\
&=1-p^2-(1+p^2-2p)+2pq\\
&=1-p^2-1-p^2+2p+2pq\\
&=-2p^2+2p+2pq\\
&=2p(1-p)+2pq=4pq
\end{align*}
So the probability distribution of the position of the random walk at stage n has mean E(Xn)=n(p-q) and variance V(Xn)=4npq.

For symmetric random walk (where p=½) after n steps, the expected position is the origin, and it yields the maximum value of V(Xn)=4npq=4np(1-p).

If p>½ then a drift is expected away from the origin in a positive direction and if p<½ it would be expected that the drift would be in the negative direction.

Since V(Xn) is proportional to $n$, it grows with increasing n, and we would be increasingly uncertain about the position of the walker as n increases.
i.e.
\begin{align*}
\frac{\partial V(X_n)}{\partial p}&=\frac{\partial}{\partial p} {4npq}\\
&=\frac{\partial}{\partial p} \{4np-4np^2 \}=4n-8np \quad \Rightarrow p=\frac{1}{2}
\end{align*}
Just knowing the mean and standard deviation of a random variable does not enable us to identify its probability distribution. But for large n, we can apply the CLT.
$Z_n=\frac{X_n-n(p-q)}{\sqrt{4npq}}\thickapprox N(0,1)$
Applying continuity correction, approximate probabilities may be obtained for the position of the walk.

Example : Consider unrestricted random walk with n=100, p=0.6 then
\begin{align*}
E(X_n)&=E(X_{100})=nE(W)=n(p-q)\\
&=100(0.6-0.4)=20\\
V(X_n)&=4npq=4\times 100\times 0.6 \times 0.4=96
\end{align*}
The position of walk at the 100th step between 15 and 25 pace/step from the origin is
$P(15\leq X_{100}\leq30)\thickapprox P(14.5<X_{100}<25.5)$
$-\frac{5.5}{\sqrt{96}}<Z_{100}=\frac{X_{100}-20}{\sqrt{96}}<\frac{5.5}{96}$
hence
$P(-0.5613<Z_{100}<0.5613)=\phi(0.5613)-\phi(-0.5613)=0.43$
where Φ(Z) is the standard normal distribution function.

## Some Basic Definitions (Stochastic Processes Introduction)

Experiment: Any activity or situation having uncertain outcome.

Sample Space:  The set of all possible outcomes is called sample space and every element ω of Ω is called sample point. In Stochastic process we will call it as state space.

Event and Event Space:  An event is a subset of the sample space. The class of all events associated with a given experiment is defined to be the event space.

An event will always be a subset of the sample space, but for sufficiently large sample spaces, not all subsets will be events. Thus the class of all subsets of the sample space will not necessarily correspond to the event space.

Random Variable:
A random variable is a mapping function which assigns outcomes of a random experiment to real numbers. Occurrence of the outcome follows certain probability distribution. Therefore, a random variable is completely characterized by its probability density function (PDF). Or

A random variable is a map X: Ω→R  such that {X ≤ x}={ω ε Ω : x(ω) ≤ x}ε F  for all x ε R.

Probability Space:  A probability space consists $(\Omega, \mathfrak{F}, P)$ of three parts, sample space, a collection of events and a probability measure.

Cumulative Distribution Function (CDF): Probability distribution function for the random variable X such that F(a)=P{X ≤ a }

Time: A point of time either discrete or continuous

State: It describe the attribute of a system at some point in time S=(s1,s2,…,st)

It is convenient to assign some unique non-negative integer as index to each possible value of the state vector S.

Activity: Something that takes some amount of time (duration) to occur. Activity culminates in an event.

Transition:  Transition is caused by an event and it results in some movement from one state to another state.

Probability Measure:  A probability measure intends to be a function defined for all subsets of Ω.

What is Stochastic Process?

The word stochastic is derived from the Greek word “stoΩ’kæstIk” meaning “to aim at a target”. Stochastic processes involves state which changes in a random way.
Given a probability space $(\Omega, \mathfrak{F}, P)$  stochastic process {X(t), t ε T} is a family of random variables, where the index set T may be discrete (T={0,1,2,…}) or continuous (T=[0, ∞)). The set of possible values which random variables {X(t), t ε T} may assume is called the state space of the process, and denoted by S. A continuous time stochastic process {X(t), t ε T}; (T=[0,∞)) is said to have independent increment of for all choices of {t1,t2,…,tn}, the n random variables X(t1)-X(t0), X(t2)-X(t1),…,X(tn)-X(tn-1) are independent. Using discrete time the state of the process at time n+1 depends only on its state at time n.

It is often used to represent the evolution of some random value or system over time. Examples of processes modeled as stochastic time series include stock market and exchange rate fluctuations, signals such as speech, audio and video, medical data such as a patient’s EKG, EEG, blood pressure or temperature, random movement such as Brownian motion or random walks, counting process, Renewal process, Poisson process and Markov process.