# Basic Statistics and Data Analysis

## Binomial Probability Distribution

A statistical experiment having successive independent trials having two possible outcomes (such as success and failure; true and false; yes and no; right and wrong etc.) and probability of success is equal for each trial, while this kind of experiment is repeated a fixed number of times (say $n$ times) is called Binomial Experiment, Each trial of this Binomial experiment is known as Bernoulli trial (a trial which is a single performance of an experiment), for example. There are four properties of Binomial Experiment.

1. Each trial of Binomial Experiment can be classified as success or failure.
2. The probability of success for each trial of the experiment is equal.
3. Successive trials are independent, that is, the occurrence of one outcome in an experiment does not affect occurrence of the other.
4. The experiment is repeated a fixed number of times.

## Binomial Probability Distribution

Let a discrete random variable, which denotes the number of successes of a Binomial Experiment (we call this binomial random variable). The random variable assume isolated values as $X=0,1,2,\cdots,n$. The probability distribution of binomial random variable is termed as binomial probability distribution. It is a discrete probability distribution.

## Binomial Probability Mass Function

The probability function of binomial distribution is also called binomial probability mass function and can be denoted by $b(x, n, p)$, that is, a binomial distribution of random variable $X$ with $n$ (given number of trials) and $p$ (probability of success) as parameters. If $p$ is the probability of success (alternatively $q=1-p$ is probability of failure such that $p+q=1$) then probability of exactly $x$ success can be found from the following formula,

\begin{align}
b(x, n, p) &= P(X=x)\\
&=\binom{n}{x} p^x q^{n-x}, \quad x=0,1,2, \cdots, n
\end{align}

where $p$ is probability of success of a single trial, $q$ is probability of failure and $n$ is number of independent trials.

The formula gives probability for each possible combination of $n$ and $p$ of binomial random variable $X$. Note that it does not give $P(X <0)$ and $P(X>n)$. Binomial distribution is suitable when $n$ is small and is applied when sampling done is with replacement.

$b(x, n, p) = \binom{n}{x} p^x q^{n-x}, \quad x=0,1,2,\cdots,n,$

is called Binomial distribution because its successive terms are exactly same as that of binomial expansion of

\begin{align}
(q+p)^n=\binom{0}{0} p^0 q^{n-0}+\binom{n}{1} p^1 q^{n-1}+\cdots+\binom{n}{n-1} p^n q^{n-(n-1)}+\binom{n}{n} p^n q^{n-n}
\end{align}

$\binom{n}{0}, \binom{n}{1}, \binom{n}{2},\cdots, \binom{n}{n-1}, \binom{n}{n}$ are called Binomial coefficients.

Note that it is necessary to describe the limit of the random variable otherwise it will be only the mathematical equation not the probability distribution.

# Covariance and Correlation

Covariance measures the degree to which two variables co-vary (i.e. vary/ changes together). If the greater values of one variable (say, $X_i$) correspond with the greater values of the other variable (say, $X_j$), i.e. if the variables tend to show similar behaviour, then the covariance between two variables ($X_i$, $X_j$) will be positive. Similarly if the smaller values of one variable correspond with the smaller values of the other variable, then the covariance between two variables will be positive. In contrast, if the greater values of one variable (say, $X_i$) mainly correspond to the smaller values of the other variables (say, $X_j$), i.e. both of the variables tend to show opposite behaviour, then the covariance will be negative.

In other words, for positive covariance between two variables means they (both of the variables) vary/changes together in the same direction relative to their expected values (averages). It means that if one variable moves above its average value, then the other variable tend to be above its average value also. Similarly, if covariance is negative between the two variables, then one variable tends to be above its expected value, while the other variable tends to be below its expected value. If covariance is zero then it means that there is no linear dependency between the two variables. Mathematically covariance between two random variables $X_i$ and $X_j$ can be represented as
$COV(X_i, X_j)=E[(X_i-\mu_i)(X_j-\mu_j)]$
where
$\mu_i=E(X_i)$ is the average of the first variable
$\mu_j=E(X_j)$ is the average of the second variable

\begin{aligned}
COV(X_i, X_j)&=E[(X_i-\mu_i)(X_j-\mu_j)]\\
&=E[X_i X_j – X_i E(X_j)-X_j E(X_i)+E(X_i)E(X_j)]\\
&=E(X_i X_j)-E(X_i)E(X_j) – E(X_j)E(X_i)+E(X_i)E(X_j)\\
&=E(X_i X_j)-E(X_i)E(X_j)
\end{aligned}

Note that, the covariance of a random variable with itself is the variance of the random variable, i.e. $COV(X_i, X_i)=VAR(X)$. If $X_i$ and $X_j$ are independent, then $E(X_i X_j)=E(X_i)E(X_j)$ and $COV(X_i, X_j)=E(X_i X_j)-E(X_i) E(X_j)=0$.

## Covariance and Correlation

Correlation and covariance are related measures but not equivalent statistical measures. The correlation between two variables (Let, $X_i$ and $X_j$) is their normalized covariance, defined as
\begin{aligned}
\rho_{i,j}&=\frac{E[(X_i-\mu_i)(X_j-\mu_j)]}{\sigma_i \sigma_j}\\
&=\frac{n \sum XY – \sum X \sum Y}{\sqrt{(n \sum X^2 -(\sum X)^2)(n \sum Y^2 – (\sum Y)^2)}}
\end{aligned}
where $\sigma_i$ is the standard deviation of $X_i$ and $\sigma_j$ is the standard deviation of $X_j$.

Note that correlation is the dimensionless, i.e. a number which is free of measurement unit and its values lies between -1 and +1 inclusive. In contrast covariance has a unit of measure–the product of the units of two variables.

# Introduction Odds Ratio

Medical students, students from clinical and psychological sciences, professionals allied to medicine enhancing their understanding and learning of medical literature and researchers from different fields of life usually encounter Odds Ratio (OR) throughout their careers.

Odds ratio is a relative measure of effect, allowing the comparison of the intervention group of a study relative to the comparison or placebo group. When computing Odds Ratio, one would do:

• The numerator is the odds in the intervention arm
• The denominator is the odds in the control or placebo arm= OR

If the outcome is the same in both groups, the ratio will be 1, implying that there is no difference between the two arms of the study. However, if the OR>1, the control group is better than the intervention group while, if the OR<1, the intervention group is better than the control group.

The ratio of the probability of success and failure is known as odds. If the probability of an event is $P_1$ then the odds is:
$OR=\frac{p_1}{1-p_1}$

The Odds Ratio is the ratio of two odds can be used to quantify how much a factor is associated to the response factor in a given model. If the probabilities of occurrences an event are $P_1$ (for first group) and $P_2$ (for second group), then the OR is:
$OR=\frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}}$

If predictors are binary then the OR for ith factor, is defined as
$OR_i=e^{\beta}_i$

The regression coefficient $b_1$ from logistic regression is the estimated increase in the log odds of the dependent variable per unit increase in the value of the independent variable. In other words, the exponential function of the regression coefficients $(e^{b_1})$ in the OR associated with a one unit increase in the independent variable.

## Binomial Random number Generation in R

We will learn here how to generate Bernoulli or Binomial distribution in R with example of flip of a coin. This tutorial is based on how to generate random numbers according to different statistical distributions in R. Our focus is in binomial random number generation in R.

We know that in Bernoulli distribution, either something will happen or not such as coin flip has to outcomes head or tail (either head will occur or head will not occur i.e. tail will occur). For unbiased coin there will be 50%  chances that head or tail will occur in the long run. To generate a random number that are binomial in R, use rbinom(n, size,prob) command.

rbinom(n, size, prob) command has three parameters, namely

where
n is number of observations
size is number of trials (it may be zero or more)
prob is probability of success on each trial for example 1/2

Some Examples

• One coin is tossed 10 times with probability of success=0.5
coin will be fair (unbiased coin as p=1/2)
>rbinom(n=10, size=1, prob=1/2)
OUTPUT: 1 1 0 0 1 1 1 1 0 1
• Two coins are tossed 10 times with probability of success=0.5
• > rbinom(n=10, size=2, prob=1/2)
OUTPUT: 2 1 2 1 2 0 1 0 0 1
• One coin is tossed one hundred thousand times with probability of success=0.5
> rbinom(n=100,000, size=1, prob=1/2)
• store simulation results in $x$ vector
> x<- rbinom(n=100,000, size=5, prob=1/2)
count 1’s in x vector
> sum(x)
find the frequency distribution
> table(x)
creates a frequency distribution table with frequency
> t=(table(x)/n *100)}
plot frequency distribution table
>plot(table(x),ylab=”Probability”,main=”size=5,prob=0.5″)

View Video tutorial on rbinom command

## Non Central Chi Squared Distribution

The Non Central Chi Squared Distribution is a generalization of the Chi Squared Distribution.
If $Y_{1} ,Y_{2} ,\cdots ,Y_{n} \sim N(0,1)$ i.e. $(Y_{i} \sim N(0,1)) \Rightarrow y_{i}^{2} \sim \psi _{i}^{2}$ and $\sum y_{i}^{2} \sim \psi _{(n)}^{2}$

If mean ($\mu$) is non-zero then $y_{i} \sim N(\mu _{i} ,1)$ i.e each $y_{i}$ has different mean
\begin{align*}
\Rightarrow  & \qquad y_i^2 \sim \psi_{1,\frac{\mu_i^2}{2}} \\
\Rightarrow  & \qquad \sum y_i^2 \sim \psi_{(n,\frac{\sum \mu_i^2}{2})} =\psi_{(n,\lambda )}^{2}
\end{align*}

Note that if $\lambda =0$ then we have central $\psi ^{2}$. If $\lambda \ne 0$ then it is non central chi squared distribution because it has no central mean (as distribution is not standard normal).

Central Chi-Square Distribution $f(x)=\frac{1}{2^{\frac{n}{2}} \left|\! {\overline{\frac{n}{2} }} \right. } \chi ^{\frac{n}{2} -1} e^{-\frac{x}{2} }; \qquad 0<x<\infty$

## Theorem:

If $Y_{1} ,Y_{2} ,\cdots ,Y_{n}$ are independent normal random variables with $E(y_{i} )=\mu _{i}$ and $V(y_{i} )=1$ then $w=\sum y_{i}^{2}$ is distributed as non central chi square with $n$ degree of freedom and non-central parameter $\lambda$, where $\lambda =\frac{\sum \mu _{i}^{2} }{2}$ and has pdf

\begin{align*}
f(w)=e^{-\lambda } \sum _{i=0}^{\infty }\left[\frac{\lambda ^{i} w^{\frac{n+2i}{2} -1} e^{-\frac{w}{2} } }{i!\, 2^{\frac{n+2i}{2} } \left|\! {\overline{\frac{n+2i}{2} }}  \right. } \right]\qquad 0\le w\le \infty
\end{align*}

## Proof:

Consider the moment generating function of $w=\sum y_{i}^{2}$

\begin{align*}
M_{w} (t)=E(e^{wt} )=E(e^{t\sum y_{i}^{2}  } ); \qquad \text{ where } y_{i} \sim N(\mu \_{i} ,1)
\end{align*}

By definition
\begin{align*}
M_{w} (t) &= \int \cdots \int e^{t\sum y_{i}^{2} } .f(y_{i} )dy_{i} \\
&= K_{1} \int \cdots \int e^{-\frac{1}{2} (1-2t)\left[\sum y_{i}^{2} -\frac{2\sum y_{i} \mu _{i} }{1-2t} \right]}   dy_{1} .dy_{2} \cdots dy_{n} \\
&\text{By completing square}\\
& =K_{1} \int \cdots \int e^{\frac{1}{2} (1-2t)\sum \left[\left[y_{i} -\frac{\mu _{i} }{1-2t} \right]^{2} -\frac{\mu _{i}^{2} }{(1-2t)^{2} } \right]}   dy_{1} .dy_{2} \cdots dy_{n} \\
&= e^{-\frac{\sum \mu_{i}^{2} }{2} \left(1-\frac{1}{1-2t} \right)} \int \cdots \int \left(\frac{1}{\sqrt{2\pi } } \right)^{n} \frac{\frac{1}{\left(\sqrt{1-2t} \right)^{n} } }{\frac{1}{\left(\sqrt{1-2t} \right)^{n} } }  \, e^{-\frac{1}{2.\frac{1}{1-2t} } .\sum \left(y_{i} -\frac{\mu _{i} }{1-2t} \right)^{2} }  dy_{1} .dy_{2} \cdots dy_{n}\\
&=e^{-\frac{\sum \mu _{i}^{2} }{2} \left(1-\frac{1}{1-2t} \right)} .\frac{1}{\left(\sqrt{1-2t} \right)^{n} } \int \cdots \int \left(\frac{1}{\sqrt{2\pi } } \right)^{n}  \frac{1}{\left(\sqrt{\frac{1} {1-2t}} \right)^n} e^{-\, \frac{1}{2.\frac{1}{1-2t} } .\sum \left(y_{i} -\frac{\mu_i}{1-2t}\right)^{2} } dy_{1} .dy_{2} \cdots dy_{n}\\
\end{align*}

where

$\int_{-\infty}^{\infty } \cdots \int _{-\infty }^{\infty }\left(\frac{1}{\sqrt{2\pi}} \right)^{n} \frac{1}{\left(\frac{1}{1-2t} \right)^{\frac{n}{2}}} e^{-{\frac{1}{2}.\frac{1}{1-2t} }} .\sum \left(y_{i} -\frac{\mu _{i} }{1-2t} \right)^{2} dy_{1} .dy_{2} \cdots dy_{n}$
is integral of complete density

\begin{align*}
M_{w}(t)&=e^{-\frac{\sum \mu_i^2}{2} \left(1-\frac{1}{1-2t}\right)} .\left(\frac{1}{\sqrt{1-2t} } \right)^{n} \\
&=\left(\frac{1}{\sqrt{1-2t}}\right)^{n} e^{-\lambda \left(1-\frac{1}{1-2t} \right)} \\
&=e^{-\lambda }.e^{\frac{\lambda}{1-2t}} \frac{1}{(1-2t)^{\frac{n}{2}}}\\
&=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} }{i!(1-2t)^{i} (1-2t)^{n/2} }\\
M_{w=y_{i}^{2} } (t)&=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} }{i!(1-2t)^{\frac{n+2i}{2} } }\tag{A}
\end{align*}

Now Moment Generating Function (MGF) for non-central distribution for a given density function is
\begin{align*}
M_{\omega} (t) & = E(e^{\omega t} )\\
&=\int _{0}^{\infty }e^{\omega \lambda } e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} \omega ^{\frac{n+2i}{2} -1} e^{-\frac{\omega }{2} } }{i!2^{\frac{n+2i}{2} } \left|\! {\overline{\frac{n+2i}{2} }}  \right. } d\omega\\
&=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} }{i!2^{\frac{n+2i}{2} } \left|\! {\overline{\frac{n+2i}{2} }}  \right. }  \int _{0}^{\infty }e^{\frac{\omega }{2} (1-2t)}  \omega ^{\frac{n+2i}{2} -1} d\omega
\end{align*}
Let
\begin{align*}
\frac{\omega }{2} (1-2t)&=P\\
\Rightarrow \omega & =\frac{2P}{1-2t} \\
\Rightarrow d\omega &=\frac{2dp}{1-2t}
\end{align*}

\begin{align*}
&=e^{-\lambda } \sum\limits_{i=0}^{\infty }\frac{\lambda ^{i} }{i!2^{\frac{n+2i}{2} } \left|\! {\overline{\frac{n+2i}{2} }}  \right. }  \int _{0}^{\infty }e^{-P} \left(\frac{2P}{1-2t} \right)^{\frac{n+2i}{2} -1} \frac{2dP}{1-2t}  \\
&=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} 2^{\frac{n+2i}{2} } }{i!2^{\frac{n+2i}{2} } \left|\! {\overline{\frac{n+2i}{2} }}  \right. (1-2t)^{\frac{n+2i}{2} -1} } \int _{0}^{\infty }e^{-P} P^{\frac{n+2i}{2} -1}  dP \\
&=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} }{i!\left|\! {\overline{\frac{n+2i}{2} }}  \right. (1-2t)^{\frac{n+2i}{2} } } \left|\! {\overline{\frac{n+2i}{2} }}  \right.
\end{align*}

as $\int\limits _{0}^{\infty }e^{-P} P^{\frac{n+2i}{2} -1} dP=\left|\! {\overline{\frac{n+2i}{2} }} \right.$

$M_{\omega } (t)=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} }{i!(1-2t)^{\frac{n+2i}{2} } } \tag{B}$

Comparing ($A$) and ($B$)
$M_{w=\sum y_{i}^{2} } (t)=M_{\omega } (t)$

By Uniqueness theorem

$f_{w} (w)=f_{\omega } (\omega )$
\begin{align*}
\Rightarrow \qquad f_{w} (t)&=f(\psi ^{2} )\\
&=e^{-\lambda } \sum _{i=0}^{\infty }\frac{\lambda ^{i} w^{\frac{n+2i}{2} -1} e^{-\frac{w}{2} } }{i!2^{\frac{n+2i}{2} } \left|\! {\overline{\frac{n+2i}{2} }}  \right. };  \qquad o\le w\le \infty
\end{align*}
is the pdf of non central chi square with n df and $\lambda =\frac{\sum \mu _{i}^{2} }{2}$ is the non-centrality parameter. Non central chi squared distribution is also Additive as central chi square distribution.