Stem and Leaf Plot: Exploratory Data Analysis

Before performing any statistical calculation (even the simplest one), data should be tabulated or plotted especially if they are quantitative and are few (few observations) to visualize the shape of the distribution.

A stem and leaf plot is a way of summarizing the set of data measured on an interval scale in condensed form. Stem and leaf plots are often used in exploratory data analysis and help to illustrate the different features of the distribution of the observed data. A basic stem and leaf display contains two columns separated by a vertical line. The left side of the vertical line contains the stems while the right side of the vertical line contains the leaves. It is customary to sort the values within each stem from smallest to largest. In this statistical technique (to present a set of data), each numerical value is divided into two parts

  1. Leading Digit(s)
  2. Trailing Digit

Stem values are the leading digit(s) and leaves are the trailing digit. The stems are located along the vertical axis, and the leaf values are stacked against each other along the horizontal axis.

A stem and leaf display is similar to a frequency distribution with more information. It provides information about the symmetry, concentration, empty sets, and outliers of the observed data set. Organizing the data into a frequency distribution has the disadvantage of

  1. Lose of the exact identity of each value (individuality of observation vanishes)
  2. Did not know (sure) how the values within each class are distributed.

The advantage of the stem and leaf plot (display) over a frequency distribution is that we do not lose the identity (individuality) of each observation. Similarly, a stem and leaf plot is similar to a histogram but usually provides more information for a relatively small data set.

More than one data set can be compared by using multiple stem and leaf plots. Using a back-to-back stem and leaf plot we can compare the same characteristics into different groups.

The origin of the stem and leaf plot is associated with Tukey, J.W (1977).

Constructing a Stem and Leaf Plot

Let us have the following data set: 56, 65, 98, 82, 64, 71, 78, 77, 86, 95, 91, 59, 69, 70, 80, 92, 76, 82, 85, 91, 92, 99, 73 and want to draw the required graph of the given data.

First of all, it’s better to sort the data. The sorted data is 56, 59, 64, 65, 69, 70, 71, 73, 76, 77, 78, 80, 82, 82, 85, 86, 91, 91, 92, 92, 95, 98, 99.

Now the first digit is the stem and the second one is a leaf, i.e. stems are from 5 to 9 as data ranges from 56 to 99.

Draw a vertical line separating the stem from the leaf. Put stem values on the left side of the vertical line (bar) and leaf values on the right side of the vertical line.  Note that Each number is assigned to the graph (plot) by pairing the unit digit, or leaf, with the correct stem. The score 56 is plotted by placing the units digit  6, to the right of stem 5.

The stem and leaf plot of the above data would look like.

The decimal point is 1 digit(s) to the right of the |
Stem | Leaf
5      | 6 9
6      | 4 5 9
7      | 0 1 3 6 7 8
8      | 0 2 2 5 6
9      | 1 1 2 2 5 8 9

The stem and leaf plot looks like a histogram by rotating it anti-clockwise.

By adding columns of frequency and cumulative frequency in the stem and leaf plots we can find the median of the data.

stem and Leaft Plot
Stem and Leaf Plot

Reference

F Distribution: Ratios of two Independent Estimators (2013)

F-distribution is a continuous probability distribution (also known as Snedecor’s F distribution or the Fisher-Snedecor distribution) which is named in honor of R.A. Fisher and George W. Snedecor. This distribution arises frequently as the null distribution of a test statistic (hypothesis testing), used to develop confidence interval and in the analysis of variance for comparison of several population means.

If $s_1^2$ and $s_2^2$ are two unbiased estimates of the population variance $\sigma^2$ obtained from independent samples of size n1 and n2 respectively from the same normal population, then the mathematically F-ratio is defined as
\[F=\frac{s_1^2}{s_2^2}=\frac{\frac{(n_1-1)\frac{s_1^2}{\sigma^2}}{v_1}}{\frac{(n_2-1)\frac{s_2^2}{\sigma^2}}{v_2}}\]
where $v_1=n_1-1$ and $v_2=n_2-1$. Since $\chi_1^2=(n_1-1)\frac{s_1^2}{\sigma^2}$ and $\chi_2^2=(n_2-1)\frac{s_2^2}{\sigma^2}$ are distributed independently as $\chi^2$ with $v_1$ and $v_2$ degree of freedom respectively, we have
\[F=\frac{\frac{\chi_1^2}{v_1}}{\frac{\chi_2^2}{v_2}}\]

So, F Distribution is the ratio of two independent Chi-square ($\chi^2$) statistics each divided by their respective degree of freedom.

F Distribution Properties

  •  This takes only non-negative values since the numerator and denominator of the F-ratio are squared quantities.
  • The range of F values is from 0 to infinity.
  • The shape of the F-curve depends on the parameters v1 and v2 (its nominator and denominator df). It is non-symmetrical and skewed to the right (positive skewed) distribution. It tends to become more and more symmetric when one or both of the parameter values (v1, v2) increase, as shown in the following figure.
F distribution
  • It is asymptotic. As X values increase, the F-curve approaches the X-axis but never crosses it or touches it (similar behavior to the normal probability distribution).
  • F have a unique mode at the value \[\tilde{F}=\frac{v_2(v_2-2)}{v_1(v_2+2)},\quad (v_2>2)\] which is always less than unity.
  • The mean of F is $\mu=\frac{v_2}{v_2-2},\quad (v_2>2)$
  • The variance of F is \[\sigma^2=\frac{2v_2^2(v_1+v_2-2)}{v_1(v_2-2)(v_2-4)},\quad (v_2>4)\]

Assumptions of F Distribution

The statistical procedure of comparing the variances of two populations has assumptions

  • The two populations (from which the samples are drawn) follow Normal distribution
  • The two samples are random samples drawn independently from their respective populations.

The statistical procedure of comparing three or more populations has assumptions

  • The population follows the Normal distribution
  • The population has equal standard deviations σ
  • The populations are independent of each other.

Note

This distribution is relatively insensitive to violations of the assumptions of normality of the parent population or the assumption of equal variances.

Use of F Distribution Table

For a given (specified) level of significance α, $F_\alpha(v_1,v_2)$ symbol is used to represent the upper (right-hand side) 100% point of an F distribution having $v_1$ and $v_2$ df.

The lower (left-hand side) percentage point can be found by taking the reciprocal of the F-value corresponding to the upper (right-hand side) percentage point, but the number of df is interchanged i.e. \[F_{1-\alpha}(v_1,v_2)=\frac{1}{F_\alpha(v_2,v_1)}\]

The distribution for the variable F is given by
\[Y=k.F^{(\frac{v_1}{2})-1}\left(1+\frac{F}{v_2}\right)^{-\frac{(v_1+v_2)}{2}}\]

References:

Learn R Programming Language

Role of Hat Matrix in Regression Analysis

The post is about the importance and role of the Hat Matrix in Regression Analysis.

Hat matrix is a $n\times n$ symmetric and idempotent matrix with many special properties that play an important role in the diagnostics of regression analysis by transforming the vector of observed responses $Y$ into the vector of fitted responses $\hat{Y}$.

The model $Y=X\beta+\varepsilon$ with solution $b=(X’X)^{-1}X’Y$ provided that $(X’X)^{-1}$ is non-singular. The fitted values are ${\hat{Y}=Xb=X(X’X)^{-1} X’Y=HY}$.

Like fitted values ($\hat{Y}$), the residual can be expressed as linear combinations of the response variable $Y_i$.

\begin{align*}
e&=Y-\hat{Y}\\
&=Y-HY\\&=(I-H)Y
\end{align*}

The role of hat matrix in Regression Analysis and Regression Diagnostics is:

  • The hat matrix only involves the observation in the predictor variable X  as $H=X(X’X)^{-1}X’$. It plays an important role in diagnostics for regression analysis.
  • The hat matrix plays an important role in determining the magnitude of a studentized deleted residual and therefore in identifying outlying Y observations.
  • The hat matrix is also helpful in directly identifying outlying $X$ observations.
  • In particular, the diagonal elements of the hat matrix are indicators in a multi-variable setting of whether or not a case is outlying concerning $X$ values.
  • The elements of the “Hat matrix” have their values between 0 and 1 always and their sum is p i.e. $0 \le h_{ii}\le 1$  and  $\sum _{i=1}^{n}h_{ii} =p $
    where p is the number of regression parameters with intercept term.
  • $h_{ii}$ is a measure of the distance between the $X$ values for the ith case and the means of the $X$ values for all $n$ cases.

Mathematical Properties of Hat Matrix

  • $HX=X$
  • $(I-H)X=0$
  • $HH=H^2 = H H^p$
  • $H(I-H)=0$
  • $Cov(\hat{e},\hat{Y})=Cov\left\{HY,(I-H)Y\right\}=\sigma ^{2} H(I-H)=0$
  • $1-H$ is also symmetric and idempotent.
  • $H1=1$ with intercept term. i.e. every row of $H$ adds up to $1. 1’=1H’=1’H$  & $1’H1=n$
  • The elements of $H$ are denoted by $h_{ii}$ i.e.
    \[H=\begin{pmatrix}{h_{11} } & {h_{12} } & {\cdots } & {h_{1n} } \\ {h_{21} } & {h_{22} } & {\cdots } & {h_{2n} } \\ {\vdots } & {\vdots } & {\ddots } & {\vdots } \\ {h_{n1} } & {h_{n2} } & {\vdots } & {h_{nn} }\end{pmatrix}\]
    The large value of $h_{ii}$ indicates that the ith case is distant from the center for all $n$ cases. The diagonal element $h_{ii}$ in this context is called leverage of the ith case. $h_{ii}$ is a function of only the $X$ values, so $h_{ii}$ measures the role of the $X$ values in determining how important $Y_i$ is affecting the fitted $\hat{Y}_{i} $ values.
    The larger the $h_{ii}$ the smaller the variance of the residuals $e_i$ for $h_{ii}=1$, $\sigma^2(ei)=0$.
  • Variance, Covariance of $e$
    \begin{align*}
    e-E(e)&=(I-H)Y(Y-X\beta )=(I-H)\varepsilon \\
    E(\varepsilon \varepsilon ‘)&=V(\varepsilon )=I\sigma ^{2} \,\,\text{and} \,\, E(\varepsilon )=0\\
    (I-H)’&=(I-H’)=(I-H)\\
    V(e) & =  E\left[e-E(e_{i} )\right]\left[e-E(e_{i} )\right]^{{‘} } \\
    & = (I-H)E(\varepsilon \varepsilon ‘)(I-H)’ \\
    & = (I-H)I\sigma ^{2} (I-H)’ \\
    & =(I-H)(I-H)I\sigma ^{2} =(I-H)\sigma ^{2}
    \end{align*}
    $V(e_i)$ is given by the ith diagonal element $1-h_{ii}$ and $Cov(e_i, e_j)$ is given by the $(i, j)$th  element of $-h_{ij}$ of the matrix $(I-H)\sigma^2$.
    \begin{align*}
    \rho _{ij} &=\frac{Cov(e_{i} ,e_{j} )}{\sqrt{V(e_{i} )V(e_{j} )} } \\
    &=\frac{-h_{ij} }{\sqrt{(1-h_{ii} )(1-h_{jj} )} }\\
    SS(b) & = SS({\rm all\; parameter)=}b’X’Y \\
    & = \hat{Y}’Y=Y’H’Y=Y’HY=Y’H^{2} Y=\hat{Y}’\hat{Y}
    \end{align*}
    The average $V(\hat{Y}_{i} )$ to all data points is
    \begin{align*}
    \sum _{i=1}^{n}\frac{V(\hat{Y}_{i} )}{n} &=\frac{trace(H\sigma ^{2} )}{n}=\frac{p\sigma ^{2} }{n} \\
    \hat{Y}_{i} &=h_{ii} Y_{i} +\sum _{j\ne 1}h_{ij} Y_{j}
    \end{align*}

Role of Hat Matrix in Regression Diagnostic

Internally Studentized Residuals

$V(e_i)=(1-h_{ii})\sigma^2$ where $\sigma^2$ is estimated by $s^2$

i.e. $s^{2} =\frac{e’e}{n-p} =\frac{\Sigma e_{i}^{2} }{n-p} $  (RMS)

we can studentized the residual as $s_{i} =\frac{e_{i} }{s\sqrt{(1-h_{ii} )} } $

These studentized residuals are said to be internally studentized because $s$ has within it $e_i$ itself.

Extra Sum of Squares attributable to $e_i$

\begin{align*}
e&=(1-H)Y\\
e_{i} &=-h_{i1} Y_{1} -h_{i2} Y_{2} -\cdots +(1-h_{ii} )Y_{i} -h_{in} Y_{n} =c’Y\\
c’&=(-h_{i1} ,-h_{i2} ,\cdots ,(1-h_{ii} )\cdots -h_{in} )\\
c’c&=\sum _{i=1}^{n}h_{i1}^{2}  +(1-2h_{ii} )=(1-h_{ii} )\\
SS(e_{i})&=\frac{e_{i}^{2} }{(1-h_{ii} )}\\
S_{(i)}^{2}&=\frac{(n-p)s^{2} -\frac{e_{i}^{2}}{e_{i}^{2}  (1-h_{ii} )}}{n-p-1}
\end{align*}
provides an estimate of $\sigma^2$ after deletion of the contribution of $e_i$.

Externally Studentized Residuals

$t_{i} =\frac{e_{i} }{s(i)\sqrt{(1-h_{ii} )} }$ are externally studentized residuals. Here if $e_i$ is large, it is thrown into emphasis even more by the fact that $s_i$ has excluded it. The $t_i$ follows a $t_{n-p-1}$ distribution under the usual normality of error assumptions.

Hat Matrix in Regression itfeature.com

Read more about the Role of the Hat Matrix in Regression Analysis https://en.wikipedia.org/wiki/Hat_matrix

Read about Regression Diagnostics

https://rfaqs.com

Simple Random Walk (Unrestricted Random Walk) 2012

A simple random walk (or unrestricted random walk) on a line or in one dimension occurs with probability $p$ when the walker steps forward (+1) and/or has probability $q=1-p$ if the walker steps back ($-1$). For ith step, the modified Bernoulli random variable $W_i$ (takes the value $+1$ or $-1$ instead of {0,1}) is observed and the position of the walk at the nth step can be found by
\begin{align}
X_n&=X_0+W_1+W_2+\cdots+W_n\nonumber\\
&=X_0+\sum_{i=1}^nW_i\nonumber\\
&=X_{n-1}+W_n
\end{align}
In the gambler’s ruin problems $X_o=k$, but here we assume (without loss of generality) that walks start from the origin so that $X_0=0$.

Simple Random Walk

Several derived results for random walks are restricted by boundaries. We consider here random walks without boundaries called unrestricted random walks. We are interested in

  1. The position of the walk after a number of steps and
  2. The probability of a return to the origin, the start of the walker.

From equation (1) the position of the walker at step $n$ simply depends on the position at $(n-1)$th step, because the simple random walk possesses the Markov property (the current state of the walk depends on its immediate previous state, not on the history of the walks up to the present state)

Furthermore, $X_n=X_{n-1}\pm 1$ and the transition probabilities from one position to another is $P(X_n=j | X_{n-1}=j-1)=p$, and $P(X_n=j|X_{n-1}=j+1)=q$ is independent of the number of plays in the game or steps is represented by $n$.

The mean and Variance of $X_n$ can be calculated as:
\begin{align*}
E(X_n)&=E\left(X_0+\sum_{i=1}^n W_i\right)\\
&=E\left(\sum_{i=1}^n W_i\right)=nW_n\\
V(X_n)&=V\left(\sum_{i=1}^n W_i\right)=nV(W)
\end{align*}
Since $W_i$ are independent and identically distributed (iid) random variables and where $W$ is the common or typical Bernoulli random variable in the sequence$\{W_i\}$. Thus
\begin{align*}
E(W)&=1.p+(-1)q=p-q\\
V(W)&=E(W^2)-[E(W)]^2\\
&=1^2p+(-1)^2q-(p-q)^2\\
&=p+q-(p^2+q^2-2pq)\\
&=1-p^2-q^2+2pq\\
&=1-p^2-(1-p)^2+2pq\\
&=1-p^2-(1+p^2-2p)+2pq\\
&=1-p^2-1-p^2+2p+2pq\\
&=-2p^2+2p+2pq\\
&=2p(1-p)+2pq=4pq
\end{align*}
So the probability distribution of the position of the random walk at stage $n$ has to mean $E(X_n)=n(p-q)$ and $V(X_n)=4npq$ and variance.

For the symmetric random walk (where $p=½$) after $n$ steps, the expected position is the origin, and it yields the maximum value of $V(X_n)=4npq=4np(1-p)$.

If $p>\frac{1}{2}$ then drift is expected away from the origin in a positive direction and if $p<\frac{1}{2}$ it would be expected that the drift would be in the negative direction.

Since $V(X_n)$ is proportional to $n$, it grows with increasing n, and we would be increasingly uncertain about the position of the walker as $n$ increases.
i.e.
\begin{align*}
\frac{\partial V(X_n)}{\partial p}&=\frac{\partial}{\partial p} {4npq}\\
&=\frac{\partial}{\partial p} \{4np-4np^2 \}=4n-8np \quad \Rightarrow p=\frac{1}{2}
\end{align*}
Just knowing the mean and standard deviation of a random variable does not enable us to identify its probability distribution. But for large $n$, we can apply the CLT.
\[Z_n=\frac{X_n-n(p-q)}{\sqrt{4npq}}\thickapprox N(0,1)\]
Applying continuity correction, approximate probabilities may be obtained for the position of the walk.

Example : Consider unrestricted random walk with $n=100, p=0.6$ then
\begin{align*}
E(X_n)&=E(X_{100})=nE(W)=n(p-q)\\
&=100(0.6-0.4)=20\\
V(X_n)&=4npq=4\times 100\times 0.6 \times 0.4=96
\end{align*}
The position of the walk at the 100th step between 15 and 25 pace/steps from the origin is
\[P(15\leq X_{100}\leq30)\thickapprox P(14.5<X_{100}<25.5)\]
\[-\frac{5.5}{\sqrt{96}}<Z_{100}=\frac{X_{100}-20}{\sqrt{96}}<\frac{5.5}{96}\]
hence
\[P(-0.5613<Z_{100}<0.5613)=\phi(0.5613)-\phi(-0.5613)=0.43\]
where $\phi(Z)$ is the standard normal distribution function.

Simple Random Walk

Read more about Simple Random Walk: Random Walks Model

FAQs about Simple Random Walk

  1. What is meant by a simple random walk?
  2. How mean and variance of a simple random walk can be computed?
  3. Give an example of a simple random walk.

References

  1. https://www.encyclopediaofmath.org/index.php/Random_walk
  2. https://mathworld.wolfram.com/RandomWalk1-Dimensional.html