Statistics for Data Analyst - Statistics MCQs, Analysis, Software

Absolute Measure of Dispersion

Apr 6, 2024May 25, 2013 by Muhammad Imdad Ullah

An absolute Measure of Dispersion gives an idea about the amount of dispersion/ spread in a set of observations. These quantities measure the dispersion in the same units as the units of original data. The absolute measure of dispersion cannot be used to compare the variation of two or more series/ data sets. The absolute measure of dispersion does not in itself, tell whether the variation is large or small.

Absolute Measure of Dispersion

The absolute Measure of Dispersion:

Range
Quartile Deviation
Mean Deviation
Variance or Standard Deviation

Range

The Range is the difference between the largest value and the smallest value in the data set. For ungrouped data, let $X_0$ be the smallest value and $X_n$ be the largest value in a data set then the range ($R$) is defined as
$R=X_n-X_0$.

For grouped data Range can be calculated in three different ways
R=Mid point of the highest class – Midpoint of the lowest class
R=Upper class limit of the highest class – Lower class limit of the lower class
R=Upper class boundary of the highest class – The lower class boundary of the lowest class

Quartile Deviation (Semi-Interquantile Range)

The Quartile deviation (an absolute measure of dispersion) is defined as the difference between the third and first quartiles, and half of this range is called the semi-interquartile range (SIQD) or simply quartile deviation (QD). $$QD=\frac{Q_3-Q_1}{2}$$

The Quartile Deviation is superior to the range as it is not affected by extremely large or small observations, anyhow it does not give any information about the position of observation lying outside the two quantities. It is not amenable to mathematical treatment and is greatly affected by sampling variability. Although Quartile Deviation is not widely used as a measure of dispersion, it is used in situations in which extreme observations are thought to be unrepresentative/ misleading. Quartile Deviation is not based on all observations therefore it is affected by extreme observations.

Note: The range “Median ± QD” contains approximately 50% of the data.

Mean Deviation (Average Deviation)

The Mean Deviation is another absolute measure of dispersion and is defined as the arithmetic mean of the deviations measured either from the mean or from the median. All these deviations are counted as positive to avoid the difficulty arising from the property that the sum of deviations of observations from their mean is zero.

$MD=\frac{\sum|X-\overline{X}|}{n}\quad$ for ungrouped data for mean
$MD=\frac{\sum f|X-\overline{X}|}{\sum f}\quad$ for grouped data for mean
$MD=\frac{\sum|X-\tilde{X}|}{n}\quad$ for ungrouped data for median
$MD=\frac{\sum f|X-\tilde{X}|}{\sum f}\quad$ for grouped data for median
Mean Deviation can be calculated about other central tendencies but it is least when deviations are taken as the median.

The Mean Deviation gives more information than the range or the Quartile Deviation as it is based on all the observed values. The Mean Deviation does not give undue weight to occasional large deviations, so it should likely be used in situations where such deviations are likely to occur.

Variance and Standard Deviation

This absolute measure of dispersion is defined as the mean of the squares of deviations of all the observations from their mean. Traditionally population variance is denoted by $\sigma^2$ (sigma square) and for sample data denoted by $S^2$ or $s^2$.

Symbolically
$\sigma^2=\frac{\sum(X_i-\mu)^2}{N}\quad$ Population Variance for ungrouped data
$S^2=\frac{\sum(X_i-\overline{X})^2}{n}\quad$ sample Variance for ungrouped data
$\sigma^2=\frac{\sum f(X_i-\mu)^2}{\sum f}\quad$ Population Variance for grouped data
$\sigma^2=\frac{\sum f (X_i-\overline{X})^2}{\sum f}\quad$ Sample Variance for grouped data

The variance is denoted by $Var(X)$ for random variable $X$. The term variance was introduced by R. A. Fisher (1890-1982) in 1918. The variance is in squares of units and the variance is a large number compared to observations themselves.
Note that there are alternative formulas to compute Variance or Standard Deviations.

The positive square root of the variance is called Standard Deviation (SD) to express the deviation in the same units as the original observation. It is a measure of the average spread about the mean and is symbolically defined as

$\sigma^2=\sqrt{\frac{\sum(X_i-\mu)^2}{N}}\quad$ Population Standard for ungrouped data
$S^2=\sqrt{\frac{\sum(X_i-\overline{X})^2}{n}}\quad$ Sample Standard Deviation for ungrouped data
$\sigma^2=\sqrt{\frac{\sum f(X_i-\mu)^2}{\sum f}}\quad$ Population Standard Deviation for grouped data
$\sigma^2=\sqrt{\frac{\sum f (X_i-\overline{X})^2}{\sum f}}\quad$ Sample Standard Deviation for grouped data
Standard Deviation is the most useful measure of dispersion and is credited with the name Standard Deviation by Karl Pearson (1857-1936).

In some text Sample, Standard Deviation is defined as $S^2=\frac{\sum (X_i-\overline{X})^2}{n-1}$ based on the argument that knowledge of any $n-1$ deviations determines the remaining deviations as the sum of n deviations must be zero. This is an unbiased estimator of the population variance $\sigma^2$. The Standard Deviation has a definite mathematical measure, it utilizes all the observed values and is amenable to mathematical treatment but affected by extreme values.

References

R Language Tutorial

MCQs about Business Mathematics

Random Walk Probability of Returning to Origin after n steps

Jun 23, 2024May 4, 2013 by Muhammad Imdad Ullah

Random Walk Probability of Returning to Origin

Assume that the walk starts at $x=0$ with steps to the right or left occurring with probabilities $p$ and $q=1-p$. We can write the position $X_n$ after $n$ steps as
\[X_n=R_n-L_n \tag{1}\]
where $R_n$ is the number of right or positive steps (+1) and $L_n$ is the number of left or negative steps ($-1$).

Therefore the Total steps can be calculated as: \[n=R_n+L_n \tag{2}\]
Hence
\begin{align*}
L_n&=n-R_n\\
\Rightarrow X_n&=R_n-n+R_n\\
R_n&=\frac{1}{2}(n+X_n) \tag{3}
\end{align*}
The equation (3) will be an integer only when $n$ and $X_n$ are both even or both odd (eg. To go from $x=0$ to $x=7$ we must take an odd number of steps).

Now, let $v_{n,x}$ be the probability that the walk is at state $x$ after $n$ steps assuming that $x$ is a positive integer. Then
\begin{align*}
v_{n,x}&=P(X_n=x)\\
&=P(R_n=\frac{1}{2}(n+x))
\end{align*}
$R_n$ is a binomial random variable with index $n$ having probability $p$, since the walker either moves to the right or not at every step, and the steps are independent, then
\begin{align*}
v_{n,x}&=\binom{n}{\frac{1}{2}(n+x)}p^{\frac{1}{2}(n+x)}q^{n-\frac{1}{2}(n+x)}\\
&=\binom{n}{\frac{1}{2}(n+x)}p^{\frac{1}{2}(n+x)}q^{\frac{1}{2}(n-x)} \tag{4}
\end{align*}
where $(n,x)$ are both even or both odd and $-n \leq x \leq n$. Note that a similar argument can be constructed if $x$ is a negative integer.

Example

For a total number of steps is 2, the net displacement must be one of the three possibilities: (1) two steps to the left, (2) back to the start, (3) or two steps to the right. These correspond to values of $x = -2, 0,+2$. It is impossible to get more than two units away from the origin if you take only two steps and it is equally impossible to end up exactly one unit from the origin if you take two steps.

For a symmetric case ($p=\tfrac{1}{2}$), starting from the origin, there are $2^n$ different paths of length $n$ since there is a choice of right or left move at each step. Since the number of steps in the right direction must be $\tfrac{1}{2}(n+x)$ and the total number of paths must be the number of ways in which $\frac{1}{2}(n+x)$ can be chosen from $n$: that is
\[N_{n,x}=\binom{n}{\tfrac{1}{2}(n+x)}\]
provided that $\tfrac{1}{2}(n+x)$ is an integer.

By counting rule, the probability that the walk ends at $x$ after $n$ steps is given by the ratio of this number and the total number of paths (since all paths are equally likely). Hence
\[v_{n,x}=\frac{N_{n,x}}{2^n}=\binom{n}{\tfrac{1}{2}(n+x)}\frac{1}{2^n}\]
The probability $v_{n,x}$ is the probability that the walk ends at state $x$ after $n$ steps: the walk could have overshot x before returning there.

A related probability is the probability that the first visit to position x occurs at the $n$th step. The following is a descriptive derivation of the associated probability-generating function of the symmetric random walk in which the walk starts at the origin, and we consider the probability that it returns to the origin.

From equation (4), the probability that a walk is at the origin at step $n$ is
\begin{align*}
v_{n,x}&=\binom{n}{\frac{1}{2}(n+x)}p^{\frac{1}{2}(n+x)}q^{n-\frac{1}{2}(n+x)}\\
&=\binom{n}{\tfrac{1}{2}(n+0)} \left(\frac{1}{2}\right)^{\tfrac{1}{2}n} \left(\frac{1}{2}\right)^{\tfrac{1}{2}n}\\
&=\binom{n}{\tfrac{1}{2}n}\frac{1}{2^n}=p_n \,\,\,\text{(say)}, \quad (n=2,4,6,\cdots) \tag{5}
\end{align*}
Here $p_n$ is the probability that after $n$ steps the position of the walker is at the origin. We also assume that $p_n=0$ if $n$ is odd. From equation (5) we can construct a generating function.
\begin{align*}
H(s)&=\sum_{n=0}^\infty p_n s^n\\
&=\sum_{n=0}^\infty p_{2n}s^{2n}=\sum_{n=0}^\infty \frac{1}{2^{2n}}\binom{2n}{n}s^{2n} \tag{6}
\end{align*}
Note that $p_0=1$, and H(s) is not a probability generating function since $H(1)\neq1$.

The binomial coefficient can be re-arranged as follows:
\begin{align*}
\binom{2n}{n}&=\frac{(2n)!}{n!n!}=\frac{2n(2n-1)(2n-2)\cdots3.2.1}{n!n!}\\
&=\frac{2^nn!(2n-1)(2n-3)\cdots3.2.1}{n!n!}\\
&=\frac{2^{2n}}{n!}\frac{1}{2}\frac{3}{2}\cdots(n-\tfrac{1}{2})\\
&=(-1)^n \binom{-\tfrac{1}{2}}{n}2^{2n} \tag{7}
\end{align*}
Using equation (6) in (7)
\[H(s)=\sum_{n=0}^\infty \frac{1}{2^{2n}}(-1)^n \binom{-\tfrac{1}{2}}{n}s^{2n}2^{2n}=(1-s^2)^{-\tfrac{1}{2}} \tag{8}\]
by binomial theorem, provided $|s|<1$. Note that this expansion guarantees that $p_n=0$ if $n$ is odd.

Note that the equation (8) does not sum to one. This is called defective distribution which still gives the probability that the walk is at the origin at step n.

We can estimate the behavior of $p_n$ for large n by using Stirling’s Formula (asymptotic estimate for $n!$ for large $n$), $n!\approx\sqrt{2\pi} n^{n+\tfrac{1}{2}}e^{-n}$

From equation (5)
\begin{align*}
p_{2n}&=\frac{1}{2^{2n}}\binom{2n}{n}=\frac{1}{2^{2n}}\frac{(2n)!}{n!n!}\\
&\approx\frac{1}{2^{2n}}\frac{\sqrt{2\pi}(2n)^{2n+\tfrac{1}{2}}e^{-2n}}{[\sqrt{2\pi}(n^{n+\tfrac{1}{2}}e^{-n})]^2}\\
&=\frac{1}{\sqrt{\pi n}}; \qquad \text{for large $n$}
\end{align*}
Hence $np_n\rightarrow \infty$ confirming that the series $\sum\limits_{n=0}^\infty p_n$ must diverge.

Random Walk Probability of Returning to Origin after $n$ Steps Some EXAMPLES

Example: Consider a random walk starting from $x_0=0$ and find the probability that after 5 steps the position is 3. i.e. $X_5=3$, $p=0.6$.

Solution: Here the number of steps is $n=5$ and the position is $x=3$. Therefore positive and negative steps are

$R_n= \frac{1}{2}(n+x)=\frac{1}{2}(5+3)=4$ and $X_n=R_n-L_n \Rightarrow 3=4+L_n=1$

The probability that the event $X_5=3$ will occur in a random walk with $p=0.6$ is
\[P(X_5=3)=\binom{5}{\frac{1}{2}(5+3)}(0.6)^{\tfrac{1}{2}(3+5)}(0.4)^{\tfrac{1}{2}(5-3)}=0.2592\]

Random Walk Probability of Returning to Origin after n steps

Click the links to learn Stochastic Processes Introduction, Random Walk Models, Simple Random Walk

Matrices in R Language

Graphical Presentation of Data (2013)

Aug 2, 2024Mar 28, 2013 by Muhammad Imdad Ullah

Getting expertise in the graphical presentation of data is important and also the major way to get insights about data.

Graphical Presentation of Data

A chart/ graph says more than twenty pages of prose, it is true when you are presenting and explaining data. The graph is a visual display of data in the form of continuous curves or discontinuous lines on graph paper. Many graphs just represent a summary of data that has been collected to support a particular theory, to understand data quickly in a visual way, by helping the audience, to make a comparison, to show a relationship, or to highlight a trend.

Usually, it is suggested that the graphical presentation of the data should be carefully looked at before proceeding with the formal statistical analysis. It is because the trend in the data can often be depicted by the use of charts and graphs.

A chart/ graph is a graphical presentation of data, in which the data is usually represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart. A chart/ graph can represent tabular numeric data, functions, or some kinds of qualitative structures.

Common Uses of Graphs

Graphical presentation of data is a pictorial way of representing relationships between various quantities, parameters, and variables. A graph summarizes how one quantity changes if another quantity that is related to it also changes.

Graphs are useful for checking assumptions made about the data i.e. the probability distribution assumed.
The graphs provide a useful subjective impression as to what the results of the formal analysis should be.
Graphs often suggest the form of a statistical analysis to be carried out, particularly, the graph of model fitted to the data.
Graphs give a visual representation of the data or the results of statistical analysis to the reader which are usually easily understandable and more attractive.
item Some graphs are useful for checking the variability in the observation and outliers can be easily detected.

Important Points for Graphical Presentation of Data

Clearly label the axis with the names of the variable and units of measurement.
Keep the units along each axis uniform, regardless of the scales chosen for the axis.
Keep the diagram simple. Avoid any unnecessary details.
A clear and concise title should be chosen to make the graph meaningful.
If the data on different graphs are to be measured always use identical scales.
In the scatter plot, do not join up the dots. This makes it likely that you will see apparent patterns in any random scatter of points.
Use either grid rulings or tick marks on the axis to mark the graph divisions.
Use color, shading, or pattern to differentiate the different sections of the graphs such as lines, pieces of the pie, bars, etc.
In general start each axis from zero; if the graph is too large, indicate a break in the grid.

For further reading about the Graphical Presentation of data go to https://en.wikipedia.org/wiki/Chart

Graphical Presentation of Data in R Language

Percentiles: Relative Standing

Apr 8, 2024Mar 10, 2013 by Muhammad Imdad Ullah

Percentiles are a measure of the relative standing of observation within a data. Percentiles divide a set of observations into 100 equal parts, and percentile scores are frequently used to report results from national standardized tests such as NAT, GAT, and GRE, etc.

The $p$th percentile is the value $Y_{(p)}$ in order statistic such that $p$ percent of the values are less than the value $Y_{(p)}$ and $(100-p)$ (100-p) percent of the values are greater $Y_{(p)}$. The 5th percentile is denoted by $P_5$, the 10th by $P_{10}$ and 95th by $P_{95}$.

Percentiles for the Ungrouped data

To calculate percentiles (a measure of the relative standing of an observation) for the ungrouped data, adopt the following procedure:

Order the observation
For the $m$th percentile, determine the product $\frac{m.n}{100}$. If $\frac{m.n}{100}$ is not an integer, round it up and find the corresponding ordered value and if $\frac{m.n}{100}$ is an integer, say k, then calculate the mean of the $K$th and $(k+1)$th ordered observations.

Example: For the following height data collected from students find the 10th and 95th percentiles. 91, 89, 88, 87, 89, 91, 87, 92, 90, 98, 95, 97, 96, 100, 101, 96, 98, 99, 98, 100, 102, 99, 101, 105, 103, 107, 105, 106, 107, 112.

Solution: The ordered observations of the data are 87, 87, 88, 89, 89, 90, 91, 91, 92, 95, 96, 96, 97, 98, 98, 98, 99, 99, 100, 100, 101, 101, 102, 103, 105, 105, 106, 107, 107, 112.

\[P_{10}= \frac{10 \times 30}{100}=3\]

So the 10th percentile i.e. $P_{10}$ is the 3rd observation in sorted data is 88, which means that 10 percent of the observations in the data set are less than 88.

\[P_{95}=\frac{95 \times 30}{100}=28.5\]

The 29th observation is our 95th Percnetile i.e., $P_{95}=107$

Percentiles for the Frequency Distribution Table (Grouped data)

The $m$th percentile (a measure of the relative standing of an observation) for the Frequency Distribution Table (grouped data) is

\[P_m=l+\frac{h}{f}\left(\frac{m.n}{100}-c\right)\]

Like median, $\frac{m.n}{100}$ is used to locate the $m$th percentile group.

$l$    is the lower class boundary of the class containing the $m$th percentile
$h$   is the width of the class containing $P_m$
$f$    is the frequency of the class containing
$n$   is the total number of frequencies $P_m$
$c$    is the cumulative frequency of the class immediately preceding the class containing $P_m$

Note that the 50th percentile is the median by definition as half of the values in the data are smaller than the median and half of the values are larger than the median. Similarly, the 25th and 75th percentiles are the lower ($Q_1$) and upper quartiles ($Q_3$) respectively. The quartiles, deciles, and percentiles are also called quantiles or fractiles.

Percentiles: Measure of Relative Standing

Example: For the following grouped data compute $P_{10}$, $P_{25}$, $P_{50}$, and $P_{95}$ given below.Solution:

Locate the 10th percentile (lower deciles i.e. $D_1$)by $\frac{10 \times n}{100}=\frac{10 \times 3o}{100}=3$ observation.
so, $P_{10}$ group is 85.5–90.5 containing the 3rd observation
\begin{align*}
P_{10}&=l+\frac{h}{f}\left(\frac{10 n}{100}-c\right)\\
&=85.5+\frac{5}{6}(3-0)\\
&=85.5+2.5=88
\end{align*}
Locate the 25th percentile (lower quartiles i.e. $Q_1$) by $\frac{10 \times n}{100}=\frac{25 \times 3o}{100}=7.5$ observation.
so, $P_{25}$ group is 90.5–95.5 containing the 7.5th observation
\begin{align*}
P_{25}&=l+\frac{h}{f}\left(\frac{25 n}{100}-c\right)\\
&=90.5+\frac{5}{4}(7.5-6)\\
&=90.5+1.875=92.375
\end{align*}
Locate the 50th percentile (Median i.e. 2nd quartiles, 5th deciles) by $\frac{50 \times n}{100}=\frac{50 \times 3o}{100}=15$ observation.
so, P₅₀ group is 95.5–100.5 containing the 15th observation
\begin{align*}
P_{50}&=l+\frac{h}{f}\left(\frac{50 n}{100}-c\right)\\
&=95.5+\frac{5}{10}(15-10)\\
&=95.5+2.5=98
\end{align*}
Locate the 95th percentile by $\frac{95 \times n}{100}=\frac{95 \times 30}{100}=28.5$th observation.
so, $P_{95}$ group is 105.5–110.5 containing the 3rd observation
\begin{align*}
P_{95}&=l+\frac{h}{f}\left(\frac{95 n}{100}-c\right)\\
&=105.5+\frac{5}{3}(28.5-26)\\
&=105.5+4.1667=109.6667
\end{align*}

The percentiles and quartiles may be read directly from the graphs of the cumulative frequency function.

Further Reading: https://en.wikipedia.org/wiki/Percentile

Drawing Graphs and Charts in R Language

Absolute Measure of Dispersion

Absolute Measure of Dispersion

Range

Quartile Deviation (Semi-Interquantile Range)

Mean Deviation (Average Deviation)

Variance and Standard Deviation

Random Walk Probability of Returning to Origin after n steps

Table of Contents

Random Walk Probability of Returning to Origin

Example

Random Walk Probability of Returning to Origin after $n$ Steps Some EXAMPLES

Graphical Presentation of Data (2013)

Table of Contents

Graphical Presentation of Data

Common Uses of Graphs

Important Points for Graphical Presentation of Data

Percentiles: Relative Standing

Percentiles for the Ungrouped data

Percentiles for the Frequency Distribution Table (Grouped data)

Absolute Measure of Dispersion

Range

Quartile Deviation (Semi-Interquantile Range)

Mean Deviation (Average Deviation)

Variance and Standard Deviation

Share this:

Table of Contents

Random Walk Probability of Returning to Origin

Example

Related Probability/ First Passage through x

Random Walk Probability of Returning to Origin after $n$ Steps Some EXAMPLES

Share this:

Table of Contents

Graphical Presentation of Data

Common Uses of Graphs

Important Points for Graphical Presentation of Data

Share this:

Percentiles for the Ungrouped data

Percentiles for the Frequency Distribution Table (Grouped data)

Share this: