Multivariate Analysis (2012)

Multivariate Analysis term is used to include all statistics for more than two variables that are simultaneously analyzed.

Multivariate analysis is based upon an underlying probability model known as the Multivariate Normal Distribution (MND). The objective of scientific investigations to which multivariate methods most naturally lend themselves includes.

Multivariate Analysis and Statistics

Objectives of Multivariate Analysis

The following are some basic objectives of multivariate analysis.

  • Data reduction or structural simplification
    The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
  • Sorting and Grouping
    Graphs of similar objects or variables are created, based on measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
  • Investigation of the dependence among variables
    The nature of the relationships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent based on observation of the other variables?
  • Prediction
    Relationships between variables must be determined for predicting the values of one or more variables based on observation of the other variables.
  • Hypothesis Construction and testing
    Specific statistical hypotheses, formulated in terms of the parameter of the multivariate population, are tested. This may be done to validate assumptions or to reinforce prior convictions.

Applications: Multivariate analysis is used in various fields:

  • Social sciences (understanding factors influencing voting behavior)
  • Business (analyzing customer demographics and purchase patterns)
  • Finance (evaluating risk factors in investment portfolios)
  • Natural sciences (studying the relationships between different environmental variables)

Multivariate Data Sets

We are concerned with analyzing measurements made on several variables or characteristics. These measurements (data) must frequently be arranged and displayed in various ways (graphs, tabular form, etc.). Preliminary concepts underlying these first steps of data organization are

Array

Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number of variables $p\ge$ of variables or characteristics to record. The values of these variables are all recorded for each distinct item, individual, or experimental unit.

$X_{jk}$ notation is used to indicate the particular value of the kth variable that is observed on the jth item or trial. i.e. $X_{jk}$ measurement of the kth variable on the jth item. So, $n$ measurements on $p$ variables can be displayed as

\[\begin{array}{ccccccc}
. & V_1 & V_2  & \dots  & V_k & \dots  & V_p \\
Item 1 & x_{11} & x_{12} & \dots  & x_{1k} & \dots  & x_{1p} \\
Item 2 & x_{21} & x_{22} & \dots  & x_{2k} & \dots  & x_{2p} \\
\vdots & \vdots  & \vdots  & \vdots & \vdots   & \vdots & \vdots  \\
Item j  & x_{j1}   & x_{j2} & \dots  & x_{jk} & \dots  & x_{jp} \\
\vdots &  \vdots & \vdots & \vdots & \vdots   & \vdots & \vdots  \\
Item n & x_{n1} & x_{n2} & \dots  & x_{nk} & \dots  & x_{np} \\
\end{array}\]

These data can be displayed as rectangular arrays $X$ of $n$ rows and $p$ columns

\[X=\begin{pmatrix}
x_{11}     & x_{12} & \dots  & x_{1k}  & \dots  & x_{1p} \\
x_{21}     & x_{22} & \ddots  & x_{2k}  & \ddots  & x_{2p} \\
\vdots & \vdots & \ddots  & \ddots & \vdots & \vdots  \\
x_{j1}     & x_{j2} & \ddots  & x_{jk}  & \ddots  & x_{jp} \\
\vdots  & \vdots & \ddots  & \vdots & \ddots & \vdots  \\
x_{n1}     & x_{n2} & \dots  & x_{nk}  & \dots  & x_{np}
\end{pmatrix}\]

This $X$ array contains the data consisting of all of the observations on all of the variables.

Example: suppose we have data for the number of books sold and the total amount of each sale.

Variable 1 (Sales in Dollars)
\[\begin{array}{ccccc}
Data Values: & 42 & 52 & 48 & 63 \\
Notation: & x_{11} & x_{21} & x_{31} & x_{41}
\end{array}\]

Variable 2 (Number of Books sold)
\[\begin{array}{ccccc}
Data Values: & 4 & 2 & 8 & 3 \\
Notation: & x_{12} & x_{22} & x_{33} & x_{42}
\end{array}\]

itfeature.com Multivariate Analysis

The information, available in the multivariate data sets can be assessed by calculating certain summary numbers, known as multivariate analysis: multivariate descriptive statistics such as Arithmetic Mean, Sample Mean (the measure of location), Average of the Squares of the distances of all of the numbers from the mean (variation/spread i.e. Measure of Spread or Variation).

MCQs General Knowledge

R Programming Language

Measure of Dispersion or Variability (2012)

Introduction to Measure of Dispersion

The measure of location or averages or central tendency is not sufficient to describe the characteristics of a distribution, because two or more distributions may have averages that are exactly alike, even though the distributions are dissimilar in other aspects. On the other hand, a measure of central tendency represents the typical value of the data set. To give a sensible description of data, a numerical quantity called the measure of dispersion/ variability or scatter that describes the spread of the values in a set of data has two types of measures of dispersion or variability:

measures-of-dispersion
  1. Absolute Measures
  2. Relative Measures

A measure of central tendency together with a measure of dispersion gives an adequate description of data as compared to the use of a measure of location only, because the averages or measures of central tendency only describe the balancing point of the data set, it does not provide any information about the degree to which the data tend to spread or scatter about the average value. So, the Measure of dispersion indicates the characteristic of the central tendency measure. The smaller the variability of a given set, the more the values of the measure of averages will represent the data set.

Absolute Measure of Dispersion

Absolute measures are defined in such a way that they have units such as meters, grams, etc., the same as those of the original measurements. Absolute measures cannot be used to compare the variation/spread of two or more data sets.
Most Common absolute measures of variability are:

Relative Measures of Dispersion

The relative measures have no units as these are ratios, coefficients, or percentages. Relative measures are independent of units of measurement and are useful for comparing data of different natures.

  • Coefficient of Variation
  • Coefficient of Mean Deviation
  • Coefficient of Quartile Deviation
  • Coefficient of Standard Deviation

Different terms are used for the measure of dispersion or variability such as variability, spread, scatterness, the measure of uncertainty, deviation, etc.

References:
http://www2.le.ac.uk/offices/careers/ld/resources/numeracy/variability

R Language Frequently Asked Questions

Testing of Hypothesis (2012)

Introduction

The objective of testing hypotheses (Testing of Statistical Hypothesis) is to determine if an assumption about some characteristic (parameter) of a population is supported by the information obtained from the sample.

Testing of Hypothesis

The terms hypothesis testing or testing of the hypothesis are used interchangeably. A statistical hypothesis (different from a simple hypothesis) is a statement about a characteristic of one or more populations such as the population mean. This statement may or may not be true. The validity of the statement is checked based on information obtained by sampling from the population.
Testing of Hypothesis refers to the formal procedures used by statisticians to accept or reject statistical hypotheses that include:

i) Formulation of Null and Alternative Hypothesis

Null hypothesis

A hypothesis formulated for the sole purpose of rejecting or nullifying it is called the null hypothesis, usually denoted by H0. There is usually a “not” or a “no” term in the null hypothesis, meaning that there is “no change”.

For Example, The null hypothesis is that the mean age of M.Sc. students is 20 years. Statistically, it can be written as $H_0:mu = 20$. Generally speaking, the null hypothesis is developed for testing.
We should emphasize that if the null hypothesis is not rejected based on the sample data we cannot say that the null hypothesis is true. In another way, failing to reject the null hypothesis does not prove that the $H_0$ is true, it means that we have failed to disprove $H_0$.

For the null hypothesis, we usually state that “there is no significant difference between “A” and “B”. For example, “the mean tensile strength of copper wire is not significantly different from some standard”.

Alternative Hypothesis

Any hypothesis different from the null hypothesis is called an alternative hypothesis denoted by $H_1$. Or we can say that a statement is accepted if the sample data provide sufficient evidence that the null hypothesis is false. The alternative hypothesis is also referred to as the research hypothesis.

It is important to remember that no matter how the problem is stated, the null hypothesis will always contain the equal sign, and the equal sign will never appear in the alternate hypothesis. It is because the null hypothesis is the statement being tested and we need a specific value to include in our calculations. The alternative hypothesis for the example given in the null hypothesis is $H_1:mu ne 20$.

Simple and Composite Hypothesis

If a statistical hypothesis completely specifies the form of the distribution as well as the value of all parameters, then it is called a simple hypothesis. For example, suppose the age distribution of the first-year college student follows $N(16, 25)$, and the null hypothesis is $H_0: mu =16$ then this null hypothesis is called a simple hypothesis, and if a statistical hypothesis does not completely specify the form of the distribution, then it is called a composite hypothesis. For example, $H_1:mu < 16$ or $H_1:mu > 16$.

ii) Level of Significance

The level of significance (significance level) is denoted by the Greek letter alpha ($alpha$). It is also called the level of risk (as there is the risk you take of rejecting the null hypothesis when it is true). The level of significance is defined as the probability of making a type-I error. It is the maximum probability with which we would be willing to risk a type-I error. It is usually specified before any sample is drawn so that the results obtained will not influence our choice.

In practice 10% (0.10) 5% (0.05) and 1% (0.01) levels of significance are used in testing a given hypothesis. A 5% level of significance means that there are about 5 chances out of 100 that we would reject the true hypothesis i.e. we are 95% confident that we have made the right decision. The hypothesis that has been rejected at a 0.05 level of significance means that we could be wrong with a probability of 0.05.

Selection of Level of Significance

In Testing of Hypothesis, the selection of the level of significance depends on the field of study. Traditionally 0.05 level is selected for business science-related problems, 0.01 for quality assurance, and 0.10 for political polling and social sciences.

Type-I and Type-II Errors

Whenever we accept or reject a statistical hypothesis based on sample data, there are always some chances of making incorrect decisions. Accepting a true null hypothesis or rejecting a false null hypothesis leads to a correct decision, and accepting a false hypothesis or rejecting a true hypothesis leads to an incorrect decision. These two types of errors are called type-I errors and type-II errors.
type-I error: Rejecting the null hypothesis when it is ($H_0$) true.
type-II error: Accepting the null hypothesis when $H_1$ is true.

iii) Test Statistics

The third step of Testing the Hypothesis is a procedures that enable us to decide whether to accept or reject the hypothesis or to determine whether observed samples differ significantly from expected results. These are called tests of hypothesis, tests of significance, or rules of decision. We can also say that test statistics is a value calculated from sample information, used to determine whether to reject the null hypothesis.

The test statistics for mean $mu$ when $sigma$ is known is $Z= frac{bar{X}-mu}{sigma/sqrt{n}}$, where Z-value is based on the sampling distribution of $bar{X}$, which follows the normal distribution with mean $mu_{bar{X}}$ equal to $mu$ and standard deviation $sigma_{bar{X}}$ which is equal to $sigma/sqrt{n}$. Thus we determine whether the difference between $bar{X}$ and $mu$ is statistically significant by finding the number of standard deviations $bar{X}$  from $mu$ using the Z statistics. Other test statistics are also available such as $t$, $F$, and $chi^2$, etc.

iv) Critical Region (Formulating Decision Rule)

It must be decided before the sample is drawn under what conditions (circumstances) the null hypothesis will be rejected. A dividing line must be drawn defining “Probable” and “Improbable” sample values given that the null hypothesis is a true statement. Simply a decision rule must be formulated having specific conditions under which the null hypothesis should be rejected or should not be rejected. This dividing line defines the region or area of rejection of those values that are large or small that the probability of their occurrence under a null hypothesis is rather remote i.e. Dividing line defines the set of possible values of the sample statistic that leads to rejecting the null hypothesis called the critical region.

Testing of Hypothesis

One-tailed and two-tailed tests of significance

In testing of hypothesis if the rejection region is on the left or right tail of the curve then it is called a one-tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “greater than” or a “less than” type.

and if the rejection region is on the left and right tail (both sides) of the curve then it is called a two-tailed hypothesis. It happens when the null hypothesis is tested against an alternative hypothesis having a “not equal to sign” type.

v) Making a Decision

In this last step of testing hypotheses, the computed value of the test statistic is compared with the critical value. If the sample statistic falls within the rejection region, the null hypothesis will be rejected or otherwise accepted. Note that only one of two decisions is possible in hypothesis testing, either accept or reject the null hypothesis. Instead of “accepting” the null hypothesis ($H_0$), some researchers prefer to phrase the decision as “Do not reject $H_0$” “We fail to reject $H_0$” or “The sample results do not allow us to reject $H_0$”.

Data Analysis in R Language

The Deciles: Measure of Position Made Easy (2012)

The deciles are the values (nine in number) of the variable that divides an ordered (sorted, arranged) data set into ten equal parts so that each part represents $\frac{1}{10}$ of the sample or population and are denoted by $D_1, D_2, \cdots D_9$, where First decile ($D_1$) is the value of order statistics that exceed 1/10 of the observations and less than the remaining $\frac{9}{10}$. The $D_9$ (ninth decile) is the value in order statistic that exceeds $\frac{9}{10}$ of the observations and is less than $\frac{1}{10}$ remaining observations. Note that the fifth deciles are equal to the median. The deciles determine the values for 10%, 20%… and 90% of the data.

Calculating Deciles for Ungrouped Data

To calculate the decile for the ungrouped data, first order all observations according to the magnitudes of the values, then use the following formula for $m$th decile.

\[D_m= m \times \left( \frac{(n+1)}{10} \right) \mbox{th value; } \qquad \mbox{where} m=1,2,\cdots,9\]

Example: Calculate the 2nd and 8th deciles of the following ordered data 13, 13,13, 20, 26, 27, 31, 34, 34, 34, 35, 35, 36, 37, 38, 41, 41, 41, 45, 47, 47, 47, 50, 51, 53, 54, 56, 62, 67, 82.
Solution:

\begin{eqnarray*}
D_m &=&m \times \{\frac{(n+1)}{10} \} \mbox{th value}\\
&=& 2 \times \frac{30+1}{10}=6.2\\
\end{eqnarray*}

We have to locate the sixth value in the ordered array and then move 0.2 of the distance between the sixth and seventh values. i.e. the value of the 2nd decile can be calculated as
\[6 \mbox{th observation} + \{7 \mbox{th observation} – 6 \mbox{th observation} \}\times 0.2\]
as 6th observation is 27 and 7th observation is 31.
The second decile would be $27+\{31-27\} \times 0.2 = 27.8$

Similarly, $D_8$ can be calculated. $D_8=52.6$.

Calculating Deciles for Grouped Data

The following formula can calculate the $m$th decile for grouped data (in ascending order).

\[D_m=l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\]

where

$l$ = is the lower class boundary of the class containing $m$th deciles
$h$ = is the width of the class containing $m$th deciles
$f$ = is the frequency of the class containing $m$th deciles
$n$ = is the total number of frequencies
$c$ = is the cumulative frequency of the class preceding the class containing $m$th deciles

Example: Calculate the first and third decile(s) of the following grouped data

The Deciles: Measure of Position

Solution: The Decile class for $D_1$ can be calculated from $\left(\frac{m.n}{10}-c\right) = \frac{1 \times 30}{10} = 3$rd observation. As 3rd observation lies in the first class (first group) so

\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_1&=&85.5+\frac{5}{6}\left(\frac{1\times30}{10}-0\right)\\
&=&88\\
\end{eqnarray*}

The Decile class for $D_7$ is 100.5—105.5 as $\frac{7 \times 30}{10}=21$th observation which is in fourth class (group).
\begin{eqnarray*}
D_m&=&l+\frac{h}{f}\left(\frac{m.n}{10}-c\right)\\
D_7&=&100.5+\frac{5}{6}\left(\frac{7\times30}{10}-20\right)\\
&=&101.333\\
\end{eqnarray*}

https://itfeature.com statistics data analytics

Learn R Language