# Basic Statistics and Data Analysis

## Descriptive Statistics Multivariate Data set

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation) etc. Here we will discuss about descriptive statistics multivariate data set.

We shall rely most heavily on descriptive statistics that is measure of location, variation and linear association.

## Measure of Location

The arithmetic Average of n measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n$

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where } k = 1, 2, \cdots , p$

Measure of spread (variance) for n measurements on the first variable can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2}$ where $\bar{x}_{1}$ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for n measurements on all variable can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\cdots ,p$

Sample Covariance

Consider n pairs of measurement on each of Variable 1 and Variable 2
$\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]$
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ is provided by the sample covariance
$s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2} )$
(the average of product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k} )$;  i=1,2,..,p and k=1,2,\… ,p.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

## Sample Correlation Coefficient

The sample correlation coefficient for the ith and kth variable is

$r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2} } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} } }$
$\mbox{ where } i=1,2,..,p \mbox{ and} k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki}$ for all $i$ and $k$, and $r$ lies between -1 and +1. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

# Multivariate Analysis: An Introduction

“Multivariate statistics” term is used to include all statistics for more than two variables which are simultaneously analyzed.

Multivariate methods are based upon an underlying probability model known as the multivariate normal distribution. The objective of scientific investigations to which multivariate methods most naturally lend themselves includes.

• Data reduction or structural simplification
The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
• Sorting and Grouping
Graphs of similar objects or variable are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
• Investigation of the dependence among variables
The nature of the relationships among variables is of interest. Are all the variables mutually independent or one or more variables dependent on the basis of observation on the other variables.
• Prediction
Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observation on the other variables.
• Hypothesis Construction and testing
Specific statistical hypothesis, formulated in terms of the parameter of multivariate population, are tested. This may be done to validate assumptions or to reinforce prior convictions.

## The Organization of Data

We concerned with Analyzing measurements made on several variables or characteristics. These measurements (data) must frequently be arranged and displayed in various ways (graphs, tabular form etc). Preliminary concepts underlying these first steps of data organization are

## Array

Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number of variables p≥1of variables or characteristics to record. The values of these variables are all recorded for each distinct item, individual or experimental unit.

xjk notation is used to indicate the particular value of the kth variable that is observed on the jth item or trial. i.e. xjk measurement of the kth variable on the jth item. So, n measurements on p variables can be displayed as

$\begin{array}{ccccccc} . & V_1 & V_2 & \dots & V_k & \dots & V_p \\ Item 1 & x_{11} & x_{12} & \dots & x_{1k} & \dots & x_{1p} \\ Item 2 & x_{21} & x_{22} & \dots & x_{2k} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ Item j & x_{j1} & x_{j2} & \dots & x_{jk} & \dots & x_{jp} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ Item n & x_{n1} & x_{n2} & \dots & x_{nk} & \dots & x_{np} \\ \end{array}$

These data can be displayed as rectangular arrays X of n rows and p columns

$X=\begin{pmatrix} x_{11} & x_{12} & \dots & x_{1k} & \dots & x_{1p} \\ x_{21} & x_{22} & \ddots & x_{2k} & \ddots & x_{2p} \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ x_{j1} & x_{j2} & \ddots & x_{jk} & \ddots & x_{jp} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{nk} & \dots & x_{np} \end{pmatrix}$

This X array contains the data consisting of all of the observations on all of the variables.

Example: Let we have data for number of books sold and the total amount of each sale.

Variable 1 (Sales in Dollars)
$\begin{array}{ccccc} Data Values: & 42 & 52 & 48 & 63 \\ Notation: & x_{11} & x_{21} & x_{31} & x_{41} \end{array}$

Variable 2 (\# of Books sold)
$\begin{array}{ccccc} Data Values: & 4 & 2 & 8 & 3 \\ Notation: & x_{12} & x_{22} & x_{33} & x_{42} \end{array}$

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean, sample mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation).