# Basic Statistics and Data Analysis

## Principal Component Regression (PCR)

The transformation of original data set into a new set of uncorrelated variables is called principal components.  This kind of transformation ranks the new variables according to their importance (that is, variable are ranked according to the size of their variance and eliminates those of least importance). After transformation, a least square regression on this reduced set of principal components is performed.

Principal Component Regression (PCR) is not scale invariant, therefore, one should scale and center data first. Therefore, given a p-dimensional random vector $x=(x_1, x_2, …, x_p)^t$ with covariance matrix $\sum$ and assume that $\sum$ is positive definite. Let $V=(v_1,v_2, \cdots, v_p)$ be a $(p \times p)$-matrix with orthogonal column vectors that is $v_i^t\, v_i=1$, where $i=1,2, \cdots, p$ and $V^t =V^{-1}$. The linear transformation

\begin{aligned}
z&=V^t x\\
z_i&=v_i^t x
\end{aligned}

The variance of the random variable $z_i$ is
\begin{aligned}
Var(Z_i)&=E[v_i^t\, x\, x^t\,\, v_i]\\
&=v_i^t \sum v_i
\end{aligned}

Maximizing the variance $Var(Z_i)$ under the conditions $v_i^t v_i=1$ with Lagrange gives
$\phi_i=v_i^t \sum v_i -a_i(v_i^t v_i-1)$

Setting the partial derivation to zero, we get
$\frac{\partial \phi_i}{\partial v_i} = 2 \sum v_i – 2a_i v_i=0$

which is
$(\sum – a_i I)v_i=0$

In matrix form
$\sum V= VA$
of
$\sum = VAV^t$

where $A=diag(a_1, a_2, \cdots, a_p)$. This is know as the eigvenvalue problem, $v_i$ are the eigenvectors of $\sum$ and $a_i$ the corresponding eigenvalues such that $a_1 \ge a_2 \cdots \ge a_p$. Since $\sum$ is positive definite, all eigenvalues are real and non-negative numbers.

$z_i$ is named the ith principal component of $x$ and we have
$Cov(z)=V^t Cov(x) V=V^t \sum V=A$

The variance of the ith principal component matches the eigenvalue $a_i$, while the variances are ranked in descending order. This means that, the last principal component will have the smallest variance. The principal components are orthogonal to all the other principal components (they are even uncorrelated) since $A$ is a diagonal matrix.

In following, for regression, we will use $q$, that is,($1\le q \le p$) principal components. The regression model for observed data $X$ and $y$ can then be expressed as

\begin{aligned}
y&=X\beta+\varepsilon\\
&=XVV^t\beta+\varepsilon\\
&= Z\theta+\varepsilon
\end{aligned}

with the $n\times q$ matrix of the empirical principal components $Z=XV$ and the new regression coefficients $\theta=V^t \beta$. The solution of the least squares estimation is

\begin{aligned}
\hat{\theta}_k=(z_k^t z_k)^{-1}z_k^ty
\end{aligned}

and $\hat{\theta}=(\theta_1, \cdots, \theta_q)^t$

Since the $z_k$ are orthogonal, the regression is a sum of univariate regressions, that is
$\hat{y}_{PCR}=\sum_{k=1}^q \hat{\theta}_k z_k$

Since $z_k$ are linear combinations of the original $x_j$, the solution in terms of coefficients of the $x_j$ can be expressed as
$\hat{\beta}_{PCR} (q)=\sum_{k=1}^q \hat{\theta}_k v_k=V \hat{\theta}$

Note that if $q=p$, we would get back the usual least squares estimates for the full model. For $q<p$, we get a “reduced” regression.

## Canonical Correlation Analysis

The bivariate correlation analysis measures the strength of relationship between two variables. One may require to find the strength of relationship between two sets of variables. In this case canonical correlation is an appropriate technique for measuring the strength of relationship between two sets of variables. Canonical correlation is appropriate in the same situations where multiple regression would be, but where there are multiple inter-correlated outcome variables. Canonical correlation analysis determines a set of canonical variates, orthogonal linear combinations of the variables within each set that best explain the variability both within and between sets. For example,

• In medical, individuals’ life styles and eating habits may have effect on their different health measures determined by number of health-related variables such as hypertension, weight, anxiety and tension level.
• In business, the marketing manager of a consumer goods firm may be interested in finding the relationship between types of products purchased and consumers’ life styles and personalities.

From above two examples, one set of variables is the predictor or independent while other set of variables is the criterion or dependent set. The objective of canonical correlation analysis is to determine if the predictor set of variables affects the criterion set of variables.

Note that it is not necessary to designate the two sets of variables as the dependent and independent sets. In this case the objective of canonical correlation is to ascertain the relationship between the two sets of variables.

The objective of canonical correlation is similar to that of conducting a principal components analysis on each set of variables. In principal components analysis, the first new axis results in a new variable that accounts for the maximum variance in the data, while in canonical correlation analysis a new axis is identified for each set of variables such that the correlation between the two resulting new variables is maximum.

Canonical correlation analysis can also be considered as data reduction technique as it is possible that only a few canonical variables are needed to adequately represents the association between the two sets of variables. Therefore, an additional objective of canonical correlation is to determine the minimum number of canonical correlations needed to adequately represent the association between two sets of variables.

## Descriptive Statistics Multivariate Data set

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation) etc. Here we will discuss about descriptive statistics multivariate data set.

We shall rely most heavily on descriptive statistics that is measure of location, variation and linear association.

## Measure of Location

The arithmetic Average of n measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n$

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where } k = 1, 2, \cdots , p$

## Measure of Spread

Measure of spread (variance) for n measurements on the first variable can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2}$ where $\bar{x}_{1}$ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for n measurements on all variable can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} \mbox{ where } k=1,2,\cdots ,p$

Sample Covariance

Consider n pairs of measurement on each of Variable 1 and Variable 2
$\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]$
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ is provided by the sample covariance
$s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2} )$
(the average of product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k} )$;  i=1,2,..,p and k=1,2,\… ,p.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

## Sample Correlation Coefficient

The sample correlation coefficient for the ith and kth variable is

$r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2} } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2} } }$
$\mbox{ where } i=1,2,..,p \mbox{ and} k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki}$ for all $i$ and $k$, and $r$ lies between -1 and +1. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

# Multivariate Analysis: An Introduction

“Multivariate statistics” term is used to include all statistics for more than two variables which are simultaneously analyzed.

Multivariate methods are based upon an underlying probability model known as the multivariate normal distribution. The objective of scientific investigations to which multivariate methods most naturally lend themselves includes.

• Data reduction or structural simplification
The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
• Sorting and Grouping
Graphs of similar objects or variable are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
• Investigation of the dependence among variables
The nature of the relationships among variables is of interest. Are all the variables mutually independent or one or more variables dependent on the basis of observation on the other variables.
• Prediction
Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observation on the other variables.
• Hypothesis Construction and testing
Specific statistical hypothesis, formulated in terms of the parameter of multivariate population, are tested. This may be done to validate assumptions or to reinforce prior convictions.

## The Organization of Data

We concerned with Analyzing measurements made on several variables or characteristics. These measurements (data) must frequently be arranged and displayed in various ways (graphs, tabular form etc). Preliminary concepts underlying these first steps of data organization are

## Array

Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number of variables p≥1of variables or characteristics to record. The values of these variables are all recorded for each distinct item, individual or experimental unit.

xjk notation is used to indicate the particular value of the kth variable that is observed on the jth item or trial. i.e. xjk measurement of the kth variable on the jth item. So, n measurements on p variables can be displayed as

$\begin{array}{ccccccc} . & V_1 & V_2 & \dots & V_k & \dots & V_p \\ Item 1 & x_{11} & x_{12} & \dots & x_{1k} & \dots & x_{1p} \\ Item 2 & x_{21} & x_{22} & \dots & x_{2k} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ Item j & x_{j1} & x_{j2} & \dots & x_{jk} & \dots & x_{jp} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ Item n & x_{n1} & x_{n2} & \dots & x_{nk} & \dots & x_{np} \\ \end{array}$

These data can be displayed as rectangular arrays X of n rows and p columns

$X=\begin{pmatrix} x_{11} & x_{12} & \dots & x_{1k} & \dots & x_{1p} \\ x_{21} & x_{22} & \ddots & x_{2k} & \ddots & x_{2p} \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ x_{j1} & x_{j2} & \ddots & x_{jk} & \ddots & x_{jp} \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{nk} & \dots & x_{np} \end{pmatrix}$

This X array contains the data consisting of all of the observations on all of the variables.

Example: Let we have data for number of books sold and the total amount of each sale.

Variable 1 (Sales in Dollars)
$\begin{array}{ccccc} Data Values: & 42 & 52 & 48 & 63 \\ Notation: & x_{11} & x_{21} & x_{31} & x_{41} \end{array}$

Variable 2 (\# of Books sold)
$\begin{array}{ccccc} Data Values: & 4 & 2 & 8 & 3 \\ Notation: & x_{12} & x_{22} & x_{33} & x_{42} \end{array}$

Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics such as Arithmetic mean, sample mean (measure of location), average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation).