Multivariate Data Sets: Descriptive Statistics (2010)

Much of the information contained in the multivariate data sets can be assessed by calculating certain summary numbers, known as multivariate descriptive statistics such as Arithmetic mean (a measure of location), an average of the squares of the distances of all of the numbers from the mean (variation/spread i.e. measure of spread or variation), etc. Here we will discuss descriptive statistics and multivariate data sets.

Multivariate data sets are used in various fields, such as:

  • Social Sciences: Analyzing factors influencing social phenomena like voting behavior, educational attainment, or health outcomes.
  • Business: Understanding customer demographics and purchase patterns, market research, risk assessment, and financial modeling.
  • Natural Sciences: Studying relationships between environmental variables, analyzing climate data, or exploring genetic factors influencing diseases.

Multivariate Data Sets: Descriptive Analysis

We shall rely most heavily on descriptive statistics which is a measure of location, variation, and linear association. For descriptive statistics multivariate data set, let us start with a measure of location, a measure of spread, sample covariance, and sample correlation coefficient.

Measure of Location

The arithmetic average of $n$ measurements $(x_{11}, x_{21}, x_{31},x_{41})$ on the first variable (defined in Multivariate Analysis: An Introduction) is

Sample Mean = $\bar{x}=\frac{1}{n} \sum _{j=1}^{n}x_{j1} \mbox{ where } j =1, 2,3,\cdots , n $

The sample mean for $n$ measurements on each of the p variables (there will be p sample means)

$\bar{x}_{k} =\frac{1}{n} \sum _{j=1}^{n}x_{jk} \mbox{ where }  k  = 1, 2, \cdots , p$

Measure of Spread

Measure of spread (variance) for $n$ measurements on the first variable for multivariate data sets can be found as
$s_{1}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )^{2} $ where $\bar{x}_{1} $ is sample mean of the $x_{j}$’s for p variables.

Measure of spread (variance) for $n$ measurements on all variables can be found as

$s_{k}^{2} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2}  \mbox{ where } k=1,2,\dots ,p \mbox{ and } j=1,2,\cdots ,p$

The Square Root of the sample variance is the sample standard deviation i.e

$S_{l}^{2} =S_{kk} =\frac{1}{n} \sum _{j=1}^{n}(x_{jk} -\bar{x}_{k} )^{2}  \mbox{ where }  k=1,2,\cdots ,p$

Multivariate Data sets

Sample Covariance

Consider $n$pairs of measurement on each of Variable 1 and Variable 2
\[\left[\begin{array}{c} {x_{11} } \\ {x_{12} } \end{array}\right],\left[\begin{array}{c} {x_{21} } \\ {x_{22} } \end{array}\right],\cdots ,\left[\begin{array}{c} {x_{n1} } \\ {x_{n2} } \end{array}\right]\]
That is $x_{j1}$ and $x_{j2}$ are observed on the jth experimental item $(j=1,2,\cdots ,n)$. So a measure of linear association between the measurements of  $V_1$ and $V_2$ for multivariate data sets is provided by the sample covariance
\[s_{12} =\frac{1}{n} \sum _{j=1}^{n}(x_{j1} -\bar{x}_{1} )(x_{j2} -\bar{x}_{2}  )\]
(the average product of the deviation from their respective means) therefore

$s_{ik} =\frac{1}{n} \sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )(x_{jk} -\bar{x}_{k}  )$;  $i=1,2,\cdots, p$ and $k=1,2,\cdots, p$.

It measures the association between the kth variable.

Variance is the most commonly used measure of dispersion (variation) in the data and it is directly proportional to the amount of variation or information available in the data.

Sample Correlation Coefficient

For Multivariate Data Sets, the sample correlation coefficient for the ith and kth variables is

\[r_{ik} =\frac{s_{ik} }{\sqrt{s_{ii} } \sqrt{s_{kk} } } =\frac{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{j} )(x_{jk} -\bar{x}_{k} ) }{\sqrt{\sum _{j=1}^{n}(x_{ji} -\bar{x}_{i} )^{2}  } \sqrt{\sum _{j=1}^{n}(x_{jk} -\bar{x}_{k}  )^{2} } } \]
$\mbox{ where } i=1,2,..,p \mbox{ and}  k=1,2,\dots ,p$

Note that $r_{ik} =r_{ki} $ for all $i$ and $k$, and $r$ lies between $-1$ and $+1$. $r$ measures the strength of the linear association. If $r=0$ the lack of linear association between the components exists. The sign of $r$ indicates the direction of the association.

Other Multivariate Analysis

Multiple Regression: It is used to model the relationship between a dependent variable (DV) and multiple independent variables (IV).

Principal Component Analysis (PCA): It reduces the dimensionality of data by identifying a smaller set of uncorrelated variables that capture most of the data’s variance.

Cluster Analysis: It groups the data points into clusters based on their similarities, helping identify subgroups within the data.

Discriminant Analysis: It classifies data points into predefined groups based on their characteristics.

Learn the use of matrices in R Language

Online MCQs Economics

Multivariate Analysis (2012)

Multivariate Analysis term is used to include all statistics for more than two variables that are simultaneously analyzed.

Multivariate analysis is based upon an underlying probability model known as the Multivariate Normal Distribution (MND). The objective of scientific investigations to which multivariate methods most naturally lend themselves includes.

Multivariate Analysis and Statistics

Objectives of Multivariate Analysis

The following are some basic objectives of multivariate analysis.

  • Data reduction or structural simplification
    The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
  • Sorting and Grouping
    Graphs of similar objects or variables are created, based on measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
  • Investigation of the dependence among variables
    The nature of the relationships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent based on observation of the other variables?
  • Prediction
    Relationships between variables must be determined for predicting the values of one or more variables based on observation of the other variables.
  • Hypothesis Construction and testing
    Specific statistical hypotheses, formulated in terms of the parameter of the multivariate population, are tested. This may be done to validate assumptions or to reinforce prior convictions.

Applications: Multivariate analysis is used in various fields:

  • Social sciences (understanding factors influencing voting behavior)
  • Business (analyzing customer demographics and purchase patterns)
  • Finance (evaluating risk factors in investment portfolios)
  • Natural sciences (studying the relationships between different environmental variables)

Multivariate Data Sets

We are concerned with analyzing measurements made on several variables or characteristics. These measurements (data) must frequently be arranged and displayed in various ways (graphs, tabular form, etc.). Preliminary concepts underlying these first steps of data organization are

Array

Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number of variables $p\ge$ of variables or characteristics to record. The values of these variables are all recorded for each distinct item, individual, or experimental unit.

$X_{jk}$ notation is used to indicate the particular value of the kth variable that is observed on the jth item or trial. i.e. $X_{jk}$ measurement of the kth variable on the jth item. So, $n$ measurements on $p$ variables can be displayed as

\[\begin{array}{ccccccc}
. & V_1 & V_2  & \dots  & V_k & \dots  & V_p \\
Item 1 & x_{11} & x_{12} & \dots  & x_{1k} & \dots  & x_{1p} \\
Item 2 & x_{21} & x_{22} & \dots  & x_{2k} & \dots  & x_{2p} \\
\vdots & \vdots  & \vdots  & \vdots & \vdots   & \vdots & \vdots  \\
Item j  & x_{j1}   & x_{j2} & \dots  & x_{jk} & \dots  & x_{jp} \\
\vdots &  \vdots & \vdots & \vdots & \vdots   & \vdots & \vdots  \\
Item n & x_{n1} & x_{n2} & \dots  & x_{nk} & \dots  & x_{np} \\
\end{array}\]

These data can be displayed as rectangular arrays $X$ of $n$ rows and $p$ columns

\[X=\begin{pmatrix}
x_{11}     & x_{12} & \dots  & x_{1k}  & \dots  & x_{1p} \\
x_{21}     & x_{22} & \ddots  & x_{2k}  & \ddots  & x_{2p} \\
\vdots & \vdots & \ddots  & \ddots & \vdots & \vdots  \\
x_{j1}     & x_{j2} & \ddots  & x_{jk}  & \ddots  & x_{jp} \\
\vdots  & \vdots & \ddots  & \vdots & \ddots & \vdots  \\
x_{n1}     & x_{n2} & \dots  & x_{nk}  & \dots  & x_{np}
\end{pmatrix}\]

This $X$ array contains the data consisting of all of the observations on all of the variables.

Example: suppose we have data for the number of books sold and the total amount of each sale.

Variable 1 (Sales in Dollars)
\[\begin{array}{ccccc}
Data Values: & 42 & 52 & 48 & 63 \\
Notation: & x_{11} & x_{21} & x_{31} & x_{41}
\end{array}\]

Variable 2 (Number of Books sold)
\[\begin{array}{ccccc}
Data Values: & 4 & 2 & 8 & 3 \\
Notation: & x_{12} & x_{22} & x_{33} & x_{42}
\end{array}\]

itfeature.com Multivariate Analysis

The information, available in the multivariate data sets can be assessed by calculating certain summary numbers, known as multivariate analysis: multivariate descriptive statistics such as Arithmetic Mean, Sample Mean (the measure of location), Average of the Squares of the distances of all of the numbers from the mean (variation/spread i.e. Measure of Spread or Variation).

MCQs General Knowledge

R Programming Language