Statistics for Data Science & Analytics - MCQs, Software & Data Analysis

Writing Excel Formulas

Jul 7, 2025Jun 11, 2016 by Muhammad Imdad Ullah

Post Views: 1,276

Writing Excel formulas is a little different than the way it is done in mathematics class. All Excel formulas start with an equal sign (=), that is, the equal sign always goes in that cell where you want the answer to appear from the formula. Therefore, the equal sign informs Excel that this is a formula not just a name or number. Let us start with writing Excel Formulas.

Excel Formula

The Excel formula looks like

= 3 + 2

rather than

3+2 =

Writing Excel Formulas and Cell References in MS Excel

The example of a formula has one drawback. If you want to change the number being calculated (3, and 2), you need to edit it or re-write the formula. A better way is to write formulas in such a way that you can change the numbers without changing or re-writing the formulas themselves. To do this, cell references are used, which tells Excel that data/ numbers are located in a cell. Therefore a cell’s location/ reference in the spreadsheet is referred to as its cell reference.

To find a cell reference, simply click the cell of which you need cell reference, and from NAME BOX (shown in the figure below), see the text, such as F2.

F2 represents the cell in the $F$ column (horizontal position) and row 2 (vertical position). It means cell reference can also be found by reading the column heading (at the topmost position) of the cells and row number (at the leftmost position). Therefore, cell reference is a combination of the column letter and row number such as A1, B2, Z5, and A106, etc. For the previous formula example, instead of writing = 3 + 2 in cell suppose (C1), follow this way of cell reference and formula writing:

In cell A1 write 3, and in cell B2 write 2. In the C1 cell write the formula such as,

= A1 + A2

Note that there is no gap between A & 1 and A & 2, they are simply A1 and A2. See the diagram for a clear understanding.

Updating/ Writing Excel Formulas

Upon wrong cell references in the Excel formula, the results from the formula will be automatically updated, whenever the data value in relevant cells is changed. For example, if you want to change data in cell A1 to 8 instead of 3, you only need to change the content of A1. The result of the formula in cell C1 will automatically be updated after the updation of the data value in A1 or B1.

Note that the formula will not change because the cell references are being used instead of data values or numbers.

Data Analysis in R Language

The Correlogram

May 8, 2024Jun 4, 2016 by Muhammad Imdad Ullah

Post Views: 1,780

A correlogram is a graph used to interpret a set of autocorrelation coefficients in which $r_k$ is plotted against the $log k$. A correlogram is often very helpful for visual inspection.

Some general advice to interpret the correlogram are:

A Random Series: If a time series is completely random, then for large $N$, $r_k \cong 0$ for all non-zero values of $k$. A random time series $r_k$ is approximately $N\left(0, \frac{1}{N}\right)$. If a time series is random, 19 out of 20 of the values of $r_k$ can be expected to lie between $\pm \frac{2}{\sqrt{N}}$. However, plotting the first 20 values of $r_k$, one can expect to find one significant value on average even when the time series is random.
Short-term Correlation: Stationary series often exhibit short-term correlation characterized by a fairly large value of $r_1$ followed by 2 or 3 more coefficients (significantly greater than zero) tend to get successively smaller values of $r_k$ for larger lags tend to get be approximately zero. A time series that gives rise to such a correlogram is one for which an observation above the mean tends to be followed by one or more further observations above the mean and similarly for observation below the mean. A model called an autoregressive model may be appropriate for a series of this type.

Alternating Series: If a time series tends to alternate with successive observations on different sides of the overall mean, then the correlogram also tends to alternate. The value of $r_1$ will be negative, however, the value of $r_2$ will be positive as observation at lag 2 will tend to be on the same side of the mean.
Non-Stationary Series: If a time series contains a trend, then the value of $r_k$ will not come down to zero except for very large values of the lags. This is because of a large number of further observations on the same side of the mean because of the trend. The sample autocorrelation function $\{ r_k \}$ should only be calculated for stationary time series and no trend should be removed before calculating $\{ r_k\}$.
Seasonal Fluctuations: If a time series contains a seasonal fluctuation then the correlogram will also exhibit an oscillation at the same frequency. If $x_t$ follows a sinusoidal pattern then so does $r_k$.
$x_t=a\, cos\, t\, w, $ where $a$ is constant, $w$ is frequency such that $0 < w < \pi$. Therefore $r_k \cong cos\, k\, w$ for large $N$.
If the seasonal variation is removed from seasonal data then the correlogram may provide useful information.
Outliers: If a time series contains one or more outliers the correlogram may be seriously affected. If there is one outlier in the time series and it is not adjusted, then the plot of $x_y$ vs $x_{t+k}$ will contain two extreme points, which will tend to depress the sample correlation coefficients towards zero. If there are two outliers, this effect is more noticeable.
General Remarks: Experience is required to interpret autocorrelation coefficients. We need to study the probability theory of stationary series and the classes of the model too. We also need to know the sampling properties of $x_t$.

There are two main types of correlograms depending on the type of correlation being analyzed:

Pearson Correlation: This is the most common type and measures linear correlations between continuous variables.
Spearman Rank Correlation: This is a non-parametric measure suitable for ordinal or continuous data and assesses monotonic relationships (not necessarily linear).

In summary, a correlogram is a valuable tool for exploratory data analysis. It helps us:

Understand the relationships between multiple variables in your data.
Identify potential issues with multicollinearity before building statistical models.
Gain insights into the underlying structure of your data.

Learn R Programming and R Data Analysis

Onl ine MCQs Test

Principal Component Regression (PCR)

Aug 3, 2024Apr 9, 2016 by Muhammad Imdad Ullah

Post Views: 1,920

Principal Component Regression (PCR) is a statistical technique that combines two powerful methods: Principal Component Analysis (PCA) and linear regression.

The transformation of the original data set into a new set of uncorrelated variables is called principal components. This kind of transformation ranks the new variables according to their importance (that is, variables are ranked according to the size of their variance and eliminate those of least importance). After transformation, a least square regression on this reduced set of principal components is performed, called principal component regression.

Principal Component Regression (PCR)

Principal Component Regression (PCR) is not scale invariant, therefore, one should scale and center data first. Therefore, given a p-dimensional random vector $x=(x_1, x_2, …, x_p)^t$ with covariance matrix $\sum$ and assume that $\sum$ is positive definite. Let $V=(v_1,v_2, \cdots, v_p)$ be a $(p \times p)$-matrix with orthogonal column vectors that is $v_i^t\, v_i=1$, where $i=1,2, \cdots, p$ and $V^t =V^{-1}$. The linear transformation

\begin{aligned}
z&=V^t x\\
z_i&=v_i^t x
\end{aligned}

The variance of the random variable $z_i$ is
\begin{aligned}
Var(Z_i)&=E[v_i^t\, x\, x^t\,\, v_i]\\
&=v_i^t \sum v_i
\end{aligned}

Maximizing the variance $Var(Z_i)$ under the conditions $v_i^t v_i=1$ with Lagrange gives
\[\phi_i=v_i^t \sum v_i -a_i(v_i^t v_i-1)\]

Setting the partial derivation to zero, we get
\[\frac{\partial \phi_i}{\partial v_i} = 2 \sum v_i – 2a_i v_i=0\]

which is
\[(\sum – a_i I)v_i=0\]

In matrix form
\[\sum V= VA\]
of
\[\sum = VAV^t\]

where $A=diag(a_1, a_2, \cdots, a_p)$. This is known as the eigenvalue problem, $v_i$ are the eigenvectors of $\sum$ and $a_i$ the corresponding eigenvalues such that $a_1 \ge a_2 \cdots \ge a_p$. Since $\sum$ is positive definite, all eigenvalues are real and non-negative numbers.

$z_i$ is named the ith principal component of $x$ and we have
\[Cov(z)=V^t Cov(x) V=V^t \sum V=A\]

The variance of the ith principal component matches the eigenvalue $a_i$, while the variances are ranked in descending order. This means that the last principal component will have the smallest variance. The principal components are orthogonal to all the other principal components (they are even uncorrelated) since $A$ is a diagonal matrix.

In the following, for regression, we will use $q$, that is,($1\le q \le p$) principal components. The regression model for observed data $X$ and $y$ can then be expressed as

\begin{aligned}
y&=X\beta+\varepsilon\\
&=XVV^t\beta+\varepsilon\\
&= Z\theta+\varepsilon
\end{aligned}

with the $n\times q$ matrix of the empirical principal components $Z=XV$ and the new regression coefficients $\theta=V^t \beta$. The solution of the least squares estimation is

\begin{aligned}
\hat{\theta}_k=(z_k^t z_k)^{-1}z_k^ty
\end{aligned}

and $\hat{\theta}=(\theta_1, \cdots, \theta_q)^t$

Since the $z_k$ are orthogonal, the regression is a sum of univariate regressions, that is
\[\hat{y}_{PCR}=\sum_{k=1}^q \hat{\theta}_k z_k\]

Since $z_k$ are linear combinations of the original $x_j$, the solution in terms of coefficients of the $x_j$ can be expressed as
\[\hat{\beta}_{PCR} (q)=\sum_{k=1}^q \hat{\theta}_k v_k=V \hat{\theta}\]

Note that if $q=p$, we would get back the usual least squares estimates for the full model. For $q<p$, we get a “reduced” regression.

Why use Principal Component Regression?

Reduces Dimensionality: When dealing with a large number of predictors, PCR can help reduce the complexity of the model.
Handles multicollinearity: If there is a high correlation among predictors (multicollinearity), PCR can address this issue.
Improves interpretability: In some cases, the principal components can be easier to interpret than the original variables.

Important Points to Remember

PCR is an unsupervised technique for dimensionality reduction.
The number of principal components used in the regression model is a crucial parameter.
PCR can be compared to Partial Least Squares Regression (PLS), another dimensionality reduction technique that considers the relationship between predictors and the response variable.

R Language Interview Questions

Online MCQs Test Website

Writing Excel Formulas

Table of Contents

Excel Formula

Writing Excel Formulas and Cell References in MS Excel

Updating/ Writing Excel Formulas

The Correlogram

Some general advice to interpret the correlogram are:

Principal Component Regression (PCR)

Table of Contents

Principal Component Regression (PCR)

Why use Principal Component Regression?

Important Points to Remember

Table of Contents

Excel Formula

Writing Excel Formulas and Cell References in MS Excel

Updating/ Writing Excel Formulas

Share this:

Some general advice to interpret the correlogram are:

Share this:

Table of Contents

Principal Component Regression (PCR)

Why use Principal Component Regression?

Important Points to Remember

Share this: