# Tagged: Short Questions

## Correlation Coeficient values lies between +1 and -1?

We know that the ratio of the explained variation to the total variation is called the coefficient of determination which is the square of the correlation coefficient. This ratio is non-negative, therefore denoted by $r^2$, thus

\begin{align*}
r^2&=\frac{\text{Explained Variation}}{\text{Total Variation}}\\
&=\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}
\end{align*}

It can be seen that if the total variation is all explained, the ratio $r^2$ (Coefficient of Determination) is one and if the total variation is all unexplained then the explained variation and the ratio $r^2$ is zero.

The square root of the coefficient of determination is called the correlation coefficient, given by

\begin{align*}
r&=\sqrt{ \frac{\text{Explained Variation}}{\text{Total Variation}} }\\
&=\pm \sqrt{\frac{\sum (\hat{Y}-\overline{Y})^2}{\sum (Y-\overline{Y})^2}}
\end{align*}

and

$\sum (\hat{Y}-\overline{Y})^2=\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2$

therefore

\begin{align*}
r&=\sqrt{ \frac{\sum(Y-\overline{Y})^2-\sum (Y-\hat{Y})^2} {\sum(Y-\overline{Y})^2} }\\
&=\sqrt{1-\frac{\sum (Y-\hat{Y})^2}{\sum(Y-\overline{Y})^2}}\\
&=\sqrt{1-\frac{\text{Unexplained Variation}}{\text{Total Variation}}}=\sqrt{1-\frac{S_{y.x}^2}{s_y^2}}
\end{align*}

where $s_{y.x}^2=\frac{1}{n} \sum (Y-\hat{Y})^2$ and $s_y^2=\frac{1}{n} \sum (Y-\overline{Y})^2$

\begin{align*}
\Rightarrow r^2&=1-\frac{s_{y.x}^2}{s_y^2}\\
\Rightarrow s_{y.x}^2&=s_y^2(1-r^2)
\end{align*}

Since variances are non-negative

$\frac{s_{y.x}^2}{s_y^2}=1-r^2 \geq 0$

Solving for inequality we have

\begin{align*}
1-r^2 & \geq 0\\
\Rightarrow r^2 \leq 1\, \text{or}\, |r| &\leq 1\\
\Rightarrow & -1 \leq r\leq 1
\end{align*}

## Alternative Proof

Since $\rho(X,Y)=\rho(X^*,Y^*)$ where $X^*=\frac{X-\mu_X}{\sigma_X}$ and $Y^*=\frac{Y-Y^*}{\sigma_Y}$

and as covariance is bi-linear and X* ,Y* have zero mean and variance 1, therefore

\begin{align*}
\rho(X^*,Y^*)&=Cov(X^*,Y^*)=Cov\{\frac{X-\mu_X}{\sigma_X},\frac{Y-\mu_Y}{\sigma_Y}\}\\
&=\frac{Cov(X-\mu_X,Y-\mu_Y)}{\sigma_X\sigma_Y}\\
&=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}=\rho(X,Y)
\end{align*}

We also know that the variance of any random variable is ≥0, it could be zero i.e .(Var(X)=0) if and only if X is a constant (almost surely), therefore

$V(X^* \pm Y^*)=V(X^*)+V(Y^*)\pm2Cov(X^*,Y^*)$

As Var(X*)=1 and Var(Y*)=1, the above equation would be negative if $Cov(X^*,Y^*)$ is either greater than 1 or less than -1. Hence $1\geq \rho(X,Y)=\rho(X^*,Y^*)\geq -1$.

If $\rho(X,Y )=Cov(X^*,Y^*)=1$ then $Var(X^*- Y ^*)=0$ making X* =Y* almost surely. Similarly, if $\rho(X,Y )=Cov(X^*,Y^*)=-1$ then X*=−Y* almost surely. In either case, Y would be a linear function of X almost surely.

We can see that the Correlation Coefficient values lie between -1 and +1.

## Bias: The Difference Between the Expected Value and True Value

Bias in Statistics is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore, the bias is a measure of the systematic error of an estimator. The bias indicates the distance of the estimator from the true value of the parameter. For example, if we calculate the mean of a large number of unbiased estimators, we will find the correct value.

In other words, the bias (sampling error) is a systematic error in measurement or sampling and it tells how far off on the average the model is from the truth.

Gauss, C.F. (1821) during his work on the least-squares method gave the concept of an unbiased estimator.

The bias of an estimator of a parameter should not be confused with its degree of precision as the degree of precision is a measure of the sampling error. The bias is favoring of one group or outcome intentionally or unintentionally over other groups or outcomes available in the population under study. Unlike random errors, bias is a serious problem and bias can be reduced by increasing the sample size and averaging the outcomes.

There are several types of bias that should not be considered mutually exclusive

• Selection Bias (arise due to systematic differences between the groups compared)
• Exclusion Bias (arise due to the systematic exclusion of certain individuals from the study)
• Analytical Bias (arise due to the way that the results are evaluated)

Mathematically Bias can be defined as

Let statistics $T$ used to estimate a parameter $\theta$ if $E(T)=\theta+bias(\theta)$ then $bias(\theta)$ is called the bias of the statistic $T$, where $E(T)$ represents the expected value of the statistics $T$.
Note: that if $bias(\theta)=0$, then $E(T)=\theta$. So, $T$ is an unbiased estimator of the true parameter, say $θ$.

Reference:
Gauss, C.F. (1821, 1823, 1826). Theoria Combinations Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1-108.

## Testing of Hypothesis or Hypothesis Testing

#### To whom is the researcher similar to in hypothesis testing: the defense attorney or the prosecuting attorney? Why?

The researcher is similar to the prosecuting attorney in the sense that the researcher brings the null hypothesis “to trial” when she believes there is a probability of strong evidence against the null.

• Just as the prosecutor usually believes that the person on trial is not innocent, the researcher usually believes that the null hypothesis is not true.
• In the court system the jury must assume (by law) that the person is innocent until the evidence clearly calls this assumption into question; analogously, in hypothesis testing the researcher must assume (in order to use hypothesis testing) that the null hypothesis is true until the evidence calls this assumption into question.

# Type I Error

It has become part of the statistical hypothesis testing culture.

• It is a longstanding convention.
• It reflects a concern over making type I errors (i.e., wanting to avoid the situation where you reject the null when it is true, that is, wanting to avoid “false positive” errors).
• If you set the significance level at .05, then you will only reject a true null hypothesis 5% or the time (i.e., you will only make a type I error 5% of the time) in the long run.