# High Correlation does not Indicates Cause and Effect

The correlation coefficient is a measure of the co-variability of variables. It does not necessarily imply any functional relationship between variables concerned. Correlation theory does not establish any causal relationship between the variables. Knowledge of the value of coefficient of correlation r alone will not enable us to predict the value of Y from X.

Sometimes their is high correlation between unrelated variable such as number of births and numbers of murders in a country. This is spurious correlation.

For example suppose there is a positive correlation between watching violence movies and violent behavior in adolescence. The cause of both these could be a third variable (extraneous variable) say, growing up in a violent environment which causes the adolescence to watch violence related movies and to have violent behavior.

# Measure of Kurtosis

Kurtosis is a measure of peakedness of a distribution relative to the normal distribution. A distribution having a relatively high peak is called leptokurtic. A distribution which is flat topped is called platykurtic. The normal distribution which is neither very peaked nor very flat-topped is also called mesokurtic.  The histogram is an effective graphical technique for showing both the skewness and kurtosis of data set.

Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

Moment ratio and Percentile Coefficient of kurtosis are used to measure the kurtosis

Moment Coefficient of Kurtosis= $b_2 = \frac{m_4}{S^2} = \frac{m_4}{m^{2}_{2}}$

Percentile Coefficient of Kurtosis = $k=\frac{Q.D}{P_{90}-P_{10}}$
where Q.D = $\frac{1}{2}(Q_3 – Q_1)$ is the semi-interquartile range. For normal distribution this has the value 0.263.

A normal random variable has a kurtosis of 3 irrespective of its mean or standard deviation. If a random variable’s kurtosis is greater than 3, it is said to be Leptokurtic. If its kurtosis is less than 3, it is said to be Platykurtic.

# Sampling Error Definition, Example, Formula

Sampling error also called estimation error is the amount of inaccuracy in estimating some value that is caused by only a portion of a population (i.e. sample) rather than the whole population. It is the difference between the statistic (value of sample, such as sample mean) and the corresponding parameter (value of population, such as population mean) is called the sampling error. If $\bar{x}$ is the sample statistic and $\mu$ is the corresponding parameter then the sampling error is $\bar{x} – \mu$.

Exact calculation/ measurements of sampling error is not feasible generally as the true value of population is unknown usually, however it can often be estimated by probabilistic modeling of the sample.

Sampling Error

Cause of Sampling Error

• The cause of the Error discussed may be due to the biased sampling procedure. Every research should select sample(s) that is free from any bias and the sample(s) is representative of the entire population of interest.
• Another cause of this Error is chance. The process of randomization and probability sampling is done to minimize the sampling process error but it is still possible that all the randomized subjects/ objects are not the representative of the population.

Eliminate/ Reduce the Sampling Error

The elimination/ Reduction of sampling error can be done when a proper and unbiased probability sampling technique is used by the researcher and the sample size is large enough.

• Increasing the sample size
The sampling error can be reduced by increasing the sample size. If the sample size n is equal to the population size N, then the sampling error will be zero.
• Improving the sample design i.e. By using the stratification
The population is divided into different groups containing similar units.

Also Read: Sampling and NonSampling Errors

# Difference between an outlier and influential observation

Cases that do not follow the model as the rest of the data are called outliers. In Regression the cases with large residuals are candidate for outliers. So an outlier is a data point that diverges from an overall pattern in a sample. Therefore an outlier can certainly influence the relationship between the variables and  may also exert an influence on the slope of the regression line.

An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. Outlier may be due to recording errors (may be correctable), or due to the sample not being entirely from the same population. May also be due to the values from the same population but from non-normal (heavy tailed) population. i.e. Outliers may be due to incorrect specifications that are based on the wrong distributional assumptions.

An influential observation is often an outlier in the x-direction. Influential observation may arise from

1. observations that are unusually large or otherwise deviate in unusually extreme forms from the center of a reference distribution,
2. the observation may be associated with a unit that has low probability, and thus having high probability weight.
3. the observation may have a weight that is very large (relative to the weights of other units in the specified subpopulation) due to problems with stratum jumping; sampling of birth units or highly seasonal units; large nonresponse adjustment factors arising from unusually low response rates within a given adjustment cell; unusual calibration-weighting effects; or other factors.

# Question: Differentiate Between Errors and Residuals in the Linear Model

In Statistics and Optimization, Statistical Errors and Residuals are two closely related and easily confused measures of “Deviation of a sample from the mean”.

Error is misnomer; an error is the amount by which an observation differs from its expected value. The errors e are unobservable random variable, assumed to have zero mean and uncorrelated elements each with common variance  σ2.

A Residual, on the other hand, is an observable estimate of the unobservable error. The residuals $\hat{e}$ are computed quantities with mean ${E(\hat{e})=0}$ and variance ${V(\hat{e})=\sigma^2 (1-H)}$.

Like the errors, each of the residuals has zero mean, but each residual may have a different variance. Unlike the errors the residuals are correlated. The residuals are linear combinations of the errors. If the errors are normally distributed so are the errors.

Note that the sum of the residuals is necessarily zero, and thus the residuals are necessarily not independent. The sum of the errors need not be zero; the errors are independent random variables if the individuals are chosen from the population independently.