# Data Transformation (Variable Transformation)

A transformation is a rescaling of the data using a function or some mathematical operation on each observation. When data are very strongly skewed (negative or positive), we sometime transform the data so that they are easier to model. In other way, if variable(s) does not fit a normal distribution then one should try a data transformation to fit the assumption of using a parametric statistical test.

The most common data transformation is log (or natural log) transformation, which is often applied when most of the data values cluster around zero relative to the larger values in the data set and all of the observations are positive.

Transformation can also be applied to one or more variables in scatter plot, correlation and regression analysis to make the relationship between the variables more linear; and hence it is easier to model with simple method. Other transformation than log are square root, reciprocal etc.

Reciprocal Transformation
The reciprocal transformation $x$ to $\frac{1}{x}$ or $(-\frac{1}{x})$ is a very strong transformation with a drastic effect on shape of the distribution. Note that this transformation cannot be applied to zero values, but can be applied to negative values. Reciprocal transformation is not useful unless all of the values are positive and reverses the order among values of the same sign i.e. largest becomes smallest etc.

Logarithmic Transformation
The logarithm $x$ to log (base 10) (or natural log, or log base 2) is an other strong transformation that have effect on the shape of distribution. Logarithmic transformation commonly used for reducing right skewness, but cannot be applied to negative or zero values.

Square Root Transformation
The square root x to $x^{\frac{1}{2}}=\sqrt(x)$ transformation have moderate effect on distribution shape and weaker than the logarithm. Square root transformation can be applied to zero values but not negative values.

Goals of transformation
The goals of transformation may be

• one might want to see the data structure differently
• one might want to reduce the skew that assist in modeling
• one might want to straighten a nonlinear (curvilinear) relationship in a scatter plot. In other words a transformation may be used to have approximately equal dispersion, making data easier to handle and interpret

# p-value interpretation, definition, introduction and examples

The p-value also known as observed level of significance or exact level of significance or the exact probability of committing a type-I error (probability of rejecting H0, when it is true), helps to determine the significance of results from hypothesis. The p-value is the probability of obtaining the observed sample results or a more extreme result when the null hypothesis (a statement about population) is actually true.

In technical words, one can define p-value as the lowest level of significance at which a null hypothesis can be rejected. If p-value is very small or less than the threshold value (chosen level of significance), then the observed data is considered as inconsistent with the assumption that the null hypothesis is true and thus null hypothesis must be rejected while the alternative hypothesis should be accepted. The p-value is a number between 0 and 1 and in literature it is usually interpreted in the following way:

• A small p-value (<0.05) indicates strong evidence against the null hypothesis
• A large p-value (>0.05) indicates weak evidence against the null hypothesis.
• p-value very close to the cutoff (say 0.05) are considered to be marginal.

Let the p-value of a certain test statistic is 0.002 then it means that the probability of committing a type-I error (making a wrong decision) is about 0.2 percent, that is only about 2 in 1,000. For a given sample size, as | t | (or any test statistic) increases the p-value decreases, so one can reject the null hypothesis with increasing confidence.

Fixing the level of significance ($\alpha$) (i.e. type-I error) equal to the p-value of a test statistic then there is no conflict between the two values, in other words, it is better to give up fixing up (significance level) arbitrary at some level of significance such as (5%, 10% etc.) and simply choose the p-value of the test statistic. For example, if the p-value of test statistic is about 0.145 then one can reject the null hypothesis at this exact significance level as nothing wrong with taking a chance of being wrong 14.5% of the time of someone reject the null hypothesis.

p-value addresses only one question: how likely are your data, assuming a true null hypothesis? It  does not measure support for the alternative hypothesis.

Most authors refers to p-value<0.05 as statistically significant and p-value<0.001 as highly statistically significant (less than one in a thousand chance of being wrong).

p-value is usually incorrectly interpreted as it is usually interpreted as the probability of making a mistake by rejecting a true null hypothesis (a Type-I error). p-value cannot be error rate because:

p-value is calculated based on the assumption that the null hypothesis is true and that the difference in the sample by random chances. Consequently, p-value cannot tell about the probability that the null hypothesis is true or false because it is 100% true from the perspective of the calculations.

# Bias (Statistical Bias)

Bias is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore the bias is a measure of the systematic error of an estimator. The bias indicates the distance of the estimator from the true value of the parameter. For example, if we calculate the mean of large number of unbiased estimators, we will find the correct value.

Gauss, C.F. (1821) during his work on the least squares method gave the concept of an unbiased estimator.

Bias of an estimator of a parameter should not be confused with its degree of precision as degree of precision is a measure of the sampling error.

There are several types of bias which should not be considered as mutually exclusive

• Selection Bias (arise due to systematic differences between the groups compared)
• Exclusion Bias (arise due to the systematic exclusion of certain individuals from the study)
• Analytical Bias (arise due to the way that the results are evaluated)

Mathematically Bias can be Defined as

Let statistics T used to estimate a parameter θ, if E(T)=θ + b(θ) then b(θ) is called the bias of the statistic T, where E(T) represents the expected value of the statistics T. Note that if b(θ)=0, then E(T)=θ. So T is an unbiased estimator of θ.

Reference:
Gauss, C.F. (1821, 1823, 1826). Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1-108.

# Difference between an outlier and influential observation

Cases that do not follow the model as the rest of the data are called outliers. In Regression the cases with large residuals are candidate for outliers. So an outlier is a data point that diverges from an overall pattern in a sample. Therefore an outlier can certainly influence the relationship between the variables and  may also exert an influence on the slope of the regression line.

An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. Outlier may be due to recording errors (may be correctable), or due to the sample not being entirely from the same population. May also be due to the values from the same population but from non-normal (heavy tailed) population. i.e. Outliers may be due to incorrect specifications that are based on the wrong distributional assumptions.

An influential observation is often an outlier in the x-direction. Influential observation may arise from

1. observations that are unusually large or otherwise deviate in unusually extreme forms from the center of a reference distribution,
2. the observation may be associated with a unit that has low probability, and thus having high probability weight.
3. the observation may have a weight that is very large (relative to the weights of other units in the specified subpopulation) due to problems with stratum jumping; sampling of birth units or highly seasonal units; large nonresponse adjustment factors arising from unusually low response rates within a given adjustment cell; unusual calibration-weighting effects; or other factors.

# p-value Interpretation

The P-value is a probability, with a value ranging from zero to one. It is measure of how much evidence we have against the null hypothesis. P-value is a way to express the likelihood that $H_0$ is not true. The smaller the p-value, the more evidence we have against $H_0$.

p-value can be defined as

The largest significance level at which we would accept the null hypothesis. It enables us to test hypothesis without first specifying a value for $\alpha$. OR

The probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true.

If the P-value is smaller then the chosen significance level then $H_0$ (null hypothesis) is rejected even when it is true. If it is larger than the significance level $H_0$ is not rejected.

If the P-value is less than

• 0.10, we have some evidence that $H_0$ is not true
• 0.05, strong evidence that $H_0$ is not true
• 0.01, Very strong evidence that $H_0$ is not true
• 0.001, extremely strong evidence that $H_0$ is not true

Misinterpretation of a P-value

Many people misunderstand P-values. For example, if the P-value is 0.03 then it means that there is a 3% chance of observing a difference as large as you observed even if the two population means are same (i.e. the null hypothesis is true). It is tempting to conclude, therefore, that there is a 97% chance that the difference you observed reflects a real difference between populations and a 3% chance that the difference is due to chance. However, this would be an incorrect conclusion. What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments.