Data Transformation (Variable Transformation)
Data Transformation (Variable Transformation)
A transformation is a rescaling of the data using a function or some mathematical operation on each observation. When data are very strongly skewed (negative or positive), we sometime transform the data so that they are easier to model. In other way, if variable(s) does not fit a normal distribution then one should try a data transformation to fit the assumption of using a parametric statistical test.
The most common data transformation is log (or natural log) transformation, which is often applied when most of the data values cluster around zero relative to the larger values in the data set and all of the observations are positive.
Transformation can also be applied to one or more variables in scatter plot, correlation and regression analysis to make the relationship between the variables more linear; and hence it is easier to model with simple method. Other transformation than log are square root, reciprocal etc.
Reciprocal Transformation
The reciprocal transformation $x$ to $\frac{1}{x}$ or $(-\frac{1}{x})$ is a very strong transformation with a drastic effect on shape of the distribution. Note that this transformation cannot be applied to zero values, but can be applied to negative values. Reciprocal transformation is not useful unless all of the values are positive and reverses the order among values of the same sign i.e. largest becomes smallest etc.
Logarithmic Transformation
The logarithm $x$ to log (base 10) (or natural log, or log base 2) is an other strong transformation that have effect on the shape of distribution. Logarithmic transformation commonly used for reducing right skewness, but cannot be applied to negative or zero values.
Square Root Transformation
The square root x to $x^{\frac{1}{2}}=\sqrt(x)$ transformation have moderate effect on distribution shape and weaker than the logarithm. Square root transformation can be applied to zero values but not negative values.
Goals of transformation
The goals of transformation may be
- one might want to see the data structure differently
- one might want to reduce the skew that assist in modeling
- one might want to straighten a nonlinear (curvilinear) relationship in a scatter plot. In other words a transformation may be used to have approximately equal dispersion, making data easier to handle and interpret