Skewness in Statistics A Measure of Asymmetry (2017)

The article is about Skewness in Statistics, a measure of asymmetry. Skewed and skew are widely used terminologies that refer to something that is out of order or distorted on one side. Similarly, when referring to the shape of frequency distributions or probability distributions, the term skewness also refers to the asymmetry of that distribution. A distribution with an asymmetric tail extending out to the right is referred to as “positively skewed” or “skewed to the right”. In contrast, a distribution with an asymmetric tail extending out to the left is “negatively skewed” or “skewed to the left”.

Skewness in Statistics A measure of Asymmetry

Skewness in Statistics

It ranges from minus infinity ($-\infty$) to positive infinity ($+\infty$). In simple words, skewness (asymmetry) is a measure of symmetry, or in other words, skewness is a lack of symmetry.

Skewness by Karl Pearson

Karl Pearson (1857-1936) first suggested measuring skewness by standardizing the difference between the mean and the mode, such that, $\frac{\mu-mode}{\text{standard deviation}}$. Since population modes are not well estimated from sample modes, therefore Stuart and Ord, 1994 suggested that one can estimate the difference between the mean and the mode as being three times the difference between the mean and the median. Therefore, the estimate of skewness will be $$\frac{3(M-median)}{\text{standard deviation}}$$. Many of the statisticians use this measure but after eliminating the ‘3’, that is, $$\frac{M-Median}{\text{standard deviation}}$$. This statistic ranges from $-1$ to $+1$. According to Hildebrand, 1986, absolute values above 0.2 indicate great skewness.

Fisher’s Skewness

Skewness has also been defined concerning the third moment about the mean, that is $\gamma_1=\frac{\sum(X-\mu)^3}{n\sigma^3}$, which is simply the expected value of the distribution of cubed $Z$ scores, measured in this way is also sometimes referred to as “Fisher’s skewness”. When the deviations from the mean are greater in one direction than in the other direction, this statistic will deviate from zero in the direction of the larger deviations.

From sample data, Fisher’s skewness is most often estimated by: $$g_1=\frac{n\sum z^3}{(n-1)(n-2)}$$. For large sample sizes ($n > 150$), $g_1$ may be distributed approximately normally, with a standard error of approximately $\sqrt{\frac{6}{n}}$. While one could use this sampling distribution to construct confidence intervals for or tests of hypotheses about $\gamma_1$, there is rarely any value in doing so.

Bowleys’ Coefficient of Skewness

Arthur Lyon Bowley (1869-19570, has also proposed a measure of asymmetry based on the median and the two quartiles. In a symmetrical distribution, the two quartiles are equidistant from the median but in an asymmetrical distribution, this will not be the case. The Bowley’s coefficient of skewness is $$\frac{q_1+q_3-2\text{median}}{Q_3-Q_1}$$. Its value lies between 0 and $\pm1$.

The most commonly used measures of Asymmetry (those discussed here) may produce some surprising results, such as a negative value when the shape of the distribution appears skewed to the right.

Impact of Lack of Symmetry

Researchers from the behavioral and business sciences need to measure the lack of symmetry when it appears in their data. A great amount of asymmetry may motivate the researcher to investigate the existence of outliers. When making decisions about which measure of the location to report and which inferential statistic to employ, one should take into consideration the estimated skewness of the population. Normal distributions have zero skewness. Of course, a distribution can be perfectly symmetric but may be far away from the normal distribution. Transformations of variables under study are commonly employed to reduce (positive) asymmetry. These transformations may include square root, log, and reciprocal of a variable.

In summary, by understanding and recognizing how skewness affects the data, one can choose appropriate analysis methods, gain more insights from the data, and make better decisions based on the findings.

FAQs About Skewness

  1. What statistical measure is used to find the asymmetry in the data?
  2. Define the term Skewness.
  3. What is the difference between symmetry and asymmetry concept?
  4. Describe negative and positive skewness.
  5. What is the difference between left-skewed and right-skewed data?
  6. What is a lack of symmetry?
  7. Discuss the measure proposed by Karl Pearson.
  8. Discuss the measure proposed by Bowley’s Coefficient of Skewness.
  9. For what distribution, the skewness is zero?
  10. What is the impact of transforming a variable?

Online MCQS/ Qui Test Website

R Programming Language

The sum of Squared Deviations from Mean (2015)

Introduction of Sum Square Deviations

In statistics, the sum of squared deviations (also known as the sum of squares) is a measure of the total variability (Measure of spread or variation) within a data set. In other words, the sum of squares is a measure of deviation or variation from the mean (average) value of the given data set.

Computation of Sum of Squared Deviations

A sum of squares is calculated by first computing the differences between each data point (observation) and the mean of the data set, i.e. $x=X-\overline{X}$. The computed $x$ is known as the deviation score for the given data set. Squaring each of these deviation scores and then adding these squared deviation scores gave us the sum of squared deviation (SS), which is represented mathematically as

\[SS=\sum(x^2)=\sum(X-\overline{X})^2\]

Note that the small letter $x$ usually represents the deviation of each observation from the mean value, while the capital letter $X$ represents the variable of interest in statistics.

The Sum of Squared Deviations Example

Consider the following data set {5, 6, 7, 10, 12}. To compute the sum of squares of this data set, follow these steps

  • Calculate the average of the given data by summing all the values in the data set and then divide this sum of numbers by the total number of observations in the data set. Mathematically, it is $\frac{\sum X_i}{n}=\frac{40}{5}=8$, where 40 is the sum of all numbers $5+6+7+10+12$ and there are 5 observations in number.
  • Calculate the difference of each observation in the data set from the average computed in step 1, for the given data. The differences are
    $5 – 8 = –3$; $6 – 8 = –2$; $7 – 8 = –1$; $10 – 8 =2$ and $12 – 8 = 4$
    Note that the sum of these differences should be zero. $(–3 + –2 + –1 + 2 +4 = 0)$
  • Now square each of the differences obtained in step 2. The square of these differences are
    9, 4, 1, 4 and 16
  • Now add the squared number obtained in step 3. The sum of these squared quantities will be $9 + 4 + 1 + 4 + 16 = 34$, which is the sum of the square of the given data set.
Sum of Squared Deviations

Sums of Squares in Different Context

In statistics, the sum of squares occurs in different contexts such as

  • Partitioning of Variance (Partition of Sums of Squares)
  • The sum of Squared Deviations (Least Squares)
  • The sum of Squared Differences (Mean Squared Error)
  • The sum of Squared Error (Residual Sum of Squares)
  • The sum of Squares due to Lack of Fit (Lack of Fit Sum of Squares)
  • The sum of Squares for Model Predictions (Explained Sum of Squares)
  • The sum of Squares for Observations (Total Sum of Squares)
  • The sum of Squared Deviation (Squared Deviations)
  • Modeling involving the Sum of Squares (Analysis of Variance)
  • Multivariate Generalization of Sum of Square (Multivariate Analysis of Variance)

As previously discussed, the Sum of Squares is a measure of the Total Variability of a set of scores around a specific number.

Summary

  • A higher sum of squares indicates that your data points are further away from the mean on average, signifying greater spread or variability in the data. Conversely, a lower sum of squares suggests the data points are clustered closer to the mean, indicating less variability.
  • The sum of squares plays a crucial role in calculating other important statistics like variance and standard deviation. These concepts help us understand the distribution of data and make comparisons between different datasets.

Online MCQs Test Website

R Faqs

Data Transformation (Variable Transformation)

The data transformation is a rescaling of the data using a function or some mathematical operation on each observation. When data are very strongly skewed (negative or positive), we sometimes transform the data so that they are easier to model. In another way, if variable(s) does not fit a normal distribution then one should try a DatavTransformation to fit the assumption of using a parametric statistical test.

The most common data transformation is log (or natural log) transformation, which is often applied when most of the data values cluster around zero relative to the larger values in the data set and all of the observations are positive.

Data Transformation Techniques

Variable transformation can also be applied to one or more variables in scatter plot, correlation, and regression analysis to make the relationship between the variables more linear; hence it is easier to model with a simple method. Other transformations than log are square root, reciprocal, etc.

Reciprocal Transformation

The reciprocal transformation $x$ to $\frac{1}{x}$ or $(-\frac{1}{x})$ is a very strong transformation with a drastic effect on the shape of the distribution. Note that this transformation cannot be applied to zero values, but can be applied to negative values. Reciprocal transformation is not useful unless all of the values are positive and reverses the order among values of the same sign i.e. largest becomes smallest etc.

Logarithmic Transformation

The logarithm $x$ to log (base 10) (or natural log, or log base 2) is another strong transformation that affects the shape of the distribution. Logarithmic transformation is commonly used for reducing right skewness, but cannot be applied to negative or zero values.

Square Root Transformation

The square root x to $x^{\frac{1}{2}}=\sqrt(x)$ transformation has a moderate effect on the distribution shape and is weaker than the logarithm. Square root transformation can be applied to zero values but not negative values.

Data Transformation

The purpose of data transformation is:

  • Convert data from one format or structure to another (like changing a messy spreadsheet into a table).
  • Clean and prepare data for analysis (fixing errors, inconsistencies, and missing values).
  • Standardize data for easier integration and comparison (making sure all your data uses the same units and formats).

Goals of transformation

The goals of transformation may be

  • one might want to see the data structure differently
  • one might want to reduce the skew that assists in modeling
  • one might want to straighten a nonlinear (curvilinear) relationship in a scatter plot. In other words, a transformation may be used to have approximately equal dispersion, making data easier to handle and interpret
Data Transformation (Variable Transformation)

There are many techniques used in data transformation, these techniques are:

  • Cleaning and Filtering: Identifying and removing errors, missing values, and duplicates.
  • Data Normalization: Ensuring data consistency across different fields.
  • Aggregation: Summarizing data by combining similar values.

Benefits of Data Transformation

The Benefits of data transformation and data cleaning are:

  • Improved data quality: Fewer errors and inconsistencies lead to more reliable results.
  • Easier analysis: Structured data is easier to work with for data analysts and scientists.
  • Better decision-making: Accurate insights from clean data lead to better choices.
https://itfeature.com

Data transformation is a crucial step in the data pipeline, especially in tasks like data warehousing, data integration, and data wrangling.

FAQS about Data Transformation

  • What is data transformation?
  • When data transformation is done?
  • What is the most common data transformation?
  • What is the reciprocal Data Transformation?
  • When reciprocal transformation is not useful?
  • What is a logarithmic transformation?
  • When logarithmic transformation is not applied to the data?
  • What is the square root transformation?
  • When square root transformation cannot be applied?
  • What is the main purpose of data transformation?
  • What are the goals of transformation?
  • What is the data normalization?
  • What is the data aggregation?
  • What is the cleaning and filtering?
  • What are the benefits of data transformation?

Online MCQs Test Website

Introduction to R Language

Level of Measurements in Statistics

Introduction to Level of Measurements in Statistics

Data can be classified according to the level of measurements in statistics, dictating the calculations that can be done to summarize and present the data (graphically), it also helps to determine, what statistical tests should be performed.

For example, suppose there are six colors of candies in a bag and you assign different numbers (codes) to them in such a way that brown candy has a value of 1, yellow 2, green 3, orange 4, blue 5, and red a value of 6. From this bag of candies, adding all the assigned color values and then dividing by the number of candies, yield an average value of 3.68. Does this mean that the average color is green or orange? Of course not. When computing statistic(s), it is important to recognize the data type, which may be qualitative (nominal and ordinal) and quantitative (interval and ratio).

The level of measurements in statistics has been developed in conjunction with the concepts of numbers and units of measurement. Statisticians classified measurements according to levels. There are four levels of measurement, namely, nominal, ordinal, interval, and ratio, described below.

Nominal Level of Measurement

At the nominal level of measurement, the observation of a qualitative variable can only be classified and counted. There is no particular order to the categories. Mode, frequency table (discrete frequency tables), pie chart, and bar graph are usually drawn for this level of measurement.

Ordinal Level of Measurement

In the ordinal level of measurement, data classification is presented by sets of labels or names that have relative values (ranking or ordering of values). For example, if you survey 1,000 people and ask them to rate a restaurant on a scale ranging from 0 to 5, where 5 shows a higher score (highest liking level) and zero shows the lowest (lowest liking level). Taking the average of these 1,000 people’s responses will have meaning. Usually, graphs and charts are drawn for ordinal data.

Level of Measurement

Interval Level of Measurement

Numbers also used to express the quantities, such as temperature, size of the dress, and plane ticket are all quantities. The interval level of measurement allows for the degree of difference between items but not the ratio between them. There is a meaningful difference between values, for example, 10 degrees Fahrenheit and 15 degrees is 5, and the difference between 50 and 55 degrees is also 5 degrees. It is also important that zero is just a point on the scale, it does not represent the absence of heat, just that it is a freezing point.

Ratio Level of Measurement

All of the quantitative data is recorded on the ratio level. It has all the characteristics of the interval level, but in addition, the zero points are meaningful and the ratio between two numbers is meaningful. Examples of ratio levels are wages, units of production, weight, changes in stock prices, the distance between home and office, height, etc.


Many of the inferential test statistics depend on the ratio and interval level of measurement. Many authors argue that interval and ratio measures should be named as scales.

Level of Measurements in Statistics

Importance of Level of Measurements in Statistics

Understanding the level of measurement in statistics, data is crucial for several reasons:

  • Choosing Appropriate Statistical Tests: Different statistical tests are designed for different levels of measurement. Using the wrong test on data with an inappropriate level of measurement can lead to misleading results and decisions.
  • Data Interpretation: The level of measurement determines how one can interpret the data and the conclusions can made. For example, average (mean) is calculated for interval and ratio data, but not for nominal or ordinal data.
  • Data analysis: The level of measurement influences the types of calculations and analyses one can perform on the data.

By correctly identifying the levels of measurement of the data, one can ensure that he/she is using appropriate statistical methods and drawing valid conclusions from the analysis.

Online MCQs Test Preparation Website