# Basic Statistics and Data Analysis

### Category: Basic Statistics

Introduction to statistics

## Skewness: Measure of Asymmetry

The skewed and askew are widely used terminologies that refer to something that is out of order or distorted on one side. Similarly, when referring to the shape of frequency distributions or probability distributions, the term skewness also refers to asymmetry of that distribution. A distribution with an asymmetric tail extending out to the right is referred to as “positively skewed” or “skewed to the right”, while a distribution with an asymmetric tail extending out to the left is referred to as “negatively skewed” or “skewed to the left”. The range of skewness is from minus infinity ($-\infty$) to positive infinity ($+\infty$). In simple words skewness (asymmetry) is measure of symmetry or in other words skewness is the lack of symmetry.

Karl Pearson (1857-1936) first suggested measuring skewness by standardizing the difference between the mean and the mode, such that, $skewness=\frac{\mu-mode}{\text{standard deviation}}$. Since, population modes are not well estimated from sample modes, therefore Stuart and Ord, 1994 suggested that one can estimate the difference between the mean and the mode as being three times the difference between the mean and the median. Therefore, the estimate of skewness will be: $skewness=\frac{3(M-median)}{\text{standard deviation}}$. Many of the statisticians use this measure but after eliminating the ‘3’, that is, $skewness=\frac{M-Median}{\text{standard deviation}}$. This statistic ranges from $-1$ to $+1$. According to Hilderand, 1986, absolute values of skewness above 0.2 indicate great skewness.

Skewness has also been defined with respect to the third moment about the mean, that is $\gamma_1=\frac{\sum(X-\mu)^3}{n\sigma^3}$, which is simply the expected value of the distribution of cubed $Z$ scores. Skewness measured in this way is also sometimes referred to as “Fisher’s skewness”. When the deviations from the mean are greater in one direction than in the other direction, this statistic will deviate from zero in the direction of the larger deviations. From sample data, Fisher’s skewness is most often estimated by: $g_1=\frac{n\sum z^3}{(n-1)(n-2)}$. For large sample sizes ($n > 150$), $g_1$ may be distributed approximately normally, with a standard error of approximately $\sqrt{\frac{6}{n}}$. While one could use this sampling distribution to construct confidence intervals for or tests of hypotheses about $\gamma_1$, there is rarely any value in doing so.

Arthur Lyon Bowley (1869-19570, has also proposed a measure of skewness based on the median and the two quartiles. In a symmetrical distribution, the two quartiles are equidistant from the median but in an asymmetrical distribution, this will not be the case. The Bowley’s coefficient of skewness is $skewness=\frac{q_1+q_3-2\text{median}}{Q_3-Q_1}$. Its value lies between 0 and $\pm1$.

The most commonly used measures of skewness (those discussed here) may produce some surprising results, such as a negative value when the shape of the distribution appears skewed to the right.

It is important for researchers from the behavioral and business sciences to measure skewness when it appears in their data. Great amount of skewness may motivate the researcher to investigate the existence of outliers. When making decisions about which measure of location to report and which inferential statistic to employ, one should take into consideration the estimated skewness of the population. Normal distributions have zero skewness. Of course, a distribution can be perfectly symmetric but may far away from normal distribution. Transformations of variables under study commonly employed to reduce (positive) skewness. These transformation may include square root, log, and reciprocal of variable.

## Convert PDFs to Editable File Formats in 3 Easy Steps

Since the introduction of computers into our lives, we’ve been able to do things that we couldn’t do before. Slowly but surely, our PC skills have improved and today we are using new technologies that are enabling us to be better and more productive in almost every aspect of our lives.

One huge part of modern technology are digital documents that are a legacy of digital revolution. Paper documents have been replaced by digital files at one point, since they are easier to use, edit and share between colleagues and friends.

One of the most used and known digital file formats is Portable Document Format, better known as the PDF. Developed and published in the nineties, the PDF is still a number one format for managers, students, accountants, writers and many others. For more than 20 years it has been building up supporters, who use it for 3 main reasons:

1. It’s universal — it can be opened on any device (including mobile devices).
2. It’s shareable — documents are easily shared across all platforms.
3. It’s standardized — the files always maintain original formatting.

Aside from attractive features that make this file format popular, there is one major downside to using PDF — the format is not so easy to edit.

If you want to make changes to your financial or project reports saved in PDF, the best thing to do is to edit your documents using a software that’s designed for that purpose. One such tool is Able2Extract Professional 11, known for its powerful and modern PDF editing features.

With Able2Extract’s integrated PDF editor you can:

• Resize and scale more pages at once
• Customize any individual page
• Extract and combine multiple PDFs
• Redact any sensitive content

The software also converts PDF to over 10 different file formats (MS Office, AutoCAD, Image, HTML, CSV) and it’s available for all three desktop platforms.

It’s so easy to use that all you need to do is follow this three step conversion process:

1. Click Open and select the PDF document that you want to convert.
2. Select either the entire document or just a part, using the Selection panel. After making the selection, click on the desired output format.
3. Choose where you want your document to be saved, and the conversion will begin.

Besides editing and conversion, the developers of Able2Extract decided to provide complete document encryption and decryption upon your PDF creation.

Now you can set up file owners, configure passwords and share your documents freely. By clicking on the “Create” button in Able2Extract, the software will automatically make a PDF document from your file.

To conclude this quick guide: the conversion of PDF files is precise, quick and most importantly — it can boost your office productivity. On the downside, the tool is aimed at experienced business professionals, with the full, lifetime license costing around $150. To see if Able2Extract is a tool that can help you with your everyday documents struggles, you can download the free trial version. It lasts for 7 days, which is more than enough to make the right call. See the video for further information and working of Able2Extact software ## The Correlogram A correlogram is a graph used to interpret a set of autocorrelation coefficients in which$r_k$is plotted against the$log k$. A correlogram is often very helpful for visual inspection. Some general advice to interpret the correlogram are: • A Random Series: If a time series is completely random, then for large$N$,$r_k \cong 0$for all non-zero value of$k$. A random time series$r_k$is approximately$N\left(0, \frac{1}{N}\right)$. If a time series is random, let 19 out of 20 of the values of$r_k$can be expected to lie between$\pm \frac{2}{\sqrt{N}}$. However, plotting the first 20 values of$r_k$, one can expect to find one significant value on average even when time series is really random. • Short-term Correlation: Stationary series often exhibit short term correlation characterized by a fairly large value of$r_1$followed by 2 or 3 more coefficients (significantly greater than zero) tend to get successively smaller value of$r_k$for larger lags tends to get be approximately zero. A time series which give rise to such a correlogram is one for which an observation above the mean tends to be followed by one or more further observations above the mean and similarly for observation below the mean. A model called an autoregressive model, may be appropriate for series of this type. • Alternating Series: If a time series has a tendency to alternate with successive observations on different sides of the overall mean, then the correlogram also tends to alternate. The value of$r_1$will be negative, however, the value of$r_2$will be positive as observation at lag 2 will tend to be on the same side of the mean. • Non-Stationary Series: If a time series contains a trend, then the value of$r_k$will not come down to zero except for very large values of the lags. This is because by a large number of further observations on the same side of the mean because of the trend. The sample autocorrelation function$\{ r_k \}$should only be calculated for stationary time series and no any tend should be removed before calculating$\{ r_k\}$. • Seasonal Fluctuations: If a time series contains a seasonal fluctuation then the correlogram will also exhibit an oscillation at the same frequency. If$x_t$follows a sinusoidal patterns then so does$r_k$.$x_t=a\, cos\, t\, w, $where$a$is constant,$w$is frequency such that$0 < w < \pi$. Therefore$r_k \cong cos\, k\, w$for large$N$. If the seasonal variation is removed from seasonal data then the correlogram may provide useful information. • Outliers: If a time series contains one or more outliers the correlogram may be seriously affected. If there is one outlier in the time series and it is not adjusted, then the plot of$x_y$vs$x_{t+k}$will contain two extreme points, which will tend to depress the sample correlation coefficients towards zero. If there are two outliers, this effect is more noticeable. • General Remarks: Experience is required to interpret autocorrelation coefficients. We need to study the probability theory of stationary series and the classes of model too. We also need to know the sampling properties of$x_t$. ## Standard Error of Estimate Standard error (SE) is a statistical term used to measure the accuracy within a sample taken from population of interest. The standard error of the mean measures the variation in the sampling distribution of the sample mean, usually denoted by$\sigma_\overline{x}$is calculated as $\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}$ Drawing (obtaining) different samples from the same population of interest usually results in different values of sample means, indicating that there is a distribution of sampled means having its own mean (average values) and variance. The standard error of the mean is considered as the standard deviation of all those possible sample drawn from the same population. The size of the standard error is affected by standard deviation of the population and number of observations in a sample called the sample size. The larger the standard deviation of the population ($\sigma$), the larger the standard error will be, indicating that there is more variability in the sample means. However larger the number of observations in a sample smaller will be the standard error of estimate, indicating that there is less variability in the sample means, where by less variability we means that the sample is more representative of the population of interest. If the sampled population is not very larger, we need to make some adjustment in computing the SE of the sample means. For a finite population, in which total number of objects (observations) is$N$and the number of objects (observations) in a sample is$n$, then the adjustment will be$\sqrt{\frac{N-n}{N-1}}$. This adjustment is called the finite population correction factor. Then the adjusted standard error will be $\frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}$ The SE is used to: 1. measure the spread of values of statistic about the expected value of that statistic 2. construct confidence intervals 3. test the null hypothesis about population parameter(s) The standard error is computed from sample statistics. To compute SE for simple random samples, assuming that the size of population ($N$) is at least 20 times larger than that of the sample size ($n). \begin{align*} Sample\, mean, \overline{x} & \Rightarrow SE_{\overline{x}} = \frac{n}{\sqrt{n}}\\ Sample\, proportion, p &\Rightarrow SE_{p} \sqrt{\frac{p(1-p)}{n}}\\ Difference\, b/w \, means, \overline{x}_1 – \overline{x}_2 &\Rightarrow SE_{\overline{x}_1-\overline{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\\ Difference\, b/w\, proportions, \overline{p}_1-\overline{p}_2 &\Rightarrow SE_{p_1-p_2}=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}} \end{align*} The standard error is identical to the standard deviation, except that it uses statistics whereas the standard deviation uses the parameter. For more about SE follow the link Standard Error of Estimate ## Sum of Squares # Sum of Sqaures In statistics, the sum of squares is a measure of the total variability (spread, variation) within a data set. In other words the sum of squares is a measure of deviation or variation from mean value of the given data set. A sum of squares calculated by first computing the differences between each data point (observation) and mean of the data set, i.e.x=X-\overline{X}$. The computed$x$is the deviation score for the given data set. Squaring each of this deviation score and then adding these squared deviation scores gave us the sum of squares (SS), which is represented mathematically as $SS=\sum(x^2)=\sum(X-\overline{X})^2$ Note that the small letter$x$usually represents the deviation of each observation from mean value, while capital letter$X$represents the variable of interest in statistics. ## Sum of Squares Example Consider the following data set {5, 6, 7, 10, 12}. To compute the sum of squares of this data set, follow these steps • Calculate the average of the given data by summing all the values in the data set and then divide this sum of numbers by the total number of observations in the date set. Mathematically, it is$\frac{\sum X_i}{n}=\frac{40}{5}=8$, where 40 is the sum of all numbers$5+6+7+10+12\$ and there are 5 observations in number.
• Calculate the difference of each observation in data set from the average computed in step 1, for given data. The difference are
5 – 8 = –3; 6 – 8 = –2; 7 – 8 = –1; 10 – 8 =2 and 12 – 8 = 4
Note that the sum of these differences should be zero. (–3 + –2 + –1 + 2 +4 = 0)
• Now square the each of the differences obtained in step 2. The square of these differences are
9, 4, 1, 4 and 16
• Now add the squared number obtained in step 3. The sum of these squared quantities will be 9 + 4 + 1 + 4 + 16 = 34, which is the sum of the square of the given data set.

In statistics, sum of squares occurs in different contexts such as

• Partitioning of Variance (Partition of Sums of Squares)
• Sum of Squared Deviations (Least Squares)
• Sum of Squared Differences (Mean Squared Error)
• Sum of Squared Error (Residual Sum of Squares)
• Sum of Squares due to Lack of Fit (Lack of Fit Sum of Squares)
• Sum of Squares for Model Predictions (Explained Sum of Squares)
• Sum of Squares for Observations (Total Sum of Squares)
• Sum of Squared Deviation (Squared Deviations)
• Modeling involving Sum of Squares (Analysis of Variance)
• Multivariate Generalization of Sum of Square (Multivariate Analysis of Variance)

As previously discussed, Sum of Square is a measure of the Total Variability of a set of scores around a specific number.