Basic Statistics and Data Analysis

Lecture notes, MCQS of Statistics

Category: Basic Statistics

Introduction to statistics

Convert PDFs to Editable File Formats in 3 Easy Steps

Since the introduction of computers into our lives, we’ve been able to do things that we couldn’t do before. Slowly but surely, our PC skills have improved and today we are using new technologies that are enabling us to be better and more productive in almost every aspect of our lives.

One huge part of modern technology are digital documents that are a legacy of digital revolution. Paper documents have been replaced by digital files at one point, since they are easier to use, edit and share between colleagues and friends.

One of the most used and known digital file formats is Portable Document Format, better known as the PDF. Developed and published in the nineties, the PDF is still a number one format for managers, students, accountants, writers and many others. For more than 20 years it has been building up supporters, who use it for 3 main reasons:

  1. It’s universal — it can be opened on any device (including mobile devices).
  2. It’s shareable — documents are easily shared across all platforms.
  3. It’s standardized — the files always maintain original formatting.

Aside from attractive features that make this file format popular, there is one major downside to using PDF — the format is not so easy to edit.

If you want to make changes to your financial or project reports saved in PDF, the best thing to do is to edit your documents using a software that’s designed for that purpose. One such tool is Able2Extract Professional 11, known for its powerful and modern PDF editing features.

With Able2Extract’s integrated PDF editor you can:

  • Resize and scale more pages at once
  • Add 10 different annotations
  • Customize any individual page
  • Add and delete your PDF content
  • Extract and combine multiple PDFs
  • Redact any sensitive content

The software also converts PDF to over 10 different file formats (MS Office, AutoCAD, Image, HTML, CSV) and it’s available for all three desktop platforms.

It’s so easy to use that all you need to do is follow this three step conversion process:

  1. Click Open and select the PDF document that you want to convert.Convert PDF with Able2Extact: Open and Select PDF
  2. Select either the entire document or just a part, using the Selection panel. After making the selection, click on the desired output format.
    Convert PDF with Able2Extact: Selection Panel
  3. Choose where you want your document to be saved, and the conversion will begin.
    Convert PDF with Able2Extact: save conversion

Besides editing and conversion, the developers of Able2Extract decided to provide complete document encryption and decryption upon your PDF creation.

Now you can set up file owners, configure passwords and share your documents freely. By clicking on the “Create” button in Able2Extract, the software will automatically make a PDF document from your file.

To conclude this quick guide: the conversion of PDF files is precise, quick and most importantly — it can boost your office productivity. On the downside, the tool is aimed at experienced business professionals, with the full, lifetime license costing around $150.

To see if Able2Extract is a tool that can help you with your everyday documents struggles, you can download the free trial version. It lasts for 7 days, which is more than enough to make the right call.

See the video for further information and working of Able2Extact software

 

The Correlogram

A correlogram is a graph used to interpret a set of autocorrelation coefficients in which $r_k$ is plotted against the $log k$. A correlogram is often very helpful for visual inspection. Some general advice to interpret the correlogram are:

  • A Random Series: If a time series is completely random, then for large $N$, $r_k \cong 0$ for all non-zero value of $k$. A random time series $r_k$ is approximately $N\left(0, \frac{1}{N}\right)$. If a time series is random, let 19 out of 20 of the values of $r_k$ can be expected to lie between $\pm \frac{2}{\sqrt{N}}$. However, plotting the first 20 values of $r_k$, one can expect to find one significant value on average even when time series is really random.
  • Short-term Correlation: Stationary series often exhibit short term correlation characterized by a fairly large value of $r_1$ followed by 2 or 3 more coefficients (significantly greater than zero) tend to get successively smaller value of $r_k$ for larger lags tends to get be approximately zero. A time series which give rise to such a correlogram is one for which an observation above the mean tends to be followed by one or more further observations above the mean and similarly for observation below the mean. A model called an autoregressive model, may be appropriate for series of this type.
  • Alternating Series: If a time series has a tendency to alternate with successive observations on different sides of the overall mean, then the correlogram also tends to alternate. The value of $r_1$ will be negative, however, the value of $r_2$ will be positive as observation at lag 2 will tend to be on the same side of the mean.
  • Non-Stationary Series: If a time series contains a trend, then the value of $r_k$ will not come down to zero except for very large values of the lags. This is because by a large number of further observations on the same side of the mean because of the trend. The sample autocorrelation function $\{ r_k \}$ should only be calculated for stationary time series and no any tend should be removed before calculating $\{ r_k\}$.
  • Seasonal Fluctuations: If a time series contains a seasonal fluctuation then the correlogram will also exhibit an oscillation at the same frequency. If $x_t$ follows a sinusoidal patterns then so does $r_k$.
    $x_t=a\, cos\, t\, w, $ where $a$ is constant, $w$ is frequency such that $0 < w < \pi$. Therefore $r_k \cong cos\, k\, w$ for large $N$.
    If the seasonal variation is removed from seasonal data then the correlogram may provide useful information.
  • Outliers: If a time series contains one or more outliers the correlogram may be seriously affected. If there is one outlier in the time series and it is not adjusted, then the plot of $x_y$ vs $x_{t+k}$ will contain two extreme points, which will tend to depress the sample correlation coefficients towards zero. If there are two outliers, this effect is more noticeable.
  • General Remarks: Experience is required to interpret autocorrelation coefficients. We need to study the probability theory of stationary series and the classes of model too. We also need to know the sampling properties of $x_t$.

Standard Error of Estimate

Standard error (SE) is a statistical term used to measure the accuracy within a sample taken from population of interest. The standard error of the mean measures the variation in the sampling distribution of the sample mean, usually denoted by $\sigma_\overline{x}$ is calculated as

\[\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}\]

Drawing (obtaining) different samples from the same population of interest usually results in different values of sample means, indicating that there is a distribution of sampled means having its own mean (average values) and variance. The standard error of the mean is considered as the standard deviation of all those possible sample drawn from the same population.

The size of the standard error is affected by standard deviation of the population and number of observations in a sample called the sample size. The larger the standard deviation of the population ($\sigma$), the larger the standard error will be, indicating that there is more variability in the sample means. However larger the number of observations in a sample smaller will be the standard error of estimate, indicating that there is less variability in the sample means, where by less variability we means that the sample is more representative of the population of interest.

If the sampled population is not very larger, we need to make some adjustment in computing the SE of the sample means. For a finite population, in which total number of objects (observations) is $N$ and the number of objects (observations) in a sample is $n$, then the adjustment will be $\sqrt{\frac{N-n}{N-1}}$. This adjustment is called the finite population correction factor. Then the adjusted standard error will be

\[\frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}\]

The SE is used to:

  1. measure the spread of values of statistic about the expected value of that statistic
  2. construct confidence intervals
  3. test the null hypothesis about population parameter(s)

The standard error is computed from sample statistics. To compute SE for simple random samples, assuming that the size of population ($N$) is at least 20 times larger than that of the sample size ($n$).
\begin{align*}
Sample\, mean, \overline{x} & \Rightarrow SE_{\overline{x}} = \frac{n}{\sqrt{n}}\\
Sample\, proportion, p &\Rightarrow SE_{p} \sqrt{\frac{p(1-p)}{n}}\\
Difference\, b/w \, means, \overline{x}_1 – \overline{x}_2 &\Rightarrow SE_{\overline{x}_1-\overline{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\\
Difference\, b/w\, proportions, \overline{p}_1-\overline{p}_2 &\Rightarrow SE_{p_1-p_2}=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}
\end{align*}

The standard error is identical to the standard deviation, except that it uses statistics whereas the standard deviation uses the parameter.

 

For more about SE follow the link Standard Error of Estimate

 

Sum of Squares

Sum of Sqaures

In statistics, the sum of squares is a measure of the total variability (spread, variation) within a data set. In other words the sum of squares is a measure of deviation or variation from mean value of the given data set. A sum of squares calculated by first computing the differences between each data point (observation) and mean of the data set, i.e. $x=X-\overline{X}$. The computed $x$ is the deviation score for the given data set. Squaring each of this deviation score and then adding these squared deviation scores gave us the sum of squares (SS), which is represented mathematically as

\[SS=\sum(x^2)=\sum(X-\overline{X})^2\]

Note that the small letter $x$ usually represents the deviation of each observation from mean value, while capital letter $X$ represents the variable of interest in statistics.

Sum of Squares Example

Consider the following data set {5, 6, 7, 10, 12}. To compute the sum of squares of this data set, follow these steps

  • Calculate the average of the given data by summing all the values in the data set and then divide this sum of numbers by the total number of observations in the date set. Mathematically, it is $\frac{\sum X_i}{n}=\frac{40}{5}=8$, where 40 is the sum of all numbers $5+6+7+10+12$ and there are 5 observations in number.
  • Calculate the difference of each observation in data set from the average computed in step 1, for given data. The difference are
    5 – 8 = –3; 6 – 8 = –2; 7 – 8 = –1; 10 – 8 =2 and 12 – 8 = 4
    Note that the sum of these differences should be zero. (–3 + –2 + –1 + 2 +4 = 0)
  • Now square the each of the differences obtained in step 2. The square of these differences are
    9, 4, 1, 4 and 16
  • Now add the squared number obtained in step 3. The sum of these squared quantities will be 9 + 4 + 1 + 4 + 16 = 34, which is the sum of the square of the given data set.

In statistics, sum of squares occurs in different contexts such as

  • Partitioning of Variance (Partition of Sums of Squares)
  • Sum of Squared Deviations (Least Squares)
  • Sum of Squared Differences (Mean Squared Error)
  • Sum of Squared Error (Residual Sum of Squares)
  • Sum of Squares due to Lack of Fit (Lack of Fit Sum of Squares)
  • Sum of Squares for Model Predictions (Explained Sum of Squares)
  • Sum of Squares for Observations (Total Sum of Squares)
  • Sum of Squared Deviation (Squared Deviations)
  • Modeling involving Sum of Squares (Analysis of Variance)
  • Multivariate Generalization of Sum of Square (Multivariate Analysis of Variance)

As previously discussed, Sum of Square is a measure of the Total Variability of a set of scores around a specific number.

 

Data Transformation (Variable Transformation)

Data Transformation (Variable Transformation)

A transformation is a rescaling of the data using a function or some mathematical operation on each observation. When data are very strongly skewed (negative or positive), we sometime transform the data so that they are easier to model. In other way, if variable(s) does not fit a normal distribution then one should try a data transformation to fit the assumption of using a parametric statistical test.

The most common data transformation is log (or natural log) transformation, which is often applied when most of the data values cluster around zero relative to the larger values in the data set and all of the observations are positive.

Transformation can also be applied to one or more variables in scatter plot, correlation and regression analysis to make the relationship between the variables more linear; and hence it is easier to model with simple method. Other transformation than log are square root, reciprocal etc.

Reciprocal Transformation
The reciprocal transformation $x$ to $\frac{1}{x}$ or $(-\frac{1}{x})$ is a very strong transformation with a drastic effect on shape of the distribution. Note that this transformation cannot be applied to zero values, but can be applied to negative values. Reciprocal transformation is not useful unless all of the values are positive and reverses the order among values of the same sign i.e. largest becomes smallest etc.

Logarithmic Transformation
The logarithm $x$ to log (base 10) (or natural log, or log base 2) is an other strong transformation that have effect on the shape of distribution. Logarithmic transformation commonly used for reducing right skewness, but cannot be applied to negative or zero values.

Square Root Transformation
The square root x to $x^{\frac{1}{2}}=\sqrt(x)$ transformation have moderate effect on distribution shape and weaker than the logarithm. Square root transformation can be applied to zero values but not negative values.

Goals of transformation
The goals of transformation may be

  • one might want to see the data structure differently
  • one might want to reduce the skew that assist in modeling
  • one might want to straighten a nonlinear (curvilinear) relationship in a scatter plot. In other words a transformation may be used to have approximately equal dispersion, making data easier to handle and interpret

 

Copy Right © 2011 ITFEATURE.COM
%d bloggers like this: