Tagged: Basic Statistics

Bias: The Difference Between the Expected Value and True Value

Bias in Statistics is defined as the difference between the expected value of a statistic and the true value of the corresponding parameter. Therefore, the bias is a measure of the systematic error of an estimator. The bias indicates the distance of the estimator from the true value of the parameter. For example, if we calculate the mean of a large number of unbiased estimators, we will find the correct value.

In other words, the bias (sampling error) is a systematic error in measurement or sampling and it tells how far off on the average the model is from the truth.

Gauss, C.F. (1821) during his work on the least-squares method gave the concept of an unbiased estimator.

The bias of an estimator of a parameter should not be confused with its degree of precision as the degree of precision is a measure of the sampling error. The bias is favoring of one group or outcome intentionally or unintentionally over other groups or outcomes available in the population under study. Unlike random errors, bias is a serious problem and bias can be reduced by increasing the sample size and averaging the outcomes.

Bias and Variance

There are several types of bias that should not be considered mutually exclusive

  • Selection Bias (arise due to systematic differences between the groups compared)
  • Exclusion Bias (arise due to the systematic exclusion of certain individuals from the study)
  • Analytical Bias (arise due to the way that the results are evaluated)

Mathematically Bias can be defined as

Let statistics $T$ used to estimate a parameter $\theta$ if $E(T)=\theta+bias(\theta)$ then $bias(\theta)$ is called the bias of the statistic $T$, where $E(T)$ represents the expected value of the statistics $T$.
Note: that if $bias(\theta)=0$, then $E(T)=\theta$. So, $T$ is an unbiased estimator of the true parameter, say $θ$.

Reference:
Gauss, C.F. (1821, 1823, 1826). Theoria Combinations Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1-108.

For further reading about Statistical Bias visit: Statistical Bias.

Constructing Frequency Tables

A frequency table is a way of summarizing a set of data. It is a record of each value (or set of values) of the variable in the data/question.

A grouping of qualitative data into mutually exclusive classes showing the number of observations in each class is called a frequency table. The number of values falling in a particular category/class is called the frequency of that category/class denoted by $f$.

If data of continuous variables are arranged into different classes with their frequencies, then this is known as continuous frequency distribution. If data of discrete variables is arranged into different classes with their frequencies then it is known as discrete distribution or discontinuous distribution.

Example

Car type Number of cars

Local

50

Foreign

30

Total Cars

80

Frequency distribution may be constructed both for discrete and continuous variables. A discrete frequency distribution can be converted back to original values, but for continuous variables, it is not possible.

Following steps are taken into account while constructing frequency tables for continuous data.

  1. Calculate the range of the data. The range is the difference between the highest and smallest values of the given data.
    \[Range = Highest Value – Lowest Value\]
  2. Decide the number of Classes. Maximum number of classes may be determined by the formula
    Number of classes $C = 2^k$     OR    Number of classes $(C) = 1+3.3 log (n)$
    Note that: Too many classes or too few classes might not reveal the basic shape of the data set.
  3. Determine the Class Interval or Width
    The class all taken together should cover at least the distance from the lowest value in the data up to the highest value, which can be done by this formula \[I=\frac{Highest Value – Lowest Value}{Number of Classes}=\frac{H-L}{K}\]
    Where $I$ is the class interval, $H$ is the highest observed value, and $L$ is the lowest observed value and $K$ is the number of classes.
    Generally, the class interval or width should be the same for all classes.
    In particular interval size is usually rounded up to some convenient number, such as a multiple of 10 or 100. Unequal class intervals present problems in graphically portraying the distribution and in doing some of the computations. Unequal class intervals may be necessary for certain situations such as to avoid a large number of empty or almost empty classes.
  4. Set the Individual Class Limits
    Class limits are the endpoints in the class interval. State clear class limits so that you can put each of the observation into one and only one category i.e. you must avoid the overlapping or unclear class limits. Because class intervals are usually rounded up to get a convenient class size, cover a larger than necessary range.
    It is convenient to choose the endpoints of the class interval so that no observation falls on them. It can be obtained by expressing the endpoints to one more place of decimal than the observations themselves, i.e. limits are converted to class boundaries to achieve continuity in data.
  5. Tally the Observation into the Classes
  6. Count the Number of Items in each Class
    The number of observation in each class I called the class frequency. Note the totaling the frequencies in each class must equal the total number of observations. After following these steps, we have organized the data into a tabulation form which is called a frequency distribution, which can be used to summarize the pattern in the observation i.e., the concentration of the data.
Frequency Distribution Table

Note: Arranging/organizing the data into a tabulation or frequency distribution results in loss of detailed information as individuality of observations vanishes i.e. in frequency distribution we cannot pinpoint the exact value, and we cannot tell the actual lowest and highest values of the data. However, the lower limit of the largest, class conveys some essentially the same meaning. So in constructing the frequency tables, the advantages of condensing the data into a more understandable and organized form are more than offset this disadvantage.

Further Reading

https://itfeature.com/statistics/frequency-distribution-table

The Word Statistics Meaning and Use

The post is about “The Word Statistics Meaning and Use”.

The word statistics was first used by German scholar Gottfried Achenwall in the middle of the 18th century as the science of statecraft concerning the collection and use of data by the state.

The word statistics comes from the Latin word “Status” or Italian word “Statistia” or German word “Statistik” or the French word “Statistique”; meaning a political state, and originally meant information useful to the state, such as information about sizes of the population (human, animal, products, etc.) and armed forces.

According to pioneer statistician Yule, the word statistics occurred at the earliest in the book “the element of universal erudition” by Baron (1770). In 1787 a wider definition was used by E.A.W. Zimmermann in “A Political survey of the present state of Europe”. It appeared in the encyclopedia of Britannica in 1797 and was used by Sir John Sinclair in Britain in a series of volumes published between 1791 and 1799 giving a statistical account of Scotland. In the 19th century, the word statistics acquired a wider meaning covering numerical data of almost any subject whatever and also interpretation of data through appropriate analysis.

Now statistics are being used in different meanings.

  • Statistics refers to “numerical facts that are arranged systematically in the form of tables or charts etc. In this sense, it is always used a plural i.e. a set of numerical information. For instance statistics of prices, road accidents, crimes, births, educational institutions etc.
  • The word statistics is defined as a discipline that includes procedures and techniques used to collect, process and analyze the numerical data to make inferences and to reach an appropriate decision in a situation of uncertainty (uncertainty refers to incompleteness, it does not imply ignorance). In this sense word statistic is used in the singular sense. It denotes the science of basing the decision on numerical data.
  • The word statistics are numerical quantities calculated from sample observations; a single quantity calculated from sample observations is called statistics such as the mean. Here word statistics is plural.

“We compute statistics from statistics by statistics”

The first place of statistics is plural of statistics, in second place is plural sense data, and in third place is singular sense methods.

For learning about Basics of Statistics Follow the link Basic Statistics

Rules for Skewed data

Basics Statistics

The two general rules are

  1. If the mean is less than the median, the data are skewed to the left, and
  2. If the mean is greater than the median, the data are skewed to the right.

Therefore, if the mean is much greater than the median the data are probably skewed to the right.