Sufficient Estimators and Sufficient Statistics

Introduction to Sufficient Estimator and Sufficient Statistics

An estimator $\hat{\theta}$ is sufficient if it makes so much use of the information in the sample that no other estimator could extract from the sample, additional information about the population parameter being estimated.

The sample mean $\overline{X}$ utilizes all the values included in the sample so it is a sufficient estimator of the population mean $\mu$.

Sufficient estimators are often used to develop the estimator that has minimum variance among all unbiased estimators (MVUE).

If a sufficient estimator exists, no other estimator from the sample can provide additional information about the population being estimated.

If there is a sufficient estimator, then there is no need to consider any of the non-sufficient estimators. A good estimator is a function of sufficient statistics.

Let $X_1, X_2,\cdots, X_n$ be a random sample from a probability distribution with unknown parameter $\theta$, then this statistic (estimator) $U=g(X_1, X_,\cdots, X_n)$ observation gives $U=g(X_1, X_2,\cdots, X_n)$ does not depend upon population parameter $\Theta$.

Sufficient Statistics Example

The sample mean $\overline{X}$ is sufficient for the population mean $\mu$ of a normal distribution with known variance. Once the sample mean is known, no further information about the population mean $\mu$ can be obtained from the sample itself, while the median is not sufficient for the mean; even if the median of the sample is known, knowing the sample itself would provide further information about the population mean $\mu$.

Mathematical Definition of Sufficiency

Suppose that $X_1,X_2,\cdots,X_n \sim p(x;\theta)$. $T$ is sufficient for $\theta$ if the conditional distribution of $X_1,X_2,\cdots, X_n|T$ does not depend upon $\theta$. Thus
\[p(x_1,x_2,\cdots,x_n|t;\theta)=p(x_1,x_2,\cdots,x_n|t)\]
This means that we can replace $X_1,X_2,\cdots,X_n$ with $T(X_1,X_2,\cdots,X_n)$ without losing information.

Sufficient Estimator Sufficient Statistics

For further reading visit: https://en.wikipedia.org/wiki/Sufficient_statistic

Computer MCQs Test Online

Creating Frequency Distribution Table (2014)

Using Descriptive statistics we can organize the data to get the general pattern of the data and check where data values tend to concentrate and try to expose extreme or unusual data values. Let us start learning about the Frequency Distribution Table and its construction.

A frequency distribution is a compact form of data in a table that displays the categories of observations according to their magnitudes and frequencies such that similar or identical numerical values are grouped. The categories are also known as groups, class intervals, or simply classes. The classes must be mutually exclusive classes showing the number of observations in each class. The number of values falling in a particular category is called the frequency of that category denoted by $f$.

A Frequency Distribution Table shows us a summarized grouping of data divided into mutually exclusive classes and the number of occurrences in a class. Frequency distribution is a way of showing raw (ungrouped or unorganized) data into grouped or organized data to show results of sales, production, income, loan, death rates, height, weight, temperature, etc.

The relative frequency of a category is the proportion of observed frequency to the total frequency obtained by dividing observed frequency by the total frequency and denoted by $r.f.$.  The sum of r.f. column should be one except for rounding errors. Multiplying each relative frequency of class by 100 we can get the percentage occurrence of a class. A relative frequency captures the relationship between a class total and the total number of observations.

The Frequency Distribution Table may be made for continuous data, discrete data, and categorical data (for both qualitative and quantitative data). It can also be used to draw some graphs such as histograms, line charts, bar charts, pie charts, frequency polygons, Pareto Charts, Scatter diagrams, stem and leaf displays, etc.

Steps of Creating Frequency Distribution Table

  1. Decide about the number of classes. The number of classes is usually between 5 and 20. Too many classes or too few classes might not reveal the basic shape of the data set, also it will be difficult to interpret such frequency distribution. The maximum number of classes may be determined by the formula:
    \[\text{Number of Classes} = C = 1 + 3.3 log (n)\]
    \[\text{or} \quad C = \sqrt{n} \quad {approximately}\]where $n$ is the total number of observations in the data.
  2. Calculate the range of the data ($Range = Max – Min$) by finding minimum and maximum data values. The range will be used to determine the class interval or class width.
  3. Decide about the width of the class denoted by h and obtained by
    \[h = \frac{\text{Range}}{\text{Number of Classes}}= \frac{R}{C} \]
    Generally, the class interval or class width is the same for all classes. The classes all taken together must cover at least the distance from the lowest value (minimum) in the data set up to the highest (maximum) value. Also note that equal class intervals are preferred in frequency distribution, while unequal class intervals may be necessary in certain situations to avoid a large number of empty, or almost empty classes.
  4. Decide the individual class limits and select a suitable starting point for the first class which is arbitrary, it may be less than or equal to the minimum value. Usually, it is started before the minimum value in such a way that the midpoint (the average of lower and upper-class limits of the first class) is properly placed.
  5. Take an observation and mark a vertical bar (|) for a class it belongs. A running tally is kept till the last observation. The tally counts indicate five.
  6. Find the frequencies, relative frequency,  cumulative frequency, etc. as required.
Frequency Distribution Table
Frequency Distribution Table

A frequency distribution is said to be skewed when its mean and median are different. The kurtosis of a frequency distribution is the concentration of scores at the mean, or how peaked the distribution appears if depicted graphically, for example, in a histogram. If the distribution is more peaked than the normal distribution it is said to be leptokurtic; if less peaked it is said to be platykurtic.

Continuous Frequency Distribution Table

Further Reading: Frequency Distribution Table

Learn R Language: R Frequently Asked Questions

Objectives of Time Series Analysis (2014)

There are many objectives of time series analysis. The one of major Objectives of Time Series is to identify the underlying structure of the Time Series represented by a sequence of observations by breaking it down into its components (Secular Trend, Seasonal Variation, Cyclical Trend, Irregular Variation).

Objectives of Time Series Analysis

The objectives of Time Series Analysis are classified as follows:

  1. Description
  2. Explanation
  3. Prediction
  4. Control

The description of the objectives of time series analysis is as follows:

Description of Time Series Analysis

The first step in the analysis is to plot the data and obtain simple descriptive measures (such as plotting data, looking for trends,  seasonal fluctuations, and so on) of the main properties of the series. In the above figure, there is a regular seasonal pattern of price change although this price pattern is not consistent. The Graph enables us to look for “wild” observations or outliers (not appear to be consistent with the rest of the data). Graphing the time series makes possible the presence of turning points where the upward trend suddenly changed to a downward trend. If there is a turning point, different models may have to be fitted to the two parts of the series.

Explanation

Observations were taken on two or more variables, making it possible to use the variation in a one-time series to explain the variation in another series. This may lead to a deeper understanding. A multiple regression model may be helpful in this case.

Prediction

Given an observed time series, one may want to predict the future values of the series. It is an important task in sales forecasting and is the analysis of economic and industrial time series. Prediction and forecasting are used interchangeably.

Control

When time series is generated to measure the quality of a manufacturing process (the aim may be) to control the process. Control procedures are of several different kinds. In quality control, the observations are plotted on a control chart and the controller takes action as a result of studying the charts. A stochastic model is fitted to the series. Future values of the series are predicted and then the input process variables are adjusted to keep the process on target.

Objectives of Time Series Analysis seasonal-effects
Image taken from: http://archive.stats.govt.nz

The figure shows that there is a regular seasonal pattern of price change although this price pattern is not consistent.

In quality control, the observations are plotted on the control chart and the controller takes action as a result of studying the charts.

A stochastic model is fitted to the series. Future values of the series are predicted and then the input process variables are adjusted to keep the process on target.

Learn more about Time Series on Wikipedia

Learn R Programming

Primary and Secondary Data (2014)

Data

Before learning about primary and Secondary Data, let us first understand the term Data in Statistics.

The facts and figures which can be numerically measured are studied in statistics. Numerical measures of the same characteristics are known as observation and collection of observations is termed as data. Data are collected by individual research workers or by organizations through sample surveys or experiments, keeping in view the objectives of the study. The data collected may be (i) Primary Data and (ii) Secondary Data.

Primary and Secondary Data in Statistics

The difference between primary and secondary data in Statistics is that Primary data is collected firsthand by a researcher (organization, person, authority, agency or party, etc.) through experiments, surveys, questionnaires, focus groups, conducting interviews, and taking (required) measurements, while the secondary data is readily available (collected by someone else) and is available to the public through publications, journals, and newspapers.

Primary and Secondary Data

Primary Data

Primary data means the raw data (data without fabrication or not tailored data) that has just been collected from the source and has not gone through any kind of statistical treatment like sorting and tabulation. The term primary data may sometimes be used to refer to first-hand information.

Sources of Primary Data

The sources of primary data are primary units such as basic experimental units, individuals, and households. The following methods are used to collect data from primary units usually and these methods depend on the nature of the primary unit. Published data and the data collected in the past are called secondary data.

  • Personal Investigation
    The researcher experiments or surveys himself/herself and collects data from it. The collected data is generally accurate and reliable. This method of collecting primary data is feasible only in the case of small-scale laboratories, field experiments, or pilot surveys and is not practicable for large-scale experiments and surveys because it takes too much time.
  • Through Investigators
    The trained (experienced) investigators are employed to collect the required data. In the case of surveys, they contact the individuals and fill in the questionnaires after asking for the required information, whereas a questionnaire is an inquiry form having many questions designed to obtain information from the respondents. This method of collecting data is usually employed by most organizations and it gives reasonably accurate information but it is very costly and may be time-consuming too.
  • Through Questionnaire
    The required information (data) is obtained by sending a questionnaire (printed or soft form) to the selected individuals (respondents) (by mail) who fill in the questionnaire and return it to the investigator. This method is relatively cheap as compared to the “through investigator” method but the non-response rate is very high as most of the respondents don’t bother to fill in the questionnaire and send it back to the investigator.
  • Through Local Sources
    The local representatives or agents are asked to send requisite information and provide the information based on their own experience. This method is quick but it gives rough estimates only.
  • Through Telephone
    The information may be obtained by contacting the individuals by telephone. It is Quick and provides the accurate required information.
  • Through Internet
    With the introduction of information technology, people may be contacted through the Internet and individuals may be asked to provide pertinent information. Google Survey is widely used as an online method for data collection nowadays. There are many paid online survey services too.

It is important to go through the primary data and locate any inconsistent observations before it is given a statistical treatment.

Secondary Data

Data that has already been collected by someone, may be sorted, tabulated, and has undergone a statistical treatment. It is fabricated or tailored data.

Sources of Secondary Data

The secondary data may be available from the following sources:

  • Government Organizations
    Federal and Provincial Bureau of Statistics, Crop Reporting Service-Agriculture Department, Census and Registration Organization etc.
  • Semi-Government Organization
    Municipal committees, District Councils, Commercial and Financial Institutions like banks etc
  • Teaching and Research Organizations
  • Research Journals and Newspapers
  • Internet

Data Structure in R Language