Mode Measure of Central Tendency

The mode is the most frequent observation in the data set i.e. the value (number) that appears the most in data set. It is possible that there may be more than one mode or it may also be possible that there is no mode in a data set. Usually mode is used for categorical data (data belongs to nominal or ordinal scale) but it is not necessary. Mode can also be used for ordinal and ratio scale, but there should be some repeated value in the data set or data set can be classified in groups. If any of the data point don’t have same values (no repetition in data values) , then the mode of that data set will not exit or may not be meaningful. A data set having more than one mode is called multimode or multimodal.

Example 1: Consider the following data set showing the weight of child at age of 10 years: 33, 30, 23, 23, 32, 21, 23, 30, 30, 22, 25, 33, 23, 23, 25. We can found the mode by tabulating the given data in form of frequency distribution table, whose first column is the weight of child and second column is the number of times the weight appear in the data i.e frequency of the each weight in first column.

Weight of 10 year child Frequency
22 1
23 5
25 2
30 3
32 1
33 2
Total 15

From above frequency distribution table we can easily found the most frequently occurring observation (data point), which will be the mode of data set. Therefore the mode of the given data set is 23, meaning that majority of the 10 year child have weight of 23kg. Note that for finding mode it is not necessary do make frequency distribution table, but it helps in finding the mode quickly and frequency table can also be used in further calculations such as percentage and cumulative percentage of each weight group.

Example 2: Consider we have information of person about his/her gender. Consider the M stands for male and F stands for Female. The sequence of person’s gender noted is as follows: F, F, M, F, F, M, M, M, M, F, M, F, M, F, M, M, M, F, F, M. The frequency distribution table of gender is

Weight of 10 year child Frequency
Male 11
Female 9
Total 25

The mode of gender data is male, showing that most frequent or majority of the people have male gender in this data set.

Mode can be found by simply sorting the data in ascending or descending order. Mode can also be found by counting the frequent value without sorting the data especially when data contains small number of observations, though it may be difficult in remembering the number of times which observation occurs. Note that mode is not affected by the extreme values (outliers or influential observations).

Mode is also a measure of central tendency, but the mode may not reflect the center of the data very well. For example the mean of data set in example 1, is 26.4kg while mode is of 23kg.

One should use mode measure of central tendency, if he/ she expect that data points will repeat or have some classification in it. For example in production process a product produced can be classified as defective or non-defective product. Similarly student grades can classified as A, B, C, D etc. For such kind of data one should use mode as a measure of central tendency instead of mean or median.

Example 3: Consider the following data. 3, 4, 7, 11, 15, 20, 23, 22, 26, 33, 25, 13. There is no mode of this data as each of the value occurs once. Grouping this data in some useful and meaningful form we can get mode of the data for example, the grouped frequency table is

Group Values Frequency
0 to 9 3, 4, 7 3
10 to 19 11, 13, 15 3
20 to 29 20, 22, 23, 25, 26 5
30 to 39 33 1
Total 12

From this table, we cannot find the most appearing value, but we can say that “20 to 29″ is the group in which most of the observations occur. We can say that this group contains the mode which can be found by using mode formula for grouped data.

 

Be Sociable, Share!

Heteroscedasticity Regression Residual Plot

Heteroscedasticity

One of the assumption of classical linear regression model is that there is no heteroscedasticity (error terms has constant error term) meaning that ordinary least square (OLS) estimators are (BLUE, best linear unbiased estimator) and their variances is the lowest of all other unbiased estimators (Gauss Markov Theorem). If the assumption of constant variance does not hold then this means that the Gauss Markov Theorem does not apply. For heteroscedastic data, regression analysis provide unbiased estimate for the relationship between the predictors and the outcome variables.

As we have discussed that heteroscedasticity occurs when the error variance has non-constant variance.  In this case, we can think of the disturbance for each observation as being drawn from a different distribution with a different variance.  Stated equivalently, the variance of the observed value of the dependent variable around the regression line is non-constant.  We can think of each observed value of the dependent variable as being drawn from a different conditional probability distribution with a different conditional variance. A general linear regression model with the assumption of heteroscedasticity can be expressed as follows

\begin{align*}
y_i & = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_ip + \varepsilon_i\\
Var(\varepsilon_i)&=E(\varepsilon_i^2)\\
&=\sigma_i^2; \cdots i=1,2,\cdots, n
\end{align*}

Note that we have a $i$ subscript attached to sigma squared.  This indicates that the disturbance for each of the $n$-units is drawn from a probability distribution that has a different variance.

If the error term has non-constant variance, but all other assumptions of the classical linear regression model are satisfied, then the consequences of using the OLS estimator to obtain estimates of the population parameters are:

  • The OLS estimator is still unbiased
  • The OLS estimator is inefficient; that is, it is not BLUE
  • The estimated variances and covariances of the OLS estimates are biased and inconsistent
  • Hypothesis tests are not valid

Detection of Heteroscedasticity Regression Residual Plot

The residual for the $i$th observation, $\hat{\varepsilon_i}$, is an unbiased estimate of the unknown and unobservable error for that observation, $\hat{\varepsilon_i}$. Thus the squared residuals, $\hat{\varepsilon_i^2}$ , can be used as an estimate of the unknown and unobservable error variance,  $\sigma_i^2=E(\hat{\varepsilon_i})$.  You can calculate the squared residuals and then plot them against an explanatory variable that you believe might be related to the error variance.  If you believe that the error variance may be related to more than one of the explanatory variables, you can plot the squared residuals against each one of these variables.  Alternatively, you could plot the squared residuals against the fitted value of the dependent variable obtained from the OLS estimates.  Most statistical programs (softwares) have a command to do these residual plots.  It must be emphasized that this is not a formal test for heteroscedasticity.  It would only suggest whether heteroscedasticity may exist.

Below there are residual plots showing the three typical patterns. The first plot shows a random pattern that indicates a good fit for a linear model. The other two plot patterns of residual plots are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model, than linear regression model.

Heteroscedasticity Regression Residual Plot 3

Heteroscedasticity Regression Residual Plot 1

Heteroscedasticity Residual Plot 1

Heteroscedasticity Residual Residual Plot 2

Heteroscedasticity Residual Plot 2

Heteroscedasticity Regression Residual Plot 3

 

Download pdf file Heteroscedasticity Regression Residual Plot

 

Be Sociable, Share!

Matrix in Matlab: Creating and manipulating Matrices in Matlab

Matrix (a two dimensional, rectangular shaped used to store multiple elements of data in an easy accessible format) is the most basic data structure in Matlab. The elements of matrix can be numbers, characters, logical states of yes or no (true or false) or other Matlab structure types. Matlab also supports more than two dimensional data structures, referred to as arrays in Matlab. Matlab is matrix-based computing environment in which all of the data entered into Matlab is stored as as a matrix.

It is assumed in this Matlab tutorial that you know some of the basics on how to define and manipulate vectors in Matlab software. we will discuss here

  1. Defining Matrices
  2. Matrix Operations
  3. Matrix Functions

1)  Defining/ Creating Matrices

Defining a matrix in Matlab is similar to defining a vector in Matlab. To define a matrix, treat it as a column of row vectors.
>> A=[1 2 3; 4 5 6; 7 8 9]

Note that spaces between number is used to define the elements of matrix and semi-colon is used to separate the rows of matrix A. The square brackets are used to construct matrices. The individual matrix and vectors entries can be referenced within parenthesis. For example A(2,3) represents element in second row and third column of matrix A.

Matrix in Matlab

Matrix in Matlab

Some example to create matrix and extract elements
>> A=rand(6, 6)
>> B=rand(6, 4)

>>A(1:4, 3) is a column vector consisting of the first four entries of the third column of A
>>A(:, 3) is the third column of A
>>A(1:4, : ) contains column  and column 4 of matrix A

Convenient matrix building Functions

eye –> identity
zeros –> matrix of zeros
ones –> matrix of ones
diag –> create or extract diagonal elements of matrix
triu –> upper triangular part of matrix
tril –> lower triangular part of matrix
rand –> randomly generated matrix
hilb –> Hilbert matrix
magic –> magic square

2)  Matrix Operations

Many of the mathematical operations can be applied on matrices and vectors in Matlab such as addition, subtraction, multiplication and division of matrices etc.

Matrix or Vector Multiplication

If x and y are both column vectors, then x’*y is their inner (or dot) product and x*y’ is their outer (or cross) product.

Matrix division

Let A is an invertible square matrix and b is a compatible column vector then
x = A/b is solution of A * x = b
x = b/A is solution of x * A = b

These are also called the backslash (\) and slash operators (/) also referred to as the mldivide and mrdivide.

3)  Matrix Functions

Matlab has a many functions used to create different kinds of matrices. Some important matrix functions used in Matlab are

eig –> eigenvalues and eigenvectors
eigs –> like eig, for large sparse matrices
chol –> cholesky factorization
svd –> singular value decomposition
svds –> like svd, for large sparse matrices
inv –> inverse of matrix
lu –> LU factorization
qr –> QR factorization
hess –> Hessenberg form
schur –> Schur decompostion
rref –> reduced row echelon form
expm –> matrix exponential
sqrtm –> matrix square root
poly –> characteristic polynomial
det –> determinant of matrix
size –> size of an array
length –> length of a vector
rank –> rank of matrix

Be Sociable, Share!

Sufficient statistics and Sufficient Estimators

An estimator $\hat{\theta}$ is sufficient if it make so much use of the information in the sample that no other estimator could extract from the sample, additional information about the population parameter being estimated.

The sample mean $\overline{X}$ utilizes all the values included in the sample so it is sufficient estimator of population mean $\mu$.

Sufficient estimators are often used to develop the estimator that have minimum variance among all unbiased estimators (MVUE).

If sufficient estimator exists, no other estimator from the sample can provide additional information about the population being estimated.

If there is a sufficient estimator, then there is no need to consider any of the non-sufficient estimator. Good estimator are function of sufficient statistics.

Let $X_1,X_2,\cdots,X_n$ be a random sample from a probability distribution with unknown parameter $\theta$, then this statistic (estimator) $U=g(X_1,X_,\cdots,X_n)$ observation gives $U=g(X_1,X_2,\cdots,X_n)$ does not depend upon population parameter $\Theta$.

Sufficient Statistic Example

The sample mean $\overline{X}$ is a sufficient for the population mean $\mu$ of a normal distribution with known variance. Once the sample mean is known, no further information about the population mean $\mu$ can be obtained from the sample itself, while median is not sufficient for the mean; even if the median of the sample is known, knowing the sample itself would provide further information about the population mean $\mu$.

Mathematical Definition of Sufficiency

Suppose that $X_1,X_2,\cdots,X_n \sim p(x;\theta)$. $T$ is sufficient for $\theta$ if the conditional distribution of $X_1,X_2,\cdots, X_n|T$ does not depend upon $\theta$. Thus
\[p(x_1,x_2,\cdots,x_n|t;\theta)=p(x_1,x_2,\cdots,x_n|t)\]
This means that we can replace $X_1,X_2,\cdots,X_n$ with $T(X_1,X_2,\cdots,X_n)$ without losing information.

For further reading visit: https://en.wikipedia.org/wiki/Sufficient_statistic

Download pdf file Sufficient Statistics:

 

Be Sociable, Share!

Component of Time Series Data

Traditional methods of time series analysis are concerned with decomposing of a series into a trend, a seasonal variation and other irregular fluctuations. Although this approach is not always the best but still useful (Kendall and Stuart, 1996).

The components, by which time series is composed of, are called component of time series data. There are four basic Component of time series data described below.

Different Sources of Variation are:

  1. Seasonal effect (Seasonal Variation or Seasonal Fluctuations)
    Many of the time series data exhibits a seasonal variation which is annual period, such as sales and temperature readings.  This type of variation is easy to understand and can be easily measured or removed from the data to give de-seasonalized data.Seasonal Fluctuations describes any regular variation (fluctuation) with a period of less than one year for example cost of variation types of fruits and vegetables, cloths, unemployment figures, average daily rainfall, increase in sale of tea in winter, increase in sale of ice cream in summer etc., all show seasonal variations.The changes which repeat themselves within a fixed period, are also called seasonal variations, for example, traffic on roads in morning and evening hours, Sales at festivals like EID etc., increase in the number of passengers at weekend etc. Seasonal variations are caused by climate, social customs, religious activities etc.
  2. Other Cyclic Changes (Cyclical Variation or Cyclic Fluctuations)
    Time series exhibits Cyclical Variations at a fixed period due to some other physical cause, such as daily variation in temperature. Cyclical variation is a non-seasonal component which varies in recognizable cycle. sometime series exhibits oscillation which do not have a fixed period but are predictable to some extent. For example, economic data affected by business cycles with a period varying between about 5 and 7 years.In weekly or monthly data, the cyclical component may describes any regular variation (fluctuations) in time series data. The cyclical variation are periodic in nature and repeat themselves like business cycle, which has four phases (i) Peak (ii) Recession (iii) Trough/Depression (iv) Expansion.
  3. Trend (Secular Trend or Long Term Variation)
    It is a longer term change. Here we take into account the number of observations available and make a subjective assessment of what is long term. To understand the meaning of long term, let for example climate variables sometimes exhibit cyclic variation over a very long time period such as 50 years. If one just had 20 years data, this long term oscillation would appear to be a trend, but if several hundreds years of data is available, then long term oscillations would be visible.These movements are systematic in nature where the movements are broad, steady, showing slow rise or fall in the same direction. The trend may be linear or non-linear (curvilinear). Some examples of secular trend are: Increase in prices, Increase in pollution, increase in the need of wheat, increase in literacy rate, decrease in deaths due to advances in science.Taking averages over a certain period is a simple way of detecting trend in seasonal data. Change in averages with time is evidence of a trend in the given series, though there are more formal tests for detecting trend in time series.
  4. Other Irregular Variation (Irregular Fluctuations)
    When trend and cyclical variations are removed from a set of time series data, the residual left, which may or may not be random. Various techniques for analyzing series of this type examine to see “if irregular variation may be explained in terms of probability models such as moving average or autoregressive  models, i.e. we can see if any cyclical variation is still left in the residuals.These variation occur due to sudden causes are called residual variation (irregular variation or accidental or erratic fluctuations) and are unpredictable, for example rise in prices of steel due to strike in the factory, accident due to failure of break, flood, earth quick, war etc.
Component of Time Series Data

Component of Time Series Data

 

 

Be Sociable, Share!
%d bloggers like this: