Median Measure of Central Tendency

Median Measure of Central Tendency

Median is the middle most value in the data set when all of the values (observations) in a data set are arranged either in ascending or descending order of their magnitude. Median is also considered as a measure of central tendency which divides the data set in two half, where the first half contains 50% observations below the median value and 50% above the median value. If in a data set there are odd number of observations (data points), the median value is the single most middle value after sorting the data set.

Example: Consider the following data set 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3.
To find the median of the given data set, first sort it (either in ascending or descending order), that is
0, 1, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle most value of the above data after sorting is 5, which is median of the given data set.

When the number of observations in a data set is even then the median value is the average of two middle most values in the sorted data.

Example: Consider the following data set, 5, 9, 8, 4, 3, 1, 0, 8, 5, 3, 5, 6, 3, 2.
To find the median first sort it and then locate the middle most two values, that is,
0, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 8, 8, 9. The middle most two values are 4 and 5. So median will be average of these two values, i.e. 4.5 in this case.

The Median is less affected by extreme values in the data set, so median is preferred measure of central tendency when the data set is skewed or not symmetrical.

For large data set it is relatively very difficult to locate median value in sorted data. It will be helpful to use median value using formula. The formula for odd number of observations is
$\begin{aligned}
Median &=\frac{n+1}{2}th\\
Median &=\frac{n+1}{2}\\
&=\frac{13+1}{2}\\
&=\frac{14}{2}=7th
\end{aligned}$

The 7th value in sorted data is the median of the given data.

The median formula for even number of observation is
$\begin{aligned}
Median&=\frac{1}{2}(\frac{n}{2}th + (\frac{n}{2}+1)th)\\
&=\frac{1}{2}(\frac{14}{2}th + (\frac{14}{2}+1)th)\\
&=\frac{1}{2}(7th + 8th )\\
&=\frac{1}{2}(4 + 5)= 4.5
\end{aligned}$

Note that median measure of central tendency, cannot be found for categorical data.

 

Be Sociable, Share!

Pseudo Random Process

Random Number

Every random experiment results in two or more outcomes.
A variable whose values depend upon the outcomes of a random experiment is called a random variable denoted by capital letters X, Y, or Z and their values by the corresponding small letters x, y or z.

Random Numbers and their Generation

Random numbers are a sequence of digits from the set {0,1,2,⋯,9} so that, at each position in the sequence, each digit has the same probability 0.1 of being selected irrespective of the actual sequence, so far constructed.

The simplest ways of achieving such numbers are games of chance such as dice, coins, cards or by repeatedly drawing numbered slips out of a jar. These are usually grouped purely for convenience of reading but this would becomes very tedious for long runs of each digit. Fortunately tables of random digits are widely available now.

Pseudo Random Process

Pseudo Random Process is a process that appears to be random but actually it is not. Pseudo random sequences typically exhibit statistical randomness while being generated by an entirely deterministic causal process. Such a process is easier to produce than a genuinely random one, and has the benefit that it can be used again and again to produce exactly the same numbers and they are useful for testing and fixing software.

For implementation on computers to provide sequence of such digits easily, and quickly, the most common methods are called Pseudo Random Technique.

Here, digit will re-appear in the same order (cycle) eventually. For a good technique the cycle might be tens of thousands of digit long.
Of course the pseudo random digits are not truly random. In fact, they are completely deterministic but they do exhibit most of the properties of random digits. Generally, they methods involves the recursive formula e.g.

\[X_{n+1}= a x_n +b\, mod\, m; n=0, 1, 2, …\]

a, b and n are suitably chosen integer constants and the seed $x_0$ (a starting number i.e. n = 0) is an integer. (Note mode m means that if the result from formula is greater than m, then divide it by m and keep the remainder as a random number.

Use of this formula gives rise to a sequence of integers each of which is in the random 0 to m – 1.

Example

let a = 13, b=5, and m = 1000, Generate 500 random numbers.

Solution

\[x_{n+1}=a x_n \,b\, mod\, 1000; n=0,1,2,…\]

let seed $x_0=5$, then for n=0 we have

\begin{align*}
x_{0+1}&=13 \times 5 +5\, mod\, 1000=70\\
x_{1+1}&=13 \times 70 + 5\, mod\, 1000=915
\end{align*}

Application of Random Variables

The random numbers have wide applicability in the simulation techniques (also called Monte Carlo Methods) which have been applied to many problems in the various sciences and one useful in the situation where direct experimentation is not possible, the cost of conducting an experimetn is very high or the experiment takes too much time.

R code to Generate Random Number

# store the pseudo random output
rand.num<-numeric(500)
rand.seed<-5
for(i in 1:500){
    rand.seed<-13*rand.seed+5
    rand.num[i]<-rand.seed%%1000
}
rand.num

 

Read more about Pseudo Random Process | Random Number Generation and Linear Congruential Generator (LCG)

Download Pseudo Random Process pdf file:

 

Be Sociable, Share!

Mode Measure of Central Tendency

The mode is the most frequent observation in the data set i.e. the value (number) that appears the most in data set. It is possible that there may be more than one mode or it may also be possible that there is no mode in a data set. Usually mode is used for categorical data (data belongs to nominal or ordinal scale) but it is not necessary. Mode can also be used for ordinal and ratio scale, but there should be some repeated value in the data set or data set can be classified in groups. If any of the data point don’t have same values (no repetition in data values) , then the mode of that data set will not exit or may not be meaningful. A data set having more than one mode is called multimode or multimodal.

Example 1: Consider the following data set showing the weight of child at age of 10 years: 33, 30, 23, 23, 32, 21, 23, 30, 30, 22, 25, 33, 23, 23, 25. We can found the mode by tabulating the given data in form of frequency distribution table, whose first column is the weight of child and second column is the number of times the weight appear in the data i.e frequency of the each weight in first column.

Weight of 10 year child Frequency
22 1
23 5
25 2
30 3
32 1
33 2
Total 15

From above frequency distribution table we can easily found the most frequently occurring observation (data point), which will be the mode of data set. Therefore the mode of the given data set is 23, meaning that majority of the 10 year child have weight of 23kg. Note that for finding mode it is not necessary do make frequency distribution table, but it helps in finding the mode quickly and frequency table can also be used in further calculations such as percentage and cumulative percentage of each weight group.

Example 2: Consider we have information of person about his/her gender. Consider the M stands for male and F stands for Female. The sequence of person’s gender noted is as follows: F, F, M, F, F, M, M, M, M, F, M, F, M, F, M, M, M, F, F, M. The frequency distribution table of gender is

Weight of 10 year child Frequency
Male 11
Female 9
Total 25

The mode of gender data is male, showing that most frequent or majority of the people have male gender in this data set.

Mode can be found by simply sorting the data in ascending or descending order. Mode can also be found by counting the frequent value without sorting the data especially when data contains small number of observations, though it may be difficult in remembering the number of times which observation occurs. Note that mode is not affected by the extreme values (outliers or influential observations).

Mode is also a measure of central tendency, but the mode may not reflect the center of the data very well. For example the mean of data set in example 1, is 26.4kg while mode is of 23kg.

One should use mode measure of central tendency, if he/ she expect that data points will repeat or have some classification in it. For example in production process a product produced can be classified as defective or non-defective product. Similarly student grades can classified as A, B, C, D etc. For such kind of data one should use mode as a measure of central tendency instead of mean or median.

Example 3: Consider the following data. 3, 4, 7, 11, 15, 20, 23, 22, 26, 33, 25, 13. There is no mode of this data as each of the value occurs once. Grouping this data in some useful and meaningful form we can get mode of the data for example, the grouped frequency table is

Group Values Frequency
0 to 9 3, 4, 7 3
10 to 19 11, 13, 15 3
20 to 29 20, 22, 23, 25, 26 5
30 to 39 33 1
Total 12

From this table, we cannot find the most appearing value, but we can say that “20 to 29″ is the group in which most of the observations occur. We can say that this group contains the mode which can be found by using mode formula for grouped data.

 

Be Sociable, Share!

Heteroscedasticity Regression Residual Plot

Heteroscedasticity

One of the assumption of classical linear regression model is that there is no heteroscedasticity (error terms has constant error term) meaning that ordinary least square (OLS) estimators are (BLUE, best linear unbiased estimator) and their variances is the lowest of all other unbiased estimators (Gauss Markov Theorem). If the assumption of constant variance does not hold then this means that the Gauss Markov Theorem does not apply. For heteroscedastic data, regression analysis provide unbiased estimate for the relationship between the predictors and the outcome variables.

As we have discussed that heteroscedasticity occurs when the error variance has non-constant variance.  In this case, we can think of the disturbance for each observation as being drawn from a different distribution with a different variance.  Stated equivalently, the variance of the observed value of the dependent variable around the regression line is non-constant.  We can think of each observed value of the dependent variable as being drawn from a different conditional probability distribution with a different conditional variance. A general linear regression model with the assumption of heteroscedasticity can be expressed as follows

\begin{align*}
y_i & = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_ip + \varepsilon_i\\
Var(\varepsilon_i)&=E(\varepsilon_i^2)\\
&=\sigma_i^2; \cdots i=1,2,\cdots, n
\end{align*}

Note that we have a $i$ subscript attached to sigma squared.  This indicates that the disturbance for each of the $n$-units is drawn from a probability distribution that has a different variance.

If the error term has non-constant variance, but all other assumptions of the classical linear regression model are satisfied, then the consequences of using the OLS estimator to obtain estimates of the population parameters are:

  • The OLS estimator is still unbiased
  • The OLS estimator is inefficient; that is, it is not BLUE
  • The estimated variances and covariances of the OLS estimates are biased and inconsistent
  • Hypothesis tests are not valid

Detection of Heteroscedasticity Regression Residual Plot

The residual for the $i$th observation, $\hat{\varepsilon_i}$, is an unbiased estimate of the unknown and unobservable error for that observation, $\hat{\varepsilon_i}$. Thus the squared residuals, $\hat{\varepsilon_i^2}$ , can be used as an estimate of the unknown and unobservable error variance,  $\sigma_i^2=E(\hat{\varepsilon_i})$.  You can calculate the squared residuals and then plot them against an explanatory variable that you believe might be related to the error variance.  If you believe that the error variance may be related to more than one of the explanatory variables, you can plot the squared residuals against each one of these variables.  Alternatively, you could plot the squared residuals against the fitted value of the dependent variable obtained from the OLS estimates.  Most statistical programs (softwares) have a command to do these residual plots.  It must be emphasized that this is not a formal test for heteroscedasticity.  It would only suggest whether heteroscedasticity may exist.

Below there are residual plots showing the three typical patterns. The first plot shows a random pattern that indicates a good fit for a linear model. The other two plot patterns of residual plots are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model, than linear regression model.

Heteroscedasticity Regression Residual Plot 3

Heteroscedasticity Regression Residual Plot 1

Heteroscedasticity Residual Plot 1

Heteroscedasticity Residual Residual Plot 2

Heteroscedasticity Residual Plot 2

Heteroscedasticity Regression Residual Plot 3

 

Download pdf file Heteroscedasticity Regression Residual Plot

 

Be Sociable, Share!

Matrix in Matlab: Creating and manipulating Matrices in Matlab

Matrix (a two dimensional, rectangular shaped used to store multiple elements of data in an easy accessible format) is the most basic data structure in Matlab. The elements of matrix can be numbers, characters, logical states of yes or no (true or false) or other Matlab structure types. Matlab also supports more than two dimensional data structures, referred to as arrays in Matlab. Matlab is matrix-based computing environment in which all of the data entered into Matlab is stored as as a matrix.

It is assumed in this Matlab tutorial that you know some of the basics on how to define and manipulate vectors in Matlab software. we will discuss here

  1. Defining Matrices
  2. Matrix Operations
  3. Matrix Functions

1)  Defining/ Creating Matrices

Defining a matrix in Matlab is similar to defining a vector in Matlab. To define a matrix, treat it as a column of row vectors.
>> A=[1 2 3; 4 5 6; 7 8 9]

Note that spaces between number is used to define the elements of matrix and semi-colon is used to separate the rows of matrix A. The square brackets are used to construct matrices. The individual matrix and vectors entries can be referenced within parenthesis. For example A(2,3) represents element in second row and third column of matrix A.

Matrix in Matlab

Matrix in Matlab

Some example to create matrix and extract elements
>> A=rand(6, 6)
>> B=rand(6, 4)

>>A(1:4, 3) is a column vector consisting of the first four entries of the third column of A
>>A(:, 3) is the third column of A
>>A(1:4, : ) contains column  and column 4 of matrix A

Convenient matrix building Functions

eye –> identity
zeros –> matrix of zeros
ones –> matrix of ones
diag –> create or extract diagonal elements of matrix
triu –> upper triangular part of matrix
tril –> lower triangular part of matrix
rand –> randomly generated matrix
hilb –> Hilbert matrix
magic –> magic square

2)  Matrix Operations

Many of the mathematical operations can be applied on matrices and vectors in Matlab such as addition, subtraction, multiplication and division of matrices etc.

Matrix or Vector Multiplication

If x and y are both column vectors, then x’*y is their inner (or dot) product and x*y’ is their outer (or cross) product.

Matrix division

Let A is an invertible square matrix and b is a compatible column vector then
x = A/b is solution of A * x = b
x = b/A is solution of x * A = b

These are also called the backslash (\) and slash operators (/) also referred to as the mldivide and mrdivide.

3)  Matrix Functions

Matlab has a many functions used to create different kinds of matrices. Some important matrix functions used in Matlab are

eig –> eigenvalues and eigenvectors
eigs –> like eig, for large sparse matrices
chol –> cholesky factorization
svd –> singular value decomposition
svds –> like svd, for large sparse matrices
inv –> inverse of matrix
lu –> LU factorization
qr –> QR factorization
hess –> Hessenberg form
schur –> Schur decompostion
rref –> reduced row echelon form
expm –> matrix exponential
sqrtm –> matrix square root
poly –> characteristic polynomial
det –> determinant of matrix
size –> size of an array
length –> length of a vector
rank –> rank of matrix

Be Sociable, Share!
%d bloggers like this: