# p-value interpretation, definition, introduction and examples

The p-value also known as observed level of significance or exact level of significance or the exact probability of committing a type-I error (probability of rejecting H0, when it is true), helps to determine the significance of results from hypothesis. The p-value is the probability of obtaining the observed sample results or a more extreme result when the null hypothesis (a statement about population) is actually true.

In technical words, one can define p-value as the lowest level of significance at which a null hypothesis can be rejected. If p-value is very small or less than the threshold value (chosen level of significance), then the observed data is considered as inconsistent with the assumption that the null hypothesis is true and thus null hypothesis must be rejected while the alternative hypothesis should be accepted. The p-value is a number between 0 and 1 and in literature it is usually interpreted in the following way:

• A small p-value (<0.05) indicates strong evidence against the null hypothesis
• A large p-value (>0.05) indicates weak evidence against the null hypothesis.
• p-value very close to the cutoff (say 0.05) are considered to be marginal.

Let the p-value of a certain test statistic is 0.002 then it means that the probability of committing a type-I error (making a wrong decision) is about 0.2 percent, that is only about 2 in 1,000. For a given sample size, as | t | (or any test statistic) increases the p-value decreases, so one can reject the null hypothesis with increasing confidence.

Fixing the level of significance ($\alpha$) (i.e. type-I error) equal to the p-value of a test statistic then there is no conflict between the two values, in other words, it is better to give up fixing up (significance level) arbitrary at some level of significance such as (5%, 10% etc.) and simply choose the p-value of the test statistic. For example, if the p-value of test statistic is about 0.145 then one can reject the null hypothesis at this exact significance level as nothing wrong with taking a chance of being wrong 14.5% of the time of someone reject the null hypothesis.

p-value addresses only one question: how likely are your data, assuming a true null hypothesis? It  does not measure support for the alternative hypothesis.

Most authors refers to p-value<0.05 as statistically significant and p-value<0.001 as highly statistically significant (less than one in a thousand chance of being wrong).

p-value is usually incorrectly interpreted as it is usually interpreted as the probability of making a mistake by rejecting a true null hypothesis (a Type-I error). p-value cannot be error rate because:

p-value is calculated based on the assumption that the null hypothesis is true and that the difference in the sample by random chances. Consequently, p-value cannot tell about the probability that the null hypothesis is true or false because it is 100% true from the perspective of the calculations.

Be Sociable, Share!

# Degrees of Freedom

The degrees of freedom (df) or number of degrees of freedom refers to the number of observations in a sample minus the number of (population) parameters being estimated from the sample data. All this means that the degrees of freedom is a function of both sample size and the number of independent variables. In other words it is the number of independent observations out of a total of ($n$) observations.

In statistics, the degrees of freedom considered as the number of values in a study that are free to vary. For example (degrees of freedom example in real life), if you have to take ten different courses to graduate, and only ten different courses are offered, then you have nine degrees of freedom. Nine semesters you will be able to choose which class to take; the tenth semester, there will only be one class left to take – there is no choice, if you want to graduate, this is the concept of the degrees of freedom (df) in statistics.

Let a random sample of size n is taken from a population with an unknown mean $\overline{X}$. The sum of the deviations from their means is always equal to zero i.e.$\sum_{i=1}^n (X_i-\overline{X})=0$. This require a constraint on each deviation $X_i-\overline{X}$ used when calculating the variance.

$S^2 =\frac{\sum_{i=1}^n (X_i-\overline{X}) }{n-1}$

This constraint (restriction) implies that $n-1$ deviations completely determine the nth deviation. The $n$ deviations (and also the sum of their squares and the variance in the $S^2$ of the sample) therefore $n-1$ degrees of freedom.

A common way to think of degrees of freedom is as the number of independent pieces of information available to estimate another piece of information. More concretely, the number of degrees of freedom is the number of independent observations in a sample of data that are available to estimate a parameter of the population from which that sample is drawn. For example, if we have two observations, when calculating the mean we have two independent observations; however, when calculating the variance, we have only one independent observation, since the two observations are equally distant from the mean.

Single sample: For $n$ observation one parameter (mean) needs to be estimated, that leaves $n-1$ degrees of freedom for estimating variability (dispersion).

Two samples: There are total of $n_1+n_2$ observations ($n_1$ for group1 and $n_2$ for group2) and two means need to be estimated, which leaves $n_1+n_2-2$ degrees of freedom for estimating variability.

Regression with p predictors: There are $n$ observations with $p+1$ parameters needs to be estimated (regression coefficient for each predictor and the intercept). This leaves $n-p-1$ degrees of freedom of error, which accounts for the error degrees of freedom in the ANOVA table.

Several commonly encountered statistical distributions (Student’s t, Chi-Squared, F) have parameters that are commonly referred to as degrees of freedom. This terminology simply reflects that in many applications where these distributions occur, the parameter corresponds to the degrees of freedom of an underlying random vector. If $X_i; i=1,2,\cdots, n$ are independent normal $(\mu, \sigma^2)$ random variables, the statistic (formula) is $\frac{\sum_{i=1}^n (X_i-\overline{X})^2}{\sigma^2}$, follows a chi-squared distribution with $n-1$ degrees of freedom. Here, the degrees of freedom arises from the residual sum of squares in the numerator and in turn the $n-1$ degrees of freedom of the underlying residual vector ${X_i-\overline{X}}$.

Be Sociable, Share!

# Binary Logistic Regression Minitab Tutorial

Binary Logistic Regression is used to perform logistic regression on a binary response (dependent) variable (a variable only that has two possible values, such as presence or absence of a particular disease, this kind of variable is known as dichotomous variable i.e binary in nature).

Binary Logistic Regression can classify observations into one of two categories. These classifications can give fewer classification errors than discriminant analysis for some cases.

The default model contains the variables that you enter in Continuous predictors and Categorical predictors. You can also add interaction and/or polynomial terms by using the tools available in model sub-dialog box.

Minitab stores the last model that you fit for each response variable. This stored models can be used to quickly generate predictions, contour plots, surface plots, overlaid contour plots, factorial plots, and optimized responses.

To perform a Binary Logistic Regression Analysis in Minitab, follow the steps given below. It is assumed that you have already launched the Minitab software.

Step1:  Choose Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model.

Fit Binary Logistic Regression

Step2:  Do one of the following:

If your data is in raw or frequency form, follow these steps:

Response in Binary Logistic Regression (Frequency Format)

1. Choose Response in binary response/frequency format, from combobox on top
2. In Response text box, enter the column that contains the response variable.
3. In Frequency text box, enter the optional column that contains the count or frequency variable.

If you have summarized data, then follow these steps:

Response in Binary Logistic Regression (Trial Format)

1. Choose Response in event/trial format, from combobox on top of the dialog box.
2. In Number of events, enter the column that contains the number of times the event occurred in your sample at each combination of the predictor values.
3. In Number of trials, enter the column that contains the corresponding number of trials.

Step4:  In Continuous predictors, enter the columns that contain continuous predictors. In Categorical predictors, enter the columns that contain categorical predictors. You can add interactions and other higher order terms to the model.

Step5:  If you like, use one or more of the dialog box options, then click OK.

The following are options available in the main dialog box of Minitab Binary Logistic Regression:

Response in binary response/frequency format: Choose if the response data has been entered as a column that contains 2 distinct values i.e as a dichotomous variable.
Response: Enter the column that contains the response values.
Response event: Choose which event of interest the results of the analysis will describe.
Frequency (optional): If the data are in two columns i.e. one column that contains the response values and the other column that contains their frequencies then enter the column that contains the frequencies.
Response in event/trial format: Choose if the response data are two columns – one column that contains the number of successes or events of interest and one column that contains the number of trials.
Event name: Enter a name for the event in the data.
Number of events: Enter the column that contains the number of events.
Number of trials: Enter the column that contains the number of nonevents.
Continuous predictors: Select the continuous variables that explain changes in the response. The predictor is also called the X variable.
Categorical predictors: Select the categorical classifications or group assignments, such as type of raw material, that explain changes in the response. The predictor is also called the X variable.

Step 6: To stores diagnostic measures and characteristics of the estimated equation click Storage… button.

Binary Logistic Regression Storage Dialog Box

Be Sociable, Share!

# Multivariable / Multiple Regression

Multiple regression (a regression having multi-variable) is referred as a regression model having more than one predictor (independent and explanatory variable) to explain a response (dependent) variable. We know that in simple regression models has one predictor used to explain a single response while for case of multiple (multivariable) regression models, more than one predictor in the models. Simple regression models and multiple (multivariable) regression models can further be categorized as linear or non-linear regression models.

Note that linearity does not based on predictors or addition of more predictors in simple regression model, it is referred to the parameter of variability (parameters attached with predictors). If the parameters of variability having constant rate of change then the models are referred to as linear models either it is a simple regression model or multiple (multivariable) regression models. It is assumed that the relationship between variables is considered as linear, though this assumption can never be confirmed for case of multiple linear regression. However, as a rule, it is better to look at bivariate scatter diagram of the variable of interests, you check that there should be no the curvature in the relationship.

Multiple regression also allows to determine the overall fit (which is known as variance explained) of the model and the relative contribution of each of the predictors to the total variance explained (overall fit of the model). For example, one may be interested to know how much of the variation in exam performance can be explained by the following predictors such as revision time, test anxiety, lecture attendance and gender “as a whole”, but also the “relative contribution” of each independent variable in explaining the variance.

A multiple regression model have the form

$y=\alpha+\beta_1 x_1+\beta_2 x_2+\cdots+\beta_k x_k+\varepsilon$

Here y is continuous variables, x’s are known as predictors which may be continuous, categorical or discrete. The above model is referred to as a linear multiple (multivariable) regression model.

For example prediction of college GPA by using, high school GPA, test scores, time gives to study and rating of high school as predictors.

Be Sociable, Share!

# Logistic regression Introduction

Logistic regression was introduced in 1930s by Ronald Fisher and Frank Yates and was first proposed in 1970s as an alternative technique to overcome limitations of ordinary least square regression in handling dichotomous outcomes. It is a type of probabilistic statistical classification model which is non-linear regression model, can be converted into linear model by using a simple transformation. It is used to predict a binary response categorical dependent variable, based on one or more predictor variables. That is, it is used in estimating empirical values of the parameters in a model. Here response variable assumes value as zero or one i.e. dichotomous variable. It is the regression model of b, a logistic regression model is written as

$\pi=\frac{1}{1+e^{-[\alpha +\sum_{i=1}^k \beta_i X_{ij}]}}$

where $\alpha$ and $\beta_i$ are the intercept and slope respectively.

So in simple words, logistic regression is used to find the probability of the occurrence of the outcome of interest.  For example if we want to find the significance of the different predictors (gender, sleeping hours, took part in extracurricular activities, etc.), on a binary response (pass or fail in exams coded as 0 and 1), for this kind of problems we used logistic regression.

By using a transformation this nonlinear regression model can be easily converted into linear model. As $\pi$ is the probability of the events in which we are interested so if we takes the ratio of the probability of success and failure then the model become linear model.

$ln(y)=ln(\frac{\pi}{1-\pi})$

The natural log of odds can convert the logistics regression model into linear form.

References: