Select Cases in SPSS

The post is about Select Cases in SPSS (IBM SPSS-Statistics) as sometimes you may be interested in analyzing the specific part (subpart) of the available dataset. For example, you may be interested in getting descriptive or inferential statistics for males and females separately. One may also be interested in a certain age range or may want to study (say) only non-smokers. In such cases, one may use Select Cases in SPSS.

Select Cases in SPSS: Step-by-Step Procedure

For illustrative purposes, I am using the “customer_dbase” file available in SPSS sample data files. I am assuming the gender variable to select male customers only and will present some descriptive statistics only for males. For this purpose follow these steps:

Step 1: Go to the Menu bar, select “Data” and then “Select Cases”.

Select Cases in SPSS - 1

Step 2: A new window called “Select Cases” will open.

Use of If statement for Select Cases in SPSS

Step 3: Tick the box called “If the condition is satisfied” as shown in the figure below.

Select Cases in SPSS - 2

Step 4: Click on the button “If” highlighted in the above picture.

Step 5: A new window called “Select Cases: If” will open.

Select Cases in SPSS - If Dialog box 3

Step 6: The left box of this dialog box contains all the variables from the data view. Choose the variable (using the left mouse button) that you want to select cases for and use the “arrow” button to move the selected variable to the right box.

Step 7: In this example, the variable gender (for which we want to select only men) is shifted from the left to the right box. In the right box, write “gender=0” (since men have the value 0 code in this dataset).

Select Cases in SPSS - with Condition

Step 8: Click on Continue and then the OK button. Now, only men are selected (and the women’s data values are temporarily filtered out from the dataset).

Re-Select Cases in SPSS

Note: To “re-select” all cases (complete dataset), you carry out the following steps:

Step a: Go to the Menu bar, choose “Data” and then “Select Cases”.

Step b: From the dialog box of “Select Cases”, tick the box called “All cases”, and then click on the OK button. 

Select Cases in SPSS - data 5

When you use the Select Cases in SPSS, a new variable called “filter” will be created in the dataset. Deleting this filter variable, the selection will disappear. The “un-selected” cases are crossed over in the data view windows.

Select Cases in SPSS - data view 6

Note: The selection will be applied to everything you do from the point you select cases until you remove the selection. In other words, all statistics, tables, and graphs will be based only on the selected individuals until you remove (or change) the selection.

Random Sample of Cases

There is another kind of selection too. For example, the random sample of cases, based on time or case range, and use the filter variable. The selected case can be copied to a new dataset or unselected cases can be deleted. For this purpose choose the appropriate option from the output section of the select cases dialog box.

Select Cases in SPSS - random selection 7

For other SPSS tutorials Independent Sample t-tests in SPSS

Hypothesis Testing in R Programming Language

Subjective Probability (2019)

A type of probability based on personal beliefs, judgment, or experience about the occurrence of a specific outcome in the future. The calculation of subjective probability contains no formal computations (of any formula) and reflects the opinion of a person based on his/her experience. The subjective probability differs from subject to subject and it may contain a high degree of personal biases.

This kind of probability is usually based on a person’s experience, understanding, knowledge, and intelligence and determines the probability of some specific event (situation). It is usually applied in real-life situations, especially, related to the decision in business, job interviews, promotions of the employee, awarding incentives, and daily life situations such as buying and/or selling of a product. An individual may use their expertise, opinion, past experiences, or intuition to assign the degrees of probability to a specific situation.

It is worth noting that the subjective probability is highly flexible in terms of an individual’s belief, for example, one individual may believe that the chance of occurrence of a certain event is 25%. The same person or others may have a different belief especially when they are given a specific range from which to choose, (such as 25% to 30%). This can occur even if no additional hard data is behind the change.

Events that may Alter Subjective Probability

Subjective probability is usually affected by a variety of personal beliefs and opinions (related to his caste, family, region, religion, and even relationship with people, etc.), held by an individual. It is because the subjective probability is often based on how each individual interprets the information presented to him

Disadvantages of Subjective Probability

As only personal opinions (beliefs, experiences) are involved, there may be a high degree of bias. On the other hand, one person’s opinion may differ greatly from the opinion of another person. Similarly, in subjective probability, one may fall into the trap of failing to meet complex calculations.

Subjective Probability

Examples Related to Subjective Probability

  • One may think that there is an 80% chance that your best friend will call you today because his/her car broke down yesterday and he/she will probably need a ride.
  • You think you have a 50% chance of getting a certain job you applied for as the other applicant is also qualified.
  • The probability that in the next (say) 5 hours, there will be rain is based on current weather situations, wind patterns, nearby weather, barometric pressure, etc. One can predict this based on his experience with weather and rain, and believes, in predicting the rain in the next 5 hours.
  • Suppose, a cricket tournament is going to be held between Pakistan and India. The theoretical probability of winning either the cricket team is 50%. But, in reality, it is not 50%. On the other hand (like empirical probability), the number of trial tournaments cannot be arranged to determine an experimental probability. Thus, the subjective probability will be used to find the winning team which will be based on the beliefs and experience of the investigator who is interested in finding the probability of the Pakistan cricket team as the winner. Note there will be a bias if any of the fans of a team investigates the probability of winning a team.
  • To locate petroleum, minerals, and/ or water lying under the earth, dowsers are employed to predict the likelihood of the existence of the required material. They usually adopt some non-scientific methods. In such a situation, the subject probability is used.
  • Note the decisions through subjective probability may be valid if the degree of belief of a person is unbiased about the situation and he/she arrives by some logical reasoning.

For further reading See Introduction to Probability Theory

R Programming Language and R Frequently Asked Questions

Remedial Measures of Heteroscedasticity (2018)

The post is about Remedial Measures of Heteroscedasticity.

Heteroscedasticity is a condition in which the variance of the residual term, or error term, in a regression model, varies widely.

The heteroscedasticity does not destroy the unbiasedness and consistency properties of the OLS estimator (as OLS estimators remain unbiased and consistent in the presence of heteroscedasticity), but they are no longer efficient, not even asymptotically. The lack of efficiency makes the usual hypothesis testing procedure dubious (مشکوک، غیر معتبر). Therefore, there should be some remedial measures for heteroscedasticity.

Homoscedasticity

Remedial Measures of Heteroscedasticity

For remedial measures of heteroscedasticity, there are two approaches: (i) when $\sigma_i^2$ is known, and (ii) when $\sigma_i^2$ is unknown.

(i) $\sigma_i^2$ is known

Consider the simple linear regression model $Y_i=\alpha + \beta X_i + u_i$.

If $V(u_i)=\sigma_i^2$ then heteroscedasticity is present. Given the values of $\sigma_i^2$, heteroscedasticity can be corrected by using weighted least squares (WLS) as a special case of Generalized Least Squares (GLS). Weighted least squares is the OLS method of estimation applied to the transformed model.

When heteroscedasticity is detected by any appropriate statistical test, then the appropriate solution is to transform the original model in such a way that the transformed disturbance term has a constant variance. The transformed model reduces the adjustment of the original data. The transformed error term $u_i$ has a constant variance i.e. homoscedastic. Mathematically

\begin{eqnarray*}
V(u_i^*)&=&V\left(\frac{u_i}{\sigma_i}\right)\\
&=&\frac{1}{\sigma_i^2}Var(u_i)\\
&=&\frac{1}{\sigma_i^2}\sigma_i^2=1
\end{eqnarray*}

This approach has limited use as the individual error variances are not always known a priori. In case of significant sample information, reasonable guesses of the true error variances can be made and be used for $\sigma_i^2$.

Let us discuss the second remedy of heteroscedasticity from remedial measures of heteroscedasticity.

(ii) $\sigma_i^2$ is unknown

If $\sigma_i^2$ is not known a priori, then heteroscedasticity is corrected by hypothesizing a relationship between the error variance and one of the explanatory variables. There can be several versions of the hypothesized relationship. Suppose the hypothesized relationship is $Var(u)=\sigma^2 X_i^2$ (error variance is proportional to $X_i^2$). For this hypothesized relation we will use the following transformation to correct for heteroscedasticity for the following simple linear regression model $Y_i =\alpha + \beta X_i +u_i$.
\begin{eqnarray*}
\frac{Y_i}{X_i}&=&\frac{\alpha}{X_i}+\beta+\frac{u_i}{X_i}\\
\Rightarrow \quad Y_i^*&=&\beta +\alpha_i^*+u_i^*\\
\mbox{where } Y_i^*&=&\frac{Y_i}{X_i}, \alpha_I^*=\frac{1}{X_i} \mbox{and  } u_i^*=\frac{u}{X_i}
\end{eqnarray*}

Now the OLS estimation of the above transformed model will yield the efficient parameter estimates as $u_i^*$’s have constant variance. i.e.

\begin{eqnarray*}
V(u_i^*)&=&V(\frac{u_i}{X_i})\\
&=&\frac{1}{X_i^2} V(u_i^2)\\
&=&\frac{1}{X_i^2}\sigma^2X_i^2\\
&=&\sigma^2=\mbox{ Constant}
\end{eqnarray*}

Remedial Measures of Heteroscedasticity (2018)

For remedial measures of heteroscedasticity, some other hypothesized relations are:

  • Error variance is proportional to $X_i$ (Square root transformation) i.e $E(u_i^2)=\sigma^2X_i$
    The transformed model is
    \[\frac{Y_i}{\sqrt{X_i}}=\frac{\alpha}{\sqrt{X_i}}+\beta\sqrt{X_i}+\frac{u_i}{\sqrt{X_i}}\]
    It (the transformed model) has no intercept term. Therefore we have to use the regression through the origin model to estimate $\alpha$ and $\beta$. To get the original model, multiply $\sqrt{X_i}$ with the transformed model.
  • Error Variance is proportional to the square of the mean value of $Y$. i.e. $E(u_i^2)=\sigma^2[E(Y_i)]^2$
    Here the variance of $u_i$ is proportional to the square of the expected value of $Y$, and $E(Y_i)$ = \alpha + \beta X_i$.
    The transformed model will be
    \[\frac{Y_i}{E(Y_i)}=\frac{\alpha}{E(Y_i)}+\beta\frac{X_i}{E(Y_i)}+\frac{u_i}{E(Y_i)}\]
    This transformation is not appropriate because $E(Y_i)$ depends upon $\alpha$ and $\beta$ which are unknown parameters. $\hat{Y_i}=\hat{\alpha}+\hat{\beta}$ is an estimator of $E(Y_i)$, so we will proceed in two steps:
     
    1. We run the usual OLS regression dis-regarding the heteroscedasticity problem and obtain $\hat{Y_i}$
    2. We will transform the model by using estimated $\hat{Y_i}$ i.e. $\frac{Y_i}{\hat{Y_i}}=\alpha\frac{1}{\hat{Y_i}}+\beta_1\frac{X_i}{\hat{Y_i}}+\frac{u_i}{\hat{Y_i}}$ and run the regression on transformed model.

      This transformation will perform satisfactory results only if the sample size is reasonably large.

  • Log transformation such as $ln\, Y_i = \alpha + \beta\, ln\, X_i + u_i$.
    Log transformation compresses the scales in which the variables are measured. However, this transformation is not applicable in some of the $Y$ and $X$ values that are zero or negative.

Visit: R Language Frequently Asked Questions