WHERE and IF Statements in SAS

Master the difference between WHERE and IF Statements in SAS for efficient data filtering. Learn when to use WHERE statements for speed on existing variables and IF statements for new variables. Our guide includes syntax examples, performance tips, and helps you avoid common subsetting errors to clean your datasets faster.

Differentiate between WHERE and IF Statements in SAS

There is a fundamental difference between WHERE and IF Statements in SAS. The breakdown of differences between WHERE and IF statements in SAS is as follows:

Key Differences Where and If Statements in SAS

Where Statement in SAS

The WHERE statement in SAS acts as a filter at the source. It tells SAS only to read observations that meet a specific condition from the input dataset(s). This happens very early in the process, often before the Program Data Vector (PDV) is fully constructed.

When to use WHERE Statement in SAS: Almost always for simple filtering, especially when working with large datasets, as it significantly improves performance by reducing I/O.

Example 1: Basic Filtering in a DATA Step

data high_earners;
    set sashelp.class;
    where age > 13; /* Only reads observations where Age is >13 */
run;

In this case, observations where age <= 13 are never even loaded into the PDV for processing.

Example 2: Using a WHERE Dataset Option

This is very powerful in procedures or when merging.

proc print data=sashelp.class(where=(sex='F')); /* Prints only females */
run;

data combined;
    merge ds1(where=(valid=1)) /* Merge only valid records from ds1 */
          ds2;
    by id;
run;

IF Statement in SAS

The IF statement in SAS is a processing-time filter. The entire observation is read into the PDV, all variables are calculated, and then the IF condition is evaluated. If the condition is false, the OUTPUT statement is bypassed (for that observation), and SAS returns to the beginning of the Data Step to process the next observation.

When to use IF Statement in SAS: When your filtering condition involves variables created within the Data Step or requires complex logic that must be executed row-by-row.

Example 1: Filtering on a New Variable

data tall_people;
set sashelp.class;
height_inches = height * 2.54; /* Create a new variable */
if height_inches > 64; /* Subsetting IF statement;*/
/* Equivalent to: if height_inches <= 64 then delete; */
run;

This works perfectly because the new variable height_inches exists in the PDV by the time the IF statement is executed.

Example 2: Complex Row-by-Row Logic

data flagged_records;
set mydata;
if some_var = . then do; /* Check for missing value */
error_flag = 'M'; /* Set a flag */
error_count + 1; /* Increment a counter */
end;
if error_flag = 'M'; /* Output only the records with errors */
run;

This kind of multi-step logic is not possible with a WHERE statement.

What is the Special Case of Subsetting IF vs DELETE Statements?

A common use of an IF statement is the subsetting IF (if condition;). This outputs an observation only if the condition is true. Its logical opposite is if condition then delete; which deletes an observation if the condition is true.

data adults;
    set people;
    if age >= 18; * Output if true;
run;

/* Is logically equivalent to: */

data adults;
    set people;
    if age < 18 then delete; * Delete if true;
run;

WHERE or IF Statement: Which One to Use?

  • Use WHERE when:
    • You are filtering based on variables that exist in the input dataset.
    • You want the most efficient processing, especially for large data.
    • You are working in a PROC step (like PROC PRINT, PROC SORT).
    • You want to use special operators like CONTAINS or LIKE.
  • Use IF when:
    • You need to filter based on a variable created within the same Data Step.
    • Your filtering logic is complex and requires other SAS statements (like DO loops or ARRAY processing).
    • You are already reading every observation into the PDV for other necessary calculations, and the efficiency gain of WHERE is negligible.

The WHERE statement can be used …. IF statement cannot be used

  • WHERE statement can be used in procedures to subset data, while the IF statement cannot be used in procedures.
  • WHERE can be used as a data set option, while IF cannot be used as a data set option.
  • WHERE statement is more efficient than the IF statement. It tells SAS not to read all observations from the data set
  • WHERE statement can be used to search for all similar character values that sound alike, while the IF statement cannot be used.
  • WHERE statement can not be used when reading data using the INPUT statement, whereas the IF statement can be used.
  • Multiple IF statements can be used to execute multiple conditional statements
  • When it is required to use newly created variables, use an IF statement, as it doesn’t require variables to exist in the READIN data set.  

What is the one statement to set the criteria of data that can be coded in any step?

A WHERE statement can set the criteria for any data set in a data step or a proc step.

General Knowledge Quizzes

Understanding Advanced SAS Procedures

Master advanced statistical modeling in SAS with our detailed question-and-answer guide. This Understanding Advanced SAS Procedures post explains the core statements and functionality of essential SAS procedures like PROC NLIN for nonlinear regression, PROC NLMIXED for nonlinear mixed models, PROC GLIMMIX for linear and generalized linear mixed models, and PROC PROBIT for dose-response analysis. Learn how to use PARMS, MODEL, RANDOM, and CLASS statements correctly, avoid common syntax errors, and interpret your results with practical examples from the sashelp.cars and sashelp.iris datasets. Perfect for data analysts and statisticians looking to deepen their SAS programming skills.

Understanding Advanced SAS Procedures

Understanding Advanced SAS Procedures

Explain the following SAS Statements used in the Example below (Non-Linear Mixed Model)

proc nlmixed data = CARS;
parms b1 = 220 b2 = 500 b3 = 310 s2u = 100 s2e = 60;
model X ~ normal(num/den, s2e);
random u1 ~ normal(0, s2u) subject = NUMBER;
run;

This is an excellent example of a nonlinear mixed model in SAS. The MIXED MODEL statement defines the dependent variable and its conditional distribution given the random effects. In the above statement, a normal (Gaussian) conditional distribution is specified.

This code is fitting a nonlinear mixed-effects model to data about cars (from the CARS dataset). It is trying to estimate parameters ($b_1, b_2$, and $b_3$) for a specific nonlinear relationship between a predictor and the outcome $X$, while also accounting for random variations between different groups of cars (grouped by NUMBER).

The RANDOM statement defines the single random effect to be $u1$, and specifies that it follows a normal distribution with mean $0$ and variance $s2u$. The SUBJECT= statement in the RANDOM statement is used to define a variable that will indicate when the random effect obtains new realizations.

Explain the following SAS statements (Linear Mixed Model) in the example below

proc glimmix data = sashelp.iris;
class species;
model age = weight;
random age = weight;
run;

The CLASS statement instructs the technique to treat the variable species as type variables. The version announcement in the example shown above specifies the reaction variable as a pattern proportion by means of the use of the occasions/trials approach.

This PROC GLIMMIX code contains a critical error in its RANDOM statement, which makes the model, as written, invalid and nonsensical.

In code, it is trying to fit a linear mixed model to the sashelp.iris dataset (famous Fisher’s Iris data). The intent might have been to see how age (which does not exist in the standard iris dataset) is related to weight (which also does not exist), while accounting for the grouping structure of species. The syntax of the RANDOM statement is completely incorrect.

Explain the use of each SAS statement (PROC PROBIT) given below

PROC PROBIT dataset;
CLASS <dependent variables>;
Model < dependent variables > = <independent VARIABLES>;

This statement outlines the basic structure for using PROC PROBIT in SAS, but it contains a few common misunderstandings and a critical error in the CLASS statement. However, the line-by-line explanation of the code is:

The DATA= option specifies the dataset that will be studied.

The PLOTS= choice within the PROC PROBIT statement, collectively with the ODS graphics announcement, requests all plots (as all have been specified in brackets, we will pick out a selected plot also) for the anticipated opportunity values and peak ranges.

The model statement prepares a response between a structured variable and independent variables. The variables top and weight are the stimuli or explanatory variables.

Explain the following SAS example (PROC NLIN)

proc nlin data = sashelp.cars method = gauss;
parms hosepower = 135
cylinders = 6;
model mpg_highway = (horsepower/cylinders);
run;

This code is used to fit a nonlinear regression model (PROC NLIN) to car data. The METHOD = option directs PROC NLIN to use the GAUSS iterative method. The PARMS statement declares the parameters and specifies their initial values.

The code is trying to model a car’s highway fuel efficiency (mpg_highway) as a simple nonlinear function of its power (horsepower) and engine size (cylinders). Specifically, it is testing the hypothesis that highway MPG is directly proportional to the power-per-cylinder (horsepower / cylinders). The code contains a critical error in its model specification, which will cause it to fail.

R Frequently Asked Questions

SAS Functions and Procedures

Discover key differences between SAS functions and procedures, when to use SUM() vs. ‘+’ operator, and INPUT vs. INFILE statements in SAS Software. Learn with clear examples and practical use cases for efficient data analysis. Perfect for SAS beginners and professionals!

SAS Functions and Procedures Questions And Answers

What is the difference between SAS Functions and Procedures?

The SAS Functions and Procedures (PROCs) serve different purposes and operate in distinct ways. The breakdown of the key differences between SAS Functions and Procedures is:

SAS Functions

Perform computations or transformations on individual values (usually within a DATA step). The SAS Functions are used to (i) operate on single values or variables, (ii) return a single result for each function call, and (iii) are often used in assignment statements or expressions.

## SAS Functions Example

data example;
    x = SUM(10, 20, 30);  /* Returns 60 */
    y = UPCASE('hello');  /* Returns 'HELLO' */
    z = SUBSTR('SAS Programming', 1, 3);  /* Returns 'SAS' */
run;

The following are some important types of SAS Functions:

  • Numeric Functions (e.g., SUM(), MEAN(), ROUND())
  • Character Functions (e.g., UPCASE(), SUBSTR(), TRIM())
  • Date/Time Functions (e.g., TODAY(), MDY(), INTCK())
  • Statistical Functions (e.g., NORMAL(), RANUNI())

SAS Procedures (PROCs)

SAS procedures, or PROCs, are used to perform data manipulation, analysis, or reporting on entire datasets. The usage of PROCS is to (i) operate on entire datasets (not just single values), (ii) generate tables, reports, graphs, or statistical analyses, and (iii) execute in a PROC step, not a DATA step.

## SAS Procedures (PROCs) Examples

proc means data=sashelp.class;  /* Computes summary statistics */
    var age height weight;
run;

proc sort data=sashelp.class;  /* Sorts a dataset */
    by descending age;
run;

proc freq data=sashelp.class;  /* Generates frequency tables */
    tables sex age;
run;

The types of SAS Procedures are:

  • Data Management PROCs (e.g., PROC SORT, PROC TRANSPOSE)
  • Statistical PROCs (e.g., PROC MEANS, PROC REG, PROC ANOVA)
  • Reporting PROCs (e.g., PROC PRINT, PROC TABULATE, PROC REPORT)
  • Graphical PROCs (e.g., PROC SGPLOT, PROC GCHART)

What are the key differences between SAS Functions and SAS Procedures?

The following are the key differences between SAS Functions and SAS Procedures:

FeatureSAS FunctionsSAS Procedures (PROCs)
OperationWork on single values/variablesWork on entire datasets
ExecutionUsed in DATA stepsUsed in PROC steps
OutputReturns a single valueGenerates reports, tables, or datasets
ExamplesSUM(), UPCASE(), SUBSTR()PROC MEANS, PROC SORT, PROC FREQ
Usage ContextCalculations within a variableDataset processing & analysis

Describe when to use SAS Functions or SAS PROCs

  • Use Functions when you need to transform or compute values within a DATA step.
  • Use Procedures when you need to analyze, summarize, or manipulate entire datasets.

What is the Difference Between the “Sum” Function and using the “+” Operator in SAS?

In SAS, both the SUM function and the + Operators can be used to perform addition, but they behave differently in terms of handling missing values and syntax. The breakdown of the differences between the SUM Function and the + Operator is:

SUM Function (SUM())

The SUM Function is used to add values while ignoring missing values (.). The general syntax of the SUM Function in SAS is

sum(var1, var2, var3, ...)

The behaviour of the SUM() is that if any argument is non-missing, the result is the sum of non-missing values. If all arguments are missing, the result is missing (.). The SUM() Function is best for

  • Summing multiple variables where some may have missing values.
  • Avoiding unintended missing results due to missing data.
## SAS SUM() Function Example 

data example;
    a = 10;
    b = .;  /* missing */
    c = 30;
    sum_result = sum(a, b, c);  /* 10 + 30 = 40 (ignores missing) */
run;

+ Operator

The ‘+’ operator performs arithmetic addition but propagates missing values. The general syntax of the ‘+’ operator in SAS is

var1 + var2 + var3

The behaviour of ‘+’ is:

  • If any variable is missing, the result is missing (.).
  • Only works if all values are non-missing.

The use of ‘+’ operator is best for:

  • Cases where missing values should make the result missing (e.g., strict calculations).
## + Operator Example

data example;
    a = 10;
    b = .;  /* missing */
    c = 30;
    plus_result = a + b + c;  /* 10 + . + 30 = . (missing) */
run;

What are the Key Differences between the SUM() Function and the ‘+’ Operator in SAS?

FeatureSUM Function (SUM())+ Operator
Handling Missing ValuesIgnores missing values (10 + . = 10)Returns missing if any value is missing (10 + . = .)
Syntaxsum(a, b, c)a + b + c
Use CaseSumming variables where some may be missingStrict arithmetic (missing = invalid)
PerformanceSlightly slower (function call)Faster (direct operation)

When to Use the SUM() Function and ‘+’ Operator in SAS?

  • Use SUM() when:
    • You want to ignore missing values (e.g., calculating totals where some data is missing).
    • Example: total = sum(sales1, sales2, sales3);
  • Use + when:
    • Missing values should make the result missing (e.g., strict calculations where all inputs must be valid).
    • Example: net_pay = salary + bonus; (if bonus is missing, net_pay should also be missing).

What is the difference between the INPUT and INFILE statements?

In SAS, both the INPUT and INFILE statements are used to read data, but they serve different purposes and are often used together. Here’s a breakdown of their differences:

INFILE Statement

The INFILE Statement in SAS specifies the source file from which data is to be read. It is used to

  • Defines the external file (e.g., .txt, .csv, .dat) to be read.
  • Can include options to control how data is read (e.g., delimiters, missing values, encoding).

The general Syntax of the INFILE Statement in SAS is:

INFILE "file-path" <options>;

The Key Options of the INFILE Statement are:

  • DLM=’,’ (specifies delimiter, e.g., CSV files)
  • DSD (handles quoted values and missing data correctly)
  • FIRSTOBS=2 (skips the first line, e.g., headers)
  • MISSOVER (prevents SAS from moving to the next line if data is missing)
## INFILE Statement Example
    DATA sample;
      INFILE "/path/to/data.csv" DLM=',' DSD FIRSTOBS=2;
      INPUT name $ age salary;
    RUN;

INPUT Statement

The INPUT Statement defines how SAS reads raw data (variable names, types, and formats). It is used to

  • Maps raw data to SAS variables (numeric or character).
  • Specifies the layout of the data (column positions, delimiters, or formats).

The general Syntax of the INPUT Statement is

INPUT variable1 $ variable2 variable3 ...;

The types of Input Styles are:

  • List Input (space/comma-delimited): INPUT name $ age salary;
  • Column Input (fixed columns): INPUT name $ 1-10 age 11-13 salary 14-20;
  • Formatted Input (specific formats): INPUT name $10. age 2. salary 8.2;
## INPUT Statement Example

    DATA sample;
      INFILE "/path/to/data.txt";
      INPUT name $ age salary;
    RUN;
Statistics Data Analysis SAS Functions, SAS Procedures

Learn String Manipulation in R Language