Dimensionality Reduction in Machine Learning

Curious about dimensionality reduction in machine learning? This post answers key questions: What is dimension reduction? How do PCA, KPCA, and ICA work? Should you remove correlated variables before PCA? Is rotation necessary in PCA? Perfect for students, researchers, data analysts, and ML practitioners looking to master feature extraction, interpretability, and efficient modeling. Learn best practices and avoid common pitfalls about dimensionality reduction in machine learning.

What is Dimension Reduction in Machine Learning?

Dimensionality Reduction in Machine Learning is the process of reducing the number of input features (variables) in a dataset while preserving its essential structure and information. Dimensionality reduction simplifies data without losing critical patterns, making ML models more efficient and interpretable. The dimensionality reduction in machine learning is used to

  • Removes Redundancy: Eliminates correlated or irrelevant features/variables
  • Fights Overfitting: Simplifies models by reducing noise
  • Speeds up Training: Fewer dimensions mean faster computation
  • Improves Visualization: Projects data into 2D/ 3D for better understanding.

The common techniques for dimensionality reduction in machine learning are:

  • PCA: Linear projection maximizing variance
  • t-SNE (t-Distributed Stochastic Neighbour Embedding): Non-linear, good for visualization
  • Autoencoders (Neural Networks): Learn compact representations.
  • UMAP (Uniform Manifold Approximation and Projection): Preserves global & local structure.

The uses of dimensionality reduction in machine learning are:

  • Image compression (for example, reducing pixel dimensions)
  • Anomaly detection (by isolating key features)
  • Text data (for example, topic modeling via LDA)

What are PCA, KPCA, and ICA used for?

PCA (Principal Component Analysis), KPCA (Kernel Principal Component Analysis), and ICA (Independent Component Analysis) are dimensionality reduction (feature extraction) techniques in machine learning; widely used in data analysis and signal processing.

  • PCA (Principal Component Analysis): reduces dimensionality by transforming data into a set of linearly uncorrelated variables (principal components) while preserving maximum variance. Its key uses are:
    • Dimensionality Reduction: Compresses high-dimensional data while retaining most information.
    • Data Visualization: Projects data into 2D/3D for easier interpretation.
    • Noise Reduction: Removes less significant components that may represent noise.
    • Feature Extraction: Helps in reducing multicollinearity in regression/classification tasks.
    • Assumptions: Linear relationships, Gaussian-distributed data.
  • KPCA (Kernel Principal Component Analysis): It is a nonlinear extension of PCA using kernel methods to capture complex structures. Its key uses are:
    • Nonlinear Dimensionality Reduction: Handles data with nonlinear relationships.
    • Feature Extraction in High-Dimensional Spaces: Useful in image, text, and bioinformatics data.
    • Pattern Recognition: Detects hidden structures in complex datasets.
    • Advantage: Works well where PCA fails due to nonlinearity.
    • Kernel Choices: RBF, polynomial, sigmoid, etc.
  • ICA (Independent Component Analysis): It separates mixed signals into statistically independent components (blind source separation). Its key uses are:
    • Signal Processing: Separating audio (cocktail party problem), EEG, fMRI signals.
    • Denoising: Isolating meaningful signals from noise.
    • Feature Extraction: Finding hidden factors in data.
    • Assumptions: Components are statistically independent and non-Gaussian.

Note that Principal Component Analysis finds uncorrelated components, and ICA finds independent ones.

Dimensionality reduction in Machine Learning

Suppose a certain dataset contains many variables, some of which are highly correlated, and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

No, one should not remove correlated variables before PCA. It is because

  • PCA Handles Correlation Automatically
    • PCA works by transforming the data into uncorrelated principal components (PCs).
    • It inherently identifies and combines correlated variables into fewer components while preserving variance.
  • Removing Correlated Variables Manually Can Lose Information
    • If you drop correlated variables first, you might discard useful variance that PCA could have captured.
    • PCA’s strength is in summarizing correlated variables efficiently rather than requiring manual preprocessing.
  • PCA Prioritizes High-Variance Directions
    • Since correlated variables often share variance, PCA naturally groups them into dominant components.
    • Removing them early might weaken the resulting principal components.
  • When Should You Preprocess Before PCA?
    • Scale Variables (if features are in different units) → PCA is sensitive to variance magnitude.
    • Remove Near-Zero Variance Features (if some variables are constants).
    • Handle Missing Values (PCA cannot handle NaNs directly).

Therefore, do not remove correlated variables before Principal Component Analysis; let PCA handle them. Instead, focus on standardizing data (if needed) and ensuring no missing values exist.

Discarding correlated variables has a substantial effect on PCA because, in the presence of correlated variables, the variance explained by a particular component gets inflated.

Suppose you have 3 variables in a data set, of which 2 are correlated. If you run Principal Component Analysis on this data set, the first principal component would exhibit twice the variance that it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variables, which is misleading.

Is rotation necessary in PCA? If yes, why? What will happen if you do not rotate the components?

Rotation is optional but often beneficial; it improves interpretability without losing information.

Why Rotate PCA Components?

  • Simplifies Interpretation
    • PCA components are initially uncorrelated but may load on many variables, making them hard to explain.
    • Rotation (e.g., Varimax for orthogonal rotation) forces loadings toward 0 or ±1, creating “simple structure.”
    • Example: A rotated component might represent only 2-3 variables instead of many weakly loaded ones.
  • Enhances Meaningful Patterns
    • Unrotated components maximize variance but may mix multiple underlying factors.
    • Rotation aligns components closer to true latent variables (if they exist).
  • Preserves Variance Explained
    • Rotation redistributes variance among components but keeps total variance unchanged.

What Happens If You Do Not Rotate?

  • Harder to Interpret: Components may have many moderate loadings, making it unclear which variables dominate.
  • Less Aligned with Theoretical Factors: Unrotated components are mathematically optimal (max variance) but may not match domain-specific concepts.
  • No Statistical Harm: Unrotated PCA is still valid for dimensionality reduction—just less intuitive for human analysis.

When to Rotate?

  • If your goal is interpretability (e.g., identifying clear feature groupings in psychology, biology, or market research). There is no need to rotate if you only care about dimension reduction (e.g., preprocessing for ML models).

Therefore, rotation (orthogonal) is necessary because it maximizes the difference between the variance captured by the component. This makes the components easier to interpret. Not to forget, that is the motive of doing Principal Component Analysis, where we aim to select fewer components (than features) which can explain the maximum variance in the dataset. By doing rotation, the relative location of the components does not change, it only changes the actual coordinates of the points. If we do not rotate the components, the effect of PCA will diminish, and we will have to select a larger number of components to explain the variance in the dataset

Rotation does not change PCA’s mathematical validity but significantly improves interpretability for human analysis. Skip it only if you are using PCA purely for algorithmic purposes (e.g., input to a classifier).

Statistics Help: dimensionality reduction in machine learning

Simulation in the R Language

Complement of an Event

Probability is a fundamental concept in statistics used to quantify uncertainty. One of the key concepts in probability is the Complement of an event. The complement of an event provides a different perspective on computing the probabilities, that is, it is used to determine the likelihood of an event not occurring. Let us explore how the complement of an event is used for the computation of probability.

What is the Complement of an Event?

The complement of an event $E$ is denoted by $E’$, encompasses all outcomes in the sample space that are not part of event $E$. In simple terms, if event $E$ represents a specific outcome or set of outcomes, its complement represents everything else that could occur.

For example, let the event $E$ be rolling a 4 on a six-sided die; the complement of event $E$ is ($E’$) rolling a 1, 2, 3, 5, or 6.

Note that event $E$ and its complement $E’$ cover the entire sample space of the die roll.

Complement Rule: Calculating Probabilities

A pivotal property of complementary events is that the sum of their probabilities is 1 (or 100%). This is because either the event happens or it does not happen, as there are no other probabilities. It can be described as
$$P(E) + P(E’) = 1$$
This leads to the complement rule, which states that
$$P(E’)= 1- P(E)$$
It is useful when computing the probability of an event not occurring.

Complement of an Event in Probability

Examples (Finding the Complement of an Event)

Suppose the probability that today is a rainy day is 0.3. The probability of it not raining today is $$1-0.3 = 0.7$$

Similarly, the probability of rolling a 2 on a fair die is $P(E) = \frac{1}{6}$. the probability of not rolling a 2 is $P(E’)=1-\frac{1}{6} = \frac{5}{6}$.

Why use the Complement Rule?

Sometimes, calculating the probability of the complement is easier than calculating the probability of the event itself. For example,

Question: What is the probability of getting at least one head in three coin tosses?
Solution: Instead of listing all possible favourable outcomes, one can easily use the complement rule. That is,
Complement Event: Getting no heads (all tails)
Probability of all tails = $\left(\frac{1}{2}\right)^3 = \frac{1}{8}$. Therefore, the probability of at least one head is

P(At least one head) = $1 – \frac{1}{8} = \frac{7}{8}$
This approach is quicker than counting all possible cases; that is, one can avoid enumerating all the favourable outcomes.

Properties of Complementary Events

  • Mutually Exclusive: An event and its complement cannot occur together (simultaneously)
  • Collectively Exhaustive: An event and its complement encompass all possible outcomes
  • Probability Sum: The probabilities of an event and its complement add up to 1.

Understanding complements in probability can make complex problems much simpler and easier.

Practical Applications

Understanding complements is invaluable in various fields:

  • Quality Control: Determining the probability of defects in manufacturing
  • Everyday Decisions: Estimating probabilities in daily life, such as the chance of missing a bus or the likelihood of rain.
  • Game Theory: Calculating chances of winning or losing scenarios
  • Risk Assessment: Evaluating the likelihood of adverse events not occurring

More Examples (Complement of an Event)

  • In a standard 52-card deck, what is the probability of not drawing a heart card?
    $P(Not\,\,Heart) = 1 – P(Heart) = 1 – \frac{13}{52} = \frac{39}{52}$
  • If the probability of passing an examination is 0.85, what is the probability of failing it?
    $P(Fail) = 1 – P(Pass) = 1 – 0.85 = 0.15$
  • If the probability that a flight will be delayed is 0.13, then the probability that it will not be delayed will be $1 – 0.13 = 0.87$
  • If $k$ is the event of drawing a king card from a well-shuffled 52-card deck, then the event $K’$ is the event that a king is not drawn, so $K’$ will contain 48 possible outcomes.

Data Analysis in the R Programming Language

PROC in SAS

This comprehensive Q&A-style guide about PROC in SAS Software breaks down fundamental SAS PROCs used in statistical analysis and data management. Learn:
What PROCs do and their key functions.
Differences between PROC MEANS & SUMMARY.
When to use PROC MIXED for mixed-effects models.
CANCORR vs CORR for multivariate vs bivariate analysis.
Sample PROC MIXED code with required statements.
How PROC PRINT & CONTENTS help inspect data.

Ideal for students learning SAS Programming and statisticians performing advanced analyses. Includes ready-to-use code snippets and easy comparisons!

Q&A PROC in SAS Software

Explain the functions of PROC in SAS.

PROC (Procedure) is a fundamental component of SAS programming that performs specific data analysis, reporting, or data management tasks. Each PROC is a pre-built routine designed to handle different statistical, graphical, or data processing operations. The key functions of PROC in SAS are:

  • Data Analysis & Statistics: PROCs perform statistical computations, including:
    • Descriptive Statistics (PROC MEANS, PROC SUMMARY, PROC UNIVARIATE)
    • Hypothesis Testing (PROC TTEST, PROC ANOVA, PROC GLM)
    • Regression & Modeling (PROC REG, PROC LOGISTIC, PROC MIXED)
    • Multivariate Analysis (PROC FACTOR, PROC PRINCOMP, PROC DISCRIM)
  • Data Management & Manipulation
    • Sorting (PROC SORT)
    • Transposing Data (PROC TRANSPOSE)
    • Merging & Combining Datasets (PROC SQL, PROC APPEND)
  • Reporting & Output Generation
    • Printing Data (PROC PRINT)
    • Creating Summary Reports (PROC TABULATE, PROC REPORT)
    • Generating Graphs (PROC SGPLOT, PROC GCHART)
  • Quality Control & Data Exploration
    • Checking Data Structure (PROC CONTENTS)
    • Identifying Missing Data (PROC FREQ with MISSING option)
    • Sampling Data (PROC SURVEYSELECT)
  • Advanced Analytics & Machine Learning
    • Cluster Analysis (PROC CLUSTER)
    • Time Series Forecasting (PROC ARIMA)
    • Text Mining (PROC TEXTMINER)

    PROCs are the backbone of SAS programming, enabling data analysis, manipulation, and reporting with minimal coding. Choosing the right PROC depends on the task—whether it’s statistical modeling, data cleaning, or generating business reports.

    Explain the Difference Between PROC MEANS and PROC SUMMARY.

    Both PROC MEANS and PROC SUMMARY in SAS compute descriptive statistics (e.g., mean, sum, min, max), but they differ in default behavior and output:

    • Default Output
      PROC MEANS: Automatically prints results in the output window.
      PROC SUMMARY: Does not print by default; requires the PRINT option.
    • Dataset Creation
      Both can store results in a dataset using OUT=.
    • Handling of N Observations
      PROC MEANS: Includes a default N (count) statistic.
      PROC SUMMARY: Requires explicit specification of statistics.
    • Usage Context
      Use PROC MEANS for quick interactive analysis.
      Use PROC SUMMARY for programmatic, non-printed summaries.

    The PROC MEANS is more user-friendly for direct analysis, while PROC SUMMARY in SAS offers finer control for automated reporting.

    Under the PROC MEANS, there is only a subgroup that is created only when there is a BY statement that is being used, and the input data is previously well-sorted out with the help of BY variables.

    Under the PROC SUMMARY in SAS, there is a statistic that gets produced automatically for all the subgroups. It gives all sorts of information that runs together.

    Introduction to PROC in SAS Software

    What is the PROC MIXED Procedure in SAS STAT used for?

    The PROC blended system in SAS/STAT fits specific blended models. The Mixed version can allow for one-of-a-kind assets of variation in information, it allows for one-of-a-kind variances for corporations, and takes into account the correlation structure of repeated measurements.

    PROC MIXED is essential for analyzing data with correlated observations or hierarchical structures. Its flexibility in modeling random effects and covariance makes it a cornerstone of advanced statistical analysis in SAS.

    PROC MIXED is a powerful SAS procedure for fitting linear mixed-effects models, which account for both fixed and random effects in data. It is widely used for analyzing hierarchical, longitudinal, or clustered data where observations are correlated (e.g., repeated measures, multilevel data).

    What is the Difference Between CANCORR and CORR Procedures in SAS STAT?

    Both procedures analyze relationships between variables, but they serve distinct purposes:

    1. PROC CORR (Correlation Analysis): Computes simple pairwise correlations (e.g., Pearson, Spearman). It is used to examine linear associations between two or more variables or when there is no distinction between dependent/independent variables. The output from different statistical software is in the form of the correlation matrix, p-values, and descriptive statistics. The code below tests how height, weight, and age are linearly related.

      PROC CORR DATA=my_data;
      VAR height weight age;
      RUN;
    2. PROC CANCORR (Canonical Correlation Analysis): Analyzes multivariate relationships between two sets of variables. It is used to find linear combinations (canonical variables) that maximize correlation between sets.
      It is also useful for dimension reduction (e.g., linking psychological traits to behavioral measures). The output from different statistical software is Canonical correlations, coefficients, and redundancy analysis.

      PROC CANCORR DATA=my_data;
      VAR set1_var1 set1_var2; /* First variable set */
      WITH set2_var1 set2_var2; /* Second variable set */
      RUN;

    Key Differences Summary

    FeaturePROC CORRPROC CANCORR
    Analysis TypeBivariate correlationsMultivariate (set-to-set)
    VariablesSingle list (no grouping)Two distinct sets (VAR & WITH)
    Output FocusPairwise coefficients (e.g., r)Canonical correlations (ρ)
    ComplexitySimple, descriptiveAdvanced, inferential

    Write a sample program using the PROC MIXED procedure, including all the required statements

    proc mixed data=SASHELP.IRIS plots=all;
    class species;
    model petallength= /;
    run;

    Describe what PROC PRINT and PROC CONTENTS are used for.

    PROC contents displays the information about an SAS dataset, while PROC print ensures that the data is correctly read into the SAS dataset.

    1. PROC CONTENTS: Displays metadata about a SAS dataset (structure, variables, attributes). Its key uses are:
    • Check variable names, types (numeric/character), lengths, and formats.
    • Identify dataset properties (e.g., number of observations, creation date).
    • Debug data import/export issues (e.g., mismatched formats).

    The general syntax of PROC CONTENTS is

    PROC CONTENTS DATA=your_data;  
    RUN;
    1. PROC PRINT: Displays raw data from a SAS dataset to the output window. Its key uses are:
    • View actual observations and values.
    • Verify data integrity (e.g., missing values, unexpected codes).
    • Quick preview before analysis.

    The general Syntax of PROC PRINT is

    PROC PRINT DATA=your_data (OBS=10);  /* Prints first 10 rows */ 
       VAR var1 var2;                                              /* Optional: limit columns */
    RUN;

    Functions in R Programming