Discover the key differences between supervised and unsupervised learning in this quick Q&A guide. Learn about supervised and unsupervised learning functions, standard approaches, and common algorithms (like kNN vs. k-means). Also, learn about how supervised and unsupervised learning apply to classification tasks. Perfect for beginners in machine learning!”
Table of Contents
Supervised and Unsupervised Learning Questions and Answers
What is the function of Unsupervised Learning?
Unsupervised Learning is a type of machine learning where the model finds hidden patterns or structures in unlabeled data without any guidance (no predefined outputs). It’s used for clustering, dimensionality reduction, and anomaly detection. The function of unsupervised learning is:
- Find clusters of the data
- Find low-dimensional representations of the data
- Find interesting directions in data
- Interesting coordinates and correlations
- Find novel observations/ database cleaning
What is the function of Supervised Learning?
Supervised Learning is a type of machine learning where the model learns from labeled data (input-output pairs) to make predictions or classifications. It’s used for tasks like regression (predicting values) and classification (categorizing data). The function of supervised learning are:
- Classifications
- Speech recognition
- Regression
- Predict time series
- Annotate strings
For the following Scenario about the train dataset, which is based on classification.
You are given a train data set having 1000 columns and 1 million rows. The dataset is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that the model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)
Processing high-dimensional data on a limited memory machine is a strenuous task; your interviewer would be fully aware of that. The following are the methods you can use to tackle such a situation:
- Due to the memory constraints on the machine (CPU has lower RAM), one should close all other applications on the machine, including the web browser, so that most of the memory can be put to use.
- One can randomly sample the dataset. This means one can create a smaller data set, for example, having 1000 variables and 300000 rows, and do the computations.
- For dimensionality reduction (to reduce dimensionality), one can separate the numerical and categorical variables and remove the correlated variables. For numerical variables, one should use correlation. For categorical variables, one should use the chi-square test.
- One can also use PCA and pick the components that can explain the maximum variance in the dataset.
- Using online learning algorithms like Vowpal Wabbit (available in Python) is a possible option.
- Building a linear model using Stochastic Gradient Descent is also helpful.
- One can also apply the business understanding to estimate which predictors can impact the response variable. But this is an intuitive approach; failing to identify useful predictors might result in a significant loss of information.
What is the standard approach to supervised learning?
The standard approach to supervised learning involves:
- Labeled Dataset: Input features paired with correct outputs.
- Training: The Model learns patterns by minimizing prediction errors.
- Validation: Tuning hyperparameters to avoid overfitting.
- Testing: Evaluating performance on unseen data.
What are the common supervised learning algorithms?
The most common supervised learning algorithms:
- Linear Regression: Predicts continuous values (e.g., house prices).
- Logistic Regression: Binary classification (e.g., spam detection).
- Decision Trees: Splits data into branches for classification/regression.
- Random Forest: An Ensemble of decision trees for better accuracy.
- Support Vector Machines (SVM): Find optimal boundary for classification.
- k-Nearest Neighbors (k-NN): Classifies based on the closest data points.
- Naive Bayes: Probabilistic classifier based on Bayes’ theorem.
- Neural Networks: Deep learning models for complex patterns.
How is kNN different from kmeans clustering?
Firstly, do not get misled by ‘k’ in their names. One should know that the fundamental difference between both these algorithms is,
- kmeans clustering is unsupervised (it is a clustering algorithm)
The kmeans clustering algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to their unsupervised nature, the clusters have no labels. - kNN is supervised in nature (it is a classification (or regression) algorithm)
The kNN algorithm tries to classify an unlabeled observation based on its k (can be any number ) surrounding neighbors. It is also known as a lazy learner because it involves minimal training of the model. Hence, it doesn’t use training data to generalize to unseen datasets
Statistics for Data Analysts and Data Scientists