Currently working as Assistant Professor of Statistics in Ghazi University, Dera Ghazi Khan.
Completed my Ph.D. in Statistics from the Department of Statistics, Bahauddin Zakariya University, Multan, Pakistan.
l like Applied Statistics, Mathematics, and Statistical Computing.
Statistical and Mathematical software used is SAS, STATA, Python, GRETL, EVIEWS, R, SPSS, VBA in MS-Excel.
Like to use type-setting LaTeX for composing Articles, thesis, etc.
The post is about MCQs cluster Analysis. There are 20 multiple-choice questions from clustering, covering topics such as k-means, k-median, k-means++, cosine similarity, k-medoid, Manhattan Distance, etc. Let us start with the MCQs Cluster Analysis Quiz.
Online Multiple-Choice Questions about Cluster Analysis
Online MCQs Cluster Analysis
Which of the following statements is true?
What are some common considerations and requirements for cluster analysis?
Which of the following statements is true?
Which of the following statements is true?
Which of the following statements about the K-means algorithm are correct?
Which of the following statements, if any, is FALSE?
In the figure below, Map the figure to the type of link it illustrates.
In the figure below, Map the figure to the type of link it illustrates.
In the figure below, Map the figure to the type of link it illustrates.
Considering the k-median algorithm, if points $(-1, 3), (-3, 1),$ and $(-2, -1)$ are the only points that are assigned to the first cluster now, what is the new centroid for this cluster?
Which of the following statements about the K-means algorithm are correct?
Given the two-dimensional points (0, 3) and (4, 0), what is the Manhattan distance between those two points?
Given three vectors $A, B$, and $C$, suppose the cosine similarity between $A$ and $B$ is $cos(A, B) = 1.0$, and the similarity between $A$ and $C$ is $cos(A, C) = -1.0$. Can we determine the cosine similarity between $B$ and $C$?
Is K-means guaranteed to find K clusters that lead to the global minimum of the SSE?
The k-means++ algorithm is designed to better initialize K-means, which will take the farthest point from the currently selected centroids. Suppose $k = 2$ and we have chosen the first centroid as $(0, 0)$. Among the following points (these are all the remaining points), which one should we take for the second centroid?
Which of the following statements is true?
Suppose $X$ is a random variable with $P(X = -1) = 0.5$ and $P(X = 1) = 0.5$. In addition, we have another random variable $Y=X * X$. What is the covariance between $X$ and $Y$?
For k-means, will different initializations always lead to different clustering results?
In the k-medoids algorithm, after computing the new center for each cluster, is the center always guaranteed to be one of the data points in that cluster?
In the k-median algorithm, after computing the new center for each cluster, is the center always guaranteed to be one of the data points in that cluster?
The post is about the use of t Distribution in Statistics. The t distribution, also known as the Student’s t-distribution, is a probability distribution used to estimate population parameter(s) when the sample size is small or when the population variance is unknown. The t distribution is similar to the normal bell-shaped distribution but has heavier tails. This means that it gives a lower probability to the center and a higher probability to the tails than the standard normal distribution.
Table of Contents
The t distribution is particularly useful as it accounts for the extra variability that comes with small sample sizes, making it a more accurate tool for statistical analysis in such cases.
The following are the commonly used situations in which t distribution is used:
Use of t Distribution: Confidence Intervals
The t distribution is widely used in constructing confidence intervals. In most of the cases, The width of the confidence intervals depends on the degrees of freedom (sample size – 1):
Confidence Interval for One Sample Mean $$\overline{X} \pm t_{\frac{\alpha}{2}} \left(\frac{s}{\sqrt{n}} \right)$$ where $t_{\frac{\alpha}{2}}$ is the upper $\frac{\alpha}{2}$ point of the t distribution with $v=n-1$ degrees of freedom and $s^2$ is the unbiased estimate of the population variance obtained from the sample, $s^2 = \frac{\Sigma (X_i-\overline{X})^2}{n-1} = \frac{\Sigma X^2 – \frac{(\Sigma X)^2}{n}}{n-1}$
Confidence Interval for Difference between Two Independent Samples MeanL Let $X_{11}, X_{12}, \cdots, X_{1n_1}$ and $X_{21}, X_{22}, \cdots, X_{2n_2}$ be the random samples of size $n_1$ and $n_2$ from normal population with variances $\sigma_1^2$ and $\sigma_2^2$, respectively. Let $\overline{X}_1$ and $\overline{X}_2$ be the respectively sample means. The confidence interval for the difference between two population mean $\mu_1 – \mu_2$ when the population variances $\sigma_1^2$ and $\sigma_2^2$ are unknown and the sample sizes $n_1$ and $n_2$ are small (less than 30) is $$(\overline{X}_1 – \overline{X}_2 \pm t_{\frac{\alpha}{2}}(S_p)\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$$ where $S_p = \frac{(n_1 – 1)s_1^2 + (n_2-1)s_2^2}{n_1-n_2-2}$ (Pooled Variance), where $s_1^2$ and $s_2^2$ are the unbiased estimates of population variances $\sigma_1^2$ and $\sigma_2^2$, respectively.
Confidence Interval for Paired Observations The confidence interval for $\mu_d=\mu_1-\mu_2$ is $$\overline{d} \pm t_{\frac{\alpha}{2}} \frac{S_d}{\sqrt{n}}$$ where $\overline{d}$ and $S_d$ are the mean and standard deviation of the differences of $n$ pairs of measurements and $t_{\frac{\alpha}{2}}$ is the upper $\frac{\alpha}{2}$ point of the distribution with $n-1$ degrees of freedom.
Use of t Distribution: Testing of Hypotheses
The t-tests are used to compare means between two groups or to test if a sample mean is significantly different from a hypothesized population mean.
Testing of Hypothesis for One Sample Mean It compares the mean of a single sample to a known population mean when the population standard deviation is known, $$t=\frac{\overline{X}-\mu}{\frac{s}{\sqrt{n}}}$$
Testing of Hypothesis for Difference between Two Population Means For two random samples of sizes $n_1$ and $n_2$ drawn from two normal population having equal variances ($\sigma_1^2 = \sigma_2^2 = \sigma^2$), the test statistics is $$t=\frac{\overline{X}_1 – \overline{X}_2}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$ with $v=n_1+n_2-2$ degrees of freedom.
Testing of Hypothesis for Paird/Dependent Observations To test the null hypothesis ($\mu_d = \mu_o$) the statistics is $$t=\frac{\overline{d} – d_o}{\frac{s_d}{\sqrt{n}}}$$ with $v=n-1$ degrees of freedom.
Testing the Coefficient of Correlation For $n$ pairs of observations (X, Y), the sample correlation coefficient, the test of significance (testing of hypothesis) for the correlation coefficient is $$t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$ with $v=n-2$ degrees of freedom.
Testing the Regression Coefficients The t distribution is used to test the significance of regression coefficients in linear regression models. It helps determine whether a particular independent variable ($X$) has a significant effect on the dependent variable ($Y$). The regression coefficient can be tested using the statistic $$t=\frac{\hat{\beta} – \beta}{\sqrt{SE_{\hat{\beta}}}}$$ where $SE_{\hat{\beta}} = \frac{S_{Y\cdot X}}{\sqrt{\Sigma (X-\overline{X})^2}}=\frac{\sqrt{\frac{\Sigma Y^2 – \hat{\beta}_o \Sigma X – \hat{\beta}_1 \Sigma XY }{n-2} } }{S_X \sqrt{n-1}}$
The t distribution is a useful statistical tool for data analysis as it allows the user to make inferences/conclusions about population parameters even when there is limited information about the population.
Suppose, we have a population of size $N$ having mean $\mu$ and variance $\sigma^2$. We draw all possible samples of size $n$ from this population with or without replacement. Then we compute the mean of each sample and denote it by $\overline{x}$. These means are classified into a frequency table which is called frequency distribution of means and the probability distribution of means is called the sampling distribution of means.
Table of Contents
Sampling Distribution
A sampling distribution is defined as the probability distribution of the values of a sample statistic such as mean, standard deviation, proportions, or difference between means, etc., computed from all possible samples of size $n$ from a population. Some of the important sampling distributions are:
Sampling Distribution of Means
Sampling Distribution of the Difference Between Means
Sampling Distribution of the Proportions
Sampling Distribution of the Difference between Proportions
Sampling Distribution of Variances
Notations of Sampling Distribution of Means
The following notations are used for sampling distribution of means:
$\mu$: Population mean $\sigma^2$: Population Variance $\sigma$: Population Standard Deviation $\mu_{\overline{X}}$: Mean of the Sampling Distribution of Means $\sigma^2_{\overline{X}}$: Variance of Sampling Distribution of Means $\sigma_{\overline{X}}$: Standard Deviation of the Sampling Distribution of Means
Formulas for Sampling Distribution of Means
The following following formulas for the computation of means, variance, and standard deviations can be used:
A population of $(N=5)$ has values 2, 4, 6, 8, and 10. Draw all possible samples of size 2 from this population with and without replacement. Construct the sampling distribution of sample means. Find the mean, variance, and standard deviation of the population and verify the following:
Standard Deviation: $\sigma_{\overline{X}}=\frac{\sigma}{\sqrt{n}} \Rightarrow 1.73=\sqrt{3}$
Why is Sampling Distribution Important?
Inference: Sampling distribution of means allows users to make inferences about the population mean based on sample data.
Hypothesis Testing: It is crucial for hypothesis testing, where the researcher compares sample statistics to population parameters.
Confidence Intervals: It helps construct confidence intervals, which provide a range of values likely to contain the population mean.
Note that the sampling distribution of means provides a framework for understanding how sample means vary from sample to sample and how they relate to the population mean. This understanding is fundamental to statistical inference and decision-making.
The Poisson Probability Distribution is discrete and deals with events that can only take on specific, whole number values (like the number of cars passing a certain point in an hour). Poisson Probability Distribution models the probability of a given number of events occurring in a fixed interval of time or space, given a known average rate of occurrence ($\mu$). The events must be independent of each other and occur randomly.
Table of Contents
The Poisson probability function gives the probability for the number of events that occur in a given interval (often a period of time) assuming that events occur at a constant rate during the interval.
Poisson Random Variable
The Poisson random variable satisfies the following conditions:
The number of successes in two disjoint time intervals is independent
The probability of success during a small time interval is proportional to the entire length of the time interval.
The probability of two or more events occurring in a very short interval is negligible.
Apart from disjoint time intervals, the Poisson random variable is also applied to disjoint regions of space.
Applications of Poisson Probability Distribution
The following are a few of the applications of Poisson Probability Distribution:
The number of deaths by horse kicking in the Prussian Army (it was the first application).
Birth defects and genetic mutations.
Rare diseases (like Leukemia, but not AIDS because it is infectious and so not independent), especially in legal cases.
Car accidents
Traffic flow and ideal gap distance
Hairs found in McDonald’s hamburgers
Spread of an endangered animal in Africa
Failure of a machine in one month
The formula of Poisson Distribution
The probability distribution of a Poisson random variable $X$ representing the number of successes occurring in a given time interval or specified region of space is given by
where $P(X=x)$ is the probability of $x$ events occurring, $e$ is the base of the natural logarithm (~2.71828), $\mu$ is the mean number of successes in the given time interval (or region of space), $x$ is the number of events we are interested in, and $x!$ is the factorial of $x$.
Mean and Variance of Poisson Distribution
If $\mu$ is the average number of successes occurring in a given time interval (or region) in the Poisson distribution, then the mean and the variance of the Poisson distribution are both equal to $\mu$. That is,
A Poisson distribution has only one parameter, $\mu$ is needed to determine the probability of an event. For binomial experiments involving rare events (small $p$) and large values of $n$, the distribution of $X=$ the number of success out of $n$ trials is binomial, but it is also well approximated by the Poisson distribution with mean $\mu=np$.
When to Use Poisson Probability Distribution
The Poisson distribution is useful in various scenarios:
Modeling Rare Events: Like accidents, natural disasters, or equipment failures.
Counting Events in a Fixed Interval: Such as the number of customers arriving at a store in an hour, or the number of calls to a call center in a minute.
Approximating the Binomial Distribution: When the number of trials ($n$) is large and the probability of success ($p$) is small.
It is important to note that
The Poisson distribution is related to the exponential distribution, which models the time between events.
It is a fundamental tool in probability theory and statistics, with applications in fields like operations research, queuing theory, and reliability engineering.