Cumulative Frequency Distribution and Polygon (2012)

Introduction to Cumulative Frequency Distribution

A cumulative frequency distribution (cumulative frequency curve or ogive) and a cumulative frequency polygon require cumulative frequencies. The cumulative frequency is denoted by CF and for a class interval it is obtained by adding the frequency of all the preceding classes including that class. It indicates the total number of values less than or equal to the upper limit of that class. For comparing two or more distributions, relative cumulative frequencies or percentage cumulative frequencies are computed.

The relative cumulative frequencies are the proportions of the cumulative frequency denoted by CRF and are obtained by dividing the cumulative frequency by the total frequency (Total number of Observations). The CRF of a class can also be obtained by adding the relative frequencies (rf) of the preceding classes including that class. Multiplying the relative frequencies by 100 gives the corresponding percentage cumulative frequency of a class.

Method of Construction of Cumulative Frequencies

The method of construction of cumulative frequencies and cumulative relative frequencies is explained in the following table:

Cumulative Frequency Distribution

Plot a Cumulative Frequency Distribution

To plot a CF distribution, scale the upper limit of each class along the x-axis and the corresponding cumulative frequencies along the y-axis. For additional information, you can label the vertical axis on the left in units and the vertical axis on the right in percent. The cumulative frequencies are plotted along the y-axis against upper or lower-class boundaries and the plotted points are joined by a straight line. Cumulative Frequency Polygon can be used to calculate median, quartiles, deciles, and percentiles, etc.

Data Visualization in R Programming Language

Cumulative Frequency Distribution Ogive
Cumulative Frequency Polygon or Ogive
Cumulative Frequency distribution and Frequency polygon

Inverse Regression Analysis or Calibration (2012)

In most regression problems we have to determine the value of $Y$  corresponding to a given value of $X$. The inverse of this problem is also called inverse regression analysis or calibration.

Inverse Regression Analysis

For inverse regression analysis, let the known values represented by matrix $X$ and their corresponding values by vector $Y$, which both form a simple linear regression model. Let, there is an unknown value of $X$, such as $X_0$, which cannot be measured and we observe the corresponding value of $Y$, say $Y_0$. Then, $X_0$ can be estimated and a confidence interval for $X_0$ can be obtained.

In regression analysis, we want to investigate the relationship between variables. Regression has many applications, which occur in many fields: engineering, economics, the physical and chemical sciences, management, biological sciences, and social sciences. We only consider the simple linear regression model, which is a model with one regressor $X$ that has a linear relationship with a response $Y$. It is not always easy to measure the regressor $X$ or the response $Y$.

Let us consider a typical example of this problem. If $X$ is the concentration of glucose in certain substances, then a spectrophotometric method is used to measure the absorbance. This absorbance depends on the concentration of $X$. The response $Y$ is easy to measure with the spectrophotometric method, but the concentration, on the other hand, is not easy to measure. If we have $n$ known concentrations, then the absorbance can be measured.

If there is a linear relation between $Y$ and $X$, then a simple linear regression model can be made with these data. Suppose we have an unknown concentration, that is difficult to measure, but we can measure the absorbance of this concentration. Is it possible to estimate this concentration with the measured absorbance? This is called the calibration problem or inverse regression Analysis.

Suppose, we have a linear model $Y=\beta_0+\beta_1X+e$ and we have an observed value of the response $Y$, but we do not have the corresponding value of $X$. How can we estimate this value of $X$? The two most important methods to estimate $X$ are the classical method and the inverse method.

The classical method of inverse regression analysis is based on the simple linear regression model

$Y=\hat{\beta}_0+\hat{\beta}_1X+\varepsilon,$   where $\varepsilon \tilde N(0, \, \sigma^2)$

where the parameters $\hat{beta}_0$ and $\hat{beta}_1$ are estimated by Least Squares as $\beta_0$ and $\beta_1$. At least two of the $n$ values of $X$ have to be distinct, otherwise, we cannot fit a reliable regression line. For a given value of $X$, say $X_0$ (unknown), a $Y$ value, say $Y_0$ (or a random sample of $k$ values of $Y$) is observed at the $X_0$ value. For inverse regression analysis, the problem is to estimate $X_0$. The classical method uses a $Y_0$ value (or the mean of $k$ values of $Y_0$) to estimate $X_0$, which is then estimated by $\hat{x_0}=\frac{\hat{Y_0}-\hat{\beta_0}} {\hat{\beta_1}}$.

scatter with regression line: Inverse Regression Analysis

The inverse estimator is the simple linear regression of $X$ on $Y$. In this case, we have to fit the model

\[X=a_0+a_1Y+e, \text{where }\, N(0, \sigma^2)\]

to obtain the estimator. Then the inverse estimator of $X_0$

\[X_0=a_0+a_1Y+e\]

Important Considerations when performing Inverse Regression

  • Inverse regression can be statistically challenging, especially when the errors are mainly in the independent variables (which become the dependent variables in the inverse model).
  • It is not a perfect replacement for traditional regression, and the assumptions underlying the analysis may differ.
  • In some cases, reverse regression, which treats both variables as having errors, might be a more suitable approach.

In summary, inverse regression is a statistical technique that flips the roles of the independent and dependent variables in a regression model.

Learn R Language Programming

Binomial Probability Distribution (2012)

We first need to understand the Bernoulli Trials to learn about Binomial Probability Distribution.

Bernoulli Trials

Many experiments consist of repeated independent trials and each trial has only two possible outcomes such as head or tail, right or wrong, alive or dead, defective or non-defective, etc. If the probability of each outcome remains the same (constant) throughout the trials, then such trials are called the Bernoulli Trials.

Binomial Probability Distribution

Binomial Probability Distribution is a discrete probability distribution describing the results of an experiment known as the Bernoulli Process. The experiment having $n$ Bernoulli trials is called a Binomial Probability experiment possessing the following four conditions/ assumptions

  1. The experiment consists of $n$ repeated tasks.
  2. Each trial results in an outcome that may be classified as success or failure.
  3. The probability of success denoted by $p$ remains constant from trial to trial.
  4. The repeated trials are independent.

A Binomial trial can result in a success with probability $p$ and a failure with probability $1-p$ having $n-x$ number of failures, then the probability distribution of Binomial Random Variable, the number of successes in $n$ independent trial is:

\begin{align*}
P(X=x)&=\binom{n}{x} \, p^x \, q^{n-x} \\
&=\frac{n!}{x!(n-x)!}\, p^x \, q^{n-x}
\end{align*}

Binomial Probability Distribution

The Binomial probability distribution is the most widely used in situations of two outcomes. It was discovered by the Swiss mathematician Jakob Bernoulli (1654—1704) whose main work on “the ars Conjectandi” (the art of conjecturing) was published posthumously in Basel in 1713.

Mean of Binomial Distribution:   Mean = $\mu = np$

Variance of Binomial Distribution:  Variance = $npq$

Standard Deviation of Binomial Distribution:  Standard Deviation = $\sqrt{npq}$

Moment Coefficient of Skewness:

\begin{align*}
\beta_1 &= \frac{q-p}{\sqrt{npq}}  \\
&= \frac{1-2p}{\sqrt{npq}}
\end{align*}

Moment Coefficient of Kurtosis:  $\beta_3 = 3+\frac{1-6pq}{npq}$

Application of Binomial Probability Distribution

  • Quality control: In manufacturing, Binomial Probability Distribution can be used to determine the probability of finding a defective product in a batch.
  • Medical testing: It can be used to assess the probability of a specific number of positive test results in a group.
  • Opinion polls: Binomial Probability Distribution can be used to estimate the margin of error in a poll by considering the probability of getting a certain number of votes for a particular candidate.

By understanding the binomial distribution, you can analyze the probability of success in various scenarios with two possible outcomes.

FAQS about Binomial Probability Distribution

  1. What is a Binomial Experiment?
  2. Define Binomial Distribution?
  3. What are the important Assumptions of a Binomial experiment?
  4. What are the important applications of Binomial distribution?
  5. What are the characteristics of Binomial distribution?
  6. Write the probability distribution formula for a Binomial random variable.
Statistics Help: https://itfeature.com

Generate Binomial Random Numbers in R Language

Coefficient of Determination: Model Selection (2012)

$R^2$ pronounced R-squared (Coefficient of determination) is a useful statistic to check the regression fit value. $R^2$ measures the proportion of total variation about the mean $\bar{Y}$ explained by the regression. R is the correlation between $Y$ and $\hat{Y}$ and is usually the multiple correlation coefficient. The coefficient of determination ($R^2$) can take values as high as 1 or  (100%) when all the values are different i.e. $0\le R^2\le 1$.

Coefficient of Determination

When repeat runs exist in the data the value of $R^2$ cannot attain 1, no matter how well the model fits, because no model can explain the variation in the data due to the pure error. A perfect fit to data for which $\hat{Y}_i=Y_i$, $R^2=1$. If $\hat{Y}_i=\bar{Y}$, that is if $\beta_1=\beta_2=\cdots=\beta_{p-1}=0$ or if a model $Y=\beta_0 +\varepsilon$ alone has been fitted, then $R^2=0$. Therefore we can say that $R^2$ is a measure of the usefulness of the terms other than $\beta_0$ in the model.

Note that we must be sure that an improvement/ increase in $R^2$ value due to adding a new term (variable) to the model under study should have some real significance and is not because the number of parameters in the model is getting else to saturation point. If there is no pure error $R^2$ can be made unity.

\begin{align*}
R^2 &= \frac{\text {SS due to regression given}\, b_0}{\text{Total SS corrected for mean} \, \bar{Y}} \\
&= \frac{SS \, (b_1 | b_0)}{S_{YY}} \\
&= \frac{\sum(\hat{Y_i}-\bar{Y})^2} {\sum(Y_i-\bar{Y})^2}r \\
&= \frac{S^2_{XY}}{(S_{XY})(S_{YY})}
\end{align*}

where summation are over $i=1,2,\cdots, n$.

Coefficient of Determination
Coefficient of Determination

Interpreting R-Square $R^2$ does not indicate whether:

  • the independent variables (explanatory variables) are a cause of the changes in the dependent variable;
  • omitted-variable bias exists;
  • the correct regression was used;
  • the most appropriate set of explanatory variables has been selected;
  • there is collinearity (or multicollinearity) present in the data;
  • the model might be improved using transformed versions of the existing explanatory variables.

Learn more about

https://itfeature.com