The t-distribution was discovered by W. S. Gosset and R.A. Fisher. The entries in Student’s t table entries are the critical values (percentiles) for the t distribution. The applications of Student’s t distribution are related to (i) the sampling distribution of the mean $\overline{x}$, (ii) the distribution of a difference $(\overline{x}_1 – \overline{x}_2)$ of two independent populations, (iii) the distribution of two paired (dependent) populations, and (iv) the significance of correlation coefficient. It is also used for constructing confidence intervals for small samples. The Student’s t distribution is a crucial tool in statistical analysis, especially when dealing with small sample sizes. It helps us make informed decisions based on our data, even when the population standard deviation is unknown.
The Student’s t variable can be generated by dividing the standard normal random variable ($Z$) with the square root of a $\chi^2_{v}$ random variable. The $\chi^2_v$ is itself divided by its parameter $v$. That is
The t distribution is symmetric about zero and wider than normal density. It has one mode and it tends to be normal as $v\rightarrow \infty$. Note that $\Gamma(x)$ indicates the Gamma function.
Moments of t Distribution
Since the t distribution is symmetric and its PDF is centered at zero, the expectation (average), the median, and the mode are all zero for the t distribution with $v$ degrees of freedom. The variance ($\sigma^2$) equals $\frac{v}{v-2}$ and kurtosis is $\frac{6}{v-4}$.
For bivariate normal population, the distribution of correlation coefficient $r$ is linked with Student’s t distribution through transformation:
The following algorithm can be used to generate random variates from Student’s $t(v)$ distribution using serially generated independent uniform $U(0,1)$ random variates. For example,
A simple Linear Regression model is one of the most fundamental techniques in machine learning and statistics. Whether you are a data science newbie or just brushing up on the basics, understanding linear regression is essential.
Table of Contents
Introduction
Frequently, we measure two or more variables on each individual and try to express the nature of the relationship between these variables (for example, in the simple linear regression model and correlation analysis). Using the regression technique, we estimate the relationship of one variable with another by expressing the one in terms of a linear (or more complex) function of another. We also predict the values of one variable in terms of the other. The variables involved in regression and correlation analysis are continuous. In this post, we will learn about the Simple Linear Regression Model.
Functional Relationship Between Variables
We are interested in establishing significant functional relationships between two (or more) variables. For example, the function $Y=f(X)=a+bx$ (read as $Y$ is a function of $X$) establishes a relationship to predict the values of variable $Y$ for the given values of variable $X$. In statistics (biostatistics), the function is a simple linear regression model or the regression equation.
The variable $Y$ is called the dependent (response) variable, and $X$ is called the independent (regressor or explanatory) variable.
In biology, many relationships can be appropriate over only a limited range of values of $X$. Negative values are meaningless in many cases, such as age, height, weight, and body temperature.
The method of linear regression is used to estimate the best-fitting straight line to describe the relationship between variables. The linear regression gives the equation of the straight line that best describes how the outcome of $Y$ increases/decreases with an increase/decrease in the explanatory variable $X$. The equation of the regression line is $$Y=\beta_0 + \beta_1 X,$$ where $\beta_0$ is the intercept (value of $Y$ when $X=0$) and $\beta_1$ is the slope of the line. Both $\beta_0$ and $\beta_1$ are the parameters (or regression coefficients) of the linear equation.
Estimation of Regression Coefficients in Simple Linear Regression Model
The best-fitting line is derived using the method of the \textit{Least Squares} by finding the values of the parameters $\beta_0$ and $\beta_1$ that minimize the sum of the squared vertical distances of the points from the regression line,
The dotted-line (best-fit) line passes through the point ($\overline{X}, \overline{Y}$).
The regression line $Y=\beta_0+\beta_1X$ is fit by the least-squares methods. The regression coefficients $\beta_0$ and $\beta_1$ are both calculated to minimize the sum of squares of the vertical deviations of the points about the regression line. Each deviation equals the difference between the observed value of $Y$ and the estimated value of $Y$ (the corresponding point on the regression.
The following table shows the \textit{body weight} and \textit{plasma volume} of eight healthy men.
Subject
Body Weight (KG)
Plasma Volume (liters)
1
58.0
2.75
2
70.0
2.86
3
74.0
3.37
4
63.5
2.76
5
62.0
2.62
6
70.5
3.49
7
71.0
3.05
8
66.0
3.12
Estimation of Paramters
The parameters $\beta_0$ and $\beta_1$ are estimated using the following formula (for simple linear regression model):
Regression coefficients are sometimes known as “beta-coefficients”. When the slope ($\beta_1=0$, then there is no relationship between $X$ and $Y$ variables. For the data above, the best-fitting straight line describing the relationship between plasma volume with body weight is $$Plasma\, Volume = 0.0857 +0.0436\times Weight$$ Note that the calculated values for $\beta_0$ and $\beta_1$ are estimates of the population values and, therefore, subject to sampling variations.
Real-Life Examples: Simple Linear Regression Models
Real Estate: Predicting House Prices (Estimate home prices based on size to guide buyers and sellers.) Independent Variable ($X$): Size of the house (sq ft) Dependent Variable ($Y$): Price of the house
Education: Predicting Student Scores (Teachers or students can predict likely outcomes based on study habits.) $X$: Hours studied $Y$: Exam scores
Healthcare: Predicting Blood Pressure (Understand how blood pressure tends to rise with age, aiding diagnosis.) $X$: Age of patient $Y$: Systolic blood pressure
Energy: Predicting Electricity Usage (Power companies use this to forecast demand and manage resources) $X$: Temperature (°C or °F) $Y$: Electricity consumption (kWh)
Manufacturing: Predicting Machine Failures $X$: Hours a machine has been in use (Predict maintenance schedules and avoid production delays.) $Y$: Number of breakdowns or wear percentage
Business: Predicting Sales Based on Advertising Spend (Helps businesses decide how much to invest in advertising.) $X$: Advertising expenditure (in $\$$) $Y$: Product sales (in units)
Agriculture: Predicting Crop Yield (Estimate yield based on expected rainfall to plan for food production.) $X$: Amount of rainfall (mm) $Y$: Crop yield (kg per acre)
Finance: Predicting Stock Prices (Although basic, it helps in forecasting trends over time (note: simple linear regression has limits in volatile markets)) $X$: Time (days or months) $Y$: Stock closing price
Transportation: Estimating Fuel Consumption (Predict fuel needs and optimize transportation costs.) $X$: Distance traveled (km) $Y$: Fuel used (liters)
E-commerce: Predicting Customer Spending (Analyze user behavior and optimize website experience for better conversion.) $X$: Time spent on the website $Y$: Amount spent on a purchase
The design in which the levels of one factor can be applied to large experimental units and the levels of other factors to the sub-units are known as “split plot design“.
A split plot experiment is a blocked experiment in which blocks serve as experimental units. After blocking the levels of other factors are randomly applied to large units within blocks, often called whole plots or main plots.
The split plot design are specifically suited for two factors designs that have more treatment to be accommodated by a complete block designs. In split plot design all the factors are not of equal importance. For example, in an experiment of varieties and fertilizers, the variety is less important and the fertilizer is more important.
In these design, the experimental units are divided into two parts, (i) Main plot and (ii) sub-plot. The levels of one factor are assigned at random to large experimental units (main plot) and the levels of the other (second) factor are applied at random the the sub-units (sub-plot) within the large experimental units. The sub-units are obtained by dividing the large experimental units.
Note that the assignment of a particular factor to either the main plot or to the subplot is extremely important, it is because the plot size and precision of measurement of the effects are not the same for both factors.
The sub-plot treatments are the combination of the levels of different factors.
The split plot design involves assigning the levels of one factor to main plots which may be arranged in a “CRD”, “RCBD” or “LSD”. The levels of the other factor are assigned to subplots within each main plot.
Split Plot Design Layout Example
If there are 3 varieties and 3 fertilizers and we want more precision for fertilizers then with the RCBD with 3 replication, the varieties are assigned randomly to the main plots within 3 blocks using a separate randomization for each. Then the levels of the fertilizers are randomly assigned to the subplots within the main plots using a separate randomization in each main plot. The layout is
Another Split Plot Design Example
Suppose we want to study the effects of two irrigation methods (factor 1) and two different fertilizer types (factor 2) on four different fields (“whole plots”). While a field can easily be split into two for the two different fertilizers, the field cannot easily be split into two for irrigation: One irrigation system normally covers a whole field and the systems are expensive to replace.
Advantages and Disadvantages of Split Plot Design
Advantages of Split Plot Design
More Practical Randomizing hard-to-change factors in groups, rather, than randomizing every run, is much less labor and time intensive.
Pliable Factors that naturally have large experimental units can be easily combined with factors having smaller experimental units.
More powerful Tests for the subplot effects from the easy-to-change factors generally have higher power due to partitioning the variance sources.
Adaptable New treatments can be introduced to experiments that are already in progress.
Cheaper to Run In case of a CRD, implementing a new irrigation method for each subplot would be extremely expensive.
More Efficient Changing the hard-to-change factors causes more error (increased variance) than changing the easy-to-change factors a split-plot design is more precise (than a completely randomized run order) for the subplot factors, subplot by subplot interactions and subplot by whole-plot interactions.
Efficient More efficient statistically, with increased precision. It permits efficient application of factors that would be difficult to apply to small plots.
Reduced Cost They can reduce the cost and complexity of manipulating factors that are difficult or expensive to change.
Precision The overall precision of split-plot design relative to the randomized complete block design may be increased by designing the main plot treatment in a Latin square design or in an incomplete Latin square design.
Disadvantages of Split Plot Design
Less powerful Tests for the hard-to-change factors are less powerful, having a larger variance to test against and fewer changes to help overcome the larger error.
Unfamiliar Analysis requires specialized methods to cope with partitioned variance sources.
Different Hard-to-change (whole-plot) and easy-to-change (subplot) factor effects are tested against different estimated noise. This can result in large whole-plot effects not being statistically significant, whereas small subplot effects are significant even though they may not be practically important.
Precision Differential in the estimation of interaction and the main effects.
Sources of Variation They involve different sources of variation ad error for each factor.
Missing Data When missing data occurs, the analysis is more complex than for a randomized complete block design.
Different treatment comparisons have different basic error variances which make the analysis more complex than with the randomized complete block design, especially if some unusual type of comparison is being made.