Application of Regression in Medical: A Quick Guide (2024)

The application of Regression cannot be ignored, as regression is a powerful statistical tool widely used in medical research to understand the relationship between variables. It helps identify risk factors, predict outcomes, and optimize treatment strategies.

Considering the application of regression analysis in medical sciences, Chan et al. (2006) used multiple linear regression to estimate standard liver weight for assessing adequacies of graft size in live donor liver transplantation and remnant liver in major hepatectomy for cancer. Standard liver weight (SLW) in grams, body weight (BW) in kilograms, gender (male=1, female=0), and other anthropometric data of 159 Chinese liver donors who underwent donor right hepatectomy were analyzed. The formula (fitted model)

 \[SLW = 218 + 12.3 \times BW + 51 \times gender\]

 was developed with a coefficient of determination $R^2=0.48$.

Application of Regression Analysis

These results mean that in Chinese people, on average, for each 1-kg increase of BW, SLW increases about 12.3 g, and, on average, men have a 51-g higher SLW than women. Unfortunately, SEs and CIs for the estimated regression coefficients were not reported. Using Formula 6 in their article, the SLW for Chinese liver donors can be estimated if BW and gender are known. About 50% of the variance of SLW is explained by BW and gender.

The regression analysis helps in:

  • Identifying risk factors: Determine which factors contribute to the development of a disease (For example, gender, age, smoking, and blood pressure for heart disease).
  • Predicting disease occurrence: Estimate the likelihood of a patient developing a disease based on specific risk factors. for example, logistic regression is used to predict the risk of diabetes based on factors like BMI, age, and family history.

The following types of regression models are widely used in medical sciences:

  • Linear regression: Used when the outcome variable is continuous (e.g., blood pressure, cholesterol levels).
  • Logistic regression: Used when the outcome variable is binary (e.g., disease present/absent, survival/death).
  • Cox proportional hazards regression: Used for survival analysis (time to event data)

 Some other related articles (Application of Regression Analysis in Medical Sciences)

Reference of Article

  • Chan SC, Liu CL, Lo CM, et al. (2006). Estimating liver weight of adults by body weight and gender. World J Gastroenterol 12, 2217–2222.

R Programming Lectures

Using Mathematica Built-in Functions (2014)

Introduction to Mathematica Built-in Functions

There are thousands of thousands of Mathematica Built-in Functions. Knowing a few dozen of the more important will help to do lots of neat calculations. Memorizing the names of most of the functions is not too hard as approximately all of the built-in functions in Mathematica follow naming convention (i.e. names of functions are related to the objective of their functionality), for example, the Abs function is for absolute value, Cos function is for Cosine and Sqrt is for the square root of a number.

The important thing than memorizing the function names is remembering the syntax needed to use built-in functions. Remembering many of the built-in Mathematica functions will not only make it easier to follow programs but also enhance your programming skills.

Important and Widely Used Mathematica Built-in Functions

The following is a short list related to Mathematica Built-in Functions.

  • Sqrt[ ]:   used to find the square root of a number
  • N[ ]:   used for numerical evaluation of any mathematical expression e.g. N[Sqrt[27]]
  • Log[  ]: used to find the log base 10 of a number
  • Sin[  ]: used to find trigonometric function Sin
  • Abs[  ]: used to find the absolute value of a number

Common Mathematica built-in functions include

  1. Trigonometric functions and their inverses
  2. Hyperbolic functions and their inverses
  3. logarithm and exponential functions

Every built-in function in Mathematica has two very important features

  • All Mathematica built-in functions begin with Capital letters, such as for square root we use Sqrt, for inverse cosine we use the ArCos built-in function.
  • Square brackets are always used to surround the input or argument of a function.

For computing the absolute value -12, write on command prompt Abs[-12]  instead of for example Abs(-12) or Abs{-12} etc i.e.   Abs[-12] is a valid command for computing the absolute value of -12.

Mathematica Built-in Functions

Note that:

In Mathematica single square brackets are used for input in a function, double square brackets [[ and ]] are used for lists, and parenthesis ( and ) are used to group terms in algebraic expression while curly brackets { and } are used to delimit lists. The three sets of delimiters [ ], ( ), { } are used for functions, algebraic expressions, and lists respectively.

Introduction to Mathematica

R Programming Language

MCQs General Knowledge

Time Series Analysis and Forecasting (2013)

Time Series Analysis

Time series analysis is the analysis of a series of data points over time, allowing one to answer questions such as what is the causal effect on a variable $Y$ of a change in variable $X$ over time? An important difference between time series and cross-section data is that the ordering of cases does matter in time series.

A time series $\{Y_t\}$ or $\{y_1,y_2,\cdots,y_T\}$ is a discrete-time, continuous state process where time $t=1,2,\cdots,=T$ are certain discrete time points spaced at uniform time intervals.

Usually, time is taken at more or less equally spaced intervals such as hour, day, month, quarter, or year. More specifically, it is a set of data in which observations are arranged in chronological order (A set of repeated observations of the same variable).

Use of Time Series

Time series are used in different fields of science such as statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, and communications engineering among many other fields.

Definition: A sequence of random variables indexed by time is called a stochastic process (stochastic means random) or time series for mere mortals. A data set is one possible outcome (realization) of the stochastic process. If history had been different, we would observe a different outcome, thus we can think of time series as the outcome of a random variable.

Rather than dealing with individuals as units, the unit of interest is time: the value of Y at time $t$ is $Y_t$. The unit of time can be anything from days to election years. The value of $Y_t$ in the previous period is called the first lag value: $Y_{t-1}$. The jth lag is denoted: $Y_{t-j}$. Similarly, $Y_{t+1}$ is the value of $Y_t$ in the next period. So a simple bivariate regression equation for time series data looks like: \[Y_t = \beta_0 + \beta X_t + u_t\]

Continuous Time Series

A time series is said to be continuous when observation are made continuously in time. The term continuous is used for series of this type even when the measured variable can only take a discrete set of values.

Discrete Time Series

A time series is said to be discrete when observations are taken at a specific time, usually equally spaced. The term discrete is used for series of this type even when the measured variable is a continuous variable.

Most Macroeconomic and financial data comes in the form of time series. GNP or Stock Return is an example of time series data.

We can write a series as $\{x_1,x_2,x_3,\cdots,x_T\}$ or $\{x_t\}$, where $t=1,2,3,\cdots,T$. $x_t$ is treated as a random variable.

Time series analysis refers to the branch of statistics where observations are collected sequentially in time, usually but not necessarily at equal-spaced time points. The arcane difference between time series and other variables is the use of subscripts.

Time series analysis comprises methods for analyzing time series data to extract some useful (meaningful) statistics and other characteristics of the data, while Time series forecasting is the use of a model to predict future values based on previously observed values.

Given an observed time series, the first step in analyzing a time series is to plot the given series on a graph taking time intervals (t) along the X-axis (as independent variable) and the observed value ($Y_t$) on the Y-axis (as dependent variable). Such a graph will show various types of fluctuations and other points of interest.

Time Series Analysis and Forecasting

Note

  • $Y_t$ is treated as random variable. If $Y_t$ is generated by some model (Regression model for time series i.e. $Y_t=x_t\beta +\varepsilon_t$, $E(\varepsilon_t|x_t)=0$, then ordinary least square (OLS) provides a consistent estimates of $\beta$.
  • Time series interchangeably used for sample $\{x_t\}$ and probability model. A possible probability model for the joint distribution of a time series $\{x_t\}$ is $x_t=\varepsilon_t$, $\varepsilon_t\sim iid  N(0,\sigma_\varepsilon^2)$
  • Time series are typically not iid (Independent Identically Distributed) e.g. If GNP today is unusually high, GNP tomorrow will also likely to be unusually high.

Reference:

R Programming Language

Quartiles in Statistics: Relative Measure of Observation

Quartiles in Statistics

Like Percentiles and Deciles, Quartiles is a type of Quantile, which is a measure of the relative standing of observation within the data set. The Quartiles values are three points that divide the data into four equal parts each group comprising a quarter of the data (the first quartile $Q_1$, second quartile $Q_2$ (also median), and the third quartile $Q_3$) in the order statistics.

The first quartile, (also known as the lower quartile $Q_1$) is the value of order statistic that exceeds 1/4 of the observations and less than the remaining 3/4 observations. The third quartile known as the upper quartile is the value in the order statistic that exceeds 3/4 of the observations and is less than the remaining 1/4 observations, while the second quartile is the median.

Quartiles in Statistics for Ungrouped Data

For ungrouped data, the quartiles are calculated by splitting the order statistic at the median and then calculating the median of the two halves. If $n$ is odd, the median can be included on both sides.

Example: Find the $Q_1, Q_2$ and $Q_3$ for the following ungrouped data set 88.03, 94.50, 94.90, 95.05, 84.60.Solution: We split the order statistic at the median and calculated the median of two halves. Since $n$ is odd, we can include the median in both halves. The order statistic is 84.60, 88.03, 94.50, 94.90, 95.05.

Quartiles in Statistics: Relative Measure of Observation

\begin{align*}
Q_2&=median=Y_{(\frac{n+1}{2})}=Y_{(3)}\\
&=94.50  (\text{the third observation})\\
Q_1&=\text{Median of the first three value}=Y_{(\frac{3+1}{2})}\\&=Y_{(2)}=88.03 (\text{the second observation})\\
Q_3&=\text{Median of the last three values}=Y_{(\frac{3+5}{2})}\\
&=Y_{(4)}=94.90 (\text{the fourth observation})
\end{align*}

Quartiles in Statistics for Grouped Data

For the grouped data (in ascending order) the quartiles are calculated as:
\begin{align*}
Q_1&=l+\frac{h}{f}(\frac{n}{4}-c)\\
Q_2&=l+\frac{h}{f}(\frac{2n}{4}-c)\\
Q_3&=l+\frac{h}{f}(\frac{3n}{4}-c)
\end{align*}
where
$l$    is the lower class boundary of the class containing the $Q_1, Q_2$ or $Q_3$.
$h$    is the width of the class containing the $Q_1, Q_2$ or $Q_3$.
$f$    is the frequency of the class containing the $Q_1, Q_2$ or $Q_3$.
$c$    is the cumulative frequency of the class immediately preceding the class containing $Q_1, Q_2$ or $Q_3, \left[\frac{n}{4},\frac{2n}{4} \text{or} \frac{3n}{4}\right]$ are used to locate $Q_1, Q_2$ or $Q_3$ group.

Quartiles in Statistics: Relative Measure of Observation

Quartiles in Statistics Example: Find the quartiles for the following grouped data

Solution: To locate the class containing $Q_1$, find $\frac{n}{4}$th observation which is here $\frac{30}{4}$th observation i.e. 7.5th observation. Note that the 7.5th observation falls in the group ($Q_1$ group) 90.5–95.5.
\begin{align*}
Q_1&=l+\frac{h}{f}(\frac{n}{4}-c)\\
&=90.5+\frac{5}{4}(7.5-6)=90.3750
\end{align*}

For $Q_2$, the $\frac{2n}{4}$th observation=$\frac{2 \times 30}{4}$th observation = 15th observation falls in the group 95.5–100.5.
\begin{align*}
Q_2&=l+\frac{h}{f}(\frac{2n}{4}-c)\\
&=95.5+\frac{5}{10}(15-10)=98
\end{align*}

For $Q_3$, the $\frac{3n}{4}$th observation=$\frac{3\times 30}{4}$th = 22.5th observation. So
\begin{align*}
Q_3&=l+\frac{h}{f}(\frac{3n}{4}-c)\\
&=100.5+\frac{5}{6}(22.5-20)=102.5833
\end{align*}

Application of Quartiles

By analyzing quartiles, one can get insights into the:

  • Spread of the data: The distance between $Q_1$ and $Q_3$ (called the interquartile range or IQR) indicates how spread out the data is. A relatively large IQR indicates a wider distribution, while a small IQR shows that the data is more concentrated around the median ($Q_2$).
  • Presence of outliers: If the data points are extremely far from the quartiles, they might be outliers that could skew the analysis of measures like the mean.
Statistics Help

Reference:

R Frequently Asked Questions

Online MCQs Test Quiz with Answers