Free Statistical Software

General Purpose Statistical Software

All these Free Statistical Software provides a wide variety of statistical analysis. The following list of software is completely free and can be used in its fully functional mode.

Free Statistical Software

OpenStat

OpenStat is a general-purpose free statistical software/ package. It supports all Windows versions (Windows XP, Windows 7, Windows 8). It is also available for Linux Systems (under Wine). This software is developed by Bill Miller of Iowa State U, with a very broad range of data manipulation and analysis capabilities. It has an SPSS-like user interface. This software has excellent reference material and video tutorials.

Download Software OpenStat
Download Sample Data files
Download Help Files

PSA OpenStat Free statistical software

OpenStat’s Tutorials

  • The use of OpenStat to create a file and do several analyses Tutorial1.zip.
  • Importing Excel file data into OpenStat Tutorial2.zip.
  • Use of Multiple regression analysis with OpenStat Tutorial3.zip.
  • Creating a professional-looking output document in OpenStat Tutorial4.zip.
  • The Relationship of Analysis of Variance to Multiple Regression Tutorial5.zip.
  • The use of Simulation procedures in OpenStat Tutorial6.zip.
  • PowerPoint presentation on the use of the Select Cases in OpenStat Tutorial7.zip.
  • Converting string group codes to integer group codes and the Recode option, Tutorial8.zip.

SalStat-2

SalStat-2 is a free statistical software written in Python Language that has a graphical user interface. It is a multi-platform, easy-to-use statistical system that provides data management such as importing, editing, and pivot tables. It provides a range of Numerical Statistical Calculations such as descriptive statistics, probability functions, chi-square, t-tests, One-way ANOVA, Regression Analysis, Correlation, non-parametric tests, and Six Sigma.

It has a graphics system inherited from matplotlib and can produce bar, line, scatter, area, histogram, box&whisker, stem, adaptive, ternary scatter, normal probability, and quality control graphs.
Download Windows Version: Final Windows version S2 V2.1

SOFA (Statistics Open For All)

SOFA is an innovative statistics analysis, reporting package user-friendly open-source software. It is available for Windows, Mac, and Linux systems.SOFA Has an emphasis on ease of use, learn-as-you-go, and beautifully formatted output. It can help you if you are a researcher, student, data

analyst, or anyone who wants to understand their data

Download SOFA Ubunto Linux version
Download SOFA Windows version

ViSta

ViSta is a Visual Statistics program that can run under Windows, Mac, and Unix available in three languages English, Spanish, and French. ViSta can perform univariate and multivariate visualization and data analysis. ViSta constructs very high-interaction, dynamic graphics that show you multiple views of your data simultaneously. The graphics are designed to augment your visual intuition so that you can better understand your data.

Download Windows version (English)
Download Mac version (English)
Download Unix version (English)

Visit for: Lecture Notes on Statistics and Data Analysis with Vista

PSPP

PSPP is a free replacement for SPSS although at this time it implements only a small fraction of SPSS’s analyses. But it never “expires”. It closely looks like SPSS, and even reads native SPSS syntax and files!

PSPP Free statistical Software

Some features…

  • Supports over 1 billion cases and over 1 billion variables.
  • Choice of the terminal or graphical user interface
  • Choice of text, postscript, or HTML output formats.
  • Inter-operates with Gnumeric, Open Office, and other free software.
  • Easy data import from spreadsheets, text files, and database sources.
  • Fast statistical procedures, even on very large data sets.
  • No license fees; no expiration period; no unethical “end-user license agreements”.
  • Fully indexed user manual.
  • Cross-platform; Runs on many different computers and operating systems.

Download PSPP Windows Version

OpenEpi

OpenEpi is a free web-based open-source program for use in public health and medicine. It provides several epidemiologic and statistical tools. It can be run from a web server or downloaded to run without an internet connection. The programs are written in JavaScript and HTML. It provides stratified analysis with exact confidence limit, matched pair, and person-time analysis, sample size, power test, sensitivity, R x C tables, chi-square for dose-response, etc.

Download: OpenEpi

Installation Instructions:

  • Download and unzip the OpenEpi.zip file. Be sure you have “OpenEpi” folder after unzipping the file, Otherwise rename it as OpenEpi.zip .
  • To use the OpenEpi program find and double-click the index.html file.
  • Enter the data in the given required tables
  • Save the output from the browser’s File menu by using Save as the command.

Statext

Statext Provides a nice assortment of basic statistical tests, with text output its graph output is text-based.
Capabilities include: Data can be rearranged, transposed, and tabulated; Similarly random sample, basic descriptive, Graphs such as dot plot (text-based), box-and-whiskers plot, stem-and-leaf display, histogram, scatter-plot, Parametric tests such as finding z-values, the confidence interval for means, t-tests (one group, two groups, and paired); one- and two-way ANOVA, Pearson, Spearman and Kendall correlation, linear regression Analysis, Non-parametric tests such as Chi-square goodness-of-fit and independence tests, sign tests, Mann-Whitney U and Kruskal-Wallis H tests, probability tables such as z, t, Chi-square, F, U, random number generator, Central Limit Theorem, and Chi-square distribution.

Download Statext

You can also buy Statext software at a cheap price.

MicrOsiris

MicrOsiris is a comprehensive statistical and data management package for Windows, derived from the OSIRIS IV package developed at the University of Michigan. MicrOsiris has special statistical techniques for data mining and analysis of nominal, ordinal, and scaled data.

It can handle any size data set. It has an Excel type of data entry. SPSS, SAS, and Stats data sets can be imported or exported. MicrOsiris reads ICPSR (OSIRIS) and UNESCO (IDAMS) datasets, an interactive decision tree for selecting appropriate tests, database manipulation extensive statistics such as scatter-plot, cross-tabs, ANOVA/MANOVA, log-linear, correlation/regression, performs logistic, linear, Tobit, Poisson, and proportional hazard regression, cluster, factor, MINISSA, item analysis, survival analysis, internal consistency. Fully functional and freeware.

Download MicrOsiris

Gnumeric

Gnumeric is a free, fast, and accurate high-powered spreadsheet with better statistical features than Microsoft Excel. It has about 60 extra functions as compared to Excel, with basic support for financial derivatives including Black Scholes and telecommunication engineering-related problem function, advanced statistical analysis tools, extensive random number generation techniques, linear and non-linear solvers, implicit intersection and iteration, goal seek, and Monte Carlo simulation tools. It also has many features of Excel such as autofill, automatic input guess, batch process import, and export from and to the different file formats.

Download Gnumeric  (Linux Version)
Genumeric Tutorials

Statist

Statist is a compact, portable program having most of the basic statistical capabilities such as data manipulation (recoding, transforming, selecting), descriptive statistics (including histograms, and box plots), correlation and regression analysis, and common significance tests such as chi-square, t-test, etc. This free Statistical software is written in C Language. (Its source code is also available for improvement and further update). This software can run on Unix/Linux, Windows, and Mac, among other operating systems. Statist is simple to use and can be run in scripts. It also handles Big data sets well on small machines.

To Download this software get register on this site get registered as a site user

Tanagra

Tanagra is a (open source) free statistical software for data mining for academic and research purposes, supporting the standard
The “stream diagram” paradigm is used by most data-mining systems. This software contains components for the Data sources (tab-delimited text),
Visualization (grid, scatter-plots), Descriptive statistics (cross-tab, ANOVA, correlation), Instance selection (sampling, stratified),
Feature selection and construction, Multiple Linear Regression, Factorial analysis (principal components, multiple correspondings (K-means, SOM, LVQ, HAC), Supervised learning (logistic regression, k-NN, multi-layer perceptron, prototype-NN, ID3,
discriminant analysis, naive Bayes, radial basis function), Meta-spv learning (instance Spv, arcing, boosting, bagging), Learning assessment
(train-test, cross-validation), and Association (Agrawal a-priori).

Download Link Download (XP, Vista, Win 7)

Dap

Dap is a statistics and graphics package developed by Susan Bassein for Unix and Linux systems, with necessary and common data management facilities. It helps to conduct Statistical analysis such as univariate statistics, correlations and regression, ANOVA, categorical data analysis, logistic regression, and nonparametric analyses. Dap Provides some of the core functionality of SAS and can read and run many SAS program files (but not all).

Download Dap

AM Statistical Software

AM is a free statistical software package for analyzing data from complex samples, especially large-scale assessments, as well as non-assessment survey data. AM has advanced statistical tools, an easy drag-and-drop interface, and an integrated help system that explains the statistics as well as how to use the software. It can estimate statistical models via marginal maximum likelihood (MML), which defines a probability distribution over the proficiency scale. It also analyzes “plausible values” used in programs like NAEP. AM automatically provides appropriate standard errors for complex samples via Taylor-series approximation, jackknife & other replication techniques. This software also offers a set of non-MML statistics, including regression, probit, logit, cross-tabs, and other statistics that are useful for survey data in general.

You can download the AM Statistical software.

Instat Plus

Instat Plus is a statistical software computing package from the University of Reading, at the statistical service center, in the UK.
(do not confuse it with Instant from GraphPad Software.) It is an interactive statistics package for Windows or DOS. This statistical software is simple and useful in teaching statistical ideas and has the power to assist the researcher in any discipline that requires the analysis of data. Instat includes many special facilities for the processing of climatic data.

Download Instat Plus

SSP

SSP (Smith’s Statistical Package) is a simple, user-friendly statistical software package available for both Mac and Windows operating systems. SSP software helps enter, edit, transform, import, and export the data. It can calculate basic summaries, prepare charts, evaluate distribution function probabilities, and perform simulations. Many inferential statistics tests are available such as compare means and proportions test, ANOVA’s, Chi-Square tests, and simple & multiple regressions analysis.

Download SSP Windows Version
Download SSP Mac Version

Dataplot

Dataplot software systems are available for Unix, Linux, PC-DOS, and Windows operating systems for scientific visualization, statistical analysis, and non-linear modeling. It has extensive mathematical and graphical capabilities. The target Dataplot user is the researcher and analyst engaged in the characterization, modeling, visualization, analysis, monitoring, and optimization of scientific and engineering processes. Closely integrated with the NIST/SEMATECH Engineering Statistics Handbook.

Download Dataplot Windows Version

Regress+

Regress+ is a professional statistical software package for performing univariate mathematical modeling (equations and distributions). The most powerful software of its kind available anywhere, with state-of-the-art functionality and user-friendliness. It has 21 built-in equation
and 59 built-in distributions.

Download Regress+
Download Compendium of Common Probability Distributions

SISA

SISA is a simple Interactive Statistical Analysis for PC (DOS) i.e. for the Windows operating system from Daan Uitenbroek. There is an
excellent collection of individual windows and DOS modules for several statistical calculations, including some analyses not readily available elsewhere.

Download SISA Windows Program

These Windows programs contain a certain procedure that performs specific statistical analysis.

  • lifetables
    It helps to perform Mortality Analysis for Demography and Epidemiology. The Lifetables program calculates the life expectancy, including all intermediary statistics, variance a confidence interval for the life expectancy, Potential Gains in Life Expectancy (PGLE), Years of Potential Life Lost (YPLL), and Lifetime Years of Potential Life Lost (LYPLL).
  • Distributions
    The SISA-Distributions program allows the user to analyze discrete single-dimension distributions. The program is based on various manipulations of the Poisson, binomial, and hypergeometric distribution. Available are the probability of an observed number of cases for the certain null hypothesis, the calculation of exact Poisson, binomial, or hypergeometric confidence intervals, the exact and approximate size of a population using catch-recatch methodologies, the full analysis of a Poisson distributed rate ratio, Fieller analysis, and two versions of the negative binomial distribution can be used in various ways.
  • Multinomial
    The multinomial program is the exact solution to the Chi-square Goodness of fit test of testing for a difference between an observed and an expected distribution in a one-dimensional array. For the two-category array, the multinomial test provides a two-sided solution for the Binomial test. The multinomial allows you to work with empty ‘0’ observation cells although you must expect a cell.
  • Tables
    SISA-Tables is a program for the analysis of tables with up to 2*7 and 3*3 cells. This program (Tables) allows for exact and approximate
    statistics. Fisher exact, Number Needed to Treat, Proportional Reduction in Error Statistics, Normal Approximations, Four different Chi-squares, Gamma, Odds-ratio, t-tests, and Kappa are among the many statistical procedures available in Tables Program.
  • Weighting
    The weighting program by SISA calculates sample weights according to the cell weight procedure. The design factor and the effective sample size for the resulting set of weights are determined. It is possible to specify a value above which extreme weights will be trimmed. The not-trimmed weights will be recalculated.
  • Intra Correlation
    The intra-correlation program from SISA calculates intra-correlations and design effects for clustered samples where the outcome measure is
    the number of positive responses per cluster. Confidence intervals and other statistics corrected for design effects can be calculated.
    This program helps to compare two groups of clusters with a t-test procedure.

JASP

JASP is an open source Free Statistical Software by the University of Amsterdam. It has a user-friendly interface. JASP offers standard statistical analysis routines in both classical and Bayesian forms.

Download: https://jasp-stats.org/download/

Jamovi

Jamovi is another free statistical software designed to be of easy use and as a good alternative to other costly statistical software such as SAS, Minitab, and SPSS. It is integrated with the R language. Jamovi is made by the scientific community. Jamovi is available in both Desktop and Cloud versions.

Jamovi Cloud Version
Jamovi Desktop Version

Develve

Develve is a free statistical software for experimental data. It is equipped with basic statistics, graphical representation of data, Inferential statistics, non-parametric statistics, and many designs of experiment-related statistics.
Download: Develve

Spreadsheets

There are two spreadsheets available, a spreadsheet that does demographic analysis and another spreadsheet for the calculation of intracorrelation coefficients. The spreadsheets are in Microsoft Excel file format; If you have MS Excel installed on the computer, your computer will start up Excel and load the spreadsheet into Excel automatically after you double-click the procedure name.

  • Lifetable
    This lifetable spreadsheet does a full abridged current life table analysis to obtain the life expectancy of a population. Furthermore, one can calculate Potential Gains in Life Expectancy (PGLE) after removing cause k, considering competing causes of death; the (Premature) Years of
    Potential Life Lost (YPLL), the Standardized Mortality Ratio (SMR), standardized numbers per 100,000, and the Comparative Mortality Figure (CMF) can also be calculated.
  • Discounted YPLL
    This spreadsheet contains the procedure to discount the YPLL if you only have mortality by age.
  • Intra Correlation
    The spreadsheet performs intra-correlation calculations for dichotomous (binary yes/no) type outcome variables according to two different methods proposed for the single cluster one by Fleiss and another one by Bennett et.al. A third spreadsheet concerns a method for two clusters by Donner and Klar.
  • Distributions
    22 spreadsheets demonstrate various statistical distributions such as Beta, Binomial, Normal, Poisson, Pareto, etc.

Online Data Analysis Programs

The programs below are for use on the Internet and are performed directly on the Internet. These procedures (programs) are very fast and a study guide is available.

  • Hypergeometric
    This procedure calculates the hypergeometric probability distribution to evaluate the hypothesis with sampling without replacing in small populations i.e. (hypergeometric distribution)
  • Binomial
    This procedure calculates probabilities for sampling with replacing in small populations or without replacing it in a very large population. It Can be used to approximate the hypergeometric distribution.
  • Poisson
    This procedure calculates probabilities for samples that are very large in an even larger population. it can used to approximate the binomial distribution.
  • Negative Binomial 1
    It is used to study accidents, and is a more general case than Poison, it considers the probability of getting into accidents if accidents cluster differently in subgroups of the population.
  • Negative Binomial 2
    Another version of the negative binomial, this one is used to do the marginal distribution of binomials. Often used to predict the termination of real-time events. Such as the probability of terminating listening to a non-answering phone after n-rings.
  • Fisher
    Is used to calculate the exact p-value for 2*2 contingency tables. Use the Fisher exact instead of the Chi-square when you have a small value in one cell or a very uneven marginal distribution.
  • Chi-Square
    This Dos procedure calculates the Chi-square and some other measures for two-dimensional tables.
  • Downloadable Programs: see the list of different downloadable programs.

Statistical Software by Paul W. Mielke Jr.

Free Statistical Software by Paul W. Mielke Jr. has a large collection of executable DOS programs and FORTRAN sources. It contains Matrix occupancy, exact g-sample empirical coverage test, and interactions of exact analyses. It also contains spectral decomposition analysis, randomized block (exact mrbp) analyses, exact multi-response permutation procedure, Fisher’s Exact for cross-classification, and goodness-of-fit test. Furthermore, Fisher’s combined p-values i.e. meta-analysis, largest part’s proportion test, Pearson-Zelterman test, Greenwood-Moran, and Kendall-Sherman
goodness-of-fit runs tests. The advanced statistical procedures include multivariate Hotelling’s test, least-absolute-deviation regression analysis, sequential permutation procedures, LAD regression, principal component analysis, matched pair permutation, r-by-c contingency tables, r-way contingency tables, and Jonkheere-Terpstra.

Download Free Statistical Software by Paul W. Mielke Jr. (Windows Version)
Download Free Statistical Software by Paul W. Mielke Jr. (Unix Version)

If any link is broken or not working please let me know about it. Also, if you have the web address of any free statistical software, inform me I will update the list.

RegressIt (MS-Excel add-in)

RegressIt is a powerful free Excel add-in that performs multivariate descriptive data analysis and regression analysis with high-quality table and chart output.  It’s an excellent tool for instructors who are running online data analysis exercises using platforms such as Zoom.  The software includes built-in documentation and it can embed regression teaching notes in output worksheets in the form of cell comments. It also has some innovative auditing tools that allow instructors to easily review and verify the originality of the complete analysis carried out by every student in a class.  RegressIt also has a unique interface with R that allows Excel to be used as a front end for running very detailed linear and logistic regression analyses in R and which also allows R to be used as a computational engine for running models in Excel. Visit https://regressit.com for complete details and free downloading of the software.

Statistical Software

R Programming Language

Select Cases in SPSS

The post is about Select Cases in SPSS (IBM SPSS-Statistics) as sometimes you may be interested in analyzing the specific part (subpart) of the available dataset. For example, you may be interested in getting descriptive or inferential statistics for males and females separately. One may also be interested in a certain age range or may want to study (say) only non-smokers. In such cases, one may use Select Cases in SPSS.

Select Cases in SPSS: Step-by-Step Procedure

For illustrative purposes, I am using the “customer_dbase” file available in SPSS sample data files. I am assuming the gender variable to select male customers only and will present some descriptive statistics only for males. For this purpose follow these steps:

Step 1: Go to the Menu bar, select “Data” and then “Select Cases”.

Select Cases in SPSS - 1

Step 2: A new window called “Select Cases” will open.

Use of If statement for Select Cases in SPSS

Step 3: Tick the box called “If the condition is satisfied” as shown in the figure below.

Select Cases in SPSS - 2

Step 4: Click on the button “If” highlighted in the above picture.

Step 5: A new window called “Select Cases: If” will open.

Select Cases in SPSS - If Dialog box 3

Step 6: The left box of this dialog box contains all the variables from the data view. Choose the variable (using the left mouse button) that you want to select cases for and use the “arrow” button to move the selected variable to the right box.

Step 7: In this example, the variable gender (for which we want to select only men) is shifted from the left to the right box. In the right box, write “gender=0” (since men have the value 0 code in this dataset).

Select Cases in SPSS - with Condition

Step 8: Click on Continue and then the OK button. Now, only men are selected (and the women’s data values are temporarily filtered out from the dataset).

Re-Select Cases in SPSS

Note: To “re-select” all cases (complete dataset), you carry out the following steps:

Step a: Go to the Menu bar, choose “Data” and then “Select Cases”.

Step b: From the dialog box of “Select Cases”, tick the box called “All cases”, and then click on the OK button. 

Select Cases in SPSS - data 5

When you use the Select Cases in SPSS, a new variable called “filter” will be created in the dataset. Deleting this filter variable, the selection will disappear. The “un-selected” cases are crossed over in the data view windows.

Select Cases in SPSS - data view 6

Note: The selection will be applied to everything you do from the point you select cases until you remove the selection. In other words, all statistics, tables, and graphs will be based only on the selected individuals until you remove (or change) the selection.

Random Sample of Cases

There is another kind of selection too. For example, the random sample of cases, based on time or case range, and use the filter variable. The selected case can be copied to a new dataset or unselected cases can be deleted. For this purpose choose the appropriate option from the output section of the select cases dialog box.

Select Cases in SPSS - random selection 7

For other SPSS tutorials Independent Sample t-tests in SPSS

Hypothesis Testing in R Programming Language

Subjective Probability (2019)

A type of probability based on personal beliefs, judgment, or experience about the occurrence of a specific outcome in the future. The calculation of subjective probability contains no formal computations (of any formula) and reflects the opinion of a person based on his/her experience. The subjective probability differs from subject to subject and it may contain a high degree of personal biases.

This kind of probability is usually based on a person’s experience, understanding, knowledge, and intelligence and determines the probability of some specific event (situation). It is usually applied in real-life situations, especially, related to the decision in business, job interviews, promotions of the employee, awarding incentives, and daily life situations such as buying and/or selling of a product. An individual may use their expertise, opinion, past experiences, or intuition to assign the degrees of probability to a specific situation.

It is worth noting that the subjective probability is highly flexible in terms of an individual’s belief, for example, one individual may believe that the chance of occurrence of a certain event is 25%. The same person or others may have a different belief especially when they are given a specific range from which to choose, (such as 25% to 30%). This can occur even if no additional hard data is behind the change.

Events that may Alter Subjective Probability

Subjective probability is usually affected by a variety of personal beliefs and opinions (related to his caste, family, region, religion, and even relationship with people, etc.), held by an individual. It is because the subjective probability is often based on how each individual interprets the information presented to him

Disadvantages of Subjective Probability

As only personal opinions (beliefs, experiences) are involved, there may be a high degree of bias. On the other hand, one person’s opinion may differ greatly from the opinion of another person. Similarly, in subjective probability, one may fall into the trap of failing to meet complex calculations.

Subjective Probability

Examples Related to Subjective Probability

  • One may think that there is an 80% chance that your best friend will call you today because his/her car broke down yesterday and he/she will probably need a ride.
  • You think you have a 50% chance of getting a certain job you applied for as the other applicant is also qualified.
  • The probability that in the next (say) 5 hours, there will be rain is based on current weather situations, wind patterns, nearby weather, barometric pressure, etc. One can predict this based on his experience with weather and rain, and believes, in predicting the rain in the next 5 hours.
  • Suppose, a cricket tournament is going to be held between Pakistan and India. The theoretical probability of winning either the cricket team is 50%. But, in reality, it is not 50%. On the other hand (like empirical probability), the number of trial tournaments cannot be arranged to determine an experimental probability. Thus, the subjective probability will be used to find the winning team which will be based on the beliefs and experience of the investigator who is interested in finding the probability of the Pakistan cricket team as the winner. Note there will be a bias if any of the fans of a team investigates the probability of winning a team.
  • To locate petroleum, minerals, and/ or water lying under the earth, dowsers are employed to predict the likelihood of the existence of the required material. They usually adopt some non-scientific methods. In such a situation, the subject probability is used.
  • Note the decisions through subjective probability may be valid if the degree of belief of a person is unbiased about the situation and he/she arrives by some logical reasoning.

For further reading See Introduction to Probability Theory

R Programming Language and R Frequently Asked Questions

Remedial Measures of Heteroscedasticity (2018)

The post is about Remedial Measures of Heteroscedasticity.

Heteroscedasticity is a condition in which the variance of the residual term, or error term, in a regression model, varies widely.

The heteroscedasticity does not destroy the unbiasedness and consistency properties of the OLS estimator (as OLS estimators remain unbiased and consistent in the presence of heteroscedasticity), but they are no longer efficient, not even asymptotically. The lack of efficiency makes the usual hypothesis testing procedure dubious (مشکوک، غیر معتبر). Therefore, there should be some remedial measures for heteroscedasticity.

Homoscedasticity

Remedial Measures of Heteroscedasticity

For remedial measures of heteroscedasticity, there are two approaches: (i) when $\sigma_i^2$ is known, and (ii) when $\sigma_i^2$ is unknown.

(i) $\sigma_i^2$ is known

Consider the simple linear regression model $Y_i=\alpha + \beta X_i + u_i$.

If $V(u_i)=\sigma_i^2$ then heteroscedasticity is present. Given the values of $\sigma_i^2$, heteroscedasticity can be corrected by using weighted least squares (WLS) as a special case of Generalized Least Squares (GLS). Weighted least squares is the OLS method of estimation applied to the transformed model.

When heteroscedasticity is detected by any appropriate statistical test, then the appropriate solution is to transform the original model in such a way that the transformed disturbance term has a constant variance. The transformed model reduces the adjustment of the original data. The transformed error term $u_i$ has a constant variance i.e. homoscedastic. Mathematically

\begin{eqnarray*}
V(u_i^*)&=&V\left(\frac{u_i}{\sigma_i}\right)\\
&=&\frac{1}{\sigma_i^2}Var(u_i)\\
&=&\frac{1}{\sigma_i^2}\sigma_i^2=1
\end{eqnarray*}

This approach has limited use as the individual error variances are not always known a priori. In case of significant sample information, reasonable guesses of the true error variances can be made and be used for $\sigma_i^2$.

Let us discuss the second remedy of heteroscedasticity from remedial measures of heteroscedasticity.

(ii) $\sigma_i^2$ is unknown

If $\sigma_i^2$ is not known a priori, then heteroscedasticity is corrected by hypothesizing a relationship between the error variance and one of the explanatory variables. There can be several versions of the hypothesized relationship. Suppose the hypothesized relationship is $Var(u)=\sigma^2 X_i^2$ (error variance is proportional to $X_i^2$). For this hypothesized relation we will use the following transformation to correct for heteroscedasticity for the following simple linear regression model $Y_i =\alpha + \beta X_i +u_i$.
\begin{eqnarray*}
\frac{Y_i}{X_i}&=&\frac{\alpha}{X_i}+\beta+\frac{u_i}{X_i}\\
\Rightarrow \quad Y_i^*&=&\beta +\alpha_i^*+u_i^*\\
\mbox{where } Y_i^*&=&\frac{Y_i}{X_i}, \alpha_I^*=\frac{1}{X_i} \mbox{and  } u_i^*=\frac{u}{X_i}
\end{eqnarray*}

Now the OLS estimation of the above transformed model will yield the efficient parameter estimates as $u_i^*$’s have constant variance. i.e.

\begin{eqnarray*}
V(u_i^*)&=&V(\frac{u_i}{X_i})\\
&=&\frac{1}{X_i^2} V(u_i^2)\\
&=&\frac{1}{X_i^2}\sigma^2X_i^2\\
&=&\sigma^2=\mbox{ Constant}
\end{eqnarray*}

Remedial Measures of Heteroscedasticity (2018)

For remedial measures of heteroscedasticity, some other hypothesized relations are:

  • Error variance is proportional to $X_i$ (Square root transformation) i.e $E(u_i^2)=\sigma^2X_i$
    The transformed model is
    \[\frac{Y_i}{\sqrt{X_i}}=\frac{\alpha}{\sqrt{X_i}}+\beta\sqrt{X_i}+\frac{u_i}{\sqrt{X_i}}\]
    It (the transformed model) has no intercept term. Therefore we have to use the regression through the origin model to estimate $\alpha$ and $\beta$. To get the original model, multiply $\sqrt{X_i}$ with the transformed model.
  • Error Variance is proportional to the square of the mean value of $Y$. i.e. $E(u_i^2)=\sigma^2[E(Y_i)]^2$
    Here the variance of $u_i$ is proportional to the square of the expected value of $Y$, and $E(Y_i)$ = \alpha + \beta X_i$.
    The transformed model will be
    \[\frac{Y_i}{E(Y_i)}=\frac{\alpha}{E(Y_i)}+\beta\frac{X_i}{E(Y_i)}+\frac{u_i}{E(Y_i)}\]
    This transformation is not appropriate because $E(Y_i)$ depends upon $\alpha$ and $\beta$ which are unknown parameters. $\hat{Y_i}=\hat{\alpha}+\hat{\beta}$ is an estimator of $E(Y_i)$, so we will proceed in two steps:
     
    1. We run the usual OLS regression dis-regarding the heteroscedasticity problem and obtain $\hat{Y_i}$
    2. We will transform the model by using estimated $\hat{Y_i}$ i.e. $\frac{Y_i}{\hat{Y_i}}=\alpha\frac{1}{\hat{Y_i}}+\beta_1\frac{X_i}{\hat{Y_i}}+\frac{u_i}{\hat{Y_i}}$ and run the regression on transformed model.

      This transformation will perform satisfactory results only if the sample size is reasonably large.

  • Log transformation such as $ln\, Y_i = \alpha + \beta\, ln\, X_i + u_i$.
    Log transformation compresses the scales in which the variables are measured. However, this transformation is not applicable in some of the $Y$ and $X$ values that are zero or negative.

Visit: R Language Frequently Asked Questions