Primary and Secondary Data (2014)

Data

Before learning about primary and Secondary Data, let us first understand the term Data in Statistics.

The facts and figures which can be numerically measured are studied in statistics. Numerical measures of the same characteristics are known as observation and collection of observations is termed as data. Data are collected by individual research workers or by organizations through sample surveys or experiments, keeping in view the objectives of the study. The data collected may be (i) Primary Data and (ii) Secondary Data.

Primary and Secondary Data in Statistics

The difference between primary and secondary data in Statistics is that Primary data is collected firsthand by a researcher (organization, person, authority, agency or party, etc.) through experiments, surveys, questionnaires, focus groups, conducting interviews, and taking (required) measurements, while the secondary data is readily available (collected by someone else) and is available to the public through publications, journals, and newspapers.

Primary and Secondary Data

Primary Data

Primary data means the raw data (data without fabrication or not tailored data) that has just been collected from the source and has not gone through any kind of statistical treatment like sorting and tabulation. The term primary data may sometimes be used to refer to first-hand information.

Sources of Primary Data

The sources of primary data are primary units such as basic experimental units, individuals, and households. The following methods are used to collect data from primary units usually and these methods depend on the nature of the primary unit. Published data and the data collected in the past are called secondary data.

  • Personal Investigation
    The researcher experiments or surveys himself/herself and collects data from it. The collected data is generally accurate and reliable. This method of collecting primary data is feasible only in the case of small-scale laboratories, field experiments, or pilot surveys and is not practicable for large-scale experiments and surveys because it takes too much time.
  • Through Investigators
    The trained (experienced) investigators are employed to collect the required data. In the case of surveys, they contact the individuals and fill in the questionnaires after asking for the required information, whereas a questionnaire is an inquiry form having many questions designed to obtain information from the respondents. This method of collecting data is usually employed by most organizations and it gives reasonably accurate information but it is very costly and may be time-consuming too.
  • Through Questionnaire
    The required information (data) is obtained by sending a questionnaire (printed or soft form) to the selected individuals (respondents) (by mail) who fill in the questionnaire and return it to the investigator. This method is relatively cheap as compared to the “through investigator” method but the non-response rate is very high as most of the respondents don’t bother to fill in the questionnaire and send it back to the investigator.
  • Through Local Sources
    The local representatives or agents are asked to send requisite information and provide the information based on their own experience. This method is quick but it gives rough estimates only.
  • Through Telephone
    The information may be obtained by contacting the individuals by telephone. It is Quick and provides the accurate required information.
  • Through Internet
    With the introduction of information technology, people may be contacted through the Internet and individuals may be asked to provide pertinent information. Google Survey is widely used as an online method for data collection nowadays. There are many paid online survey services too.

It is important to go through the primary data and locate any inconsistent observations before it is given a statistical treatment.

Secondary Data

Data that has already been collected by someone, may be sorted, tabulated, and has undergone a statistical treatment. It is fabricated or tailored data.

Sources of Secondary Data

The secondary data may be available from the following sources:

  • Government Organizations
    Federal and Provincial Bureau of Statistics, Crop Reporting Service-Agriculture Department, Census and Registration Organization etc.
  • Semi-Government Organization
    Municipal committees, District Councils, Commercial and Financial Institutions like banks etc
  • Teaching and Research Organizations
  • Research Journals and Newspapers
  • Internet

Data Structure in R Language

Markov Chain an Introduction (2014)

A Markov chain, named after Andrey Markov is a mathematical system that experiences transitions from one state to another, between a finite or countable number of possible states. Markov chain is a random process usually characterized as memoryless: the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of memorylessness is called the Markov property. Markov chains have many applications as statistical models of real-world processes.

If the random variables $X_{n-1}$ and $X_n$ take the values $X_{n-1}=i$ and $X_n=j$, then the system has made a transition $S_i \rightarrow S_j$, that is, a transition from state $S_i$ to state $S_j$ at the $n$th trial. Note that $i$ can equal $j$, so transitions within the same state may be possible. We need to assign probabilities to the transitions $S_i \rightarrow S_j$. Generally in the chain, the probability that $X_n=j$ will depend on the whole sequence of random variables starting with the initial value $X_0$.

The Markov chain has the characteristic property that the probability that $X_n=j$ depends only on the immediately previous state of the system. This means that we need no further information at each step other than for each $i$ and $j$,  \[P\{X_n=j|X_{n-1}=i\}\]
which means the probability that $X_n=j$ given that $X_{n-1}=i$: this probability is independent of the values of $X_{n-2},X_{n-3},\cdots, X_0$.

Let us have a set of states $S=\{s_1,s_2,\cdots,s_n\}$. The process starts in one of these states and moves successively from one state to another. Each move is called a step. If the chain is currently in state $s_i$ then it moves to state $s_j$ at the next step with a probability denoted by $p_{ij}$ (transition probability) and this probability does not depend upon which states the chain was in before the current state. The probabilities $p_{ij}$ are called transition probabilities ($s_i  \xrightarrow[]{p_{ij}} s_j$ ). The process can remain in its state, and this occurs in probability $p_{ii}$.

https://itfeature.com

An initial probability distribution, defined on $S$ specifies the starting state. Usually, this is done by specifying a particular state as the starting state.

A Markov chain is a sequence of random variables $X_1, X_2,\cdots,$ with the Markov property that, given the present state, the future and past state are independent. Thus
\[P(X_n=x|X_1=x_1,X_2=x_2\cdots X_{n-1}=x_{n-1})\]
\[\quad=P(X_n=x|X_{n-1}=x_{n-1})\]
Or
\[P(X_n=j|X_{n-1}=i)\]

Example: Markov Chain

A Markov chain $X$ on $S=\{0,1\}$ is determined by the initial distribution given by $p_0=P(X_0=0), \; p_1=P(X_0=1)$ and the one-step transition probability given by $p_{00}=P(x_{n+1}=0|X_n=0)$, $p_{10}=P(x_{n+1}=0|X_n=1)$, $p_{01}=1-p_{00}$ and $p_{11}=1-p_{10}$, so one-step transition probability in matrix form is $P=\begin{pmatrix}p_{00}&p_{10}\\p_{01}&p_{11}\end{pmatrix}$

Markov Chain

Markov chains are a powerful tool for modeling various random processes. However, it’s important to remember that they assume the Markov property, which may not always hold true in real-world scenarios.

Applications of Markov Chains

  • Information Theory: Used in data compression algorithms like Huffman coding.
  • Search Algorithms: Applied in recommender systems and website navigation analysis.
  • Queueing Theory: Helps model customer arrivals and service times in queues.
  • Financial Modeling: Financial analysts can use Markov chains to model stock prices or economic trends.
  • Game Design: Markov chains can be used to create video games with more realistic and interesting behavior for non-player characters.
  • Predictive Text: Smartphones that suggest the next word you are typing use a kind of Markov chain, where the probability of the next word depends on the current word.
  • Modeling weather: Markov chains can be used to represent the probabilities of transitioning between different weather states.
https://itfeature.com

References

  • https://en.wikipedia.org/wiki/Markov_chain
  • http://people.virginia.edu/~rlc9s/sys6005/SYS_6005_Intro_to_MC.pdf
  • http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf

Computer MCQs Online Test

Learn R Programming

Effect Size Definition, Formula, Interpretation (2014)

Effect Size Definition

The Effect Size definition: An effect size is a measure of the strength of a phenomenon, conveying the estimated magnitude of a relationship without making any statement about the true relationship. Effect size measure(s) play an important role in meta-analysis and statistical power analyses. So reporting effect size in thesis, reports or research reports can be considered as a good practice, especially when presenting some empirical results/ findings because it measures the practical importance of a significant finding. Simply, we can say that effect size is a way of quantifying the size of the difference between the two groups.

Effect size is usually computed after rejecting the null hypothesis in a statistical hypothesis testing procedure. So if the null hypothesis is not rejected (i.e. accepted) then effect size has little meaning.

There are different formulas for different statistical tests to measure the effect size. In general, the effect size can be computed in two ways.

  1. As the standardized difference between two means
  2. As the effect size correlation (correlation between the independent variables classification and the individual scores on the dependent variable).

The Effect Size Dependent Sample T-test

The effect size of paired sample t-test (dependent sample t-test) known as Cohen’s d (effect size) ranging from $-\infty$ to $\infty$ evaluated the degree measured in standard deviation units that the mean of the difference scores is equal to zero. If the value of d equals 0, then it means that the difference scores are equal to zero. However larger the d value from 0, the more the effect size.

Effect Size Formula for Dependent Sample T-test

The effect size for the dependent sample t-test can be computed by using

\[d=\frac{\overline{D}-\mu_D}{SD_D}\]

Note that both the Pooled Mean (D) and standard deviation are reported in SPSS output under paired differences.

Let the effect size, $d = 2.56$ which means that the sample means difference and the population mean difference is 2.56 standard deviations apart. The sign does not affect the size of an effect i.e. -2.56 and 2.56 are equivalent effect sizes.

The $d$ statistics can also be computed from the obtained $t$ value and the number of paired observations by Ray and Shadish (1996) such as

\[d=\frac{t}{\sqrt{N}}\]

The value of $d$ is usually categorized as small, medium, and large. With Cohen’s $d$:

  • d=0.2 to 0.5 small effect
  • d=0.5 to 0.8, medium effect
  • d= 0.8 and higher, large effect.

Calculating Effect Size from $R^2$

Another method of computing the effect size is with r-squared ($r^2$), i.e.

\[r^2=\frac{t^2}{t^2+df}\]

Effect size is categorized into small, medium, and large effects as

  • $r^2=0.01$, small effect
  • $r^2=0.09$, medium effect
  • $r^2=0.25$, large effect.
Effect Size Definition Dependent t test

The non‐significant results of the t-test indicate that we failed to reject the hypothesis that the two conditions have equal means in the population. A larger value of $r^2$ indicates the larger effect (effect size), while a large effect size with a non‐significant result suggests that the study should be replicated with a larger sample size.

So larger value of effect size computed from either method indicates a very large effect, meaning that means are likely very different.

Choosing the Right Effect Size Measure

The appropriate effect size measure depends on the type of analysis being conducted (for example, correlation, group comparison, etc.) and the scale measurement of the data (continuous, binary, nominal, ration, interval, ordinal, etc.). It is always a good practice to report both effect size and statistical significance (p-value) to provide a more complete picture of your findings.

In conclusion, effect size is a crucial concept in interpreting statistical results. By understanding and reporting effect size, one can gain a deeper understanding of the practical significance of the research findings and contribute to a more comprehensive understanding of the field of study.

References:

  • Ray, J. W., & Shadish, W. R. (1996). How interchangeable are different estimators of effect size? Journal of Consulting and Clinical Psychology, 64, 1316-1325. (see also “Correction to Ray and Shadish (1996)”, Journal of Consulting and Clinical Psychology, 66, 532, 1998)
  • Kelley, Ken; Preacher, Kristopher J. (2012). “On Effect Size”. Psychological Methods 17 (2): 137–152. doi:10.1037/a0028086.

Learn more about Effect Size Definition and Statistical Significance

R Language Basics

Consistent Estimator: Easy Learning

Statistics is a consistent estimator of a population parameter if “as the sample size increases, it becomes almost certain that the value of the statistics comes close (closer) to the value of the population parameter”. If an estimator (statistic) is considered consistent, it becomes more reliable with a large sample ($n \to \infty$). All this means that the distribution of the estimates becomes more and more concentrated near the value of the population parameter that is being estimated, such that the probability of the estimator being arbitrarily closer to $\theta$ converges to one (sure event).

Consistent Estimator

The estimator $\hat{\theta}_n$ is said to be a consistent estimator of $\theta$ if for any positive $\varepsilon$;
\[limit_{n \rightarrow \infty} P[|\hat{\theta}_n-\theta| \le \varepsilon]=1\]
or
\[limit_{n\rightarrow \infty} P[|\hat{\theta}_n-\theta|> \varepsilon]=0]\]

Here $\hat{\theta}_n$ expresses the estimator of $\theta$, calculated by using a sample size of size $n$.

Consistent Estimator
  • The sample median is a consistent estimator of the population mean if the population distribution is symmetrical; otherwise, the sample median would approach the population median, not the population mean.
  • The sample estimate of standard deviation is biased but consistent as the distribution of $\hat{\sigma}^2$ is becoming more and more concentrated at $\sigma^2$ as the sample size increases.
  • A sample statistic can be an inconsistent estimator, whereas a consistent statistic is unbiased in the limit but an unbiased estimator may or may not be consistent.

Note that these two are not equivalent: (1) Unbiasedness is a statement about the expected value of the sampling distribution of the estimator, while (2) Consistency is a statement about “where the sampling distribution of the estimator is going” as the sample size.

A consistent estimate has insignificant (non-significant) errors (variations) as sample sizes increase indefinitely. More specifically, the probability that those errors will vary by more than a given amount approaches zero as the sample size increases. In other words, the more data you collect, the more consistent the estimator will be with the real population parameter you’re trying to measure. The sample mean ($\overline{X}$) and sample variance ($S^2$) are two well-known consistent estimators.

Statistics Help

R Language Lectures