Primary and Secondary Data (2014)

Data

Before learning about primary and Secondary Data, let us first understand the term Data in Statistics.

The facts and figures which can be numerically measured are studied in statistics. Numerical measures of the same characteristics are known as observation and collection of observations is termed as data. Data are collected by individual research workers or by organizations through sample surveys or experiments, keeping in view the objectives of the study. The data collected may be (i) Primary Data and (ii) Secondary Data.

Primary and Secondary Data in Statistics

The difference between primary and secondary data in Statistics is that Primary data is collected firsthand by a researcher (organization, person, authority, agency or party, etc.) through experiments, surveys, questionnaires, focus groups, conducting interviews, and taking (required) measurements, while the secondary data is readily available (collected by someone else) and is available to the public through publications, journals, and newspapers.

Primary and Secondary Data

Primary Data

Primary data means the raw data (data without fabrication or not tailored data) that has just been collected from the source and has not gone through any kind of statistical treatment like sorting and tabulation. The term primary data may sometimes be used to refer to first-hand information.

Sources of Primary Data

The sources of primary data are primary units such as basic experimental units, individuals, and households. The following methods are used to collect data from primary units usually and these methods depend on the nature of the primary unit. Published data and the data collected in the past are called secondary data.

  • Personal Investigation
    The researcher experiments or surveys himself/herself and collects data from it. The collected data is generally accurate and reliable. This method of collecting primary data is feasible only in the case of small-scale laboratories, field experiments, or pilot surveys and is not practicable for large-scale experiments and surveys because it takes too much time.
  • Through Investigators
    The trained (experienced) investigators are employed to collect the required data. In the case of surveys, they contact the individuals and fill in the questionnaires after asking for the required information, whereas a questionnaire is an inquiry form having many questions designed to obtain information from the respondents. This method of collecting data is usually employed by most organizations and it gives reasonably accurate information but it is very costly and may be time-consuming too.
  • Through Questionnaire
    The required information (data) is obtained by sending a questionnaire (printed or soft form) to the selected individuals (respondents) (by mail) who fill in the questionnaire and return it to the investigator. This method is relatively cheap as compared to the “through investigator” method but the non-response rate is very high as most of the respondents don’t bother to fill in the questionnaire and send it back to the investigator.
  • Through Local Sources
    The local representatives or agents are asked to send requisite information and provide the information based on their own experience. This method is quick but it gives rough estimates only.
  • Through Telephone
    The information may be obtained by contacting the individuals by telephone. It is Quick and provides the accurate required information.
  • Through Internet
    With the introduction of information technology, people may be contacted through the Internet and individuals may be asked to provide pertinent information. Google Survey is widely used as an online method for data collection nowadays. There are many paid online survey services too.

It is important to go through the primary data and locate any inconsistent observations before it is given a statistical treatment.

Secondary Data

Data that has already been collected by someone, may be sorted, tabulated, and has undergone a statistical treatment. It is fabricated or tailored data.

Sources of Secondary Data

The secondary data may be available from the following sources:

  • Government Organizations
    Federal and Provincial Bureau of Statistics, Crop Reporting Service-Agriculture Department, Census and Registration Organization etc.
  • Semi-Government Organization
    Municipal committees, District Councils, Commercial and Financial Institutions like banks etc
  • Teaching and Research Organizations
  • Research Journals and Newspapers
  • Internet

Data Structure in R Language

Quartiles in Statistics (2025)

Quartiles in Statistics

Like Percentiles and Deciles, Quartiles is a type of Quantile, which is a measure of the relative standing of observation within the data set. The Quartiles values are three points that divide the data into four equal parts each group comprising a quarter of the data (the first quartile $Q_1$, second quartile $Q_2$ (also median), and the third quartile $Q_3$) in the order statistics.

The first quartile, (also known as the lower quartile $Q_1$) is the value of order statistic that exceeds 1/4 of the observations and less than the remaining 3/4 observations. The third quartile known as the upper quartile is the value in the order statistic that exceeds 3/4 of the observations and is less than the remaining 1/4 observations, while the second quartile is the median.

Quartiles in Statistics for Ungrouped Data

For ungrouped data, the quartiles are calculated by splitting the order statistic at the median and then calculating the median of the two halves. If $n$ is odd, the median can be included on both sides.

Example: Quartiles in Statistics

Find the $Q_1, Q_2$, and $Q_3$ for the following ungrouped dataset 88.03, 94.50, 94.90, 95.05, 84.60.Solution: We split the order statistic at the median and calculated the median of two halves. Since $n$ is odd, we can include the median in both halves. The order statistic is 84.60, 88.03, 94.50, 94.90, 95.05.

Quartiles in Statistics: Relative Measure of Observation

\begin{align*}
Q_2&=median=Y_{(\frac{n+1}{2})}=Y_{(3)}\\
&=94.50  (\text{the third observation})\\
Q_1&=\text{Median of the first three value}=Y_{(\frac{3+1}{2})}\\&=Y_{(2)}=88.03 (\text{the second observation})\\
Q_3&=\text{Median of the last three values}=Y_{(\frac{3+5}{2})}\\
&=Y_{(4)}=94.90 (\text{the fourth observation})
\end{align*}

Quartiles in Statistics for Grouped Data

For the grouped data (in ascending order) the quartiles are calculated as:
\begin{align*}
Q_1&=l+\frac{h}{f}(\frac{n}{4}-c)\\
Q_2&=l+\frac{h}{f}(\frac{2n}{4}-c)\\
Q_3&=l+\frac{h}{f}(\frac{3n}{4}-c)
\end{align*}
where
$l$    is the lower class boundary of the class containing the $Q_1, Q_2$ or $Q_3$.
$h$    is the width of the class containing the $Q_1, Q_2$ or $Q_3$.
$f$    is the frequency of the class containing the $Q_1, Q_2$ or $Q_3$.
$c$    is the cumulative frequency of the class immediately preceding the class containing $Q_1, Q_2$ or $Q_3, \left[\frac{n}{4},\frac{2n}{4} \text{or} \frac{3n}{4}\right]$ are used to locate $Q_1, Q_2$ or $Q_3$ group.

Quartiles in Statistics: Relative Measure of Observation

Quartiles in Statistics Example

Find the quartiles for the following grouped data

Solution: To locate the class containing $Q_1$, find $\frac{n}{4}$th observation which is here $\frac{30}{4}$th observation i.e. 7.5th observation. Note that the 7.5th observation falls in the group ($Q_1$ group) 90.5–95.5.
\begin{align*}
Q_1&=l+\frac{h}{f}(\frac{n}{4}-c)\\
&=90.5+\frac{5}{4}(7.5-6)=90.3750
\end{align*}

For $Q_2$, the $\frac{2n}{4}$th observation=$\frac{2 \times 30}{4}$th observation = 15th observation falls in the group 95.5–100.5.
\begin{align*}
Q_2&=l+\frac{h}{f}(\frac{2n}{4}-c)\\
&=95.5+\frac{5}{10}(15-10)=98
\end{align*}

For $Q_3$, the $\frac{3n}{4}$th observation=$\frac{3\times 30}{4}$th = 22.5th observation. So
\begin{align*}
Q_3&=l+\frac{h}{f}(\frac{3n}{4}-c)\\
&=100.5+\frac{5}{6}(22.5-20)=102.5833
\end{align*}

Application of Quartiles

By analyzing quartiles, one can get insights into the:

  • Spread of the data: The distance between $Q_1$ and $Q_3$ (called the interquartile range or IQR) indicates how spread out the data is. A relatively large IQR indicates a wider distribution, while a small IQR shows that the data is more concentrated around the median ($Q_2$).
  • Presence of outliers: If the data points are extremely far from the quartiles, they might be outliers that could skew the analysis of measures like the mean.
Statistics Help

Reference:

Frequently Asked Questions about Quartiles in Statistics

  1. What are the first, second, and third quartiles?
  2. How to compute quartiles for ungrouped data?
  3. How to compute quartiles for grouped data?
  4. What is the application of quartiles?
  5. Give some Numerical examples of quartiles.

R Frequently Asked Questions

Online MCQs Test Quiz with Answers

Range Measure of Dispersion (2013)

Measure of Central Tendency provides typical value about the data set, but it does not tell the actual story about the data i.e. mean, median, and mode are enough to get summary information, though we know about the center of the data. In other words, we can measure the center of the data by looking at averages (mean, median, and mode). These measures tell nothing about the spread of data. So for more information about data, we need some other measure, such as the Range measure of dispersion or spread.

Range Measure of Dispersion

The Spread of data can be measured by calculating the range of data; the range tells us how many numbers of data extend. The range is an absolute measure of dispersion that can be found by subtracting the highest value (called upper bound) in data from the smallest value (called lower bound). i.e.

Range = Upper Bound – Lowest Bound
OR
Range = Largest Value – Smallest Value

This absolute measure of dispersion has disadvantages as range only describes the width of the data set (i.e. only spread out) measured in the same unit as data, but it does not give the real picture of how data is distributed. If data has outliers, using range to describe the spread of that can be very misleading as the range is sensitive to outliers.

We need to be careful in using the range measure of dispersion as it does not give the full picture of what’s going between the highest and lowest values. It might give a misleading picture of the spread of the data because it is based only on the two extreme values. Therefore, Range is an unsatisfactory measure of dispersion.

Range measure-of-dispersion

However, the range measure of dispersion is widely used in statistical process control such as control charts of manufactured products, daily temperature, stock prices, etc., applications as it is very easy to calculate. It is an absolute measure of dispersion, its relative measure known as the coefficient of dispersion defines the relation

\[Coefficient\,\, of\,\, Dispersion = \frac{x_m-x_0}{x_m-x_0}\]

Measure of Dispersion

The coefficient of dispersion is pure dimensionless and is used for comparison purposes.

Data Frame in R Language

Online MCQs Test Website