Stem and Leaf Plot: Exploratory Data Analysis

Before performing any statistical calculation (even the simplest one), data should be tabulated or plotted especially if they are quantitative and are few (few observations) to visualize the shape of the distribution.

A stem and leaf plot summarizes the set of data measured on an interval scale in condensed form. Stem and leaf plots are often used in exploratory data analysis and help to illustrate the different features of the distribution of the observed data. A basic stem and leaf display contains two columns separated by a vertical line. The left side of the vertical line contains the stems while the right side contains the leaves. It is customary to sort the values within each stem from smallest to largest. In this statistical technique (to present a set of data), each numerical value is divided into two parts

  1. Leading Digit(s)
  2. Trailing Digit

Stem values are the leading digit(s) and leaves are the trailing digit. The stems are located along the vertical axis, and the leaf values are stacked against each other along the horizontal axis.

A stem and leaf display is similar to a frequency distribution with more information. It provides information about the observed data set’s symmetry, concentration, empty sets, and outliers. Organizing the data into a frequency distribution has the disadvantage of

  1. Lose of the exact identity of each value (individuality of observation vanishes)
  2. Did not know (sure) how the values within each class are distributed.

The advantage of the stem and leaf plot (display) over a frequency distribution is that we do not lose the identity (individuality) of each observation. Similarly, a stem and leaf plot is similar to a histogram but usually provides more information for a relatively small data set.

More than one data set can be compared by using multiple stem and leaf plots. Using a back-to-back stem and leaf plot we can compare the same characteristics into different groups.

The origin of the stem and leaf plot is associated with Tukey, J.W (1977).

Constructing a Stem and Leaf Plot

Let us have the following data set: 56, 65, 98, 82, 64, 71, 78, 77, 86, 95, 91, 59, 69, 70, 80, 92, 76, 82, 85, 91, 92, 99, 73 and want to draw the required graph of the given data.

First of all, it’s better to sort the data. The sorted data is 56, 59, 64, 65, 69, 70, 71, 73, 76, 77, 78, 80, 82, 82, 85, 86, 91, 91, 92, 92, 95, 98, 99.

Now the first digit is the stem and the second one is a leaf, i.e. stems are from 5 to 9 as data ranges from 56 to 99.

Draw a vertical line separating the stem from the leaf. Put stem values on the left side of the vertical line (bar) and leaf values on the right side of the vertical line.  Note that Each number is assigned to the graph (plot) by pairing the unit digit, or leaf, with the correct stem. The score 56 is plotted by placing the units digit  6, to the right of stem 5.

The stem and leaf plot of the above data would look like.

The decimal point is 1 digit(s) to the right of the |
Stem | Leaf
5      | 6 9
6      | 4 5 9
7      | 0 1 3 6 7 8
8      | 0 2 2 5 6
9      | 1 1 2 2 5 8 9

The stem and leaf plot looks like a histogram by rotating it anti-clockwise.

By adding columns of frequency and cumulative frequency in the stem and leaf plots we can find the median of the data.

stem and Leaft Plot
Stem and Leaf Plot

Reference

Histogram Graph: Useful Graphical Representation of Data

A histogram is very similar to the bar chart for a frequency distribution based on quantitative data showing the distribution of qualitative data. It is a useful graphical representation of data that helps to visualize the distribution of data.

Important Points to Draw a Histogram Graph

The histogram is constructed from the grouped data by taking the class boundaries (not class limits) along the x-axis and the corresponding frequencies along the y-axis. For ungrouped data, we have to form the grouped frequency distribution before making a histogram. It consists of a set of bars (like a bar chart) but these bars are adjacent to each other and the height of the bars is proportional to the frequency associated with respective classes.

The area of each rectangle represented the respective class frequencies. When the class intervals are equal, the rectangles all have the same width and their heights directly represent the class frequencies. For the case in which class intervals are not all equal, the height of the rectangle (bar) over an unequal class interval, is to be adjusted because it is area and not the height that measures frequency. This means that the height of a rectangle must be proportionally decreased if the length of the corresponding class interval increases.

For example, if the length of a class interval becomes double, then the height of the rectangle is to be halved so that area, being the fundamental property of the rectangle of the histogram remains unchanged. This sort of rescaling is necessary to observe the correct pattern of distribution.

Important Features of Histogram

The important feature of the Histogram graph is that there is no gap (space) between the vertical bars because the variable plotted on the horizontal axis is quantitative and the variable is from the measure of scale either interval or ratio. Thus, it provides an easily interpreted visual representation of a frequency distribution. Note that class midpoints are used as labels for the classes.

It allows us to analyze extremely large datasets by reducing them to a single graphical representation which is used to show primary, secondary, and tertiary peaks in data, and also helps us by giving a visual representation of the statistical significance of those peaks.

Alternative of Histogram

An alternative to the histogram is kernel density estimation, which uses a kernel to smooth samples. This will construct a smooth probability density function, which will, in general, more accurately reflect the underlying variable.

Histograms for Continuous Grouped Data

To draw a histogram graph from the continuous grouped frequency distribution, the following steps are taken.

  1. Mark the class boundaries of the classes along the x-axis.
  2. Mark frequencies along the y-axis.
  3. Draw a rectangle for each class such that the height of each rectangle is proportional to the frequency corresponding to that class. This is the case when classes are of equal width as they often are.
  4. If the classes are of unequal width, then the area instead of the height of each rectangle is proportional to the frequency corresponding to that class, and the height of each rectangle is obtained by dividing the frequency of the class by the width of that class.

It may be noted that the area under a histogram graph can be calculated by adding up the areas of all the rectangles that constitute the histogram. The area of one rectangle is obtained by the multiplication of the width of the class by the corresponding frequency i.e.

Area of a single rectangle = width of the class x frequency of the class

Histogram for Discrete Data

Bar graphs are usually drawn for discrete and categorical data but there are some situations where there is a need to make an approximation, the histograms may be constructed. To construct a histogram graph for discrete grouped data, the following steps are taken:

  1. Mark possible values on the x-axis.
  2. Mark frequencies along the y-axis.
  3. Draw a rectangle centered on each value with equal width on each side possibly 0.5 to either side of the value.
Histogram graph

The advantages of the histograms as compared to the unprocessed data are:

  1. It gives a range of data.
  2. It gives the location of the data.
  3. it gives a clue about the skewness of the data.
  4. It gives information about the out-of-control situation.
  5. Histograms are density estimates (give a good impression of the distribution of data.
  6. Can be compared to the normal curve.

The disadvantages are:

  • Exact values cannot be read from histogram graph because data is grouped into categories and individuality of data vanishes in grouped data.
  • It is more difficult to compare two data sets.
  • It is used only for the continuous data set.

FAQs about Histogram

  1. What is a histogram graph?
  2. What is the difference between a bar chart and a histogram?
  3. What are the important features of histograms?
  4. What are the advantages and disadvantages of histogram graphs?
  5. How one can draw a histogram for a discrete data set?
  6. How one can draw a histogram for a continuous data set?

Graphical Representation of Data, Data Visualization/ Graphics in R

Pareto Chart Easy Guide (2012)

A Pareto chart named after Vilfredo Pareto (an Italian Economist) is a bar chart in which all bars are ordered from largest to the smallest along with a line showing the cumulative percentage and count of the bars. The left vertical axis has the frequency of occurrence (number of occurrences), or some other important unit of measure such as cost. The right vertical axis contains the cumulative percentage of the total number of occurrences or the total of the particular unit of measure such as total cost. For the Pareto chart, the cumulative function is concave because the bars (representing the reasons) are in decreasing order. A Pareto chart is also called a Pareto distribution diagram.

The Pareto chart is also known as the 80/20 rule chart. These charts offer several benefits for data analysis and problem-solving.

A Pareto chart can be used when the following questions have their answer is “yes”

  1. Can data be arranged into categories?
  2. Is the rank of each category important?

Pareto charts are often used to analyze defects in a manufacturing process or the most frequent reasons for customer complaints to help determine the types of defects that are most prevalent (important) in a process. So a Company can focus on improving its efforts in particular important areas where it can make the largest gain or the lowest loss by eliminating causes of defects. So it’s easy to prioritize the problem areas using Pareto charts. The categories in the “tail” of the Pareto chart are called the insignificant factors.

Pareto Chart Example

Pareto Chart

The Pareto chart given above shows the reasons for consumer complaints against airlines in 2004. Here each bar represents the number (frequency) of each complaint received. The major complaints received are related to flight problems (such as cancellations, delays, and other deviations from the schedule). The 2nd largest complaint is about customer service (rude or unhelpful employees, inadequate meals or cabin service, treatment of delayed passengers, etc.). Flight problems account for 21% of the complaints, while both flight problems and customer service account for 40% of the complaints. The top three complaint categories account for 55% of the complaints. So, to reduce the number of complaints, airlines should need to work on flight delays, customer service, and baggage problems.

By incorporating Pareto-charts into data analysis, one can get valuable insights, prioritize effectively, and make data-driven decisions.

Charts and Graphs

References:

  • Nancy R. Tague (2004). “Seven Basic Quality Tools”. The Quality Toolbox. Milwaukee, Wisconsin: American Society for Quality. p. 15. Retrieved 2010-02-05.
  • http://en.wikipedia.org/wiki/Pareto_chart

See more about Charts and Graphs

Online MCQs Intermediate Mathematics (Matrices and Determinants)

Graphs in R Language