Basic Statistics and Data Analysis

Lecture notes, MCQS of Statistics

Category: Chart and Graphics

Graphic Presentation of Data

Graphic Presentation of Data

A chart/ graph says more than twenty pages of prose, its true when you are presenting and explaining data. Graph is a visual display of data in the form of continuous curves or discontinuous lines on graph paper. Many graphs just represent a summary of data that has been collected to support a particular theory, to understand data quickly in a visual way, by helping the audience, to make a comparison, to show a relationship, or to highlight a trend.

Usually it is suggested that graphical representations of the data should be carefully looked at before proceeding for the formal statistical analysis, because trend in the data can often be depicted by the use of charts and graphs.

A chart/ graph is a graphical representation of data, in which the data is usually represented by symbols, such as bars in a bar chart, lines in a line chart, or slices in a pie chart. A chart/ graph can represent tabular numeric data, functions or some kinds of qualitative structures.

Common Uses of Graphs

Presenting the data in graph is a pictorial way of representing relationships between various quantities, parameters, variables. A graph basically summarizes how one quantity changes if another quantity that is related to it also changes.

  1. Graphs are useful for checking assumptions made about the data i.e. the probability distribution assumed.
  2. The graphs provide a useful subjective impression as to what the results of the formal analysis should be.
  3. Graphs often suggest the form of a statistical analysis to be carried out, particularly, the graph of model fitted to the data.
  4. Graphs gives a visual representation of the data or the results of statistical analysis to the reader which are usually easily understandable and more attractive.
  5. item Some graphs are useful for checking the variability in the observation and outliers can be easily detected.

Some Important Points for Drawing Graphs

  • Clearly label the axis with the names of the variable and units of measurement.
  • Keep the units along each axis uniform, regardless of the scales chosen for axis.
  • Keep the diagram simple. Avoid any unnecessary details.
  • A clear and concise title should be chosen to make the graph meaningful.
  • If the data on different graphs are to be measured always use identical scales.
  • In the scatter plot, do not join up the dots. This makes it likely that you will see apparent patterns in any random scatter of points.
  • Use either grid rulings or tick marks on the axis to mark the graph divisions.
  • Use color, shading, or pattern to differentiate the different sections of the graphs such as lines, pieces of the pie, bars etc.
  • In general start each axis from zero; if the graph is too large, indicate a break in the grid.

further reading about Graphic Presentation of data go to https://en.wikipedia.org/wiki/Chart

Download Graphic Presentation of data pdf file:

 

Stem and Leaf plot helps to visualize the features of distribution of observed data

Before performing any statistical calculation (even the simplest one), data should be tabulated or plotted especially if they are of quantitative nature and are few in number (few observations) to visualize the shape of the distribution.

A stem and leaf plot is a way of summarizing the set of data measured on an interval scale in condensed form. Stem and leaf plot are often used in exploratory data analysis, and help to illustrate the different features of the distribution of the observed data. A basic stem-and-leaf plot contains two columns separated by a vertical line. The left side of the vertical line contains the stems while the right side of the vertical line contains the leaves. It is customary to sort the values within each stem from smallest to largest. In this statistical technique (to present a set of data), each numerical value is divided into two parts

  1. Leading Digit(s)
  2. Trailing Digit

Stem values are the leading digit(s) and leaves are trailing digit. The stems are located along the vertical axis, and the leaf values are stacked against each other along the horizontal axis.

A stem and leaf plot is similar to a frequency distribution with more information. It provides information about the symmetry, concentration, empty sets and outlier of the observed data set. Organizing the data into a frequency distribution has disadvantage of

  1. Lose of exact identity of each value (individuality of observation vanishes)
  2. Did not know (sure) how the values within each class are distributed.

The advantage of the stem and leaf plot (display) over a frequency distribution is that we do not lose identity (individuality) of each observation. Similarly a stem and leaf plot is similar to histogram but is usually provide more information for relatively small data set.

More than one data set can be compared by using the multiple stem and leaf plots. Using a back-to-back stem and leaf plot we can compare the same characteristics in to different groups.

The origin of the stem and leaf plot is associated with Tukey, J.W (1977).

Constructing a stem-and-leaf display

Let we have the following data set: 56, 65, 98, 82, 64, 71, 78, 77, 86, 95, 91, 59, 69, 70, 80, 92, 76, 82, 85, 91, 92, 99, 73 and want to draw stem and leaf plot of the given data.

First of all its better to sort the data. The sorted data is 56, 59, 64, 65, 69, 70, 71, 73, 76, 77, 78, 80, 82, 82, 85, 86, 91, 91, 92, 92, 95, 98, 99.

Now first digit is stem and second one is leaf, i.e stems are from 5 to 9 as data ranges from 56 to 99.

Draw a vertical line separating stem from leaf. Put stem values on the left side of the vertical line (bar) and leaf values on the right side of the vertical line.  Note that Each number is assigned to the graph (plot) by pairing the units digit, or leaf, with the correct stem. The score 56 is plotted by placing the units digit  6, to the right of stem 5.

The stem and leaf plot of the above data would look like.

The decimal point is 1 digit(s) to the right of the |
Stem | Leaf
5      | 6 9
6      | 4 5 9
7      | 0 1 3 6 7 8
8      | 0 2 2 5 6
9      | 1 1 2 2 5 8 9

Stem and leaf plot look like histogram by rotating it anti-clock wise.

By adding columns of frequency and cumulative frequency in stem and leaf plot we can find median of the data.

Stem and Leaf Plot

Stem and Leaf Plot

Reference

Download pdf file:

 

Histogram: a useful graphical representation of data which helps to visualize the shape of data.

A histogram is very similar to the bar chart for a frequency distribution based on quantitative data showing the distribution of qualitative data. It is a useful graphical representation of data which helps to visualize the distribution of data.

Histogram is constructed from the grouped data by taking the class boundaries (not class limits) along x-axis and the corresponding frequencies along y-axis. For ungrouped data we have to form the grouped frequency distribution before making a histogram. Histogram consists of a set of bars (like bar chart) but these bars are adjacent to each other and the height of bars is proportional to the frequency associated with respective classes. The area of each rectangle represented the respective class frequencies. When the class intervals are equal, the rectangles all have the same width and their heights directly represent the class frequencies. For the case in which class-intervals are not all equal, the height of the rectangle (bar) over an unequal class-interval, is to be adjusted because it is area, and not height that measures frequency. This means that the height of a rectangle must be proportionally decreased if the length of the corresponding class-interval increases. For example, if the length of a class-interval becomes double, then the height of the rectangle is to be halved so that area, being the fundamental property of the rectangle of a histogram remains unchanged. This sort of rescaling is necessary to observe the correct pattern of distribution.

The feature of histogram is that there is no gap (space) between the vertical bars, because the variable plotted on the horizontal axis is quantitative and variable is from measure of scale either interval or ratio. Thus, the histogram provides an easy interpreted visual representation of a frequency distribution. Note that class midpoints are used as the labels for the classes.

Histogram allow us to analyze extremely large datasets by reducing them to a single graphical representation which is used to show primary, secondary and tertiary peaks in data and also help us by giving a visual representation of the statistical significance of those peaks.

An alternative to the histogram is kernel density estimation, which uses a kernel to smooth samples. This will construct a smooth probability density function, which will in general more accurately reflect the underlying variable.

Histogram for continuous grouped data

To draw a histogram from the continuous grouped frequency distribution, the following steps are taken.

  1. Mark class boundaries of the classes along x-axis.
  2. Mark frequencies along y-axis.
  3. Draw a rectangle for each class such that the height of each rectangle is proportional to the frequency corresponding to that class. This is the case when classes are of equal width as they often are.
  4. If the classes are of unequal width, then the area instead of height of each rectangle is proportional to the frequency corresponding to that class and the height of each rectangle is obtained by dividing the frequency of the class by width of that class.

It may be noted that the area under a histogram can be calculated by adding up the areas of all the rectangles that constitute the histogram. The area of one rectangle is obtained by the multiplication of width of the class by the corresponding frequency i.e.

Area of a single rectangle = width of the class x frequency of the class

Histogram for Discrete Data

Bar graphs are usually drawn for discrete and categorical data but there are some situations where there is need to make approximation, the histogram may be constructed. To construct a histogram for discrete grouped data, following steps are taken:

  1. Mark possible values on x-axis.
  2. Mark frequencies along y-axis.
  3. Draw a rectangle centered on each value with equal width on each side possible 0.5 to either side of the value.
Histogram

Histogram

Advantages:

The advantages of the histogram as compared to the unprocessed data are:

  1. It gives range of the data.
  2. It gives location of the data.
  3. it gives clue about the skewness of the data.
  4. It gives information about the out of control situation.
  5. Histogram are density estimates (gives a good impression of the distribution of data.
  6. Can be compared to normal curve.

Disadvantages:

  1. Exact values cannot be read from histogram because data is grouped into categories and individuality of data vanishes in grouped data.
  2. It is more difficult it compare two data sets.
  3. It is used only for continuous data set.

Download pdf file:

 

A Pareto chart named after Vilfredo Pareto (an Italian Economist) is actually a bar chart

Pareto Chart

A Pareto chart named after Vilfredo Pareto (an Italian Economist) is actually a bar chart in which all bars are ordered from largest to the smallest along with a line showing the cumulative percentage and count of the bars. The left vertical axis has the frequency of occurrence (number of occurrence), or some other important unit of measure such as cost. The right vertical axis contains the cumulative percentage of the total number of occurrences, or total of the particular unit of measure such as total cost. For Pareto chart the cumulative function is a concave function because bars (representing the reasons) are in decreasing order. Pareto chart are also called a Pareto distribution diagram.

A Pareto chart can be used when the following questions have their answer in “yes”

  1. Can data be arranged into categories?
  2. Is the rank of each category important?

Pareto chart are often used with analyzing defects in a manufacturing process or the most frequent reasons for customer complaints to help and determine the types of defects which are most prevalent (important) in a process. So a Company can focus to improve his efforts on particular important areas where it can make the largest gain or the lowest loss by eliminating causes of defects. So its easy to prioritize the problem areas using Pareto charts. The categories in the “tail” of the Pareto chart are called the insignificant factors.

Pareto chart Example

sample pareto diagram for an airline company

The Pareto chart given above shows the reasons for consumer complaints against an airlines in 2004. Here each bar represents the number (frequency) about each complaint  received. The major complaint receive are related to flight problems (such as cancellations, delays and other deviations from schedule). The 2nd largest complaint is about customer service (rude or unhelpful employees, inadequate meals or cabin service, treatment of delayed passengers etc). Flight problems account 21% of the complaints, while both the flight problems and customer service account for 40% of the complaints. The top three complaint categories account for 55% of the complaints. So, to reduce the number of complaints, airlines should  need to work on flight delays, customer service, and baggage problems.

References:

  • Nancy R. Tague (2004). “Seven Basic Quality Tools”. The Quality Toolbox. Milwaukee, Wisconsin: American Society for Quality. p. 15. Retrieved 2010-02-05.
  • http://www.spcforexcel.com/pareto-diagrams-newsletter
  • http://en.wikipedia.org/wiki/Pareto_chart

Cumulative Frequency Distribution and Polygon

A cumulative frequency distribution (cumulative frequency curve or ogive) and a cumulative frequency polygon require cumulative frequencies. The cumulative frequency is denoted by CF and for a class interval it is obtained by adding the frequency of all the preceding classes including that class. It indicates the total number of values less than or equal to the upper limit of that class. For comparing two or more distributi0ons, relative cumulative frequencies or percentage cumulative frequencies are computed.

The relative cumulative frequencies are the proportions of the cumulative frequency denoted by crf are obtained by dividing the cumulative frequency by the total frequency (Total number of Observations). The crf of a class can also be obtained by adding the relative frequencies (rf) of the preceding classes including that class. Multiplying the relative frequencies by 100 gives corresponding percentage cumulative frequency of a class.

The method of construction of cumulative frequencies and cumulative relative frequencies is explained in the following table:

cumulative frequency distribution

cumulative frequency distribution

To plot a cumulative frequency distribution, scale the upper limit of each class along the x-axis and the corresponding cumulative frequencies along y-axis. For additional information, you can label the vertical axis on the left in units and vertical axis on right in percent. The cumulative frequencies are plotted along y-axis against upper or lower class boundaries and the plotted points are joined by straight line. Cumulative Frequency Polygon can be used to calculate median, quartiles, deciles and percentiles etc.

Cumulative Frequency Polygon or Ogive

Cumulative Frequency Polygon or Ogive

Copy Right © 2011-2017 | Free Music Download ITFEATURE.COM