Graphs

The old adage "a picture is worth a thousand words" is quite true in statistics, as the eye can observe relationships and trends much more easily through pictures than through data. In a simplistic sense, there are three types of graphs:

circle graphs (which are often called pie charts)
bar graphs (and histograms, which are related but not quite identical)
line graphs (sometimes called frequency polygons)

Virtually everything else is a variation of one of these three types. However, rather than focusing on each type, we feel it is much more appropriate to discuss how to display different types of data.

Picturing Qualitative Data

The National Pet Owners Survey is consucted by the American Pet Products Association on a regular basis. For the 2011-2012 survey, they obtained the following results.

Type of Pet	Number of Households	Number of Pets
Bird	5.7 million	16.2 million
Cat	38.9 million	86.4 million
Dog	46.3 million	78.2 million
Equine	2.4 million	7.9 million
Fish, Freshwater	11.9 million	151.1 million
Fish, Saltwater	0.7 million	8.6 million
Reptile	4.6 million	13.0 million
Small Animal	5.0 million	16.0 million

The number of households owning a pet (in the first column of the table) is qualitative data (and in particular, nominal data), but the total of that column would be meaningless, because some households will own more than one type of pet, and others do not own pets. In such cases, a bar graph is the most appropriate vehicle for displaying the data.

The number of pets owned (in the second column of the table) is qualitative data, and the total of that data would represent the total number of pets in the USA (except for the few households who might keep some more exotic pets). Pie charts are an excellent vehicle for displaying qualitative data having a total.

Bar Graph of the Number of Households in the USA Owning Pets Pie Chart of the Number of Pets in the USA

We could have constructed the first graph with vertical bars rather than horizontal, but it would have been slightly more difficult to determine how to place the text labels for each bar. And we could have displayed the second set of data as a bar graph rather than a pie chart, but we would have lost the sense of totality that comes with a pie chart. However, if we were displaying ordinal data rather than nominal, we would have avoided the pie chart, because the ordered structure of the data is lost in a pie chart.

Graphs are meant to be interpreted, and it is interesting to compare these two graphs. Dogs and cats dominate the first graph, but freshwater fish clearly form the largest sector of the second graph. The two graphs display different information, and in this case their differences alert us to the fact that fish owners own on average $\dfrac{151.1}{11.9} \approx 12.7$ fish each, while cat owners own $\dfrac{86.4}{38.9} \approx 2.2$ cats each.

Picturing Quantitative Data

The U.S. Census Bureau provided the following estimates of the annual income of individuals for the year 2008.

Income Range (in dollars)	Number of Individuals (in millions)
0 - 4999	20.353
5000 - 9999	21.476
10000 - 14999	21.981
15000 - 19999	18.801
20000 - 24999	17.742
25000 - 29999	14.941
30000 - 34999	14.078
35000 - 39999	11.895
40000 - 44999	10.447
45000 - 49999	7.994
50000 - 54999	8.963
55000 - 59999	5.136
60000 - 64999	5.921
65000 - 69999	3.909
70000 - 74999	3.961
75000 - 79999	3.139
80000 - 84999	2.886
85000 - 89999	1.806
90000 - 94999	1.910
95000 - 99999	1.278
100000+	13.215

This data is quantitative data (and more specifically, ratio data). Quantitative data almost always are either of interval or ratio type, and in both cases, pie charts should be avoided because they lose the ordered structure present in the data. Furthermore, rather than a bar chart, a histogram should be used to display the continuous ordering of the ranges for each bar. That is, a histogram has no gaps between bars, unlike a bar chart that does have gaps between bars. But if the ranges themselves have little relevance, then a line graph (frequency polygon) is probably more appropriate. For this example, we have shown both.

Histogram of Individual Income in the USA Line Graph of Individual Income in the USA

Looking at the result, we should probably discuss the last class. Note that the ranges in the frequency distribution all had width $\$5,000$, except for the last open-ended class. Unfortunately, that open-ended class does distort the graph. If we ignored that last class, the histogram would be misleading, because it would imply that no one earned more than $\$100,000$. But we could drop the last class from the line graph, leaving the end of the line hanging, and this does give the impression that there would be more data to the right, just not displayed on this graph.

Another useful type of graph is the cumulative frequency polygon (also called an ogive). Instead of using the frequency data as given, we find the cumulative frequencies and plot those instead. And in the following example, we also converted the cumulative frequencies to percents.

Cumulative Frequency Polygon of
Individual Incomes

From examining this graph, we can easily see that the average (specifically, the median) income of individuals in the USA in 2008 was about $\$25,000$, and that an income of $\$75,000$ per year would place the earner at approximately the 90th percentile. We will discuss averages and percentiles in later sections.

Picturing Time Series Data

Time Series Data will always have interval data (years, months, etc.) along one axis. Because of the continuous nature of time, a line graph (frequency polygon) should be used. Here are two examples of time series graphs. The first graph was prepared using data from the Center for Disease Control and Prevention, and the second graph, displayed on the CNN Money web site, shows the Dow Jones Industrial Average for the week of January 9-13, 2012.

Graph of Death Rates in the USA Graph of Dow Jones Average for a Week

The first graph may appear rather boring, but in fact, it indicates that death rates have fallen over the past 70 years (which is a good thing for all of us who are still living). The second graph looks quite active, yet when the scales and the detailed information are examined, we see that it displays a really ho-hum week (with not much opportunity for making large profits on the changes in the value of stocks).

Picturing Bivariate Data

Bivariate data, where each data point consists of two observations, are typically displayed using a scatterplot. The following graph is a scatterplot of the winning and losing scores for the Super Bowl football games from 1967 to 2011.

Scatterplot of Super Bowl Scores

Since the winning score is always larger than the losing score, all of the data is closer to the horizontal axis than to the vertical axis. The games where the scores were closest are found roughly equidistant from each axis. We can see that there were a number of close games, but more frequently the scores were not that close.

Good Graphs and Poor Graphs

Since graphs today are almost all made by computer software, we have not discussed the mechanics of putting one together. But if you follow certain principles, your graphs will have a much higher quality.

A graph should be self-contained, telling the entire story without the need for additional information. To achieve this, be sure to include informative titles, axes labels, and other necessary information.
A graph should not distort the data. To achieve this, if the data is numerical, equal-width classes are preferred. Be sure the scales on each axis begin at zero, and proceed in a proportional fashion. (But if the differences in values are more important, beginning at zero may not be the best choice, as in the graph of the Dow Jones Industrial Average.)
A graph should be interesting. Color often helps. Additional information on the graph can also keep the viewer's attention.
A graph should not include superfluous features. If the additional items do not increase the information available, they only clutter the graph and should probably be omitted.

We make no claim that the graphs on this page meet all of these criteria. You should take the time to consider what improvements might be made to enhance each of these graphs.

We close with one last example, which must be viewed to be appreciated: Hans Rosling's 200 Countries, 200 Years, 4 Minutes, produced by the British Broadcasting Corporation for their program The Joy of Stats. Rosling's multivariate data effectively tells the story of the progress of global health in the last two hundred years.