Powered by MathJax
We use MathJax

Measures of Central Tendency

A measure of central tendency is used when one intends to express the values in a distribution by a single representative value. Often, we refer to such a value as an average. Three types of average are quite common, the mean, the median, and the mode. It is necessary, though, to distinguish between population and sample data when computing averages.

Computing an Average from Raw Sample Data

Suppose we randomly sample twelve students who took the first statistics exam, and obtain the following raw data:

82, 76, 31, 87, 94, 76, 85, 88, 93, 98, 89, 93

Computing an Average from Raw Population Data

The essence of the computation is the same, whether the data came from a sample or a population. However, the variables used in the formulas are different, because of the interplay that occurs between populations and samples.

Estimating an Average from a Frequency Distribution

A big disadvantage occurs when working with data that has already been summarized in a frequency distribution. We no longer have the original raw data, and therefore we cannot use the original formulas for the various types of averages. The modifications we make to find averages in this case are at best estimates of the values that would be found from the raw data.

Suppose the heights of 169 freshmen at Western High School were found, and the results provided in the following table.

Height Number of Students
135-149 cm 23
150-164 cm 36
165-179 cm 29
180-194 cm 64
195-209 cm 17
Height Class Midpoints
$x_i$
Number of Students
$w_i$
Product
$w_i x_i$
135-149 cm 142 cm 23 3266
150-164 cm 157 cm 36 5652
165-179 cm 172 cm 29 4988
180-194 cm 187 cm 64 11968
195-209 cm 202 cm 17 3434
Totals   169 29308

Estimating an Average from a Graph

When estimating averages from a graph, imagine that the distribution is essentially a continuous function on a domain that extends from the minimum to the maximum values of the variable.

Graph of a function, with a balance point at the mean, a vertical line at the median, and a point at the mode

Characteristics of the Different Averages

The mean uses an algebraic formula, and that does make it easier to work with. That also means quantitative data is required; no mean is possible with only qualitative data. The mean is also the most stable average in sampling, in that a small change in any one data value cannot cause a large change in the average. However, the mean is easily influenced by outliers (data values quite unlike the great majority of values), and is in fact dependent on every value in the data set.

The median requires an ordering process rather than a straightforward formula, so its determination is not algebraic. Medians can be found on any data that is at least ordinal, which means some qualitative variables will have medians. The median is not influenced by outliers, nor most other values, in fact the only values which really count are those at the very center.

The mode does not require any structure on the data, so modes can be found even for nominal data. However, they may not exist, and if they do exist, they may not be unique.

Choosing an appropriate average for a set of data is not always easy. Suppose, for example, that the wages of a sample of 28 employees of a company are recorded, with the following results.

Hourly wage Number of employees
$7.50 15
$7.75 8
$8.00 3
$8.25 1
$25.00 1

Note that the classes in this frequency distribution are precise, so our results will not involve estimates (beyond those due to any sampling process). When the three types of averages are computed, it can be found that the median and mode are both $\$7.50$, while the mean is slightly more than $\$8.27$. Now if the average wage was an issue in labor-management negotiations, management would not be happy hearing that the average was $\$7.50$, since in fact that is the lowest wage they were paying. And labor would not be happy hearing that the average was almost $\$8.28$, since that would make all but one individual below average. None of the three averages really does a good job at representing the "center" of this set of data, and the basic cause of the problem is the presence of the outlier, $\$25.00$. If that value were excluded, the mean would drop to about $\$7.66$, well within the bulk of the data.