Data Summaries I

Contents of this page:
Graphs for Discrete Data
Totals (Frequencies) vs. Percentages
A Graph for Continuous Data

Tables and Graphs for Discrete Data

Example Consider the variable gender in our WRInc data. Clearly this is discrete data. Usually the first thing one would do is simply count how many of each type there are. You can use the Stat > Tables command in MINITAB to do the "counting":
Stat > Tables > Tally Individual Variables, Variable=Gender. We see that there are 206 Female and 321 Male employees.

Example According to the US Department of Education there were 12,263,000 undergraduate students in US colleges in 1994. Their breakdown by race was as follows:
Race Number
American Indian 117000
Asian 674000
Black 1317000
Hispanic 968000
White 8916000

If a table is used for presentation purposes it should usually include a little more information and maybe a better ordering, for example by size. Also, big numbers are often expressed in bigger units:
Race Number (in 1000) Percentage (%)
White 8,916 72.7
Black 1,317 10.7
Hispanic 968 7.9
Asian 674 5.5
American Indian 117 1.0
Total 12,263 100

In order to compute the percentages we need to divide by the total and multiply by 100. The total is found using the sum command:

Calc > Columns Statistics, Input Variable=Number
and to calculate the percentages use
Calc > Calculator, Store in c3, Expression 'Number' / SUM('Number') * 100. Usually percentages are rounded either to the nearest integer or with one number behind the decimal point, and we can do this with
Calc > Calculator, Store in c3, Expression ROUND('Number' / SUM('Number') * 100,1)

Some discrete variables have a built-in (natural) ordering, for example t-shirt size (small, medium, large, x-large) or grades (A,B, ...) Such an ordering can also be used.

Graph for discrete data: Bar chart
use Graph > Bar Chart, Values from a Table, Graph Variables=Number, Categorical Variable=Race

Example This is a nice professional table from the website of the CDC (Centers for Disease Control) about the dangers of smoking:

For more on barcharts see page 38 of the textbook.

Totals (Frequencies) vs. Percentages

Decide based on the background of the data which number is more relevant/important/interesting. Some of the things to consider are:

--- If the data is a random sample from a larger population percentages are often better:
Example: of 150 randomly selected people in a phone survey 85 said they would vote for candidate AA in the next election - use 57% instead.
Example: in a company with 150 employees 85 said they like their job --- use these numbers

--- For small numbers use frequencies, for large numbers use percentages

--- These are just guidelines, there can always be exceptions if there is a good reason.

Tables and Graphs for Continuous Data

Frequency Tables

Example Consider the WRInc data. If you want to make a table for the variable "Satisfaction" it is clear what it will look like:
Rating Frequency
1 60
2 91
3 103
4 83
5 190

Here "1", "2", .., "5" are called the classes, and for a discrete variable there is usually an obvious choice. Of course we can change them anyway:
Rating Frequency
1 or 2 151
3 or 4 186
5 190

But what about a continuous variable, say "Income"? Again for a table we need those "classes", but now it is not at all clear what they should be. In fact there are many possibilities. First of, now a "class" will be an interval, for example 10000-20000.

Follow these steps to construct a table (now called a frequency table) for continuous data:

1) find the range = largest observation - smallest observation
2) decide on number of classes, usually at least 5. In general, the more observations we have the higher the number of classes will be.
3) calculate the class width = range/number of classes , rounded to a nice number.
4) find left end point so that each observation falls into one and only one class.

Example: Let's do a frequency table for the Incomes of the employees of WRInc
Stat > Basic Statistics > Display Desrciptive Statistics, Income shows Min=6700 and Max=96300
Now
1) range = 96300-6700 = 89600
2) number of classes = 10 (just as an example)
3) class width = 89600/10 = 8960 , round to 10000
4) Left endpoint: 5000

How about

Class
5000 - 15000
15000 - 25000
25000 - 35000
35000 - 45000
45000 - 55000
55000 - 65000
65000 - 75000
75000 - 85000
85000 - 95000
95000 - 105000

Next we will count the number of people with an income between 5000 and 15000. We can use Minitab to do that: Calc > Calculator, Store in c8, Expression SUM('Income' <= 15000) to get the answer: 6
The next class is 15000-25000, and there is a problem: the income of the employee in row 500 is 25000, so where do we count this person, in this class or in the 25000-35000 class? Remember the last part of the steps above: we need to find classes so that each observation falls into one and only one class. Here is one solution:

Class Frequency
5000 - 14999 6
15000 - 24999 42
25000 - 34999 177
35000 - 44999 156
45000 - 54999 74
55000 - 64999 40
65000 - 74999 17
75000 - 84999 11
85000 - 94999 3
95000 - 104999 1

A histogram is just like a barchart, with the bars on top of the classes. Note, though, that in a histogram there are no spaces between the bars!. In Minitab use Graph > Histogram to draw the graph. You can double-click on the graph to change the number of classes.

In general we like to draw several histograms, with different numbers of classes. Try some!

For more on histograms see page 52 of the textbook.