Exercises - Descriptive Statistics - Data Summaries

Problem 1 For each of the following variables decide whether the data is discrete, continuous or possibly both
Variable Discrete Continuous Possibly Both
Daily low temperature in New York
Brand of cereal in supermarket
Telephone number
License plates of cars
Weight lost in a weight loss program
Time spent on studying for ESMA 3101 during last week

Problem 2 Polling company "Polls-R-Us" has been asked to conduct an opinion poll for Coca-Cola. Coca-Cola wants to run an Ad on TV and claim that "most people prefer Coke to any other soft drink". Coca-Cola will pay Polls-R-US a big bonus if their survey results support that claim. Name three things that Polls-R-US can do to influence the poll results in Coca-Colas favour.
Problem 3 Consider the data in Marital Status. If you do a barchart, it will look like this:

Instead draw this barchart:

Things that you need to change from the default graph:
Numbers are based on percentages
Percentages appear as labels above bars
Bars have a black right slant at 22.5°
Title - with your name


Problem 4 Consider the data set for Friday the 13th.
a) Find the mean and the standard deviation of the number of accidents on Friday 13th by hand. Write down all the parts of the solution
b) Use MINITAB to compute the mean and the standard deviation for the two Fridays and the three data sets separately (so there will be the mean and st. dev. for the number of accidents on Friday the 6th, the mean and st. dev. for the number of accidents on Friday the 13th, the mean and st. dev. for the number of shoppers on Friday the 6th and so on). Does any of these numbers support the idea that Friday the 13th is special?
Problem 5 Consider the data in AIDS in Americas in 1995.

Do this work by hand. You can check your answers using MINITAB. Remember, though, the answers might be slightly different.

a) Find the 20th and the 80th percentiles of the AIDS rates. What are the corresponding countries?

b) Find the 5 number summary for the aids rates.

c) According to the boxplot as drawn by MINITAB, which countries are outliers in this data set?

d) Let's say the WHO wants to use the "average" rate of AIDS infection (together with the number of people living in the Americas) to estimate the number of AIDS infected people in the Americas. Should they use the mean or the median to find the "average"?


Problem 6 In this exercise we study the dataset headache

a): What is the type of data of the variables?

b): Find the mean and standard deviation of Time

c): Find the 5-number summary and draw the boxplot of Time

d): Draw the boxplot of Time, by hand


Problem 7 Company XYZ has a contract with a supplier for metal rods. The contract says that all the rods have to be between 15.2cm and 15.5cm long. XYZ just received a shipment of 50000 rods. They randomly select 100 of them and measure the length of each. They find =15.344 and s=0.041. Should they except this shipment?
Problem 8 Consider the Rogaine dataset. Draw a good graph for this dataset.

Solutions

Problem 1 For each of the following variables decide whether the data is discrete, continuous or possibly both
Variable Discrete Continuous Possibly Both
Daily low temperature in New York
X
Brand of cereal in supermarket
X
Telephone number
X
License plates of cars
X
Weight lost in a weight loss program
X
Time spent on studying for ESMA 3101 during last week
X

Problem 2 1) Ask the question "Do you agree that Coca-Cola is much better than any of these other soft drinks?"
2) Wait outside a supermarket and only ask those customers who have a coke bottle in their shopping cart.
3) Do a survey with 10 customers and keep repeating it until six say they prefer coke over any other soft drink.


Problem 3
Numbers are based on percentages: Compute percentages (rounded to nearest integer) and put in new column "Percentages"
Percentages appear as labels above bars: Labels - Data Labels - check Show data labels
Title - with your name: Labels - title
Hit enter
Bars have a right slant on a light gray background: click inside a bar, right-click, choose Edit Bars

Problem 4
a) n=6, ∑x = 65, ∑x2 = 769
so = ∑x/n = 65/6 = 10.83,

Sxx = (∑x2-(∑x)2/n) = 769-652/6 = 64.8
so s = √Sxx/(n-1) = √64.8/5 = 3.60

b)
Friday 6th
Friday 13th
Data Set Mean Stand. Dev. Mean Stand. Dev.
Traffic 128385 7259 126550 7664
Shoppers 4971 1166 5017 1173
Accidents 7.50 3.33 10.83 3.6

There does not appear to be anything special about friday the 13th


Problem 5
a) 20th percentile: np/100 = 45*20/100 = 9, 9th observation in ordered data set is 0.8 (Nicaragua)
80th percentile: np/100 = 45*80/100 = 36, 36th observation in ordered data set is 13.7 (Honduras)

b) We have Min=0.0 and Max=131.4
Median: The number in the middle is 5.6 (Argentina), so Median=5.6
Q1: Q1=P25, so np/100=45*25/100=11.25 ~ 12, the 12th observation is 1.5 (Chile), so Q1=14
Q3: Q3=P75, so np/100=45*75/100=33.75 ~ 34, the 34th observation is 10.9 (Saint Kitts & Nevis), so Q3=10.9
and we have the 5-number summary:

c) The countries with the five highest AIDS rates are outliers (Guadaloupe, Barbados, French Guiana, Bermuda, Bahamas)

d) Mean, because the countries with the highest AIDS rates have to influence our "average", and they don't if we use the median.


Problem 6
a) Time: continuous
Dose: possibly both
Sex: discrete
BP Quan: looks possibly both which I would accept as an answer, but is actually just discrete.

b)
a) n=24, ∑x = 632, ∑x2 = 21524
so = ∑x/n = 632/24 = 26.33,

Sxx = (∑x2-(∑x)2/n) = 21524-6322/24 = 4881.3
so s = √Sxx/(n-1) = √4881.3/23 = 14.57

c) The numbers in the 5 number summary are: Min, Q1, Median, Q3 and Max. For all those numbers we need the data to be ordered:
3 5 8 11 13 14 19 20 22 22 23 26 26 27 27 28 29 29 35 43 43 47 55 57.
So we have Min=3 and Max=57
Median: The two numbers in the middle are 26 and 26, so Median =(26+26)/2 = 26
Q1: Q1=P25, so np/100=24*25/100=6, the 6th observation is 14, so Q1=14
Q3: Q3=P75, so np/100=24*75/100=18, the 18th observation is 29, so Q3=29
and we have the 5-number summary:

For the boxplot we need IQR=Q3-Q1=29-14=15
LF=Q1-1.5IQR = 14-1.5×15=-8.5<3, so the left line goes to 3
RF=Q3+1.5IQR = 29+1.5×15=51.5<57, so the right line goes to 51.5


Problem 7
We can use the empirical rule to decide. This requires that the lengths of the rods have a bell-shaped histogram, which of course should be checked. Then

±s = 15.344±0.041 = 15.303, 15.385
±2s = 15.344±2·0.041 = 15.262, 15.426
±3s = 15.344±3·0.041 = 15.221, 15.467

The last interval is supposed to include "almost all" the observations (or lengths of the rods), so we can conclude that "almost all" the rods have a length at least 15.221cm and at most 15.467cm, which is in accordance with the contract. So XYZ should accept the shipment.


Problem 8 Here we have two discrete variables: Growth and Group. The standard graph for two discrete variables is the multiple barchart. There are 6 possible graphs, depending whether we show totals, percentages based on grand total, percentages based on row total or percentages based on column total, and on the grouping. The most useful of these is probably the bar chart based on percentages within Growth:

Select rogaine1.mtw (raw data)
Graph > Bar Chart > Counts of unique value, Cluster, Categorical Variables: Growth Group, Bar Chart Options > Show Y as Percent, Within categories of level 1, hit enter

but it has one major problem. The variable Growth has values that have an ordering (No Growth - Dens Growth) and the graph should reflect that ordering. We can do this as follows: in the worksheet select the variable by clicking on C1-T, Editor > Column > Value Order, User-specified order. The ordering that appears in the box is already ok, so hit enter. Now redraw the graph as described above: