Stories and Datasets

Datasets used in Class
1970 Military Draft
Euro Coins
Alcohol and Tobacco Expenditures
Olympics
Gas Chromatography
Hubble's Constant
Quality of Fish
Golf Scores
Brain and Body Weight of Mammals
Usage of Electricity
House Prices
Temperature in the USA
Air Pollution and Mortality
Environment, Safety and Health Attitudes
Headaches and Pain Reliever
Hair Growth
Methods of Drowning
Seat Belt Use
Gregor Mendel's Genetic Experiments
Death by Horsekicks
Sex Discrimination in Graduate School Admissions
Drug Use of Mothers and the Health of the Newborn
Cancer Survival
Capacity of Wells in the Appalachian Mountains
Cultural Differences in Equipment Use
Testing Hearing Aids
Oxygen Concentration and Fermentation
Film Thickness in Semiconductor Production
Water Quality and Mining
Noise and Air Filters for Cars
Datasets used in Exercises
Forbes
Employee Attitudes
Singers in the New York Choral Society
Watering Schedules and Crop Yield
Treatment for Hostility
Treaments for Lowering Blood Pressure
Prices of Diamond Rings
GPA and IQ
Pets and one-year Survival
Larvea of Mosquito
Datasets used in Homeworks and Exams
Albuquerque House Prices
Cats
Medieval Churches
Cuckoo Eggs
Fabric Wear
Fiber in Crackers
Flammability of Childrens Sleepwear
Sex Ratios by State (American Community Survey 2004)
Salaries in a Company
Egyptian Skulls
Smoking and Lung Cancer Rates
Liver and Toxins in Trout
Life Expectancy, TV's and Docters
Deaths in Car Accidents (2002)
Highway Accidents
Old Faithful Guyser
Smoking and Job Type
Cheese Tasting
Hotdogs
Homework or not?
Gender and Age in US, Washington DC and Puerto Rico (US Census 2000)
Gasoline type and Pollution
Wine Consumption and Hearts Disease

Sex Ratios by State (American Community Survey 2004)

Sex ratios by state, without Puerto Rico. Data from the American Community Survey 2004. Includes upper and lower limits of 90% confidence interval.
Data set: sexratio

1970 Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.
Data set: draft

Euro Coins

The data were collected by Herman Callaert at Hasselt University in Belgium. The euro coins were "borrowed" at a local bank. Two assistants, Sofie Bogaerts and Saskia Litiere weighted the coins one by one, in laboratory conditions on a weighing scale of the type Sartorius BP 310s.
Data set: euros

Alcohol Consumption

Data from a British government survey of household spending may be used to examine the relationship between household spending on tobacco products and alcoholic beverages. The numbers are the average expenditure for each of the 11 regions of England.
Data set: alcohol

Quality of Fish

A study was conducted to examine the quality of fish after several days in ice storage. Ten raw fish of the same kind and quality were caught and prepared for storage. Two of the fish were placed in ice storage immediately after being caught, two were placed there after 3 hours, and two each after 6, 9 and 12 hours. Then all the fish were left in storage for 7 days. Finally they were examined and rated accorrding to their "freshness"
Data set: fish

Gas Chromatography

Results of a study of gas chromatography, a technique which is used to detect very small amounts of a substance. Five measurements were taken for each of four specimens containing different amounts of the substance. The amount of the substance in each specimen was determined before the experiment. The response variable is the output reading from the gas chromatograph. The purpose of the study is to calibrate the chromatograph by relating the actual amount of the substance to the chromatograph reading.
Data set: chromatography

Olympics

Data on the gold medal winning performances in the men's long jump, high jump and discus throw and the women's long jump for the modern Olympic games.
Data set: olympics

Hubble's Constant






In 1929 Edwin Hubble published a paper showing a relationship between the distance and radial velocity away from Earth of "extra-galactic nebulae" (galaxies). His findings revolutionized astronomy. The "Hubble constant," the slope of the regression of velocity (Y) on distance (X), is still a subject of research and debate. The data here are those Hubble published in his original paper. It also has data from much more recent studies which I got from http://www.geocities.com/jurgenshestani/hubble.html
For more on Hubble's Constant and Cosmology go here
Data set: hubble

Golf Scores

We have the scores for the four rounds of the 2006 AT&T Peeble Beach National Pro-AM, 2006 Honda Classic, 2006 Shell Huston Open, 2006 Byron Nelson Classic and the 2007 Sony Open.
Data set: golfscores

Brain and Body Weight of Mammals

Brain and Body Weight (in kg) of 62 Mammals.
Data set: brainsize

Usage of Electricity

In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.
Data set: elusage

House Prices

Prices of residencies located 30 miles south of a large metropolitan area with several possible predictor variables. Notice the 1.7 baths!
Data set: houseprice

Temperature in the USA

The data gives the normal average January minimum temperature in degrees Fahrenheit with the latitude and longitude of 56 U.S. cities. (For each year from 1931 to 1960, the daily minimum temperatures in January were added together and divided by 31. Then, the averages for each year were averaged over the 30 years.) City: City State State postal abbreviation JanTemp: Average January minimum temperature in degrees F. from 1931-1960 Latitude: Latitude in degrees north of the equator Longitude: Longitude in degrees west of the prime meridian
Data set: ustemperature

Air Pollution and Mortality

The dependent variable for analysis is age adjusted mortality (called "Mortality"). The data include variables measuring demographic characteristics of the cities, variables measuring climate characteristics, and variables recording the pollution potential of three different air pollutants.
Data set: airpollution

Environment, Safety and Health Attitudes

Environment, Safety and Health Attitudes of employees of a laboratory. Employees are given a questionaire, which is then collated into an average score from 1(bad) to 10(good). We also have available the length of service of the employee, their gender and their race.
Data set: es&h

Headaches and Pain Reliever

A pharmaceutical company set up an experiment in which patients with a common type of headache were treated with a new analgesic or pain reliever. The analgesic was given to each patient in one of four dosage levels: 2,5,7 or 10 grams. Then the time until noticeable relieve was recorded in minutes. In addition the sex and the blood pressure of each patient was recorded. The blood pressure groups where formed by comparing each patients diastolic and systolic pressure reading with historical data. Based on this comparison the patients are assigned to one of three types: low (0.25), medium (0.5), high (0.75) according to the respective quantiles of the historic data.
data set: headache

Rogaine

Rogaine is the first treatment for hair loss approved by the Food and Drug Administration. Here we have the results of one of the studies that were done to show that rogaine works. A randomized clinical trial was carried out. 1431 bald men were randomly assigned to two groups. The men in the treatment group received Rogaine, the men in the control group received a placebo. After some time the men were examined and assigned to one of 5 groups:
No Growth = no difference

New Vellus = some hair follicles
Min Growth = minimal hair growth
Mod Growth = moderate hair growth
Den Growth = dense hair growth

Here is the original statistical analysis by the Food and Drug Administration used to approve rogaine.
Data set: rogaine

Drownings

Data is from O'Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.
Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. We examined data collected by the Los Angeles County Coroner's Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period, for an annual rate of 2.36 drownings per 100,000 persons (3.44 for males and 1.33 for females). The largest proportion of drownings (44.5%) for both sexes, and in almost every age group, occurred in private swimming pools. Children 2 to 3 years of age had the highest swimming-pool drowning rate (7.95). The elderly also experienced high drowning rates, primarily in swimming pools and bathtubs. Drowning-site profiles varied dramatically by age and sex. These findings indicate a need for Los Angeles County to address the problem of drownings among infants and toddlers in private swimming pools and to investigate the failure of regulations requiring fencing of swimming pools to prevent these deaths. These findings also suggest several potential opportunities for preventive intervention by physicians and demonstrate that health professionals cannot rely on national drowning-site profiles when developing local drowning prevention strategies.
Data set: drownings

Seat Belt Use and Injuries

This data set gives historic percentages and actual numbers for types of injuries of drivers and front seat passengers in tow-away crashes in Charlottesville, Virginia before and after a mandatory seat belt law went into effect. Did the seat belt law have an effect?
Dats set: seatbelt

Gregor Mendel's (1823-1884) Genetic Experiments



Data from Experiments in Plant Hybridization (1865), One of Gregor Mendel's breeding trials. His theory of genetics predicted that the number of peas would be in the proportions 9:3:3:1. For more on Gregor Mendel see http://en.wikipedia.org/wiki/Gregor_Mendel
Data set: mendel

Death by Horsekicks

Number of Deaths by Horsekicks in the Prussian Army from 1875-1894 for 14 Corps
Data set: horsekicks

Sex Discrimination in Graduate School Admissions

The famous Berkeley data on sex discrimination. In fall quarter, 1973, there were 8,442 men who applied for admission to graduate school, and 4,321 women.
Source: Freeman, D., Pisani, R., Purves, R. and Adhikiri, A. (1991) Statistics (2nd edition). WW Norton.
Data set: berkeleyadmissions1 and berkeleyadmissions2

Drug Use of Mothers and the Health of the Newborn

Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocain use. The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn. Is there a significant difference between the groups?

Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics & Gynecology 1989;74:882-885
Data set: mothers and cocain use

Cancer Survival


Linus Pauling, USA Chemistry1954   "Studies Of Molecular Structure And The Chemical Bond" Peace 1962 "Fight Against Atomic Testing"
Frederick Sanger, USA Chemistry 1958 "Determining Structure Of Insulin Molecule" Chemistry 1980 "Biochemical Studies Of Nucleic Acids"
Marie Sklodowska Curie, Poland/France Physics 1903 "Discovery Of Radioactivity In Uranium Work On Radioactivity Based On Becquerel's Discovery" Chemistry 1911 "discovery of the elements radium and polonium"
John Bardeen, USA Physics 1952 "Development of the transistor effect" Physics 1972 "Theory of Superconductivity"
Patients with advanced cancers of the stomach, bronchus, colon, ovary or breast were treated with ascorbate. The purpose of the study was to determine if patient survival differed with respect to the organ affected by the cancer.
Cameron, E. and Pauling, L. (1978) Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times in terminal human cancer. Proceedings of the National Academy of Science USA, 75, 4538Ð4542.
Data set: cancersurvival

Capacity of Wells in the Appalachian Mountains

The specific capacity of wells in the Appalachian mountain region of Pennsylvania has been measured in four rock types. (Knopman 1990) The rock types are dolomite, limestone, siliclastic and metamorphic. the capacities are recorded in gal/min/ft.
Data set: rocks

Cultural Differences in Equipment Use

A US company manufactures equipment that is used in the production of semiconductors. The firm is considering a costly redesign that will improve the performance of its equipment. The performance is characterized as mean time between failures (MTBF). Most of the companies customers are in the USA, Europe and Japan, and there is anectotal evidence that the Japanese customers typically get better performance from the users in the USA and Europe.
Data: MTBF for randomly selected users in the USA, Europe and Japan.
Data set: culture

Testing Hearing Aids

Reference: Loven, Faith. (1981). A Study of the Interlist Equivalency of the CID W-22 Word List Presented in Quiet and in Noise. Unpublished MS Thesis, University of Iowa.
Description: Percent of a Standard 50-word list heard correctly in the presence of background noise. 24 subjects with normal hearing listened to standard audiology tapes of English words at low volume with a noisy background. They repeated the words and were scored correct or incorrect in their perception of the words. The order of list presentation was randomized.
The word lists are standard audiology tools for assessing hearing. They are calibrated to be equally difficult to perceive. However, the original calibration was performed with normal-hearing subjects and no noise background. The experimenter wished to determine whether the lists were still equally difficult to understand in the presence of a noisy background.
Data set: hearingaid

Oxygen Concentration and Fermentation

The effect of oxygen level on fermentation end products was examined in the article "Effects of Oxygen on Pyruvate FormateLyase in Situ and Sugar Metabolism of Streptococcusmutans and Streptococcus samguis" (Infection and Immunity 1985, p129-134) Four oxygen concentrations (0, 46,92, 138 microM) and two sugar types (galactose and glucose) were used. The amount of ethanol was measured for each oxygen-sugar combination and each measurement was repeated twice.
Data set: fermentation

Film Thickness in Semiconductor Production

Chemical vapor deposition is a process used in the semiconductor industry to deposit thin films of silicon dioxide and photoresit on substrates of wafers as they are manufactured. The films must be as thin as possible and have a uniform thickness, which is measured by a process called infrared interference. A process engineer wants to evaluate a low-pressure chemical vapor deposition process that reduces costs and increases productivity. The engineer has set up an experiment to study the effect of chamber temperature and pressure on film thickness.
Data set: filmcoatings

24) Water Quality and Mining

The effects of mining and rock type on water quality.
Data set: mines

Noise and Air Filters for Cars

The data are from a statement by Texaco, Inc. to the Air and Water Pollution Subcommittee of the Senate Public Works Committee on June 26, 1973. Mr. John McKinley, President of Texaco, cited the Octel filter, developed by Associated Octel Company as effective in reducing pollution. However, questions had been raised about the effects of pollution filters on aspects of vehicle performance, including noise levels. He referred to data presented in the datafile associated with this story as evidence that the Octel filter was was at least as good as a standard silencer in controlling vehicle noise levels.
Data set: airfilters

Forbes Companies

Data on 79 companies, from Forbes magacine
Data set: forbes

Albuquerque House Prices

The data are a random sample of records of resales of homes from Feb 15 to Apr 30, 1993 from the files maintained by the Albuquerque Board of Realtors. This type of data is collected by multiple listing agencies in many cities and is used by realtors as an information base.
Variables:
Price = Selling price ($hundreds)
Sqfeet = Square feet of living space
Features = Number out of 11 features (dishwasher, refrigerator, microwave, disposer, washer, intercom, skylight(s), compactor, dryer, handicap fit, cable TV access
Corner = Corner location ("yes", coded as 0 and "no", coded as 1)
Tax = Annual taxes ($)

For some of the houses not all the information was available. Such missing data is coded as *.
Data set: albuquerque house price

Treaments for Lowering Blood Pressure

Four different treatments (three drugs and a placebo) are to be evaluated for their impact on diastolic blood pressure. Thirty-two patients (16 male and 16 female) who have high blood pressure are randomly selected to participate in the experiment. They are randomly assigned to one of the four treatments, whith 4 males and 4 females for each. The patients diastolic blood pressure is measured before and after each treatment. The differences (before-after) are recorded.
Data set: bloodpressure

Cats

Anatomical Data from Domestic Cats
Description: The heart and body weights of samples of male and female cats used for digitalis experiments. The cats were all adult, over 2 kg body weight.
Variables:
Sex: Female cat="F" and male cat="M".
Bwt: Body weight in kg.
Hwt: Heart weight in g.
Source: R. A. Fisher (1947) The analysis of covariance method for the relation between a part and the whole, Biometrics 3, 65–68.
Data set: cats

Medieval Churches

Gould (1973) has speculated on the applicability of biological "laws" of shape to other objects. To study this, he chose a "simple minded" example: medieval churches. These were built in a very wide range of sizes and shapes, to serve fairly similar purposes. Because of limitations due to the use of stone as a building material, we might speculate that the relationship between various measurements of the churches will be very strong. The data lists the perimeter in hundreds of meters and area in hundreds of square meters for 25 post-Conquest Romanesque churches in Britain. The data were measured from ground plans given by Clapharn (1934) and kindly provided by S. J. Gould.
Data set: church

Watering Schedules and Crop Yield

In an experiment in agriculture 8 different watering schedules were used to grow a certain crop.
Data set: crop

Cuckoo Eggs

That cuckoo eggs were peculiar to the locality where found was already known in 1892. A study by E.B. Chance in 1940 called The Truth About the Cuckoo demonstrated that cuckoos return year after year to the same territory and lay their eggs in the nests of a particular host species. Further, cuckoos appear to mate only within their territory. Therefore, geographical sub-species are developed, each with a dominant foster-parent species, and natural selection has ensured the survival of cuckoos most fitted to lay eggs that would be adopted by a particular foster-parent species. The data has the lengths of cuckoo eggs found in the nests of six other bird species (drawn from the work of O.M. Latter in 1902).
Data set: cuckoo

Prices of Diamond Rings

The data contains the prices of ladies diamond rings and the carat size of their diamond stones. The rings are made with gold of 20 carats purity and are each mounted with a single diamond stone. The source of the data is a full page advertisement placed in the _Straits Times_ newspaper issue of February 29, 1992, by a Singapore-based retailer of diamond jewelry.
Data set: diamond

Pets and one-year Survival

Psychological and social factors can influence the survival of patients with serious diseases. One study examined the relationship between survival of patients with coronary heart disease and pet ownership. Each of 92 patients was classified as having a pet or not, and whether they survived one year. Here is the data, from Erika Friedmann et al., "Animal companions and one-year survival of patients after discharge from a coronary care unit.":

Patient StatusOwns a PetDoes not Own a Pet
Alive5028
Dead311

Fabric Wear

Results from an experiment designed to determine how much the speed of a washing machine effects the wear on a new fabric. The machine was run at 5 different speeds (measured in rpm) and with six pieces of fabric each.
Data set: fabric wear

Fiber in Crackers

A manufacturer was considering marketing crackers high in a certain kind of edible fiber as a dieting aid. Dieters would consume some crackers before a meal, filling their stomachs so that they would feel less hungry and eat less. A laboratory studied whether people would in fact eat less in this way. Overweight female subjects ate crackers with different types of fiber (bran fiber, gum fiber, both, and a control cracker) and were then allowed to eat as much as they wished from a prepared menu. The amount of food they consumed and their weight were monitored, along with any side effects they reported. Unfortunately, some subjects developed uncomfortable bloating and gastric upset from some of the fiber crackers.
Data set: fiber

Flammability of Childrens Sleepwear

The flammability of fabric used in children's sleepwear is tested by placing a flame under a piece of fabric until it ignites. The flame is then removed, and the fabric stops burning. The length of the charred portion of the fabric is measured. In the data set pieces of the same cloth were sent to five different laboratories, which then carried out this experiment eleven times.
Data set: flammability

Treatment for Hostility

A clinical psychologist wished to compare three methods for reducing hostility levels in university students. A certain psychological test (HLT) was used to measure the degree of hostility. High scores on this test were taken to indicate great hostility. Eleven students obtaining high and nearly equal scores were used in the experiment. Five were selected at random from among the 11 problem cases and treated by method A. Three were taken at random from the remaining six students and treated by method B. The other three students were treated by method C. All treatments continued throughout a semester. Each student was given the HLT test again at the end of the semester.
Data set: hostility

GPA and IQ

Data on grade point average of high school students (measured on a scale of 0-12) and their IQ and gender.
Data set: iq

Larvea of Mosquito

Dowdy and Wearden (1983) presented a set of data containing measurements on the number of larvea of Chaoborus (a non blod-sucking type of mosquito), the depth of the lake were the larvea were found, the brackishness (or how clear) of the water and the amount of oxygen in the water.
Data set: larvea

Employee Attitudes

A survey was conducted to study the attitude of the employees of 30 departments of a large financial institution towards their supervisors. The employees were asked to give a score of 0 (=totally disagree) to 100 (=totally agree). The data set contains the average score per department. The variables and the respective questions are:
Variable name Question in Survey
Rating My supervisor is doing an outstanding job
Complaints He/She handles complaints well
Privileges He/She does not allow special privileges
Learn new Things He/She gives us the opportunity to learn new things
Raises He/She bases raises on the performance
Too Critical He/She is too critical of poor performances
Advancing We have a good rate of advancing to new jobs

Data set: ratings

Survey of Children

We have data on students in grades 4-6 from three school districts in Ingham and Clinton Counties, Michigan. Chase and Dummer stratified their sample, selecting students from urban, suburban, and rural school districts with approximately 1/3 of their sample coming from each district. Students indicated whether good grades, athletic ability, or popularity was most important to them. They also ranked four factors: grades, sports, looks, and money, in order of their importance for popularity. The questionnaire also asked for gender, grade level, and other demographic information.
Variable Names:
1. Gender: Boy or girl
2. Grade: 4, 5 or 6
3. Age: Age in years
4. Race: White, Other
5. Urban/Rural: Rural, Suburban, or Urban school district
6. School: Brentwood Elementary, Brentwood Middle, Ridge, Sand, Eureka, Brown, Main, Portage, Westdale Middle
7. Goals: Student's choice in the personal goals question where options were 1 = Make Good Grades, 2 = Be Popular, 3 = Be Good in Sports
8. Grades: Rank of "make good grades" (1=most important for popularity, 4=least important)
9. Sports: Rank of "being good at sports" (1=most important for popularity, 4=least important)
10. Looks: Rank of "being handsome or pretty" (1=most important for popularity, 4=least important)
11. Money: Rank of "having lots of money" (1=most important for popularity, 4=least important)
Data set: popular

Salaries in a Company

A company did a survey of their salaries, together with information on the number of years people have worked there and their job level, either "low" or "high".
Data set: salaries

Singers in the New York Choral Society

Each singer in the NY Choral Society in 1979 self-reported his or her height to the nearest inch. Their voice parts in order from highest pitch to lowest pitch are Soprano, Alto, Tenor, Bass. The first two are typically sung by female voices and the last two by male voices.
Data set: singer

Egyptian Skulls

Four measurements were made of male Egyptian skulls from five different time periods ranging from 4000 B.C. to 150 A.D. Are there are any differences in the skull sizes between the time periods? The researchers theorize that a change in skull size over time is evidence of the interbreeding of the Egyptians with immigrant populations over the years.
The predictor variables are
Maximal Breadth of Skull (MB)
Basibregmatic Height of Skull (BH)
Basialveolar Length of Skull (BL)
Nasal Height of Skull (NH)
Data is from Thomson, A. and Randall-Maciver, R. (1905) Ancient Races of the Thebaid, Oxford: Oxford University Press
Data set: skulls

Smoking and Lung Cancer Rates

Data on the per capita numbers of cigarettes smoked (sold) by 43 states and the District of Columbia in 1960 together with death rates per thousand population from lung cancer
Data set: smoking

Liver and Toxins in Trout

Casella and Berger (p522) describe an experiment to measure the amount of deterioration on the liver of trouts, depending on three toxins and a control.
Data set: trout

Life Expectancy, TV's and Docters

For each of the 38 largest countries in the world (according to 1990 population figures), data are given for the country's life expectancy at birth, number of People per TV, and number of People per Doctor. SOURCE: _The World Almanac and Book of Facts 1993_ (1993), New York: Pharos Books
Data set: tv

Deaths in Car Accidents (2002)

Number of fatalities in car crashes in the US, by age and gender. Source: National Center for Injury Prevention and Control
Data set: cardeaths

Highway Accidents

This data set, taken from an unpublished Masters thesis by Carl Hoffstedt and discussed in Weisberg, S. (1985) Applied Linear Regression, relates the automobile accident rate, in accidents per million vehicle miles, to 13 potential predictors. The data include 39 sections of large highways in the state of Minnesota in 1973. The variables are
rate: accidents per million vehicle miles
len: length of the segment in miles
adt: average daily traffic count in thousands
trks: truck volume as a percent of total traffic
slim: speed limit
lwid: lane width in feet
shld: with in feed of outer shoulder
itg: number of freeway-type interchanges per mile
sigs: number of signalized interchanges per mile
acpt: number of access points per mile
lane: total number of traffic lanes in both directions
fai: 1 if federal aid interstate highway, 0 otherwise
pa: 1 if principal arterial highway, 0 otherwise
ma: 1 if major arterial highway, 0 otherwise.

Data set: highways

Old Faithful Guyser

The Old Faithful Geyser in Yellowstone National Park erupts every 35 to 120 minutes. The duration of each eruption lasts for 1½ to 5 minutes. Notice that Old Faithful is not as faithful as one might expect. The time between eruptions and the length of each eruption varies quite a bit. However, one can estimate the time of the next eruption quite accurately given the duration of the previous eruption. The data set we will work with consists of the duration of eruption and time between eruptions for 272 different eruptions of Old Faithful taken over a number of days in August 1978 and August 1979. (From Applied Linear Regression, 2nd Edition, by Sanford Weisberg, pp. 231 and 234.) The times given are in minutes. We will use the length of the duration to predict the length of the amount of time until the next eruption again. The park rangers at Yellowstone do this and their predictions are posted near the geyser and at the web cam picture site located at http://www.nps.gov/yell/oldfaithfulcam.htm
Data set: faithful

Smoking and Job Type

Data summarizes a study of men in 25 occupational groups in England. Two indices are presented for each occupational group. The smoking index is the ratio of the average number of cigarettes smoked per day by men in the particular occupational group to the average number of cigarettes smoked per day by all men. The mortality index is the ratio of the rate of deaths from lung cancer among men in the particular occupational group to the rate of deaths from lung cancer among all men.
Variables:
1. Occupational_Group: Occupational Group
2. Smoking: Smoking index (100 = average)
3. Mortality: Lung cancer mortality index (100 = average)
Source: Occupational Mortality: The Registrar General's Decennial Supplement for England and Wales, 1970-1972, Her Majesty's Stationery Office, London, 1978.
Data set: smokingandjob

Cheese Tasting

As cheese ages, various chemical processes take place that determine the taste of the final product. This dataset contains concentrations of various chemicals in 30 samples of mature cheddar cheese, and a subjective measure of taste for each sample.
Variable Names:
Taste: Subjective taste test score, obtained by combining the scores of several tasters
Acetic: concentration acetic acid H2S: concentration of hydrogen sulfide Lactic: Concentration of lactic acid
Data set: cheese

Hotdogs

People who are concerned about their health may prefer hot dogs that are low in salt and calories. The datafile contains data on the sodium and calories contained in each of 54 major hot dog brands. The hot dogs are classified by type: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry meat).
Data set: hotdogs

Homework or not?

Does assigning homework improve students performance? To study this a teacher of a basic Statistics course set up the following experiment. She taught three sections of the same course. In one she did not assign any homework, in the second she assigned homework but did not collect it and in the third the assigned and graded the homework. After the first exam she collected the exam scores.
Data set: homework

Age by Gender in US and PR (Census 2000)

Breakdown of the population of USA and Puerto Rico by age and gender, according to the 2000 Census
Data set: Puerto Rico: agesex, all of US: agesexUS

Gasoline type and Pollution

In an experiment to reduce pollution, four different blends of gasoline are tested in each of three makes of automobiles. The cars are driven a fixed distance to determine the mpg (miles per gallon) The experiment is repeated three times for each blend-automobile combination. (Taken from Lyman Ott)
Data set: gasoline

Wine Consumption and Hearts Disease

Data for 19 developed countries on wine consumption (liters of wine per person per year) and deaths from heart disease (per 100000 people). (taken from David Moore: The Active Practice of Statistics)
Data set: wine