Use Summary Statistics and Graphics to Clean and Analyze Data

John Holcomb (Cleveland State University)

Angela Spalsbury (Youngstown State University)

Journal of Statistics Education Volume 13, Number 3 (2005), www.amstat.org/publications/jse/v13n3/datasets.holcomb.html

The problem

The objective of the study for which you will analyze the data was to determine if significant gender differences existed between subjects 65 years of age and older with regard to calcium, inorganic phosphorous, and alkaline phosphatase levels (Boyd et al., 1998). The researchers performed a retrospective chart review of laboratory procedures performed in 6 different physician practices. The data consisted of 178 subjects representing 92 males and 86 females age 65 or older. In the dataset, there are three discrete variables, sex, lab, and agegroup. The coding is as follows:

Var Code
Sex 1 = Male; 2 = Female
Lab 1 = Metpath; 2 = Deyor; 3 = St. Elizabeth's; 4 = CB Rouche; 5 = YOH; 6 = Horizon
Agegroup 1 = 65-69; 2 = 70-74; 3 = 75-79; 4 = 80-84; 5 = 85-89

The other variables of age (years), alkphos - alkaline phosphatase (IU/L), cammol - calcium (mmol/L), and phosmmol – inorganic phosphorus (mmol/L), are continuous.

  1. The first task of the assignment is to check the validity of the data. Determine if this is a “messy” dataset with variable values that appear incorrect. Attempt to recover the correct values by looking up the true values from the actual data records. Copies of these can be found on bigtable.htm. Be sure to catalogue the problem values in the data and the changes that were made to clean the dataset. Include a paragraph detailing the steps taken to clean the dataset.

  2. Once the data are “clean”, perform a summary analysis of the three discrete variables (sex, lab, and agegroup). For the variables alkphos, cammol and phosmmol, report the mean, median, standard deviation, min and max broken down by sex. Also summarize the variables alkphos, cammol and phosmmol in a similar way with the factor variable as lab.

  3. Construct side by side boxplots of the variables alkphos, cammol and phosmmol with the factor variable as sex. Next construct side by side boxplots of the alkphos, cammol and phosmmol continuous variables with the factor variable as lab.

  4. Compare the mean and standard deviation of age, alkphos, cammol and phosmmol from the messy dataset with the mean and standard deviation from your cleaned dataset. Does cleaning the data make a difference? Explain.

  5. Using your summary statistics and your side-by-side boxplots, do you believe a significant difference exists in alkphos, cammol and phosmmol levels with respect to sex? Why or why not? Do you believe a significant difference exists in alkphos, cammol and phosmmol levels with respect to lab? Why or why not?

  6. Suppose Mr. and Mrs. Contrarian are married and Mrs. Contrarian has lower calcium than Mr. Contrarian. She refuses to believe the results of the study that men tend to have lower calcium than women because she has lower calcium than her husband. Using your results to question #3, explain to Mrs. Contrarian the flaw in her thinking.

  7. One of the objectives of this research was to propose a reference range of values that are to be considered “normal” for calcium, inorganic phosphorus, and alkaline phosphatase. Looking at the results for cammol alone for each of the labs, explain why a single reference range is so difficult to establish.

Getting the Data

The file calcium.dat.txt contains the data with the problem values. The file calciumgood.dat.txt contains the data with the problem values corrected. The observation grid can be found at bigtable.htm. The file calcium.txt is a documentation file that contains a brief description of the dataset and the purpose of the assignment.

Appendix A: Key to variables in calcium.dat.txt and calciumgood.dat.txt

Calcium.dat.txt

Columns Variable Comment
9-11 OBSNO Patient Observation Number
21-22 AGE Years
33 SEX 1=Male, 2=Female
42-44 ALKPHOS Alkaline Phosphatase International Units/Liter
55 Lab Lab: 1=Metpath; 2=Deyor; 3=St. Elizabeth's; 4=CB Rouche; 5=Youngstown Osteopathic Hospital; 6=Horizon
63-66 CAMMOL Calcium mmol/L
74-77 PHOSMMOL Inorganic Phosphorus mmol/L
88 AGEGROUP Age group 1=65-69; 2=70-74; 3=75-79; 4=80-84; 5=85-89 Years

Calciumgood.dat.txt

Columns Variable Comment
9-11 OBSNO Patient Observation Number
20-22 AGE Years
32-33 SEX 1=Male, 2=Female
42-44 ALKPHOS Alkaline Phosphatase International Units/Liter
54-55 Lab Lab: 1=Metpath; 2=Deyor; 3=St. Elizabeth's; 4=CB Rouche; 5=Youngstown Osteopathic Hospital; 6=Horizon
62-66 CAMMOL Calcium mmol/L
74-77 PHOSMMOL Inorganic Phosphorus mmol/L
88 AGEGROUP Age group 1=65-69; 2=70-74; 3=75-79; 4=80-84; 5=85-89 Years