Describing Patterns in Data (Part 3)

Homer White, Georgetown College

Patterns in Data: Part 3

Load Packages

Always remember to make sure the necessary packages are loaded:

require(mosaic)
require(tigerstats)

Relationship Between a Factor and a Numerical Variable

Factor and Numerical Variable

Research Question: Who tends to drive faster – GC guys or GC gals?

  • Question about the relationship between two variables.
  • Explanatory variable: sex (factor)
  • Response variable: fastest (numerical)

Idea for Investigation

  • Compare the conditional distributions of the numerical variable, given each value of the factor variable.
  • Getting conditional distributions requires breaking data into “groups”, one for each value of the factor.
  • If the conditonal distributions differ, then the two variables are related.

Numerical Approach

favstats(fastest~sex,data=m111survey)

Note the formula-data input: formula is:

\[ numerical \sim factor \]

Numerical Approach

  .group min Q1 median    Q3 max  mean    sd
1 female  60 90     95 110.0 145 100.0 17.61
2   male  85 99    110 122.5 190 113.5 22.57

Focus on a measure of center (definitely NOT on max or min!).

Guys drive faster, on average (mean speed 113.5 mph, compared to 100 mph for the gals).

Graphics: Parallel Boxplots

bwplot(fastest~sex,
       data=m111survey,
 xlab="speed (mph)",
 main="Fastest Speed")

plot of chunk unnamed-chunk-5

Graphics: Histograms

histogram(~fastest|sex,data=m111survey,
       type="density",
       main="Fastest Speed Driven, by Sex",
       xlab="Fastest Speed, in mph")

The Result

plot of chunk unnamed-chunk-7

Graphics: Grouped Density Plots

densityplot(~fastest,data=m111survey,
       groups=sex,
       main="Fastest Speed Driven, by Sex",
       xlab="speed (mph)",
       auto.key=TRUE)

plot of chunk unnamed-chunk-9

Comparing Measures of Center and Spread

Two Pairs of Measures

  • Two Measures of center:
    • mean
    • median
  • Two measures of spread
    • standard deviation (SD)
    • interquartile range (IQR)

Which to use?

Mean/Median

In a histogram or density plot:

  • Half of the area comes before the median
  • the graph “balances” over the mean

Mean/Median with Symmetry

plot of chunk unnamed-chunk-10 The mean and median are about the same!

Mean/Median with Skewness

plot of chunk unnamed-chunk-11

The mean is dragged toward the tail.

Mean/Median, No Outliers

SmallDataset
[1] 32 45 47 47 49 56 56 56 57
favstats(~SmallDataset)
 min Q1 median Q3 max  mean    sd
  32 47     49 56  57 49.44 8.079

Mean and median are both in the “middle.”

Mean/Median, With Outlier

NewData <- c(SmallDataset,200)
NewData
 [1]  32  45  47  47  49  56  56  56  57 200
favstats(~NewData)
 min Q1 median Q3 max mean    sd
  32 47   52.5 56 200 64.5 48.22

Mean is much bigger than all values except the outlier! The SD is also quite inflated.

Criterion

If

  • the data are strongly skewed, or
  • have extreme outliers in one direction but not the other

then use median/IQR rather than mean/SD to describe center/spread.

The 68-95 Rule

The 68-95 Rule

If the distribution of a numerical variable is roughly bell-shaped, then

  • about 68% of the values are within one SD of the mean
  • about 95% are within 2 SDs of the mean
  • about 99.7% are within 3 SDs of the mean

(This is sometimes called the Empirical Rule.)

68-95 Rule Limits

The rule works well even if the data are somewhat skewed, but be careful!

require(manipulate)
EmpRuleGC()

68-95 Rule Application

A population has heights that have a bell-shaped distribution, with a mean of 70 inches and a standard deviation of 3 inches.

  • About what percentage of people are taller than 73 inches?
  • About what percentage are shorter than 64 inches?
  • The percentage that are taller than 74 inches is less than \( U \)% and bigger than \( L \)%. Find \( U \) and \( L \).

Suggestion: try

require(manipulate)
EmpRuleGC(mean=70,sd=3)

z-scores

z-scores

Suppse you want to compare an individual to a group. The \( z \)-score for the individual value \( x \) is:

\[ z=\frac{x-\bar{x}}{s}, \]

where

  • \( \bar{x} \) is the mean of the group
  • \( s \) is the standard deviation of the group

Idea of a z-score

The \( z \)-score of \( x \) tells you how many SDs \( x \) is above or below the mean for the group. It measures how unusual \( x \) is.

Our Rule for Being Surprised

Let's just agree to say that a value \( x \) is:

  • surprisingly high if \( z>2 \)
  • surprisingly low if \( z<-2 \)

Application

Question: Suppose that Linda is 72 inches tall. How does she compare with the other GC students in the m111survey data?

Solution: Get her \( z \)-score.

favstats(~height,data=m111survey)
  mean    sd
 67.99 5.296

The mean is about 67.987 inches, and the SD is about 5.296 inches. Hence Linda's \( z \)-score is:

(72-67.987)/5.296
[1] 0.7577

The \( z \) score is about 0.76, which means that Linda is only about three-fourths of a standard deviation above the mean height.

She is taller than average, but not unusually tall.

Question: Is Linda unusually tall, for a female?

Solution:

favstats(height~sex,data=m111survey)
  .group  mean    sd
1 female 64.94 4.622
2   male 71.92 3.049

Linda's \( z \)-score relative to the females is:

(72-64.94)/4.622
[1] 1.527

So Linda is about 1.53 SDs above the mean female height.

That's more impressive, but still not terribly unusual.

Note

  • \( z \)-scores and our criterion for “surprise” make the most sense when the distribution is roughly bell-shaped.
  • But even if the distribution is a bit skewed, they are still useful.