Homer White, Georgetown College

Always remember to make sure the necessary packages are loaded:

```
require(mosaic)
require(tigerstats)
```

Research Question:Who tends to drive faster – GC guys or GC gals?

- Question about the
*relationship*between two variables. - Explanatory variable:
**sex**(factor) - Response variable:
**fastest**(numerical)

- Compare the
*conditional distributions*of the numerical variable, given each value of the factor variable. - Getting conditional distributions requires breaking data into “groups”, one for each value of the factor.
- If the conditonal distributions differ, then the two variables are related.

```
favstats(fastest~sex,data=m111survey)
```

Note the formula-data input: formula is:

\[ numerical \sim factor \]

```
.group min Q1 median Q3 max mean sd
1 female 60 90 95 110.0 145 100.0 17.61
2 male 85 99 110 122.5 190 113.5 22.57
```

*Focus on a measure of center (definitely NOT on max or min!)*.

Guys drive faster, on average (mean speed 113.5 mph, compared to 100 mph for the gals).

```
bwplot(fastest~sex,
data=m111survey,
xlab="speed (mph)",
main="Fastest Speed")
```

```
histogram(~fastest|sex,data=m111survey,
type="density",
main="Fastest Speed Driven, by Sex",
xlab="Fastest Speed, in mph")
```

```
densityplot(~fastest,data=m111survey,
groups=sex,
main="Fastest Speed Driven, by Sex",
xlab="speed (mph)",
auto.key=TRUE)
```

- Two Measures of center:
- mean
- median

- Two measures of spread
- standard deviation (SD)
- interquartile range (IQR)

Which to use?

In a histogram or density plot:

- Half of the area comes before the median
- the graph “balances” over the mean

The mean and median are about the same!

The mean is dragged toward the tail.

```
SmallDataset
```

```
[1] 32 45 47 47 49 56 56 56 57
```

```
favstats(~SmallDataset)
```

```
min Q1 median Q3 max mean sd
32 47 49 56 57 49.44 8.079
```

Mean and median are both in the “middle.”

```
NewData <- c(SmallDataset,200)
NewData
```

```
[1] 32 45 47 47 49 56 56 56 57 200
```

```
favstats(~NewData)
```

```
min Q1 median Q3 max mean sd
32 47 52.5 56 200 64.5 48.22
```

Mean is much bigger than all values except the outlier! The SD is also quite inflated.

If

- the data are
**strongly**skewed, or - have
**extreme**outliers in one direction but not the other

then use median/IQR rather than mean/SD to describe center/spread.

If the distribution of a numerical variable is roughly bell-shaped, then

- about 68% of the values are within one SD of the mean
- about 95% are within 2 SDs of the mean
- about 99.7% are within 3 SDs of the mean

(This is sometimes called the *Empirical Rule*.)

The rule works well even if the data are somewhat skewed, but be careful!

```
require(manipulate)
EmpRuleGC()
```

A population has heights that have a bell-shaped distribution, with a mean of 70 inches and a standard deviation of 3 inches.

- About what percentage of people are taller than 73 inches?
- About what percentage are shorter than 64 inches?
- The percentage that are taller than 74 inches is less than \( U \)% and bigger than \( L \)%. Find \( U \) and \( L \).

Suggestion: try

```
require(manipulate)
EmpRuleGC(mean=70,sd=3)
```

Suppse you want to compare an individual to a group. The \( z \)-score for the individual value \( x \) is:

\[ z=\frac{x-\bar{x}}{s}, \]

where

- \( \bar{x} \) is the mean of the group
- \( s \) is the standard deviation of the group

The \( z \)-score of \( x \) tells you how many SDs \( x \) is above or below the mean for the group. It measures how unusual \( x \) is.

Let's just agree to say that a value \( x \) is:

- surprisingly high if \( z>2 \)
- surprisingly low if \( z<-2 \)

**Question:** Suppose that Linda is 72 inches tall. How does she compare with the other GC students in the `m111survey`

data?

Solution: Get her \( z \)-score.

```
favstats(~height,data=m111survey)
```

```
mean sd
67.99 5.296
```

The mean is about 67.987 inches, and the SD is about 5.296 inches. Hence Linda's \( z \)-score is:

```
(72-67.987)/5.296
```

```
[1] 0.7577
```

The \( z \) score is about 0.76, which means that Linda is only about three-fourths of a standard deviation above the mean height.

She is taller than average, but not unusually tall.

**Question**: Is Linda unusually tall, *for a female*?

**Solution**:

```
favstats(height~sex,data=m111survey)
```

```
.group mean sd
1 female 64.94 4.622
2 male 71.92 3.049
```

Linda's \( z \)-score relative to the *females* is:

```
(72-64.94)/4.622
```

```
[1] 1.527
```

So Linda is about 1.53 SDs above the mean female height.

That's more impressive, but still not terribly unusual.

- \( z \)-scores and our criterion for “surprise” make the most sense when the distribution is roughly bell-shaped.
- But even if the distribution is a bit skewed, they are still useful.