Homer White, Georgetown College

Always remember to make sure the necessary packages are loaded:

```
require(mosaic)
require(tigerstats)
```

The estimators for the Basic Five Parameters are all random variables, so they all have a center (EV) and a spread (SD).

Estimator | Center | Spread |
---|---|---|

\( \bar{x} \) | \( \mu \) | \( \frac{\sigma}{\sqrt{n}} \) |

\( \bar{x}_1-\bar{x}_2 \) | \( \mu_1-\mu_2 \) | \( \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \) |

\( \hat{p} \) | \( p \) | \( \sqrt{\frac{p(1-p)}{n}} \) |

\( \hat{p}_1-\hat{p}_2 \) | \( p_1-p_2 \) | \( \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}} \) |

\( \bar{d} \) | \( \mu_d \) | \( \frac{\sigma_d}{\sqrt{n}} \) |

There are *three* SDs. You have to keep them straight:

- \( \sigma \) is the SD of a numerical population
- says how much a typical population value differs from the population mean \( \mu \)

- \( s \) is the SD of a numerical sample
- says how much a typical sample value differs from the mean \( \bar{x} \) of the sample
- can be used to estimate \( \sigma \)

- \( SD(\bar{x}) \) is the SD of the \( \bar{x} \)
*as a random variable*- says how much \( \bar{x} \) is liable to differ from \( EV(\bar{x}) \), in repeated sampling

But what is the *shape* of an estimator?

To investigate the shape of the distribution of \( \bar{x} \), try:

```
require(manipulate)
MeanSampler(~income,data=imagpop)
```

Suggestions:

- first try with \( n=1 \). Watch the density curve build up.
- then with \( n=2 \)
- then with higher \( n \), maybe \( n=11 \)
- try with another numerical variable in
`imagpop`

, maybe**height**

To investigate the shape of the distribution of \( \hat{p} \), try:

```
require(manipulate)
PropSampler(~cappun,data=imagpop)
```

Suggestions:

- first try with \( n=10 \). Watch the density curve build up.
- then with \( n=30 \)
- then with higher \( n \), maybe \( n=100 \)
- try with another two-value factor variable in
`imagpop`

, maybe**math**

To investigate the shape of the distribution of \( \bar{x}_1-\bar{x}_2 \), try:

```
require(manipulate)
SampDist2Means(imagpop)
```

Suggestions:

- first set
`numver`

to**income**,`facvar`

to**sex**. Try for various sample sizes. - then set
`numver`

to**height**,`facvar`

to**sex**. Try for various sample sizes.

As sample sizes increase, shape of the distribution of the estimator looks more and more bell-shaped.

**No matter what the underlying population looks like!!**

(But the more skewed the population is, the bigger the sample size must be before the estimator starts looking bell-shaped.)

No matter how the population is distributed, as sample size \( n \) increases, the distribution of \( \bar{x} \) gets closer and closer to:

\[ norm(\mu,\frac{\sigma}{\sqrt{n}}). \]

No matter how the population is distributed, as sample size \( n \) increases, the distribution of

\[ Z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \]

gets closer and closer to:

\[ norm(0,1), \]

the *standard normal* distribution.

… to the other four of the Basic Five estimators!

As long as sample sizes are “big enough”, their shape will be approximately normal!

Rules of Thumb:

Estimator | Big Enough |
---|---|

\( \bar{x} \) | \( n \ge 30 \) |

\( \bar{x}_1-\bar{x}_2 \) | \( n_1 \ge 30,n_2 \ge 30 \) |

\( \hat{p} \) | \( np \ge 10,n(1-p) \ge 10 \) |

\( \hat{p}_1-\hat{p}_2 \) | all of \( n_1p_1,n_1(1-p_1),n_2p_2,n_2(1-p_2) \ge 10 \) |

\( \bar{d} \) | \( n \ge 30 \) |

… that you are a powerful being (a Greek god/goddess, maybe).

- You know everything about the present
- so you have complete information on all populations
- so you can instantly find any population parameter you like
- but you do NOT know the future

You see a poor mortal, a statistician, about to take a simple random sample of size \( n \) from a population.

The following is all true:

- You know population mean \( \mu \),
- you know population SD \( \sigma \),
- and you know that \( \bar{x} \approx norm(\mu,\sigma/\sqrt{n}). \)
- You don't know what \( \bar{x} \) will be,
- but you can compute probabilities for it to lie in various ranges.

A statistician is about to take a SRS of size \( n=25 \) from

`imagpop`

and compute \( \bar{x} \), the sample mean of the heights of the 25 selected individuals.

What is the probability that the sample mean will exceed 68.3 inches?

In other words, what is:

\[ P(\bar{x} > 68.3)? \]

First of all, we use our god-like powers (and R!) to find:

- mean \( \mu \) and
- standard deviation \( \sigma \)

of the heights in the population:

```
favstats(~height,data=imagpop)[c("mean","sd")]
```

```
mean sd
67.53 3.907
```

So we know that:

- \( EV(\bar{x}) = 67.53 \), and
- \( SD(\bar{x}) = 3.907/\sqrt{25}=0.7814 \).

We know the center and the spread of \( \bar{x} \)!

How about the shape of \( \bar{x} \)? Sample was size \( n=25 \), a bit less than the “cut-off” of 30 where CLT kicks in.

But if heights in `imagpop`

are approximately normal, we don't need a big \( n \) to assure us that \( \bar{x} \) will be approximately normal. So, use god-like powers to make density plot of population:

```
densityplot(~height,data=imagpop,
xlab="Height (inches)",
main="Imagpop Heights",
plot.points=FALSE)
```

Population looks fairly bell-shaped!

So even though \( n=25 < 30 \), we still figure that

\[ \bar{x} \approx norm(67.53,0.7814). \]

So for \( P(\bar{x} > 68.3) \), go for:

```
pnormGC(bound=68.3,region="above",
mean=67.53,sd=0.7814,
graph=TRUE)
```

```
[1] 0.1622
```

So \( P(\bar{x} > 68.3) \approx 16.22\% \)

A statistician plans to take a SRS of 30 males from the population of all males in

`imagpop`

, and an independent SRS of 40 females from the population of all women in`imagpop`

. She will then compute \( \bar{x}_1-\bar{x}_2 \), the sample mean height of the males minus the sample mean height of the females.

Approximately what is the chance that the difference of sample means will be between 4 and 6 inches?

This time the samples sizes are:

- \( n_1=30 \), and
- \( n_2=40 \).

Both sample sizes are \( \ge 30 \), so Central Limit Theorem says \( \bar{x}_1-\bar{x}_2 \) is approximately normal.

Next, compute EV and SD of \( \bar{x}_1-\bar{x}_2 \). For this we will need means and standard deviations of of the heights for:

- all males in
`imagpop`

, and - all females in
`imagpop`

.

Hence we ask for:

```
favstats(height~sex,
data=imagpop)[c(".group","mean","sd")]
```

```
.group mean sd
1 female 65.00 2.962
2 male 70.03 3.013
```

So the EV of \( \bar{x}_1-\bar{x}_2 \) is

\[ 70.03-65=5.03 \]

inches, and \( SD(\bar{x}_1-\bar{x}_2) = \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}, \)

```
sqrt(3.013^2/30+2.962^2/40)
```

```
[1] 0.7225
```

So:

\[ \bar{x}_1-\bar{x}_2 \approx norm(5.03,0.722). \]

Now we can get the desired probability:

```
pnormGC(c(4,6),region="between",
mean=5.03,sd=0.722)
```

```
[1] 0.8336
```

So there is about a 83.36% chance that \( \bar{x}_1-\bar{x}_2 \) will turn out to be between 4 and 6 inches.

In other words:

There is about a 83.36% chance that the mean for the sample guys will be between 4 and 6 inches higher than the mean for the sample gals.

Consult the Course Notes for several more examples!

If the distribution of a numerical variable is roughly bell-shaped, then

- about 68% of the values are within one SD of the mean
- about 95% are within 2 SDs of the mean
- about 99.7% are within 3 SDs of the mean

(This was sometimes called the *Empirical Rule*.)

If the probability distribution of a random variable \( X \) is approximately normal, then

- there is about a 68% chance that \( X \) will turn out to be within one SD of its EV
- there is about a 95% chance that \( X \) will turn out to be within two SDs of its EV
- there is about a 99.7% chance that \( X \) will turn out to be within three SDs of its EV

You are about to take a simple random sample from a population.

- The mean \( \mu \) is 50.
- The standard deviation \( \sigma \) of the population is 6.
- The size of your sample will be \( n=36 \).

Approximately what is

\[ P(\bar{x} > 53)? \]

- \( EV(\bar{x})=\mu=50 \)
- \( SD(\bar{x})=\sigma/\sqrt{n}=6/\sqrt{36}=1 \)
- \( n=36 \ge 30 \), so by CLT \( \bar{x} \approx norm(50,1) \)

So,

\[ P(\bar{x} > 53) \approx 0.15\%. \]

Use the SD for the random variable you are working with, NOT the SD of the population.

\[ SD(\bar{x})=\frac{\sigma}{\sqrt{n}}=\frac{6}{\sqrt{36}}=1, \]

\[ SD(\bar{x}) \neq \sigma=6 \]