Rebekah Robinson and Homer White, Georgetown College

Always remember to make sure the necessary packages are loaded:

```
require(mosaic)
require(tigerstats)
```

The

populationis the set of all items of interest.The

sampleis the subset of the population for which we have data.

A

parameteris a number that you could compute if you knew the entire population.

Suppose the population is **all adults**.

Examples of parameters:

- The mean height \( \mu \) of the population
- The standard deviation \( \sigma \) of the heights
- The proportion \( p \) of the population that plays tennis

A

statisticis a number that you can compute from your sample data.

Suppose that we have a sample from the population of all adults. Examples of statistics:

- The sample mean \( \bar{x} \) of the heights
- The sample standard deviation \( s \) of the heights
- The proportion \( \hat{p} \) of the sample that plays tennis
- Note: \( \hat{p} \) is the number \( X \) in the sample who play tennis, divided by the sample size \( n \).

- The parameters are fixed, but we don't know them
- The statistics depend on the sample
- We use statistics to estimate parameters. That is, we hope that:
- \( \bar{x} \approx \mu \)
- \( s \approx \sigma \)
- \( \hat{p} \approx p \)

For the approximations to be good, the sample should be

representativeof the population.

So we should employ methods of sampling for which the sample is likely to be representative of the population.

How do we get a representative sample?

Should we let potential subjects choose whether or not to be in the sample?

This is called a

volunteersample.

**Example**: When conducting an opinion survey on food in the Cafe, you leave forms at the entrance for people to fill out.

The volunteers might differ from the general population in some important way.

In our example, the students who take the time to fill out the survey might have stronger opinions (one way or another) than those who don't bother.

Should the researchers decide who gets into the sample?

Example: *Quota* sampling in the 1948 U.S. presidential elections.

When researchers use their own judgement to decide on the sample, they could (intentionally or unintentionally) choose an unrepresentative sample.

In the 1948 quota sampling, pollsters ended up interviewing “approachable” folks, who turned out to be wealthier than average, thus biasing poll results toward the Republican candidate.

- if you can't let subjects decide whether to be in the sample,
- and you can't let researchers decide who should be in the sample …

… then what should decide who gets into the sample?

*Let chance decide who gets into the sample!*

We should use some form of *random sampling*. That is, we should use chance in a controlled, quantifiable way.

There are many types of random sampling. The one we will think about the most is *simple random sampling.*

Suppose you are planning to take a sample of size \( n \) from a population. If you take the sample so that

every set of \( n \) subjects in the population has the same chance to be the sample selected

then you are doing *simple random sampling* (SRS).

Having a box full of tickets, one for each member of the population.

- You randomly pick out one ticket …
- and set it aside …
- then randomly pick out another ticket
- and set it aside …
- … and so on until you have drawn \( n \) tickets.

(You draw \( n \) tickets at random from the box, **without** replacement.)

… especially when the sample size \( n \) is large.

Try this app:

```
require(manipulate)
SimpleRandom()
```

Distribution of **sex** in `imagpop`

:

```
rowPerc(xtabs(~sex,
data=imagpop))
```

```
female male Total
49.68 50.32 100.00
```

… of size \( n=10 \):

```
popsamp(imagpop,n=10)
```

Try several times:

```
mysample <- popsamp(imagpop,n=10)
rowPerc(xtabs(~sex,data=mysample))
```

Are you always pleased with the results?

To get sample “right” (at least with respect to a few variables):

- break the population into homogeneuous groups called
*strata* - use SRS to sample a set number from each stratum

A small, imaginary population:

```
data(FakeSchool)
View(FakeSchool)
```

Say you only have time to sample 7 of these 28 students. You plan to ask them questions about academic life, so you want the sample to exactly resemble the population with respect to the variable **Honors**.

You construct two strata:

**First Stratum**: All the Honors Students

```
honors <- subset(FakeSchool,Honors=="Yes")
honors
```

**Second Stratum**: All the non-Honors Students

```
nonhonors <- subset(FakeSchool,Honors=="No")
nonhonors
```

In the population, there are:

- 12 Honors students
- 16 non-Honors students

So in the sample of size 7 you want:

- 3 Honors students
- 4 non-Honors students

because

\[ \frac{3}{12}=\frac{4}{16}. \]

Sample the three Honors students by SRS:

```
set.seed(1837)
popsamp(honors,n=3)
```

```
Students Sex class GPA Honors
9 Betsy F So 4.0 Yes
11 Dylan M So 3.5 Yes
8 Andrea F So 4.0 Yes
```

Sample the four non-Honors students by SRS:

```
set.seed(17365)
popsamp(nonhonors,n=4)
```

```
Students Sex class GPA Honors
25 Diana F Sr 2.90 No
13 Eric M So 2.10 No
14 Gabriel M So 1.98 No
28 Grace F Sr 1.40 No
```

Combine the two samples to get your stratified sample!

Stratified sampling is more accurate than SRS, with respect to the variables that determine the strata.

(This advantage is most evident at small sample sizes.)

You have to know the population distribution of the variables that determine the strata.

To actually take a SRS, you would have to be able to identify every subject in the population, **before** you take the sample.

If you are on the server, run:

```
source("/mat111/Additional_R_Functions/SquareSamp.R")
```

Then take a simple random sample of 10 dots out of 2500 dots equally spaced on a rectangle:

```
require(shape)
SquareSamp()
```

One way to get around these problems:

- Divide the populations into disjoint subsets (
*clusters*) each of which is representative of the population. - Sample a few of the clusters by SRS.
- Then contact each subjects in each one of the selected clusters.

This is called *cluster sampling*.

- Easy to get to your selected subjects.
- You only have to be able to identify the clusters—not everyone in the population.

Clusters are seldom exactly like the population, so cluster samples can be quite variable.