Homer White, Georgetown College

- Data Basics
- Describing the Distribution of a Variable
- Exploring Relationships Between Variables
- Numerical Measures
- Graphical Tools

- Types of Variables
- Research Questions
- Distribution of One Factor Variable
- Relationship Between Two Factor Variables
- Distribution of One Numerical Variable
- Numerical Measures of Center and Spread

Always remember to make sure the necessary packages are loaded:

```
require(mosaic)
require(tigerstats)
```

(1) Put the data into your Global Environment:

```
data(m111survey)
```

(2) View it:

```
View(m111survey)
```

(3) Learn more about it:

```
help(m111survey)
?m111survey #same thing, less typing!
```

Here is part of `m111survey`

:

```
height ideal_ht fastest seat sex
1 76.0 78 119 1_front male
2 74.0 76 110 2_middle male
3 64.0 NA 85 2_middle female
4 62.0 65 100 1_front female
5 72.0 72 95 3_back male
6 70.8 NA 100 1_front male
```

In a data frame:

- rows are
*observations*(individuals) - columns are
*variables*

**Factor**(Categorical)- values (
*levels*) are not numbers: “male”, “female”,… *ordinal factors.*levels come in an order: “front”, “middle”, “back”, …

- values (
**Numerical**(Quantitative)*Double*(Continuous): values are real numbers 4.37,2.58, …*Integer*(Discrete): values are whole numbers 1,4,2,2,…

```
str(m111survey)
```

```
'data.frame': 71 obs. of 12 variables:
$ height : num 76 74 64 62 72 70.8 70 79 59 67 ...
$ ideal_ht : num 78 76 NA 65 72 NA 72 76 61 67 ...
$ sleep : num 9.5 7 9 7 8 10 4 6 7 7 ...
$ fastest : int 119 110 85 100 95 100 85 160 90 90 ...
$ weight_feel : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
$ love_first : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ extra_life : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
$ seat : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
$ GPA : num 3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
$ enough_Sleep : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
$ sex : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
$ diff.ideal.act.: num 2 2 NA 3 0 NA 2 -3 2 0 ...
```

*Descriptive Statistcs* is the art of summarizing data and describing patterns in the data.

**Graphical Devices**- barcharts
- histograms
- density plots
- boxplots
- and many more …

**Numerical Measurements**- mean and standard deviation
- median and interquartile range
- quantiles
- and many more …

The choice of tools will depend on the type of variables involved in your *Research Question*.

- “Are a majority of students female?”
- variable:
**sex**(factor)

- variable:
- “Who is more likely to prefer to sit in the front: a guy or a gal?”
- variables:
**sex**(factor) and**seat**(factor)

- variables:
- “Who drives faster: students who prefer the front, the middle, or the back?”
- variables:
**seat**(factor) and and**fastest**(numerical)

- variables:

**Research Question**:

- “What percentage of students in the survey are female?”

Variable involved is **sex** (factor).

Tally the sexes (a table of counts):

```
xtabs(~sex,data=m111survey)
```

```
sex
female male
40 31
```

Get percentages:

```
rowPerc(xtabs(~sex,data=m111survey))
```

```
female male Total
56.34 43.66 100.00
```

```
barchartGC(~sex,data=m111survey,
type="percent",
main="Distribution of Sex")
```

**Research Question**: “Who is more likely to sit in the front: a guy or a gal?”

- This question is about the
*relationship*between two variables. - Variables involved are:
**sex**(factor). This is the*explanatory*variable.**seat**(factor). This is the*response*variable.

- The explanatory variable is the variable that:
- we think might help
*cause*the response, or … - that we might intend to use to
*predict*the response.

- we think might help
- Often we don't use explanatory/response distinction

Also called *cross table*, or *contingency table*.

```
xtabs(~sex+seat,data=m111survey)
```

```
seat
sex 1_front 2_middle 3_back
female 19 16 5
male 8 16 7
```

**But**: counts don't answer Research Question. (There are more gals than guys in the first place!)

To check for a relationship between two factor variables, see if *conditional distributions* differ. For the conditional distribution **seat** given various values of **sex**, compute row percents:

```
sexseat <- xtabs(~sex+seat,data=m111survey)
rowPerc(sexseat)
```

```
1_front 2_middle 3_back Total
female 47.50 40.00 12.50 100
male 25.81 51.61 22.58 100
```

Females are more likely to prefer the front (47.5%, vs. 25.81% for the guys).

```
barchartGC(~sex+seat,data=m111survey,
type="percent",
main="Seating Preference by Sex")
```

Note the “formula-data” input:

**formula**is ~explanatory+response**data**gives the data frame containing the variables

- Females tend to prefer the front more than guys do.
- Guys tend to prefer the back more than gals do.

(Numerical Tools)

**Research Question**: “How fast do GC students drive, when they drive their fastest?”

This question involves one variable:

**fastest**(numerical)

Describe the distribution's:

- Center (numerical measurements)
- Spread (numerical measurements)
- Shape (graphical tools)

```
favstats(~fastest,data=m111survey)
```

```
min Q1 median Q3 max mean sd n
60 90.5 102 119.5 190 105.9 20.88 71
```

```
min Q1 median Q3 max mean sd n
60 90.5 102 119.5 190 105.9 20.88 71
```

**min**: smallest data value**Q1**: first quartile**median**median of the data**Q3**: third quartile**max**: largest data value**mean**: mean of the data**sd**: standard deviation of the data**n**sample size

The mean of the sample data is:

\[ \bar{x}=\frac{\sum{x_i}}{n}, \]

where:

- \( \sum \) denotes summing
- \( x_i \) denotes the individual values to be summed
- \( n \) denotes the number of values in the list.

A small example:

```
FakeData <- c(2,4,7,9,10)
mean(FakeData)
```

```
[1] 6.4
```

Using R as a calculator to compute the mean:

```
(2+4+7+9+10)/5
```

```
[1] 6.4
```

Standard deviation (SD) measures how far the typical data value is from the mean of all the data.

\[ s = \sqrt{(\sum{(x_i - \bar{x})^2})/(n-1)}. \]

- Find the mean \( \bar{x} \) of the numbers.
- Subtract the mean from each number \( x_i \), then square these “deviations”.
- Add up the squared deviations.
- Average them (almost!) by dividing the sum by how many there are, MINUS ONE.
- Take the square root of this “almost-average.”

```
mean sd
105.9 20.88
```

“The typical GC student drives about 105.9 mph, give or take 20.8 mph or so.”

(**The SD is a “give-or-take” figure.**)

A small dataset:

```
FakeData <- c(2,4,7,9,10)
FakeData
```

```
[1] 2 4 7 9 10
```

First, get the mean of the data:

```
mean(FakeData)
```

```
[1] 6.4
```

Subtract mean from each data value, to get the *deviations* from the mean:

```
x deviations
1 2 -4.4
2 4 -2.4
3 7 0.6
4 9 2.6
5 10 3.6
```

We don't care about “above” or “below”, we only care about how far away, so *square* the deviations:

```
x deviations squared.devs
1 2 -4.4 19.36
2 4 -2.4 5.76
3 7 0.6 0.36
4 9 2.6 6.76
5 10 3.6 12.96
```

The square is never negative!

```
(19.36+5.76+0.36+6.76+12.96)/(5-1)
```

```
[1] 11.3
```

The result is called the *sample variance*.

To (sort of) make up for squaring the deviations, we now take the *square root* of the variance:

```
sqrt(11.3)
```

```
[1] 3.362
```

The result is the sample standard deviation!

```
FakeData <- c(2,4,7,9,10)
median(FakeData)
```

```
[1] 7
```

To find the median:

- sort the data from smallest to largest
- look in the “middle” of the list:
- if \( n \) is odd, the median is the middle
- if \( n \) is even, median is average of the two closest to the middle

Also called *percentiles*.

```
with(m111survey,
quantile(fastest,
probs=c(0.2,0.5,0.8,0.9)))
```

```
20% 50% 80% 90%
90 102 120 130
```

- About 20% of the students drove slower than 90 mph
- About 50% drove slower than 102 mph (median!)
- About 80% drove slower than 120 mph
- About 90% drove slower than 130 mph

```
with(m111survey,
quantile(fastest,
probs=c(0.25,0.50,0.75)))
```

```
25% 50% 75%
90.5 102.0 119.5
```

- 25th percentile is the
*First Quartile*Q1 - 50th percentile is the median
- 75th percentile is the
*Third Quartile*Q3

The *interquartile range* is

\[ IQR=Q3-Q1. \]

The middle 50% of the data lie between Q1 and Q3.

```
min Q1 median Q3 max
60 90.5 102 119.5 190
```

- “The median fastest speed driven was 102 mph.”
- “The IQR range was \( 119.5-90.5=29 \) mph.”
- “The middle half of the students drove between 90.5 and 119.5 mph.”

Part 2 will begin with tools for describing the *shape* of the distribution of a numerical variable.