Rebekah Robinson, Georgetown College

Always remember to make sure the necessary packages are loaded:

```
require(mosaic)
require(tigerstats)
```

The datasets that will be discussed in this section are:

```
data(m111survey)
View(m111survey)
help(m111survey)
```

```
data(ucdavis1)
View(ucdavis1)
help(ucdavis1)
```

```
data(pennstate1)
View(pennstate1)
help(pennstate1)
```

It is important to consider the **strength** of association as well as the direction.

In the `m111survey`

data, the variables **height** and **fastest** have a positive association.

However, **height** and **ideal_ht** have a *stronger* positive association. The points are less scattered.

The **correlation coefficient** is the numerical measure of the direction and strength of the linear association between two numerical variables.

The correlation coefficient, \( r \), is

\[ r=\sum (z_x)(z_y)/(n-1) \]

where

\( z_x=(x-\bar{x})/s_x \)

\( z_y=(y-\bar{y})/s_y \).

The correlation coefficient for **height** and **fastest** is

```
cor(fastest~height,
data=m111survey,
use="na.or.complete")
```

```
[1] 0.1709
```

The correlation coefficient for **height** and **ideal_ht** is

```
cor(ideal_ht~height,
data=m111survey,
use="na.or.complete")
```

```
[1] 0.832
```

\( r \) always falls between -1 and 1

The

**sign**of \( r \) indicates the*direction*of the relationship.- \( r>0 \) indicates a positive linear association.
- \( r<0 \) indicates a negative linear association.

The

**magnitude**of \( r \) indicates the*strength*of the relationship.- \( r=1 \) indicates a perfect positive linear relationship.
- \( r=-1 \) indicates a perfect negative linear relationship.
- \( r=0 \) indicates no linear relationship.

You can further investigate the values of \( r \) with the following app:

```
require(manipulate)
VaryCorrelation()
```

Research Question: At UC Davis, how is a student's mother's height related to their father's height?

Explanatory variable:

**dadheight**(numerical)Response variable:

**momheight**(numerical)

```
xyplot(momheight~dadheight,data=ucdavis1,
col="blue",pch=19)
```

It appears that students with tall dads have tall moms. Students with short dads have short moms.

Since the points do not form a tight cluster, the positive assocation does not appear to be very strong.

```
cor(momheight~dadheight,data=ucdavis1,
use="na.or.complete")
```

```
[1] 0.2572
```

A linear relationship can be explained using the equation of a line.

**Regression equation** - the equation of a line used to *predict* the value of the response variable from a known value of the explanatory variable.

\[ \hat{y}=a+bx \]

- \( a \) is the \( y \)-intercept.
- \( b \) is the slope.
- \( x \) is the known value of the explanatory variable.
- \( \hat{y} \) is the predicted value of the response variable.

RQ: At Penn State, how is a student's right handspan related to his/her height?

Each point on the scatterplot, \( (x,y) \), is a known observation.

Each point on the line, \( (x,\hat{y}) \), is a predicted response for a value of the explanatory variable.

A **residual** is the vertical distance between a point and the regression line.

\[ \mbox{residual}=y-\hat{y} \]

The regression line is the line that minimizes the sum of the squared residuals.

\[ \mbox{Sum of Squares} = \sum (\mbox{residuals})^2= \sum(y_i-\hat{y})^2 \]

Investigate this further with the following app:

```
require(manipulate)
FindRegLine()
```

The built-in function, **lmGC**, outputs the

correlation coefficient,

equation of the regression line,

gives the option to display the graph of the regression line,

and more.

```
lmGC(Height~RtSpan,data=pennstate1)
```

```
Simple Linear Regression
Correlation coefficient r = 0.6314
Equation of Regression Line:
Height = 41.96 + 1.239 * RtSpan
Residual Standard Error: s = 3.149
R^2 (unadjusted): R^2 = 0.3987
```

Including the argument **graph=TRUE** will output the regression line and provide a graph.

```
lmGC(Height~RtSpan,data=pennstate1,
graph=TRUE)
```

```
Simple Linear Regression
Correlation coefficient r = 0.6314
Equation of Regression Line:
Height = 41.96 + 1.239 * RtSpan
Residual Standard Error: s = 3.149
R^2 (unadjusted): R^2 = 0.3987
```

What is the predicted height of a Penn State student with a right handspan of 22 cm?

We can use the regression equation

\[ \hat{y}=41.96+1.239x \]

to predict this height by plugging in \( x=22 \).

\[ \hat{y}=41.96+1.239(22)=69.2 \]

**A Penn State student with a right handspan of 22 cm is predicted to be 69.2 inches tall.**

The **predict** function will also compute this. The **predict** function requires two inputs:

a linear model

an \( x \)-value.

```
handheightmod<-lmGC(Height~RtSpan,
data=pennstate1)
predict(handheightmod,x=22)
```

```
[1] 69.23
```

The regression line is \( \hat{y}=41.9593+1.2394x \).

The **intercept** is 41.9593.

The **slope** is 1.2394.

What do these numbers *mean*?

**Intercept:** 41.9593 is the *predicted* height of a Penn State student whose right handspan is 0 cm.

*The interpretation of the intercept does not always make logical sense!*

**Slope:** The *predicted* height of a Penn State student changes by 1.2394 inches as right handspan increases by 1 centimeter.

*For every one centimeter increase in right handspan, the predicted height increases by 1.2394 inches.*

When there is variation in the data, a line is not a perfect explanation of the relationship.

The variation is measured two ways:

Residual Standard Error (RSE)

Squared Correlation (\( r^2 \))

Let's return to the Penn State data of right handspans and heights.

```
lmGC(Height~RtSpan,data=pennstate1)
```

RSE measures the *spread* of the residuals.

\[ \mbox{RSE}=3.149 \]

**Warning**: RSE is directly affected by a change in scale.

```
RSE=3.149
```

```
RSE=0.262
```

The squared correlation, \( r^2 \), is

another measurement of the explained variation in the scatterplot.

the

*proportion*of variation in the response variable that is explained by the explanatory variable.unaffected by a change in scale.

\( r^2 \) is always between 0 and 1.

\( r^2=1 \) implies perfect correlation between explanatory and response variables.

\( r^2=0 \) implies no correlation.

Part 3 will discuss topics of **caution** when examining relationships between two numerical variables.