Two Numerical Variables (Part 2)

Rebekah Robinson, Georgetown College

Load Packages

Always remember to make sure the necessary packages are loaded:

require(mosaic)
require(tigerstats)

Load & View Data

The datasets that will be discussed in this section are:

data(m111survey)
View(m111survey)
help(m111survey)
data(ucdavis1)
View(ucdavis1)
help(ucdavis1)
data(pennstate1)
View(pennstate1)
help(pennstate1)

Statistical Relationships in Part 2:

Correlation

Strength of Association

It is important to consider the strength of association as well as the direction.

In the m111survey data, the variables height and fastest have a positive association.

However, height and ideal_ht have a stronger positive association. The points are less scattered.

Comparison of Association Strength

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

Correlation Coefficient

The correlation coefficient is the numerical measure of the direction and strength of the linear association between two numerical variables.

The correlation coefficient, \( r \), is

\[ r=\sum (z_x)(z_y)/(n-1) \]

where

  • \( z_x=(x-\bar{x})/s_x \)

  • \( z_y=(y-\bar{y})/s_y \).

r in R

The correlation coefficient for height and fastest is

cor(fastest~height,
    data=m111survey,
    use="na.or.complete")
[1] 0.1709

plot of chunk unnamed-chunk-9

r in R

The correlation coefficient for height and ideal_ht is

cor(ideal_ht~height,
    data=m111survey,
    use="na.or.complete")
[1] 0.832

plot of chunk unnamed-chunk-11

Properties of r

  • \( r \) always falls between -1 and 1

  • The sign of \( r \) indicates the direction of the relationship.

    • \( r>0 \) indicates a positive linear association.
    • \( r<0 \) indicates a negative linear association.
  • The magnitude of \( r \) indicates the strength of the relationship.

    • \( r=1 \) indicates a perfect positive linear relationship.
    • \( r=-1 \) indicates a perfect negative linear relationship.
    • \( r=0 \) indicates no linear relationship.

Perfect Positive Correlation

plot of chunk unnamed-chunk-12

Perfect Negative Correlation

plot of chunk unnamed-chunk-13

No Correlation

plot of chunk unnamed-chunk-14

Further Investigation

You can further investigate the values of \( r \) with the following app:

require(manipulate)
VaryCorrelation()

Idea for Investigation

Research Question: At UC Davis, how is a student's mother's height related to their father's height?

  • Explanatory variable: dadheight (numerical)

  • Response variable: momheight (numerical)

xyplot(momheight~dadheight,data=ucdavis1,
       col="blue",pch=19)

Graphical Investigation

It appears that students with tall dads have tall moms. Students with short dads have short moms.

Since the points do not form a tight cluster, the positive assocation does not appear to be very strong.

plot of chunk unnamed-chunk-17

Numerical Investigation

cor(momheight~dadheight,data=ucdavis1,
    use="na.or.complete")
[1] 0.2572

Regression Equation

Regression Analysis

A linear relationship can be explained using the equation of a line.

Regression equation - the equation of a line used to predict the value of the response variable from a known value of the explanatory variable.

\[ \hat{y}=a+bx \]

  • \( a \) is the \( y \)-intercept.
  • \( b \) is the slope.
  • \( x \) is the known value of the explanatory variable.
  • \( \hat{y} \) is the predicted value of the response variable.

Regression Line

RQ: At Penn State, how is a student's right handspan related to his/her height?

Each point on the scatterplot, \( (x,y) \), is a known observation.

Each point on the line, \( (x,\hat{y}) \), is a predicted response for a value of the explanatory variable.

plot of chunk unnamed-chunk-19

Residuals

A residual is the vertical distance between a point and the regression line.

\[ \mbox{residual}=y-\hat{y} \]

plot of chunk unnamed-chunk-20

Regression Line and Residuals

The regression line is the line that minimizes the sum of the squared residuals.

\[ \mbox{Sum of Squares} = \sum (\mbox{residuals})^2= \sum(y_i-\hat{y})^2 \]

Investigate this further with the following app:

require(manipulate)
FindRegLine()

Linear Model Function

The built-in function, lmGC, outputs the

  • correlation coefficient,

  • equation of the regression line,

  • gives the option to display the graph of the regression line,

  • and more.

Finding the Regression Equation

lmGC(Height~RtSpan,data=pennstate1)

            Simple Linear Regression

Correlation coefficient r =  0.6314 

Equation of Regression Line:

     Height = 41.96 + 1.239 * RtSpan 

Residual Standard Error:    s   = 3.149 
R^2 (unadjusted):       R^2 = 0.3987 

Graphing the Regression Line

Including the argument graph=TRUE will output the regression line and provide a graph.

lmGC(Height~RtSpan,data=pennstate1,
     graph=TRUE)

The Result


            Simple Linear Regression

Correlation coefficient r =  0.6314 

Equation of Regression Line:

     Height = 41.96 + 1.239 * RtSpan 

Residual Standard Error:    s   = 3.149 
R^2 (unadjusted):       R^2 = 0.3987 

plot of chunk unnamed-chunk-25

Predictions

What is the predicted height of a Penn State student with a right handspan of 22 cm?

We can use the regression equation

\[ \hat{y}=41.96+1.239x \]

to predict this height by plugging in \( x=22 \).

\[ \hat{y}=41.96+1.239(22)=69.2 \]

A Penn State student with a right handspan of 22 cm is predicted to be 69.2 inches tall.

Predictions using the Predict Function

The predict function will also compute this. The predict function requires two inputs:

  • a linear model

  • an \( x \)-value.

handheightmod<-lmGC(Height~RtSpan,
                    data=pennstate1)

predict(handheightmod,x=22)
[1] 69.23

Interpretation of Regression Line

The regression line is \( \hat{y}=41.9593+1.2394x \).

The intercept is 41.9593.

The slope is 1.2394.

What do these numbers mean?

plot of chunk unnamed-chunk-27

Interpretation of Slope and Intercept

Intercept: 41.9593 is the predicted height of a Penn State student whose right handspan is 0 cm.

The interpretation of the intercept does not always make logical sense!

Slope: The predicted height of a Penn State student changes by 1.2394 inches as right handspan increases by 1 centimeter.

For every one centimeter increase in right handspan, the predicted height increases by 1.2394 inches.

How well does our line fit?

When there is variation in the data, a line is not a perfect explanation of the relationship.

The variation is measured two ways:

  • Residual Standard Error (RSE)

  • Squared Correlation (\( r^2 \))

Residual Standard Error

Let's return to the Penn State data of right handspans and heights.

lmGC(Height~RtSpan,data=pennstate1)

RSE measures the spread of the residuals.

\[ \mbox{RSE}=3.149 \]

Warning: RSE is directly affected by a change in scale.

RSE for Different Units

plot of chunk unnamed-chunk-29

RSE=3.149

plot of chunk unnamed-chunk-30

RSE=0.262

Squared Correlation

The squared correlation, \( r^2 \), is

  • another measurement of the explained variation in the scatterplot.

  • the proportion of variation in the response variable that is explained by the explanatory variable.

  • unaffected by a change in scale.

Properties of r-squared

  • \( r^2 \) is always between 0 and 1.

  • \( r^2=1 \) implies perfect correlation between explanatory and response variables.

  • \( r^2=0 \) implies no correlation.

Next Topic

Part 3 will discuss topics of caution when examining relationships between two numerical variables.