Two Factor Variables (Pt. 1)

Homer White, Georgetown College

Two Factor Variables

  • Detecting and Describing Relationships between Two Factor Variables
  • The Inferential Aspect of a Research Question
  • The Chi-Square Test for Relationship
  • Simpson's Paradox

In Part 1:

Load Packages

Always remember to make sure the necessary packages are loaded:

require(mosaic)
require(tigerstats)

Conditional Distributions

m111survey

data(m111survey)
View(m111survey)
help(m111survey)

Research Question

What is the relationship between sex and how one feels about one's weight?

Two Aspects

  • Descriptive Aspect: Is there a relationship in the sample data?
  • Inferential Aspect Does the relationship in the sample (if any) provide strong evidence for a similar relationship in the population from which the sample was drawn?

We will focus first on the descriptive aspect.

Two-Way Table

SexWt <- xtabs(~sex+weight_feel,
               data=m111survey)
SexWt
        weight_feel
sex      under ok over
  female     1 11   28
  male       8 14    9
  • Two dimensions (rows, columns)
  • 2 rows x 3 cols = 6 cells

Row and Column Totals

       under ok over Total
female     1 11   28    40
male       8 14    9    31
Total      9 25   37    71

The margins describe each factor variable individually.

A Marginal Distribution

Tally for sex:

sex
female   male 
    40     31 

Marginal Distribution of sex (percents):

female   male 
 56.34  43.66 

A Marginal Distribution

Tally for weight_feel:

under    ok  over 
    9    25    37 

Marginal Distribution of weight_feel (percents):

under    ok  over 
12.68 35.21 52.11 

Marginal Distributions

The marginal distributions don't tell you anything about how the two variables are related!

Conditional Distributions

rowPerc(SexWt)
       under    ok  over Total
female  2.50 27.50 70.00   100
male   25.81 45.16 29.03   100

Each row of this table gives a conditional distribution.

Conditional Distributions

The conditional distribution of weight_feel, given that sex is “female”.

under    ok  over 
  2.5  27.5  70.0 

The conditional distribution of weight_feel, given that sex is “male”.

under    ok  over 
25.81 45.16 29.03 

Detecting Relationships

Principle For Detection

If the conditional distributions are different, then the two variables are related.

  • compare row percents down a single column
  • if they differ, the variables are related

Comparison

       under    ok  over Total
female  2.50 27.50 70.00   100
male   25.81 45.16 29.03   100

Look down the “overweight” column:

       over
femal 70.00
male  29.03

70% of the females think they are overweight, but only 29% of the males think they are overweight.

Describing Relationships

Generate the two-way table and the row percents:

SexWt <- xtabs(~sex+weight_feel,
          data=m111survey)
SexWt
rowPerc(SexWt)
        weight_feel
sex      under ok over
  female     1 11   28
  male       8 14    9
       under    ok  over Total
female  2.50 27.50 70.00   100
male   25.81 45.16 29.03   100

When you choose a column:

  • Focus on a column with a big difference in %'s
  • Preferably where percents are based on large counts
        weight_feel
sex      under ok over
  female     1 11   28
  male       8 14    9
       under    ok  over Total
female  2.50 27.50 70.00   100
male   25.81 45.16 29.03   100

A Good Description

“The females in the sample are more likely than the males to think that they are overweight (70% of females think they are overweight, as compared to only 29% of the males).”

(Always back up your answer with specific, relevant features of the data!)

A Mistaken Description

       under    ok  over Total
female  2.50 27.50 70.00   100
male   25.81 45.16 29.03   100

“The females in the sample are more likely to think that they are overweight (70% of females think they are overwieght, only 2.5% think they are underweight).”

[Comparing row percents across a row involves only one conditonal distribution!]

Another Mistake

       sprouted not.sprouted
Plot.A       70           30
Plot.B      140           60
       sprouted not.sprouted Total
Plot.A       70           30   100
Plot.B       70           30   100

“Type of plot and sprouting are related: in both types of plot, a majority of the seeds (70%) sprouted.”

[The conditonal distributions are identical, so the variables are not related at all.]