Describing Patterns in Data (Part 1)

Homer White, Georgetown College

Describing Patterns in Data

  • Data Basics
  • Describing the Distribution of a Variable
  • Exploring Relationships Between Variables
  • Numerical Measures
  • Graphical Tools

In Part 1:

Load Packages

Always remember to make sure the necessary packages are loaded:

require(mosaic)
require(tigerstats)

Types of Variables

Looking at Data

(1) Put the data into your Global Environment:

data(m111survey)

(2) View it:

View(m111survey)

(3) Learn more about it:

help(m111survey)
?m111survey #same thing, less typing!

A Data Frame

Here is part of m111survey:

  height ideal_ht fastest     seat    sex
1   76.0       78     119  1_front   male
2   74.0       76     110 2_middle   male
3   64.0       NA      85 2_middle female
4   62.0       65     100  1_front female
5   72.0       72      95   3_back   male
6   70.8       NA     100  1_front   male

In a data frame:

  • rows are observations (individuals)
  • columns are variables

Variable Types

  • Factor (Categorical)
    • values (levels) are not numbers: “male”, “female”,…
    • ordinal factors. levels come in an order: “front”, “middle”, “back”, …
  • Numerical (Quantitative)
    • Double (Continuous): values are real numbers 4.37,2.58, …
    • Integer (Discrete): values are whole numbers 1,4,2,2,…

The str() Function

str(m111survey)
'data.frame':   71 obs. of  12 variables:
 $ height         : num  76 74 64 62 72 70.8 70 79 59 67 ...
 $ ideal_ht       : num  78 76 NA 65 72 NA 72 76 61 67 ...
 $ sleep          : num  9.5 7 9 7 8 10 4 6 7 7 ...
 $ fastest        : int  119 110 85 100 95 100 85 160 90 90 ...
 $ weight_feel    : Factor w/ 3 levels "1_underweight",..: 1 2 2 1 1 3 2 2 2 3 ...
 $ love_first     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ extra_life     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 2 2 2 1 ...
 $ seat           : Factor w/ 3 levels "1_front","2_middle",..: 1 2 2 1 3 1 1 3 3 2 ...
 $ GPA            : num  3.56 2.5 3.8 3.5 3.2 3.1 3.68 2.7 2.8 NA ...
 $ enough_Sleep   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 1 2 ...
 $ sex            : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 2 1 1 ...
 $ diff.ideal.act.: num  2 2 NA 3 0 NA 2 -3 2 0 ...

Research Questions

Descriptive Statistics

Descriptive Statistcs is the art of summarizing data and describing patterns in the data.

Descriptive Tools

  • Graphical Devices
    • barcharts
    • histograms
    • density plots
    • boxplots
    • and many more …
  • Numerical Measurements
    • mean and standard deviation
    • median and interquartile range
    • quantiles
    • and many more …

Important Guiding Principle

The choice of tools will depend on the type of variables involved in your Research Question.

Some Research Questions

  • “Are a majority of students female?”
    • variable: sex (factor)
  • “Who is more likely to prefer to sit in the front: a guy or a gal?”
    • variables: sex (factor) and seat (factor)
  • “Who drives faster: students who prefer the front, the middle, or the back?”
    • variables: seat (factor) and and fastest (numerical)

Distribution of One Factor Variable

One Factor Variable

Research Question:

  • “What percentage of students in the survey are female?”

Variable involved is sex (factor).

Numerical Tool: Tables

Tally the sexes (a table of counts):

xtabs(~sex,data=m111survey)
sex
female   male 
    40     31 

Get percentages:

rowPerc(xtabs(~sex,data=m111survey))
 female   male  Total
  56.34  43.66 100.00

Graphical Tool: Barchart

barchartGC(~sex,data=m111survey,
           type="percent",
           main="Distribution of Sex")

plot of chunk unnamed-chunk-10

Relationship Between Two Factor Variables

Two Factor Variables

Research Question: “Who is more likely to sit in the front: a guy or a gal?”

  • This question is about the relationship between two variables.
  • Variables involved are:
    • sex (factor). This is the explanatory variable.
    • seat (factor). This is the response variable.
  • The explanatory variable is the variable that:
    • we think might help cause the response, or …
    • that we might intend to use to predict the response.
  • Often we don't use explanatory/response distinction

Numerical Tool: Two-Way Tables

Also called cross table, or contingency table.

xtabs(~sex+seat,data=m111survey)
        seat
sex      1_front 2_middle 3_back
  female      19       16      5
  male         8       16      7

But: counts don't answer Research Question. (There are more gals than guys in the first place!)

Row Percents

To check for a relationship between two factor variables, see if conditional distributions differ. For the conditional distribution seat given various values of sex, compute row percents:

sexseat <- xtabs(~sex+seat,data=m111survey)
rowPerc(sexseat)
       1_front 2_middle 3_back Total
female   47.50    40.00  12.50   100
male     25.81    51.61  22.58   100

Females are more likely to prefer the front (47.5%, vs. 25.81% for the guys).

Graphical Tool: Barchart

barchartGC(~sex+seat,data=m111survey,
           type="percent",
           main="Seating Preference by Sex")

Note the “formula-data” input:

  • formula is ~explanatory+response
  • data gives the data frame containing the variables

Barchart Output

plot of chunk unnamed-chunk-14

  • Females tend to prefer the front more than guys do.
  • Guys tend to prefer the back more than gals do.

Distribution of One Numerical Variable

(Numerical Tools)

Back to Table of Contents

One Numerical Variable

Research Question: “How fast do GC students drive, when they drive their fastest?”

This question involves one variable:

  • fastest (numerical)

Describing a Numerical Variable

Describe the distribution's:

  • Center (numerical measurements)
  • Spread (numerical measurements)
  • Shape (graphical tools)

Numerical Measures: favstats()

favstats(~fastest,data=m111survey)
 min   Q1 median    Q3 max  mean    sd  n
  60 90.5    102 119.5 190 105.9 20.88 71

Numerical Measures: favstats()

 min   Q1 median    Q3 max  mean    sd  n
  60 90.5    102 119.5 190 105.9 20.88 71
  • min: smallest data value
  • Q1: first quartile
  • median median of the data
  • Q3: third quartile
  • max: largest data value
  • mean: mean of the data
  • sd: standard deviation of the data
  • n sample size

Describing the Center: The Mean

The mean of the sample data is:

\[ \bar{x}=\frac{\sum{x_i}}{n}, \]

where:

  • \( \sum \) denotes summing
  • \( x_i \) denotes the individual values to be summed
  • \( n \) denotes the number of values in the list.

Describing the Center: The Mean

A small example:

FakeData <- c(2,4,7,9,10)
mean(FakeData)
[1] 6.4

Using R as a calculator to compute the mean:

(2+4+7+9+10)/5
[1] 6.4

Describing the Spread: the SD

Standard deviation (SD) measures how far the typical data value is from the mean of all the data.

\[ s = \sqrt{(\sum{(x_i - \bar{x})^2})/(n-1)}. \]

  • Find the mean \( \bar{x} \) of the numbers.
  • Subtract the mean from each number \( x_i \), then square these “deviations”.
  • Add up the squared deviations.
  • Average them (almost!) by dividing the sum by how many there are, MINUS ONE.
  • Take the square root of this “almost-average.”

Combining Mean and SD

  mean    sd
 105.9 20.88

“The typical GC student drives about 105.9 mph, give or take 20.8 mph or so.”

(The SD is a “give-or-take” figure.)

Computing SD: Small Example

A small dataset:

FakeData <- c(2,4,7,9,10)
FakeData
[1]  2  4  7  9 10

First, get the mean of the data:

mean(FakeData)
[1] 6.4

Computing the SD: Deviations

Subtract mean from each data value, to get the deviations from the mean:

   x deviations
1  2       -4.4
2  4       -2.4
3  7        0.6
4  9        2.6
5 10        3.6

Squared Deviations

We don't care about “above” or “below”, we only care about how far away, so square the deviations:

   x deviations squared.devs
1  2       -4.4        19.36
2  4       -2.4         5.76
3  7        0.6         0.36
4  9        2.6         6.76
5 10        3.6        12.96

The square is never negative!

"Almost Averaging"

(19.36+5.76+0.36+6.76+12.96)/(5-1)
[1] 11.3

The result is called the sample variance.

"Make Up" for Squaring

To (sort of) make up for squaring the deviations, we now take the square root of the variance:

sqrt(11.3)
[1] 3.362

The result is the sample standard deviation!

Describing the Center: The Median

FakeData <- c(2,4,7,9,10)
median(FakeData)
[1] 7

To find the median:

  1. sort the data from smallest to largest
  2. look in the “middle” of the list:
    • if \( n \) is odd, the median is the middle
    • if \( n \) is even, median is average of the two closest to the middle

Quantiles

Also called percentiles.

with(m111survey,
  quantile(fastest,
    probs=c(0.2,0.5,0.8,0.9)))
20% 50% 80% 90% 
 90 102 120 130 
  • About 20% of the students drove slower than 90 mph
  • About 50% drove slower than 102 mph (median!)
  • About 80% drove slower than 120 mph
  • About 90% drove slower than 130 mph

Quartiles

with(m111survey,
  quantile(fastest,
    probs=c(0.25,0.50,0.75)))
  25%   50%   75% 
 90.5 102.0 119.5 
  • 25th percentile is the First Quartile Q1
  • 50th percentile is the median
  • 75th percentile is the Third Quartile Q3

Describing the Spread: IQR

The interquartile range is

\[ IQR=Q3-Q1. \]

The middle 50% of the data lie between Q1 and Q3.

Combining Median and IQR

 min   Q1 median    Q3 max
  60 90.5    102 119.5 190
  • “The median fastest speed driven was 102 mph.”
  • “The IQR range was \( 119.5-90.5=29 \) mph.”
  • “The middle half of the students drove between 90.5 and 119.5 mph.”

Next Topic

Part 2 will begin with tools for describing the shape of the distribution of a numerical variable.