Describing Patterns in Data (Pt. 2)

Homer White, Georgetown College

In Part 2:

Load Packages

Always remember to make sure the necessary packages are loaded:

require(mosaic)
require(tigerstats)

One Numerical Variable

(Graphical Tools)

Back to Contents

Graph Tool: Histogram

plot of chunk unnamed-chunk-2

A frequency histogram.

  • 30 people drove between 80 and 100 mph.
  • One person drove between 190 and 200 mph.

Graph Tool: Histogram

plot of chunk unnamed-chunk-3

A relative frequency histogram.

  • 42% drove between 80 and 100 mph.
  • 7% drove between 60 and 80 mph.

Graph Tool: Histogram

plot of chunk unnamed-chunk-4

A density histogram.

  • area of each rectangle gives proportion of values in its range
  • total area = 1 (100%)

Density Histogram

How this works:

  • The rectangle from 80 to 100 mph has base \( 80-60=20 \).
  • Height of 80-100 rectangle was about 0.021
  • Proportion driving between 80 and 100 is:

\[ base \times height = 20 \times 0.021 \approx 0.42. \]

  • So, about 42% drove between 80 and 100 mph.

Making a Density Histogram

histogram(~fastest,
 data=m111survey,
 xlab="speed (mph)",
 main="Fastest Speed")

plot of chunk unnamed-chunk-6

Graph Tool: Density Plot

plot of chunk unnamed-chunk-7

Making a Density Plot

densityplot(~fastest,
        data=m111survey,
        xlab="speed (mph)",
        main="Fastest Speed")

The Plot

plot of chunk unnamed-chunk-9

Describing Shape of a Numerical Distribution

Terminology

  • symmetric (mirror image of itself around a central vertical line)
  • skewed left (tail to the lower values)
  • skewed right (to higher values)
  • unimodal (one major “hump”)
  • bimodal (two major “humps”)

Back to Contents

Unimodal, Left-Skewed

plot of chunk unnamed-chunk-10

Unimodal, Right-Skewed

plot of chunk unnamed-chunk-11

Bimodal, Right-Skewed

plot of chunk unnamed-chunk-12

Unimodal, Symmetric

plot of chunk unnamed-chunk-13 This is often called “bell-shaped.”

Bimodal, Symmetric

plot of chunk unnamed-chunk-14

Bimodal, Symmetric

plot of chunk unnamed-chunk-15

An Imaginary Population

data(imagpop)

Some of the variables in imagpop:

     sex math income cappun kkardashtemp
1 female   no  40900 oppose            6
2 female   no  56100 oppose            1
3 female   no 108800 oppose            5
4 female   no  43100 oppose            3
5   male   no  15500 oppose           94
6   male   no  49800 oppose           77

Back to Contents

Describing Kim Kardashian Temp

Numerical Approach:

favstats(~kkardashtemp,
         data=imagpop)
 min Q1 median Q3 max mean    sd     n
   0  7     62 93 100 50.4 41.76 10000

Describing Kim Kardashian Temp

Graphical Approach:

densityplot(~kkardashtemp,data=imagpop,
      xlab="Point Rating",
      main="Kim Kardashian Temp")

plot of chunk unnamed-chunk-21

Describing Kim Kardashian Temp

  • Center
  • spread
  • shape
  • any unusual features

So we say something like:

  • The mean rating is about 50.4, with a standard deviation of 41.76.
  • The distribution is symmetric, but bimodal, with modes near 0 and 100.
  • People either love her or hate her!

Boxplots

A Special Graphical Tool

Back to Contents

Boxplots

ImaginaryData <- c(7.1,7.3,7.5,8.2,8.5,9.1,9.5,
          9.8,9,9,9.9,10,10.5,11)
bwplot(~ImaginaryData,xlab="x",
  main="Example Boxplot")

plot of chunk unnamed-chunk-23

Boxplot Detect Outliers

bwplot(~height,data=m111survey,
       main="Height at GC",
       xlab="height (inches)")

plot of chunk unnamed-chunk-25

Boxplots Detect Skewness

plot of chunk unnamed-chunk-26

Boxplots Miss "Crowding"

plot of chunk kardashviolin