
individuals
 Individuals can be people,
 animals, plants, or any object of interest.

variable
 any characteristic of an
 individual. A variable varies
 among individuals.

distribution
 tells us what values the
 variable takes and how often it takes these values.

quantitative variable
 Something that takes
 numerical values for which arithmetic operations, such as adding and averaging,
 make sense.

categorical variable
 Something that falls into
 one of several categories. What can be counted is the count or proportion of
 individuals in each category

Ways to chart categorical data
Bar graphs and pie charts

Bar graphs
 Each category is
 represented by one bar. The bar’s height shows the count (or sometimes the
 percentage) for that particular category

Pie charts
 Each slice represents a piece of one whole. The size of
 a slice depends on what percent of the whole this category represents.

Ways to chart quantitative data
Histograms and stemplots, and Line graphs: time plots

Line graphs: time plots
 Use when there is a
 meaningful sequence, like time. The line connecting the points helps emphasize
 any change over time

Histograms and stemplots
 These are summary graphs
 for a single variable. They are very useful to understand the pattern of
 variability in the data

Histograms
  The range of values that a
 variable can take is divided into equal size intervals.
  The histogram shows the
 number of individual data points that fall in each interval.

stem plots
 To compare two related distributions, a backtoback stem plot with common
 stems is useful.
 Stem plots do not work well for large
 datasets.
 When the observed values have too many
 digits, trim the numbers before making
 a stem plot.
 When plotting a moderate number of
 observations, you can split each
 stem.

Interpreting histograms
We can describe the overall pattern of a histogram by its shape, center, and spread.

A distribution is symmetric if ...
 the right and left sides
 of the histogram are approximately mirror images of each other.

A distribution is skewed to the right
 if the right side of
 the histogram (side with larger values) extends much farther out than the left
 side

skewed to the left
 if the left side of
 the histogram extends much farther out than the right side

An important kind of
deviation is an outlier
 Outliers are observations that lie outside the overall
 pattern of a distribution

A trend is
a rise or fall that persists over time, despite small irregularities.

seasonal variation
A pattern that repeats itself at regular intervals of time

mean
 add all values, then divide by the number of individuals. It is the “center of
 mass.”

median
 the midpoint of a
 distribution—the number such that half of the observations are smaller and half are larger

Comparing the mean and the median
 The
 mean and the median are the same only if the distribution is symmetrical. The
 median is a measure of center that is resistant to skew and outliers. The mean
 is not

first quartile, Q1
 the value
 in the sample that has 25% of the data at or below it

third quartile, Q3
 is the
 value in the sample that has 75% of the data at or below it

“1.5 * IQR rule for outliers
 if it falls more than 1.5
 times the size of the interquartile range (IQR) above the first quartile or
 below the third quartile


standard
deviation s.
 s measures spread about the mean and should be
 used only when the mean is the measure of center.
 s = 0 only when all observations have the same
 value and there is no spread.
 Otherwise, s > 0.
 s is not resistant to outliers.
 s has the same units of measurement as the
 original observations.

linear transformation
 do not change the basic shape of a distribution (skew, symmetry,
 multimodal). But they do change the measures of center and spread:

density curve
 The
 total area under the curve, by definition, is equal to 1, or 100%.
 The area under the
 curve for a range of values is the proportion of all observations for that
 range

median of a density curve is
the equalareas point
 the
 point that divides the area under the curve in half.

mean of a density curve is
the balance point
at which the curve would balance if it were made of solid material.

Normal – or Gaussian –
distributions
 a family of symmetrical,
 bellshaped density curves defined by a mean m (mu) and a standard deviation
 s (sigma) : N(m,s).

zscore
 measures the number of standard deviations that a data value x
 is from the mean m.

