
statistics
 getting information out of numerical data gotten from an experiment or from a sample
 creating the experiment or sampling procedure, collecting and analyzing data, and making inferences (statements) about the population

descriptive statistics
methods for organizing, displaying, and describing data by using tables, graphs, and summary measures

inferential statistics
methods that use sample results to help make inferences (decisions or predictions) about a population

data analysis
process of describing data using graphs and numerical summaries

individuals
objects described by a set of data; may be people, animals, or things

variables
any characteristic of an individual

categorical variable
places an individual into one of several groups or categories; can be numerical in some cases (zip codes, classes of age)

quantitative variable
takes numerical values for which it makes sense to find an average, should always specify the unit

distribution
tells what values a variable takes and how often it takes these values

inference
drawing conclusions that go beyond the data at hand

frequency table
displays the count (frequency) of observations in each category or class

relative frequency table
shows the percents (relative frequencies) of observations in each category or class

roundoff error
the difference between the calculated approximation of a number and its exact mathematical value

pie chart
 shows the distribution of a categorical variable as a "pie" whose slices are sized by the counts or percents for the categories
 must include all of the categories that make up the whole

when can you not use pie charts
 if you don't have all the categories that make up the whole
 if you're dealing with individuals that represent a category (e.g. 1012yrs) since those are different groups, not part of a whole

bar graph
used to display the distribution of categorical variable or to compare the sizes of different quantities. The categories or quantities being compared is on the horizontal axis. Has blank spaces between the bars.

how can graphs be misleading
 bars with different widths
 xaxis and yaxis intervals

twoway table
table of counts that organizes data about two categorical variables

marginal distribution
 distribution of values in one of the categorical variables in a twoway table among all of the individuals described in the table
 in a twoway table, calculating percentages of the distribution of one variable
 say nothing about the relationship between two variables

conditional distribution
 describes the values of one variable among individuals who have a specific value of another variable
 percentage of distribution calculated between the two variables in a twoway table

segmented bar graph
 compares the distribution of a categorical variable in each of several groups. There is a bar for each group with segments that correspond to the different values of the categorical variable.
 height of each segment is determined by the percent of individuals in the group with that value, each bar has a total height of 100%

four steps to answer a statistics problem
 STATE the question you want to answer
 PLAN how you will answer the question and which statistical techniques the problem requires
 DO make graphs and calculate stuff
 CONCLUDE be practical given the setting of the realworld problem

side by side bar graph
 used to compare the distribution of a categorical variable in each of several groups. There is a bar corresponding to each group for each categorical variable.
 height of each bar is determined by the count or percent of individuals in the group with that value

association
occurs between two variables if specific values of one variable tend to occur in common with specific values of the other

qualitative data
values of categorical data

dotplot
a simple graph that shows each data value as a dot above its location on a number line

overall pattern
 in any graph of data, this can be describes by the direction, form, and strength of the relationship
 SOCS: shape, outliers, center, and spread

center
the midpoint/median represents the typical value, and the calculated mean is the average

spread
indicates the variability of the data, includes the maximum and minimum values and the range

range
maximumminimum values

outlier
an observation that lies outside the overall pattern of other observations

residuals
in outliers, residuals are present if outliers are outliers in the y direction but not the x direction

shape
 peaks (modes) and the number of which
 skewed results or symmetry
 number of clusters + gaps

mode
the value or class in a statistical distribution having the greatest frequency

unimodal
describes a graph of quantitative data with a single peak

bimodal
describes a graph of quantitative data with two clear peaks

multimodal
describes a graph of quantitative data with more than two clear peaks

symmetry
left and right sides of the graph are approximately mirror images of each other

skewed to the right
right side of the graph is much longer than the left side, tail is extended to the right

skewed to the left
left side of the graph is much longer than the right side, tail is on the left

stemplot
observations are separated into stems (numbers that have all but final digit) and leaves (the final digit), arranged in a vertical column with increasing order out of the stem (down)

splitting stems
 a method for spreading out a stemplot that has too few stems
 should use asterisks (e.g. 5* and 5**)

backtoback stemplot
used to compare the distribution of a quantitative variable for two groups, one variable is a leaf on one side of the stem and the other variable is a separate leaf on the other side of the stem

truncate
removing one or more digits from a value if it has too many digits, like in creating stemplots

histogram
type of bar graph without spaces that displays the class/relative frequency of a quantitative variable; horizontal axis shows the classes of the variable, vertical axis has the scale of counts/percents; do not preserve raw data because it has been grouped into classes

time plots
used to show bivariate (2variable quantitative data) where the independent variable (x) represents time

independent/dependent variable on graph axes
 dependent=yaxis
 independent=xaxis


mean
arithmetic average, nonresistant measure, represents size of observations if they were equally split among all observations

resistant measure
statistic that is not affected very much by extreme observations

median
midpoint M of a distribution, half the observations are smaller than this and half are larger, represents typical value, resistant measure

median position formula
 n=# observations in data set
 after arranging data in increasing order, move this number inward to find median

mean > median
right skewed


mean < median
left skewed

mode
value that occurs the most

689599.7 Rule aka Empirical Rule
in a bellshaped distribution, 68% of the data lies within one standard deviation of the mean, 95% lies within two standard deviations of the mean, and 99.7% lies within three standard deviations of the mean

interquartile range (IQR)
 measures the range of the middle 50% of the data, resistant measure
 IQR= Q_{3}Q_{1}

first quartile
median of observations to the left of the median

third quartile
median of observations to the right of the median

percentile implication
95th percentile means that 95% of the population got that score or lower

IQR rule for calculating outliers
an observation is an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile

how to use IQR to calculate bottom cutoff value
Q_{1}1.5 x IQR

how to use IQR to calculate top cutoff value
Q_{3}+1.5 x IQR

standard deviation
 measure of spread that looks out how far observations are from the mean, typical scores are found above and below the standard deviation of the mean, nonresistant measure
 standard deviation of 0 indicates no variability, greater when observations are more spread out

degrees of freedom
(n1) observations

variance
S_{x}^{2 }the average squared distance of the observations in a data set from their mean

standard deviation formula


how to calculate variance and standard deviation
 find mean of data, find the deviations of the observations from the mean, square these, and add them up, then divide by degrees of freedom (n1) observations to find the variance
 to find standard deviation, take the square root of variance

fivenumber summary
 minimum, first quartile, median, third quartile, maximum
 gives a summary of both center and spread, roughly divides the distribution into quarters

boxplot
graphs the fivenumber summary, box spans the quartiles and whiskers extend to the min/max values, center line represents median

modified boxplots
boxplots that always show the outliers as dots

sidebyside boxplots
show the boxplots next to each other using the same scale, used to compare distributions of two data sets

detecting skewedness in boxplots
the longer whisker is where the distribution is skewed, a larger difference in lengths means a more strongly skewed distribution

detecting range and IQR in boxplots
range is represented by full length of boxplot, IQR is represented by length of box

options for measuring center and spread, resistant or nonresistant
 median and IQR are resistant, use when analyzing skewed data and/or outliers
 average and standard deviation are nonresistant and sensitive to skewed results and outliers

sigma
Σ represents a summation, "add them up"


lower limit and upper limit
the numbers above and below a sigma, represent the range of numbers you are plugging into i and adding up

summand
in sigma notation, what you're adding up (e.g. i^{2})

solution
in sigma notation, the answer that you solve for (your sum after you add everything up)









frequency table categories
class and count

relative frequency table categories
class and percent

frequency histogram; relative frequency histogram




