
What is Statistics?
Methods for Collecting , Describing, Analyzing and Drawing Conclusions from Data

Population
 total set of all subjects of interest
 the entire group of people, animals or things about which we want information

Elementary Unit
any individual member of the population

Sample
 subset of the population from which the study actually collects information
 used to draw conclusions about the whole population

Variable
 a characteristic of a unit that can vary among subjects in the population/sample
 Examples: gender, nationality, age, income, hair color, height, disease status, company rating, grade in STA 291, state of residence

Sampling Frame
listing of all the units in the population

Parameter
 numerical characteristic of the population
 calculated using the whole population

Statistic
 numerical characteristic of the sample
 calculated using the sample

Why not measure all of the units in the population?
 Accuracy: May not be able to list them all  may not be able to come up with a frame
 Time: Speed of Response
 Expense: Cost
 Infinite Population
 Destructive Sampling or Testing

Descriptive Statistics
Summarizing the information in a collection of data

Inferential Statistics
Using information from a sample to make conclusions/predictions about the population

Univariate data set
Consists of observations on a single attribute

Multivariate data
Consists of observations on several attributes

Special case: Bivariate data
Two attributes collected per observation

Nominal variables
 have a scale of unordered categories
 Examples: gender, nationality, hair color (it doesn't make sense to say that green eyes are greater/higher/better than brown

Ordinal variables
 have a scale of ordered categories; often treated in a quantitative manner
 Examples: disease status, company rating, grade in STA 291

Qualitative variables
 categorical (not numerical)
 Nominal and Ordinal

Quantitative variables
measured numerically, that is, for each subject a number is observed

interval scale
the scale for quantitative variables

Discrete variables
 has a finite number of possible values
 all qualitative (categorical) variables are ~
 only some quantitative (numeric) variables are ~

Continuous variables
 can take all the values in a continuum of real values
 Examples: time, distance, volume, speed, (usually physical measures)

Simple Random Sample
 Each possible sample has the same probability of being selected
 The sample size is usually denoted by "n"

Convenience Sample
the people just happened to be there

Volunteer Sampling
 this sample will poorly represent the population
 will cause misleading conclusions
 BIAS
 people are much more likely to speak up if they feel strongly about the issue
 Examples: Mall interview, Street corner interview

Random Sample
even if it is smaller it is much more trustworthy than volunteer because it has less bias

Observational Study
 observes individuals and measure variable of interest but does not attempt to influence the responses
 passive data collection
 it's purpose is to describe/compare groups or situations

Experiment
 deliberately imposes some treatment on individuals in order to observe their responses
 active data production
 it's purpose is to study whether the treatment causes a change in the response

stratified sampling
 divide the population into separate, nonoverlapping groups ("strata")
 select a simple random sample independently (and usually proportionally) from each group

cluster sampling
 the population can be divided into a set of nonoverlapping subgroups (the clusters)
 the clusters are then selected at random, and all individuals in the selected clusters are included in the sample

systematic sampling
 an initial name is selected at random
 every Kth name is selected after that
 K is computed by dividing membership list length by the desired sample size
 not a simple random sample, but often almost as good as one
 useful when the population consists as a list

types of bias
 Selection Bias  selection of the sample systematically excludes some part of the population of interest
 Measurement/Response Bias  method of observation tends to produce values that systematically differ from the true value
 Nonresponse Bias  occurs when responses are not actually obtained from all individuals selected for inclusion in the sample

sampling error
 the error that occurs when a statistic based on a sample estimates or predicts the value of a population parameter
 in random samples, the sampling error can usually be quantified

nonsampling error
 any error that could also happen in a census
 Examples: bias due to question wording, question order, nonreponse, wrong answers (especially to delicate questions)

frequency distribution
 a listing of intervals of possible values for a variable AND a tabulation of the # of observations in each interval
 use intervals of same length (if possible)
 intervals must be mutually exclusive (any observation must fall into one and only one interval
  RULE of thumb: if you have n observations, the # of intervals should be about √n

Frequency, Relative Frequency, and Percentage Distribution
 frequency = # in interval
 relative frequency = frequency/total #
 percentage = relative frequency x 100%

cumulative frequencies
# of observations that fall in the class and in smaller classes

histogram (interval data)
 use numbers from the frequency distribution to create a graph
 draw a bar over each interval, the height of the bar represents the relative frequency for that interval
 bars should be touching; i.e., equally extend the width of the bar at the upper and lower limits so that the bars are touching

bar graph (nominal/ordinal data)
 the bars are usually separated to emphasize that the variable is categorical rather than quantitative
 for nominal variables (no natural ordering), order the bars by frequency, except possibly for a category "other" that is always last
 for ordinal data classes are presented in the natural order, (A, B, C...)

stem and leaf plot
 write the observations ordered from smallest to largest
 each observation is represented by a stem (leading digit(s)) and a leaf (final digit)
 looks like a histogram sideways  gives individual values
 contains more information than a histogram, because every single measurement can be recovered

describing distributions
 center, spread (numbers later)
 symmetric distributions  bellshaped or Ushaped
 not symmetric distributions  leftskewed or rightskewed

contingency table
 number of subjects observed at all the combinations of possible outcomes for the 2 variables
 ~ are identified by their number of rows and columns  a table with 2 rows and 3 columns is called a 2x3 table

good graphics...
 present large data sets concisely and coherently
 can replace a thousand words and still be clearly understood and comprehended
 encourage the viewer to compare two or more variables
 do not replace substance by form
 do not distort what the data reveal
 have a high "datatoink" ratio

bad graphics...
 don't have a scale on the axis
 have a misleading caption
 distort by stretching/shrinking the vertical or horizontal axis
 use histograms or bar charts with bars of unequal width
 are more confusing than helpful

sampling variability
sampletosample differences

undercoverage
some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population

measures of central location or tendency
 Mean: arithmetic average
 Median: midpoint of the observations when they are arranged in increasing order
 Mode: most frequent value

outliers
stragglers that stand off away from the body of the distribution

mean
 sample mean = xbar
 population mean = mu
  sometimes the mean is calculated for ordinal variables, but this doesn't always make sense (GPA = 3.8)
 it is highly influenced by outliers

median
 falls in the middle of the ordered sample, it n is even, average the 2 middle values
 for skewed distributions, it is more appropriate measure of central tendency than the mean (better describes a "typical value")
 it may be too insensitive to changes in the data

trimmed mean
 compromise between the median and the mean
 1. order the data from smallest to largest
 2. delete a selected number of values from each end of the ordered list
 3. find the mean of the remaining values

trimming percentage
the percentage of values that have been deleted form each end of the ordered list when calculating the mean.

mode
 the most frequently occurring value
 on a histogram it would be the highest bar
 it may not be unique

measures of dispersion of the data
 variance, standard deviation
 interquartile range
 range

percentiles
 50th percentile = median
 25th = lower quartile = Q_{1}
 75th = upper quartile = Q_{3}

interquartile range (IQR)
 the difference between upper and lower quartile
 IQR = Q_{3}  Q_{1}
 range of values that contains the middle 50% of the data
 IQR increases as variability increases

fivenumber summary of a distribution
reports its median, quartiles, and extremes (maximum and minimum)

boxplot (AKA boxandwhiskers plot)
 basically a graphical version of the fivenumber summary (unless there are outliers)
 it consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)
 a line within the box that marks the median
 lines at 1.5 IQR's from lower/upper quartiles
 whiskers that extend to the max and min, unless there are outliers

range
the difference between the extremes (max/min)

variance
 the average of the squared deviations (s^{2})
 ∑(x_{i}  ẍ)^{2}
 n  1
 of the population
 ∑(xi  μ)^{2}
 N

standard deviation
 of the population √σ^{2}
 of the sample √s^{2 }


standard deviation
if the histogram of the data is approximately symmetric and bellshaped, then
 about 68% of the data are within one standard deviation from the mean
 about 95% of the data are within two standard deviation from the mean
 about 99.7% of the data are within three standard deviation from the mean

