-
What is Statistics?
Methods for Collecting , Describing, Analyzing and Drawing Conclusions from Data
-
Population
- total set of all subjects of interest
- the entire group of people, animals or things about which we want information
-
Elementary Unit
any individual member of the population
-
Sample
- subset of the population from which the study actually collects information
- used to draw conclusions about the whole population
-
Variable
- a characteristic of a unit that can vary among subjects in the population/sample
- Examples: gender, nationality, age, income, hair color, height, disease status, company rating, grade in STA 291, state of residence
-
Sampling Frame
listing of all the units in the population
-
Parameter
- numerical characteristic of the population
- calculated using the whole population
-
Statistic
- numerical characteristic of the sample
- calculated using the sample
-
Why not measure all of the units in the population?
- Accuracy: May not be able to list them all - may not be able to come up with a frame
- Time: Speed of Response
- Expense: Cost
- Infinite Population
- Destructive Sampling or Testing
-
Descriptive Statistics
Summarizing the information in a collection of data
-
Inferential Statistics
Using information from a sample to make conclusions/predictions about the population
-
Univariate data set
Consists of observations on a single attribute
-
Multivariate data
Consists of observations on several attributes
-
Special case: Bivariate data
Two attributes collected per observation
-
Nominal variables
- have a scale of unordered categories
- Examples: gender, nationality, hair color (it doesn't make sense to say that green eyes are greater/higher/better than brown
-
Ordinal variables
- have a scale of ordered categories; often treated in a quantitative manner
- Examples: disease status, company rating, grade in STA 291
-
Qualitative variables
- categorical (not numerical)
- Nominal and Ordinal
-
Quantitative variables
measured numerically, that is, for each subject a number is observed
-
interval scale
the scale for quantitative variables
-
Discrete variables
- has a finite number of possible values
- all qualitative (categorical) variables are ~
- only some quantitative (numeric) variables are ~
-
Continuous variables
- can take all the values in a continuum of real values
- Examples: time, distance, volume, speed, (usually physical measures)
-
Simple Random Sample
- Each possible sample has the same probability of being selected
- The sample size is usually denoted by "n"
-
Convenience Sample
the people just happened to be there
-
Volunteer Sampling
- this sample will poorly represent the population
- will cause misleading conclusions
- BIAS
- people are much more likely to speak up if they feel strongly about the issue
- Examples: Mall interview, Street corner interview
-
Random Sample
even if it is smaller it is much more trustworthy than volunteer because it has less bias
-
Observational Study
- observes individuals and measure variable of interest but does not attempt to influence the responses
- passive data collection
- it's purpose is to describe/compare groups or situations
-
Experiment
- deliberately imposes some treatment on individuals in order to observe their responses
- active data production
- it's purpose is to study whether the treatment causes a change in the response
-
stratified sampling
- divide the population into separate, non-overlapping groups ("strata")
- select a simple random sample independently (and usually proportionally) from each group
-
cluster sampling
- the population can be divided into a set of non-overlapping subgroups (the clusters)
- the clusters are then selected at random, and all individuals in the selected clusters are included in the sample
-
systematic sampling
- an initial name is selected at random
- every Kth name is selected after that
- K is computed by dividing membership list length by the desired sample size
- not a simple random sample, but often almost as good as one
- useful when the population consists as a list
-
types of bias
- Selection Bias - selection of the sample systematically excludes some part of the population of interest
- Measurement/Response Bias - method of observation tends to produce values that systematically differ from the true value
- Nonresponse Bias - occurs when responses are not actually obtained from all individuals selected for inclusion in the sample
-
sampling error
- the error that occurs when a statistic based on a sample estimates or predicts the value of a population parameter
- in random samples, the sampling error can usually be quantified
-
non-sampling error
- any error that could also happen in a census
- Examples: bias due to question wording, question order, non-reponse, wrong answers (especially to delicate questions)
-
frequency distribution
- a listing of intervals of possible values for a variable AND a tabulation of the # of observations in each interval
- -use intervals of same length (if possible)
- -intervals must be mutually exclusive (any observation must fall into one and only one interval
- - RULE of thumb: if you have n observations, the # of intervals should be about √n
-
Frequency, Relative Frequency, and Percentage Distribution
- frequency = # in interval
- relative frequency = frequency/total #
- percentage = relative frequency x 100%
-
cumulative frequencies
# of observations that fall in the class and in smaller classes
-
histogram (interval data)
- use numbers from the frequency distribution to create a graph
- draw a bar over each interval, the height of the bar represents the relative frequency for that interval
- bars should be touching; i.e., equally extend the width of the bar at the upper and lower limits so that the bars are touching
-
bar graph (nominal/ordinal data)
- the bars are usually separated to emphasize that the variable is categorical rather than quantitative
- for nominal variables (no natural ordering), order the bars by frequency, except possibly for a category "other" that is always last
- for ordinal data classes are presented in the natural order, (A, B, C...)
-
stem and leaf plot
- write the observations ordered from smallest to largest
- each observation is represented by a stem (leading digit(s)) and a leaf (final digit)
- looks like a histogram sideways - gives individual values
- contains more information than a histogram, because every single measurement can be recovered
-
describing distributions
- center, spread (numbers later)
- symmetric distributions - bell-shaped or U-shaped
- not symmetric distributions - left-skewed or right-skewed
-
contingency table
- number of subjects observed at all the combinations of possible outcomes for the 2 variables
- ~ are identified by their number of rows and columns - a table with 2 rows and 3 columns is called a 2x3 table
-
good graphics...
- present large data sets concisely and coherently
- can replace a thousand words and still be clearly understood and comprehended
- encourage the viewer to compare two or more variables
- do not replace substance by form
- do not distort what the data reveal
- have a high "data-to-ink" ratio
-
bad graphics...
- don't have a scale on the axis
- have a misleading caption
- distort by stretching/shrinking the vertical or horizontal axis
- use histograms or bar charts with bars of unequal width
- are more confusing than helpful
-
sampling variability
sample-to-sample differences
-
undercoverage
some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population
-
measures of central location or tendency
- Mean: arithmetic average
- Median: midpoint of the observations when they are arranged in increasing order
- Mode: most frequent value
-
outliers
stragglers that stand off away from the body of the distribution
-
mean
- sample mean = x-bar
- population mean = mu
- - sometimes the mean is calculated for ordinal variables, but this doesn't always make sense (GPA = 3.8)
- it is highly influenced by outliers
-
median
- falls in the middle of the ordered sample, it n is even, average the 2 middle values
- for skewed distributions, it is more appropriate measure of central tendency than the mean (better describes a "typical value")
- it may be too insensitive to changes in the data
-
trimmed mean
- compromise between the median and the mean
- 1. order the data from smallest to largest
- 2. delete a selected number of values from each end of the ordered list
- 3. find the mean of the remaining values
-
trimming percentage
the percentage of values that have been deleted form each end of the ordered list when calculating the mean.
-
mode
- the most frequently occurring value
- on a histogram it would be the highest bar
- it may not be unique
-
measures of dispersion of the data
- variance, standard deviation
- interquartile range
- range
-
percentiles
- 50th percentile = median
- 25th = lower quartile = Q1
- 75th = upper quartile = Q3
-
interquartile range (IQR)
- the difference between upper and lower quartile
- IQR = Q3 - Q1
- range of values that contains the middle 50% of the data
- IQR increases as variability increases
-
five-number summary of a distribution
reports its median, quartiles, and extremes (maximum and minimum)
-
boxplot (AKA box-and-whiskers plot)
- basically a graphical version of the five-number summary (unless there are outliers)
- it consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)
- a line within the box that marks the median
- lines at 1.5 IQR's from lower/upper quartiles
- whiskers that extend to the max and min, unless there are outliers
-
range
the difference between the extremes (max/min)
-
variance
- the average of the squared deviations (s2)
- ∑(xi - ẍ)2
- n - 1
- of the population
- ∑(xi - μ)2
- N
-
standard deviation
- of the population √σ2
- of the sample √s2
-
-
standard deviation
if the histogram of the data is approximately symmetric and bell-shaped, then
- about 68% of the data are within one standard deviation from the mean
- about 95% of the data are within two standard deviation from the mean
- about 99.7% of the data are within three standard deviation from the mean
|
|