STA 291 LEC 1&2

Home

Get App

Create

What is Statistics?

Methods for Collecting , Describing, Analyzing and Drawing Conclusions from Data
Population
- total set of all subjects of interest
- the entire group of people, animals or things about which we want information
Elementary Unit

any individual member of the population
Sample
- subset of the population from which the study actually collects information
- used to draw conclusions about the whole population
Variable
- a characteristic of a unit that can vary among subjects in the population/sample
- Examples: gender, nationality, age, income, hair color, height, disease status, company rating, grade in STA 291, state of residence
Sampling Frame

listing of all the units in the population
Parameter
- numerical characteristic of the population
- calculated using the whole population
Statistic
- numerical characteristic of the sample
- calculated using the sample
Why not measure all of the units in the population?
- Accuracy: May not be able to list them all - may not be able to come up with a frame
- Time: Speed of Response
- Expense: Cost
- Infinite Population
- Destructive Sampling or Testing
Descriptive Statistics

Summarizing the information in a collection of data
Inferential Statistics

Using information from a sample to make conclusions/predictions about the population
Univariate data set

Consists of observations on a single attribute
Multivariate data

Consists of observations on several attributes
Special case: Bivariate data

Two attributes collected per observation
Nominal variables
- have a scale of unordered categories
- Examples: gender, nationality, hair color (it doesn't make sense to say that green eyes are greater/higher/better than brown
Ordinal variables
- have a scale of ordered categories; often treated in a quantitative manner
- Examples: disease status, company rating, grade in STA 291
Qualitative variables
- categorical (not numerical)
- Nominal and Ordinal
Quantitative variables

measured numerically, that is, for each subject a number is observed
interval scale

the scale for quantitative variables
Discrete variables
- has a finite number of possible values
- all qualitative (categorical) variables are ~
- only some quantitative (numeric) variables are ~
Continuous variables
- can take all the values in a continuum of real values
- Examples: time, distance, volume, speed, (usually physical measures)
Simple Random Sample
- Each possible sample has the same probability of being selected
- The sample size is usually denoted by "n"
Convenience Sample

the people just happened to be there
Volunteer Sampling
- this sample will poorly represent the population
- will cause misleading conclusions
- BIAS
- people are much more likely to speak up if they feel strongly about the issue
- Examples: Mall interview, Street corner interview
Random Sample

even if it is smaller it is much more trustworthy than volunteer because it has less bias
Observational Study
- observes individuals and measure variable of interest but does not attempt to influence the responses
- passive data collection
- it's purpose is to describe/compare groups or situations
Experiment
- deliberately imposes some treatment on individuals in order to observe their responses
- active data production
- it's purpose is to study whether the treatment causes a change in the response
stratified sampling
- divide the population into separate, non-overlapping groups ("strata")
- select a simple random sample independently (and usually proportionally) from each group
cluster sampling
- the population can be divided into a set of non-overlapping subgroups (the clusters)
- the clusters are then selected at random, and all individuals in the selected clusters are included in the sample
systematic sampling
- an initial name is selected at random
- every Kth name is selected after that
- K is computed by dividing membership list length by the desired sample size
- not a simple random sample, but often almost as good as one
- useful when the population consists as a list
types of bias
- Selection Bias - selection of the sample systematically excludes some part of the population of interest
- Measurement/Response Bias - method of observation tends to produce values that systematically differ from the true value
- Nonresponse Bias - occurs when responses are not actually obtained from all individuals selected for inclusion in the sample
sampling error
- the error that occurs when a statistic based on a sample estimates or predicts the value of a population parameter
- in random samples, the sampling error can usually be quantified
non-sampling error
- any error that could also happen in a census
- Examples: bias due to question wording, question order, non-reponse, wrong answers (especially to delicate questions)
frequency distribution
- a listing of intervals of possible values for a variable AND a tabulation of the # of observations in each interval
- -use intervals of same length (if possible)
- -intervals must be mutually exclusive (any observation must fall into one and only one interval
- - RULE of thumb: if you have n observations, the # of intervals should be about √n
Frequency, Relative Frequency, and Percentage Distribution
- frequency = # in interval
- relative frequency = frequency/total #
- percentage = relative frequency x 100%
cumulative frequencies

# of observations that fall in the class and in smaller classes
histogram (interval data)
- use numbers from the frequency distribution to create a graph
- draw a bar over each interval, the height of the bar represents the relative frequency for that interval
- bars should be touching; i.e., equally extend the width of the bar at the upper and lower limits so that the bars are touching
bar graph (nominal/ordinal data)
- the bars are usually separated to emphasize that the variable is categorical rather than quantitative
- for nominal variables (no natural ordering), order the bars by frequency, except possibly for a category "other" that is always last
- for ordinal data classes are presented in the natural order, (A, B, C...)
stem and leaf plot
- write the observations ordered from smallest to largest
- each observation is represented by a stem (leading digit(s)) and a leaf (final digit)
- looks like a histogram sideways - gives individual values
- contains more information than a histogram, because every single measurement can be recovered
describing distributions
- center, spread (numbers later)
- symmetric distributions - bell-shaped or U-shaped
- not symmetric distributions - left-skewed or right-skewed
contingency table
- number of subjects observed at all the combinations of possible outcomes for the 2 variables
- ~ are identified by their number of rows and columns - a table with 2 rows and 3 columns is called a 2x3 table
good graphics...
- present large data sets concisely and coherently
- can replace a thousand words and still be clearly understood and comprehended
- encourage the viewer to compare two or more variables
- do not replace substance by form
- do not distort what the data reveal
- have a high "data-to-ink" ratio
bad graphics...
- don't have a scale on the axis
- have a misleading caption
- distort by stretching/shrinking the vertical or horizontal axis
- use histograms or bar charts with bars of unequal width
- are more confusing than helpful
sampling variability

sample-to-sample differences
undercoverage

some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population
measures of central location or tendency
- Mean: arithmetic average
- Median: midpoint of the observations when they are arranged in increasing order
- Mode: most frequent value
outliers

stragglers that stand off away from the body of the distribution
mean
- sample mean = x-bar
- population mean = mu
- - sometimes the mean is calculated for ordinal variables, but this doesn't always make sense (GPA = 3.8)
- it is highly influenced by outliers
median
- falls in the middle of the ordered sample, it n is even, average the 2 middle values
- for skewed distributions, it is more appropriate measure of central tendency than the mean (better describes a "typical value")
- it may be too insensitive to changes in the data
trimmed mean
- compromise between the median and the mean
- 1. order the data from smallest to largest
- 2. delete a selected number of values from each end of the ordered list
- 3. find the mean of the remaining values
trimming percentage

the percentage of values that have been deleted form each end of the ordered list when calculating the mean.
mode
- the most frequently occurring value
- on a histogram it would be the highest bar
- it may not be unique
measures of dispersion of the data
- variance, standard deviation
- interquartile range
- range
percentiles
- 50th percentile = median
- 25th = lower quartile = Q₁
- 75th = upper quartile = Q₃
interquartile range (IQR)
- the difference between upper and lower quartile
- IQR = Q₃ - Q₁
- range of values that contains the middle 50% of the data
- IQR increases as variability increases
five-number summary of a distribution

reports its median, quartiles, and extremes (maximum and minimum)
boxplot (AKA box-and-whiskers plot)
- basically a graphical version of the five-number summary (unless there are outliers)
- it consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)
- a line within the box that marks the median
- lines at 1.5 IQR's from lower/upper quartiles
- whiskers that extend to the max and min, unless there are outliers
range

the difference between the extremes (max/min)
variance
- the average of the squared deviations (s²)
- ∑(x_i - ẍ)²
- n - 1
- of the population
- ∑(xi - μ)²
- N
standard deviation
- of the population √σ²
- of the sample √s²
standard deviation
if the histogram of the data is approximately symmetric and bell-shaped, then
- about 68% of the data are within one standard deviation from the mean
- about 95% of the data are within two standard deviation from the mean
- about 99.7% of the data are within three standard deviation from the mean

Author

clydethedog

32635

Card Set

STA 291 LEC 1&2

Description

Statistics

Updated

2010-09-29T00:53:46Z

Show Answers