STA 291 LEC 1&2

  1. What is Statistics?
    Methods for Collecting , Describing, Analyzing and Drawing Conclusions from Data
  2. Population
    • total set of all subjects of interest
    • the entire group of people, animals or things about which we want information
  3. Elementary Unit
    any individual member of the population
  4. Sample
    • subset of the population from which the study actually collects information
    • used to draw conclusions about the whole population
  5. Variable
    • a characteristic of a unit that can vary among subjects in the population/sample
    • Examples: gender, nationality, age, income, hair color, height, disease status, company rating, grade in STA 291, state of residence
  6. Sampling Frame
    listing of all the units in the population
  7. Parameter
    • numerical characteristic of the population
    • calculated using the whole population
  8. Statistic
    • numerical characteristic of the sample
    • calculated using the sample
  9. Why not measure all of the units in the population?
    • Accuracy: May not be able to list them all - may not be able to come up with a frame
    • Time: Speed of Response
    • Expense: Cost
    • Infinite Population
    • Destructive Sampling or Testing
  10. Descriptive Statistics
    Summarizing the information in a collection of data
  11. Inferential Statistics
    Using information from a sample to make conclusions/predictions about the population
  12. Univariate data set
    Consists of observations on a single attribute
  13. Multivariate data
    Consists of observations on several attributes
  14. Special case: Bivariate data
    Two attributes collected per observation
  15. Nominal variables
    • have a scale of unordered categories
    • Examples: gender, nationality, hair color (it doesn't make sense to say that green eyes are greater/higher/better than brown
  16. Ordinal variables
    • have a scale of ordered categories; often treated in a quantitative manner
    • Examples: disease status, company rating, grade in STA 291
  17. Qualitative variables
    • categorical (not numerical)
    • Nominal and Ordinal
  18. Quantitative variables
    measured numerically, that is, for each subject a number is observed
  19. interval scale
    the scale for quantitative variables
  20. Discrete variables
    • has a finite number of possible values
    • all qualitative (categorical) variables are ~
    • only some quantitative (numeric) variables are ~
  21. Continuous variables
    • can take all the values in a continuum of real values
    • Examples: time, distance, volume, speed, (usually physical measures)
  22. Simple Random Sample
    • Each possible sample has the same probability of being selected
    • The sample size is usually denoted by "n"
  23. Convenience Sample
    the people just happened to be there
  24. Volunteer Sampling
    • this sample will poorly represent the population
    • will cause misleading conclusions
    • BIAS
    • people are much more likely to speak up if they feel strongly about the issue
    • Examples: Mall interview, Street corner interview
  25. Random Sample
    even if it is smaller it is much more trustworthy than volunteer because it has less bias
  26. Observational Study
    • observes individuals and measure variable of interest but does not attempt to influence the responses
    • passive data collection
    • it's purpose is to describe/compare groups or situations
  27. Experiment
    • deliberately imposes some treatment on individuals in order to observe their responses
    • active data production
    • it's purpose is to study whether the treatment causes a change in the response
  28. stratified sampling
    • divide the population into separate, non-overlapping groups ("strata")
    • select a simple random sample independently (and usually proportionally) from each group
  29. cluster sampling
    • the population can be divided into a set of non-overlapping subgroups (the clusters)
    • the clusters are then selected at random, and all individuals in the selected clusters are included in the sample
  30. systematic sampling
    • an initial name is selected at random
    • every Kth name is selected after that
    • K is computed by dividing membership list length by the desired sample size
    • not a simple random sample, but often almost as good as one
    • useful when the population consists as a list
  31. types of bias
    • Selection Bias - selection of the sample systematically excludes some part of the population of interest
    • Measurement/Response Bias - method of observation tends to produce values that systematically differ from the true value
    • Nonresponse Bias - occurs when responses are not actually obtained from all individuals selected for inclusion in the sample
  32. sampling error
    • the error that occurs when a statistic based on a sample estimates or predicts the value of a population parameter
    • in random samples, the sampling error can usually be quantified
  33. non-sampling error
    • any error that could also happen in a census
    • Examples: bias due to question wording, question order, non-reponse, wrong answers (especially to delicate questions)
  34. frequency distribution
    • a listing of intervals of possible values for a variable AND a tabulation of the # of observations in each interval
    • -use intervals of same length (if possible)
    • -intervals must be mutually exclusive (any observation must fall into one and only one interval
    • - RULE of thumb: if you have n observations, the # of intervals should be about √n
  35. Frequency, Relative Frequency, and Percentage Distribution
    • frequency = # in interval
    • relative frequency = frequency/total #
    • percentage = relative frequency x 100%
  36. cumulative frequencies
    # of observations that fall in the class and in smaller classes
  37. histogram (interval data)
    • use numbers from the frequency distribution to create a graph
    • draw a bar over each interval, the height of the bar represents the relative frequency for that interval
    • bars should be touching; i.e., equally extend the width of the bar at the upper and lower limits so that the bars are touching
  38. bar graph (nominal/ordinal data)
    • the bars are usually separated to emphasize that the variable is categorical rather than quantitative
    • for nominal variables (no natural ordering), order the bars by frequency, except possibly for a category "other" that is always last
    • for ordinal data classes are presented in the natural order, (A, B, C...)
  39. stem and leaf plot
    • write the observations ordered from smallest to largest
    • each observation is represented by a stem (leading digit(s)) and a leaf (final digit)
    • looks like a histogram sideways - gives individual values
    • contains more information than a histogram, because every single measurement can be recovered
  40. describing distributions
    • center, spread (numbers later)
    • symmetric distributions - bell-shaped or U-shaped
    • not symmetric distributions - left-skewed or right-skewed
  41. contingency table
    • number of subjects observed at all the combinations of possible outcomes for the 2 variables
    • ~ are identified by their number of rows and columns - a table with 2 rows and 3 columns is called a 2x3 table
  42. good graphics...
    • present large data sets concisely and coherently
    • can replace a thousand words and still be clearly understood and comprehended
    • encourage the viewer to compare two or more variables
    • do not replace substance by form
    • do not distort what the data reveal
    • have a high "data-to-ink" ratio
  43. bad graphics...
    • don't have a scale on the axis
    • have a misleading caption
    • distort by stretching/shrinking the vertical or horizontal axis
    • use histograms or bar charts with bars of unequal width
    • are more confusing than helpful
  44. sampling variability
    sample-to-sample differences
  45. undercoverage
    some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population
  46. measures of central location or tendency
    • Mean: arithmetic average
    • Median: midpoint of the observations when they are arranged in increasing order
    • Mode: most frequent value
  47. outliers
    stragglers that stand off away from the body of the distribution
  48. mean
    • sample mean = x-bar
    • population mean = mu
    • - sometimes the mean is calculated for ordinal variables, but this doesn't always make sense (GPA = 3.8)
    • it is highly influenced by outliers
  49. median
    • falls in the middle of the ordered sample, it n is even, average the 2 middle values
    • for skewed distributions, it is more appropriate measure of central tendency than the mean (better describes a "typical value")
    • it may be too insensitive to changes in the data
  50. trimmed mean
    • compromise between the median and the mean
    • 1. order the data from smallest to largest
    • 2. delete a selected number of values from each end of the ordered list
    • 3. find the mean of the remaining values
  51. trimming percentage
    the percentage of values that have been deleted form each end of the ordered list when calculating the mean.
  52. mode
    • the most frequently occurring value
    • on a histogram it would be the highest bar
    • it may not be unique
  53. measures of dispersion of the data
    • variance, standard deviation
    • interquartile range
    • range
  54. percentiles
    • 50th percentile = median
    • 25th = lower quartile = Q1
    • 75th = upper quartile = Q3
  55. interquartile range (IQR)
    • the difference between upper and lower quartile
    • IQR = Q3 - Q1
    • range of values that contains the middle 50% of the data
    • IQR increases as variability increases
  56. five-number summary of a distribution
    reports its median, quartiles, and extremes (maximum and minimum)
  57. boxplot (AKA box-and-whiskers plot)
    • basically a graphical version of the five-number summary (unless there are outliers)
    • it consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)
    • a line within the box that marks the median
    • lines at 1.5 IQR's from lower/upper quartiles
    • whiskers that extend to the max and min, unless there are outliers
  58. range
    the difference between the extremes (max/min)
  59. variance
    • the average of the squared deviations (s2)
    • ∑(xi - ẍ)2
    • n - 1
    • of the population
    • ∑(xi - μ)2
    • N
  60. standard deviation
    • of the population √σ2
    • of the sample √s2
  61. standard deviation
    if the histogram of the data is approximately symmetric and bell-shaped, then
    • about 68% of the data are within one standard deviation from the mean
    • about 95% of the data are within two standard deviation from the mean
    • about 99.7% of the data are within three standard deviation from the mean
Card Set
STA 291 LEC 1&2