# STA 291 LEC 1&2

 What is Statistics? Methods for Collecting , Describing, Analyzing and Drawing Conclusions from Data Population total set of all subjects of interestthe entire group of people, animals or things about which we want information Elementary Unit any individual member of the population Sample subset of the population from which the study actually collects informationused to draw conclusions about the whole population Variable a characteristic of a unit that can vary among subjects in the population/sampleExamples: gender, nationality, age, income, hair color, height, disease status, company rating, grade in STA 291, state of residence Sampling Frame listing of all the units in the population Parameter numerical characteristic of the populationcalculated using the whole population Statistic numerical characteristic of the samplecalculated using the sample Why not measure all of the units in the population? Accuracy: May not be able to list them all - may not be able to come up with a frameTime: Speed of ResponseExpense: CostInfinite PopulationDestructive Sampling or Testing Descriptive Statistics Summarizing the information in a collection of data Inferential Statistics Using information from a sample to make conclusions/predictions about the population Univariate data set Consists of observations on a single attribute Multivariate data Consists of observations on several attributes Special case: Bivariate data Two attributes collected per observation Nominal variables have a scale of unordered categoriesExamples: gender, nationality, hair color (it doesn't make sense to say that green eyes are greater/higher/better than brown Ordinal variables have a scale of ordered categories; often treated in a quantitative mannerExamples: disease status, company rating, grade in STA 291 Qualitative variables categorical (not numerical)Nominal and Ordinal Quantitative variables measured numerically, that is, for each subject a number is observed interval scale the scale for quantitative variables Discrete variables has a finite number of possible valuesall qualitative (categorical) variables are ~only some quantitative (numeric) variables are ~ Continuous variables can take all the values in a continuum of real valuesExamples: time, distance, volume, speed, (usually physical measures) Simple Random Sample Each possible sample has the same probability of being selectedThe sample size is usually denoted by "n" Convenience Sample the people just happened to be there Volunteer Sampling this sample will poorly represent the populationwill cause misleading conclusionsBIASpeople are much more likely to speak up if they feel strongly about the issueExamples: Mall interview, Street corner interview Random Sample even if it is smaller it is much more trustworthy than volunteer because it has less bias Observational Study observes individuals and measure variable of interest but does not attempt to influence the responsespassive data collectionit's purpose is to describe/compare groups or situations Experiment deliberately imposes some treatment on individuals in order to observe their responsesactive data productionit's purpose is to study whether the treatment causes a change in the response stratified sampling divide the population into separate, non-overlapping groups ("strata")select a simple random sample independently (and usually proportionally) from each group cluster sampling the population can be divided into a set of non-overlapping subgroups (the clusters)the clusters are then selected at random, and all individuals in the selected clusters are included in the sample systematic sampling an initial name is selected at randomevery Kth name is selected after thatK is computed by dividing membership list length by the desired sample sizenot a simple random sample, but often almost as good as oneuseful when the population consists as a list types of bias Selection Bias - selection of the sample systematically excludes some part of the population of interestMeasurement/Response Bias - method of observation tends to produce values that systematically differ from the true valueNonresponse Bias - occurs when responses are not actually obtained from all individuals selected for inclusion in the sample sampling error the error that occurs when a statistic based on a sample estimates or predicts the value of a population parameterin random samples, the sampling error can usually be quantified non-sampling error any error that could also happen in a censusExamples: bias due to question wording, question order, non-reponse, wrong answers (especially to delicate questions) frequency distribution a listing of intervals of possible values for a variable AND a tabulation of the # of observations in each interval-use intervals of same length (if possible)-intervals must be mutually exclusive (any observation must fall into one and only one interval- RULE of thumb: if you have n observations, the # of intervals should be about √n Frequency, Relative Frequency, and Percentage Distribution frequency = # in intervalrelative frequency = frequency/total #percentage = relative frequency x 100% cumulative frequencies # of observations that fall in the class and in smaller classes histogram (interval data) use numbers from the frequency distribution to create a graphdraw a bar over each interval, the height of the bar represents the relative frequency for that intervalbars should be touching; i.e., equally extend the width of the bar at the upper and lower limits so that the bars are touching bar graph (nominal/ordinal data) the bars are usually separated to emphasize that the variable is categorical rather than quantitativefor nominal variables (no natural ordering), order the bars by frequency, except possibly for a category "other" that is always lastfor ordinal data classes are presented in the natural order, (A, B, C...) stem and leaf plot write the observations ordered from smallest to largesteach observation is represented by a stem (leading digit(s)) and a leaf (final digit)looks like a histogram sideways - gives individual valuescontains more information than a histogram, because every single measurement can be recovered describing distributions center, spread (numbers later)symmetric distributions - bell-shaped or U-shapednot symmetric distributions - left-skewed or right-skewed contingency table number of subjects observed at all the combinations of possible outcomes for the 2 variables~ are identified by their number of rows and columns - a table with 2 rows and 3 columns is called a 2x3 table good graphics... present large data sets concisely and coherentlycan replace a thousand words and still be clearly understood and comprehendedencourage the viewer to compare two or more variablesdo not replace substance by form do not distort what the data revealhave a high "data-to-ink" ratio bad graphics... don't have a scale on the axishave a misleading captiondistort by stretching/shrinking the vertical or horizontal axisuse histograms or bar charts with bars of unequal widthare more confusing than helpful sampling variability sample-to-sample differences undercoverage some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population measures of central location or tendency Mean: arithmetic averageMedian: midpoint of the observations when they are arranged in increasing orderMode: most frequent value outliers stragglers that stand off away from the body of the distribution mean sample mean = x-barpopulation mean = mu- sometimes the mean is calculated for ordinal variables, but this doesn't always make sense (GPA = 3.8)it is highly influenced by outliers median falls in the middle of the ordered sample, it n is even, average the 2 middle valuesfor skewed distributions, it is more appropriate measure of central tendency than the mean (better describes a "typical value")it may be too insensitive to changes in the data trimmed mean compromise between the median and the mean1. order the data from smallest to largest2. delete a selected number of values from each end of the ordered list3. find the mean of the remaining values trimming percentage the percentage of values that have been deleted form each end of the ordered list when calculating the mean. mode the most frequently occurring valueon a histogram it would be the highest barit may not be unique measures of dispersion of the data variance, standard deviationinterquartile rangerange percentiles 50th percentile = median25th = lower quartile = Q175th = upper quartile = Q3 interquartile range (IQR) the difference between upper and lower quartileIQR = Q3 - Q1range of values that contains the middle 50% of the dataIQR increases as variability increases five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum) boxplot (AKA box-and-whiskers plot) basically a graphical version of the five-number summary (unless there are outliers)it consists of a box that contains the central 50% of the distribution (from lower quartile to upper quartile)a line within the box that marks the medianlines at 1.5 IQR's from lower/upper quartileswhiskers that extend to the max and min, unless there are outliers range the difference between the extremes (max/min) variance the average of the squared deviations (s2)∑(xi - ẍ)2 n - 1of the population ∑(xi - μ)2N standard deviation of the population √σ2of the sample √s2 standard deviation if the histogram of the data is approximately symmetric and bell-shaped, then about 68% of the data are within one standard deviation from the meanabout 95% of the data are within two standard deviation from the meanabout 99.7% of the data are within three standard deviation from the mean Authorclydethedog ID32635 Card SetSTA 291 LEC 1&2 DescriptionStatistics Updated2010-09-29T00:53:46Z Show Answers