# 549-Chapter 3

 What is regression and why do we use it? Is used to make predictions about scores on one variable from knowledge of scores on another variable. Th ese predictions are obtained from the regression line, which is defi ned as the best-fi tting straight line through a set of points in a scatter diagram. It is found by using the principle of least squares, which minimizes the squared deviation around the regression line Variable used to predict or estimate score on another variable ni.e., “regression” - makingpredictions about scores on one variable (Y) based upon knowledge of scores for another variable (X)regression uses raw score data. i.e., no reciprocal property The importance of a Regression line *simplifies description of relationship between two variables.. useful to predict values for response variable (criterion) from values on predictor variable.*line that best represents the trend of the data points*minimizes the average distance for all data points best-fitting straight line through a set of points in a scatter diagram. Procedure for making such predictions is “linear regression.” Goal of Regression keep prediction errors to minimum. Values of “b” and “a” chosen to satisfy this condition. constant values for set of data *Goal is to find regression line that maximizes prediction accuracy and minimizes error. Regression Equation Y’ = bX + a *Y variable = dependent variable or “criterion”. variable upon which scores are predicted; predictedscore on dependent variable *X variable = independent variable or “predictor”. provides information upon which predictions are based. *b = slope of the line also the regression coefficient between variables Y and X. value of b =amount of change expected in Y when X changes by one unit *a = Y-intercept value of Y’ when X = 0- point at which regression line crosses the Y axis Principle of least squares *regression line identifies the "central tendency" of the relationship between two variables -mean = least squares for 1variable. -line on which all share equally in two dimensions. *want to minimize squared deviation around the regression line *Data points rarely fall in exact straight line -any difference between predicted score for the criterion (Y’) and the observed score (Y) is error or “residual” -error in predicting Y = Y – Y’ (a.k.a. residual) -∑Y –Y’ = 0, so these values are squared -Sum of squares yields the “least squares” regression line -best fitting line minimizes the deviation between observed and predicted Y scores*Once values for slope (b) and intercept (a) are determined, original regression equation used to make bestlinear predictions possible on new samples of data *Must ensure that original sample is representative of future groups for whom the predictions will be made Other Regression Considerations *Data points rarely fall in exact straight line -any difference between predicted score for the criterion (Y’) and the observed score (Y) is error or “residual” -error in predicting Y = Y – Y’ (a.k.a. residual) -∑Y –Y’ = 0, so these values are squared -Sum of squares yields the “least squares” regression line -best fitting line minimizes the deviation between observed and predicted Y scores*Once values for slope (b) and intercept (a) are determined, original regression equation used to make best linear predictions possible on new samples of data *Must ensure that original sample is representative of future groups for whom the predictions will be made * If there is no good informationon which to base a prediction (i.e., “b” or slope = 0), the same estimate ismade for everyone *best estimate = mean of the criterion e.g., when correlation between variables = 0 What is correlation and why do we use it? *Determine whether two variables covary“Co-related” extent to which change in one variable corresponds with change in 2nd variable *Used primarily in examining linear relationships *Assesses magnitude and direction of a relationshipWhy? creates common metric/reciprocal nature of correlation coefficients vs. regression coefficients/easily interpreted statistic/yields information about:strength-direction-effect size How are correlation and regression similiar/different? Similiar: Correlation – specialized case of regression. Both attempt to determine:*degree of association between two or more variables*whether association is greater than expected by chance*strength of the association i.e, weak, moderate, strong Difference between two methods is analogous to difference between standardized and raw scores because scores are standardized, means are the same for both variables in regression, variables X and Y have own meansCorrelation coefficient is reciprocal nature. The correlation between X and Y will always be the same as the correlation between Y and X. Eg. if the correlation between drug dose and activity is .68, the correlation between activity and drug dose is. (Both variables in z score form mean of X = 0, mean of Y = 0 thus, a (intercept) value will always = 0 and is dropped from equation Y = bX + a instead have Y = rX. Correlation coefficient (r) = regression coefficient (slope) when measured in standardized units e.g., SAT scores and GPA. Regression does not have this property. Regression is used to transform scores on one variable into estimated scores on the other. We often use regression to predict raw scores on Y on the basis of raw scores on X. For instance, we might seek an equation to predict a student’s grade point average (GPA) on the basis of his or her SAT score. Because regression uses the raw units of the variables, the reciprocal property does not hold. Th e coeffi cient that describes the regression of X on Y is usually not the same as the coeffi cient that describes the regression of Y on X. What types of information are yielded from a correlatio coefficient (eg. r=7 what does this mean? Correlation coefficient” (r) calculated to measure relationship*mathematical index that describesdirection and strength of relationship -i.e., measure of effect size for a particular result Types of correlation relationships (1) Positive (2) Negative (3) No correlationi.e.,information on one variable fails to yield information on 2nd variable Perfect correlation Z scores for each pair identical opposite signs if relationship is negative Two primary methods to calculate “r” (1) raw score method (2) calculating mean of product of paired Z values Pearson product moment correlation coefficient” (a.k.a. Pearson r) *used when both variables are “continuous” (height, weight, IQ)*ncalculated ratio - estimates the degree of variation in one variable in relation to knowledge about variation onother variable*coefficient range = -1.0 and +1.0*0 = no relationship Other correlation coefficients Pearson product moment correlation coefficient (r) used when measuring two continuous variables Spearman’s rho (ρ) – used with ordinal data (correlating two sets of ranks) Biserial r – corr. between continuous variable and an artificial(reflect unlying continious scale forced into dicotomy dichotomous (2 levels-y/n,m/f, incorr/corr) variable. ie p/f test and gpa score. Point biserial r – continuous and true (naturally form 2 categories, m/f) dichotomous variables. ie relationship bet gender & gpa score What is measurement error? What are the assumptions associated with it? *The standard deviation of the residuals is known as the standard error of estimateMeasure of the accuracy of prediction. Must use 2 degrees of freedom rather than 1 b/c regression equation has 2 constants *Assumption Prediction is most accurate when the standard error of estimate is relatively small. As it becomes larger, the prediction becomes less accurate. Correlational terms Residual – difference between the predicted score and the observed value. (sum of residuals always = 0) (Y-Y) Standard error of estimate – i.e., standard deviation of the residuals – measure of the accuracy in the prediction Coefficient of determination (r2) –proportion of the total variation in scores on Y that can be accounted for as afunction of information about X – computed by squaring correlation coefficient (r). Value tells us the proportion of the total variation in scores on Y that we know as a function of information about X. ie. if the correlation between the SAT score and performance in the fi rst year of college is .40, then the coeffi cient of determination is .16. The calculation is simply .402 = .16. Th is means that we can explain 16% of the variation in fi rst-year college performance by knowing SAT scores. Coefficient of alienation – measure of nonassociation between two variables.nhigh value means a high degree ofnonassociation between two variables. Shrinkage – amount of decrease observed when a regression equation is created for one population and then applied toanother many times, strength of relationship can be overestimated, esp. if sample size is small Range of Restriction very difficult to find meaningful relationships when variability in data is small “Correlation requiresvariability”Difficult to observe significant correlation Correlation requires variability. If the variability is restricted, then signifi cant correlations are difficult to find. Eg. relationship between SAT quantitative and graduate school GPA across elite students. GRE scores vary little. Hard to find relationship. Why are causal statements not permitted when analzying correlations? Correlation IS NOT causation The cause of the relationship can not be discerned from the correlation coefficient X may cause YY may cause XA third variable may cause both Xand Y*experimental design – may be able to infer causation*correlation – measure variables as they currently exist*experiment – presumed causal variable is manipulated by experimenter. ninvestigator controls studyconditions Scatter Diagrams Visual depiction of relationship between two variables*one variable X-axis (horizontal)*other on Y-axis (vertical)Each point represents performance of one individual assessed on two measuresAllows for visual inspection of data What does the regression line represent? Importance of regression line: simplifies description of relationship between two variables*useful to predict values for response variable (criterion) from values on predictor variable*line that best represents the trend of the data points minimizes the average distance for all data points Regression toward the mean. tendency of scores to regress toward the mean;If a person is extreme on X, then regression predicts that he or she will be less extreme on Y. Sir Francis Galton (1885) – Regression toward Mediocrity in Hereditary Stature*physical characteristics of offspring related but less extreme than those of parentse.g., especially short or tall parents tended to produce offspring who “regress” toward average Testing for signaficance Not enough to demonstrate a relationship between two variables*is the obtained association greater than what would be expected by chance?**i.e., determine whether the observed relationship was due to sampling error**possible to obtain correlations between variables by chance alone i.e., calculate statistical significance How to test for signaficance. *Begin with establishing a “null”hypothesis i.e., no relationship between variables*Test for statistical significance using a “t distribution” and select a criterion of significance (e.g., .05, .01).*Can also use a table of significant values for r**If obtained value is greater than tabled value, null hypothesis is rejected Type I & II errors Type I error – decision that one variable has an effect on or a relationship with another variable when inreality it does notType II error – decision that one variable does not have an effect on or a relationship with another variablewhen in reality it does Implications for decision of retaining or rejecting null hypothesis: *retain null - not established that the two variables are linearly uncorrelated in the population (i.e., Type II error)Two variables could be related but not in linear fashion reject null - (i.e., r is significant), some degree of linear relationship is noted i.e., relationship in generalpopulation is unlikely to be zero Statistical vs. clinical significance statistical – is the obtained result likely to be attributable to chance factors? clinical – is the obtained result important or meaningful? Authormsgreta1970 ID36192 Card Set549-Chapter 3 DescriptionChapter 3 Updated2010-09-21T19:45:16Z Show Answers