
What is regression and why do we use it?
Is used to make predictions about scores on one variable from knowledge of scores on another variable. Th ese predictions are obtained from the regression line, which is defi ned as the bestfi tting straight line through a set of points in a scatter diagram. It is found by using the principle of least squares, which minimizes the squared deviation around the regression line
 Variable used to predict or estimate score on another variable ni.e., “regression”  making
 predictions about scores on one variable (Y) based upon knowledge of scores for another variable (X)
 regression uses raw score data. i.e., no reciprocal property

The importance of a Regression line
 *simplifies description of relationship between two variables.. useful to predict values for response variable (criterion) from values on predictor variable.
 *line that best represents the trend of the data points
 *minimizes the average distance for all data points
bestfitting straight line through a set of points in a scatter diagram. Procedure for making such predictions is “linear regression.”

Goal of Regression
keep prediction errors to minimum. Values of “b” and “a” chosen to satisfy this condition. constant values for set of data
*Goal is to find regression line that maximizes prediction accuracy and minimizes error.

Regression Equation
Y’ = bX + a
 *Y variable = dependent variable or “criterion”. variable upon which scores are predicted; predicted
 score on dependent variable
*X variable = independent variable or “predictor”. provides information upon which predictions are based.
 *b = slope of the line also the regression coefficient between variables Y and X. value of b =
 amount of change expected in Y when X changes by one unit
*a = Yintercept
value of Y’ when X = 0 point at which regression line crosses the Y axis

Principle of least squares
 *regression line identifies the "central tendency" of the relationship between two variables
 mean = least squares for 1variable.
 line on which all share equally in two dimensions.
 *want to minimize squared deviation around the regression line
 *Data points rarely fall in exact straight line
 any difference between predicted score for the criterion (Y’) and the observed score (Y) is error or “residual”
 error in predicting Y = Y – Y’ (a.k.a. residual)
 ∑Y –Y’ = 0, so these values are squared
 Sum of squares yields the “least squares” regression line
 best fitting line minimizes the deviation between observed and predicted Y scores
 *Once values for slope (b) and intercept (a) are determined, original regression equation used to make best
 linear predictions possible on new samples of data
*Must ensure that original sample is representative of future groups for whom the predictions will be made

Other Regression Considerations
 *Data points rarely fall in exact straight line
 any difference between predicted score for the criterion (Y’) and the observed score (Y) is error or “residual”
 error in predicting Y = Y – Y’ (a.k.a. residual)
 ∑Y –Y’ = 0, so these values are squared
 Sum of squares yields the “least squares” regression line
 best fitting line minimizes the deviation between observed and predicted Y scores
 *Once values for slope (b) and intercept (a) are determined, original regression equation used to make best linear predictions possible on new samples of data
*Must ensure that original sample is representative of future groups for whom the predictions will be made
 * If there is no good informationon which to base a prediction (i.e., “b” or slope = 0), the same estimate is
 made for everyone
*best estimate = mean of the criterion e.g., when correlation between variables = 0

What is correlation and why do we use it?
 *Determine whether two variables covary
 “Corelated” extent to which change in one variable corresponds with change in 2nd variable
*Used primarily in examining linear relationships
 *Assesses magnitude and direction of a relationship
 Why? creates common metric/reciprocal nature of correlation coefficients vs. regression coefficients/easily interpreted statistic/yields information about:strengthdirectioneffect size

How are correlation and regression similiar/different?
 Similiar: Correlation – specialized case of regression. Both attempt to determine:
 *degree of association between two or more variables
 *whether association is greater than expected by chance
 *strength of the association i.e, weak, moderate, strong
 Difference between two methods is analogous to difference between standardized and raw scores because scores are standardized, means are the same for both variables in regression, variables X and Y have own means
 Correlation coefficient is reciprocal nature. The correlation between X and Y will always be the same as the correlation between Y and X. Eg. if the correlation between drug dose and activity is .68, the correlation between activity and drug dose is. (Both variables in z score form mean of X = 0, mean of Y = 0 thus, a (intercept) value will always = 0 and is dropped from equation Y = bX + a instead have Y = rX. Correlation coefficient (r) = regression coefficient (slope) when measured in standardized units e.g., SAT scores and GPA.
Regression does not have this property. Regression is used to transform scores on one variable into estimated scores on the other. We often use regression to predict raw scores on Y on the basis of raw scores on X. For instance, we might seek an equation to predict a student’s grade point average (GPA) on the basis of his or her SAT score. Because regression uses the raw units of the variables, the reciprocal property does not hold. Th e coeffi cient that describes the regression of X on Y is usually not the same as the coeffi cient that describes the regression of Y on X.

What types of information are yielded from a correlatio coefficient (eg. r=7 what does this mean?
 Correlation coefficient” (r) calculated to measure relationship
 *mathematical index that describes
 direction and strength of relationship
 i.e., measure of effect size for a particular result

Types of correlation relationships
(1) Positive
(2) Negative
 (3) No correlation
 i.e.,information on one variable fails to yield information on 2nd variable

Perfect correlation
 Z scores for each pair identical
 opposite signs if relationship is negative

Two primary methods to calculate “r”
(1) raw score method
(2) calculating mean of product of paired Z values

Pearson product moment
correlation coefficient” (a.k.a. Pearson r)
 *used when both variables are “continuous” (height, weight, IQ)
 *ncalculated ratio  estimates the degree of variation in one variable in relation to knowledge about variation on
 other variable
 *coefficient range = 1.0 and +1.0
 *0 = no relationship

Other correlation coefficients
Pearson product moment correlation coefficient (r) used when measuring two continuous variables
Spearman’s rho (ρ) – used with ordinal data (correlating two sets of ranks)
Biserial r – corr. between continuous variable and an artificial(reflect unlying continious scale forced into dicotomy dichotomous (2 levelsy/n,m/f, incorr/corr) variable. ie p/f test and gpa score.
Point biserial r – continuous and true (naturally form 2 categories, m/f) dichotomous variables. ie relationship bet gender & gpa score

What is measurement error? What are the assumptions associated with it?
 *The standard deviation of the residuals is known as the standard error of estimate
 Measure of the accuracy of prediction. Must use 2 degrees of freedom rather than 1 b/c regression equation has 2 constants
*Assumption Prediction is most accurate when the standard error of estimate is relatively small. As it becomes larger, the prediction becomes less accurate.

Correlational terms
Residual – difference between the predicted score and the observed value. (sum of residuals always = 0) (YY)
Standard error of estimate – i.e., standard deviation of the residuals – measure of the accuracy in the prediction
 Coefficient of determination (r2) –proportion of the total variation in scores on Y that can be accounted for as a
 function of information about X – computed by squaring correlation coefficient (r). Value tells us the proportion of the total variation in scores on Y that we know as a function of information about X. ie. if the correlation between the SAT score and performance in the fi rst year of college is .40, then the coeffi cient of determination is .16. The calculation is simply .402 = .16. Th is means that we can explain 16% of the variation in fi rstyear college performance by knowing SAT scores.
 Coefficient of alienation – measure of nonassociation between two variables.nhigh value means a high degree of
 nonassociation between two variables.
 Shrinkage – amount of decrease observed when a regression equation is created for one population and then applied to
 another many times, strength of relationship can be overestimated, esp. if sample size is small

Range of Restriction
 very difficult to find meaningful relationships when variability in data is small “Correlation requires
 variability”
 Difficult to observe significant correlation Correlation requires variability. If the variability is restricted, then signifi cant correlations are difficult to find. Eg. relationship between SAT quantitative and graduate school GPA across elite students. GRE scores vary little. Hard to find relationship.

Why are causal statements not permitted when analzying correlations?
Correlation IS NOT causation
The cause of the relationship can not be discerned from the correlation coefficient
 X may cause Y
 Y may cause X
 A third variable may cause both Xand Y
 *experimental design – may be able to infer causation
 *correlation – measure variables as they currently exist
 *experiment – presumed causal variable is manipulated by experimenter.
 ninvestigator controls study
 conditions

Scatter Diagrams
 Visual depiction of relationship between two variables
 *one variable Xaxis (horizontal)
 *other on Yaxis (vertical)
 Each point represents performance of one individual assessed on two measures
 Allows for visual inspection of data

What does the regression line represent?
 Importance of regression line: simplifies description of relationship between two variables
 *useful to predict values for response variable (criterion) from values on predictor variable
 *line that best represents the trend of the data points minimizes the average distance for all data points

Regression toward the mean.
tendency of scores to regress toward the mean;If a person is extreme on X, then regression predicts that he or she will be less extreme on Y.
 Sir Francis Galton (1885) – Regression toward Mediocrity in Hereditary Stature
 *physical characteristics of offspring related but less extreme than those of parents
 e.g., especially short or tall parents tended to produce offspring who “regress” toward average

Testing for signaficance
 Not enough to demonstrate a relationship between two variables
 *is the obtained association greater than what would be expected by chance?
 **i.e., determine whether the observed relationship was due to sampling error
 **possible to obtain correlations between variables by chance alone i.e., calculate statistical significance

How to test for signaficance.
 *Begin with establishing a “null”hypothesis i.e., no relationship between variables
 *Test for statistical significance using a “t distribution” and select a criterion of significance (e.g., .05, .01).
 *Can also use a table of significant values for r
 **If obtained value is greater than tabled value, null hypothesis is rejected

Type I & II errors
 Type I error – decision that one variable has an effect on or a relationship with another variable when in
 reality it does not
 Type II error – decision that one variable does not have an effect on or a relationship with another variable
 when in reality it does

Implications for decision of retaining or rejecting null hypothesis:
 *retain null  not established that the two variables are linearly uncorrelated in the population (i.e., Type II error)
 Two variables could be related but not in linear fashion
 reject null  (i.e., r is significant), some degree of linear relationship is noted i.e., relationship in general
 population is unlikely to be zero

Statistical vs. clinical significance
statistical – is the obtained result likely to be attributable to chance factors?
clinical – is the obtained result important or meaningful?

