STAT 104 - Chapter 5 - Regression

  1. 1. What is the formula for a straight (linear) line, and what do the symbols represent?

    2. What is the sentence we MUST remember in order to interpret a slopejQuery112408936700543656486_1558035850811 **PUT ON CHEAT SHEET**
    1. y = a + bx

    y is the variable on the vertical axis (response variable)

    x is the variable on the horizontal axis (explanatory variable)

    a is the intercept. This is where the line cuts the vertical axis. a is the value of y when x is zero

    b is the slope. 

    Image Upload 1

    2. When x increases by one unit, y changes by b.
  2. 1. What is the least squares regression line and how do you see it?

    Image Upload 2
    Use the line to estimate the average growth when the average sea surface temperature is 26.4 degrees

    3. Use the line to estimate th average growth wehn the average sea surface temperature is 28 degrees.

    4. Interpret the slope

    5. Interpret the intercept
    1. The least squares regression line is used to show a linear relationship between two variables. Hover over the regression line in Minitab and copy the whole thing. It will show the values of the two variables (x and y) in order to input them in a linear equation.

    • Image Upload 3
    • y = 0.1579, and x = 5.039

    2. By reading the graph, we can see that the average growth is approx. 0.87, even though there is no data point. We use regression to extrapolate. 

    3. You can go a little bit outside the data, but this is quite far outside. There bay be a linear plot, but we don't know what will happen to coral at 28 degrees. It may die. Mathematically, the line goes on forever, but in practice, we cannot and should not extrapolate. 

    4. When x increases by one unit, y changes by b.

    Now substitute

    When the sea surface temperature (x) increases by 1°C, (one unit) the growth rate of coral (y) goes down (changes) by 0.1579 cm/year (b). 

    5. Intercept is 5.039

    *If temp (x) = 0°C, growth is 5.039 cm / year.

    Coral doesn't grow at 0 degrees, so the intercept isn't useful
  3. 1. What is the co-efficient of determination? How do we write it?

    2. What do we use it for?
    Give an example using coral growth.
    1. The co-efficient of determination is the percent of variation in the response variable that is accounted for by the explanatory variable.

    It is denoted as R², and calculated in Minitab when you hover over the regression line. 

    2. It tells us how much of the y variable is accounted for by x. In other words, it tells us how well the line fits the data and therefore how good the estimates will be that are calculated from the line. 

    Example: 66% (R²) of coral growth is accounted for by sea surface temperature. 

    If you only used sea surface temp, you'd account for 66% of the variation, therefore, you need to use other variables to make a more accurate prediction.
  4. 1. What does it mean when R² is close to 100%? What does it mean when it is close to 0%?

    2. What 2 things must we consider when determining how good or bad, accurate or inaccurate R² is?

    3. How can we improve R²?
    1. If R² is close to 100%, it means that the line fits the data very well.

    Consequently, estimates calculated from the line will be very accurate. 

    f R² is close to 0%, it means that the line does not fit the data very well. 

    Consequently, estimates calculated from the line will not be very accurate. 

    2. We also need to consider sample size to determine what is good or bad, accurate or inaccurate.

    Outliers can also affect R²

    3. By adding more relevant explanatory variables.
  5. 1. How do we calculate the following question

    What percentage of the variation in average growth is accounted for by the average sea surface temperature?

    2. How do we calculate the following question

    What percentage of the variation in average growth is accounted for by variables other than the average sea surface temperature?

    3. How accurate will the estimates of growth be if we use the regression line to produce these estimates?

    4. What are high influence points?
    1. R² - When you hover over the line, we find that R² is 65.8%. 

    2. 100 - 65.8% (R²) = 34.2%

    • 3. 65.8% is moderately accurate. 
    • Close to 100 would be very accurate, and close to 0 would be very inaccurate. 

    4. An individual is a high influence point if omitting it changes the result of calculaitons. Like outliers, they change the results a lot.
  6. 1. What influences the slope of the line more, outliers in the x or y direction?

    2. What is a lurking variable?

    3. What is meant by the phrase "association does not imply causation."

    4. What is the meaning of causation?
    1. Outliers in the x direction. 

    2. A variable that is not amongst the response and explanatory variables, and yet may influence the interpretation of relationships between these variables.

    * They are variables you haven't included in your data, but are still influencing it. 

    3. An association between two variables does not mean that one variable is causing the other to change, no matter how strong the relationship is. 

    We may be able to use the relationship to predict y form x, but we cannot claim that x is causing y to change. 

    We have to consider lurking variables that might be causing both variables to change together. 

    4. By changing x, we can make y change.
  7. How is causation established?
    By conducting a statistical experiment that deliberately manipulates the explanatory variable.
  8. 1. How would you go about conducting a statistical experiment? (3 parts)

    2. What is an observational study, and can causation be established from it?

    3. Are there any exceptions? What factors must be in place? (5 steps)
    1. You must deliberately manipulate the explanatory variable.

    a.There must be at least two groups of individuals,

    b. and different groups must receive different levels of the explanatory variable.

    c. Individuals must also be randomly assigned to the groups. 

    2. In an observational study, individuals are only observed, by direct observation, measurement or asking questions. They are not forced to change anything, and therefore causation cannot be established. 

    *Almost all data collected are from observational studies. 

    3. In rare cases, causation can be claimed from numerous observational studies, provided that certain steps are followed.

    • a. strong association
    • b. consistent association
    • c. Higher doses are associated with stronger responses. ex more packs of cigarettes smoked, higher chance of lung cancer
    • d. The alleged cause precedes the effect in time
    • e. The alleged cause is plausible (must make sense)
  9. Many people believe that big data (very large data sets collected by social media and google) will provide the answer to many questions. 

    What are the 2 incorrect beliefs that this faith is based on?
    1a. There is no need to worry about causation because correlations are all we need to know for making accurate predictions

    1b. Scientific and statistical theory isn't necessary because with enough data, the numbers speak for themselves.
  10. Image Upload 4\
    1. What is the response variable, and what is the explanatory variable?

    2. Describe the relationship

    Image Upload 5

    Is it appropriate to fit a straight line to the data? If so, find the least squares regression line.

    4. Estimate the discharge in 2018, assuming that it is safe to extrapolate this far. 

    5. Interpret the slope

    6. Interpret the intercept

    7. Is it appropriate to calculate the correlation? If so, calculate and interpret the correlation. 

    Image Upload 6

    8. What percentage of the variation in discharge (ie. water) is accounted for by the year? Is year a good predictor of discharge?

    9. What percentage of the variation in discharge (ie. water) is accounted for by variables other than the year? 

    10. How can we improve the accuracy of our predictions?
    1. Response = water, explanatory = year

    2. We see a somewhat weak increasing linear relationship as the scatter points are quite wide. Possibly 1 outlier (in 2007, the discharge was higher than expected).

    3. Yes, we can fit a straight line through the data because the relationship looks linear.

    The line is: water = -2589 + 2.239 year

    4. Water = -2589 + (2.239)(2018) = 1929.302 km³

    5. As each year passes, water (discharge) increases by 2.239 km³.

    6.  In year 0 (that's 2018 years ago) discharge is estimated to be -2589 km³. (this makes no sense at all). How could you have a negative amount of water coming from the rivers into the ocean?

    7. Yes, we can calculate the correlation because the relationship is linear, and we have two numeric variables. 

    r = 0.417

    8. R² = 17.42, so year is NOT a good predictor of discharge

    9. 100 - 17.42% = 82.58%

    • 10. Include other variables that are connected with discharge.
    • ie. rain, size of icecaps etc.
Card Set
STAT 104 - Chapter 5 - Regression
Quiz #2 Prep