-
What do good models do? Why is it important to have a good fitting model?
To reduce total squared error
-
We can quantify the fit of the model in terms of the deviations from the data (_), and the model prediction (_)
(yi), (yhati)
-
The deviations (y-yhat) are called the ___
Model Residuals
-
The sum of squared residuals is a ___
- - Sum of the measure of model fit
- - smaller = better
- - Convenient/intuitive measure of error of prediction
-
Outcome/predictor?
lm(formula=_~_, data=parenthood)
dansleep + dangrump
dangrump~dansleep
-
-
-
Proportion of variance unexplained (Normalised least-squared error)
RSS/TSS
-
Proportion of variance explained (coefficient of determination or R2)
(TSS-RSS)/TSS
-
A model can be defined as (2)
- 1) a set of parameters
- 2) a rule for combining the parameters
-
Model parameters can also be called: (2)
They are chosen to:
- 1) weights
- 2) coefficients
- minimize RSS
-
If the model fits the data then the _ is small, or the _ is large
RSS, coefficient of determination (R2)
-
Steps in constructing a statistical test (3)
- 1) Specify a null-hyp
- 2) Identify a test-statistic of interest
- 3 Determine the sampling distribution of test stat under assumption that null hyp is true (plus any other assumptions you have to make so it will work)
-
Steps in applying the statistical test (4)
- 1) Collect data
- 2) calculate value of test stat
- 3) Compare this value against relevant sampling distribution if null hyp is true
- 4) If probability of observing at least this value is smaller than some criterion, reject null hyp
-
The F distribution is handy because it:
is used to test our null hype under the linear model
-
Linear model formula:
yi=b0+b1x+e
- yi= outcome
- b0= coefficient
- b1 = slope
- x = predictor
- e = residual/error
-
"Additive" =…(2)
- If 2+ UNCORRELATED predictors, total proportion of explained variance is additive
- Additive models dont include any interaction
-
"Sub-additive" = ...
If 2+ CORRELATED predictors, total proportion of explained variance is sub-additive
-
Partial Correlation:
rY(X.Z) =
CORRELATION between Y and X with the effect of Z removed from X
-
rYX.Z=
CORRELATION between X and Y with the effect of Z removed from both X and Y
-
Dilution effect = (3)
- Adding non-predicitve predictors reduces efficacy of model
- But salient predictors remain salient
- Suggests paying more attention to tests of individual coefficients
-
Collinearity effect = (3)
- Very high correlation btw 2 predictors INFLATES their standard errors
- None of the coefficients of the correlated predictors may be significant
- Suggests paying more attention to the tests of all the coefficients
-
Variable Inflation Factor (VIF) (3)
- Way of checking for (multi) collinearity
- If regress a redundant predictor onto all other predictors, resulting R2 will be very high
- Rule = abandon hope if VIF>10
-
Forward Selection=
Start with no predictors and add ones you think will work
-
Backward selection=
Start with all predictors and pull one ones that don't work
-
Two attitudes towards data (4 and 4)
- Planned:
- - A priori
- - Confirmatory
- - Hypothesis testing
- - Minimal/controlled capitalisation on chance
- Unplanned:
- - Post hoc
- - Exploratory
- - Prediction
- - Maximal/uncontrolled capitalisation on chance
-
Planned model building (3)
- Regression model based on theoretical/practical context
- Hypotheses determined by questions of interest
- Mostly avoids capitalisation on chance
-
Regression (linear model) assumptions: (5)
- 1. Residuals have no discernible structure - outcome is modelled as a LINEAR function of the predictors (multiply predictors by a coefficient, add together -> "prediction")
- 2. Residuals are independent (uncorrelated)
- 3. Residuals are normally distributed (mean of 0, some kind of SD)
- 4. Residuals have a constant variance (Homoscedasticity)
- 5. No outliers (no residuals distorting results likely to find)
-
Normal distribution of residuals assumption of linear regression can be tested/viewed using: (2)
- Quantile Probability Plots
- Shapiro-Wilk test (W closer to 1 = not normally distributed?)
-
What to do about non-normal residuals? (3)
- Ignore it
- Transform 1 or more variables
- Try another more complicated approach
-
Homoscedasticity=
- Population SD is same in both gps
- Chi sq
- P>.05=homosc is violated
-
_ deals with factors, _ deals with numeric
ANOVA deals with factors, MULTIPLE REGRESSION deals with numeric
-
t-tests: (3)
- Compare 2 means of an outcome variable
- The 2 gps are defined by the levels of intervention: (No, Yes)
- Null hyp = both means are the same
-
Dummy (numeric) variable coding: Comparison between 2 gps (2)
- Think of gps as CATEGORICAL variable (or "factor") and conduct t-test accordingly
- Think of gps as defining a numeric (dummy) predictor: 1 gp has 1 level of predictor, other gp has other level of predictor
-
If equal variance is assumed, t-tests comparing 2 means of an outcome variable (H0:mu1=mu2) are equivalent to:
- a test of the regression equation: y=b0+b1x+e
- (x is a dummy variable coding for the group, and the null hyp is equivalent to H0:b1=0)
-
Why called "one-way" ANOVA?
- Only got one variable using to predict outcome, and multiple levels.
- If two levels, its called t-test?
-
Anova - Factorial Design: (4)
- 2 or more factors are orthogonally (independent of each other) combined/crossed (aka fully crossed design)
- Each CELL defined by choice of level across all factors (factors= Treatment & Expectations)
- Allows effects of multiple factors to be estimated SIMULTANEOUSLY
- Reduces residual error
-
BALANCED DESIGN =
If each cell has SAME no of observations, its called balanced design
-
eta-squared (n2) =
- Effect size
- Partial eta sq same as partial R2??????
-
Interactions = (4)
- Any departure from additive model = interaction
- Effect of one factor not same at each level of other factor
- Effect of one factor DEPENDS on level of other, effects of the two factors are NOT INDEPENDENT
- Interaction is an EFFECT: has a size + can be measured
-
A set of factors are orthogonal if: (3)
- They are fully crossed
- There are equal number of observations in each cell
-
If factors not orthogonal, they share some explained variance. Either: (2)
- 1. The common variance is assigned to one of the correlated factors or
- 2, The common variance is assigned to non of the correlated factors
- (Anova takes option 1 - may not be appropriate)
-
Type I SOS:(4)
- Allocate order in which factors enter the model
- Called "Sequential Sums of Squares/Type I"
- SAME as forward selection
- Method assumed by ANOVA
-
Type II SOS: (2)
- Allocate ONLY UNIQUE variance
- Only works if NO interaction
-
Type III SOS: (3)
- Do not allocate ANY common variance to any factor/interaction
- Works if SIGNIFICANT interaction
- But main effects may be reduced
-
2 factors: treatment and expectations. Treatment has 3 levels, Expectations has 2 levels. This is called a __ anova
3x2 anova
-
Crossing the 2 factors creates a structure with __ cells
6 (3x2)
-
The mean of each cell is indexed by___
the levels of the 2 factors
-
F is a ratio of …
How to calculate?
- F is a ratio of mean squares
- Divide its mean square by the residual
-
Contrasts (3)
- A planned comparison - (tests meaningful hypothesis)
- A linear combination of predictors that sum to zero
- Another way of specifying dummy variables
-
Post hoc pairwise comparisons (4)
- If you don't have any meaningful hypotheses can conduct set of these
- EXPLORATORY rather than confirmatory
- Proposed after meaningful hyps have been tested against data
- Compare mean of each cell with mean of every other cell controlling for the FAMILYWISE ERROR RATE
-
Type 1 error =
reject hyp when its true
-
Familywise error rate
If you test k hypotheses, probability of making at least 1 type 1 error cannot be less than 1-(1-a)^k
-
ANCOVA
Hybrid form of multiple regression + ANOVA
-
ANCOVA combines:
- 1 or more CATEGORICAL factors (as dummy variables/contrasts) + 1 or more CONTINUOUS predictors (called covariates)
- (Interest usually lies in the effects of the FACTORS on the DV)
-
In ANCOVA, the predictors serve 2 pain purposes:
- 1) To reduce residual error/variance
- 2) To "control" for possible confounding effects of the covariate(s)
-
In ANCOVA, it is desirable that the covariate be at least __ correlated with the DV, and at most __ correlated with the FACTOR of interest
moderately, weakly
-
The __ is based on the variance of the differences between conditions across participants.
This is the same as the __ between the __ and the __
The paired samples t-test is based on the variance of the differences between conditions across participants.
This is the same as the interaction between the factor (time) and the subjects variable (id).
-
Assumptions of repeated measures (3)
- 1. Independence of "subjects" (assume unrelated to each other)
- 2. Normal distribution within each cell
- 3. Sphericity
-
Sphericity =
Assumption that the VARIANCES of differences between each pair of within-subjects cells are EQUAL
-
Sphericity can be tested using:
- Mauchly test of sphericity
- W=.0004, p<2.2e-16
- = REJECT hypothesis that variances of differences are equal
-
Why are the Greenhouse-Geisser and Huynh-Feldt corrections
sometimes required by a repeated measures anova?
- These corrections are applied to the degrees of freedom of an F-ratio
- in order to adjust for failure of the sphericity assumption in repeated
- measures anova.
-
Diagrams serve __ and __ functions
- Expository: explain/provide info
- Productive: generate new info
-
Graphs: (2)
- Diagrams that exhibit relationship between 2 sets of numbers as a set of points having coordinates determined by the relationship (plots).
- Used to illustrate relationships (charts)
-
ggplot 2 package invokes the following terminology (4)
- Aesthetics - maps data onto logical elements of graph
- Geometrics - specifies how elements of graph are represented
- Themes - Modifies look/feel of graph elements
- Others
|
|