DRIP

Home

Get App

Create

What do good models do? Why is it important to have a good fitting model?

To reduce total squared error
We can quantify the fit of the model in terms of the deviations from the data (_), and the model prediction (_)

(yi), (yhati)
The deviations (y-yhat) are called the ___

Model Residuals
The sum of squared residuals is a ___
- - Sum of the measure of model fit
- - smaller = better
- - Convenient/intuitive measure of error of prediction
Outcome/predictor?
lm(formula=_~_, data=parenthood)
dansleep + dangrump

dangrump~dansleep
RSS =

Σ(Yi -Yihat)2
TSS=
- Σ(Yi - Yibar)2
- bar=mean
Proportion of variance unexplained (Normalised least-squared error)

RSS/TSS
Proportion of variance explained (coefficient of determination or R2)

(TSS-RSS)/TSS
A model can be defined as (2)
- 1) a set of parameters
- 2) a rule for combining the parameters
Model parameters can also be called: (2)
They are chosen to:
- 1) weights
- 2) coefficients
- minimize RSS
If the model fits the data then the _ is small, or the _ is large

RSS, coefficient of determination (R2)
Steps in constructing a statistical test (3)
- 1) Specify a null-hyp
- 2) Identify a test-statistic of interest
- 3 Determine the sampling distribution of test stat under assumption that null hyp is true (plus any other assumptions you have to make so it will work)
Steps in applying the statistical test (4)
- 1) Collect data
- 2) calculate value of test stat
- 3) Compare this value against relevant sampling distribution if null hyp is true
- 4) If probability of observing at least this value is smaller than some criterion, reject null hyp
The F distribution is handy because it:

is used to test our null hype under the linear model
Linear model formula:
yi=b0+b1x+e
- yi= outcome
- b0= coefficient
- b1 = slope
- x = predictor
- e = residual/error
"Additive" =…(2)
- If 2+ UNCORRELATED predictors, total proportion of explained variance is additive
- Additive models dont include any interaction
"Sub-additive" = ...

If 2+ CORRELATED predictors, total proportion of explained variance is sub-additive
Partial Correlation:
rY(X.Z) =

CORRELATION between Y and X with the effect of Z removed from X
rYX.Z=

CORRELATION between X and Y with the effect of Z removed from both X and Y
Dilution effect = (3)
- Adding non-predicitve predictors reduces efficacy of model
- But salient predictors remain salient
- Suggests paying more attention to tests of individual coefficients
Collinearity effect = (3)
- Very high correlation btw 2 predictors INFLATES their standard errors
- None of the coefficients of the correlated predictors may be significant
- Suggests paying more attention to the tests of all the coefficients
Variable Inflation Factor (VIF) (3)
- Way of checking for (multi) collinearity
- If regress a redundant predictor onto all other predictors, resulting R2 will be very high
- Rule = abandon hope if VIF>10
Forward Selection=

Start with no predictors and add ones you think will work
Backward selection=

Start with all predictors and pull one ones that don't work
Two attitudes towards data (4 and 4)
- Planned:
- - A priori
- - Confirmatory
- - Hypothesis testing
- - Minimal/controlled capitalisation on chance
- Unplanned:
- - Post hoc
- - Exploratory
- - Prediction
- - Maximal/uncontrolled capitalisation on chance
Planned model building (3)
- Regression model based on theoretical/practical context
- Hypotheses determined by questions of interest
- Mostly avoids capitalisation on chance
Regression (linear model) assumptions: (5)
- 1. Residuals have no discernible structure - outcome is modelled as a LINEAR function of the predictors (multiply predictors by a coefficient, add together -> "prediction")
- 2. Residuals are independent (uncorrelated)
- 3. Residuals are normally distributed (mean of 0, some kind of SD)
- 4. Residuals have a constant variance (Homoscedasticity)
- 5. No outliers (no residuals distorting results likely to find)
Normal distribution of residuals assumption of linear regression can be tested/viewed using: (2)
- Quantile Probability Plots
- Shapiro-Wilk test (W closer to 1 = not normally distributed?)
What to do about non-normal residuals? (3)
- Ignore it
- Transform 1 or more variables
- Try another more complicated approach
Homoscedasticity=
- Population SD is same in both gps
- Chi sq
- P>.05=homosc is violated
_ deals with factors, _ deals with numeric

ANOVA deals with factors, MULTIPLE REGRESSION deals with numeric
t-tests: (3)
- Compare 2 means of an outcome variable
- The 2 gps are defined by the levels of intervention: (No, Yes)
- Null hyp = both means are the same
Dummy (numeric) variable coding: Comparison between 2 gps (2)
- Think of gps as CATEGORICAL variable (or "factor") and conduct t-test accordingly
- Think of gps as defining a numeric (dummy) predictor: 1 gp has 1 level of predictor, other gp has other level of predictor
If equal variance is assumed, t-tests comparing 2 means of an outcome variable (H0:mu1=mu2) are equivalent to:
- a test of the regression equation: y=b0+b1x+e
- (x is a dummy variable coding for the group, and the null hyp is equivalent to H0:b1=0)
Why called "one-way" ANOVA?
- Only got one variable using to predict outcome, and multiple levels.
- If two levels, its called t-test?
Anova - Factorial Design: (4)
- 2 or more factors are orthogonally (independent of each other) combined/crossed (aka fully crossed design)
- Each CELL defined by choice of level across all factors (factors= Treatment & Expectations)
- Allows effects of multiple factors to be estimated SIMULTANEOUSLY
- Reduces residual error
BALANCED DESIGN =

If each cell has SAME no of observations, its called balanced design
eta-squared (n2) =
- Effect size
- Partial eta sq same as partial R2??????
Interactions = (4)
- Any departure from additive model = interaction
- Effect of one factor not same at each level of other factor
- Effect of one factor DEPENDS on level of other, effects of the two factors are NOT INDEPENDENT
- Interaction is an EFFECT: has a size + can be measured
A set of factors are orthogonal if: (3)
- They are fully crossed
- There are equal number of observations in each cell
If factors not orthogonal, they share some explained variance. Either: (2)
- 1. The common variance is assigned to one of the correlated factors or
- 2, The common variance is assigned to non of the correlated factors
- (Anova takes option 1 - may not be appropriate)
Type I SOS:(4)
- Allocate order in which factors enter the model
- Called "Sequential Sums of Squares/Type I"
- SAME as forward selection
- Method assumed by ANOVA
Type II SOS: (2)
- Allocate ONLY UNIQUE variance
- Only works if NO interaction
Type III SOS: (3)
- Do not allocate ANY common variance to any factor/interaction
- Works if SIGNIFICANT interaction
- But main effects may be reduced
2 factors: treatment and expectations. Treatment has 3 levels, Expectations has 2 levels. This is called a __ anova

3x2 anova
Crossing the 2 factors creates a structure with __ cells

6 (3x2)
The mean of each cell is indexed by___

the levels of the 2 factors
F is a ratio of …
How to calculate?
- F is a ratio of mean squares
- Divide its mean square by the residual
Contrasts (3)
- A planned comparison - (tests meaningful hypothesis)
- A linear combination of predictors that sum to zero
- Another way of specifying dummy variables
Post hoc pairwise comparisons (4)
- If you don't have any meaningful hypotheses can conduct set of these
- EXPLORATORY rather than confirmatory
- Proposed after meaningful hyps have been tested against data
- Compare mean of each cell with mean of every other cell controlling for the FAMILYWISE ERROR RATE
Type 1 error =

reject hyp when its true
Familywise error rate

If you test k hypotheses, probability of making at least 1 type 1 error cannot be less than 1-(1-a)^k
ANCOVA

Hybrid form of multiple regression + ANOVA
ANCOVA combines:
- 1 or more CATEGORICAL factors (as dummy variables/contrasts) + 1 or more CONTINUOUS predictors (called covariates)
- (Interest usually lies in the effects of the FACTORS on the DV)
In ANCOVA, the predictors serve 2 pain purposes:
- 1) To reduce residual error/variance
- 2) To "control" for possible confounding effects of the covariate(s)
In ANCOVA, it is desirable that the covariate be at least __ correlated with the DV, and at most __ correlated with the FACTOR of interest

moderately, weakly
The __ is based on the variance of the differences between conditions across participants.
This is the same as the __ between the __ and the __

The paired samples t-test is based on the variance of the differences between conditions across participants.

This is the same as the interaction between the factor (time) and the subjects variable (id).
Assumptions of repeated measures (3)
- 1. Independence of "subjects" (assume unrelated to each other)
- 2. Normal distribution within each cell
- 3. Sphericity
Sphericity =

Assumption that the VARIANCES of differences between each pair of within-subjects cells are EQUAL
Sphericity can be tested using:
- Mauchly test of sphericity
- W=.0004, p<2.2e-16
- = REJECT hypothesis that variances of differences are equal
Why are the Greenhouse-Geisser and Huynh-Feldt corrections
sometimes required by a repeated measures anova?
- These corrections are applied to the degrees of freedom of an F-ratio
- in order to adjust for failure of the sphericity assumption in repeated
- measures anova.
Diagrams serve __ and __ functions
- Expository: explain/provide info
- Productive: generate new info
Graphs: (2)
- Diagrams that exhibit relationship between 2 sets of numbers as a set of points having coordinates determined by the relationship (plots).
- Used to illustrate relationships (charts)
ggplot 2 package invokes the following terminology (4)
- Aesthetics - maps data onto logical elements of graph
- Geometrics - specifies how elements of graph are represented
- Themes - Modifies look/feel of graph elements
- Others

Author

peep_muri

311080

Card Set

DRIP

Description

drip

Updated

2015-11-11T01:37:16Z

Show Answers