quals: clinical data mining part IV

Home

Get App

Create

Inferential tests: goals
- estimate parameters
- Y = f(X1, ..., Xn, noise, parameters)
- aims to infer properties of a population given a smaller, observable sample
- could this function have generated the output I see?
Inferential tests: model validation

yes-no using goodness-of-fit tests and residual examination
Inferential tests: worries
- significance: is there a measurable difference?
- effect size: is the measured difference relevant?
confidence intervals interpretation
- If many samples collected and the 95% confidence interval computed, 95% of the intervals would contain the true value of the parameter, theta
- 95% confidence interval does NOT mean that there is a 95% probability that the interval contains the true value of the parameter theta
point estimate

providing a best guess of a parameter as a function over data points form some distribution
linear regression
- Method to estimate the intercept and slope of a line quantifying the relationship between at least one independent (predictor) variable and one dependent (outcome) variable
- Y=α+βX+random error
multiple linear regression
- linear model with more than one predictor
- allows adjustment for possible confounders
- Interpret each coefficient as the change in the outcome per unit increase in that variable, with all other variables held constant
logistic regression

fits a model for a categorical outcome variable
applications of logistic regression (2)
- Comparing response rates in cancer patients among treatment arms adjusted for age and duration of disease
- Comparing the proportion of hypertensive patients between two treatment arms adjusted for age, cholesterol level and exercise habits
cox regression

Multivariate regression for a time-to-event outcome
Cox proportional hazards model
- Relates the time that passes before some event occurs to one or more covariates that may be associated with that event
- Assumes that the effects of the predictor variables upon survival are constant over time and are multiplicative
Regularized regression
- Increase the bias of linear regression a little to reduce the variance by a lot
- ridge, lasso, elastic net
classification trees
- Model space = piecewise constant functions with disjoint rectangular regions
- robust to outliers, fast to fit, will ignore useless features
- not good predictors because high variance, decision boundary is sensitive to what the training data looks like
solutions to classification tree problems
- bagging: fit many trees to bootstrap-resampled versions of the training data, use majority vote as prediction
- random forest: at each split, only consider a random subset of feature to split
- boosting: fit trees to successively reweighted versions of the data, accounting for errors made by previous trees, use weighted majority vote as prediction
- boosting and random forests are among the best ML methods
- boosting is slightly better than random forest, but has more parameters to tune
type I error
- falsely rejecting the null hypothesis when there is no effect
- happens 1 in 20 times when p-value threshold is 0.05
type II error

failing to reject the null hypothesis when there actually is an effect
statistical power
- inverse of type II error
- chance of detecting a true effect if it exists
hypothesis testing steps (6)
- define a null hypothesis
- determine the null distribution for the variable to be observed
- conduct experiment to measure variable under conditions of interest
- summarize the variable as a point estimate of the unobserved parameter (e.g. mean)
- determine the probability that the observation was by chance given the null distribution
- if the probability is low (less than 5%), reject the null hypothesis, declare a measureable difference from the null
risk

the rate of experiencing the outcome
odds

the chance of experiencing the outcome relative to not experiencing the outcome
relative risk (RR)
- measure of the risk of an outcome in a group compared to another group
- Not possible to calculate because prevalence of outcome in the samp
odds ratio (OR)
- A measure of the odds of the outcome in a group compared to a control group
- Can be calculated from case-control study data
- Output of logistic regression (you get the odds ratio for free)
predictive test: goals
- learn a function
- f(x) - that operates on x to predict y
- what is the function that could have generated the output I see?
predictive test: model validation

measured by predictive accuracy
predictive test: worries
- accuracy: how often do i get it right?
- actionability: i got it right, so what?
Positive Predictive Value (precision)
- how likely are you ot have the disease if you test positive?
- TP / TP + FP
Negative predictive value (npv)
- How likely are you to not have the disease if you test negative?
- TN / TN + FN
sensitivity (recall)
- How likely are you to test positive if you have the disease?
- TP / TP + FN
Specificity
- How likely are you to test negative if you do not have the disease?
- TN / TN + FP
causal test on observational data
- If you want to do causal analysis on observational data, then you need to find two people that are the same besides the fact that one person got the exposure and the other did not.
- In other words, we want to find two people that had the same probability of receiving treatment, but only one of them actually did receive treatment
exploratory tests: goals
- find sub-groups: clustering
- find themes: look for patterns, decomposition
confounding factors

Factors that can cause or prevent the outcome of interest, at not intermediate variables, and are associated with the factors under investigation
ways to prevent being wrong in analyses (5)
- consider confounding factors
- replication: over time, across sites, using different study designs
- quantify the "instability" of analysis: i.e. variance in the face of alternative study design and data source choices
- test for, and quantify non-stationarity in the data
- examine multiple dimensions of performance: calibration, AUPRC, effect size, estimated FDR, etc.

Author

tulipyoursweety

350163

Card Set

quals: clinical data mining part IV

Description

Updated

2019-12-31T05:07:54Z

Show Answers