quals: clinical data mining part IV

  1. Inferential tests: goals
    • estimate parameters
    • Y = f(X1, ..., Xn, noise, parameters)
    • aims to infer properties of a population given a smaller, observable sample
    • could this function have generated the output I see?
  2. Inferential tests: model validation
    yes-no using goodness-of-fit tests and residual examination
  3. Inferential tests: worries
    • significance: is there a measurable difference?
    • effect size: is the measured difference relevant?
  4. confidence intervals interpretation
    • If many samples collected and the 95% confidence interval computed, 95% of the intervals would contain the true value of the parameter, theta
    • 95% confidence interval does NOT mean that there is a 95% probability that the interval contains the true value of the parameter theta
  5. point estimate
    providing a best guess of a parameter as a function over data points form some distribution
  6. linear regression
    • Method to estimate the intercept and slope of a line quantifying the relationship between at least one independent (predictor) variable and one dependent (outcome) variable
    • Y=α+βX+random error
  7. multiple linear regression
    • linear model with more than one predictor
    • allows adjustment for possible confounders
    • Interpret each coefficient as the change in the outcome per unit increase in that variable, with all other variables held constant
  8. logistic regression
    fits a model for a categorical outcome variable
  9. applications of logistic regression (2)
    • Comparing response rates in cancer patients among treatment arms adjusted for age and duration of disease
    • Comparing the proportion of hypertensive patients between two treatment arms adjusted for age, cholesterol level and exercise habits
  10. cox regression
    Multivariate regression for a time-to-event outcome
  11. Cox proportional hazards model
    • Relates the time that passes before some event occurs to one or more covariates that may be associated with that event
    • Assumes that the effects of the predictor variables upon survival are constant over time and are multiplicative
  12. Regularized regression
    • Increase the bias of linear regression a little to reduce the variance by a lot
    • ridge, lasso, elastic net
  13. classification trees
    • Model space = piecewise constant functions with disjoint rectangular regions
    • robust to outliers, fast to fit, will ignore useless features
    • not good predictors because high variance, decision boundary is sensitive to what the training data looks like
  14. solutions to classification tree problems
    • bagging: fit many trees to bootstrap-resampled versions of the training data, use majority vote as prediction
    • random forest: at each split, only consider a random subset of feature to split
    • boosting: fit trees to successively reweighted versions of the data, accounting for errors made by previous trees, use weighted majority vote as prediction
    • boosting and random forests are among the best ML methods
    • boosting is slightly better than random forest, but has more parameters to tune
  15. type I error
    • falsely rejecting the null hypothesis when there is no effect
    • happens 1 in 20 times when p-value threshold is 0.05
  16. type II error
    failing to reject the null hypothesis when there actually is an effect
  17. statistical power
    • inverse of type II error
    • chance of detecting a true effect if it exists
  18. hypothesis testing steps (6)
    • define a null hypothesis
    • determine the null distribution for the variable to be observed
    • conduct experiment to measure variable under conditions of interest
    • summarize the variable as a point estimate of the unobserved parameter (e.g. mean)
    • determine the probability that the observation was by chance given the null distribution
    • if the probability is low (less than 5%), reject the null hypothesis, declare a measureable difference from the null
  19. risk
    the rate of experiencing the outcome
  20. odds
    the chance of experiencing the outcome relative to not experiencing the outcome
  21. relative risk (RR)
    • measure of the risk of an outcome in a group compared to another group
    • Not possible to calculate because prevalence of outcome in the samp
  22. odds ratio (OR)
    • A measure of the odds of the outcome in a group compared to a control group
    • Can be calculated from case-control study data
    • Output of logistic regression (you get the odds ratio for free)
  23. predictive test: goals
    • learn a function
    • f(x) - that operates on x to predict y
    • what is the function that could have generated the output I see?
  24. predictive test: model validation
    measured by predictive accuracy
  25. predictive test: worries
    • accuracy: how often do i get it right?
    • actionability: i got it right, so what?
  26. Positive Predictive Value (precision)
    • how likely are you ot have the disease if you test positive?
    • TP / TP + FP
  27. Negative predictive value (npv)
    • How likely are you to not have the disease if you test negative?
    • TN / TN + FN
  28. sensitivity (recall)
    • How likely are you to test positive if you have the disease?
    • TP / TP + FN
  29. Specificity
    • How likely are you to test negative if you do not have the disease?
    • TN / TN + FP
  30. causal test on observational data
    • If you want to do causal analysis on observational data, then you need to find two people that are the same besides the fact that one person got the exposure and the other did not. 
    • In other words, we want to find two people that had the same probability of receiving treatment, but only one of them actually did receive treatment
  31. exploratory tests: goals
    • find sub-groups: clustering
    • find themes: look for patterns, decomposition
  32. confounding factors
    Factors that can cause or prevent the outcome of interest, at not intermediate variables, and are associated with the factors under investigation
  33. ways to prevent being wrong in analyses (5)
    • consider confounding factors
    • replication: over time, across sites, using different study designs
    • quantify the "instability" of analysis: i.e. variance in the face of alternative study design and data source choices
    • test for, and quantify non-stationarity in the data
    • examine multiple dimensions of performance: calibration, AUPRC, effect size, estimated FDR, etc.
Author
tulipyoursweety
ID
350163
Card Set
quals: clinical data mining part IV
Description
aa
Updated