Epidemiology & Biostatistics

  1. Chapter 2
  2. Nominal Data
    • values that correspond to unordered categories or classes
    • eg. hair color, gender, state of birth, etc.
    • have no ordering & are summarized as percentages
    • nominal data that can take 1 of only 2 values (eg. having or not having gonorrhea are called binary/dichotomous data)
    • type of Categorical data
  3. Ordinal Data
    • when the categories of a variable have an inherent order to them
    • eg. Glasgow Coma Score, severity of adverse events (AE), “satisfaction”
    • type of Categorical data
  4. Discrete Data
    • data restricted to specific integer values
    • eg. number of children, bacteria count
    • type of Measured data
  5. Continuous Data
    • unrestricted real numbers
    • eg. weight, age, degree of stenosis, physician salary
    • type of Measured data
  6. Age as a Continuous & Ordinal Variable
    • age itself can be reported using mean & standard deviation (SD)
    • OR it can be grouped into categories - 45-54 yr, 55-64, ≥ 65 - & reported as percentage (x % were at this age, etc.)
  7. Bar Charts
    • useful for summarizing Categorical (nominal & ordinal) data
    • displays # or % of individuals in each of the categories
    • their bars do NOT touch to emphasize that the categories are DISTINCT from each other & MUTUALLY EXCLUSIVE
  8. Histogram
    • bars touch each other to emphasize the data are CONTINUOUS (measured)
    • like in a bar chart, each bar of a histogram displays the # or % of individuals in each interval
    • used to display the distribution of data in a sample
    • is VERY good at displaying extreme data points
  9. Histogram Example
    Image Upload 1
    y-axis = frequency (or %) - aka # of individuals whose cholesterol falls between 5 & 6
  10. Arithmetic Mean (Average)
    • sum of observations divided by the # of observations
    • if one were to place “weight” on a numeric scale, the arithmetic mean would be at the balancing point
    • because of this, it’s SENSITIVE to EXTREME observations (outliers)
    • Image Upload 2
  11. Median
    • NOT sensitive to extreme observations
    • the value that divides an ordered data set in half
    • 50% of the data are ABOVE the median value, & 50% are below it
  12. 1st Quartile
    refers to the value where 25% of the data are below it
  13. 3rd Quartile
    the value where 75% of the data are below it
  14. Interpret This Table (re: Quartiles)
    Image Upload 3
    • 25% of patients have an LVOT below 7 mmHg
    • 50% of patients have an LVOT below 16 mmHg
    • 75% of patients have an LVOT below 26 mmHg
    • the interquartile range (7-26) contains 50% of the observations
  15. Box Plot
    • an illustrative tool for displaying the MEDIAN as well as the 1st & 3rd quartiles
    • the Whiskers extend to the edge of extreme values in the data set
    • more extreme values are individually displayed as dots
    • can be useful when comparing 2 or more groups
  16. Cholesterol & Grouped Box Plots

    Image Upload 4
    • even though women & men have similar distributions of cholesterol levels by age group, it’s easy to see that the median cholesterol levels increase as women age
    • however levels remain relatively stable in men as they age
  17. Distribution Shapes
    when looking at distributions of data (using either histograms or box plots), it’s important to characterize them in regard to how far a distribution deviates from symmetry
  18. Symmetrical Distribution
    • when the mean, median, & mode have identical values
    • normal distributions are symmetric
    • Image Upload 5
  19. Positive Right Skew
    • mean the skew is “to the right” - this manifests in a graph with a tail on the right side of the graph (is more “pronounced)
    • mean > median
    • Image Upload 6
  20. Negative Left Skew
    • the skew is to the left
    • the left “tail” is more pronounced
    • mean < median
    • Mode
    • the most common (repeated) value in a dataset
    • Image Upload 7
  21. Skew with a Box Plot & Histogram
    • Positive/Right Skew: both the right whisker & tail are ELONGATED
    • Negative/Left Skew: both the left whisker & tail are ELONGATED
    • Symmetrical: both whiskers & tails are equal distribution from the center
    • Extremes: box plot displays them as individual data points; histogram displays them separately, with discontinuity
    • Image Upload 8
  22. Standard Deviation
    • numerically characterizes the amount of variability (data dispersion, data spread) among observed data points
    • the mathematical formula calculates the average distance from the mean
    • square root of the variance
  23. Z Scores
    • indicate how many SDs an observation is above or below the mean of a distribution
    • computed by subtracting the mean from the observation & dividing the the SD
    • Z = random variable - the mean / SD
    • or (x - μ) / σ
  24. What are the units of Z Score?
    there are none because the mean & SD have the same units, the units cancel out
  25. What are 3 important numbers to remember from a NORMAL distribution?
    • 68
    • 95
    • 99
    • ~ 68% of the observations lie within ±1 SD of the mean
    • ~95% of samples lie within ±2 SDs of the mean
    • ~99% lie within ±3 SDs of the mean
    • Image Upload 9
    • someone with a Z score of 1 is 1 SD away from the mean
  26. Chance
    a probability expressed as a percentage
  27. Chapter 3
  28. Target Population
    all individuals we’re interested in studying

    Image Upload 10
  29. Random Sample
    a representative subset chosen from said population of individuals we’re interested in studying

    Image Upload 11
  30. Sampling Distribution: a Concept
    • suppose a population has a mean = μ & SD = σ
    • Image Upload 12
    • from the population, you can take many samples, eg. of size 10 (n = 10), & find the means for each sample
    • [the distribution of observations in each sample will be different from each other]
    • Image Upload 13
    • can continue the process of selecting random samples an infinite # of times
    • if you were to plot all the calculated means they would likely show a normal bell-shaped curve
    • Image Upload 14
    • the population mean would be the center of the curve, & the variability among the infinite # of means = the STANDARD ERROR
  31. Standard Error
    • SE = pop. standard deviation / √ sample size
    • SE = σ / √n (eg. 41/√10)
  32. Sampling Distribution of the Mean
    describes the entire spectrum of sample means that could occur for all possible random samples of a given size n from means that are COMMON (near the center of the distribution/curve) to means that are RARE (near the edges of the distribution)
  33. Central Limit Theorem
    • given the existence of a population mean (μ) & population standard deviation (σ) from any distribution of any shape, the CLT states that the distribution of sample means (computed from random samples of size n from a population) will have its mean centered at the population mean
    • the standard deviation of all the means = standard error
    • if n > 30, the shape of the sampling distribution will be ~normal
    • important because it forms the basis of all statistical tests
  34. Confidence Interval
    • used to approximate the population parameter of interest using a point estimate (eg. sample mean)
    • a range of values that, with a known level of certainty, includes (surrounds, captures) the unknown population value
  35. What does a 95% CI indicate?
    that we are 95% confident that the range of values between the lower & upper limits contains the true population value
  36. CI Formula
    • point estimate ± critical value * SE(point estimate)
    • critical value indicates a level of certainty (eg. 95%)
  37. How to Interpret a CI
    1. We are 95% confident that _______ levels (scores, values, etc.) ranging from x to y (units) capture the true mean _________.

    2. We are 95% certain that the true mean ________ level is between x to y (units).

    Image Upload 15
  38. What 3 things can make a CI wider (aka less precise in capturing the true population value)?
    1. Increasing the level of confidence (eg. 95% → 99%, you widen the CI to be more certain in capturing the true population value)

    2. More variability among the observations (eg. larger standard error); more variability implies less precision in capturing the true pop. value

    3. a smaller sample size (there’s less information involved in capturing the true population value)
  39. What does the process of testing a hypothesis begin with?
    specifying the null & alternative hypothesis

    null: statistical statement that indicates no difference, response, or change exists

    alternative: contradicts the null
  40. α
    • significance level of the test
    • usually set at 5%
    • represents the threshold beyond which the null hypothesis would be rejected in favor of the alternative hypothesis
  41. p-value
    • probability of obtaining a test-statistic at LEAST as extreme as the one that was actually observed, assuming that the null hypothesis is true
    • can be interpreted as the likelihood that the observed result could’ve occurred by chance alone
  42. T-test
    • ratio of the observed mean difference to the amount of sampling variability (given via standard error)
    • t values close to 0: support the null hypothesis
    • t values further from 0: support the alternative hypothesis
    • eg. T = 3.4 implies the mean difference is 3.4 standard errors ABOVE 0 (supports the alternative hypothesis)
  43. How is the p-value calculated?
    • it is the area under the sampling distribution of mean differences in both the L & R tails
    • the areas that yield the p-value are calculated from a t distribution
    • half the p-value comes from the L tail & half comes from the R tail
  44. What does a p-value of 0.0007 mean?
    it means that the alternative hypothesis is true & there’s a 0.07% chance that the difference could have occurred by chance alone
  45. What kind of correspondence is there between p-values & confidence intervals?

    whenever the CI doesn’t contain the parameter of interest, the null hypothesis can be rejected

    eg. if a 95% CI about a mean difference EXCLUDES 0, we can conclude that the mean difference is significant at an alpha level of 5%

    however, if a 95% CI about a mean difference INCLUDES 0, we cannot reject the null hypothesis
  46. Chi Square Test
    • used when the outcome & exposure variables are categorical
    • eg. % of current smokers among newly diagnosed diabetes patients
    • the basis of the chi-square test is to quantify the extent of agreement between the observed results gathered from data collection & the EXPECTED results one would observe if the null hypothesis were true
  47. Interpretation of the Chi Square
    • chi square values near 0: accept the null hypothesis
    • large chi square values: reject the null hypothesis, accept the alternative hypothesis
    • χ2 = 0.61 implies good agreement between observed & expected results → accept the null hyp.
    • the p-value for a chi square test is calculated as a tail area under the chi square distribution
  48. Type I Error (α)
    • when the null hypothesis is rejected (alternative hypothesis accepted) when it shouldn’t have been
    • a difference in the sample is observed when there is actually no difference in the population
    • (guilty verdict when defendant is innocent)
  49. Type II Error (β)
    • when the null hypothesis is not rejected (alternative hypothesis is rejected) when it should have been rejected
    • (not guilty verdict when defendant is guilty)
  50. Power
    • 1 - β
    • the probability that a statistical test will RESULT in the REJECTION of the null hypothesis (acceptance of the alternative hypothesis) when it SHOULD be rejected
    • (when a jury correctly assigns guilt)
  51. When is power considered?
    • 1. when a study is being planned, to determine the # of participants to enroll
    • 2. when the null hypothesis is accepted (NOT rejected)
  52. How can the power of a study be increased?
    • 1. ↑ the expected effect size (eg. expected association, difference in means)
    • 2. ↓ the expected standard error by:
    • ↑ the sample size or ↓ standard deviation
  53. What happens every time a statistical test is performed and a p-value is reported?
    • a Type I error is made
    • this is because the significance level for the test is pre- specified
    • multiple testing inflates the overall Type I error rate
  54. Overall Type I Error Rate
    • 1 - (1 - α)^# of tests
    • eg. if 5 tests are performed at α = .05 (5% sig level)
    • error rate = 1–(1–0.05)^5 = 0.226 (22.6%)
    • this means there’s a 22.6% chance of finding at least 1 significant difference when it doesn’t exist
    • (instead of the usual 5% chance of a false-positive finding)
  55. Chapter 5
  56. What are the 3 levels of screening?
    • 1. Primary
    • 2. Secondary
    • 3. Tertiary
  57. Primary Screening
    • screening is done to prevent the disease
    • eg. serum lipids are screened to prevent coronary artery disease
  58. Secondary Screening
    • the attempt to reduce the impact of the disease
    • eg. mamography
  59. Tertiary Screening
    • aims to improve the quality of life of those with the disease
    • eg. metabolic bone screen for pathologic fractures
  60. What kinds of diseases are appropriate for screening?
    those that are serious, common, & ones that would benefit from treatment before symptoms/sign’s develop
  61. What’s important when selecting a screening test?
    making sure it’s available, inexpensive, low risk, easily performed, reliable (it’s results can be reproduced), & accurate (results are correct)
  62. Test Results
    • people with a disease: true positives & false negatives
    • people with no disease: true negatives & false positives
  63. What are the two test performance measures?
    • 1. Sensitivity
    • 2. Specificity
    • these are measures of test performance
  64. Sensitivity
    • true positives / true positives + false negatives
    • calculated ONLY among individuals WITH the disease
    • given the disease is present, the likelihood of testing positive
    • TP / TP + FN
  65. Specificity
    • true negatives / true negatives + false positives
    • calculated only among individuals WITHOUT the disease
    • given the disease is not present, the likelihood of testing negative
    • TN / TN + FP
  66. Predictive Value Positive
    • TP / TP + FP
    • calculated ONLY among individuals that test positive
    • the number of true positives divided by the sum of true & false positives
  67. Predictive Value Negative
    • TN / TN + FN
    • calculated only among individuals that test NEGATIVE
    • the number of true negatives divided by the sum of true & false negatives
  68. Why does the “prior probability” (prevalence) of disease matter when interpreting test results?
    because changes in prevalence can alter Predictive Value Positive & Predictive Value Negative
  69. What happens to Predictive Value Positive & Predictive Value Negative as prevalence increases?
    • predictive value positive increases
    • predictive value negative decreases
  70. What happens to Predictive Value Positive & Predictive Value Negative as prevalence decreases?
    • predictive value positive decreases
    • predictive value negative increases
  71. What happens to Sensitivity & Specificity as prevalence changes?
    they remain unaffected by prevalence!
  72. Likelihood Ratio
    a method that quantifies the likelihood that a given test result represents true disease or not

    it’s the ratio of the chance that a certain test result will be found in a patient who has the disease versus the chance that the test result will be found in a patient who does not have the disease
  73. Likelihood Ratio Formula
    LR for a positive result = sensitivity / (1-specificity)

    LR for a negative test result = (1-sensitivity) / specificity

    • to calculate LR in general:
    • # of people with a disease with a certain test outcome / # of people without a disease with the same test outcome
  74. Chapter 6
  75. Randomized Control Trial (RCT)
    • volunteers are randomized to:
    • 1. experimental arm
    • 2. placebo/standard/control arm
    • both groups are followed over a certain period of time
    • then the incidence of outcome in both groups is measured & compared
    • Image Upload 16
  76. What are the 5 Steps of a RCT?
    • 1. Enroll volunteers using strict inclusion/exclusion criteria
    • 2. Allocate to treatment groups by randomization
    • 3. Follow for relevant time period
    • 4. Ascertain endpoint/outcome
    • 5. Analyze results
  77. Why are RCTs considered the “gold standard” of studies?
  78. Randomization
    • reduces potential bias by:
    • removing investigator treatment preference, volunteer treatment preference, & balances the trial arms
    • it results in similar risk profiles in both groups
  79. Blinding of Treatment
    • investigators & volunteers don’t know whether a person received a treatment or not
    • this prevents investigators & volunteers from biasing the results
    • treatments should look alike (can be difficult if it’s say radiation v. chemotherapy) however the outcome/end points committee can be blind when they evaluate both groups
  80. How does using a placebo reduce potential bias?
    • it removes the “placebo effect” from the measure of treatment effect
    • it removes volunteer influence by blinding
    • it removes investigator influence by blinding
    • these are especially important if the outcome is SUBJECTIVE
  81. How long are volunteers followed for?
    • long enough to detect differences in outcome as well as differences in side effects
    • often side effects aren’t encompassed in the relative time period
    • while longer follow up might be beneficial in terms of detecting side effects, it can also lead to a potentially larger loss to follow up
    • this loss-to-follow up is a bias
  82. Ascertaining the Outcome
    • those evaluating should be blind to treatment assignment
    • a PRECISE definition should be used to define the outcome
  83. “Intention to Treat” Analysis
    • volunteers analyzed in the group to which they were randomized regardless of actual treatment received (eg. if they’re lost to follow up, don’t comply, etc.)
    • “best” measure to be used to analyze the results
    • the idea is to retain benefits of randomization
  84. How can you tell if a new treatment is better than a placebo in a RCT?
    compare the two arms using Relative Risk (RR)
  85. Relative Risk (RR)
    • Outcome Incidence in Experimental Group / Outcome Incidence in Control Group
    • for a cohort study, RR can be written as:
    • Outcome Incidence in Exposed / Outcome Incidence in Unexposed Group
    • a ratio of risks
    • Image Upload 17
  86. RR = 1
    • Outcome Incidence in Experimental = Outcome Incidence in Controls
    • RR = 1 is also called the “Null Value”
  87. RR < 1
    • when the outcome incidence is higher in controls than in the experimental group
    • the denominator (controls) is bigger, so RR < 1
  88. RR > 1
    • when the outcome incidence is higher in the experimental group than in controls
    • the numerator (experimental group) is bigger, so RR > 1
  89. Why are subgroup analysis important?
    • they ascertain effect modification
    • eg. is the effect of a treatment on an outcome the SAME in all people that participated in the trial (eg. in women & men, young & old, etc.)
    • subgroup analyses are important to clinicians because they may identify subgroups in which treatment is helpful or harmful
  90. Chapter 7
  91. What are the 2 overarching types of Epidemiological studies?
    • 1. Observational
    • 2. Interventional

    studies within these categories can either be Descriptive or Analytic
  92. How do descriptive case series’ differ from cross-sectional studies?
    • in case series there is NO comparison group
    • in cross-sectional studies there’s an experimental & control group
  93. Descriptive Observational Studies
    • Case Report
    • Case Series
    • Cross-sectional
    • Correlational
  94. Analytic Observational Studies
    • Cohort Studies
    • Case-control Studies
  95. Descriptive Interventional Studies
    • Case Report
    • Case Series
    • these 2 kinds of studies can ALSO be categorized as interventional because the exposure can be chosen by an investigator
  96. Analytic Interventional Studies*
    Randomized Control Trial
  97. Cohort Studies
    • observational studies that are analytic (as opposed to descriptive) in nature
    • investigator recruits 2 types of individuals: exposed & unexposed
    • the investigator then follows these 2 groups through time & eventually measures the incidence of outcome in each
    • eg. twin studies comparing twins who had different exposure levels
    • Image Upload 18
  98. Prospective Cohort Study
    the investigator starts a study (TODAY) & follows exposed & unexposed volunteers through time (eg. for 10 years FORWARD) & then compares the incidence of outcome in exposed v. unexposed volunteers
  99. Retrospective Cohort Study
    • the investigator again might start the study today, but then looks BACK in time with the assistance of medical records to determine who’s exposed & who’s unexposed
    • the investigator might then again compare current day outcomes in both groups
  100. When is a Cohort Study appropriate?
    • when you’re interested in incidence rates or predictors MORE than the effects of interventions
    • it can be used before a randomized trial is proposed (eg. to generate hypotheses like the effects of hormone replacement therapy or dietary fat)
    • when exposure CANNOT be randomized (eg. genes, race, BMI, serum cholesterol, potentially harmful exposures [cigarettes, drugs, pesticides])
  101. Is a Cohort Study more or less valid than an RCT?
    • a Cohort Study has a LOWER validity than an RCT
    • this is mostly because there is no randomization in a Cohort Study
    • it’s also more difficult to measure exposure because often we rely on self report
  102. What is the goal in regard to Exposure in a Cohort Study?
    • to have an ACCURATE measure of true exposure
    • definition of Exposure should be clear & concise
    • it should be measured with accurate instruments, & the same method/instrument should be used in the exposed & unexposed
    • Exposure should be assessed BLIND to outcome (to avoid investigator bias)
  103. What is the goal in regard to Outcome in a Cohort Study?
    • to have (again) an ACCURATE measure of the true outcome
    • a clear & precise definition is needed
    • outcome should be measured with an accurate instrument
    • potential sources of outcome data include disease registries, medical records, death certificates
  104. What are some advantages of Cohort Studies?
    • they work well for exposures that can’t be randomized (genes, drug use)
    • they’re good for RARE exposures
    • they can assess multiple outcomes
    • they can generate Incidence data
  105. What are some advantages of Retrospective Cohort Studies in particular?
    • they’re good for studying diseases with long latency
    • they take less time to complete
    • are less expensive
    • (less time & resource intensive)
  106. What are some DISadvantages of Cohort Studies?
    • the exposure can’t be randomized
    • they’re bad for rare OUTCOMES (because you have to follow many many people over a long time to see the outcome)
    • - this long follow-up time to observe outcomes can also lead to loss to follow-up & a change in someone’s exposure status
    • they can be EXPENSIVE because of the number of years needed to follow the cohorts
    • subjects need to be free of the outcome at the start of the study & sometimes that can be hard if diagnosis can’t easily be done or isn’t clear-cut
    • in a retrospective cohort study, the data may not be available or adequate
  107. Relative Risk (Risk Ratio, Rate Ratio) in a Cohort Study
    • can be used as a measure of association between exposure & outcome
    • the probability (risk) of developing the disease if exposed compared to the probability of the developing disease if unexposed
    • RR = Incidence of Disease if Exposed / Incidence of Disease if Unexposed
    • unexposed (controls) are in the denominator
  108. Cohort RR > 1
    • exposure promotes the outcome
    • eg. a larger # of people got the disease (outcome) if they were exposed
    • larger numerator
  109. Cohort RR < 1
    • exposure prevents outcome
    • eg. a larger # of people DIDN’T the disease (outcome) if they weren’t exposed
    • larger denominator
  110. RR = 1
    • the exposure had no effect of on the outcome
    • the same proportion of people got the disease (outcome) in both the exposed & unexposed groups
    • RR = 1 is the null value
  111. Attributable Risk (Excess Risk/Risk Difference)
    • Incidence in Exposed - Incidence in Unexposed
    • it’s the incidence DUE to exposure, because it’s the difference between the two incidences
    • it compares the incidence of disease (outcome) in the exposed group & the incidence of outcome in the unexposed group
    • is another measure of association between exposure & outcome of interest
    • Image Upload 19
  112. Number Needed to Treat
    1 / Attributable or Excess Risk

    this is the # needed to treat to prevent 1 occurence of a disease (eg. stroke)
  113. Relative Risk v. Attributable Risk (RR v. AR)
    • RR can be useful in helping an individual understand their personal risk of disease by continuing to expose themselves to an exposure
    • RR might be what’s shared with a patient
    • AR is important in public health advocacy work (eg. when talking with legislators or policy makers)
    • AR helps you understand the # of cases that can be averted, or can be converted into the cost of a disease in a population
  114. Bias
    • the distortion of a study’s results
    • the results of the study don’t reflect the truth when there’s systematic error in the measurement of an association between 2 variables
  115. What are different types of Bias?
    • Confounding
    • Selection Bias
    • Information Bias (random + non-random)
    • Loss to Follow-up
    • *unlike with confounding, the other sources of bias CAN’T be corrected during analysis & MUST be avoided in the study design
    • Image Upload 20
  116. Random (Non-differential) Misclassification of the Exposure
    any misclassification of the exposure that’s the SAME in both outcome groups
  117. What are some examples of Random Misclassification of Exposure?
    • 1. ALL volunteers lie about substance abuse → underestimate of exposure
    • 2. ALL volunteers over-estimate daily physical activity → overestimate of exposure
    • 3. ALL subjects have trouble recalling average red meat consumption
    • with these 3 examples, the bias is ALWAYS toward the NULL
    • there’s a “watering down” of the association
  118. Random (Non-differential) Misclassification of the Outcome
    any misclassification of the outcome that’s the SAME in both exposure groups
  119. What are some examples of Random Misclassification of Outcome?
    • 1. an investigator may UNDER-diagnose outcomes in ALL volunteers (fewer people will have the outcome of interest)
    • 2. 1. an investigator may OVER-diagnose outcomes in ALL volunteers (more people will have the outcome of interest than should)
    • 3. a disease is difficult to conclusively diagnose (eg. MI)
    • again, the bias is ALWAYS toward the NULL (there’s a watering-down effect)
  120. Non-random (Differential) Misclassification of the Exposure
    any misclassification of the exposure that’s DIFFERENT in outcome groups
  121. What are some examples of Non-random Misclassification of Exposure?
    • 1. mother’s with babies born with FAS LIE about alcohol use
    • 2. volunteers who haven’t had an MI over-estimate physical activity
    • 3. investigators over-estimate exposure in people with a disease & under-estimate exposure in people without a disease
    • bias can either be toward OR away from the NULL
  122. Non-random (Differential) Misclassification of the Outcome
    any misclassification of the outcome that’s DIFFERENT in exposure groups
  123. What are some examples of Non-random Misclassification of Outcome?
    • 1. investigators under-diagnosing an outcome in people who’ve had surgery v. people who haven’t had surgery
    • 2. investigators over-diagnose a disease (outcome) in those who were exposed v. people who weren’t exposed
    • 3. investigators UNDER-diagnose people in the exposed group & OVER-diagnose people in the unexposed group
    • can lead to bias toward or away from the null
  124. How can non-random misclassification of exposure or outcome be avoided?
    by BLINDING investigators to both the exposure & outcome of interest
  125. Chapter 8
  126. External Validity (Generalizability)
    the ability to apply the findings from our study population to a larger population
  127. Internal Validity
    • the degree to which a study’s findings represent a true reflection of the exposure-outcome association in the population
    • how close the estimated relative risk in a study is to the true (but unknown) relative risk
    • internal validity = absence of bias (bias = “distance from the truth”)
  128. What are the 3 steps of assessing internal validity of a study?
    • 1. Rule out confounding
    • 2. Rule out other sources of bias
    • 3. Rule out chance with statistical tests
  129. RCTs & Internal Validity
    double-blind RCTs are considered the “gold standard” of epidemiologic study designs because the internal validity (absence of bias) is greater than in other study designs
  130. What 3 things must a variable be to be considered a confounder?
    1. it must be a risk factor for / associated with the outcome

    2. it must be associated with the exposure, or unbalanced in the exposure group

    3. it must NOT be on the intermediate path between the exposure & outcome, aka it can’t be a mediator on the causal pathway between the exposure & outcome
  131. Framingham Heart Study & Menopause
    researchers showed that menopause increased a woman’s risk for heart disease

    Image Upload 21

    what was actually happening was that age was acting as a confounder, creating the ILLUSION of a positive association between menopause & CHD

    the criteria: age IS a risk factor for CHD, age is UNBALANCED across exposure groups (pre/postmenopausal women tend to be different ages), & age is NOT on the causal pathway (like endogenous estrogen, which would be)
  132. Risk Factor
    • an attribute or exposure associated with an increased or decreased probability of a health-related outcome
    • not necessarily a direct cause of disease
    • eg. vaccines are an example of a “risk factor” that may prevent disease.
    • also called exposure, predictor, or determinant
  133. How do you address confounding in relation to study design?
    • you can Randomize individuals into the 2 arms of the study
    • you can Restrict which subjects you include (eg. if sex is a cofounder, just do the study in men)
    • you can MATCH subjects on specific characteristics that you KNOW to be confounders (eg. sex, age, race, ethnicity)
  134. How do you address confounding in relation to study analysis?
    via Stratified analysis & Multivariable analysis
  135. Matching
    • a way to avoid confounding in the study DESIGN
    • eg. TWIN studies
    • if sex, ethnicity, & age were confounders, then matching eliminates their ability to confound
  136. Effect Modification (Interaction)
    a factor OTHER than the exposure of interest (or even the disease) that can modify the exposure-disease association
  137. Confounding v. Effect Modification
    Conf: look at a difference between crude & adjusted RR/OR (eg. within the same variable but different when adjusted for various other variables)

    EM: look across subgroups (strata); applies to variable of interest (eg. age)
  138. Chapter 9
  139. Scattergram
    • visually summarizes the relationship between two continuous variables
    • eg. vitamin D serum levels v. vitamin D supplementation
  140. RR, OR, & Pearson R
    • RR & OR are single numbers used to quantify the relationship between 2 binary variables
    • when you have 2 continuous variables, as in this example, the relationship can also be quantified by a single number: r
  141. r (Pearson Product-Moment Correlation Coefficient)
    • quantifies the magnitude & direction of LINEAR relationships between 2 CONTINUOUS variables
    • is the average of the product of the Z-scores between the two variables of interest
    • has NO units
    • range = - 1 → +1
  142. Pearson Correlation Coefficient (r) Values
    • 0: no linear relationship
    • +1: a perfect positive linear relationship (positive means as the values of 1 variable increase, so do the values of the other variable)
    • -1: a perfect inverse linear relationship (negative means as the values of 1 variable increase, the values of the other variable decrease)
    • Image Upload 22
  143. Regression Analysis
    a statistical tool for evaluating the relationship of one or MORE independent variable (predictors) to a SINGLE continuous dependent (outcome) variable

    in addition to evaluating relationships, regression analysis can also be used to predict outcomes (via deriving prediction equations)
  144. Simple Linear Regression
    • used to fit a straight line (describe & predict a linear relationship) through points on a Scattergram
    • done in relation to 2 variables (X & Y axis)
    • the straight line is derived mathematically so the same # of data points lie above & below the line
  145. Simple Linear Regression Line of Best Fit
    • Y = βo + β1X
    • Y: the predicted value (outcome)
    • Bo: the INTERCEPT/cOrrelation coefficient
    • B1: the s1ope coefficient
  146. Vitamin D Eg. Line of Best Fit
    Image Upload 23

    for every 1 unit of supplemental vitamin D taken, the estimated increase in vitamin D serum levels is 0.0249 units

    Y = 64.7 + 0.0249X

    Image Upload 24
  147. What are the 2 distinctions between slope coefficient & relationship coefficients (worth remembering)?
    1. the s1ope coefficient β1 has UNITS, the cOrrelation coefficient doesn’t

    2. the cOrrelation coefficient βo just assesses the strength of the relationship; it doesn’t describe how the dependent variable changes in relation to changes in the independent variable
  148. What kind of relationship is there between slope & correlation coefficients?
    a 1 to 1 relationship
  149. Different Values of β1 (S1ope Coefficient)
    Image Upload 25
  150. R2 (Coefficient of Determination)
    • proportion (percentage) of variation in the outcome variable that can be explained by the exposure
    • ranges from 0 → 1 (0 → 100%)

    eg. if R2 = 0.73, then 73% of the variability in the outcome can be explained by the exposure
  151. Multiple Linear Regression
    used to describe & predict a linear relationship between 1 dependent variable & 2 or more independent variables

    the adjusted r2 corresponds to one from a multiple linear regression, & sometimes explains more variability (the value increases)
  152. Multiple Linear Regression Line of Best Fit
    Y = βo + β1X1 + β2X2 + β3X3
  153. What are the benefits of a Multiple Linear Regression?
    it can assess multiple exposures

    it can assess potential confounding
  154. β Estimates in Multiple Linear Regression
    • the β values correspond to each variable - so for every 1 unit that the independent variable changes by, the dependent variable (what you’re measuring) changes by the β coefficient that corresponds to that specific independent variable
    • this β coefficient exists after adjusting for all variables in the regression model
    • Image Upload 26
    • vitamin D eg.: for every 1 ounce of fish eaten, vitamin D serum levels increase by 2.01 (the individual β coefficient for the variable “fish intake”)
  155. When is a Logistic Regression used?
    • when the outcome is binary (yes/no, present/absent - is an extension of the 2 by 2 table)
    • can be used to calculate adjusted odds ratios & relative risks for more than 1 extraneous variable
    • to generate prediction equations in the case of binary outcomes
  156. Log Regression Equation
    • π(x) = eβ11x / 1+ eβ11x
    • βo & β1: unknown parameters
    • x: exposure
    • π: proportion of the outcome (1=yes, 0=no)
  157. How is the logistic model written?
    as a non-linear equation to ensure the outcome is bounded between 0 & 1

    if logistic regression is used to predict the risk of outcome, the risk estimates can’t be less than 0%
  158. How do you calculate the β coefficients for a logistic regression?
    • convert the log equation to a linear one via a logit transform
    • ln[odds] = ln [π(x) / 1 - π(x)] = βo + β1
  159. When is a Time to Event Analysis used?
    when the time to an event is AS important as whether the event occurs or not
  160. Time to Event Analysis
    • accommodates varying lengths of follow-up
    • outcome of interest is the TIME until an event occurs rather than whether it does or doesn’t occur
  161. Why might there be varying lengths of subject follow up?
    • 1. staggered entry into the study
    • 2. subjects might drop out of the study, are lost to follow up, or may die
    • 3. subjects who don’t have an outcome by the end of the study
  162. Physician Waiting Room Example
    • Image Upload 27
    • F: saw physician (outcome achieved)
    • L: lost to follow up at noted time
    • C: censored (outcome not achieved in specified time period)
    • question to ask: What is the rate (risk, chance, likelihood) of being seen by a physician (outcome) within __ minutes of arriving at the doctor’s office?
  163. Incidence Density
    • # of new cases in a specified time period / total number of units of person-time
    • a simple but MORE precise measure of incidence than cumulative incidence
    • the advantage of ID is that it accounts for unequal follow up & loss to follow up
  164. Incidence Density with Doctor’s Office
    • 12 total patients
    • 3 patients were seen within 15 minutes of arrival
    • minutes waited by everyone = 146
    • ID = 3 / 146 = 0.0205
    • 0.0205 * 100 = 2.05 patients per 100 patient-minutes
    • 2.05*15 = 30.8 patients will be seen per 100 patient-minutes within 15 minutes
  165. Doctor’s Office Incidence Density Interpretation
    there’s a 30.8% CHANCE a patient will be seen by a physician within 15 minutes of their arrival
  166. Kaplan-Meier Event-Free Survival Curve
    • an approach to display the cumulative proportion of participants who did NOT experience an outcome event over time
    • each step plotted on a K-M curve represents an outcome event
    • because K-M analysis & corresponding curve accounts for varying lengths of follow up, the K-M estimates are VERY close to Incidence Density
  167. Logrank Test
    • used to compare Kaplan-Meier curves between 2 or more groups
    • the comparrison is made across the entire event free survival curve & not at any particular period of time
  168. Cox Proportional Hazards Regression
    does the same thing that a logistic regression does (accommodating multiple exposure variables [including potential confounders & effect modifiers] for time to event data
  169. Hazard Function
    • h(t) = h0(t)eβ1x1 + β2x2
    • h(t): hazard @ time t
  170. Logistic Regression is an extension of the __ _____ ____, & Cox Regression is an extension of the _________________
    • Logistic Regression is an extension of the chi-square test
    • Cox Regression is an extension of the event free survival analysis
    • while the log regression model yields odds ratios, a Cox regression model yields hazard ratios
  171. How can hazard ratios be interpreted?
    • as Relative Risks
    • hazard ratios are byproducts of the Cox regression model
  172. Chapter 10
  173. Case-control Study
    • a type of observational analytic study in which subjects are selected based on their disease status
    • subjects are classified as Cases (having the disease) or Controls (not having the disease)
    • cases are identified by the outcome/disease being clearly & precisely defined
    • Good for rare OUTCOMES (diseases); Bad for rare exposures
  174. How should case & control status be assessed?
    • investigators should be blind to a person’s exposure status when assessing if the person is a case
    • similarly, exposure should be assessed blind to outcome
  175. Why are incident (new) cases better than prevalent cases?
    prevalent (old + new) cases may be survivors & therefore may not be representative of “typical” cases
  176. What cannot/is not calculated in a Case-control Study?
    • INCIDENCE, because we started with cases (people who already have the disease of interest)
    • therefore relative risk can’t be calculated either
  177. So, what is the only measure of association possible in a case-control study?
    Odds Ratio

    OR is a good estimate of the relative risk when the disease of interest is rare (<10%)
  178. Remember: Cohort Study
    • in both a prospective & retrospective cohort study, disease incidence can be calculated → RR can be calculated (as can OR)
    • Case-control can ONLY use OR
  179. Odds Ratio (OR)
    odds that a case was exposed / odds that control was exposed

    Image Upload 28
  180. What are the advantages of Case-Control Studies?
    • Good for rare diseases
    • Allow for evaluation of multiple exposures
    • Efficient (re: time & cost)
    • Avoid potential ethical issues of an RCT (eg. can’t randomize subjects to a harmful exposure like smoking)
  181. What are the disadvantages of Case-Control Studies?
    • they’re BAD for rare exposures
    • Comparability of cases and controls might not be given
    • Can’t generate incidence data because time isn’t known
    • there’s potential Selection bias, Interviewer bias & Recall bias (are all controls really free of the disease?)
  182. Selection Bias
    • caused by how the study subjects were selected for the study
    • results in the association between the exposure & outcome not being representative of the target population’s true association
    • to avoid selection bias, the exposure & outcome shouldn’t both be mentioned in recruitment material
  183. What type of study cannot have Selection Bias?
    • RCT
    • because subjects join the study before they know their exposure status & because the outcome has not yet occurred
  184. Recall Bias
    • Cases (who have experienced an adverse health outcome) may be more likely to recall exposure histories than controls
    • eg. cancer cases recall pesticide exposure more readily than people without cancer
    • it can attenuate association between disease & exposure towards the null or exaggerate association away from the null
    • a differential (non-random) misclassification of the EXPOSURE
  185. In what kinds of studies is Recall Bias found?
    • Retrospective Studies only
    • because outcome has already occurred when the Exposure is assessed
  186. Advantages of Case-control v. Cohort Studies
    • Image Upload 29
    • Case-control: recruit people based on their disease (outcome) for the outcome arm then find controls to match that don’t have the disease
    • Cohort: observational study, recruit people just based on whether or not they have or don’t have a certain exposure → outcome is assessed “later”
  187. Disadvantages of Case-control v. Cohort Studies
    Image Upload 30
  188. Why are all observational studies subject to confounding?
    because there’s NO randomization

    [small RCTs are more subject to confounding than large RCTs]
Card Set
Epidemiology & Biostatistics
EpiBio Exam