quals: clinical data mining part II

  1. Main components of a clinical data mining workflow?
    • 1) What questions to ask?
    • 2) Finding data sources
    • 3) Preparing datasets for analyses
    • 4) Performing analyses
  2. Asking questions about a data source (5)
    • 1) data model: how does the source store what relates to what
    • 2) terminologies: what terms does the source use to denote specific items in the data
    • 3) level of quality control: what are the test and checks to ensure that the right terms are used, and right things relate
    • 4) counting outcomes of interest: how will you decide if X happens or not
    • 5) methods: what analyses can you do and not do? What can you do to not be wrong?
  3. Prepare dataset for analyses: structured data in databases can be transformed into datasets using _____ and ______.
    • SQL
    • R
  4. Prepare dataset for analyses: define the unit of observation
    • patients (usually), or drugs
    • these are the rows in the data frame
    • need vector representation to make a dataframe
  5. Prepare dataset for analyses: standardizing features (2)
    • scale
    • normalize
  6. Prepare dataset for analyses: feature spaces are often reduced by (2)
    • using ontologies to aggregate features
    • data-driven methods
  7. Prepare dataset for analyses: why select features? (5)
    • useless features
    • features that are too specific
    • highly correlated features
    • spurious patterns ("overfitting")
    • longer runtime
  8. Prepare dataset for analyses: selecting feature
    Remove ____-prevalence, ___-variance features
    • low
    • low
  9. Prepare dataset for analyses: How can you use ontologies to reduce dimensionality?
    • collapse to parent concepts
    • e.g. condense aspirin, ibuprofen, naproxen columns into one column NSAID
  10. Prepare dataset for analyses: reducing dimensionality using PCA, SVD - pros and cons
    • pros: no dependence on outside knowledge
    • cons: derived features are not interpretable
  11. Prepare dataset for analyses: dealing with missing data (3)
    • delete rows with missing values - "complete case analysis"
    • impute data
    • multivariate regression model
  12. Prepare dataset for analyses: downside of "complete case analysis"
    • deleting rows with missing values
    • can quickly end up with nothing
  13. Prepare dataset for analyses: imputing data (3 methods)
    • column mean imputation/naive approaches - biased in summary statistics
    • k-nearest neighbors
    • low-rank model imputation
    • might be better to use an analysis method that handles missingness rather than impute
  14. Prepare dataset for analyses: multivariate regression model for missing data
    will kick out all samples missing a value for any variable used in the model
  15. Prepare dataset for analyses: constructing data
    • variables you'd like to analyze (or would like to adjust your results for) is not recorded, so you need to construct it from what's in the data
    • e.g. counts vs. binary indicators
    • e.g. socioeconomic status from zipcode
    • useful tricks: counts, differences, first derivatives, ratios
  16. Prepare dataset for analyses: text mining - things to know about a term mention
    • negation
    • context: does it apply to the patient or someone else, is it current or in the past, uncertainty
    • temporality: drug precedes outcome vs. drug follows outcome
  17. Prepare dataset for analyses: major components
    • define the unit of observation
    • standardize features
    • select features
    • deal with missing data
    • construct data
    • text-mining
    • electronic phenotyping
  18. Prepare dataset for analyses: electronic phenotyping
    finding patients with specific conditions or outcomes using EHR data
  19. electronic phenotyping dimensions
    • features: what features are used to characterize a phenotype
    • time: does the representation support a time-variant definition of the phenotype
    • complexity: how complex is the representation corresponding to a phenotype definition
  20. electronic phenotyping approaches (2)
    • rule-based, expert-consensus definitions
    • probabilistic phenotyping
  21. electronic phenotyping: rule-based approach
    • figure out what are the "things" you want to see in aa record to believe that patient has that condition
    • using the knowledge-graph that the data conforms to, convert "things" into specific identifiers
    • define the criteria around the presence, absence, frequency of occurrence etc. for those identifiers
    • PheKB, ATLAS
  22. electronic phenotyping: probabilistic phenotyping
    • learn from a set of labeled examples (i.e. supervised learning)
    • broad themes:
    • automated feature selection
    • reduce the number of training examples
    • probability of phenotype as a continuous trait
    • APHRODITE: aimed to create large training datasets for "cheap" and still learn a good phenotype model
    • ANCHOR learning
  23. electronic phenotyping: state of the art approach
    • develop a rule-based consensus definition for an electronic phenotype as a combination of codes (ICD9 and CPT), medications, key terms, and lab values that in combination identify the phenotype of interest
    • validate at multiple sites
    • execute such a phenotype "algorithm" as queries over an EMER database system
    • Con: fragmentation across sites and idiosyncrasies of site-specific coding and data recording practices can have significant effects on performance¬†
    • improve: NLP improves case identification rates over coded data and exact string matching alone for a variety of phenotypes
  24. electronic phenotyping: recent approaches
    • train classifier using EMR derived data to distinguish cases of phenotype from controls
    • some approaches rely on manual feature selection, more recent approaches incorporate automated methods for feature selection as part of the model building process
    • rate-limiting step is manually constructing the training set of cases and controls
Card Set
quals: clinical data mining part II
clinical data mining workflow