bioinformatics exam 2

  1. Types of Sequence Alignment
    • Number of sequences:
    • Pair-wise
    • Multiple Sequence Alignment (MSA)
  2. •Types of Sequence Alignment
    Type of sequence
    • DNA, RNA, and Protein
  3. • Types of Sequence Alignment
    Number of sequences:
    Global – All letters and nulls in all sequences must be aligned

    • Local – Allows arbitrary-length segments of each sequence to be aligned, with
    • no penalty for the unaligned porNons of the sequences.
  4. Matches/Mismatches
    • Whether a sequence is the same or differs between two sequences at a specific shared
    • position along a sequence (“|”)
  5. Gaps.
    An arbitrary number of null characters (”-”) may be placed into either sequence, and aligned with letters in the other sequence. Two nulls may not be aligned. Depending upon perspective, the alignment of a letter with a null may be understood as the insertion of a letter into one sequence, or the deletion of a letter from the other. Therefore, a letter aligned with a null is sometimes called an indel.
  6. Alignment scores
    • The score for an alignment is taken to be the sum of scores for aligned pairs of
    • letters, and scores for letters aligned with nulls.
  7. Substitution scores
    • Scores for aligned pairs of letters are called substitution scores, whether the letter
    • aligned are identical or not. Most simply, substitution scores may take the form of match scores and
    • mismatchscores.
  8. Gap scores.
    The score for a letter aligned with a null is called a gap score. Usually gap scores are letterindependent
  9. Dynamic Programming and Global Alignment
    • Dynamic programming is a method by which a larger problem may be solved by first solving smaller,
    • partial versions of the problem.
  10. Local Alignment: Definition
    • local alignment of two sequences allow arbitrary length
    • segments of each sequence to be aligned, with no penalty for the unaligned portions of
    • the sequences. Otherwise, the score for a local alignment is calculated the same way as that for a
    • global alignment.
  11. Assumption of homology (learn how species are biologically related)
    MSAs: Identify sequences or parts of sequences ­> function
    • hylogenetic Analysis
    • Study of evolutionary history and relationships among individuals or group of organisms
    • Explore evlutionary history
    • evolution of species or gene/gene family
  12. Sequence alignment
    Phylogeny: important to current questions
    • Evolutionary tree
    • ○ Reveals pattern of relatedness in genes.
    • ○ Displays hierarchical ancestral relationships among sequences
    • ○ Classify gene with appropriate subfamily
    • ○ Test predictions about placement of sequence
    • ○ Test predictions about effects of species history on evolution
    • ○ Used as the null model
  13. 2 Types of trees:
    • Unrooted tree topology
    • 1. Rooted tree topology
    • ○ need info on evolutionary rates or which is most ancient
    • ○ needs outgroup ­> outside of group of interest but not too far relationally
  14. Branches and Nodes
    Node Types
    • Leaves­extant samples
    • ○ Internal Nodes­ putative ancestor
  15. Branch Lengths
    • Amount of genetic change
    • ○ Average number of substitutions per site
    • ○ Describe amount of change between instances
  16. How do we build phylogenies?
    Find the topology mongruent with observed with data (which matches up best to explain data)
  17. Distance Methods
    • Uses probablistic models of sequence evolution to calculate pairwise evolutionary distances for all pairs
    • of sequences
  18. Parsimony
    Smallest number of changes or steps to explain data
  19. Maximum likelihood
    • Best topology is one with the smalles number of changes needed to explain data.
    • ○ Can include more complex evolutionary models
  20. Distance methods
    • Uses probalistic models of sequence evolution to calculate pairwise volutionary distances for all pairs of
    • sequences
  21. Bayesian Inference
    • ● Uses likelyhood to create "porterior probability of trees" using a model of evolution, based on some prior
    • probabilities and determine the most likely topology for the given data.
  22. Evaluating Tree Reconstruion
    • Measrue support for clades
    • ● Bootstrapping (resampling technique)
    • ○ reshuffling of sample data with replacement of values
    • ○ Especially in maximum likelihood analysis
    • ○ 70­-80%: strong support usually
    • ● Posterior probability
    • ○ Bayesian
    • ○ 0­1
  23. N50 length
    • The length of the sequence which takes the sum length of all sequences past 50% of the total
    • assembly size (when summing
    • legnth from longest to shortest).
    • Access whether or not you have a good assembly.
  24. Assessing assemblies
    We Desire
    • We Desire
    • ■ total length similar to genome size
    • ■ fewer larger contigs
    • ■ no mistakes
  25. Metrics
    • No genrallly useful objective measure
    • ■ Longest, toal bp, genes recovered
  26. Validation
    • Self­ Consistenct
    • ■ align reads back to contigs
    • ■ check for errors or discordant pairs
    • ○ Second opinion
    • ■ use two complementarry sequencing methods
    • ■ Target troublesome areas for PCR
    • ■ Use a genome wide "optical map
  27. Consideration
    • Size of genome
    • Hardware
    • ● phone, laptop, desktop, server, cloud
    • ● RAM more limiting than CPU
    • Operating System
    • ● Linux, Mac, Windows
    • Software budget
    • ● Commerical, free, open­sources
    • Using core genes
    • All genomes perform some core functions
    • (transcription, replication, translations ect.)
    • proteins involved tend to be highly conserved
    • They should be present in every genome
    • CEGMA
    • Core Eukaryotic Genes Mapping Approach
    • defines a set of 248 'Core Eukaryotic' (CEGSs)
    • CEGs identified from genomes of: S. cerevisie, S. pombe....
    • N50 not related to more sequenced genes.
    • CEGMA is outated, hard to install but good idea.
  28. de novo genome assembly (from scratch)
    de novo genome assembly (from scratch) an attempt to accurately represent entire genome sequence from a large set of short DNA sequences like a jigsaw puzzle overlapping seq. (they fit together)
  29. Largest assembled genome belongs to the
    Lobolly Pine (Pinus taeda) with a 22Gbp genome.
  30. 64x coverage mean
    estimation of sequence: genome was assembled 64 times in one run.
  31. Biological Problems of genome assembly
    • Repeats: Many organisms' genomes consist of repetitive sequences.
    • Ploidy: at least two copies of the genome present   
    • Lack of reference genome -  Reference-assisted assembly is a much easier problem than de novo asembly.  Even having genome from a closely realted species can help.
  32. Chellenges for genome assembly
    Cost- In 2014 Illumina claimed the $1,000 genome barrier had been broken(if you first spend ~$10 million on hardwareLibrary prep- critical and often overlooked, step in the processSequence diversity hardware- Illumina, 454, ion torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using.Hardware: Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster. Expertise(one of the biggest inhibitors): not always easy to get assembly software installed, let alone understand how to properly use it.Software: many choices to choose from. Many parameters that result in different genome assemblies. At least 125 different tools for assembly
  33. Before you assemble...
    Remove adapter contamination Remove sequence contamination Trim sequences...
  34. First eukaryotic genome sequence
  35. First plant genome
    • Arabidopsis thaliana
    • Size corrected upwards and downwards mutlple times over the years
  36. We no longer have one genome per species
    we have genome sequences representing different strains and varieties of a species.
  37. Really easy to sequence genomes but very hard to accurately assemble them.
    Bad genome assemblies  #1- Fragmented #large amount of Ns.
  38. Approaches to Genome Assembly Hierarchical sequencing
    Top-down, map-based, or clone-by-clone strategy Break genome into smaller and smaller units that you know where they are located, then sequence

    • Benefits know
    • map positions for maps Reduces repetitive effort show we did all the first sets of genomes high confidence

    • Expensive
    • Time consuming
  39. Shotgun Sequencing
    • Use computer algorithm to assemble contigs from derived over-lapping sequences Benefits Cheap!!!Fast!!!Cost
    • Computationally complex
    • Difficult to asses
    • Biases unclear
    • Some parts of genome are unable to be sequence (centromeres)
  40. Hierarchical sequencing?
    • Find overlaps -Find all pairs of sequences that overlap
    • Layout - remove redundant or weak ones
    • Consensus- merge pairs that overlap unambiguously (overlap ONLY with each other and not other sequences).
  41. Overlap graph
    • A vertex is a string (sequence)
    • Used by overlap-layout-consensus assemblers
    • An edge represents an overlap between two strings contig-continuous segment
    • _______  -> O (bacterial chromosome)  
    • Isolated segents of sequence, break up and analyse.
  42. What counts as "overlapping"?
    What counts as "overlapping"?Min overlap 20bpMin %identity across overlap 95%choice depends on L(length) and expected error rate
  43. What ruins the graphs?
    • Read Errots Introduce false edges and nodes Heterozygosity: causes lots of detours from homozygous areas
    • Repeats: causes nodes to be shared, locality confusion.
  44. Contigs
    • Conitiguous, unambiguous stretches of assembled DNA sequence 
    • real ends
    •  dead ends
    • forks
  45. Scaffolding
  46. Contig sizes Limited by :
    Length of repeats in your genome can't change this! The length (or "span") of the reads wait for new technology use "tricks' with existing technology
  47. RNA -Seq
    Transcriptomic Quantitative -> number of reads = level of expresion
  48. ChIP - Seq (DNA)
    Chromatin Immuno-Precipitation
  49. Variants (DNA)
    take a closer look at which SNP's are associated with what
  50. So You can run a next generation sequences and you get 10s-100s of millions of sequeces, now what?
    • Get rid of low quality stuff
    • Map your data to a reference
    • Analyze resutls...
  51. Mapping: alignment of sequence read to reference sequence
    • Reads
    • References
    • Alignment
  52. Elements of an alignment
    • Matches/Mismatches
    • Gaps
    • Alignment scores
    • Substitution scores
  53. Different Algorithms
    Dynamic Programming
    • Scoring matrix
    • Global
    • -Needleman-Wunsch
    • Local
    • -Smith-Waterman
    • Affine gap
    • -Penalties for opening and extending a gap separate
    • Computationally expensive
    • -For aligning 100bp x 1000bp
    • -sequences need to calculate 10^4 scores ---Several calculations or FLOPs (FLoating point OPerations) per score
  54. Heuristic Algorithms
    • Make approximations to speed up the process
    • e.g. FASTA or BLAST
    • identify all matching k-touples
    • k-tuples ~ k-mer
    • -String of k length
    • -k determnes the speed and sensitivity
    • Hash ~ dictionary
  55. Heusristics Algorithms
    • BLAST
    • -Restrict searches in fewer, smaller bands around matrix diagonals Hash table of words ("seeds") Find matches and extend in both directions first without gaps or mismatches. Exend matches joined in restricted regions with allowing of gaps and mismatches (above some threshold alignment score)
    • BLAST will miss alignments outside of restricted region
    • -Collection of Algorithms
    • --BLASTN - Short
    • --MEGABLAST - Long
  56. Next-Generation Sequencing Data
    • Lot's of sequences!!!
    • Millions and millions
    • Same algorithms, but need to be sped up Various approaches:Gloval vs. Local
    • Allow or not allow gaps
    • Index (create hash-map of 'seeds') the reads or index the reference
    • Different algorithms
  57. Mapping
    Burrows-Wheeler Transformation
    • o e.g. Bowtie, BWA, SOAP2, etc.
    • o Increases speed, decreases memory footprint
    • o Uses "trie" convension (from retrieval) for fast string matching
    • o "trie contains all possible substring for a reference sequence (either prefix of suffix trie)
    • o uses graph theoretical approach, with nodes and edges
    • o Generate a suffix tree, uses BWT transformation and indexing methods.
    • o Allows for reduced info. storing and easy data compression
  58. Query
    Thing we are trying to identify/map/
  59. Subject
    Thing we are matching to (index = bunch of subjects)
  60. Applications
  61. o Genome assembly
    • - e.g. scaffolding; depth of coverage
    • o RNA-Seq-estimations of gene/transcript expression
    • o Epigenomics and gene-regulation - modifications to genomics RNA discovery
    • o Unknown identification-small RNA discovery
    • o Variant Identification- SNP prediction
  62. Transcriptomics
    • Study of the complete set of RNA transcripts that are produced by the genome, under specific circumstances or in specific cell
    • -Ofter referred as expression profiling
  63. Traditional Process:
    • • 1. Isolate RNA from sample•
    • 2. Get fraction of RNA in which you are interested
    • o Poly-A selected
    • o Ribo-depletion
    • o small RNA•
    • 3. Construct library of sample RNA•
    • 4. Sequence data•
    • 5. Analyze...
  64. Power of RNA-SEQ
    • o Entire transcriptome
    • o Inductive and deductive
    • o Quantitative
    •  Number of reads mapped ~ level of gene expression
    • o Strand specific
    •  Sense vs. Anti-sense transcription
    • o Can do RNA seq on really smal amounts of RNA - Single cells
  65. How is quantification done?
    •  To normilize the data you got (Put them in the same scale to be able to compare them)
    • o RPKM - reads per kilobase per million reads
    • o FPKM - fragments per kilobase per billion reads
    • o TPM - transcript per millions (proportion of transcript in total pool)
    None is absolute - always comparing relative proportion reads mapped.
  67. Statistics;
    • o Differential Gene Expression
    •  More complicated T-test
    •  Correct for non-normality
    •  Correct for mulltiple tests and false rate of discovery
  68. Clustering
    • o Correlated gene expression
    • o Simple to very complicated
    • Differential Functional Element Representation
    • • GO group count comparisons
Card Set
bioinformatics exam 2