Types of Sequence Alignment
• Number of sequences:
- Multiple Sequence Alignment (MSA)
•Types of Sequence Alignment
Type of sequence
• DNA, RNA, and Protein
• Types of Sequence Alignment
Number of sequences:
Global – All letters and nulls in all sequences must be aligned
- Local – Allows arbitrary-length segments of each sequence to be aligned, with
- no penalty for the unaligned porNons of the sequences.
- Whether a sequence is the same or differs between two sequences at a specific shared
- position along a sequence (“|”)
An arbitrary number of null characters (”-”) may be placed into either sequence, and aligned with letters in the other sequence. Two nulls may not be aligned. Depending upon perspective, the alignment of a letter with a null may be understood as the insertion of a letter into one sequence, or the deletion of a letter from the other. Therefore, a letter aligned with a null is sometimes called an indel.
- The score for an alignment is taken to be the sum of scores for aligned pairs of
- letters, and scores for letters aligned with nulls.
- Scores for aligned pairs of letters are called substitution scores, whether the letter
- aligned are identical or not. Most simply, substitution scores may take the form of match scores and
The score for a letter aligned with a null is called a gap score. Usually gap scores are letterindependent
Dynamic Programming and Global Alignment
- Dynamic programming is a method by which a larger problem may be solved by first solving smaller,
- partial versions of the problem.
Local Alignment: Definition
- local alignment of two sequences allow arbitrary length
- segments of each sequence to be aligned, with no penalty for the unaligned portions of
- the sequences. Otherwise, the score for a local alignment is calculated the same way as that for a
- global alignment.
Assumption of homology (learn how species are biologically related)
MSAs: Identify sequences or parts of sequences > function
- hylogenetic Analysis
- Study of evolutionary history and relationships among individuals or group of organisms
- Explore evlutionary history
- evolution of species or gene/gene family
Phylogeny: important to current questions
- Evolutionary tree
- ○ Reveals pattern of relatedness in genes.
- ○ Displays hierarchical ancestral relationships among sequences
- ○ Classify gene with appropriate subfamily
- ○ Test predictions about placement of sequence
- ○ Test predictions about effects of species history on evolution
- ○ Used as the null model
2 Types of trees:
- Unrooted tree topology
- 1. Rooted tree topology
- ○ need info on evolutionary rates or which is most ancient
- ○ needs outgroup > outside of group of interest but not too far relationally
Branches and Nodes
- Leavesextant samples
- ○ Internal Nodes putative ancestor
- Amount of genetic change
- ○ Average number of substitutions per site
- ○ Describe amount of change between instances
How do we build phylogenies?
Find the topology mongruent with observed with data (which matches up best to explain data)
- Uses probablistic models of sequence evolution to calculate pairwise evolutionary distances for all pairs
- of sequences
Smallest number of changes or steps to explain data
- Best topology is one with the smalles number of changes needed to explain data.
- ○ Can include more complex evolutionary models
- Uses probalistic models of sequence evolution to calculate pairwise volutionary distances for all pairs of
- ● Uses likelyhood to create "porterior probability of trees" using a model of evolution, based on some prior
- probabilities and determine the most likely topology for the given data.
Evaluating Tree Reconstruion
- Measrue support for clades
- ● Bootstrapping (resampling technique)
- ○ reshuffling of sample data with replacement of values
- ○ Especially in maximum likelihood analysis
- ○ 70-80%: strong support usually
- ● Posterior probability
- ○ Bayesian
- ○ 01
- The length of the sequence which takes the sum length of all sequences past 50% of the total
- assembly size (when summing
- legnth from longest to shortest).
- Access whether or not you have a good assembly.
- We Desire
- ■ total length similar to genome size
- ■ fewer larger contigs
- ■ no mistakes
- No genrallly useful objective measure
- ■ Longest, toal bp, genes recovered
- Self Consistenct
- ■ align reads back to contigs
- ■ check for errors or discordant pairs
- ○ Second opinion
- ■ use two complementarry sequencing methods
- ■ Target troublesome areas for PCR
- ■ Use a genome wide "optical map
- Size of genome
- ● phone, laptop, desktop, server, cloud
- ● RAM more limiting than CPU
- Operating System
- ● Linux, Mac, Windows
- Software budget
- ● Commerical, free, opensources
- Using core genes
- All genomes perform some core functions
- (transcription, replication, translations ect.)
- proteins involved tend to be highly conserved
- They should be present in every genome
- Core Eukaryotic Genes Mapping Approach
- defines a set of 248 'Core Eukaryotic' (CEGSs)
- CEGs identified from genomes of: S. cerevisie, S. pombe....
- N50 not related to more sequenced genes.
- CEGMA is outated, hard to install but good idea.
de novo genome assembly (from scratch)
de novo genome assembly (from scratch) an attempt to accurately represent entire genome sequence from a large set of short DNA sequences like a jigsaw puzzle overlapping seq. (they fit together)
Largest assembled genome belongs to the
Lobolly Pine (Pinus taeda) with a 22Gbp genome.
64x coverage mean
estimation of sequence: genome was assembled 64 times in one run.
Biological Problems of genome assembly
- Repeats: Many organisms' genomes consist of repetitive sequences.
- Ploidy: at least two copies of the genome present
- Lack of reference genome - Reference-assisted assembly is a much easier problem than de novo asembly. Even having genome from a closely realted species can help.
Chellenges for genome assembly
Cost- In 2014 Illumina claimed the $1,000 genome barrier had been broken(if you first spend ~$10 million on hardwareLibrary prep- critical and often overlooked, step in the processSequence diversity hardware- Illumina, 454, ion torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using.Hardware: Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster. Expertise(one of the biggest inhibitors): not always easy to get assembly software installed, let alone understand how to properly use it.Software: many choices to choose from. Many parameters that result in different genome assemblies. At least 125 different tools for assembly
Before you assemble...
Remove adapter contamination Remove sequence contamination Trim sequences...
First eukaryotic genome sequence
First plant genome
- Arabidopsis thaliana
- Size corrected upwards and downwards mutlple times over the years
We no longer have one genome per species
we have genome sequences representing different strains and varieties of a species.
Really easy to sequence genomes but very hard to accurately assemble them.
Bad genome assemblies #1- Fragmented #large amount of Ns.
Approaches to Genome Assembly Hierarchical sequencing
Top-down, map-based, or clone-by-clone strategy Break genome into smaller and smaller units that you know where they are located, then sequence
- Benefits know
- map positions for maps Reduces repetitive effort show we did all the first sets of genomes high confidence
- Use computer algorithm to assemble contigs from derived over-lapping sequences Benefits Cheap!!!Fast!!!Cost
- Computationally complex
- Difficult to asses
- Biases unclear
- Some parts of genome are unable to be sequence (centromeres)
- Find overlaps -Find all pairs of sequences that overlap
- Layout - remove redundant or weak ones
- Consensus- merge pairs that overlap unambiguously (overlap ONLY with each other and not other sequences).
- A vertex is a string (sequence)
- Used by overlap-layout-consensus assemblers
- An edge represents an overlap between two strings contig-continuous segment
- _______ -> O (bacterial chromosome)
- Isolated segents of sequence, break up and analyse.
What counts as "overlapping"?
What counts as "overlapping"?Min overlap 20bpMin %identity across overlap 95%choice depends on L(length) and expected error rate
What ruins the graphs?
- Read Errots Introduce false edges and nodes Heterozygosity: causes lots of detours from homozygous areas
- Repeats: causes nodes to be shared, locality confusion.
- Conitiguous, unambiguous stretches of assembled DNA sequence
- real ends
- dead ends
Contig sizes Limited by :
Length of repeats in your genome can't change this! The length (or "span") of the reads wait for new technology use "tricks' with existing technology
Transcriptomic Quantitative -> number of reads = level of expresion
ChIP - Seq (DNA)
take a closer look at which SNP's are associated with what
So You can run a next generation sequences and you get 10s-100s of millions of sequeces, now what?
- Get rid of low quality stuff
- Map your data to a reference
- Analyze resutls...
Mapping: alignment of sequence read to reference sequence
Elements of an alignment
- Alignment scores
- Substitution scores
- Scoring matrix
- Affine gap
- -Penalties for opening and extending a gap separate
- Computationally expensive
- -For aligning 100bp x 1000bp
- -sequences need to calculate 10^4 scores ---Several calculations or FLOPs (FLoating point OPerations) per score
- Make approximations to speed up the process
- e.g. FASTA or BLAST
- identify all matching k-touples
- k-tuples ~ k-mer
- -String of k length
- -k determnes the speed and sensitivity
- Hash ~ dictionary
- -Restrict searches in fewer, smaller bands around matrix diagonals Hash table of words ("seeds") Find matches and extend in both directions first without gaps or mismatches. Exend matches joined in restricted regions with allowing of gaps and mismatches (above some threshold alignment score)
- BLAST will miss alignments outside of restricted region
- -Collection of Algorithms
- --BLASTN - Short
- --MEGABLAST - Long
Next-Generation Sequencing Data
- Lot's of sequences!!!
- Millions and millions
- Same algorithms, but need to be sped up Various approaches:Gloval vs. Local
- Allow or not allow gaps
- Index (create hash-map of 'seeds') the reads or index the reference
- Different algorithms
- o e.g. Bowtie, BWA, SOAP2, etc.
- o Increases speed, decreases memory footprint
- o Uses "trie" convension (from retrieval) for fast string matching
- o "trie contains all possible substring for a reference sequence (either prefix of suffix trie)
- o uses graph theoretical approach, with nodes and edges
- o Generate a suffix tree, uses BWT transformation and indexing methods.
- o Allows for reduced info. storing and easy data compression
Thing we are trying to identify/map/
Thing we are matching to (index = bunch of subjects)
o Genome assembly
- - e.g. scaffolding; depth of coverage
- o RNA-Seq-estimations of gene/transcript expression
- o Epigenomics and gene-regulation - modifications to genomics RNA discovery
- o Unknown identification-small RNA discovery
- o Variant Identification- SNP prediction
- Study of the complete set of RNA transcripts that are produced by the genome, under specific circumstances or in specific cell
- -Ofter referred as expression profiling
- • 1. Isolate RNA from sample•
- 2. Get fraction of RNA in which you are interested
- o Poly-A selected
- o Ribo-depletion
- o small RNA•
- 3. Construct library of sample RNA•
- 4. Sequence data•
- 5. Analyze...
Power of RNA-SEQ
- o Entire transcriptome
- o Inductive and deductive
- o Quantitative
- Number of reads mapped ~ level of gene expression
- o Strand specific
- Sense vs. Anti-sense transcription
- o Can do RNA seq on really smal amounts of RNA - Single cells
How is quantification done?
- o RPKM/FPKM/TPM
- To normilize the data you got (Put them in the same scale to be able to compare them)
- o RPKM - reads per kilobase per million reads
- o FPKM - fragments per kilobase per billion reads
- o TPM - transcript per millions (proportion of transcript in total pool)
In RNA seq EVERYTHING IS RELATIVE!!!!!!
None is absolute - always comparing relative proportion reads mapped.
- o Differential Gene Expression
- More complicated T-test
- Correct for non-normality
- Correct for mulltiple tests and false rate of discovery
- o Correlated gene expression
- o Simple to very complicated
- Differential Functional Element Representation
- • GO group count comparisons