The flashcards below were created by user lukemlj on FreezingBlue Flashcards.

  1. What is one of the key challenges facing biology today?
    to organize, study, and draw conclusions from all the info collected on DNA, RNA and proteins.
  2. What is the main role of DNA?
    information storage
  3. What is a nucleotide?
    A nucleotide is a building block of DNA or RNA
  4. Which DNA base is sometimes methylated?
  5. What effect does methylation have on certain genes?
    Certain genes can be “rendered inactive”
  6. What is the error rate in DNA replication?
    The error rate is about 1 base in a billion
  7. Why is the error rate in DNA replication so low?
    There are mechanisms of proof reading and error correction
  8. What is nucleic acid hybridization?
    the ability for a strand of DNA to pair up with another strand of DNA or RNA
  9. Give 3 applications of hybridization.
    microarrays, in situ hybridization, and FISH
  10. What are the main working components of organisms that play the major role in almost all the key processes of life?
  11. What part of DNA determines whether a gene is active (a protein is produced) or inactive?
    There are surrounding regions of non-coding DNA which can act as control regions (e.g. promoter).
  12. What is the name of the RNA transcribed from a protein-coding gene?
    Messenger RNA
  13. Name one method for measuring the expression of many genes in a cell.
    a microarray
  14. What is meant by “a gene is expressed”?
    mRNA is transcribed and the corresponding protein is synthesized
  15. In which direction is the mRNA produced during transcription? Give the direction with respect to the mRNA itself.
    from its 5’ to its 3’ end
  16. Where do overlapping genes most commonly occur?
    in viruses
  17. Do overlapping genes also occur in humans?
    Yes, but very rarely
  18. What does it mean that the “genetic code is degenerate”?
    amino acids can be specified by more than one codon
  19. Which amino acid is often subsequently removed from newly synthesized proteins?
  20. How many possible ways are there to translate a given DNA sequence?
    There can be six different reading frames for a given sequence
  21. What is meant by an “open reading frame”?
    The reading frame that is flanked by appropriate start (ATG) and stop signals (TAG, TGA, or TAA)
  22. What are RNAs mainly involved in?
    The transfer of information from DNA, and using this information to synthesize proteins
  23. The 3 main classes of RNAs are: rRNA, mRNA, and tRNA. What is each one involved in?
    mRNA is transcribed from DNA. rRNA along with proteins form the ribosomes, the factories which build proteins given the mRNA. tRNA molecules recognize the codon in the mRNA and carry the associated amino acid that is then added to the chain.
  24. What molecule is the physical link between the mRNA and the growing protein chain?
  25. Where does tRNA bind to?
    the ribosome
  26. What does the regulation of many processes that interpret the information contained in a DNA sequence rely on?
    regulatory elements
  27. What are promoters?
    The control regions in DNA at which RNA polymerase binds to initiate transcription
  28. Bacterial promoters typically occur immediately before the position of what site?
    In bacteria, they typically occur before the transcription start site (TSS).
  29. What is one of the main problems in finding promoters in DNA sequences?
    Sequences outside the conserved motifs vary a great deal, and even the motifs themselves are not always the same from gene to gene
  30. Which is more variable: the prokaryote terminator sequence or the promoter sequence?
    the prokaryote terminator sequence
  31. Is the prokaryote terminator sequence usually included in genome annotations?
  32. What are activators and repressors?
    Activators improve the efficiency of the binding of RNA polymerase while Repressors inhibit gene expression by blocking promoter sites.
  33. Why are activators and repressors of critical biological importance?
    They can have a significant effect on when and/or whether or not transcription occurs.
  34. What is the most important core promoter sequence in genes transcribed by RNA polymerase II?
    The TATA box
  35. Where is The TATA box located?
    About 25 nucleotides upstream from the start of transcription
  36. What is the major difference between eukaryotes and prokaryotes in terms of their transcription and translation processes?
    Eukaryotic mRNA transcripts are substantially modified before translation
  37. What is the role of the spliceosome?
    The spliceosome is responsible for removing introns from a transcribed piece of mRNA
  38. What does the spliceosome consist of?
    small nuclear RNA molecules and proteins
  39. What is meant by “alternative splicing”?
    When some exons or parts of exons are left out of the final mRNA product that is spliced together prior to translation
  40. What is the Shine-Dalgarno sequence?
    A short sequence at the 5'’ end of mRNA that indicates the ribosome-binding site
  41. Alternative splicing is ___ in the genes of humans and other mammals.
    quite common
  42. What is the Shine-Dalgarno sequence's typical consensus sequence?
  43. Where does the Shine-Dalgarno sequence occur?
    A few bases upstream of the AUG translation start sequence
  44. What is an operon?
    Functionally related protein-coding sequences under the control of a single regulatory signal or promoter
  45. How do viruses replicate?
    By using a host cell’s resources
  46. Give an example of an unusual feature that is found in some viral genomes but not in cellular genomes.
    Some viral genomes apparently have overlapping genes.
  47. What are plasmids?
    Circular pieces of DNA from bacteria that contain additional genes that can be transmitted between bacteria
  48. On what process does the fate of a mutation (to be lost or to be retained) depend?
    the process of natural selection
  49. What are 4 targets of HIV Drugs?
    Fusion, Integrase, protease, reverse transcriptase
  50. What can be said about the general statements made in chapter one?
    there are probably exceptions to every general statement
  51. What is one of the challenges facing bioinformatics?
    Determining how the amino acid sequence of a protein determines the protein’s final structure and function
  52. What are some of the physical and chemical properties that are used to classify proteins into groups?
    size, electrical charge, hydrophobicity, and polarity; they can overlap
  53. What is meant by a-helix?
    a polypeptide in the form of a stable, right-handed helix. It has a pitch of 0.54nm; 3.6 amino acid residues per helical turn; and a rise per residue of 0.15nm
  54. What is meant by ß-sheet?
    an extended polypeptide, that when aligned with other beta-strands, allow for hydrogen bonding between them to form a beta-sheet
  55. What are homologous proteins?
    proteins that have a common ancestor
  56. When comparing proteins, where are most amino acids that change during evolution found?
    in regions that are not structurally or functionally important, such as many of the loops (or variable) regions
  57. What are globular proteins?
    proteins that are roughly spherical when folded
  58. What are fibrous proteins?
    proteins that form long protein filaments shaped like rods or wires
  59. Large hydrophobic groups tend to cluster ___, while polar groups tend to cluster _____.
    inside the protein, on the protein’s surface
  60. A protein domain is a ___.
    chain of protein that has folded up into a discrete structural unit.
  61. A typical domain has between ___.
    50 and 350 amino acids.
  62. The core of a domain is composed mainly of ___.
    alpha-helices, beta-sheets or both.
  63. Adult hemoglobin is an example of a ___ and is made of ___.
    tetrameric quaternary structure, two alpha subunits and two beta subunits.
  64. What is one of the main aims of bioinformatics?
    “One of the main aims of bioinformatics is to predict and analyze the structure of proteins and the relationship of the structure to the function.”
  65. What should be done to the protein sequence under study in order to obtain more info about it and to perform accurate predictions?
    The protein should be aligned to other proteins to find homologs.
  66. Why are flat files still being used?
    Flat files can be read and analyzed by many different types of programs depending on the user’s preferences, while more complex databases often require specific, expensive software.
  67. What does SQL stand for?
    Structured Query Language
  68. What does HTML stand for?
    HTML: Hypertext Markup Language
  69. What does XHTML stand for?
    XHTML: Extensible Hypertext Markup Language
  70. What is meant by "Annotation"?
    Annotation refers to additional information related to a database entry that is not critical to identify the item.
  71. What are some features of an annotation in a db?
    Features include links to related entries in other databases, interpretation of the data, and relevant research citations.
  72. What is meant by "gene ontology"?
    According to the text, Gene Ontology is a collaborative project across many laboratories to provide a controlled vocabulary that describes gene and gene-associated information for all organisms.
  73. What are the three types of DNA sequences stored in databases containing info about nucleic acids?
    Raw genomic sequences, cDNAs, and expressed sequence tags.
  74. What is the difference between DIP and pSTIING?
    DIP only details protein-protein interactions. pSTIING details interactions with proteins and anything else.
  75. What is a non-redundant db?
    In a non-redundant database, related data are held in a single entry referring to all the independent experiments and summarizing that information.
  76. When and why are genes and proteins labeled hypothetical?
    When genes and their proteins are identified in nucleotide sequences purely by computational methods
  77. Sequences in the ___ database are curated manually.
  78. Name some applications for the ID of similar sequences.
    For raw DNA sequences, comparison with similar sequences can indicate that the sequence is likely to be part of a protein-coding gene. Similarities can also make predictions about a protein’s function and/or structure. Comparison with different organisms can be useful in constructing phylogenetic trees.
  79. ___ are disabled copies of genes that no longer produce functional proteins.
  80. How do pseudogenes arise?
    It is assumed that they arise from gene duplication.
  81. How many pseudogenes are in the human genome?
    About 20,000
  82. What is an example of a pseudogene?
    The gene that codes for an enzyme (GULO) which aids in the biosynthesis of vitamin C is found in most mammals, but it is disabled in primates. Another is the eta globin pseudogene that is also disabled in humans, but is functional in goats.
  83. How do convergent and divergent evolution differ?
    In divergent evolution, structures or sequences are produced from a common ancestor.
  84. Why is it easier to detect homolgy when comparing protein sequences compared to nucleic acid sequences?
    There are only 4 letters in a DNA sequence, so less information is provided for a sequence of the same length as an amino acid sequence. Also, the genetic code is redundant, so a match in amino acid sequence does not mean there’s a match in the DNA sequence.
  85. When is it necessary to compare DNA sequences?
    DNA sequences are needed when searching for regulatory sequences or in whole-genome comparisons.
  86. How do you get rid of noise in a dot-plot?
    A filter can be applied. One type of filter uses over-lapping “windows” of a fixed length that requires a match over that fixed length before showing up on the plot.
  87. What is a tool used to visual a protein?
    Protein Workshop at the Protein Data Bank (PDB)
  88. Upon scoring pairs of aligned amino acids, why are pairs such as isoleucine and leucine, scored more highly than other pairs such as isoleucine and lysine?
    Isoleucine and leucine have similar properties compared to isoleucine and lysine. The more similar pair thus gets a higher score since they are more likely to substitute functionally for each other.
  89. What is the minimum percentage identity that can reasonably be accepted as significant when comparing two protein sequences?
    The text cites an analysis that found that “90% of sequence pairs with identity at or greater than 30% over their whole length were pairs of structurally similar proteins.” The conclusion is that 30% sequence identity is the threshold to claim likelihood of homology.
  90. Explain making sure you include what is meant by “twilight zone” in your answer.
    Also in the study, “Below about 25% sequence identity … only 10% of the aligned pairs represented structural similarity.” So this gap between 20-30% is called the twilight zone “where homology may exist but cannot be reliably assumed in the absence of other evidence.”
  91. The rows in the substitution matrices of Figure 4.4 on page 83 are color coded. Yellow:
    Small and polar. Polar in this case refers to the property of the side chain that results in a separation of charge for the amino acid. Water is also polar, so polar amino acids tend to like the water.
  92. The rows in the substitution matrices of Figure 4.4 on page 83 are color coded. White:
    Small and nonpolar. Nonpolar amino acids don’t have that charge separation and thus don’t like aqueous environments.
  93. The rows in the substitution matrices of Figure 4.4 on page 83 are color coded. Red:
    polar or acidic. Acidic amino acids are typically charged depending on the pH of the medium. For example, with respect to “physiological pH” (the pH of blood in humans), acidic amino acids will be negatively charged.
  94. The rows in the substitution matrices of Figure 4.4 on page 83 are color coded. Green:
    large and hydrophobic. Hydrophobic molecules do not like aqueous environments.
  95. The rows in the substitution matrices of Figure 4.4 on page 83 are color coded. Orange:
    Aromatic. Aromatic amino acids have carbon-based rings in their side chains. This structure allows them to absorb UV radiation. They are also relatively non-polar.
  96. Explain PAM
    PAM stands for Point Accepted Mutations, the accepted point mutations per 100 residues. So a PAM1 matrix is geared towards sequences that might have a 1% mutation rate. PAM substitution matrices are based substitution frequencies in aligned homologous protein sequences.
  97. Explain BLOSUM
    BLOSUM stands for BLOck SUbstitution Matrix. BLOSUM was developed after PAM, and it also depends on actual real-world data of actual substitutions. However, the BLOSUM value refers to percent identity. For example, a BLOSUM-80 substitution matrix will likely perform better for more closely related sequences than a BLOSUM-50 substitution matrix.
  98. What are the two findings that were obtained from structural analysis of the sequence of amino acids w.r.t. insertions and deletions of amino acids?
    Indels are rarer in sequences of structural importance. Insertions more often contain multiple residues as opposed to just one.
  99. When should we set a high gap penalty?
    A high gap penalty should be set if a new gap is introduced or if an exact match is desired.
  100. When should we set a high low penalty?
    A current gap is continued or if similarity is sought between distantly related sequences.
  101. Why should amino acids such as tryptophan have higher gap penalties (when aligned with gaps) than other amino acids such as glycine?
    Tryptophan is more likely to be conserved because its side chain is more important in determining structure and function compared to glycine.
  102. What are the Smith-Waterman and Needleman-Wunsch algorithms?
    They are dynamic programming algorithms. For our purposes, they are used in comparing sequences. The Smith-Waterman algorithm is a modification of the Needleman-Wunsch algorithm for finding local alignments.
  103. What do sequence conservations in multiple alignments identify?
    According to the text, they can identify “amino acids that are important for function or for the structural integrity of the protein fold”.
  104. What do the authors mean when they claim that Needleman–Wunsch and Smith-Waterman methods are rigorous?
    Given a scoring scheme, they are guaranteed to find the best-scoring alignments between two sequences.
  105. What programming technique are Needleman–Wunsch and Smith-Waterman methods based on?
    They are based on dynamic programming algorithms.
  106. Is BLAST a rigorous method? Explain.
    No. BLAST uses a heuristic algorithm to find similar sequences.
  107. Is SSEARCH a rigorous method? Explain.
    Yes. It’s based on the Smith-Waterman algorithm which is rigorous.
  108. Is the default gapped setting of BLAST adequate for most applications?
    According to the text, it is suitable for most applications, but the text doesn’t elaborate.
  109. What are the differences between blastn, blastp, and blastx?
    Blastn compares nucleotide sequence against a nucleotide database. Blastp compares a protein sequence against a protein database. Blastx takes a nucleotide sequence and translates it into six possible protein sequences based on the six possible reading frames, and then compares these sequences to a protein database.
  110. Consider Figure 4.12 (A) on page 99. Why does the caption of the figure (on page 98) claim that hits above the arrow are significant, while the ones below are not. Fully explain.
    The last column in the table is the E-value which is similar to a p-value when testing for significance. The hits above the arrow have a very tiny E-value: 2e-90 or less. In other words, the probability that these results occur by chance is vanishingly small. The hit below the arrow has an E-value of 0.59 which is quite large as these things go. Any hits below that row will have even greater values, so will be of even less significance.
  111. Name two actions one could undertake to reduce the large number of hits one gets upon blasting a sequence against a database of sequences.
    One could limit the search output by E-value to reduce the number of hits. One could search just a subset of the data such as the newest sequences in the database.
  112. What are low complexity regions?
    According to the text, these are “regions with a highly biased amino acid composition, often runs of prolines or acidic amino acids.”
  113. Are low complexity regions desirable? Why?
    No. Alignments of these regions can give false matches and distract from actually significant hits.
  114. Why does it make sense to resubmit a query sequence sometime after it was found to have no match in the database?
    A lack of a hit does not mean there are no possible matches that might be found in the future. Improvements in algorithm searches may lead to matches down the road. Or it may be that additions to the database will yield a match in the future.
  115. What is the simplest method of constructing a pattern or motif?
    The simplest method is the consensus method.
  116. How does the consensus method work?
    According to the text, “… the most similar regions in a global multiple sequence alignment are used to construct a pattern. Those positions in the alignment that are all occupied by the same residue are used to define the pattern at these positions, by specifying just the allowed residues at each position.”
  117. What are logos?
    They are representations of homologous sequences.
  118. How are Logos constructed?
    They are computed using a position scoring matrix.
  119. What do the size of the letters in a logo mean?
    According the text, “The size of the letters indicates the level of conservation…. Letters that are large and occupy the whole position represent identities in the multiple sequence alignment.”
  120. How many genes in a human?
  121. What is the largest human gene?
    Dystrophin, 2.4Mbp
  122. What percentage of nucleotides separates one person from another?
    0.1 percent
  123. Which human chromosome has the largest number of genes?
    1, 3000
  124. Which human chromosome has the smallest number of genes?
    Y, 350
  125. How would one gene code for multiple proteins?
    Alternative splicing
  126. In Genbank, what does PLN stand for?
    Plant, Fungal, Algal
  127. How many divisions does Genbank have?
  128. What does CDS stand for?
    Coding Sequence
  129. What does "rs" stand for?
    Reference SNP
  130. What are three major databases?
    GenBank, EBI, DDBJ
  131. What is a lentivirus?
    Virus with long interval between infection and symptoms
  132. When analyzing sequences with evolution in mind, is it the differences between them that we need to quantify and score?
    Yes, according to the text, these differences are summarized by “evolutionary or genetic distance.”
  133. What is the purpose of the phylogenetic tree representation?
    The purpose “is to summarize the key aspects of a reconstructed evolutionary history.”
  134. What are species trees?
    These are phylogenetic trees that show the hypothesized relationship between species based on their orthologous sequences.
  135. How are species trees constructed?
    • From the analysis of orthlogous sequences
    • morphological features used in traditional taxonomy
    • the presence of certain restriction sites in the DNA
    • the order of a particular set of genes in the genome
  136. Is the evolutionary history of a set of related genes always the same as that of the species from which the genes were selected?
  137. What is a speciation event?
    They are events that produce divergent species.
  138. How is a speciation event represented in a species trees?
    A speciation event is represented by an internal branch point.
  139. What does the root represent in rooted trees?
    The root represents the last common ancestor of the species represented by the branches coming off of it.
  140. What is the major task of phylogenetic tree reconstruction?
    It is “to identify from the numerous alternatives the topology that best describes the evolution of the data.”
  141. Describe cladograms.
    The topology has meaning, but the branch lengths do not
  142. Describe additive trees.
    Branch lengths are a measure of evolutionary divergence (e.g. number of mutations per site).
  143. Describe ultrametric trees.
    An additive tree with the added property that the rate of mutation is assumed be constant along all branches thus allowing the measurement of actual time
  144. What is meant by bootstrap analysis?
    Bootstrap analysis uses subsets of the original data to make estimates of the support for particular topological features for a given tree construction method.
  145. How can bootstrap analysis be used to construct condensed trees?
    “A condensed tree is produced by removing internal branches that are supported by less than 60% of the bootstrap trees.”
  146. What are the two conditions that would have made phylogenetic tree reconstruction from a set of homologous sequences considerably easier had they held during sequence evolution?
    • “… all the sequences evolved at a constant mutation rate for all mutations at all times”
    • “… the sequences have only diverged to a moderate degree such that no position has been subjected to more than one mutation.”
  147. Where do most mutations that are retained in DNA come from?
    Most retained mutations come from uncorrected errors during replication.
  148. What is the difference between synonymous and nonsynonymous mutations?
    Synonymous mutations do not change the amino acid, while nonsynonymous ones do.
  149. When is it useful to remove the third codon sites from the data before any further analysis?
    It may be useful when “the dataset involves long evolutionary timescales….”
  150. What is the key assumption that is made when constructing a phylogenetic tree from a set of sequences?
    The sequences “are all derived from a single ancestral sequence.”
  151. Explain the process of gene loss. Does gene loss occur solely because of gene duplication?
    • Sometimes, after gene duplication, one of the genes can lose its function due to a mutation becoming a pseudogene. Additional mutations can make it unrecognizable resulting in gene loss.
    • According to the text, “gene loss can occur without gene duplication.”
  152. What is meant by homoplasy?
    “Sequence similarity not due to homology”
  153. What is meant by “horizontal gene transfer” (also known as lateral gene transfer”)? Why is it called “horizontal”?
    • It is the transfer of genes between organisms not by reproduction (Wikipedia)
    • It is called “horizontal” because it does not occur from parent to offspring (vertical transfer).
  154. What are syntenic regions? Are they easily detected? Explain.
    • “equivalent regions in different species”, i.e. “regions containing related genes in the same order”
    • Apparently they are not as easily detected as researches had hoped because large-scale changes of the chromosome(s) and genome occur shuffling the location of the related genes.
  155. When comparing sequences from two closely related species, which regions will convey useful information for the construction of phylogenetic trees?
    • “The ideal is a genomic region that occurs in every species but only occurs once in the genome.”
    • “There should be little if any HGT within this region.”
    • “The rate of change in this sequence segment must be fast enough to distinguish between closely related species, but not so fast that the regions from very distantly related species cannot be confidently aligned. “
    • The three requirements above can often be satisfied “… by a single sequence that has some highly conserved regions and other regions that are more variable between species.”
  156. The analysis of which genomic sequence led to the discovery that prokaryotes comprised two quite distinct domains?
    “The DNA sequence specifying the small ribosomal subunit rRNA (called 16S RNA in prokaryotes)….”
  157. What is phylogeny?
    The history of descent of a group of organisms from a common ancestor. The inference of evolutionary relationships.
  158. What is taxonomy?
    The science of classification of organisms.
  159. What is the aim of phylogenetic analysis?
    To discover all of the branching relationships in the tree and the branch lengths.
  160. Why are phylogenetic trees constructed?
    • To understand lineage.
    • To understand how functions evolved.
    • To perform multiple alignment.
  161. What do internal nodes represent in a phylogenetic tree?
    Hypothetical ancestral units.
  162. In a ___ tree, the path from root to a node represents ___.
    rooted tree, evolutionary path
  163. A(n) ___ specifies ___, but not evolutionary paths.
    unrooted tree, relationships among objects
  164. All objects in a ___ have a single common ancestor.
  165. What might tree construction be based on?
    morphological features or sequence data.
  166. Describe a cladogram.
    Branch length carries no meaning.
  167. Describe an additive tree.
    Branch length measures evolutionary divergence.
  168. Describe an ultrametric tree.
    • An additive tree where there is a constant rate of mutation.
    • Horizontal lines are not important.
  169. What is significant about an additive tree with an outgroup?
    An outgroup can be used to convert an unrooted tree to a rooted tree.
  170. How is a tree rooted?
    Place the candidate root half way between the outgroup and the closest node.
  171. What is UPGMA?
    Distance-based method - Unweighted Pair Group Method using Arithmetic averages
  172. UPGMA method results in what kind of tree?
Card Set
Show Answers