CS 4/504, Fall 2001 Computational Biology

Homework 5, Due: 10/11/01

 

 

In Homework 1, we found a human endogenous retrovirus, accession number AF087913. For this exercise, we will find a putative protein in this sequence, then search for similar proteins using both an Entrez search and a Blast search.

 

  1. Use Entrez (http://www.ncbi.nlm.nih.gov/Entrez/index.html) to search for AF087913 (remember, this is a nucleotide). Find the putative reverse transcriptase pseudogene in AF087913. How many A’s, C’s, G’s, and T’s are in this pseudogene (just to show you’ve found the record)?

 

Ans: BASE COUNT      430 a    313 c    287 g    385 t

 

  1. This gene is a putative gene from humans that codes for the reverse transcriptase protein. Use Entrez to search for other such proteins: go back to Entrez and search for a protein using the fields “human” and “reverse transcriptase” (you can enter both in the search field, keep “reverse transcriptase” in quotes, and be sure to look for proteins). How many items does this find? How many pages of data is that?

 

Ans: 10,794; 540 pages (on 30 October)

 

  1. The reverse transcriptase protein translates RNA back into DNA. These sequences are mostly from viruses, often HIV. Why might a virus need reverse transcriptase? Why are these viral sequences in the human genome? (for fun, click on the ORGANISM link and follow the taxonomic chain back up to the retroviridae, to see how many other similar viruses there are.)

 

Ans: This virus copies its genome, which is in RNA rather than DNA since it is a retrovirus, into the host genome. That answers both questions.

 

One of the proteins from question 2 had accession number AAL05380. Perform the following steps to prepare for the next few questions:

    1. Display this sequence in FASTA format.
    2. Copy the sequence (one way: highlight the sequence, then right click and select “copy”).
    3. Go to the blast page (http://www.ncbi.nlm.nih.gov/BLAST) and select blastp (for protein searches)
    4. Paste your sequence (right click and select “paste”) into the search field and press the BLAST! button. (use the “nr” database)

 

  1. The blue box labeled “rvt” is a conserved region in AAL05380. Move your cursor over this box. The box above tells the score for this conserved domain, which reflects the likelihood that this domain would be conserved in the database by chance. What are the score and E value for it?

 

Ans: 141 and 4e-35.

 

  1. Click on Format! To see the sequences from the nr database. How many sequences does blastp find? How many characters?

 

Ans: 791,492 sequences; 251,575,206 total letters

 

  1. How do you reconcile the count from question 5 with that from question 2?

 

Ans: there are many sequences similar to our reverse transcriptase sequence which are either not actually the same sort of gene (the match is spurious), or which are not annotated to be the same sort of gene (the annotation is suspect).

 

  1. The sequences returned are sorted from lowest E value to highest, and the alignment shown is the best local alignment of the query sequence with the database sequence. An “identity” is an exact amino acid match, and a “positive” is a match with a chemically similar residue. For the match with the sequence with accession number AF286365. For this sequence, compared to our query sequence, what are: the score, the E-value, the percentage identities, and the percentage positives? Why does this score higher than the alignment of our query sequence with iteself?

 

Ans: Score =  414 bits (1065), Expect = e-115,  Identities = 195/206 (94%), Positives = 202/206 (97%)