CS
4/504, Fall 2001 Computational Biology
Homework 5, Due: 10/11/01
In Homework 1, we found a human endogenous retrovirus,
accession number AF087913. For this exercise, we will find a putative protein
in this sequence, then search for similar proteins using both an Entrez search
and a Blast search.
- Use
Entrez (http://www.ncbi.nlm.nih.gov/Entrez/index.html) to search for
AF087913 (remember, this is a nucleotide). Find the putative reverse
transcriptase pseudogene in AF087913. How many A’s, C’s, G’s, and T’s are
in this pseudogene (just to show you’ve found the record)?
Ans: BASE COUNT 430 a
313 c 287 g 385 t
- This
gene is a putative gene from humans that codes for the reverse
transcriptase protein. Use Entrez to search for other such proteins: go
back to Entrez and search for a protein using the fields “human” and
“reverse transcriptase” (you can enter both in the search field, keep
“reverse transcriptase” in quotes, and be sure to look for proteins). How
many items does this find? How many pages of data is that?
Ans: 10,794; 540 pages (on 30
October)
- The
reverse transcriptase protein translates RNA back into DNA. These
sequences are mostly from viruses, often HIV. Why might a virus need
reverse transcriptase? Why are these viral sequences in the human genome?
(for fun, click on the ORGANISM link and follow the taxonomic chain back
up to the retroviridae, to see how many other similar viruses there
are.)
Ans: This virus copies its
genome, which is in RNA rather than DNA since it is a retrovirus, into the host
genome. That answers both questions.
One of the proteins from question
2 had accession number AAL05380. Perform the following steps to prepare for the
next few questions:
- Display
this sequence in FASTA format.
- Copy
the sequence (one way: highlight the sequence, then right click and
select “copy”).
- Go
to the blast page (http://www.ncbi.nlm.nih.gov/BLAST) and select blastp
(for protein searches)
- Paste
your sequence (right click and select “paste”) into the search
field and press the BLAST! button. (use the “nr” database)
- The
blue box labeled “rvt” is a conserved region in AAL05380. Move your cursor
over this box. The box above tells the score for this conserved domain,
which reflects the likelihood that this domain would be conserved in the
database by chance. What are the score and E value for it?
Ans: 141 and 4e-35.
- Click
on Format! To see the sequences from the nr database. How many sequences
does blastp find? How many characters?
Ans: 791,492 sequences;
251,575,206 total letters
- How do
you reconcile the count from question 5 with that from question 2?
Ans: there are many sequences
similar to our reverse transcriptase sequence which are either not actually the
same sort of gene (the match is spurious), or which are not annotated to be the
same sort of gene (the annotation is suspect).
- The
sequences returned are sorted from lowest E value to highest, and the
alignment shown is the best local alignment of the query sequence with the
database sequence. An “identity” is an exact amino acid match, and a
“positive” is a match with a chemically similar residue. For the match
with the sequence with accession number AF286365. For this sequence,
compared to our query sequence, what are: the score, the E-value, the
percentage identities, and the percentage positives? Why does this score
higher than the alignment of our query sequence with iteself?
Ans: Score = 414 bits (1065), Expect = e-115, Identities = 195/206 (94%), Positives =
202/206 (97%)