CS
4/504, Fall 2001 Computational Biology
Homework 6, Due: 11/15/01
- Use
Entrez to search for AAL05380, and display the sequence in FASTA format.
Paste this into the search box for a PSI-BLAST search using the nr
database (see http://www.ncbi.nlm.nih.gov/blast).
How many of the sequences were new on this iteration? What is the score
and E-value of the top scoring sequence.
Ans: all of them. This is an
iterative algorithm, and this is the first iteration. So, everything returned
was new! Top E-value score (on 1 Dec 2001 around 17.00) was e-115 for AF286365.
- Run a
second iteration of PSI-BLAST. You will have to go back to the original
PSI-BLAST window (not the results window) and press the “FORMAT!” button
again. This will change the results window. How many of the top 10 scoring
sequences were added on this iteration (iteration 2)? Has the top score
and E-value changed?
Ans: One (AF042100) was new.
The new top score is e-119, a better score than before even though this
sequence (AF286365) is the same as the last top scorer. This is because the
second iteration used more information (the results of the first iteration),
and so the probability of having this match by chance is lower than it was on
the first iteration—which is what the lower e-value means.
- How
many of the top 10 were added on iteration 3? What was the top score and
E-value? How many of the top 10 were added on iteration 4? What was the
top score and E-value?
Ans: On iteration 3, none were
new and the top e-value was again e-119 for AF286365. On iteration 4, none were
new and the top e-value was again e-119 for AF286365. It looks like the
algorithm has converged, and we’re unlikely to find anything new.
- Look
up how PSI-BLAST works. What do your observations tell you about how
quickly this particular iterative algorithm converges.
Ans: Psi-Blast builds a profile
from the top results of one search, and then uses the profile to search for new
sequences. The profile is essentially a scoring matrix for each position. Using
this new scoring information, the next iteration matches proteins in the
database with profile. The sequences returned may or may not include some that
were not returned on the prior iteration—or it may score some sequences higher
than before, given the new scoring matrix. Using these new results, Psi-Blast
computes a new scoring matrix to be used for the next iteration. And so on.
Once the top scoring sequences stop changing, the scoring matrix also stops
changing—so the algorithm has converged. This happens very quickly in most
cases (as in this one).
- go to
an online ClustalW webpage (for example http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html,
or you could install ClastalW on your own machine and run it
<grin>. Enter the top two
and the bottom 2 scoring sequences from a Blast on AAL05380. (You may have
to re-do the Blast from question 1, and you will certainly have to do a lot
of scrolling and cutting and pasting). Answer the following questions:
- Annoying,
isn’t it? Imagine doing this with 100 or 200 sequences. Do you see why
biologists want better tools?
- Describe
the alignments you see (just enough so that I know you actually did the
exercise). Be sure to try the JALViewer. If it works for you, it’s MUCH
easier.
- Can
you find the guide tree for these four sequences? If so, what does it
look like?
Ans:
- Yes
indeed. Damned annoying. That’s why we have grad students do it!
- The
top 2 and bottom 2 were AF286365, AF156820, AY032069, and AF077704
(respectively, on 1 December 2001 at 17.43). The first two of these
sequences were very long, and the last two very short. The alignment has
long gaps at the beginning and ends, with a moderately long gap at the
beginning the second sequence as well. It would appear the middle of the
long sequence is the highly conserved part.
- Nope.
I can’t find it. Remember, ClustalW first builds a guide tree, then uses
that to build the MSA. I know that if you install Clusta\lW yourself, you
can get this guide tree. But I couldn’t find it on this online service.