1) Use Entrez to search for AAL05380, and display the sequence in FASTA format. Paste this into the search box for a PSI-BLAST search using the nr database (see http://www.ncbi.nlm.nih.gov/blast). How many of the sequences were new on this iteration? What is the score and E-value of the top scoring sequence?
Ans: Oops. This
was accidentally copied from HW 6.
2) Here is a multiple sequence alignment (question thanks to William Grundy’s course, COMS W4761 Computational Genomics, at Columbia):
DAKYTKQGKYV
DS---KIGKIY
AE--TKTAKIP
A-------EKA
FT---DT-EDP
a) Using the heuristic we discussed in class, how many match states would a profile HMM for this alignment have? Draw a picture of it.
b) Using this alignment, write down the following. Show your work, and use a pseudocount of 1.
i) the emission probabilities from the first match state,
ii) the transition probabilities out of the first match state,
iii) the transition probabilities out of the second insert state, and
iv) the transition probabilities out of the third delete state.
Ans:
1)
8, one for each column with fewer than half deletions.
The heavy box and lines are the parameters we provide in part b).

2)
We label the matching states B, M1, M2, ..., M8, E, I0,
I1, I2, ..., I8, D1, D2, ... D8.
i)
eM1(A)=eM1(D)=3/25; eM1(F)=2/25;
eM1(all others)=1/25. This is because there are 2 A’ s, 2 D’s and 1
F in column one (I’m using “the first” matching state to be the one for the
first column, ignoring the begin state). The denominator is 5 (2+2+1) plus a
pseudocount of 1 for each possible amino acid.
ii)
aM1,M2=5/8; aM1,D1=2/8; aM1,I1=1/8.
There are four matches and one deletion following characters in column one. The
denumerator is the number of transitions (4+1) plus the pseudocount of 3 (one
for each edge from M1).
iii)
There is some ambiguity as to whether “the second
insert state” is I1 or I2. I’ll provide answers for both. aI1,I1=aI1,D2=aI1,M2=1/3
since there are no transitions from an insertion after column one (the
denumerator is the pseudocount total of 3, one for each edge from I1). aI2,I2=3/7;
aI2,D3=1/7; aI2,M3=3/7. There are two transitions from I2
to an insertion and two from I2 to a match, the denominator is those counts
plus a total pseudocount of 3.
To see where the numbers come from in the
second case (which is the interesting one), run the Baum-Welch estimation
algorithm by aligning the sequences to the HMM and counting the number of times
a transition is used.
·
First, initialize all counters to one. In
particular, set AI2,I2=AI2,D3= AI2,M3=1.
·
For the first sequence, there is a transition from a
match (for A) to an insertion (for K). This adds one to AM2,I2 (which
we don’t care about for this question). Then the Y is emitted from I2, which
adds one to AI2,I2. Then the T is emitted from I2, which adds one to
AI2,I2. Then we transition to the match state M3 to emit K, adding
one to AI2,M3.
·
The second, fourth, and fifth sequences are transitions
from M2 to M3 directly, and so increment AM2,M3 (which we don’t care
about for this question).
·
The third sequence makes a transition to the
insertion state in order to emit the T, adding
one to AM2,I2 (which we don’t care about for this question).
Then it transitions to M3 in order to emit K, adding one to AI2,M3.
·
At the end of this, AI2,I2=3, AI2,D3=1,
and AI2,M3=3. The sum of these is 7, which is where the values in
the answer come from (see the equations on page 107).
iv)
aD3,D4=2/4; aD3,M4=1/4; aD3,I3=1/4.
There is one transition to a deletion in column 3, and the denumerator is that
one transition plus a total pseudocount of 3.
3) Find the protein sequence for AAL05380 in Genbank, using Entrez (question thanks to Kevin Karplus’ course BME 100 at UC Santa Cruz). Now, run the script at the following URL on your sequence: http://www.soe.ucsc.edu/research/compbio/HMM-apps/T99-query.html
This script will run the SAM T99 algorithm in two phases: in the model library template search phase, it takes your sequence and scores it against a collection of hidden Markov models, one model for each sequence in a select, representative subset of the PDB (Protein Data Base) database.
In the target search phase the following occurs: The script builds an HMM from your sequence, and then scores every sequence in PDB against this one HMM. Actually, the process is a bit more involved than that. First it finds some close relatives of your query sequence in the nr database using a version of BLAST called WU-BLAST, then it uses these to build an HMM. Then it uses this HMM to search the database again and finds some more sequences. Four iterations of search and re-alignment are done to create a final HMM. It uses this final HMM to score all sequences in the database you specified on the web page (in this case PDB), and reports the good hits. It learns more about the family of proteins you are searching for in each of the iterations. PSI-BLAST works in a similar fashion, but uses BLAST to do the searches and alignment, rather than HMMs.
Print out and turn in the
“Database hits and scores” and the “alignment 1” from the search results (which
will come back in email).
Ans: Here is a partial copy of my
results. First, the “Database hits and scores”:

And now “Alignment 1”: