CS 4/504, Fall 2001 Computational Biology

Homework 7, Due: 12/6/01

 

 

1)      Use Entrez to search for AAL05380, and display the sequence in FASTA format. Paste this into the search box for a PSI-BLAST search using the nr database (see http://www.ncbi.nlm.nih.gov/blast). How many of the sequences were new on this iteration? What is the score and E-value of the top scoring sequence?

 

Ans: Oops. This was accidentally copied from HW 6.

 

2)      Here is a multiple sequence alignment (question thanks to William Grundy’s course, COMS W4761 Computational Genomics, at Columbia):

DAKYTKQGKYV

DS---KIGKIY

AE--TKTAKIP

A-------EKA

FT---DT-EDP

a)      Using the heuristic we discussed in class, how many match states would a profile HMM for this alignment have? Draw a picture of it.

b)      Using this alignment, write down the following. Show your work, and use a pseudocount of 1.

i)        the emission probabilities from the first match state,

ii)       the transition probabilities out of the first match state,

iii)     the transition probabilities out of the second insert state, and

iv)     the transition probabilities out of the third delete state.

 

Ans:

1)      8, one for each column with fewer than half deletions. The heavy box and lines are the parameters we provide in part b).

2)      We label the matching states B, M1, M2, ..., M8, E, I0, I1, I2, ..., I8, D1, D2, ... D8.

i)        eM1(A)=eM1(D)=3/25; eM1(F)=2/25; eM1(all others)=1/25. This is because there are 2 A’ s, 2 D’s and 1 F in column one (I’m using “the first” matching state to be the one for the first column, ignoring the begin state). The denominator is 5 (2+2+1) plus a pseudocount of 1 for each possible amino acid.

ii)      aM1,M2=5/8; aM1,D1=2/8; aM1,I1=1/8. There are four matches and one deletion following characters in column one. The denumerator is the number of transitions (4+1) plus the pseudocount of 3 (one for each edge from M1).

iii)    There is some ambiguity as to whether “the second insert state” is I1 or I2. I’ll provide answers for both. aI1,I1=aI1,D2=aI1,M2=1/3 since there are no transitions from an insertion after column one (the denumerator is the pseudocount total of 3, one for each edge from I1). aI2,I2=3/7; aI2,D3=1/7; aI2,M3=3/7. There are two transitions from I2 to an insertion and two from I2 to a match, the denominator is those counts plus a total pseudocount of 3. 

     To see where the numbers come from in the second case (which is the interesting one), run the Baum-Welch estimation algorithm by aligning the sequences to the HMM and counting the number of times a transition is used.

·        First, initialize all counters to one. In particular, set AI2,I2=AI2,D3= AI2,M3=1.

·        For the first sequence, there is a transition from a match (for A) to an insertion (for K). This adds one to AM2,I2 (which we don’t care about for this question). Then the Y is emitted from I2, which adds one to AI2,I2. Then the T is emitted from I2, which adds one to AI2,I2. Then we transition to the match state M3 to emit K, adding one to AI2,M3.

·        The second, fourth, and fifth sequences are transitions from M2 to M3 directly, and so increment AM2,M3 (which we don’t care about for this question).

·        The third sequence makes a transition to the insertion state in order to emit the T, adding  one to AM2,I2 (which we don’t care about for this question). Then it transitions to M3 in order to emit K, adding one to AI2,M3.

·        At the end of this, AI2,I2=3, AI2,D3=1, and AI2,M3=3. The sum of these is 7, which is where the values in the answer come from (see the equations on page 107).

iv)    aD3,D4=2/4; aD3,M4=1/4; aD3,I3=1/4. There is one transition to a deletion in column 3, and the denumerator is that one transition plus a total pseudocount of 3.

 

3)      Find the protein sequence for AAL05380 in Genbank, using Entrez (question thanks to Kevin Karplus’ course BME 100 at UC Santa Cruz). Now, run the script at the following URL on your sequence:  http://www.soe.ucsc.edu/research/compbio/HMM-apps/T99-query.html

This script will run the SAM T99 algorithm in two phases: in the model library template search phase, it takes your sequence and scores it against a collection of hidden Markov models, one model for each sequence in a select, representative subset of the PDB (Protein Data Base) database.

In the target search phase the following occurs: The script builds an HMM from your sequence, and then scores every sequence in PDB against this one HMM. Actually, the process is a bit more involved than that. First it finds some close relatives of your query sequence in the nr database using a version of BLAST called WU-BLAST, then it uses these to build an HMM. Then it uses this HMM to search the database again and finds some more sequences. Four iterations of search and re-alignment are done to create a final HMM. It uses this final HMM to score all sequences in the database you specified on the web page (in this case PDB), and reports the good hits. It learns more about the family of proteins you are searching for in each of the iterations. PSI-BLAST works in a similar fashion, but uses BLAST to do the searches and alignment, rather than HMMs.

 

Print out and turn in the “Database hits and scores” and the “alignment 1” from the search results (which will come back in email).

 

Ans: Here is a partial copy of my results. First, the “Database hits and scores”:

Text Box: The E-value is an estimate of approximately how many sequences would score this well by chance in the database searched.  <snip> 

Database: PDB (updated Thu Oct 25 12:41:18 2001)
Score method: average

Summary of Database Hits:
                  Average               Target               Template     
SeqID       Reverse     EValue    Reverse     EValue    Reverse     EValue  SeqLabel
1c0tA          -300          0       -300   1.5e-126  ---------  ---------  :mol:protein-het length:560     Hiv-1 Reverse Transcriptase (A-Chain)

1c0tB          -300          0       -300   1.5e-126  ---------  ---------  :mol:protein length:440     Hiv-1 Reverse Transcriptase (B-Chain)

1c0uA          -300          0       -300   1.5e-126  ---------  ---------  :mol:protein-het length:560     Hiv-1 Reverse Transcriptase (A-Chain)

1c0uB          -300          0       -300   1.5e-126  ---------  ---------  :mol:protein length:440     Hiv-1 Reverse Transcriptase (B-Chain)

1c1bA          -300          0       -300   1.5e-126  ---------  ---------  :mol:protein-het length:560     Hiv-1 Reverse Transcriptase (A-Chain)

1c1cA          -300          0       -300   1.5e-126  ---------  ---------  :mol:protein-het length:560     Hiv-1 Reverse Transcriptase (A-Chain)

<snip>

 


Text Box: Alignment 1:
Seq 1: gi|15723673|gb|AAL05380.1|
Seq 2: 1c0tA
   1 ..................................VGICSEMGKGGKISKIGPENPYNTPVFAIKK
   2 pispietvpvklkpgmdgpkvkqwplteekikalVEICTEMEKEGKISKIGPENPYNTPVFAIKK


   1 KDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKEKSVTVLDVGDAYFSVPLDKDFRKYTAF
   2 KDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF


   1 TIPSTNNETPGIRYQYNVLPQGWKGSPAIFQSSMLKILEPFRKQNPDIVIYQYVDDLYVGSDLEI
   2 TIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYMDDLYVGSDLEI


   1 GQHRTKIEELREHLLRWGFTTPDKKHQKEPPFLWMGYELHPDKWT....................
   2 GQHRTKIEELRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTvqpivlpekdswtvndiqkl


   1 .................................................................
   2 vgklnwasqiypgikvrqlxkllrgtkalteviplteeaelelaenreilkepvhgvyydpskdl


   1 .................................................................
   2 iaeiqkqgqgqwtyqiyqepfknlktgkyarmrgahtndvkqlteavqkittesiviwgktpkfk


   1 .................................................................
   2 lpiqketwetwwteywqatwipewefvntpplvklwyqlekepivgaetfyvdgaanretklgka


   1 .................................................................
   2 gyvtnrgrqkvvtltdttnqktelqaiylalqdsglevnivtdsqyalgiiqaqpdqseselvnq


   1 .........................................
   2 iieqlikkekvylawvpahkgiggneqvdklvsagirkvl.
And now “Alignment 1”: