The GRAIL Gene Finding Program

The GRAIL Gene Finding Program

KMP.htm100644 5736 106 5437 6461114177 11112 0ustar attesoncbil The Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm builds a failure function for the pattern. Consider again the pattern p=AAATA::

Consider again searching in the text t=AATAAAATA:

  1. AATAAAATA

  2. AATAAAATA

  3. AATAAAATA

  4. AATAAAATA

  5. AATAAAATA

  6. AATAAAATA

  7. AATAAAATA

  8. AATAAAATA

  9. AATAAAATA

  10. AATAAAATA

The Knuth-Morris-Pratt algorithm always requires around n+m operations in the worst case where m is the length of p and n is the length of t.

auto.gif100644 5736 106 2644 6460670652 11411 0ustar attesoncbilGIF89a!,ڋ޼H扦 L ĢL*̦ J'jܮ N tE?wwgxw HY9(gi٣):T:zvJ' u[K+' [l4|||,R ;}Mm[ݝ =^fN~!nNNgOo0M2Hƒ -<᫈%JEK3n֑'Z! I(9[fK4ɜYM\Ol@na䭨Q,. @ZJ5uBT&[e TW#fuꊴ؝[8 ݺh`bP\2@^66! K졳-? u'P˵zk]˴3^-^i>n5L|kj|-F`pwGqw +yGߪlkoyWϞOǝH-oQ _!8!nab"H"V Xb*+"H.-Xތ&8!;&أȇ|-YC"a 2-9K.Y%'%ZB :]`x Z"'ZaF uo~plyrJ RC(7jb ٖlijܡq:*h`]vgv-6+uڪIa":JV\~%Wm9۬%!\'ri,4'[mح3m->k-9za^q|ά^k E`-3氐T? . )5ܩ5b-$_ '|`?C?pp D\o$ 7nT!A~N}!ajgB)lYXB.8a*lö9 {B-CdVDM$G"-' D)hsEy#YH 6kL#%7"Ǒİ#GJ/. Bɂ-BYi%/Ljr'? Pr,)O Trl+_ T;auto1.gif100644 5736 106 2666 6460670652 11476 0ustar attesoncbilGIF89a!,ڋ޼H扦 L ĢL*̦ J'jܮ N tE?wwgxw HY9(gi٣):T:zvJ' u[K+' [l4|||,R ;}Mm[ݝ =^fN~!nNNgOo0M2Hƒ -<᫈%JEK3n֑'Z! I(9[fK4ɜYM\Ol@na䭨Q,. @ZJ5uBT&[e TW#fu@UWElE^}9"εB{ĠEM,ToXX8o㺓 T#_'jX9 ')v! u7|q '^[o^OOr}p䳡kw_o ǿ7gݾy-^v[{ހH} P`y"蓃v"z Z"(݅yBb*.vzc-x:-| "؛ r'vP>T ǟVJIc^&yr褙h& $%&gɦ`rxtJrshl/N'(PwR颊[Uƙsmڈh vUz+,fbF,1"lmm)wzmq"J֋VK`d{.Pň\,݄*tZ!L&ly(RL(;p2 C B-uS| 2-v%k?ڌk\L4K"/&, <[4<4=HF}rAS3ܰYOuzt*U-?0 kC9-", ݮV;zkvsN?2- K>-LQ+S(a^y84_ Orp)sk'庚.QVzر j'=:gE=<'m8λ6=-G+F-,h {-"}䰄O ΐ{/3 ;Fc`P ,/ &N<aLC`aHq1\H8n-axhp$4DE" QKT]{ "QDqR!ElJrEɨ/ Fsc'x);Ĩ-G@_ IH ]L$G$9(TZ%'z27Pr,)O Trl+_ X r-o\re- ;auto2.gif100644 5736 106 2660 6460670652 11471 0ustar attesoncbilGIF89a!,ڋ޼H扦 L ĢL*̦ J'jܮ N tE?wwgxw HY9(gi٣):T:zvJ' u[K+' [l4|||,R ;}Mm[ݝ =^fN~!nNNgOo0M2Hƒ -<᫈%JEK3n֑'Z! I(9[fK4ɜYM\Ol@na䭨Q,. @ZJ5uBT&[e TW#fu BMElE^}9 ZHl-{0 {IE 0†ny'M p: W)-%t޼OM9|5'r)PHL{2N]-]7w-oy پ#뱨ξk_^2&?y_|WXE5_DHWj']}bi"b&~x'h"# x@6~6w #- ߂x Y`F戤-8܎JhQ7cncbag܌ 2aIpd}QqHrg"J%@Z)sfo"FZh&gV*d!jiɚlAhD(+Eke`){,2(}j-X_px':,]Ymdy@C5;&ȫ]-P4y~gNn"- RB ~0I kiSp B"jL\ rN| ˚L"Tv|˱</' 41CK"V8EEN'"TmuZo]„]ٛ:hóv_;4Kf" -MOݎ#ߧr* A %9*-I>3~d@3SE,'(@ "Kcb ':C-ȎE0֣ Q áE.1xd,)x.7aIH 8xKDHȳoL 9rqX%'Y'? Pr,)O Trl+_ X r-o\f;auto3.gif100644 5736 106 2650 6460670652 11471 0ustar attesoncbilGIF89a!,ڋ޼H扦 L ĢL*̦ J'jܮ N tE?wwgxw HY9(gi٣):T:zvJ' u[K+' [l4|||,R ;}Mm[ݝ =^fN~!nNNgOo0M2Hƒ -<᫈%JEK3n֑'Z! I(9[fK4ɜYM\Ol@na䭨Q,. @ZJ5uBT&[e TW#fuꊴ-^}9 ZH.r#M{j$2 XE sK!؜ OkPSAtWaoMzjz:BOF0[7̗ㆉ/z^"Ϡ`F'u䳁W6[f=B٧Ǫ W-r2_~6̇`})pTB }'ab"Z衈&hvb2H]3XpߋHaM($mz`T6b'di8%L&vY %Z" H-ȟ^%jvaK cg9sr z|igyQ *Bk(T'6r"(eϱq voz*(-]64š*MVZQ) VF-R*Yv+-ZZqvW[ *c["ccR[)/*.J -Pte2ŻUµj"{'h*$G'(&Y'a2/Z "j.&i Y<D*.q~lK *\<g10[Y-p)K r\4"GB-K"OC= :ר'Ꞩ΅o#9J[(醗n*.l^\“)~O.>-u|"'Ⱌ|gw+~3" O6tȯ5@ۏ 5yo1y`O4 KKp=,x N 9<}'+ cDQ!\ 6#!dڰ_!OBN"t $@ HQO$CH)K$*r|a20?sF'i#WH>x,nc %>qc{|- iG8F.G%/Ljr'? Pr,)O Trl+_ X rY(;auto4.gif100644 5736 106 2643 6460670652 11474 0ustar attesoncbilGIF89a!,ڋ޼H扦 L ĢL*̦ J'jܮ N tE?wwgxw HY9(gi٣):T:zvJ' u[K+' [l4|||,R ;}Mm[ݝ =^fN~!nNNgOo0M2Hƒ -<᫈%JEK3n֑'Z! I(9[fK4ɜYM\Ol@na䭨Q,. @ZJ5uBT&[e TW#fuꊴ؝AE^}9 ZH.{$ιM,t_FXW-CDW䱔=)yr)-]}.}[ݭ0"a{A]7>\y^OO.= >y?z-f7:'k-}k?u $Qr-}*`-(7]}bn&J%bcc(-4o' 9|'!}"XGf܆X RyYn${EVe* %tc%|cuoy>r͹&v^ّ"*=ZM) q UWzhƚnRjE*boc izZ~fUh9Y]Q+%*:6EVb[ͲrW[a-t-6-A)v1cI+Ss;׏ ݽ|7EĠtx@3S1$y) JjRw[9kQz⡧~_V-ɔpNCC8NQ{l5"zħ|2S;< CZI(BjpZa"yg:>r%/Ljr'? Pr,)O Trl+_) ;auto5.gif100644 5736 106 2654 6460670652 11477 0ustar attesoncbilGIF89a!,ڋ޼H扦 L ĢL*̦ J'jܮ N tE?wwgxw HY9(gi٣):T:zvJ' u[K+' [l4|||,R ;}Mm[ݝ =^fN~!nNNgOo0M2Hƒ -<᫈%JEK3n֑'Z! I(9[fK4ɜYM\Ol@na䭨Q,. @ZJ5uBT&[e TW#fuꊴ؝-^}9 ZH.{DoMA/B$qL): 'CDƊ,dd ]P&FΚAu d5T"cM˖i/c3\ñ 3#ksvFMxf1{[ -Ϡ;;XOo' -nÖW$c\uif:v-6Q\#kzD[$#'wBۡ_ OXqǓ(2̒vo,Ǭ-gfΑ^x …6xTo&nvnƑ/8"ܺwK^O$'_a.ˣ?K|[->i _/}up `  Vu >hY} aBLnpqijb'ׄB-x2G"XG!3 衄)Xt [4؜FjgG\#bzxcV*'By`0sQYdQ ݎhZIY"0u(jglwj۞X`̟VaDF炔'ZA(uYN*i Aaaq:*^zؑzzz*ںjD'l)*O_=h’uU kT\G'l_I,VY;a|دfS+$B|m3ŶLU)<%ɗL0- KBQo B^Q]A3KFj -poOspΔh^z'~{>Pksk! x5πE-@]v=ePNt@~ycDJ `7-Őx. WCaa;|C™A zH] XD&MrƈXE`qT̢[)zw]8-G;F~scW=;B--(- IH+2LLTGGJr%/Ljr'? Pr,)cP;define.htm100644 5736 106 6362 6461114177 11713 0ustar attesoncbil Weight Matrix Definition

Weight Matrix Definition

design.htm100644 5736 106 17535 6461114177 11756 0ustar attesoncbil Weight Matrix Design

Weight Matrix Design

gene.htm100644 5736 106 2077 6461114177 11376 0ustar attesoncbil Gene Finding

Gene Finding

intro.htm100644 5736 106 2453 6461114177 11611 0ustar attesoncbil Pattern Matching Introduction

Pattern Matching Introduction

Instructor (me):

Kevin Atteson, 2-5581
atteson@peaplant.biology.yale.edu

Subject matter:

Pattern matching is the search for various patterns in long texts. Typical patterns might be:

  • Genes or signals related to genes:

    • Transcription factor binding sites

    • Cap sites

    • Splice signals

    • Start and stop codons

    • Polyadenylation signals

  • Protein sequence motifs (e.g. PROSITE patterns).

We will discuss these systems from a computational perspective, that is, we will describe how pattern matching systems work.

naive.htm100644 5736 106 11521 6461114177 11574 0ustar attesoncbil The Naive Algorithm

The Naive Algorithm

The naive algorithm is to search every location of the string t for the pattern p.

As an example, consider searching for the pattern p = AAATA in the text t=AATAAAATA:

  1. AAATA
    AATAAAATA

  2. AAATA
    AATAAAATA

  3. AAATA
    AATAAAATA

  4. AAATA
    AATAAAATA

  5. AAATA
    AATAAAATA

  6. AAATA
    AATAAAATA

  7. AAATA
    AATAAAATA

  8. AAATA
    AATAAAATA

  9. AAATA
    AATAAAATA

  10. AAATA
    AATAAAATA

  11. AAATA
    AATAAAATA

  12. AAATA
    AATAAAATA

  13. AAATA
    AATAAAATA

  14. AAATA
    AATAAAATA

  15. AAATA
    AATAAAATA

  16. AAATA
    AATAAAATA

The naive algorithm requires around m (n-m) operations in the worst case where m is the length of p and n is the length of t.

Is there a better way?

next.gif100644 5736 106 475 6460670652 11377 0ustar attesoncbilGIF89app!# Imported from XPM image: next.xpm!,@63333B! 0 A0 0 0  0 `0 `0 A @ `0 `00000000000000000000000000000000000000000000  000000 0000000000000000000000000000` ;nn.gif100644 5736 106 2707 6461071204 11041 0ustar attesoncbilGIF87a-,-ڋ޼H扦 LΣ }Ao ˤs"d>X.Gcx- (x3HHtؓ؈f"IYvic9#Y S4XyZ3z Z$K b{둫۫ !hbt{+[n.z BE5g -CU-MINPvT&u'-kec=󁿭{@s;?{Ͽ}㿽~>j+ֽ'LuE7 >y4 50c 4(@i\U0[Y! ˓?I \ȭO[ ihRX'. "Ba-*^ w(myQtKh>\3[0qd,x)rPUjFE.E"'t7,X 16kĕC(dW@Z$$U2-H}f-cy S\ U(@b7eVh,1d*sl3i;other.htm100644 5736 106 3013 6461114177 11570 0ustar attesoncbil Other Methods of String Matching

Other Methods of String Matching

overview.htm100644 5736 106 2313 6461114177 12317 0ustar attesoncbil Pattern Matching Overview

Overview

prev.gif100644 5736 106 475 6460670652 11375 0ustar attesoncbilGIF89app!# Imported from XPM image: prev.xpm!,@63333# B 0 A0 0 0 0 `0 `0 A   `0 `00000000000000000000000000000000000000000000  000 0000000000000000000000000000000` ;prob.htm100644 5736 106 5133 6461114177 11416 0ustar attesoncbil String Probabilities

String Probabilities

sig.htm100644 5736 106 3125 6461114177 11235 0ustar attesoncbil Signals for Gene Finding

Signals for Gene Finding

string.htm100644 5736 106 2635 6461114177 11766 0ustar attesoncbil String Searching

String Searching

The string searching problem is to find a string or pattern such as the polyadenylation concensus signal "AATAAA" in a body of text such as a gene.

String matching can be used to find:

  • Start and stop codons

  • Polyadenylation consensus signals

  • Consensus transcription factor binding sites (see the TRANSFAC database)

In our introduction to string matching, we will illustrate issues of algorithm design and analysis from the perspective of computer science and mathematics.

Those interested in further information are referred to the following book:

Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997. This is a very readable book about algorithms in computational biology. Of particular interest to this lecture are Chapters 1 and 2.

toc.gif100644 5736 106 474 6460670652 11205 0ustar attesoncbilGIF89ap!" Imported from XPM image: toc.xpm!,@6313c B0 0 A0 0 0 0 `0@`0 `  `0@`0 `0@`0000000000 0000000000 00000000 000000 0000 000000000 00000000000 00000000000000` ;weight.htm100644 5736 106 1551 6461114177 11743 0ustar attesoncbil Weight Matrices

Weight Matrices