5-5: The T-Score web application – Scoring and ranking peptides for MHC binding

PHP Data structure for scoring matrices

We wish to import the scoring matrix data from the .csv files in a data structure in PHP that would be convenient to use for calculating the peptide binding scores. There is indeed one structure that is perfect for the job: an associative array or dictionary, in which we have a key for each amino-acid letter that has, as associated value, an array with the scoring values as elements.

To exemplify the concept, let’s consider the lines for Alanine and Leucine in the .csv file for MHC allele HLA-A1 that we generated in the previous section:

A;1.69;1.50;-0.15;-1.5;0.82;1.36;1.00;1.14;-1.20
L;0.51;1.62;1.24;-0.29;0.19;0.44;0.38;0.22;1.31

and manually build a dictionary called $matrix with just two keys, “A” and “L”, for now.

We could use this matrix to score a 9 mer peptide solely composed of Alanine and Leucine residues, with sequence: ALLAAAALA

The scoring array for an amino-acid can be accessed very easily in the matrix by using the amino-acid itself as key. For an Alanine (“A”), this would be:

$matrix[“A”]

Then, to access the score for the A in a specific position of the peptide, we just use the position itself as an array index, where position 1 in the peptide corresponds to index 0 (the first element) in the array. So the score for an alanine in position 1 would be:

$matrix[“A”][0]

Check out the following scoring example:

Based on this code, we can easily write a function that takes a peptide and a matrix as arguments and returns the peptide score. In the following code example we write the function and then use it to score a peptide. In the function, we include a check to make sure that each analysed amino-acid exists as a key in the matrix and return an error if it does not. This could be useful for code debugging and will prevent PHP errors if we supply an invalid sequence.

Once we have a score, we can use the thresholds data available in the matrix to rank the peptide among the x% best binders.

In the previous section we did store the threshold information in the csv file in this format:

tre;6.68;5.18;4.21;3.47;2.85;2.31;1.83;1.39;0.99;0.61;

We could store this data in our matrix array in the same format as the scoring data for amino-acids:

Let’s calculate the score and the rank this time.

Let’s now modify the pepscore() function we wrote before to return the score, the rank, or both, depending on a third optional argument, so that we can enjoy some flexibility while using it.

We now have a function, pepscore(), to score and rank a peptide based on a matrix. A quite valuable and central piece of code for our T-Score application.

Let’s now see how we can generate a proper matrix associative array starting from one of the .csv files we produced in the previous section.

Generating scoring matrices from the .csv files

We will now write a little function matrix_file_to_array() to convert our .csv files into associative arrays with the format discussed above.

The flow of the code is as follows:

– We import the file to a variable
– Split on “\n” to get the lines
– For each line, we split the line to an array by using the semicolon as splitting element. The first element of this array will become a key in our final matrix associative array, while all the other elements will be part of an array assigned as value to this key. Remember how each line is structured:

A;1.69;1.50;-0.15;-1.5;0.82;1.36;1.00;1.14;-1.20

– The function then returns this matrix array

On executing the code above from our script.php file located in the tscore directory, we get the var_dump of the matrix for the HLA-A1 allele. It will show that the $a1_matrix variable is now an array of 22 elements, as expected, as we have 20 amino-acids, plus the X amino-acid, plus the “tre” line with the thresholds.

Here is how the var_dump starts (it’s quite long so we just report the beginning of the output, but please test it on your own server):

array(22) {
[“A”]=>
array(9) {
[0]=>
string(4) “1.69”
[1]=>
string(4) “1.50”
[2]=>
string(5) “-0.15”
[3]=>
string(5) “-1.50”
[4]=>
string(4) “0.82”
[5]=>
string(4) “1.36”
[6]=>
string(4) “1.00”
[7]=>
string(4) “1.14”
[8]=>
string(5) “-1.20”
}
[“C”]=>
array(9) {
[0]=>
string(4) “2.00”
[1]=>
string(4) “0.00”
[2]=>
string(4) “0.00”
[3]=>
string(4) “1.00”
[4]=>
string(4) “0.00”
[5]=>
string(5) “-2.00”
[6]=>
string(4) “0.00”
[7]=>
string(5) “-2.00”
[8]=>
string(4) “0.00”
}
[“D”]=>
array(9) {
[0]=>
string(5) “-1.50”
[1]=>
string(5) “-2.00”
[2]=>
string(4) “6.52”
[3]=>
string(5) “-0.67”
[4]=>
string(5) “-2.00”
[5]=>
string(4) “0.67”
[6]=>
string(5) “-1.33”
[7]=>
string(5) “-1.00”
[8]=>
string(5) “-2.00”
}
etc…

Getting real: scoring and ranking a peptide sequence by using a full scoring matrix

In the following code example we will select an allele name, generate the scoring matrix array “on the fly” and use it to score and rank an actual peptide sequence.

To get a fully woking self-contained example, we move the pepscore() and matrix_file_to_array() functions to a “functions.php” file created in a folder called “include”, inside the tscore directory.

We also create a new php file called “generate_matrices.php” where we will move all the code related to the download of the .html files from the original website and the generation of the .csv files in our matrix folder, from these .html files. The reason for moving some of the code in a different file is to keep our main file, script.php, as clean and light as possible so that it will be more readable and easier to maintain. As functions.php, we will store this file in the include folder.

Our application file structure therefore becomes, for now:

tscore (permission 777)
    matrix (permission 777)
    script.php
    include
        functions.php
        generate_matrices.php

And here is the content of the files:

generate_matrices.php

functions.php

script.php

See how, as usual, with all the pieces (accessory code and functions) in place, the script itself becomes short and easy to write.

Running this code will output the following:

Peptide KLDVHRTAC has a score of 3.86 with the HLA-A1 MHC allele, which ranks it among the 4% best binders

As you see, we are getting closer to be able to write a full fledged code for our T-Score web application.

The slide_window() function

There is one important part still missing though: in the web form, the user will provide a full protein sequence rather than a 9 mer peptide sequence. We need a function to extract all possible 9 mers from a protein sequence, by applying a so-called “sliding window”. To make a simple example, let’s suppose the original protein sequence provided by the user has a length of 15 amino-acids.

KTKLDVHRTACEWVM

The 9 mers it contains are the following:

1-9: KTKLDVHRT
2-10: TKLDVHRTA
3-11: KLDVHRTAC
4-12: LDVHRTACE
5-13: DVHRTACEW
6-14: VHRTACEWV
7-15: HRTACEWVM

Those are obtained by applying a window with a fixed length of 9 to the sequence, and sliding it each time by one position on the sequence, hence the term “sliding window”.

Scoring a full protein means generating all possible peptides with the sliding window and then scoring each individual peptide. The peptides that rank higher than 10 are considered “negative”. Remember that the lower the rank, the better binder a peptide is. A rank of 1 means that the peptide is among the top 1% binders, which is better than being “just” among the top 10% binders.

Let us write a slide_window() function and test it on a short sequence. The function will take a protein sequence as argument and return an array with the individual peptide sequences of the width provided as second argument (optional, as default width= 9) as elements.

This is the output of the script above, everything looks nice:

array(7) {
[0]=>
string(9) “KTKLDVHRT”
[1]=>
string(9) “TKLDVHRTA”
[2]=>
string(9) “KLDVHRTAC”
[3]=>
string(9) “LDVHRTACE”
[4]=>
string(9) “DVHRTACEW”
[5]=>
string(9) “VHRTACEWV”
[6]=>
string(9) “HRTACEWVM”
}

Scoring and ranking all the peptides in a protein for binding to an MHC allele

By maintaining the very same structure as the script we used above to score and rank a single peptide, let’s just add the slide_window() and the process_fasta() functions to the functions.php file in the include folder and revisit the script.php file in order to score and rank all the peptides from an entire target protein, the capsid protein of human rhinovirus 14 (HRV-14) – http://www.uniprot.org/uniprot/P03303.

>sp|P03303|POLG_HRV14 Genome polyprotein OS=Human rhinovirus 14 PE=1 SV=3
MGAQVSTQKSGSHENQNILTNGSNQTFTVINYYKDAASTSSAGQSLSMDPSKFTEPVKDL
MLKGAPALNSPNVEACGYSDRVQQITLGNSTITTQEAANAVVCYAEWPEYLPDVDASDVN
KTSKPDTSVCRFYTLDSKTWTTGSKGWCWKLPDALKDMGVFGQNMFFHSLGRSGYTVHVQ
CNATKFHSGCLLVVVIPEHQLASHEGGNVSVKYTFTHPGERGIDLSSANEVGGPVKDVIY
NMNGTLLGNLLIFPHQFINLRTNNTATIVIPYINSVPIDSMTRHNNVSLMVIPIAPLTVP
TGATPSLPITVTIAPMCTEFSGIRSKSIVPQGLPTTTLPGSGQFLTTDDRQSPSALPNYE
PTPRIHIPGKVHNLLEIIQVDTLIPMNNTHTKDEVNSYLIPLNANRQNEQVFGTNLFIGD
GVFKTTLLGEIVQYYTHWSGSLRFSLMYTGPALSSAKLILAYTPPGARGPQDRREAMLGT
HVVWDIGLQSTIVMTIPWTSGVQFRYTDPDTYTSAGFLSCWYQTSLILPPETTGQVYLLS
FISACPDFKLRLMKDTQTISQTVALTEGLGDELEEVIVEKTKQTVASISSGPKHTQKVPI
LTANETGATMPVLPSDSIETRTTYMHFNGSETDVECFLGRAACVHVTEIQNKDATGIDNH
REAKLFNDWKINLSSLVQLRKKLELFTYVRFDSEYTILATASQPDSANYSSNLVVQAMYV
PPGAPNPKEWDDYTWQSASNPSVFFKVGDTSRFSVPYVGLASAYNCFYDGYSHDDAETQY
GITVLNHMGSMAFRIVNEHDEHKTLVKIRVYHRAKHVEAWIPRAPRALPYTSIGRTNYPK
NTEPVIKKRKGDIKSYGLGPRYGGIYTSNVKIMNYHLMTPEDHHNLIAPYPNRDLAIVST
GGHGAETIPHCNCTSGVYYSTYYRKYYPIICEKPTNIWIEGNPYYPSRFQAGVMKGVGPA
EPGDCGGILRCIHGPIGLLTAGGSGYVCFADIRQLECIAEEQGLSDYITGLGRAFGVGFT
DQISTKVTELQEVAKDFLTTKVLSKVVKMVSALVIICRNHDDLVTVTATLALLGCDGSPW
RFLKMYISKHFQVPYIERQANDGWFRKFNDACNAAKGLEWIANKISKLIEWIKNKVLPQA
KEKLEFCSKLKQLDILERQITTMHISNPTQEKREQLFNNVLWLEQMSQKFAPLYAVESKR
IRELKNKMVNYMQFKSKQRIEPVCVLIHGTPGSGKSLTTSIVGRAIAEHFNSAVYSLPPD
PKHFDGYQQQEVVIMDDLNQNPDGQDISMFCQMVSSVDFLPPMASLDNKGMLFTSNFVLA
STNSNTLSPPTILNPEALVRRFGFDLDICLHTTYTKNGKLNAGMSTKTCKDCHQPSNFKK
CCPLVCGKAISLVDRTTNIRYSVDQLVTAIISDFKSKMQITDSLETLFQGPVYKDLEIDV
CNTPPPECINDLLKSVDSEEIREYCKKKKWIIPEIPTNIERAMNQASMIINTILMFVSTL
GIVYVIYKLFAQTQGPYSGNPPHNKLKAPTLRPVVVQGPNTEFALSLLRKNIMTITTSKG
EFTGLGIHDRVCVIPTHAQPGDDVLVNGQKIRVKDKYKLVDPENINLELTVLTLDRNEKF
RDIRGFISEDLEGVDATLVVHSNNFTNTILEVGPVTMAGLINLSSTPTNRMIRYDYATKT
GQCGGVLCATGKIFGIHVGGNGRQGFSAQLKKQYFVEKQGQVIARHKVREFNINPVNTPT
KSKLHPSVFYDVFPGDKEPAVLSDNDPRLEVKLTESLFSKYKGNVNTEPTENMLVAVDHY
AGQLLSLDIPTSELTLKEALYGVDGLEPIDITTSAGFPYVSLGIKKRDILNKETQDTEKM
KFYLDKYGIDLPLVTYIKDELRSVDKVRLGKSRLIEASSLNDSVNMRMKLGNLYKAFHQN
PGVLTGSAVGCDPDVFWSVIPCLMDGHLMAFDYSNFDASLSPVWFVCLEKVLTKLGFAGS
SLIQSICNTHHIFRDEIYVVEGGMPSGCSGTSIFNSMINNIIIRTLILDAYKGIDLDKLK
ILAYGDDLIVSYPYELDPQVLATLGKNYGLTITPPDKSETFTKMTWENLTFLKRYFKPDQ
QFPFLVHPVMPMKDIHESIRWTKDPKNTQDHVRSLCMLAWHSGEKEYNEFIQKIRTTDIG
KCLILPEYSVLRRRWLDLF

script.php

On running this script we get the following output:

Analysed protein: >sp|P03303|POLG_HRV14 Genome polyprotein OS=Human rhinovirus 14 PE=1 SV=3

Out of 2172 total peptides, 205 rank better then 10 for binding to MHC allele HLA-A1

Position Sequence Score Rank
12-20 SHENQNILT 1.3 9
24-32 NQTFTVINY 2.5 6
25-33 QTFTVINYY 6.08 2
33-41 YKDAASTSS 3.72 4
40-48 SSAGQSLSM 0.94 10
47-55 SMDPSKFTE 1.34 9
53-61 FTEPVKDLM 4.47 3
70-78 SPNVEACGY 4.91 3
72-80 NVEACGYSD 1.43 8
78-86 YSDRVQQIT 1.09 9
93-101 TTQEAANAV 1.55 8
94-102 TQEAANAVV 1.45 8
96-104 EAANAVVCY 4.26 3
97-105 AANAVVCYA 2.49 6
104-112 YAEWPEYLP 1 9
111-119 LPDVDASDV 0.62 10
116-124 ASDVNKTSK 3.55 4
134-142 TLDSKTWTT 1.3 9
155-163 LKDMGVFGQ 1.99 7
167-175 FHSLGRSGY 1.56 8
196-204 IPEHQLASH 3.62 4
202-210 ASHEGGNVS 1.08 9
203-211 SHEGGNVSV 3.63 4
205-213 EGGNVSVKY 1.62 8
209-217 VSVKYTFTH 0.74 10
222-230 GIDLSSANE 3.46 5
228-236 ANEVGGPVK 4.08 4
232-240 GGPVKDVIY 0.79 10
235-243 VKDVIYNMN 5 3
264-272 NTATIVIPY 2.68 6
268-276 IVIPYINSV 2.19 7
290-298 MVIPIAPLT 1.08 9
297-305 LTVPTGATP 1.1 9
317-325 CTEFSGIRS 4.83 3
358-366 NYEPTPRIH 2.38 6
390-398 HTKDEVNSY 4.74 3
391-399 TKDEVNSYL 3.73 4
398-406 YLIPLNANR 1.16 9
418-426 IGDGVFKTT 1.22 9
426-434 TLLGEIVQY 1.23 9
427-435 LLGEIVQYY 3.58 4
440-448 GSLRFSLMY 4.34 3
447-455 MYTGPALSS 1.15 9
448-456 YTGPALSSA 0.79 10
454-462 SSAKLILAY 11.18 1
498-506 WTSGVQFRY 3.27 5
506-514 YTDPDTYTS 4.52 3
508-516 DPDTYTSAG 1.09 9
512-520 YTSAGFLSC 3.06 5
514-522 SAGFLSCWY 3.9 4
529-537 PPETTGQVY 6.74 1
553-561 MKDTQTISQ 4.77 3
565-573 LTEGLGDEL 2.32 6
572-580 ELEEVIVEK 2.52 6
573-581 LEEVIVEKT 2.64 6
577-585 IVEKTKQTV 2.11 7
603-611 ANETGATMP 3.44 5
616-624 DSIETRTTY 5.28 2
617-625 SIETRTTYM 1.57 8
631-639 ETDVECFLG 3.91 4
635-643 ECFLGRAAC 2.82 6
646-654 VTEIQNKDA 2.66 6
651-659 NKDATGIDN 0.65 10
692-700 DSEYTILAT 4.21 3
695-703 YTILATASQ 1.53 8
701-709 ASQPDSANY 9.23 1
711-719 SNLVVQAMY 5.89 2
729-737 EWDDYTWQS 5.8 2
730-738 WDDYTWQSA 1.47 8
747-755 VGDTSRFSV 1.5 8
756-764 PYVGLASAY 3.89 4
760-768 LASAYNCFY 5.32 2
763-771 AYNCFYDGY 1.36 9
767-775 FYDGYSHDD 3.7 4
772-780 SHDDAETQY 8.39 1
776-784 AETQYGITV 1.21 9
803-811 KTLVKIRVY 4.09 4
822-830 PRAPRALPY 3.25 5
830-838 YTSIGRTNY 7.29 1
841-849 NTEPVIKKR 4.29 3
854-862 KSYGLGPRY 0.96 10
858-866 LGPRYGGIY 0.72 10
867-875 TSNVKIMNY 6.16 2
880-888 PEDHHNLIA 1.61 8
904-912 GAETIPHCN 3.15 5
905-913 AETIPHCNC 0.77 10
910-918 HCNCTSGVY 4.31 3
911-919 CNCTSGVYY 3.51 4
914-922 TSGVYYSTY 4.45 3
915-923 SGVYYSTYY 3.75 4
918-926 YYSTYYRKY 3.17 5
919-927 YSTYYRKYY 6.56 2
930-938 ICEKPTNIW 5.31 2
936-944 NIWIEGNPY 1.02 9
937-945 IWIEGNPYY 3.4 5
938-946 WIEGNPYYP 1.46 8
959-967 PAEPGDCGG 4.76 3
962-970 PGDCGGILR 4.08 4
978-986 LLTAGGSGY 2.69 6
989-997 FADIRQLEC 2.03 7
994-1002 QLECIAEEQ 1.01 9
998-1006 IAEEQGLSD 8.07 1
999-1007 AEEQGLSDY 12.46 1
1004-1012 LSDYITGLG 0.71 10
1019-1027 FTDQISTKV 4.68 3
1027-1035 VTELQEVAK 2.07 7
1034-1042 AKDFLTTKV 3.58 4
1046-1054 VVKMVSALV 1.36 9
1074-1082 GCDGSPWRF 2.91 5
1095-1103 YIERQANDG 5.18 2
1100-1108 ANDGWFRKF 2.86 5
1108-1116 FNDACNAAK 3 5
1117-1125 GLEWIANKI 4.79 3
1128-1136 LIEWIKNKV 1.54 8
1140-1148 AKEKLEFCS 1.25 9
1172-1180 KREQLFNNV 2.01 7
1186-1194 MSQKFAPLY 7.62 1
1195-1203 AVESKRIRE 2.28 7
1219-1227 RIEPVCVLI 1.34 9
1245-1253 AIAEHFNSA 1.18 9
1246-1254 IAEHFNSAV 4.18 4
1247-1255 AEHFNSAVY 4.32 3
1255-1263 YSLPPDPKH 0.66 10
1269-1277 QQEVVIMDD 1.32 9
1274-1282 IMDDLNQNP 0.69 10
1281-1289 NPDGQDISM 2.77 6
1296-1304 SVDFLPPMA 3.92 4
1305-1313 SLDNKGMLF 3.88 4
1343-1351 GFDLDICLH 1.43 8
1345-1353 DLDICLHTT 1.55 8
1346-1354 LDICLHTTY 3.51 4
1369-1377 CKDCHQPSN 1.52 8
1392-1400 LVDRTTNIR 3.27 5
1393-1401 VDRTTNIRY 0.7 10
1402-1410 SVDQLVTAI 4.94 3
1420-1428 ITDSLETLF 0.69 10
1425-1433 ETLFQGPVY 6.08 2
1433-1441 YKDLEIDVC 2.97 5
1437-1445 EIDVCNTPP 3.51 4
1445-1453 PPECINDLL 0.9 10
1456-1464 VDSEEIREY 1.43 8
1457-1465 DSEEIREYC 5.46 2
1478-1486 NIERAMNQA 1.74 8
1488-1496 MIINTILMF 0.62 10
1496-1504 FVSTLGIVY 5.18 2
1518-1526 SGNPPHNKL 0.8 10
1540-1548 NTEFALSLL 1.96 7
1559-1567 KGEFTGLGI 0.83 10
1572-1580 CVIPTHAQP 3.38 5
1581-1589 GDDVLVNGQ 1.65 8
1593-1601 VKDKYKLVD 4.87 3
1599-1607 LVDPENINL 5.85 2
1606-1614 NLELTVLTL 3.01 5
1628-1636 SEDLEGVDA 2.09 7
1630-1638 DLEGVDATL 5.53 2
1633-1641 GVDATLVVH 1.92 7
1636-1644 ATLVVHSNN 2.19 7
1649-1657 ILEVGPVTM 2.37 6
1651-1659 EVGPVTMAG 1.63 8
1657-1665 MAGLINLSS 1.69 8
1664-1672 SSTPTNRMI 0.71 10
1668-1676 TNRMIRYDY 1.26 9
1706-1714 FSAQLKKQY 2.33 6
1715-1723 FVEKQGQVI 4.23 3
1742-1750 SKLHPSVFY 1.73 8
1754-1762 PGDKEPAVL 2.38 6
1756-1764 DKEPAVLSD 5.94 2
1773-1781 LTESLFSKY 12.02 1
1786-1794 NTEPTENML 5.36 2
1792-1800 NMLVAVDHY 3.49 4
1793-1801 MLVAVDHYA 2.19 7
1796-1804 AVDHYAGQL 7.29 1
1806-1814 SLDIPTSEL 1.83 8
1813-1821 ELTLKEALY 3.94 4
1814-1822 LTLKEALYG 0.84 10
1825-1833 GLEPIDITT 6.78 1
1828-1836 PIDITTSAG 3.21 5
1831-1839 ITTSAGFPY 3.42 5
1856-1864 DTEKMKFYL 1.4 8
1859-1867 KMKFYLDKY 0.91 10
1863-1871 YLDKYGIDL 8.56 1
1868-1876 GIDLPLVTY 9.37 1
1877-1885 IKDELRSVD 2.2 7
1883-1891 SVDKVRLGK 6.96 1
1894-1902 LIEASSLND 1.63 8
1906-1914 MRMKLGNLY 7.72 1
1930-1938 GCDPDVFWS 1.45 8
1942-1950 CLMDGHLMA 1.29 9
1943-1951 LMDGHLMAF 1.68 8
1959-1967 SLSPVWFVC 0.96 10
1967-1975 CLEKVLTKL 6.33 2
1999-2007 VVEGGMPSG 4.14 4
2023-2031 IRTLILDAY 0.92 10
2033-2041 GIDLDKLKI 0.99 10
2035-2043 DLDKLKILA 3.33 5
2036-2044 LDKLKILAY 1.7 8
2044-2052 YGDDLIVSY 11.68 1
2045-2053 GDDLIVSYP 2.31 6
2055-2063 ELDPQVLAT 9.12 1
2060-2068 VLATLGKNY 4.53 3
2115-2123 IHESIRWTK 1.85 7
2128-2136 TQDHVRSLC 4 4
2134-2142 SLCMLAWHS 3.84 4
2142-2150 SGEKEYNEF 2.01 7
2156-2164 TTDIGKCLI 1.82 8

As it happens in these pages, the formatting of the output, in particular of the results table in this case, will be influenced by the CSS of this web site. On running the script on your own server, which you are warmly encouraged to do, the visual results will be slightly different. Still, you can see that we are getting some interesting results here, 205 peptides out of 2172 were found positive with a few ranking as high as 1.

It would be nice to be able to dynamically sort this results table according to the scores or ranks. This can be easily achieved by using jQuery and DataTables as we will see later on during the development of the full T-Score web application. You can see that at this stage, all the pieces of code we need to fully score and rank the peptides in a protein are in place, so we are ready to proceed to the development of the web application. Keep on reading about this in the next section.

Chapter Sections

WORK IN PROGRESS ON CHAPTER 5!