5-4: The T-Score web application – Web scraping of the scoring matrix data

Keywords: web scraping, file parsing, PHP programming, regular expressions, MHC-peptide binding, Immunoinformatics

Logical flow of the T-Score application

This web application will accept a protein sequence or UniProt id as input. The user will be able to select an HLA (MHC) allele for which the user wishes to calculate the binding score to all the possible 9mers (sequences of 9 amino-acids) present in the input sequence and the threshold above which positive peptides will be shown in the analysis results.

The processing script will generate all possible 9mers from the input protein and assign each one a binding score to the selected HLA allele, based on the appropriate scoring matrix, as discussed in the previous section. For each peptide the score will be used to rank the peptide according to the thresholds present in the scoring matrix, thereby assigning the peptide to say, the top 10% binders, the top 5% binders, the top 1% binders etc.

An output will be generated that will list, for each protein, a list of the peptides whose score ranks them above the threshold selected by the user in the initial web form.

As you can imagine, there is quite some code to be written to achieve all these functionalities.

In this section we will concentrate on gathering the scoring matrices data from the original web page where they are shared, through a web scraping operation performed by using PHP and regular expressions.

Downloading the scoring matrices raw HTML files from the original website and saving a local copy

As a first step, we need to download the data relative to each scoring matrix from the website where they are available. The operation of extracting data we need from an existing web page is called “web scraping”. Web scraping can be achieved by different techniques. For this example we will adopt a simple strategy based on the manual analysis of the HTML source code of our target pages. From this analysis we will be able to write dedicated regular expressions and use those in an appropriate code context so as to retrieve the scoring matrix data and write it to text files.

Another piece of code will be dedicated to reading those text files and storing the data into appropriate data structures (arrays) that can then be used by the form processing script to assign scores to peptides.

Let’s get started.

Our application will be limited to the analysis with three HLA alleles, namely HLA-A1, A2 and A3. The scoring matrices for these three alleles can be found at the following URLs:

A1 matrix: http://www.imtech.res.in/raghava/nhlapred/matrices/a1.html
A2 matrix: http://www.imtech.res.in/raghava/nhlapred/matrices/a2.html
A3 matrix: http://www.imtech.res.in/raghava/nhlapred/matrices/a2.html

Let us create somewhere within the path of the web root of our apache installation, a directory for the whole project, called tscore and, inside, a subdirectory called matrix, where the matrix work will be executed and the actual matrices files will be stored. The code will be written in a file called “script.php” to be created inside the tscore directory. To avoid permissions issues during code execution, grant a 777 permission to both the tscore and matrix directories with chmod (chmod 777 directory_path). So for now we will have the following structure:

tscore (permission 777)
    matrix (permission 777)

Into script.php the very first task will be to download the three matrices “raw” HTML files, as they are, into the matrix folder, if they do not exist in this folder already (we do not want to create any unnecessary load to the server from which we are scraping the data, so let us ensure we download the original html pages just once).

You can try the code above yourself and then check that the expected 3 files a1.html, a2.html and a3.html were created in the matrix folder and do contain some HTML code. Thank you for doing this just once (if successful), so as to avoid any unnecessary load on the server thanks to which the matrices are kindly shared by the research group who created them.

It is worth mentioning that instead of using the fopen(), fwrite() and fclose() funtions sequentially, as we did in the code above, we could have used the file_put_contents() function, which can be considered a shortcut to these three functions for file writing. We have used the long way to do it for educational purposes here.

So we could replace this part of the code above:

with something much shorter, like:

Scraping the scoring matrices data from the HTML files

We can now proceed to extract the data from those HTML files and write the data, in a cleaner format, in some new text files that we will name as a1-matrix.csv, a2-matrix.csv and a3-matrix.csv.

The problem we have is that in the original html files the various numbers and amino-acids letters we need are included into an HTML table, so they are heavily mixed-up with HTML tags that we need to get rid of. Click on the image below to get an idea:

The HTML source code relative to the part of the page where the matrix data is stored.
The HTML source code relative to the part of the page where the matrix data is stored. It is actually an HTML table, HTML1 style.

Let’s state upfront where we are aiming at. We want to generate a file with the matrix data in csv format, where the scoring values for each amino-acid are stored in a dedicated line and separated by semicolons. The first character of each line will be the letter of the amino-acid to which the numbers that follow refer to. So each file will contain 21 lines (one for each amino-acid plus the “generic” amino-acid X, which is included in the matrices). Each one of these lines will contain the amino-acid letter as first character followed by 9 numbers, corresponding to the numeric scoring values for the P1 to P9 positions.

To clarify the concept, let’s consider again the scoring matrix for the HLA-A1 allele:

The scoring Matrix for MHC allele HLA-A1
The scoring Matrix for MHC allele HLA-A1. Source: http://www.imtech.res.in/raghava/nhlapred/matrix.html

Our text file for the HLA-A1 matrix (a1-matrix.csv) will start with those two lines:


In the same file we will also store the threshold information (last two line in the matrix figure above), but let’s take care about this later on.

In order to extract the data from the HTML file (web scraping) we need to carefully look at the HTML source code of the source page(s). Please open this page in your browser, the HTML page for the HLA-A1 matrix, and look at the source HTML code by selecting the appropriate option in the browser’s menu.

At the time of this writing, this is what the relevant part of the HTML, the one with the matrix data, looks like. You can click on the image to view a larger version. Even better, look at the original source code in your browser.

The HTML source code relative to the part of the page where the matrix data is stored.
The HTML source code relative to the part of the page where the matrix data is stored. It is actually an HTML table, HTML1 style.

Thankfully, all the lines for each amino-acid look the same in the source code and each amino-acid occupies it’s own line which, in HTML terms, is a table raw, embedded in TR tags (capital letters as this page is written in “old style” HTML1). Let’s consider the line for the first amino-acid Alanine (A):

See how all the data we need is there, we just need to scrape it out.

We start by writing a regular expression that matches the line with a capture group right where we need it.

More specifically, we are interested in matching parts like this:

like this:

or like this:

So our capture group may contain from 1 to 5 characters.

Here is a fitting expression:

We can then use preg_match_all() to capture all the matches, in the line, to the capture group. The matches will be either a capital letter (the amino-acid letter, should be the first match found), or score numbers.

Before we get into the parsing of a whole file, let’s try the regular expression on a single line with a small example:

This will output the following var_dump:

array(10) {
string(1) “A”
string(4) “1.69”
string(4) “1.50”
string(5) “-0.15”
string(5) “-1.50”
string(4) “0.82”
string(4) “1.36”
string(4) “1.00”
string(4) “1.14”
string(5) “-1.20”

This is great as we now have an array, namely $matches[1] (remember that $matches[0] contains the matches to the whole regular expression while we need the matches for the capture group, that are indeed in $matches[1]), that contains exactly the data we were looking for.

Let’s now discuss the thresholds part. The page(s) we are parsing to scrape the data actually contains six different tables. The fifth table contains the scoring data we just discussed while the sixth table contains the thresholds data. Check this out in the HTML source of one of the original pages.

We need to manage this in the script. In particular, as we reach the sixth table, the thresholds one, during the parsing process, we know we should stop scraping scoring data and instead start to scrape the thresholds data.

To keep track about which table we are managing at a given time into the scraping code, we will use a “$table_flag” variable whose value will be 0 to begin with, will become 5 as we reach the scores table and 6 as we reach the thresholds table. Depending on the value of $table_flag we will execute a different code.

In the .csv file, thresholds will be represented by a dedicated line (the last line of the file) with this format:

tre;6.68;5.18;4.21;3.47;2.85;2.31;1.83;1.39;0.99;0.61 (this is how the last line of the .csv file for the HLA-A1 matrix will look like)

We can now parse the files. To do this, we will list all the files in the matrix directory. Then, file name by file name, if the file name ends in .html and the respective .csv file does not exist, we will create it and put inside the cleaned matrix scoring and thresholds data, in the csv format we discussed above.

Mind that the execution of this last bit of code assumes that the html files were already downloaded to the matrix directory by executing the download code upper in the page. After running this code the matrix directory will contain 3 new files: a1-matrix.csv, a2-matrix.csv and a3-matrix.csv. Here is what the a1-matrix.csv file contents will look like:



Let’s put all the code for this section together, in a single script that will download the HTML files from the matrices web site and extract the data to .csv files in a split second:

There you have it, a file for each one of the three selected MHC alleles with all the data needed to score a peptide for binding, in a clean format. In the next section we will reason on how to use these file to actually score a peptide and then rank it according to the available thresholds.

Chapter Sections


Leave a Reply

Your email address will not be published. Required fields are marked *