4-11: Regular expressions in PHP – Retrieving matches position by using the PREG_OFFSET_CAPTURE flag in preg_match() and preg_match_all() calls

In the previous section we have seen how to to go beyond the “match or no match” paradigm using the $matches array as a third element in preg_match() and preg_match_all() calls, that allows us to retrieve the actual matches of the PHP regular expression in the target string. In several instances it will be useful to now only know what matched in the target string, but also where exactly this match is located. To find this out we can call the preg_match() or preg_match_all() functions with a fourth argument, called a flag. More specifically, we will want to use the PREG_OFFSET_CAPTURE flag (capture the offset of the regular expression) to accomplish this task.

Using PREG_OFFSET_CAPTURE as a fourth argument in preg_match calls will slightly modify the structure of the $matches array, in that each individual match in the array, rather than a simple string, will be represented by an array composed by the part of the string that matched as a first element, and the position in the target string at which the match starts (offset) as a second element.

Let’s start with an example in which we will search all the occurrences of a BanII cutting sites in a nucleotide sequence (excluding possible overlapping cutting sites for the sake of simplicity).

Our target sequence is (with BanII cutting sites highlighted in red):

CCCCCCGAGCTCTTTTTTTTTGGGCTCCCCCCCC

As we saw in section 4-8, a regular expression for the BanII restriction enzyme cutting site could be formulated as follows:

Let’s now use preg_match_all() to find the two cutting sites for BanII in the target sequence and their respective starting positions in the target string.

The output contains the var_dump of the $matches array as structured after the call with the PREG_OFFSET_CAPTURE flag:

array(1) { [0]=> array(2) { [0]=> array(2) { [0]=> string(6) “GAGCTC” [1]=> int(6) } [1]=> array(2) { [0]=> string(6) “GGGCTC” [1]=> int(21) } } }

Let’s break it down:

array(1) { // $matches array contains a single element (as no capture groups in regexp)
[0]=> // which has index 0
array(2) { // This single element is the array of the matches to the whole regular expression and contains 2 elements (as there were 2 matches here)
[0]=> // The first match, index 0. Matches here are themselves arrays as PREG_OFFSET_CAPTURE flag was used
array(2) { // First match is an array on 2 elements,
// the string matched (a string) and where it starts in the sequence (an integer)
[0]=>
string(6) “GAGCTC” // the string matched in the first match
[1]=>
int(6) // Where the first match starts in the sequence (offset).
// Since numbering of characters in the target string starts from 0, not 1, 6 means nucleotide 7 of the sequence
}
[1]=> // Second match has index 1
array(2) { // Second match is again an array on 2 elements
[0]=>
string(6) “GGGCTC” // the string matched in the second match
[1]=>
int(21) // Offset of the second match.
// Since numbering of characters starts from 0, not 1, 21 means nucleotide 22 of the sequence
}
}
}

Let us re-write the code to provide an output that is nicer than the one we obtain with a var_dump by using a table for displaying the results, as we did in section 4-9.

To run this code live and see the output click here.

With a few more lines of code we can apply the same logic to perform a full restriction mapping of a much longer sequence in FASTA format.

Let us generate a restriction mapping with the BanII enzyme of the first 50.000 nucleotides of the genome of Listeria monocytogenes. You can download a FASTA file for this sequence here.

While up to now, in other cases in which we handled FASTA sequences, we mostly concentrated out attention on the header lines, this time we have to deal with the sequence itself. Since in the FASTA format the sequence is typically divided in lines of 80 nucleotides, we have to remount a full sequence from the individual lines. We will then provide a double output: a table with the matches and offsets found, as in the previous example, and the sequence in FASTA format with the matches in red, for a visual representation. Also, by using the title attribute within the same span tag used to render the match sequence in red, the user can mouseover the red match and get information about the start and end positions for the match.

This is what the source code for a matched sequence portion will look like when we run the code of the next example:

It provides an output that you can mouseover to get information. Try to mouseover the following:

GAGCTCC

You will see that on mouseover the cursor turns to a pointer thanks to the cursor:pointer style assigned to the span tag, and a tooltip appears with the value of the title attribute.

In the following code we use two indexes to track nucleotide numbers: $i and $j. $i starts from 0, it is useful to refer to nucleotide positions as PHP sees them (first character of a string is 0, not 1). $j starts from 1 and allows us to manage the sequence and provide output in a more conventional way for biological sequences, where the first nucleotide or amino-acid is considered to be 1 rather than 0.

This code has some complexity with respect to our typically short examples. It is important that you go through it carefully and try to understand what individual lines do, as this is closer (still not there, though) to the level of complexity you may encounter while writing “real” web applications for bioinformatics.

You can run this code live by following this link.

Up to now we have handled FASTA sequences in a slight different way each time, depending of what we needed to do. However, it would be handy to have a reusable piece of code that would allow us to take a FASTA sequence in input and get the header and sequence as separate variables, for further processing, in output. We could achieve this by writing our own dedicated function, that we could call for example fasta_process(). A set of functions (or classes, still not discussed here) revolving around the same task, such as for example the handling of biological sequences, is called a library. Writing our own functions is indeed an essential skill in programming. We will learn how to do it in the next section.

Chapter Sections