4-11: Regular expressions in PHP – Retrieving matches position by using the PREG_OFFSET_CAPTURE flag in preg_match() and preg_match_all() calls

In the previous section we have seen how to to go beyond the “match or no match” paradigm using the $matches array as a third element in preg_match() and preg_match_all() calls, that allows us to retrieve the actual matches of the PHP regular expression in the target string. In several instances it will be useful to now only know what matched in the target string, but also where exactly this match is located. To find this out we can call the preg_match() or preg_match_all() functions with a fourth argument, called a flag. More specifically, we will want to use the PREG_OFFSET_CAPTURE flag (capture the offset of the regular expression) to accomplish this task.

Using PREG_OFFSET_CAPTURE as a fourth argument in preg_match calls will slightly modify the structure of the $matches array, in that each individual match in the array, rather than a simple string, will be represented by an array composed by the part of the string that matched as a first element, and the position in the target string at which the match starts (offset) as a second element.

Let’s start with an example in which we will search all the occurrences of a BanII cutting sites in a nucleotide sequence (excluding possible overlapping cutting sites for the sake of simplicity).

Our target sequence is (with BanII cutting sites highlighted in red):

CCCCCCGAGCTCTTTTTTTTTGGGCTCCCCCCCC

As we saw in section 4-8, a regular expression for the BanII restriction enzyme cutting site could be formulated as follows:


<?php

$banII_regexp ="/G[AG]GC[CT]C/";

?>

<?php

$banII_regexp ="/G[AG]GC[CT]C/";

Let’s now use preg_match_all() to find the two cutting sites for BanII in the target sequence and their respective starting positions in the target string.


<?php

$sequence = "CCCCCCGAGCTCTTTTTTTTTGGGCTCCCCCCCC";
$banII_regexp ="/G[AG]GC[CT]C/";

preg_match_all($banII_regexp, $sequence, $matches, PREG_OFFSET_CAPTURE);

echo "<p>";
var_dump($matches);
echo "</p>";

?>

<?php

$sequence = "CCCCCCGAGCTCTTTTTTTTTGGGCTCCCCCCCC";

$banII_regexp ="/G[AG]GC[CT]C/";

preg_match_all($banII_regexp, $sequence, $matches, PREG_OFFSET_CAPTURE);

echo "<p>";

var_dump($matches);

echo "</p>";

The output contains the var_dump of the $matches array as structured after the call with the PREG_OFFSET_CAPTURE flag:

array(1) { [0]=> array(2) { [0]=> array(2) { [0]=> string(6) “GAGCTC” [1]=> int(6) } [1]=> array(2) { [0]=> string(6) “GGGCTC” [1]=> int(21) } } }

Let’s break it down:

array(1) { // $matches array contains a single element (as no capture groups in regexp)
[0]=> // which has index 0
array(2) { // This single element is the array of the matches to the whole regular expression and contains 2 elements (as there were 2 matches here)
[0]=> // The first match, index 0. Matches here are themselves arrays as PREG_OFFSET_CAPTURE flag was used
array(2) { // First match is an array on 2 elements,
// the string matched (a string) and where it starts in the sequence (an integer)
[0]=>
string(6) “GAGCTC” // the string matched in the first match
[1]=>
int(6) // Where the first match starts in the sequence (offset).
// Since numbering of characters in the target string starts from 0, not 1, 6 means nucleotide 7 of the sequence
}
[1]=> // Second match has index 1
array(2) { // Second match is again an array on 2 elements
[0]=>
string(6) “GGGCTC” // the string matched in the second match
[1]=>
int(21) // Offset of the second match.
// Since numbering of characters starts from 0, not 1, 21 means nucleotide 22 of the sequence
}
}
}

Let us re-write the code to provide an output that is nicer than the one we obtain with a var_dump by using a table for displaying the results, as we did in section 4-9.


<?php

$sequence = "CCCCCCGAGCTCTTTTTTTTTGGGCTCCCCCCCC";
$banII_regexp ="/G[AG]GC[CT]C/";

preg_match_all($banII_regexp, $sequence, $matches, PREG_OFFSET_CAPTURE);

$matches_num = count($matches[0]);

echo "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta charset=\"utf-8\">\n<title>Test page</title>\n<style>td,th{padding:5px;}thead{font-weight:bold;text-align:left;}</style>\n</head>\n<body>";

echo "<p><strong>Regular Expression:</strong> $banII_regexp </p>\n<br><strong>Target Sequence:</strong> $sequence</p>\n";
echo "<p>$matches_num matches found</p>\n";
 
echo "<table><thead><tr><th>Match</th><th>Offset</th></tr></thead><tbody>\n"; // Starting to output the HTML for the table
 
foreach($matches[0] as $match){ // We cycle through the matches
    echo "<tr><td>".$match[0]."</td><td>".$match[1]."</td></tr>\n";
    
}
echo "</tbody>\n</table>\n</body>\n</html>"; // Closing the various tags opened before

?>

<?php

$sequence = "CCCCCCGAGCTCTTTTTTTTTGGGCTCCCCCCCC";

$banII_regexp ="/G[AG]GC[CT]C/";

preg_match_all($banII_regexp, $sequence, $matches, PREG_OFFSET_CAPTURE);

$matches_num = count($matches[0]);

echo "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta charset=\"utf-8\">\n<title>Test page</title>\n<style>td,th{padding:5px;}thead{font-weight:bold;text-align:left;}</style>\n</head>\n<body>";

echo "<p><strong>Regular Expression:</strong> $banII_regexp </p>\n<br><strong>Target Sequence:</strong> $sequence</p>\n";

echo "<p>$matches_num matches found</p>\n";

echo "<table><thead><tr><th>Match</th><th>Offset</th></tr></thead><tbody>\n"; // Starting to output the HTML for the table

foreach($matches[0] as $match){ // We cycle through the matches

echo "<tr><td>".$match[0]."</td><td>".$match[1]."</td></tr>\n";

}

echo "</tbody>\n</table>\n</body>\n</html>"; // Closing the various tags opened before

To run this code live and see the output click here.

With a few more lines of code we can apply the same logic to perform a full restriction mapping of a much longer sequence in FASTA format.

Let us generate a restriction mapping with the BanII enzyme of the first 50.000 nucleotides of the genome of Listeria monocytogenes. You can download a FASTA file for this sequence here.

While up to now, in other cases in which we handled FASTA sequences, we mostly concentrated out attention on the header lines, this time we have to deal with the sequence itself. Since in the FASTA format the sequence is typically divided in lines of 80 nucleotides, we have to remount a full sequence from the individual lines. We will then provide a double output: a table with the matches and offsets found, as in the previous example, and the sequence in FASTA format with the matches in red, for a visual representation. Also, by using the title attribute within the same span tag used to render the match sequence in red, the user can mouseover the red match and get information about the start and end positions for the match.

This is what the source code for a matched sequence portion will look like when we run the code of the next example:


<span style="color:red;cursor:pointer;" title="19878-19884">GAGCTCC</span>

<span style="color:red;cursor:pointer;" title="19878-19884">GAGCTCC</span>

It provides an output that you can mouseover to get information. Try to mouseover the following:

GAGCTCC

You will see that on mouseover the cursor turns to a pointer thanks to the cursor:pointer style assigned to the span tag, and a tooltip appears with the value of the title attribute.

In the following code we use two indexes to track nucleotide numbers: $i and $j. $i starts from 0, it is useful to refer to nucleotide positions as PHP sees them (first character of a string is 0, not 1). $j starts from 1 and allows us to manage the sequence and provide output in a more conventional way for biological sequences, where the first nucleotide or amino-acid is considered to be 1 rather than 0.

This code has some complexity with respect to our typically short examples. It is important that you go through it carefully and try to understand what individual lines do, as this is closer (still not there, though) to the level of complexity you may encounter while writing “real” web applications for bioinformatics.


<?php

$fasta_sequence = file_get_contents("http://www.cellbiol.com/bioinformatics_web_development/uploads/AL591983.1_1-50000.fasta");
$banII_regexp ="/G[AG]GC[CT]C/";

// Storing the FASTA header and sequence into different variables

$fasta_lines = explode("\n", $fasta_sequence);

$header = ""; // We will store the header line here during the next foreach cycle
$sequence = ""; // We will store the sequence here during the next foreach cycle

foreach($fasta_lines as $line){
    if(preg_match("/^>/", $line)){  // If the line starts with a > it's the header line
        $header = $line;
    }
    elseif($line != ""){
        $sequence = $sequence.$line; // We concatenate each new sequence line in the $sequence variable
    }
}

// At this point we should have the FASTA header in the $header variable 
// and the whole sequence in the $sequence variable

// Ready to perform the match

preg_match_all($banII_regexp, $sequence, $matches, PREG_OFFSET_CAPTURE);

$matches_num = count($matches[0]); // Number of BanII cutting sites in the sequence

// We now remount the FASTA sequence for a web output

$offsets = array(); // We declare an array in which we will store all the matches offsets

foreach($matches[0] as $match){ // We go through each match, extract the offset and add it to $offsets
   $offsets[] = $match[1];
}

$end_matches = array(); // The positions at which the matches end, let's compute this

foreach($offsets as $offset){
    $end_matches[] = $offset + 6; // The BanII cutting site is 6 nucleotides long
}

$output_fasta_sequence = "<span style=\"font-family:courier\">".$header."<br>\n"; // We start by adding the header line followed by a break tag to the output FASTA sequence
// All the sequence will be in courier font

$nucleotides = str_split($sequence, 1); // We create an array with the single nucleotides of the sequence as elements

$openspan = "<span style=\"color:red;cursor:pointer;\" title=\""; // We have to place this before the start of each match
$closespan = "</span>";

$i = 0;
$j = 1;
foreach($nucleotides as $nucleotide){
    if(in_array($i, $offsets)){ // If the nucleotide is at the start of a match we open a span for color:red before it
                                // and add the correct info for the title attribute
        $endmatchpos = $j + 6; // The end of the match position calculated assuming the first nucleotide
                               // of the sequence is 1, not 0, so we use $j, not $i
        $output_fasta_sequence = $output_fasta_sequence.$openspan."$j-$endmatchpos\">".$nucleotide;
    }
    elseif(in_array($i, $end_matches)){ // If the nucleotide is at the end of a match we close the span after it
        $output_fasta_sequence = $output_fasta_sequence.$nucleotide.$closespan;
    }
    else{ // average nucleotides, not at start or end of matches
        $output_fasta_sequence = $output_fasta_sequence.$nucleotide;
    }
    if(is_int($j/80)){  // Let's introduce a break each 80 nucleotides for the sake of FASTA format
        $output_fasta_sequence = $output_fasta_sequence."<br>\n";
    }
    $i++;
    $j++;
}

$output_fasta_sequence = $output_fasta_sequence."</span>"; // Closing the very first span used to have the sequence in courier font

echo "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta charset=\"utf-8\">\n<title>FASTA sequence restriction mapping</title>\n<style>td,th{padding:5px;}thead{font-weight:bold;text-align:left;}</style>\n</head>\n<body>";

echo "<p><strong>Regular Expression:</strong> $banII_regexp </p>\n";
echo "<p><span style=\"color:red;font-weight:bold;\">$matches_num BanII cutting sites found</span></p>\n";

echo "<p>$output_fasta_sequence</p>";
 
echo "<table><thead><tr><th>Ban II Site</th><th>Offset</th></tr><tbody>\n"; // Starting to output the HTML for the table
 
foreach($matches[0] as $match){ // We cycle through the matches
    // While the software numbers characters in a string starting from 0, for humans the first nucleotide of a sequence is 1
    // so we adjust the match start to provide an output for the end user, possibly a human
    $match_start = $match[1] + 1; 
    echo "<tr><td>".$match[0]."</td><td>".$match_start."</td></tr>\n";    
}

echo "</tbody>\n</table>\n</body>\n</html>"; // Closing the various tags opened before

?>

<?php

$fasta_sequence = file_get_contents("http://www.cellbiol.com/bioinformatics_web_development/uploads/AL591983.1_1-50000.fasta");

$banII_regexp ="/G[AG]GC[CT]C/";

// Storing the FASTA header and sequence into different variables

$fasta_lines = explode("\n", $fasta_sequence);

$header = ""; // We will store the header line here during the next foreach cycle

$sequence = ""; // We will store the sequence here during the next foreach cycle

foreach($fasta_lines as $line){

if(preg_match("/^>/", $line)){ // If the line starts with a > it's the header line

$header = $line;

}

elseif($line != ""){

$sequence = $sequence.$line; // We concatenate each new sequence line in the $sequence variable

}

// At this point we should have the FASTA header in the $header variable

// and the whole sequence in the $sequence variable

// Ready to perform the match

preg_match_all($banII_regexp, $sequence, $matches, PREG_OFFSET_CAPTURE);

$matches_num = count($matches[0]); // Number of BanII cutting sites in the sequence

// We now remount the FASTA sequence for a web output

$offsets = array(); // We declare an array in which we will store all the matches offsets

foreach($matches[0] as $match){ // We go through each match, extract the offset and add it to $offsets

$offsets[] = $match[1];

}

$end_matches = array(); // The positions at which the matches end, let's compute this

foreach($offsets as $offset){

$end_matches[] = $offset + 6; // The BanII cutting site is 6 nucleotides long

}

$output_fasta_sequence = "<span style=\"font-family:courier\">".$header."<br>\n"; // We start by adding the header line followed by a break tag to the output FASTA sequence

// All the sequence will be in courier font

$nucleotides = str_split($sequence, 1); // We create an array with the single nucleotides of the sequence as elements

$openspan = "<span style=\"color:red;cursor:pointer;\" title=\""; // We have to place this before the start of each match

$closespan = "</span>";

$i = 0;

$j = 1;

foreach($nucleotides as $nucleotide){

if(in_array($i, $offsets)){ // If the nucleotide is at the start of a match we open a span for color:red before it

// and add the correct info for the title attribute

$endmatchpos = $j + 6; // The end of the match position calculated assuming the first nucleotide

// of the sequence is 1, not 0, so we use $j, not $i

$output_fasta_sequence = $output_fasta_sequence.$openspan."$j-$endmatchpos\">".$nucleotide;

}

elseif(in_array($i, $end_matches)){ // If the nucleotide is at the end of a match we close the span after it

$output_fasta_sequence = $output_fasta_sequence.$nucleotide.$closespan;

}

else{ // average nucleotides, not at start or end of matches

$output_fasta_sequence = $output_fasta_sequence.$nucleotide;

}

if(is_int($j/80)){ // Let's introduce a break each 80 nucleotides for the sake of FASTA format

$output_fasta_sequence = $output_fasta_sequence."<br>\n";

}

$i++;

$j++;

}

$output_fasta_sequence = $output_fasta_sequence."</span>"; // Closing the very first span used to have the sequence in courier font

echo "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta charset=\"utf-8\">\n<title>FASTA sequence restriction mapping</title>\n<style>td,th{padding:5px;}thead{font-weight:bold;text-align:left;}</style>\n</head>\n<body>";

echo "<p><strong>Regular Expression:</strong> $banII_regexp </p>\n";

echo "<p><span style=\"color:red;font-weight:bold;\">$matches_num BanII cutting sites found</span></p>\n";

echo "<p>$output_fasta_sequence</p>";

echo "<table><thead><tr><th>Ban II Site</th><th>Offset</th></tr><tbody>\n"; // Starting to output the HTML for the table

foreach($matches[0] as $match){ // We cycle through the matches

// While the software numbers characters in a string starting from 0, for humans the first nucleotide of a sequence is 1

// so we adjust the match start to provide an output for the end user, possibly a human

$match_start = $match[1] + 1;

echo "<tr><td>".$match[0]."</td><td>".$match_start."</td></tr>\n";

}

echo "</tbody>\n</table>\n</body>\n</html>"; // Closing the various tags opened before

You can run this code live by following this link.

Up to now we have handled FASTA sequences in a slight different way each time, depending of what we needed to do. However, it would be handy to have a reusable piece of code that would allow us to take a FASTA sequence in input and get the header and sequence as separate variables, for further processing, in output. We could achieve this by writing our own dedicated function, that we could call for example fasta_process(). A set of functions (or classes, still not discussed here) revolving around the same task, such as for example the handling of biological sequences, is called a library. Writing our own functions is indeed an essential skill in programming. We will learn how to do it in the next section.

Chapter Sections

[pagelist include=”435″]

[siblings]

Chapter Sections

Leave a Reply Cancel reply