4-10: Regular expressions in PHP – retrieving all matches to a pattern in a string with preg_match_all() including overlapping matches

In this section we will explore the use of the preg_match_all() PHP function to retrieve all the occurrences of a pattern in a target string as well as the use of lookahead expressions to include overlapping matches.

You may remember that in the previous section we mentioned that preg_match() will only allow to retrieve the first match to a pattern in a sting.

For example, the following regular expression:

$regexp = “/[GT]AATT[CA]/”;

used in a preg_match() call will match GAATTC (the first match) but not TAATTA (the second match) in the following string:

CCCGAATTCTTTCCCTAATTATTTT

The $matches array as seen with a var_dump looks like this (the output of this script):

array(1) { [0]=> string(6) “GAATTC” }

Since there are no capture groups, $matches only have one element. This element is a string that represents the first match to the whole pattern in the target sequence.

preg_match_all() is very similar to preg_match(), with a catch

There are several cases in which we need to be able to match and retrieve all the matches to a pattern rather than just the first one. In these cases we can use preg_match_all() instead of preg_match().

A call to preg_match_all() is done exactly as a call to preg_match(). Here’s a call example.

This time the output, again a var_dump of the $matches array, looks like this:

array(1) { [0]=> array(2) { [0]=> string(6) “GAATTC” [1]=> string(6) “TAATTA” } }

As it would have happened with preg_match(), the $matches array now contains just one element, as there are no capture groups in the regular expression we did use in this example. There is a catch, though: this time, this element is itself an array rather than a string. The elements of this array, accessible at $matches[0], are all the matches, as strings, to the whole search pattern. In this particular case there are 2 matches, GAATTC and TAATTA.

In the following example we perform the same match but refine the output a bit:

The output:

/[GT]AATT[CA]/ found 2 matches in CCCGAATTCTTTCCCTAATTA, namely:

  • GAATTC
  • TAATTA

Using preg_match_all() with regular expressions containing capture groups

Similarly to what happens with preg_match, if the regular expression contains capture groups defined with round parentheses, for each capture group, in the order in which they are used in the regular expression, a new element will be added to the $matches array. For preg_match_all(), however, these elements are arrays rather than strings, and will contain all the matches to the capture group in the target string (as strings, hopefully this is not too confusing).

If for example the regular expression contains only one capture group, the structure of the $matches array – a 2 elements array – will be as follows:

[(first match to whole expression, second match to whole expression, third match to whole expression…), (first match to capture group1, second match to capture group1, third match to capture group 1…)]

In this example, the second match to the first (and only) capture group could be accessed at $matches[1][2].

If the regular expression contains two capture groups, the structure of the $matches array – a 3 elements array – will be as follows:

[(first match to whole expression, second match to whole expression, third match to whole expression…), (first match to capture group1, second match to capture group1, third match to capture group 1…), (first match to capture group2, second match to capture group2, third match to capture group 2…)]

In this last example, the third match to the second capture group could be accessed at $matches[2][3].

Finding overlapping matches with a lookahead regular expression in PHP

Sometimes, as it often happens for example during the search a pattern in nucleotide sequences, some of the matches are overlapping with each other.

For example in the following sequence:

CCCGAATTAATTCCC

this pattern:

“/[GT]AATT[CA]/”

could in principle be found 2 times:

CCCGAATTAATTCCC

and

CCCGAATTAATTCCC

However, since the two matches are overlapping with each other, the following call to preg_match_all():

preg_match_all(“/[GT]AATT[CA]/”, “CCCGAATTAATTCCC”);

will only find the first one, as once the first match is found and stored, the search starts again from the character that follows it.

Consider the following example:

The output of this code:

/[GT]AATT[CA]/ found 1 matches in CCCGAATTAATTCCC, namely:

  • GAATTA

The var_dump of matches:
array(1) {
[0]=>
array(1) {
[0]=>
string(6) “GAATTA”
}
}

In the search of patterns within biological sequences this is a serious limitation.

In order to retrieve all the matches, even if overlapping, we can use a lookahead in the regular expression. Lookaheads are special cases of “lookaround”, see this page on regular-expressions.info for an in-depth discussion about lookaround regular expressions.

As regular-expressions.info puts it:

“Lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called “assertions”. They do not consume characters in the string, but only assert whether a match is possible or not.”

There are two important things to consider here:

  • First, lookarounds “give up the match”. Therefore, if we perform a match with a lookahead expression and preg_match_all(), the matches results to the whole expression will not be stored in $matches[0] as it would normally happen. In order to capture them we do need to explicitly use capture groups. With one capture group we would have to look for the matches in $matches[1].
  • Second, lookaheads do not “consume” characters. This means that even overlapping matches will be found

A simple example is worth a thousand words: let’s run again the last code example with a lookahead expression and some slight modifications.

We transform the original regular expression

“/[GT]AATT[CA]/”

to:

“/(?=([GT]AATT[CA]))/”

(?= Look ahead to see if there is (the lookahead)
( start of the capture group
[GT]AATT[CA] our usual expression
) end of the capture group
) end of the lookahead

Credits: this expression was inspired from this brilliant page on StackOverflow

This is the output of the script:

/(?=([GT]AATT[CA]))/ found 2 matches in CCCGAATTAATTCCC, namely:

  • GAATTA
  • TAATTC

The var_dump of matches:
array(2) {
[0]=>
array(2) {
[0]=>
string(0) “”
[1]=>
string(0) “”
}
[1]=>
array(2) {
[0]=>
string(6) “GAATTA”
[1]=>
string(6) “TAATTC”
}
}

In the last example note how $matches[0], the placeholder for matches to the whole expression, contains 2 element but they are empty stings. This is because the lookahead does not store such matches, as explained above. In contrast, we can find the matches in $matches[1] as we did use a capture group.

You now have all the knowledge to find all the matches to a pattern in a string, even overlapping matches.

In the next section we look at the use of the PREG_OFFSET_CAPTURE flag in the preg_match() or preg_match_all() calls in order to capture not only the matches, but also the position in the string at which they start, which is often useful to do while searching for a pattern in a biological sequence as well as in several other instances not necessarily related to biology or bioinformatics.

Chapter Sections

Leave a Reply

Your email address will not be published. Required fields are marked *