4-9: Regular expressions in PHP – retrieving matches to patterns with preg_match() called with the $matches argument

In the previous section we have learned to write simple regular expressions and to check if the patterns defined would have a match or not within a string. We did not learn how to retrieve the part of the string that actually matched the pattern, though.

Calling preg_match() with the $matches array as a third argument

In order to retrieve the actual match we need to call the preg_match() function with a third argument, typically called “$matches” (but you may name it as you wish).

The actual matches (if any) to the regular expression passed as first argument to preg_match() in the string passed as second argument, will be stored inside $matches (the third argument), which is an array with a particular structure. Let us start with a simple example in which we use var_dump() to explore the structure of the $matches array after a match occurred.

By running this script we get the following:

Match to sequence 1
array(1) {
[0]=>
string(6) “GAATTC”
}

Match to sequence 2
array(1) {
[0]=>
string(6) “TAATTA”
}

As we can see and as expected, a match to the regular expression was found in both sequences.

How is the $matches array in which the matches results are stored structured? Since the regular expression did not contain any portion within round parentheses (used to create “capture groups”, more on this in a moment), $matches contains only one element, namely the match found for the whole regular expression. You see that the var_dumps start with “array(1)”, which means that $matches is an array containing 1 element. For both sequences, this single element, that therefore has an index 0 (“[0]=>”), is a string 6 characters long (“string(6)”). The actual string differs between the two sequences, though. It is “GAATTC” for the first sequence and “TAATTA” for the second.

To rephrase this, if the regular expression (first argument of preg_match()) does not contain any capture groups defined by including a portion of it in between round parentheses, the $matches array, passed as a third argument to preg_match() during the call, will contain only one element that is constituted by the part of the string passed as second element to preg_match() that did match the pattern defined by the whole the regular expression.

There you have it, you can not only know if a regular expression finds a match inside a string, but also know what exactly this match was.

Let’s run the same example as above with slight modifications to provide a cleaner output and skip the var_dump:

We get:

Match to sequence 1
Match within sequence CCCGAATTCTTT is GAATTC

Match to sequence 2
Match within sequence CCCTAATTATTT is TAATTA

You should be aware that preg_match() will only find the first match to the regular expression within the string.

In the example above, if the target sequence had been (possible matches to the pattern in red):

CCCGAATTCTTTCCCTAATTATTTT

only the first match, GAATTC, would have been found by the preg_match() call.

Defining capture groups with round parentheses within a regular expression

Rather than recovering the match to the whole pattern defined in the regular expression, in many cases we may be interested in recovering the match to only a part of the pattern. We may use a pattern to identify a part of a string that contains the bit (or the bits) of information we are actually interested in.

To make this clear with an example, let’s consider an hypothetical header line of a FASTA sequence.

>gi|197107235|pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin

In red, the GI, a number we may want to extract from the header. Learn more about GIs here.

We could indeed write a regular expression along these lines to catch it:

“/\d+/”

Which means, as you should know by now, “any number repeated one or more times”.

This is a very broad pattern. This would match 5, 33, 453672 etc… It will indeed catch our GI number, and does so in this specific header, but this is just “by chance”.

Here’s the code we could use:

which gives the following output:

197107235

In this other hypothetical header:

>pdb|3CHW|P Chain P, Complex Of Dictyostelium Discoideum Actin

the very same pattern will catch “3”, the first number a preg_match() call would get to match. And it is not the GI number we wanted, which is actually not present at all within this header. Try this:

As a general rule, the more specific the pattern is, and the more “context information” it contains, the higher the chances to really find what we are looking for.

In order to write a pattern very specific for a GI, that will actually catch a number in a FASTA header only if it is really a GI, let’s take advantage of syntactical context the GI number is normally embedded in in FASTA headers, and write a pattern that includes this context information.

Check the context of the GI number:

>gi|197107235|

Let’s write a regular expression that includes this context:

However, if we run the usual example on it:

This is what we get:

gi|197107235|

Se how we now have the number we wanted, but it is included within the context we did define in the regular expression. Not surprising, the $matches array now contains as the unique element the match to the whole regular expression.

In order to obtain the number only, we can use a capture group within the regular expression. This is done by embedding the part of the expression that matches the number within round parentheses.

The regular expression therefore becomes:

Grouping elements with parentheses without creating a capture group
It is worth mentioning that sometimes you may want to group characters with parentheses without creating a capture group. In this case you should include ?: right after the first parenthesis. As an example, if you want to include in a pattern an optional part “ABC”, but you do not want to capture it in the $matches array, you may write:

(?:ABC)?

Don’t confuse the ? right after the parenthesis, with the one at the end, as they have a different meaning. The second one means that what precedes is optional.

Now here’s the trick: for every capture group we define within the regular expression, we can create as many as we like, a new element will be automatically added to the $matches array, containing the match to the capture group only. On using capture groups within the regular expression, the first element of the $matches array will still be the match to the whole expression, while the subsequent elements will be the matches to the capture groups, in the order they are use in the regular expression definition.

If we use just one capture group, $matches[0] will contain the match to the whole expression and $matches[1] will contain the match to the capture group.

If we use 2 capture groups (two sets of parentheses), the $matches array will contain 3 elements: $matches[0] will be the match to the whole expression, $matches[1] will be the match to the first capture group and $matches[2] will be the match to the second capture group. The concept extends to any number of capture groups we may want or need to use in your expression.

Here’s a code example to capture the GI number cleanly:

The output:

Match to the entire pattern:
gi|197107235|

Just the GI number:
197107235

The var_dump of the $matches array:
array(2) {
[0]=>
string(13) “gi|197107235|”
[1]=>
string(9) “197107235”
}

See how the $matches array now contains two elements instead of just one. The second element contains the match to the first (and in this example, only) capture group we used inside the regular expression. As in the previous examples, the first element of $matches still contains the match to the whole regular expression.

Let us consider the following FASTA sequence:

>gi|28373620|pdb|1MA9|A Chain A, Crystal Structure Of The Complex Of Human Vitamin D Binding Protein And Rabbit Muscle Actin
LERGRDYEKNKVCKEFSHLGKEDFTSLSLVLYSRKFPSGTFEQVSQLVKEVVSLTEACCAEGADPDCYDT
RTSALSAKSCESNSPFPVHPGTAECCTKEGLERKLCMAALKHQPQEFPTYVEPTNDEICEAFRKDPKEYA
NQFMWEYSTNYGQAPLSLLVSYTKSYLSMVGSCCTSASPTVCFLKERLQLKHLSLLTTLSNRVCSQYAAY
GEKKSRLSNLIKLAQKVPTADLEDVLPLAEDITNILSKCCESASEDCMAKELPEHTVKLCDNLSTKNSKF
EDCCQEKTAMDVFVCTYFMPAAQLPELPDVELPTNKDVCDPGNTKVMDKYTFELSRRTHLPEVFLSKVLE
PTLKSLGECCDVEDSTTCFNAKGPLLKKELSSFIDKGQELCADYSENTFTEYKKKLAERLKAKLPEATPT
ELAKLVNKRSDFASNCCSINSPPLYCDSEIDAELKNIL

You can find a FASTA file with this sequence here: gi-28373620.fasta

Let us write a regular expression that will allow us to catch both the GI ID and the PDB ID. This is the relevant part of the header we want to focus on, that provides all the context for the two IDs:

>gi|28373620|pdb|1MA9|

Note that while the GI ID is made by numbers only, PDB identifiers are made by capital letters or numbers, and they are always 4 characters long. Here is a good regular expression for the job:

Now that we have a good regular expression we can write a little script to parse the FASTA sequence file and extract the GI ID and the PDB ID.

Here is the output of this script:

GI ID: 28373620

PDB ID: 1MA9

Whole match: >gi|28373620|pdb|1MA9|

As a final example for this section, let’s unleash the power of scripting and regular expressions to extract all the GI IDs and PDB IDs from a FASTA file containing several sequences.

You can view or download the file from this link. If you look carefully at the headers of the sequences in this file you may notice that while all of them include a GI ID, only some of them include a PDB ID. For this reason, in order to be sure to catch all the GI IDs and all the PDB IDs we cannot use the compound regular expression that catches both that we have implemented in the previous example as this expression assumed that both IDs were present.

Let us write a new regular expression in which the PDB ID part is optional.

We can now write the code that parses the FASTA file with several sequences. The results will be presented in a table, check out this page on W3Schools for how to generate a table in html. The thead and tbody tags (optional in tables) are described here.

Also note that we store all the results in an array called $ids. The elements of this $ids array are themselves arrays of two elements, the first being the GI ID and the second the PDB ID of a sequence. So the structure of the $ids array can be exemplified as follows:

[(GI,PDB),(GI,PDB),(GI,PDB),(GI,PDB),(GI,PDB),(GI,PDB), etc…]

You can do a copy-paste and run the script on your own server, or run a demo here.

If you are able to understand the last script example you are a long way in your PHP and regular expressions learning journey, congratulations!

In the next section we will learn how to extract all the matches to a regular expression in a string with preg_match_all(), rather that just the first match, as preg_match() does. Keep reading.

Chapter Sections

Leave a Reply

Your email address will not be published. Required fields are marked *