4-8: Using regular expressions in PHP – metacharacters and preg_match() basics

Regular expressions are an extremely powerful tool. As this quite informative page on Wikipedia puts it, a regular expression is

“a sequence of characters that defines a search pattern”

We can indeed use regular expressions to look for patterns, sequences of characters we are interested in, in strings or text files. If you are totally new to the concept, this can be difficult to grasp at the beginning. However we promise that once you get it, you will love it and a whole world of exciting possibilities opens up.

Uses of regular expressions

You can use them to “parse” text files in search, for example, of particular sequences or chunks of characters, and set those aside for further analysis. Imagine parsing a text file with thousands of FASTA sequences and taking out all the sequence IDs for some particular purpose.

Or you can perform a smart “search and replace” in a text file, way more powerful than what we saw already with explode(), as while explode() works with very precisely defined substrings or characters as delimiters for string splitting, a regular expression can look for a way more loosely defined pattern that will allow us to catch matches with substring those exact sequence of characters we do not know beforehand.

In the analysis of biological entities such as nucleotide or protein sequences, special linear (sequence-based) patterns are very often linked to interesting biological functional properties such as for example the ability to interact with proteins in order to form complexes or for other regulatory reasons. Just think about how gene promoters are rich in transcription factors binding sites, that can often be defined by loose linear patters, that determine if and how much a particular gene will be expressed in a certain cellular environment.

Regular characters and metacharacters in regular expressions

There are two kind of characters that can be used inside a regular expression: regular characters, that have a “literal” meaning, and metacharaters, that have a special meaning.

  • Every character that is not a metacharacter is a regular character.
  • Every regular character matches itself.
  • In order to match a metacharacter used within a regular expression literally, we need to escape it with a backslash.

All of this will become more clear by looking at the examples shown below in this section and maybe just a little bit of practice.

Here is a summary of the most used metacharacters, thanks to Wikipedia:

POSIX metacharacters Wikipedia
Figure 4-8-1: metacharacters according to the POSIX standard. Source: Wikipedia
POSIX extended additional metacharacters
Figure 4-8-2: additional metacharacters, POSIX extended standard. Source: Wikipedia

The PHP preg_match() function, basic use

Learning to write regular expressions is better done within a framework that allows us to use them in practical examples. There are several predefined PHP functions that can take regular expressions as arguments and perform actions with them. The more “popular” and widely used is called preg_match(). In the most basic use, it allows us to check if a particular regular expression finds a match into a given string. The regular expression is passed as first argument, the string as second argument. It will return true if a match is found and false if a match is not found. Used in this way, it will not provide any information on exactly “what” was found into the string that matches the pattern defined in the regular expression, just if a match was found or not.

Let’s get into some practical examples to be better grasp what are regular expressions and how you can use them in the PHP programming language. We will slowly introduce a few metacharacters as we proceed.

The ^ and $ metacharacters

To better understand the following example:

  • The general syntax for a regular expression is to write it as a string enclosed between forward slashes: “/sequence of characters and meta characters/”
  • Regular expressions are case sensitive. You can make them case insensitive by adding an i after the final forward slash, see $regexp6 below
  • The ^ character will match the beginning of the target string
  • The $ character inside the expression will match the end of the target string
  • Before looking at the script output included after the code, try to guess by yourself if the sample regular expressions will match or not the target string

Here’s the output of the script above:

The target string is Hello, world

Regular expression 1 matched

Regular expression 2 did not match

Regular expression 3 matched

Regular expression 4 did not match

Regular expression 5 matched

Regular expression 6 matched

Regular expression 7 matched

Regular expression 8 did not match

Checking if a short DNA sequence is present within a target sequence

This is the output of the script:

4 – We have a match

5 – We have a match

6 – We have a match

7 – We have a match

Extracting the FASTA header line from a FASTA sequence

In the following example we obtain a FASTA sequence from the UniProt web site with file_get_contents() as described previously. We use explode() to split a FASTA sequence into it’s composing individual lines, that will get stored into a $fasta_lines array. We then cycle through this lines array to identify the FASTA header line, that is the one that starts with a > sign, by using preg_match().

Remember that since > is not a metacharacter, it is a regular character that will simply match itself when used in a regular expression.

Executing the code above generates the following output:

The FASTA header for the selected sequence is:
>sp|P00519|ABL1_HUMAN Tyrosine-protein kinase ABL1 OS=Homo sapiens GN=ABL1 PE=1 SV=4

The . and * metacharacters

The dot . metacharacter matches every character in a string except newlines. The asterisk metacharacter * indicates that the character that precedes it in the regular expression is repeated zero or more times.

The “/.*/” regular expression means “any character zero or more times”. It will match everything, even an empty string. To get a match to a full string, from start to end, you can use “/^.*$/”.

Here’s the result:

“/.*/” matches apples

“/.*/” matches bananas

“/.*/” matches GAATTC

“/.*/” even matches an empty string!

The ? and + metacharacters

Similarly to the asterisk *, the ? and + metacharacters in a regular expression refer to the character that precedes them, adding a special meaning.

  • * the character that precedes can be repeated zero or more times
  • ? the character that precedes can be repeated zero or one time. A way to say that the character that precedes is optional, it can be present or not in the target for a match to occur
  • + the character that precedes can be repeated one or more times

Here’s the output of the script above:

“/ATG?/” matches AT

“/ATG?/” matches ATG

“/ATG?C/” matches ATCTT

“/ATG+C/” matches ATGGGGGCTT

Defining character classes with square brackets []

By inserting a number of characters within square brackets in the context of a regular expression we indicate that each one of these characters could be a valid match for the position where the brackets are. We are defining a class of characters. For example [ATGC] means that either A or T of G or C are a valid match. Classes syntax supports intervals: [a-z] means every lowercase letter, [A-Z] means every uppercase letter, [a-zA-Z] means every letter, lowercase or uppercase, [a-zA-Z0-9] means every letter or number.

Running the code generates the following output:

“/TT[AGCT]CC/” matches GGCTTACCTAT

“/TT[AGCT]CC/” matches CAGTTCCCTTA

“/TT[AGCT]CC/” matches TAGTTGCCCTT

“/TT[AGCT]CC/” matches TCTTTTCCGCG

Consider the cutting pattern of restriction enzyme BanII:

5′ GRGCYC 3′

For the less initiated:

  • Restriction enzymes are proteins that cut the DNA when a specific sequence pattern (the so-called “cutting site”) is present. Each enzyme has it’s own cutting site. Depending on the enzyme, the cutting site pattern may be very strict, with every nucleotide in every position univocally defined (such as “GGATCC” for the enzyme BamH1), or less strict, with some fixed positions where some precise nucleotide must be present and other positions where more than one nucleotide is allowed for the enzyme to cut (such as “GRGCYC” for the enzyme BanII)
  • The “loosely defined” or, as they are called, “degenerated” or ambiguous positions can be specified by using the IUPAC alphabet. For example R means “either A or G” and Y means “either C or T”.
  • Patterns with degenerated positions can match several different DNA sequences, the precise number depending on the number of degenerated positions and the number of different nucleotides allowed in each positions. For the pattern of BanII we have 2 degenerated positions and each can accommodate 2 different nucleotides, therefore the total number of possible combinations is 2 x 2 = 4, namely: GAGCCC, GAGCTC, GGGCTC, GGGCCC. Finding all of those within a target DNA sequence is a perfect job for regular expressions, that are indeed born for matching loosely defined patterns rather than strictly defined strings (that can of course still be easily matched).

Let us write a regular expression that will match all possible cutting sites for the BanII enzyme within a target DNA sequence. In order to use it properly to find all matches within the target DNA sequence we do need a more advanced use of the preg_match() function. Better, we would actually want to use the related function preg_match_all() in order to find all possible matches within the target sequence rather than just the first one, as preg_match() would do.

We will leave the matching to the next section of the book, for now let us just write the regular expression by using character classes:

Hopefully you start to see the power of regular expressions and classes for pattern matching in the analysis of biological sequences here, and this is just a tiny simple example. More on this example in the next section.

With this information you may try to write yourself regular expressions for:

Bca77I: WCCGGW
Bco118I: RCCGGY

Character classes shortcuts

Here are a few handy shortcuts for some frequently used character classes

\d => matches any number, equivalent to [0-9]
\D => matches every character that is not a number, equivalent to [^0-9] (the ^ inside square brackets has a negation meaning)
\w => Every word character, equivalent to [A-Za-z0-9_]. Letters, numbers and the underscore.
\W => Every non word character, equivalent to [^A-Za-z0-9_] (examples of non word could be a newline \n, a tab \t or a comma ,)
\s => matches spaces
\l => Lowercase letters, equivalent to [a-z]

Quantifying characters with curly brackets {}

With *, ? and + you can specify if a character in the regular expression should be present in the target string zero or more times, zero or one time, one or more times. What if you wanted a character to be present 5 to 13 times? Exactly 25 times? 100 to 300 times? 5 or more times? 500 or more times? You can specify all these kind of options with curly brackets:

{5,} the character that precedes is repeated at least 5 times
{25} exactly 25 times
{100,300} from 100 to 300 times
{5,13} from 5 to 13 times
etc…

Let us make an example with some humor:

Please feel free to run the code above yourself, play with it and generate variations. What about checking for success with YAHOOOOOOOOOOOO…

We have covered the basics here. We know how to write simple regular expressions and to check if a pattern defined by a regular expression finds a match inside a string. However we still did not learn how to extract from the string what actually matched, which is an essential skill. In order to do that, we need to call the preg_match() function with a third argument. Also, we could add additional call flags to see where exactly the match is inside the string. Last but not least, we may want to retrieve all the matches of the regular expression in the string, not just the first one as preg_match() does, this can be accomplished with the preg_match_all() function. We will start to tackle these topics in the next section, stay tuned!

Chapter Sections