{"id":1113,"date":"2017-03-07T15:01:03","date_gmt":"2017-03-07T15:01:03","guid":{"rendered":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/?page_id=1113"},"modified":"2017-03-17T14:33:54","modified_gmt":"2017-03-17T14:33:54","slug":"regular-expressions-in-php-retrieving-all-matches-to-a-pattern-in-a-string-with-preg_match_all-including-overlapping-matches","status":"publish","type":"page","link":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/chapter-4-adding-a-dynamic-layer-introducing-the-php-programming-language\/regular-expressions-in-php-retrieving-all-matches-to-a-pattern-in-a-string-with-preg_match_all-including-overlapping-matches\/","title":{"rendered":"4-10: Regular expressions in PHP &#8211; retrieving all matches to a pattern in a string with preg_match_all() including overlapping matches"},"content":{"rendered":"<p>In this section we will explore the use of the preg_match_all() PHP function to retrieve all the occurrences of a pattern in a target string as well as the use of lookahead expressions to include overlapping matches.<\/p>\n<p>You may remember that in the <a href=\"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/chapter-4-adding-a-dynamic-layer-introducing-the-php-programming-language\/regular-expressions-in-php-retrieving-matches-to-patterns-with-preg_match-called-with-the-matches-argument\/\">previous section<\/a> we mentioned that preg_match() will only allow to retrieve the first match to a pattern in a sting.<\/p>\n<p>For example, the following regular expression:<\/p>\n<p>$regexp = &#8220;\/[GT]AATT[CA]\/&#8221;;<\/p>\n<p>used in a preg_match() call will match GAATTC (the first match) but not TAATTA (the second match) in the following string:<\/p>\n<p>CCC<span style=\"color:red;\">GAATTC<\/span>TTTCCC<span style=\"color:red;\">TAATTA<\/span>TTTT<\/p>\n<pre lang=\"php\"><code>\r\n<?php\r\n\r\n$regexp = \"\/[GT]AATT[CA]\/\";\r\n\r\n$sequence = \"CCCGAATTCTTTCCCTAATTA\";\r\n\r\npreg_match($regexp, $sequence, $matches);\r\n\r\nvar_dump($matches);\r\n\r\n?>\r\n<\/code><\/pre>\n<p>The $matches array as seen with a var_dump looks like this (the output of this script):<\/p>\n<p>array(1) { [0]=> string(6) &#8220;GAATTC&#8221; }<\/p>\n<p>Since there are no capture groups, $matches only have one element. This element is <strong>a string<\/strong> that represents the first match to the whole pattern in the target sequence.<\/p>\n<h2>preg_match_all() is very similar to preg_match(), with a catch<\/h2>\n<p>There are several cases in which we need to be able to match and retrieve all the matches to a pattern rather than just the first one. In these cases we can use preg_match_all() instead of preg_match().<\/p>\n<p>A call to preg_match_all() is done exactly as a call to preg_match(). Here&#8217;s a call example.<\/p>\n<pre lang=\"php\"><code>\r\n<?php\r\n\r\n$regexp = \"\/[GT]AATT[CA]\/\";\r\n\r\n$sequence = \"CCCGAATTCTTTCCCTAATTA\";\r\n\r\npreg_match_all($regexp, $sequence, $matches);\r\n\r\nvar_dump($matches);\r\n\r\n?>\r\n<\/code><\/pre>\n<p>This time the output, again a var_dump of the $matches array, looks like this:<\/p>\n<p>array(1) { [0]=> array(2) { [0]=> string(6) &#8220;GAATTC&#8221; [1]=> string(6) &#8220;TAATTA&#8221; } }<\/p>\n<p>As it would have happened with preg_match(), the $matches array now contains just one element, as there are no capture groups in the regular expression we did use in this example. <strong>There is a catch, though: this time, this element is itself an array rather than a string<\/strong>. The elements of this array, accessible at $matches[0], are all the matches, as strings, to the whole search pattern. In this particular case there are 2 matches, GAATTC and TAATTA.<\/p>\n<p>In the following example we perform the same match but refine the output a bit:<\/p>\n<pre lang=\"php\"><code>\r\n<?php\r\n\r\n$regexp = \"\/[GT]AATT[CA]\/\";\r\n\r\n$sequence = \"CCCGAATTCTTTCCCTAATTA\";\r\n\r\npreg_match_all($regexp, $sequence, $matches);\r\n\r\n$matches_num = count($matches[0]); \/\/ How many matches were found?\r\n\/\/ $matches[0] is now an array whose elements are the various matches found by the pattern in the target string\r\n\r\necho \"<p>\\n$regexp found $matches_num matches in $sequence, namely:\\n<ul>\\n\";\r\n\r\nforeach($matches[0] as $match){\r\n    echo \"<li>$match<\/li>\\n\";\r\n}\r\n\r\necho \"<\/ul>\";\r\n\r\n?>\r\n<\/code><\/pre>\n<p>The output:<\/p>\n<p>\n\/[GT]AATT[CA]\/ found 2 matches in CCCGAATTCTTTCCCTAATTA, namely:<\/p>\n<ul>\n<li>GAATTC<\/li>\n<li>TAATTA<\/li>\n<\/ul>\n<h2>Using preg_match_all() with regular expressions containing capture groups<\/h2>\n<p>Similarly to what happens with preg_match, if the regular expression contains capture groups defined with round parentheses, for each capture group, in the order in which they are used in the regular expression, a new element will be added to the $matches array. For preg_match_all(), however, these elements are arrays rather than strings, and will contain all the matches to the capture group in the target string (as strings, hopefully this is not too confusing).<\/p>\n<p>If for example the regular expression contains only one capture group, the structure of the $matches array &#8211; a 2 elements array &#8211; will be as follows:<\/p>\n<p>[(first match to whole expression, second match to whole expression, third match to whole expression&#8230;), (first match to capture group1, second match to capture group1, third match to capture group 1&#8230;)]<\/p>\n<p>In this example, the second match to the first (and only) capture group could be accessed at $matches[1][2].<\/p>\n<p>If the regular expression contains two capture groups, the structure of the $matches array &#8211; a 3 elements array &#8211; will be as follows:<\/p>\n<p>[(first match to whole expression, second match to whole expression, third match to whole expression&#8230;), (first match to capture group1, second match to capture group1, third match to capture group 1&#8230;), (first match to capture group2, second match to capture group2, third match to capture group 2&#8230;)]<\/p>\n<p>In this last example, the third match to the second capture group could be accessed at $matches[2][3].<\/p>\n<h2>Finding overlapping matches with a lookahead regular expression in PHP<\/h2>\n<p>Sometimes, as it often happens for example during the search a pattern in nucleotide sequences, some of the matches are overlapping with each other.<\/p>\n<p>For example in the following sequence:<\/p>\n<p>CCCGAATTAATTCCC<\/p>\n<p>this pattern:<\/p>\n<p>&#8220;\/[GT]AATT[CA]\/&#8221;<\/p>\n<p>could in principle be found 2 times:<\/p>\n<p>CCC<span style=\"color:red;\">GAATTA<\/span>ATTCCC<\/p>\n<p>and<\/p>\n<p>CCCGAAT<span style=\"color:red;\">TAATTC<\/span>CC<\/p>\n<p>However, since the two matches are overlapping with each other, the following call to preg_match_all():<\/p>\n<p>preg_match_all(&#8220;\/[GT]AATT[CA]\/&#8221;, &#8220;CCCGAATTAATTCCC&#8221;);<\/p>\n<p>will only find the first one, as once the first match is found and stored, the search starts again from the character that follows it.<\/p>\n<p>Consider the following example:<\/p>\n<pre lang=\"php\"><code>\r\n<?php\r\n\r\npreg_match_all(\"\/[GT]AATT[CA]\/\", \"CCCGAATTAATTCCC\", $matches);\r\n\r\n$matches_num = count($matches[0]); \/\/ How many matches were found?\r\n\/\/ $matches[0] is now an array whose elements are the various matches found by the pattern in the target string\r\n\r\necho \"<p>\\n\/[GT]AATT[CA]\/ found $matches_num matches in CCCGAATTAATTCCC, namely:\\n<ul>\\n\";\r\n\r\nforeach($matches[0] as $match){\r\n    echo \"<li>$match<\/li>\\n\";\r\n}\r\n\r\necho \"<\/ul>\\n<\/p>\\n<p>The var_dump of matches:<br>\\n\";\r\nvar_dump($matches);\r\necho \"\\n<\/p>\";\r\n?>\r\n<\/code><\/pre>\n<p>The output of this code:<\/p>\n<p>\n\/[GT]AATT[CA]\/ found 1 matches in CCCGAATTAATTCCC, namely:<\/p>\n<ul>\n<li>GAATTA<\/li>\n<\/ul>\n<p>The var_dump of matches:<br \/>\narray(1) {<br \/>\n  [0]=><br \/>\n  array(1) {<br \/>\n    [0]=><br \/>\n    string(6) &#8220;GAATTA&#8221;<br \/>\n  }<br \/>\n}<\/p>\n<p>In the search of patterns within biological sequences this is a serious limitation.<\/p>\n<p>In order to retrieve all the matches, even if overlapping, we can use a lookahead in the regular expression. Lookaheads are special cases of &#8220;lookaround&#8221;, see <a href=\"http:\/\/www.regular-expressions.info\/lookaround.html\" target=\"_blank\">this page on regular-expressions.info for an in-depth discussion about lookaround regular expressions<\/a>.<\/p>\n<p>As regular-expressions.info puts it:<\/p>\n<p><em>&#8220;Lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called &#8220;assertions&#8221;. They do not consume characters in the string, but only assert whether a match is possible or not.&#8221;<\/em><\/p>\n<p>There are two important things to consider here:<\/p>\n<ul>\n<li>First, lookarounds &#8220;give up the match&#8221;. Therefore, if we perform a match with a lookahead expression and preg_match_all(), the matches results to the whole expression will not be stored in $matches[0] as it would normally happen. In order to capture them we do need to explicitly use capture groups. With one capture group we would have to look for the matches in $matches[1]. <\/li>\n<li>Second, lookaheads do not &#8220;consume&#8221; characters. This means that even overlapping matches will be found<\/li>\n<\/ul>\n<p>A simple example is worth a thousand words: let&#8217;s run again the last code example with a lookahead expression and some slight modifications.<\/p>\n<p>We transform the original regular expression<\/p>\n<p>&#8220;\/[GT]AATT[CA]\/&#8221;<\/p>\n<p>to:<\/p>\n<p>&#8220;\/(?=([GT]AATT[CA]))\/&#8221;<\/p>\n<p>(?=               Look ahead to see if there is (the lookahead)<br \/>\n(                 start of the capture group<br \/>\n[GT]AATT[CA]      our usual expression<br \/>\n)                 end of the capture group<br \/>\n)                 end of the lookahead<\/p>\n<p>Credits: this expression was inspired from this <a href=\"http:\/\/stackoverflow.com\/questions\/24619099\/regex-for-overlapping-matches\" target=\"_blank\">brilliant page on StackOverflow<\/a> <\/p>\n<pre lang=\"php\"><code>\r\n<?php\r\n    \r\npreg_match_all(\"\/(?=([GT]AATT[CA]))\/\", \"CCCGAATTAATTCCC\", $matches);\r\n\r\n\/\/ Regular expression breakdown:\r\n\/\/ (?=               Look ahead to see if there is (the lookahead)\r\n\/\/ (                 start of the capture group\r\n\/\/ [GT]AATT[CA]      our usual expression\r\n\/\/ )                 end of the capture group \r\n\/\/ )                 end of the lookahead\r\n\r\n$matches_num = count($matches[1]); \/\/ How many matches were found?\r\n\/\/ $matches[1] is now an array whose elements are the various matches \r\n\/\/ found by the capture group in the target string\r\n\r\necho \"<p>\\n\/(?=([GT]AATT[CA]))\/ found $matches_num matches in CCCGAATTAATTCCC, namely:\\n<ul>\\n\";\r\n\r\nforeach($matches[1] as $match){\r\n    echo \"<li>$match<\/li>\\n\";\r\n}\r\n\r\necho \"<\/ul>\\n<\/p>\\n<p>The var_dump of matches:<br>\\n\";\r\nvar_dump($matches);\r\necho \"\\n<\/p>\";\r\n\r\n?>\r\n<\/code><\/pre>\n<p>This is the output of the script:<\/p>\n<p>\n\/(?=([GT]AATT[CA]))\/ found 2 matches in CCCGAATTAATTCCC, namely:<\/p>\n<ul>\n<li>GAATTA<\/li>\n<li>TAATTC<\/li>\n<\/ul>\n<p>The var_dump of matches:<br \/>\narray(2) {<br \/>\n  [0]=><br \/>\n  array(2) {<br \/>\n    [0]=><br \/>\n    string(0) &#8220;&#8221;<br \/>\n    [1]=><br \/>\n    string(0) &#8220;&#8221;<br \/>\n  }<br \/>\n  [1]=><br \/>\n  array(2) {<br \/>\n    [0]=><br \/>\n    string(6) &#8220;GAATTA&#8221;<br \/>\n    [1]=><br \/>\n    string(6) &#8220;TAATTC&#8221;<br \/>\n  }<br \/>\n}<\/p>\n<p>In the last example note how $matches[0], the placeholder for matches to the whole expression, contains 2 element but they are empty stings. This is because the lookahead does not store such matches, as explained above. In contrast, we can find the matches in $matches[1] as we did use a capture group.<\/p>\n<p>You now have all the knowledge to find all the matches to a pattern in a string, even overlapping matches.<\/p>\n<p>In the <a href=\"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/chapter-4-adding-a-dynamic-layer-introducing-the-php-programming-language\/regular-expressions-in-php-retrieving-matches-position-by-using-the-preg_offset_capture-flag-in-preg_match-and-preg_match_all-calls\/\">next section we look at the use of the <strong>PREG_OFFSET_CAPTURE<\/strong> flag in the preg_match() or preg_match_all() calls<\/a> in order to capture not only the matches, but also the position in the string at which they start, which is often useful to do while searching for a pattern in a biological sequence as well as in several other instances not necessarily related to biology or bioinformatics.<\/p>\n<div class=\"google-ad\"><script async src=\"\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js\"><\/script><br \/>\n<!-- bioinfo web dev 2 --><br \/>\n<ins class=\"adsbygoogle\" style=\"display: inline-block; width: 728px; height: 90px;\" data-ad-client=\"ca-pub-0159360445983090\" data-ad-slot=\"3442176918\"><\/ins><br \/>\n<script>\n(adsbygoogle = window.adsbygoogle || []).push({});\n<\/script><\/div>\n<h2>Chapter Sections<\/h2>\n<p>[pagelist include=&#8221;435&#8243;]<\/p>\n<p>[siblings]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this section we will explore the use of the preg_match_all() PHP function to retrieve all the occurrences of a pattern in a target string as well as the use of lookahead expressions to include overlapping matches. You may remember that in the previous section we mentioned that preg_match() will only allow to retrieve the &hellip; <a href=\"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/chapter-4-adding-a-dynamic-layer-introducing-the-php-programming-language\/regular-expressions-in-php-retrieving-all-matches-to-a-pattern-in-a-string-with-preg_match_all-including-overlapping-matches\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;4-10: Regular expressions in PHP &#8211; retrieving all matches to a pattern in a string with preg_match_all() including overlapping matches&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":435,"menu_order":10,"comment_status":"open","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1113","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/pages\/1113","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/comments?post=1113"}],"version-history":[{"count":56,"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/pages\/1113\/revisions"}],"predecessor-version":[{"id":1454,"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/pages\/1113\/revisions\/1454"}],"up":[{"embeddable":true,"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/pages\/435"}],"wp:attachment":[{"href":"http:\/\/www.cellbiol.com\/bioinformatics_web_development\/wp-json\/wp\/v2\/media?parent=1113"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}