Recursive regex for templating sub loops - regex

So I looked at How to write a recursive regex that matches nested parentheses? and other solutions for recursive regex matching, but I'm still not getting a proper match on RegexBuddy.
I have a generic handlebars-style template that I want to parse myself, a table with headings:
<table>
<thead>
<tr>
{{#each columns as col }}<th>{{col}}</th>{{/each}}
</tr>
</thead>
<tbody>
{{#each rows as row }}
<tr>
{{#each row as col }}<td>col</td>{{/each}}
</tr>
{{/each}}
</tbody>
</table>
And trying to match with
/{{\#each (\w+) as (\w+) }}(.*?|(?R)){{/each}}/s
The regex matches the {{#each columns... in the <thead> just fine, but it seems to ignore the |(?R) part and matches {{#each rows... only until the first {{/each}}. I, of course, would like it to match both the inner and outer #each expressions. How? This is perhaps much more complex than simple nested parentheses.
(I always feel like I'm a pro at RegEx until I run into things like this. I have been trying for a while to make this work, and regular-expressions.info is just confusing me more.)
I'm currently working around this by doing {{#each_sub...}}...{{/each_sub}} so my regex won't stop on the first closing tag, but that's obviously a sub-optimal way of doing it. I have several other applications that would benefit from recursive regex but can't figure out what I'm doing wrong.

It isn't ignoring the recursion, it's just never reaching it. Because .*? is capable of matching your delimiters ({{#each...}} and {{/each}}), it matches the first closing delimiter it finds and reports success without ever needing to recurse.
For this technique to work, the branch before the (?R) has to match anything that's not a delimiter. Since your delimiters consist of multiple characters, you can't use a negated character class, as they did in the question you linked to. Instead, you need to use a tempered greedy token:
(?:(?!{{[#/]each\b).)*
This is the same as .*, except before it consumes each character it checks to make sure it's not the beginning of {{#each or {{/each. Here it is in context:
{{\#each (\w+) as (\w+) }}(?:(?:(?!{{[#/]each\b).)*|(?R))*{{/each}}
If the first branch fails, it means you've encountered something that looks like a delimiter. If it's an opening delimiter, the second branch takes over and tries to match the whole pattern recursively. Otherwise, it pops out of the loop (note the * after the group--you were missing that, too) and tries to match a closing delimiter.
While the regex above will work fine on valid input, it's subject to catastrophic backtracking if input is malformed. To avoid that, you can use an unrolled loop in place of the alternation (as #Wiktor did in his comment):
{{\#each\s+(\w+)\s+as\s+(\w+)\s*}}(?:(?!{{[#/]each\b).)*(?:(?R)(?:(?!{{[#/]each\b).)*)*{{/each}}
Here's a slightly more readable version, with possessive quantifiers added to squeeze out even more speed:
{{\#each\s+(\w+)\s+as\s+(\w+)\s*}}
(?:(?!{{[#/]each\b).)*+
(?:
(?R)
(?:(?!{{[#/]each\b).)*+
)*+
{{/each}}

Related

Why does a regular expression find a match outside of it's bounds?

I have the following regular expression, which I'm using to find <icon use="some-id" class="some-class" />:
(?:<icon )(?=(?:.*?(?:use=(?:"|')(.*?)(?:"|')))?)(?=(?:.*?(?:class=(?:"|')(.*?)(?:"|')))?)(?:.*?)(?: \/)?[^?](?:>)
This mostly works, except that if I don't specify a class, but do specify one on another element on the same line, it'll match that other elements class, even though the full match is reported as just being the icon element.
For example:
<icon use="search" /> <div class="test"></div>
$1 for that is search, and $2 is test, even though they're not part of the same element. $& is reporting <icon use="search" />.
I'm sure I'm missing something obvious about the way regular expressions work.
The .*? just before the match of class= will match ANYTHING it has to in order to make the rest of the regex match - including the end of the first tag and the start of the second one, and everything that might lie in between. The only restriction you've placed on it is that it can't cross a line boundary, as newlines are not matched by . by default. To make this work somewhat more reliably, you'd need to restrict that part of the regex so that it cannot cross a tag boundary: [^<]+? (one or more characters that aren't a left angle bracket, matching as few as possible) should do the job.

how to separate two regexp (for taking text from brackets in commented area)?

I have some html page, it looks like:
<span>Some text</span>
<p>And again</p>
<table>
<thead>
<tr>
<th>Text</th>
<th>Text [some text]</th>
<th>Text</th>
</tr>
</thead>
<tbody>
<!--[content-->
<tr>
<td>again some txt but with [this]</td>
<td>in this td the same situation [oops]</td>
<td>hello [world]</td>
</tr>
<!--content]-->
</tbody>
</table>
<span>here is [the text]</span>
I need to take text from square brackets, but just in commented fields. I have 2 reg exp and they are work fine, but separately.
/[^[\]]+(?=])/g - this is for text in brackets;
(?=<!--\[content)([\s\S]*?content]-->) - for commented fields.
But I can't combine it. I was trying this (?=<!--\[content)([^[\]]+(?=]))([\s\S]*?content]-->) but it's not works. I don't know much regexp, how can I combine it?
UPD: for output I need text in brackets only between commented fields (this, oops, world).
First, I might start from some simple one:
(?<=\[)[^\]\[]*(?=\])(?=[\s\S]*?<!--content\]-->)
Explanation
(?<=\[)[^\]\[]*(?=\]) match text inside any square brackets,
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
Its sound so make sense right! BUT anyway, check this out DEMO1. yeah...it didn't work. So, the question is why???
In the regex above there is still some problem about the lookahead assertion, as I mentioned before in the previous explanation:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
This is WRONG, it should be:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by any open or closed content tags.
So, the conclusion our issue is the regex [\s\S]*? sometimes it just matches "more than one content tags".
Workaround
To prevent the above issue, we can put another negative lookaheads of the open content tags to be coupled with every characters that will be generated by [\s\S]*. Thus, we get:
(?<=\[)[^\]\[]*(?=\])(?=(?:(?!<!--\[content-->)[\s\S])*?<!--content\]-->)
Notice that
[\s\S]*
is just modified to
(?:(?!<!--\[content-->)[\s\S])*?
which means (?!<!--\[content-->) is spawned to be in front of every characters that generated by [\s\S]*. For example if [\s\S]* generates ABCDEF..., the negative lookahead will be spawned in this way:
(?!<!--\[content-->)A(?!<!--\[content-->)B(?!<!--\[content-->)C(?!<!--\[content-->)D(?!<!--\[content-->)E(?!<!--\[content-->)F...
Finally, please check the DEMO2. See that right? it's just work!
DISCLAIMER: My regex here will be work fine under only the simple examples that you were provided on the question. For the another complex such as some recursive structure, I can not guarantee that.

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

REGEX replace all " style='anything'" except within tables

I am parsing html. I know this shouldn't be done with regex but dom/xpath. In my case it should just be fast, simple and without tidy so I chose regex.
The task is replacing all the style='xxx' with an empty string, except within tables.
This regex for preg_replace works catching all style='xxx' no matter where:
'/ style="([^"]+)"/s'
The content can look like this
<!-- more html here -->
<span style='do:smtg'><table class=... > <span style="...">
<table> <div style=""></div></table></span></table>
<!-- more html here -->
or just simple non nested tables, meaning regex should exclude all style='...' also within nested tables.
Is there a simple syntax doing this?
Thou Shalt Not Parse HTML with Regular Expressions!
No, really, you shouldn't.
As evidenced by your example, you can expect nested tables. That means the regex should keep track of the level of nesting, to decide whether or not you're in a table. If you find a way to do this, it will certainly not be "fast and simple".
Email, resurrecting this question because it had a regex that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
First we need a regex to match tables, nested or not. This does it with simple recursion:
<table(?:.*?(?R).*?|.*?)</table>
Next, we exclude these, and match what we do want. Here is the whole regex:
(?s)<table(?:.*?(?R).*?|.*?)<\/table>(*SKIP)(*F)|style=(['"])[^'"]*\1
See the demo
The left side of the alternation matches complete tables, nested or not, then deliberately fails. The right side matches and captures your styles to Group 1, allowing for different quote styles. We know these are the right styles because they were not matched by the expression on the left.
With this regex, you can do a simple preg_replace($regex, "", $yourstring);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

Why does this RegEx work the way I want it to?

I have a RegEx that is working for me but I don't know WHY it is working for me. I'll explain.
RegEx: \s*<in.*="(<?.*?>)"\s*/>\s*
Text it finds (it finds the white-space before and after the input tag):
<td class="style9">
<input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value="<?php echo $data[guarantor4]; ?>" /> </td>
</tr>
The part I don't understand:
<in.*=" <--- As I understand it, this should only find up to the first =" as in it should only find <input name="
It actually finds: <input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value=" which happened to be what I was trying to do.
What am I not understanding about this RegEx?
You appear to be using 'greedy' matching.
Greedy matching says "eat as much as possible to make this work"
try with
<in[^=]*=
for starters, that will stop it matching the "=" as part of ".*"
but in future, you might want to read up on the
.*?
and
.+?
notation, which stops at the first possible condtion that matches instead of the last.
The use of 'non-greedy' syntax would be better if you were trying to only stop when you saw TWO characters,
ie:
<in.*?=id
which would stop on the first '=id' regardless of whether or not there are '=' in between.
.* is greedy. You want .*? to find up to only the first =.
.* is greedy, so it'll find up to the last =. If you want it non-greedy, add a question mark, like so: .*?
As I understand it, this should only
find up to the first =" as in it
should only find <input name="
You don't say what language you're writing in, but almost all regular expression systems are "greedy matchers" - that is, they match the longest possible substring of the input. In your case, that means everything everying from the start of the input tag to the last equal-quote sequence.
Most regex systems have a way to specify that the patter only match the shortest possible substring, not the longest - "non-greedy matching".
As an aside, don't assume the first parameter will be name= unless you have full control over the construction of the input. Both HTML and XML allow attributes to be specified in any order.
Your greedy approach is causing confusion. You want .*?
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.