Is there a way to optimize this case of catastrophic regex backtracking? - regex

So I have come up with the following regex:
([^\s\\]+(?:\\.[^\s\\]*)*)(?:.*?)(\S+\.php\b)
Test link: https://regex101.com/r/NV6Bk4/4
It matches the binary and the script name of a command line. Example:
php --strict myscript.php --arg=value
matches php and myscript.php in group(1) and group(2).
The problem is this part in the middle: (?:.*?), it leads to combinatorial explosion, slowing down the regex for large inputs. Is there a way to optimize this? Since there is no pattern I can't think of anything.
To clarify, the rule that I'm trying to match is:
Match any path to a command, possibly containing escaped whitespace. Ignore any arguments following it. Match a file ending in .php, ignore anything that follows it. The command should be in group(1), the filename should be in group(2).

You may use the following "fix" with Matcher#matches():
([^\s\\]*+(?:\\.[^\s\\]*)*).*?(\S+\.php\b).*
In Java
String regex = "([^\\s\\\\]*+(?:\\\\.[^\\s\\\\]*)*).*?(\\S+\\.php\\b).*";
See the regex demo. Note that a literal . outside of a character class must be escaped. Compile the pattern with Pattern.DOTALL if the string may have line breaks.
As you see, the .*? part matches any char, and (?:\\.[^\s\\]*)* before it can match any 0 or more chars (so, it is kind of optional) and the next adjoining pattern to .*? from the left is [^\s\\]+ that can match the same chars as .*?. That means, the regex engine may backtrack to the first subpattern, and that creates a lot of ways to match the string, commonly named as catastrophic backtracking.
If you disallow backtracking into the first negated character class with *+ possessive quantifier, it will already work much more reliably.
Add .* at the end to make it work with .matches() as this method requires a full string match.

Related

Match asterisk followed by space in PCRE

I'm just having trouble figuring out how to regex properly. What I need is to match an asterisk followed by a space followed by any amount of characters that aren't \n. (Similar to reddit list formatting)
Example:
* Test
* Test2
* Test3
The closest I got was this, but it wasn't working.
/^[*][ ](.*?)/s
Can anyone familiar with PCRE help me.
You should not use a lazy dot pattern at the end of the regex because it will never match any single char (as it will be skipped when the regex engine comes up to it, and since there is nothing to match after it, the empty string will be matched by .*?).
Use the greedy dot pattern:
^\* (.*)
See the regex demo
Other notes: you may use \h to match any horizontal whitespace instead of the regular space in the pattern. To match start of lines with ^ use m modifier. Only use s modifier if you need . to match any chars including a newline (and carriage return depending on PCRE verbs that are active).

Which would be better non-greedy regex or negated character class?

I need to match #anything_here# from a string #anything_here#dhhhd#shdjhjs#. So I'd used following regex.
^#.*?#
or
^#[^#]*#
Both way it's work but I would like to know which one would be a better solution. Regex with non-greedy repetition or regex with negated character class?
Negated character classes should usually be prefered over lazy matching, if possible.
If the regex is successful, ^#[^#]*# can match the content between #s in a single step, while ^#.*?# needs to expand for each character between #s.
When failing (for the case of no ending #) most regex engines will apply a little magic and internally treat [^#]* as [^#]*+, as there is a clear cut border between # and non-#, thus it will match to the end of the string, recognize the missing # and not backtrack, but instantly fail. .*? will expand character for character as usual.
When used in larger contexts, [^#]* will also never expand over the borders of the ending # while this is very well possible for the lazy matching. E.g. ^#[^#]*a[^#]*# won't match #bbbb#a# while ^#.*?a.*?# will.
Note that [^#] will also match newlines, while . doesn't (in most regex engines and unless used in singleline mode). You can avoid this by adding the newline character to the negation - if it is not wanted.
It is clear the ^#[^#]*# option is much better.
The negated character class is quantified greedily which means the regex engine grabs 0 or more chars other than # right away, as many as possible. See this regex demo and matching:
When you use a lazy dot matching pattern, the engine matches #, then tries to match the trailing # (skipping the .*?). It does not find the # at Index 1, so the .*? matches the a char. This .*? pattern expands as many times as there are chars other than # up to the first #.
See the lazy dot matching based pattern demo here and here is the matching steps:

Regex end of capture string

Can anyone tell me why this regex:
(<\s*script\s*>.*<\s*\/*script\s*>)
Matches this entire line:
< script > some more javascript</script> ggg <script>
You have two problems:
First a simple mistake, you are making the termination switch match 0 or more '/' characters by using the * quantifier. You can solve that by removing the quantifier. Requiring a single termination switch, changing your regex to: (<\s*script\s*>.*<\s*\/script\s*>)
But second, because .* is greedy. This means it grabs as much as it can while still matching the rest of the regex. In this case: <\s*\/*script\s*>. This means that if you had multiple "...<\script>"s on a line it would match the entire line rather than each "...".
What you want is to match any character as few times as possible. Which is called lazy matching. You can qualify any quantifier with ? to accomplish this, in your example:
.*?
Using that your regex would become:
(<\s*script\s*>.*?<\s*\/script\s*>)
If you're actually using the http://www.regexr.com "Reference" menu to build your regex, you can find this under "Quantifiers and Alternation">"Lazy"
Replace \/* by \/.
\/* match 0 or more "/".

Multiline capture regex fails if ungreedy

I would like to retrieve data from an xml file. I use for instance a regexp like this one:
/
<OVERLAYLINKPROJECT(?:.|\s)+
<OUTPUT
/xU
Here's an extract of my xml file:
<OVERLAYLINKPROJECT id='0773C138' parent_id='007285A0' ovl_id='0x4b' run_address='0x9022a' run_size='0x450' live_address='0x40c111' live_size='0x678' >
<FILE_NAME><![CDATA[xxx.ovl]]></FILE_NAME>
<OUTPUT_SECTIONS>
<OUTPUT_SECTION id='0773C138' name='xxxx' type='SHT_PROGBITS' start_address='0x9022a' word_size='0x450' word_size_unmapped='0x0' in_overlay='' >
<INPUT_SECTIONS>
<INPUT_SECTION id='0580D5B0' name='yyyy' start_address='0x9022b' size='0x44f' element_at='0x0' >
The regex doesn't work without the ungreedy modificator U. Why?
The problem is, surprisingly enough, catastrophic backtracking.
You used (?:.|\s), presumably because . doesn't match newlines, and your input contains them. However, \s also matches other whitespace which can also be matched by ..
If you don't use the ungreedy modifier, (?:.|\s)+ first matches the entire string after <OVERLAYLINKPROJECT and then backtracks to see where <OUTPUT can first be matched. At each and every space, it needs to try all the alternatives between matching it with . or matching it with \s before it can be sure that neither lead to a valid match.
There are 14 spaces in that part of the string. Each one has to be checked in every possible combination with all the other ones, which leads to 14! (= 87178291200) permutations that all have to be checked. That takes a while (or the regex engine times out).
When you use the ungreedy modifier, the regex engine steps through the match one character at a time, "marking" each whitespace for later evaluation in case the match fails - but it succeeds as soon as <OUTPUT is encountered. That's why it matches much faster. It will still fail catastrophically if the input string does not contain <OUTPUT at all - because in that case, the regex engine needs to revisit all the spaces and try the different permutations in the vain hope of finding a match that way.
Use the /s modifier instead to allow the dot to match newlines:
/
<OVERLAYLINKPROJECT.+
<OUTPUT
/xs
<OVERLAYLINKPROJECT(?:.|\s)+?<OUTPUT
Try this.See demo.
http://regex101.com/r/bW3aR1/1
The problem with your regex is catastrophic backtracking as explained by Tim.

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/