Multiline capture regex fails if ungreedy - regex

I would like to retrieve data from an xml file. I use for instance a regexp like this one:
/
<OVERLAYLINKPROJECT(?:.|\s)+
<OUTPUT
/xU
Here's an extract of my xml file:
<OVERLAYLINKPROJECT id='0773C138' parent_id='007285A0' ovl_id='0x4b' run_address='0x9022a' run_size='0x450' live_address='0x40c111' live_size='0x678' >
<FILE_NAME><![CDATA[xxx.ovl]]></FILE_NAME>
<OUTPUT_SECTIONS>
<OUTPUT_SECTION id='0773C138' name='xxxx' type='SHT_PROGBITS' start_address='0x9022a' word_size='0x450' word_size_unmapped='0x0' in_overlay='' >
<INPUT_SECTIONS>
<INPUT_SECTION id='0580D5B0' name='yyyy' start_address='0x9022b' size='0x44f' element_at='0x0' >
The regex doesn't work without the ungreedy modificator U. Why?

The problem is, surprisingly enough, catastrophic backtracking.
You used (?:.|\s), presumably because . doesn't match newlines, and your input contains them. However, \s also matches other whitespace which can also be matched by ..
If you don't use the ungreedy modifier, (?:.|\s)+ first matches the entire string after <OVERLAYLINKPROJECT and then backtracks to see where <OUTPUT can first be matched. At each and every space, it needs to try all the alternatives between matching it with . or matching it with \s before it can be sure that neither lead to a valid match.
There are 14 spaces in that part of the string. Each one has to be checked in every possible combination with all the other ones, which leads to 14! (= 87178291200) permutations that all have to be checked. That takes a while (or the regex engine times out).
When you use the ungreedy modifier, the regex engine steps through the match one character at a time, "marking" each whitespace for later evaluation in case the match fails - but it succeeds as soon as <OUTPUT is encountered. That's why it matches much faster. It will still fail catastrophically if the input string does not contain <OUTPUT at all - because in that case, the regex engine needs to revisit all the spaces and try the different permutations in the vain hope of finding a match that way.
Use the /s modifier instead to allow the dot to match newlines:
/
<OVERLAYLINKPROJECT.+
<OUTPUT
/xs

<OVERLAYLINKPROJECT(?:.|\s)+?<OUTPUT
Try this.See demo.
http://regex101.com/r/bW3aR1/1
The problem with your regex is catastrophic backtracking as explained by Tim.

Related

Trying to combine two Regex

I'm trying to combine two working regex patterns into one. Please let me know the correct syntax and if this can be better written.
Pattern 1: (?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|
Pattern 2: (?P<path>[^\/]+(?=\-[^\/-]*$))
Sample line:
06/Mar/2020:00:01:04 -0500|/TESTSTREAM|5766764|4.2.2.1|123290|path1/path2/x-fr-US.OPEN.1-Turtle-2020.30.04-64.mp3
The first expression matches the start of the string, the second matches the end, you can combine them by putting a non-greedy .*? between them, like this:
(?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|.*?(?P<path>[^\/]+(?=\-[^\/-]*$))
As you can see here this expression works, but it takes 1660 steps to match the string. This is because .* between | first capture the whole string up to the end, and then try to step back character by character in order to find the match.
If you use the non-greedy modifiers here: .*?, then the regex machine will initially match an empty string and then will need to move forward character by character until it finds the matching |. It will reduce the number of steps to 1183: demo
However, if you want to remove this backtracking (forward-tracking) at all, you can just very quickly skip as many non-| characters as possible with [^|]*. Similarly we can replace other .* patterns in the regex. The resulting regex finds a match in just 47 steps, more than 30-times less than the original regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|(?:[^\/\n]*\/)*(?P<path>.*)-.*
Demo here.
Update 2020-03-09
If you want to keep the last slash you can use this regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|.*?(?P<path>\/[^\/]*)-[^\/]*

Is there a way to optimize this case of catastrophic regex backtracking?

So I have come up with the following regex:
([^\s\\]+(?:\\.[^\s\\]*)*)(?:.*?)(\S+\.php\b)
Test link: https://regex101.com/r/NV6Bk4/4
It matches the binary and the script name of a command line. Example:
php --strict myscript.php --arg=value
matches php and myscript.php in group(1) and group(2).
The problem is this part in the middle: (?:.*?), it leads to combinatorial explosion, slowing down the regex for large inputs. Is there a way to optimize this? Since there is no pattern I can't think of anything.
To clarify, the rule that I'm trying to match is:
Match any path to a command, possibly containing escaped whitespace. Ignore any arguments following it. Match a file ending in .php, ignore anything that follows it. The command should be in group(1), the filename should be in group(2).
You may use the following "fix" with Matcher#matches():
([^\s\\]*+(?:\\.[^\s\\]*)*).*?(\S+\.php\b).*
In Java
String regex = "([^\\s\\\\]*+(?:\\\\.[^\\s\\\\]*)*).*?(\\S+\\.php\\b).*";
See the regex demo. Note that a literal . outside of a character class must be escaped. Compile the pattern with Pattern.DOTALL if the string may have line breaks.
As you see, the .*? part matches any char, and (?:\\.[^\s\\]*)* before it can match any 0 or more chars (so, it is kind of optional) and the next adjoining pattern to .*? from the left is [^\s\\]+ that can match the same chars as .*?. That means, the regex engine may backtrack to the first subpattern, and that creates a lot of ways to match the string, commonly named as catastrophic backtracking.
If you disallow backtracking into the first negated character class with *+ possessive quantifier, it will already work much more reliably.
Add .* at the end to make it work with .matches() as this method requires a full string match.

Regex end of capture string

Can anyone tell me why this regex:
(<\s*script\s*>.*<\s*\/*script\s*>)
Matches this entire line:
< script > some more javascript</script> ggg <script>
You have two problems:
First a simple mistake, you are making the termination switch match 0 or more '/' characters by using the * quantifier. You can solve that by removing the quantifier. Requiring a single termination switch, changing your regex to: (<\s*script\s*>.*<\s*\/script\s*>)
But second, because .* is greedy. This means it grabs as much as it can while still matching the rest of the regex. In this case: <\s*\/*script\s*>. This means that if you had multiple "...<\script>"s on a line it would match the entire line rather than each "...".
What you want is to match any character as few times as possible. Which is called lazy matching. You can qualify any quantifier with ? to accomplish this, in your example:
.*?
Using that your regex would become:
(<\s*script\s*>.*?<\s*\/script\s*>)
If you're actually using the http://www.regexr.com "Reference" menu to build your regex, you can find this under "Quantifiers and Alternation">"Lazy"
Replace \/* by \/.
\/* match 0 or more "/".

Need help with Regular Expression to Match Blood Group

I'm trying to come up with a regex that helps me validate a Blood Group field - which should accept only A[+-], B[+-], AB[+-] and O[+-].
Here's the regex I came up with (and tested using Regex Tester):
[A|B|AB|O][\+|\-]
Now this pattern successfully matches A,B,O[+-] but fails against AB[+-].
Can anyone please suggest a regex that'll serve my purpose?
Thanks,
m^e
Try:
(A|B|AB|O)[+-]
Using square brackets defines a character class, which can only be a single character. The parentheses create a grouping which allows it to do what you want. You also don't need to escape the +- in the character class, as they don't have their regexy meaning inside of it.
As you mentioned in the comments, if it is a string you want to match against that has the exact values you are looking for, you might want to do this:
^(A|B|AB|O)[+-]$
Without the start of string and end of string anchors, things like "helloAB+asdads" would match.
The brackets [] denote a character class, meaning "any of the characters herein". You want the parentheses () for grouping:
(A|B|AB|0)(\+|-)
When you are building an alternation (e.g. (A|B|AB|O)), you should be careful with the ordering of the elements. Many regex engines will stop at the first alternate that matches (rather than the longest). If it weren't for the [-+] forcing a backtrack, (A|B|AB|O)[-+] would not work for "AB+". It is probably better to say (AB|A|B|O)[-+] (but you should check the docs for your regex engine).
Also, if you do not intend to capture the antigen for latter use, you should you use the non-capturing grouping parentheses: (?:AB|A|B|O)[-+].
Furthermore, if you want to ensure that the only thing in the string is a blood type then you need anchors to prevent it from matching only part of the string: ^(?:AB|A|B|O)[-+]$. A quick note on anchors, Depending on your regex engine, ^ may match the beginning of a line rather than the beginning of the string if you pass it a multiline-match option. Similarly, $ may match the end of a line rather than the end of a string. For this reason there are three other anchors in common (but not %100) usage: \A, \Z, and \z. If your regex engine supports them, \A always matches the start of the string, \Z matches the end of the string or a newline just before the end of the string, and \z matches only the send of the string.
For case insensitive within html pattern attribute you may try this
([AaBbOo]|[Aa][Bb])[\+-]
<input type="text" maxlength="3" pattern="([AaBbOo]|[Aa][Bb])[\+-]" required />
^(A|B|AB|O)[+-]?$
This will produce the correct out put.

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/