Regular expression/Regex with Java/Javascript: performance drop or infinite loop - regex

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.

This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

find string that is missing substring in xml files regular expression

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>
You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo
Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$
You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

Why is my search in BBEdit causing a "stack overflow" error?

I'm stumped about a "stack overflow" error--"out of stack space (application error code: 12246)--that I'm getting in BBEdit when I do a "replace all", searching for
(#article(((?!eprint|#article|#book).)*\r)*)pmid = {(.+)}((((?!eprint|#article|#book).)*\r)*(#|\r*\z))
and replacing with
\1eprinttype = {pubmed}, eprint = {\4}\5
I can use these same patterns manually, doing one-at-a-time find & replace, without any errors, even once the match no longer occurs. I can also avoid the error by working on smaller files.
I suspect that it's my inefficient and sloppy regex coding that's to blame, and would appreciate an expert's help in doing this more efficiently. I'm trying to locate all entries in a BibLaTeX bibliography that don't already have an eprint field, but which have a pmid field, and replace the pmid field with a corresponding e-print specification (using eprint and eprinttype).
Update: After some experimentation, I've found that a different approach is the only thing I can get to work. Searching for
(?(?=#article(.+\r)+eprint = {(.+\r)+}\r*)(?!)|(#article(.+\r)+)pmid = {(.+)}((.+\r)+}\r*))
and replacing with
\3eprinttype = {pubmed}, eprint = {\5}\6
does the trick. The only problem with this is the backreferences are fragile, but I can't get named backreferences to work in BBEdit.
It's probably catastrophic backtracking caused by this last part:
.)*\r)*(#|\r*\z))
If you break that down and simplify it, you essentially have a .*, a \r*, and another \r* right next to each other. Now picture a string of \r characters at the end of your input: How should each \r be distributed? Which of those little clauses will soak up each \r character? If you have \r\r\r\r\r, you could eat all five \rs with the .* part and none at all with the \r* parts...or, you can make up any number of permutations that will still match. Since the * is greedy, it will try to fill the .* up first, but if that fails, it has to keep trying permutations until one of them works. So it's probably hogging a bunch of your resources with unnecessary backtracking, until finally it crashes.
I'm not an expert on optimization techniques for regex, but I'd start there if I were you.
Update:
Check out the Wikipedia article on PCRE:
Unless the "NoRecurse" PCRE build option (aka
"--disable-stack-for-recursion") is chosen, adequate stack space must
be allocated to PCRE by the calling application or operating system.
...
While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.
So I think catastrophic backtracking is a good bet here. I'd try to solve it by tweaking your regex before changing the build options on PCRE.
Obviously this is some bug. But you could try changing the expression a bit. It's difficult to optimize the expression without knowing the requirements, but here's a guess:
(#article(?:(?:(?!eprint|#article|#book|pmid)[^\r])*+\r)*+)pmid = {([^\n\r]+)}((?:(?:(?!eprint|#article|#book)[^\r])*+\r)*(?:#|\r*\z))
Replace with:
\1eprinttype = {pubmed}, eprint = {\2}\3
BBEdit seems to use PCRE, unless it's (very) outdated the above expression should be compatible.

Regex assistance: include/exclude

Hello I am trying to figure out this RegEx expression. I have a URL that can have different querystring parameter at different location.
test.aspx?foo=bar&abc=123
test.aspx?abc=123&foo=bar
test.aspx?foo=bar&abc=123#T1
test.aspx?abc=123&foo=bar#T2
I am trying to only find the one without the #Tnumber
Here what I have so far.
test.aspx\?(?!\#T[0-9])
However it still select all of them, is there a way to have a string constant and scan it down the line?
Juniorflip
If #Tnum is always at the end, you just need to do a bit of anchoring. For example, like this:
test.aspx\?.*(?!\#T[0-9])...$
But that's very fragile as it depends on the bad URLs always ending in a very particular form and good URLs always having enough characters to soak up that end matching. A negative lookbehind assertion is somewhat better, but still fragile and less commonly supported:
test.aspx\?.*(?<!\#T[0-9])$
It's better to write a regular expression that matches what you don't want and to just invert the logic of what to do when you get a match (i.e., "if it matches throw it away", instead of "if it matches use it"). But really it's much better to delegate the parsing of the URLs to a specialized library and then just do a simpler check against the fragment identifier as a logical component instead of as a horrible RE hack.

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/