Why is my search in BBEdit causing a "stack overflow" error? - regex

I'm stumped about a "stack overflow" error--"out of stack space (application error code: 12246)--that I'm getting in BBEdit when I do a "replace all", searching for
(#article(((?!eprint|#article|#book).)*\r)*)pmid = {(.+)}((((?!eprint|#article|#book).)*\r)*(#|\r*\z))
and replacing with
\1eprinttype = {pubmed}, eprint = {\4}\5
I can use these same patterns manually, doing one-at-a-time find & replace, without any errors, even once the match no longer occurs. I can also avoid the error by working on smaller files.
I suspect that it's my inefficient and sloppy regex coding that's to blame, and would appreciate an expert's help in doing this more efficiently. I'm trying to locate all entries in a BibLaTeX bibliography that don't already have an eprint field, but which have a pmid field, and replace the pmid field with a corresponding e-print specification (using eprint and eprinttype).
Update: After some experimentation, I've found that a different approach is the only thing I can get to work. Searching for
(?(?=#article(.+\r)+eprint = {(.+\r)+}\r*)(?!)|(#article(.+\r)+)pmid = {(.+)}((.+\r)+}\r*))
and replacing with
\3eprinttype = {pubmed}, eprint = {\5}\6
does the trick. The only problem with this is the backreferences are fragile, but I can't get named backreferences to work in BBEdit.

It's probably catastrophic backtracking caused by this last part:
.)*\r)*(#|\r*\z))
If you break that down and simplify it, you essentially have a .*, a \r*, and another \r* right next to each other. Now picture a string of \r characters at the end of your input: How should each \r be distributed? Which of those little clauses will soak up each \r character? If you have \r\r\r\r\r, you could eat all five \rs with the .* part and none at all with the \r* parts...or, you can make up any number of permutations that will still match. Since the * is greedy, it will try to fill the .* up first, but if that fails, it has to keep trying permutations until one of them works. So it's probably hogging a bunch of your resources with unnecessary backtracking, until finally it crashes.
I'm not an expert on optimization techniques for regex, but I'd start there if I were you.
Update:
Check out the Wikipedia article on PCRE:
Unless the "NoRecurse" PCRE build option (aka
"--disable-stack-for-recursion") is chosen, adequate stack space must
be allocated to PCRE by the calling application or operating system.
...
While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.
So I think catastrophic backtracking is a good bet here. I'd try to solve it by tweaking your regex before changing the build options on PCRE.

Obviously this is some bug. But you could try changing the expression a bit. It's difficult to optimize the expression without knowing the requirements, but here's a guess:
(#article(?:(?:(?!eprint|#article|#book|pmid)[^\r])*+\r)*+)pmid = {([^\n\r]+)}((?:(?:(?!eprint|#article|#book)[^\r])*+\r)*(?:#|\r*\z))
Replace with:
\1eprinttype = {pubmed}, eprint = {\2}\3
BBEdit seems to use PCRE, unless it's (very) outdated the above expression should be compatible.

Related

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Futile attempt to run regular expression find/replace in MS Word using groups on Mac

According to the received wisdom MS Word (more or less) supports find/replace with use of regular expressions. I have a simple regular expression:
^(C[[:alpha:]]*)(\d*)(.*)$
That I'm running on the data:
indSIMDdecile
CSdeccrim12006
CSdeccrim12006
CSdeccrim12009
CSdeccrim12009
CSdeccrim12012
CSdeccrim12012
CSdeceduc12004
CSdeceduc12004
CSdeceduc12006
CSdeceduc12006
CSdeceduc12009
CSdeceduc12009
CSdeceduc12012
CSdeceduc12012
CSdecemp12004.x
I'm interested in returning the first word prior to the digit 1, which works as demonstrated on regex101 here.
Problem
I would like to the same but in MS Word (v. 15.18 on Mac). After getting error messages of trying to supply unsuitable syntax I learned that MS Word does not support to the full regex syntax. I simplified my expression to something on the lines:
but the search does not find any strings and nothing gets replaced. Hence my questions, is it possible to use MS Word on Mac with regex?
The linked help website hints that something like that should be possible, but so far now luck.
The simple answer is "no", if you mean "Does Mac Word have a UI feature that lets you use one of the modern dialects of regex?" Word's Find/Replace only supports its own Regular Expression syntax.
In this case, I think the following will give you what you need:
Find with wildcards:
(C)([!1]#)(1)
and a replace by
\1
(If you also had to find "C1", then that doesn't work, and unfortunately nor does
(C)([!1]{0,})(1)
because Word does not allow 0 in the {,} pattern)
But there is a problem with "#". If the text the "#" is looking for is long, the find/replace may fail. There is supposed to be a 255 limit, but it seems rather more arbitrary than that. (I have long suspected a buffer overrun type error in the Word code, but perhaps there is a simpler explanation).
If you mean, "is there any way to use modern regex with Word?", then the answer is "Yes, but you only get to operate on a copy of the text in the document. You will need to create your own code to do the 'replace' part of the find replace, and that means that you would have to deal with any of the issues such as preserving formatting that Word's built-in find/replace might get right for you.
On the Windows side, people who want a better regex than Word's often use VBScript's regexp object because it is easily used from VBA. VBA itself only really has the "like" operator, which also only has fairly crude pattern matching abilities. I think there are examples of VBScript rexexp use on StackOverflow. On the Mac side, you would either have to use VBA and "shell out" to one of the built-in Mac/Unix utilities to do your finding (and perhaps replacing), or perhaps use Applescript or Javascript application scripting to do it. As far as I can remember Applescript does not have a 'modern' regex built-in either.
[As a bit of history, Word's "regular expressions" were I think introduced in Word 6, around 1993, at a time when most dialects of regex were much more crude than they are today. I don't think Word's version has moved along much at all - it probably added some Unicode support at some point, but that's probably about it. I assume that people using modern regex don't regard it as regex at all, and I personally prefer not to call Word's Regular Expressions 'regex' precisely for that reason.]

Having difficulties with regex

Here is a string:
111A9d809d4712701eea0e9c2b2c143941ab000
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The 9d809d4712701eea0e9c2b2c143941ab is what changes every time but the 111A and the 000 never change, I need to match the whole string, I tried googling but it is very hard to find answers to such specific needs, also can anyone please suggest me a web page or some program that will help me solve these kinds of problems.
you are looking for:
111A[a-z0-9]{32}000
or for matching words only:
\b111A[a-z0-9]{32}000\b
or for matching whole strings only:
^111A[a-z0-9]{32}000$
or if middle part has random length, you may replace exact letters count (32) with "at least one" (+) or "may be have any count or does not have at all" (*):
111A[a-z0-9]+000
111A[a-z0-9]*000
\b111A[a-z0-9]+000\b
\b111A[a-z0-9]*000\b
^111A[a-z0-9]+000$
^111A[a-z0-9]*000$
Bold part?
Try
111A[a-z0-9]+000
I'd advise you to buy a good book, for instance "Mastering Regular Expressions", by Jeffrey Friedl which is really good (http://shop.oreilly.com/product/9780596528126.do). Regex are extremely useful but also take some investment before you start seeing real benefits to your coding.
As for your regex, /^111A[a-z0-9]+000$/ would do the trick (get rid of ^ and $ if you want to allow character before and after in the string, and depending on your programming language you may need to drop the surrounding slashes).
You can use this regex:
/111A\w+?000/g
Or this one if your pattern has exactely 32 chars
/111A\w{32}000/g
Demo here: http://regex101.com/r/cQ7uF1/1

In what ways can I improve this regular expression?

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr
The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.
sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.

Why does Mono locks up on regex

This is the line mono on linux locks up (i am using 2.6.4 VM distro on the official site)
var match = Regex.Match(sz, linkPattern);
The string is this which gets the link and the title.
var linkPattern = #"<\ba\b[^\>]*\bhref\b*=\b*""([^""\>]*)""[^\>]*\btitle\b*=\b*""([^""\>]*) by [^""\>]*""";
When mono hits that line it doesnt crash, throw an exception or anything. Using tops i see mono using 96% of the CPU. I dont know how long the string is. I suspect its <8kb (i tested a different url) and it has been a few minutes since i ran the code so something must be broken.
"Too many \b's" was my first reaction. But really:
\b means word boundary. In my opinion, <\ba and <a should be identical. Also, \b* therefore would mean "optional repetition of word boundaries", which sounds rather confusing.
I guess I've never used \b at all, and used \s? or \s* instead.
Did you try a different regex engine (Perl, PHP) to determine whether the lockup is due to Mono?
There are some bugs in Mono's regex implementation that can cause it to recurse infinitely. Probably the only fix is to rewrite your pattern to be a simpler regular expression, or not use regular expressions for this task.
You may also want to file a bug. I think there is a Google Summer of Code student currently working on Mono's regular expression engine.