Why does Mono locks up on regex - regex

This is the line mono on linux locks up (i am using 2.6.4 VM distro on the official site)
var match = Regex.Match(sz, linkPattern);
The string is this which gets the link and the title.
var linkPattern = #"<\ba\b[^\>]*\bhref\b*=\b*""([^""\>]*)""[^\>]*\btitle\b*=\b*""([^""\>]*) by [^""\>]*""";
When mono hits that line it doesnt crash, throw an exception or anything. Using tops i see mono using 96% of the CPU. I dont know how long the string is. I suspect its <8kb (i tested a different url) and it has been a few minutes since i ran the code so something must be broken.

"Too many \b's" was my first reaction. But really:
\b means word boundary. In my opinion, <\ba and <a should be identical. Also, \b* therefore would mean "optional repetition of word boundaries", which sounds rather confusing.
I guess I've never used \b at all, and used \s? or \s* instead.
Did you try a different regex engine (Perl, PHP) to determine whether the lockup is due to Mono?

There are some bugs in Mono's regex implementation that can cause it to recurse infinitely. Probably the only fix is to rewrite your pattern to be a simpler regular expression, or not use regular expressions for this task.
You may also want to file a bug. I think there is a Google Summer of Code student currently working on Mono's regular expression engine.

Related

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Futile attempt to run regular expression find/replace in MS Word using groups on Mac

According to the received wisdom MS Word (more or less) supports find/replace with use of regular expressions. I have a simple regular expression:
^(C[[:alpha:]]*)(\d*)(.*)$
That I'm running on the data:
indSIMDdecile
CSdeccrim12006
CSdeccrim12006
CSdeccrim12009
CSdeccrim12009
CSdeccrim12012
CSdeccrim12012
CSdeceduc12004
CSdeceduc12004
CSdeceduc12006
CSdeceduc12006
CSdeceduc12009
CSdeceduc12009
CSdeceduc12012
CSdeceduc12012
CSdecemp12004.x
I'm interested in returning the first word prior to the digit 1, which works as demonstrated on regex101 here.
Problem
I would like to the same but in MS Word (v. 15.18 on Mac). After getting error messages of trying to supply unsuitable syntax I learned that MS Word does not support to the full regex syntax. I simplified my expression to something on the lines:
but the search does not find any strings and nothing gets replaced. Hence my questions, is it possible to use MS Word on Mac with regex?
The linked help website hints that something like that should be possible, but so far now luck.
The simple answer is "no", if you mean "Does Mac Word have a UI feature that lets you use one of the modern dialects of regex?" Word's Find/Replace only supports its own Regular Expression syntax.
In this case, I think the following will give you what you need:
Find with wildcards:
(C)([!1]#)(1)
and a replace by
\1
(If you also had to find "C1", then that doesn't work, and unfortunately nor does
(C)([!1]{0,})(1)
because Word does not allow 0 in the {,} pattern)
But there is a problem with "#". If the text the "#" is looking for is long, the find/replace may fail. There is supposed to be a 255 limit, but it seems rather more arbitrary than that. (I have long suspected a buffer overrun type error in the Word code, but perhaps there is a simpler explanation).
If you mean, "is there any way to use modern regex with Word?", then the answer is "Yes, but you only get to operate on a copy of the text in the document. You will need to create your own code to do the 'replace' part of the find replace, and that means that you would have to deal with any of the issues such as preserving formatting that Word's built-in find/replace might get right for you.
On the Windows side, people who want a better regex than Word's often use VBScript's regexp object because it is easily used from VBA. VBA itself only really has the "like" operator, which also only has fairly crude pattern matching abilities. I think there are examples of VBScript rexexp use on StackOverflow. On the Mac side, you would either have to use VBA and "shell out" to one of the built-in Mac/Unix utilities to do your finding (and perhaps replacing), or perhaps use Applescript or Javascript application scripting to do it. As far as I can remember Applescript does not have a 'modern' regex built-in either.
[As a bit of history, Word's "regular expressions" were I think introduced in Word 6, around 1993, at a time when most dialects of regex were much more crude than they are today. I don't think Word's version has moved along much at all - it probably added some Unicode support at some point, but that's probably about it. I assume that people using modern regex don't regard it as regex at all, and I personally prefer not to call Word's Regular Expressions 'regex' precisely for that reason.]

Why is my search in BBEdit causing a "stack overflow" error?

I'm stumped about a "stack overflow" error--"out of stack space (application error code: 12246)--that I'm getting in BBEdit when I do a "replace all", searching for
(#article(((?!eprint|#article|#book).)*\r)*)pmid = {(.+)}((((?!eprint|#article|#book).)*\r)*(#|\r*\z))
and replacing with
\1eprinttype = {pubmed}, eprint = {\4}\5
I can use these same patterns manually, doing one-at-a-time find & replace, without any errors, even once the match no longer occurs. I can also avoid the error by working on smaller files.
I suspect that it's my inefficient and sloppy regex coding that's to blame, and would appreciate an expert's help in doing this more efficiently. I'm trying to locate all entries in a BibLaTeX bibliography that don't already have an eprint field, but which have a pmid field, and replace the pmid field with a corresponding e-print specification (using eprint and eprinttype).
Update: After some experimentation, I've found that a different approach is the only thing I can get to work. Searching for
(?(?=#article(.+\r)+eprint = {(.+\r)+}\r*)(?!)|(#article(.+\r)+)pmid = {(.+)}((.+\r)+}\r*))
and replacing with
\3eprinttype = {pubmed}, eprint = {\5}\6
does the trick. The only problem with this is the backreferences are fragile, but I can't get named backreferences to work in BBEdit.
It's probably catastrophic backtracking caused by this last part:
.)*\r)*(#|\r*\z))
If you break that down and simplify it, you essentially have a .*, a \r*, and another \r* right next to each other. Now picture a string of \r characters at the end of your input: How should each \r be distributed? Which of those little clauses will soak up each \r character? If you have \r\r\r\r\r, you could eat all five \rs with the .* part and none at all with the \r* parts...or, you can make up any number of permutations that will still match. Since the * is greedy, it will try to fill the .* up first, but if that fails, it has to keep trying permutations until one of them works. So it's probably hogging a bunch of your resources with unnecessary backtracking, until finally it crashes.
I'm not an expert on optimization techniques for regex, but I'd start there if I were you.
Update:
Check out the Wikipedia article on PCRE:
Unless the "NoRecurse" PCRE build option (aka
"--disable-stack-for-recursion") is chosen, adequate stack space must
be allocated to PCRE by the calling application or operating system.
...
While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.
So I think catastrophic backtracking is a good bet here. I'd try to solve it by tweaking your regex before changing the build options on PCRE.
Obviously this is some bug. But you could try changing the expression a bit. It's difficult to optimize the expression without knowing the requirements, but here's a guess:
(#article(?:(?:(?!eprint|#article|#book|pmid)[^\r])*+\r)*+)pmid = {([^\n\r]+)}((?:(?:(?!eprint|#article|#book)[^\r])*+\r)*(?:#|\r*\z))
Replace with:
\1eprinttype = {pubmed}, eprint = {\2}\3
BBEdit seems to use PCRE, unless it's (very) outdated the above expression should be compatible.

Selenum-server fails regexp pattern that passes in Selenium-IDE

i have a (i guess) simple question.
I am running Selenium test cases (HTML, selense) after a successful build in Hudson. The testcases pass when i run them in the IDE after the build but not on the server. The cases that fails holds a regexp expression like so:
regexp:The profile details have been updated, it can take up to [0-9]*\s\w*\(.\) until the changes are fully visible.
the goal is to match times like 1 hour(s), 30 second(s) 4 minutes(s)
Have someone encountered a problem like this and how did you solve it?
I would use a slightly simpler expression, assuming you do not need to handle things like 1 second, 2 seconds, etc, but only need to handle things like 1 second(s) or 2 second(s), you could use this expression:
try replacing the regex stuff with
[0-9]+ [a-z]+\(s\)
Some older regex flavors (POSIX ERE) don't allow the shorthand character classes, so if those rules are somehow being invoked, then it would fail on \s and \w (and maybe even the .). The above expression should work almost anywhere; though it is not as flexible, it should be flexible enough for your situation.
Some flavors that are even older (POSIX BRE) don't support the shorthand repetition and actually interpret curly braces and parentheses differently as well. If that is the flavor being expected, it might just fail on any normal expression, and you'd need to usde something like:
[0-9]\{1,\} [a-z]\{1,\}(s)
If neither of these work, then there is a significant difference in the engines implementing your regex OR there is some difference in how the page is rendering and being parsed/searched/analyzed by the Selenium Server.
If that is the case, there could be <br> tags or formatting tags (like <i> or <span>) that are being parsed as text rather than being extracted from the string. Then, something like changes are fully visible might be parsed as changes are <em>fully</em> visible and thus failing the test
When you test it locally versus hudson run, are there differences in OS platform?
I know that I had some problems with windows xp versus unix (where hudson was deployed). Although I don't think that is the case here, just a thought though.

Regex PatternRepository pattern on BlackBerry 5 - how to ignore case

I hope this title makes sense - I need case-insensitive regex matching on BlackBerry 5.
I have a regular expression defined as:
public static final String SMS_REG_EXP = "(?i)[(htp:/w\\.)]*cobiinteractive\\.com/[\\w|\\%]+";
It is intended to match "cobiinteractive.com/" followed by some text. The preceding (htp:w.) is just there because on my device I needed to override the internal link-recognition that the phone applies (shameless hack).
The app loads at start-up. The idea is that I want to pick up links to my site from sms & email, and process them with my app.
I add it to the PatternRepository using:
PatternRepository.addPattern(
ApplicationDescriptor.currentApplicationDescriptor(),
GlobalConstants.SMS_REG_EXP,
PatternRepository.PATTERN_TYPE_REGULAR_EXPRESSION,
applicationMenu);
On the os 4.5 / 4.7 simulators and on
a Curve 8900 device (running 4.5),
this works.
On the os 5 simulators and the Bold
9700 I tested, app fails to compile
the pattern with an
IllegalArgumentException("unrecognized
character after (?").
I have also tried (naively) to set the pattern to "/rockstar/i" but that only matches the exact string - this is possibly the correct direction to take, but if so, I don't know how to implement it on the BB.
How would I modify my regex in order to pick up case insensitive patterns using the PatternRepository as above?
PS: would the "correct" way be to use the [Cc][Oo][Bb][Ii]2... etc pattern? This is ok for a short string, but I am hoping for a more general solution if possible?
Well not a real solution for the general problem but this workaround is easy, safe and performant:
As your dealing here with URLs and they are not case-sensitive...
(it doesn't matter if we write google.com or GooGLE.COm or whatever)
The most simple solution (we all love KISS_principle) is to do first a lowercase (or uppercase if you like) on the input and than do a regex match where it doesn't matter whether it's case-sensitive or not because we know for sure what we are dealing with.
Since nobody else has answered this question relating to the PatternRepository class, I will self-answer so I can close it.
One way to do this would be to use a pattern like: [Cc][Oo][Bb][Ii]2[Nn][Tt][Ee][Rr][Aa][Cc][Tt][Ii][Vv][Ee]... etc where for each letter in the string, you put 2 options. Fortunately my string is short.
This is not an elegant solution, but it works. Unfortunately I don't know of a way to modify the string passed to PatternRepository and I think the crash when using the (?i) modifier is a bug in BB.
Use the port of the jakarta regex library:
https://code.google.com/p/regexp-me/
If you use unicode support, it's going to eat memory,
but if you just want case insensitive matching,
you simply need to pass the RE.MATCH_CASEINDEPENDENT flag when you compile your regex.
new RE("yourCaseInsensitivePattern", RE.MATCH_CASEINDEPENDENT | OTHER_FLAGS)