Fixing regex to work around ICU/RegexKitLite bug - regex

I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?
FWIW, ICU also offers a *+ operator, but it doesn't work either.
EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.

If you simply change every * quantifier to a +, the regex will fail to work in those instances where the * should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.
However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example, x* could be rewritten as (?:(?!x)|x+). It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+), but not reluctant stars (*?).
Here it is in table form:
BEFORE AFTER
x* (?:(?!x)|x+)
x*+ (?:(?!x)|x++)
x*? x*? More complex atoms would need to have their own parentheses preserved:
(?:xyz)* (?:(?!(?:xyz))|(?:xyz)+) You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the {min,} and {min,max} forms are affected too, they would get the same treatment (with the same modifications for possessive variants):
x{0,} same as x*
x{0,n} (?:(?!x)|x{1,n})
It occurs to me that conditionals--(?(condition)yes-pattern|no-pattern)--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.

I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)
I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.

Both \* and [*] are literal asterisks, so a naive replacement mightn't work.
In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.
x* is equivalent to x{0,} and (?:x+)?.

Yeah, use that strategy:
(pseudo code)
if ($str =~ /x*/ && $str =~ /(x+)/) {
print "'$1'\n";
}
But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.

Related

How to find changing date format from a string [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?
To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.
I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.
They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.
They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

Is there any difference between `{m}` and `{m}?` quantifiers?

Please give an example where the difference between greedy and lazy versions of "repeat-exactly-m-times" quantifier can be seen.
The question arose from here and here.
If there are no differences then what for the {m}? quantifier exists?
I don't believe there is any real difference between {m} and {m}? since each specifies exactly m times. However, there is a difference between {m,} and {m,}? (and {m,}+, while we're at it). It's appropriate and needed for quantifiers in general, even if it isn't needed for that particular case.
As said in comments, Oniguruma regexp engine treats it differently, as an exception: {m}? is not a non-greedy exact m (which is same as greedy exact m), but 0-or-m. All the other engines I tried did as other posters say: it makes no difference.
The reason for the non-greedy exact m to exist: if it didn't, it's an exception. Exceptions are harder to remember, and harder to implement - it's extra work, and in this case, as the semantics is equal, it doesn't hurt anyone.
I love Oniguruma, and appreciate they might have wanted to change the unneeded bit into something more usable and efficient, but this looks like a bug waiting to happen. Fortunately, no-one sane writes non-greedy exact m...
Doesn't make a difference in exact match {m}.
However, will make a difference with {m,} as greedy qualifiers match as many characters as possible, whereas lazy qualifiers match as few as possible.
Given the string "Baaaaaaaaaaaa"
The regex (B[a]{2,}?) would match "Baa"
The regex (B[a]{2,}) would match "Baaaaaaaaaaaa"
Whereas, with the exact match {m}:
The regex (B[a]{2}?) would match "Baa"
The regex (B[a]{2}) would also match "Baa"

Need assistance with Regular Expressions in Qt (QRegExp) [bad repetition syntax?]

void MainWindow::whatever(){
QRegExp rx ("<span(.*?)>");
//QString line = ui->txtNet1->toHtml();
QString line = "<span>Bar</span><span style='baz'>foo</span>";
while(line.contains(rx)){
qDebug()<<"Found rx!";
line.remove (rx);
}
}
I've tested the regular expression online using this tool. With the given regex string and a sample text of <span style="foo">Bar</span> the tool says that it the regular expression should be found in the string. In my Qt code, however, I'm never getting into my while loop.
I've really never used regex before, in Qt or any other language. Can someone provide some help? Thanks!
[edit]
So I just found that QRegExp has a function errorString() to use if the regex is invalid. I output this and see: "bad repetition syntax". Not really sure what this means. Of course, googling for "bad repetition syntax" brings up... this post. Damn google, you fast.
The problem is that QRegExp only supports greedy quantifiers. More precisely, it supports either greedy or reluctant quantifiers, but not both. Thus, <span(.*?)> is invalid, since there is no *? operator. Instead, you can use
QRegExp rx("<span(.*)>");
rx.setMinimal(true);
This will give every *, +, and ? in the QRegExp the behavior of *?, +?, and ??, respectively, rather than their default behavior. The difference, as you may or may not be aware, is that the minimal versions match as few characters as possible, rather than as many.
In this case, you can also write
QRegExp rx("<span([^>]*)>");
This is probably what I would do, since it has the same effect: match until you see a >. Yours is more general, yes (if you have a multi-character ending token), but I think this is slightly nicer in the simple case. Either will work, of course.
Also, be very, very careful about parsing HTML with regular expressions. You can't actually do it, and recognizing tags is—while (I believe) possible—much harder than just this. (Comments, CDATA blocks, and processing instructions throw a wrench in the works.) If you know the sort of data you're looking at, this can be an acceptable solution; even so, I'd look into an HTML parser instead.
What are you trying to achieve? If you want to remove the opening tag and its elements, then the pattern
<span[^>]*>
is probably the simplest.
The syntax .*? means non-greedy match which is widely supported, but may be confusing the QT regex engine.

Regex for password requirements

I want to require the following:
Is greater than seven characters.
Contains at least two digits.
Contains at least two special (non-alphanumeric) characters.
...and I came up with this to do it:
(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Now, I'd also like to make sure that no two sequential characters are the same. I'm having a heck of a time getting that to work though. Here's what I got that works by itself:
(\S)\1+
...but if I try to combine the two together, it fails.
I'm operating within the constraints of the application. It's default requirement is 1 character length, no regex, and no nonstandard characters.
Anyway...
Using this test harness, I would expect y90e5$ to match but y90e5$$ to not.
What an i missing?
This is a bad place for a regex. You're better off using simple validation.
Sometimes we cannot influence specifications and have to write the implementation regardless, i.e., when some ancient backoffice system has to be interfaced through the web but has certain restrictions on input, or just because your boss is asking you to.
EDIT: removed the regex that was based on the original regex of the asker.
altered original code to fit your description, as it didn't seem to really work:
EDIT: the q. was then updated to reflect another version. There are differences which I explain below:
My version: the two or more \W and \d can be repeated by each other, but cannot appear next to each other (this was my incorrect assumption), i fixed it for length>7 which is slightly more efficient to place as a typical "grab all" expression.
^(?!.*((\S)\1|\s))(?=.*(\d.+){2,})(?=.*(\W.+){2,}).{8,}
New version in original question: the two or more \W and the \d are allowed to appear next to each other. This version currently support length>=6, not length>7 as is explained in the text.
The current answer, corrected, should be something like this, which takes the updated q., my comments on length>7 and optimizations, then it looks like: ^(?!.*((\S)\1|\s))(?=(.*\d){2,})(?=(.*\W){2,}).{8,}.
Update: your original code doesn't seem to work, so I changed it a bit
Update: updated answer to reflect changes in question, spaces not allowed anymore
This may not be the most efficient but appears to work.
^(?!.*(\S)\1)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Test strings:
ad2f#we1$ //match valid.
adfwwe12#$ //No Match repeated ww.
y90e5$$ //No Match repeated $$.
y90e5$ //No Match too Short and only 1 \W class value.
One of the comments pointed out that the above regex allows spaces which are typically not used for password fields. While this doesn't appear to be a requirement of the original post, as pointed out a simple change will disallow spaces as well.
^(?!.*(\S)\1|.*\s)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Your regex engine may parse (?!.*(\S)\1|.*\s) differently. Just be aware and adjust accordingly.
All previous test results the same.
Test string with whitespace:
ad2f #we1$ //No match space in string.
If the rule was that passwords had to be two digits followed by three letters or some such, or course a regular expression would work very nicely. But I don't think regexes are really designed for the sort of rule you actually have. Even if you get it to work, it would be pretty cryptic to the poor sucker who has to maintain it later -- possibly you. I think it would be a lot simpler to just write a quick function that loops through the characters and counts how many total and how many of each type. Then at the end check the counts.
Just because you know how to use regexes doesn't mean you have to use them for everything. I have a cool cordless drill but I don't use it to put in nails.

Regex: Is Lazy Worse?

I have always written regexes like this
([^<]*)
but I just learned about this lazy thing and that I can write it like this
(.*?)
is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).
Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.
Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)
Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:
/<BODY>.*?<\/BODY>/is
But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.
In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.
The complemented character class more rigorously defines what you want to match, so whenever you can, I'd use it.
The non greedy regex will match things you probably don't want, such as:
foo
where your first .*? matches
foo" NAME="foo
Note that your examples are not equivalent. Your first regular expression will not select any links that contain other tags, such as img or b. The second regular expression will, and I expect that's probably what you wanted anyway.
Besides the difference in meaning, the only disadvantage I can think of is that support for non-greedy modifiers isn't quite as prevalent as character-class negation is. It's more widely supported than I thought, before I checked, but notably absent from the list is GNU Grep. If the regular-expression evaluators you're using support it, then go ahead and use it.
It's not about better or worse. The term I've seen the most is greedy vs. non-greedy, but however you put they do two different things. You want to use the right one for the task. I.e. turn off the greedy option when you don't want to capture multiple matches in a line.
“lazy” is the wrong word here. You mean non-greedy as opposed to greedy. There's no disadvantage in using it, that I know of. But in your special case, neither should it be more efficient.
Non-greedy is better, is it not? It works forward, checking for a match each time and stopping when it finds one, whereas the normal kleene closure (*) works backwards matching the rest of the input and removing things until it finds a match.
In the end, they do different things, but I think non-greedy outperforms greedy. Bear in mind that I haven't tested this, but now I'm curious.