Need assistance with Regular Expressions in Qt (QRegExp) [bad repetition syntax?] - c++

void MainWindow::whatever(){
QRegExp rx ("<span(.*?)>");
//QString line = ui->txtNet1->toHtml();
QString line = "<span>Bar</span><span style='baz'>foo</span>";
while(line.contains(rx)){
qDebug()<<"Found rx!";
line.remove (rx);
}
}
I've tested the regular expression online using this tool. With the given regex string and a sample text of <span style="foo">Bar</span> the tool says that it the regular expression should be found in the string. In my Qt code, however, I'm never getting into my while loop.
I've really never used regex before, in Qt or any other language. Can someone provide some help? Thanks!
[edit]
So I just found that QRegExp has a function errorString() to use if the regex is invalid. I output this and see: "bad repetition syntax". Not really sure what this means. Of course, googling for "bad repetition syntax" brings up... this post. Damn google, you fast.

The problem is that QRegExp only supports greedy quantifiers. More precisely, it supports either greedy or reluctant quantifiers, but not both. Thus, <span(.*?)> is invalid, since there is no *? operator. Instead, you can use
QRegExp rx("<span(.*)>");
rx.setMinimal(true);
This will give every *, +, and ? in the QRegExp the behavior of *?, +?, and ??, respectively, rather than their default behavior. The difference, as you may or may not be aware, is that the minimal versions match as few characters as possible, rather than as many.
In this case, you can also write
QRegExp rx("<span([^>]*)>");
This is probably what I would do, since it has the same effect: match until you see a >. Yours is more general, yes (if you have a multi-character ending token), but I think this is slightly nicer in the simple case. Either will work, of course.
Also, be very, very careful about parsing HTML with regular expressions. You can't actually do it, and recognizing tags is—while (I believe) possible—much harder than just this. (Comments, CDATA blocks, and processing instructions throw a wrench in the works.) If you know the sort of data you're looking at, this can be an acceptable solution; even so, I'd look into an HTML parser instead.

What are you trying to achieve? If you want to remove the opening tag and its elements, then the pattern
<span[^>]*>
is probably the simplest.
The syntax .*? means non-greedy match which is widely supported, but may be confusing the QT regex engine.

Related

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

Fixing regex to work around ICU/RegexKitLite bug

I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?
FWIW, ICU also offers a *+ operator, but it doesn't work either.
EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.
If you simply change every * quantifier to a +, the regex will fail to work in those instances where the * should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.
However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example, x* could be rewritten as (?:(?!x)|x+). It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+), but not reluctant stars (*?).
Here it is in table form:
BEFORE AFTER
x* (?:(?!x)|x+)
x*+ (?:(?!x)|x++)
x*? x*? More complex atoms would need to have their own parentheses preserved:
(?:xyz)* (?:(?!(?:xyz))|(?:xyz)+) You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the {min,} and {min,max} forms are affected too, they would get the same treatment (with the same modifications for possessive variants):
x{0,} same as x*
x{0,n} (?:(?!x)|x{1,n})
It occurs to me that conditionals--(?(condition)yes-pattern|no-pattern)--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.
I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)
I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.
Both \* and [*] are literal asterisks, so a naive replacement mightn't work.
In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.
x* is equivalent to x{0,} and (?:x+)?.
Yeah, use that strategy:
(pseudo code)
if ($str =~ /x*/ && $str =~ /(x+)/) {
print "'$1'\n";
}
But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.

boost::regex - \bb?

I have some badly commented legacy code here that makes use of boost::regex::perl. I was wondering about one particular construct before, but since the code worked (more or less), I was loath to touch it.
Now I have to touch it, for technical reasons (more precisely, current versions of Boost no longer accepting the construct), so I have to figure out what it does - or rather, was intended to do.
The relevant part of the regex:
(?<!(\bb\s|\bb|^[a-z]\s|^[a-z]))
The piece that gives me headaches is \bb. I know of \b, but I could not find mention of \bb, and looking for a literal 'b' would not make sense here. Is \bb some special underdocumented feature, or do I have to consider this a typo?
As Boost seems to be a regex engine for C++, and one of the compatibility modes is perl compatibility--if that is a "perl-compatible" expression, than the second 'b' can only be a literal.
It's a valid expression, pretty much a special case for words beginning with 'b'.
It seems to be the deciding factor that this is a c++ library, and that it's to give environments that aren't perl, perl-compatible regexes. Thus my original thought that perl might interpret the expression (say with overload::constant) is invalid. Yet it is still worth mentioning just for clarification purposes, regardless of how inadvisable it would be tweak an expression meaning "word beginning with 'b'".
The only caveat to that idea is that perhaps Boost out-performs Perl at it's own expressions and somebody would be using the Boost engine in a Perl environment, then all bets are off as to whether that could have been meant as a special expression. This is just one stab, given a grammar where '!!!' meant something special at the beginning of words, you could piggyback on the established meaning like this (NOT RECOMMENDED!)
s/\\bb\b/(?:!!!(\\p{Alpha})|\\bb)/
This would be something dumb to do, but as we are dealing with code that seems unfit for its task, there are thousands of ways to fail at a task.
(\bb\s|\bb|^[a-z]\s|^[a-z]) matches a b if it's not preceded by another word character, or any lowercase letter if it's at the beginning of the string. In either case, the letter may be followed by a whitespace character. (It could match uppercase letters too if case-insensitive mode is set, and the ^ could also match the beginning of a line if multiline mode is set.)
But inside a lookbehind, that shouldn't even have compiled. In some flavors, a lookbehind can contain multiple alternatives with different, fixed lengths, but the alternation has to be at the top level in the lookbehind. That is, (?<=abc|xy|12345) will work, but (?<=(abc|xy|12345)) won't. So your regex wouldn't work even in those flavors, but Boost's docs just say the lookbehind expression has to be fixed-length.
If you really need to account for all four of the possibilities matched by that regex, I suggest you split the lookbehind into two:
(?<!\bb|^[a-z])(?<!(?:\bb|^[a-z])\s)

Regex: Is Lazy Worse?

I have always written regexes like this
([^<]*)
but I just learned about this lazy thing and that I can write it like this
(.*?)
is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).
Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.
Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)
Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:
/<BODY>.*?<\/BODY>/is
But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.
In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.
The complemented character class more rigorously defines what you want to match, so whenever you can, I'd use it.
The non greedy regex will match things you probably don't want, such as:
foo
where your first .*? matches
foo" NAME="foo
Note that your examples are not equivalent. Your first regular expression will not select any links that contain other tags, such as img or b. The second regular expression will, and I expect that's probably what you wanted anyway.
Besides the difference in meaning, the only disadvantage I can think of is that support for non-greedy modifiers isn't quite as prevalent as character-class negation is. It's more widely supported than I thought, before I checked, but notably absent from the list is GNU Grep. If the regular-expression evaluators you're using support it, then go ahead and use it.
It's not about better or worse. The term I've seen the most is greedy vs. non-greedy, but however you put they do two different things. You want to use the right one for the task. I.e. turn off the greedy option when you don't want to capture multiple matches in a line.
“lazy” is the wrong word here. You mean non-greedy as opposed to greedy. There's no disadvantage in using it, that I know of. But in your special case, neither should it be more efficient.
Non-greedy is better, is it not? It works forward, checking for a match each time and stopping when it finds one, whereas the normal kleene closure (*) works backwards matching the rest of the input and removing things until it finds a match.
In the end, they do different things, but I think non-greedy outperforms greedy. Bear in mind that I haven't tested this, but now I'm curious.

Looking for a URI match on this regular expression, but it is crashing regex testers

I'm trying to use this regular expression, but I'm having trouble testing it, because for whatever reason it is crashing the only two regex testers I use.
^([a-zA-Z-]+/?)+$
If I'm not mistaken, that should match any URI that follows this format:
some-uri-string/some-uri-string/some-uri-string
However, I don't want it to match:
some-uri-string/some-uri-string//some-uri-string
Notice the double slash. Essentially, it is category/sub-category/sub-sub-category/page
With any number of sub-categories being allowed.
The problem is that, for a non-matching string, your pattern is a pathological case for a backtracking regular expression engine (which is true of those in most scripting languages, like Perl, Python, and anything based on PCRE).
More information:
http://swtch.com/~rsc/regexp/regexp1.html
http://www.regular-expressions.info/catastrophic.html
You could rewrite it as:
^([a-zA-Z-]+/)*[a-zA-Z-]+$
Or your RE implementation may have other facilities available to avoid that scenario.
Look up catastrophic backtracking... You need to define it a little better, or write one that chomps the URI string one part at a time... This one is recursive, and why it crashes on you... Or, you may change it from Greedy to lazy.. I didn't actually DEBUG this regex, but that's what I see...
For PCRE, you can make the group atomic to avoid catastrophic backtracking. Just add ?> like this:
^(?>[a-zA-Z-]+/?)+$
See the PCRE manual under "Atomic Grouping and Possessive Quantifiers".
Works using this tester :) Give it a shot - http://ocpsoft.org/tutorials/regular-expressions/java-visual-regex-tester/#!;t=some-uri-string%2Fsome-uri-string%2Fsome-uri-string&r=%5E(%5Ba-zA-Z-%5D%2B%2F%3F)%2B%24&x=