How can I fix Emacs' syntax highlighting of qr"\.[^.]+$"? - regex

Emacs is confusing syntax after a line like
fileparse($file, qr"\.[^.]+$");
And thinks the rest of the file is a string. How do I fix this?

Emacs is confused by the quotes in the qr expression. The standard regular expression delimiter (/) works:
fileparse($file, qr/\.[^.]+$/);
In fact, almost anything else seems to work.
fileparse($file, qr{(\.[^.]+$/});
fileparse($file, qr*\.[^.]+$*);
fileparse($file, qr#\.[^.]+$#);
My version of VIM doesn't get confused by the quotes, but I know that older versions of VIM did. Putting quotes is really confusing because it makes the qr expression look like a string (which it isn't). It's usually bad policy to use quotes as delimiters in regular expressions even though it's technically legal.
However, the really important question is what's fileparse? That's not a standard Perl function. I am assuming that this is imported from File::Basename? Would that be correct?
According to the documentation I have, the second argument in fileparse is suppose to be an array and not a quoted regular expression.

David W. probably has the right answer. Again, probably best to avoid using " as delimiter, but sometimes you need to use ' as the delimiter to prevent interpolation, therefore this may be useful to someone in that case.
So, if for some reason you REALLY want the quote delimiter (or to prevent interpolation using '), you can always do something like:
fileparse($file, qr"\.[^.]+$"); #"# highlight fix
an example using one that SO does badly:
s'hi'by'; #'# highlight fix
note: the second # is to trigger the comment highlighting in the editor so that highlight fix shows as a comment.

Related

Understanding regex with OR

I have a regular expression like this: ('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
In order to test it I am using a handy tool called http://www.regexpal.com/
The thing is that I am getting stuck when trying to understand the logic, inserting a '0' is fine but then I don't get why the OR prevents inserting other characters. Any explanation is appreciated.
I'm not sure you understand how the brackets in the regex are working. It isn't the OR part that is preventing you.
('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
Will match either '0' with the quotes or for example 0000000'z''''9 or anything else like it. The quotes are treated as literal and the period must be escaped because it is a wildcard.
(0|[0-9]+\.[0-9a-f]*)
May be what you are looking for. This will match values such as 0 or 23. or 3.14159
There are numerous problems in your regex (as others have pointed out), but I'll explain something about alternations.
Most regex flavors will short-circuit alternations.
This means that you should reorder it, if you want it to match the other expression first.

Match repetition with regexp in ocamllex

I'm trying to write a lexer with ocamllex for some special native language (that is a bit modified for my purposes). Some words shall be matched by their first char, that is doubled. But I dont find any way for express this repetition of the first char. Neither I can use the regex syntax
(['a'-'z'])\1['a'-'z']+
with that "\1". Ocamllex says "illegal escape sequence \1." and I think thats really okay with the syntax of escape expressions, but sure thats not what I wanted. Nor I can use the repetition syntax with curly braces in any way (but this wont solve the problem anyway):
['a'-'z']{2}['a'-'z']+
I think there is a conflict with the oCaml code in the curly braces after the regexp.
Does anybody have an idea for that?
thank you very much.
Ocamllex's regex doesn't have repetition syntax. The avaibable regex syntax is just as listed in reference manual:
http://caml.inria.fr/pub/docs/manual-ocaml-4.01/lexyacc.html#sec274
And I think you can manually list the all possible repetitions as below:
("aa"|"bb"|"cc"|"dd"|"ee"|"ff"| ..............)['a'-'z']+

Regexp languages and replacements in Emacs

When I use the regexp-builder, I need to escape things in a different way from the way I do it when using replace-regexp. Now, this thread explains that these two commands use a different syntax, but why is that?
Also, I went through this blog post: Re-builder: The Interactive Regexp Builder, and I added
(require 're-builder)
(setq reb-re-syntax 'string)
to my .emacs file following the advice on the site. However, I still need to type " around my regexp to make it work. I thought changing the syntax language would take care of this but it doesn't.
With this, my actual questions are:
Is it sill the case that Emacs does not support PCRE? Are there any workarounds to this?
Once I have the right regexp in regex-builder, is there any way to directly send the regexp to replace-regexp and enter the replacement string?
There's a package in the MELPA repository called pcre2el that adds PCRE support to many parts of Emacs, including regexp-builder and replace-regexp.
Regarding question #2: No (at least not by default), but there's another way to do that without using re-builder.
Start by doing a regexp isearch for your pattern. Because it's an isearch, you'll see the matches interactively, a bit like re-builder (albeit without coloured groupings).
Still in isearch, once you're happy with the pattern, type C-M-% to call isearch-query-replace-regexp which will prompt you for the replacement.
You can of course simply copy your re-builder string from its buffer and yank it as a replacement string (but that's undoubtedly not news).
I was curious about the need for quotes in re-builder with string syntax. It seems that's it's just a formality of the system, and reb-read-regexp returns everything between the first and last " when using that syntax. Maybe it's intended to ensure that leading or trailing whitespace can't confuse matters -- re-builder does use leading whitespace for improved visibility, and trailing whitespace would be harder to spot. Or maybe it just made some of the code more convenient/consistent.
No, Emacs doesn't support PCRE, and as far as I know there is no work-around for that.
I don't think so.
To answer your first question, why does re-builder use a different syntax than replace-regexp:
By default, re-builder uses the syntax that is appropriate for writing elisp programs. In the context of a written program, regexps are entered within strings. Inside a string, backslashes have a special meaning which conflicts with using the backslash as part of a regexp. Consequently, within a string, you need to double a backslash to use it to signify part of the regexp syntax.
replace-regexp, on the other hand, is designed to be used interactively by the user, and it explicitly expects the input to be a regexp. As a convenience, it interprets backslashes as regexp syntax, not as string escapes. Which is why you can use single backslashes in this context.

Fixing regex to work around ICU/RegexKitLite bug

I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?
FWIW, ICU also offers a *+ operator, but it doesn't work either.
EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.
If you simply change every * quantifier to a +, the regex will fail to work in those instances where the * should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.
However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example, x* could be rewritten as (?:(?!x)|x+). It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+), but not reluctant stars (*?).
Here it is in table form:
BEFORE AFTER
x* (?:(?!x)|x+)
x*+ (?:(?!x)|x++)
x*? x*? More complex atoms would need to have their own parentheses preserved:
(?:xyz)* (?:(?!(?:xyz))|(?:xyz)+) You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the {min,} and {min,max} forms are affected too, they would get the same treatment (with the same modifications for possessive variants):
x{0,} same as x*
x{0,n} (?:(?!x)|x{1,n})
It occurs to me that conditionals--(?(condition)yes-pattern|no-pattern)--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.
I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)
I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.
Both \* and [*] are literal asterisks, so a naive replacement mightn't work.
In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.
x* is equivalent to x{0,} and (?:x+)?.
Yeah, use that strategy:
(pseudo code)
if ($str =~ /x*/ && $str =~ /(x+)/) {
print "'$1'\n";
}
But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.

Need assistance with Regular Expressions in Qt (QRegExp) [bad repetition syntax?]

void MainWindow::whatever(){
QRegExp rx ("<span(.*?)>");
//QString line = ui->txtNet1->toHtml();
QString line = "<span>Bar</span><span style='baz'>foo</span>";
while(line.contains(rx)){
qDebug()<<"Found rx!";
line.remove (rx);
}
}
I've tested the regular expression online using this tool. With the given regex string and a sample text of <span style="foo">Bar</span> the tool says that it the regular expression should be found in the string. In my Qt code, however, I'm never getting into my while loop.
I've really never used regex before, in Qt or any other language. Can someone provide some help? Thanks!
[edit]
So I just found that QRegExp has a function errorString() to use if the regex is invalid. I output this and see: "bad repetition syntax". Not really sure what this means. Of course, googling for "bad repetition syntax" brings up... this post. Damn google, you fast.
The problem is that QRegExp only supports greedy quantifiers. More precisely, it supports either greedy or reluctant quantifiers, but not both. Thus, <span(.*?)> is invalid, since there is no *? operator. Instead, you can use
QRegExp rx("<span(.*)>");
rx.setMinimal(true);
This will give every *, +, and ? in the QRegExp the behavior of *?, +?, and ??, respectively, rather than their default behavior. The difference, as you may or may not be aware, is that the minimal versions match as few characters as possible, rather than as many.
In this case, you can also write
QRegExp rx("<span([^>]*)>");
This is probably what I would do, since it has the same effect: match until you see a >. Yours is more general, yes (if you have a multi-character ending token), but I think this is slightly nicer in the simple case. Either will work, of course.
Also, be very, very careful about parsing HTML with regular expressions. You can't actually do it, and recognizing tags is—while (I believe) possible—much harder than just this. (Comments, CDATA blocks, and processing instructions throw a wrench in the works.) If you know the sort of data you're looking at, this can be an acceptable solution; even so, I'd look into an HTML parser instead.
What are you trying to achieve? If you want to remove the opening tag and its elements, then the pattern
<span[^>]*>
is probably the simplest.
The syntax .*? means non-greedy match which is widely supported, but may be confusing the QT regex engine.