Match repetition with regexp in ocamllex - regex

I'm trying to write a lexer with ocamllex for some special native language (that is a bit modified for my purposes). Some words shall be matched by their first char, that is doubled. But I dont find any way for express this repetition of the first char. Neither I can use the regex syntax
(['a'-'z'])\1['a'-'z']+
with that "\1". Ocamllex says "illegal escape sequence \1." and I think thats really okay with the syntax of escape expressions, but sure thats not what I wanted. Nor I can use the repetition syntax with curly braces in any way (but this wont solve the problem anyway):
['a'-'z']{2}['a'-'z']+
I think there is a conflict with the oCaml code in the curly braces after the regexp.
Does anybody have an idea for that?
thank you very much.

Ocamllex's regex doesn't have repetition syntax. The avaibable regex syntax is just as listed in reference manual:
http://caml.inria.fr/pub/docs/manual-ocaml-4.01/lexyacc.html#sec274
And I think you can manually list the all possible repetitions as below:
("aa"|"bb"|"cc"|"dd"|"ee"|"ff"| ..............)['a'-'z']+

Related

Why was "? immediately after a parenthesis" a syntax error, in Regular Expressions?

The Python Regular Expression HOWTO explains how the syntax of non-capturing and named groups came about:
For these new features the Perl developers couldn’t choose new single-keystroke metacharacters or new special sequences beginning with \ without making Perl’s regular expressions confusingly different from standard REs. If they chose & as a new metacharacter, for example, old expressions would be assuming that & was a regular character and wouldn’t have escaped it by writing \& or [&].
The solution chosen by the Perl developers was to use (?...) as the extension syntax. ? immediately after a parenthesis was a syntax error because the ? would have nothing to repeat, so this didn’t introduce any compatibility problems.
I don't understand why the parenthesis should have something that repeats?
I do understand the overall point that taking something that caused a syntax error, to use to extend regex functionality, would prevent existing regexes from breaking.
regular-expressions.info explains it nicely.
The ... question mark is the quantifier that makes the previous token optional. This quantifier cannot appear after an opening parenthesis, because there is nothing to be made optional at the start of a group. Therefore, there is no ambiguity between the question mark as an operator to make a token optional and the question mark as part of the syntax for non-capturing groups...
Nothing to be made optional, because the token before the ? is the group-opening metacharacter (, and not something searchable in a string such as a regular character.
I don't agree with the phrasing in the HOWTO you linked to. Optional – i.e., zero or one times – is not "repeating".

Understanding regex with OR

I have a regular expression like this: ('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
In order to test it I am using a handy tool called http://www.regexpal.com/
The thing is that I am getting stuck when trying to understand the logic, inserting a '0' is fine but then I don't get why the OR prevents inserting other characters. Any explanation is appreciated.
I'm not sure you understand how the brackets in the regex are working. It isn't the OR part that is preventing you.
('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
Will match either '0' with the quotes or for example 0000000'z''''9 or anything else like it. The quotes are treated as literal and the period must be escaped because it is a wildcard.
(0|[0-9]+\.[0-9a-f]*)
May be what you are looking for. This will match values such as 0 or 23. or 3.14159
There are numerous problems in your regex (as others have pointed out), but I'll explain something about alternations.
Most regex flavors will short-circuit alternations.
This means that you should reorder it, if you want it to match the other expression first.

regex matching pair of brackets

I'm trying to write a Sublime Text 2 syntax highlighter for Simulink's Target Language Compiler (TLC) files. This is a scripting language for auto-generating code. In TLC, the syntax to expand the contents of a token (similar to dereferencing a pointer in C or C++) is
%<token>
The regular expression I wrote to match this is
%<.+?>
This works for most cases, but fails for the following statement
%<LibAddToCommonIncludes("<string.h>")>
Modifying the regular expression to greedy fixes this if the statement is by itself on a line, but fails in several other cases. So that is not an option.
For that line, the highlighting stops at the first > instead of the second. How can I modify the regular expression to handle this case?
It'd be great if there was a general expression that could handle any number of nested <> pairs; for example
%<...<...>...<...<...>...>...>
where the dots are optional characters. The entire expression above should be a single match.
A generic way through regular expressions is difficult -as explained very well in this thread.
You can try to specifically match 2 < characters through a regex. Something like %<.+?<.+?>.+?>.

How can I fix Emacs' syntax highlighting of qr"\.[^.]+$"?

Emacs is confusing syntax after a line like
fileparse($file, qr"\.[^.]+$");
And thinks the rest of the file is a string. How do I fix this?
Emacs is confused by the quotes in the qr expression. The standard regular expression delimiter (/) works:
fileparse($file, qr/\.[^.]+$/);
In fact, almost anything else seems to work.
fileparse($file, qr{(\.[^.]+$/});
fileparse($file, qr*\.[^.]+$*);
fileparse($file, qr#\.[^.]+$#);
My version of VIM doesn't get confused by the quotes, but I know that older versions of VIM did. Putting quotes is really confusing because it makes the qr expression look like a string (which it isn't). It's usually bad policy to use quotes as delimiters in regular expressions even though it's technically legal.
However, the really important question is what's fileparse? That's not a standard Perl function. I am assuming that this is imported from File::Basename? Would that be correct?
According to the documentation I have, the second argument in fileparse is suppose to be an array and not a quoted regular expression.
David W. probably has the right answer. Again, probably best to avoid using " as delimiter, but sometimes you need to use ' as the delimiter to prevent interpolation, therefore this may be useful to someone in that case.
So, if for some reason you REALLY want the quote delimiter (or to prevent interpolation using '), you can always do something like:
fileparse($file, qr"\.[^.]+$"); #"# highlight fix
an example using one that SO does badly:
s'hi'by'; #'# highlight fix
note: the second # is to trigger the comment highlighting in the editor so that highlight fix shows as a comment.

Need assistance with Regular Expressions in Qt (QRegExp) [bad repetition syntax?]

void MainWindow::whatever(){
QRegExp rx ("<span(.*?)>");
//QString line = ui->txtNet1->toHtml();
QString line = "<span>Bar</span><span style='baz'>foo</span>";
while(line.contains(rx)){
qDebug()<<"Found rx!";
line.remove (rx);
}
}
I've tested the regular expression online using this tool. With the given regex string and a sample text of <span style="foo">Bar</span> the tool says that it the regular expression should be found in the string. In my Qt code, however, I'm never getting into my while loop.
I've really never used regex before, in Qt or any other language. Can someone provide some help? Thanks!
[edit]
So I just found that QRegExp has a function errorString() to use if the regex is invalid. I output this and see: "bad repetition syntax". Not really sure what this means. Of course, googling for "bad repetition syntax" brings up... this post. Damn google, you fast.
The problem is that QRegExp only supports greedy quantifiers. More precisely, it supports either greedy or reluctant quantifiers, but not both. Thus, <span(.*?)> is invalid, since there is no *? operator. Instead, you can use
QRegExp rx("<span(.*)>");
rx.setMinimal(true);
This will give every *, +, and ? in the QRegExp the behavior of *?, +?, and ??, respectively, rather than their default behavior. The difference, as you may or may not be aware, is that the minimal versions match as few characters as possible, rather than as many.
In this case, you can also write
QRegExp rx("<span([^>]*)>");
This is probably what I would do, since it has the same effect: match until you see a >. Yours is more general, yes (if you have a multi-character ending token), but I think this is slightly nicer in the simple case. Either will work, of course.
Also, be very, very careful about parsing HTML with regular expressions. You can't actually do it, and recognizing tags is—while (I believe) possible—much harder than just this. (Comments, CDATA blocks, and processing instructions throw a wrench in the works.) If you know the sort of data you're looking at, this can be an acceptable solution; even so, I'd look into an HTML parser instead.
What are you trying to achieve? If you want to remove the opening tag and its elements, then the pattern
<span[^>]*>
is probably the simplest.
The syntax .*? means non-greedy match which is widely supported, but may be confusing the QT regex engine.