Why JFlex reject .+?(?=->) - regex

I'm trying design simple language idea plugin.
I want to match below example code as 3 tokens as text before ->, between -> and :, after :
Ex: First part -> Second Part: Third part
For first part when I try regex .+?(?=->) at https://regex101.com/r/TDBWg0/1 it works.
But as per JFlex .+?(?=->) has a syntax error:
Error in file "Simple.flex" (line 41):
Syntax error.
FIRST_PART=.+(?=(->))
^

Lexer generators like JFlex often have a different syntax and feature set than most other regex implementations, so helpers like regex101 aren't always that useful for them. Instead you should look at the JFlex manual to see which syntax JFlex supports.
There's two things of note there:
The syntax for lookahead is /regex not (?=regex)
There is no syntax for non-greedy quantifiers
< and > need to be quoted or escaped
So .+/"->" would be a valid regex, but when there are multiple ->s it will match up to the last ->, not the first. Presumably you tried to make the + non-greedy specifically so that it would only match up to the first, so this is no good.
Since there are no non-greedy modifiers in JFlex, we need a different approach. If we look at the available regex features again, we'll see that there's an operator ~, which works as follows:
~a (upto)
matches everything up to (and including) the first occurrence of a text matched by a. The expression ~a is equivalent to !([^]* a [^]*) a. A traditional C-style comment is matched by "/*" ~"*/".
So the regex you want is simpy ~"->".
Another approach, that works with virtually every regex implementation, would be to write a regex that specifically matches everything that's not a ->, i.e. any non-- character or a - not followed by a >. So that'd be:
([^-]|-[^\>])+

Related

Why was "? immediately after a parenthesis" a syntax error, in Regular Expressions?

The Python Regular Expression HOWTO explains how the syntax of non-capturing and named groups came about:
For these new features the Perl developers couldn’t choose new single-keystroke metacharacters or new special sequences beginning with \ without making Perl’s regular expressions confusingly different from standard REs. If they chose & as a new metacharacter, for example, old expressions would be assuming that & was a regular character and wouldn’t have escaped it by writing \& or [&].
The solution chosen by the Perl developers was to use (?...) as the extension syntax. ? immediately after a parenthesis was a syntax error because the ? would have nothing to repeat, so this didn’t introduce any compatibility problems.
I don't understand why the parenthesis should have something that repeats?
I do understand the overall point that taking something that caused a syntax error, to use to extend regex functionality, would prevent existing regexes from breaking.
regular-expressions.info explains it nicely.
The ... question mark is the quantifier that makes the previous token optional. This quantifier cannot appear after an opening parenthesis, because there is nothing to be made optional at the start of a group. Therefore, there is no ambiguity between the question mark as an operator to make a token optional and the question mark as part of the syntax for non-capturing groups...
Nothing to be made optional, because the token before the ? is the group-opening metacharacter (, and not something searchable in a string such as a regular character.
I don't agree with the phrasing in the HOWTO you linked to. Optional – i.e., zero or one times – is not "repeating".

The most efficient lookahead substitute for jflex

I am writing tokenizer in jflex. I need to match words like interferon-a as one token, and words like interferon-alpha as three.
Obvious solution would be lookaheads, but they do not work in jflex. For a similar task, I wrote a function matching one additional wildcard character after the matched pattern, checking if it is a whitespace in java code and pushing it back with or without a part of the matched string.
REGEX = [:letter:]+\-[:letter:]\.
From string interferon-alpha it would match interferon-al.
Then, in Java code section it would check if the last character of the match is a whitespace. It is not, so -al would be pushed back and interferon returned.
In the case of interferon-a, whitespace would be pushed back and interferon returned.
However, this function does not work if matched string does not have anything succeeding. Also, it seems quite clunky. Hence, I was wondering if there is any 'nicer' way of ensuring that the following character is a whitespace without actually matching and returning it.
JFlex certainly has a lookahead facility, the same as (f)lex. Unlike Java regex lookahead assertions, the JFlex lookahead can only be applied at the end of a match, but it is otherwise similar. It is described in the Semantics section of JFlex manual:
In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either $ (the end of line operator) or / followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match…
So you could certainly write the rule:
[:letter:]+\-[:letter:]/\s
However, you cannot put such a rule in a macro definition (REGEX = …), as the manual also mentions (in the section on macros):
The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators.
So the lookahead operator can only be used in a pattern rule.
Note that \s matches any whitespace character, including newline characters, while . does not match any newline character. I think that's what lead to your comment that REGEX = [:letter:]+\-[:letter:]\. "does not work if matched string does not have anything succeeding" (I'm guessing that you meant "does not have anything succeeding it on the same line, and also that you intended to write . rather than \.).
Rather than testing for following whitespace, you might (depending on your language) prefer to test for a non-word character:
[:letter:]+\-[:letter:]/\W
or to craft a more precise specification as a set of Unicode properties, as in the definition of \W (also found in the linked section of the JFlex manual).
Having said all that, I'd like to repeat the advice from my previous answer to a similar question of yours: put more specific patterns first. For example, using the following pair of patterns will guarantee that the first one picks up words with a single letter suffix, while avoiding the need to explicitly pushback.
[:letter:]+(-[:letter:])? { /* matches 'interferon' or 'interferon-a' */ }
[:letter:]+/-[:letter:]+ { /* matches only 'interferon' from 'interferon-alpha' */ }
Of course, in this case you could easily avoid the collision between the second pattern and the first pattern by using {2,} instead of + for the second repetition, but it's perfectly OK to rely on pattern ordering since it's often inconvenient to guarantee that patterns don't overlap.

Matching Word() when word is not (some word)

Specifically, I want to match functions in my Javascript code that are not in a set of common standard Javascript functions. In other words, I want to match user defined functions. I'm working with vim's flavour of regexp, but I don't mind seeing solutions for other flavours.
As I understand it, regexp crawls through a string character by character, so thinking in terms of sets of characters can be problematic even when a problem seems simple. I've tried negative lookahead, and as you might expect all the does is prevent the first character of the functions I don't want from being matched (ie, onsole.log instead of console.log).
(?(?!(if)|(console\.log)|(function))\w+)\(.*\)
function(meep, boop, doo,do)
JSON.parse(localStorage["beards"])
console.log("sldkfjls" + dododo);
if (beepboop) {
BLAH.blah.somefunc(arge, arg,arg);
https://regexr.com/
I would like to be able to crawl through a function and see where it is calling other usermade functions. Will I need to do post-processing (ie mapping with another regexp) on the matches to reject matches I don't want, or is there a way to do this in one regexp?
The basic recipe for a regular expression that matches all words except foo (in Vim's regular expression syntax) is:
/\<\%(foo\>\)\#!\k\+\>/
Note how the negative lookahead (\#!) needs an end assertion (here: \>) on its own, to avoid that it also excludes anything that just starts with the expression!
Applied to your examples (excluding if (potentially with whitespace), console.log, and function, ending with (), that gives:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\(\k\|\.\)\+\>(.*)
As you seem to want to include the entire object chain (so JSON.parse instead of just parse), the actual match includes both keyword characters (\k) and the period. There's one complication with that: The negative lookahead will latch onto the log() in console.log(), because the leading keyword boundary assertion (\<) matches there as well. We can disallow that match by also excluding a period just before the function; i.e. by placing \.\#<! in between:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\.\#<!\(\k\|\.\)\+\>(.*)
That will highlight just the following calls:
JSON.parse(localStorage["beards"])
BLAH.blah.somefunc(arge, arg,arg);
foo.log(asdf)

Regex character interval with exception

Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)
Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.
The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']
(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'
Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]
The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo

Regular Expression Opposite

Is it possible to write a regex that returns the converse of a desired result? Regexes are usually inclusive - finding matches. I want to be able to transform a regex into its opposite - asserting that there are no matches. Is this possible? If so, how?
http://zijab.blogspot.com/2008/09/finding-opposite-of-regular-expression.html states that you should bracket your regex with
/^((?!^ MYREGEX ).)*$/
, but this doesn't seem to work. If I have regex
/[a|b]./
, the string "abc" returns false with both my regex and the converse suggested by zijab,
/^((?!^[a|b].).)*$/
. Is it possible to write a regex's converse, or am I thinking incorrectly?
Couldn't you just check to see if there are no matches? I don't know what language you are using, but how about this pseudocode?
if (!'Some String'.match(someRegularExpression))
// do something...
If you can only change the regex, then the one you got from your link should work:
/^((?!REGULAR_EXPRESSION_HERE).)*$/
The reason your inverted regex isn't working is because of the '^' inside the negative lookahead:
/^((?!^[ab].).)*$/
^ # WRONG
Maybe it's different in vim, but in every regex flavor I'm familiar with, the caret matches the beginning of the string (or the beginning of a line in multiline mode). But I think that was just a typo in the blog entry.
You also need to take into account the semantics of the regex tool you're using. For example, in Perl, this is true:
"abc" =~ /[ab]./
But in Java, this isn't:
"abc".matches("[ab].")
That's because the regex passed to the matches() method is implicitly anchored at both ends (i.e., /^[ab].$/).
Taking the more common, Perl semantics, /[ab]./ means the target string contains a sequence consisting of an 'a' or 'b' followed by at least one (non-line separator) character. In other words, at ANY point, the condition is TRUE. The inverse of that statement is, at EVERY point the condition is FALSE. That means, before you consume each character, you perform a negative lookahead to confirm that the character isn't the beginning of a matching sequence:
(?![ab].).
And you have to examine every character, so the regex has to be anchored at both ends:
/^(?:(?![ab].).)*$/
That's the general idea, but I don't think it's possible to invert every regex--not when the original regexes can include positive and negative lookarounds, reluctant and possessive quantifiers, and who-knows-what.
You can invert the character set by writing a ^ at the start ([^…]). So the opposite expression of [ab] (match either a or b) is [^ab] (match neither a nor b).
But the more complex your expression gets, the more complex is the complementary expression too. An example:
You want to match the literal foo. An expression, that does match anything else but a string that contains foo would have to match either
any string that’s shorter than foo (^.{0,2}$), or
any three characters long string that’s not foo (^([^f]..|f[^o].|fo[^o])$), or
any longer string that does not contain foo.
All together this may work:
^[^fo]*(f+($|[^o]|o($|[^fo]*)))*$
But note: This does only apply to foo.
You can also do this (in python) by using re.split, and splitting based on your regular expression, thus returning all the parts that don't match the regex, how to find the converse of a regex
In perl you can anti-match with $string !~ /regex/;.
With grep, you can use --invert-match or -v.
Java Regexps have an interesting way of doing this (can test here) where you can create a greedy optional match for the string you want, and then match data after it. If the greedy match fails, it's optional so it doesn't matter, if it succeeds, it needs some extra data to match the second expression and so fails.
It looks counter-intuitive, but works.
Eg (foo)?+.+ matches bar, foox and xfoo but won't match foo (or an empty string).
It might be possible in other dialects, but couldn't get it to work myself (they seem more willing to backtrack if the second match fails?)