Regex for extracting qmake variables - regex

I'm trying to write the QRegExp for extracting variable names from qmake project code (*.pro files).
The syntax of variable usage have two forms:
$$VAR
$${VAR}
So, my regular expression must handle both cases.
I'm trying to write expression in this way:
\$\$\{?(\w+)\}?
But it does not work as expected: for string $$VAR i've got $$V match, with disabled "greeding" matching mode (QRegExp::setMinimal (true)). As i understood, gready-mode can lead to wrong results in my case.
So, what am i doing wrong?
Or maybe i just should use greedy-mode and don't care about this behavior :)
P.S. Variable name can't contains spaces and other "special" symbols, only letters.

You do not need to disable greedy matching. If greedy matching is disabled, the minimal match that satisfies your expression is returned. In your example, there's no need to match the AR, because $$V satisfies your expression.
So turn the minimal mode back on, and use
\$\$(\w+|\{\w+\})
This matches two dollar signs, followed by either a bunch of word characters, or by a bunch of word characters between braces. If you can trust your data not to contain any non-matching braces, your expression should work just as well.
\w is equal to [A-Za-z0-9_], so it matches all digits, all upper and lowercase alphabetical letters, and the underscore. If you want to restrict this to just the letters of the alphabet, use [A-Za-z] instead.
Since the variable names can not contain any special characters, there's no danger of matching too much, unless a variable can be followed directly by more regular characters, in which case it's undecidable.
For instance, if the data contains a string like Buy our new $$Varbuster!, where $$Var is supposed to be the variable, there is no regular expression that will separate the variable from the rest of the string.

Related

The most efficient lookahead substitute for jflex

I am writing tokenizer in jflex. I need to match words like interferon-a as one token, and words like interferon-alpha as three.
Obvious solution would be lookaheads, but they do not work in jflex. For a similar task, I wrote a function matching one additional wildcard character after the matched pattern, checking if it is a whitespace in java code and pushing it back with or without a part of the matched string.
REGEX = [:letter:]+\-[:letter:]\.
From string interferon-alpha it would match interferon-al.
Then, in Java code section it would check if the last character of the match is a whitespace. It is not, so -al would be pushed back and interferon returned.
In the case of interferon-a, whitespace would be pushed back and interferon returned.
However, this function does not work if matched string does not have anything succeeding. Also, it seems quite clunky. Hence, I was wondering if there is any 'nicer' way of ensuring that the following character is a whitespace without actually matching and returning it.
JFlex certainly has a lookahead facility, the same as (f)lex. Unlike Java regex lookahead assertions, the JFlex lookahead can only be applied at the end of a match, but it is otherwise similar. It is described in the Semantics section of JFlex manual:
In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either $ (the end of line operator) or / followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match…
So you could certainly write the rule:
[:letter:]+\-[:letter:]/\s
However, you cannot put such a rule in a macro definition (REGEX = …), as the manual also mentions (in the section on macros):
The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators.
So the lookahead operator can only be used in a pattern rule.
Note that \s matches any whitespace character, including newline characters, while . does not match any newline character. I think that's what lead to your comment that REGEX = [:letter:]+\-[:letter:]\. "does not work if matched string does not have anything succeeding" (I'm guessing that you meant "does not have anything succeeding it on the same line, and also that you intended to write . rather than \.).
Rather than testing for following whitespace, you might (depending on your language) prefer to test for a non-word character:
[:letter:]+\-[:letter:]/\W
or to craft a more precise specification as a set of Unicode properties, as in the definition of \W (also found in the linked section of the JFlex manual).
Having said all that, I'd like to repeat the advice from my previous answer to a similar question of yours: put more specific patterns first. For example, using the following pair of patterns will guarantee that the first one picks up words with a single letter suffix, while avoiding the need to explicitly pushback.
[:letter:]+(-[:letter:])? { /* matches 'interferon' or 'interferon-a' */ }
[:letter:]+/-[:letter:]+ { /* matches only 'interferon' from 'interferon-alpha' */ }
Of course, in this case you could easily avoid the collision between the second pattern and the first pattern by using {2,} instead of + for the second repetition, but it's perfectly OK to rely on pattern ordering since it's often inconvenient to guarantee that patterns don't overlap.

Matching Word() when word is not (some word)

Specifically, I want to match functions in my Javascript code that are not in a set of common standard Javascript functions. In other words, I want to match user defined functions. I'm working with vim's flavour of regexp, but I don't mind seeing solutions for other flavours.
As I understand it, regexp crawls through a string character by character, so thinking in terms of sets of characters can be problematic even when a problem seems simple. I've tried negative lookahead, and as you might expect all the does is prevent the first character of the functions I don't want from being matched (ie, onsole.log instead of console.log).
(?(?!(if)|(console\.log)|(function))\w+)\(.*\)
function(meep, boop, doo,do)
JSON.parse(localStorage["beards"])
console.log("sldkfjls" + dododo);
if (beepboop) {
BLAH.blah.somefunc(arge, arg,arg);
https://regexr.com/
I would like to be able to crawl through a function and see where it is calling other usermade functions. Will I need to do post-processing (ie mapping with another regexp) on the matches to reject matches I don't want, or is there a way to do this in one regexp?
The basic recipe for a regular expression that matches all words except foo (in Vim's regular expression syntax) is:
/\<\%(foo\>\)\#!\k\+\>/
Note how the negative lookahead (\#!) needs an end assertion (here: \>) on its own, to avoid that it also excludes anything that just starts with the expression!
Applied to your examples (excluding if (potentially with whitespace), console.log, and function, ending with (), that gives:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\(\k\|\.\)\+\>(.*)
As you seem to want to include the entire object chain (so JSON.parse instead of just parse), the actual match includes both keyword characters (\k) and the period. There's one complication with that: The negative lookahead will latch onto the log() in console.log(), because the leading keyword boundary assertion (\<) matches there as well. We can disallow that match by also excluding a period just before the function; i.e. by placing \.\#<! in between:
\<\%(\%(if *\|console\.log\|function\)(\)\#!\.\#<!\(\k\|\.\)\+\>(.*)
That will highlight just the following calls:
JSON.parse(localStorage["beards"])
BLAH.blah.somefunc(arge, arg,arg);
foo.log(asdf)

Paired characters in regular expression

I expect this is very easy, but I can't work out how to match optional character pairs in regex. Regular expressions are not something I have ever had to do before.
I want to be able to match "=N","=B","=R" or "=Q" in a character string, optionally -- but if they appear, they must appear paired with the equal sign. So =?[NBRQ]? won't work for me, because someone could type 'N' without the accompanying equal sign. So it must be "=N","=B", "=R" or "=Q" or nothing at all.
If you need to make more than one regex production optional, enclose them in parentheses, capturing or non-capturing:
(=[NBRQ])?
The above would match an optional =N, =B, =R, or =Q. Since the question mark appears after parentheses, the entire group is optional, not its individual parts.

Regex to check if a string contains at least A-Za-z0-9 but not an &

I am trying to check if a string contains at least A-Za-z0-9 but not an &.
My experience with regexes is limited, so I started with the easy part and got:
.*[a-zA-Z0-9].*
However I am having troubling combining this with the does not contain an & portion.
I was thinking along the lines of ^(?=.*[a-zA-Z0-9].*)(?![&()]).* but that does not seem to do the trick.
Any help would be appreciated.
I'm not sure if this what you meant, but here is a regular expression that will match any string that:
contains at least one alpha-numeric character
does not contain a &
This expression ensures that the entire string is always matched (the ^ and $ at beginning and end), and that none of the characters matched are a "&" sign (the [^&]* sections):
^[^&]*[a-zA-Z0-9][^&]*$
However, it might be clearer in code to simply perform two checks, if you are not limited to a single expression.
Also, check out the \w class in regular expressions (it might be the better solution for catching alphanumeric chars if you want to allow non-ASCII characters).

Different regex evaluation in collections or patterns

I am experiencing a strange behaviour when searching for a regular expression in vim:
I attempt to clean up superfluous whitespace in a file and want to use the substitute command for it.
When I use the following regular expression with collections, vim matches single whitespaces as well:
\%[\s]\{2,}
When I use the same regular expression with patterns instead of collections vim correctly matches only 2 or more whitespaces:
\%(\s\)\{2,}
I know that I do not need to use a collection, but if I try the expression in a online regular expression parser (e.g. Rubular) it works with a collection as well.
Can anyone explain why these expression are not evaluated in the same way?
Because \%[...] and \%(...\) are completely different patterns.
\%[...] means a sequence of optional atoms.
For example, r\%[ead] matches "read", "rea", "re" and "r".
While \%(...\) treats the enclosed atoms as a single atom.
For example, r\%(ead\) matches only "read".
So that,
\%[\s]\{2,} can be interpreted as \(\s\|\)\{2,}, then \(\s\|\)\(\s\|\)\|\(\s\|\)\(\s\|\)\(\s\|\)\|....
Here \(\s\|\)\(\s\|\), the minimum pattern, can be interpreted as \(\)\(\), \(\)\(\s\), \(\s\)\(\) or \(\s\)\(\s\).
It matches 1 whitespace character too.
\%(\s\)\{2,} can be interpreted as \s\{2,}, then \s\s\|\s\s\s\|....
It matches only 2 or more whitespace characters.
does this answer your question?
http://vimdoc.sourceforge.net/htmldoc/pattern.html#/\%[]
A sequence of optionally matched atoms. This always matches.
It matches as much of the list of atoms it contains as possible.
Thus it stops at the first atom that doesnt match.
For example:
/r\%[ead]
matches "r", "re", "rea" or "read". The longest that matches is used.
The problem is it always match and override the quantifier {2,} at the back.
it is rarely used, but interesting nevertheless.