How to match something not defined - regex

if I have defined something like that
COMMAND = "HI" | "HOW" | "ARE" | "YOU"
How can i say "if u match something that is not a COMMAND"?..
I tried with this one
[^COMMAND]
But didn't work..

As far as I can tell this is not possible with (current) JFlex.
We would need an effective tempered negative lookahead: ((?!bad).)*
There are two ways to do a negative lookahead in JFlex:
negation in the lookahead: x / !(y [^]*) (match x if not followed by y in the lookahead).
lookahead with negated elements: x / [^y]|y[^z] (match if x is followed by something that is !a or a!b.
Otherwise, you may get some ideas from this answer (specifically the lookaround alternatives): https://stackoverflow.com/a/37988661/8291949

Well, you can just match anything else, then
COMMAND = "HI" | "HOW" | "ARE" | "YOU"
. {throw new RuntimeException("Illegal character: <" + yytext() + ">");}

Related

I have written a regex for matching the sub string with spaces around it but that's not working well

Actually I was working on a regex problem whose task is to take a substring (||, &&) and replaces it with another substring (or, and) and I wrote code for it but that's not working well
question = x&& &&& && && x || | ||\|| x
Expected output = x&& &&& and and x or | ||\|| x
Here is the code I wrote
import re
for i in range(int(input())):
print(re.sub(r'\s[&]{2}\s', ' and ', re.sub(r"\s[\|]{2}\s", " or ", input())))
My output = x&& &&& and && x or | ||\|| x
You need to use lookarounds, the problem with the current regex is && && here the && the first match captures the space so there's no space available before the second && and it won't match, so we need to use zero-length-match ( lookarounds)
Replace the regex
\s[&]{2}\s --> (?<=\s)[&]{2}(?=\s)
\s[\|]{2}\s --> (?<=\s)[\|]{2}(?=\s)
(?<=\s) - Match should be precede space characters
(?=\s) - Match should be followed by space characters
You're looking for a regex like (?<=\s)&&(?=\s) (Regex demo)
Using lookarounds to assert the position of space characters around your targeted replacement groups allows overlapping matches to occur - otherwise, it will match the spaces on both sides and block out the other options.
import re
in_str = 'x&& &&& && && x || | ||\|| x'
expect_str = 'x&& &&& and and x or | ||\|| x'
print(re.sub("(?<=\s)\|\|(?=\s)", "or", re.sub("(?<=\s)&&(?=\s)", "and", in_str)))
Python demo
Try using re.findall() instead of re.sub

R regex - removing pattern from ends

Suppose I have a string that looks like so:
x <- "NNNNAAAJNFHANFFADN"
How would I only remove the N's from the ends to get:
"AAAJNFHANFFAD"
Just match and remove the N's which exists at the start or at the end through gsub.
gsub("^N+|N+$", "", x)
^N+ matches one or more N's which exists at the start.
| Alternation operator.
N+$ Matches one or more N's which exists at the end.
Example:
> x <- "NNNNAAAJNFHANFFADN"
> gsub("^N+|N+$", "", x)
[1] "AAAJNFHANFFAD"
gsub("^N*([A-Z]*?)N*$", "\\1", x)
You can use \1 to backreference here.See demo.
https://regex101.com/r/uF4oY4/66
Use as
gsub("(^N{1,}|N{1,}$)","",x)
https://regex101.com/r/uF4oY4/69

Regular Expression : Splitting a string of list of multivalues

My goal is splitting this string with regular expression:
AA(1.2,1.3)+,BB(125)-,CC(A,B,C)-,DD(QWE)+
in a list of:
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Regards.
This regex works with your sample string:
,(?![^(]+\))
This splits on comma, but uses a negative lookahead to assert that the next bracket character is not a right bracket. It will still split even if there are no following brackets.
Here's some java code demonstrating it working with your sample plus some general input showing its robustness:
String input = "AA(1.2,1.3)+,BB(125)-,FOO,CC(A,B,C)-,DD(QWE)+,BAR";
String[] split = input.split(",(?![^(]+\\))");
for (String s : split) System.out.println(s);
Output:
AA(1.2,1.3)+
BB(125)-
FOO
CC(A,B,C)-
DD(QWE)+
BAR
I don't know what language you are working with, but this makes it in grep:
$ grep -o '[A-Z]*([A-Z0-9.,]*)[^,]*' file
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Explanation
[A-Z]*([A-Z0-9.,]*)[^,]*
^^^^^^ ^^^^^^^^^^^ ^^^^^
| ^ | ^ |
| | | | everything but a comma
| ( char | ) char
| A-Z 0-9 . or , chars
list of chars from A to Z

Non-greedy regular expression match for multicharacter delimiters in awk

Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?
I have tried the following:
awk '
BEGIN {
str="AB 1 BA 2 AB 3 BA"
regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
if (match(str,regex))
print substr(str,RSTART,RLENGTH)
}'
with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..
Merge your two negated character classes and remove the [^A] from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.
The other answer didn't really answer: how to match non-greedily?
Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest
sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch behaves like match, returning:
the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.

Regular expression match decimal with letters

I have following string 3.14, 123.56f, .123e5f, 123D, 1234, 343E12, 32.
What I want to do is match any combination of above inputs. So far I started with the following:
^[0-9]\d*(\.\d+)
I realize I have to escape the . since its a regular expression itself.
Thanks.
This should also work, if not already proposed.
try {
Pattern regex = Pattern.compile("\\.?\\b[0-9]*\\.?[0-9]+(?:[eE][-+]?[0-9]+)?[fD]?\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Probably
^(\d+(\.\d+)?|\.\d+)([eE]\d+)?[fD]?$
http://regexr.com?2ut9t
^ start of the string
(\d+(\.\d+)?|\.\d+) one or more digits with an optional ( . and one or more digits)
or
. and one or more digits
([eE]\d+)? an optional ( e or E and one or more digits)
[fD]? an optional f or D
$ end of the string
As a sidenote, I've made the D compatible with everything but the f.
If you need positive and negative sign, add [+-]? after the ^
This will match all of those:
[0-9.]+(?:[Ee][0-9.]*)?[DdFf]?
Note that within a character class (square brackets), dot . is not a special character and should not be escaped.
Maybe that one ?
^\d*(?:\.\d+)?(?:[eE]\d+)?(?:[fD])?$
with
^\d* #possibly a digit or sequence of digits at the start
(?:\.\d+)? #possibly followed by a dot and at least one digit
(?:[eE]\d+)? #possibly a 'e' or 'E' followed by at least one digit
(?:[fD])?$ #optionnaly followed by 'f' or 'D' letters until the end
You can use regexpal to test it out, but this seems to work on all of those examples:
^\d*\.?(\d*[eE]?\d*)[fD]?$