Understanding \G and \K in regex - regex

In a previous question, I asked to match chars that follow a specific pattern. In order to be more specific, I would like to consider this example:
We want to match all the x that follow b or d. We may want to replace these characters with o:
-a x x xx x x
-b x x x x xx x
-c x x x x x x
-d x x x xx x x
The result would be this:
-a x x xx x x
-b o o o o oo o
-c x x x x x x
-d o o o oo o o
anubhava answered my question with a pretty nice regex that has the same form as this one:
/([db]|\G)[^x-]*\Kx/g
Unfortunately I did not completely understand how \G and \K work. I would like to have a more detailed explaination on this specific case.
I tried to use the Perl regex debugger, but It is a bit cryptic.
Compiling REx "([db]|\G)[^x-]*\Kx"
Final program:
1: OPEN1 (3)
3: BRANCH (15)
4: ANYOF[bd][] (17)
15: BRANCH (FAIL)
16: GPOS (17)
17: CLOSE1 (19)
19: STAR (31)
20: ANYOF[\x00-,.-wy-\xff][{unicode_all}] (0)
31: KEEPS (32)
32: EXACT <x> (34)
34: END (0)

Correct regex is:
(-[db]|(?!^)\G)[^x-]*\Kx
Check this demo
As per the regex101 description:
\G - asserts position at the end of the previous match or the start of the string for the first match. \G will match start of line as well for the very first match hence there is a need of negative lookahead here (?!^)
\K - resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match. \K will discard all matched input hence we can avoid back-reference in replacement.
More details about \K
More details about \G

I would suggest not doing it in one regex. Your intent is much more clear if you do this:
if ( /^-[bd]/ ) { # If it's a line that starts with -b or -d...
s/x/o/g; # ... replace the x's with o's.
}
If that's too many lines for you, you could even do:
s/x/o/g if /^-[bd]/;

Related

How to match something not defined

if I have defined something like that
COMMAND = "HI" | "HOW" | "ARE" | "YOU"
How can i say "if u match something that is not a COMMAND"?..
I tried with this one
[^COMMAND]
But didn't work..
As far as I can tell this is not possible with (current) JFlex.
We would need an effective tempered negative lookahead: ((?!bad).)*
There are two ways to do a negative lookahead in JFlex:
negation in the lookahead: x / !(y [^]*) (match x if not followed by y in the lookahead).
lookahead with negated elements: x / [^y]|y[^z] (match if x is followed by something that is !a or a!b.
Otherwise, you may get some ideas from this answer (specifically the lookaround alternatives): https://stackoverflow.com/a/37988661/8291949
Well, you can just match anything else, then
COMMAND = "HI" | "HOW" | "ARE" | "YOU"
. {throw new RuntimeException("Illegal character: <" + yytext() + ">");}

Remove all numbers + symbols from line in Notepad++

Is it possible to remove every line in a notepad++ Not Containing
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
, . '
Like that :
Remove Non-ascii
.*[^\x00-\x7F]+.*
Remove Numbers
.*[0-9]+.*
Text :
example
example'
example,
example.
example123
éxample è
[example/+
example'/é,
example,*
exa'mple--
example#
example"
You may use
^(?![a-zA-Z,.']+$).+$\R?
The regex matches any non-empty line (.+) that does not only consist of ASCII letters, ,, . or '. \R? at the end matches an optional line break.
Details:
^ - start of a string
(?![a-zA-Z,.']+$) - a negative lookahead that fails the match if its pattern is not matched: [a-zA-Z,.']+ - 1 or more ASCII letters, comma, period or single quote up to the end of the line ($)
.+ - 1+ chars other than line break char
$ - end of a line
\R? - an optional line break char (sequence)
You can remove them like this:
Find what: ^.*[^a-zA-Z.,'].*$
Replace with: ``
Explanation:
.* for any text
the negated character class [^...] for any unwanted character
then again .* for more any text
You need to wrap it into ^...$ to match the whole line
If you want to delete the linefeed characters, then you can use \r?\n instead of the $ sign. I.e.: ^.*[^a-zA-Z.,'].*\r?\n
Try to replace all this match
^.+?[^a-zA-Z,.'\r\n]+(.|\r?\n)

How do I represent "Any string except for .... "

I'm trying to solve a regex where the given alphabet is Σ={a,b}
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
which means the corresponding regex is: aa(a)*b(bbb)*
What would be a regex for L2, complement of L1?
Is it right to assume L2 = "Any string except for aa(a)b(bbb)"?
First, in my opinion, the regex for L1 = {a^2n b^3m+1 | n>=1, m>=0}
is NOT what you gave but is: aa(aa)*b(bbb)*. The reason is that a^2n, n > 1 means that there are at least 2 a and a pair number of a.
Now, the regular expression for "Any string except for aa(aa)*b(bbb)*" is:
^(?!^aa(aa)*b(bbb)*$).*$
more details here: Regex101
Explanations
aa(a)*b(bbb)* the regex you DON'T want to match
^ represents begining of line
(?!) negative lookahead: should NOT match what's in this group
$ represents end of line
EDIT
Yes, a complement for aa(aa)*b(bbb)* is "Any string but the ones that match aa(aa)*b(bbb)*".
Now you need to find a regex that represents that with the syntax that you can use. I gave you a regex in this answer that is correct and matches "Any string but the ones that match aa(aa)*b(bbb)*", but if you want a mathematical representation following the pattern you gave for L1, you'll need to find something simpler.
Without any negative lookahead, that would be:
L2 = ^((b+.*)|((a(aa)*)?b*)|a*((bbb)*|bb(bbb)*)|(.*a+))$
Test it here at Regex101
Good luck with the mathematical representation translation...
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
Regex for L1 is:
^aa(?:aa)*b(?:bbb)*$
Regex demo
Input
a
b
ab
aab
abb
aaab
aabb
abbb
aaaab
aaabb
aabbb
abbbb
aaaaab
aaaabb
aaabbb
aabbbb
abbbbb
aaaaaab
aaaaabb
aaaabbb
aaabbbb
aabbbbb
abbbbbb
aaaabbbb
Matches
MATCH 1
1. [7-10] `aab`
MATCH 2
1. [30-35] `aaaab`
MATCH 3
1. [75-81] `aabbbb`
MATCH 4
1. [89-96] `aaaaaab`
MATCH 5
1. [137-145] `aaaabbbb`
Regex for L2, complement of L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL)|^.*$
Explanation:
^aa(?:aa)*b(?:bbb)*$ matches L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL) anything matches L1 will skip & fail
|^.*$ matches others that not matches L1
Regex demo
Matches
MATCH 1
1. [0-1] `a`
MATCH 2
1. [2-3] `b`
MATCH 3
1. [4-6] `ab`
MATCH 4
1. [11-14] `abb`
MATCH 5
1. [15-19] `aaab`
MATCH 6
1. [20-24] `aabb`
MATCH 7
1. [25-29] `abbb`
MATCH 8
1. [36-41] `aaabb`
MATCH 9
1. [42-47] `aabbb`
MATCH 10
1. [48-53] `abbbb`
MATCH 11
1. [54-60] `aaaaab`
MATCH 12
1. [61-67] `aaaabb`
MATCH 13
1. [68-74] `aaabbb`
MATCH 14
1. [82-88] `abbbbb`
MATCH 15
1. [97-104] `aaaaabb`
MATCH 16
1. [105-112] `aaaabbb`
MATCH 17
1. [113-120] `aaabbbb`
MATCH 18
1. [121-128] `aabbbbb`
MATCH 19
1. [129-136] `abbbbbb`

Regular expression replaces invalid words

I have this sentence and i use regular expressions to replace the word "merda" or "merdas" with ---
"merda vamerda e mais mmmerda? a merdaaa lol merda, namerda m e r d a mesmo merda"
This is the regular expression im using:
m{1,}e{1,}r{1,}d{1,}a{1,}s{1,}|m{1,}e{1,}r{1,}d{1,}a{1,}
and this is the result:
"--- va --- e mais --- ? a --- lol --- , na --- m e r d a mesmo ---"
3 errors here, vamerda and namerda should not be replaced, and it didnt replace m e r d a.
Can you help me please?
how about :
/\bm+\s*e+\s*r+\s*d+\s*a+\s*s*\b/
explanation:
\b : word boundary
m+ : matches 1 or more m
\s* : matches 0 or more spaces
... same explanation for other letters (e,r,d,a)
s* : matches 0 or more s
\b : word boundary
This will match all expected combinations in the given example.
Edit
According to your comment, you can modify the regex by exchanging each \s* with [\s_]* like :
\bm+[\s_]*e+[\s_]* and so on ...
or even with:
\bm+[^a-z]* ...
Try putting your regular expression in Rubular
It will give you real-time match results, as you modify your regex.
Here's a link to your expression in Rubular permalink
Try this one:
/\Amerda\s+|\smerda,|\smerda\z|\s+merdas\s+|m\se\sr\sd\sa\s/

Regex to calculate straight poker hand - Using ASCII CODE

In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)
Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.
There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround
Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62