Scala Regular Expression Oddity - regex

I have this regular expression:
^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$
To give it a bit of organization, there's really 3 parts:
// Part 1
^(10)(1|0)(.)(.)(.)(.{18})
// Part 2
// Optional Elements that begin with two characters and is terminated by a |
// May appear at most once
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
// Part 3
AY([0-9]*)AZ(.*)$
Part 2 is the part that I'm having trouble with but I believe the current regular expression says any of these given elements will appear one or more times. I could have done something like: (AB.*?|) but I don't need the pipe in my group and wasn't quite sure how to express it.
This is my sample input - it's SIP2 if you've seen it before (please disregard checksum, I know it's not valid):
101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
This is my snippet of Scala code:
val regex = """^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$""".r
val msg = "101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71"
val m = regex.findFirstMatchIn(msg)) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
This is my output:
0 101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
1 10
2 1
3 Y
4 N
5 Y
6 201406120000091911
7 DAl|
8 ABb
9 AQc
10 AJd
11 AFf
12 CSg
13 CRh
14 CTi
15 CKe
16 CVj
17 CYk
18 DAl
19 AOa
20 1
21 AA71
Note the entry that starts with 7. Can anyone explain why that's there?
I'm using Scala 2.10.4 but I believe regular expressions in Scala simply uses Java's regular expression. I'm certainly open to other suggestions for parsing strings.
EDIT: Based on wingedsubmariner's response, I was able to fix my regular expression:
^(10)(1|0)(.)(.)(.)(.{18})(?:AB([^|]*)\||AQ([^|]*)\||AJ([^|]*)\||AF([^|]*)\||CS([^|]*)\||CR([^|]*)\||CT([^|]*)\||CK([^|]*)\||CV([^|]*)\||CY([^|]*)\||DA([^|]*)\||AO([^|]*)\|)+AY([0-9]*)AZ(.*)$
Basically adding ?: to indicate I was not interested in the group!

You get a matched group for each set of parentheses, the order being the order of the opening parenthesis in the regex. Matched group 7 corresponds to the opening parenthesis that begins your "Group 2":
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
^
|
This parenthesis
Each matched group takes on the value of the last part of the text that matched, which in this case is DAl| because it was the last piece of text to match the "Group 2" expression.
Here is a simpler example that demonstrates the behavior:
val regex = """((A)\||(B)\|)+""".r
val msg = "A|B|A|B|"
regex.findFirstMatchIn(msg) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
Which produces:
0 A|B|A|B|
1 B|
2 A
3 B

Related

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

Using the text replaced in the replace text

I'm using the version 7.3.3 of Notepad++.
I have this list of numbers to 1.000.000.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
I want to add XML tags to these numbers like this:
<SerialNumber>
<SN>1</SN>
</SerialNumber>
<SerialNumber>
<SN>2</SN>
</SerialNumber>
<SerialNumber>
<SN>3</SN>
</SerialNumber>
<SerialNumber>
<SN>4</SN>
</SerialNumber>
So I need a regular expression to find a number ended with a \n\r, and use the number I've found with the regular expression in the text that I'm going to add.
Do you know to do that in Notepad++?
I have tried with \d{*} but it is not a valid regular expression.
Open the "Replace" menu (Search > Replace...).
And set like that :
- Find what: (\d+)[(\r)?\n]
- Replace with : <SerialNumber>\r\n <SN>$1</SN>\r\n</SerialNumber>
- Check : Search mode -> Regular expression
Then press Replace All. To even match the last number, add an empty line at the end of your file.
I hope it helps.
if we take your [1.... n] numbers in string called strData then :
var nums = strData.split("\r\n").map(function(item) {
return parseInt(item, 10);
});
var strXmlOut = nums.map(function(n) {
return "<SerialNumber><SN>" + n + "</SN></SerialNumber>";
}).join("\r\n");
and the one call version would be :
var xmlOut = strData.split("\r\n")
.map(function(item) {return parseInt(item, 10);})
.map(function(n) {return "<SerialNumber><SN>" + n + "</SN></SerialNumber>";})
.join("\r\n");

How do I represent "Any string except for .... "

I'm trying to solve a regex where the given alphabet is Σ={a,b}
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
which means the corresponding regex is: aa(a)*b(bbb)*
What would be a regex for L2, complement of L1?
Is it right to assume L2 = "Any string except for aa(a)b(bbb)"?
First, in my opinion, the regex for L1 = {a^2n b^3m+1 | n>=1, m>=0}
is NOT what you gave but is: aa(aa)*b(bbb)*. The reason is that a^2n, n > 1 means that there are at least 2 a and a pair number of a.
Now, the regular expression for "Any string except for aa(aa)*b(bbb)*" is:
^(?!^aa(aa)*b(bbb)*$).*$
more details here: Regex101
Explanations
aa(a)*b(bbb)* the regex you DON'T want to match
^ represents begining of line
(?!) negative lookahead: should NOT match what's in this group
$ represents end of line
EDIT
Yes, a complement for aa(aa)*b(bbb)* is "Any string but the ones that match aa(aa)*b(bbb)*".
Now you need to find a regex that represents that with the syntax that you can use. I gave you a regex in this answer that is correct and matches "Any string but the ones that match aa(aa)*b(bbb)*", but if you want a mathematical representation following the pattern you gave for L1, you'll need to find something simpler.
Without any negative lookahead, that would be:
L2 = ^((b+.*)|((a(aa)*)?b*)|a*((bbb)*|bb(bbb)*)|(.*a+))$
Test it here at Regex101
Good luck with the mathematical representation translation...
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
Regex for L1 is:
^aa(?:aa)*b(?:bbb)*$
Regex demo
Input
a
b
ab
aab
abb
aaab
aabb
abbb
aaaab
aaabb
aabbb
abbbb
aaaaab
aaaabb
aaabbb
aabbbb
abbbbb
aaaaaab
aaaaabb
aaaabbb
aaabbbb
aabbbbb
abbbbbb
aaaabbbb
Matches
MATCH 1
1. [7-10] `aab`
MATCH 2
1. [30-35] `aaaab`
MATCH 3
1. [75-81] `aabbbb`
MATCH 4
1. [89-96] `aaaaaab`
MATCH 5
1. [137-145] `aaaabbbb`
Regex for L2, complement of L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL)|^.*$
Explanation:
^aa(?:aa)*b(?:bbb)*$ matches L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL) anything matches L1 will skip & fail
|^.*$ matches others that not matches L1
Regex demo
Matches
MATCH 1
1. [0-1] `a`
MATCH 2
1. [2-3] `b`
MATCH 3
1. [4-6] `ab`
MATCH 4
1. [11-14] `abb`
MATCH 5
1. [15-19] `aaab`
MATCH 6
1. [20-24] `aabb`
MATCH 7
1. [25-29] `abbb`
MATCH 8
1. [36-41] `aaabb`
MATCH 9
1. [42-47] `aabbb`
MATCH 10
1. [48-53] `abbbb`
MATCH 11
1. [54-60] `aaaaab`
MATCH 12
1. [61-67] `aaaabb`
MATCH 13
1. [68-74] `aaabbb`
MATCH 14
1. [82-88] `abbbbb`
MATCH 15
1. [97-104] `aaaaabb`
MATCH 16
1. [105-112] `aaaabbb`
MATCH 17
1. [113-120] `aaabbbb`
MATCH 18
1. [121-128] `aabbbbb`
MATCH 19
1. [129-136] `abbbbbb`

decision on regular expression length

I want to accomplish the following requirements using Regex only (no C# code can be used )
• BTN length is 12 and BTN starts with 0[123456789] then it should remove one digit from left and one digit from right.
WORKING CORRECTLY
• BTN length is 12 and it’s not the case stated above then it should always return 10 right digits by removing 2 from the start. (e.g. 491234567891 should be changed to 1234567891)
NOT WORKING CORRECTLY
• BTN length is 11 and it should remove one digit from left. WORKING CORRECTLY
for length <=10 BTNs , nothing is required to be done , they would remain as it is or Regex may get failed too on them , thats acceptable .
USING SQL this can be achieved like this
case when len(BTN) = 12 and BTN like '0[123456789]%' then SUBSTRING(BTN,2,10) else RIGHT(BTN,10) end
but how to do this using Regex .
So far I have used and able to get some result correct using this regex
[0*|\d\d]*(.{10}) but by this regex I am not able to correctly remove 1st and last character of a BTN like this 015732888810 to 1573288881 as this regex returns me this 5732888810 which is wrong
code is
string s = "111112573288881,0573288881000,057328888105,005732888810,15732888815,344956345335,004171511326,01777203102,1772576210,015732888810,494956345335";
string[] arr = s.Split(',');
foreach (string ss in arr)
{
// Match mm = Regex.Match(ss, #"\b(?:00(\d{10})|0(\d{10})\d?|(\d{10}))\b");
// Match mm = Regex.Match(ss, "0*(.{10})");
// ([0*|\\d\\d]*(.{10}))|
Match mm = Regex.Match(ss, "[0*|\\d\\d]*(.{10})");
// Match mm = Regex.Match(ss, "(?(^\\d{12}$)(.^{12}$)|(.^{10}$))");
// Match mm = Regex.Match(ss, "(info)[0*|\\d\\d]*(.{10}) (?(1)[0*|\\d\\d]*(.{10})|[0*|\\d\\d]*(.{10}))");
string m = mm.Groups[1].Value;
Console.WriteLine("Original BTN :"+ ss + "\t\tModified::" + m);
}
This should work:
(0(\d{10})0|\d\d(\d{10}))
UPDATE:
(0(\d{10})0|\d{1,2}(\d{10}))
1st alternate will match 12-digits with 0 on left and 0 on right and give you only 10 in between.
2nd alternate will match 11 or 12 digits and give you the right 10.
EDIT:
The regex matches the spec, but your code doesn't read the results correctly. Try this:
Match mm = Regex.Match(ss, "(0(\\d{10})0|\\d{1,2}(\\d{10}))");
string m = mm.Groups[2].Value;
if (string.IsNullOrEmpty(m))
m = mm.Groups[3].Value;
Groups are as follows:
index 0: returns full string
index 1: returns everything inside the outer closure
index 2: returns only what matches in the closure inside the first alternate
index 3: returns only what matches in the closure inside the second alternate
NOTE: This does not deal with anything greater than 12 digits or less than 11. Those entries will either fail or return 10 digits from somewhere. If you want results for those use this:
"(0(\\d{10})0|\\d*(\\d{10}))"
You'll get rightmost 10 digits for more than 12 digits, 10 digits for 10 digits, nothing for less than 10 digits.
EDIT:
This one should cover your additional requirements from the comments:
"^(?:0|\\d*)(\\d{10})0?$"
The (?:) makes a grouping excluded from the Groups returned.
EDIT:
This one might work:
"^(?:0?|\\d*)(\\d{10})\\d?$"
(?(^\d{12}$)(?(^0[1-9])0?(?<digit>.{10})|\d*(?<digit>.{10}))|\d*(?<digit>.{10}))
which does the exact same thing as sql query + giving result in Group[1] all the time so i didn't had to change the code a bit :)

Regex to calculate straight poker hand - Using ASCII CODE

In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)
Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.
There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround
Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62