regex for position matching with OR condition - regex

Newbie to regex and looking for help in creating regexp to seek out following:
The data items consists of six character strings as shown in example below
1) "100100"
2) "110011"
3) "010000"
4) "110011"
5) "111111"
6) "000111"
Need to use regexp to find data with say
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1 in 2nd position: Items 2,4 ad 5 should be matched
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched

Given your samples, these will work:
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1.....|...1...
1 in 2nd position: Items 2,4 ad 5 should be matched
.1....
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched
....11
Or if you want to match any of these rules, combine them with the | (or) operator.
Example:
http://regexpal.com/?flags=g&regex=(1.....%7C...1...%7C.1....%7C....11)&input=100100%0A%0A110011%0A%0A010000%0A%0A110011%0A%0A111111%0A%0A000111

If it is always strings with only 1s and 0s, you should treat them as binary numbers and use logical operators to find the matches.

Try this regex
([1][0-1]{2}[1][0-1]{2})|([0-1][1][0-1]{4})|([0-1]{4}[1]{2})
Find the explanation and demo here http://www.regex101.com/r/vD9jE7

Here's an example. Change dots with zeros if necessary. /^(11..|.1.1)11$/
^ # beginning of string
( # either
11.. # 11 and any 2 char
| # or
.1.1 # any char, 1, any char, 1
)
11
$ # end of string

Related

Removing special characters while retaining alpha numeric words

I'm in the middle of cleaning a data set that has this:
[IN]
my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
my_Series.str.replace("[^a-zA-Z]+", " ")
[OUT]
0
1 ASD
2 AUG M G
3 Air G G
4 Karsh
[IDEAL OUT]
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
My goal is to remove special characters and numbers but it there's a word that contains alphanumeric, it should stay. Can anyone help?
Try with apply to achieve your ideal output.
>>> my_Series = pd.Series(["-","ASD", "711-AUG-M4G","Air G2G", "Karsh"])
Output:
>>> my_Series.apply(lambda x: " ".join(['' if word.isdigit() else word for word in x.replace('-', ' ').split()]))
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
I have replaced - with space and split string on spaces. Then check whether the word is digit or not.
If it is digit replace with empty string else with actual word.
At last we are joining the list.
Edit 1:
regex solution :-
>>> my_Series.str.replace("((\d+)(?=.*\d))|([^a-zA-Z0-9 ])", " ")
0
1 ASD
2 AUG M4G
3 Air G2G
4 Karsh
dtype: object
Explanation:
Using lookaround.
((\d+)(?=.*\d))|([^a-zA-Z0-9 ])
(A number is last if it is followed by any other number) OR (allows alpha numeric)

Filter a string using regular expression

I tried the following code. However, the result is not what I want.
$strLine = "100.11 Q9"
$sortString = StringRegExp ($strLine,'([0-9\.]{1,7})', $STR_REGEXPARRAYMATCH)
MsgBox(0, "", $sortString[0],2)
The output shows 100.11, but I want 100.11 9. How could I display it this way using a regular expression?
$sPattern = "([0-9\.]+)\sQ(\d+)"
$strLine = "100.11 Q9"
$sortString = StringRegExpReplace($strLine, $sPattern, '\1 \2')
MsgBox(0, "$sortString", $sortString, 2)
$strLine = "100.11 Q9"
$sortString = StringRegExp($strLine, $sPattern, 3); array of global matches.
For $i1 = 0 To UBound($sortString) -1
MsgBox(0, "$sortString[" & $i1 & "]", $sortString[$i1], 2)
Next
The pattern is to get the 2 groups being 100.11 and 9.
The pattern will 1st match the group with any digit and dot until it reach
/s which will match the space. It will then match the Q. The 2nd group
matches any remaining digits.
StringRegExpReplace replaces the whole string with 1st and 2nd groups
separated with a space.
StringRegExp get the 2 groups as 2 array elements.
Choose 1 from the 2 types regexp above of which you prefer.

How do I represent "Any string except for .... "

I'm trying to solve a regex where the given alphabet is Σ={a,b}
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
which means the corresponding regex is: aa(a)*b(bbb)*
What would be a regex for L2, complement of L1?
Is it right to assume L2 = "Any string except for aa(a)b(bbb)"?
First, in my opinion, the regex for L1 = {a^2n b^3m+1 | n>=1, m>=0}
is NOT what you gave but is: aa(aa)*b(bbb)*. The reason is that a^2n, n > 1 means that there are at least 2 a and a pair number of a.
Now, the regular expression for "Any string except for aa(aa)*b(bbb)*" is:
^(?!^aa(aa)*b(bbb)*$).*$
more details here: Regex101
Explanations
aa(a)*b(bbb)* the regex you DON'T want to match
^ represents begining of line
(?!) negative lookahead: should NOT match what's in this group
$ represents end of line
EDIT
Yes, a complement for aa(aa)*b(bbb)* is "Any string but the ones that match aa(aa)*b(bbb)*".
Now you need to find a regex that represents that with the syntax that you can use. I gave you a regex in this answer that is correct and matches "Any string but the ones that match aa(aa)*b(bbb)*", but if you want a mathematical representation following the pattern you gave for L1, you'll need to find something simpler.
Without any negative lookahead, that would be:
L2 = ^((b+.*)|((a(aa)*)?b*)|a*((bbb)*|bb(bbb)*)|(.*a+))$
Test it here at Regex101
Good luck with the mathematical representation translation...
The first expression is:
L1 = {a^2n b^(3m+1) | n >= 1, m >= 0}
Regex for L1 is:
^aa(?:aa)*b(?:bbb)*$
Regex demo
Input
a
b
ab
aab
abb
aaab
aabb
abbb
aaaab
aaabb
aabbb
abbbb
aaaaab
aaaabb
aaabbb
aabbbb
abbbbb
aaaaaab
aaaaabb
aaaabbb
aaabbbb
aabbbbb
abbbbbb
aaaabbbb
Matches
MATCH 1
1. [7-10] `aab`
MATCH 2
1. [30-35] `aaaab`
MATCH 3
1. [75-81] `aabbbb`
MATCH 4
1. [89-96] `aaaaaab`
MATCH 5
1. [137-145] `aaaabbbb`
Regex for L2, complement of L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL)|^.*$
Explanation:
^aa(?:aa)*b(?:bbb)*$ matches L1
^aa(?:aa)*b(?:bbb)*$(*SKIP)(*FAIL) anything matches L1 will skip & fail
|^.*$ matches others that not matches L1
Regex demo
Matches
MATCH 1
1. [0-1] `a`
MATCH 2
1. [2-3] `b`
MATCH 3
1. [4-6] `ab`
MATCH 4
1. [11-14] `abb`
MATCH 5
1. [15-19] `aaab`
MATCH 6
1. [20-24] `aabb`
MATCH 7
1. [25-29] `abbb`
MATCH 8
1. [36-41] `aaabb`
MATCH 9
1. [42-47] `aabbb`
MATCH 10
1. [48-53] `abbbb`
MATCH 11
1. [54-60] `aaaaab`
MATCH 12
1. [61-67] `aaaabb`
MATCH 13
1. [68-74] `aaabbb`
MATCH 14
1. [82-88] `abbbbb`
MATCH 15
1. [97-104] `aaaaabb`
MATCH 16
1. [105-112] `aaaabbb`
MATCH 17
1. [113-120] `aaabbbb`
MATCH 18
1. [121-128] `aabbbbb`
MATCH 19
1. [129-136] `abbbbbb`

Scala Regular Expression Oddity

I have this regular expression:
^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$
To give it a bit of organization, there's really 3 parts:
// Part 1
^(10)(1|0)(.)(.)(.)(.{18})
// Part 2
// Optional Elements that begin with two characters and is terminated by a |
// May appear at most once
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
// Part 3
AY([0-9]*)AZ(.*)$
Part 2 is the part that I'm having trouble with but I believe the current regular expression says any of these given elements will appear one or more times. I could have done something like: (AB.*?|) but I don't need the pipe in my group and wasn't quite sure how to express it.
This is my sample input - it's SIP2 if you've seen it before (please disregard checksum, I know it's not valid):
101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
This is my snippet of Scala code:
val regex = """^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$""".r
val msg = "101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71"
val m = regex.findFirstMatchIn(msg)) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
This is my output:
0 101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
1 10
2 1
3 Y
4 N
5 Y
6 201406120000091911
7 DAl|
8 ABb
9 AQc
10 AJd
11 AFf
12 CSg
13 CRh
14 CTi
15 CKe
16 CVj
17 CYk
18 DAl
19 AOa
20 1
21 AA71
Note the entry that starts with 7. Can anyone explain why that's there?
I'm using Scala 2.10.4 but I believe regular expressions in Scala simply uses Java's regular expression. I'm certainly open to other suggestions for parsing strings.
EDIT: Based on wingedsubmariner's response, I was able to fix my regular expression:
^(10)(1|0)(.)(.)(.)(.{18})(?:AB([^|]*)\||AQ([^|]*)\||AJ([^|]*)\||AF([^|]*)\||CS([^|]*)\||CR([^|]*)\||CT([^|]*)\||CK([^|]*)\||CV([^|]*)\||CY([^|]*)\||DA([^|]*)\||AO([^|]*)\|)+AY([0-9]*)AZ(.*)$
Basically adding ?: to indicate I was not interested in the group!
You get a matched group for each set of parentheses, the order being the order of the opening parenthesis in the regex. Matched group 7 corresponds to the opening parenthesis that begins your "Group 2":
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
^
|
This parenthesis
Each matched group takes on the value of the last part of the text that matched, which in this case is DAl| because it was the last piece of text to match the "Group 2" expression.
Here is a simpler example that demonstrates the behavior:
val regex = """((A)\||(B)\|)+""".r
val msg = "A|B|A|B|"
regex.findFirstMatchIn(msg) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
Which produces:
0 A|B|A|B|
1 B|
2 A
3 B

Regular expression to match three different strings

I need to write a regular expression that matches with 3 slightly different strings and extracts values out of them
Strings are as follows (excluding quotes)
1. "Beds: 3, Baths: 3"
2. "Beds: 3 - Sleeps 10, Baths: 3"
3. "Beds: 3 - 10, Baths: 3"
Values to extract like, for
1. 3, 0 , 3
2. 3, 10, 3
3. 3, 10, 3
I have written something like
$pattern = '/Beds: ([0-9]+).*-[ Sleeps]* ([0-9]+).* Baths: ([\.0-9]+)/';
It matches with string 2 and 3, but not with string 1.
Just extract the digits from non-digits.
\D*(\d+)\D*(\d+)?\D*(\d+)
Beds: ([0-9]+)(?:(?:.*-[ Sleeps]* ([0-9]+))|).* Baths: ([\.0-9]+)
#!/usr/bin/perl
use strict;
use warnings;
open (my $rentals, '<', 'tmp.dat');
while (<$rentals>){
if (my ($beds, $sleeps, $baths) = $_=~m/^Beds:\s+(\d+)(?:\s+-)?\s*(?:Sleeps\s+)?(\d+)?,\s+Baths:\s+(\d+)$/){
$sleeps=$sleeps?$sleeps:"No information provided";
print "$.:\n\tBeds:\t$beds\n\tSleeps:\t$sleeps\n\tBeds:\t$beds\n\n";
}
else{
print "record $. did not match the regex:\n\t|$_|";
}
}
check this:
'/Beds:\s(\d)[\s,][\s-].*?(\d, |)Baths:\s(\d)/'