how to put captured group inside a character class to negate it? - regex

I'm trying to parse my poker hand history to determine the number of hands I played 7-2 off suit (that is, the 7 is of one suite, and the 2 is of another).
I can get the hands where I played 77 or 22
$ grep -E "Dealt to .* \[([7|2])[s|c|h|d]\s\1" ~/poker/handhistory/*/* | wc -l
15
And the hands where I played 72 of the same suit.
$ grep -E "Dealt to .* \[([7|2])([s|c|h|d])\s[7|2]\2" ~/poker/handhistory/GMulligan/* | wc -l
9
I've captured the rank of the first card. What I'd like to do is have a character class that contains [7] if the first capture group was 2 and [2] if the first capture group was 7.
can anyone help here?
update:
sorry, some sample data would obviously help here
every hand that player1 is involved in has a line like this:
Dealt to player1 [4c Ac]
i'm looking specifically for all the following within the "[" and "]"
7h 2c
7h 2d
7h 2s
7c 2h
7c 2d
7c 2s
7d 2h
7d 2c
7d 2s
7s 2h
7s 2c
7s 2d

You may be able to use negative lookaheads to achieve what you're trying to do.
https://regex101.com/r/yK4oC7/2 (* denotes matches)
Dealt to player1 []
Dealt to player1 [7c 2c]
Dealt to player1 [7c 2h] *
Dealt to player1 [7d 7c]
Here's a breakdown of the regex \[([72])([sdch]) (?\!\1)([72])(?\!\2)([sdch]) (In bash, ! is a special character and must be escape. So while mamy languages execute a negative lookahead with (?!....), bash appears to need (?\!....).
\[ - match literal [
([72]) - match 7 or 2 and capter as \1
([sdch]) - match s,d,c or h
(?!\1)([72]) - match a space followed by digit that's not the same as \1
and is 7 or 2
(?!\2)([sdch]) - match sdch where it's not the same as whichever of the
four was matched as \1
Edit: I don't use bash, so I'm unfamiliar with the nuances, but the two answers to How to use regex negative lookahead should be useful in devising the proper syntax.

Related

Extracting multi values with regex ( Only values, Not Fieldname )

Can someone help me with this regex?
I would like to extract either 1. or 2.
1.
(2624594000) 303 days, 18:32:20.00 <-- Timeticks
.1.3.6.1.4.1.14179.2.6.3.39. <-- OID
Hex-STRING: 54 4A 00 C8 73 70 <-- Hex-STRING (need "Hex-STRING" ifself too)
0 <--INTEGER
"NJTHAP027" <- STRING
OR
2.
Timeticks: (2624594000) 303 days, 18:32:20.00
OID: .1.3.6.1.4.1.14179.2.6.3.39
Hex-STRING: 54 4A 00 C8 73 70
INTEGER: 0
STRING: "NJTHAP027"
This filedname and value will return different data each time. (The data will be variable.)
I don't need to get the field names and only want to get the values in order from the top (multi value)
(?s)[^=]+\s=\s(?<value_v2c>([^=]+)-)
https://regex101.com/r/lsKeEM/2
-> I can't extract the last STRING: "NJTHAP027" at all!
The named group value_v2c is already a group, so you can omit the inner capture group.
Currently the - char should always be matched in the pattern, but you can either match it or assert the end of the string.
As you are using negated character classes and [^=]+ and \s, you can omit the inline modifier (?s) as both already match newlines.
To match the 2. variation, you can update the pattern to:
[^=]+\s=\s(?<value_v2c>[^=]+)(?:-|$)
Regex demo
To get the 1. version, you can match all before the colon as long as it is not Hex-String.
Then in the group optionally match it.
[^=]+\s=\s(?:(?!Hex-STRING:)[^:])*:?\s*(?<value_v2c>(?:Hex-STRING: )?[^=]+?)(?: -|$)
Regex demo

Regex matching issue to Test-String

i have a problem and dont get it.
My Regex:
My Test-String:
I have two issues and one general question :)
As you can see in my Test-String the very last (german) Phone Number (the big yellow one in the Test-String attachment) does not match my Regex-Pattern correctly. I dont get it, what is the Problem here? the "0049" fits Group 5, but should fit Group 2, why is that?
My second Problem is, how can i get rid of the spaces before and after every match? (The 7 yellow small circles in the Test-String Attachment)
For copy/paste purposes, here is the Regex and Test-String again:
Regex:
((\+\d{2}|00\d{2})?([ ])?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ])?(\d+)?([ ])?(\d+)?)
Test-String:
Vorwahl 089, die E.123 ebenfalls , also (089) 1234567. Die DIN 5008, also +49 89 1234567 respectivly 0049 89 1234567. Die E.123 empfiehlt, also +49 89 123456 0 respectivly 0049 89 123456 0 oder +49 89 123456 789. Also +49 89 123 456 789. Klammern 089/1234567 und 0151 19406041. Test +49 151 123 456 789 respectivly 0049 151 123 456 789
Last but not at least, my general question:
Is it a good approach to Group each logical part as i did in my example?
A last Information: I validate my Regex with https://regex101.com/ and use it in Python with the re Module.
The thing that makes it unpredictable are the numerous optional groups (..)?.
As first step i recommend replacing ([ ])?(\d+)? as a coupled expression ([ ]?\d+)?, which will avoid spaces at the end of the match - your point #2.
As a second step i recommend coupling the first optional space with the expression of the "national dialling": ((\+|00)\d{2}([ ])?)?. Now we are lucky, because it solves both the space at the beginning and the recognition of the whole number, due to less possible matching options.
The new expression now looks like this:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+)?([ ]?\d+)?)
I now recommend to simplify the last part, if you dont need the single group-values:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+){0,2})
For better performance I suggest you remove the parenteses/groups where possible or mark them as non-capturing, if you don't need to have the specific group-values.
In some programming languages you will not need to most outer parenteses, as that is always group 0.

How can I find values which does not contains characters and spaces?

Below is sample data from the list I am working with:
74
7491
75
75010
75013
78
8081
84
8400 Winterthu
852
9000 Aalborg
974
A
A CORUÑA
aa
Aalborg
Aargau
Aarhus
aas
AAT
AB
ABERC
Abu Dhabi
Abuja
AC
ACT
AD
Using [^\p{L}-] I can get a list but it also includes the following values which I do not want in the list
Abu Dhabi
Puerto Rico
Hong Kong
How can I do this?
You want to find multiple items, so you must use g option.
You will be checking each line separately. Usual way the pattern
for such case is constructed is ^...$, but both ^ and $
should match begin and end of each line, not the whole string.
So you must use m option.
And the last point, what should be the accepted content of a candidate
line, i.e. what should be between ^ and $: Any not empty sequence of
letters in any language or literal minus, i.e. [\p{L}-]+.
So, to sum up, the whole regex should be:
/^[\p{L}-]+$/gm
This way names containing a space (e.g. Puerto Rico) will not be matched
(as you specified).
Say your file is test.dat
A 1 simple line in grep will give wat you want:
grep -o -P "[0-9]+$" test.dat
Output:
74
7491
75
75010
75013
78
8081
84
852
974

Match Regular Expressions patterns if exist, else

Here is what I am trying to achieve. Given a certain set of data I am trying to get the entire row that contains the matching regular expressions that I have.
Essentially, given a data set such as this
AFAM 002A AFAM & DEV AM HIS/GV 03 46493 3 LEC D2 70 P 20/15 W 1800-2045 08/24/16-12/12/16 WSQ 207 K WHITE
AFAM 102 AFRO-AMER MUSIC 01 47200 3 LEC P 5/30 W 1800-2045 08/24/16-12/12/16 MUS 250 V GROCE-ROBERTS
AFAM 125 THE BLACK FAMILY 01 47198 3 LEC P 16/40 M 1800-2045 08/24/16-12/12/16 CCB 101 S MILLNER
AFAM 152 THE BLACK WOMAN 01 47199 3 LEC P 8/40 T 1800-2045 08/24/16-12/12/16 CL 111 R WILSON
AFAM 159 ECON ISSUES BLKCM 01 47197 3 LEC P 11/40 MW 1330-1445 08/24/16-12/12/16 CL 234 R WILSON
AFAM 180 INDIVIDUAL STUDIES 01 46982 3 SUP P 0/10 TBA TBA 08/24/16-12/12/16
The regex that I have created basically groups the following into..
Course ID eg. AFAM 002A
Course Name eg. AFRO-AMER MUSIC
Start date
end date
Professor Name (This is the value that I want to be optional)
The problem that I am having now is that for the optional value, instead of what I what which is to check if it exist, if not then leave empty. If someone could show me the correct way to do this I would greatly appreciated it.
Essentially this part of my regular expression ([A-Z][\s][A-Z]+[-]*[A-Z]+)? Needs to be included if it exist, I understand that that's how the ? operator is supposed to work, however I cant seem to find the right keyword for this question so here I am
([A-Z]+[\s][0-9]+[A-Z]*)(.+)[\s][0-9]+[\s][0-9]+.+(\d\d\/\d\d\/\d\d)-(\d\d\/\d\d\/\d\d)[\s]([A-Z][\s][A-Z]+[-]*[A-Z]+)?
The Expected results for this dataset for the last two rows should be
{ [ (AFAM 159), (ECON ISSUES BLKCM), (08/24/16), (12/12/16), (R WILSON)],
[(AFAM 180), (INDIVIDUAL STUDIES), (08/24/16), (12/12/16), ()]
}
Your regex does not match CL 234 in the last but one line. You need to consume it. However, just adding .*? won't work, you need to make your optional pattern obligatory (remove ?) and wrap .*?([A-Z]\s[A-Z]+-*[A-Z]+) with an optional non-capturing group (?:....).
([A-Z]+\s\d+[A-Z]*)(.+?)\s\d+\s\d+.+?(\d\d\/\d\d\/\d\d)-(\d\d\/\d\d\/\d\d)\s(?:.*?([A-Z]\s[A-Z]+-*[A-Z]+))?
See the regex demo.

Find all substrings with at least one group

I try to find in a string all substring that meet the condition.
Let's say we've got string:
s = 'some text 1a 2a 3 xx sometext 1b yyy some text 2b.'
I need to apply search pattern {(one (group of words), two (another group of words), three (another group of words)), word}. First three positions are optional, but there should be at least one of them. If so, I need a word after them.
Output should be:
2a 1a 3 xx
1b yyy
2b
I wrote this expression:
find_it = re.compile(r"((?P<one>\b1a\s|\b1b\s)|" +
r"(?P<two>\b2a\s|\b2b\s)|" +
r"(?P<three>\b3\s|\b3b\s))+" +
r"(?P<word>\w+)?")
Every group contain set or different words (not 1a, 1b). And I can't mix them into one group. It should be None if group is empty. Obviously the result is wrong.
find_it.findall(s)
> 2a 1a 2a 3 xx
> 1b 1b yyy
I am grateful for your help!
You can use following regex :
>>> reg=re.compile('((?:(?:[12][ab]|3b?)\s?)+(?:\w+|\.))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '2b.']
Here I just concise your regex by using character class and modifier ?.The following regex is contain 2 part :
[12][ab]|3b?
[12][ab] will match 1a,1b,2a,2b and 3b? will match 3b and 3.
And if you don't want the dot at the end of 2b you can use following regex using a positive look ahead that is more general than preceding regex (because making \s optional is not a good idea in first group):
>>> reg=re.compile('((?:(?:[12][ab]|3b?)\s)+\w+|(?:(?:[12][ab]|3b?))+(?=\.|$))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '2b']
Also if your numbers and example substrings are just instances you can use [0-9][a-z] as a general regex :
>>> reg=re.compile('((?:[0-9][a-z]?\s)+\w+|(?:[0-9][a-z]?)+(?=\.|$))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '5h 9 7y examole', '2b']