[Python3]RegEx to match multiple strings - regex

I am trying to match multiple stings, which also includes an optional capture group.
My RegEx:
(\[[A-Za-z]*\])(.*) - (.*)(.[0-9]{2}\.[0-9]{2}\.[0-9]{2}.)?(\[.*\])
Strings:
[Test]Kyubiikitsune - Company Of Wolves[20.06.96][Hi-Res]
[TEst]_ANother - Company Of 2[Hi-Res]
[Yes]coOl__ - some text_[20.06.96][Hi-Res]
How can I match all of these and optimize my RegEx? I'm still new to this.

I asume this is what you want:
r"\[(.*?)\](.*?)\s*-\s*(.*?)(?:\[(\d{2}\.\d{2}\.\d{2})\])?\[(.*?)\]"g

Consider approaching this with pandas as shown below:
import pandas as pd
# create a Series object containing the strings to be searched
s = pd.Series([
'[Test]Kyubiikitsune - Company Of Wolves[20.06.96][Hi-Res]',
'[TEst]_ANother - Company Of 2[Hi-Res]',
'[Yes]coOl__ - some text_[20.06.96][Hi-Res]'
])
# use pandas' StringMethods to peform regex extraction; a DataFrame object is returned because your regex contains more than one capture group
s.str.extract('(\[[A-Za-z]*\])(.*) - (.*)(.[0-9]{2}\.[0-9]{2}\.[0-9]{2}.)?(\[.*\])', expand=True)
# returns the following
0 1 2 3 4
0 [Test] Kyubiikitsune Company Of Wolves[20.06.96] NaN [Hi-Res]
1 [TEst] _ANother Company Of 2 NaN [Hi-Res]
2 [Yes] coOl__ some text_[20.06.96] NaN [Hi-Res]

Related

Separating columns based on Regex | Pandas

So I have converted a pdf to a dataframe and am almost in the final stages of what I wish the format to be. However I am stuck in the following step. I have a column which is like -
Column A
1234[321]
321[3]
123
456[456]
and want to separate it into two different columns B and C such that -
Column B Column C
1234 321
321 3
123 0
456 456
How can this be achieved? I did try something along the lines of
df.Column A.str.strip(r"\[\d+\]")
but I have not been able to get through after trying different variations. Any help will be greatly appreciated as this is the final part of this task. Much thanks in advance!
An alternative could be:
# Create the new two columns
df[["Column B", "Column C"]]=df["Column A"].str.split('[', expand=True)
# Get rid of the extra bracket
df["Column C"] = df["Column C"].str.replace("]", "")
# Get rid of the NaN and the useless column
df = df.fillna(0).drop("Column A", axis=1)
# Convert all columns to numeric
df = df.apply(pd.to_numeric)
You may use
import pandas as pd
df = pd.DataFrame({'Column A': ['1234[321]', '321[3]', '123', '456[456]']})
df[['Column B', 'Column C']] = df['Column A'].str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
# If you need to drop Column A here, use
# df[['Column B', 'Column C']] = df.pop('Column A').str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
df['Column C'][pd.isna(df['Column C'])] = 0
df
# Column A Column B Column C
# 0 1234[321] 1234 321
# 1 321[3] 321 3
# 2 123 123 0
# 3 456[456] 456 456
See the regex demo. It matches
^ - start of string
(\d+) - Group 1: one or more digits
(?:\[(\d+)])? - an optional non-capturing group matching [, then capturing into Group 2 one or more digits, and then a ]
$ - end of string.

make regex more specific in getting consequtive digits

import pandas as pd
df= pd.DataFrame({'Data':['123456A122 119999 This 1234522261 1A1619 BL171111 A-1-24',
'134456 dont 12-23-34-45-5-6 Z112 NOT 01-22-2001',
'mix: 1A25629Q88 or A13B ok'],
'IDs': ['A11','B22','C33'],
})
I have the following df as seen above. I am using the following to get only consequtive digits
reg = r'((?:[\d]-?){6,})'
df['new'] = df['Data'].str.findall(reg)
Data IDs new
0 [123456,119999, 1234522261, 171111]
1 [134456, 12-23-34-45-5-6, 01-22-2001]
2 []
This picks up many things I dont want like 171111 from BL171111 and 123456 from 123456A122 etc
I would like the following output which only picks up 6 consequtive digits
Data IDs new
0 [119999]
1 [134456]
2 []
How do I change my regex to so?
reg = r'((?:[\d]-?){6,})'
Change your regex to use word boundaries (\b) and limit the number of digits to exactly 6, like this:
reg = r'(\b\d{6}\b)'
This looks for a word boundary, 6 numbers, and another word boundary.
Here's a demo.

Pandas and regular expressions

I have the code below hoping to accomplish simple pattern recognition. I want it to find all occurences of PDP or CDP or PRS or EDP followed by (0 or up to 3) nondigits followed by (exactly 6 digits). Seems simple enough but pandas keeps screaming the error below.
sample rows of data:
row1 CAPS ACCT # /APR 1-APR 30 18/EDP 443996/SPECIAL PRICING
row2 CAPS /EDP# 320902/UNUSED LABELS
ValueError: Wrong number of items passed 5, placement implies 1
df['USPS_refund_no'] = df['APEX Invoice Description'].str.extract(r'((EDP)|(PDP)|(CDP)|(PRS)\D{,3}\d{6})',expand=True)
Thanks in advance
In your case, str.extract expects one capturing group. To match alternatives before the number, enclose the alternative list with a non-capturing group and capture the whole pattern with an outer capturing group:
df['USPS_refund_no'] = df['APEX Invoice Description'].str.extract(r'((?:EDP|PDP|CDP|PRS)\D{0,3}\d{6})',expand=True)
See the regex demo.
Details
( - start of the outer capturing group (required for extract)
(?:EDP|PDP|CDP|PRS) - a non-capturing group matching any one of the alternatives listed inside (note you may also write it as (?:[EPC]DP|PRS)):
EDP - EDP
| - or
PDP - PDP
| - or
CDP - CDP
| - or
PRS - PRS
\D{0,3} - 0 to 3 non-digits
\d{6} - six digits
) - end of the outer capturing group.

Regex and capturing groups

I have my regex working to the point where I now have two groups of text, group 1 and group 2 - I'm only really interested in the group2 text. In the end how do I get just group 2 to match/display?
("token":")([^,]+)
Something like below should work:
>>> p = re.compile('("token":")([^,]+)')
>>> m = p.match('...')
>>> m.group(2)
This will get the content of the second group. (Taken from here)

REGEX : Extract group of number where digits are more than 3

HI I have a question regarding REGEX.
This sounds very simple and I remember doing it but somehow it got deleted and I am finding it hard to get it back.
I want to extract group of numbers from one line.
If the count of digits > 3 - select that.
EG:
ga3rdparty/phpMyAdmin/i0ndex.php?&t0oken=abf540063shakk
This line can be different everytime but there will be only 1 group of digits with more than 2 digits.
OUTPUT: 540063
Thank you in advance
You can use \d{3,} where 3 is the minimum number of digits. You an take a look at the following python code
import re
var= "ga3rdparty/phpMyAdmin/i0ndex.php?&t0oken=abf540063shakk"
pattern = re.compile(r'\d{3,}')
for match in pattern.findall(ver):
print(match)