Pandas and regular expressions

Pandas and regular expressions - regex

I have the code below hoping to accomplish simple pattern recognition. I want it to find all occurences of PDP or CDP or PRS or EDP followed by (0 or up to 3) nondigits followed by (exactly 6 digits). Seems simple enough but pandas keeps screaming the error below.
sample rows of data:
row1 CAPS ACCT # /APR 1-APR 30 18/EDP 443996/SPECIAL PRICING
row2 CAPS /EDP# 320902/UNUSED LABELS
ValueError: Wrong number of items passed 5, placement implies 1
df['USPS_refund_no'] = df['APEX Invoice Description'].str.extract(r'((EDP)|(PDP)|(CDP)|(PRS)\D{,3}\d{6})',expand=True)
Thanks in advance

In your case, str.extract expects one capturing group. To match alternatives before the number, enclose the alternative list with a non-capturing group and capture the whole pattern with an outer capturing group:
df['USPS_refund_no'] = df['APEX Invoice Description'].str.extract(r'((?:EDP|PDP|CDP|PRS)\D{0,3}\d{6})',expand=True)
See the regex demo.
Details
( - start of the outer capturing group (required for extract)
(?:EDP|PDP|CDP|PRS) - a non-capturing group matching any one of the alternatives listed inside (note you may also write it as (?:[EPC]DP|PRS)):
EDP - EDP
| - or
PDP - PDP
| - or
CDP - CDP
| - or
PRS - PRS
\D{0,3} - 0 to 3 non-digits
\d{6} - six digits
) - end of the outer capturing group.

Related

Regex to search and replace whats inside parenthesis

We're trying to find text inside parenthesis and replace it with a words. In this case all text inside parenthesis, like (R:2379; L:28) etc are to be replaced with (Receipt No.:2379; Ledger No.:28)
There's that very same text on the next line that should not be touched (Don't know why it there. This is from an old DOS accounting application).
I came upto /\([R.]]+\)/g, 'Receipt No.' but this is harder than I imagined. How can this be done?
#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
R:2379;L:28
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
R:2432; L:28
#Ch. No. 884274 #Dr. Date 10-09-1997 #Ch. Dep. 19-09-1997 #Bank: Citibank (R:2475; L:28)
R:2475; L:28
#Ch. No. 884275 #Dr. Date 10-09-1997 #Ch. Dep. 24-09-1997 #Bank: Citibank (R:2480; L:28)
R:2480; L:28

You can use
\(R:(\d+);\s*L:(\d+)\)
Replace with (Receipt No.:$1; Ledger No.:$2).
See the regex demo. Details:
\(R: - (R: text
(\d+) - Group 1: one or more digits
; - a ; char
\s* - 0 or more whitespaces
L: - a literal L: text
(\d+) - Group 2: one or more digits
\) - a ) char.
The $1 is the backreference to Group 1 value and the $2 is the backreference to Group 2 value.

Regular expression for matching a specifc substring of a string

I have a log file that logs connection drops of computers in a LAN. I want to extract name of each computer from every line of the log file and for that I am doing this: (?<=Name:)\w+|(-PC)
The target text:
`[C417] ComputerName:KCUTSHALL-PC UserID:GO kcutshall Station 9900 (locked) LanId: | (11/23 10:54:09 - 11/23 10:54:44) | Average limit (300) exceeded while pinging www.google.com [74.125.224.147] 8x
[C445] ComputerName:FRONTOFFICE UserID:YB Yenae Ball Station 7C LanId: | (11/23 17:02:00) | Client is connected to agent.`
The problem is that some computer names have -PC in them and in some isn't. The expression I have created matches computer without -PC in their names but it if a computer has -PC in the name, it treats that as a separate match and I don't want that. In short, it gives me 3 matches, but I want only 2. That's why I need help here, I am beginner in regex.

You may use
(?<=Name:)\w+(?:-PC)?
Details
(?<=Name:) - a place immediately preceded with Name:
\w+ - 1+ word chars
(?:-PC)? - an optional non-capturing group that matches 1 or 0 occurrences of -PC substring.
Consider using word boundaries if you need to match PC as a whole word,
(?<=Name:)\w+(?:-PC\b)?
See the regex demo.

Regex for date and 3-letter code

been trying to create a REGEX that will parse date and 3-letter code from a bit longer message. Here I will post examples of the messages and what I want to get:
AAA BBB 1A BY PEK14JUN18/1654 OR QQQ MF 812 XXXXX -> PEK, 14JUN18/1654
XXX/WWWW BY 05JUL 0900 BKK LT ELSE BKG WILL BE QQQQ -> BKK, 05JUL 0900
TO AZ BY 02AUG 1910 TYO OR AZ WWWW WILL BE XXX -> TYO, 02AUG 1910
BY TYO20JUL18/0355 OR CXL CA ALL QQQ -> TYO, 20JUL18/0355
BY AMS04JUL18/1954 OR CXL MF 812 L07JUL -> AMS, 04JUL18/1954
I want to be able to match 3-letter code and the date for every message. The code is always nearby the date but can be before or after the date part. Also the date part can be with or without a year.Is it possible to have one regex to use for all the above examples?
I started with this:
(\s[A-Z]{3}\d\d|\d\d[A-Z]{3}\s)
(https://regex101.com/r/LPLjgf/1) but it's not working as it should and I'm not very experienced with regex to be honest.
EDIT:
Actually I would need to use only the 3-letter codes but I need them to be connected with a date - for example in:
AAA BBB 1A BY PEK14JUN18/1654 OR QQQ MF 812 XXXXX
the AAA, BBB or QQQ shouldn't match because they arent right after / before the date as is PEK.
Same with BY TYO20JUL18/0355 OR CXL CA ALL QQQ -> only TYO should match because it's before a date and CXL shouldn't.

You may use the following pattern:
([A-Z]{3})(\d{2}[A-Z]{3}\d{2}\/\d{4})|(\d{2}[A-Z]{3} \d{4}) ([A-Z]{3})
([A-Z]{3}) Capturing group for three capital letters
(\d{2}[A-Z]{3}\d{2}\/\d{4}) Capturing group for two digits, three upper case letters, two digits, /, four digits.
| Logical OR, alternates pattern.
(\d{2}[A-Z]{3} \d{4}) Capturing group. Captures two digits, three upper case letters, whitespace and four digits.
([A-Z]{3}) Capturing group for three upper case letters.
You can try it live here.
Captured groups:
Group 1. 14-17 `PEK`
Group 2. 17-29 `14JUN18/1654`
Group 3. 83-93 `05JUL 0900`
Group 4. 94-97 `BKK`
Group 3. 151-161 `02AUG 1910`
Group 4. 162-165 `TYO`
Group 1. 211-214 `TYO`
Group 2. 214-226 `20JUL18/0355`
Group 1. 269-272 `AMS`
Group 2. 272-284 `04JUL18/1954`
Group 1. 342-345 `PEK`
Group 2. 345-357 `14JUN18/1654`
Group 1. 378-381 `TYO`
Group 2. 381-393 `20JUL18/0355`

Firstly
(\s[A-Z]{3}\d\d|\d\d[A-Z]{3}\s)
The alternation – | – means that will match \s[A-Z]{3}\d\d or \d\d[A-Z]{3}\s which is certainly not what you want. To narrow the scope of an alternation use grouping.
I would think you want to match this fairly directly:
([A-Z]{3})\d{2}[A-Z]{3}\d{2}
And that only captures the three letters in a group.

Try the following RegEx:
[A-Z]{3}(\d{2}[A-Z]{3}[\S]*)|(\d{2}[A-Z]{3}\s\d{4}\s[A-Z]{3})
It will matach 3 letters fowed by 2 numbers, followed by 3 letters OR 2 numbers followed by 3 letters, a Space, 4 numbers, a Space and 3 letters.
You can try it here

Regex to obtain two values from string

I have string like below and i need to pull out two values (numeric values) one is 197kJ (numeric 197) and second is 47kcal (numeric 47). Can someone help me with this because I just go crazy :) ?
My regular expression:
((<|>)?\d+((\.|,)\d+)?kj\s?\/\s?)?(<|>)?(\d+((\.|,)\d+)?)kcal
String to search in:
Per 250ml serving (10 servings per pack): Energy 197kJ (2% ADH)/47kcal
(2% ADH), Fat 0.3g (of which Saturated Fat 0.1g), Carbohydrate 7.8g
(3% ADH) (of which Sugars 3.9g (4% ADH)), Fibres 1.6g, Protein 2.2g
(4% ADH), Salt 1.6g (27% ADH)

Just do:
\d+(?:kcal|kJ)
# require at least one number
# followed by either kcal or kJ
See a demo on regex101.com (or yours: https://regex101.com/r/uS3mE4/3)

Your regex pattern looks overcomplicated. Probably it is because it serves more complex job than described.
But your taks (get numeric values of kJ and kcal) can be done using pattern like:
(\d+[.,]?\d+)(?:kJ|kcal)

here is my proposition:
(\d+)kJ.*?(\d+)kcal
it extracts the kJ and the kcal number in two differents capturing
groups
it simply captures all numeric chars before "kJ" or "kcal"
substring
it uses lazy quantifier *? to avoid consuming the first
chars of the kcal number.
https://regex101.com/r/uS3mE4/4

You may use alternation and optional (0 or more) spaces between the number and measurement unit:
[<>]?(\d+(?:[.,]\d+)?)\s*k(?:cal|j)
To be used with the i case insensitive modifier.
See the regex demo.
Details:
[<>]? - an optional < or >
(\d+(?:[.,]\d+)?) - Group 1 capturing 1 or more digits, and then an optional sequence: a . or , and 1+ digits
\s* - zero or more whitespaces
k - literal k
(?:cal|j) - either a cal or j.

Matching a group that may or may not exist

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)

Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.

To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm

Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$

This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)

(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pandas and regular expressions - regex

Related

Regex to search and replace whats inside parenthesis

Regular expression for matching a specifc substring of a string

Regex for date and 3-letter code

Regex to obtain two values from string

Matching a group that may or may not exist

Categories

Resources