Regex to search and replace whats inside parenthesis - regex

We're trying to find text inside parenthesis and replace it with a words. In this case all text inside parenthesis, like (R:2379; L:28) etc are to be replaced with (Receipt No.:2379; Ledger No.:28)
There's that very same text on the next line that should not be touched (Don't know why it there. This is from an old DOS accounting application).
I came upto /\([R.]]+\)/g, 'Receipt No.' but this is harder than I imagined. How can this be done?
#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
R:2379;L:28
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
R:2432; L:28
#Ch. No. 884274 #Dr. Date 10-09-1997 #Ch. Dep. 19-09-1997 #Bank: Citibank (R:2475; L:28)
R:2475; L:28
#Ch. No. 884275 #Dr. Date 10-09-1997 #Ch. Dep. 24-09-1997 #Bank: Citibank (R:2480; L:28)
R:2480; L:28

You can use
\(R:(\d+);\s*L:(\d+)\)
Replace with (Receipt No.:$1; Ledger No.:$2).
See the regex demo. Details:
\(R: - (R: text
(\d+) - Group 1: one or more digits
; - a ; char
\s* - 0 or more whitespaces
L: - a literal L: text
(\d+) - Group 2: one or more digits
\) - a ) char.
The $1 is the backreference to Group 1 value and the $2 is the backreference to Group 2 value.

Related

Extracting Data from the Cell Through Formula But does not pull some of the data

I have been Extracting a data from the cell where i need more result but my formula is extracting some data but not the whole as i need.
I have attached a sheet below will appreciate if i could get a help.
My formulas.
=ArrayFormula(TRIM(REGEXREPLACE(A3:A,"\.\.\.(.*)|\*\*\*","")))
=ArrayFormula(IFERROR(TRIM(REGEXEXTRACT(A3:A, "DONE=>\s*.+\b"))))
https://docs.google.com/spreadsheets/d/1MKC1OWIj64v_mmuNM6mLFY9wMgLwl2mUxm6KnsM5arE/edit#gid=0
The regexps you can use are
=ArrayFormula(TRIM(REGEXREPLACE(A3:A,"(\*{3}.*?)(?:\s*\.{3}DONE=>.*)?(\*{3})$","$1 $2")))
=ArrayFormula(IFERROR(TRIM(REGEXEXTRACT(REGEXREPLACE(A3:A, "^([^-]*-)[^-]+-", "$1"), ".*DONE=>.*"))))
See the first regex demo and the second regex demo. The third one - .*DONE=>.* - simply returns all the strings that contain DONE=> in them.
Details:
(\*{3}.*?) - Group 1 ($1): three * chars and then any zero or more chars other than line break chars, as few as possible
(?:\s*\.{3}DONE=>.*)? - an optional string of zero or more whitespaces, ***DONE=> and then the rest of the string
(\*{3}) - Group 2 ($2): *** string
$ - end of string.
The ^([^-]*-)[^-]+- matches
^ - start of string
([^-]*-) - Group 1 ($1): any zero or more chars other than - and then a -
[^-]+- - one or more chars other than - and then a - char.
You say "Thank you but it includes ? value in last"
Completely new formula for your needs
We put front part and last part together with &
=ArrayFormula(IF(REGEXMATCH(A2:A,"MUKHML"),TRIM((REGEXEXTRACT(A2:A,"^[^-]*")&REGEXREPLACE(A2:A,".*\?|.* COMPLEXIES",""))),""))
Use this new formula like from Wiktor
=ArrayFormula(IF(REGEXMATCH(A2:a,"MUKHML"),REGEXREPLACE(A2:a,"^([^-]*-)[^-]+-","$1"),""))

Pandas and regular expressions

I have the code below hoping to accomplish simple pattern recognition. I want it to find all occurences of PDP or CDP or PRS or EDP followed by (0 or up to 3) nondigits followed by (exactly 6 digits). Seems simple enough but pandas keeps screaming the error below.
sample rows of data:
row1 CAPS ACCT # /APR 1-APR 30 18/EDP 443996/SPECIAL PRICING
row2 CAPS /EDP# 320902/UNUSED LABELS
ValueError: Wrong number of items passed 5, placement implies 1
df['USPS_refund_no'] = df['APEX Invoice Description'].str.extract(r'((EDP)|(PDP)|(CDP)|(PRS)\D{,3}\d{6})',expand=True)
Thanks in advance
In your case, str.extract expects one capturing group. To match alternatives before the number, enclose the alternative list with a non-capturing group and capture the whole pattern with an outer capturing group:
df['USPS_refund_no'] = df['APEX Invoice Description'].str.extract(r'((?:EDP|PDP|CDP|PRS)\D{0,3}\d{6})',expand=True)
See the regex demo.
Details
( - start of the outer capturing group (required for extract)
(?:EDP|PDP|CDP|PRS) - a non-capturing group matching any one of the alternatives listed inside (note you may also write it as (?:[EPC]DP|PRS)):
EDP - EDP
| - or
PDP - PDP
| - or
CDP - CDP
| - or
PRS - PRS
\D{0,3} - 0 to 3 non-digits
\d{6} - six digits
) - end of the outer capturing group.

Python Regular Expression: No space in between

I have the following string:
"......(some chars) aaa bbb ###8/13/2018 ......(some chars)"
The ### in the string represent some random characters. ###'s length is unknown and it could be None (just "aaa bbb 8/13/2018").
My goal is to find the date from the string (8/13/2018) and the starting index of ###.
I currently used the following code:
m = re.search(r'\s.*?([0-9]{1,}/[0-9]{1,}/[0-9]{2,})', str)
m.groups()[0] ## The date
m.start() ## index of ###
But the regex is matching bbb ###8/13/2018 instead of ###8/13/2018
I also tried change the regex to:
r'\s(?!\s).*?[0-9]{1,}/[0-9]{1,}/[0-9]{2,}'
r'\s(?!\s)*?[0-9]{1,}/[0-9]{1,}/[0-9]{2,}'
But neither of them works.
I will be appreciated for any help or comments. Thank you.
I tend to believe you are looking for:
#*(?:\d{1,2}/){2}\d{2,4} or even \S*(?:\d{1,2}/){2}\d{2,4}
This is simply saying:
\S* start with 0 or more non-space charaters.
(?:\d{1,2}/){2} find two groups of \d{1,2}/ but do not capture them. ie not capturing: (?:..).this will match the month and date part 8/13/. \d{1,2} means atleast one digit and atmost two digits
\d{2,4} match the year .Atleast 2 digits and atmost 4 digits
Using a part of your regex, I think you mean something like this
r'\S*([0-9]+/[0-9]+/[0-9]{2,})'
https://regex101.com/r/dxF4sT/1
To find the starting index, it would be where the match was found.
Note that \S will find all consecutive non-whitespace.
You can change this to other things like [#a-zA-Z] etc..., just add it to the class.

Extract nested string from text column

I have following SQL result entries.
Result
---------
TW - 5657980 Due Date updated : to <strong>2017-08-13 10:21:00</strong> by <strong>System</strong>
TW - 5657980 Priority updated from <strong> Medium</strong> to <strong>Low</strong> by <strong>System</strong>
TW - 5657980 Material added: <strong>1000 : Cash in Bank - Operating (Old)/ QTY:2</strong> by <strong>System</strong>#9243
TW - 5657980 Labor added <strong>Kelsey Franks / 14:00 hours </strong> by <strong>System</strong>#65197
Now I am trying to extract a short description from this result and trying to migrate it to the another column in the same table.
Expected result
--------------
Due Date Updated
Priority Updated
Material Added
Labor Added
Ignore first 13 characters. For most of the cases it ends with 'updated'. Few ends with 'added'. It should be case insensitive.
Is there any way to get the expected result.
Solution with substring() using a regular expression. It skips the first 13 characters, then takes the string up to the first ' updated' or ' added', case-insensitive, with leading blank. Else NULL:
SELECT substring(result, '(?i)^.{13}(.*? (?:updated|added))')
FROM tbl;
The regexp explained:
(?i) .. meta-syntax to switch to case-insensitive matching
^ .. start of string
.{13} .. skip the first 13 characters
() .. capturing parenthesis (captures payload)
.*? .. any number of characters (non-greedy)
(?:) .. non-capturing parenthesis
(?:updated|added) .. 2 branches (string ends in 'updated' or 'added')
If we cannot rely on 13 leading characters like you later commented, we need some other reliable definition instead. Your difficulty seems with hazy requirements more than with the actual implementation.
Say, we are dealing with 1 or more non-digits, followed by 1 or more digits, a space and then the payload as defined above:
SELECT substring(result, '(?i)^\D+\d+ (.*? (?:updated|added))') ...
\d .. class shorthand for digits
\D .. non-digits, the opposite of \d

Matching a group that may or may not exist

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)
Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.
To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$
This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)
(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.