Matching a group that may or may not exist - regex

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)

Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.

To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm

Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$

This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)

(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.

Related

Regex for specific pattern and group issue

using regex on golang and having hard time to group three names from the following statements:
Hello Planet Earth 2022 - R1 3 Hell John v Tom v Ford
Hello World 2022 - R2 3 Hell - John v Tom v Ford
I'm trying to group "John" "Tom" "Ford" with the following regex:
^(?i).+? . R\d 3 Hell (.+?)[,|v] (.+?)[,|v] (.+?)$
The issue is that it groups "- John" for second statement and I need "John" only.
Any ideas how can it be adjusted?
thanks
Not sure how correct it is, but having tested here, I came up with this...
^(?i).+? . R\d 3 Hell[\s-]* (.+?)[,v] (.+?)[,v] (.+?)$
As you seem to match single spaces, you can optionally match a dash and a space.
Note that the last .+ does not have to be non greedy as it is the last part of the pattern, and the character class [,v] does not need a pipe char if you do not intent to match that as a character.
^(?i).+? . R\d 3 Hell (?:- )?(.+?) [,v] (.+?)[,v] (.+)
Regex demo

Regex to search and replace whats inside parenthesis

We're trying to find text inside parenthesis and replace it with a words. In this case all text inside parenthesis, like (R:2379; L:28) etc are to be replaced with (Receipt No.:2379; Ledger No.:28)
There's that very same text on the next line that should not be touched (Don't know why it there. This is from an old DOS accounting application).
I came upto /\([R.]]+\)/g, 'Receipt No.' but this is harder than I imagined. How can this be done?
#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
R:2379;L:28
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
R:2432; L:28
#Ch. No. 884274 #Dr. Date 10-09-1997 #Ch. Dep. 19-09-1997 #Bank: Citibank (R:2475; L:28)
R:2475; L:28
#Ch. No. 884275 #Dr. Date 10-09-1997 #Ch. Dep. 24-09-1997 #Bank: Citibank (R:2480; L:28)
R:2480; L:28
You can use
\(R:(\d+);\s*L:(\d+)\)
Replace with (Receipt No.:$1; Ledger No.:$2).
See the regex demo. Details:
\(R: - (R: text
(\d+) - Group 1: one or more digits
; - a ; char
\s* - 0 or more whitespaces
L: - a literal L: text
(\d+) - Group 2: one or more digits
\) - a ) char.
The $1 is the backreference to Group 1 value and the $2 is the backreference to Group 2 value.

Regex: Match only street name within address

I have a list of addresses and I would like to have a regular expression that is able to capture just the name of the street without the street type, address number, or cardinal direction. There are some errors in formatting but all characters are in capital letters. So,
2038 W MAIN AVE
2038QWEW S JEFFERSON AVENUE
33 NORTH CALIFORNIA STREET
53371 SOUTH WASHINGTON
53371 S WASHINGTON AVENUE
1600 E PENNSYLVANIA AVE
WEST9 67ST ST
E171 N 23RD STREET
G171 N121ST STREET
ought to return
MAIN
JEFFERSON
CALIFORNIA
WASHINGTON
WASHINGTON
PENNSYLVANIA
67ST
23RD
121ST
So far I've got
([^ W ]|[^ E ]|[^ S ]|[^ N ])([0-9])*([A-Z]+)[^ ]
But I can't seem to only capture the first match that occurs after the street number. I feel like I need the standard greedy operators (i.e. ?, *, or +) but I can't figure out how to incorporate them.
These two links have taken me close:
Matching on every second occurence
Simple regex for street address
For the output what you want from the given (address) input, this regex will surely help: [\pL\pN]+(?=\h+[\pL\pN]+$)
This regex will match the second last word in your line where a word is "1 or more any letter or digit in any language".
For reference you could https://superuser.com/questions/1361759/matching-second-last-word-in-sentence-through-regular-expression
Logic: we are looking for the second last word (set of characters) + possible border with the symbol N
^.*?\s[N]{0,1}([-a-zA-Z0-9]+)\s*\w*$
Res:
Match 1
Full match 0-15 `2038 W MAIN AVE`
Group 1. 7-11 `MAIN`
Match 2
Full match 16-43 `2038QWEW S JEFFERSON AVENUE`
Group 1. 27-36 `JEFFERSON`
Match 3
Full match 44-70 `33 NORTH CALIFORNIA STREET`
Group 1. 53-63 `CALIFORNIA`
Match 4
Full match 71-93 `53371 SOUTH WASHINGTON`
Group 1. 83-93 `WASHINGTON`
Match 5
Full match 94-119 `53371 S WASHINGTON AVENUE`
Group 1. 102-112 `WASHINGTON`
Match 6
Full match 120-143 `1600 E PENNSYLVANIA AVE`
Group 1. 127-139 `PENNSYLVANIA`
Match 7
Full match 144-157 `WEST9 67ST ST`
Group 1. 150-154 `67ST`
Match 8
Full match 158-176 `E171 N 23RD STREET`
Group 1. 165-169 `23RD`
Match 9
Full match 177-195 `G171 N121ST STREET`
Group 1. 183-188 `121ST`
https://regex101.com/r/m2rmUQ/4
I was able to figure this out in a slightly different way
[0-9A-Z]* [0-9A-Z]*$
and then I simply split the string it created by the space. Maybe one or two steps too many but it's transparent

Regex for date and 3-letter code

been trying to create a REGEX that will parse date and 3-letter code from a bit longer message. Here I will post examples of the messages and what I want to get:
AAA BBB 1A BY PEK14JUN18/1654 OR QQQ MF 812 XXXXX -> PEK, 14JUN18/1654
XXX/WWWW BY 05JUL 0900 BKK LT ELSE BKG WILL BE QQQQ -> BKK, 05JUL 0900
TO AZ BY 02AUG 1910 TYO OR AZ WWWW WILL BE XXX -> TYO, 02AUG 1910
BY TYO20JUL18/0355 OR CXL CA ALL QQQ -> TYO, 20JUL18/0355
BY AMS04JUL18/1954 OR CXL MF 812 L07JUL -> AMS, 04JUL18/1954
I want to be able to match 3-letter code and the date for every message. The code is always nearby the date but can be before or after the date part. Also the date part can be with or without a year.Is it possible to have one regex to use for all the above examples?
I started with this:
(\s[A-Z]{3}\d\d|\d\d[A-Z]{3}\s)
(https://regex101.com/r/LPLjgf/1) but it's not working as it should and I'm not very experienced with regex to be honest.
EDIT:
Actually I would need to use only the 3-letter codes but I need them to be connected with a date - for example in:
AAA BBB 1A BY PEK14JUN18/1654 OR QQQ MF 812 XXXXX
the AAA, BBB or QQQ shouldn't match because they arent right after / before the date as is PEK.
Same with BY TYO20JUL18/0355 OR CXL CA ALL QQQ -> only TYO should match because it's before a date and CXL shouldn't.
You may use the following pattern:
([A-Z]{3})(\d{2}[A-Z]{3}\d{2}\/\d{4})|(\d{2}[A-Z]{3} \d{4}) ([A-Z]{3})
([A-Z]{3}) Capturing group for three capital letters
(\d{2}[A-Z]{3}\d{2}\/\d{4}) Capturing group for two digits, three upper case letters, two digits, /, four digits.
| Logical OR, alternates pattern.
(\d{2}[A-Z]{3} \d{4}) Capturing group. Captures two digits, three upper case letters, whitespace and four digits.
([A-Z]{3}) Capturing group for three upper case letters.
You can try it live here.
Captured groups:
Group 1. 14-17 `PEK`
Group 2. 17-29 `14JUN18/1654`
Group 3. 83-93 `05JUL 0900`
Group 4. 94-97 `BKK`
Group 3. 151-161 `02AUG 1910`
Group 4. 162-165 `TYO`
Group 1. 211-214 `TYO`
Group 2. 214-226 `20JUL18/0355`
Group 1. 269-272 `AMS`
Group 2. 272-284 `04JUL18/1954`
Group 1. 342-345 `PEK`
Group 2. 345-357 `14JUN18/1654`
Group 1. 378-381 `TYO`
Group 2. 381-393 `20JUL18/0355`
Firstly
(\s[A-Z]{3}\d\d|\d\d[A-Z]{3}\s)
The alternation – | – means that will match \s[A-Z]{3}\d\d or \d\d[A-Z]{3}\s which is certainly not what you want. To narrow the scope of an alternation use grouping.
I would think you want to match this fairly directly:
([A-Z]{3})\d{2}[A-Z]{3}\d{2}
And that only captures the three letters in a group.
Try the following RegEx:
[A-Z]{3}(\d{2}[A-Z]{3}[\S]*)|(\d{2}[A-Z]{3}\s\d{4}\s[A-Z]{3})
It will matach 3 letters fowed by 2 numbers, followed by 3 letters OR 2 numbers followed by 3 letters, a Space, 4 numbers, a Space and 3 letters.
You can try it here

Matching String regex with Optional (foobar)

I have this data:
/blabla/blabla (abs,def)
/yxz
I use this regex
(.*)(?:\(([^$]*)\))?\n
But it doesn't work, and i don't know whats wrong.
I need the first "directory" information and optional the Information in "()".
Try using some online regexp matcher (ex: http://www.rubular.com/ ) to test by your own. Many of them has the match highlight function, and you can refine your regex by them
This regex extracts the first directory in group 1 and anything between the () optionally:
/([^/]*)(?:\((.*?)\)|.)*
Let me know if this works or need some assistance.
Match 1: /blabla/blabla (abs,def) 0 24
Group 1: blabla 1 6
Group 2: abs,def 16 7
Match 2: /yxz 28 4
Group 1: yxz 29 3
Group 2 did not participate in the match
edit for quick joe
something like this, maybe? ([^(\n]+)(?:\(([^)]*)\))?