regex to split string into parts - regex

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you

You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.

As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Related

Regex that matches two or three words, but does no catpure the third if it is a specific word

I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:
Mr. Snow
Mr. John Snow
Mr. John Snow (Winterfall of the nord lands)
My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark. It captures also the and. So I'm looking for a regular expression that does not capture the second name only if it is and. Here I'm looking for ["Mr. Snow", "Ms. Stark"].
My best try is as follows:
(M[rs].\s\w+(?:\s[\w-]+)(?:\s\([^\)]*\))?).
Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.
Any Ideas?
Here is some text to fast check.
Here is my two cents:
\bM[rs]\.\h(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*)\b
See an online demo
\b - A word-boundary;
M[rs]\.\h - Match Mr. or Ms. followed by a horizontal whitespace;
(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*) - A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0+ 2nd names concatenated through whitespace or hyphen;
\b - A word-boundary.
As it is a name of a person you could also check that the first letters of the words be uppercases.
M[rs].\s[A-Z]\w+(?:\s[A-Z]\w+(?:\s\([^\)]*\))?)?
See the regex demo
Matching names is difficult, see this page for a nice article:
Falsehoods Programmers Believe About Names.
For the examples that you have given, you might use:
\bM[rs]\.(?: (?!M[rs]\.|and )\w+)*
Explanation
\b A word boundary
M[rs]\. Match either Mr or Ms followed by a dot (note to escape it)
(?: Non capture group
Match a space (Or \s+ if you want want to allow newlines)
(?!M[rs]\.|and ) Negative lookahead, assert that from the current position there is not Mr or Ms or and directly to the right
\w+ Match 1+ word characters
)* Close the non capture group and optionally repeat it
Regex demo
This captures the first name in group 1 and the second in group 2if the second name exists and is not and:
(?<=M[rs]\. )(\w+)(?: (?!and)(\w+))?
See live demo.
If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:
(M[rs]\.) (\w+)(?: (?!and)(\w+))?

Terminating match at multiple space in Regex Pattern

I am reading a text which is like this:
BROKER : 0012301 AB ABCDEF/ABC
VENDOR NUMBER: 511111 A/P NUMBER: 3134
VENDOR NAME: KING ARTHUR FLOURCO INC OUR INVOICE #: 553121117 DATE: 05/03/2021
I want to extract the field Vendor Name, Vendor Number. Hence I'm using the regex
(?<=:\s).[^\s]*
But this helps me to extract any field which doesn't have any white space. However, the fields having spaces in between aren't extracted properly like Vendor Name. How do I modify my regex pattern to fetch all fields? I've tried (?<=:\s).[^\s\s]* but that didn't work.
One option could be to match either VENDOR NAME or VENDOR NUMBER and capture what follows until the first encounter of 3 whitespace chars.
Note that \s can also match a newline.
\bVENDOR\s+(?:NAME|NUMBER):\s+(\S.*?)\s{3}
The pattern matches:
\bVENDOR\s+(?:NAME|NUMBER) A word boundary to prevent a partial match, 1+ witespace chars and then match either NAME or NUMBER
:\s+ Match : and 1+ whitespace chars
(\S.*?) Capture group 1, Match a non whitespace char followed by as least as possible chars
\s{3} Match 3 whitspace chars
See a regex demo.

Regex to capture a group, but only if preceded by a string and followed by a string

There's a few examples of the 'typical' solution to the problem, here in SO and elsewhere, but we need help with a slightly different version.
We have a string such as the following
pies
bob likes,
larry likes,
harry likes
cakes
And with the following regexp
(?<=pies\n|\G,\n)(\w+ likes)
Only when the string commences with pies we can capture the 'nnn likes' as expected, however, we'd also need that the capture fails if it doesn't end with 'cakes', and our attempts at doing so have failed.
Link to the regex101: https://regex101.com/r/uDNWXN/1/
Any help appreciated.
I suggest adding an extra lookahead at the start, to make sure there is cakes in the string:
(?s)(?<=\G(?!^),\n|pies\n(?=.*?cakes))(\w+ likes)
See the regex demo (no match as expected, add some char on the last line to have a match).
Pattern details
(?s) - DOTALL/singleline modifier to let . match any chars including line breaks
(?<= - a positive lookbehind that requires the following immediately to the left of the current location:
\G(?!^),\n - right after the end of previous match, a comma and then a newline
| - or
^pies\n(?=.*cakes) - start of string, pies, newline not followed with any 0+ chars as many as possible, and then a cakes string
) - end of the lookbehind
(\w+ likes) - Group 1: any one or more letters, digits or underscores and then a space and likes.

How to avoid string based on prefix using regular expression

I am using regular expression to identify names from a student file. Names contain prefix such as 'MR' or 'MRS' or there is no prefix only name, for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'.
I want to extract 53 only from 'GEORGE 53' out of these three(the last one), meaning no 'MR GEORGE 51' or 'MRS GEORGE 52' should come. Note: numbers can be change, its an age.
I do know about regular expression and i tried patterns like '[^M][^R]' '[^M][^R][^S]' to identify and extract age, only when no 'MR' or 'MRS' should come as a prefix in a string. I understand through python program i can achieve this by some condition but i do want to know is there any regular expression available to do the same.
The [^M][^R] pattern matches any char but M followed with any char but R. Thus, you may actually reject valid matches if they are SR or ME, for example.
You may use
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I)
See the regex demo. To grab the name and age into separate tuple items capture them:
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I)
Details
\b - word boundary
(?<!\bmr\s) - no mr + space right before the current location
(?<!\bmrs\s) - no mrs + space right before the current location
(\S+) - Group 1: one or more non-whitespace chars
\s+ - 1+ whitespaces
(\d{1,2}) - Group 2: one or two digits
\b - word boundary
The re.I is the case insensitive modifier.
Python demo:
import re
text="for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'"
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I))
# => ['GEORGE 53']
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I))
# => [('GEORGE', '53')]

Regex - find string by excluding part of it

I have text: "Johnny Alan Walker Sint Jansstraat 7, 1012 HG Amsterdam +123456789012"
Is is possible to find Lastname and phone?
Exclude address?
Address regex is this: "([A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}" (two words from capital, housenumber, comma, postal code and city)
I want result string to be "Walker +123456789012"
This should do what you need, and also doesn't assume three names (works without a middle name present), so it's a little more flexible in case you run into entries for people who don't have a middle name:
.*?(\w+)\s*(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}\s*(\+\d+)
.*?(\w+)\s* - Capture the last word before the whitespace before the address. .*? will lazily match anything up to the word preceeding the address, but not capture. \s* will match the whitespace between the word and the address.
(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,} - your address regex but using a non-capturing group (?:)
\s*(\+\d+) - Captures the + and following numbers. \s* will match the whitespace between the address and the +.
I reused your address regex, but made the capture group non-capturing. Then we match the last word before the address (the last name) using (\w+), and the + and following numbers after the address using (\+\d+).
Here it is in action: https://regex101.com/r/YGiaJT/1
You could do....
\w+\s+\w+\s+(\w+).*(\+\d+)
And your capture groups should match up pretty well with what you're trying to match...
Essentially this will "disregard" your first and second "words" (first / middle name) and then disregard EVERYTHING from in between until it finds a + then captures the digits after it.
Live example: https://regex101.com/r/MjJCSv/1
In theory if your last name and your address will always be separated by more than 1 space you can shorten this a little bit and write it as
(\w+)\s{2,}.*(\+\d+)
Live example of this functionality: https://regex101.com/r/vGGB4z/1
Example implementation of the later in java: http://ideone.com/RExAEO
You can use the following to capture just the surname and the phone number.
The first part ((\w+\s){3}) will capture the 3rd occurrence of a word followed by a space.
The second part (.+?) will capture everything
The third part ((\+?\d+)$) will capture an optional + (phone number prefix) and the rest of the phone number, up to the end of the string.
(\w+\s){3}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/4
But, IF the surname and the address is separated with more than 1 space, then you can use
(\w+)\s{2,}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/5
I've tested these expressions on the Java engine, and they give back the correct match