Refine a regex with group and lookaround

Refine a regex with group and lookaround - regex

I have the following pattern:
NAME FISCAL 20394
I need to figure out whether or not after the name field is the owner's first and last name, or, for example, a company name, and if present, extract the data.
I report some explanatory examples:
NAME MARY POPPINS FISCAL 20394
NAME MARY JANE POPPINS FISCAL 20394
NAME SNOWFLAKE INC. FISCAL 20394
I've tried with the following regex:
NAME ([A-Za-z0-9]).*(?=FISCAL)
but in this case I am only able to recognize if there is text between NAME and FISCAL but not to extract (via group) the whole name (regardless of whether it consists of one or more words).
Would you be able to help me refine the regex?

You can use
NAME\s+\K\S.*?(?=\s+FISCAL\b)
See the regex demo. Or, if you can do with a group:
NAME\s+(\S.*?)\s+FISCAL\b
See this regex demo.
Details:
NAME - a literal string
\s+ - one or more whitespaces
(\S.*?) - Group 1: a non-whitespace and then zero or more chars other than line break chars as few as possible
\s+ - one or more whitespaces
-FISCAL - a fixed text
\b - word boundary.

Related

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you

You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.

As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

How to avoid string based on prefix using regular expression

I am using regular expression to identify names from a student file. Names contain prefix such as 'MR' or 'MRS' or there is no prefix only name, for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'.
I want to extract 53 only from 'GEORGE 53' out of these three(the last one), meaning no 'MR GEORGE 51' or 'MRS GEORGE 52' should come. Note: numbers can be change, its an age.
I do know about regular expression and i tried patterns like '[^M][^R]' '[^M][^R][^S]' to identify and extract age, only when no 'MR' or 'MRS' should come as a prefix in a string. I understand through python program i can achieve this by some condition but i do want to know is there any regular expression available to do the same.

The [^M][^R] pattern matches any char but M followed with any char but R. Thus, you may actually reject valid matches if they are SR or ME, for example.
You may use
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I)
See the regex demo. To grab the name and age into separate tuple items capture them:
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I)
Details
\b - word boundary
(?<!\bmr\s) - no mr + space right before the current location
(?<!\bmrs\s) - no mrs + space right before the current location
(\S+) - Group 1: one or more non-whitespace chars
\s+ - 1+ whitespaces
(\d{1,2}) - Group 2: one or two digits
\b - word boundary
The re.I is the case insensitive modifier.
Python demo:
import re
text="for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'"
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I))
# => ['GEORGE 53']
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I))
# => [('GEORGE', '53')]

Regex capturing the first occurrence of every group in a recurring pattern

Suppose I have the following text:
Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity
I have a regex (a bit more complex, but it boils down to this):
^(?:(?:(?:Name: (.+?))|(?:Address: (.+?))|(?:City: (.+?)))\t*)+$
which has three capturing groups, that can capture the values of Name, Address and City (if they occur in the text). A few more examples are here: https://regex101.com/r/37nemH/6. EDIT The ordering is not fixed beforehand, and it could also happen that the fields are not separated by \t characters.
Now this all works well, the only slight problem I have is when one field occurs twice in the same text, as can be seen in the last example I put on regex101:
Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity\tAddress: Other Address
What I would want is for the second capturing group to match the first address, i.e. Street 123 ABC, and preferably to let the second occurrence be matched within the "City" group, i.e.
1: John Doe
2: Street 123 ABC
3: MyCity\tAddress: Other Address
Conceptually, I tried doing this with a negative lookbehind, e.g. replacing (?:Address: (.+?)) with (?:(?<!.*Address: )Address: (.+?)), i.e. assuring that an Address: match was not proceded somewhere in the text by another Address: tag. But, negative lookbehind does not allow for arbitrary length, so this obviously would not work.
Can this be achieved using regex, and how?

For your stated problem, you may use this regex with a conditional construct:
^.*?(?:(?:Name: (.+?)|(Address: )(.+?)|City: ((?(2).*?Address: )*.+?))\t*)+$
RegEx Demo
Your values are available in captured groups 1, 3, 4.
Capture group 2 is for literal label "Address: ".
Here, (?(2).*?Address: )* is a conditional construct that means if captured group 2 is present then in group 4 match text till next Address: is found (0 or more matches of this).
For the text Name: John Doe Address: Street 123 ABC City: MyCity Address: Second address, it will have following matches:
Group 1. 169-177 `John Doe`
Group 2. 178-187 `Address: `
Group 3. 187-201 `Street 123 ABC`
Group 4. 210-240 `MyCity Address: Second address`

If the word order can be any and some or all the items can be missing, it is much easier to use 3 separate patterns to extract the bits you need.
Name (demo):
^.*?Name:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
City (demo):
^.*?City:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
Address (demo):
^.*?Address:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
Details
^ - start of string
.*? - any 0+ chars other than line break chars, as few as possible
Address: - a keyword to stop at and look for the expected match
\s* - 0+ whitespaces
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible...
(?=\s*(?:Name:|Address:|City:|$)) - up to but excluding 0 or more whitespaces followed with Name:, Address:, City: or end of string.

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?

In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

How to split a name into surname plus initials

I have a Postgres table containing names like "Smith, John Albert", and I need to create a view which has names like "Smith, J A". Postgres has some regex implementations I haven't seen elsewhere.
So far I've got
SELECT regexp_replace('Smith, John Albert', '\Y\w', '', 'g');
which returns
S, J A
So I'm thinking I need to find out how to make the replace start part-way into the source string.

The regex used in PostgreSQL is actually implemented using a software package written by Henry Spencer. It is not odd, it has its own advantages, peculiarities.
One of the differences from the usual NFA regex engines is the word boundary. Here, \Y matches a non-word boundary. The rest of the patterns you need are quite known ones.
So, you need to use '^(\w+)|\Y\w' pattern and a '\1' replacement.
Details:
^ - start of string anchor
(\w+) - Capturing group 1 matching 1+ word chars (this will be referred to with \1 from the replacement pattern)
| - or
\Y\w - a word char that is preceded with another word character.
The \1 is called a replacement numbered backreference, that just puts the value captured with Group 1 into the replacement result.

The original idea is by Wiktor Stribiżew:
SELECT regexp_replace('Smith, John Albert', '^(\w+)|\Y\w', '\1', 'g');
regexp_replace
----------------
Smith, J A
(1 row)

As #bub suggested:
t=# SELECT concat(split_part('Smith, John Albert',',',1),',',regexp_replace(split_part('Smith, John Albert',',',2), '\Y\w', '', 'g'));
concat
------------
Smith, J A
(1 row)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Refine a regex with group and lookaround - regex

Related

regex to split string into parts

How to avoid string based on prefix using regular expression

Regex capturing the first occurrence of every group in a recurring pattern

Regex to extract city names (.NET)

How to split a name into surname plus initials

Categories

Resources