I am using regular expression to identify names from a student file. Names contain prefix such as 'MR' or 'MRS' or there is no prefix only name, for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'.
I want to extract 53 only from 'GEORGE 53' out of these three(the last one), meaning no 'MR GEORGE 51' or 'MRS GEORGE 52' should come. Note: numbers can be change, its an age.
I do know about regular expression and i tried patterns like '[^M][^R]' '[^M][^R][^S]' to identify and extract age, only when no 'MR' or 'MRS' should come as a prefix in a string. I understand through python program i can achieve this by some condition but i do want to know is there any regular expression available to do the same.
The [^M][^R] pattern matches any char but M followed with any char but R. Thus, you may actually reject valid matches if they are SR or ME, for example.
You may use
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I)
See the regex demo. To grab the name and age into separate tuple items capture them:
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I)
Details
\b - word boundary
(?<!\bmr\s) - no mr + space right before the current location
(?<!\bmrs\s) - no mrs + space right before the current location
(\S+) - Group 1: one or more non-whitespace chars
\s+ - 1+ whitespaces
(\d{1,2}) - Group 2: one or two digits
\b - word boundary
The re.I is the case insensitive modifier.
Python demo:
import re
text="for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'"
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I))
# => ['GEORGE 53']
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I))
# => [('GEORGE', '53')]
Related
I have these kind of strings:
WILLIAM SMITH 2345 GLENDALE DR RM 245 ATLANTA GA 30328-3474
LINDSAY SCARPITTA 655 W GRACE ST APT 418 CHICAGO IL 60613-4046
I want to make sure that strings I will get are like those strings like above.
Here's my regular expression:
[A-Z]+ [A-Z]+ [0-9]{3,4} [A-Z]+ [A-Z]{2,4} [A-Z]{2,4} [0-9]+ [A-Z]+ [A-Z]{2} [0-9]{5}-[0-9]{4}$
But my regular expression only matches the first example and does not match the second one.
Here's dawg's regex with capturing groups:
^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$
Here's the url.
UPDATE
sorry, I forgot to remove non-capturing group at the end of dawg's regex...
Here's new regex without non-capturing group: regex101
Try this:
^[A-Z]+[ \t]+[A-Z]+[ \t]+\d+.*[ \t]+[A-Z]{2}[ \t]+\d{5}(?:-\d{4})$
Demo
Explanation:
1. ^[A-Z]+[ \t]+[A-Z]+[ \t]+ Starting at the start of line,
two blocks of A-Z for the name
(however, names are often more complicated...)
2. \d+.*[ \t]+[A-Z]{2}[ \t]+ Using number start and
two letter state code at the end for the full address
Cities can have spaces such as 'Miami Beach'
3. \d{5}(?:-\d{4})$ Zip code with optional -NNNN with end anchor
I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you
You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.
As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo
Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?
In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.
I have a Postgres table containing names like "Smith, John Albert", and I need to create a view which has names like "Smith, J A". Postgres has some regex implementations I haven't seen elsewhere.
So far I've got
SELECT regexp_replace('Smith, John Albert', '\Y\w', '', 'g');
which returns
S, J A
So I'm thinking I need to find out how to make the replace start part-way into the source string.
The regex used in PostgreSQL is actually implemented using a software package written by Henry Spencer. It is not odd, it has its own advantages, peculiarities.
One of the differences from the usual NFA regex engines is the word boundary. Here, \Y matches a non-word boundary. The rest of the patterns you need are quite known ones.
So, you need to use '^(\w+)|\Y\w' pattern and a '\1' replacement.
Details:
^ - start of string anchor
(\w+) - Capturing group 1 matching 1+ word chars (this will be referred to with \1 from the replacement pattern)
| - or
\Y\w - a word char that is preceded with another word character.
The \1 is called a replacement numbered backreference, that just puts the value captured with Group 1 into the replacement result.
The original idea is by Wiktor Stribiżew:
SELECT regexp_replace('Smith, John Albert', '^(\w+)|\Y\w', '\1', 'g');
regexp_replace
----------------
Smith, J A
(1 row)
As #bub suggested:
t=# SELECT concat(split_part('Smith, John Albert',',',1),',',regexp_replace(split_part('Smith, John Albert',',',2), '\Y\w', '', 'g'));
concat
------------
Smith, J A
(1 row)
I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary