How to split a name into surname plus initials - regex

I have a Postgres table containing names like "Smith, John Albert", and I need to create a view which has names like "Smith, J A". Postgres has some regex implementations I haven't seen elsewhere.
So far I've got
SELECT regexp_replace('Smith, John Albert', '\Y\w', '', 'g');
which returns
S, J A
So I'm thinking I need to find out how to make the replace start part-way into the source string.

The regex used in PostgreSQL is actually implemented using a software package written by Henry Spencer. It is not odd, it has its own advantages, peculiarities.
One of the differences from the usual NFA regex engines is the word boundary. Here, \Y matches a non-word boundary. The rest of the patterns you need are quite known ones.
So, you need to use '^(\w+)|\Y\w' pattern and a '\1' replacement.
Details:
^ - start of string anchor
(\w+) - Capturing group 1 matching 1+ word chars (this will be referred to with \1 from the replacement pattern)
| - or
\Y\w - a word char that is preceded with another word character.
The \1 is called a replacement numbered backreference, that just puts the value captured with Group 1 into the replacement result.

The original idea is by Wiktor Stribiżew:
SELECT regexp_replace('Smith, John Albert', '^(\w+)|\Y\w', '\1', 'g');
regexp_replace
----------------
Smith, J A
(1 row)

As #bub suggested:
t=# SELECT concat(split_part('Smith, John Albert',',',1),',',regexp_replace(split_part('Smith, John Albert',',',2), '\Y\w', '', 'g'));
concat
------------
Smith, J A
(1 row)

Related

Python Regex some name + US Address

I have these kind of strings:
WILLIAM SMITH 2345 GLENDALE DR RM 245 ATLANTA GA 30328-3474
LINDSAY SCARPITTA 655 W GRACE ST APT 418 CHICAGO IL 60613-4046
I want to make sure that strings I will get are like those strings like above.
Here's my regular expression:
[A-Z]+ [A-Z]+ [0-9]{3,4} [A-Z]+ [A-Z]{2,4} [A-Z]{2,4} [0-9]+ [A-Z]+ [A-Z]{2} [0-9]{5}-[0-9]{4}$
But my regular expression only matches the first example and does not match the second one.
Here's dawg's regex with capturing groups:
^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$
Here's the url.
UPDATE
sorry, I forgot to remove non-capturing group at the end of dawg's regex...
Here's new regex without non-capturing group: regex101
Try this:
^[A-Z]+[ \t]+[A-Z]+[ \t]+\d+.*[ \t]+[A-Z]{2}[ \t]+\d{5}(?:-\d{4})$
Demo
Explanation:
1. ^[A-Z]+[ \t]+[A-Z]+[ \t]+ Starting at the start of line,
two blocks of A-Z for the name
(however, names are often more complicated...)
2. \d+.*[ \t]+[A-Z]{2}[ \t]+ Using number start and
two letter state code at the end for the full address
Cities can have spaces such as 'Miami Beach'
3. \d{5}(?:-\d{4})$ Zip code with optional -NNNN with end anchor

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you
You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.
As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Handle initials in Postgresql

I'd like to keep together initials (max two letters) when there are punctuation or spaces in between.
I have the following snippet to tackle almost everything, but I am having issues in keeping together initials that are separated by punctuation and space. For instance, this is working on regular regex, but not in postgresql:
SELECT regexp_replace('R Z ELEMENTARY SCHOOL',
'(\b[A-Za-z]{1,2}\b)\s+\W*(?=[a-zA-Z]{1,2}\b)',
'\1')
The outcome should be "RZ ELEMENTARY SCHOOL". Other examples will include:
A & D ALTERNATIVE EDUCATION
J. & H. KNOWLEDGE DEVELOPMENT
A. - Z. EVOLUTION IN EDUCATION
The transformation should be as follows:
AD ALTERNATIVE EDUCATION
JH KNOWLEDGE DEVELOPMENT
AZ EVOLUTION IN EDUCATION
How to achieve this in Postgresql?
Thanks
Building on your current regex, I can recommend
SELECT REGEXP_REPLACE(
REGEXP_REPLACE('J. & H. KNOWLEDGE DEVELOPMENT', '\m([[:alpha:]]{1,2})\M\s*\W*(?=[[:alpha:]]{1,2}\M)', '\1'),
'^([[:alpha:]]+)\W+',
'\1 '
)
See the online demo, yielding
regexp_replace
1 JH KNOWLEDGE DEVELOPMENT
It is a two step solution. The first regex matches
\m([[:alpha:]]{1,2})\M - a whole one or two letter words captured into Group 1 (\m is a leading word boundar, and \M is a trailing word boundary)
\s* - zero or more whitespaces
\W* - zero or more non-word chars
(?=[[:alpha:]]{1,2}\M) - a positive lookahead that requires a whole one or two letter word immediately to the right of the current position.
The match is replaced with the contents of Group 1 (\1).
The second regex matches a letter word at the start of the string and replaces all non-word chars after it with a space.

How to avoid string based on prefix using regular expression

I am using regular expression to identify names from a student file. Names contain prefix such as 'MR' or 'MRS' or there is no prefix only name, for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'.
I want to extract 53 only from 'GEORGE 53' out of these three(the last one), meaning no 'MR GEORGE 51' or 'MRS GEORGE 52' should come. Note: numbers can be change, its an age.
I do know about regular expression and i tried patterns like '[^M][^R]' '[^M][^R][^S]' to identify and extract age, only when no 'MR' or 'MRS' should come as a prefix in a string. I understand through python program i can achieve this by some condition but i do want to know is there any regular expression available to do the same.
The [^M][^R] pattern matches any char but M followed with any char but R. Thus, you may actually reject valid matches if they are SR or ME, for example.
You may use
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I)
See the regex demo. To grab the name and age into separate tuple items capture them:
re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I)
Details
\b - word boundary
(?<!\bmr\s) - no mr + space right before the current location
(?<!\bmrs\s) - no mrs + space right before the current location
(\S+) - Group 1: one or more non-whitespace chars
\s+ - 1+ whitespaces
(\d{1,2}) - Group 2: one or two digits
\b - word boundary
The re.I is the case insensitive modifier.
Python demo:
import re
text="for an example 'MR GEORGE 51' or 'MRS GEORGE 52' or 'GEORGE 53'"
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)\S+\s+\d{1,2}\b', text, re.I))
# => ['GEORGE 53']
print(re.findall(r'\b(?<!\bmr\s)(?<!\bmrs\s)(\S+)\s+(\d{1,2})\b', text, re.I))
# => [('GEORGE', '53')]

Putting space in camel case string using regular expression

I am driving my question from add a space between two words.
Requirement: Split a camel case string and put spaces just before the capital letter which is followed by a small case letter or may be nothing. The space should not incur between capital letters.
eg: CSVFilesAreCoolButTXT is a string I want to yield it this way CSV Files Are Cool But TXT
I drove a regular express this way:
"LightPurple".replace(/([a-z])([A-Z])/, '$1 $2')
If you have more than 2 words, then you'll need to use the g flag, to match them all.
"LightPurpleCar".replace(/([a-z])([A-Z])/g, '$1 $2')
If are trying to split words like CSVFile then you might need to use this regexp instead:
"CSVFilesAreCool".replace(/([a-zA-Z])([A-Z])([a-z])/g, '$1 $2$3')
But still it does not serve the way I have put my requirements.
var rex = /([A-Z])([A-Z])([a-z])|([a-z])([A-Z])/g;
"CSVFilesAreCoolButTXT".replace( rex, '$1$4 $2$3$5' );
// "CSV Files Are Cool But TXT"
And also
"CSVFilesAreCoolButTXTRules".replace( rex, '$1$4 $2$3$5' );
// "CSV Files Are Cool But TXT Rules"
The text of the subject string that matches the regex pattern will be replaced by the replacement string '$1$4 $2$3$5', where the $1, $2 etc. refer to the substrings matched by the pattern's capture groups ().
$1 refers to the substring matched by the first ([A-Z]) sub-pattern, and $3 refers to the substring matched by the first ([a-z]) sub-pattern etc.
Because of the alternation character |, to make a match the regex will have to match either the ([A-Z])([A-Z])([a-z]) sub-pattern or the ([a-z])([A-Z]) sub-pattern, so if a match is made several of the capture groups will remain unmatched. These capture groups can be referenced in the replacement string but they have have no effect upon it - effectively, they will reference an empty string.
The space in the replacement string ensures a space is inserted in the subject string every time a match is made (the trailing g flag means the regular expression engine will look for more than one match).
If the first character is always lowercase.
'camelCaseString'.replace(/([A-Z]+)/g, ' $1')
If the first character is uppercase.
'CamelCaseString'.replace(/([A-Z]+)/g, ' $1').replace(/^ /, '')
Splitting CamelCase with regex in .NET :
Regex.Replace(input, "((?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z]))", " $1").Trim();
Example :
Regex.Replace("TheCapitalOfTheUAEIsAbuDhabi", "((?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z]))", " $1").Trim();
Output :
The Capital Of The UAE Is Abu Dhabi
This worked for me
let camelCase = "CSVFilesAreCoolButTXTRules"
let re = /[A-Z-_\&](?=[a-z0-9]+)|[A-Z-_\&]+(?![a-z0-9])/g
let delimited = camelCase.replace(re,' $&').trim()
The above code works for almost all the use cases i had. I had a few peculiarities where '&' and '_' should be treated equivalent to an upper case character
ThisIsASlug ---> This Is A Slug
loremIpsum ---> lorem Ipsum
PAGS_US ---> PAGS_US
TheCapitalOfTheUAEIsAbuDhabi ---> The Capital Of The UAE Is Abu Dhabi
eclipseRCPExt ---> eclipse RCP Ext
VALUE ---> VALUE
SG&A ---> SG&A
A brief explanation
[A-Z-_\&](?=[a-z0-9]+)
//Matches normal words i.e. one uppercase followed by one or more non-uppercase characters
[A-Z-_\&]+(?![a-z0-9])
//Matches acronyms & abbreviations i.e. a sequence of uppercase characters that are not followed by non-uppercase characters
Check out the regexr fiddle here
Camel-case replacement for Javascript using lookaheads / behinds:
"TheCapitalOfTheUAEIsAbuDhabi".replace(/([A-Z](?=[a-z]+)|[A-Z]+(?![a-z]))/g, ' $1').trim()
// "The Capital Of The UAE Is Abu Dhabi"