How to use REGEXTRACT to extract certain characters between two strings - regex

I am trying to extract a person's name between different characters. For example, the cells contains this information
PATIENT: 2029985 - COLLINS, JUNIOR .
PATIENT: 1235231-02 - JERRY JR, PATRICK .
PATIENT: 986435--EXP-- - JULIUS, DANIEL .
PATIENT: 2021118-02 - DRED-HARRY, KEVIN .
My goal is to use one REGEXTRACT formula to get the following:
COLLINS, JUNIOR
JERRY JR, PATRICK
JULIUS, DANIEL
LOVE ALSTON, BRENDA
So far, I have come up with the formula:
=ARRAYFORMULA(REGEXEXTRACT(B3:B, "-(.*)\."))
Where B3 contains the first information
Using that formula, I get:
COLLINS, JUNIOR
02 - JERRY JR, PATRICK
02 - LOVE-ALSTON, BRENDA
-EXP-- - JULIUS, DANIEL
02 - DRED-HARRY, KEVIN
I managed to get the first name down but how do I go about extracting the rest.

You can use
=ARRAYFORMULA(REGEXEXTRACT(B3:B, "\s-\s+([^.]*?)\s*\."))
See the regex demo. Details:
\s-\s+ - a whitespace, -, one or more whitespaces
([^.]*?) - Group 1: zero or more chars other than a . as few as possible
\s* - zero or more whitespaces
\. - a . char.

1st solution: With your shown samples, please try following regex.
Online demo for above regex
^PATIENT:.*-\s+([^.]*?)\s*\.
OR try following Google-sheet forumla:
=ARRAYFORMULA(REGEXEXTRACT(B3:B, "^PATIENT:.*-\s+([^.]*?)\s*\."))
Explanation: Checking if line/value starts from PATIENT followed by : till -(using greedy mechanism here), which is followed by spaces(1 or more occurrences). Then creating one and only capturing group which contains everything just before .(dot) in it making it non-greedy, closing capturing group which is followed by spaces(0 or more occurrences) followed by a literal dot.
2nd solution: Using lazy match approach in regex, please try following regex.
.*?\s-\s([^.]*?)\s*\.
Google-sheet formula will be as follows:
=ARRAYFORMULA(REGEXEXTRACT(B3:B, ".*?\s-\s([^.]*?)\s*\."))
Online demo for above regex

Related

Regex for specific pattern and group issue

using regex on golang and having hard time to group three names from the following statements:
Hello Planet Earth 2022 - R1 3 Hell John v Tom v Ford
Hello World 2022 - R2 3 Hell - John v Tom v Ford
I'm trying to group "John" "Tom" "Ford" with the following regex:
^(?i).+? . R\d 3 Hell (.+?)[,|v] (.+?)[,|v] (.+?)$
The issue is that it groups "- John" for second statement and I need "John" only.
Any ideas how can it be adjusted?
thanks
Not sure how correct it is, but having tested here, I came up with this...
^(?i).+? . R\d 3 Hell[\s-]* (.+?)[,v] (.+?)[,v] (.+?)$
As you seem to match single spaces, you can optionally match a dash and a space.
Note that the last .+ does not have to be non greedy as it is the last part of the pattern, and the character class [,v] does not need a pipe char if you do not intent to match that as a character.
^(?i).+? . R\d 3 Hell (?:- )?(.+?) [,v] (.+?)[,v] (.+)
Regex demo

Match first and then all equal occurrences with regex

Lets say we have the string:
one day, when Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.
is there a way with regex to match the first name, and then match and all other occurrences of the same name in the string?
Given the name-matching regex /[A-Z][a-z]+ (with /g maybe?), can the regex matcher be made to remember the first match, and then use that match EXACTLY for the rest of the string? Other subsequent matches to the name-matching regex should be ignored (except for Anne in the example).
The result would be (if matches are replaced with "Foo"):
one day, when Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.
Please ignore the fact that the sentence start uncapitalized, or add an example that also handles this.
Using a script to get the first match and then using that as input for a second iteration works of course, but that's outside the scope of the question (which is limited to ONE regex expression).
The only way I could think of is with non-fixed width lookbehinds. For example through Pypi's regex module, and maybe Javascript too? Either way, assuming a name is capture through [A-Z][a-z]+ as per your question try:
\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)
See an online demo
\b([A-Z][a-z]+)\b - A 1st capture group capturing a name between two word-boundaries;
(?<=^[^A-Z]*\b\1\b.*) - A non-fixed width positive lookbehind to match start of line anchor followed by 0+ characters other than uppercase followed by the content of the 1st capture group and 0+ characters.
Here is a PyPi's example:
import regex as re
s= 'Anne, Lisa and Paul went to the store, then Anne said to Paul: "I love Lisa!". Then Lisa laughed and kissed Anne.'
s_new = re.sub(r'\b([A-Z][a-z]+)\b(?<=^[^A-Z]*\b\1\b.*)', 'Foo', s)
print(s_new)
Prints:
Foo, Lisa and Paul went to the store, then Foo said to Paul: "I love Lisa!". Then Lisa laughed and kissed Foo.

Find UPPERCASE names that end with a colon

Find UPPERCASE names that start with either a letter or digits.
Could include digits, hyphens, periods and spaces with colon at end of name.
MR. SMITH:
- CAPTAIN AMERICA:
ANT-MAN:
2014 NEBULA:
-2012 IRON MAN:
BOY 1:
1
00:00:07,174 --> 00:00:09,837
BARTON: Okay, hold on.
16
00:00:36,411 --> 00:00:37,527
- COOPER: Nice!
- LAURA: Nice throw, kiddo.
Here is my best attempt, but it is also selecting parts of the timecodes.
It's also not filling groups correctly.
Find: ([A-Z0-9])(?:([A-Z0-9\.\-\s]*)*\:+)
regex101
Thanks in Advance !
Try this pattern: ([A-Z0-9][A-Z0-9. -]+)(?=:\B)
See Regex101 Demo

Is there an easy way to distinguish Lastnames and Firstnames using RegEx in Notepad++

I have 20,000+ records to deal with, but multiple passes like below is fine, unless of course all of it can be done in one super-effficient regex?? 🤔
Sample records:
ABBEY Chantelle - 08.11.1995 - A
ANAND Toni-Grace - 04.09.1999 - A
ADCOCK ALVEY James - 12.04.1992 - C
ADLINGTON-JONES Robin Jacob Sebastian - 15.02.1999 - B
AFZAL Kiera - 25.04.2000 - B
AHMED Nisar Abu Ben Adhem - 16.08.2002 - C
AIRE-DEANE Christopher-James - 06.01.1997 - B
AL-MISRI Yaqoob - 23.07.2004 - C
ASTER Lily-May - 01.04.2010 - B
McQUEEN Stephen - 02.02.2001 - A
Desired output:
ABBEY¬Chantelle¬08.11.1995¬A
ANAND¬Toni-Grace¬04.09.1999¬A
ADCOCK ALVEY¬James¬12.04.1992¬C
ADLINGTON-JONES¬Robin¬Jacob¬Sebastian¬15.02.1999¬B
AFZAL¬Kiera¬25.04.2000¬B
AHMED¬Nisar¬Abu¬Adhem¬16.08.2002¬C
AIRE-DEANE¬Christopher-James¬06.01.1997¬B
AL-MISRI¬Yaqoob¬23.07.2004¬C
ASTER¬Lily-May¬01.04.2010¬B
McQUEEN Stephen¬02.02.2001¬A
First Pass:
Find: ^([A-Z]{2,20}-[A-Z]{2,20}) ([A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$
RegEx: \1¬\2¬\3¬\4
Result:
AL-MISRI¬Yaqoob¬23.07.2004¬C
Second Pass:
Find: ^([A-Z]{2,20}) ([A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$
RegEx: \1¬\2¬\3¬\4
Result:
ABBEY¬Chantelle¬08.11.1995¬A
AFZAL¬Kiera¬25.04.2000¬B
McQUEEN Stephen¬02.02.2001¬A
Third Pass:
Find: ^([A-Z]{2,20}) ([A-Za-z]{1,20}-[A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$
RegEx: \1¬\2¬\3¬\4
Result:
ANAND¬Toni-Grace¬04.09.1999¬A
ASTER¬Lily-May¬01.04.2010¬B
Fourth Pass:
Find: ^([A-Z]{2,20}-[A-Z]{2,20}) ([A-Za-z]{1,20}-[A-Za-z]{1,20}) - ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])$
RegEx: \1¬\2¬\3¬\4
Result:
AIRE-DEANE¬Christopher-James¬06.01.1997¬B
But the above Regexes can't account for these records 😢
ADCOCK ALVEY James - 12.04.1992 - C
ADLINGTON-JONES Robin Jacob Sebastian - 15.02.1999 - B
AHMED Nisar Abu Ben Adhem - 16.08.2002 - C
Notes:
All Last names appear first [IN CAPITALS] some may be hyphenated, First- (second- and other middle-) names are next in Title Case and MAY be hyphenated too
Match Case is Enabled in Notepad++ during the Search and Replace activity. None of the Names have an apostrophe (e.g. O'KEEFE), they have all been removed
Even if just the Names can be sorted, I can deal with the Dates and Suffixes separately, any help would be greatly appreciated as I'm still a novice to RegEx
I also apologies in advance if I have missed an existing solution, just in case I didn't select the correct tags or terminology during my searches on this site
I've checked this article; however, it didn't help to resolve my query: Regular expression for first and last name
Matching names is not so easy due to all the possibilities, but for the given example data you might use a pattern with \G to select the spaces and - parts in between replacing them with ¬
Use (?-i) or tick the Match case checkmark.
(?-i)(?:^(?:Mc)?[A-Z]+(?:[ -][A-Z]+)*|\G(?!^)[A-Z][a-z]+(?:-[A-Z][a-z]+)*|\d{2}\.\d{2}\.\d{4})\K -?\h*
Regex demo
This regexp works on almost all names (not McQUEEN because its not all caps):
(([A-Z]+[ \-]){1,})(([A-Z][a-z]+[ \-]){1,})\- ([0-9]{2}.[0-9]{2}.[0-9]{4}) - ([A|B|C])
Groups that can be used are \1 \3 \5 \6.
Link for demo: https://regex101.com/r/3LpI54/1

Regex to match a few possible strings with possible leading and/or trailing spaces

Let's say I have a string:
John Smith (auth.), Mary Smith, Richard Smith (eds.), Richie Jack (ed.), Jack Johnny (eds.)
I would like to match:
John Smith(auth.),Mary Smith,Richard Smith(eds.),Richie Jack(ed.),Jack Johnny(eds.)
I have came up with a regex but I have a problem with the | (or character) because my string contains characters that have to be escaped like ().. This is what I'm not able deal with. My regex is:
\s+\((auth\.\)|\(eds\.\))?,\s+
EDIT: I think now that the most universal solution would be to assume that in () could be anything.
Try this:
\s*\((auth|eds?)?\.\)?,?\s*
\s+ means one or more
\s* means zero or more
Based on your comment, I modified the regex:
\s*((\([^)]*\))|,)\s*