Parse 'family' names into people + last name with regex - regex

Given the following string, I'd like to parse into a list of first names + a last name:
Peter-Paul, Mary & Joël Van der Winkel
(and the simpler versions)
I'm trying to work out if I can do this with a regex. I've got this far
(?:([^, &]+))[, &]*(?:([^, &]+))
But the problem here is that I'd like the last name to be captured in a different capture.
I suspect I'm beyond what's possible, but just in case...
UPDATE
Extracting captures from the group was new for me, so here's the (C#) code I used:
string familyName = "Peter-Paul, Mary & Joël Van der Winkel";
string firstperson = #"^(?<First>[-\w]+)"; //.Net syntax for named capture
string lastname = #"\s+(?<Last>.*)";
string others = #"(?:(?:\s*[,|&]\s*)(?<Others>[-\w]+))*";
var reg = new Regex(firstperson + others + lastname);
var groups = reg.Match(familyName).Groups;
Console.WriteLine("LastName=" + groups["Last"].Value);
Console.WriteLine("First person=" + groups["First"].Value);
foreach(Capture firstname in groups["Others"].Captures)
Console.WriteLine("Other person=" + firstname.Value);
I had to tweak the accepted answer slightly to get it to cover cases such as:
Peter-Paul&Joseph Van der Winkel
Peter-Paul & Joseph Van der Winkel

Assuming a first name can not be two words with a space (otherwise Peter Paul Van der Winkel is not automatically parsable), then the following set of rules applies:
(first name), then any number of (, first name) or (& first name)
Everything left is the last name.
^([-\w]+)(?:(?:\s?[,|&]\s)([-\w]+)\s?)*(.*)

Seems that this might do the trick:
((?:[^, &]+\s*[,&]+\s*)*[^, &]+)\s+([^,&]+)

Related

How to Extract people's last name start with "S" and first name not start with "S"

As the title shows, how do I capture a person who:
Last name start with letter "S"
First name NOT start with letter "S"
The expression should match the entire last name, not just the first letter, and first name should NOT be matched.
Input string is like the following:
(Last name) (First name)
Duncan, Jean
Schmidt, Paul
Sells, Simon
Martin, Jane
Smith, Peter
Stephens, Sheila
This is my regular expression:
/([S].+)(?:, [^S])/
Here is the result I have got:
Schmidt, P
Smith, P
the result included "," space & letter "P" which should be excluded.
The ideal match would be
Schmidt
Smith
You can try this pattern: ^S\w+(?=, [A-RT-Z]).
^S\w+ matches any word (name in your case) that start with S at the beginning,
(?=, [A-RT-Z]) - positive lookahead - makes sure that what follows, is not the word (first name in your case) starting with S ([A-RT-Z] includes all caps except S).
Demo
I did something similar to catch the initials. I've just updated the code to fit your need. Check it:
public static void Main(string[] args)
{
//Your code goes here
Console.WriteLine(ValidateName("FirstName LastName", 'L'));
}
private static string ValidateName(string name, char letter)
{
// Split name by space
string[] names = name.Split(new string[] {" "}, StringSplitOptions.RemoveEmptyEntries);
if (names.Count() > 0)
{
var firstInitial = names.First().ToUpper().First();
var lastInitial = names.Last().ToUpper().First();
if(!firstInitial.Equals(letter) && lastInitial.Equals(letter))
{
return names.Last();
}
}
return string.Empty;
}
In you current regex you capture the lastname in a capturing group and match the rest in a non capturing group.
If you change your non capturing group (?: into a positive lookahead (?= you would only capture the lastname.
([S].+)(?=, [^S]) or a bit shorter S.+(?=, [^S])
Your regex worked for me fine
$array = ["Duncan, Jean","Schmidt, Paul","Sells, Simon","Martin, Jane","Smith, Peter","Stephens, Sheila"];
foreach($array as $el){
if(preg_match('/([S].+)(?:,)( [^S].+)/',$el,$matches))
echo $matches[2]."<br/>";
}
The Answer I got is
Paul
Peter

PowerShell Normalize List of Names

I have some really messed up names from a system that I'm trying to match First and Last names in AD. Just need to parse the strings. I have names such as :
Hagstrom, N.P., Ana (Analise)
Banas, R.N., Cynthia
Saltzmann, N.P., April
Lee, Christopher
Rajaram, Pharm.D., Sharmee
Goode Jr, John (Jack) L
Reyes, R.N., Meghan
Miller, M.S., Adrienne M
Chavez, Gabriela
Stevens, MS, CCC-SLP, Christopher
Lockwood Flores, R.N., Jessica
I have tried this, but for some reason, the GivenName isn't being returned properly.
$Name = "Saltzmann, N.P., April"
$GivenName = $Name.Split(",")[$Name.Split(",").GetUpperBound(0)]
$SN = $Name.Split(",")[0]
If ($SN.IndexOf("-") -gt -1) {
$HypenLast = $SN.Split("-")[0]
$SNName = $SN.Split("-")[1]
}
If ($GivenName.IndexOf(" ") -gt -1) {
$GivenName = $GivenName.Replace("(","").Replace(")","").Split(" ")[0]
$MiddleName =$GivenName.Replace("(","").Replace(")","").Split(" ")[1]
}
Trying to take everything before the first comma and everything after last comma, but take letters before the second space of the first name.
Trying to get LastName FirstName but then need to flip it to FirstName LastName. Thanks.
All of the names could be piped to a script block that uses a regex with some named capture groups. The named capture group values can be extracted to rebuild the name you need using string interpolation.
$nameList | ForEach-Object {
$match = [Text.RegularExpression.Regex]::Match($_, "(?<last>[\w\s]+),(?:.*,)?(?:\s*)(?<first>\w+)")
$lastName = $match.Groups["last"].Value
$firstName = $match.Groups["first"].Value
"$firstName $lastName"
}

Replace last comma with or using ColdFusion

What is the best way to convert an array of values in ColdFusion
[ Fed Jones, John Smith, George King, Wilma Abby]
and to a list where the last comma is an or
Fed Jones, John Smith, George King or Wilma Abby
I thought REReplace might work but haven't found the right expression yet.
If you've got an array, combining the last element with an ArrayToList is the simplest way (as per Henry's answer).
If you've got it as a string, using rereplace is a valid method, and would work like so:
<cfset Names = rereplace( Names , ',(?=[^,]+$)' , ' or ' ) />
Which says match a comma, then check (without matching) that there are no more commas until the end of the string (which of course will only apply for the last comma, and it will thus be replaced).
It'd be easier to manipulate in the array level first, before converting into a list.
names = ["Fed Jones", "John Smith", "George King", "Wilma Abby"];
lastIndex = arrayLen(names);
last = names[lastIndex];
arrayDeleteAt(names, lastIndex);
result = arrayToList(names, ", ") & " or " & last;
// result == "Fed Jones, John Smith, George King or Wilma Abby"
Another option is to work with a list / string using listLast and the JAVA lastIndexOf() method of the result string.
<cfscript>
names = ["Fed Jones", "John Smith", "George King", "Wilma Abby"];
result = arraytoList(names,', ');
last = listLast(result);
result = listLen(result) gt 1 ? mid(result, 1, result.lastIndexOf(',')) & ' or' & last : result;
</cfscript>
<cfoutput>#result#</cfoutput>
Result:
Fed Jones, John Smith, George King or Wilma Abby

Matching the pattern with foreign character

Here i do a regular expression where _pattern is the list of teams and _name is the keyword i would like to find whether it matches the _pattern.
Result shows that it matched. I'm wondering why is it possible because the keyword is totally different to the _pattern. I suspect that it is related with the é symbol.
string _pattern = "Ipswich Town F.C.|Ipswich Town Football Club|Ipswich|The Blues||Town|The Tractor Boys|Ipswich Town";
string _name = "Estudiantes de Mérida";
regex = new Regex( #"(" + _pattern + #")", RegexOptions .IgnoreCase );
Match m = regex. Match (_name );
if (m . Success)
{
var g = m. Groups [1 ]. Value;
break ;
}
It has nothing to do with the é symbol. Let's go over a few things..
Is it right that there are 2 | in as your questions formulates :
The Blues||Town
Also the point has special meaning in a regex so you should escape it
meaIpswich Town F\.C\.
And alternatives should be enclosed with parenthesis:
(Ipswich Town F.C.)|(Ipswich Town Football Club)|(Ipswich)|
The parenthesis in the following java line are not necessary
regex = new Regex( #"(" + _pattern + #")"
Aneway, The reason that it matches is not do to a valid regex. I think it has to do with your use of the java API.
The regex that I would rewrite for your purposes is:
^((Ipswich Town F\.C\.)|(Ipswich Town Football Club)|(Ipswich)|(The Blues)|(Town)|(The Tractor Boys)|(Ipswich Town))$
As you can see, there are quit a few differences.

regex how can I split this word?

I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks
A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy
For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});
A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;