How to extract multiple names with capital letters in Google Sheets? - regex

I am trying to extract contact names of a data set, however, the names are compiled in one cell and not split up by first name, middle name, last name, email, etc.
I only need to get their names because I already have a data set only with their emails, NOT their names.
How do I extract multiple case-sensitive words and split into cells?
Here's how it looks like in one cell:
I've tried several codes I've found online and this is the only thing that comes close, however, it still extracts unnecessary lower case letters which I don't need. Please help, I'm no expert with these kinds of things.
=TRANSPOSE(SPLIT(TRIM(SUBSTITUTE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
A2,"\b\w[^A-z]*\b"," "),"\W+"," "),"[0-9]+","")," m "," "))," "))
I expect them to have the first, middle, last names to be split into new columns like this:
Tom Billy Claudia Downey Karen Dicky Steve Harvey
OR
Tom Billy Claudia Downey Karen Dicky Steve Harvey

=ARRAYFORMULA(TRIM(IFERROR(REGEXREPLACE(IFERROR(REGEXEXTRACT(IFERROR(SPLIT(A2:A,
CHAR(10))), "(.*) .*#")), "Mr. |Mrs. ", ""))))

This formula might help. i have added the conditions to replace the email id and Mr./Ms. conditions.
=TRANSPOSE(SPLIT(TRIM(SUBSTITUTE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(A2,"([a-zA-Z0-9_\-\.]+)#([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]
{2,5})",""),"\w+[\\.]+(?)",""),"\b\w[^A-z]*\b"," "),"\W+"," "),"[0-9]+","")," m ","
"))," "))

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Extract data from dataset

I need to extract title from name but cannot understand how it is working . I have provided the code below :
combine = [traindata , testdata]
for dataset in combine:
dataset["title"] = dataset["Name"].str.extract(' ([A-Za-z]+)\.' , expand = False )
There is no error but i need to understand the working of above code
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
above is the name feature from csv file and in dataset["title"] it stores the title of each name that is mr , miss , master , etc
Your code extracts the title from name using pandas.Series.str.extract function which uses regex
pandas.series.str.extract - Extract capture groups in the regex pat as columns in a DataFrame.
' ([A-Za-z]+)\.' this is a regex pattern in your code which finds the part of string that is here Name wherever a . is present.
[A-Za-z] - this part of pattern looks for charaters between alphabetic range of a-z and A-Z
+ it states that there can be more than one character
\. looks for following . after a part of string
An example is provided on the link above where it extracts a part from
string and puts the parts in seprate columns
I found this specific response with the link very helpful on how to use the 'str's extract method and put the strings in columns and series with changing the expand's value from True to False.

Regular Expression for comma separated names

Regular comma separated names would be easy to use regular expressions on, but my problem is: how would a regular expression distinguish between a list of names and a (last name, first name)?
This is the example I have:
Lawrence, Billy
Alex Newell, Jess Glynne, DJ Cassidy, Nile Rodgers
These are some examples of many that show up in a text file that I have and I need to distingush between them. Does anyone have a solution?
I thought about just counting the commas and distinguishing that way, but I also have examples like this:
Tisto, Sean Kingston & Flo Rida
This is the format (a list of artists), just to give you an idea of what I need in the end:
Lawrence, Billy
Alex Newell
Jess Glynne
DJ Cassidy
Nile Rodgers
Tisto
Sean Kingston
Flo Rida
To make it easier to parser you could add some constraints. For example, you could make every ones names two phrase and when you don't ether one of the word you could add a phrase as a filler. So then, when you parse the file every to phrase is a name. Then your delimiters are ' ', ',' and '&'

Separate last name and firstname using openoffice formula

I have a record like
Mr. James M. Heilbronner
Bryan Southwick
Ismael G. Pugeda PE
I want to insert the lastname as the last word in this example it should be
Helbronner
Southwick
PE (I can just manually edit this)
and the rest should go into the first name
Mr. James M.
Bryan
Ismael G. Pugeda
=RIGHT(A2;LEN(A2)-FIND(" ";SUBSTITUTE(A2;" ";" ";LEN(A2)-LEN(SUBSTITUTE(A2;" ";""))))) this is my code for the last name but it gets all the words after the first word
edit:
I have the solution for the last name it's this code
=IF(ISERROR(FIND(" ";A2));A2;TRIM(RIGHT(A2;LEN(A2)-FIND("";SUBSTITUTE(A2;" ";"";LEN(A2)-LEN(SUBSTITUTE(A2;" ";"")))))))
the only problem is the firstname
Assuming Mr. Heilbronner resides in A2:
B2: =LEFT(A2;LEN(A2)-LEN(C2))
C2: =TRIM(RIGHT(SUBSTITUTE(A2;" ";REPT(" ";99));99))
both copied down to suit.
The basic concept I think courtesy of Jerry Beaucaire: replace all spaces with lots of spaces and then chop off a hunk from the end and remove all spaces from it. Once you have the length of the 'surname' then use that to limit the number of characters chosen for the 'first name'.

CSV - split full name into first and last name

I regularly need to process large lists of user data for our marketing emails. I get a lot of CSVs with full name and email address and need to split these full names into separate first name and last name values. for example:
John Smith,jsmith#gmail.com
Jane E Smith,jane-smith#example.com
Jeff B. SMith,jeff_b#demo.com
Joel K smith,joelK#demo.org
Mary Jane Smith,mjs#demo.co.uk
In all of these cases, I want Smith to go in the last name column and everything else into the first name column.
Basically, I'd like to look for the last space before the first comma and replace that last space with a comma. But, I'm lost on how to do this, so any suggestions would be greatly appreciated. Also, I'm using BBEdit to process the text file.
Try the following regex:
(.*?) (\b\w*\b)(,[^,]*$)
And the substitution:
$1,$2$3
DEMO
After substitution, the data will be as follows:
John,Smith,jsmith#gmail.com
Jane E,Smith,jane-smith#example.com
Jeff B.,SMith,jeff_b#demo.com
Joel K,smith,joelK#demo.org
Mary Jane,Smith,mjs#demo.co.uk