I am looking for a way to remove a list of prefix and suffixes from the name variable
For example
under name I have
Mr. Walter White Jr.
I wish to keep just Walter White
I have list of prefixes and suffixes that I can use a reference
thanks in advance
Use regular expressions with your list to replace them with blank values.
data have;
infile datalines dlm='|';
length name $20.;
input name$;
datalines;
Mr. Walter White Jr.
Mrs. Skyler White
Dr. Saul Goodman
Mr. Jesse Pinkman
Mr. Gus Fring
;
run;
data want;
set have;
name = strip(prxchange('s/(Mr.|Jr|Dr)\.?//', -1, name) );
run;
Output:
name
Walter White
Skyler White
Saul Goodman
Jesse Pinkman
Gus Fring
I have this column which i would wish to remain only the names and wish to remove everything after the ( s. May i know how could i achieve this?
Name Age
James 12
John (funny) 11
Jonathan 10
Alisa (134 cm) 12
Merlin (cheerful) 12
Jessica (hopeful) 12
Ali (quiet) 13
I have tried using functions such as compress but it still didnt work
data output;
length Name $30.;
infile datalines dlm=',';
input Name$ Age;
new = compress(name, '()');
datalines;
James,12
John (funny),11
Jonathan,10
Alisa (134 cm),12
Merlin (cheerful),12
Jessica (hopeful),12
Ali (quiet),13
;
Updating based on Tom's suggestion:
Use scan() and treat ( as a delimiter. This will pull all text before the first (.
new = scan(name, 1, '(', 'T')
The T option trims any trailing blanks.
You can use Perl regular expression patterns to replace parenthetical content with 'nothing'
name = prxchange ('s/\(.*?\)//', -1, name);
If you have a moment, I need some help adding to my regex expression. I am validating a response in a Google Form for the user's full name.
The validation requires:
That only letters are used
That the user inputs both the first and second name (at a minimum), separated by a space
So far I have come up with:
[a-zA-Z ]+]
But this lacks the check for a minimum of two words in a given string.
After an hour of fails and googling, I have admitted defeat and need your help!
Thanks in advance.
This should do the job:
/^[a-z]{2,}( [a-z]+)*?( [a-z]{2,}){1,}$/i
It matches:
john smith ◄ all lowercase
John Smith
John P E Smith
John Paul E Smith
John Paul Eward Smith
It ignores:
John
John S
John Paul S
John Paul Edward S
J0hn Smith ◄ zero instead of the letter 'o'
John Smith ◄ multiple spaces
You can play with this fiddle.
Best regards
Background
I have the following df which is a modification of Alter text in pandas column based on names
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Doe works ',
'So is Mary Doe, works too',
'Jane Ann, Doe doesnt',
'Jone, Dow doesnt either'],
'P_ID': [1,2,3,4],
'P_Name' : ['Doe, Jon J', 'Doe, Mary', 'Doe, Jane Ann', 'Dow, Jone' ]
})
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
And the following block of code works to block names like Jon J Doe but it doesnt work when a name like Jane Ann Doe has a character in between e.g. Jane Ann, Doe or Jone! Dow
df['NewText'] = df['Text'].replace(df['P_Name'].str.split(', *').apply(lambda l: ' '.join(l[::-1])),'**BLOCK**',regex=True)
Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt Jane Ann, Doe doesnt
3 4 Dow, Jone Jone,Dow doesnt either Jone, Dow doesnt either
Goal
1) Tweak the code above to take into account for , (or any other characters that may be in between the names)
(I know I can strip commas, but I need to leave them in)
Desired Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt **BLOCK** doesnt
3 4 Dow, Jone Jone,Dow doesnt either **BLOCK** doesnt either
Question
How do I tweak my code to get my desired output?
I don't know if there are multiple such cases, but in case you have limited
Sample DataSet:
>>> df
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
You can create dict combination and apply that to the dataFrame to get the result.
>>> replace_values = {'Jon J Doe': '**BLOCK**', 'Mary Doe': '**BLOCK**', 'Jane Ann, Doe': '**BLOCK**', 'Jone, Dow': '**BLOCK**'}
Resulted dataFrame:
>>> df = df.replace(replace_values, regex=True)
>>> df
P_ID P_Name Text
0 1 Doe, Jon J **BLOCK** works
1 2 Doe, Mary So is **BLOCK**, works too
2 3 Doe, Jane Ann **BLOCK** doesnt
3 4 Dow, Jone **BLOCK** doesnt either
try this:
df['NewText'] = df['Text'].replace( r'('+ df['P_Name'].str.split('\W+').str.join('|')+'|\W+){3,}', ' **BLOCK** ', regex=True)
Is it possible to use one regex to convert both
Doe, John C., Jr., M.D.
Doe, Jane, M.D.
to read
John C. Doe Jr., M.D.
Jane Doe, M.D.
Replace
^([^,]+),\s([^,]+),(?:(\s[^,]+),)?\s([^,]+)$
with
$2 $1$3, $4
DEMO
Barmar's answer works for the specified examples, but there's a possibly simpler solution which should satisfy our input:
Replace ^([^,]+),\s([^,]+),(.*)$ with $2 $1$3
We replace the (?:(\s[^,]+),)?\s([^,]+) with a simpler (.*) that grabs all titles after the first name (we don't care about the specifics of what's in these titles).