Removing duplicate substring in SAS - sas

In the following sample data I am trying to remove the duplicate substrings in my string using the code below:
data z;
input pvd_name_orig $50.;
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;
DATA want (keep=pvd_name_orig pvd_name_temp);
SET pvd_pmd_md;
pvd_name_temp=scan(pvd_name_orig, 1, ' ');
do i=2 to countw(pvd_name_orig,' ');
word=scan(pvd_name_orig, i, ' ');
found=find(pvd_name_temp, word, 'it');
if found=0 then pvd_name_temp=catx(' ', pvd_name_temp, word);
end;
run;
However, the above code works well with all names except the names that have middle name initial letter that is repeated in the string. In that case it deletes the middle name initial considering it as a repetition. Is there a way I can avoid deleting single letter words in my string?
I have tried to manually add a period after middle name initial and then the code does not delete it in the new variable. However, I am unable to add a period after middle name initial using a SAS code.
I have used the following code to add a period at the second position of the second word (randomly) but it only adds a period at the second character in the string.
data want;
set z;
if length(compress(scan(pvd_name_orig,2,' '),'.'))=1 then substr(pvd_name_orig,2,1)='.';
run;
My final desired output is
Obs pvd_name_orig pvd_name_temp
1 MD SMITH, JOHN MD MD SMITH, JOHN
2 SMITH, JOHN W SMITH, JOHN W
3 MD T. SMITH, JOHN W. MD T. SMITH, JOHN W.
4 SMITH, JOHN WILLIAM SMITH, JOHN WILLIAM
5 JOHN N MD SMITH MD JOHN N MD SMITH
6 MD JOHN W. SMITH MD MD JOHN W. SMITH
7 MD SMITH, MD JOHN MD SMITH, JOHN
Any suggestions??

Regex could be used to resolve the issue. here \b is word boundary, \S+ is string without space, (\b\S+\b) is a word without space; (.*) is anything, first (\1) is repeat word of (\b\S+\b) which needed to be delete; \1\2 means to keep (\b\S+\b)(.*).
data z;
input name $50.;
new_name=prxchange('s/(\b\S+\b)(.*)(\1)/\1\2/',-1,strip(name));
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;

Related

name prefix and suffix removal SAS

I am looking for a way to remove a list of prefix and suffixes from the name variable
For example
under name I have
Mr. Walter White Jr.
I wish to keep just Walter White
I have list of prefixes and suffixes that I can use a reference
thanks in advance
Use regular expressions with your list to replace them with blank values.
data have;
infile datalines dlm='|';
length name $20.;
input name$;
datalines;
Mr. Walter White Jr.
Mrs. Skyler White
Dr. Saul Goodman
Mr. Jesse Pinkman
Mr. Gus Fring
;
run;
data want;
set have;
name = strip(prxchange('s/(Mr.|Jr|Dr)\.?//', -1, name) );
run;
Output:
name
Walter White
Skyler White
Saul Goodman
Jesse Pinkman
Gus Fring

Remove values from a string SAS

I have this column which i would wish to remain only the names and wish to remove everything after the ( s. May i know how could i achieve this?
Name Age
James 12
John (funny) 11
Jonathan 10
Alisa (134 cm) 12
Merlin (cheerful) 12
Jessica (hopeful) 12
Ali (quiet) 13
I have tried using functions such as compress but it still didnt work
data output;
length Name $30.;
infile datalines dlm=',';
input Name$ Age;
new = compress(name, '()');
datalines;
James,12
John (funny),11
Jonathan,10
Alisa (134 cm),12
Merlin (cheerful),12
Jessica (hopeful),12
Ali (quiet),13
;
Updating based on Tom's suggestion:
Use scan() and treat ( as a delimiter. This will pull all text before the first (.
new = scan(name, 1, '(', 'T')
The T option trims any trailing blanks.
You can use Perl regular expression patterns to replace parenthetical content with 'nothing'
name = prxchange ('s/\(.*?\)//', -1, name);

Regex - Creating validation to enforce that a string has 2+ words

If you have a moment, I need some help adding to my regex expression. I am validating a response in a Google Form for the user's full name.
The validation requires:
That only letters are used
That the user inputs both the first and second name (at a minimum), separated by a space
So far I have come up with:
[a-zA-Z ]+]
But this lacks the check for a minimum of two words in a given string.
After an hour of fails and googling, I have admitted defeat and need your help!
Thanks in advance.
This should do the job:
/^[a-z]{2,}( [a-z]+)*?( [a-z]{2,}){1,}$/i
It matches:
john smith ◄ all lowercase
John Smith
John P E Smith
John Paul E Smith
John Paul Eward Smith
It ignores:
John
John S
John Paul S
John Paul Edward S
J0hn Smith  ◄ zero instead of the letter 'o'
John     Smith  ◄ multiple spaces
You can play with this fiddle.
Best regards

modification of alter text in pandas column based on names

Background
I have the following df which is a modification of Alter text in pandas column based on names
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Doe works ',
'So is Mary Doe, works too',
'Jane Ann, Doe doesnt',
'Jone, Dow doesnt either'],
'P_ID': [1,2,3,4],
'P_Name' : ['Doe, Jon J', 'Doe, Mary', 'Doe, Jane Ann', 'Dow, Jone' ]
})
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
And the following block of code works to block names like Jon J Doe but it doesnt work when a name like Jane Ann Doe has a character in between e.g. Jane Ann, Doe or Jone! Dow
df['NewText'] = df['Text'].replace(df['P_Name'].str.split(', *').apply(lambda l: ' '.join(l[::-1])),'**BLOCK**',regex=True)
Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt Jane Ann, Doe doesnt
3 4 Dow, Jone Jone,Dow doesnt either Jone, Dow doesnt either
Goal
1) Tweak the code above to take into account for , (or any other characters that may be in between the names)
(I know I can strip commas, but I need to leave them in)
Desired Output
P_ID P_Name Text NewText
0 1 Doe, Jon J Jon J Doe works **BLOCK** works
1 2 Doe, Mary So is Mary Doe, works So is **BLOCK**, works
2 3 Doe, Jane Ann Jane Ann, Doe doesnt **BLOCK** doesnt
3 4 Dow, Jone Jone,Dow doesnt either **BLOCK** doesnt either
Question
How do I tweak my code to get my desired output?
I don't know if there are multiple such cases, but in case you have limited
Sample DataSet:
>>> df
P_ID P_Name Text
0 1 Doe, Jon J Jon J Doe works
1 2 Doe, Mary So is Mary Doe, works too
2 3 Doe, Jane Ann Jane Ann, Doe doesnt
3 4 Dow, Jone Jone, Dow doesnt either
You can create dict combination and apply that to the dataFrame to get the result.
>>> replace_values = {'Jon J Doe': '**BLOCK**', 'Mary Doe': '**BLOCK**', 'Jane Ann, Doe': '**BLOCK**', 'Jone, Dow': '**BLOCK**'}
Resulted dataFrame:
>>> df = df.replace(replace_values, regex=True)
>>> df
P_ID P_Name Text
0 1 Doe, Jon J **BLOCK** works
1 2 Doe, Mary So is **BLOCK**, works too
2 3 Doe, Jane Ann **BLOCK** doesnt
3 4 Dow, Jone **BLOCK** doesnt either
try this:
df['NewText'] = df['Text'].replace( r'('+ df['P_Name'].str.split('\W+').str.join('|')+'|\W+){3,}', ' **BLOCK** ', regex=True)

Regex to transpose somewhat tricky last name, first name , title

Is it possible to use one regex to convert both
Doe, John C., Jr., M.D.
Doe, Jane, M.D.
to read
John C. Doe Jr., M.D.
Jane Doe, M.D.
Replace
^([^,]+),\s([^,]+),(?:(\s[^,]+),)?\s([^,]+)$
with
$2 $1$3, $4
DEMO
Barmar's answer works for the specified examples, but there's a possibly simpler solution which should satisfy our input:
Replace ^([^,]+),\s([^,]+),(.*)$ with $2 $1$3
We replace the (?:(\s[^,]+),)?\s([^,]+) with a simpler (.*) that grabs all titles after the first name (we don't care about the specifics of what's in these titles).