Search for a string in a string - sas

I have a data set which looks like the following, but containing thousands of rows.
Firstname Lastname Emailaddress
John Smith John.Smith#mail.com
Anna Blake Anna.Blake#mail.com
Susan Peterson 1962_Peterson_Susan#mail.com
David Anderson RandomEmail_1956#mail.com
I want to create a variable which tells me if the email address is containing the persons first or last name at all in disregard of position. If a match is found, the variable would return the value 1 and if no match is found it would return the value 0.
I have created the following logic which works for most cases.
Data CheckNames;
Set MyDataSet;
LenFM = Length(FirstName);
LenLM = Length(LastName);
If Substr(EmailAddress,1,LenFM) = FirstName or Substr(EmailAddress,1,LenLM) = LastName then Match = 1;
Else Match = 0;
run;
This logic would return Match = 1 for the first two results and Match = 0 for the last two. However I would like it to return Match = 1 for the third observation as well because it contains the name of the person.
My question is if there is a SAS command that lets me loop through all the observations in the variables Firstname and Lastname that then scans if the names are found in the variable EmailAddress.
I have tried with Find() and PrxMatch() but they both seem to require hard coded values, making them inefficient for this purpose.
Thank you!

Both FIND and PRXMATCH would work fine, and have no such requirement of hardcoded values. FIND works particularly well for this. Add the modifier t to tell it to trim the spaces from the firstname/lastname variable (or use the trim function).
data MyDataSet;
length firstname lastname emailaddress $50;
input Firstname $ Lastname $ Emailaddress $;
datalines;
John Smith John.Smith#mail.com
Anna Blake Anna.Blake#mail.com
Susan Peterson 1962_Peterson_Susan#mail.com
David Anderson RandomEmail_1956#mail.com
;;;;
run;
Data CheckNames;
Set MyDataSet;
Match = find(EmailAddress,Firstname,'t') | find(EmailAddress,LastName,'t');
run;
I use | there to OR the two find functions' values together, but you can do it more explicitly if you prefer.

Related

Splitting a cell up using delimiters, but the delimiter is before the desired split location

I have some data in the form of a column in a dataset (named Person_details), where each has an unknown number of names, with the name (split up by spaces), followed by an underscore, followed by that persons identifier (7 characters).
Is there a way to split these entries up automatically, rather than repeatedly finding the position of the underscore, and then taking the substring before and after?
Person_details:
Evan Davies_123F323 Adam John Smith_342D427 Karl Marx_903C943
There are an unknown number of names in each cell, e.g. some have just one name and some have 20. Also complicated by the fact that some entries have middle name(s).
The ideal output would be in the form
Name Code
Evan Davies 123F323
Adam John Smith 342D427
Karl Marx 903C943
You could just use SCAN() instead.
data have;
string='Evan Davies_123F323 Adam Smith_342D427 Karl Marx_903C943';
length name $50 code $7 ;
do index=1 to countw(string,' ');
name = catx(' ',name,scan(string,index,' '));
if index(name,'_') then do;
code = scan(name,-1,'_');
name = substr(name,1,length(name)-length(code)-1);
output;
name=' ';
end;
end;
run;
Result
You can use a Perl regular expression (regex) to detect and extract pieces from patterned text. SAS routine PRXNEXT iterates through matches, and function PRXPOSN extracts pieces.
Example:
data have;
text = 'Evan Davies_123F323 Adam John Smith_342D427 Karl Marx_903C943';
run;
data want(keep=name code);
rx = prxparse('/(.+?)_(.{7}( |$))/');
set have;
start = 1;
stop = length(text);
do seq = 1 by 1;
call prxnext(rx,start,stop,text,position,length);
if position=0 then leave;
name = prxposn(rx,1,text);
code = prxposn(rx,2,text);
output;
end;
run;

Remove values from a string SAS

I have this column which i would wish to remain only the names and wish to remove everything after the ( s. May i know how could i achieve this?
Name Age
James 12
John (funny) 11
Jonathan 10
Alisa (134 cm) 12
Merlin (cheerful) 12
Jessica (hopeful) 12
Ali (quiet) 13
I have tried using functions such as compress but it still didnt work
data output;
length Name $30.;
infile datalines dlm=',';
input Name$ Age;
new = compress(name, '()');
datalines;
James,12
John (funny),11
Jonathan,10
Alisa (134 cm),12
Merlin (cheerful),12
Jessica (hopeful),12
Ali (quiet),13
;
Updating based on Tom's suggestion:
Use scan() and treat ( as a delimiter. This will pull all text before the first (.
new = scan(name, 1, '(', 'T')
The T option trims any trailing blanks.
You can use Perl regular expression patterns to replace parenthetical content with 'nothing'
name = prxchange ('s/\(.*?\)//', -1, name);

Can regex queries be combined in SAS?

I've successfully implemented a negative lookback in my regex code in SAS. However, there are multiple 'words' that are possibilities that would negate the string I'm looking for. Specifically I'm looking for a phrase (from medical notes) that say "carbapenmase producing" or "carbapenamase confirmed" and at times these phrases can be preceded by "not carbapenemase producing" or "possible carbapenamase producing", and these I don't want. Having learned that negative lookbacks require the qualifier words (if > 1) to be of the same length, I need to create 2 separate regex expressions to capture "not" and "possible", as in:
*!!! Create template to identify key phrases in the comment/note;
retain carba1 carba2 carba3;
if _n_ = 1 then do; /*probable*/
carba1 = prxparse("/(?<!not\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
carba2 = prxparse("/(?<!possible|probable\s)ca[bepr]\w*?\s*?
(conf|posi|prod|\+)/i");
carba3 = prxparse("/(?<!not a\s)ca[bepr]\w*?\s*?(conf|posi|prod|\+)/i");
end;
if prxmatch(carba1,as_comments) > 0 or prxmatch(carba2,as_comments) > 0 or
prxmatch(carba3,as_comments) > 0;
Is there a word around for this that would shorten execution time, or am I stuck with this? Any advice/comments are appreciated.
if it has just 4 scenarios and they are straightforward. you can do this simple by using contains and not contains.
data have;
length string $200.;
infile datalines;
input string & $ ;
datalines;
this is cool and carbapenmase producing or wow so nice
this is wow confirmed carbapenamase confirmed hello
now this positive for modified hodge test and later
cool is my name not carbapenemase producing" or "the modified hodge hello
wow and wow previous possible carbapenamase producing hello
Mr cool is hello
;
data want;
set have;
where (string contains "carbapenmase producing" or
string contains "carbapenamase confirmed")
and not (string contains "not carbapenemase producing" or
string contains "possible carbapenamase producing");
run;

Remove Middle Initial but not Middle Name from string

I'm trying to find a way to remove the Middle initial from a string containing the First name and middle initial (example "Mary A" needs to be "Mary").
However, I would need to keep the middle/second name if it was more than an initial (example "Mary Ann" would stay "Mary Ann").
Much thanks,
Matt
Try to use the function scan:
data test;
input name $20.;
cards;
Mary A
Anthony B
Mary Ann
Anthony Bernard
;
run;
data res;
set test;
if (length(scan(name,2))=1) then name=scan(name,1);
run;
As a result, you get:
Mary
Anthony
Mary Ann
Anthony Bernard
Here's an example of how to do this using regular expression substitution. I've used proc sql but this would also work in a data step:
data names;
input name & $5.;
cards;
Aa A
Aa Aa
Aaa A
;
run;
proc sql;
select prxchange('s/^(\w+)\s+\w\s*$/$1/',-1,name) from names;
quit;
The regex is built up as follows:
Capture the first word
Match a space, a single character, then any number of trailing spaces
If the whole expression is a match, return only the first word, otherwise return the whole input string unchanged.

I want to seperate first name last name and age from my string in sas

Input:
David30Miller
Jhonty45Rhodes
Ahsley63Cummins
So first name variable should contain the character before the age i.e David Age should contain the number i.e 30 and last name should contain the Miller.
Required Output:
FirstName Age Last name
David 30 Miller
Jhonty 45 Rhodes
Ahsley 63 Cummins
Can somebody help?
Step1: Extract age by using compress(string,,"kd") (where kd compresses all the character values leaving us with the age)
Step2: Using age as a parameter for scan function to make first name and last name. scan(,,) : First parameter is the value you want to work upon, second parameter being which part of the string you want to extract and the third parameter is which symbol is used for differentiating (age) in this case.
data abc;
input string $50.;
cards;
David30Miller
Jhonty45Rhodes
Ahsley63Cummins
;
run;
data abc;
set abc;
age = input(compress(string,,"kd"),best.);
first_name =scan(string,1,age); /*or scan(string,1,,"d");*/
last_name = scan(string,2,age); /*or scan(string,2,,"d");*/
run;
My Output:
|string |age |first_name |last_name
|David30Miller |30 |David |Miller
|Jhonty45Rhodes |45 |Jhonty |Rhodes
|Ahsley63Cummins |63 |Ahsley |Cummins
let me know in case of any queries
You can also use Prxchange as shown below. Below is brief discussion about the code.
^([a-z]+)([0-9]+)([a-z]+)$ --- ^ means starting ^([a-z]+) this is group1 with
alphabets
([0-9]+) is group2 with numbers only
([a-z]+)$ is group3.
$1 represents group1 which can replace everything with group 1 by using /$1/
$2 represents group1 which can replace everything with group 2 by using /$2/
$3 represents group1 which can replace everything with group 3 by using /$3/
In first scenario we replace everything with group one gives your firstname and so on
data want
set have;
firstname = prxchange('s/^([a-z]+)([0-9]+)([a-z]+)$/$1/i',1,trim(string));
age = input(prxchange('s/^([a-z]+)([0-9]+)([a-z]+)$/$2/i',2,trim(string)),8.);;
lastname = prxchange('s/^([a-z]+)([0-9]+)([a-z]+)$/$3/i',1,trim(string));;
run;