syntax search over a string with sas - sas

Got the following example
I'm trying to know if any part of string in the column nomvar in table tata does exist in col1 in table toto and if yes, give me the definition using col2.
For I2010,RT,IS-IPI,F_CC11_X_CCXBA, I would have in the column intitule "yes,toto,tata,well"
I thought about using a proc sql with an insert and a select but I have two tables and I would need to do a join.
In the same time, I thought to have everything in one table but I'm unsure if it is a good idea.
Any suggestions are welcomed as I'm deeply stuck.

The SAS data step hash object is a nice way to do this. It allows you to read the Toto table into memory and it becomes a lookup table for you. Then you just walk the string from the Tata table using the scan function, tokenize, and lookup the col2 value. Here is the code.
By the way, turning table Tata into a structure like Toto and performing join is a perfectly rational way to do this, too.
/*Create sample data*/
data toto;
length col1 col2 $ 100;
col1='I2010';
col2='yes';
output;
col1='RT';
col2='toto';
output;
col1='IS-IPI';
col2='tata';
output;
col1='F_CC11_X_CCXBA';
col2='well';
output;
run;
data tata;
length nomvar intitule $ 100;
nomvar='I2010,RT,IS-IPI,F_CC11_X_CCXBA';
run;
/*Now for the solution*/
/*You can do this lookup easily with a data step hash object*/
data tata;
set tata;
length col1 col2 token $ 100;
drop col1 col2 token i sepchar rc;
/*slurp the data in from the Toto data set into the hash*/
if (_n_ = 1) then do;
declare hash toto_hash(dataset: 'work.toto');
rc = toto_hash.definekey('col1');
rc = toto_hash.definedata('col2');
toto_hash.definedone();
end;
/*now walk the tokens in data set tata and perform the lookup to get each value*/
i = 1;
sepchar = ''; /*this will be a comma after the first iteration of the loop*/
intitule = '';
do until (token = '');
/*grab nth item in the comma-separated list*/
token = scan(nomvar, i, ',');
/*lookup the col2 value from the toto data set*/
rc = toto_hash.find(key:token);
if (rc = 0) then do;
/*lookup successful so tack the value on*/
intitule = strip(intitule) || sepchar || col2;
sepchar = ',';
end;
i = i + 1;
end;
run;

Assuming your data is all structured like this (you're looking at the different strings in between . characters) I would think the easiest way is to normalize TATA (splitting by .) and then doing a straight join, then (if you need to) transposing back. (It might be better to leave it vertical - very likely you would find this more useful structure for analysis.)
data tata_v;
set tata;
call scan(nomvar,1,position,length,'.');
do _i = 1 by 1 while position le 0);
nomvar_out = substr(nomvar,position,length);
output;
call scan(nomvar,_i+1,position,length,'.');
end;
run;
Now you can join on nomvar_out and then (if needed) recombine things.

Related

Matching SAS character variables to a list

So I have a vector of search terms, and my main data set. My goal is to create an indicator for each observation in my main data set where variable1 includes at least one of the search terms. Both the search terms and variable1 are character variables.
Currently, I am trying to use a macro to iterate through the search terms, and for each search term, indicate if it is in the variable1. I do not care which search term triggered the match, I just care that there was a match (hence I only need 1 indicator variable at the end).
I am a novice when it comes to using SAS macros and loops, but have tried searching and piecing together code from some online sites, unfortunately, when I run it, it does nothing, not even give me an error.
I have put the code I am trying to run below.
*for example, I am just testing on one of the SASHELP data sets;
*I take the first five team names to create a search list;
data terms; set sashelp.baseball (obs=5);
search_term = substr(team,1,3);
keep search_term;;
run;
*I will be searching through the baseball data set;
data test; set sashelp.baseball;
run;
%macro search;
%local i name_list next_name;
proc SQL;
select distinct search_term into : name_list separated by ' ' from work.terms;
quit;
%let i=1;
%do %while (%scan(&name_list, &i) ne );
%let next_name = %scan(&name_list, &i);
*I think one of my issues is here. I try to loop through the list, and use the find command to find the next_name and if it is in the variable, then I should get a non-zero value returned;
data test; set test;
indicator = index(team,&next_name);
run;
%let i = %eval(&i + 1);
%end;
%mend;
Thanks
Here's the temporary array solution which is fully data driven.
Store the number of terms in a macro variable to assign the length of arrays
Load terms to search into a temporary array
Loop through for each word and search the terms
Exit loop if you find the term to help speed up the process
/*1*/
proc sql noprint;
select count(*) into :num_search_terms from terms;
quit;
%put &num_search_terms.;
data flagged;
*declare array;
array _search(&num_search_terms.) $ _temporary_;
/*2*/
*load array into memory;
if _n_ = 1 then do j=1 to &num_search_terms.;
set terms;
_search(j) = search_term;
end;
set test;
*set flag to 0 for initial start;
flag = 0;
/*3*/
*loop through and craete flag;
do i=1 to &num_search_terms. while(flag=0); /*4*/
if find(team, _search(i), 'it')>0 then flag=1;
end;
drop i j search_term ;
run;
Not sure I totally understand what you are trying to do but if you want to add a new binary variable that indicates if any of the substrings are found just use code like:
data want;
set have;
indicator = index(term,'string1') or index(term,'string2')
... or index(term,'string27') ;
run;
Not sure what a "vector" would be but if you had the list of terms in a dataset you could easily generate that code from the data. And then use %include to add it to your program.
filename code temp;
data _null_;
set term_list end=eof;
file code ;
if _n_ =1 then put 'indicator=' # ;
else put ' or ' #;
put 'index(term,' string :$quote. ')' #;
if eof then put ';' ;
run;
data want;
set have;
%include code / source2;
run;
If you did want to think about creating a macro to generate code like that then the parameters to the macro might be the two input dataset names, the two input variable names and the output variable name.

Aggregating multiple observations depending on validity ranges

I am facing a problem regarding the concatenation of multiple observations depending on validity ranges. The function I am trying to reproduce is similar to the Listagg() function in Oracle but I want to use it with regards to validity ranges.
Here is a reproducible minimal dataset:
data have;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
3,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,10JAN2016:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y,01JAN2019:00:00:00,31DEC2019:00:00:00
5,MEMBER,N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
                            
What I would like to do is to concatenate the value variable for each group of var inside an id.
However, there a multiple types of cases:
If there is only one value for a given var inside an id, don't do anything (e.g. case of id=1 in my example)
If validity ranges are consecutive, output every value of var inside an id (e.g. case of id=2)
If validity ranges are the same for the same var inside an id, concatenate them altogether (e.g. case of id=3)
If validity ranges are overlapping, the range that shares the two value of var are concatenated with the corresponding validity range (e.g. case of id=4)
If there are multiple validity ranges that are non consecutive for the same value in var inside an id, concatenate each that shares the same validity ranges (e.g. case of id=5)
Here is the desired result:
                            
Following #Kiran's answer on how to do Listagg function in SAS and #Joe's answer on List Aggregation and Group Concatenation in SAS Proc SQL, I tried to use the CATX function.
This is my attempt:
proc sort data=have;
by id var start;
run;
data staging1;
set have;
by id var start;
if first.var then group_number+1;
run;
/* Simulate LEAD() function in SAS */
data staging2;
merge staging1 staging1(firstobs = 2
keep=group_number start end
rename=(start=lead_start end=lead_end group_number=nextgrp));
if group_number ne nextgrp then do;
lead_start = .;
lead_end = .;
end;
drop nextgrp;
format lag_: datetime20.;
run;
proc sort data=staging2;
by id var group_number start;
run;
data want;
retain _temp;
set staging2;
by id var group_number;
/* Only one obs for a given variable, output directly */
if first.group_number = 1 and last.group_number = 1 then
output;
else if first.group_number = 1 and last.group_number = 0 then
do;
if lead_start ne . and lead_end ne .
and ((lead_start < end) or (lead_end < start)) then
do;
if (lead_start = start) or (lead_end = end) then
do;
retain _temp;
_temp = value;
end;
if (lead_start ne start) or (lead_end ne end) then
do;
_temp = value;
end = intnx('dtday',lead_start,-1);
output;
end;
end;
else if lead_start ne . and lead_end ne . and intnx('dtday', end, 1) = lead_start then
do;
_temp = value;
output;
end;
else output;
end;
else if first.group_number = 0 and last.group_number = 1 then
do;
/* Concatenate preceded retained value */
value = catx(";",_temp, value);
output;
call missing(_temp);
end;
else output;
drop _temp lead_start lead_end group_number;
run;
My attempt did not solve all the problems. Only the cases of id=1 and id=3 were correctly output. I am starting to think that the use of first. and last. as well as the simulated LEAD() function might not be the most optimal one and that there is a probably a better way to do this.
Result of my attempt:
                            
Desired results in data:
data want;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,09JAN2016:00:00:00
4,MEMBER,Y;N,10JAN2016:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,01JAN2018:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y;N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
It's pretty hard to do this in raw SQL, without built in windowing functions; data step SAS will have some better solutions.
Some of this depends on your data size. One example, below, does exactly what you ask for, but it probably will be impractical with your real data. Some of that is the 31DEC9999 dates - that makes for a lot of data - but even without that, this has thousands of rows per person, so if you have a million people or something this will get rather large. But, it might still be the best solution, depending on what you need - it does give you the absolute best control.
* First, expand the dataset to one row per day/value. (Hopefully you do not need the datetime - just the date.)
data daily;
set have;
do datevar = datepart(start) to datepart(end);
output;
end;
format datevar date9.;
drop start end;
run;
proc sort data=daily;
by id var datevar value;
run;
*Now, merge together the rows to one row per day - so days with multiple values will get merged into one.;
data merging;
set daily;
by id var datevar;
retain merge_value;
if (first.datevar and last.datevar) then output;
else do;
if first.datevar then merge_value = value;
else merge_value = catx(',',merge_value,value);
if last.datevar then do;
value = merge_value;
output;
end;
end;
keep id var datevar value;
run;
proc sort data=merging;
by id var value datevar;
run;
*Now, re-condense;
data want;
set merging;
by id var value datevar;
retain start end;
last_datevar = lag(datevar);
if first.value then do;
start = datevar;
end = .;
end;
else if last_datevar ne (datevar - 1) then do;
end = last_datevar;
output;
start = datevar;
end = .;
end;
if last.value then do;
end = datevar;
output;
end;
format start end date9.;
run;
I do not necessarily recommend doing this - it's provided for completeness, and in case it turns out it's the only way to do what you do.
Easier, most likely, is to condense using the data step using an event level dataset, where 'start' and 'end' are events. Here's an example that does what you require; it translates the original dataset to only 2 rows per original row, and then uses logic to decide what should happen for each event. This is pretty messy, so you'd want to clean it up for production, but the idea should work.
* First, make event level dataset so we can process the start and end separately;
data events;
set have;
type = 'Start';
dt_event = start;
output;
type = 'End';
dt_event = end;
output;
drop start end;
format dt_event datetime.;
run;
proc sort data=events;
by id var dt_event value;
run;
*Now, for each event, a different action is taken. Starts and Ends have different implications, and do different things based on those.;
data want;
set events(rename=value=in_value);
by id var dt_event;
retain start end value orig_value;
format value new_value $8.;
* First row per var is easy, just start it off with a START;
if first.var then do;
start = dt_event;
value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current VALUE from the concatenated VALUE string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = value;
do _i = 1 to countw(value,',');
if scan(orig_value,_i,',') ne in_value then new_value = catx(',',new_value,scan(orig_value,_i,','));
end;
orig_value = new_value;
if last.dt_event then do;
end = dt_event;
output;
start = dt_event + 86400;
value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end = dt_event - 86400;
if start < end and not missing(value) then output;
value = catx(',',value,in_value);
start = dt_event;
end = .;
end;
end;
format start end datetime21.;
keep id var value start end;
run;
Last, I'll leave you with this: you probably work in insurance, pharma, or banking, and either way this is a VERY solved problem - it's done a lot (this sort of windowing). You shouldn't really be writing new code here for the most part - first look in your company, and then if not, look for papers in either PharmaSUG or FinSUG or one of the other SAS user groups, where they talk about this. There's probably several dozen implementations of code that does this already published.

Populate SAS macro-variable using a SQL statement within another SQL statement?

I stumbled upon the following code snippet in which the variable top3 has to be filled from a table have rather than from an array of numbers.
%let top3 = 14 15 42; /* This should be made obsolete.. */
%let no = 3;
proc sql;
create table want as
select *
from (select x, y from foo) a
%do i = 1 %to &no.;
%let current = %scan(&top3.,&i.); /* What do I need to put here? */
left join (select x, y from bar where z=&current.) row_&current.
on a.x = row_&current..x
%end;
;
quit;
The table have contains the xs from the string and looks as follows:
i x
1 14
2 15
3 42
I am now wondering how I should modify the %let current = ... line such that current is populated from the table have. I know how to populate a macro variable using proc sql with select .. into, but I am afraid that the way I am going right now is fully against SAS philosophy.
It looks like you're more or less transposing something. If that's the case, this is doable in macro/sql pretty easily.
First, here's the simple version - no macro.
proc sql;
create table class_t as
select * from (
select name from sashelp.class ) class
left join (
select name, age as age_Alfred
from sashelp.class
where name='Alfred') Alfred
on class.name = Alfred.name
;
quit;
We grab the value of age from the Alfred row and put it on the main join. This isn't exactly what you're doing, but it seems similar. (I'm just using one table, but you can of course use two here.)
Now, how do we extend this to be table-driven and not handwritten? Macros!
First, here's the macro - just taking the Alfred bit and making it generic.
%macro joiner(name=);
left join (
select name, age as age_&name.
from sashelp.class
where name="&name.") &name.
on class.name = &name..name
%mend joiner;
Second, we look at this and see two things we need to put into macro lists: the SELECT variable list (we'll get one new variable for each call), and the JOIN list.
proc sql;
select cats('%joiner(name=',name,')')
into :joinlist separated by ' '
from sashelp.class;
select cats(name,'.age_',name)
into :selectlist separated by ','
from sashelp.class;
quit;
And then, we just call it!
proc sql;
create table class_t as
select class.name,&selectlist. from (
select name from sashelp.class) class
&joinlist.
;
quit;
Now, your dataset you call the macro lists from is perhaps the dataset with the 3 rows in it you have above ("have"). The dataset you actually get the appending data from is some other dataset ("bar"), right? And then the ones you join to is perhaps a third dataset ("foo"). Here I just use the one, for simplicity, but the concept is the same, just different sources.
When the lookup data is in a table you can perform a three way join without any need for SAS Macro. You don't provide any data so the example will mock some.
Example:
Suppose a master record has several associated detail records, and the detail records contain a z value used for selection into a result set per a wanted z lookup table.
data masters;
call streaminit(2020);
do id = 1 to 100;
do x = 1 to 100;
m_rownum + 1;
code = rand('integer', 10,45);
output;
end;
end;
run;
data details;
call streaminit(2020);
do date = 1 to 20;
do x = 1 to 100;
do rep = 1 to 5;
d_rownum + 1;
amount = rand('integer', 100,200);
z = rand('integer', 10,45);
output;
end;
end;
end;
run;
data zs;
input z ##; datalines;
14 15 42
;
proc sql;
create table want as
select
m_rownum
, d_rownum
, masters.id
, masters.x
, masters.code
, details.z
, details.date
, details.amount
from
masters
left join
details
on
details.x = masters.x
inner join
zs
on
zs.z = details.z
order by
masters.id, masters.x, details.z, details.date
;
quit;

Creating a dataset with the unique values of indexed variable

I have a dataset (LRG_DS) with about 74,000,000 observations. The dataset has been indexed by a variable (I_VAR1) that has about 7500 unique values. I've discovered this by running a proc contents on the dataset.
I'd like to create a dataset (TEMP)contains just the 7000 unique values of the index variable.
I've tried the following:
data TEMP;
set LRG_DS (keep = I_VAR1);
by I_VAR1;
if first.I_VAR1;
run;
and
proc sort data = LRG_DS nodupkey out = TEMP (keep = I_VAR1);
by I_VAR1;
run;
The first approach takes about 46 seconds and the second takes about 55 seconds.
I've read that the sas7bndx is file is not intended to be examined in isolation, but rather as a file to speed up the some of the procedures performed using the index variable.
Any help is much appreciated!
YMMV but using populating an empty hash table with the unique key values may perform better than a sort.
Create some example data:
data x;
do cnt=1 to 10*100000;
var=round(rand('uniform'),0.001);
do cnt2=1 to 10;
output;
end;
drop cnt2;
end;
run;
Test speed with a proc sort:
proc sort data=x(keep=var) out=sorted nodupkey;
by var;
run;
Compare with the hash table version:
data _null_;
set x(keep=var) end=eof;
if _n_ eq 1 then do;
declare hash ht ();
rc = ht.DefineKey ('var');
rc = ht.DefineDone ();
end;
if ht.check() ne 0 then do;
rc = ht.add();
end;
if eof then do;
ht.output(dataset:"ids");
end;
run;
From my very brief tests, I found that the hash table version starts to perform worse as the number of unique values grows. It may be possible to offset this by dimensioning the hash appropriately beforehand but I didn't test.

How to refer a variable from another file in SAS?

Suppose I have two files named "test" and "lookup".
The file "test" contains the following information:
COL1 COL2
az ab
fc ll
gc ms
cc ds
And the file "lookup" has:
VAR
ll
dd
cc
ab
ds
I want to find those observations, which are in "test" but not in "lookup" and to replace them with missing values. Here is my code:
data want; set test;
array COL[2] COL1 COL2;
do n=1 to 2;
if COL[n] in lookup.VAR then COL[n]=COL[n];
else COL[n]=.;
end;
run;
I tried the above code. But ERROR shows that "Expecting an relational or arithmetic operator".
My question is how to refer a variable from another file?
First, grab the %create_hash() macro from this post.
You need to use a hash object to achieve what you are looking for.
The return code from a hash lookup is zero when found and non-zero when not found.
Character missing values are not . but "".
data want;
set have;
if _n_ = 1 then do;
%create_hash(lu,var,var,"lookup");
end;
array COL[2] COL1 COL2;
do n=1 to 2;
var = col[n];
rc = lu.find();
if rc then
col[n] = "";
end;
drop rc var n;
run;
Here is an alternative approach using proc sql:
proc sql;
create table want as
select case when col1 in (select var from lookup) then '' else col1 end as col1,
case when col2 in (select var from lookup) then '' else col2 end as col2
from test;
quit;