Dynamic Control Flow in DO loop in SAS data step - sas

Is the inner loop here (marked with ********) okay as is, or do I need to use something like %eval()? (I don't think I need %eval() because there are no macro variables.)
do _i = 1 to 5;
if sp_id_array{_i} ne . then do;
do _j = (_i+1) to 5; *********;
if sp_id_array{_j} ne . then do;
sp_id = sp_id_array{_i};
sp_partner_id = sp_id_array{_j};
output;
end;
end;
end;
end;

That's fine; SAS will automatically use the (_i+1). In fact, you can modify the loop control variable itself inside the loop.
data _null_;
do _i = 1 to 5;
put _i=;
_i=5;
end;
run;

Related

Aggregating multiple observations depending on validity ranges

I am facing a problem regarding the concatenation of multiple observations depending on validity ranges. The function I am trying to reproduce is similar to the Listagg() function in Oracle but I want to use it with regards to validity ranges.
Here is a reproducible minimal dataset:
data have;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
3,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,10JAN2016:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y,01JAN2019:00:00:00,31DEC2019:00:00:00
5,MEMBER,N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
                            
What I would like to do is to concatenate the value variable for each group of var inside an id.
However, there a multiple types of cases:
If there is only one value for a given var inside an id, don't do anything (e.g. case of id=1 in my example)
If validity ranges are consecutive, output every value of var inside an id (e.g. case of id=2)
If validity ranges are the same for the same var inside an id, concatenate them altogether (e.g. case of id=3)
If validity ranges are overlapping, the range that shares the two value of var are concatenated with the corresponding validity range (e.g. case of id=4)
If there are multiple validity ranges that are non consecutive for the same value in var inside an id, concatenate each that shares the same validity ranges (e.g. case of id=5)
Here is the desired result:
                            
Following #Kiran's answer on how to do Listagg function in SAS and #Joe's answer on List Aggregation and Group Concatenation in SAS Proc SQL, I tried to use the CATX function.
This is my attempt:
proc sort data=have;
by id var start;
run;
data staging1;
set have;
by id var start;
if first.var then group_number+1;
run;
/* Simulate LEAD() function in SAS */
data staging2;
merge staging1 staging1(firstobs = 2
keep=group_number start end
rename=(start=lead_start end=lead_end group_number=nextgrp));
if group_number ne nextgrp then do;
lead_start = .;
lead_end = .;
end;
drop nextgrp;
format lag_: datetime20.;
run;
proc sort data=staging2;
by id var group_number start;
run;
data want;
retain _temp;
set staging2;
by id var group_number;
/* Only one obs for a given variable, output directly */
if first.group_number = 1 and last.group_number = 1 then
output;
else if first.group_number = 1 and last.group_number = 0 then
do;
if lead_start ne . and lead_end ne .
and ((lead_start < end) or (lead_end < start)) then
do;
if (lead_start = start) or (lead_end = end) then
do;
retain _temp;
_temp = value;
end;
if (lead_start ne start) or (lead_end ne end) then
do;
_temp = value;
end = intnx('dtday',lead_start,-1);
output;
end;
end;
else if lead_start ne . and lead_end ne . and intnx('dtday', end, 1) = lead_start then
do;
_temp = value;
output;
end;
else output;
end;
else if first.group_number = 0 and last.group_number = 1 then
do;
/* Concatenate preceded retained value */
value = catx(";",_temp, value);
output;
call missing(_temp);
end;
else output;
drop _temp lead_start lead_end group_number;
run;
My attempt did not solve all the problems. Only the cases of id=1 and id=3 were correctly output. I am starting to think that the use of first. and last. as well as the simulated LEAD() function might not be the most optimal one and that there is a probably a better way to do this.
Result of my attempt:
                            
Desired results in data:
data want;
infile datalines4 delimiter=",";
input id var $ value $ start:datetime20. end:datetime20.;
format start end datetime20.;
datalines4;
1,NAME,AAA,01JAN2014:00:00:00,31DEC2020:00:00:00
1,MEMBER,Y,01JAN2014:00:00:00,31DEC9999:00:00:00
2,NAME,BBB,01JAN2014:00:00:00,31DEC9999:00:00:00
2,MEMBER,Y,01JAN2014:00:00:00,31DEC2016:00:00:00
2,MEMBER,N,01JAN2017:00:00:00,31DEC2019:00:00:00
3,NAME,CCC,01JAN2014:00:00:00,31DEC9999:00:00:00
3,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
4,NAME,DDD,01JAN2014:00:00:00,31DEC9999:00:00:00
4,MEMBER,Y,01JAN2014:00:00:00,09JAN2016:00:00:00
4,MEMBER,Y;N,10JAN2016:00:00:00,31DEC2017:00:00:00
4,MEMBER,N,01JAN2018:00:00:00,31DEC2019:00:00:00
5,NAME,EEE,01JAN2014:00:00:00,31DEC9999:00:00:00
5,MEMBER,Y;N,01JAN2014:00:00:00,31DEC2017:00:00:00
5,MEMBER,Y;N,01JAN2019:00:00:00,31DEC2019:00:00:00
;;;;
run;
It's pretty hard to do this in raw SQL, without built in windowing functions; data step SAS will have some better solutions.
Some of this depends on your data size. One example, below, does exactly what you ask for, but it probably will be impractical with your real data. Some of that is the 31DEC9999 dates - that makes for a lot of data - but even without that, this has thousands of rows per person, so if you have a million people or something this will get rather large. But, it might still be the best solution, depending on what you need - it does give you the absolute best control.
* First, expand the dataset to one row per day/value. (Hopefully you do not need the datetime - just the date.)
data daily;
set have;
do datevar = datepart(start) to datepart(end);
output;
end;
format datevar date9.;
drop start end;
run;
proc sort data=daily;
by id var datevar value;
run;
*Now, merge together the rows to one row per day - so days with multiple values will get merged into one.;
data merging;
set daily;
by id var datevar;
retain merge_value;
if (first.datevar and last.datevar) then output;
else do;
if first.datevar then merge_value = value;
else merge_value = catx(',',merge_value,value);
if last.datevar then do;
value = merge_value;
output;
end;
end;
keep id var datevar value;
run;
proc sort data=merging;
by id var value datevar;
run;
*Now, re-condense;
data want;
set merging;
by id var value datevar;
retain start end;
last_datevar = lag(datevar);
if first.value then do;
start = datevar;
end = .;
end;
else if last_datevar ne (datevar - 1) then do;
end = last_datevar;
output;
start = datevar;
end = .;
end;
if last.value then do;
end = datevar;
output;
end;
format start end date9.;
run;
I do not necessarily recommend doing this - it's provided for completeness, and in case it turns out it's the only way to do what you do.
Easier, most likely, is to condense using the data step using an event level dataset, where 'start' and 'end' are events. Here's an example that does what you require; it translates the original dataset to only 2 rows per original row, and then uses logic to decide what should happen for each event. This is pretty messy, so you'd want to clean it up for production, but the idea should work.
* First, make event level dataset so we can process the start and end separately;
data events;
set have;
type = 'Start';
dt_event = start;
output;
type = 'End';
dt_event = end;
output;
drop start end;
format dt_event datetime.;
run;
proc sort data=events;
by id var dt_event value;
run;
*Now, for each event, a different action is taken. Starts and Ends have different implications, and do different things based on those.;
data want;
set events(rename=value=in_value);
by id var dt_event;
retain start end value orig_value;
format value new_value $8.;
* First row per var is easy, just start it off with a START;
if first.var then do;
start = dt_event;
value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current VALUE from the concatenated VALUE string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = value;
do _i = 1 to countw(value,',');
if scan(orig_value,_i,',') ne in_value then new_value = catx(',',new_value,scan(orig_value,_i,','));
end;
orig_value = new_value;
if last.dt_event then do;
end = dt_event;
output;
start = dt_event + 86400;
value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end = dt_event - 86400;
if start < end and not missing(value) then output;
value = catx(',',value,in_value);
start = dt_event;
end = .;
end;
end;
format start end datetime21.;
keep id var value start end;
run;
Last, I'll leave you with this: you probably work in insurance, pharma, or banking, and either way this is a VERY solved problem - it's done a lot (this sort of windowing). You shouldn't really be writing new code here for the most part - first look in your company, and then if not, look for papers in either PharmaSUG or FinSUG or one of the other SAS user groups, where they talk about this. There's probably several dozen implementations of code that does this already published.

Create a variable inside WHEN-DO statement SAS

Want to create 2 macro variables "list" and "list2" depending on prod's value but it always returns the last iteration values.
Thanks
%let prod=WC;
SELECT ;
WHEN (WC = &prod)
DO;
%let list = (60 , 63 );
%let list2= ("6A","6B","6C") ;
END;
WHEN (MT = &prod)
DO;
%let list = (33 , 34);
%let list2= ("3A","3B");
END;
OTHERWISE;
END;
RUN;
``
The macro processor works before the generated code is passed to SAS to interpret. So your code is evaluated in this order:
%let prod=WC;
%let list = (60 , 63 );
%let list2= ("6A","6B","6C") ;
%let list = (33 , 34);
%let list2= ("3A","3B");
data ...
SELECT ;
WHEN (WC = &prod) DO;
END;
WHEN (MT = &prod) DO;
END;
OTHERWISE;
END;
...
RUN;
To set macro variables from a running data step use the CALL SYMPUTX() function. Also are you really trying to compare the variable WC to the variable MT? Does the data in your data step even have those variables? Or did you want to compare the text WC to the text MT?
when ("WC" = "&prod") do;
call symputx('list','(60,63)');
call symputx('list2','("6A","6B","6C")') ;
end;
Use call symput in a datastep:-
Call symput in SAS documentation
So your statements will be something like:-
call symput("list", "(60 , 63 )");
Hope this helps :-)

Generate multiple lags through loops in SAS?

I'm trying to generate 20 lags for a variable.
To generate the first lag, I use the following statement:
data temp.data2;
set temp.data1;
by gvkey fyear;
lag1 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(mv),.);
lag2 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(lag1),.);
etc.
run;
Don't want to repeat 20 times. Is there a way to do this through a loop?
Thanks a lot!
You would have to maintain your own array of mv values and assign the lag values from that. The array would be bubbled for each row processed and reset at the start of an fyear group.
Example:
data have;
do gvkey = 1 to 5;
do fyear = 1 to 5;
do day = 1 to ifn(fyear=3, 10, 30);
mv = 366-day;
output;
end;
end;
end;
run;
data want;
set have;
by gvkey fyear;
array mvs(20) _temporary_;
array lags(20) lag1-lag20;
if first.fyear then call missing(of mvs(*));
* assign lags;
do _n_ = 1 to dim(lags);
lags(_n_) = mvs(_n_);
end;
* bubble mvs;
do _n_ = dim(lags) to 2 by -1;
mvs(_n_) = mvs(_n_-1);
end;
mvs(1) = mv;
run;

nesting do until eof in do while loop

Is it possible to nest a do until (eof) in a do until or do while loop? I have not been able to get it to work.
Data
data have;
do i = 1 to 3;
output;
end;
run;
Example - quits after first j
data want;
j = 1;
do while (j < 4);
do until (eof);
set have end=eof;
end;
call missing(eof);
output;
j + 1;
end;
run;
Let me know if I haven't made my problem clear enough. Thanks!
In general you can use any type of DO loop inside any other type.
Your particular program is going to stop the first time it tries to run the SET statement on the second iteration of the outer DO loop since there are no more observations to be read. Most normal SAS data step stop in the same way when they have exhausted their input stream.
If you really want to re-read the whole dataset then use the POINT= option on the SET statement. Make sure to have a way to stop the step since it will no longer be able to read past the end of the input stream.
data want;
do j=1 to 4;
do p=1 to nobs;
set have point=p nobs=nobs ;
* Do something there ;
end;
* Do something else here ;
end;
* and maybe something here too ;
stop;
run;

Report how many times condition is met in SAS

Given a data step like this:
data tmp;
do i=1 to 10;
if 3<i<7 then do;
some stuff;
end;
end;
run;
I want to write to the log how many times the if statement is true. For example, in this example, I want to have a line in the log that says:
If statement true 3 times
because the condition is true when i is 4, 5, or 6. How can I do this?
Using retain to keep a counter variable, it's pretty easy to increment a count of how many times an if condition was met.
data tmp;
retain Counter 0;
do i=1 to 10;
if 3<i<7 then do;
Counter+1;
*some stuff;
end;
end;
put 'If statement true ' Counter 'time(s).';
run;
Note that this writes to the log once because it is the last thing that occurs before the data step terminates (there's only one loop in the data step in the example). If you wanted to do this for a data step that has more than one loop (e.g. when there is a set statement reading data in from another dataset, you'd want to tell SAS you only want it to report at the end of the step. You'd do it like this:
* create an example input data set;
data exampleData;
do i=1 to 10;
output;
end;
run;
* use a variable 'eof' to indicate the end of the input dataset;
data new;
set exampleData end=eof;
retain Counter 0;
if 3<i<7 then do;
Counter+1;
*some stuff;
end;
if eof then put 'If statement true ' Counter 'time(s).';
run;