SAS LAG function issue (Not in conditional clause) - sas

The code with LAG and if function doesn't work the way I expected. I know how to correct it. However, I am looking for help to understand why it is happening. I know lag will have issues with conditional statement, however my lag function is out of if statement.
I am having issues with the 2nd ID record. For the id#2, record #2, the prior should be 0 instead of 2. I don't understand why.
data a;
input vid 1. rid 2. flag 3. ;
datalines;
1 1 0
1 2 1
1 3 1
1 4 0
2 1 0
2 2 0
2 3 0
2 4 2
;
run;
/*incorrect version*/
data b;
set a;
by vid;
Cumflag+flag;
/*Keep track of prior record running total*/
put (vid rid flag CumFlag)(=) ;
Priorflag=lag(Cumflag);
put (vid rid flag CumFlag PriorFlag)(=) ;
if first.vid then do;
Cumflag=flag;
Priorflag=0;
put (vid rid CumFlag PriorFlag)(=) ;
end;
run;
/*correct version*/
data c;
set a;
by vid;
Cumflag+flag;
/*Keep track of prior record running total*/
if first.vid then Cumflag=flag;
Priorflag=lag(Cumflag);
if first.vid then Priorflag=0;
run;
Output dataset B is as follows.
Having issue with vid=2 prio=2.
vid rid flag Cumflag Priorflag
1 1 0 0 0
1 2 1 1 0
1 3 1 2 1
1 4 0 2 2
2 1 0 0 0
2 2 0 0 2
2 3 0 0 0
2 4 2 2 0
Log file is here.
vid=1 rid=1 flag=0 Cumflag=0
vid=1 rid=1 flag=0 Cumflag=0 Priorflag=.
vid=1 rid=1 Cumflag=0 Priorflag=0
vid=1 rid=2 flag=1 Cumflag=1
vid=1 rid=2 flag=1 Cumflag=1 Priorflag=0
vid=1 rid=3 flag=1 Cumflag=2
vid=1 rid=3 flag=1 Cumflag=2 Priorflag=1
vid=1 rid=4 flag=0 Cumflag=2
vid=1 rid=4 flag=0 Cumflag=2 Priorflag=2
vid=2 rid=1 flag=0 Cumflag=2
vid=2 rid=1 flag=0 Cumflag=2 Priorflag=2
vid=2 rid=1 Cumflag=0 Priorflag=0
vid=2 rid=2 flag=0 Cumflag=0
vid=2 rid=2 flag=0 Cumflag=0 Priorflag=2 (*** having question here, since cumflag=0 prior**)
vid=2 rid=3 flag=0 Cumflag=0
vid=2 rid=3 flag=0 Cumflag=0 Priorflag=0
vid=2 rid=4 flag=2 Cumflag=2
vid=2 rid=4 flag=2 Cumflag=2 Priorflag=0

Agree with Tom. Suggest adding a PUT statement so you can see the values.
data b;
set a;
by vid;
Cumflag+flag;
/*Keep track of prior record running total*/
Priorflag=lag(Cumflag);
put (vid flag CumFlag PriorFlag)(=) ;
if first.vid then do;
Cumflag=flag;
Priorflag=0;
end;
run;
returns
vid=1 flag=0 Cumflag=0 Priorflag=.
vid=1 flag=1 Cumflag=1 Priorflag=0
vid=1 flag=1 Cumflag=2 Priorflag=1
vid=1 flag=0 Cumflag=2 Priorflag=2
vid=2 flag=0 Cumflag=2 Priorflag=2 ***
vid=2 flag=0 Cumflag=0 Priorflag=2 ***
vid=2 flag=0 Cumflag=0 Priorflag=0
vid=2 flag=2 Cumflag=2 Priorflag=0
vid=3 flag=0 Cumflag=2 Priorflag=2
vid=3 flag=0 Cumflag=0 Priorflag=2
vid=3 flag=1 Cumflag=1 Priorflag=0
vid=3 flag=3 Cumflag=4 Priorflag=1
Note that for the first record with vid=2, the value of CumFlag at the time lag() is executed is 2. This means the value 2 is put into the lag queue. Then for the second record with vid=2, the value of CumFlag is 0 when the lag executes, so this value is entered into the lag queue and it returns 2.
BTW, there lag function works correctly conditionally. The key to understanding lag is remembering that it is a queue.

The LAG() function does not retrieve the value from the previous observation. It just retrieves the value from the list of previous values you have passed in. The value that is lagged is the value that variable has when the LAG() function runs. That is why you normally do not want to execute the LAG() function conditionally.
The first program is lagging the value of CUMFLAG before resetting CUMFLAG on the first observation in the BY group. The second program is lagging the value of CUMFLAG after it has been reset.

The LAG<n>() function is a FIFO stack/queue of size <n>. LAG is simply a stack with one item.
LAG(variable) does the following, at the point of invocation:
retrieve return value from bottom of stack
shift all stack values down 1
push variable value to top of stack
The unconditional use of LAG in a data step with will return the variable value from the prior implicit iteration; typical use with SET causes the value returned to be that from the prior row.

Related

Update Baseline Value Based on Previous Rows

Every Subject has a baseline. Once the difference between the value and the baseline exceeds 5, that value becomes the baseline for all future comparisons until another value exceeds this new baseline by 5.
This is what I want the output data to look like:
This is what I'm getting
This is my current code, which gets me as close as anything I've tried. I've tried different combinations of retain, lag(), and ifn (suggested in this post)
Data Have;
Input Visit usubjid Baseline Value;
datalines;
1 1 112.2 112.2
2 1 112.2 113.7
3 1 112.2 112
3 1 112.2 108
4 1 112.2 109
5 1 112.2 107
7 1 112.2 106
8 1 112.2 107
;
run;
proc sort;by usubjid;run;
data want;
Length chg $71;
retain chg;
set Have;
length prevchg $71;
by usubjid;
prevchg=chg;
if first.usubjid then do; prevchg=''; end;
baseline=ifn(prevchg in ('Increase >= 5mm New', "Decrease >= 5mm"),lag(value),lag(baseline));
diff = value-baseline;
if visit > 1 then do;
if diff > 5 then do; chg='Increase >= 5mm New'; order = 3; end;
else if diff < -5 then do; chg = 'Decrease >= 5mm'; order = 6; end;
else if -5 <= diff <= 5 then do;
if prevchg in('Increase >= 5mm New', 'Increase > 5mm Persistent') then do; chg ='Increase > 5mm Persistent'; order = 4; end;
else do; chg = 'No Change (change >= -5 and <= 5mm)'; order = 5; end;
end;
end;
run;
Right now the code will correctly update the baseline to the previous value for the next visit, but then goes right back to the original baseline. I'm confident this has something to do with the way Lag() and Retain work with if/then, but I cannot figure out the solution. here is an example of the issue:
You should be able to do this easily. The BASELINE variable CANNOT be on the input if you want to RETAIN its value.
data want ;
set have ;
by usubjid;
retain baseline;
if first.usubjid then baseline=value;
difference = baseline - value;
output;
if difference > 5 then baseline=value;
run;

Create a running counter based on ID and date

I have 3 variables and a counter has to be created based on them.
Input:
ID window start window end
1 29oct20 12mar21
1 31oct20 08Feb21
1 31oct21 08feb21
1 31oct21 08feb21
2 06Nov20 11Apr21
2 06Nov20 11Apr21
2 27Nov20 01Apr19
Expected output:
ID window start window end priority_count
1 29oct20 12mar21 1
1 31oct20 08Feb21 2
1 31oct21 08feb21 2
1 31oct21 08feb21 2
2 06Nov20 11Apr21 1
2 06Nov20 11Apr21 1
2 27Nov20 01Apr19 2
So for every ID a new count should start once a new date comes.
I have been using this code
data want;
set have;
by ID window_start window_end;
if first.ID and first.window_start and first.window_endthen priority_count=1;
else priority_count+1;
run;
But it gives:
priority_count
1
2
3
4
1
2
3
Not sure if those are typos but there are several observations for which window_start is after window_end.
Using the LAG function
data want;
set have;
by id;
_lag=lag(window_start);
if first.id then priority_count=1;
else do;
if window_start ne _lag then
priority_count + 1;
end;
drop _lag;
run;
ID window_start window_end priority_count
1 29OCT2020 12MAR2021 1
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
2 06NOV2020 11APR2021 1
2 06NOV2020 11APR2021 1
2 27NOV2020 01APR2019 2
I think you're on the right track but need a slight modifications on your IF statements to reflect the logic.
Set to 0 at first of each ID
Increment if the window_end changes (or window_start since they're consistent in your example). Setting it to 0 initially means you can increment without worrying if it's the first or not.
data want;
set have;
by ID window_start window_end;
if first.ID then priority_count=0;
if first.window_end then priority_count+1;
run;

SAS and do loop

I'm writing a program in SAS.
Here's the dataset I have:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
For each ID, I want to delete the record if variable huuse ne 1, until I get to the first huuse=1. Then I want to keep that record and all subsequent records for that id, no matter what value huuse is. So for id=1, I want to delete the first two records than keep all records for id=1 starting with the 3rd record. For id=2, the first record has huuse=1, so I want to keep all records for id=2.
The data set I want should look like this:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
I tried this code, but it removes all records that have huuse ne 1.
data want;
set have;
by id;
do until (huuse=1);
if huuse = 1 then LEAVE;
if huuse ne 1 then DELETE;
END;
run;
I've tried several variations of do loops, but they all do the same thing.
The DATA step is a program with an implicit loop that reads every record of the data set specified in the SET statement. Any program data vector (pdv) variables not coming from the data set are, by default, reset to missing at the top of the implicit loop. You change that behavior using a RETAIN statement to name variables that should not get reset.
So, in your problem you have two situations when a tracking variable is needed. The variable will track the state of the condition Have I seen huuse=1 yet in this group ?. Call this variable one_flag
RETAIN one_flag; so you control when it's value changes
At the start of a BY group one_flag needs to be reset to false (0)
When huuse is first seen as 1 set the flag to true (1)
Example:
data want(drop=one_flag);
set have;
by id;
retain one_flag 0;
if first.id then one_flag = 0;
if not one_flag and huuse = 1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
run;
You can place the SET and BY statement inside an explicit DO and that changes the operating behavior of the program, especially if the explicit loop is terminated according to a LAST.<var> automatic variable. Such a loop is commonly called a DOW loop by SAS programmers. There is no phrase DOW loop in the SAS documentation.
Example:
data want;
do until (last.id);
set have;
by id;
if not one_flag and huuse=1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
end;
run;
Because the looping is explicit and never reaches the TOP of the program with in the loop, there is no need to RETAIN the flag variable, nor reset it. Program variables that are not retained are reset automatically at the top of the program, and the top of the program is only reached at the start of the BY group. Learn more about this programming construct in the SGF 2013 paper "The Magnificent DO", Paul M. Dorfman
Your source and result are same :-)
But if I understood your question correctly the solution is quite simple with a retain solution. I add 2 lines to the example to make it clear that I understood correctly.
The code with example table:
data test;
id=1;huuse=0;days=4;output;
id=1;huuse=0;days=3;output;
id=1;huuse=1;days=12;output;
id=1;huuse=1;days=1;output;
id=1;huuse=2;days=15;output;
id=2;huuse=1;days=13;output;
id=2;huuse=0;days=16;output;
id=2;huuse=1;days=18;output;
id=2;huuse=0;days=44;output;
id=3;huuse=0;days=1;output;
id=3;huuse=1;days=2;output;
run;
data test_output;
set test;
retain keep_id -1;
if (keep_id ne id and huuse ne 0) then keep_id=id;
if keep_id = id then output;
run;
/* the results:
id huuse days
1 1 12 1
1 1 1 1
1 2 15 1
2 1 13 2
2 0 16 2
2 1 18 2
2 0 44 2
3 1 2 3
*/

SAS: Extract previous and next observations

I have a dataset like this(type is an indicator):
datetime type
...
ddmmyy:10:30:00 0
ddmmyy:10:31:00 0
ddmmyy:10:32:00 1
ddmmyy:10:33:00 0
ddmmyy:10:34:00 1
ddmmyy:10:35:00 0
...
I was trying to extract data with type 1 and also the previous and next one. Just try to extract (-1,+1) window based on type 1.
datetime type
...
ddmmyy:10:31:00 0
ddmmyy:10:32:00 1
ddmmyy:10:33:00 0
ddmmyy:10:34:00 1
ddmmyy:10:35:00 0
...
I found a similar post here. I copied and pasted the code, but I am not quite sure what does 'x' mean in his code. SAS gives me 'File WORK.x does not exist'.
Can someone help me out? Thx.
The X data set in the other post is the same source table you are filtering, so the logical order of the code is:
Check every row in the table 'Have', _N_ holds the current row number,
If Type = 1 then Set Have Point=_N_ goes to row _N_ in the 'Have' table and outputs that row to the new table 'want', then continues to the next row. The _N_ can be the pointer to the current, previous or next row. ( The two IF statements handles the cases of first row and last row; where there is no Previous or no Next)
Full Working Code:
data have;
length datetime $23.;
input datetime $ type ;
datalines;
ddmmyy:10:30:00 0
ddmmyy:10:31:00 0
ddmmyy:10:32:00 1
ddmmyy:10:33:00 0
ddmmyy:10:34:00 1
ddmmyy:10:35:00 0
;
run;
data want;
set have nobs=nobs;
if type = 1 then do;
current = _N_;
prev = current - 1;
next = current + 1;
if prev > 0 then do;
set have point = prev;
output;
end;
set have point = current;
output;
if next <= nobs then do;
set have point = next;
output;
end;
end;
run;
proc sort data=want noduprecs;
by _all_ ; Run;
Note: I added an extra step proc sort to remove duplicate rows.
Output:
datetime=ddmmyy:10:31:00 type=0
datetime=ddmmyy:10:32:00 type=1
datetime=ddmmyy:10:33:00 type=0
datetime=ddmmyy:10:34:00 type=1
datetime=ddmmyy:10:35:00 type=0
For you example data that does not have any id or group variable it should be pretty straight forward. Instead of thinking about moving back and forth in the file, just create new variables that contain the previous (LAG_TYPE) and next (LEAD_TYPE) value for TYPE. Then your requirement to keep the observations before the one with TYPE=1 is translated to keeping the observations where LEAD_TYPE=1.
Let's convert your sample data into a dataset.
data have ;
input datetime :$15. type ;
cards;
ddmmyy:10:30:00 0
ddmmyy:10:31:00 0
ddmmyy:10:32:00 1
ddmmyy:10:33:00 0
ddmmyy:10:34:00 1
ddmmyy:10:35:00 0
;
Rather than actually keeping the required observations I will make a new variable KEEP that will be true for records that meet your criteria.
data want ;
recno+1;
set have end=eof;
lag_type=lag(type);
if not eof then set have(firstobs=2 keep=type rename=(type=lead_type));
else lead_type=.;
keep= (type=1 or lag_type=1 or lead_type=1) ;
run;
Here is the result.
recno datetime type lag_type lead_type keep
1 ddmmyy:10:30:00 0 . 0 0
2 ddmmyy:10:31:00 0 0 1 1
3 ddmmyy:10:32:00 1 0 0 1
4 ddmmyy:10:33:00 0 1 1 1
5 ddmmyy:10:34:00 1 0 0 1
6 ddmmyy:10:35:00 0 1 . 1
Move next observation up and compare two observations at the same row or use lag to compare the current observation and previous observation.
data have;
length datetime $23.;
input datetime $ type ;
datalines;
ddmmyy:10:30:00 0
ddmmyy:10:31:00 0
ddmmyy:10:32:00 1
ddmmyy:10:33:00 0
ddmmyy:10:34:00 1
ddmmyy:10:35:00 0
;
run;
data want;
merge have have(firstobs=2 keep=type rename=(type=_type));
if max(type,_type) or max(type,lag(type)) ;
drop _type;
run;

What does this if mean in a data step?

In this data step I do not understand what if last.y do...
Could you tell me ?
data stop2;
set stop2;
by x y z t;
if last.y; /*WHAT DOES THIS DO ??*/
if t ne 999999 then
t=t+1;
else do;
t=0;
z=z+1;
end;
run;
LAST.Y refers to the row immediately before a change in the value of Y. So, in the following dataset:
data have;
input x y z;
datalines
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
2 3 1
2 3 2
2 3 3
;;;;
run;
LAST.Y would occur on the third, sixth, ninth, and twelfth rows in that dataset (on each row where Z=3). The first two times are when Y is about to change from 1 to 2, and when it is about to change from 2 to 3. The third time is when X is about to change - LAST.Y triggers when Y is about to change or when any variable before it in the BY list changes. Finally, the last row in the dataset is always LAST.(whatever).
In the specific dataset above, the subsetting if means you only take the last row for each group of Ys. In this code:
data want;
set have;
by x y z;
if last.y;
run;
You would end up with the following dataset:
data want;
input x y z;
datalines;
1 1 3
1 2 3
1 3 3
2 3 3
;;;;
run;
at the end.
One thing you can do if you want to see how FIRST and LAST operate is to use PUT _ALL_;. For example:
data want;
set have;
by x y z;
put _all_;
if last.y;
run;
It will show you all of the variables, including FIRST.(whatever) and LAST.(whatever) on the dataset. (FIRST.Y and LAST.Y are actually variables.)
In SAS, first. and last. are variables created implicitly within a data step.
Each variable will have a first. and a last. corresponding to each record in the DATA step. These values will be wither 0 or 1. last.y is same as saying if last.y = 1.
Please refer here for further info.
That is an example of subsetting IF statement. Which is different than an IF/THEN statement. It basically means that if the condition is not true then stop this iteration of the data step right now.
So
if last.y;
is equivalent to
if not last.y then delete;
or
if not last.y then return;
or
if last.y then do;
... rest of the data step before the run ...
end;