Update Baseline Value Based on Previous Rows - sas

Every Subject has a baseline. Once the difference between the value and the baseline exceeds 5, that value becomes the baseline for all future comparisons until another value exceeds this new baseline by 5.
This is what I want the output data to look like:
This is what I'm getting
This is my current code, which gets me as close as anything I've tried. I've tried different combinations of retain, lag(), and ifn (suggested in this post)
Data Have;
Input Visit usubjid Baseline Value;
datalines;
1 1 112.2 112.2
2 1 112.2 113.7
3 1 112.2 112
3 1 112.2 108
4 1 112.2 109
5 1 112.2 107
7 1 112.2 106
8 1 112.2 107
;
run;
proc sort;by usubjid;run;
data want;
Length chg $71;
retain chg;
set Have;
length prevchg $71;
by usubjid;
prevchg=chg;
if first.usubjid then do; prevchg=''; end;
baseline=ifn(prevchg in ('Increase >= 5mm New', "Decrease >= 5mm"),lag(value),lag(baseline));
diff = value-baseline;
if visit > 1 then do;
if diff > 5 then do; chg='Increase >= 5mm New'; order = 3; end;
else if diff < -5 then do; chg = 'Decrease >= 5mm'; order = 6; end;
else if -5 <= diff <= 5 then do;
if prevchg in('Increase >= 5mm New', 'Increase > 5mm Persistent') then do; chg ='Increase > 5mm Persistent'; order = 4; end;
else do; chg = 'No Change (change >= -5 and <= 5mm)'; order = 5; end;
end;
end;
run;
Right now the code will correctly update the baseline to the previous value for the next visit, but then goes right back to the original baseline. I'm confident this has something to do with the way Lag() and Retain work with if/then, but I cannot figure out the solution. here is an example of the issue:

You should be able to do this easily. The BASELINE variable CANNOT be on the input if you want to RETAIN its value.
data want ;
set have ;
by usubjid;
retain baseline;
if first.usubjid then baseline=value;
difference = baseline - value;
output;
if difference > 5 then baseline=value;
run;

Related

How do I select the first 5 observations with regard to duplicates? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a large dataset containting over 80 000 000 rows sorted by "name" and "income" (with duplicates both for name and income). For the first name I would like to have the 5 lowest incomes. For the second name I would like to have the 5 lowest incomes (but incomes drawn to the first name are then disqualified to be selected). And so on, until the last name (if there are any incomes left at that time).
You first want to rank income within names. So:
proc rank data=yourdata out=temp ties=low;
by name;
var income;
ranks incomerank;
run;
Then you want to filter the 5 lowest incomes by name, so:
proc sql;
create table want as
select distinct *
from temp
where incomerank < 6;
quit;
You will need to sort and track incomes
Use an array to sort and track the lowest five income of a name.
Use a hash to track and check the observance of an income being output and thus ineligible for output by later names.
Example:
An insert sort of eligible low valued incomes is used and will be fast due to only 5 items.
data have;
call streaminit(1234);
do name = 1 to 1e6;
do seq = 1 to rand('integer', 20);
income = rand('integer', 20000, 1000000);
output;
end;
end;
run;
data
want (label='Lowest 5 incomes (first occurring over all names) of each name')
want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
array X(5) _temporary_;
if _n_ = 1 then do;
if 0 then set have;
declare hash incomes();
incomes.defineKey('income');
incomes.defineDone();
end;
_maxmin5 = 1e15;
x(1) = 1e15;
x(2) = 1e15;
x(3) = 1e15;
x(4) = 1e15;
x(5) = 1e15;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if incomes.check() = 0 then continue;
* insert sort - lowest five not observed previously;
if income > _maxmin5 then continue;
do _i_ = 1 to 5;
if income < x(_i_) then do;
do _j_ = 5 to _i_+1 by -1;
x(_j_) = x(_j_-1);
end;
x(_i_) = income;
_maxmin5 = x(5);
incomes.add();
leave;
end;
end;
end;
_outflag = 0;
do _n_ = 1 to _n_;
set have;
if income in x then do;
_outflag = 1;
OUTPUT want;
end;
end;
if not _outflag then
OUTPUT want_barren;
drop _:;
run;
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
address = cats('Address_', _N_);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(dataset : 'have(obs=0)', ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedata(all : 'y');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Here is my interpretation of your problem and a solution.
Suppose a simplified version of your data looks like this and you want the 2 lowest income for each name. For simplicity, I use a numeric variable n as name, but a character var will work as well.
data have;
input n income;
datalines;
1 100
1 200
1 300
2 400
2 100
2 500
3 600
3 200
3 500
;
From this data, my guess is that your logic goes like this:
Start with n = 1.
Output the 2 observations with the lowest income (100 and 200)
Go to the next name (n=2).
Output the 2 observations with the lowest income, that has not already been output (300 and 400). 200 Has been output in the n=1 group.
...And so on...
This gives the desired result below:
data want;
input n income;
datalines;
1 100
1 200
2 300
2 400
3 500
;
Try out the solution below and verify that you get the result as posted above.
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 2 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Finally, let us create representable example data with 80Mio obs. I change the if c = 2 then leave; statement to if c = 5 then leave; to go back to your actual problem.
The code below runs in about 45 sec on my system and processes the data in a single pass. Let me know is it works for you :-)
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;

Sum consecutive observations in a dataset SAS

I have a dataset that looks like:
Hour Flag
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
I want to have an output dataset like:
Total_Hours Count
2 2
3 1
4 1
As you can see, I want to count the number of hours included in each period with consecutive "1s". A missing value ends the consecutive sequence.
How should I go about doing this? Thanks!
You'll need to do this in two steps. First step is making sure the data is sorted properly and determining the number of hours in a consecutive period:
PROC SORT DATA = <your dataset>;
BY hour;
RUN;
DATA work.consecutive_hours;
SET <your dataset> END = lastrec;
RETAIN
total_hours 0
;
IF flag = 1 THEN total_hours = total_hours + 1;
ELSE
DO;
IF total_hours > 0 THEN output;
total_hours = 0;
END;
/* Need to output last record */
IF lastrec AND total_hours > 0 THEN output;
KEEP
total_hours
;
RUN;
Now a simple SQL statement:
PROC SQL;
CREATE TABLE work.hour_summary AS
SELECT
total_hours
,COUNT(*) AS count
FROM
work.consecutive_hours
GROUP BY
total_hours
;
QUIT;
You will have to do two things:
compute the run lengths
compute the frequency of the run lengths
For the case of using the implict loop
Each run length occurnece can be computed and maintained in a retained tracking variable, testing for a missing value or end of data for output and a non missing value for run length reset or increment.
Proc FREQ
An alternative is to use an explicit loop and a hash for frequency counts.
Example:
data have; input
Hour Flag; datalines;
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
;
data _null_;
declare hash counts(ordered:'a');
counts.defineKey('length');
counts.defineData('length', 'count');
counts.defineDone();
do until (end);
set have end=end;
if not missing(flag) then
length + 1;
if missing(flag) or end then do;
if length > 0 then do;
if counts.find() eq 0
then count+1;
else count=1;
counts.replace();
length = 0;
end;
end;
end;
counts.output(dataset:'want');
run;
An alternative
data _null_;
if _N_ = 1 then do;
dcl hash h(ordered : "a");
h.definekey("Total_Hours");
h.definedata("Total_Hours", "Count");
h.definedone();
end;
do Total_Hours = 1 by 1 until (last.Flag);
set have end=lr;
by Flag notsorted;
end;
Count = 1;
if Flag then do;
if h.find() = 0 then Count+1;
h.replace();
end;
if lr then h.output(dataset : "want");
run;
Several weeks ago, #Richard taught me how to use DOW-loop and direct addressing array. Today, I give it to you.
data want(keep=Total_Hours Count);
array bin[99]_temporary_;
do until(eof1);
set have end=eof1;
if Flag then count + 1;
if ^Flag or eof1 then do;
bin[count] + 1;
count = .;
end;
end;
do i = 1 to dim(bin);
Total_Hours = i;
Count = bin[i];
if Count then output;
end;
run;
And Thanks Richard again, he also suggested me this article.

Matching Algorithm In SAS

So I have this code which aims to left join a new interest rate based on its original maturity. for example, if the original maturity is 0.25 years, I want it to append on the interest rate for a 3 month rate.
The code looks like
data want;
set maturity;
retain tempMat tempRate;
i = 1;
do until(stop eq 1);
set Maturity2 nobs = num point = i;
diff = abs(sum(Maturity, -OriginalMaturity));
if i eq 1 then
do;
tempMat = diff;
tempRate = Base_Rate_New;
end;
else
do;
if i = num then
do;
stop = 1;
Rate_ok = Base_Rate_New;
end;
else if diff gt tempMat then
do;
stop = 1;
Rate_ok = tempRate;
end;
else
do;
tempMat = diff;
tempRate = Base_Rate_New;
end;
end;
i = i + 1;
end;
end;
run;
The two tables look like
data maturity;
input ID maturity base_rate;
datalines
1 0.25 1
2 0.5 1
3 0.6 2
4 0.3 3
5 1.2 1.2
6 1.5 2
7 2 3
8 3 1
9 1 0.5
;
data maturity2;
input originalmaturity base_rate_new;
datalines
0.25 2
0.5 3
0.75 1
1 3
;
run;
the dataset that I look to create would look like (after dropping all the extra variables
data want;
input ID maturity base_rate base_rate_new;
datalines
1 0.25 1 2
2 0.5 1 3
3 0.6 2 3
4 0.3 3 2
5 1.2 1.2 3
6 1.5 2 3
7 2 3 3
8 3 1 3
9 1 0.5 3
;
run;
The problem is that currently if the value exceeds the closer number, it still chooses the higher number. For example if it is 0.3 it would choose the 0.5 instead of 0.25
data want;
set maturity;
retain tempDiff Rate;
do point=1 to n;
set maturity2 nobs=n point=point;
diff=abs(maturity-originalmaturity);
if point=1 then do;
tempDiff=diff;
Rate=base_rate_new;
end;
else if diff<tempDiff then do;
tempDiff=diff;
Rate=base_rate_new;
end;
end;
drop diff tempDiff originalmaturity base_rate_new;
run;
Assuming the second data set is sorted in ascending order, then you need to find the 2 rates that surround the value and check which is closer. You have to handle the edge case where you have an exact match or when you get to the end of the second data set.
/*This assumes maturity2 is sorted ascending by originalmaturity*/
data want(keep=ID maturity originalmaturity base_rate base_rate_new);
set maturity;
stop = 0;
i = 1;
do until(stop eq 1);
set Maturity2 nobs = num point = i;
/* put _all_;*/
if originalmaturity = maturity then do;
output;
stop = 1;
end;
else if originalmaturity < maturity then do;
ldist = abs(originalmaturity - maturity);
lbrn = base_rate_new;
end;
else if originalmaturity > maturity then do;
hdist = abs(originalmaturity - maturity);
if hdist > ldist then
base_rate_new = lbrn;
output;
stop = 1;
end;
if num = i then do;
output;
stop = 1;
end;
i = i + 1;
end;
run;

Calculate rolling sum for one column in time interval in SAS

I have one problem and I think there is not much to correct to work right.
I have table (with desired output column 'sum_usage'):
id opt t_purchase t_spent bonus usage sum_usage
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 . 7
So, I need to sum all usage values from time_purchase (for one id, opt combination (group by id, opt) there is just one unique time_purchase) until t_spent.
Also, I have about milion rows, so hash table would be the best solution. I've tried with:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('id','opt', 't_purchase', 't_spent');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;
The fifth line from the bottom is not correct for sure (do i=intck('second', t_purchase, t_spent), but have no idea how to approach this. So, the main problem is how to set up time interval to calculate this. I have already one function in this hash table func with the same keys, but without time interval, so it would be pretty good to write this one too, but it's not necessary.
Personally, I would ditch the hash and use SQL.
Example Data:
data have;
input id $ opt
t_purchase datetime20.
t_spent datetime20.
bonus usage sum_usage;
format
t_purchase datetime20.
t_spent datetime20.;
datalines;
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 7
;
I'm leaving your sum_usage column here for comparison.
Now, create a table of sums. New value is sum_usage2.
proc sql noprint;
create table sums as
select a.id,
a.opt,
a.t_purchase,
a.t_spent,
sum(b.usage) as sum_usage2
from have as a,
have as b
where a.id = b.id
and a.opt = b.opt
and b.t_spent <= a.t_spent
and b.t_spent >= a.t_purchase
group by a.id,
a.opt,
a.t_purchase,
a.t_spent;
quit;
Now that you have the sums, join them back to the original table:
proc sql noprint;
create table want as
select a.*,
b.sum_usage2
from have as a
left join
sums as b
on a.id = b.id
and a.opt = b.opt
and a.t_spent = b.t_spent
and a.t_purchase = b.t_purchase;
quit;
This produces the table you want. Alternatively, you can use a hash to look up the values and add the sum in a Data Step (which can be faster given the size).
data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
%create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;
rc = lk.find();
drop rc;
run;
%create_hash() macro available here https://github.com/FinancialRiskGroup/SASPerformanceAnalytics
I believe this question is a morph of one your earlier ones where you compute a rolling sum by do a hash lookup for every second over a 3 hour period for each record in your data set. Hopefully you realized the simplicity of that approach has a large cost of 3*3600 hash lookups per record as well as having to load the entire data vector into a hash.
The time log data presented has new records inserted at the top of the data, and I presume the data to be descending monotonic in time.
A DATA Step can, in a single pass of monotonic data, compute the rolling sum over a time window. The technique uses 'ring' arrays, where-in index advancement is adjusted by modulus. One array is for the time and the other is for the metric (usage). The required array size is the maximum number of items that could occur within the time window.
Consider some generated sample data with time steps of 1, 2, and one jump of 200 seconds:
data have;
time = '12oct2017:11:22:32'dt;
usage = 0;
do _n_ = 1 to &have_count;
time + 2; *ceil(25*ranuni(123));
if _n_ > 30 then time + -1;
if _n_ = 145 then time + 200;
usage = floor(180*ranuni(123));
delta = time-lag(time);
output;
end;
run;
Start with the case of computing a rolling sum from prior items when sorted time ascending. (The descending case will follow):
The example parameters are RING_SIZE 16 and TIME_WINDOW of 12 seconds.
%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;
data want;
array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
array ring_time [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
retain ring_tail 0 ring_head -1 span 0 span_usage 0;
set have;
by time ; * cause error if data not sorted per algorithm requirement;
* unload from accumulated usage the tail items that fell out the window;
do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
span + -1;
span_usage + -ring_usage(ring_tail);
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
end;
ring_head = mod ( ring_head + 1, &RING_SIZE );
span + 1;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
* update the ring array;
ring_time(ring_head) = time;
ring_usage(ring_head) = usage;
span_usage + usage;
drop ring_tail ring_head span;
run;
For the case of data sorted descending, you could jiggle things; sort ascending, compute rolling and resort descending.
What to do if such a jiggle can't be done, or you just want a single pass?
The items to be part of the rolling calculation have to be from 'lead' rows, or rows not yet read via SET. How is this possible ? A second SET statement can be used to open a separate channel to the data set, and thus obtain lead values.
There is a little more bookkeeping for processing lead data -- premature overwrite and diminished window at the end of data need to be handled.
data want2;
array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
array ring_time [-1:%eval(&RING_SIZE-1)] _temporary_;
retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;
set have;
&debug put / _N_ ':' time= ring_head=;
* unload ring_head slotted item from sum;
span + -1;
span_usage + -ring_usage(ring_head);
* advance ring_head slot by 1, the vacated slot will be overwritten by lead;
ring_head = mod ( ring_head + 1, &RING_SIZE );
&debug put +2 ring_time(ring_head)= span= 'head';
* load ring with lead values via a second SET of the same data;
if not end2 then do;
do until (_n_ > 1 or lead_index = 0 or end2);
set have(keep=time usage rename=(time=t usage=u)) end=end2; * <--- the second SET ;
if end2 then guard_index = lead_index;
&debug if end2 then put guard_index=;
ring_time(lead_index) = t;
ring_usage(lead_index) = u;
&debug put +2 ring_time(lead_index)= 'lead';
lead_index = mod ( lead_index + 1, &RING_SIZE);
end;
end;
* advance ring_tail to cover the time window;
if ring_tail ne guard_index then do;
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
do while (time - ring_time(ring_tail) <= &TIME_WINDOW);
span + 1;
span_usage + ring_usage(ring_tail);
&debug put +2 ring_time(ring_tail)= span= 'seek';
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
if ring_tail_was = guard_index then leave;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
end;
* seek went beyond window, back tail off to prior index;
ring_tail = ring_tail_was;
end;
&debug put +2 ring_time(ring_tail)= span= 'mark';
drop lead_index t u ring_: guard_index span;
format ring: span: usage 6.;
run;
options source;
Confirm both methods have the same computation:
proc sort data=want2; by time;
run;
proc compare noprint data=want compare=want2 out=diff outnoequal;
id time;
var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.
I have not benchmarked ring-array versus SQL versus Proc EXPAND versus Hash.
Caution: Dead reckoning rolling values using +in and -out operations can experience round-off errors when dealing with non-integer values.

SAS maximum value in preceding rows

I need to calculate max (Measure) in the last 3 months for each ID and month, without using PROC SQL.I was wondering I could do this using the RETAIN statement, however I have no idea how to implement the condition of comparing the value of Measure in the current row and the preceding two.
I will also need to prepare the above for more than 3 months so any solution that do not require a separate step for each additional month would be absolutely appreciated!
Here is the data I have:
data have;
input month ID $ measure;
cards;
201501 A 0
201502 A 30
201503 A 60
201504 A 90
201505 A 0
201506 A 0
201501 B 0
201502 B 30
201503 B 0
201504 B 30
201505 B 60
;
Here the one I need:
data want;
input month ID $ measure max_measure_3m;
cards;
201501 A 0 0
201502 A 30 30
201503 A 60 60
201504 A 90 90
201505 A 0 90
201506 A 0 90
201501 B 0 0
201502 B 30 30
201503 B 0 30
201504 B 30 30
201505 B 60 60
;
And here both tables: the one I have on the left and the one I need on the right
You can do this with an array that's size to your moving window. I'm not sure what type of dynamic code you need in terms of windows. If you need the max for a 4 or 5 month on top of 3 month then I would recommend using PROC EXPAND instead of these methods. The documentation for PROC EXPAND has a good example of how to do this.
data want;
set have;
by id;
array _prev(0:2) _temporary_;
if first.id then
do;
call missing (of _prev(*));
count=0;
end;
count+1;
_prev(mod(count, 3))=measure;
max=max(of _prev(*));
drop count;
run;
proc expand data=test out=out method=none;
by id;
id month;
convert x = x_movave3 / transformout=(movave 3);
convert x = x_movave4 / transformout=(movave 4);
run;
Try this:
data want(drop=l1 l2 cnt tmp);
set have;
by id;
retain cnt max_measure_3m l1 l2;
if first.id then do;
max_measure_3m = 0;
cnt = 0;
l1 = .;
l2 = .;
end;
cnt = cnt + 1;
tmp = lag(measure);
if cnt > 1 then
l1 = tmp;
tmp = lag2(measure);
if cnt > 2 then
l2 = tmp;
if measure > l1 and measure > l2 then
max_measure_3m = measure;
run;