calculate length of overlapping time intervals - sas

Location Start End Diff
A 01:02:00 01:05:00 3
A 01:03:00 01:08:00 5
A 01:04:00 01:11:00 7
B 02:00:00 02:17:00 17
B 02:10:00 02:20:00 10
B 02:11:00 02:15:00 4
Ideal Output:
Location OverlapTime(Min) OverlapRecords
A 6 3
B 11 3
If each time only two records within the same location overlaps, then I can do it via lag
data want;
set have;
prevend = lag1(End);
if first.location then prevend = .;
if start < prevend then overlap = 1;else overlap = 0;
overlaptime = -(start-prevend);
by location notsorted;
run;
proc sql;
select location, sum(overlaptime), sum(overlap)
from want
group by location;
But the thing is, I have so many(unknown) overlapping time intervals within same location. How can I achieve this?

Here's my solution. There's no need to use the lag function, it can be done with the retain statement along with first and last.
/* create dummy dataset */
data have;
input Location $ Start :time. End :time.;
format start end time.;
datalines;
A 01:02:00 01:05:00
A 01:03:00 01:08:00
A 01:04:00 01:11:00
A 01:13:00 01:15:00
B 02:00:00 02:17:00
B 02:10:00 02:20:00
B 02:11:00 02:15:00
C 01:25:00 01:30:00
D 01:45:00 01:50:00
D 01:51:00 01:55:00
;
run;
/* sort data if necessary */
proc sort data = have;
by location start;
run;
/* calculate overlap */
data want;
set have;
by location start;
retain _stime _etime _gap overlaprecords; /* retain temporary variables from previous row */
if first.location then do; /* reset temporary and new variables at each change in location */
_stime = start;
_etime = end;
_gap = 0;
_cumul_diff = 0;
overlaprecords=0;
end;
_cumul_diff + (end-start); /* calculate cumulative difference between start and end */
if start>_etime then _gap + (start-_etime); /* calculate gap between start time and previous highest end time */
if not first.location and start<_etime then do; /* count number of overlap records */
if overlaprecords=0 then overlaprecords+2; /* first overlap record gets count of 2 to account for previous record */
else overlaprecords + 1;
end;
if end>_etime then _etime=end; /* update _etime if end is greater than current value */
if last.location then do; /* calculate overlap time when last record for current location is read */
overlaptime = intck('minute',(_etime-_stime-_gap),_cumul_diff);
output;
end;
drop _: start end; /* drop unanted variables */
run;

Related

Create Index (ID) and increment (SAS)

I have a very simple table with sale information.
Name | Value | Index
AAC | 1000 | 1
BTR | 500 | 2
GRS | 250 | 3
AAC | 100 | 4
I add a new column Name Index.
And I run the first time
DATA BSP;
Index = _N_;
SET BSP;
RUN;
This works fine for the first time.
But now I add more and more sales items and the new line should be get a new indexnumber. The highest index + 1 .... The old sales should keep the indexnumber. But if I run the code below all new lines get the index = 1. What is wrong with the code.
proc sql noprint;
select max(Index) into :max_ID from WORK.BSP;
quit;
DATA work.BSP;
SET work.BSP;
RETAIN new_Id &max_ID;
IF Index = . THEN DO;
new_ID + 1;
index = new_id;
END;
RUN;
You definied the value of Index column in the first step. Where is a new data that you want set? This code like your data it's work well. Can you share your base datase and the last dataset that you want change? Maybe your data is wrong?
(BTW The index variable name is not a lucky choice :-))
data BSP;
Name="AAC";Value=1000;Index=1;output;
Name="BTR";Value=500;Index=2;output;
Name="GRS";Value=250;Index=3;output;
Name="AAC";Value=100;Index=4;output;
run;
/* the row where Index not definied */
data BSPNew;
Name="XXX";Value=1000;output;
run;
proc sql noprint;
select max(Index) into :max_ID from WORK.BSP;
quit;
%put &max_Id.;
proc append base=BSP data=BSPNew force;
run;
DATA work.BSP;
SET work.BSP;
RETAIN new_Id &max_ID;
IF Index = . THEN DO;
new_ID + 1;
index = new_id;
END;
RUN;
data _null_;
set BSP;
put Name Value Index;
run ;
/* the result is:
AAC 1000 1
BTR 500 2
GRS 250 3
AAC 100 4
XXX 1000 5
*/
You need to show more of your code that will demonstrate the problem. The following example is the same as yours, but does not 'fail' to assign a desired index
Example:
data master;
do name = 'A','B','C'; OUTPUT; end;
run;
data master;
set master;
index = _n_;
run;
data new;
do name = 'E','F','G'; OUTPUT; end;
run;
proc sql noprint;
insert into master(name) select name from new; * append new rows;
select max(index) into :next_index from master; * compute highest index known;
data master;
set master;
retain next_index &next_index; * utilize highest index;
if index = . then do;
next_index + 1; * increment highest index before applying;
index = next_index;
end;
drop next_index; * discard 'worker' variable;
run;
You may have inserted a 1 by accident if the insert statement looked like
insert into master select name, 1 from new;
or the new data already has index set to '1'
insert into master select name, index from new;

Generate multiple lags through loops in SAS?

I'm trying to generate 20 lags for a variable.
To generate the first lag, I use the following statement:
data temp.data2;
set temp.data1;
by gvkey fyear;
lag1 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(mv),.);
lag2 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(lag1),.);
etc.
run;
Don't want to repeat 20 times. Is there a way to do this through a loop?
Thanks a lot!
You would have to maintain your own array of mv values and assign the lag values from that. The array would be bubbled for each row processed and reset at the start of an fyear group.
Example:
data have;
do gvkey = 1 to 5;
do fyear = 1 to 5;
do day = 1 to ifn(fyear=3, 10, 30);
mv = 366-day;
output;
end;
end;
end;
run;
data want;
set have;
by gvkey fyear;
array mvs(20) _temporary_;
array lags(20) lag1-lag20;
if first.fyear then call missing(of mvs(*));
* assign lags;
do _n_ = 1 to dim(lags);
lags(_n_) = mvs(_n_);
end;
* bubble mvs;
do _n_ = dim(lags) to 2 by -1;
mvs(_n_) = mvs(_n_-1);
end;
mvs(1) = mv;
run;

Calculate rolling sum for one column in time interval in SAS

I have one problem and I think there is not much to correct to work right.
I have table (with desired output column 'sum_usage'):
id opt t_purchase t_spent bonus usage sum_usage
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 . 7
So, I need to sum all usage values from time_purchase (for one id, opt combination (group by id, opt) there is just one unique time_purchase) until t_spent.
Also, I have about milion rows, so hash table would be the best solution. I've tried with:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('id','opt', 't_purchase', 't_spent');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;
The fifth line from the bottom is not correct for sure (do i=intck('second', t_purchase, t_spent), but have no idea how to approach this. So, the main problem is how to set up time interval to calculate this. I have already one function in this hash table func with the same keys, but without time interval, so it would be pretty good to write this one too, but it's not necessary.
Personally, I would ditch the hash and use SQL.
Example Data:
data have;
input id $ opt
t_purchase datetime20.
t_spent datetime20.
bonus usage sum_usage;
format
t_purchase datetime20.
t_spent datetime20.;
datalines;
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 7
;
I'm leaving your sum_usage column here for comparison.
Now, create a table of sums. New value is sum_usage2.
proc sql noprint;
create table sums as
select a.id,
a.opt,
a.t_purchase,
a.t_spent,
sum(b.usage) as sum_usage2
from have as a,
have as b
where a.id = b.id
and a.opt = b.opt
and b.t_spent <= a.t_spent
and b.t_spent >= a.t_purchase
group by a.id,
a.opt,
a.t_purchase,
a.t_spent;
quit;
Now that you have the sums, join them back to the original table:
proc sql noprint;
create table want as
select a.*,
b.sum_usage2
from have as a
left join
sums as b
on a.id = b.id
and a.opt = b.opt
and a.t_spent = b.t_spent
and a.t_purchase = b.t_purchase;
quit;
This produces the table you want. Alternatively, you can use a hash to look up the values and add the sum in a Data Step (which can be faster given the size).
data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
%create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;
rc = lk.find();
drop rc;
run;
%create_hash() macro available here https://github.com/FinancialRiskGroup/SASPerformanceAnalytics
I believe this question is a morph of one your earlier ones where you compute a rolling sum by do a hash lookup for every second over a 3 hour period for each record in your data set. Hopefully you realized the simplicity of that approach has a large cost of 3*3600 hash lookups per record as well as having to load the entire data vector into a hash.
The time log data presented has new records inserted at the top of the data, and I presume the data to be descending monotonic in time.
A DATA Step can, in a single pass of monotonic data, compute the rolling sum over a time window. The technique uses 'ring' arrays, where-in index advancement is adjusted by modulus. One array is for the time and the other is for the metric (usage). The required array size is the maximum number of items that could occur within the time window.
Consider some generated sample data with time steps of 1, 2, and one jump of 200 seconds:
data have;
time = '12oct2017:11:22:32'dt;
usage = 0;
do _n_ = 1 to &have_count;
time + 2; *ceil(25*ranuni(123));
if _n_ > 30 then time + -1;
if _n_ = 145 then time + 200;
usage = floor(180*ranuni(123));
delta = time-lag(time);
output;
end;
run;
Start with the case of computing a rolling sum from prior items when sorted time ascending. (The descending case will follow):
The example parameters are RING_SIZE 16 and TIME_WINDOW of 12 seconds.
%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;
data want;
array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
array ring_time [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
retain ring_tail 0 ring_head -1 span 0 span_usage 0;
set have;
by time ; * cause error if data not sorted per algorithm requirement;
* unload from accumulated usage the tail items that fell out the window;
do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
span + -1;
span_usage + -ring_usage(ring_tail);
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
end;
ring_head = mod ( ring_head + 1, &RING_SIZE );
span + 1;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
* update the ring array;
ring_time(ring_head) = time;
ring_usage(ring_head) = usage;
span_usage + usage;
drop ring_tail ring_head span;
run;
For the case of data sorted descending, you could jiggle things; sort ascending, compute rolling and resort descending.
What to do if such a jiggle can't be done, or you just want a single pass?
The items to be part of the rolling calculation have to be from 'lead' rows, or rows not yet read via SET. How is this possible ? A second SET statement can be used to open a separate channel to the data set, and thus obtain lead values.
There is a little more bookkeeping for processing lead data -- premature overwrite and diminished window at the end of data need to be handled.
data want2;
array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
array ring_time [-1:%eval(&RING_SIZE-1)] _temporary_;
retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;
set have;
&debug put / _N_ ':' time= ring_head=;
* unload ring_head slotted item from sum;
span + -1;
span_usage + -ring_usage(ring_head);
* advance ring_head slot by 1, the vacated slot will be overwritten by lead;
ring_head = mod ( ring_head + 1, &RING_SIZE );
&debug put +2 ring_time(ring_head)= span= 'head';
* load ring with lead values via a second SET of the same data;
if not end2 then do;
do until (_n_ > 1 or lead_index = 0 or end2);
set have(keep=time usage rename=(time=t usage=u)) end=end2; * <--- the second SET ;
if end2 then guard_index = lead_index;
&debug if end2 then put guard_index=;
ring_time(lead_index) = t;
ring_usage(lead_index) = u;
&debug put +2 ring_time(lead_index)= 'lead';
lead_index = mod ( lead_index + 1, &RING_SIZE);
end;
end;
* advance ring_tail to cover the time window;
if ring_tail ne guard_index then do;
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
do while (time - ring_time(ring_tail) <= &TIME_WINDOW);
span + 1;
span_usage + ring_usage(ring_tail);
&debug put +2 ring_time(ring_tail)= span= 'seek';
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
if ring_tail_was = guard_index then leave;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
end;
* seek went beyond window, back tail off to prior index;
ring_tail = ring_tail_was;
end;
&debug put +2 ring_time(ring_tail)= span= 'mark';
drop lead_index t u ring_: guard_index span;
format ring: span: usage 6.;
run;
options source;
Confirm both methods have the same computation:
proc sort data=want2; by time;
run;
proc compare noprint data=want compare=want2 out=diff outnoequal;
id time;
var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.
I have not benchmarked ring-array versus SQL versus Proc EXPAND versus Hash.
Caution: Dead reckoning rolling values using +in and -out operations can experience round-off errors when dealing with non-integer values.

SAS adding new observations for longitudinal data

I have a longitudinal dataset in SAS with periods of time categorized as either at risk for an event, or not at risk. Unfortunately, some time periods overlap, and I would like to recode them to have a dataset of entirely non-overlapping observations. For example, the dataset currently looks like:
Row 1: ID=123; Start=Jan 1, 1999; End=Dec 31, 1999; At_risk="Yes"
Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No"
The dataset I would like looks like:
Row 1: ID=123; Start=Jan 1, 1999; End=Feb 1, 1999; At_risk="Yes"
Row 2: ID=123; Start=Feb 1, 1999; End=Feb 15, 1999; At_risk="No"
Row 3: ID=123; Start=Feb 15, 1999; End=Dec 31, 1999; At_risk="Yes"
Thoughts?
Vasja may have been suggesting something like this (date level) as an alternative.
I will assume here that the most recent row read in your longitudinal dataset will have priority over any other rows with overlapping date ranges. If that is not the case then adjust the priority derivation below as appropriate.
Are you sure your start and end dates are correct. Your desired output still has overlapping dates. Feb 1 & 15 are both At Risk and not At Risk. Your End date should be at least one day before the next start date. Not the same day. End and Start dates should be contiguous. For that reason coding a solution that produces your desired output (with overlapping dates) is problematic. The solution below is based on no overlapping dates. You will need to modify it to include overlapping dates as per your required output.
/* Your longitudinal dataset . */
data orig;
format Id 16. Start End Date9.;
Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output;
Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output;
run;
/* generate a row for each date between start and end dates. */
/* Use row number (_n_) to assign priorioty. */
Data overlapping_dates;
set orig;
foramt date date9.;
priority = _n_;
do date = start to end by 1;
output;
end;
Run;
/* Get at_risk details for most recent read date according to priority. */
Proc sql;
create table non_overlapping_dates as
select id, date, at_risk
from overlapping_dates
group by id, date
having priority eq max (priority)
order by id, date
;
Quit;
/* Rebuild longitudinal dataset . */
Data longitudinal_dataset
(keep= id start end at_risk)
;
format id 16. Start End Date9. at_risk $3.;
set non_overlapping_dates;
by id at_risk notsorted;
retain start;
if first.at_risk
then start = date;
/* output a row to longitudinal dataset if at_risk is about to change or last row for id. */
if last.at_risk
then do;
end = date;
output;
end;
Run;
Such tasks are exercises in debugging of program logic and fighting data assumptions, playing with old/new values...
Below my initial code for the exact example you provided, will surely need some adjustment on real data.
In case there's time overlap on more than current-next record I'm not sure it's doable this way (with reasonable effort). For such cases you'd probably be more effective with splitting original start - end intervals to day level and then summarize details to new intervals.
data orig;
format Id 16. Start End Date9.;
Id = 123;Start='1jan1999'd; End='31dec1999'd; At_risk="Yes";output;
Id = 123;Start='1feb1999'd; End='15feb1999'd; At_risk="No";output;
run;
proc sort data = orig;
by ID Start;
run;
data modified;
format pStart oStart pEnd oEnd Date9.;
set orig;
length pStart pEnd 8 pAt_risk $3;
by ID descending End ;
retain pStart pEnd pAt_risk;
/* keep original values */
oStart = Start;
oEnd = End;
oAt_risk = At_risk;
if first.id then do;
pStart = Start;
pEnd = End;
pAt_risk = At_risk;
/* no output */
end;
else do;
if pAt_risk ne At_risk then do;
if Start > pStart then do;
put _all_;
Start = pStart;
End = oStart;
At_risk = pAt_risk;
output;/* first part of time span */
Start = oStart;
End = oEnd;
At_risk = oAt_risk;
output;/* second part of time span */
if (End < pEnd ) then do;
Start = End;
End = pEnd;
At_risk = pAt_risk;
output; /*third part of time span */
/* keep current values as previous record values */
pStart = max(oStart, Start);
pEnd = End;
pAt_risk = At_risk;
end;
end;
end;
end;
run;
proc print;run;

how to swap first and last observations in sas data set if it contain 3 observations

I have a data set with 3 observations, 1 2 3
4 5 6
7 8 9 , now i have to interchange 1 2 3 and 7 8 9.
How can do this in base sas?
If you just want to sort your dataset by a variable in descending order, use proc sort:
data example;
input number;
datalines;
123
456
789
;
run;
proc sort data = example;
by descending number;
run;
If you want to re-order a dataset in a more complex way, create a new variable containing the position that you want each row to be in, and then sort it by that variable.
If you want to swap the contents of the first and last observations while leaving the rest of the dataset in place, you could do something like this.
data class;
set sashelp.class;
run;
data firstobs;
i = 1;
set sashelp.class(obs = 1);
run;
data lastobs;
i = nobs;
set sashelp.class nobs = nobs point = nobs;
output;
stop;
run;
data transaction;
set lastobs firstobs;
/*Swap the values of i for first and last obs*/
retain _i;
if _n_ = 1 then do;
_i = i;
i = 1;
end;
if _n_ = 2 then i = _i;
drop _i;
run;
data class;
set transaction(keep = i);
modify class point = i;
set transaction;
run;
This modifies just the first and last observations, which should be quite a bit faster than sorting or replacing a large dataset. You can do a similar thing with the update statement, but that only works if your dataset is already sorted / indexed by a unique key.
By Sandeep Sharma:sandeep.sharmas091#gmail.com
data testy;
input a;
datalines;
1
2
3
4
5
6
7
8
9
;
run;
data ghj;
drop y;
do i=nobs-2 to nobs;
set testy point=i nobs=nobs;
output;
end;
do n=4 to nobs-3;
set testy point=n;
output;
end;
do y=1 to 3;
set testy;
output;
end;
stop;
run;