SAS: using first. and last. to process a date range - sas

I am trying to go through a list of dates and keep only the date range for dates that 5 or more occurrences and delete all others. The example I have is:
data test;
input dt dt2;
format dt dt2 date9.;
datalines;
20000 20001
20000 20002
20000 20003
21000 21001
21000 21002
21000 21003
21000 21004
21000 21005
;
run;
proc sort data = test;
by dt dt2;
run;
data check;
set test;
by dt dt2;
format dt dt2 date9.;
if last.dt = first.dt then
if abs(last.dt2 - first.dt) < 5 then delete;
run;
What I want returned is just one entry, if possible, but I would be happy with the entire appropriate range as well.
The one entry would be a table that has:
start_dt end_dt
21000 21005
The appropriate range is:
21000 21001
21000 21002
21000 21003
21000 21004
21000 21005
My code doesn't work as desired, and I am not sure what changes I need to make.

last.dt2 and first.dt are flags and can have value in (0,1), so condition abs(last.dt2 - first.dt) < 5 is always true.
Use counter variable to count records in group instead:
data check(drop= count);
length count 8;
count=0;
do until(last.dt);
set test;
by dt dt2;
format dt dt2 date9.;
count = count+1;
if last.dt and count>=5 then output;
end;
run;

I'm not sure why you are looking to use the last.dt2 and the first.dt within your delete function so I have turned it around to create your desired output:
data check2;
set test;
by dt ;
format dt dt2 date9.;
if last.dt then do;
if abs(dt2 - dt) >= 5 then output;
end;
run;
Of course, this will only work if your file is sorted on dt and dt2.
Hope this helps.

Related

SAS cumulative count by unique ID and date

I have a dataset like below
Customer_ID Vistited_Date
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
I am trying find the cumulative unique count of customers by date, assuming my output will be like below
Cust_count Vistited_Date
3 7-Feb-20
2 14-Feb-20
7-Feb-2020 has 3 unique customers, whereas 14-Feb-2020 has only 2 hence customer 1234 has visited already.
Anyone knows how I could develop a data set in these conditions?
Sorry if my question is not clear enough, and I am available to give more details if necessary.
Thanks!
NOTE: #draycut's answer has the same logic but is faster, and I will explain why.
#draycut's code uses one hash method, add(), using the return code as test for conditional increment. My code uses check() to test for conditional increment and then add (which will never fail) to track. The one method approach can be perceived as being anywhere from 15% to 40% faster in performance (depending on number of groups, size of groups and id reuse rate)
You will need to track the IDs that have occurred in all prior groups, and exclude the tracked IDs from the current group count.
Tracking can be done with a hash, and conditional counting can be performed in a DOW loop over each group. A DOW loop places the SET statement inside an explicit DO.
Example:
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data counts(keep=date count);
if _n_ = 1 then do;
declare hash tracker();
tracker.defineKey('id');
tracker.defineDone();
end;
do until (last.date);
set have;
by date;
if tracker.check() ne 0 then do;
count = sum(count, 1);
tracker.add();
end;
end;
run;
Raw performance benchmark - no disk io, cpu required to fill array before doing hashing, so those performance components are combined.
The root performance is how fast can new items be added to the hash.
Simulate 3,000,000 'records', 1,000 groups of 3,000 dates, 10% id reuse (so the distinct ids will be ~2.7M).
%macro array_fill (top=3000000, n_group = 1000, overlap_factor=0.10);
%local group_size n_overlap index P Q;
%let group_size = %eval (&top / &n_group);
%if (&group_size < 1) %then %let group_size = 1;
%let n_overlap = %sysevalf (&group_size * &overlap_factor, floor);
%if &n_overlap < 0 %then %let n_overlap = 0;
%let top = %sysevalf (&group_size * &n_group);
P = 1;
Q = &group_size;
array ids(&top) _temporary_;
_n_ = 0;
do i = 1 to &n_group;
do j = P to Q;
_n_+1;
ids(_n_) = j;
end;
P = Q - &n_overlap;
Q = P + &group_size - 1;
end;
%mend;
options nomprint;
data _null_ (label='check then add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.check() ne 0 then do;
count = sum(count,1);
h.add();
end;
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data _null_ (label='just add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.add() = 0 then
count = sum(count,1);
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data have;
input Customer_ID Vistited_Date :anydtdte12.;
format Vistited_Date date9.;
datalines;
1234 7-Feb-2020
4567 7-Feb-2020
9870 7-Feb-2020
1234 14-Feb-2020
7654 14-Feb-2020
3421 14-Feb-2020
;
data want (drop=Customer_ID);
if _N_=1 then do;
declare hash h ();
h.definekey ('Customer_ID');
h.definedone ();
end;
do until (last.Vistited_Date);
set have;
by Vistited_Date;
if h.add() = 0 then Count = sum(Count, 1);
end;
run;
If your data is not sorted and you like the SQL maybe this solution is same good for you and it is very simple:
/* your example 3 rows */
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
1234 15-Feb-20
7654 15-Feb-20
1111 15-Feb-20
;
run;
/* simple set theory. Final dataset contains your final data like results
below*/
proc sql;
create table temp(where =(mindate=date)) as select
ID, date,min(date) as mindate from have
group by id;
create table final as select count(*) as customer_count,date from temp
group by date;
quit;
/* results:
customer_count Date
3 07.febr.20
2 14.febr.20
1 15.febr.20
*/
Another method cause I dont know hash so well. >_<
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data want;
length Used $200.;
retain Used;
set have;
by Date;
if first.Date then count = .;
if not find(Used,cats(ID)) then do;
count + 1;
Used = catx(',',Used,ID);
end;
if last.Date;
put Date= count=;
run;
If you are not overly concerned with processing speed and want something simple:
proc sort data=have;
by id date;
** Get date of each customer's first unique visit **;
proc sort data=have out=first_visit nodupkey;
by id;
proc freq data=first_visit noprint;
tables date /out=want (keep=date count);
run;

Combining the rows with overlapping data ranges in SAS

Since I am new to SAS I need some help to understand how to combine the overlap date ranges into one row.I want to combine the overlap date ranges when they have matching Id. If the dates don’t overlap then I want to keep them as it is. IF they over lap by Matching Id and drug code Then it should combine into one line. Please look at the same ple data set which I have below and the expected results:
Current Data set:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 1/1/2019
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
Expected results:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 3/31/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/2/2018 8/31/2018
1 100 9/1/2018 5/4/2019
I wrote some SAS code but I am combining the dates even when there is no overlap. I want to write some code which should work in SAS.
PROC SORT DATA=Want OUT=ONE;
BY PERSON_ID BEG_DATE DRUG_CODE END_DATE;
RUN;
data TWO (DROP=PERSON_ID2 DRUG_CODE2 BEG_DATE END_DATE
RENAME=(BEG2=BEG_DOS
END2=END_DOS));
SET ONE;
RETAIN BEG2 END2;
PERSON_ID2=LAG1(PERSON_ID);
DRUG_CODE2=LAG1(DRUG_CODE);
IF PERSON_ID2=PERSON_ID AND DRUG_CODE2=DRUG_CODE AND BEG_DATE LE(END2+1) THEN
DO;
BEG2=MIN(BEG_DATE,BEG2);
END2=MAX(END_DATE,END2);
END;
ELSE
DO;
SEG+1;
BEG2=BEG_DATE;
END2=END_DATE;
END;
FORMAT BEG2 END2 MMDDYY10.;
RUN;
DATA THREE(DROP=BEG_DOS END_DOS SEG);
RETAIN BEG_DATE END_DATE;
SET TWO;
BY PERSON_ID SEG;
FORMAT BEG_DATE END_DATE MMDDYY10.;
IF FIRST.SEG THEN
DO;
BEG_DATE=BEG_DOS;
END;
IF LAST.SEG THEN
DO;
END_DATE = END_DOS;
OUTPUT;
END;
RUN;
This is how I would do it. Create an obs for each ID DRUG and DATE. Flag the gaps and summarize by RUN.
data have;
input ID Drug_Code (BEG End)(:mmddyy.);
format BEG End mmddyyd10.;
cards;
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 90 6/1/2018 8/15/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
;;;;
run;
proc print;
run;
/*1 100 1/1/2018 1/1/2019*/
data exv/ view=exv;
set have;
do date = beg to end;
output;
end;
drop beg end;
format date mmddyyd10.;
run;
proc sort data=exv out=ex nodupkey;
by id drug_code date;
run;
data breaksV / view=BreaksV;
set ex;
by id drug_code;
dif = dif(date);
if first.drug_code then do; dif=1; run=1; end;
if dif ne 1 then run+1;
run;
proc summary data=breaksV nway missing;
class id drug_code run;
var date;
output out=want(drop=_type_) min=Begin max=End;
run;
Proc print;
run;
Computing the extent range composed of overlapping segment ranges requires a good understanding of the range conditions (cases).
Consider the scenarios when sorted by start date (within any larger grouping set, G, such as id and drug)
Let [ and ] be endpoints of a range
# be date values (integers) within
Extent be the combined range that grows
Segment be the range in the current row
Case 1 - Growth. Within G Segment start before Extent end
Segment will either not contribute to Extent or extend it.
[####] Extent
+ [#] Segment range DOES NOT contribute
--------
[####] Extent (do not output a row, still growing)
or
[####] Extent
+ [#####] Segment range DOES contribute
--------
[#######] Extent (do not output a row, still growing)
Case 2 - Terminus. 3 possibilities:
Within G Segment start after Extent end,
Next G reached (different id/drug combination),
End of data reached.
#2 and #3 can be tested by checking the appropriate last. flag.
[####] Extent
+ ..[#] Segment beyond Extent (gap is 2)
--------
[####] output Extent
[#] reset Extent to Segment
You can adjust your rules for Segment being adjacent (gap=0) or close enough (gap < threshold) to mean an Extent is either expanded, or, output and reset to Segment.
Note: The situation is a little more (not shown) complicated for the real world cases of:
missing start means the Segment has an unknown start date (presume it to be epoch (0=01JAN1960, or some date that pre-dates all dates in the data or study)
missing end means the Segment is active today (end date is date when processing data)
Sample code:
data have;
call streaminit(42);
do id = 1 to 10;
do _n_ = 1 to 50;
drug = ceil(rand('UNIFORM', 10));
beg_date = intnx ('MONTH', '01JAN2008'D, rand('UNIFORM',20));
end_date = intnx ('DAY', beg_date, rand('UNIFORM',75));
OUTPUT;
end;
end;
format beg_date end_date yymmdd10.;
run;
proc sort data=have out=segments;
by id drug beg_date end_date;
run;
data want;
set segments;
by id drug beg_date end_date; * will error if incoming data is NOT sorted;
retain ext_beg ext_end;
retain gap_allowed 0; * set to 1 for contiguously adjacent segment ;
if first.drug then do;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 0;
end;
if beg_date <= ext_end + gap_allowed then do;
ext_end = max (ext_end, end_date);
segment_count + 1;
end;
else do;
extent_id + 1;
OUTPUT;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 1;
end;
if last.drug then do;
extent_id + 1;
OUTPUT;
* reset occurs implicitly;
* it will happen at first. logic when control returns to top of step;
end;
format ext_: yymmdd10.;
keep id drug ext_beg ext_end segment_count extent_id;
run;

SAS Fast forward a date until a limit using INTNX/INTCK

I'm looking to take a variable observation's date and essentially keep rolling it forward by its specified repricing parameter until a target date
the dataset being used is:
data have;
input repricing_frequency date_of_last_repricing end_date;
datalines;
3 15399 21367
10 12265 21367
15 13879 21367
;
format date_of_last_repricing end_date date9.;
informat date_of_last_repricing end_date date9.;
run;
so the idea is that i'd keep applying the repricing frequency of either 3 months, 10 months or 15 months to the date_of_last_repricing until it is as close as it can be to the date "31DEC2017". Thanks in advance.
EDIT including my recent workings:
data want;
set have;
repricing_N = intck('Month',date_of_last_repricing,'31DEC2017'd,'continuous');
dateoflastrepricing = intnx('Month',date_of_last_repricing,repricing_N,'E');
format dateoflastrepricing date9.;
informat dateoflastrepricing date9.;
run;
The INTNX function will compute an incremented date value, and allows the resultant interval alignment to be specified (in your case the 'end' of the month n-months hence)
data have;
format date_of_last_repricing end_date date9.;
informat date_of_last_repricing end_date date9.;
* use 12. to read the raw date values in the datalines;
input repricing_frequency date_of_last_repricing: 12. end_date: 12.;
datalines;
3 15399 21367
10 12265 21367
15 13879 21367
;
run;
data want;
set have;
status = 'Original';
output;
* increment and iterate;
date_of_last_repricing = intnx('month',
date_of_last_repricing, repricing_frequency, 'end'
);
do while (date_of_last_repricing <= end_date);
status = 'Computed';
output;
date_of_last_repricing = intnx('month',
date_of_last_repricing, repricing_frequency, 'end'
);
end;
run;
If you want to compute only the nearest end date, as when iterating by repricing frequency, you do not have to iterate. You can divide the months apart by the frequency to get the number of iterations that would have occurred.
data want2;
set have;
nearest_end_month = intnx('month', end_date, 0, 'end');
if nearest_end_month > end_date then nearest_end_month = intnx('month', nearest_end_month, -1, 'end');
months_apart = intck('month', date_of_last_repricing, nearest_end_month);
iterations_apart = floor(months_apart / repricing_frequency);
iteration_months = iterations_apart * repricing_frequency;
nearest_end_date = intnx('month', date_of_last_repricing, iteration_months, 'end');
format nearest: date9.;
run;
proc sql;
select id, max(date_of_last_repricing) as nearest_end_date format=date9. from want group by id;
select id, nearest_end_date from want2;
quit;

SAS: Insert Blank Rows

I'm calculating some interval statistics (standard deviation of one minute intervals for example) of financial time series data. My code managed to get results for all intervals that contain data, but for intervals that do not contain any observations in the time series, I'd like to insert an empty row just to maintain the timestamp consistency.
For example, if there's data between 10:00 to 10:01, 10:02 to 10:03, but not 10:01 to 10:02, my output would be:
10:01 stat1 stat2 stat3
10:03 stat1 stat2 stat3
It would ideal if the result could be (I want some values to be 0, some missing '.'):
10:01 stat1 stat2 stat3
10:02 0 0 .
10:03 stat1 stat2 stat3
What I did:
data v_temp/view = v_temp;
set &taq_ds;
where TIME_M between &start_time and &end_time;
INTV = hms(00, ceil(TIME_M/'00:01:00't),00); *create one minute interval;
format INTV tod.; *format hh:mm:ss;
run;
proc means data = sorted noprint;
by SYM_ROOT DATE INTV;
var PRICE;
weight SIZE;
output
out=oneMinStats(drop=_TYPE_ _FREQ_)
n=NTRADES mean=VWAP sumwgt=SUMSHS max=HI min=LO std=SIGMAPRC
idgroup(max(TIME_M) last out(price size ex time_m)=LASTTRD LASTSIZE LASTEX LASTTIME);
run;
For some non-active stocks, there're many gaps like this. What would be an efficient way to generate those filling rows?
If you have SAS:ETS licensed, PROC EXPAND is a good choice for adding blank rows in a time series. Here's a very short example:
data mydata;
input timevar stat1 stat2 stat3;
format timevar TIME5.;
informat timevar HHMMSS5.;
datalines;
10:01 1 3 5
10:03 2 4 6
;;;;
run;
proc expand data=mydata out=mydata_exp from=minute to=minute observed=beginning method=none;
id timevar;
run;
The documentation has more details if you want to perform inter/extrapolation or anything like that. The important options are from=minute, observed=beginning, method=none (no extrapolation or interpolation), and id (which identifies the time variable).
If you don't have ETS, then a data step should suffice. You can either merge to a known dataset, or add your own rows; the size of your dataset determines somewhat which is easier. Here's the merge variation. The add your own rows in a datastep variation is similar to how I create the extra rows.
*Select the maximum time available.;
proc sql noprint;
select max(timevar) into :endtime from mydata;
quit;
*Create the empty dataset with just times;
data mydata_tomerge;
set mydata_tomerge(obs=1);
do timevar = timevar to &endtime by 60; *by 60 = minutes!;
output;
end;
keep timevar;
run;
*Now merge the one with all the times to the one with all the data!;
data mydata_fin;
merge mydata_tomerge(in=a) mydata;
by timevar;
if a;
run;

Multiple hash objects in SAS

I have two SAS data sets. The first is relatively small, and contains unique dates and a corresponding ID:
date dateID
1jan90 10
2jan90 15
3jan90 20
...
The second data set very large, and has two date variables:
dt1 dt2
1jan90 2jan90
3jan90 1jan90
...
I need to match both dt1 and dt2 to dateID, so the output would be:
id1 id2
10 15
20 10
Efficiency is very important here. I know how to use a hash object to do one match, so I could do one data step to do the match for dt1 and then another step for dt2, but I'd like to do both in one data step. How can this be done?
Here's how I would do the match for just dt1:
data tbl3;
if 0 then set tbl1 tbl2;
if _n_=1 then do;
declare hash dts(dataset:'work.tbl2');
dts.DefineKey('date');
dts.DefineData('dateid');
dts.DefineDone();
end;
set tbl1;
if dts.find(key:date)=0 then output;
run;
A format would probably work just as efficiently given the size of your hash table...
data fmt ;
retain fmtname 'DTID' type 'N' ;
set tbl1 ;
start = date ;
label = dateid ;
run ;
proc format cntlin=fmt ; run ;
data tbl3 ;
set tbl2 ;
id1 = put(dt1,DTID.) ;
id2 = put(dt2,DTID.) ;
run ;
Edited version based on below comments...
data fmt ;
retain fmtname 'DTID' type 'I' ;
set tbl1 end=eof ;
start = date ;
label = dateid ;
output ;
if eof then do ;
hlo = 'O' ;
label = . ;
output ;
end ;
run ;
proc format cntlin=fmt ; run ;
data tbl3 ;
set tbl2 ;
id1 = input(dt1,DTID.) ;
id2 = input(dt2,DTID.) ;
run ;
I don't have SAS in front of me right now to test it but the code would look like this:
data tbl3;
if 0 then set tbl1 tbl2;
if _n_=1 then do;
declare hash dts(dataset:'work.tbl2');
dts.DefineKey('date');
dts.DefineData('dateid');
dts.DefineDone();
end;
set tbl1;
date = dt1;
if dts.find()=0 then do;
id1 = dateId;
end;
date = dt2;
if dts.find()=0 then do;
id2 = dateId;
end;
if dt1 or dt2 then do output; * KEEP ONLY RECORDS THAT MATCHED AT LEAST ONE;
drop date dateId;
run;
I agree with the format solution, for one, but if you want to do the hash solution, here it goes. The basic thing here is that you define the key as the variable you're matching, not in the hash itself.
data tbl2;
informat date DATE7.;
input date dateID;
datalines;
01jan90 10
02jan90 15
03jan90 20
;;;;
run;
data tbl1;
informat dt1 dt2 DATE7.;
input dt1 dt2;
datalines;
01jan90 02jan90
03jan90 01jan90
;;;;
run;
data tbl3;
if 0 then set tbl1 tbl2;
if _n_=1 then do;
declare hash dts(dataset:'work.tbl2');
dts.DefineKey('date');
dts.DefineData('dateid');
dts.DefineDone();
end;
set tbl1;
rc1 = dts.find(key:dt1);
if rc1=0 then id1=dateID;
rc2 = dts.find(key:dt2);
if rc2=0 then id2=dateID;
if rc1=0 and rc2=0 then output;
run;