I am working with a dataset which has several stocks and I merged a data set with events onto it (for the stocks, several events in the period per stock).
Now, for an event study, I want to create several variables that act as dummies and create windows: -60 to -11 days, -5 to -1 day and announcement day plus day +1.
Important are two things:
It has to be by Stock (the windows should not be carried over between stocks)
one announcement/event day (ann_day) should not spoil another event's window.
I tried the following but it is just giving me a window and does not account for different stocks and spoiled windows:
proc sql;
create view event_study as
select distinct b.ann_date,a.date,a.dayid-b.dayid as event_time, a.stock,a.return
from Dataset_full as a,Announcements as b
where a.dayid-b.dayid between -60 and 11 and a.secid=b.secid
order by a.stockb.ann_date,event_time;
quit;
Some info: announcement days are the events
dataset_full has the stock, date, return, volume. One row per calendar/trading day.
Announcement has stock, announcement date and announcement info (one row per announcement)
Data should look like this:
Stock Date Ann_date flag_minus60_minus11 flag_minus5_minus1 flag_day0_day1
A 1/01/2016 1
A 2/01/2016 1
A 3/01/2016
A 4/01/2016 4/01/2016 1
A 5/01/2016 1
A 6/01/2016
A 7/01/2016
A 8/01/2016
A 9/01/2016
A 10/01/2016
A 11/01/2016
A 12/01/2016
A 13/01/2016
A 14/01/2016
A 15/01/2016
B 1/01/2016 1
B 2/01/2016 1
B 3/01/2016 1
B 4/01/2016 1
B 5/01/2016 1
B 6/01/2016 1
B 7/01/2016
B 8/01/2016
B 9/01/2016
B 10/01/2016
B 11/01/2016 1
B 12/01/2016 1
B 13/01/2016 1
B 14/01/2016 1
B 15/01/2016 1
B 16/01/2016 16/01/2016 1
B 17/01/2016 1
B 18/01/2016 1
B 19/01/2016 1
B 20/01/2016 20/01/2016 1
B 21/01/2016 1
B 22/01/2016
B 23/01/2016
B 24/01/2016
B 25/01/2016
MaBo:
Here is some sample data and SQL. When you examine the output, I presume you will see 'spoiled' information -- that being a date with more than one future announcement in the flagging time frame.
The issue of flagging trading dates with respect to an event date is inner join. The inner join has to be performed for each flag being computed, and that inner join needs to be left joined to the trading data to get your 'want'.
data trading;
do group = 1 to 4;
do date = today()-1000 to today(); format date yymmdd10.;
output;
end;
end;
run;
data announcement;
do group = 1 to 4;
do date = today()-1000 to today(); format date yymmdd10.;
if ranuni(123) < 0.01 then output;
end;
end;
run;
proc sql;
create table trading_pre_announce_flagged as
select
trading.*
, announcement.date as annouce_date
, case when P0.date is not null then 1 else . end as P0_flag label="Announcement was today or yesterday"
, case when P1.date is not null then 1 else . end as P1_flag label="Announcement in 1 to 5 days"
, case when P2.date is not null then 1 else . end as P2_flag label="Announcement in 11 to 60 days"
, case when P2.date is not null then P2.adate else . end as P2_date label="Date of Announcement in 11 to 60 days" format=yymmdd10.
from
trading
left join
announcement
on announcement.date = trading.date and announcement.group = trading.group
left join
( select trading.group, trading.date
from trading
inner join
announcement
on announcement.group = trading.group
and announcement.date - trading.date between -1 and 0
) as P0
on P0.date = trading.date and P0.group = trading.group
left join
( select trading.group, trading.date
from trading
inner join
announcement
on announcement.group = trading.group
and announcement.date - trading.date between 1 and 5
) as P1
on P1.date = trading.date and P1.group = trading.group
left join
( select trading.group, trading.date, announcement.date as adate
from trading
inner join
announcement
on announcement.group = trading.group
where announcement.date - trading.date between 11 and 60
) as P2
on P2.date = trading.date and P2.group = trading.group
order
by trading.group, trading.date
;
quit;
At some point (can't find it though) the OP mentioned processing ~750 companies and 500 overall events, and that the SQL solution seemed to be long running.
An alternative would be DATA Step.
The 500 events is a small enough cardinality where arrays of group and date could be used to store the events for lookup. Smart index tracking of the sorted events can be used for doing a minimum scan for evaluating the rules and applying the condition flags.
For example:
data trading;
do group = 1 to 700;
do date = today()-1000 to today(); format date yymmdd10.;
output;
end;
end;
run;
data announcement;
do eventid = 1 to 500;
group = ceil(700*ranuni(123));
date = (today()-1000) + ceil(1000*ranuni(123)); format date yymmdd10.;
if mod(eventid,20) = 1 then do;
output;
eventid+1;
date = date + 30 + floor(100*ranuni(123));
output;
eventid+1;
date = date + 30 + floor(100*ranuni(123));
end;
output;
end;
run;
proc sort data=announcement;
by group date;
run;
data _null_;
if 0 then set announcement nobs=nobs;
call symputx ('top', nobs+1);
run;
data marked_trading;
array e_group(0:&TOP) _temporary_;
array e_date (0:&TOP) _temporary_;
* load event array;
do _n_ = 1 by 1 until (last_announcement);
set announcement end=last_announcement;
e_group(_n_) = group;
e_date(_n_) = date;
eix0 = 1;
eix1 = 1;
end;
e_group(0) = 0; * sentinel;
e_group(_n_) = 1e9; *sentinel;
* evaluate flagging criteria for each trade group date;
do _n_ = 1 by 1 until (last_trading);
set trading end=last_trading;
by group;
if first.group then do;
* discover indices of events associated with the group;
do eix0 = eix0 by 1 while (e_group(eix0) < group); end;
do eix1 = eix0 by 1 while (e_group(eix1) = group); end; eix1 = eix1 - 1;
eix_group = e_group(eix0);
end;
p3_flag = .; p2_flag = .; p1_flag = .;
if group = eix_group then do;
* NOTE: bounds are evaluated only at loop initialization;
* evaluate events for flagging a trade;
do ix = eix0 to eix1;
days_to_event = e_date(ix) - date;
if not p3_flag then if 11 <= days_to_event <= 60 then p3_flag = 1;
if not p2_flag then if 1 <= days_to_event <= 5 then p2_flag = 1;
if not p1_flag then if -1 <= days_to_event <= 0 then p1_flag = 1;
if days_to_event <= -1 then eix0 = ix+1; * update when applicability exhausted;
end;
end;
output;
end;
keep group date p:;
stop;
run;
Related
Since I am new to SAS I need some help to understand how to combine the overlap date ranges into one row.I want to combine the overlap date ranges when they have matching Id. If the dates don’t overlap then I want to keep them as it is. IF they over lap by Matching Id and drug code Then it should combine into one line. Please look at the same ple data set which I have below and the expected results:
Current Data set:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 1/1/2019
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
Expected results:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 3/31/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/2/2018 8/31/2018
1 100 9/1/2018 5/4/2019
I wrote some SAS code but I am combining the dates even when there is no overlap. I want to write some code which should work in SAS.
PROC SORT DATA=Want OUT=ONE;
BY PERSON_ID BEG_DATE DRUG_CODE END_DATE;
RUN;
data TWO (DROP=PERSON_ID2 DRUG_CODE2 BEG_DATE END_DATE
RENAME=(BEG2=BEG_DOS
END2=END_DOS));
SET ONE;
RETAIN BEG2 END2;
PERSON_ID2=LAG1(PERSON_ID);
DRUG_CODE2=LAG1(DRUG_CODE);
IF PERSON_ID2=PERSON_ID AND DRUG_CODE2=DRUG_CODE AND BEG_DATE LE(END2+1) THEN
DO;
BEG2=MIN(BEG_DATE,BEG2);
END2=MAX(END_DATE,END2);
END;
ELSE
DO;
SEG+1;
BEG2=BEG_DATE;
END2=END_DATE;
END;
FORMAT BEG2 END2 MMDDYY10.;
RUN;
DATA THREE(DROP=BEG_DOS END_DOS SEG);
RETAIN BEG_DATE END_DATE;
SET TWO;
BY PERSON_ID SEG;
FORMAT BEG_DATE END_DATE MMDDYY10.;
IF FIRST.SEG THEN
DO;
BEG_DATE=BEG_DOS;
END;
IF LAST.SEG THEN
DO;
END_DATE = END_DOS;
OUTPUT;
END;
RUN;
This is how I would do it. Create an obs for each ID DRUG and DATE. Flag the gaps and summarize by RUN.
data have;
input ID Drug_Code (BEG End)(:mmddyy.);
format BEG End mmddyyd10.;
cards;
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 90 6/1/2018 8/15/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
;;;;
run;
proc print;
run;
/*1 100 1/1/2018 1/1/2019*/
data exv/ view=exv;
set have;
do date = beg to end;
output;
end;
drop beg end;
format date mmddyyd10.;
run;
proc sort data=exv out=ex nodupkey;
by id drug_code date;
run;
data breaksV / view=BreaksV;
set ex;
by id drug_code;
dif = dif(date);
if first.drug_code then do; dif=1; run=1; end;
if dif ne 1 then run+1;
run;
proc summary data=breaksV nway missing;
class id drug_code run;
var date;
output out=want(drop=_type_) min=Begin max=End;
run;
Proc print;
run;
Computing the extent range composed of overlapping segment ranges requires a good understanding of the range conditions (cases).
Consider the scenarios when sorted by start date (within any larger grouping set, G, such as id and drug)
Let [ and ] be endpoints of a range
# be date values (integers) within
Extent be the combined range that grows
Segment be the range in the current row
Case 1 - Growth. Within G Segment start before Extent end
Segment will either not contribute to Extent or extend it.
[####] Extent
+ [#] Segment range DOES NOT contribute
--------
[####] Extent (do not output a row, still growing)
or
[####] Extent
+ [#####] Segment range DOES contribute
--------
[#######] Extent (do not output a row, still growing)
Case 2 - Terminus. 3 possibilities:
Within G Segment start after Extent end,
Next G reached (different id/drug combination),
End of data reached.
#2 and #3 can be tested by checking the appropriate last. flag.
[####] Extent
+ ..[#] Segment beyond Extent (gap is 2)
--------
[####] output Extent
[#] reset Extent to Segment
You can adjust your rules for Segment being adjacent (gap=0) or close enough (gap < threshold) to mean an Extent is either expanded, or, output and reset to Segment.
Note: The situation is a little more (not shown) complicated for the real world cases of:
missing start means the Segment has an unknown start date (presume it to be epoch (0=01JAN1960, or some date that pre-dates all dates in the data or study)
missing end means the Segment is active today (end date is date when processing data)
Sample code:
data have;
call streaminit(42);
do id = 1 to 10;
do _n_ = 1 to 50;
drug = ceil(rand('UNIFORM', 10));
beg_date = intnx ('MONTH', '01JAN2008'D, rand('UNIFORM',20));
end_date = intnx ('DAY', beg_date, rand('UNIFORM',75));
OUTPUT;
end;
end;
format beg_date end_date yymmdd10.;
run;
proc sort data=have out=segments;
by id drug beg_date end_date;
run;
data want;
set segments;
by id drug beg_date end_date; * will error if incoming data is NOT sorted;
retain ext_beg ext_end;
retain gap_allowed 0; * set to 1 for contiguously adjacent segment ;
if first.drug then do;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 0;
end;
if beg_date <= ext_end + gap_allowed then do;
ext_end = max (ext_end, end_date);
segment_count + 1;
end;
else do;
extent_id + 1;
OUTPUT;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 1;
end;
if last.drug then do;
extent_id + 1;
OUTPUT;
* reset occurs implicitly;
* it will happen at first. logic when control returns to top of step;
end;
format ext_: yymmdd10.;
keep id drug ext_beg ext_end segment_count extent_id;
run;
I have a SAS code (SQL) that has to repeat for 25 times; for each month/year combination (see code below). How can I use a macro in this code?
proc sql;
create table hh_oud_AUG_17 as
select hh_key
,sum(RG_count) as RG_count_aug_17
,case when sum(RG_count) >=2 then 1 else 0 end as loyabo_recht_aug_17
from basis_RG_oud
where valid_from_dt <= "01AUG2017"d <= valid_to_dt
group by hh_key
order by hh_key
;
quit;
proc sql;
create table hh_oud_SEP_17 as
select hh_key
,sum(RG_count) as RG_count_sep_17
,case when sum(RG_count) >=2 then 1 else 0 end as loyabo_recht_sep_17
from basis_RG_oud
where valid_from_dt <= "01SEP2017"d <= valid_to_dt
group by hh_key
order by hh_key
;
quit;
If you use a data step to do this, you can put all the desired columns in the same output dataset rather than using a macro to create 25 separate datasets:
/*Generate lists of variable names*/
data _null_;
stem1 = "RG_count_";
stem2 = "loyabo_recht_";
month = '01aug2017'd;
length suffix $4 vlist1 vlist2 $1000;
do i = 0 to 24;
suffix = put(intnx('month', month, i, 's'), yymmn4.);
vlist1 = catx(' ', vlist1, cats(stem1,suffix));
vlist2 = catx(' ', vlist2, cats(stem2,suffix));
end;
call symput("vlist1",vlist1);
call symput("vlist2",vlist2);
run;
%put vlist1 = &vlist1;
%put vlist2 = &vlist2;
/*Produce output table*/
data want;
if 0 then set have;
start_month = '01aug2017'd;
array rg_count[2, 0:24] &vlist1 &vlist2;
do _n_ = 1 by 1 until(last.hh_key);
set basis_RG_oud;
by hh_key;
do i = 0 to hbound2(rg_count);
if valid_from_dt <= intnx('month', start_month, i, 's') <= valid_to_dt
then rg_count[1,i] = sum(rg_count[1,i],1);
end;
end;
do _n_ = 1 to _n_;
set basis_RG_oud;
do i = 0 to hbound2(rg_count);
rg_count[2,i] = rg_count[1,i] >= 2;
end;
end;
run;
Create a second data set that enumerates (is a list of) the months to be examined. Cross Join the original data to that second data set. Create a single output table (or view) that contains the month as a categorical variable and aggregates based on that. You will be able to by-group process, classify or subset based on the month variable.
data months;
do month = '01jan2017'd to '31dec2018'd;
output;
month = intnx ('month', month, 0, 'E');
end;
format month monyy7.;
run;
proc sql;
create table want as
select
month, hh_key,
sum(RG_count) as RG_count,
case when sum(RG_count) >=2 then 1 else 0 end as loyabo_recht
from
basis_RG_oud
cross join
months
where
valid_from_dt <= month <= valid_to_dt
group
by month, hh_key
order
by month, hh_key
;
…
/* Some analysis */
BY MONTH;
…
/* Some tabulation */
CLASS MONTH;
TABLE … MONTH …
WHERE year(month) = 2018;
I have one problem and I think there is not much to correct to work right.
I have table (with desired output column 'sum_usage'):
id opt t_purchase t_spent bonus usage sum_usage
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 . 7
So, I need to sum all usage values from time_purchase (for one id, opt combination (group by id, opt) there is just one unique time_purchase) until t_spent.
Also, I have about milion rows, so hash table would be the best solution. I've tried with:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('id','opt', 't_purchase', 't_spent');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;
The fifth line from the bottom is not correct for sure (do i=intck('second', t_purchase, t_spent), but have no idea how to approach this. So, the main problem is how to set up time interval to calculate this. I have already one function in this hash table func with the same keys, but without time interval, so it would be pretty good to write this one too, but it's not necessary.
Personally, I would ditch the hash and use SQL.
Example Data:
data have;
input id $ opt
t_purchase datetime20.
t_spent datetime20.
bonus usage sum_usage;
format
t_purchase datetime20.
t_spent datetime20.;
datalines;
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 7
;
I'm leaving your sum_usage column here for comparison.
Now, create a table of sums. New value is sum_usage2.
proc sql noprint;
create table sums as
select a.id,
a.opt,
a.t_purchase,
a.t_spent,
sum(b.usage) as sum_usage2
from have as a,
have as b
where a.id = b.id
and a.opt = b.opt
and b.t_spent <= a.t_spent
and b.t_spent >= a.t_purchase
group by a.id,
a.opt,
a.t_purchase,
a.t_spent;
quit;
Now that you have the sums, join them back to the original table:
proc sql noprint;
create table want as
select a.*,
b.sum_usage2
from have as a
left join
sums as b
on a.id = b.id
and a.opt = b.opt
and a.t_spent = b.t_spent
and a.t_purchase = b.t_purchase;
quit;
This produces the table you want. Alternatively, you can use a hash to look up the values and add the sum in a Data Step (which can be faster given the size).
data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
%create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;
rc = lk.find();
drop rc;
run;
%create_hash() macro available here https://github.com/FinancialRiskGroup/SASPerformanceAnalytics
I believe this question is a morph of one your earlier ones where you compute a rolling sum by do a hash lookup for every second over a 3 hour period for each record in your data set. Hopefully you realized the simplicity of that approach has a large cost of 3*3600 hash lookups per record as well as having to load the entire data vector into a hash.
The time log data presented has new records inserted at the top of the data, and I presume the data to be descending monotonic in time.
A DATA Step can, in a single pass of monotonic data, compute the rolling sum over a time window. The technique uses 'ring' arrays, where-in index advancement is adjusted by modulus. One array is for the time and the other is for the metric (usage). The required array size is the maximum number of items that could occur within the time window.
Consider some generated sample data with time steps of 1, 2, and one jump of 200 seconds:
data have;
time = '12oct2017:11:22:32'dt;
usage = 0;
do _n_ = 1 to &have_count;
time + 2; *ceil(25*ranuni(123));
if _n_ > 30 then time + -1;
if _n_ = 145 then time + 200;
usage = floor(180*ranuni(123));
delta = time-lag(time);
output;
end;
run;
Start with the case of computing a rolling sum from prior items when sorted time ascending. (The descending case will follow):
The example parameters are RING_SIZE 16 and TIME_WINDOW of 12 seconds.
%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;
data want;
array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
array ring_time [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
retain ring_tail 0 ring_head -1 span 0 span_usage 0;
set have;
by time ; * cause error if data not sorted per algorithm requirement;
* unload from accumulated usage the tail items that fell out the window;
do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
span + -1;
span_usage + -ring_usage(ring_tail);
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
end;
ring_head = mod ( ring_head + 1, &RING_SIZE );
span + 1;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
* update the ring array;
ring_time(ring_head) = time;
ring_usage(ring_head) = usage;
span_usage + usage;
drop ring_tail ring_head span;
run;
For the case of data sorted descending, you could jiggle things; sort ascending, compute rolling and resort descending.
What to do if such a jiggle can't be done, or you just want a single pass?
The items to be part of the rolling calculation have to be from 'lead' rows, or rows not yet read via SET. How is this possible ? A second SET statement can be used to open a separate channel to the data set, and thus obtain lead values.
There is a little more bookkeeping for processing lead data -- premature overwrite and diminished window at the end of data need to be handled.
data want2;
array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
array ring_time [-1:%eval(&RING_SIZE-1)] _temporary_;
retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;
set have;
&debug put / _N_ ':' time= ring_head=;
* unload ring_head slotted item from sum;
span + -1;
span_usage + -ring_usage(ring_head);
* advance ring_head slot by 1, the vacated slot will be overwritten by lead;
ring_head = mod ( ring_head + 1, &RING_SIZE );
&debug put +2 ring_time(ring_head)= span= 'head';
* load ring with lead values via a second SET of the same data;
if not end2 then do;
do until (_n_ > 1 or lead_index = 0 or end2);
set have(keep=time usage rename=(time=t usage=u)) end=end2; * <--- the second SET ;
if end2 then guard_index = lead_index;
&debug if end2 then put guard_index=;
ring_time(lead_index) = t;
ring_usage(lead_index) = u;
&debug put +2 ring_time(lead_index)= 'lead';
lead_index = mod ( lead_index + 1, &RING_SIZE);
end;
end;
* advance ring_tail to cover the time window;
if ring_tail ne guard_index then do;
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
do while (time - ring_time(ring_tail) <= &TIME_WINDOW);
span + 1;
span_usage + ring_usage(ring_tail);
&debug put +2 ring_time(ring_tail)= span= 'seek';
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
if ring_tail_was = guard_index then leave;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
end;
* seek went beyond window, back tail off to prior index;
ring_tail = ring_tail_was;
end;
&debug put +2 ring_time(ring_tail)= span= 'mark';
drop lead_index t u ring_: guard_index span;
format ring: span: usage 6.;
run;
options source;
Confirm both methods have the same computation:
proc sort data=want2; by time;
run;
proc compare noprint data=want compare=want2 out=diff outnoequal;
id time;
var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.
I have not benchmarked ring-array versus SQL versus Proc EXPAND versus Hash.
Caution: Dead reckoning rolling values using +in and -out operations can experience round-off errors when dealing with non-integer values.
I have two variables ID1 and ID2. They are both the same kinds of identifiers. When they appear in the same row of data it means they are in the same group. I want to make a group identifier for each ID. For example, I have
ID1 ID2
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
Then I would want
ID Group
1 1
2 1
3 2
4 1
5 1
6 1
7 2
Because 1,2,4,5,6 are paired by some combination in the original data they share a group. 3 and 7 are only paired with each other so they are a new group. I want to do this for ~20,000 rows. Every ID that is in ID1 is also in ID2 (more specifically if ID1=1 and ID2=2 for an observation, then there is another observation that is ID1=2 and ID2=1).
I've tried merging them back and forth but that doesn't work. I also tried call symput and trying to make a macro variable for each ID's group and then updating it as I move through rows, but I couldn't get that to work either.
I have used Haikuo Bian's answer as a starting point to develop a slightly more complex algorithm that seems to work for all the test cases I have tried so far. It could probably be optimised further, but it copes with 20000 rows in under a second on my PC while using only a few MB of memory. The input dataset does not need to be sorted in any particular order, but as written it assumes that every row is present at least once with id1 < id2.
Test cases:
/* Original test case */
data have;
input id1 id2;
cards;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
run;
/* Revised test case - all in one group with connecting row right at the end */
data have;
input ID1 ID2;
/*Make sure each row has id1 < id2*/
if id1 > id2 then do;
t_id2 = id2;
id2 = id1;
id1 = t_id2;
end;
drop t_id2;
cards;
2 5
4 8
2 4
2 6
3 7
4 1
9 1
3 2
6 2
7 3
;
run;
/*Full scale test case*/
data have;
do _N_ = 1 to 20000;
call streaminit(1);
id1 = int(rand('uniform')*100000);
id2 = int(rand('uniform')*100000);
if id1 < id2 then output;
t_id2 = id2;
id2 = id1;
id1 = t_id2;
if id1 < id2 then output;
end;
drop t_id2;
run;
Code:
option fullstimer;
data _null_;
length id group 8;
declare hash h();
rc = h.definekey('id');
rc = h.definedata('id');
rc = h.definedata('group');
rc = h.definedone();
array ids(2) id1 id2;
array groups(2) group1 group2;
/*Initial group guesses (greedy algorithm)*/
do until (eof);
set have(where = (id1 < id2)) end = eof;
match = 0;
call missing(min_group);
do i = 1 to 2;
rc = h.find(key:ids[i]);
match + (rc=0);
if rc = 0 then min_group = min(group,min_group);
end;
/*If neither id was in a previously matched group, create a new one*/
if not(match) then do;
max_group + 1;
group = max_group;
end;
/*Otherwise, assign both to the matched group with the lowest number*/
else group = min_group;
do i = 1 to 2;
id = ids[i];
rc = h.replace();
end;
end;
/*We now need to work through the whole dataset multiple times
to deal with ids that were wrongly assigned to a separate group
at the end of the initial pass, so load the table into a
hash object + iterator*/
declare hash h2(dataset:'have(where = (id1 < id2))');
rc = h2.definekey('id1','id2');
rc = h2.definedata('id1','id2');
rc = h2.definedone();
declare hiter hi2('h2');
change_count = 1;
do while(change_count > 0);
change_count = 0;
rc = hi2.first();
do while(rc = 0);
/*Get the current group of each id from
the hash we made earlier*/
do i = 1 to 2;
rc = h.find(key:ids[i]);
groups[i] = group;
end;
/*If we find a row where the two ids have different groups,
move the id in the higher group to the lower group*/
if groups[1] < groups[2] then do;
id = ids[2];
group = groups[1];
rc = h.replace();
change_count + 1;
end;
else if groups[2] < groups[1] then do;
id = ids[1];
group = groups[2];
rc = h.replace();
change_count + 1;
end;
rc = hi2.next();
end;
pass + 1;
put pass= change_count=; /*For information only :)*/
end;
rc = h.output(dataset:'want');
run;
/*Renumber the groups sequentially*/
proc sort data = want;
by group id;
run;
data want;
set want;
by group;
if first.group then new_group + 1;
drop group;
rename new_group = group;
run;
/*Summarise by # of ids per group*/
proc sql;
select a.group, count(id) as FREQ
from want a
group by a.group
order by freq desc;
quit;
Interestingly, the suggested optimisation of not checking the group of id2 during the initial pass if id1 is already matched actually slows things down a little in this extended algorithm, because it means that more work has to be done in the subsequent passes if id2 is in a lower numbered group. E.g. output from a trial run I did earlier:
With 'optimisation':
pass=0 change_count=4696
pass=1 change_count=204
pass=2 change_count=23
pass=3 change_count=9
pass=4 change_count=2
pass=5 change_count=1
pass=6 change_count=0
NOTE: DATA statement used (Total process time):
real time 0.19 seconds
user cpu time 0.17 seconds
system cpu time 0.04 seconds
memory 9088.76k
OS Memory 35192.00k
Without:
pass=0 change_count=4637
pass=1 change_count=182
pass=2 change_count=23
pass=3 change_count=9
pass=4 change_count=2
pass=5 change_count=1
pass=6 change_count=0
NOTE: DATA statement used (Total process time):
real time 0.18 seconds
user cpu time 0.16 seconds
system cpu time 0.04 seconds
Please try the below code.
data have;
input ID1 ID2;
datalines;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
run;
* Finding repeating in ID1;
proc sort data=have;by id1;run;
data want_1;
set have;
by id1;
attrib flagrepeat length=8.;
if not (first.id1 and last.id1) then flagrepeat=1;
else flagrepeat=0;
run;
* Finding repeating in ID2;
proc sort data=want_1;by id2;run;
data want_2;
set want_1;
by id2;
if not (first.id2 and last.id2) then flagrepeat=1;
run;
proc sort data=want_2 nodupkey;by id1 ;run;
data want(drop= ID2 flagrepeat rename=(ID1=ID));
set want_2;
attrib Group length=8.;
if(flagrepeat eq 1) then Group=1;
else Group=2;
run;
Hope this answer helps.
Like one commentator mentioned, Hash does seem to be a viable approach. In the following code, 'id' and 'group' is maintained in the Hash table, new 'group' is added only when no 'id' match is found for the entire row. Please note, 'do over' is an undocumented feature, it can be easily replaced with a little bit more coding.
data have;
input ID1 ID2;
cards;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
data _null_;
if _n_=1 then
do;
declare hash h(ordered: 'a');
h.definekey('id');
h.definedata('id','group');
h.definedone();
call missing(id,group);
end;
set have end=last;
array ids id1 id2;
do over ids;
rc=sum(rc,h.find(key:ids)=0);
/*you can choose to 'leave' the loop here when first h.find(key:ids)=0 is met, for the sake of better efficiency*/
end;
if not rc > 0 then
group+1;
do over ids;
id=ids;
h.replace();
end;
if last then rc=h.output(dataset:'want');
run;
I have monthly data with several observations per day. I have day, month and year variables. How can I retain data from only the first and the last 5 days of each month? I have only weekdays in my data so the first and last five days of the month changes from month to month, ie for Jan 2008 the first five days can be 2nd, 3rd, 4th, 7th and 8th of the month.
Below is an example of the data file. I wasn't sure how to share this so I just copied some lines below. This is from Jan 2, 2008.
Would a variation of first.variable and last.variable work? How can I retain observations from the first 5 days and last 5 days of each month?
Thanks.
1 AA 500 B 36.9800 NH 2 1 2008 9:10:21
2 AA 500 S 36.4500 NN 2 1 2008 9:30:41
3 AA 100 B 36.4700 NH 2 1 2008 9:30:43
4 AA 100 B 36.4700 NH 2 1 2008 9:30:48
5 AA 50 S 36.4500 NN 2 1 2008 9:30:49
If you want to examine the data and determine the minimum 5 and maximum 5 values then you can use PROC SUMMARY. You could then merge the result back with the data to select the records.
So if your data has variables YEAR, MONTH and DAY you can make a new data set that has the top and bottom five days per month using simple steps.
proc sort data=HAVE (keep=year month day) nodupkey
out=ALLDAYS;
by year month day;
run;
proc summary data=ALLDAYS nway;
class year month;
output out=MIDDLE
idgroup(min(day) out[5](day)=min_day)
idgroup(max(day) out[5](day)=max_day)
/ autoname ;
run;
proc transpose data=MIDDLE out=DAYS (rename=(col1=day));
by year month;
var min_day: max_day: ;
run;
proc sql ;
create table WANT as
select a.*
from HAVE a
inner join DAYS b
on a.year=b.year and a.month=b.month and a.day = b.day
;
quit;
/****
get some dates to play with
****/
data dates(keep=i thisdate);
offset = input('01Jan2015',DATE9.);
do i=1 to 100;
thisdate = offset + round(599*ranuni(1)+1); *** within 600 days from offset;
output;
end;
format thisdate date9.;
run;
/****
BTW: intnx('month',thisdate,1)-1 = first day of next month. Deduct 1 to get the last day
of the current month.
intnx('month',thisdate,0,"BEGINNING") = first day of the current month
****/
proc sql;
create table first5_last5 AS
SELECT
*
FROM
dates /* replace with name of your data set */
WHERE
/* replace all occurences of 'thisdate' with name of your date variable */
( intnx('month',thisdate,1)-5 <= thisdate <= intnx('month',thisdate,1)-1 )
OR
( intnx('month',thisdate,0,"BEGINNING") <= thisdate <= intnx('month',thisdate,0,"BEGINNING")+4 )
ORDER BY
thisdate;
quit;
Create some data with the desired structure;
Data inData (drop=_:); * froget all variables starting with an underscore*;
format date yymmdd10. time time8.;
_instant = datetime();
do _i = 1 to 1E5;
date = datepart(_instant);
time = timepart(_instant);
yy = year(date);
mm = month(date);
dd = day(date);
*just some more random data*;
letter = byte(rank('a') +floor(rand('uniform', 0, 26)));
*select week days*;
if weekday(date) in (2,3,4,5,6) then output;
_instant = _instant + 1E5*rand('exponential');
end;
run;
Count the days per month;
proc sql;
create view dayCounts as
select yy, mm, count(distinct dd) as _countInMonth
from inData
group by yy, mm;
quit;
Select the days;
data first_5(drop=_:) last_5(drop=_:);
merge inData dayCounts;
by yy mm;
_newDay = dif(date) ne 0;
retain _nrInMonth;
if first.mm then _nrInMonth = 1;
else if _newDay then _nrInMonth + 1;
if _nrInMonth le 5 then output first_5;
if _nrInMonth gt _countInMonth - 5 then output last_5;
run;
Use the INTNX() function. You can use INTNX('month',...) to find the beginning and ending days of the month and then use INTNX('weekday',...) to find the first 5 week days and last five week days.
You can convert your month, day, year values into a date using the MDY() function. Let's assume that you do that and create a variable called TODAY. Then to test if it is within the first 5 weekdays of last 5 weekdays of the month you could do something like this:
first5 = intnx('weekday',intnx('month',today,0,'B'),0) <= today
<= intnx('weekday',intnx('month',today,0,'B'),4) ;
last5 = intnx('weekday',intnx('month',today,0,'E'),-4) <= today
<= intnx('weekday',intnx('month',today,0,'E'),0) ;
Note that those ranges will include the week-ends, but it shouldn't matter if your data doesn't have those dates.
But you might have issues if your data skips holidays.