I have to create complex (for me) counter variables in a dataset. I am trying to explain as clearly as possible. If anything unclear please let me know. Hope with your help I can achieve what I am expecting.
I need to create three variables: Probation_Count, Probation_Flag and Cure_Count.
Creating three variables are CID specific (we are grouping by CID).
Probation_Count and Probation_Flag conditions
Condition 1 - probation_count starts from 1 when a contract goes from Default_Flag =Y to Default_Flag = N, and probation_flag = Y.
Condition 2 - Probation_count will increment as long as DPD = 0 and Default_Flag =N, probation flag =Y
Condition 3 - when DPD >0 and DPD <= 3 and Defult_Flag=N, probation_count has to stay at the value when DPD = 0; probation_count will start to increase once DPD =0 and default_flag = N, probation_flag =Y
Condition 4 - when DPD >3 and default_flag = N then Probation count resets to 0 unti DPD = 0 and default_Flag=N, probation flag = Y
Condition 5 - probation_count can increase until 10, and then resets to 0, probation_flag = Y until probation count =10
Condition 6 - When ever Default_Flag = Y then probation_count = 0 and Probation_flag = N. In order to start the probation_count contract has to move from default_flag= Y to Default_flag=N.
Cure_count conditions
Condition 1 - cure_count starts from 1 when previous date
probation_count was 10 and current date default_flag = N
Condition 2 - Cure_count will increase until default flag = Y or Cure_count = 10
Please find the sample data below.
I have manually calculated
probation_count, probation_flag and cure_count.
data sample;
INFILE DATALINES DLM='#';
input CID date ddmmyy10. DPD Default_Flag $ Probation_Count probation_Flag $ Cure_count;
format date ddmmyy10.;
datalines;
111#04/04/2021#87#N#00# #0
111#05/04/2021#88#N#00# #0
111#06/04/2021#89#N#00# #0
111#07/04/2021#90#Y#00# #0
111#08/04/2021#91#Y#00# #0
111#09/04/2021#92#Y#00# #0
111#10/04/2021#93#Y#00# #0
111#11/04/2021#00#N#01#Y#0
111#12/04/2021#00#N#02#Y#0
111#13/04/2021#00#N#03#Y#0
111#14/04/2021#00#N#04#Y#0
111#15/04/2021#00#N#05#Y#0
111#16/04/2021#01#N#05#Y#0
111#17/04/2021#02#N#05#Y#0
111#18/04/2021#00#N#06#Y#0
111#19/04/2021#00#N#07#Y#0
111#20/04/2021#00#N#08#Y#0
111#21/04/2021#00#N#09#Y#0
111#22/04/2021#00#N#10#Y#0
111#23/04/2021#00#N#00# #1
111#24/04/2021#00#N#00# #2
111#25/04/2021#00#N#00# #3
222#04/04/2021#86#N#00# #0
222#05/04/2021#87#N#00# #0
222#06/04/2021#88#N#00# #0
222#07/04/2021#89#N#00# #0
222#08/04/2021#90#Y#00# #0
222#09/04/2021#91#Y#00# #0
222#10/04/2021#92#Y#00# #0
222#11/04/2021#93#Y#00# #0
222#12/04/2021#94#Y#00# #0
222#13/04/2021#95#Y#00# #0
222#14/04/2021#96#Y#00# #0
333#04/04/2021#87#N#00# #0
333#05/04/2021#88#N#00# #0
333#06/04/2021#89#N#00# #0
333#07/04/2021#90#Y#00# #0
333#08/04/2021#91#Y#00# #0
333#09/04/2021#92#Y#00# #0
333#10/04/2021#00#N#01#Y#0
333#11/04/2021#00#N#02#Y#0
333#12/04/2021#00#N#03#Y#0
333#13/04/2021#00#N#04#Y#0
333#14/04/2021#00#N#05#Y#0
333#15/04/2021#00#N#06#Y#0
333#16/04/2021#01#N#05#Y#0
333#17/04/2021#02#N#05#Y#0
333#18/04/2021#03#N#05#Y#0
333#19/04/2021#04#N#00#Y#0
333#20/04/2021#05#N#00#Y#0
333#21/04/2021#00#N#01#Y#0
333#22/04/2021#00#N#02#Y#0
333#23/04/2021#00#N#03#Y#0
333#24/04/2021#00#N#04#Y#0
333#25/04/2021#00#N#05#Y#0
333#26/04/2021#00#N#06#Y#0
333#27/04/2021#00#N#07#Y#0
333#28/04/2021#00#N#08#Y#0
333#29/04/2021#00#N#09#Y#0
333#30/04/2021#00#N#10#Y#0
333#01/05/2021#00#N#00# #1
333#02/05/2021#00#N#00# #2
333#03/05/2021#00#N#00# #3
333#04/05/2021#90#Y#00# #0
333#05/05/2021#91#Y#00# #0
;
run;
Thank you so much for your time and help
The data and explanations are not 100% clear, but this sample code might help you fully realize the complex rules you are attempting.
I need to create three variables: Probation_Count, Probation_Flag and Cure_Count.
I would expect this to mean these variables and their values can only be computed from the state and changed state of default_flag and dpd. You don't make it clear how or if a value computed in the prior row should be carried forward into the next rows computation.
Example:
data have;
INFILE DATALINES DLM='#';
input CID date ddmmyy10. DPD Default_Flag $ Probation_Count_X Probation_Flag_X $ Cure_Count_X;
format date ddmmyy10.;
datalines;
111#04/04/2021#87#N#00# #0
111#05/04/2021#88#N#00# #0
111#06/04/2021#89#N#00# #0
111#07/04/2021#90#Y#00# #0
111#08/04/2021#91#Y#00# #0
111#09/04/2021#92#Y#00# #0
111#10/04/2021#93#Y#00# #0
111#11/04/2021#00#N#01#Y#0
111#12/04/2021#00#N#02#Y#0
111#13/04/2021#00#N#03#Y#0
111#14/04/2021#00#N#04#Y#0
111#15/04/2021#00#N#05#Y#0
111#16/04/2021#01#N#05#Y#0
111#17/04/2021#02#N#05#Y#0
111#18/04/2021#00#N#06#Y#0
111#19/04/2021#00#N#07#Y#0
111#20/04/2021#00#N#08#Y#0
111#21/04/2021#00#N#09#Y#0
111#22/04/2021#00#N#10#Y#0
111#23/04/2021#00#N#00# #1
111#24/04/2021#00#N#00# #2
111#25/04/2021#00#N#00# #3
222#04/04/2021#86#N#00# #0
222#05/04/2021#87#N#00# #0
222#06/04/2021#88#N#00# #0
222#07/04/2021#89#N#00# #0
222#08/04/2021#90#Y#00# #0
222#09/04/2021#91#Y#00# #0
222#10/04/2021#92#Y#00# #0
222#11/04/2021#93#Y#00# #0
222#12/04/2021#94#Y#00# #0
222#13/04/2021#95#Y#00# #0
222#14/04/2021#96#Y#00# #0
333#04/04/2021#87#N#00# #0
333#05/04/2021#88#N#00# #0
333#06/04/2021#89#N#00# #0
333#07/04/2021#90#Y#00# #0
333#08/04/2021#91#Y#00# #0
333#09/04/2021#92#Y#00# #0
333#10/04/2021#00#N#01#Y#0
333#11/04/2021#00#N#02#Y#0
333#12/04/2021#00#N#03#Y#0
333#13/04/2021#00#N#04#Y#0
333#14/04/2021#00#N#05#Y#0
333#15/04/2021#00#N#06#Y#0
333#16/04/2021#01#N#05#Y#0
333#17/04/2021#02#N#05#Y#0
333#18/04/2021#03#N#05#Y#0
333#19/04/2021#04#N#00#Y#0
333#20/04/2021#05#N#00#Y#0
333#21/04/2021#00#N#01#Y#0
333#22/04/2021#00#N#02#Y#0
333#23/04/2021#00#N#03#Y#0
333#24/04/2021#00#N#04#Y#0
333#25/04/2021#00#N#05#Y#0
333#26/04/2021#00#N#06#Y#0
333#27/04/2021#00#N#07#Y#0
333#28/04/2021#00#N#08#Y#0
333#29/04/2021#00#N#09#Y#0
333#30/04/2021#00#N#10#Y#0
333#01/05/2021#00#N#00# #1
333#02/05/2021#00#N#00# #2
333#03/05/2021#00#N#00# #3
333#04/05/2021#90#Y#00# #0
333#05/05/2021#91#Y#00# #0
;
data want;
length rule $1 probation_count 8 probation_flag $1 cure_count 8;
length trigger_counting pcounting 8;
retain pcounting probation_count;
set have;
by cid;
rule = ' ';
if first.cid then do;
probation_count = 0;
probation_flag = ' ';
trigger_counting = 0;
pcounting = 0;
end;
trigger_counting =
default_flag = 'N'
and
( lag(default_flag) = 'Y' and NOT first.cid )
;
if default_flag = 'N' then do;
* set the counting flag 'pcounting' and initialize count;
if trigger_counting then do;
pcounting = 1;
probation_count = 1;
probation_flag = 'Y';
rule = '1';
return;
end;
* increment count for no dpd, reset if necessary;
if pcounting and dpd = 0 then do;
probation_count + 1;
probation_flag = 'Y';
rule = '2';
if probation_count > 10 then do;
probation_count = 0;
rule = '5';
end;
return;
end;
* pause counting for few dpd;
if pcounting and 0 < dpd <= 3 then do;
probation_flag = 'Y';
rule = '3';
return;
end;
* reset counting for high dpd;
if pcounting and dpd > 3 then do;
probation_count = 0;
probation_flag = 'Y';
rule = '4';
return;
end;
end;
else
if default_flag = 'Y' then do;
probation_count = 0;
probation_flag = 'N';
rule = '6';
end;
else do;
put 'ERROR: ' default_flag= _n_=;
stop;
end;
* drop trigger_counting pcounting;
run;
Related
I want to flag Komp and Bauspar if either one of them is <1 with -, >1 with + and if one of them is blank --> no flag.
Tried the following, but it produces with two 2022_Bauspar_flag columns somehow?
Can you give me hint?
Thanks a lot.
Kind regards,
Ben
%macro target_years2(table,type);
%local name_Bauspar name_Komp;
data &table ;
set work.&table;
%let name_Komp = "2022_ZZ_Komp"n;
%let name_Bauspar = "2022_ZZ_Bauspar"n;
&name_Komp = (1+("2022_Komposit"n-"2022_Komposit_Ziel"n)/"2022_Komposit_Ziel"n);
&name_Bauspar = (1+("2022_Bausparen"n-"2022_Bausparen_Ziel"n)/"2022_Bausparen_Ziel"n);
/*create ZZ_flags*/
if &name_Komp > 1 THEN do;
"2022_ZZ_Komp_flag"n = '+';
end;
else if &name_Komp < 1 and &name_Komp <> . THEN do;
"2022_ZZ_Komp_flag"n = '-';
end;
else if &name_Bauspar > 1 THEN do;
"2022_ZZ_Baupar_flag"n = '+';
end;
else if &name_Bauspar < 1 and &name_Bauspar <> . THEN do;
"2022_ZZ_Bauspar_flag"n = '-';
end;
else do;
end;
run;
%mend;
%target_years2(Produktion_temp,Produktion)
Difficult to help you as you do not provide any output or detailed explanation of what is wrong.
Note that if you want to compute both columns for each observations you would need to split your if statement. The second IF condition is not evaluated when the first IF condition is true.
I understand you want to compute two derived columns 2022_ZZ_Komp_flag and 2022_ZZ_Bauspar_flag with the following condition:
if associated macro variable &name_ > 1 then flag is +
if associated macro variable &name_ < 1 then flag is -
if associated macro variable &name_ = . then flag is missing
With the following dataset
data have;
input zz_komp zz_baupar;
cards;
0.9 1.1
1.1 0.8
. 2
0.8 .
;
The following code
data want;
set have;
"2022_ZZ_Komp_flag"n = ifc(zz_komp > 1, '+', '-');
"2022_ZZ_Baupar_flag"n = ifc(zz_baupar > 1, '+', '-');
if missing(zz_komp) then "2022_ZZ_Komp_flag"n = '';
if missing(zz_baupar) then "2022_ZZ_Baupar_flag"n = '';
run;
Produces
Is it the expected result?
You have a typo in your code. You assign to Baupar_flag in one case, and Bauspar_flag in the other
else if &name_Bauspar > 1 THEN do;
"2022_ZZ_Baupar_flag"n = '+';
------
end;
else if &name_Bauspar < 1 and &name_Bauspar <> . THEN do;
"2022_ZZ_Bauspar_flag"n = '-';
-------
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a large dataset containting over 80 000 000 rows sorted by "name" and "income" (with duplicates both for name and income). For the first name I would like to have the 5 lowest incomes. For the second name I would like to have the 5 lowest incomes (but incomes drawn to the first name are then disqualified to be selected). And so on, until the last name (if there are any incomes left at that time).
You first want to rank income within names. So:
proc rank data=yourdata out=temp ties=low;
by name;
var income;
ranks incomerank;
run;
Then you want to filter the 5 lowest incomes by name, so:
proc sql;
create table want as
select distinct *
from temp
where incomerank < 6;
quit;
You will need to sort and track incomes
Use an array to sort and track the lowest five income of a name.
Use a hash to track and check the observance of an income being output and thus ineligible for output by later names.
Example:
An insert sort of eligible low valued incomes is used and will be fast due to only 5 items.
data have;
call streaminit(1234);
do name = 1 to 1e6;
do seq = 1 to rand('integer', 20);
income = rand('integer', 20000, 1000000);
output;
end;
end;
run;
data
want (label='Lowest 5 incomes (first occurring over all names) of each name')
want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
array X(5) _temporary_;
if _n_ = 1 then do;
if 0 then set have;
declare hash incomes();
incomes.defineKey('income');
incomes.defineDone();
end;
_maxmin5 = 1e15;
x(1) = 1e15;
x(2) = 1e15;
x(3) = 1e15;
x(4) = 1e15;
x(5) = 1e15;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if incomes.check() = 0 then continue;
* insert sort - lowest five not observed previously;
if income > _maxmin5 then continue;
do _i_ = 1 to 5;
if income < x(_i_) then do;
do _j_ = 5 to _i_+1 by -1;
x(_j_) = x(_j_-1);
end;
x(_i_) = income;
_maxmin5 = x(5);
incomes.add();
leave;
end;
end;
end;
_outflag = 0;
do _n_ = 1 to _n_;
set have;
if income in x then do;
_outflag = 1;
OUTPUT want;
end;
end;
if not _outflag then
OUTPUT want_barren;
drop _:;
run;
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
address = cats('Address_', _N_);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(dataset : 'have(obs=0)', ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedata(all : 'y');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Here is my interpretation of your problem and a solution.
Suppose a simplified version of your data looks like this and you want the 2 lowest income for each name. For simplicity, I use a numeric variable n as name, but a character var will work as well.
data have;
input n income;
datalines;
1 100
1 200
1 300
2 400
2 100
2 500
3 600
3 200
3 500
;
From this data, my guess is that your logic goes like this:
Start with n = 1.
Output the 2 observations with the lowest income (100 and 200)
Go to the next name (n=2).
Output the 2 observations with the lowest income, that has not already been output (300 and 400). 200 Has been output in the n=1 group.
...And so on...
This gives the desired result below:
data want;
input n income;
datalines;
1 100
1 200
2 300
2 400
3 500
;
Try out the solution below and verify that you get the result as posted above.
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 2 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Finally, let us create representable example data with 80Mio obs. I change the if c = 2 then leave; statement to if c = 5 then leave; to go back to your actual problem.
The code below runs in about 45 sec on my system and processes the data in a single pass. Let me know is it works for you :-)
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
I have a dataset that looks like:
Hour Flag
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
I want to have an output dataset like:
Total_Hours Count
2 2
3 1
4 1
As you can see, I want to count the number of hours included in each period with consecutive "1s". A missing value ends the consecutive sequence.
How should I go about doing this? Thanks!
You'll need to do this in two steps. First step is making sure the data is sorted properly and determining the number of hours in a consecutive period:
PROC SORT DATA = <your dataset>;
BY hour;
RUN;
DATA work.consecutive_hours;
SET <your dataset> END = lastrec;
RETAIN
total_hours 0
;
IF flag = 1 THEN total_hours = total_hours + 1;
ELSE
DO;
IF total_hours > 0 THEN output;
total_hours = 0;
END;
/* Need to output last record */
IF lastrec AND total_hours > 0 THEN output;
KEEP
total_hours
;
RUN;
Now a simple SQL statement:
PROC SQL;
CREATE TABLE work.hour_summary AS
SELECT
total_hours
,COUNT(*) AS count
FROM
work.consecutive_hours
GROUP BY
total_hours
;
QUIT;
You will have to do two things:
compute the run lengths
compute the frequency of the run lengths
For the case of using the implict loop
Each run length occurnece can be computed and maintained in a retained tracking variable, testing for a missing value or end of data for output and a non missing value for run length reset or increment.
Proc FREQ
An alternative is to use an explicit loop and a hash for frequency counts.
Example:
data have; input
Hour Flag; datalines;
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
;
data _null_;
declare hash counts(ordered:'a');
counts.defineKey('length');
counts.defineData('length', 'count');
counts.defineDone();
do until (end);
set have end=end;
if not missing(flag) then
length + 1;
if missing(flag) or end then do;
if length > 0 then do;
if counts.find() eq 0
then count+1;
else count=1;
counts.replace();
length = 0;
end;
end;
end;
counts.output(dataset:'want');
run;
An alternative
data _null_;
if _N_ = 1 then do;
dcl hash h(ordered : "a");
h.definekey("Total_Hours");
h.definedata("Total_Hours", "Count");
h.definedone();
end;
do Total_Hours = 1 by 1 until (last.Flag);
set have end=lr;
by Flag notsorted;
end;
Count = 1;
if Flag then do;
if h.find() = 0 then Count+1;
h.replace();
end;
if lr then h.output(dataset : "want");
run;
Several weeks ago, #Richard taught me how to use DOW-loop and direct addressing array. Today, I give it to you.
data want(keep=Total_Hours Count);
array bin[99]_temporary_;
do until(eof1);
set have end=eof1;
if Flag then count + 1;
if ^Flag or eof1 then do;
bin[count] + 1;
count = .;
end;
end;
do i = 1 to dim(bin);
Total_Hours = i;
Count = bin[i];
if Count then output;
end;
run;
And Thanks Richard again, he also suggested me this article.
I am working with a dataset which has several stocks and I merged a data set with events onto it (for the stocks, several events in the period per stock).
Now, for an event study, I want to create several variables that act as dummies and create windows: -60 to -11 days, -5 to -1 day and announcement day plus day +1.
Important are two things:
It has to be by Stock (the windows should not be carried over between stocks)
one announcement/event day (ann_day) should not spoil another event's window.
I tried the following but it is just giving me a window and does not account for different stocks and spoiled windows:
proc sql;
create view event_study as
select distinct b.ann_date,a.date,a.dayid-b.dayid as event_time, a.stock,a.return
from Dataset_full as a,Announcements as b
where a.dayid-b.dayid between -60 and 11 and a.secid=b.secid
order by a.stockb.ann_date,event_time;
quit;
Some info: announcement days are the events
dataset_full has the stock, date, return, volume. One row per calendar/trading day.
Announcement has stock, announcement date and announcement info (one row per announcement)
Data should look like this:
Stock Date Ann_date flag_minus60_minus11 flag_minus5_minus1 flag_day0_day1
A 1/01/2016 1
A 2/01/2016 1
A 3/01/2016
A 4/01/2016 4/01/2016 1
A 5/01/2016 1
A 6/01/2016
A 7/01/2016
A 8/01/2016
A 9/01/2016
A 10/01/2016
A 11/01/2016
A 12/01/2016
A 13/01/2016
A 14/01/2016
A 15/01/2016
B 1/01/2016 1
B 2/01/2016 1
B 3/01/2016 1
B 4/01/2016 1
B 5/01/2016 1
B 6/01/2016 1
B 7/01/2016
B 8/01/2016
B 9/01/2016
B 10/01/2016
B 11/01/2016 1
B 12/01/2016 1
B 13/01/2016 1
B 14/01/2016 1
B 15/01/2016 1
B 16/01/2016 16/01/2016 1
B 17/01/2016 1
B 18/01/2016 1
B 19/01/2016 1
B 20/01/2016 20/01/2016 1
B 21/01/2016 1
B 22/01/2016
B 23/01/2016
B 24/01/2016
B 25/01/2016
MaBo:
Here is some sample data and SQL. When you examine the output, I presume you will see 'spoiled' information -- that being a date with more than one future announcement in the flagging time frame.
The issue of flagging trading dates with respect to an event date is inner join. The inner join has to be performed for each flag being computed, and that inner join needs to be left joined to the trading data to get your 'want'.
data trading;
do group = 1 to 4;
do date = today()-1000 to today(); format date yymmdd10.;
output;
end;
end;
run;
data announcement;
do group = 1 to 4;
do date = today()-1000 to today(); format date yymmdd10.;
if ranuni(123) < 0.01 then output;
end;
end;
run;
proc sql;
create table trading_pre_announce_flagged as
select
trading.*
, announcement.date as annouce_date
, case when P0.date is not null then 1 else . end as P0_flag label="Announcement was today or yesterday"
, case when P1.date is not null then 1 else . end as P1_flag label="Announcement in 1 to 5 days"
, case when P2.date is not null then 1 else . end as P2_flag label="Announcement in 11 to 60 days"
, case when P2.date is not null then P2.adate else . end as P2_date label="Date of Announcement in 11 to 60 days" format=yymmdd10.
from
trading
left join
announcement
on announcement.date = trading.date and announcement.group = trading.group
left join
( select trading.group, trading.date
from trading
inner join
announcement
on announcement.group = trading.group
and announcement.date - trading.date between -1 and 0
) as P0
on P0.date = trading.date and P0.group = trading.group
left join
( select trading.group, trading.date
from trading
inner join
announcement
on announcement.group = trading.group
and announcement.date - trading.date between 1 and 5
) as P1
on P1.date = trading.date and P1.group = trading.group
left join
( select trading.group, trading.date, announcement.date as adate
from trading
inner join
announcement
on announcement.group = trading.group
where announcement.date - trading.date between 11 and 60
) as P2
on P2.date = trading.date and P2.group = trading.group
order
by trading.group, trading.date
;
quit;
At some point (can't find it though) the OP mentioned processing ~750 companies and 500 overall events, and that the SQL solution seemed to be long running.
An alternative would be DATA Step.
The 500 events is a small enough cardinality where arrays of group and date could be used to store the events for lookup. Smart index tracking of the sorted events can be used for doing a minimum scan for evaluating the rules and applying the condition flags.
For example:
data trading;
do group = 1 to 700;
do date = today()-1000 to today(); format date yymmdd10.;
output;
end;
end;
run;
data announcement;
do eventid = 1 to 500;
group = ceil(700*ranuni(123));
date = (today()-1000) + ceil(1000*ranuni(123)); format date yymmdd10.;
if mod(eventid,20) = 1 then do;
output;
eventid+1;
date = date + 30 + floor(100*ranuni(123));
output;
eventid+1;
date = date + 30 + floor(100*ranuni(123));
end;
output;
end;
run;
proc sort data=announcement;
by group date;
run;
data _null_;
if 0 then set announcement nobs=nobs;
call symputx ('top', nobs+1);
run;
data marked_trading;
array e_group(0:&TOP) _temporary_;
array e_date (0:&TOP) _temporary_;
* load event array;
do _n_ = 1 by 1 until (last_announcement);
set announcement end=last_announcement;
e_group(_n_) = group;
e_date(_n_) = date;
eix0 = 1;
eix1 = 1;
end;
e_group(0) = 0; * sentinel;
e_group(_n_) = 1e9; *sentinel;
* evaluate flagging criteria for each trade group date;
do _n_ = 1 by 1 until (last_trading);
set trading end=last_trading;
by group;
if first.group then do;
* discover indices of events associated with the group;
do eix0 = eix0 by 1 while (e_group(eix0) < group); end;
do eix1 = eix0 by 1 while (e_group(eix1) = group); end; eix1 = eix1 - 1;
eix_group = e_group(eix0);
end;
p3_flag = .; p2_flag = .; p1_flag = .;
if group = eix_group then do;
* NOTE: bounds are evaluated only at loop initialization;
* evaluate events for flagging a trade;
do ix = eix0 to eix1;
days_to_event = e_date(ix) - date;
if not p3_flag then if 11 <= days_to_event <= 60 then p3_flag = 1;
if not p2_flag then if 1 <= days_to_event <= 5 then p2_flag = 1;
if not p1_flag then if -1 <= days_to_event <= 0 then p1_flag = 1;
if days_to_event <= -1 then eix0 = ix+1; * update when applicability exhausted;
end;
end;
output;
end;
keep group date p:;
stop;
run;
I have a dataset like this(sp is an indicator):
datetime sp
ddmmyy:10:30:00 N
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
ddmmyy:10:34:00 N
And I would like to extract observations with "Y" and also the previous and next one:
ID sp
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
I tired to use "lag" and successfully extract the observations with "Y" and the next one, but still have no idea about how to extract the previous one.
Here is my try:
data surprise_6_step3; set surprise_6_step2;
length lag_sp $1;
lag_sp=lag(sp);
if sp='N' and lag(sp)='N' then delete;
run;
and the result is:
ID sp
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
Any methods to extract the previous observation also?
Thx for any help.
Try using the point option in set statement in data step.
Like this:
data extract;
set surprise_6_step2 nobs=nobs;
if sp = 'Y' then do;
current = _N_;
prev = current - 1;
next = current + 1;
if prev > 0 then do;
set x point = prev;
output;
end;
set x point = current;
output;
if next <= nobs then do;
set x point = next;
output;
end;
end;
run;
There is an implicite loop through dataset when you use it in set statement.
_N_ is an automatic variable that contains information about what observation is implicite loop on (starts from 1). When you find your value, you store the value of _N_ into variable current so you know on which row you have found it. nobs is total number of observations in a dataset.
Checking if prev is greater then 0 and if next is less then nobs avoids an error if your row is first in a dataset (then there is no previous row) and if your row is last in a dataset (then there is no next row).
/* generate test data */
data test;
do dt = 1 to 100;
sp = ifc( rand("uniform") > 0.75, "Y", "N" );
output;
end;
run;
proc sql;
create table test2 as
select *,
monotonic() as _n
from test
;
create table test3 ( drop= _n ) as
select a.*
from test2 as a
full join test2 as b
on a._n = b._n + 1
full join test2 as c
on a._n = c._n - 1
where a.sp = "Y"
or b.sp = "Y"
or c.sp = "Y"
;
quit;