I have a dataset with IDs, and each ID has multiple dates (actually datetime). I want to use PROC SQL to get the minimum datetime and also add 1 year to the minimum. I'm trying to do this all in one PROC SQL but have been fumbling and can't get this to work. Below are two attempts. Would appreciate any advice.
*** GENERATE RANDOM DATES AFTER JAN 1, 2012 AND CREATE DATE/TIME VARIABLE ***;
data have ;
format date mmddyy10. dt datetime15.;
do person_id=100, 200, 300, 400, 500;
do i = 1 to 100;
jdate = int(1000 * ranuni(123987));
date = mdy(1,1,2012) + jdate;
dt = dhms(date, 0,0,0);
output;
end;
end;
run;
*** TRY1: THIS DOES NOT WORK - GETS MIN DATE/TIME AND REMERGES WITH EVERY RECORD***;
proc sql;
create table try1 as
select min(dt) as index_dt format=datetime15. ,
(dt + 365*24*60*60) as followup_date format=datetime15.
from have
;
quit;
*** TRY2: USE MIN() IN "HAVING" STATEMENT ***;
*** PROBLEMATIC IF PERSON_ID HAS MIN(DT) OCCUR MULTIPLE TIMES ***;
proc sql;
create table try2 as
select person_id,
dt as index_dt format=datetime15.,
(dt + 365*24*60*60) as followup_date format=datetime15.
from have
group by person_id
having dt=min(dt)
;
quit;
Try this:
proc sql;
create table try1 as
select
min(dt) as index_dt format=datetime15. ,
calculated index_dt + 365*24*60*60 as followup_date format=datetime15.
from have
;
quit;
The trick here is using the "calculated" keyword.
Also you may want to do the following to add a year on instead of your multiplications:
proc sql;
create table try1 as
select
min(dt) as index_dt format=datetime15. ,
input(compress(
put(intnx('YEAR', datepart(calculated index_dt),1,'SAMEDAY'),date9.)||":"||
put(timepart(calculated index_dt),time5.)),datetime15.) as followup_date format=datetime15.
from have
;
quit;
Try using "select distinct person_id" instead of "select person_id" - that should help with your issue with duplicates. I'm not sure if SAS treats 365*24*3600 as the correct number of seconds per year, so that may be a contributing factor as well.
i don't think that you can do in only proc sql. I think to do that in this way:
*** GENERATE RANDOM DATES AFTER JAN 1, 2012 AND CREATE DATE/TIME VARIABLE ***;
data have ;
format date mmddyy10. dt datetime15.;
do person_id=100, 200, 300, 400, 500;
do i = 1 to 100;
jdate = int(1000 * ranuni(123987));
date = mdy(1,1,2012) + jdate;
dt = dhms(date, 0,0,0);
output;
end;
end;
run;
%macro do_elaboration(ds=);
/*count how many rows has my table */
%let dataset=&ds.;
%let DSID = %sysfunc(open(&dataset., IS));
%let nobs = %sysfunc(attrn(&DSID., NLOBS));
%let rc=%sysfunc(close(&DSID.));
/*loop over the number of rows*/
%do i=1 %to &nobs.;
/*at each loop get one id*/
data _NULL_;
set &ds. (OBS=&i OBS=&i);
call symputx("id", person_id);
run;
/*with proc sql get the min_dt*/
proc sql noprint;
select min(dt) into:min_dt
from &ds.
where person_id=&id.
;
quit;
/*increment the min_dt with the function sas intnx*/
data have_final_tmp;
person_id = &id.;
followup_date = intnx('dtyear',&min_dt,1);
format followup_date datetime15.;
run;
/*put all id with the followup_date in only one dataset*/
proc append base=have_final data=have_final_tmp force;
run;
%end;
%mend do_elaboration;
/*call the macro*/
%do_elaboration(ds=have);
I write the code very quickly and i don't test it so you should check it, but the concept is clear.
Related
I was trying to calculate past average stock returns. I find using the following "data step" code is much better than using "proc sql" code.
The data step code:
%macro same(start = ,end = );
proc sql;drop view temp;quit;
proc sql;
create table temp
as select distinct a.*, mean(b.ret_dm) as same_&start._&end, count(b.ret_dm) as sc_&start._&end
from msf1 as a left join msf1 as b
on a.stkcd = b.stkcd and &start <= a.ym - b.ym <= &end and a.month = b.month
group by a.stkcd,a.ym;
quit;
proc sql;
create table same
as select a.*, b.same_&start._&end, b.sc_&start._&end
from same as a left join temp as b
on a.stkcd = b.stkcd and a.ym = b.ym;
quit;
proc sql; drop table temp;quit;
%mend;
data same; set msf;run;
%same(start = 1, end = 12);
The proc sql code:
%macro MA_1;
%do p = 2 %to 9; *;
%put p &p;
proc printto log = junk ; run;
proc sql;
create table price&p
as select distinct a.*, b.count,b.ym
from price&p as a left join tradingdate as b
on a.date = b.date;
quit;
proc sort data = price&p; by stkcd ym date;quit;
data msf;
set price&p;
by stkcd ym date;
if last.ym;
run;
proc printto; run;
%do j = 1 %to %sysfunc(countw(&laglist));
%let lag = %scan(&laglist,&j);
%put lag &lag;
/*********************************************/
proc sql; drop table ma_&lag._&p ;quit;
%do i = 1 %to 2018; *;
proc printto log = junk ; run;
data getname;
set stock;
if _n_ = &i;
call symput('stkcd',stkcd);
run;
proc printto; run;
%put &i &stkcd;
proc printto log = junk ; run;
proc sql;
create table temp
as select distinct a.*, mean(b.prc) as ma_&lag._&p
from msf (where = (stkcd = "&stkcd" )) as a left join price&p (where = (stkcd = "&stkcd" )) as b
on a.stkcd = b.stkcd and 0 <= a.count - b.count <= &lag
group by a.stkcd, a.date
order by a.stkcd, a.date;
quit;
proc append base = ma_&lag._&p data = temp force; quit;
proc printto; run;
%end;
dm "log; clear;";
proc sql;
create table ma_allprc
as select a.*, b.ma_&lag._&p
from ma_allprc as a left join ma_&lag._&p as b
on a.stkcd = b.stkcd and a.date = b.date;
quit;
proc sql; drop table ma_&lag._&p;quit;
%end;
%end;
%mend;
%let laglist = 5 10 20 50 100 200 500 1000 2000; * ;
data ma_allprc; set msf;run;
%ma_1;
"Proc sql" is much slower than I thought. "Data step" takes about 3 hours, but "Proc sql" takes about 2 days.
I even have to loop over each stock when using proc sql, cause it takes up too much of the memory space, I have to say that using proc sql to calculate past averages is dumb, but currently I have no better ideas. :(
Does anybody have a solution with that..
I have a dataset in SAS in which the months would be dynamically updated each month. I need to calculate the sum vertically each month and paste the sum below, as shown in the image.
Proc means/ proc summary and proc print are not doing the trick for me.
I was given the following code before:
`%let month = month name;
%put &month.;
data new_totals;
set Final_&month. end=end;
&month._sum + &month._final;
/*feb_sum + &month._final;*/
output;
if end then do;
measure = 'Total';
&month._final = &month._sum;
/*Feb_final = feb_sum;*/
output;
end;
drop &month._sum;
run; `
The problem is this has all the months hardcoded, which i don't want. I am not too familiar with loops or arrays, so need a solution for this, please.
enter image description here
It may be better to use a reporting procedure such as PRINT or REPORT to produce the desired output.
data have;
length group $20;
do group = 'A', 'B', 'C';
array month_totals jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over month_totals;
month_totals = 10 + floor(rand('uniform', 60));
end;
output;
end;
run;
ods excel file='data_with_total_row.xlsx';
proc print noobs data=have;
var group ;
sum jan2020--dec2019;
run;
proc report data=have;
columns group jan2020--dec2019;
define group / width=20;
rbreak after / summarize;
compute after;
group = 'Total';
endcomp;
run;
ods excel close;
Data structure
The data sets you are working with are 'difficult' because the date aspect of the data is actually in the metadata, i.e. the column name. An even better approach, in SAS, is too have a categorical data with columns
group (categorical role)
month (categorical role)
total (continuous role)
Such data can be easily filtered with a where clause, and reporting procedures such as REPORT and TABULATE can use the month variable in a class statement.
Example:
data have;
length group $20;
do group = 'A', 'B', 'C';
do _n_ = 0 by 1 until (month >= '01feb2020'd);
month = intnx('month', '01jan2018'd, _n_);
total = 10 + floor(rand('uniform', 60));
output;
end;
end;
format month monyy5.;
run;
proc tabulate data=have;
class group month;
var total;
table
group all='Total'
,
month='' * total='' * sum=''*f=comma9.
;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
proc report data=have;
column group (month,total);
define group / group;
define month / '' across order=data ;
define total / '' ;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
Here is a basic way. Borrowed sample data from Richard.
data have;
length group $20;
do group = 'A', 'B';
array months jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over months;
months = 10 + floor(rand('uniform', 60, 1));
end;
output;
end;
run;
proc summary data=have;
var _numeric_;
output out=temp(drop=_:) sum=;
run;
data want;
set have temp (in=t);
if t then group='Total';
run;
I'm using PROC SQL within SAS and trying to get a count where the current month is equal to the month on a date field I'm reading. the format of the input date is - mmddyy10.
This is a sample of what I'm trying –
data test;
input job $ lastrun;
DateNew = datejul(lastrun);
Format datenew mmddyy10.;
datalines;
joba 19300
jobb 19200
jobc 19303
jobx 19288
run;
proc print; run;
proc sql;
select
count(job) AS cnt_LastMonth
from test
where datepart(datenew) = intnx('month', today(), -1, 'same');
quit;
In this example I'm expecting the cnt_LastMonth to return 3, however it returns 0.
You can't calculate datepart from date variable, only from datetime. And if you want to compare dates that belong to one month, don't ignore year value.
proc sql;
create table qert as
select
count(job) AS cnt_LastMonth
from test
where intnx('month', DateNew, 0, 'b') = intnx('month', today(), -1, 'b');
/*Increments both dates to the month's begin
Instead of it you can try to use:
where month(DateNew) = month(today())-1 and year(DateNew)=year(today());
*/
quit;
proc sql;
select count(job) AS cnt_LastMonth
from test
where month(DateNew)= 10;
quit;
OR
proc sql;
SELECT count(A2.job) AS cnt_LastMonth
FROM (SELECT *,
MONTH(Date_Minus_1) as Month_filter,
MONTH(DateNew) as Month
FROM(SELECT *,
intnx('Month',today(),-1,'s') as Date_Minus_1 format=mmddyy10.
FROM test) A1)A2
Where A2.Month =A2.Month_filter;
Run;
What I've got:
a table of 20 rows in SAS (originally 100k)
various binary attributes (columns)
What I'm looking to get:
A crosstable displaying the frequency of the attribute combinations
like this:
Attribute1 Attribute2 Attribute3 Attribute4
Attribute1 5 0 1 2
Attribute2 0 3 0 3
Attribute3 2 0 5 4
Attribute4 1 2 0 10
*The actual sum of combinations is made up and probably not 100% logical
The code I currently have:
/*create dummy data*/
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
output;
end;
run;
I guess this can be done smarter, but this seem to work. First I created a table that should hold all the frequencies:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;
Then I loop through all the combinations, inserting the count into the crosstable:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
Note that since the frequency count for (i,j)=(j,i) you do not need to do both.
I'd recommend using the built-in SAS tools for this sort of thing, and probably displaying your data slightly differently as well, unless you really want a diagonal table. e.g.
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
count = 1;
output;
end;
run;
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;
proc summary nway data = monthly_sales;
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
Either of these gives you a table with 1 row for each contribution of attributes in your data, which is slightly different from what you requested, but conveys the same information. You can force proc summary to include rows for combinations of class variables that don't exist in your data by using the completetypes option in the proc summary statement.
It's definitely worth taking the time to get familiar with proc summary if you're doing statistical analysis in SAS - you can include additional output statistics and process multiple variables with minimal additional code and processing overhead.
Update: it's possible to produce the desired table without resorting to macro logic, albeit a rather complex process:
proc summary data = monthly_sales completetypes;
ways 1 2; /*Calculate only 1 and 2-way summaries*/
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
/*Eliminate unnecessary output rows*/
data summary_table;
set summary_table;
array a{*} attribute:;
sum = sum(of a[*]);
missing = 0;
do i = 1 to dim(a);
missing + missing(a[i]);
a[i] = a[i] * count;
end;
/*We want rows where two attributes are both 1 (sum = 2),
or one attribute is 1 and the others are all missing*/
if sum = 2 or (sum = 1 and missing = dim(a) - 1);
drop i missing sum;
edge = _n_;
run;
/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
by edge;
var attribute:;
run;
/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
create table diagonal as
select a._name_ as aname,
b._name_ as bname,
a.col1 as count
from tr_table a, tr_table b
where a.edge = b.edge
group by a.edge
having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
order by aname, bname
;
quit;
/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
by aname;
id bname;
var count;
run;
/*Re-order variables and set missing values to zero*/
data want;
informat aname attribute1-attribute4;
set want;
array a{*} attribute:;
do i = 1 to dim(a);
a[i] = sum(a[i],0);
end;
drop i;
run;
Yeah, user667489 was right, I just added some extra code to get the cross-frequency table looking good. First, I created a table with 10 million rows and 10 variables:
data monthly_sales (drop=i);
do i=1 to 10000000;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
Attribute5=rand("Normal")>0.5;
Attribute6=rand("Normal")>0.5;
Attribute7=rand("Normal")>0.5;
Attribute8=rand("Normal")>0.5;
Attribute9=rand("Normal")>0.5;
Attribute10=rand("Normal")>0.5;
output;
end;
run;
Create an empty 10x10 crosstable:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
output;output;output;output;output;output;output;output;output;output;
run;
Create a frequency table using proc freq:
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
/ out = frequency_table;
run;
Loop through all the combinations of Attributes and sum the "count" variable. Insert it into the crosstable:
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
I have to calculate the correlation and covariance for my daily sales values for an event window. The event window is of 45 day period and my data looks like -
store_id date sales
5927 12-Jan-07 3,714.00
5927 12-Jan-07 3,259.00
5927 14-Jan-07 3,787.00
5927 14-Jan-07 3,480.00
5927 17-Jan-07 3,646.00
5927 17-Jan-07 3,316.00
4978 18-Jan-07 3,530.00
4978 18-Jan-07 3,103.00
4978 18-Jan-07 3,026.00
4978 21-Jan-07 3,448.00
Now, for every store_id, date combination, I need to go back 45 days (there is more data for each combination in my original data set) calculate the correlation between sales and lag(sales) i.e. autocorrelation of degree one. As you can see, the date column is not continuous. So something like (date - 45) is not going to work.
I have gotten till this part -
data ds1;
set ds;
by store_id;
LAG_SALE = lag(sales);
IF FIRST.store_idTHEN DO;
LAG_SALE = .;
END;
run;
For calculating correlation and covariances -
proc corr data=ds1 outp=Corr
by store_id date;
cov; /** include covariances **/
var sales lag_sale;
run;
But how do I insert the event window for each date, store_id combination? My final output should look something like this -
id date corr cov
5927 12-Jan-07 ... ...
5927 14-Jan-07 ... ...
Here is what I've come up with:
First I convert the date to a SAS date, which is the number of days since Jan. 1 1960:
data ds;
set ds (rename=(date=old_date));
date = input(old_date, date11.);
drop old_date;
run;
Then compute lag_sale (I am using the same calculation you used in the question, but make sure this is what you want to do. For some observations the lag sale is the previous recorded date, but for some it is the same store_id and date, just a different observation.):
proc sort data=ds; by store_id; run;
data ds;
set ds;
by store_id;
lag_sale = lag(sales);
if first.store_id then lag_sale = .;
run;
Then set up the final data set:
data final;
length store_id 8 date 8 cov 8 corr 8;
if _n_ = 0;
run;
Then create a macro which takes a store_id and date and runs proc corr. The first part of the macro selects only the data with that store_id and within the past 45 days of the date. Then it runs proc corr. Then it formats proc corr how you want it and appends the results to the "final" data set.
%macro corr(store_id, date);
data ds2;
set ds;
where store_id = &store_id and %eval(&date-45) <= date <=&date
and lag_sale ne .;
run;
proc corr noprint data=ds2 cov outp=corr;
by store_id;
var sales lag_sale;
run;
data corr2;
set corr;
where _type_ in ('CORR', 'COV') and _name_ = 'sales';
retain cov;
date = &date;
if _type_ = 'COV' then cov = lag_sale;
else do;
corr = lag_sale;
output;
end;
keep store_id date corr cov;
run;
proc append base=final data=corr2 force; run;
%mend corr;
Finally run the macro for each store_id/date combination.
proc sort data=ds out=ds3 nodupkey;
by store_id date;
run;
data _null_;
set ds3;
call execute('%corr('||store_id||','||date||');');
run;
proc sort data=final;
by store_id date;
run;