I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
I am trying to collapse my multiple rows of binary variables into a single row per patient id as depicted in my illustration. Could someone please help me with the SAS code to do this? Thanks
If the rule is that to set it to 1 if it is ever 1 then take the MAX. If the rule is to set it to one only if all of them are one then take the MIN.
proc summary data=have nway ;
by id;
output out=want max= ;
run;
Update trick
data want;
update have(obs=0) have;
by id;
run;
Or
proc sql;
create table want as
select ID, max('2018'n) as Y2018, max('2019'n) as Y2019, max('2020'n) as Y2020
from have
group by ID
order by ID;
quit;
Untested because you provided data as images, please post as text, preferably as a data step.
Here is a data step-based solution. Certainly more complex than the above answers, but it does show ways you can use arrays, first. and last. processing, and the retain statement.
Use a retained temporary array to hold the values of 2018-2020 until the last observation of each id group. On the last value of each id, check if each held value is 1 and set each value of the year to a 1 or 0.
data want;
set have;
by id;
array year[3] '2018'n--'2020'n;
array hold[3] _TEMPORARY_;
retain hold;
if(first.id) then call missing(of hold[*]);
do i = 1 to dim(year);
if(year[i] = 1) then hold[i] = 1;
end;
if(last.id) then do;
do i = 1 to dim(year);
year[i] = (hold[i] = 1);
end;
output;
end;
drop i;
run;
I have the following code that is being used generate running totals of features for the past 1 day, 7 days, 1 month, 3 months, and 6 months.
LIBNAME A "C:\Users\James\Desktop\data\Base Data";
LIBNAME DATA "C:\Users\James\Desktop\data\Data1";
%MACRO HELPER(P);
data a1;
set data.final_master_&P. ;
QUERY = '%TEST('||STRIP(DATETIME)||','||STRIP(PARTICIPANT)||');';
CALL EXECUTE(QUERY);
run;
%MEND;
%MACRO TEST(TIME,PAR);
proc sql;
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_24, :APP_2_24, :APP_3_24, :APP_4_24, :APP_5_24
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24) AND &TIME.;
/* 7 Days */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_7DAY, :APP_2_7DAY, :APP_3_7DAY, :APP_4_7DAY, :APP_5_7DAY
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7) AND &TIME.;
/* One Month */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_1MONTH, :APP_2_1MONTH, :APP_3_1MONTH, :APP_4_1MONTH, :APP_5_1MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4) AND &TIME.;
/* Three Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_3MONTH, :APP_2_3MONTH, :APP_3_3MONTH, :APP_4_3MONTH, :APP_5_3MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*3) AND &TIME.;
/* Six Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_6MONTH, :APP_2_6MONTH, :APP_3_6MONTH, :APP_4_6MONTH, :APP_5_6MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*6) AND &TIME.;
quit;
DATA T;
PARTICIPANT = &PAR.;
DATETIME = &TIME;
APP_1_24 = &APP_1_24.;
APP_2_24 = &APP_2_24.;
APP_3_24 = &APP_3_24.;
APP_4_24 = &APP_4_24.;
APP_5_24 = &APP_5_24.;
APP_1_7DAY = &APP_1_7DAY.;
APP_2_7DAY = &APP_2_7DAY.;
APP_3_7DAY = &APP_3_7DAY.;
APP_4_7DAY = &APP_4_7DAY.;
APP_5_7DAY = &APP_5_7DAY.;
APP_1_1MONTH = &APP_1_1MONTH.;
APP_2_1MONTH = &APP_2_1MONTH.;
APP_3_1MONTH = &APP_3_1MONTH.;
APP_4_1MONTH = &APP_4_1MONTH.;
APP_5_1MONTH = &APP_5_1MONTH.;
APP_1_3MONTH = &APP_1_3MONTH.;
APP_2_3MONTH = &APP_2_3MONTH.;
APP_3_3MONTH = &APP_3_3MONTH.;
APP_4_3MONTH = &APP_4_3MONTH.;
APP_5_3MONTH = &APP_5_3MONTH.;
APP_1_6MONTH = &APP_1_6MONTH.;
APP_2_6MONTH = &APP_2_6MONTH.;
APP_3_6MONTH = &APP_3_6MONTH.;
APP_4_6MONTH = &APP_4_6MONTH.;
APP_5_6MONTH = &APP_5_6MONTH.;
FORMAT DATETIME DATETIME.;
RUN;
PROC APPEND BASE=DATA.FLAGS_&par. DATA=T;
RUN;
%MEND;
%helper(1);
This code runs perfectly if I limit the number of observations in the %helper macro, using an (obs=) in the creation of the a1 dataset. However, when I put no limit on the obs number, i.e. execute the %test macro for every row in the dataset a1, I get errors. In SAS EG, I get a "server disconnected" popup after the status bar hangs at "running data step", and on Base SAS 9.4 I get the error that none of the macro variables have been resolved that are created in the proc sql into.
I'm confused as the code works fine for a limited amount of observations, but when trying on the whole dataset it hangs or gives errors. The dataset I'm doing this for has around 130,000 observations.
The answer to your actual question is that you're simply generating too much macro code and perhaps even simply taking too much time. The way you are doing this is going to operate on an O=n^2 level, as you're basically doing a cartesian join of every record to every record, and then some. 130,000 * 130,000 is a pretty decent sized number, and on top of that you're actually opening the SQL environment several times for each 130,000 rows. Ouch.
The solution is to do it either in a way that isn't too slow, or if it is, in a way that won't have too much overhead at least.
The fast solution is to not do the cartesian join, or to limit how much needs to be joined. One good solution would be to restructure the problem, not require every record to be compared, but instead consider each calendar day, say, a period, especially in the over-24h periods (24h you might do the way you do above, but not the other four). 1 month, 3 month, etc., do you really need to figure out time of day? Probably won't make much difference. If you can get rid of that, then you can use built in PROCs to precompile all possible 1 month periods, all possible 3 month periods, etc., and then join on the appropriate one. But that won't work with 130,000 of them; it would only work if you could limit it to one per day, probably.
If you must do it at the second level (or worse), what you'll want to do is avoid the cartesian join, and instead keep track of the various records you've seen already, and the sums. The short explanation of the algorithm is:
For each row:
Add this row's values to the rolling sums (at the end of the queue)
Check if the current item of the queue is outside of the period; if it is, subtract it from the rolling sums, and check the next item (repeat until not outside of the period), updating the current queue position
Return the sum at this point
This requires checking each row typically twice (except at odd boundaries where you have no rows popped off for several iterations, due to months having different numbers of days). This operates on O=n time, much faster than the cartesian join, and on top of that has far less memory/space required (the cartesian join might need to hit disk space).
The hash version of this solution is below. This will be the fastest solution I think that compares every row. Note that I intentionally make the test data have 1 for every row and same number of rows for every day; that lets you see how it works on a row-wise manner very easily. (For example, every 24h period has 481 rows, because I made 480 rows per day exactly, and 481 includes the same time yesterday - if you change lt to le it will be 480, if you prefer not to include same time yesterday). You can see that the 'month' based periods will have slightly odd results at the boundaries where months change because the '01FEB20xx' to '01MAY20xx' period has far fewer days (and thus rows) than the '01JUL20xx' to '01OCT20xx' period, for example; better would be 30/90/180 day periods.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add(array=);
array app_&array. app_&array._1-app_&array._5;
%add(array=app_&array.);
%mend process_array_add;
%macro process_array_subtract(array=, period=, number=);
if _n_ eq 1 then do;
declare hiter hi_&array.('td');
rc_&array. = hi_&array..first();
end;
else do;
rc_&array. = hi_&array..setcur(key:firstval_&array.);
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and rc_&array.=0);
%subtract(array=app_&array.);
rc_&array. = hi_&array..next();
end;
retain firstval_&array.;
firstval_&array. = dt_var;
%mend process_array_subtract;
data want;
set test_data;
* if _n_ > 10000 then stop;
curr_dt_var = dt_var;
array app[5] app_1-app_5;
if _n_ eq 1 then do;
declare hash td(ordered:'a');
td.defineKey('dt_var');
td.defineData('dt_var','app_1','app_2','app_3','app_4','app_5');
td.defineDone();
end;
rc_a = td.add();
*start macro territory;
%process_array_add(array=24h);
%process_array_add(array=1wk);
%process_array_add(array=1mo);
%process_array_add(array=3mo);
%process_array_add(array=6mo);
%process_array_subtract(array=24h,period=DTDay, number=1);
%process_array_subtract(array=1wk,period=DTDay, number=7);
%process_array_subtract(array=1mo,period=DTMonth, number=1);
%process_array_subtract(array=3mo,period=DTMonth, number=3);
%process_array_subtract(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var rc: _:;
output;
run;
Here's a pure data step non-hash version. On my machine it's actually faster than the hash solution; I suspect it's not actually faster on a machine with a HDD (I have an SSD, so point access is not substantially slower than hash access, and I avoid having to load the hash). I would recommend using it if you don't know hashes very well or at all, as it'll be easier to troubleshoot, and it scales similarly. For most rows it accesses 11 rows, the current row and five other rows twice (one row, subtract it, then another row) for a total of around a million and a half total reads for 130k rows. (Compare that to about 17 billion reads for the cartesian...)
I suffix the macros with "_2" to differentiate them from the macros in the hash solution.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add_2(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract_2(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add_2(array=);
array app_&array. app_&array._1-app_&array._5; *define array;
%add_2(array=app_&array.); *add current row to array;
%mend process_array_add_2;
%macro process_array_sub_2(array=, period=, number=);
if _n_ eq 1 then do; *initialize point variable;
point_&array. = 1;
end;
else do; *do not have to do this _n_=1 as we only have that row;
set test_data point=point_&array.; *set the row that we may be subtracting;
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and point_&array. < _N_); *until we hit a row that is within the period...;
%subtract_2(array=app_&array.); *subtract the rows values;
point_&array. + 1; *increment the point to look at;
set test_data point=point_&array.; *set the new row;
end;
%mend process_array_sub_2;
data want;
set test_data;
*if _n_ > 10000 then stop; *useful for testing if you want to check time to execute;
curr_dt_var = dt_var; *save dt_var value from originally set record;
array app[5] app_1-app_5; *base array;
*start macro territory;
%process_array_add_2(array=24h); *have to do all of these adds before we start subtracting;
%process_array_add_2(array=1wk); *otherwise we have the wrong record values;
%process_array_add_2(array=1mo);
%process_array_add_2(array=3mo);
%process_array_add_2(array=6mo);
%process_array_sub_2(array=24h,period=DTDay, number=1); *now start checking to subtract what we need to;
%process_array_sub_2(array=1wk,period=DTDay, number=7);
%process_array_sub_2(array=1mo,period=DTMonth, number=1);
%process_array_sub_2(array=3mo,period=DTMonth, number=3);
%process_array_sub_2(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var _:;
output; *unneeded in this version but left for comparison to hash;
run;
I am very new but keen to learn SAS coding.I have 2 data sets a and b namely dt1 and dt2 which consist of columns a for dt1 and b and c for dt2:
a b c
2014 2008 2
2009 3
2014 4
2015 5
I am trying to get the nth row of the c column when the element which is at nth row of b column is equal to a(1)
Here it is c=4;
I wrote a code below.
DATA dt1;
set dt1;
data dt2;
set dt2;
i=1;
do while (b ne a);
i=i+1;
end;
call symput('ROW_NUMBER',i);
run;
proc print data = dt2(keep = c obs = &ROW_NUMBER firstobs = &ROW_NUMBER);
run;
but this code enters in an infinite loop and I could not find any solution for this. I appreciate if you help solve this issue.
Thanks
I think you should learn the basic syntax of the data step before trying to use macro variables. A lot of what you're doing makes little sense. Here is an explanation of how the data step works. You will do yourself a huge favor if you study that.
Here's how to do an inner join in proc sql, which seems to be more in line with your goal here. This simply selects the values of c where dt1.a is equal to dt2.b:
proc sql;
select c
from dt1 inner join dt2 on dt1.a = dt2.b;
quit;
If you were to use a data step, you'd do something like this the following.
data out(keep=c);
set dt1;
do until (a=b or eof);
set dt2 end=eof;
if a=b then output;
end;
run;
proc print data=out noobs;
run;
Use the end= option to create temporary variable eof which allows you to end the loop after the last row of dt2 is read.
This is a simple MERGE. You just need to rename the variables to match. This assumes they're both sorted by the value (a/b). You can then set the macro variable in that data step or do whatever you want.
data want;
merge dt1(in=_a rename=a=b) dt2(in=_b);
by b;
if _a and _b;
call symput("ROW_NUMBER",c);
run;
If you want to define macro variables:
data _null_;
set dt2;
if _n_=1 then set dt1;
if a=b then do;
call symput('c_val',c);
call symput('row_num',_n_);
end;
run;
%put &row_num &c_val;
I have a series of dataset with two variables: uid and timestamp. I want to create a new variable called "session_num" to parse the timestamps into session numbers (when two timestampes are 30 min + apart, it will be marked as a new session).
For example:
I try to use retain statement in sas to loop through the timestamp, but it didn't work.
Here's my code:
Data test;
SET test1;
By uid;
RETAIN session_num session_len;
IF first.uid THEN DO;
session=1;
session_len=0;
END;
session_len=session_len+timpestamp;
IF timpestamp-session_len>1800 THEN session_num=session_num+1;
ELSE session_num=session_num;
IF last.uid;
KEEP uid timestamp session_num;
RUN;
Really appreciate if you could point out my mistake, and suggest the right solution.
Thanks!
First, here is some sample input data (in the future, you should supply your own code to generate the sample input data so others don't have to spend time doing this for you),
data test;
input uid$ timestamp : DATETIME.;
format timestamp DATETIME.;
datalines;
a 05jul2014:03:55:00
a 05jul2014:03:57:00
a 07jul2014:20:08:00
a 10jul2014:19:02:00
a 10jul2014:19:05:00
a 11jul2014:14:39:00
;
Then you can create the session variable as you defined it with
data testsession;
set test;
retain last;
retain session 0;
by uid;
if first.uid then do;
session = session +1;
last = timestamp;
end;
if (timestamp-last)/60>15 then do;
session = session+1;
end;
last = timestamp;
drop last;
run;
MrFlick's method is probably the more normal way to do this, but another option involves the look-ahead self-merge. (Yes, look-ahead, even though this is supposed to look behind - look behind is more complicated in this manner.)
data want;
retain session 1; *have to initialize to 1 for the first record!;
merge have(in=_h) have(rename=(timestamp=next uid=nextuid) firstobs=2);
output; *output first, then we adjust session for the next iteration;
if (uid ne nextuid) then session=1;
else if timestamp + 1800 < next then session+1;
drop next:;
run;