I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
Ive got 50 columns of data, with 4 different measurements in each, as well as designation tags (groups C, D, and E). Ive averaged the 4 measurements... So every data point now has an average. Now, I am supposed to take the average of all the data points averages of each specific group.
So I want all the data in group C to be averaged, and so on for D and E.... and I dont know how to do that.
avg1=(MEAS1+MEAS2+MEAS3+MEAS4)/4;
avg_score=round(avg1, .1);
run;
proc print;
run;
This is what I have so far.
There are several procedures, and SQL that can average values over a group.
I'll guess you meant to say 50 rows of data.
Example:
Proc MEANS
data have;
call streaminit(314159);
do _n_ = 1 to 50;
group = substr('CDE', rand('integer',3),1);
array v meas1-meas4;
do _i_ = 1 to dim(v);
num + 2;
v(_i_) = num;
end;
output;
end;
drop num;
run;
data rowwise_means;
set have;
avg_meas = mean (of meas:);
run;
* group wise means of row means;
proc means noprint data=rowwise_means nway;
class group;
var avg_meas;
output out=want mean=meas_grandmean;
run;
rowwise_means
want (grandmean, or mean of means)
I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;
I have the following code that is being used generate running totals of features for the past 1 day, 7 days, 1 month, 3 months, and 6 months.
LIBNAME A "C:\Users\James\Desktop\data\Base Data";
LIBNAME DATA "C:\Users\James\Desktop\data\Data1";
%MACRO HELPER(P);
data a1;
set data.final_master_&P. ;
QUERY = '%TEST('||STRIP(DATETIME)||','||STRIP(PARTICIPANT)||');';
CALL EXECUTE(QUERY);
run;
%MEND;
%MACRO TEST(TIME,PAR);
proc sql;
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_24, :APP_2_24, :APP_3_24, :APP_4_24, :APP_5_24
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24) AND &TIME.;
/* 7 Days */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_7DAY, :APP_2_7DAY, :APP_3_7DAY, :APP_4_7DAY, :APP_5_7DAY
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7) AND &TIME.;
/* One Month */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_1MONTH, :APP_2_1MONTH, :APP_3_1MONTH, :APP_4_1MONTH, :APP_5_1MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4) AND &TIME.;
/* Three Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_3MONTH, :APP_2_3MONTH, :APP_3_3MONTH, :APP_4_3MONTH, :APP_5_3MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*3) AND &TIME.;
/* Six Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_6MONTH, :APP_2_6MONTH, :APP_3_6MONTH, :APP_4_6MONTH, :APP_5_6MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*6) AND &TIME.;
quit;
DATA T;
PARTICIPANT = &PAR.;
DATETIME = &TIME;
APP_1_24 = &APP_1_24.;
APP_2_24 = &APP_2_24.;
APP_3_24 = &APP_3_24.;
APP_4_24 = &APP_4_24.;
APP_5_24 = &APP_5_24.;
APP_1_7DAY = &APP_1_7DAY.;
APP_2_7DAY = &APP_2_7DAY.;
APP_3_7DAY = &APP_3_7DAY.;
APP_4_7DAY = &APP_4_7DAY.;
APP_5_7DAY = &APP_5_7DAY.;
APP_1_1MONTH = &APP_1_1MONTH.;
APP_2_1MONTH = &APP_2_1MONTH.;
APP_3_1MONTH = &APP_3_1MONTH.;
APP_4_1MONTH = &APP_4_1MONTH.;
APP_5_1MONTH = &APP_5_1MONTH.;
APP_1_3MONTH = &APP_1_3MONTH.;
APP_2_3MONTH = &APP_2_3MONTH.;
APP_3_3MONTH = &APP_3_3MONTH.;
APP_4_3MONTH = &APP_4_3MONTH.;
APP_5_3MONTH = &APP_5_3MONTH.;
APP_1_6MONTH = &APP_1_6MONTH.;
APP_2_6MONTH = &APP_2_6MONTH.;
APP_3_6MONTH = &APP_3_6MONTH.;
APP_4_6MONTH = &APP_4_6MONTH.;
APP_5_6MONTH = &APP_5_6MONTH.;
FORMAT DATETIME DATETIME.;
RUN;
PROC APPEND BASE=DATA.FLAGS_&par. DATA=T;
RUN;
%MEND;
%helper(1);
This code runs perfectly if I limit the number of observations in the %helper macro, using an (obs=) in the creation of the a1 dataset. However, when I put no limit on the obs number, i.e. execute the %test macro for every row in the dataset a1, I get errors. In SAS EG, I get a "server disconnected" popup after the status bar hangs at "running data step", and on Base SAS 9.4 I get the error that none of the macro variables have been resolved that are created in the proc sql into.
I'm confused as the code works fine for a limited amount of observations, but when trying on the whole dataset it hangs or gives errors. The dataset I'm doing this for has around 130,000 observations.
The answer to your actual question is that you're simply generating too much macro code and perhaps even simply taking too much time. The way you are doing this is going to operate on an O=n^2 level, as you're basically doing a cartesian join of every record to every record, and then some. 130,000 * 130,000 is a pretty decent sized number, and on top of that you're actually opening the SQL environment several times for each 130,000 rows. Ouch.
The solution is to do it either in a way that isn't too slow, or if it is, in a way that won't have too much overhead at least.
The fast solution is to not do the cartesian join, or to limit how much needs to be joined. One good solution would be to restructure the problem, not require every record to be compared, but instead consider each calendar day, say, a period, especially in the over-24h periods (24h you might do the way you do above, but not the other four). 1 month, 3 month, etc., do you really need to figure out time of day? Probably won't make much difference. If you can get rid of that, then you can use built in PROCs to precompile all possible 1 month periods, all possible 3 month periods, etc., and then join on the appropriate one. But that won't work with 130,000 of them; it would only work if you could limit it to one per day, probably.
If you must do it at the second level (or worse), what you'll want to do is avoid the cartesian join, and instead keep track of the various records you've seen already, and the sums. The short explanation of the algorithm is:
For each row:
Add this row's values to the rolling sums (at the end of the queue)
Check if the current item of the queue is outside of the period; if it is, subtract it from the rolling sums, and check the next item (repeat until not outside of the period), updating the current queue position
Return the sum at this point
This requires checking each row typically twice (except at odd boundaries where you have no rows popped off for several iterations, due to months having different numbers of days). This operates on O=n time, much faster than the cartesian join, and on top of that has far less memory/space required (the cartesian join might need to hit disk space).
The hash version of this solution is below. This will be the fastest solution I think that compares every row. Note that I intentionally make the test data have 1 for every row and same number of rows for every day; that lets you see how it works on a row-wise manner very easily. (For example, every 24h period has 481 rows, because I made 480 rows per day exactly, and 481 includes the same time yesterday - if you change lt to le it will be 480, if you prefer not to include same time yesterday). You can see that the 'month' based periods will have slightly odd results at the boundaries where months change because the '01FEB20xx' to '01MAY20xx' period has far fewer days (and thus rows) than the '01JUL20xx' to '01OCT20xx' period, for example; better would be 30/90/180 day periods.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add(array=);
array app_&array. app_&array._1-app_&array._5;
%add(array=app_&array.);
%mend process_array_add;
%macro process_array_subtract(array=, period=, number=);
if _n_ eq 1 then do;
declare hiter hi_&array.('td');
rc_&array. = hi_&array..first();
end;
else do;
rc_&array. = hi_&array..setcur(key:firstval_&array.);
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and rc_&array.=0);
%subtract(array=app_&array.);
rc_&array. = hi_&array..next();
end;
retain firstval_&array.;
firstval_&array. = dt_var;
%mend process_array_subtract;
data want;
set test_data;
* if _n_ > 10000 then stop;
curr_dt_var = dt_var;
array app[5] app_1-app_5;
if _n_ eq 1 then do;
declare hash td(ordered:'a');
td.defineKey('dt_var');
td.defineData('dt_var','app_1','app_2','app_3','app_4','app_5');
td.defineDone();
end;
rc_a = td.add();
*start macro territory;
%process_array_add(array=24h);
%process_array_add(array=1wk);
%process_array_add(array=1mo);
%process_array_add(array=3mo);
%process_array_add(array=6mo);
%process_array_subtract(array=24h,period=DTDay, number=1);
%process_array_subtract(array=1wk,period=DTDay, number=7);
%process_array_subtract(array=1mo,period=DTMonth, number=1);
%process_array_subtract(array=3mo,period=DTMonth, number=3);
%process_array_subtract(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var rc: _:;
output;
run;
Here's a pure data step non-hash version. On my machine it's actually faster than the hash solution; I suspect it's not actually faster on a machine with a HDD (I have an SSD, so point access is not substantially slower than hash access, and I avoid having to load the hash). I would recommend using it if you don't know hashes very well or at all, as it'll be easier to troubleshoot, and it scales similarly. For most rows it accesses 11 rows, the current row and five other rows twice (one row, subtract it, then another row) for a total of around a million and a half total reads for 130k rows. (Compare that to about 17 billion reads for the cartesian...)
I suffix the macros with "_2" to differentiate them from the macros in the hash solution.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add_2(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract_2(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add_2(array=);
array app_&array. app_&array._1-app_&array._5; *define array;
%add_2(array=app_&array.); *add current row to array;
%mend process_array_add_2;
%macro process_array_sub_2(array=, period=, number=);
if _n_ eq 1 then do; *initialize point variable;
point_&array. = 1;
end;
else do; *do not have to do this _n_=1 as we only have that row;
set test_data point=point_&array.; *set the row that we may be subtracting;
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and point_&array. < _N_); *until we hit a row that is within the period...;
%subtract_2(array=app_&array.); *subtract the rows values;
point_&array. + 1; *increment the point to look at;
set test_data point=point_&array.; *set the new row;
end;
%mend process_array_sub_2;
data want;
set test_data;
*if _n_ > 10000 then stop; *useful for testing if you want to check time to execute;
curr_dt_var = dt_var; *save dt_var value from originally set record;
array app[5] app_1-app_5; *base array;
*start macro territory;
%process_array_add_2(array=24h); *have to do all of these adds before we start subtracting;
%process_array_add_2(array=1wk); *otherwise we have the wrong record values;
%process_array_add_2(array=1mo);
%process_array_add_2(array=3mo);
%process_array_add_2(array=6mo);
%process_array_sub_2(array=24h,period=DTDay, number=1); *now start checking to subtract what we need to;
%process_array_sub_2(array=1wk,period=DTDay, number=7);
%process_array_sub_2(array=1mo,period=DTMonth, number=1);
%process_array_sub_2(array=3mo,period=DTMonth, number=3);
%process_array_sub_2(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var _:;
output; *unneeded in this version but left for comparison to hash;
run;
I'm calculating some interval statistics (standard deviation of one minute intervals for example) of financial time series data. My code managed to get results for all intervals that contain data, but for intervals that do not contain any observations in the time series, I'd like to insert an empty row just to maintain the timestamp consistency.
For example, if there's data between 10:00 to 10:01, 10:02 to 10:03, but not 10:01 to 10:02, my output would be:
10:01 stat1 stat2 stat3
10:03 stat1 stat2 stat3
It would ideal if the result could be (I want some values to be 0, some missing '.'):
10:01 stat1 stat2 stat3
10:02 0 0 .
10:03 stat1 stat2 stat3
What I did:
data v_temp/view = v_temp;
set &taq_ds;
where TIME_M between &start_time and &end_time;
INTV = hms(00, ceil(TIME_M/'00:01:00't),00); *create one minute interval;
format INTV tod.; *format hh:mm:ss;
run;
proc means data = sorted noprint;
by SYM_ROOT DATE INTV;
var PRICE;
weight SIZE;
output
out=oneMinStats(drop=_TYPE_ _FREQ_)
n=NTRADES mean=VWAP sumwgt=SUMSHS max=HI min=LO std=SIGMAPRC
idgroup(max(TIME_M) last out(price size ex time_m)=LASTTRD LASTSIZE LASTEX LASTTIME);
run;
For some non-active stocks, there're many gaps like this. What would be an efficient way to generate those filling rows?
If you have SAS:ETS licensed, PROC EXPAND is a good choice for adding blank rows in a time series. Here's a very short example:
data mydata;
input timevar stat1 stat2 stat3;
format timevar TIME5.;
informat timevar HHMMSS5.;
datalines;
10:01 1 3 5
10:03 2 4 6
;;;;
run;
proc expand data=mydata out=mydata_exp from=minute to=minute observed=beginning method=none;
id timevar;
run;
The documentation has more details if you want to perform inter/extrapolation or anything like that. The important options are from=minute, observed=beginning, method=none (no extrapolation or interpolation), and id (which identifies the time variable).
If you don't have ETS, then a data step should suffice. You can either merge to a known dataset, or add your own rows; the size of your dataset determines somewhat which is easier. Here's the merge variation. The add your own rows in a datastep variation is similar to how I create the extra rows.
*Select the maximum time available.;
proc sql noprint;
select max(timevar) into :endtime from mydata;
quit;
*Create the empty dataset with just times;
data mydata_tomerge;
set mydata_tomerge(obs=1);
do timevar = timevar to &endtime by 60; *by 60 = minutes!;
output;
end;
keep timevar;
run;
*Now merge the one with all the times to the one with all the data!;
data mydata_fin;
merge mydata_tomerge(in=a) mydata;
by timevar;
if a;
run;