I am trying to calculate some statistics for a given variable based on the client id and the time horizon. My current solution is show below, however, I would like to know if there is a way to reformat the code into a datastep instead of an sql join, because the join takes a very long time to execute on my real dataset.
data have1(drop=t);
id = 1;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have2(drop=t);
id = 2;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have_fin;
set have1 have2;
run;
Proc sql;
create table want1 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-3,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Proc sql;
create table want2 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-6,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Use temporary arrays and a single data step instead.
This does the same thing in a single step.
Sort data to ensure order is correct
Declare a temporary array for each set of moving average you want to calculate.
Ensure the array is empty at the start of each ID
Assign values to array in correct index. MOD() allows you to dynamically index the data without have to include a separate counter variable.
Take the average of the array. If you want the array to ignore the first two values - because it has only 1/2 data points you can conditionally calculate this as well.
*sort to ensure data is in correct order (Step 1);
proc sort data=have_fin;
by id dt;
run;
data want;
*Step 2;
array p3{0:2} _temporary_;
array p6(0:5) _temporary_;
set have_fin;
by ID;
*clear values at the start of each ID for the array Step3;
if first.ID then call missing(of p3{*}, of p6(*));
*assign the value to the array, the mod function indexes the array so it's continuously the most recent 3/6 values;
*Step 4;
p3{mod(_n_,3)} = var;
p6{mod(_n_,6)} = var;
*Step 5 - calculates statistic of interest, average in this case;
mean3d = mean(of p3(*));
mean6d = mean(of p6(*));
;
run;
And if you have SAS/ETS licensed this is super trivial.
*prints product to log - check if you have SAS/ETS licensed;
proc product_status;run;
*sorts data;
proc sort data=have_fin;
by id dt;
run;
*calculates moving average;
proc expand data=have_fin out=want_expand;
by ID;
id dt;
convert var = mean_3d / method=none transformout= (movave 3);
convert var = mean_6d / method=none transformout= (movave 6);
run;
Related
I have a 2 column dataset - accounts and attributes, where there are 6 types of attributes.
I am trying to use PROC TRANSPOSE in order to set the 6 different attributes as 6 new columns and set 1 where the column has that attribute and 0 where it doesn't
This answer shows two approaches:
Proc TRANSPOSE, and
array based transposition using index lookup via hash.
For the case that all of the accounts missing the same attribute, there would be no way for the data itself to exhibit all the attributes -- ideally the allowed or expected attributes should be listed in a separate table as part of your data reshaping.
Proc TRANSPOSE
When working with a table of only account and attribute you will need to construct a view adding a numeric variable that can be transposed. After TRANSPOSE the result data will have to be further massaged, replacing missing values (.) with 0.
Example:
data have;
call streaminit(123);
do account = 1 to 10;
do attribute = 'a','b','c','d','e','f';
if rand('uniform') < 0.75 then output;
end;
end;
run;
data stage / view=stage;
set have;
num = 1;
run;
proc transpose data=stage out=want;
by account;
id attribute;
var num;
run;
data want;
set want;
array attrs _numeric_;
do index = 1 to dim(attrs);
if missing(attrs(index)) then attrs(index) = 0;
end;
drop index;
run;
proc sql;
drop view stage;
From
To
Advanced technique - Array and Hash mapping
In some cases the Proc TRANSPOSE is deemed unusable by the coder or operator, perhaps very many by groups and very many attributes. An alternate way to transpose attribute values into like named flag variables is to code:
Two scans
Scan 1 determine attribute values that will be encountered and used as column names
Store list of values in a macro variable
Scan 2
Arrayify the attribute values as variable names
Map values to array index using hash (or custom informat per #Joe)
Process each group. Set arrayed variable corresponding to each encountered attribute value to 1. Array index obtained via lookup through hash map.
Example:
* pass #1, determine attribute values present in data, the values will become column names;
proc sql noprint;
select distinct attribute into :attrs separated by ' ' from have;
* or make list of attributes from table of attributes (if such a table exists outside of 'have');
* select distinct attribute into :attrs separated by ' ' from attributes;
%put NOTE: &=attrs;
* pass #2, perform array based tranposformation;
data want2(drop=attribute);
* prep pdv, promulgate by group variable attributes;
if 0 then set have(keep=account);
array attrs &attrs.;
format &attrs. 4.;
if _n_=1 then do;
declare hash attrmap();
attrmap.defineKey('attribute');
attrmap.defineData('_n_');
attrmap.defineDone();
do _n_ = 1 to dim(attrs);
attrmap.add(key:vname(attrs(_n_)), data: _n_);
end;
end;
* preset all flags to zero;
do _n_ = 1 to dim(attrs);
attrs(_n_) = 0;
end;
* DOW loop over by group;
do until (last.account);
set have;
by account;
attrmap.find(); * lookup array index for attribute as column;
attrs(_n_) = 1; * set flag for attribute (as column);
end;
* implicit output one row per by group;
run;
One other option for doing this not using PROC TRANSPOSE is the data step array technique.
Here, I have a dataset that hopefully matches yours approximately. ID is probably your account, Product is your attribute.
data have;
call streaminit(2007);
do id = 1 to 4;
do prodnum = 1 to 6;
if rand('Uniform') > 0.5 then do;
product = byte(96+prodnum);
output;
end;
end;
end;
run;
Now, here we transpose it. We make an array with the six variables that could occur in HAVE. Then we iterate through the array to see if that variable is there. You can add a few additional lines to the if first.id block to set all of the variables to 0 instead of missing initially (I think missing is better, but YMMV).
data want;
set have;
by id;
array vars[6] a b c d e f;
retain a b c d e f;
if first.id then call missing(of vars[*]);
do _i = 1 to dim(vars);
if lowcase(vname(vars[_i])) = product then
vars[_i] = 1;
end;
if last.id then output;
run;
We could do it a lot faster if we knew how the dataset was constructed, of course.
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[rank(product)-96]=1;
if last.id then output;
run;
While your data doesn't really work that way, you could make an informat though that did this.
*First we build an informat relating the product to its number in the array order;
proc format;
invalue arrayi
'a'=1
'b'=2
'c'=3
'd'=4
'e'=5
'f'=6
;
quit;
*Now we can use that!;
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[input(product,arrayi.)]=1;
if last.id then output;
run;
This last one is probably the absolute fastest option - most likely much faster than PROC TRANSPOSE, which tends to be one of the slower procs in my book, but at the cost of having to know ahead of time what variables you're going to have in that array.
I am a SAS Developer. I am starting a project that requires me to assign RK number to unique record. Every extraction will get data that is already existed in the target table and some may not.
For example.
Source Data:
Name
A
B
C
D
E
Target Table:
Name RK
A 1
B 2
C 3
When I load, i want it to insert D and E into the target table with RK 4 & 5 respectively. Currently, I can think of doing hash lookup from source with target table. For data that is not mapped using hash object, RK field will be blank. I will put the max RK number from the target table and incremental 1 to it by appending D & E into it.
I am not sure if this is the most efficient way of doing so. Is there another more efficient way?
You could use a hash to determine if some name (I'll call it value) already exists in target table. However, new keys would have to be tracked, output at the end of the step and then PROC APPPEND'd to target table (I'll call it master) .
For the case of just updating the master table with new RK values, a traditional SAS approach is to use a DATA step to MODIFY a unique keyed master table. The coding pattern is:
SET <source>
MODIFY <master> KEY=<value> / UNIQUE;
... _IORC_ logic ...
Example:
%* Create some source data and the master table;
data have1 have2 have3 have4 have5;
call streaminit(123);
value = 2020; output; output; output;
do _n_ = 1 to 2500;
value = ceil(rand('uniform', 5000));
select;
when (rand('uniform') < 0.20) output have1;
when (rand('uniform') < 0.20) output have2;
when (rand('uniform') < 0.20) output have3;
when (rand('uniform') < 0.20) output have4;
otherwise output have5;
end;
end;
run;
data have6;
do _n_ = 1 to 20;
value = 2020;
output;
end;
run;
* Create the unique keyed master table;
* Typically done once and stored in a permanent library.;
proc sql;
create table keys (value integer, RK integer);
create distinct index value on work.keys;
quit;
%* A macro for adding new RK values as needed;
%macro RK_ASSIGN(master, data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &master;
retain newkey %sysevalf(0&last+0); %* trickery for 1st use case when max(RK) is .;
set &data;
modify &master key=value / unique;
if _iorc_ eq %sysrc(_DSENOM);
newkey + 1;
RK = newkey;
output;
_error_ = 0;
run;
%mend;
%* Use the macro to process source data;
%RK_ASSIGN(keys,have1)
%RK_ASSIGN(keys,have2)
%RK_ASSIGN(keys,have3)
%RK_ASSIGN(keys,have4)
%RK_ASSIGN(keys,have5)
%RK_ASSIGN(keys,have6)
You can see the forced repeats of the 2020 value in the source data is only RK'd once in the master table, and there are no errors during processing.
If you want to backfill the source data with the found or assigned RK value there would be additional steps. You could update a custom format, or do a traditional left join. If you want to focus on backfill during a read over source data the HASH step + APPEND new RK's step might be preferable.
Example 2 Master table is named values
HASH version with RK assignment added to source data. New RKs output and appended.
proc sql;
create table values (value integer, RK integer);
create distinct index value on work.values;
%macro RK_HASH_ASSIGN(master,data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &data(drop=next_RK);
set &data end=end;
if _n_ = 1 then do;
declare hash lookup (dataset:"&master");
lookup.defineKey("value");
lookup.defineData("value", "RK");
lookup.defineDone();
declare hash newlookup (dataset:"&master(obs=0)");
newlookup.defineKey("value");
newlookup.defineData("value", "RK");
newlookup.defineDone();
end;
retain next_RK %sysevalf(0&last+0); %* trick;
* either load existing RK from hash, or compute and apply next RK value;
if lookup.find() ne 0 then do;
next_RK + 1;
RK = next_RK;
lookup.add();
newlookup.add();
end;
if end then do;
newlookup.output(dataset:'work.newmasters');
end;
run;
proc append base=&master data=work.newmasters;
proc delete data=work.newmasters;
run;
%mend;
%RK_HASH_ASSIGN(values,have1)
%RK_HASH_ASSIGN(values,have2)
%RK_HASH_ASSIGN(values,have3)
%RK_HASH_ASSIGN(values,have4)
%RK_HASH_ASSIGN(values,have5)
%RK_HASH_ASSIGN(values,have6)
%* Compare the two assignment strategies, no differences!;
proc sort force data=values(index=(value));
by RK;
run;
proc compare noprint base=keys compare=values out=diffs outnoequal;
by RK;
run;
----- LOG -----
2525 proc compare noprint base=keys compare=values out=diffs
outnoequal <------------- do not output when data is identical ;
;
2526 by RK;
2527 run;
NOTE: There were 215971 observations read from the data set WORK.KEYS.
NOTE: There were 215971 observations read from the data set WORK.VALUES.
NOTE: The data set WORK.DIFFS has 0 observations and 4 variables. <--- all the same ---
NOTE: PROCEDURE COMPARE used (Total process time):
real time 0.25 seconds
cpu time 0.26 seconds
I've got pretty big table where I want to replace rare values (for this example that have less than 10 occurancies but real case is more complicated- it might have 1000 levels while I want to have only 15). This list of possible levels might change so I don't want to hardcode anything.
My code is like:
%let var = Make;
proc sql;
create table stage1_ as
select &var.,
count(*) as count
from sashelp.cars
group by &var.
having count >= 10
order by count desc
;
quit;
/* Join table with table including only top obs to replace rare
values with "other" category */
proc sql;
create table stage2_ as
select t1.*,
case when t2.&var. is missing then "Other_&var." else t1.&var. end as &var._new
from sashelp.cars t1 left join
stage1_ t2 on t1.&var. = t2.&var.
;
quit;
/* Drop old variable and rename the new as old */
data result;
set stage2_(drop= &var.);
rename &var._new=&var.;
run;
It works, but unfortunately it is not very officient as it needs to make a join for each variable (in real case I am doing it in loop).
Is there a better way to do it? Maybe some smart replace function?
Thanks!!
You probably don't want to change the actual data values. Instead consider creating a custom format for each variable that will map the rare values to an 'Other' category.
The FREQ procedure ODS can be used to capture the counts and percentages of every variable listed into a single table. NOTE: Freq table/out= captures only the last listed variable. Those counts can be used to construct the format according to the 'othering' rules you want to implement.
data have;
do row = 1 to 1000;
array x x1-x10;
do over x;
if row < 600
then x = ceil(100*ranuni(123));
else x = ceil(150*ranuni(123));
end;
output;
end;
run;
ods output onewayfreqs=counts;
proc freq data=have ;
table x1-x10;
run;
data count_stack;
length name $32;
set counts;
array x x1-x10;
do over x;
name = vname(x);
value = x;
if value then output;
end;
keep name value frequency;
run;
proc sort data=count_stack;
by name descending frequency ;
run;
data cntlin;
do _n_ = 1 by 1 until (last.name);
set count_stack;
by name;
length fmtname $32;
fmtname = trim(name)||'top';
start = value;
label = cats(value);
if _n_ < 11 then output;
end;
hlo = 'O';
label = 'Other';
output;
run;
proc format cntlin=cntlin;
run;
ods html;
proc freq data=have;
table x1-x10;
format
x1 x1top.
x2 x2top.
x3 x3top.
x4 x4top.
x5 x5top.
x6 x6top.
x7 x7top.
x8 x8top.
x9 x9top.
x10 x10top.
;
run;
I have a process flow in SAS Enterprise Guide which is comprised mainly of Data views rather than tables, for the sake of storage in the work library.
The problem is that I need to calculate percentiles (using proc univariate) from one of the data views and left join this to the final table (shown in the screenshot of my process flow).
Is there any way that I can specify the outfile in the univariate procedure as being a data view, so that the procedure doesn't calculate everything prior to it in the flow? When the percentiles are left joined to the final table, the flow is calculated again so I'm effectively doubling my processing time.
Please find the code for the univariate procedure below
proc univariate data=WORK.QUERY_FOR_SGFIX noprint;
var CSA_Price;
by product_id;
output out= work.CSA_Percentiles_Prod
pctlpre= P
pctlpts= 40 to 60 by 10;
run;
In SAS, my understanding is that procs such as proc univariate cannot generally produce views as output. The only workaround I can think of would be for you to replicate the proc logic within a data step and produce a view from the data step. You could do this e.g. by transposing your variables into temporary arrays and using the pctl function.
Here's a simple example:
data example /view = example;
array _height[19]; /*Number of rows in sashelp.class dataset*/
/*Populate array*/
do _n_ = 1 by 1 until(eof);
set sashelp.class end = eof;
_height[_n_] = height;
end;
/*Calculate quantiles*/
array quantiles[3] q40 q50 q60;
array points[3] (40 50 60);
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/*Keep only the quantiles we calculated*/
keep q40--q60;
run;
With a bit more work, you could also make this approach return percentiles for individual by groups rather than for the whole dataset at once. You would need to write a double-DOW loop to do this, e.g.:
data example;
array _height[19];
array quantiles[3] q40 q50 q60;
array points[3] _temporary_ (40 50 60);
/*Clear heights array between by groups*/
call missing(of _height[*]);
/*Populate heights array*/
do _n_ = 1 by 1 until(last.sex);
set class end = eof;
by sex;
_height[_n_] = height;
end;
/*Calculate quantiles*/
do i = 1 to 3;
quantiles[i] = pctl(points[i], of _height{*});
end;
/* Output all rows from input dataset, with by-group quantiles attached*/
do _n_ = 1 to _n_;
set class;
output;
end;
keep name sex q40--q60;
run;
What I've got:
a table of 20 rows in SAS (originally 100k)
various binary attributes (columns)
What I'm looking to get:
A crosstable displaying the frequency of the attribute combinations
like this:
Attribute1 Attribute2 Attribute3 Attribute4
Attribute1 5 0 1 2
Attribute2 0 3 0 3
Attribute3 2 0 5 4
Attribute4 1 2 0 10
*The actual sum of combinations is made up and probably not 100% logical
The code I currently have:
/*create dummy data*/
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
output;
end;
run;
I guess this can be done smarter, but this seem to work. First I created a table that should hold all the frequencies:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;
Then I loop through all the combinations, inserting the count into the crosstable:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
Note that since the frequency count for (i,j)=(j,i) you do not need to do both.
I'd recommend using the built-in SAS tools for this sort of thing, and probably displaying your data slightly differently as well, unless you really want a diagonal table. e.g.
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
count = 1;
output;
end;
run;
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;
proc summary nway data = monthly_sales;
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
Either of these gives you a table with 1 row for each contribution of attributes in your data, which is slightly different from what you requested, but conveys the same information. You can force proc summary to include rows for combinations of class variables that don't exist in your data by using the completetypes option in the proc summary statement.
It's definitely worth taking the time to get familiar with proc summary if you're doing statistical analysis in SAS - you can include additional output statistics and process multiple variables with minimal additional code and processing overhead.
Update: it's possible to produce the desired table without resorting to macro logic, albeit a rather complex process:
proc summary data = monthly_sales completetypes;
ways 1 2; /*Calculate only 1 and 2-way summaries*/
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
/*Eliminate unnecessary output rows*/
data summary_table;
set summary_table;
array a{*} attribute:;
sum = sum(of a[*]);
missing = 0;
do i = 1 to dim(a);
missing + missing(a[i]);
a[i] = a[i] * count;
end;
/*We want rows where two attributes are both 1 (sum = 2),
or one attribute is 1 and the others are all missing*/
if sum = 2 or (sum = 1 and missing = dim(a) - 1);
drop i missing sum;
edge = _n_;
run;
/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
by edge;
var attribute:;
run;
/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
create table diagonal as
select a._name_ as aname,
b._name_ as bname,
a.col1 as count
from tr_table a, tr_table b
where a.edge = b.edge
group by a.edge
having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
order by aname, bname
;
quit;
/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
by aname;
id bname;
var count;
run;
/*Re-order variables and set missing values to zero*/
data want;
informat aname attribute1-attribute4;
set want;
array a{*} attribute:;
do i = 1 to dim(a);
a[i] = sum(a[i],0);
end;
drop i;
run;
Yeah, user667489 was right, I just added some extra code to get the cross-frequency table looking good. First, I created a table with 10 million rows and 10 variables:
data monthly_sales (drop=i);
do i=1 to 10000000;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
Attribute5=rand("Normal")>0.5;
Attribute6=rand("Normal")>0.5;
Attribute7=rand("Normal")>0.5;
Attribute8=rand("Normal")>0.5;
Attribute9=rand("Normal")>0.5;
Attribute10=rand("Normal")>0.5;
output;
end;
run;
Create an empty 10x10 crosstable:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
output;output;output;output;output;output;output;output;output;output;
run;
Create a frequency table using proc freq:
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
/ out = frequency_table;
run;
Loop through all the combinations of Attributes and sum the "count" variable. Insert it into the crosstable:
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;