I need to create a new variable, called new_id, which displays the same value for either same id tasks or same location tasks. In this example:
Table 1
id Task location
a Task1 lat1
b Task2 lat2
b Task3 lat1
c Task4 lat3
c Task5 lat4
d Task6 lat5
e Task7 lat5
Table want
id Task Location New_id
a Task1 lat1 a
b Task2 lat2 a
b Task3 lat1 a
c Task4 lat3 c
c Task5 lat4 c
d Task6 lat5 d
e Task7 lat5 d
Task1 and Task3 must have the same new_id because they have the same location.
Task2 and Task3 must have the same new_id because they have the same id.
I tried to use a retain data step. First I sort on location, retain the first.variable, then sort id, retain the first.variable.
proc sort data=table1;
by location;
data table1_1;
set table1;
by location;
retain new_id_temp;
if first.location then new_id_temp =id;
new_id=new_id_temp;
run;
proc sort data=table1_1;
by id;
data table1_2;
set table1_1;
by id;
retain id_temp;
if first.id then id_temp=id;
new_id=id_temp;
run;
Based on the above code, I still got two different new_id and proc sort takes lots time if the datasets are large.
Can anyone help?
Your issue here is that you didn't update the second datastep to use new_id as the source for the retained ID, so it's using b not a.
data table1_2;
set table1_1;
by id;
retain id_temp;
if first.id then
id_temp=new_id;
new_id_fin=id_temp;
run;
I'm not sure this is really an effective way to solve your problem generally, but it should give you the results you want. You might want to search around the site (or the web) for other ways to solve this problem, as it's a well understood but complex issue.
/To help you understand the algorithm, I print intermediate results/
%let print_diagnostics = 1; * 0 : no diagnostics 1 : diagnostics *;
/Read in the example, extended with extra data/
options mprint;
title read input data;
data table1;
input id $ Task $ location $;
datalines;
a Task01 lat1
b Task02 lat2
b Tas0k3 lat1
b Task04 lat0
c Task05 lat3
c Task06 lat4
d Task07 lat5
e Task08 lat5
f Task09 lat4
f Task10 lat6
g Task11 lat6
g Task12 lat7
h Task13 lat7
;
proc print;
run;
/The solution needs some iteration, so we need a macro/
%macro re_identify (got, want);
* Initially, we assign id to new_id *;
data &want.;
set &got.;
new_id = id;
run;
* proceed re-assigning ids until stabilised *;
%let pass = 0;
%let proceed = 1;
%do %while (&proceed);
/To lookup the smallest new_id already used for an id or location, I use hash tables. For more information, read Data Step Hash Objects as Programming Tools/
* We will construct two hash tables
* one with the smallest new_id for each id and *
* one with the smallest new_id for each location *
* To achieve this, the smallest new_id should come first *;
%let pass = %eval(&pass + 1);
title pass &pass;
proc sort data=&want.;
by new_id;
run;
data
%if &print_diagnostics %then %do;
hash_id(keep=id id_id)
hash_loc(keep=location loc_id)
%end;
&want. (drop=rc loc_id id_id proceed);
/The hash tables have to be loaded only once of course. Mind the declaration of the data variables!/
* Create hash tables with for each id and location
* the smallest new_id used up to now *;
length loc_id id_id $ 1;
if _N_ eq 1 then do;
dcl hash h_id (dataset: "&want.(rename=(new_id=id_id))");
h_id.defineKey('id');
h_id.definedata('id_id','id');
h_id.defineDone();
dcl hash h_loc (dataset: "&want.(rename=(new_id=loc_id))");
h_loc.defineKey('location');
h_loc.definedata('loc_id','location');
h_loc.defineDone();
* Unless we have to lower the new id for any id or location,
* we can stop after this pass *;
proceed = 0;
end;
retain proceed;
* Read in the data *;
set &want. end=last;
* If there is a task with the same id or location
* with a smaller new_id, lower the new_id for this task *;
rc = h_id.find() + h_loc.find();
if rc then put 'WARNING: location not found' _all_;
if id_id lt new_id then new_id = id_id;
if loc_id lt new_id then new_id = loc_id;
output &want.;
* If we lowered the new_id,
* adapt the hash table
* and proceed after this pass *;
if id_id gt new_id then do;
id_id = new_id;
h_id.replace();
proceed = 1;
end;
if loc_id gt new_id then do;
loc_id = new_id;
h_loc.replace();
proceed = 1;
end;
/Adapting the hashtables with the replace statement is optional but can drastically reduce the number of passes./
* transfer the the decision to proceed
* from a data step variable to a macro variable *;
if last then call symput ('proceed', proceed);
%if &print_diagnostics %then %do;
if last then do;
dcl hiter i_id ('h_id') ;
dcl hiter i_loc ('h_loc') ;
do rc = i_id.first () by 0 while ( rc = 0 ) ;
output hash_id;
rc = i_id.next () ;
end;
do rc = i_loc.first () by 0 while ( rc = 0 ) ;
output hash_loc;
rc = i_loc.next () ;
end;
put "NOTE: after pass &pass." proceed=;
end;
%end;
run;
%if &print_diagnostics %then %do;
* Print intermediate results *;
title2 new id assigned to task; proc print data=&want.; run;
title2 new id assigned to id; proc print data=hash_id; run;
title2 new id assigned to location; proc print data=hash_loc; run;
%end;
%end;
%mend;
%re_identify(table1, table_want);
/And finally write out the report./
* sort in task order and print the final results *;
title final result;
proc sort data=table_want;
by Task;
proc print;
run;
/*
*/
This should give you the results you expect, with a caveat: If you were to add a Task8 having id F and location lat1, you'd need a more refined algorithm with 2 or more passes. But this solution will work fine as long as your id's and locations progress in a way that elements sharing common id's and/or locations are placed after one another.
Generate Sample Dataset
data tasks;
input id $ Task $ location $;
datalines;
a Task1 lat1
b Task2 lat2
b Task3 lat1
c Task4 lat3
c Task5 lat4
d Task6 lat5
e Task7 lat5
;
Generate all Possible Combinations using PROC FREQ
proc freq data=tasks;
table id * task * location / out=combinations (drop=percent count);
run;
Calculate New ID's Based on Your Criteria
data newIDs;
set combinations;
length prev_id $ 1
newID $ 1
prev_location $ 4;
retain newID prev_id prev_location;
* First scenario - first row;
if _N_ = 1 then do;
put _N_= "First scenario - first row";
newID = id;
output;
prev_id = id;
prev_location = location;
end;
* Second scenario - some redundancy between 2 rows;
else if id = prev_id or prev_location=location then do;
put _N_= "Second Scenario - some redundancy";
output;
prev_id = id;
prev_location = location;
end;
* Third scenario - no redundancy;
else do;
put _N_= "Third scenario - no redundancy";
newID = id;
output;
prev_id = id;
prev_location = location;
end;
keep id task location newID;
run;
Merge the Tasks Dataset to the newIDs Dataset
proc sql;
create table tasks_update as
select t.id
,i.newID
,t.Task
,t.location
from tasks as t
left join newIDs as i
on t.id = i.id
and t.task = i.task
and t.location = i.location
order by id;
quit;
Results
id newID Task location
a a Task1 lat1
b a Task2 lat2
b a Task3 lat1
c c Task4 lat3
c c Task5 lat4
d d Task6 lat5
e d Task7 lat5
Related
I have a SAS data set t3. I want to run a data step inside a loop through a set of variables to create additional sets based on the variable value = 1, and rank two variables bal and otheramt in each subset, and then merge the ranks for each subset onto the original data set. Each rank column needs to be dynamically named so I know what subset is getting ranked. I know how to do proc rank and macros basically but do not know how to do this in the most dynamic way inside of a macro. Can you assist?
ID
bal
otheramt
firstvar
secondvar
lastvar
444
581
100
1
1
555
255
200
1
1
1
666
255
300
--------------
1
--------------
%macro dog();
data new;
set t3;
ARRAY Indicators(5) FirstVar--LastVar;
/*create data set for each of the subsets if firstvar = 1, secondvar = 1 ... lastvar = 1 */
/*for each new data set, rank by bal and otheramt*/
/*name the new rank columns [FirstVar]BalRank, [FirstVar]OtherAmtRank; */
/*merge the new ranks onto the original data set by ID*/
%mend;
%dog()
The Proc rank section would be something like this, but I would need the rank columns to have information about what subset I am ranking.
proc rank data=subset1 out=subset1ranked;
var bal otheramt;
ranks bal_rank otheramt_rank;
run;
Instead of using macro, use data transformation and reshaping that allows simpler steps to be written.
Example:
Rows are split into multiple rows based on flag so group processing in RANK can occur. Two transposes are required to reshape the results back a single row per id.
data have;
call streaminit(20230216);
do id = 1 to 100;
foo = rand('integer', 50,150);
bar = rand('integer', 100,200);
flag1 = rand('integer', 0, 1);
flag2 = rand('integer', 0, 1);
flag3 = rand('integer', 0, 1);
output;
end;
run;
data step1;
set have;
/* important: the group value becomes part of the variable name later */
if flag1 then do; group='flag1_'; output; end;
if flag2 then do; group='flag2_'; output; end;
if flag3 then do; group='flag3_'; output; end;
drop flag:;
run;
proc sort data=step1;
by group;
run;
proc rank data=step1 out=step2;
by group;
var foo bar;
ranks foo_rank bar_rank;
run;
proc sort data=step2;
by id group;
run;
* pivot (reshape) so there is one row per ranked var;
proc transpose data=step2 out=step3(drop=_label_);
by id foo bar group;
var foo_rank bar_rank;
run;
* pivot again so there is one row per id;
proc transpose data=step3 out=step4(drop=_name_);
by id;
var col1;
id group _name_;
run;
* merge so those 0 0 0 flag rows remain intact;
data want;
merge have step4;
by id;
run;
Since we don't have much sample data, I created test data from sashelp.class with some indicator variables like yours.
data have;
set sashelp.class;
firstvar=round(rand('uniform',1));
secondvar=round(rand('uniform',1));
thirdvar=round(rand('uniform',1));
drop sex weight;
run;
Partial output:
Name Age Height firstvar secondvar thirdvar
Alfred 14 69 1 0 1
Alice 13 56.5 0 1 1
Barbara 13 65.3 1 0 0
Carol 14 62.8 0 0 0
To dynamically rank data based on indicator variables, I created a macro that accepts a list of indicators and rank variables. The 2 lists help to create the specific variable names you requested. Here's the macro call:
%rank(indicators=firstvar secondvar thirdvar,
rank_vars=age height);
Here's part of the final output. Notice the indicators in the sample output above coincide with the ranks in this output. Also note that Carol is not in the output because she had no indicators set to 1.
Name Age Height firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank thirdvar_height_rank
Alfred 14 69 8 11 . . 6.5 10
Alice 13 56.5 . . 3.5 2 4.5 2
Barbara 13 65.3 6.5 8 . . . .
Henry 14 63.5 . . 5.5 5 . .
The full macro is listed below. It has 3 parts.
Create a temp data set with a group variable that contains the number of the indicator variable based on the order of the variable in the list. Whenever an indicator = 1 the obs is output. If an obs has all 3 indicators set to 1 then it will be output 3 times with the group variable set to the number of each indicator variable. This step is important because proc rank will rank groups independently.
Generate the rankings on the temp data set. Each group will be ranked independently of the other groups and can be done in one step.
Construct the final data set by essentially transposing the ranked data into columns.
%macro rank(indicators=, rank_vars=);
%let cnt_ind = %sysfunc(countw(&indicators));
%let cnt_vars = %sysfunc(countw(&rank_vars));
data temp;
set have;
array indicators(*) &indicators;
do i = 1 to dim(indicators);
if indicators(i) = 1 then do;
group = i; * create a group based on order of indicators;
output; * an obs can be output multiple times;
end;
end;
drop i &indicators;
run;
proc sort data=temp;
by group;
run;
* Generate rankings by group;
proc rank data=temp out=ranks;
by group;
var &rank_vars;
ranks
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
proc sort data=ranks;
by name group;
run;
* Contruct final data set by transposing the ranks into columns;
data want;
set ranks;
by name;
* retain statement to declare new variables and retain values;
retain
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let vars = &vars &ivar._&jvar._rank;
%end;
%end;
&vars;
if first.name then call missing (of &vars);
* option 1: build series of IF statements;
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%str(if group = &i then do;)
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let newvar = &ivar._&jvar._rank;
%str(&newvar = &jvar._rank;)
%end;
%str(end;)
%end;
if last.name then output;
drop group
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
%mend;
When constructing the final data set and transposing the rank variables, there are a couple of options. The first option shown above is to dynamically build a series of if statements. Here is what the code generates:
MPRINT(RANK): * option 1: build series of IF statements;
MPRINT(RANK): if group = 1 then do;
MPRINT(RANK): firstvar_age_rank = age_rank;
MPRINT(RANK): firstvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 2 then do;
MPRINT(RANK): secondvar_age_rank = age_rank;
MPRINT(RANK): secondvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 3 then do;
MPRINT(RANK): thirdvar_age_rank = age_rank;
MPRINT(RANK): thirdvar_height_rank = height_rank;
MPRINT(RANK): end;
The 2nd option is to use an array and mathematically calculate the index into the array by the group number and variable number. Here is the snippet of macro code to replace the if series code:
* option 2: create arrays and calculate index into array
* by group number and variable number;
array ranks(*) &vars;
array rankvars(*)
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
%str(idx = dim(rankvars) * (group - 1);)
%str(do i = 1 to dim(rankvars);)
%str(ranks(idx + i) = rankvars(i);)
%str(end;)
Here is the generated code:
MPRINT(RANK): * option 2: create arrays and calculate index into array * by group number and variable number;
MPRINT(RANK): array ranks(*) firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank
thirdvar_height_rank;
MPRINT(RANK): array rankvars(*) age_rank height_rank;
MPRINT(RANK): idx = dim(rankvars) * (group - 1);
MPRINT(RANK): do i = 1 to dim(rankvars);
MPRINT(RANK): ranks(idx + i) = rankvars(i);
MPRINT(RANK): end;
It takes a minute to understand the array option, but once you do, it is preferable over generating if statments. As the number of variables increases, the code generated by the array option is the same and operates more efficiently.
I am trying to calculate some statistics for a given variable based on the client id and the time horizon. My current solution is show below, however, I would like to know if there is a way to reformat the code into a datastep instead of an sql join, because the join takes a very long time to execute on my real dataset.
data have1(drop=t);
id = 1;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have2(drop=t);
id = 2;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have_fin;
set have1 have2;
run;
Proc sql;
create table want1 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-3,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Proc sql;
create table want2 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-6,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Use temporary arrays and a single data step instead.
This does the same thing in a single step.
Sort data to ensure order is correct
Declare a temporary array for each set of moving average you want to calculate.
Ensure the array is empty at the start of each ID
Assign values to array in correct index. MOD() allows you to dynamically index the data without have to include a separate counter variable.
Take the average of the array. If you want the array to ignore the first two values - because it has only 1/2 data points you can conditionally calculate this as well.
*sort to ensure data is in correct order (Step 1);
proc sort data=have_fin;
by id dt;
run;
data want;
*Step 2;
array p3{0:2} _temporary_;
array p6(0:5) _temporary_;
set have_fin;
by ID;
*clear values at the start of each ID for the array Step3;
if first.ID then call missing(of p3{*}, of p6(*));
*assign the value to the array, the mod function indexes the array so it's continuously the most recent 3/6 values;
*Step 4;
p3{mod(_n_,3)} = var;
p6{mod(_n_,6)} = var;
*Step 5 - calculates statistic of interest, average in this case;
mean3d = mean(of p3(*));
mean6d = mean(of p6(*));
;
run;
And if you have SAS/ETS licensed this is super trivial.
*prints product to log - check if you have SAS/ETS licensed;
proc product_status;run;
*sorts data;
proc sort data=have_fin;
by id dt;
run;
*calculates moving average;
proc expand data=have_fin out=want_expand;
by ID;
id dt;
convert var = mean_3d / method=none transformout= (movave 3);
convert var = mean_6d / method=none transformout= (movave 6);
run;
Here is a simple example I came up with. There are 3 players here (id is 1,2,3) and each player gets 3 attempts at the game (attempt is 1,2,3).
data have;
infile datalines delimiter=",";
input id attempt score;
datalines;
1,1,100
1,2,200
2,1,150
3,1,60
;
run;
I would like to add in rows where the score is missing if they did not play attempt 2 or attempt 3.
data want;
set have;
by id attempt;
* ??? ;
run;
proc print data=have;
run;
The output would look something like this.
1 1 100
1 2 200
1 3 .
2 1 150
2 2 .
2 3 .
3 1 60
3 2 .
3 3 .
How do I go about doing this?
You could solve this by first creating a table where you have the structure you want to see: for each ID three attempts. This structure can then be joined with a 'left join' to your 'have' table to get the actual scores if they exist and missing variable if they don't.
/* Create table with all ids for which the structure needs to be created */
proc sql;
create table ids as
select distinct id from have;
quit;
/* Create table structure with 3 attempts per ID */
data ids (drop = i);
set ids;
do i = 1 to 3;
attempt = i;
output;
end;
run;
/* Join the table structure to the actual scores in the have table */
proc sql;
create table want as
select a.*,
b.score
from ids a left join have b on a.id = b.id and a.attempt = b.attempt;
quit;
A table of possible attempts cross joined with the distinct ids left joined to the data will produce the desired result set.
Example:
data have;
infile datalines delimiter=",";
input id attempt score;
datalines;
1,1,100
1,2,200
2,1,150
3,1,60
;
data attempts;
do attempt = 1 to 3; output; end;
run;
proc sql;
create table want as
select
each_id.id,
each_attempt.attempt,
have.score
from
(select distinct id from have) each_id
cross join
attempts each_attempt
left join
have
on
each_id.id = have.id
& each_attempt.attempt = have.attempt
order by
id, attempt
;
Update: I figured it out.
proc sort data=have;
by id attempt;
data want;
set have (rename=(attempt=orig_attempt score=orig_score));
by id;
** Previous attempt number **;
retain prev;
if first.id then prev = 0;
** If there is a gap between previous attempt and current attempt, output a blank record for each intervening attempt **;
if orig_attempt > prev + 1 then do attempt = prev + 1 to orig_attempt - 1;
score = .;
output;
end;
** Output current attempt **;
attempt = orig_attempt;
score = orig_score;
output;
** If this is the last record and there are more attempts that should be included, output dummy records for them **;
** (Assumes that you know the maximum number of attempts) **;
if last.id & attempt < 3 then do attempt = attempt + 1 to 3;
score = .;
output;
end;
** Update last attempt used in this iteration **;
prev = attempt;
run;
Here is a alternative DATA step, a DOW way:
data want;
do until (last.id);
set have;
by id;
output;
end;
call missing(score);
do attempt = attempt+1 to 3;
output;
end;
run;
If the absent observations are only at the end then you can just use a couple of OUTPUT statements and a DO loop. So write each observation as it is read and if the last one is NOT attempt 3 then add more observations until you get to attempt 3.
data want1;
set have ;
by id;
output;
score=.;
if last.id then do attempt=attempt+1 to 3;
output;
end;
run;
If the absent attempts can appear any where then you need to "look ahead" to see whether the next observations skips any attempts.
data want2;
set have end=eof;
by id ;
if not eof then set have (firstobs=2 keep=attempt rename=(attempt=next));
if last.id then next=3+1;
output;
score=.;
do attempt=attempt+1 to next-1;
output;
end;
drop next;
run;
Looking to automate some checks and print some warnings to a log file. I think I've gotten the general idea but I'm having problems generalising the checks.
For example, I have two datasets my_data1 and my_data2. I wish to print a warning if nobs_my_data2 < nobs_my_data1. Additionally, I wish to print a warning if the number of distinct values of the variable n in my_data2 is less than 11.
Some dummy data and an attempt of the first check:
%LET N = 1000;
DATA my_data1(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N - 100;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA my_data2(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA _NULL_;
FILE "\\filepath\log.txt" MOD;
SET my_data1 NOBS = NOBS1 my_data2 NOBS = NOBS2 END = END;
IF END = 1 THEN DO;
PUT "HERE'S A HEADER LINE";
END;
IF NOBS1 > NOBS2 AND END = 1 THEN DO;
PUT "WARNING!";
END;
IF END = 1 THEN DO;
PUT "HERE'S A FOOTER LINE";
END;
RUN;
How can I set up the check for the number of distinct values of n in my_data2?
A proc sql way to do it -
%macro nobsprint(tab1,tab2);
options nonotes; *suppresses all notes;
proc sql;
select count(*) into:nobs&tab1. from &tab1.;
select count(*) into:nobs&tab2. from &tab2.;
select count(distinct n) into:distn&tab2. from &tab2.;
quit;
%if &&nobs&tab2. < &&nobs&tab1. %then %put |WARNING! &tab2. has less recs than &tab1.|;
%if &&distn&tab2. < 11 %then %put |WARNING! distinct VAR n count in &tab2. less than 11|;
options notes; *overrides the previous option;
%mend nobsprint;
%nobsprint(my_data1,my_data2);
This would break if you have to specify libnames with the datasets due to the .. And, you can use proc printto log to print it to a file.
For your other part as to just print the %put use the above as a call -
filename mylog temp;
proc printto log=mylog; run;
options nomprint nomlogic;
%nobsprint(my_data1,my_data2);
proc printto; run;
This won't print any erroneous text to SAS log other than your custom warnings.
#samkart provided perhaps the most direct, easily understood way to compare the obs counts. Another consideration is performance. You can get them without reading the entire data set if your data set has millions of obs.
One method is to use nobs= option in the set statement like you did in your code, but you unnecessarily read the data sets. The following will get the counts and compare them without reading all of the observations.
62 data _null_;
63 if nobs1 ne nobs2 then putlog 'WARNING: Obs counts do not match.';
64 stop;
65 set sashelp.cars nobs=nobs1;
66 set sashelp.class nobs=nobs2;
67 run;
WARNING: Obs counts do not match.
Another option is to get the counts from sashelp.vtable or dictionary.tables. Note that you can only query dictionary.tables with proc sql.
What I've got:
a table of 20 rows in SAS (originally 100k)
various binary attributes (columns)
What I'm looking to get:
A crosstable displaying the frequency of the attribute combinations
like this:
Attribute1 Attribute2 Attribute3 Attribute4
Attribute1 5 0 1 2
Attribute2 0 3 0 3
Attribute3 2 0 5 4
Attribute4 1 2 0 10
*The actual sum of combinations is made up and probably not 100% logical
The code I currently have:
/*create dummy data*/
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
output;
end;
run;
I guess this can be done smarter, but this seem to work. First I created a table that should hold all the frequencies:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;
Then I loop through all the combinations, inserting the count into the crosstable:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
Note that since the frequency count for (i,j)=(j,i) you do not need to do both.
I'd recommend using the built-in SAS tools for this sort of thing, and probably displaying your data slightly differently as well, unless you really want a diagonal table. e.g.
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
count = 1;
output;
end;
run;
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;
proc summary nway data = monthly_sales;
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
Either of these gives you a table with 1 row for each contribution of attributes in your data, which is slightly different from what you requested, but conveys the same information. You can force proc summary to include rows for combinations of class variables that don't exist in your data by using the completetypes option in the proc summary statement.
It's definitely worth taking the time to get familiar with proc summary if you're doing statistical analysis in SAS - you can include additional output statistics and process multiple variables with minimal additional code and processing overhead.
Update: it's possible to produce the desired table without resorting to macro logic, albeit a rather complex process:
proc summary data = monthly_sales completetypes;
ways 1 2; /*Calculate only 1 and 2-way summaries*/
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
/*Eliminate unnecessary output rows*/
data summary_table;
set summary_table;
array a{*} attribute:;
sum = sum(of a[*]);
missing = 0;
do i = 1 to dim(a);
missing + missing(a[i]);
a[i] = a[i] * count;
end;
/*We want rows where two attributes are both 1 (sum = 2),
or one attribute is 1 and the others are all missing*/
if sum = 2 or (sum = 1 and missing = dim(a) - 1);
drop i missing sum;
edge = _n_;
run;
/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
by edge;
var attribute:;
run;
/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
create table diagonal as
select a._name_ as aname,
b._name_ as bname,
a.col1 as count
from tr_table a, tr_table b
where a.edge = b.edge
group by a.edge
having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
order by aname, bname
;
quit;
/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
by aname;
id bname;
var count;
run;
/*Re-order variables and set missing values to zero*/
data want;
informat aname attribute1-attribute4;
set want;
array a{*} attribute:;
do i = 1 to dim(a);
a[i] = sum(a[i],0);
end;
drop i;
run;
Yeah, user667489 was right, I just added some extra code to get the cross-frequency table looking good. First, I created a table with 10 million rows and 10 variables:
data monthly_sales (drop=i);
do i=1 to 10000000;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
Attribute5=rand("Normal")>0.5;
Attribute6=rand("Normal")>0.5;
Attribute7=rand("Normal")>0.5;
Attribute8=rand("Normal")>0.5;
Attribute9=rand("Normal")>0.5;
Attribute10=rand("Normal")>0.5;
output;
end;
run;
Create an empty 10x10 crosstable:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
output;output;output;output;output;output;output;output;output;output;
run;
Create a frequency table using proc freq:
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
/ out = frequency_table;
run;
Loop through all the combinations of Attributes and sum the "count" variable. Insert it into the crosstable:
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;