how to create variables that names are concat with two array variable names - sas

I have a HCC dataset DATA_HCC that with member ID and 79 binary variables:
Member_ID HCC1 HCC2 HCC6 HCC8 ... HCC189
XXXXXXX1 1 0 1 0 ... 0
XXXXXXX2 0 0 1 0 ... 0
XXXXXXX3 0 1 0 0 ... 1
I am trying to create a output dataset that could create new binary variables for all the combination of those 79 variables. Each new variable represents if a member had both of the variables as 1.
%LET hccList = HCC1 HCC2 HCC6 HCC8 HCC9 HCC10 HCC11 HCC12 HCC17 HCC18 HCC19 HCC21 HCC22 HCC23 HCC27
HCC28 HCC29 HCC33 HCC34 HCC35 HCC39 HCC40 HCC46 HCC47 HCC48 HCC54 HCC55 HCC57 HCC58
HCC70 HCC71 HCC72 HCC73 HCC74 HCC75 HCC76 HCC77 HCC78 HCC79 HCC80 HCC82 HCC83 HCC84
HCC85 HCC86 HCC87 HCC88 HCC96 HCC99 HCC100 HCC103 HCC104 HCC106 HCC107 HCC108 HCC110
HCC111 HCC112 HCC114 HCC115 HCC122 HCC124 HCC134 HCC135 HCC136 HCC137 HCC157 HCC158
HCC161 HCC162 HCC166 HCC167 HCC169 HCC170 HCC173 HCC176 HCC186 HCC188 HCC189;
DATA COUNT_HCC; SET DATA_HCC;
ARRAY HCC [*] &hccList.;
DO i = 1 TO DIM(HCC);
DO j = i+1 TO DIM(HCC);
%LET HCC_COMBO = CATX('_', VARNAME(HCC[i]), VARNAME(HCC[j]));
&HCC_COMBO. = MIN(HCC[i], HCC[j]);
END;
END;
RUN;
I tried to use CATX function to just concat the two variable names but it didn't work.
Here is the log error that I got:
ERROR: Undeclared array referenced: CATX.
ERROR: Variable CATX has not been declared as an array.
ERROR 71-185: The VARNAME function call does not have enough arguments.
And the results output sample would like this:
Member_ID HCC1_HCC2 HCC1_HCC6 HCC1_HCC8 ... HCC188_HCC189
XXXXXXX1 0 1 0 ... 0
XXXXXXX2 0 0 0 ... 0
XXXXXXX3 0 0 0 ... 1

To achieve dynamic variable name generation, use a macro to create the variables that you need. The below code generates dynamic variable names and generates data step code to create the variables.
%macro get_hcc_combo_mins;
%do i = 1 %to %sysfunc(countw(&hccList.));
%do j = %eval(&i.+1) %to %sysfunc(countw(&hccList.));
%let hcc1 = %scan(&hccList., &i.);
%let hcc2 = %scan(&hccList., &j.);
&hcc1._&hcc2. = min(&hcc1., &hcc2.);
%end;
%end;
%mend;
DATA COUNT_HCC; SET DATA_HCC;
ARRAY HCC [*] &hccList.;
%get_hcc_combo_mins;
RUN;
The macro %get_hcc_combo_mins generates this code in the data step:
HCC1_HCC2 = min(HCC1, HCC2);
HCC1_HCC6 = min(HCC1, HCC6);
HCC1_HCC8 = min(HCC1, HCC8);
...
There may be other ways to do this all within one data step that I'm not aware of, but macros can get the job done.

A DATA Step with LEXCOMB can generate variable name pairs. CALL EXECUTE submit a statement using those names.
Example:
Presume HCC: variable names, which specific ones not known apriori.
data have;
call streaminit(1234);
do id = 1 to 100;
array hcc hcc1 hcc3 hcc5 hcc7 hcc10-hcc79 hcc150 hcc155 hcc180 hcc190-hcc191;
do over hcc;
hcc = rand('uniform', dim(hcc)) < _i_;
end;
output;
end;
run;
data _null_;
set have;
array hcc hcc:;
do _n_ = 1 to dim(hcc);
hcc(_n_) = _n_;
end;
call execute("data pairwise; set have;");
do _n_ = 1 to comb(dim(hcc),2);
call lexcomb(_n_, 2, of hcc(*));
index1 = hcc(1);
index2 = hcc(2);
name1 = vname(hcc(index1));
name2 = vname(hcc(index2));
put name1=;
call execute (cats(
catx( '_',name1,name2),
'=',
catx(' and ',name1,name2),
';'
));
end;
call execute('run;');
stop;
run;

See if you can use this as a template.
/* Example data */
data have (drop = i j);
array h {*} HCC1 HCC2 HCC6 HCC8 HCC9 HCC10 HCC11 HCC12 HCC17 HCC18 HCC19 HCC21 HCC22 HCC23 HCC27
HCC28 HCC29 HCC33 HCC34 HCC35 HCC39 HCC40 HCC46 HCC47 HCC48 HCC54 HCC55 HCC57 HCC58
HCC70 HCC71 HCC72 HCC73 HCC74 HCC75 HCC76 HCC77 HCC78 HCC79 HCC80 HCC82 HCC83 HCC84
HCC85 HCC86 HCC87 HCC88 HCC96 HCC99 HCC100 HCC103 HCC104 HCC106 HCC107 HCC108 HCC110
HCC111 HCC112 HCC114 HCC115 HCC122 HCC124 HCC134 HCC135 HCC136 HCC137 HCC157 HCC158
HCC161 HCC162 HCC166 HCC167 HCC169 HCC170 HCC173 HCC176 HCC186 HCC188 HCC189;
do i = 1 to 10;
do j = 1 to dim (h);
h [j] = rand('uniform') > .5;
end;
output;
end;
run;
/* Create long version of output data */
data temp (drop = i j);
set have;
array a {*} HC:;
do i = 1 to dim (a)-1;
do j = i+1 to dim (a);
v = catx('_', vname (a[i]), vname (a[j]));
d = a [i] * a [j];
n = _N_;
output;
end;
end;
run;
/* Transpose to wide format */
proc transpose data=temp out=temp2 (drop=_: n);
by n;
id v;
var d;
run;
/* Merge back with original data */
data want;
merge have temp2;
run;

Related

Loop through SAS variables and create data sets

I have a SAS data set t3. I want to run a data step inside a loop through a set of variables to create additional sets based on the variable value = 1, and rank two variables bal and otheramt in each subset, and then merge the ranks for each subset onto the original data set. Each rank column needs to be dynamically named so I know what subset is getting ranked. I know how to do proc rank and macros basically but do not know how to do this in the most dynamic way inside of a macro. Can you assist?
ID
bal
otheramt
firstvar
secondvar
lastvar
444
581
100
1
1
555
255
200
1
1
1
666
255
300
--------------
1
--------------
%macro dog();
data new;
set t3;
ARRAY Indicators(5) FirstVar--LastVar;
/*create data set for each of the subsets if firstvar = 1, secondvar = 1 ... lastvar = 1 */
/*for each new data set, rank by bal and otheramt*/
/*name the new rank columns [FirstVar]BalRank, [FirstVar]OtherAmtRank; */
/*merge the new ranks onto the original data set by ID*/
%mend;
%dog()
The Proc rank section would be something like this, but I would need the rank columns to have information about what subset I am ranking.
proc rank data=subset1 out=subset1ranked;
var bal otheramt;
ranks bal_rank otheramt_rank;
run;
Instead of using macro, use data transformation and reshaping that allows simpler steps to be written.
Example:
Rows are split into multiple rows based on flag so group processing in RANK can occur. Two transposes are required to reshape the results back a single row per id.
data have;
call streaminit(20230216);
do id = 1 to 100;
foo = rand('integer', 50,150);
bar = rand('integer', 100,200);
flag1 = rand('integer', 0, 1);
flag2 = rand('integer', 0, 1);
flag3 = rand('integer', 0, 1);
output;
end;
run;
data step1;
set have;
/* important: the group value becomes part of the variable name later */
if flag1 then do; group='flag1_'; output; end;
if flag2 then do; group='flag2_'; output; end;
if flag3 then do; group='flag3_'; output; end;
drop flag:;
run;
proc sort data=step1;
by group;
run;
proc rank data=step1 out=step2;
by group;
var foo bar;
ranks foo_rank bar_rank;
run;
proc sort data=step2;
by id group;
run;
* pivot (reshape) so there is one row per ranked var;
proc transpose data=step2 out=step3(drop=_label_);
by id foo bar group;
var foo_rank bar_rank;
run;
* pivot again so there is one row per id;
proc transpose data=step3 out=step4(drop=_name_);
by id;
var col1;
id group _name_;
run;
* merge so those 0 0 0 flag rows remain intact;
data want;
merge have step4;
by id;
run;
Since we don't have much sample data, I created test data from sashelp.class with some indicator variables like yours.
data have;
set sashelp.class;
firstvar=round(rand('uniform',1));
secondvar=round(rand('uniform',1));
thirdvar=round(rand('uniform',1));
drop sex weight;
run;
Partial output:
Name Age Height firstvar secondvar thirdvar
Alfred 14 69 1 0 1
Alice 13 56.5 0 1 1
Barbara 13 65.3 1 0 0
Carol 14 62.8 0 0 0
To dynamically rank data based on indicator variables, I created a macro that accepts a list of indicators and rank variables. The 2 lists help to create the specific variable names you requested. Here's the macro call:
%rank(indicators=firstvar secondvar thirdvar,
rank_vars=age height);
Here's part of the final output. Notice the indicators in the sample output above coincide with the ranks in this output. Also note that Carol is not in the output because she had no indicators set to 1.
Name Age Height firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank thirdvar_height_rank
Alfred 14 69 8 11 . . 6.5 10
Alice 13 56.5 . . 3.5 2 4.5 2
Barbara 13 65.3 6.5 8 . . . .
Henry 14 63.5 . . 5.5 5 . .
The full macro is listed below. It has 3 parts.
Create a temp data set with a group variable that contains the number of the indicator variable based on the order of the variable in the list. Whenever an indicator = 1 the obs is output. If an obs has all 3 indicators set to 1 then it will be output 3 times with the group variable set to the number of each indicator variable. This step is important because proc rank will rank groups independently.
Generate the rankings on the temp data set. Each group will be ranked independently of the other groups and can be done in one step.
Construct the final data set by essentially transposing the ranked data into columns.
%macro rank(indicators=, rank_vars=);
%let cnt_ind = %sysfunc(countw(&indicators));
%let cnt_vars = %sysfunc(countw(&rank_vars));
data temp;
set have;
array indicators(*) &indicators;
do i = 1 to dim(indicators);
if indicators(i) = 1 then do;
group = i; * create a group based on order of indicators;
output; * an obs can be output multiple times;
end;
end;
drop i &indicators;
run;
proc sort data=temp;
by group;
run;
* Generate rankings by group;
proc rank data=temp out=ranks;
by group;
var &rank_vars;
ranks
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
proc sort data=ranks;
by name group;
run;
* Contruct final data set by transposing the ranks into columns;
data want;
set ranks;
by name;
* retain statement to declare new variables and retain values;
retain
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let vars = &vars &ivar._&jvar._rank;
%end;
%end;
&vars;
if first.name then call missing (of &vars);
* option 1: build series of IF statements;
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%str(if group = &i then do;)
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let newvar = &ivar._&jvar._rank;
%str(&newvar = &jvar._rank;)
%end;
%str(end;)
%end;
if last.name then output;
drop group
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
%mend;
When constructing the final data set and transposing the rank variables, there are a couple of options. The first option shown above is to dynamically build a series of if statements. Here is what the code generates:
MPRINT(RANK): * option 1: build series of IF statements;
MPRINT(RANK): if group = 1 then do;
MPRINT(RANK): firstvar_age_rank = age_rank;
MPRINT(RANK): firstvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 2 then do;
MPRINT(RANK): secondvar_age_rank = age_rank;
MPRINT(RANK): secondvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 3 then do;
MPRINT(RANK): thirdvar_age_rank = age_rank;
MPRINT(RANK): thirdvar_height_rank = height_rank;
MPRINT(RANK): end;
The 2nd option is to use an array and mathematically calculate the index into the array by the group number and variable number. Here is the snippet of macro code to replace the if series code:
* option 2: create arrays and calculate index into array
* by group number and variable number;
array ranks(*) &vars;
array rankvars(*)
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
%str(idx = dim(rankvars) * (group - 1);)
%str(do i = 1 to dim(rankvars);)
%str(ranks(idx + i) = rankvars(i);)
%str(end;)
Here is the generated code:
MPRINT(RANK): * option 2: create arrays and calculate index into array * by group number and variable number;
MPRINT(RANK): array ranks(*) firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank
thirdvar_height_rank;
MPRINT(RANK): array rankvars(*) age_rank height_rank;
MPRINT(RANK): idx = dim(rankvars) * (group - 1);
MPRINT(RANK): do i = 1 to dim(rankvars);
MPRINT(RANK): ranks(idx + i) = rankvars(i);
MPRINT(RANK): end;
It takes a minute to understand the array option, but once you do, it is preferable over generating if statments. As the number of variables increases, the code generated by the array option is the same and operates more efficiently.

(SAS) how to name the column with the value of other variable

given dataset 'temp' looks like this..
index
code1
code2
code3
A
P1
P2
P3
B
P1
P3
P4
C
P2
P4
N1
then I want to make new dataset like this
index
P1
P2
P3
P4
n1
A
1
1
1
0
0
B
1
0
1
1
0
C
0
1
0
1
1
My code is here...
%macro freq;
%do i = 1 %to 3;
%do j = 1 %to 5;
if substr(code&i.,1,1) = "P" then
if input(substr(code&i.,2,1),1.) = &j. then p&j. = 1;
if substr(code&i.,1,1) = "N" then
if input(substr(code&i.,2,1),1.) = &j. then n&j. = 1;
%end;
%end;
%mend;
But it's not cool :(
How can I create a new column whose name is the value of variables(code1, code2,...)?
Is there any other simple way?
How about
data have;
input (index code1 code2 code3)($);
datalines;
A P1 P2 P3
B P1 P3 P4
C P2 P4 N1
;
data temp;
set have;
array c code:;
do over c;
v = c;
d = 1;
output;
end;
run;
proc transpose data = temp out = want(drop = _:);
by index;
id v;
var d;
run;
You can achieve this without a macro by using ARRAY and the VNAME function in a DATA step.
data want;
set have;
/* Initialize flag variables. */
length P1-P4 3 N1 3;
/* Define arrays. */
array code [*] code1-code3;
array flags [*] P1-P4 N1;
/* Loop over the arrays. */
do i = 1 to dim(flags);
flags[i] = 0;
do j = 1 to dim(code);
if vname(flags[i]) = code[j] then flags[i] = 1;
end;
end;
keep index P1-P4 N1;
run;
The simplest way to convert values into variable names is via PROC TRANSPOSE. So first convert your wide dataset into a tall dataset. You could use PROC TRANSPOSE to do that, but to make your target dataset PROC TRANSPOSE will need some numeric variable to transpose. So why not use a data step to make the tall dataset and include a numeric variable that is set to 1.
The PROC TRANSPOSE step will give you a dataset with either a 1 or a missing value for the new variables. You can use PROC STDIZE to change the missing values into zeros.
data have;
input index $ (code1-code3) (:$32.) ;
cards;
A P1 P2 P3
B P1 P3 P4
C P2 P4 N1
;
data tall;
set have ;
array code code1-code3;
length _name_ $32 dummy 8;
retain dummy 1;
do column=1 to dim(code);
_name_=code[column];
if not missing(_name_) then output;
end;
run;
proc transpose data=tall out=want(drop=_name_);
by index ;
id _name_;
var dummy;
run;
proc stdize reponly missing=0 data=want ;
var _numeric_;
run;
One more alternative:
proc transpose data=have out=long;
by index;
var code:;
run;
data long2;
set long;
value = 1;
run;
proc transpose data=long2 out=wide;
by index;
id col1;
var value;
run;
/* Convert missing to zeroes */
data want;
set wide;
array vars _NUMERIC_;
do over vars;
if(vars = .) then vars = 0;
end;
drop _NAME_;
run;
Output:
index P1 P2 P3 P4 N1
A 1 1 1 0 0
B 1 0 1 1 0
C 0 1 0 1 1

sas macro : how to make multiful column using macro?

I'm SAS user.
I want to assign year columns using date values.
for example, here is my code, below.
I want to make Y_2010, Y_2011, Y_2012 , Y_2013, Y_2014 in work.total data set.
but there is only Y_2014 as a result.
How can I change the code as I can get right result which I intended first?
options mcompilenote = all;
%let a = Y_ ;
%macro B(YMIN, YMAX) ;
%do i = &YMIN %to &YMAX ;
DATA TOTAL ;
SET SASUSER.EMPDATA ;
IF YEAR(HIREDATE) = &i THEN &a&i = 1 ;
ELSE &a&i = 0 ;
RUN;
%end;
%mend;
%B (2010, 2014) ;
Because you are repeatedly re-creating the output dataset only the final version is available. To fix the macro move the %DO loop inside the DATA step so that you are generating all of the variables in a single data step.
%macro B(YMIN, YMAX) ;
DATA TOTAL ;
SET SASUSER.EMPDATA ;
%do i = &YMIN %to &YMAX ;
IF YEAR(HIREDATE) = &i THEN &a&i = 1 ;
ELSE &a&i = 0 ;
%end;
RUN;
%mend;
But there is no need to a macro for this. Just use normal SAS statements. For example you could use an ARRAY statement to define the variables and then loop over the array and set the values. Note that the result of a boolean expression in SAS is 0 when false and 1 when true so you can eliminate the IF/THEN/ELSE statement and just use an assignment statement.
DATA TOTAL ;
SET SASUSER.EMPDATA ;
array &a &a&ymin - &a&ymax;
do i=&ymin to &ymax ;
&a[i-&ymin+1] = (year(hiredata)=i);
end;
drop i;
RUN;

Generate multiple lags through loops in SAS?

I'm trying to generate 20 lags for a variable.
To generate the first lag, I use the following statement:
data temp.data2;
set temp.data1;
by gvkey fyear;
lag1 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(mv),.);
lag2 = ifn(gvkey=lag(gvkey) and fyear=lag(fyear)+1,lag(lag1),.);
etc.
run;
Don't want to repeat 20 times. Is there a way to do this through a loop?
Thanks a lot!
You would have to maintain your own array of mv values and assign the lag values from that. The array would be bubbled for each row processed and reset at the start of an fyear group.
Example:
data have;
do gvkey = 1 to 5;
do fyear = 1 to 5;
do day = 1 to ifn(fyear=3, 10, 30);
mv = 366-day;
output;
end;
end;
end;
run;
data want;
set have;
by gvkey fyear;
array mvs(20) _temporary_;
array lags(20) lag1-lag20;
if first.fyear then call missing(of mvs(*));
* assign lags;
do _n_ = 1 to dim(lags);
lags(_n_) = mvs(_n_);
end;
* bubble mvs;
do _n_ = dim(lags) to 2 by -1;
mvs(_n_) = mvs(_n_-1);
end;
mvs(1) = mv;
run;

Automate check for number of distinct values SAS

Looking to automate some checks and print some warnings to a log file. I think I've gotten the general idea but I'm having problems generalising the checks.
For example, I have two datasets my_data1 and my_data2. I wish to print a warning if nobs_my_data2 < nobs_my_data1. Additionally, I wish to print a warning if the number of distinct values of the variable n in my_data2 is less than 11.
Some dummy data and an attempt of the first check:
%LET N = 1000;
DATA my_data1(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N - 100;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA my_data2(keep = i u x n);
a = -1;
b = 1;
max = 10;
do i = 1 to &N;
u = rand("Uniform"); /* decimal values in (0,1) */
x = a + (b-a) * u; /* decimal values in (a,b) */
n = floor((1 + max) * u); /* integer values in 0..max */
OUTPUT;
END;
RUN;
DATA _NULL_;
FILE "\\filepath\log.txt" MOD;
SET my_data1 NOBS = NOBS1 my_data2 NOBS = NOBS2 END = END;
IF END = 1 THEN DO;
PUT "HERE'S A HEADER LINE";
END;
IF NOBS1 > NOBS2 AND END = 1 THEN DO;
PUT "WARNING!";
END;
IF END = 1 THEN DO;
PUT "HERE'S A FOOTER LINE";
END;
RUN;
How can I set up the check for the number of distinct values of n in my_data2?
A proc sql way to do it -
%macro nobsprint(tab1,tab2);
options nonotes; *suppresses all notes;
proc sql;
select count(*) into:nobs&tab1. from &tab1.;
select count(*) into:nobs&tab2. from &tab2.;
select count(distinct n) into:distn&tab2. from &tab2.;
quit;
%if &&nobs&tab2. < &&nobs&tab1. %then %put |WARNING! &tab2. has less recs than &tab1.|;
%if &&distn&tab2. < 11 %then %put |WARNING! distinct VAR n count in &tab2. less than 11|;
options notes; *overrides the previous option;
%mend nobsprint;
%nobsprint(my_data1,my_data2);
This would break if you have to specify libnames with the datasets due to the .. And, you can use proc printto log to print it to a file.
For your other part as to just print the %put use the above as a call -
filename mylog temp;
proc printto log=mylog; run;
options nomprint nomlogic;
%nobsprint(my_data1,my_data2);
proc printto; run;
This won't print any erroneous text to SAS log other than your custom warnings.
#samkart provided perhaps the most direct, easily understood way to compare the obs counts. Another consideration is performance. You can get them without reading the entire data set if your data set has millions of obs.
One method is to use nobs= option in the set statement like you did in your code, but you unnecessarily read the data sets. The following will get the counts and compare them without reading all of the observations.
62 data _null_;
63 if nobs1 ne nobs2 then putlog 'WARNING: Obs counts do not match.';
64 stop;
65 set sashelp.cars nobs=nobs1;
66 set sashelp.class nobs=nobs2;
67 run;
WARNING: Obs counts do not match.
Another option is to get the counts from sashelp.vtable or dictionary.tables. Note that you can only query dictionary.tables with proc sql.