given dataset 'temp' looks like this..
index
code1
code2
code3
A
P1
P2
P3
B
P1
P3
P4
C
P2
P4
N1
then I want to make new dataset like this
index
P1
P2
P3
P4
n1
A
1
1
1
0
0
B
1
0
1
1
0
C
0
1
0
1
1
My code is here...
%macro freq;
%do i = 1 %to 3;
%do j = 1 %to 5;
if substr(code&i.,1,1) = "P" then
if input(substr(code&i.,2,1),1.) = &j. then p&j. = 1;
if substr(code&i.,1,1) = "N" then
if input(substr(code&i.,2,1),1.) = &j. then n&j. = 1;
%end;
%end;
%mend;
But it's not cool :(
How can I create a new column whose name is the value of variables(code1, code2,...)?
Is there any other simple way?
How about
data have;
input (index code1 code2 code3)($);
datalines;
A P1 P2 P3
B P1 P3 P4
C P2 P4 N1
;
data temp;
set have;
array c code:;
do over c;
v = c;
d = 1;
output;
end;
run;
proc transpose data = temp out = want(drop = _:);
by index;
id v;
var d;
run;
You can achieve this without a macro by using ARRAY and the VNAME function in a DATA step.
data want;
set have;
/* Initialize flag variables. */
length P1-P4 3 N1 3;
/* Define arrays. */
array code [*] code1-code3;
array flags [*] P1-P4 N1;
/* Loop over the arrays. */
do i = 1 to dim(flags);
flags[i] = 0;
do j = 1 to dim(code);
if vname(flags[i]) = code[j] then flags[i] = 1;
end;
end;
keep index P1-P4 N1;
run;
The simplest way to convert values into variable names is via PROC TRANSPOSE. So first convert your wide dataset into a tall dataset. You could use PROC TRANSPOSE to do that, but to make your target dataset PROC TRANSPOSE will need some numeric variable to transpose. So why not use a data step to make the tall dataset and include a numeric variable that is set to 1.
The PROC TRANSPOSE step will give you a dataset with either a 1 or a missing value for the new variables. You can use PROC STDIZE to change the missing values into zeros.
data have;
input index $ (code1-code3) (:$32.) ;
cards;
A P1 P2 P3
B P1 P3 P4
C P2 P4 N1
;
data tall;
set have ;
array code code1-code3;
length _name_ $32 dummy 8;
retain dummy 1;
do column=1 to dim(code);
_name_=code[column];
if not missing(_name_) then output;
end;
run;
proc transpose data=tall out=want(drop=_name_);
by index ;
id _name_;
var dummy;
run;
proc stdize reponly missing=0 data=want ;
var _numeric_;
run;
One more alternative:
proc transpose data=have out=long;
by index;
var code:;
run;
data long2;
set long;
value = 1;
run;
proc transpose data=long2 out=wide;
by index;
id col1;
var value;
run;
/* Convert missing to zeroes */
data want;
set wide;
array vars _NUMERIC_;
do over vars;
if(vars = .) then vars = 0;
end;
drop _NAME_;
run;
Output:
index P1 P2 P3 P4 N1
A 1 1 1 0 0
B 1 0 1 1 0
C 0 1 0 1 1
Related
I have a SAS data set t3. I want to run a data step inside a loop through a set of variables to create additional sets based on the variable value = 1, and rank two variables bal and otheramt in each subset, and then merge the ranks for each subset onto the original data set. Each rank column needs to be dynamically named so I know what subset is getting ranked. I know how to do proc rank and macros basically but do not know how to do this in the most dynamic way inside of a macro. Can you assist?
ID
bal
otheramt
firstvar
secondvar
lastvar
444
581
100
1
1
555
255
200
1
1
1
666
255
300
--------------
1
--------------
%macro dog();
data new;
set t3;
ARRAY Indicators(5) FirstVar--LastVar;
/*create data set for each of the subsets if firstvar = 1, secondvar = 1 ... lastvar = 1 */
/*for each new data set, rank by bal and otheramt*/
/*name the new rank columns [FirstVar]BalRank, [FirstVar]OtherAmtRank; */
/*merge the new ranks onto the original data set by ID*/
%mend;
%dog()
The Proc rank section would be something like this, but I would need the rank columns to have information about what subset I am ranking.
proc rank data=subset1 out=subset1ranked;
var bal otheramt;
ranks bal_rank otheramt_rank;
run;
Instead of using macro, use data transformation and reshaping that allows simpler steps to be written.
Example:
Rows are split into multiple rows based on flag so group processing in RANK can occur. Two transposes are required to reshape the results back a single row per id.
data have;
call streaminit(20230216);
do id = 1 to 100;
foo = rand('integer', 50,150);
bar = rand('integer', 100,200);
flag1 = rand('integer', 0, 1);
flag2 = rand('integer', 0, 1);
flag3 = rand('integer', 0, 1);
output;
end;
run;
data step1;
set have;
/* important: the group value becomes part of the variable name later */
if flag1 then do; group='flag1_'; output; end;
if flag2 then do; group='flag2_'; output; end;
if flag3 then do; group='flag3_'; output; end;
drop flag:;
run;
proc sort data=step1;
by group;
run;
proc rank data=step1 out=step2;
by group;
var foo bar;
ranks foo_rank bar_rank;
run;
proc sort data=step2;
by id group;
run;
* pivot (reshape) so there is one row per ranked var;
proc transpose data=step2 out=step3(drop=_label_);
by id foo bar group;
var foo_rank bar_rank;
run;
* pivot again so there is one row per id;
proc transpose data=step3 out=step4(drop=_name_);
by id;
var col1;
id group _name_;
run;
* merge so those 0 0 0 flag rows remain intact;
data want;
merge have step4;
by id;
run;
Since we don't have much sample data, I created test data from sashelp.class with some indicator variables like yours.
data have;
set sashelp.class;
firstvar=round(rand('uniform',1));
secondvar=round(rand('uniform',1));
thirdvar=round(rand('uniform',1));
drop sex weight;
run;
Partial output:
Name Age Height firstvar secondvar thirdvar
Alfred 14 69 1 0 1
Alice 13 56.5 0 1 1
Barbara 13 65.3 1 0 0
Carol 14 62.8 0 0 0
To dynamically rank data based on indicator variables, I created a macro that accepts a list of indicators and rank variables. The 2 lists help to create the specific variable names you requested. Here's the macro call:
%rank(indicators=firstvar secondvar thirdvar,
rank_vars=age height);
Here's part of the final output. Notice the indicators in the sample output above coincide with the ranks in this output. Also note that Carol is not in the output because she had no indicators set to 1.
Name Age Height firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank thirdvar_height_rank
Alfred 14 69 8 11 . . 6.5 10
Alice 13 56.5 . . 3.5 2 4.5 2
Barbara 13 65.3 6.5 8 . . . .
Henry 14 63.5 . . 5.5 5 . .
The full macro is listed below. It has 3 parts.
Create a temp data set with a group variable that contains the number of the indicator variable based on the order of the variable in the list. Whenever an indicator = 1 the obs is output. If an obs has all 3 indicators set to 1 then it will be output 3 times with the group variable set to the number of each indicator variable. This step is important because proc rank will rank groups independently.
Generate the rankings on the temp data set. Each group will be ranked independently of the other groups and can be done in one step.
Construct the final data set by essentially transposing the ranked data into columns.
%macro rank(indicators=, rank_vars=);
%let cnt_ind = %sysfunc(countw(&indicators));
%let cnt_vars = %sysfunc(countw(&rank_vars));
data temp;
set have;
array indicators(*) &indicators;
do i = 1 to dim(indicators);
if indicators(i) = 1 then do;
group = i; * create a group based on order of indicators;
output; * an obs can be output multiple times;
end;
end;
drop i &indicators;
run;
proc sort data=temp;
by group;
run;
* Generate rankings by group;
proc rank data=temp out=ranks;
by group;
var &rank_vars;
ranks
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
proc sort data=ranks;
by name group;
run;
* Contruct final data set by transposing the ranks into columns;
data want;
set ranks;
by name;
* retain statement to declare new variables and retain values;
retain
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let vars = &vars &ivar._&jvar._rank;
%end;
%end;
&vars;
if first.name then call missing (of &vars);
* option 1: build series of IF statements;
%let vars = ;
%do i = 1 %to &cnt_ind;
%let ivar = %scan(&indicators, &i);
%str(if group = &i then do;)
%do j = 1 %to &cnt_vars;
%let jvar = %scan(&rank_vars, &j);
%let newvar = &ivar._&jvar._rank;
%str(&newvar = &jvar._rank;)
%end;
%str(end;)
%end;
if last.name then output;
drop group
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
run;
%mend;
When constructing the final data set and transposing the rank variables, there are a couple of options. The first option shown above is to dynamically build a series of if statements. Here is what the code generates:
MPRINT(RANK): * option 1: build series of IF statements;
MPRINT(RANK): if group = 1 then do;
MPRINT(RANK): firstvar_age_rank = age_rank;
MPRINT(RANK): firstvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 2 then do;
MPRINT(RANK): secondvar_age_rank = age_rank;
MPRINT(RANK): secondvar_height_rank = height_rank;
MPRINT(RANK): end;
MPRINT(RANK): if group = 3 then do;
MPRINT(RANK): thirdvar_age_rank = age_rank;
MPRINT(RANK): thirdvar_height_rank = height_rank;
MPRINT(RANK): end;
The 2nd option is to use an array and mathematically calculate the index into the array by the group number and variable number. Here is the snippet of macro code to replace the if series code:
* option 2: create arrays and calculate index into array
* by group number and variable number;
array ranks(*) &vars;
array rankvars(*)
%let vars = ;
%do i = 1 %to &cnt_vars;
%let var = %scan(&rank_vars, &i);
%let vars = &vars &var._rank;
%end;
&vars;
%str(idx = dim(rankvars) * (group - 1);)
%str(do i = 1 to dim(rankvars);)
%str(ranks(idx + i) = rankvars(i);)
%str(end;)
Here is the generated code:
MPRINT(RANK): * option 2: create arrays and calculate index into array * by group number and variable number;
MPRINT(RANK): array ranks(*) firstvar_age_rank firstvar_height_rank secondvar_age_rank secondvar_height_rank thirdvar_age_rank
thirdvar_height_rank;
MPRINT(RANK): array rankvars(*) age_rank height_rank;
MPRINT(RANK): idx = dim(rankvars) * (group - 1);
MPRINT(RANK): do i = 1 to dim(rankvars);
MPRINT(RANK): ranks(idx + i) = rankvars(i);
MPRINT(RANK): end;
It takes a minute to understand the array option, but once you do, it is preferable over generating if statments. As the number of variables increases, the code generated by the array option is the same and operates more efficiently.
I have a HCC dataset DATA_HCC that with member ID and 79 binary variables:
Member_ID HCC1 HCC2 HCC6 HCC8 ... HCC189
XXXXXXX1 1 0 1 0 ... 0
XXXXXXX2 0 0 1 0 ... 0
XXXXXXX3 0 1 0 0 ... 1
I am trying to create a output dataset that could create new binary variables for all the combination of those 79 variables. Each new variable represents if a member had both of the variables as 1.
%LET hccList = HCC1 HCC2 HCC6 HCC8 HCC9 HCC10 HCC11 HCC12 HCC17 HCC18 HCC19 HCC21 HCC22 HCC23 HCC27
HCC28 HCC29 HCC33 HCC34 HCC35 HCC39 HCC40 HCC46 HCC47 HCC48 HCC54 HCC55 HCC57 HCC58
HCC70 HCC71 HCC72 HCC73 HCC74 HCC75 HCC76 HCC77 HCC78 HCC79 HCC80 HCC82 HCC83 HCC84
HCC85 HCC86 HCC87 HCC88 HCC96 HCC99 HCC100 HCC103 HCC104 HCC106 HCC107 HCC108 HCC110
HCC111 HCC112 HCC114 HCC115 HCC122 HCC124 HCC134 HCC135 HCC136 HCC137 HCC157 HCC158
HCC161 HCC162 HCC166 HCC167 HCC169 HCC170 HCC173 HCC176 HCC186 HCC188 HCC189;
DATA COUNT_HCC; SET DATA_HCC;
ARRAY HCC [*] &hccList.;
DO i = 1 TO DIM(HCC);
DO j = i+1 TO DIM(HCC);
%LET HCC_COMBO = CATX('_', VARNAME(HCC[i]), VARNAME(HCC[j]));
&HCC_COMBO. = MIN(HCC[i], HCC[j]);
END;
END;
RUN;
I tried to use CATX function to just concat the two variable names but it didn't work.
Here is the log error that I got:
ERROR: Undeclared array referenced: CATX.
ERROR: Variable CATX has not been declared as an array.
ERROR 71-185: The VARNAME function call does not have enough arguments.
And the results output sample would like this:
Member_ID HCC1_HCC2 HCC1_HCC6 HCC1_HCC8 ... HCC188_HCC189
XXXXXXX1 0 1 0 ... 0
XXXXXXX2 0 0 0 ... 0
XXXXXXX3 0 0 0 ... 1
To achieve dynamic variable name generation, use a macro to create the variables that you need. The below code generates dynamic variable names and generates data step code to create the variables.
%macro get_hcc_combo_mins;
%do i = 1 %to %sysfunc(countw(&hccList.));
%do j = %eval(&i.+1) %to %sysfunc(countw(&hccList.));
%let hcc1 = %scan(&hccList., &i.);
%let hcc2 = %scan(&hccList., &j.);
&hcc1._&hcc2. = min(&hcc1., &hcc2.);
%end;
%end;
%mend;
DATA COUNT_HCC; SET DATA_HCC;
ARRAY HCC [*] &hccList.;
%get_hcc_combo_mins;
RUN;
The macro %get_hcc_combo_mins generates this code in the data step:
HCC1_HCC2 = min(HCC1, HCC2);
HCC1_HCC6 = min(HCC1, HCC6);
HCC1_HCC8 = min(HCC1, HCC8);
...
There may be other ways to do this all within one data step that I'm not aware of, but macros can get the job done.
A DATA Step with LEXCOMB can generate variable name pairs. CALL EXECUTE submit a statement using those names.
Example:
Presume HCC: variable names, which specific ones not known apriori.
data have;
call streaminit(1234);
do id = 1 to 100;
array hcc hcc1 hcc3 hcc5 hcc7 hcc10-hcc79 hcc150 hcc155 hcc180 hcc190-hcc191;
do over hcc;
hcc = rand('uniform', dim(hcc)) < _i_;
end;
output;
end;
run;
data _null_;
set have;
array hcc hcc:;
do _n_ = 1 to dim(hcc);
hcc(_n_) = _n_;
end;
call execute("data pairwise; set have;");
do _n_ = 1 to comb(dim(hcc),2);
call lexcomb(_n_, 2, of hcc(*));
index1 = hcc(1);
index2 = hcc(2);
name1 = vname(hcc(index1));
name2 = vname(hcc(index2));
put name1=;
call execute (cats(
catx( '_',name1,name2),
'=',
catx(' and ',name1,name2),
';'
));
end;
call execute('run;');
stop;
run;
See if you can use this as a template.
/* Example data */
data have (drop = i j);
array h {*} HCC1 HCC2 HCC6 HCC8 HCC9 HCC10 HCC11 HCC12 HCC17 HCC18 HCC19 HCC21 HCC22 HCC23 HCC27
HCC28 HCC29 HCC33 HCC34 HCC35 HCC39 HCC40 HCC46 HCC47 HCC48 HCC54 HCC55 HCC57 HCC58
HCC70 HCC71 HCC72 HCC73 HCC74 HCC75 HCC76 HCC77 HCC78 HCC79 HCC80 HCC82 HCC83 HCC84
HCC85 HCC86 HCC87 HCC88 HCC96 HCC99 HCC100 HCC103 HCC104 HCC106 HCC107 HCC108 HCC110
HCC111 HCC112 HCC114 HCC115 HCC122 HCC124 HCC134 HCC135 HCC136 HCC137 HCC157 HCC158
HCC161 HCC162 HCC166 HCC167 HCC169 HCC170 HCC173 HCC176 HCC186 HCC188 HCC189;
do i = 1 to 10;
do j = 1 to dim (h);
h [j] = rand('uniform') > .5;
end;
output;
end;
run;
/* Create long version of output data */
data temp (drop = i j);
set have;
array a {*} HC:;
do i = 1 to dim (a)-1;
do j = i+1 to dim (a);
v = catx('_', vname (a[i]), vname (a[j]));
d = a [i] * a [j];
n = _N_;
output;
end;
end;
run;
/* Transpose to wide format */
proc transpose data=temp out=temp2 (drop=_: n);
by n;
id v;
var d;
run;
/* Merge back with original data */
data want;
merge have temp2;
run;
I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?
Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;
Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;
I want to have a mean which is based in non zero values for given variables using proc means only.
I know we do can calculate using proc sql, but I want to get it done through proc means or proc summary.
In my study I have 8 variables, so how can I calculate mean based on non zero values where in I am using all of those in the var statement as below:
proc means = xyz;
var var1 var2 var3 var4 var5 var6 var7 var8;
run;
If we take one variable at a time in the var statement and use a where condition for non zero variables , it works but can we have something which would work for all the variables of interest mentioned in the var statement?
Your suggestions would be highly appreciated.
Thank you !
One method is to change all of your zero values to missing, and then use PROC MEANS.
data zeromiss /view=zeromiss ;
set xyz ;
array n{*} var1-var8 ;
do i = 1 to dim(n) ;
if n{i} = 0 then call missing(n{i}) ;
end ;
drop i ;
run ;
proc means data=zeromiss ;
var var1-var8 ;
run ;
Create a view of your input dataset. In the view, define a weight variable for each variable you want to summarise. Set the weight to 0 if the corresponding variable is 0 and 1 otherwise. Then do a weighted summary via proc means / proc summary. E.g.
data xyz_v /view = xyz_v;
set xyz;
array weights {*} weight_var1-weight_var8;
array vars {*} var1-var8;
do i = 1 to dim(vars);
weights[i] = (vars[i] ne 0);
end;
run;
%macro weighted_var(n);
%do i = 1 to &n;
var var&i /weight = weight_var&i;
%end;
%mend weighted_var;
proc means data = xyz_v;
%weighted_var(8);
run;
This is less elegant than Chris J's solution for this specific problem, but it generalises slightly better to other situations where you want to apply different weightings to different variables in the same summary.
Can't you use a data statement?
data lala;
set xyz;
drop qty;
mean = 0;
qty = 0;
if(not missing(var1) and var1 ^= 0) then do;
mean + var1;
qty + 1;
end;
if(not missing(var2) and var2 ^= 0) then do;
mean + var2;
qty + 1;
end;
/* ... repeat to all variables ... */
if(not missing(var8) and var8 ^= 0) then do;
mean + var8;
qty + 1;
end;
mean = mean/qty;
run;
If you want to keep the mean in the same xyz dataset, just replace lala with xyz.
Could you help to calculate the number of each variable where each variable =1? I posted how can I calculate the missing number here. Hopefully, it is the similar way. Thanks in advance.
/*y00*/
%let list0=OCALZHMR OCARTERY OCARTH OCCHD OCDIABTS OCHBP OCMENTAL OCMYOCAR
OCOTHART OCPSYCH OCSTROKE;
/*y01 and y02*/
%let list1=D_CFAIL D_CHD D_HBP D_MYOCAR D_OTHHRT D_PSYCH D_RHYTHM D_STROKE
D_VALVE OCALZHMR OCARTERY OCARTH OCCHD OCDIABTS OCHBP OCMENTAL OCMYOCAR
OCOTHART OCPSYCH OCSTROKE;
proc means data=cohort00 nmiss noprint;
var &list0;
output out=y2000_nmiss(drop=_:) nmiss= ;
run;
proc means data=cohort01 nmiss noprint;
var &list1;
output out=y2001_nmiss(drop=_:) nmiss= ;
run;
data y2000_nmiss;
set y2000_nmiss;
j=1;
run;
data y2001_nmiss;
set y2001_nmiss;
j=1;
run;
proc transpose data=y2000_nmiss out=long0(rename=(COL1=Y2000 _name_=VAR));
by j;
run;
proc transpose data=y2001_nmiss out=long1(rename=(COL1=Y2001 _name_=VAR));
by j;
run;
data ATC_missing;
merge long0 long1;
by VAR;
drop j;
run;
Here is the part of output table for the number of missing :
VARS Y2000 Y2001 Y2002
OCDIABTS 0 1 0
OCHBP 0 0 0
OCMENTAL 17 18 10
OCMYOCAR 0 0 0
OCOTHART 0 0 4758
OCOTHHRT . . .
OCPSYCH 0 0 0
%let list1=Width Length Depth;
data work.is_even / view=work.is_even;
set sashelp.lake;
array vars {*} &list1 ;
drop i;
do i=1 to dim(vars);
if mod(round(vars(i), 1),2) = 0 /* would be VARS(I)=1 for your case */
then vars(i)=1;
else vars(i)=.;
end;
run;
proc means data=work.is_even n;
run;
First I create a datastep view work.is_even (view in order to avoid a full copy of data) that manipulates the data in the way I need.
Here it overrides the original value of variables (not in original data, just in that view) by 1 if the rounded value is even, by null if it's odd.
Then, just count the nonmissing values (N statistic in PROC MEANS).