How to recode values of a variable based on the maxmium value in the variable, for hundreds of variables? - sas

I want to recode the max value of a variable as 1 and 0 when it is not. For each variable, there may be multiple observations with the max value. The max value for each value is not fixed, i.e. from cycle to cycle the max value for each variable may change. And there are hundreds of variables, cannot "hard-code" anything.
The final product would have the same dimensions as the original table, i.e. equal number of rows and columns as a matrix of 0s and 1s.
This is within SAS. I attempted to calculate the max of each variable and then append these max as a new observation into the data. Then comparing down the column of each variable against the "max" observation... looking into examples of the following did not help:
SQL
Array in datastep
proc transpose
formatting
Any insight would be much appreciated.

Here is a version done with SQL:
The idea is that we first calculate the maximum. The Latter select. Then we join the data to original and the outer the case-select specifies if the flag is set up or not.
data begin;
input var value;
cards;
1 1
1 2
1 3
1 2.5
1 1.7
1 3
2 34
2 33
2 33
2 33.7
2 34
2 34
; run;
proc sql;
create table result as
select a.var, a.value, case when a.value = b.maximum then 1 else 0 end as is_max from
(select * from begin) a
left join
(select max(value) as maximum, var from begin group by var) b
on a.var = b.var
;
quit;

To avoid "hard-code" you need to use some code generation.
First let's figure out what code you could use to solve the problem. Later we can look into ways to generate that code.
It is probably easiest to do this with PROC SQL code. SAS will allow you to reference the MAX() value of a variable. Also note that SAS evaluates boolean expressions to 1 (TRUE) or 0 (FALSE). So you just want to generate code like:
proc sql;
create table want as
select var1=max(var1) as var1
, var2=max(var2) as var2
from have
;
quit;
To generate the code you need a list of the variables in your source dataset. You can get those with PROC CONTENTS but also with the metadata table (view) DICTIONARY.COLUMNS (also accessible as SASHELP.VCOLUMN from outside PROC SQL).
If the list of variables is small then you could generate the code into a single macro variable.
proc sql noprint;
select catx(' ',cats(name,'=max(',name,')'),'as',name)
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='HAVE'
order by varnum
;
create table want as
select &varlist
from have
;
quit;
The maximum number of characters that will fit into a macro variable is 64K. So long enough for about 2,000 variables with names of 8 characters each.
Here is little more complex way that uses PROC SUMMARY and a data step with a temporary array. It does not really need any code generation.
%let dsin=sashelp.class(obs=10);
%let dsout=want;
%let varlist=_numeric_;
proc summary data=&dsin nway ;
var &varlist;
output out=summary(drop=_type_ _freq_) max= ;
run;
data &dsout;
if 0 then set &dsin;
array vars &varlist;
array max [10000] _temporary_;
if _n_=1 then do;
set summary ;
do _n_=1 to dim(vars);
max[_n_]=vars[_n_];
end;
end;
set &dsin;
do _n_=1 to dim(vars);
vars[_n_]=vars[_n_]=max[_n_];
end;
run;
Results:
Obs Name Sex Age Height Weight
1 Alfred M 0 1 1
2 Alice F 0 0 0
3 Barbara F 0 0 0
4 Carol F 0 0 0
5 Henry M 0 0 0
6 James M 0 0 0
7 Jane F 0 0 0
8 Janet F 1 0 1
9 Jeffrey M 0 0 0
10 John M 0 0 0

Related

Is there a way in SAS to print the value of a variable in label using proc sql?

I have a situation where I would like to put the value of a variable in the label in SAS.
Example: Median for Total_Days is 2. I would like to put this value in Days_Median_Split label. The median keeps on changing with varying data, so I would like to automate it.
Phy_Activity Total_Days "Days_Median_Split: Number of Days with Median 2"
No 0 0
No 0 0
Yes 2 1
Yes 3 1
Yes 5 1
Sample Dataset
Thanks so much!
* step 1 create data;
data have;
input Phy_Activity $ Total_Days Days_Median_Split;
datalines;
No 0 0
No 0 0
Yes 2 1
Yes 3 1
Yes 5 1
run;
*step 2 sort data on Total_days;
proc sort data = have;
by Total_days;
run;
*step 3 get count of obs;
proc sql noprint;
select count(*) into: cnt
from have;quit;
* step 4 calulate median;
%let median = %sysevalf(&cnt/2 + .5);
*step 5 get median obsevation;
proc sql noprint;
select Total_days into: medianValue
from have
where monotonic()=&median;quit;
*step 6 create label;
data have;
set have;
label Days_Median_split = 'Days_Median_split: Number of Days with Median '
%trim(&medianValue);
run;

Flagging rows that fulfill two conditions in SAS

I have a large dataset, with hundreds of variables and hundreds of observations, coming from a clinical trial. Variable V1 is a yes/no variable, indicating some condition. V2 is numeric, and representing a dose. T is a time variable. The dataset is "long" shaped, every subject has few observations, one for each time point. For every subject, I want to create a new yes/no variable (can be in a new dataset), which is yes if: V1 is "yes" in at least one time point, OR, V2 is above 0 in at least one time point. How do I do that? Thank you.
Try the following:
data ds;
set ds;
if V1="yes" or V2>0 then do;
flag="yes;
end;
else do;
flag= "no";
end;
summarize the dataset to ID level:
proc sql;
create table summary as
select ID, count(flag) as flag_cnt
from ds
where flag="yes"
group by ID;
quit;
These are the IDs which satisfy the condition
You can submit the code on the example below to verify.
Here (V1="yes" or V2>0) gives a dummy variable for eauch row. When we sum it we have the number of rows satisfying the condition you mentioned for each ID.
To have a flag, we compare the sum to 0 and put it between () to create a 0/1 variable that you want to have.
hope it helps !
MK
data have;
input ID V1 $ V2;
cards;
1 yes 0
1 no 0
1 no 0
2 no 0
2 no 0
2 no 0
3 no 1
3 no 0
4 yes 0
4 yes 0
5 yes 1
5 no 1
5 yes 0
;
run;
proc sql;
select ID
, (sum((V1="yes")or(V2>0))>0) as new_flag
from have
group by ID;
quit;
data want (keep=id flag);
flag = 'no ';
do until (last.id);
set have;
by id;
if v1 = 'yes' or v2 > 0 then flag = 'yes';
end;
output;
run;

Count number of 0 values

Similar to here, I can count the number of missing observations:
data dataset;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc means data=dataset NMISS N;
run;
But how can I also count the number of observations that are 0?
If you want to count the number of observations that are 0, you'd want to use proc tabulate or proc freq, and do a frequency count.
If you have a lot of values and you just want "0/not 0", that's easy to do with a format.
data have;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc format;
value zerof
0='Zero'
.='Missing'
other='Not Zero';
quit;
proc freq data=have;
format _numeric_ zerof.;
tables _numeric_/missing;
run;
Something along those lines. Obviously be careful about _numeric_ as that's all numeric variables and could get messy quickly if you have a lot of them...
I add this as an additional answer. It requires you to have PROC IML.
This uses matrix manipulation to do the count.
(ds=0) -- creates a matrix of 0/1 values (false/true) of values = 0
[+,] -- sums the rows for all columns. If we have 0/1 values, then this is the number of value=0 for each column.
' -- operator is transpose.
|| -- merge matrices {0} || {1} = {0 1}
Then we just print the values.
proc iml;
use dataset;
read all var _num_ into ds[colname=names];
close dataset;
ds2 = ((ds=0)[+,])`;
n = nrow(ds);
ds2 = ds2 || repeat(n,ncol(ds),1);
cnames = {"N = 0", "Count"};
mattrib ds2 rowname=names colname=cnames;
print ds2;
quit;
Easiest to use PROC SQL. You will have to use a UNION to replicate the MEANS output;
Each section of the first FROM counts the 0 values for each variable and UNION stacks them up.
The last section just counts the number of observations in DATASET.
proc sql;
select n0.Variable,
n0.N_0 label="Number 0",
n.count as N
from (
select "A" as Variable,
count(a) as N_0
from dataset
where a=0
UNION
select "B" as Variable,
count(b) as N_0
from dataset
where b=0
UNION
select "C" as Variable,
count(c) as N_0
from dataset
where c=0
) as n0,
(
select count(*) as count
from dataset
) as n;
quit;
there is levels options in proc freq you could use.
proc freq data=dataset levels;
table _numeric_;
run;

SAS sort by original order

Say you have three separate data sets consisting of the same number of observations. Each observation has an ID letter, A-Z, followed by some numerical observation. For example:
Data set 1:
B 3 8 1 9 4
C 4 1 9 3 1
A 4 4 5 4 9
Data set 2:
C 3 1 9 4 0
A 4 1 2 0 0
B 0 3 3 1 8
I want to merge the data sets BY that first variable. The problem is, the first variable is NOT already sorted in alphabetical form, and I do not want to sort it in alphabetical form. I want to merge the data but keep the original order. For example, I would get:
Merged data:
B 3 8 1 9 4
B 0 3 3 1 8
C 4 1 9 3 1
C 3 1 9 4 0
A 4 4 5 4 9
A 4 1 2 0 0
Is there any way to do this?
You can create a variable that holds the order and then apply that the new dataset after its "merged". I believe this is an append rather than merge though. I've used a format, though you could use a sql or data set merge as well.
data have1;
input id $ var1-var5;
cards;
B 3 8 1 9 4
C 4 1 9 3 1
A 4 4 5 4 9
;
run;
data have2;
input id $ var1-var5;
cards;
C 3 1 9 4 0
A 4 1 2 0 0
B 0 3 3 1 8
;
run;
data order;
set have1;
fmtname='sort_order';
type='J';
label=_n_;
start=id;
keep id fmtname type label start;
run;
proc format cntlin=order;
run;
data want;
set have1 have2;
order_var=input(id, $sort_order.);
run;
proc sort data=want;
by order_var;
run;
This is just one SQL version which follows along a similar path to Joe's answer. Row order is input via a sub-query rather than a format. However the initial order of the two input tables is lost in the join to the row order sub-query. The original order (have2 follows have1) is re-instated by using the table names as a secondary order variable.
proc sql;
create table want1 as
select want.id
,want.var1
,want.var2
,want.var3
,want.var4
,want.var5
from (
select *
, 'have1' as source
from have1
union all
select *
, 'have2' as source
from have2
) as want
left join
(
select id
, monotonic() as row_no
from have1
) as order
on want.id eq order.id
order by order.row_no
,want.source
;
quit;
proc compare
base=want1
compare=want
;
run;
And this is a data step version without a format. Here the have1 table with row order is re-merged with the concatenated data (have1 and have2) and then re-sorted by row order.
data want2;
set have1 have2;
run;
data have1;
set have1;
order_var = _n_;
run;
proc sort data=want2;
by id;
run;
proc sort data=have1;
by id;
run;
data want2;
merge want2 have1;
by id;
run;
proc sort data=want2;
by order_var;
run;
proc compare
base=want2
compare=want
;
run;

Is there a way to delete 0's from a dataset?

Suppose I want to only apply proc means or the better means macro to only non zero entries in my dataset? Is there an easy option to do this? If I have a dataset:
A B C
0 1 2
2 2 0
2 0 1
How can I use proc means or the better means macro to ignore the 0 values?
You can create a view to convert them on the fly. BETTERMEANS may have a way of handling this; not sure.
data have;
input A B C ;
format a b c zeromissing1.;
datalines;
0 1 2
2 2 0
2 0 1
;;;;
run;
data have_z/view=have_z;
set have;
array num _numeric_;
do _i = 1 to dim(num);
if num[_i]=0 then num[_i]=.;
end;
run;
proc means data=have_z;
var a b c;
run;