How to delete variables with huge percent of missings in table in SAS? - sas

I have table in SAS with missing values like below:
col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111 | | abc | ... | abc
222 | 11 | C1 | ... | 11
333 | 18 | | ... | 12
... | ... | ... | ... | ...
And I need to delete from above table variables where is more than 80% missing values (>=80%).
How can I do taht in SAS ?

The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.
%macro get_missing_vars(lib=, dsn=, threshold=);
%global drop_vars;
/* Generate a select statement that calculates the proportion missing:
nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */
proc sql noprint;
select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
into :calculate_pct_missing separated by ','
from dictionary.columns
where libname = upcase("&lib")
AND memname = upcase("&dsn")
;
quit;
/* Calculate the percent missing */
proc sql;
create table pct_missing as
select &calculate_pct_missing.
from &lib..&dsn.
;
quit;
/* Convert to a long table */
proc transpose data=pct_missing out=drop_list;
var _NUMERIC_;
run;
/* Get a list of variables to drop that are >= the drop threshold */
proc sql noprint;
select _NAME_
into :drop_vars separated by ' '
from drop_list
where COL1 GE &threshold.
;
quit;
%mend;
It has three parameters:
lib: Library of your dataset
dsn: Dataset name without the library
threshold: Proportion of missing values a variable must meet or exceed to be dropped
For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.
data have;
array col[10];
do i = 1 to 10;
do j = 1 to 10;
col[j] = i;
if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
end;
output;
end;
drop i j;
run;
We'll run the macro and check the log:
%get_missing_vars(lib=work, dsn=have, threshold=0.8);
%put &drop_vars;
The log shows:
col1 col2 col3
Now we can pass this into a simple data step.
data want;
set have;
drop &drop_vars;
run;

Related

Proc REPORT move group value (row header) closer to totals

I have some data that is structured as below. I need to create a table with subtotals, a total column that's TypeA + TypeB and a header that spans the columns as a table title. Also, it would be ideal to show different names in the column headings rather than the variable name from the dataset.
I cobbled together some preliminary code to get the subtotals and total, but not the rest.
data tabletest;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
;
run;
Code that I wrote:
proc report data=tabletest nofs headline headskip;
column referral_total referral_source TypeA TypeB;
define referral_total / group ;
define referral_source / group;
define TypeA / sum ' ';
define TypeB / sum ' ';
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
run;
This is the desired output:
The DEFINE statement has an option NOPRINT that causes the column to not be rendered, however, the variables for it are still available (in a left to right manner) for use in a compute block.
Stacking in the column statement allows you to customize the column headers and spans. In a compute block for non-group columns, the Proc REPORT data vector only allows access to the aggregate values at the detail or total line, so you need to specify .
This sample code shows how the _total column is hidden and the _source cells in the sub- and report- total lines are 'injected' with the hidden _total value. The _source variable has to be lengthened to accommodate the longer values that are in the _total variable.
data tabletest;
* ensure referral_source big enough to accommodate _total || ' TOTAL';
length referral_total $50 referral_source $60;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
run;
proc report data=tabletest;
column
( 'Table 1 - Stacking gives you custom headers and hierarchies'
referral_total
referral_source
TypeA TypeB
TypeTotal
);
define referral_total / group noprint; * hide this column;
define referral_source / group;
define TypeA / sum 'Freq(A)'; * field labels are column headers;
define TypeB / sum 'Freq(B)';
define TypeTotal / computed 'Freq(ALL)'; * specify custom computation;
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
/*
* no thanks, doing this in the _source compute block instead;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
*/
compute referral_source;
* the referral_total value is available because it is left of me. It just happens to be invisible;
* at the break lines override the value that appears in the _source cell, effectively 'moving it over';
select (_break_);
when ('referral_total') referral_source = catx(' ', referral_total, 'Total');
when ('_RBREAK_') referral_source = 'Total';
otherwise;
end;
endcomp;
compute TypeTotal;
* .sum is needed because the left of me are groups and only aggregate values available here;
TypeTotal = Sum(TypeA.sum,TypeB.sum);
endcomp;
run;

SAS: Rename variables in merge according to original dataset

I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.
For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;

Compare value to max of variable

I have a dataset with a lot of lines and I'm studying a group of variables.
For each line and each variable, I want to know if the value is equal to the max for this variable or more than or equal to 10.
Expected output (with input as all variables without _B) :
(you can replace T/F by TRUE/FALSE or 1/0 as you wish)
+----+------+--------+------+--------+------+--------+
| ID | Var1 | Var1_B | Var2 | Var2_B | Var3 | Var3_B |
+----+------+--------+------+--------+------+--------+
| A | 1 | F | 5 | F | 15 | T |
| B | 1 | F | 5 | F | 7 | F |
| C | 2 | T | 5 | F | 15 | T |
| D | 2 | T | 6 | T | 10 | T |
+----+------+--------+------+--------+------+--------+
Note that for Var3, the max is 15 but since 15>=10, any value >=10 will be counted as TRUE.
Here is what I've maid up so far (doubt it will be any help but still) :
%macro pleaseWorkLittleMacro(table, var, suffix);
proc means NOPRINT data=&table;
var &var;
output out=Varmax(drop=_TYPE_ _FREQ_) max=;
run;
proc transpose data=Varmax out=Varmax(rename=(COL1=varmax));
run;
data Varmax;
set Varmax;
varmax = ifn(varmax<10, varmax, 10);
run; /* this outputs the max for every column, but how to use it afterward ? */
%mend;
%pleaseWorkLittleMacro(MY_TABLE, VAR1 VAR2 VAR3 VAR4, _B);
I have the code in R, works like a charm but I really have to translate it to SAS :
#in a for loop over variable names, db is my data.frame, x is the
#current variable name and x2 is the new variable name
x.max = max(db[[x]], na.rm=T)
x.max = ifelse(x.max<10, x.max, 10)
db[[x2]] = (db[[x]] >= x.max) %>% mean(na.rm=T) %>% percent(2)
An old school sollution would be to read the data twice in one data step;
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
run;
data my_input;
set expect;
keep ID Var1 Var2 Var3 ;
proc print;
run;
It is a good habit to declare the most volatile things in your code as macro variables.;
%let varList = Var1 Var2 Var3;
%let markList = Var1_B Var2_B Var3_B;
%let varCount = 3;
Read the data twice;
data my_result;
set my_input (in=maximizing)
my_input (in=marking);
Decklare and Initialize arrays;
format &markList $1.;
array _vars [&&varCount] &varList;
array _maxs [&&varCount] _temporary_;
array _B [&&varCount] &markList;
if _N_ eq 1 then do _varNr = 1 to &varCount;
_maxs(_varNr) = -1E15;
end;
While reading the first time, Calculate the maxima;
if maximizing then do _varNr = 1 to &varCount;
if _vars(_varNr) gt _maxs(_varNr) then _maxs(_varNr) = _vars(_varNr);
end;
While reading the second time, mark upt to &maxMarks maxima;
if marking then do _varNr = 1 to &varCount;
if _vars(_varNr) eq _maxs(_varNr) or _vars(_varNr) ge 10
then _B(_varNr) = 'T';
else _B(_varNr) = 'F';
end;
Drop all variables starting with an underscore, i.e. all my working variables;
drop _:;
Only keep results when reading for the second time;
if marking;
run;
Check results;
proc print;
var ID Var1 Var1_B Var2 Var2_B Var3 Var3_B;
proc compare base=expect compare=my_result;
run;
This is quite simple to solve in sql
proc sql;
create table my_result as
select *
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Not tested, as our SAS server is currently down, which is the reason I pass my time on Stack Overflow)
If you need that for a lot of variables, consult the system view VCOLUMN of SAS:
proc sql;
select ''|| name ||'_B = ('|| name ||' eq max_'|| name ||')'
, 'max('|| name ||') as max_'|| name
from sasHelp.vcolumn
where libName eq 'WORK'
and memName eq 'MY_RESULT'
and type eq 'num'
and upcase(name) like 'VAR%'
;
into : create_B separated by ', '
, : select_max separated by ', '
create table my_result as
select *, &create_B
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Again not tested)
After Proc MEANS computes the maximum value for each column you can run a data step that combines the original data with the maximums.
data want;
length
ID $1 Var1 8 Var1_B $1. Var2 8 Var2_B $1. Var3 8 Var3_B $1. var4 8 var4_B $1;
input
ID Var1 Var1_B Var2 Var2_B Var3 Var3_B ; datalines;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
run;
data have;
set want;
drop var1_b var2_b var3_b var4_b;
run;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_) max= / autoname;
run;
The neat thing the VAR statement is that you can easily list numerically suffixed variable names. The autoname option automatically appends _ to the names of the variables in the output.
Now combine the maxes with the original (have). The set varmax automatically retains the *_max variables, and they will not get overwritten by values from the original data because the varmax variable names are different.
Arrays are used to iterate over the values and apply the business logic of flagging a row as at max or above 10.
data want;
if _n_ = 1 then set varmax; * read maxes once from MEANS output;
set have;
array values var1-var4;
array maxes var1_max var2_max var3_max var4_max;
array flags $1 var1_b var2_b var3_b var4_b;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
The difficult part above is that the MEANS output creates variables that can not be listed using the var1 - varN syntax.
When you adjust the naming convention to have all your conceptually grouped variable names end in numeric suffixes the code is simpler.
* number suffixed variable names;
* no autoname, group rename on output;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_ rename=var1-var4=max_var1-max_var4) max= ;
run;
* all arrays simpler and use var1-varN;
data want;
if _n_ = 1 then set varmax;
set have;
array values var1-var4;
array maxes max_var1-max_var4;
array flags $1 flag_var1-flag_var4;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
You can use macro code or arrays, but it might just be easier to transform your data into a tall variable/value structure.
So let's input your test data as an actual SAS dataset.
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
First you can use PROC TRANSPOSE to make the tall structure.
proc transpose data=expect out=tall ;
by id ;
var var1-var3 ;
run;
Now your rules are easy to apply in PROC SQL step. You can derive a new name for the flag variable by appending a suffix to the original variable's name.
proc sql ;
create table want_tall as
select id
, cats(_name_,'_Flag') as new_name
, case when col1 >= min(max(col1),10) then 'T' else 'F' end as value
from tall
group by 2
order by 1,2
;
quit;
Then just flip it back to horizontal and merge with the original data.
proc transpose data=want_tall out=flags (drop=_name_);
by id;
id new_name ;
var value ;
run;
data want ;
merge expect flags;
by id;
run;

Populate SAS variable by repeating values

I have a SAS table with a lot of missing values. This is only a simple example.
The real table is much bigger (>1000 rows) and the numbers is not the same. But what is the same is that I have a column a that have no missing numbers. Column b and c have a sequence that is shorter than the length of a.
a b c
1 1b 1000
2 2b 2000
3 3b
4
5
6
7
What I want is to fill b an c with repeating the sequences until they columns are full. The result should look like this:
a b c
1 1b 1000
2 2b 2000
3 3b 1000
4 1b 2000
5 2b 1000
6 3b 2000
7 1b 1000
I have tried to make a macro but it become to messy.
The hash-of-hashes solution is the most flexible here, I suspect.
data have;
infile datalines delimiter="|";
input a b $ c;
datalines;
1|1b|1000
2|2b|2000
3|3b|
4| |
5| |
6| |
7| |
;;;;
run;
%let vars=b c;
data want;
set have;
rownum = _n_;
if _n_=1 then do;
declare hash hoh(ordered:'a');
declare hiter hih('hoh');
hoh.defineKey('varname');
hoh.defineData('varname','hh');
hoh.defineDone();
declare hash hh();
do varnum = 1 to countw("&vars.");
varname = scan("&vars",varnum);
hh = _new_ hash(ordered:'a');
hh.defineKey("rownum");
hh.defineData(varname);
hh.defineDone();
hoh.replace();
end;
end;
do rc=hih.next() by 0 while (rc=0);
if strip(vvaluex(varname)) in (" ",".") then do;
num_items = hh.num_items;
rowmod = mod(_n_-1,num_items)+1;
hh.find(key:rowmod);
end;
else do;
hh.replace();
end;
rc = hih.next();
end;
keep a &Vars.;
run;
Basically, one hash is built for each variable you are using. They're each added to the hash of hashes. Then we iterate over that, and search to see if the variable requested is populated. If it is then we add it to its hash. If it isn't then we retrieve the appropriate one.
Assuming that you can tell how many rows to use for each variable by counting how many non-missing values are in the column then you could use this code generation technique to generate a data step that will use the POINT= option SET statements to cycle through the first Nx observations for variable X.
First get a list of the variable names;
proc transpose data=have(obs=0) out=names ;
var _all_;
run;
Then use those to generate a PROC SQL select statement to count the number of non-missing values for each variable.
filename code temp ;
data _null_;
set names end=eof ;
file code ;
if _n_=1 then put 'create table counts as select ' ;
else put ',' #;
put 'sum(not missing(' _name_ ')) as ' _name_ ;
if eof then put 'from have;' ;
run;
proc sql noprint;
%include code /source2 ;
quit;
Then transpose that so that again you have one row per variable name but this time it also has the counts in COL1.
proc transpose data=counts out=names ;
var _all_;
run;
Now use that to generate SET statements needed for a DATA step to create the output from the input.
filename code temp;
data _null_;
set names ;
file code ;
length pvar $32 ;
pvar = cats('_point',_n_);
put pvar '=mod(_n_-1,' col1 ')+1;' ;
put 'set have(keep=' _name_ ') point=' pvar ';' ;
run;
Now use the generated statements.
data want ;
set have(drop=_all_);
%include code / source2;
run;
So for your example data file with variables A, B and C and 7 total observations the LOG for the generated data step looks like this:
1229 data want ;
1230 set have(drop=_all_);
1231 %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file .../#LN00026.
1232 +_point1 =mod(_n_-1,7 )+1;
1233 +set have(keep=a ) point=_point1 ;
1234 +_point2 =mod(_n_-1,3 )+1;
1235 +set have(keep=b ) point=_point2 ;
1236 +_point3 =mod(_n_-1,2 )+1;
1237 +set have(keep=c ) point=_point3 ;
NOTE: %INCLUDE (level 1) ending.
1238 run;
NOTE: There were 7 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 7 observations and 3 variables.
Populate a temporary array with the values, then check the row and add the appropriate value.
Setup the data
data have;
infile datalines delimiter="|";
input a b $ c;
datalines;
1|1b|1000
2|2b|2000
3|3b|
4| |
5| |
6| |
7| |
;
Get a count of the non-null values
proc sql noprint;
select count(*)
into :n_b
from have
where b ^= "";
select count(*)
into :n_c
from have
where c ^=.;
quit;
Now populate the missing values by repeating the contents of each array.
data want;
set have;
/*Temporary Arrays*/
array bvals[&n_b] $ 32 _temporary_;
array cvals[&n_c] _temporary_;
if _n_ <= &n_b then do;
/*Populate the b array*/
bvals[_n_] = b;
end;
else do;
/*Fill the missing values*/
b = bvals[mod(_n_+&n_b-1,&n_b)+1];
end;
if _n_ <= &n_c then do;
/*populate C values array*/
cvals[_n_] = c;
end;
else do;
/*fill in the missing C values*/
c = cvals[mod(_n_+&n_c-1,&n_c)+1];
end;
run;
data want;
set have;
n=mod(_n_,3);
if n=0 then b='3b';
else b=cats(n,'b');
if n in (1,0) then c=1000;
else c=2000;
drop n;
run;

Create SAS data set conditionally on other data sets

I have 6 identical SAS data sets. They only differ in terms of the values of the observations.
How can I create one output data, which finds the maximum value across all the 6 data sets for each cell?
The update statement seems a good candidate, but it cannot set a condition.
data1
v1 v2 v3
1 1 1
1 2 3
data2
v1 v2 v3
1 2 3
1 1 1
Result
v1 v2 v3
1 2 3
1 2 3
If need be the following could be automated by "PUT" statements or variable arrays.
***ASSUMES DATA SETS ARE SORTED BY ID;
Data test;
do until(last.id);
set a b c;
by id;
if v1 > updv1 then updv1 = v1;
if v2 > updv2 then updv2 = v2;
if v3 > updv3 then updv3 = v3;
end;
drop v1-v3;
rename updv1-updv3 = v1-v3;
run;
To provide a more complete solution to Rico's question(assuming 6 datasets e.g. d1-d6) one could do it this way:
Data test;
array v(*) v1-v3;
array updv(*) updv1-updv3;
do until(last.id);
set d1-d6;
by id;
do i = 1 to dim(v);
if v(i) > updv(i) then updv(i) = v(i);
end;
end;
drop v1-v3;
rename updv1-updv3 = v1-v3;
run;
proc print;
var id v1-v3;
run;
See below. For a SAS beginner might be too complex. I hope the comments do explain it a bit.
/* macro rename_cols_opt to generate cols_opt&n variables
- cols_opt&n contains generated code for dataset RENAME option for a given (&n) dataset
*/
%macro rename_cols_opt(n);
%global cols_opt&n max&n;
proc sql noprint;
select catt(name, '=', name, "&n") into: cols_opt&n separated by ' '
from dictionary.columns
where libname='WORK' and memname='DATA1'
and upcase(name) ne 'MY_ID_COLUMN'
;
quit;
%mend;
/* prepare macro variables = pre-generate the code */
%rename_cols_opt(1)
%rename_cols_opt(2)
%rename_cols_opt(3)
%rename_cols_opt(4)
%rename_cols_opt(5)
%rename_cols_opt(6)
/* create macro variable keep_list containing names of output variables to keep (based on DATA1 structure, the code expects those variables in other tables as well */
proc sql noprint;
select trim(name) into: keep_list separated by ' '
from dictionary.columns
where libname='WORK' and memname='DATA1'
;
quit;
%put &keep_list;
/* macro variable maxcode contains generated code for calculating all MAX values */
proc sql noprint;
select cat(trim(name), ' = max(of ', trim(name), ":)") into: maxcode separated by '; '
from dictionary.columns
where libname='WORK' and memname='DATA1'
and upcase(name) ne 'MY_ID_COLUMN'
;
quit;
%put "&maxcode";
data result1 / view =result1;
merge
data1 (in=a rename=(&cols_opt1))
data2 (in=b rename=(&cols_opt2))
data3 (in=b rename=(&cols_opt3))
data4 (in=b rename=(&cols_opt4))
data5 (in=b rename=(&cols_opt5))
data6 (in=b rename=(&cols_opt6))
;
by MY_ID_COLUMN;
&maxcode;
keep &keep_list;
run;
/* created a datastep view, now "describing" it to see the generated code */
data view=result1;
describe;
run;
Here's another attempt that is scalable against any number of datasets and variables. I've added in an ID variable this time as well. Like the answer from #vasja, there are some advanced techniques used here. The 2 solutions are in fact very similar, I've used 'call execute' instead of a macro to create the view. My solution also requires the dataset names to be stored in a dataset.
/* create dataset of required dataset names */
data datasets;
input ds_name $;
cards;
data1
data2
;
run;
/* dummy data */
data data1;
input id v1 v2 v3;
cards;
10 1 1 1
20 1 2 3
;
run;
data data2;
input id v1 v2 v3;
cards;
10 1 2 3
20 1 1 1
;
run;
/* create dataset, macro list and count of variables names */
proc sql noprint;
create table variables as
select name as v_name from dictionary.columns
where libname='WORK' and upcase(memname)='DATA1' and upcase(name) ne 'ID';
select name, count(*) into :keepvar separated by ' ',
:numvar
from dictionary.columns
where libname='WORK' and upcase(memname)='DATA1' and upcase(name) ne 'ID';
quit;
/* create view that joins all datasets, renames variables and calculates maximum value per id */
data _null_;
set datasets end=last;
if _n_=1 then call execute('data data_all / view=data_all; merge');
call execute (trim(ds_name)|| '(rename=(');
do i=1 to &numvar.;
set variables point=i;
call execute(trim(v_name)||'='||catx('_',v_name,_n_));
end;
call execute('))');
if last then do;
call execute('; by id;');
do i=1 to &numvar.;
set variables point=i;
call execute(trim(v_name)||'='||'max(of '||trim(v_name)||':);');
end;
call execute('run;');
end;
run;
/* create dataset of maximum values per id per variable */
data result (keep=id &keepvar.);
set data_all;
run;