SAS Proc Summary with class variables that have missing values - sas

I'm having an issue with how proc summary behaves when variables in the class statement have missing values. In the example below, test_out will give all possible combinations of types. test_missing_out does not and does not take into account var3 for the sum where var2 was missing, despite the fact var1 was not missing:
data test;
infile datalines dsd delimiter=' ';
input var1 var2 $ var3;
datalines;
1 data 200
2 data2 103
;
run;
proc summary
data=test;
class var1 var2;
var var3;
output out=test_out sum=sum;
run;
data test_missing;
infile datalines dsd delimiter=' ';
input var1 var2 $ var3;
datalines;
1 data 200
2 103
;
run;
proc summary
data=test_missing;
class var1 var2;
var var3;
output out=test_missing_out sum=sum;
run;

proc summary has a lot in common with proc means concerning syntax.
You can simply add the keyword MISSING to the proc summary statement if you want it to consider missing values as a grouping level:
proc summary
data=test_missing
MISSING;
class var1 var2;
var var3;
output out=test_missing_out sum=sum;
run;

Related

How do you add a row Number in SAS by multiple groups with one variable in decending order?

I have discovered this code in SAS that mimics the following window function in SQL server:
ROW_NUMBER() OVER (PARTITION BY Var1,var2 ORDER BY var1, var2)
=
data want;
set have
by var1 var2;
if first.var1 AND first.var2 then n=1;
else n+1;
run;
"She's a beaut' Clark"... but, How does one mimic this operation:
ROW_NUMBER() OVER (PARTITION BY Var1,var2 ORDER BY var1, var2 Desc)
I've made sure I have before:
PROC SORT DATA=WORK.TEST
OUT=WORK.TEST;
BY var1 DECENDING var2 ;
RUN;
data WORK.want;
set WORK.Test;
by var1 var2;
if first.var1 AND last.var2 then n=1;
else n+1;
run;
But this doesn't work.
ERROR: BY variables are not properly sorted on data set WORK.TEST.
Sample DataSet:
data test;
infile datalines dlm='#';
INPUT var1 var2;
datalines;
1#5
2#4
1#3
1#6
1#9
2#5
2#2
1#7
;
run;
I was thinking I can make one variable temporary negative, but I don't want to change the data, I'm looking for a more elegant solution.
You have to tell the data step to expect the data in descending order if that is what you are giving it.
You also don't seem to quite get the logic of the FIRST. and LAST. flags. If it is FIRST.VAR1 then by definition it is FIRST.VAR2. The first observation for this value of VAR1 is also the first observation for the first value of VAR2 within this specific value of VAR1.
Do you want to number the observations within each combination of VAR1 and VAR2?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var2 then n=1;
else n+1;
run;
Or number the distinct values of VAR2 within VAR1?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var1 then n=0;
if first.var2 then n+1;
run;
Or number the distinct combinations of VAR2 and VAR1?
data WORK.want;
set WORK.Test;
BY var1 DESCENDING var2 ;
if first.var2 then n+1;
run;

Looping around a whole data step in SAS

So I have a following code:
%let macroVar = Var1 Var2;
Data new1;
Set old1 (keep= count &macroVar.);
Run;
Proc means data = new1 nway missing noprint;
Class var1;
Var count;
Output out= out_var1 sum=;
Proc means data = new1 nway missing noprint;
Class var2;
Var count;
Output out= out_var2 sum=;
How can I write out the two proc means in one data step, using the macro variable I set up at the beginning?
Many thanks
Why not just do it in one PROC MEANS call?
%let varlist=sex age ;
proc means data=sashelp.class missing noprint;
class &varlist;
ways 1;
var height;
output out=want sum=;
run;
Result:
Obs Sex Age _TYPE_ _FREQ_ Height
1 11 1 2 108.8
2 12 1 5 297.2
3 13 1 3 184.3
4 14 1 4 259.6
5 15 1 4 262.5
6 16 1 1 72.0
7 F . 2 9 545.3
8 M . 2 10 639.1
If you really must have separate output datasets then it will be faster to generate them from the output above.
%macro sum_counts(varlist,data=,var=count);
%local i n;
%let n=%sysfunc(countw(&varlist));
proc means data=&data missing noprint ;
class &varlist;
ways 1;
var &var;
output out=_summary_ sum=;
run;
%do i=1 %to &n;
data new&i;
set _summary_;
where _type_=2**(&n-&i);
keep %scan(&varlist,&i) &var;
run;
%end;
%mend sum_counts;
Example:
263 options mprint;
264 %sum_counts(varlist=sex age,data=sashelp.class,var=height);
MPRINT(SUM_COUNTS): proc means data=sashelp.class missing noprint ;
MPRINT(SUM_COUNTS): class sex age;
MPRINT(SUM_COUNTS): ways 1;
MPRINT(SUM_COUNTS): var height;
MPRINT(SUM_COUNTS): output out=_summary_ sum=;
MPRINT(SUM_COUNTS): run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK._SUMMARY_ has 8 observations and 5 variables.
MPRINT(SUM_COUNTS): data new1;
MPRINT(SUM_COUNTS): set _summary_;
MPRINT(SUM_COUNTS): where _type_=2**(2-1);
MPRINT(SUM_COUNTS): keep sex height;
MPRINT(SUM_COUNTS): run;
NOTE: There were 2 observations read from the data set WORK._SUMMARY_.
WHERE _type_=2;
NOTE: The data set WORK.NEW1 has 2 observations and 2 variables.
MPRINT(SUM_COUNTS): data new2;
MPRINT(SUM_COUNTS): set _summary_;
MPRINT(SUM_COUNTS): where _type_=2**(2-2);
MPRINT(SUM_COUNTS): keep age height;
MPRINT(SUM_COUNTS): run;
NOTE: There were 6 observations read from the data set WORK._SUMMARY_.
WHERE _type_=1;
NOTE: The data set WORK.NEW2 has 6 observations and 2 variables.

How to load a selected list of files using SAS macro

I have a folder full of csv files (folder path Path). I have a SAS dataset Name which contains a list of selected file names. Dataset Name looks like the following:
FileName
File1
File3
File8
File9
...
I want to load every file listed on dataset name (name1-name...) and save it as sas dataset using the same name.
Can anyone teach me how to automate this process using macro please. I cannot figure out how to create a loop and do the following process one by one. Many thanks.
data Name1 ;
infile 'Path\Name1.csv' delimiter = ',' truncover DSD lrecl=32767 firstobs=2 ;
informat VAR1 $25. ;
informat VAR2 yymmdd10. ;
informat VAR3 best12. ;
format VAR1 $25. ;
format VAR2 yymmdd10. ;
format VAR3 best12. ;
input
VAR1 $
var2
var3
;
run;
Create code that imports correctly for one file (you've provided above).
Change code to take macro variables instead of default values.
%let name=name1;
data &name;
infile "Path\&name..csv" delimiter = ',' truncover DSD lrecl=32767 firstobs=2 ;
informat VAR1 $25. VAR2 yymmdd10. VAR3 best12. ;
format VAR1 $25. VAR2 yymmdd10. VAR3 best12. ;
input VAR1 $ var2 var3;
run;
Create a macro that takes parameters
%macro import_file(name);
data &name;
infile "Path\&name..csv" delimiter = ',' truncover DSD lrecl=32767 firstobs=2 ;
informat VAR1 $25. VAR2 yymmdd10. VAR3 best12. ;
format VAR1 $25. VAR2 yymmdd10. VAR3 best12. ;
input VAR1 $ var2 var3;
run;
%mend;
Use call execute within your dataset to execute macro.
data _null_;
set files;
str=catt('%import_file(', file_name, ');');
call execute(str);
run;

Concatenate duplicate values

I have a table with some variables, say var1 and var2 and an identifier, and for some reasons, some identifiers have 2 observations.
I would like to know if there is a simple way to put back the second observation of the same identifier into the first one, that is
instead of having two observations, each with var1 var2 variables for the same identifier value
ID var1 var2
------------------
A1 12 13
A1 43 53
having just one, but with something like var1 var2 var1_2 var2_2.
ID var1 var2 var1_2 var2_2
--------------------------------------
A1 12 13 43 53
I can probably do that with renaming all my variables, then merging the table with the renamed one and dropping duplicates, but I assume there must be a simpler version.
Actually, your suggestion of merging the values back is probably the best.
This works if you have, at most, 1 duplicate for any given ID.
data first dups;
set have;
by id;
if first.id then output first;
else output dups;
run;
proc sql noprint;
create table want as
select a.id,
a.var1,
a.var2,
b.var1 as var1_2,
b.var2 as var2_2
from first as a
left join
dups as b
on a.id=b.id;
quit;
Another method makes use of PROC TRANSPOSE and a data-step merge:
/* You can experiment by adding more data to this datalines step */
data have;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
;
run;
/* This step puts the var1 values onto one line */
proc transpose data=tab out=new1 (drop=_NAME_) prefix=var1_;
by id;
var var1;
run;
/* This does the same for the var2 values */
proc transpose data=tab out=new2 (drop=_NAME_) prefix=var2_;
by id;
var var2;
run;
/* The two transposed datasets are then merged together to give one line */
data want;
merge new1 new2;
by id;
run;
As an example:
data tab;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
A2 199 342
A2 1132 111
A2 91913 199191
B1 1212 43214
;
run;
Gives:
ID var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
---------------------------------------------------
A1 12 43 . 13 53 .
A2 199 1132 91913 342 111 199191
B1 1212 . . 43214 . .
There's a very simple way of doing this, using the IDGROUP function within PROC SUMMARY.
data have;
input ID $ var1 $ var2 $;
datalines;
A1 12 13
A1 43 53
;
run;
proc summary data=have nway;
class id;
output out=want (drop=_:)
idgroup(out[2] (var1 var2)=);
run;

Is there a built in method to choose non key variables in a SAS merge?

If you have multiple datasets (hundreds) with the same variable names and would like to merge them by a key, is there a simple way to control which value of a variable to take for the variables that are not the key? One way to do this would be a rename on the merge statement then write another step to use those renamed variable to calculate the most frequent value with an array...but I'm really wondering if there's an built in way of handling this. For example:
data ds1;
infile datalines dsd delimiter=' ';
input var1 $ var2;
datalines;
a 1
b 2
;
run;
data ds2;
infile datalines dsd delimiter=' ';
input var1 $ var2;
datalines;
a
b 2
;
run;
data ds3;
infile datalines dsd delimiter=' ';
input var1 $ var2;
datalines;
a 1
b
;
run;
data ds123;
merge ds1 ds2 ds3;
by var1;
run;
This code will 'pick' the 'furthest right' var2 i.e. the dataset ds123:
a 1
b
But I may want it to be:
a 1
b 2
as this would match the most frequent values.
Use an SQL join and the coalesce function. Specify the preference order in the coalesce and the first non-missing in that order will be used.
proc sql noprint;
create table ds123 as
select a.var1,
coalesce(a.var2,b.var2,c.var2) as var2
from ds1 as a,
ds2 as b,
ds3 as c
where a.var1 = b.var1
and b.var1 = c.var1;
quit;