label variables in one data set with values from another? - sas

How do you take the labels contained in the metadata data set and apply them to the variables in the set1 data set?
The desired result is that 'set1' still contains variables a-h and the appropriate variables now have labels. For example 'set1' will continue to have a variable 'a' with no label, however variable 'b' will now have the label 'Label1' etc.
The code I have below works, but it's very inefficient because it runs the macro for each variable. So for every one label it has to read 'set1', apply a label and save 'set1'. When doing this with large 'set1' and 'metadata' data sets, it's quite slow.
/**********************************************************
Reads metadata - in the real case it comes from a large
csv file
***********************************************************/
data metadata;
input var $ labels $;
datalines;
b Label1
d Label2
f Label3
;
run;
/**********************************************************
Reads 'set1' in the real case it comes from many
even larger csv files.
***********************************************************/
data set1;
input a b c d e f g h;
datalines;
1 1 0 5 6 4 0 4
2 3 4 5 3 5 0 1
3 2 1 9 6 5 8 1
;
run;
/**********************************************************
Macro to relabel one by one
***********************************************************/
%Macro relabel(var,label);
DATA set1;
set set1;
label %quote(&var) = %quote(&label);
RUN;
%Mend relabel;
/**********************************************************
Steps through 'metadata' and individually calls the macro
for each obs
***********************************************************/
data _null_;
set metadata;
call execute('%relabel('||var||','||labels||')');
run;
proc print;
run;
/**********************************************************
Shows labels applied correctly.
***********************************************************/
proc contents;
run;

If your metadata is small enough then you can use a macro variable to hold it. There is a 65K limit on the size of a macro variable.
proc sql noprint;
select catx('=',var,quote(trim(labels)))
into :labels separated by ' '
from metadata
;
quit;
proc datasets nolist lib=work ;
modify set1;
label &labels;
run;
quit;

Related

How to create 2x2 table in sas for fisher exact test

I just performed the fisher test in R and in Excel on a 2x2 table with the contents 1 6 and 7 2. I can't manage to do this in sas.
data my_table;
input var1 var2 ##;
datalines;
1 6 7 2
;
proc freq data=my_table;
tables var1*var2 / fisher;
run;
The test somehow ignores that the table consists of the 4 variables but when I print the table it looks fine. In the test the contents of the table are 0, 1, 1, 0. I guess I need to change something when creating the data but what?
You do NOT have two variables that each have two categories.
Read the data in this way instead.
data have ;
do var1=1,2 ; do var2=1,2;
input count ##;
output;
end; end;
datalines;
1 6 7 2
;
Now VAR1 and VAR2 both have two possible values and COUNT has the number of cases for the particular combination. Use the WEIGHT statement to tell PROC FREQ to use COUNT as the number of cases.
proc freq data=have ;
weight count ;
tables var1*var2 / fisher ;
run;

Replace Missing Values with Mean of column in SAS

I have a data set in SAS that has multiple columns that have missing data. This post replaces all the missing values in the entire data set with zeros. But since it goes through the entire data set you can't just replace the zero with the mean or median for that column. How do I replace missing data with the mean of that column?
There are only 5 or so columns so the script doesn't need to go through the entire data set.
PROC STDIZE has an option to do just this. The REPONLY option tells it you want it to only replace missing values, and METHOD=MEAN tells it how you want to replace those values. (PROC EXPAND also could be used, if you are using time series data, but if you're just using mean, STDIZE is the simpler one.)
For example:
data missing_class;
set sashelp.class;
if _N_=5 then call missing(age);
if _N_=7 then call missing(height);
if _N_=9 then call missing(weight);
run;
proc stdize data=missing_class out=imputed_class
method=mean reponly;
var age height weight;
run;
Ideally, you would want to use PROC MI to do multiple imputation and get a more accurate representation of missing values; however, if you wish to use the average, and alternate way of doing so can be done with PROC MEANS and a data step.
/* Set up data */
data have(index=(sex) );
set sashelp.class;
if(_N_ IN(3,7,9,12) ) then call missing(height);
run;
/* Calculate mean of all non-missing values */
proc means data=have noprint;
by sex;
output out=means mean(height) = imp_height;
run;
/* Merge avg. values with original data */
data want;
merge have
means;
by sex;
if(missing(height) ) then height = imp_height;
drop imp_height;
run;
You can use the mean function in proc sql to replace only the missing observations in each column:
data temp;
input var1 var2 var3 var4 var5;
datalines;
. 2 3 4 .
6 7 8 9 10
. 12 . . 15
16 17 18 19 .
21 . 23 24 25
;
run;
proc sql;
create table temp2 as select
case when missing(var1) then mean(var1) else var1 end as var1,
case when missing(var2) then mean(var2) else var2 end as var2,
case when missing(var3) then mean(var3) else var3 end as var3,
case when missing(var4) then mean(var4) else var4 end as var4,
case when missing(var5) then mean(var5) else var5 end as var5
from temp;
quit;
And, as Joe mentioned, you can use coalesce instead if you prefer that syntax:
coalesce(var1, mean(var1)) as var1

outputting multiple data sets into excel workbook

Another question. I have multiple data sets that generate ouput how can output these into one excel work sheet and apply my own formating. For example I have data set 1, data set 2, data set 3
each data set has two coloumns, for example
Col 1 Col 2
1 2
3 4
5 6
I want each data set to be in one worksheet and seperated by column , so in excel it should look like
Col 1 Col 2 Blank Col Col 1 Col 2 Blank Col
Somone told me I need to look at DDE for this is this true
Regards,
You can definitely do it using DDE. What DDE does it just simulates user's clicks at Excel's menus, buttons, cells etc. Here's an example how you can do that with macro loop for 3 datasets with names have1, have2 and have3. If you need more general solution (unknown number of datasets, with various number of variables, random datasets' names etc), the code should be updated, but its 'DDE-part' will be essentially pretty the same.
One more assumption - your Excel workbook should be open during code execution. Though it can be also automated - Excel can be started and file can be open using DDE itself.
You can find a very nice introduction into DDE here, where all these trick discussed in details.
data have1;
input Col1 Col2;
datalines;
1 2
3 4
5 6
;
run;
data have2;
input Col1 Col2;
datalines;
1 2
3 4
5 6
7 8
;
run;
data have3;
input Col1 Col2;
datalines;
1 2
3 4
7 8
5 6
9 10
;
run;
%macro xlsout;
/*iterating through your datasets*/
%do i=1 %to 3;
/*determine number of records in the current dataset*/
proc sql noprint;
select count(*) into :noobs
from have&i;
quit;
/*assign a range on the workbook spreadsheet matching to data in the current dataset*/
filename range dde "excel|[myworkbook.xls]sas!r1c%eval((&i-1)*3+1):r%left(&noobs)c%eval((&i-1)*3+2)" notab;
/*put data into selected range*/
data _null_;
set have&i;
file range;
put Col1 '09'x Col2;
run;
%end;
%mend xlsout;
%xlsout
You cannot do exactly this with SAS (DDE is probably possible). I would suggest looking at SaviCells Pro.
http://www.sascommunity.org/wiki/SaviCells
http://www.savian.net/utilities.html
You could likely accomplish what you're asking through ODS TAGSETS.EXCELXP or the new ODS EXCEL (9.4 TS1M1). You would need to arrange the datasets ahead of time (ie, merge them together or transpose or whatnot to get one dataset with the right columns), however, or else use PROC REPORT or some other procedure to get them in the right format.

replicating a sql function in sas datastep

Hi another quick question
in proc sql we have on which is used for conditional join is there something similar for sas data step
for example
proc sql;
....
data1 left join data2
on first<value<last
quit;
can we replicate this in sas datastep
like
data work.combined
set data1(in=a) data2(in=b)
if a then output;
run;
You can also can reproduce sql join in one DATA-step using hash objects. It can be really fast but depends on the size of RAM of your machine since this method loads one table into memory. So the more RAM - the larger dataset you can wrap into hash. This method is particularly effective for look-ups in relatively small reference table.
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
data want;
if _N_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2');
h.defineKey('value');
h.defineData('value');
h.defineDone();
declare hiter hi('h');
end;
set have1;
rc=hi.first();
do while(rc=0);
if first<value<last then output;
rc=hi.next();
end;
drop rc;
run;
The result:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
Yes there is a simple (but subtle) way in just 7 lines of code.
What you intend to achieve is intrinsically a conditional Cartesian join which can be done by a do-looped set statement. The following code use the test dataset from Dmitry and a modified version of the code in the appendix of SUGI Paper 249-30
data data1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data data2;
input value;
datalines;
2
5
6
7
;
run;
/***** by data step looped SET *****/
DATA CART_data;
SET data1;
DO i=1 TO NN; /*NN can be referenced before set*/
SET data2 point=i nobs=NN; /*point=i - random access*/
if first<value<last then OUTPUT; /*conditional output*/
END;
RUN;
/***** by SQL *****/
proc sql;
create table cart_SQL as
select * from data1
left join data2
on first<value<last;
quit;
One can easily see that the results coincide.
Also note that from SAS 9.2 documentation: "At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you CAN refer to the NOBS= variable BEFORE the SET statement. The variable is available in the DATA step but is not added to any output data set."
There isn't a direct way to do this with a MERGE. This is one example where the SQL method is clearly superior to any SAS data step methods, as anything you do will take much more code and possibly more time.
However, depending on the data, it's possible a few approaches may make sense. In particular, the format merge.
If data1 is fairly small (even, say, millions of records), you can make a format out of it. Like so:
data fmt_set;
set data1;
format label $8.;
start=first; *set up the names correctly;
end=last;
label='MATCH';
fmtname='DATA1F';
output;
if _n_=1 then do; *put out a hlo='o' line which is for unmatched lines;
start=.; *both unnecessary but nice for clarity;
end=.;
label='NOMATCH';
hlo='o';
output;
end;
run;
proc format cntlin=fmt_set; *import the dataset;
quit;
data want;
set data2;
if put(value,DATA1F.)="MATCH";
run;
This is very fast to run, unless data1 is extremely large (hundreds of millions of rows, on my system) - faster than a data step merge, if you include sort time, since this doesn't require a sort. One major limitation is that this will only give you one row per data2 row; if that is what is desired, then this will work. If you want repeats of data2 then you can't do it this way.
If data1 may have overlapping rows (ie, two rows where start/end overlap each other), you also will need to address this, since start/end aren't allowed to overlap normally. You can set hlo="m" for every row, and "om" for the non-match row, or you can resolve the overlaps.
I'd still do the sql join, however, since it's much shorter to code and much easier to read, unless you have performance issues, or it doesn't work the way you want it to.
Here's another solution, using a temporary array to hold the lookup dataset. Performance is probably similar to Dmitry's hash-based solution, but this should also work for people still using versions of SAS prior to 9.1 (i.e. when hash objects were first introduced).
I've reused Dmitry's sample datasets:
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
/*We need a macro var with the number of obs in the lookup dataset*/
/*This is so we can specify the dimension for the array to hold it*/
data _null_;
if 0 then set have2 nobs = nobs;
call symput('have2_nobs',put(nobs,8.));
stop;
run;
data want_temparray;
array v{&have2_nobs} _temporary_;
do _n_ = 1 to &have2_nobs;
set have2 (rename=(value=value_array));
v{_n_}=value_array;
end;
do _n_ = 1 by 1 until (eof_have1);
set have1 end = eof_have1;
value=.;
do i=1 to &have2_nobs;
if first < v{i} < last then do;
value=v{i};
output;
end;
end;
if missing(value) then output;
end;
drop i value_array;
run;
Output:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
This matches the output from the equivalent SQL:
proc sql;
create table want_sql as
select * from
have1 left join have2
on first<value<last
;
quit;
run;

Export the types of variables in a SAS dataset

Is there any easy way to capture and export the type of each variable in a SAS dataset? I'm exporting a dataset to CSV format to read into R, and the read.table procedure in the latter can work more efficiently if it also knows the data type of each variable.
PROC CONTENTS has an OUT= option to ouput a dataset with variable attributes. type=1 is numeric and type=2 is character. HTH.
proc contents data=sashelp.class out=vars;
run;
proc print data=vars noobs;
var varnum name type length;
run;
/* on lst
VARNUM NAME TYPE LENGTH
3 Age 1 8
4 Height 1 8
1 Name 2 8
2 Sex 2 1
5 Weight 1 8
*/