How can I convert the output of a SAS data column into a macro variable?
For example:
Var1 | Var2
-----------
A | 1
B | 2
C | 3
D | 4
E | 5
What if I want a macro variable containing all of the values in Var1 to use in a PROC REG or other procedure? How can I extract that column into a variable which can be used in other PROCS?
In other words, I would want to generate the equivalent statement:
%LET Var1 =
A
B
C
D
E
;
But I will have different results coming from a previous procedure so I can't just do a '%LET'. I have been exploring SYMPUT and SYMGET, but they seem to apply only to single observations.
Thank you.
proc sql;
select var1
into :varlist separated by ' '
from have;
quit;
creates &varlist. macro variable, separated by the separation character. If you don't specify a separation character it creates a variable with the last row's value only.
There are a lot of other ways, but this is the simplest. CALL SYMPUTX for example will do the same thing, except it's complicated to get it to pull all rows into one.
You can use it in a proc directly, no need for a macro variable. I used numeric values for your var1 for simplicity, but you get the idea.
data test;
input var1 var2 ##;
datalines;
1 100 2 200 3 300 4 400 5 500
run;
proc reg data=TEST;
MODEL VAR1 = VAR2;
RUN;
Related
I have calculated a frequency table in a previous step. Excerpt below:
I want to automatically drop all variables from this table where the frequency is missing. In the excerpt above, that would mean the variables "Exkl_UtgUtl_Taxi_kvot" and "Exkl_UtgUtl_Driv_kvot" would need to be dropped.
I try the following step in PROC SQL (which ideally I will repeat for all variables in the table):
PROC SQL;
CREATE TABLE test3 as
SELECT (CASE WHEN Exkl_UtgUtl_Flyg_kvot!=. THEN Exkl_UtgUtl_Flyg_kvot ELSE NULL END)
FROM stickprovsstorlekar;
quit;
This fails, however, since SAS does not like NULL values. How do I do this?
I tried just writing:
PROC SQL;
CREATE TABLE test3 as
SELECT (CASE WHEN Exkl_UtgUtl_Flyg_kvot!=. THEN Exkl_UtgUtl_Flyg_kvot)
FROM stickprovsstorlekar;
quit;
But that just generates a variable with an automatically generated name (like DATA_007). I want all variables containing missing values to be totally excluded from the results.
Let's say you have 10 variables, where var1, var3, var5, var7, and var9 have missing values in the first observation. We want to select only the variables with no missing observations.
var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
. 8 . 9 . 6 . 1 . 4
5 1 2 7 2 7 2 9 7 7
5 9 7 7 6 8 5 6 4 9
...
First, let's find all variables that have missing observations:
proc means data=have noprint;
var _NUMERIC_;
output out=missing nmiss=;
run;
Then transpose this output table so it's easier to work with:
proc transpose data=missing out=missing_tpose;
run;
We now have a table that looks like this:
_NAME_ COL1
_TYPE_ 0
_FREQ_ 10
var1 1
var2 0
var3 1
var4 0
var5 1
var6 0
var7 1
var8 0
var9 1
var10 0
When COL1 is > 0 and the name is not _TYPE_ or _FREQ_, that means the variable has missing values. Let's extract the name of the variable from _NAME_ into a comma-separated list.
proc sql noprint;
select _NAME_
into :vars separated by ','
from missing_tpose
where COL1 = 0 AND _NAME_ NOT IN('_TYPE_', '_FREQ_')
;
quit;
%put &vars and you'll see all of the non-missing values that can be passed into SQL.
var2,var4,var6,var8,var10
Now we have a dynamic way to select variables with only non-missing values.
proc sql;
create table want as
select &vars
from have
;
quit;
How do I avoid spaces/tabs in columns names when I use proc transpose? The best way to illustrate my problem is by giving an example:
Data tst; input ColA $ ColB; datalines;
Cat1 1
Cat2 2
Cat3 3
; run;
proc transpose data = tst out= tst_out (drop = _name_); id ColA;
run;
When running this code my column names look something like this:
Basically I want the column names to be "Cat1", "Cat1", "Cat1" and not " Cat1", " Cat1", " Cat1".
(If that is not possible then I have an alternative question: How do I remove the spaces AFTER proc transpose? In my real data set I have a lot of columns so I prefer a method where I don't have to type for every column)
Just change the setting of VALIDVARNAME option to V7 instead of ANY. It won't remove the leading spaces/tabs but it will change them to underscores so the result are valid names.
Example:
data tst;
input ColA $& ColB;
datalines;
Cat 1 1
Cat 2 2
Cat 3 3
;
options validvarname=v7;
proc transpose data=tst out=tst2; id cola ; var colb; run;
proc print;
run;
Result:
Obs _NAME_ Cat_1 Cat_2 Cat_3
1 ColB 1 2 3
PS When using in-line data in your SAS program make sure to start the lines of data in the first column. That will prevent the accidental inclusion of spaces (or tabs when using SAS/Studio interface) in the lines of data. Placing the DATALINES (also known as CARDS) statement starting in the first column will also prevent the editor from automatically indenting when you start adding lines of data.
I have a dataset with X number of categorical variables for a given record. I would like to somehow turn this dataset into a new dataset with dummy variables, but I want to have one command / macro that will take the dataset and make the dummy variables for all variables in the dataset.
I also dont want to specify the name of each variable, because I could have a dataset with 50 variables so it would be too cumbersome to have to specify each variable name.
Lets say I have a table like this, and I want the resulting table, with the above conditions that I want a single command or single macro without specifying each individual variable:
You can use PROC GLMSELECT to generate the design matrix, which is what you are asking for.
data test;
input id v1 $ v2 $ v3 $ ;
datalines;
1 A A A
2 B B B
3 C C C
4 A B C
5 B A A
6 C B A
;
proc glmselect data=test outdesign(fullmodel)=test_design noprint ;
class v1 -- v3;
model id = v1 -- v3 /selection=none noint;
run;
You can use the -- to specify all variables between the first and last. Notice I don't have to type v2. So if you know first and the last, you can get want you want easily.
I prefer GLMMOD myself. One note, if you can, CLASS variables are usually a better way to go, but not supported by all PROCS.
/*Run model within PROC GLMMOD for it to create design matrix
Include all variables that might be in the model*/
proc glmmod data=sashelp.class outdesign=want outparm=p;
class sex age;
model weight=sex age height;
run;
/*Create rename statement automatically
THIS WILL NOT WORK IF YOUR VARIABLE NAMES WILL END UP OVER 32 CHARS*/
data p;
set p;
if _n_=1 and effname='Intercept' then
var='Col1=Intercept';
else
var=catt("Col", _colnum_, "=", catx("_", effname, vvaluex(effname)));
run;
proc sql ;
select var into :rename_list separated by " " from p;
quit;
/*Rename variables*/
proc datasets library=work nodetails nolist;
modify want;
rename &rename_list;
run;
quit;
proc print data=want;
run;
Originally from here and the post has links to several other methods.
https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-dummy-variables-Categorical-Variables/ta-p/308484
Here is a worked example using your simple three observation dataset and a modified version of the PROC GLMMOD method posted by #Reeza
First let's make a sample dataset with a long character ID variable. We will introduce a numeric ROW variable that we can later use to merge the design matrix back with the input data.
data have;
input id :$21. education_lvl $ income_lvl $ ;
row+1;
datalines;
1 A A
2 B B
3 C C
;
You could set the list of variables into a macro variable since we will need to use it in multiple places.
%let varlist=education_lvl income_lvl;
Use PROC GLMMOD to generate the design matrix and the parameter list that we will later use to generate user friendly variable names.
proc glmmod data=have outdesign=design outparm=parm noprint;
class &varlist;
model row=&varlist / noint ;
run;
Now let's use the parameter list to generate rename statement to a temporary text file.
filename code temp;
data _null_;
set parm end=eof;
length rename $65 ;
rename = catx('=',cats('col',_colnum_),catx('_',effname,of &varlist));
file code ;
if _n_=1 then put 'rename ' ;
put #3 rename ;
if eof then put ';' ;
run;
Now let's merge back with the input data and rename the variables in the design matrix.
data want;
merge have design;
by row ;
%inc code / source2;
run;
How do you take the labels contained in the metadata data set and apply them to the variables in the set1 data set?
The desired result is that 'set1' still contains variables a-h and the appropriate variables now have labels. For example 'set1' will continue to have a variable 'a' with no label, however variable 'b' will now have the label 'Label1' etc.
The code I have below works, but it's very inefficient because it runs the macro for each variable. So for every one label it has to read 'set1', apply a label and save 'set1'. When doing this with large 'set1' and 'metadata' data sets, it's quite slow.
/**********************************************************
Reads metadata - in the real case it comes from a large
csv file
***********************************************************/
data metadata;
input var $ labels $;
datalines;
b Label1
d Label2
f Label3
;
run;
/**********************************************************
Reads 'set1' in the real case it comes from many
even larger csv files.
***********************************************************/
data set1;
input a b c d e f g h;
datalines;
1 1 0 5 6 4 0 4
2 3 4 5 3 5 0 1
3 2 1 9 6 5 8 1
;
run;
/**********************************************************
Macro to relabel one by one
***********************************************************/
%Macro relabel(var,label);
DATA set1;
set set1;
label %quote(&var) = %quote(&label);
RUN;
%Mend relabel;
/**********************************************************
Steps through 'metadata' and individually calls the macro
for each obs
***********************************************************/
data _null_;
set metadata;
call execute('%relabel('||var||','||labels||')');
run;
proc print;
run;
/**********************************************************
Shows labels applied correctly.
***********************************************************/
proc contents;
run;
If your metadata is small enough then you can use a macro variable to hold it. There is a 65K limit on the size of a macro variable.
proc sql noprint;
select catx('=',var,quote(trim(labels)))
into :labels separated by ' '
from metadata
;
quit;
proc datasets nolist lib=work ;
modify set1;
label &labels;
run;
quit;
I have a data set in SAS that has multiple columns that have missing data. This post replaces all the missing values in the entire data set with zeros. But since it goes through the entire data set you can't just replace the zero with the mean or median for that column. How do I replace missing data with the mean of that column?
There are only 5 or so columns so the script doesn't need to go through the entire data set.
PROC STDIZE has an option to do just this. The REPONLY option tells it you want it to only replace missing values, and METHOD=MEAN tells it how you want to replace those values. (PROC EXPAND also could be used, if you are using time series data, but if you're just using mean, STDIZE is the simpler one.)
For example:
data missing_class;
set sashelp.class;
if _N_=5 then call missing(age);
if _N_=7 then call missing(height);
if _N_=9 then call missing(weight);
run;
proc stdize data=missing_class out=imputed_class
method=mean reponly;
var age height weight;
run;
Ideally, you would want to use PROC MI to do multiple imputation and get a more accurate representation of missing values; however, if you wish to use the average, and alternate way of doing so can be done with PROC MEANS and a data step.
/* Set up data */
data have(index=(sex) );
set sashelp.class;
if(_N_ IN(3,7,9,12) ) then call missing(height);
run;
/* Calculate mean of all non-missing values */
proc means data=have noprint;
by sex;
output out=means mean(height) = imp_height;
run;
/* Merge avg. values with original data */
data want;
merge have
means;
by sex;
if(missing(height) ) then height = imp_height;
drop imp_height;
run;
You can use the mean function in proc sql to replace only the missing observations in each column:
data temp;
input var1 var2 var3 var4 var5;
datalines;
. 2 3 4 .
6 7 8 9 10
. 12 . . 15
16 17 18 19 .
21 . 23 24 25
;
run;
proc sql;
create table temp2 as select
case when missing(var1) then mean(var1) else var1 end as var1,
case when missing(var2) then mean(var2) else var2 end as var2,
case when missing(var3) then mean(var3) else var3 end as var3,
case when missing(var4) then mean(var4) else var4 end as var4,
case when missing(var5) then mean(var5) else var5 end as var5
from temp;
quit;
And, as Joe mentioned, you can use coalesce instead if you prefer that syntax:
coalesce(var1, mean(var1)) as var1