SAS - Create Dummy Variables for All Variables - sas

I have a dataset with X number of categorical variables for a given record. I would like to somehow turn this dataset into a new dataset with dummy variables, but I want to have one command / macro that will take the dataset and make the dummy variables for all variables in the dataset.
I also dont want to specify the name of each variable, because I could have a dataset with 50 variables so it would be too cumbersome to have to specify each variable name.
Lets say I have a table like this, and I want the resulting table, with the above conditions that I want a single command or single macro without specifying each individual variable:

You can use PROC GLMSELECT to generate the design matrix, which is what you are asking for.
data test;
input id v1 $ v2 $ v3 $ ;
datalines;
1 A A A
2 B B B
3 C C C
4 A B C
5 B A A
6 C B A
;
proc glmselect data=test outdesign(fullmodel)=test_design noprint ;
class v1 -- v3;
model id = v1 -- v3 /selection=none noint;
run;
You can use the -- to specify all variables between the first and last. Notice I don't have to type v2. So if you know first and the last, you can get want you want easily.

I prefer GLMMOD myself. One note, if you can, CLASS variables are usually a better way to go, but not supported by all PROCS.
/*Run model within PROC GLMMOD for it to create design matrix
Include all variables that might be in the model*/
proc glmmod data=sashelp.class outdesign=want outparm=p;
class sex age;
model weight=sex age height;
run;
/*Create rename statement automatically
THIS WILL NOT WORK IF YOUR VARIABLE NAMES WILL END UP OVER 32 CHARS*/
data p;
set p;
if _n_=1 and effname='Intercept' then
var='Col1=Intercept';
else
var=catt("Col", _colnum_, "=", catx("_", effname, vvaluex(effname)));
run;
proc sql ;
select var into :rename_list separated by " " from p;
quit;
/*Rename variables*/
proc datasets library=work nodetails nolist;
modify want;
rename &rename_list;
run;
quit;
proc print data=want;
run;
Originally from here and the post has links to several other methods.
https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-dummy-variables-Categorical-Variables/ta-p/308484

Here is a worked example using your simple three observation dataset and a modified version of the PROC GLMMOD method posted by #Reeza
First let's make a sample dataset with a long character ID variable. We will introduce a numeric ROW variable that we can later use to merge the design matrix back with the input data.
data have;
input id :$21. education_lvl $ income_lvl $ ;
row+1;
datalines;
1 A A
2 B B
3 C C
;
You could set the list of variables into a macro variable since we will need to use it in multiple places.
%let varlist=education_lvl income_lvl;
Use PROC GLMMOD to generate the design matrix and the parameter list that we will later use to generate user friendly variable names.
proc glmmod data=have outdesign=design outparm=parm noprint;
class &varlist;
model row=&varlist / noint ;
run;
Now let's use the parameter list to generate rename statement to a temporary text file.
filename code temp;
data _null_;
set parm end=eof;
length rename $65 ;
rename = catx('=',cats('col',_colnum_),catx('_',effname,of &varlist));
file code ;
if _n_=1 then put 'rename ' ;
put #3 rename ;
if eof then put ';' ;
run;
Now let's merge back with the input data and rename the variables in the design matrix.
data want;
merge have design;
by row ;
%inc code / source2;
run;

Related

How can I make the first row of a SAS dataset the variable names?

I have an already imported dataset where the first row contains the variable names. I know that typically when importing a dataset you use getnames = yes. However, if the data is already imported how can I make the first row the variable names using a data step?
Data looks like:
A B C
1 Name 1 Name 2 Name 3
2 2 4 66
3 3 5 6
Since reading the names as data probably made all of your variables character you can try just transposing the data twice to fix it. That will work well for small datasets.
So the first transpose will place the current name into the _NAME_ variable and convert each row into a column. The second proc transpose can drop the original name and use the first row (new COL1 variable) as the names.
proc transpose data=have out=wide ;
var _all_;
run;
proc transpose data=wide(drop=_name_ rename=(col1=_name_)) out=want(drop=_name_ _label_);
var col:;
id _name_;
run;
The problem with the already imported data is that all the numeric data was likely placed in a character variables because the 'first row' of data seen by the import process contained some character data, and drove the inference for automatic column construction.
Regardless, you will need to construct renaming pairs old-name=new-name for each variables that has to be renamed. The new-name being in row 1 makes it possible to transpose that row to arrange those name parts as data. SQL with :into and separated by can populate a macro variable for use in a proc datasets step that performs the column renaming without rewriting the entire data set. Finally, a DATA step with modify can remove a row in place, again, without rewriting the entire data set.
filename sandbox temp;
data _null_;
file sandbox;
put 'A,B,C';
put 'Name 1, Name 2, Name 3';
put '2,4,66';
put '3,5,6';
run;
proc import datafile=sandbox dbms=csv replace out=work.oops;
run;
proc transpose data=oops(obs=1) out=renames;
var _all_;
run;
proc sql noprint;
select cats(_name_,"=",compress(col1,,"KN"))
into :renames separated by ' '
from renames;
%put NOTE: &=renames;
proc datasets nolist lib=work;
modify oops;
rename &renames;
run;
data oops;
modify oops;
remove;
stop;
run;
%let syslast=oops;

SAS-Creating Panel by several datasets

Suppose there are ten datasets with same structure: date and price, particularly they have same time period but different price
date price
20140604 5
20140605 7
20140607 9
I want to combine them and create a panel dataset. Since there is no name in each datasets, I attempt to add a new variable name into each data and then combine them.
The following codes are used to add name variable into each dataset
%macro name(sourcelib=,from=,going=);
proc sql noprint; /*read datasets in a library*/
create table mytables as
select *
from dictionary.tables
where libname = &sourcelib
order by memname ;
select count(memname)
into:obs
from mytables;
%let obs=&obs.;
select memname
into : memname1-:memname&obs.
from mytables;
quit;
%do i=1 %to &obs.;
data
&going.&&memname&i;
set
&from.&&memname&i;
name=&&memname&i;
run;
%end;
%mend;
So, is this strategy correct? Whether are there a different way to creating a panel data?
There are really two ways to setup repeated measures data. You can use the TALL method that your code will create. That is generally the most flexible. The other would be a wide format with each PRICE being stored in a different variable. That is usually less flexible, but can be easier for some analyses.
You probably do not need to use macro code or even code generation to combine 10 datasets. You might find that it is easier to just type the 10 dataset names than to write complex code to pull the names from metadata. So a data step like this will let you list any number of datasets in the SET statement and use the membername as the value for the new PANEL variable that distinguishes the source dataset.
data want ;
length dsn $41 panel $32 ;
set in1.panel1 in1.panela in1.panelb indsname=dsn ;
panel = scan(dsn,-1,'.') ;
run;
And if your dataset names follow a pattern that can be used as a member list in the SET statement then the code is even easier to write. So you could have a list of names that have a numeric suffix.
set in1.panel1-in1.panel10 indsname=dsn ;
or perhaps names that all start with a particular prefix.
set in1.panel: indsname=dsn ;
If the different panels are for the same dates then perhaps the wide format is easier? You could then merge the datasets by DATE and rename the individual PRICE variables. That is generate a data step that looks like this:
data want ;
merge in1.panel1 (rename=(price=price1))
in1.panel2 (rename=(price=price2))
...
;
by date;
run;
Or perhaps it would be easier to add a BY statement to the data set that makes the TALL dataset and then transpose it into the WIDE format.
data tall;
length dsn $41 panel $32 ;
set in1.panel1 in1.panela in1.panelb indsname=dsn ;
by date ;
panel = scan(dsn,-1,'.') ;
run;
proc transpose data=tall out=want ;
by date;
id panel;
var price ;
run;
I can't comment on the SQL code but the strategy is correct. Add a name to each data set and then panel on the name with the PANELBY statement.
That is a valid way to achieve what you are looking for.
You are going to need 2 . in between the macros for library.data syntax. The first . is used to concatenate. The second shows up as a ..
I assume you will want to append all of these data sets together. You can add
data &going..want;
set
%do i=1 %to &obs;
&from..&&memname&i
%end;
;
run;
You can combine your loop that adds the names and that data step like this:
data &going..want;
set
%do i=1 %to &obs;
&from..&&memname&i (in=d&i)
%end;
;
%do i=1 %to &obs;
if d&i then
name = &&memname&i;
%end;
run;

Create new variables from format values

What i want to do: I need to create a new variables for each value labels of a variable and do some recoding. I have all the value labels output from a SPSS file (see sample).
Sample:
proc format; library = library ;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
value ... (many more with different amount of levels)
The new variable name would be the actual one without F and with underscore+level (example: FUMERT1F level 0 would become FUMERT1_0).
After that i need to recode the variables on this pattern:
data ds; set ds;
FUMERT1_0=0;
if FUMERT1=0 then FUMERT1_0=1;
FUMERT1_1=0;
if FUMERT1=1 then FUMERT1_1=1;
FUMERT1_2=0;
if FUMERT1=2 then FUMERT1_2=1;
FUMERT1_3=0;
if FUMERT1=3 then FUMERT1_3=1;
run;
Any help will be appreciated :)
EDIT: Both answers from Joe and the one of data_null_ are working but stackoverflow won't let me pin more than one right answer.
Update to add an _ underscore to the end of each name. It looks like there is not option for PROC TRANSREG to put an underscore between the variable name and the value of the class variable so we can just do a temporary rename. Create rename name=newname pairs to rename class variable to end in underscore and to rename them back. CAT functions and SQL into macro variables.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
%let class=sex fumert1;
proc transpose data=have(obs=0) out=vnames;
var &class;
run;
proc print;
run;
proc sql noprint;
select catx('=',_name_,cats(_name_,'_')), catx('=',cats(_name_,'_'),_name_), cats(_name_,'_')
into :rename1 separated by ' ', :rename2 separated by ' ', :class2 separated by ' '
from vnames;
quit;
%put NOTE: &=rename1;
%put NOTE: &=rename2;
%put NOTE: &=class2;
proc transreg data=have(rename=(&rename1));
model class(&class2 / zero=none);
id caseid;
output out=design(drop=_: inter: rename=(&rename2)) design;
run;
%put NOTE: _TRGIND(&_trgindn)=&_trgind;
First try:
Looking at the code you supplied and the output from Joe's I don't really understand the need for the formats. It looks to me like you just want to create dummies for a list of class variables. That can be done with TRANSREG.
data have;
call streaminit(1234);
do caseID = 1 to 1e4;
fumert1 = rand('table',.2,.2,.2) - 1;
sex = first(substrn('MF',rand('table',.5),1));
output;
end;
stop;
run;
proc transreg data=have;
model class(sex fumert1 / zero=none);
id caseid;
output out=design(drop=_: inter:) design;
run;
proc contents;
run;
proc print data=design(obs=40);
run;
One good alternative to your code is to use proc transpose. It won't get you 0's in the non-1 cells, but those are easy enough to get. It does have the disadvantage that it makes it harder to get your variables in a particular order.
Basically, transpose once to vertical, then transpose back using the old variable name concatenated to the variable value as the new variable name. Hat tip to Data null for showing this feature in a recent SAS-L post. If your version of SAS doesn't support concatenation in PROC TRANSPOSE, do it in the data step beforehand.
I show using PROC EXPAND to then set the missings to 0, but you can do this in a data step as well if you don't have ETS or if PROC EXPAND is too slow. There are other ways to do this - including setting up the dataset with 0s pre-proc-transpose - and if you have a complicated scenario where that would be needed, this might make a good separate question.
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
proc transpose data=have out=want_pre;
by caseID;
var fumert1 sex;
copy fumert1 sex;
run;
data want_pre_t;
set want_pre;
x=1; *dummy variable;
run;
proc transpose data=want_pre_t out=want delim=_;
by caseID;
var x;
id _name_ col1;
copy fumert1 sex;
run;
proc expand data=want out=want_e method=none;
convert _numeric_ /transformin=(setmiss 0);
run;
For this method, you need to use two concepts: the cntlout dataset from proc format, and code generation. This method will likely be faster than the other option I presented (as it passes through the data only once), but it does rely on the variable name <-> format relationship being straightforward. If it's not, a slightly more complex variation will be required; you should post to that effect, and this can be modified.
First, the cntlout option in proc format makes a dataset of the contents of the format catalog. This is not the only way to do this, but it's a very easy one. Specify the appropriate libname as you would when you create a format, but instead of making one, it will dump the dataset out, and you can use it for other purposes.
Second, we create a macro that performs your action one time (creating a variable with the name_value name and then assigning it to the appropriate value) and then use proc sql to make a bunch of calls to that macro, once for each row in your cntlout dataset. Note - you may need a where clause here, or some other modifications, if your format library includes formats for variables that aren't in your dataset - or if it doesn't have the nice neat relationship your example does. Then we just make those calls in a data step.
*Set up formats and dataset;
proc format;
value SEXF
1 = 'Homme'
2 = 'Femme' ;
value FUMERT1F
0 = 'Non'
1 = 'Oui , occasionnellement'
2 = 'Oui , régulièrement'
3 = 'Non mais j''ai déjà fumé' ;
quit;
data have;
do caseID = 1 to 1e4;
fumert1 = rand('Binomial',.3,3);
sex = rand('Binomial',.5,1)+1;
output;
end;
run;
*Dump formats into table;
proc format cntlout=formats;
quit;
*Macro that does the above assignment once;
%macro spread_var(var=, val=);
&var._&val.= (&var.=&val.); *result of boolean expression is 1 or 0 (T=1 F=0);
%mend spread_var;
*make the list. May want NOPRINT option here as it will make a lot of calls in your output window otherwise, but I like to see them as output.;
proc sql;
select cats('%spread_var(var=',substr(fmtname,1,length(Fmtname)-1),',val=',start,')')
into :spreadlist separated by ' '
from formats;
quit;
*Actually use the macro call list generated above;
data want;
set have;
&spreadlist.;
run;

Set the labels of a SAS Dataset equal to their variable name

I'm working with a rather large several dataset that are provided to me as a CSV files. When I attempt to import one of the files the data will come in fine but, the number of variables in the file is too large for SAS, so it stops reading the variable names and starts assigning them sequential numbers. In order to maintain the variable names off of the data set I read in the file with the data row starting on 1 so it did not read the first row as variable names -
proc import file="X:\xxx\xxx\xxx\Extract\Live\Live.xlsx" out=raw_names dbms=xlsx replace;
SHEET="live";
GETNAMES=no;
DATAROW=1;
run;
I then run a macro to start breaking down the dataset and rename the variables based on the first observations in each variable -
%macro raw_sas_datasets(lib,output,start,end);
data raw_names2;
raw_names;
if _n_ ne 1 then delete;
keep A -- E &start. -- &end.;
run;
proc transpose data=raw_names2 out=raw_names2;
var A -- &end.;
run;
data raw_names2;
set raw_names2;
col1=compress(col1);
run;
data raw_values;
set raw;
keep A -- E &start. -- &end.;
run;
%macro rename(old,new);
data raw_values;
set raw_values;
rename &old.=&new.;
run;
%mend rename;
data _null_;
set raw_names2;
call execute('%rename('||_name_||","||col1||")");
run;
%macro freq(var);
proc freq data=raw_values noprint;
tables &var. / out=&var.;
run;
%mend freq;
data raw_names3;
set raw_names2;
if _n_ < 6 then delete;
run;
data _null_;
set raw_names3;
call execute('%freq('||col1||")");
run;
proc sort data=raw_values;
by StudySubjectID;
run;
data &lib..&output.;
set raw_values;
run;
%mend raw_sas_datasets;
The problem I'm running into is that the variable names are now all set properly and the data is lined up correctly, but the labels are still the original SAS assigned sequential numbers. Is there any way to set all of the labels equal to the variable names?
If you just want to remove the variable labels (at which point they default to the variable name), that's easy. From the SAS Documentation:
proc datasets lib=&lib.;
modify &output.;
attrib _all_ label=' ';
run;
I suspect you have a simpler solution than the above, though.
The actual renaming step needs to be done differently. Right now it's rewriting the entire dataset over and over again - for a lot of variables that is a terrible idea. Get your rename statements all into one datastep, or into a PROC DATASETS, or something else. Look up 'list processing SAS' for details on how to do that; on this site or on google you will find lots of solutions.
You likely can get SAS to read in the whole first line. The number of variables isn't the problem; it is probably the length of the line. There's another question that I'll find if I can on this site from a few months ago that deals with this exact problem.
My preferred option is not to use PROC IMPORT for CSVs anyway; I would suggest writing a metadata table that stores the variable names and the length/types for the variables, then using that to write import code. A little more work at first, but only has to be done once per study and you guarantee PROC IMPORT isn't making silly decisions for you.
In the library sashelp is a table vcolumn. vcolumn contains all the names of your variables for each library by table. You could write a macro that puts all your variable names into macro variables and then from there set the label.
Here's some code that I put together (not very pretty) but it does what you're looking for:
data test.label_var;
x=1;
y=1;
label x = 'xx';
label y = 'yy';
run;
proc sql noprint;
select count(*) into: cnt
from sashelp.vcolumn
where memname = 'LABEL_VAR';quit;
%let cnt = &cnt;
proc sql noprint;
select name into: name1 - :name&cnt
from sashelp.vcolumn
where memname = 'LABEL_VAR';quit;
%macro test;
%do i = 1 %to &cnt;
proc datasets library=test nolist;
modify label_var;
label &&name&i=&&name&i;
quit;
%end;
%mend test;
%test;

SAS - Creating variables from macro variables

I have a SAS dataset which has 20 character variables, all of which are names (e.g. Adam, Bob, Cathy etc..)
I would like a dynamic code to create variables called Adam_ref, Bob_ref etc.. which will work even if there a different dataset with different names (i.e. don't want to manually define each variable).
So far my approach has been to use proc contents to get all variable names and then use a macro to create macro variables Adam_ref, Bob_ref etc..
How do I create actual variables within the dataset from here? Do I need a different approach?
proc contents data=work.names
out=contents noprint;
run;
proc sort data = contents; by varnum; run;
data contents1;
set contents;
Name_Ref = compress(Name||"_Ref");
call symput (NAME, NAME_Ref);
%put _user_;
run;
If you want to create an empty dataset that has variables named like some values you have in a macro variables you could do something like this.
Save the values into macro variables that are named by some pattern, like v1, v2 ...
proc sql;
select compress(Name||"_Ref") into :v1-:v20 from contents;
quit;
If you don't know how many values there are, you have to count them first, I assumed there are only 20 of them.
Then, if all your variables are character variables of length 100, you create a dataset like this:
%macro create_dataset;
data want;
length %do i=1 %to 20; &&v&i $100 %end;
;
stop;
run;
%mend;
%create_dataset; run;
This is how you can do it if you have the values in macro variable, there is probably a better way to do it in general.
If you don't want to create an empty dataset but only change the variable names, you can do it like this:
proc sql;
select name into :v1-:v20 from contents;
quit;
%macro rename_dataset;
data new_names;
set have(rename=(%do i=1 %to 20; &&v&i = &&v&i.._ref %end;));
run;
%mend;
%rename_dataset; run;
You can use PROC TRANSPOSE with an ID statement.
This step creates an example dataset:
data names;
harry="sally";
dick="gordon";
joe="schmoe";
run;
This step is essentially a copy of your step above that produces a dataset of column names. I will reuse the dataset namerefs throughout.
proc contents data=names out=namerefs noprint;
run;
This step adds the "_Refs" to the names defined before and drops everything else. The variable "name" comes from the column attributes of the dataset output by PROC CONTENTS.
data namerefs;
set namerefs (keep=name);
name=compress(name||"_Ref");
run;
This step produces an empty dataset with the desired columns. The variable "name" is again obtained by looking at column attributes. You might get a harmless warning in the GUI if you try to view the dataset, but you can otherwise use it as you wish and you can confirm that it has the desired output.
proc transpose out=namerefs(drop=_name_) data=namerefs;
id name;
run;
Here is another approach which requires less coding. It does not require running proc contents, does not require knowing the number of variables, nor creating a macro function. It also can be extended to do some additional things.
Step 1 is to use built-in dictionary views to get the desired variable names. The appropriate view for this is dictionary.columns, which has alias of sashelp.vcolumn. The dictionary libref can be used only in proc sql, while th sashelp alias can be used anywhere. I tend to use sashelp alias since I work in windows with DMS and can always interactively view the sashelp library.
proc sql;
select compress(Name||"_Ref") into :name_list
separated by ' '
from sashelp.vcolumn
where libname = 'WORK'
and memname = 'NAMES';
quit;
This produces a space delimited macro vaiable with the desired names.
Step 2 To build the empty data set then this code will work:
Data New ;
length &name_list ;
run ;
You can avoid assuming lengths or create populated dataset with new variable names by using a slightly more complicated select statement.
For example
select compress(Name)||"_Ref $")||compress(put(length,best.))
into :name_list
separated by ' '
will generate a macro variable which retains the previous length for each variable. This will work with no changes to step 2 above.
To create populated data set for use with rename dataset option, replace the select statement as follows:
select compress(Name)||"= "||compress(_Ref")
into :name_list
separated by ' '
Then replace the Step 2 code with the following:
Data New ;
set names (rename = ( &name_list)) ;
run ;