Categorical variables with macro - sas

I am trying to create categorical variables in sas. I have written the following macro, but I get an error: "Invalid symbolic variable name xxx" when I try to run. I am not sure this is even the correct way to accomplish my goal.
Here is my code:
%macro addvars;
proc sql noprint;
select distinct coverageid
into :coverageid1 - :coverageid9999999
from save.test;
%do i=1 %to &sqlobs;
%let n=coverageid&i;
%let v=%superq(&n);
%let f=coverageid_&v;
%put &f;
data save.test;
set save.test;
%if coverageid eq %superq(&v)
%then &f=1;
%else &f=0;
run;
%end;
%mend addvars;
%addvars;

You're combining macro code with data step code in a way that isn't correct. %if = macro language, meaning you are actually evaluating whether the text "coverageid" is equal to the text that %superq(&v) evaluates to, not whether the contents of the coverageid variable equal the value in &v. You could just convert %if to if, but even if you got that to work properly it would be hideously inefficient (you're rewriting the dataset N times, so if you have 1500 values for coverageID you rewrite the entire 500MB dataset or whatnot 1500 times, instead of just once).
If what you want to do is take the variable 'coverageid' and convert it to a set of variables that consist of all possible values of coverageid, 1/0 binary, for each, there are a nubmer of ways to do it. I'm fairly sure the ETS module has a procedure that just does this, but I don't recall it off the top of my head - if you were to post this to the SAS mailing list, one of the guys there would undoubtedly have it quickly.
The simple way for me, is to do this with entirely datastep code. First determine how many potential values there are for COVERAGEID, then assign each to a direct value, then assign the value to the correct variable.
If the COVERAGEID values are consecutive (ie, 1 to some number, no skips, or you don't mind skipping) then this is easy - set up an array and iterate over it. I will assume they are NOT consecutive.
*First, get the distinct values of coverageID. There are a dozen ways to do this, this works as well as any;
proc freq data=save.test;
tables coverageid/out=coverage_values(keep=coverageid);
run;
*Then save them into a format. This converts each value to a consecutive number (so the lowest value becomes 1, the next lowest 2, etc.) This is not only useful for this step, but it can be useful in the future in converting back.;
data coverage_values_fmt;
set coverage_values;
start=coverageid;
label=_n_;
fmtname='COVERAGEF';
type='i';
call symputx('CoverageCount',_n_);
run;
*Import the created format;
proc format cntlin=coverage_values_fmt;
quit;
*Now use the created format. If you had already-consecutive values, you could skip to this step and skip the input statement - just use the value itself;
data save.test_fin;
set save.test;
array coverageids coverageid1-coverageid&coveragecount.;
do _t = 1 to &coveragecount.;
if input(coverageid,COVERAGEF.) = _t then coverageids[_t]=1;
else coverageids[_t]=0;
end;
drop _t;
run;

Here's another way that doesn't use formats, and may be easier to follow.
First, just make some test data:
data test;
input coverageid ##;
cards;
3 27 99 105
;
run;
Next, create a data set with no observations but one variable for each level of coverageid. Note that this approach allows arbitrary values here.
proc transpose data=test out=wide(drop=_name_);
id coverageid;
run;
Finally, create a new data set that combines the initial data set and the wide one. Then, for each level of x, look at each categorical variable and decide whether to turn it "on".
data want;
set test wide;
array vars{*} _:;
do i=1 to dim(vars);
vars{i} = (coverageid = substr(vname(vars{i}),2,1));
end;
drop i;
run;
The line
vars{i} = (coverageid = substr(vname(vars{i}),2));
may require more explanation. vname returns the name of the variable, and since we didn't specify a prefix in proc transpose, all variables are named something like _1, _2, etc. So we take the substring of the variable name that starts in the second position, and compare it to coverageid; if they're the same, we set the variable to 1; otherwise it evaluates to 0.

Related

Create a macro that applies translate on multiple columns that you define in a dataset

I'm new to programming in SAS and I would like to do 2 macros, the first one I have done and it consists of giving 3 parameters: name of the input table, name of the column, name of the output table. What this macro does is translate the rare or accented characters, passing it a table and specifying in which column you want the rare characters to be translated:
The code to do this macro is this:
%macro translate_column(table,column,name_output);
*%LET table = TEST_MACRO_TRNSLT;
*%let column = marca;
*%let name_output = COSAS;
PROC SQL;
CREATE TABLE TEST AS
SELECT *
FROM &table.;
QUIT;
data &NAME_OUTPUT;
set TEST;
&column.=tranwrd(&column., "Á", "A");
run;
%mend;
%translate_column(TEST_MACRO_TRNSLT,marca,COSAS);
The problem comes when I try to do the second macro, that I want to replicate what I do in the first one but instead of having the columns that I can introduce to 1, let it be infinite, that is, if in a data set I have 4 columns with characters rare, can you translate the rare characters of those 4 columns. I don't know if I have to put a previously made macro in a parameter and then make a kind of loop or something in the macro.
The same by creating a kind of array (I have no experience with this) and putting those values in a list (these would be the different columns you want to iterate over) or in a macrovariable, it may be that passing this list as a function parameter works.
Could someone give me a hand on this? I would be very grateful
Either use an ARRAY or a %DO loop.
In either case use a space delimited list of variable names as the value of the COLUMN input parameter to your macro.
%translate_column
(table=TEST_MACRO_TRNSLT
,column=var1 varA var2 varB
,name_output=COSAS
);
So here is ARRAY based version:
%macro translate_column(table,column,name_output);
data &NAME_OUTPUT;
set &table.;
array __column &column ;
do over __column;
__column=ktranslate(__column, "A", "Á");
end;
run;
%mend;
Here is %DO loop based version
%macro translate_column(table,column,name_output);
%local index name ;
data &NAME_OUTPUT;
set &table.;
%do index=1 %to %sysfunc(countw(&column,%str( )));
%let name=%scan(&column,&index,%str( ));
&name = ktranslate(&name, "A", "Á");
%end;
run;
%mend;
Notice I switched to using KTRANSLATE() instead of TRANWRD. That means you could adjust the macro to handle multiple character replacements at once
&name = ktranslate(&name,'AO','ÁÓ');
The advantage of the ARRAY version is you could do it without having to create a macro. The advantage of the %DO loop version is that it does not require that you find a name to use for the array that does not conflict with any existing variable name in the dataset.

Is there a way to skip missing data sets when iterating through names?

I have a few data sets in SAS which I am trying to collate into one larger set which I will be filtering later. They're all called something like table_201802. My problem is that there are a few missing months (i.e. there exists table201802 and table201804 and up, but not table201803.
I'm new enough to SAS, but what I've tried so far is to create a new data set called output testing and ran a macro loop iterating over the names (they go from 201802 to 201903, and they're monthly data so anything from 812 to 900 won't exist).
data output_testing;
set
%do i=802 %to 812;
LIBRARY.table_201&i
%end;
;
run;
%mend append;
I want the code to ignore missing tables and just look for ones that do exist and then append them to the new output_testing table.
If the table name prefix is distinct, and you are confident the data structures amongst the tables are consistent (variable names, types and lengths are the same) then the table can be stacked using table name prefix lists (:)
For a specific known range of table names you can also use numbered range lists (-) tab
data have190101 have190102 have190103;
x =1;
run;
data want_version1_stack; /* any table name that starts with have */
set have:;
run;
data want_version1b_stack; /* 2019 and 2020 */
set have19: have20:;
run;
options nodsnferr;
data want_version2_stack; /* any table names in the iterated numeric range */
set have190101-have191231;
run;
options dsnferr;
From helps
Using Data Set Lists with SET
You can use data set lists with the SET
statement. Data set lists provide a quick way to reference existing
groups of data sets. These data set lists must either be name prefix
lists or numbered range lists.
Name prefix lists refer to all data
sets that begin with a specified character string. For example, set
SALES1:; tells SAS to read all data sets that start with "SALES1" such
as SALES1, SALES10, SALES11, and SALES12. >
Numbered range lists
require you to have a series of data sets with the same name, except
for the last character or characters, which are consecutive numbers.
In a numbered range list, you can begin with any number and end with
any number. For example, these lists refer to the same data sets:
sales1 sales2 sales3 sales4
sales1-sales4
Some macro code with proc append should solve the problem.
%let n = 10;
%macro get_list_table;
%do i = 1 %to &n;
%let dsn = data&n;
%if %sysfunc(exist(&dsn)) %then %do;
proc append data = &dsn base = appended_data force;
run;
%end;
%end;
%mend;
You can use shortcuts:
data output_testing;
set LIBRARY.table_201:
;
run;
but in this case you will get in set all tables that start with "table_201".
For example:
LIBRARY.table_201tablesss LIBRARY.table_201ed56

Using SAS SET statement with numbered macro variables

I'm trying to create a custom transformation within SAS DI Studio to do some complicated processing which I will want to reuse often. In order to achieve this, as a first step, I am trying to replicate the functionality of a simple APPEND transformation.
To this end, I've enabled multiple inputs (max of 10) and am trying to leverage the &_INPUTn and &_INPUT_count macro variables referenced here. I would like to simply use the code
data work.APPEND_DATA / view=work.APPEND_DATA;
%let max_input_index = %sysevalf(&_INPUT_count - 1,int);
set &_INPUT0 - &&_INPUT&max_input_index;
keep col1 col2 col3;
run;
However, I receive the following error:
ERROR: Missing numeric suffix on a numbered data set list (WORK.SOME_INPUT_TABLE-WORK.ANOTHER_INPUT_TABLE)
because the macro variables are resolved to the names of the datasets they refer to, whose names do not conform to the format required for the
SET dataset1 - dataset9;
statement. How can I get around this?
Much gratitude.
You need to create a macro that loops through your list and resolves the variables. Something like
%macro list_tables(n);
%do i=1 %to &n;
&&_INPUT&i
%end;
%mend;
data work.APPEND_DATA / view=work.APPEND_DATA;
%let max_input_index = %sysevalf(&_INPUT_count - 1,int);
set %list_tables(&max_input_index);
keep col1 col2 col3;
run;
The SET statement will need a list of the actual dataset names since they might not form a sequence of numeric suffixed names.
You could use a macro %DO loop if are already running a macro. Make sure to not generate any semi-colons inside the %DO loop.
set
%do i=1 %to &_inputcount ; &&_input&i %end;
;
But you could also use a data step to concatenate the names into a single macro variable that you could then use in the SET statement.
data _null_;
call symputx('_input1',symget('_input'));
length str $500 ;
do i=1 to &_inputcount;
str=catx(' ',str,symget(cats('_input',i)));
end;
call symputx('_input',str);
run;
data .... ;
set &_input ;
...
The extra CALL SYMPUTX() at the top of the data step will handle the case when count is one and SAS only creates the _INPUT macro variable instead of creating the series of macro variables with the numeric suffix. This will set _INPUT1 to the value of _INPUT so that the DO loop will still function.

Adding columns to a dataset in SAS using a for loop

I'm coming at SAS from a Python/R/Stata background, and learning that things are rather different in SAS. I'm approaching the following problem from the standpoint of one of these languages, perhaps SAS isn't up to what I want to do.
I have a panel dataset with an age column in it. I want to add new columns to the dataset using this age column. I'm going to simplify the functions of age to keep it simple in my example.
The goal is to loop over a sequence, and use the value of that sequence at each loop step to 1. assign the name of the new column and 2. assign the values of that column. I'm hoping to get my starting dataset, with new columns added to it taking values spline1 spline2... spline7
data somePath.FinalDataset;
do i = 1 to 7;
if i = 1 then
spline&i. = age;
if i ^= 1 then spline&i. = age + i;
end;
set somePath.StartingDataset;
run;
This code won't even run, though in an earlier version I was able to get it to run, but the new columns had their values shifted down one row from what they should have been. I include this code block as pseudocode of what I'm trying to do. Any help is much appreciated
One way to do this in SAS is with arrays. A SAS array can be used to reference a group of variables, and it can also create variables.
data have;
input age;
cards;
5
10
;
run;
data want;
set have;
array spline{7}; *create spline1 spline2 ... spline7;
do i=1 to 7;
if i = 1 then spline{i} = age;
else spline{i} = age + i;
end;
drop i;
run;
Spline{i} referes to the ith variable of the array named spline.
i is a regular variable, the DROP statement prevents it from being written to the output dataset.
When you say new columns were "shifted by one," note that spline1=age and spline2=age+2. You can change your code accordingly, e.g. if you want spline2=age+1, you could change your else statement to else spline{i} = age + i - 1 ; It is also possible to change the array statement to define it with 0 as the lower bound, rather than 1.
Arrays are likely the best way to solve this, but I will demonstrate a macro approach, which is necessary in some cases.
SAS separates its doing-things-with-data language from its writing-code language into the 'data step language' and the 'macro language'. They don't really talk to each other during a data step, because the macro language runs during the compilation stage (before any data is processed) while the data step language runs during the execution stage (while rows of data are being processed).
In any event, for something like this it's quite possible to write a macro to do what you want. Borrowing Quentin's general structure and initial dataset:
data have;
input age;
cards;
5
10
;
run;
%macro make_spline(var=, count=);
%local i;
%do i = 1 %to &count;
%if &i=1 %then &var.&i. = &var.;
%else &var.&i. = &var. + &i.;
; *this semicolon ends the assignment statement;
%end;
/* You end up with the IF statement generating:
age1 = age
and the extra semicolon after the if/else generates the ; for that line, making it
age1 = age;
etc. for the other lines.
*/
%mend make_spline;
data want;
set have;
%make_spline(var=age,count=7);
run;
This would then perform what you're looking to perform. The looping is in the macro language, not in the data step. You can assign parameters however you see fit; I prefer to have parameters like above, or even more (start loop could also be a parameter, and in fact the assignment code could be a parameter!).

How can I peform same datastep across many variables in SAS?

I have data that looks like this and has 500 variables with a target:
var1 var2 var3 var4 ... var500 target
The names of the variables are not sequential as above so I don't think I can use something like var1:var500. I want to loop through the variables to create graphs. Some of the variables are continous and some are nominal.
for var1 through var500
if nominal then create graphtypeA var[i] * target
else if continous then create graphtypeB var[i] * target
end;
I can easily create a second table that has the data type in it to check against. Arrays seem like they might be useful to peform this task of looping through variables. Something like:
data work.mydata;
set archive.mydata;
array myarray{501] myarray1 - myarray501
do i=1 to 500;
proc sgpanel;
panelby myarray[501];
histogram myarray[i];
end;
run;
This doesn't work though and it doens't check to see what type of variable it is. If we assume I have another sas.dataset that has varname and vartype (continuous, nominal) how can I loop through to create the desired graphs for the given vartype? Thanks in advance.
Basically, you need to loop over some variables, apply some logic to determine the variable type, then produce output depending on the variable type. While there are many approaches to this problem, one solution is to select your variables into a macro variable, loop over this "list" (not a formal data structure) of variables, and use macro control logic to designate different subroutines for numeric and character variables.
I'll use the sashelp.cars data set to illustrate. In this example the variable origin is your 'Target' variable and the variables Make, Type, Horsepower, and Cylinders are the numeric and character variables.
* get some data;
data set1 (keep = Make Type Origin Horsepower Cylinders);
set sashelp.cars;
run;
* create dataset of variable names and types;
proc contents data = set1
out = vars
noprint;
run;
* get variable names and variable types (1=numeric, 2=character)
* into two macro variable "lists" where each entry is seperated
* by a space;
proc sql noprint;
select name, type
into :varname separated by ' ', :vartype separated by ' '
from vars
where name <> "Make";
quit;
* put the macro variables to the log to confirm they are what
* you expect
%put &varname;
%put &vartype;
Now, use a macro to loop over the values in the macro variable list. The countw function counts the number of variables, and uses this number as the loop iterator limit. The scan function reads in each variable name and type by its relative position in the respective macro variable lists. For each variable the type is then evaluated and a plot is produced depending on whether it is character or numeric. In this example, a histogram with density plot is produced for numeric variables and a bar chart of frequency counts is produced for character variables.
The loop logic is general, and Proc sgpanel and Proc sgplot cab be modified or replaced with other desired data step processing or procedures.
* turn on options that are useful for
* macro debugging, turn them off
* when using in production;
options mlogic mprint symbolgen;
%macro plotter;
%do i = 1 %to %sysfunc(countw(&varname));
%let nextvar = %scan(&varname, &i, %str( ));
%let nextvartype = %scan(&vartype, &i, %str( ));
%if &nextvartype. = 1 %then %do;
proc sgpanel data=set1 noautolegend;
title "&nextvar. Distribution";
panelby Origin;
histogram &nextvar.;
density &nextvar.;
run;
%end;
%if &nextvartype. = 2 %then %do;
proc sgplot data=set1;
title "&nextvar. Count by Origin";
vbar &nextvar. /group= origin;
run;
%end;
%end;
%mend plotter;
*call the macro;
%plotter;
Unfortunately it is not possible to use arrays outside a data step in the way that you propose here, at least not in any very efficient way. However, there are quite a few options available to you. One would be just to call your graphing proc once and tell it to graph every numeric variable in your dataset, e.g. like so:
proc univariate data = sashelp.class;
var _NUMERIC_;
histogram;
run;
If the variables you want to graph that are of the same type are adjacent in the column order of your dataset, you can use a double-dash list, e.g.
proc univariate data = sashelp.class;
var age--weight;
histogram;
run;
In general you should seek to avoid calling procs or running data steps separately for every variable - it is nearly always more efficient to call them just once and process everything in one go.