I am performing regression where I make dummy variable for categorical variable.
where I have a categorical variable i.e
agecategory
1
2
3
4
5
I would like to know how can I define dummy variable of this categorical variable in SAS.
In your DATA step:
array agecat(*) agecategory1 - agecategory5;
do i=1 to dim(agecat);
if agecategory=i then agecat[i]=1;
else agecat[i]=0;
end;
This is assuming your agecategory variable is numeric, not character.
Related
I need a way to dynamically return the number of variables in the current data step.
Using SAS NOTE 24671: Dynamically determining the number of observations and variables in a SAS data set, I have come up with the following macro.
%macro GetVarCount(dataset);
/* Open assigns ID to open data set. Assigns 0 if DNE */
%let exists = %sysfunc(open(&dataset));
%if &exists %then
%do;
%let returnValue = %sysfunc(attrn(&exists, nvars));
%let closed = %sysfunc(close(&exists));
%end;
/* Output error if no dataset */
%else %put %sysfunc(sysmsg());
&returnValue
%mend;
Unfortunately, this errors out on an initial pass of a data set since the data set has not yet been created. After the first pass, and a dataset with 0 observations has been created, the macro can access the table and the number of variables.
For instance,
data example;
input x y;
put "NOTE: [DEV] There are %GetVarCount(example) variables in the EXAMPLE data set.";
datalines;
1
2
;
run;
The first run produces:
ERROR: File WORK.EXAMPLE.DATA does not exist.
WARNING: Apparent symbolic reference RETURNVALUE not resolved.
NOTE: [DEV] There are &returnValue variables in the EXAMPLE data set.
The second run produces:
NOTE: [DEV] There are 2 variables in the EXAMPLE data set.
Is there a way to get the number of variables in a data set first time the data step is run?
In your example, you're trying to determine the number of active variables in a data step - this isn't necessarily the same as the number of variables that will be in the output data set, because (a) there might not be an output data set and (b) some of the variables might get dropped.
With that caveat in mind, if you really want to do that, then this works:
data fred;
length x y z $ 20 f g 8;
array vars_char _character_;
array vars_num _numeric_;
total_vars = dim(vars_char) + dim(vars_num);
put "Vars in data step: " total_vars;
run;
This works by using the special _character_ and _numeric_ keywords to create arrays of all character and numeric vars in the current buffer, and the dim() function to get the sizes of those arrays.
It will only count variables that exist when the arrays are declared, so it doesn't count total_vars in this case.
You could wrap this in a macro like:
%macro var_count(var_count_name):
array vars_char _character_;
array vars_num _numeric_;
&var_count_name = dim(vars_char) + dim(vars_num);
%mend;
and then use it like:
data fred;
length x y z $ 20 f g 8;
%var_count(total_vars);
put "Vars in data step: " total_vars;
run;
Try to open a dataset that has already been created.
The 'open' function requires the dataset that WILL be open to exist, I think you want 'open' to give you an ID of the already open dataset; that is not the case.
The reason it works only after the first pass (not just the second), is because the first pass created an empty dataset with metadata regarding the variables it contains.
Use a library to permanently store your dataset first and then try your macro to read from it:
Data <lib>.dataset;
update:
#Reeza already gave you the answer in the comments.
Another alternative:
Using put _all_; will print all the variables to the log, if you write the put into a file and then read it and count the '=' signs you can get the variable count too. Just remove _n_ and _ERROR_ from the count.
I have a set of input macro variables in SAS. They are dynamic and generated based on the user selection in a sas stored process.
For example:There are 10 input values 1 to 10.
The name of the macro variable is VAR_. If a user selects 2,5,7 then 4 macro variables are created.
&VAR_0=3;
&VAR_=2;
&VAR_1=5;
&VAR_2=7;
The first one with suffix 0 provides the count. The next 3 provides the values.
Note:If a user select only one value then only one macro variable is created. For example If a user selects 9 then &var_=9; will be created. There will not be any count macro variable.
I am trying to create a sas table using these variables.
It should be like this
OBS VAR
-----------
1 2
2 5
3 7
-----------
This is what I tried. Not sure if this is the right way to do approach it.
It doesn't give me a final solution but I can atleast get the name of the macro variables in a table. How can I get their values ?
data tbl1;
do I=1 to &var_0;
VAR=CAT('&VAR_',I-1);
OUTPUT;
END;
RUN;
PROC SQL;
CREATE TABLE TBL2 AS
SELECT I,
CASE WHEN VAR= '&VAR_0' THEN '&VAR_' ELSE VAR END AS VAR
from TBL1;
QUIT;
Thank You for your help.
Jay
SAS helpfully stores them in a table for you already, you just need to parse out the ones you want. The table is called SASHELP.VMACRO or DICTIONARY.MACROS
Here's an example:
%let var=1;
%let var2=3;
%let var4=5;
proc sql;
create table want as
select * from sashelp.vmacro
where name like 'VAR%';
quit;
proc print data=want;
run;
I think the real issue is the inconsistent behavior of the stored process. It only creates the 0 and 1 variable when there are multiple selections. I think that your example is a little off. If the value of VAR_0 is three then their should be a VAR_3 macro variable. Also the value of VAR_ and VAR_1 should be set to the same thing.
To fix this in the past I have done something like this. First let's assign the parameter name a macro variable so that the code is reusable for other programs.
%let name=VAR_;
Then first make sure the minimal macro variables exist.
%global &name &name.0 &name.1 ;
Then make sure that you have a count by setting the 0 variable to 1 when it is empty.
%let &name.0 = %scan(&&&name.0 1,1);
Then make sure that you have a 1 variable. Since it should have the same value as the macro variable without a suffix just re-assign it.
%let &name.1 = &&&name ;
Now your data step is easier.
data want ;
length var $32 value $200 ;
do i=1 to &&&name.0 ;
var=cats(symget('name'),i);
value=symget(var);
output;
end;
run;
I don't understand your numbering scheme and recommend changing it, if you can; the &var_ variable is very confusing.
Anyway, the easiest way to do this is SYMGET. That returns a value from the macro symbol table which you can specify at runtime.
%let VAR_0=3;
%let VAR_=2;
%let VAR_1=5;
%let VAR_2=7;
data want;
do obs = 1 to &var_0.;
var = input(symget(cats('VAR_',ifc(obs=1,'',put(obs-1,2.)))),2.);
output;
end;
run;
Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.
My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf
In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;
I want to create something in SAS that works like an Excel lookup function. Basically, I set the values for macro variables var1, var2, ... and I want to find their index number according to the ref table. But I get the following messages in the data step.
NOTE: Variable A is uninitialized.
NOTE: Variable B is uninitialized.
NOTE: Variable NULL is uninitialized.
When I print the variables &num1,&num2, I get nothing. Here is my code.
data ref;
input index varname $;
datalines;
0 NULL
1 A
2 B
3 C
;
run;
%let var1=A;
%let var2=B;
%let var3=NULL;
data temp;
set ref;
if varname=&var1 then call symput('num1',trim(left(index)));
if varname=&var2 then call symput('num2',trim(left(index)));
if varname=&var3 then call symput('num3',trim(left(index)));
run;
%put &num1;
%put &num2;
%put &num3;
I can get the correct values for &num1,&num2,.. if I type varname='A' in the if-then statement. And if I subsequently change the statement back to varname=&var1, I can still get the required output. But why is it so? I don't want to input the actual string value and then change it back to macro variable to get the result everytime.
Solution to immediate problem
You need to wrap your macro variables in double quotes if you want SAS to treat them as string constants. Otherwise, it will treat them the same way as any other random bits of text it finds in your data step.
Alternatively, you could re-define the macro vars to include the quotes.
As a further option, you could use the symget or resolve functions, but these are not usually needed unless you want to create a macro variable and use it again within the same data step. If you use them as a replacement for double quotes they tend to use a lot more CPU as they will evaluate the macro vars once per row by default - normally, macro vars are evaluated just once, at compile time, before your code executes.
A better approach?
For the sort of lookup you're doing, you actually don't need to use a dataset at all - you can instead define a custom format, which gives you much more flexibility in how you can use it. E.g. this creates a format called lookup:
proc format;
value lookup
1 = 'A'
2 = 'B'
3 = 'C'
other = '#N/A' /*Since this is what vlookup would do :) */
;
run;
Then you can use the format like so:
%let testvar = 1;
%let testvar_lookup = %sysfunc(putn(&testvar, lookup.));
Or in a data step:
data _null_;
var1 = 1;
format var1 lookup.;
put var1=;
run;
I am trying to create categorical variables in sas. I have written the following macro, but I get an error: "Invalid symbolic variable name xxx" when I try to run. I am not sure this is even the correct way to accomplish my goal.
Here is my code:
%macro addvars;
proc sql noprint;
select distinct coverageid
into :coverageid1 - :coverageid9999999
from save.test;
%do i=1 %to &sqlobs;
%let n=coverageid&i;
%let v=%superq(&n);
%let f=coverageid_&v;
%put &f;
data save.test;
set save.test;
%if coverageid eq %superq(&v)
%then &f=1;
%else &f=0;
run;
%end;
%mend addvars;
%addvars;
You're combining macro code with data step code in a way that isn't correct. %if = macro language, meaning you are actually evaluating whether the text "coverageid" is equal to the text that %superq(&v) evaluates to, not whether the contents of the coverageid variable equal the value in &v. You could just convert %if to if, but even if you got that to work properly it would be hideously inefficient (you're rewriting the dataset N times, so if you have 1500 values for coverageID you rewrite the entire 500MB dataset or whatnot 1500 times, instead of just once).
If what you want to do is take the variable 'coverageid' and convert it to a set of variables that consist of all possible values of coverageid, 1/0 binary, for each, there are a nubmer of ways to do it. I'm fairly sure the ETS module has a procedure that just does this, but I don't recall it off the top of my head - if you were to post this to the SAS mailing list, one of the guys there would undoubtedly have it quickly.
The simple way for me, is to do this with entirely datastep code. First determine how many potential values there are for COVERAGEID, then assign each to a direct value, then assign the value to the correct variable.
If the COVERAGEID values are consecutive (ie, 1 to some number, no skips, or you don't mind skipping) then this is easy - set up an array and iterate over it. I will assume they are NOT consecutive.
*First, get the distinct values of coverageID. There are a dozen ways to do this, this works as well as any;
proc freq data=save.test;
tables coverageid/out=coverage_values(keep=coverageid);
run;
*Then save them into a format. This converts each value to a consecutive number (so the lowest value becomes 1, the next lowest 2, etc.) This is not only useful for this step, but it can be useful in the future in converting back.;
data coverage_values_fmt;
set coverage_values;
start=coverageid;
label=_n_;
fmtname='COVERAGEF';
type='i';
call symputx('CoverageCount',_n_);
run;
*Import the created format;
proc format cntlin=coverage_values_fmt;
quit;
*Now use the created format. If you had already-consecutive values, you could skip to this step and skip the input statement - just use the value itself;
data save.test_fin;
set save.test;
array coverageids coverageid1-coverageid&coveragecount.;
do _t = 1 to &coveragecount.;
if input(coverageid,COVERAGEF.) = _t then coverageids[_t]=1;
else coverageids[_t]=0;
end;
drop _t;
run;
Here's another way that doesn't use formats, and may be easier to follow.
First, just make some test data:
data test;
input coverageid ##;
cards;
3 27 99 105
;
run;
Next, create a data set with no observations but one variable for each level of coverageid. Note that this approach allows arbitrary values here.
proc transpose data=test out=wide(drop=_name_);
id coverageid;
run;
Finally, create a new data set that combines the initial data set and the wide one. Then, for each level of x, look at each categorical variable and decide whether to turn it "on".
data want;
set test wide;
array vars{*} _:;
do i=1 to dim(vars);
vars{i} = (coverageid = substr(vname(vars{i}),2,1));
end;
drop i;
run;
The line
vars{i} = (coverageid = substr(vname(vars{i}),2));
may require more explanation. vname returns the name of the variable, and since we didn't specify a prefix in proc transpose, all variables are named something like _1, _2, etc. So we take the substring of the variable name that starts in the second position, and compare it to coverageid; if they're the same, we set the variable to 1; otherwise it evaluates to 0.