How to recode variables in SAS? - sas

I tried to recode the missing values but instead lost all my other variables within a dataset
BEFORE:
AFTER:
data work.newdataset;
if (year =.) then year = 2000;
run;

You are missing the SET statement.
data want;
set have;
myvar=5;
run;
will create a new dataset, want, from have, with the new variable value applied (or the recode or whatever). You could also do
data have;
set have;
myvar=5;
run;
That would replace have with itself plus the recode/whatever. This is actually less common in SAS; it is often preferable to do all recodes in one step, but to create a new dataset (so that the code is reversible easily).

Related

How do I add a variable to a dataset within a SET statement in a data step?

I have a SAS data step that uses a %do loop to build a set statement that combines multiple files. The files have names of the form datafile2020q1 or datafile2021q3, and they're each stored in libraries with names of the form DataLibrary2020 or DataLibrary2021.
%macro processclaimsdata;
data rawclaimsdata (keep=&claimvariables.);
set
%do yr=2019 %to 2021;
%do qrtr=1 %to 4;
DataLibrary&yr..datafile&yr.q&qrtr. (keep=&claimvariables.)
%end;
%end;
;
run;
%mend;
My goal is to add a variable to the dataset that has YYYYqQ in it, e.g., 2020q1 or 2021q3, for each dataset, so I can keep track of which file it came from. Is this possible by modifying my code above, or do I need to rework it to use proc append and/or proc sql?
Use the indsname= option to keep track of the dataset that a row came from. Although it won't be in the format you specify, this will tell you the exact dataset name that a row came from.
The code below will create a new temporary variable in the dataset called dsn that stores the name of the dataset a row came from. You can add it to the dataset by assigning it to a permanent variable.
data want;
set have1
have2
have3
indsname=dsn;
dsname = dsn;
run;
You can then use substr() or other string functions to snag the name of the quarter from each dataset name. For example:
qtr = substr(dsn, -6);

Export/Import Attributes of a SAS dataset

I am working with multiple waves of survey data. I have finished defining formats and labels for the first wave of data.
The second wave of data will be different, but the codes, labels, formats, and variable names will all be the same. I do not want to define all these attributes again...it seems like there should be a way to export the PROC CONTENTS information for one dataset and import it into another dataset. Is there a way to do this?
The closest thing I've found is PROC CPORT but I am totally confused by it and cannot get it to run.
(Just to be clear I'll ask the question another way as well...)
When you run PROC CONTENTS, SAS tells you what format, labels, etc. it is using for each variable in the dataset.
I have a second dataset with the exact same variable names. I would like to use the variable attributes from the first dataset on the variables in the second dataset. Is there any way to do this?
Thanks!
So you have a MODEL dataset and a HAVE dataset, both with data in them. You want to create WANT dataset which has data from HAVE, with attributes of MODEL (formats, labels, and variable lengths). You can do this like:
data WANT ;
if 0 then set MODEL ;
set HAVE ;
run ;
This works because when the DATA step compiles, SAS builds the Program Data Vector (PDV) which defines variable attributes. Even though the SET MODEL never executes (because 0 is not true), all of the variables in MODEL are created in the PDV when the step compiles.
Importantly, note that if there are corresponding variables with different lengths, the length from MODEL will determine the length of the variable in WANT. So if HAVE has a variable that is longer than the same-named variable in MODEL, it may be truncated. Options VARLENCHK determines whether or not SAS throws a warning/error if this happens.
That assumes there are no formats/labels on the HAVE dataset. If there is a variable in HAVE that has a format/label, and the corresponding variable in MODEL does not have a format/label, the format/label from HAVE will be applied to WANT.
Sample code below.
data model;
set sashelp.class;
length FavoriteColor $3;
FavoriteColor="Red";
dob=today();
label
dob='BirthDate'
;
format
dob mmddyy10.
;
run;
data have;
set sashelp.class;
length FavoriteColor $10;
dob=today()-1;
FavoriteColor="Orange";
label
Name="HaveLabel"
dob="HaveLabel"
;
format
Name $1.
dob comma.
;
run;
options varlenchk=warn;
data want;
if 0 then set model;
set have;
run;
I'd create an empty dataset based on the existing one, and then use proc append to append the contents to it.
Create some sample data for the second round of data:
data new_data;
age = 10;
run;
Create an empty dataset based on the original data:
proc sql noprint;
create table want like sashelp.class;
quit;
Append the data into the empty dataset, retaining the details from the original:
proc append base=want data=new_data force nowarn;
run;
Note that I've used the force and nowarn options on proc append. This will ensure the data is appended even if differences are found between the two datasets being used. This is expected if you have, for example, format differences. It will also hide things like if columns exist in the new table that aren't in the old table etc. So be careful that this is doing what you want it to. If the behaviour is undesirable, consider using a datastep to append instead (and list the want dataset first).
Welcome to the stack.
If you want to copy the properties of the table without the data within it, you could use PROC SQL or data step with zero rows read in.
This examples copies all information about the SASHELP.CLASS dataset into a brand new dataset. All formats, attributes, labels, the whole thing is copies over. If you want to only copy some of the columns, specify them in select clause instead of asterix.
PROC SQL outobs=0;
CREATE TABLE WANT as SELECT * FROM SASHELP.CLASS;
QUIT;
Regards,
Vasilij

Naming variable using _n_, a column for each iteration of a datastep

I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;

How to do simple mathematics with min/max variable in SAS

I am currently running a macro code in SAS and I want to do a calculation with regards to max and min. Right now the line of code I have is :
hhincscaled = 100*(hhinc - min(hhinc) )/ (max(hhinc) - min(hhinc));
hhvaluescaled = 100*(hhvalue - min(hhvalue))/ (max(hhvalue) - min(hhvalue));
What I am trying to do is re-scale household income and value variables with the calculations below. I am trying to subtract the minimum value of each variable and subtract it from the respective maximum value and then scale it by multiplying it by 100. I'm not sure if this is the right way or if SAS is recognizing the code the way I want it.
I assume you are in a Data Step. A Data Step has an implicit loop over the records in the data set. You only have access to the record of the current loop (with some exceptions).
The "SAS" way to do this is the calculate the Min and Max values and then add them to your data set.
Proc sql noprint;
create table want as
select *,
min(hhinc) as min_hhinc,
max(hhinc) as max_hhinc,
min(hhvalue) as min_hhvalue,
max(hhvalue) as max_hhvalue
from have;
quit;
data want;
set want;
hhincscaled = 100*(hhinc - min_hhinc )/ (max_hhinc - min_hhinc);
hhvaluescaled = 100*(hhvalue - min_hhvalue)/ (max_hhvalue - min_hhvalue);
/*Delete this if you want to keep the min max*/
drop min_: max_:;
run;
Another SAS way of doing this is to create the max/min table with PROC MEANS (or PROC SUMMARY or your choice of alternatives) and merge it on. Doesn't require SQL knowledge to do, and probably about the same speed.
proc means data=have;
*use a class value if you have one;
var hhinc hhvalue;
output out=minmax min= max= /autoname;
run;
data want;
if _n_=1 then set minmax; *get the min/max values- they will be retained automatically and available on every row;
set have;
*do your calculations, using the new variables hhinc_max hhinc_min etc.;
run;
If you have a class statement - ie, a grouping like 'by state' or similar - add that in proc means and then do a merge instead of a second set in want, by your class variable. It would require a sorted (initial) dataset to merge.
You also have the option of doing this in SAS-IML, which works more similarly to how you are thinking above. IML is the SAS interactive matrix language, and more similar to r or matlab than the SAS base language.

Sas macro with proc sql

I want to perform some regression and i would like to count the number of nonmissing observation for each variable. But i don't know yet which variable i will use. I've come up with the following solution which does not work. Any help?
Here basically I put each one of my explanatory variable in variable. For example
var1 var 2 -> w1 = var1, w2= var2. Notice that i don't know how many variable i have in advance so i leave room for ten variables.
Then store the potential variable using symput.
data _null_;
cntw=countw(&parameters);
i = 1;
array w{10} $15.;
do while(i <= cntw);
w[i]= scan((&parameters"),i, ' ');
i = i +1;
end;
/* store a variable globally*/
do j=1 to 10;
call symput("explanVar"||left(put(j,3.)), w(j));
end;
run;
My next step is to perform a proc sql using the variable i've stored. It does not work as
if I have less than 10 variables.
proc sql;
select count(&explanVar1), count(&explanVar2),
count(&explanVar3), count(&explanVar4),
count(&explanVar5), count(&explanVar6),
count(&explanVar7), count(&explanVar8),
count(&explanVar9), count(&explanVar10)
from estimation
;quit;
Can this code work with less than 10 variables?
You haven't provided the full context for this project, so it's unclear if this will work for you - but I think this is what I'd do.
First off, you're in SAS, use SAS where it's best - counting things. Instead of the PROC SQL and the data step, use PROC MEANS:
proc means data=estimation n;
var &parameters.;
run;
That, without any extra work, gets you the number of nonmissing values for all of your variables in one nice table.
Secondly, if there is a reason to do the PROC SQL, it's probably a bit more logical to structure it this way.
proc sql;
select
%do i = 1 %to %sysfunc(countw(&parameters.));
count(%scan(&parameters.,&i.) ) as Parameter_&i., /* or could reuse the %scan result to name this better*/
%end; count(1) as Total_Obs
from estimation;
quit;
The final Total Obs column is useful to simplify the code (dealing with the extra comma is mildly annoying). You could also put it at the start and prepend the commas.
You finally could also drive this from a dataset rather than a macro variable. I like that better, in general, as it's easier to deal with in a lot of ways. If your parameter list is in a data set somewhere (one parameter per row, in the dataset "Parameters", with "var" as the name of the column containing the parameter), you could do
proc sql;
select cats('%countme(var=',var,')') into :countlist separated by ','
from parameters;
quit;
%macro countme(var=);
count(&var.) as &var._count
%mend countme;
proc sql;
select &countlist from estimation;
quit;
This I like the best, as it is the simplest code and is very easy to modify. You could even drive it from a contents of estimation, if it's easy to determine what your potential parameters might be from that (or from dictionary.columns).
I'm not sure about your SAS macro, but the SQL query will work with these two notes:
1) If you don't follow your COUNT() functions with an identifier such as "COUNT() AS VAR1", your results will not have field headings. If that's ok with you, then you may not need to worry about it. But if you export the data, it will be helpful for you if you name them by adding "...AS "MY_NAME".
2) For observations with fewer than 10 variables, the query will return NULL values. So don't worry about not getting all of the results with what you have, because as long as the table you're querying has space for 10 variables (10 separate fields), you will get data back.