Automating IF and then statement in sas using macro in SAS - if-statement

I have a data where I have various types of loan descriptions, there are at least 100 of them.
I have to categorise them into various buckets using if and then function. Please have a look at the data for reference
data des;
set desc;
if loan_desc in ('home_loan','auto_loan')then product_summary ='Loan';
if loan_desc in ('Multi') then product_summary='Multi options';
run;
For illustration I have shown it just for two loan description, but i have around 1000 of different loan_descr that I need to categorise into different buckets.
How can I categorise these loan descriptions in different buckets without writing the product summary and the loan_desc again and again in the code which is making it very lengthy and time consuming
Please help!

Another option for categorizing is using a format. This example uses a manual statement, but you can also create a format from a dataset if you have the to/from values in a dataset. As indicated by #Tom this allows you to change only the table and the code stays the same for future changes.
One note regarding your current code, you're using If/Then rather than If/ElseIf. You should use If/ElseIf because then it terminates as soon as one condition is met, rather than running through all options.
proc format;
value $ loan_fmt
'home_loan', 'auto_loan' = 'Loan'
'Multi' = 'Multi options';
run;
data want;
set have;
loan_desc = put(loan, $loan_fmt.);
run;

For a mapping exercise like this, the best technique is to use a mapping table. This is so the mappings can be changed without changing code, among other reasons.
A simple example is shown below:
/* create test data */
data desc (drop=x);
do x=1 to 3;
loan_desc='home_loan'; output;
loan_desc='auto_loan'; output;
loan_desc='Multi'; output;
loan_desc=''; output;
end;
data map;
loan_desc='home_loan'; product_summary ='Loan '; output;
loan_desc='auto_loan'; product_summary ='Loan'; output;
loan_desc='Multi'; product_summary='Multi options'; output;
run;
/* perform join */
proc sql;
create table des as
select a.*
,coalescec(b.product_summary,'UNMAPPED') as product_summary
from desc a
left join map b
on a.loan_desc=b.loan_desc;
There is no need to use the macro language for this task (I have updated the question tag accordingly).

Already good solutions have been proposed (I like #Reeza's proc format solution), but here's another route which also minimizes coding.
Generate sample data
data have;
loan_desc="home_loan"; output;
loan_desc="auto_loan"; output;
loan_desc="Multi"; output;
loan_desc=""; output;
run;
Using PROC SQL's case expression
This way doesn't allow, to my knowledge, having several criteria on a single when line, but it really simplifies coding since the resulting variable's name needs to be written down only once.
proc sql;
create table want as
select
loan_desc,
case loan_desc
when "home_loan" then "Loan"
when "auto_loan" then "Loan"
when "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
Otherwise, using the following syntax is also possible, giving the same results:
proc sql;
create table want as
select
loan_desc,
case
when loan_desc in ("home_loan", "auto_loan") then "Loan"
when loan_desc = "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;

Related

2 ids with 3 distinct groups in SAS

hi I just wanted to know what would be the code in a data step for this outcome
I have the following:
and need the below:
please direct me to another page if this was answered previously...in addition how would I asked something like this in the future.
Thanks
Do you really need to do this in a data step? You might as well use proc sql for a Cartesian product.
proc sql;
create table want as
select code, groupds
from (select distinct code from have) a,
(select distinct groupds from have) b;
quit;
Here is how you can do it in a data step.
data want;
set have(keep=code where=(not missing(code)));
do i=1 to n;
set have(keep=groupds where=(not missing(groupds))) point=i nobs=n;
output;
end;
run;
The issue with this method is if you have a duplicate code or groupds a record will be created for the duplicate entry.

Adding a column is SAS using MODIFY (no sql)

I'm new to SAS and have some problems with adding a column to existing data set in SAS using MODIFY statement (without proc sql).
Let's say I have data like this
id name salary perks
1 John 2000 50
2 Mary 3000 120
What I need to get is a new column with the sum of salary and perks.
I tried to do it this way
data data1;
modify data1;
money=salary+perks;
run;
but apparently it doesn't work.
I would be grateful for any help!
As #Tom mentioned you use SET to access the dataset.
I generally don't recommend programming this way with the same name in set and data statements, especially as you're learning SAS. This is because it's harder to detect errors, since once run and encounter an error, you destroy your original dataset and have to recreate it before you start again.
If you want to work step by step, consider intermediary datasets and then clean up after you're done by using proc datasets to delete any unnecessary intermediary datasets. Use a naming conventions to be able to drop them all at once, i.e. data1, data2, data3 can be referenced as data1-data3 or data:.
data data2;
set data1;
money = salary + perks;
run;
You do now have two datasets but it's easy to drop datasets later on and you can now run your code in sections rather than running all at once.
Here's how you would drop intermediary datasets
proc datasets library=work nodetails holist;
delete data1-data3;
run;quit;
You can't add a column to an existing dataset. You can make a new dataset with the same name.
data data1;
set data1;
money=salary+perks;
run;
SAS will build it as a new physical file (with a temporary name) and when the step finishes without error it deletes the original and renames the new one.
If you want to use a data set you do it like this:
data dataset;
set dataset;
format new_column $12;
new_column = 'xxx';
run;
Or use Proc SQL and ALTER TABLE.
proc sql;
alter table dataset
add new_column char(8) format = $12.
;
quit;

SAS Proc SQL how to perform procedure only on N rows of a big table

I need to perform a procedure on a small set (e.g. 100 rows) of a very big table just to test the syntax and output. I have been running the following code for a while and it's still running. I wonder if it is doing something else. Or what is the right way to do?
Proc sql inobs = 100;
select
Var1,
sum(Var2) as VarSum
from BigTable
Group by
Var1;
Quit;
What you're doing is fine (limiting the maximum number of records taken from any table to 100), but there are a few alternatives. To avoid any execution at all, use the noexec option:
proc sql noexec;
select * from sashelp.class;
quit;
To restrict the obs from a specific dataset, you can use the data set obs option, e.g.
proc sql;
select * from sashelp.class(obs = 5);
quit;
To get a better idea of what SAS is doing behind the scenes in terms of index usage and query planning, use the _method and _tree options (and optionally combine with inobs as above):
proc sql _method _tree inobs = 5;
create table test as select * from sashelp.class
group by sex
having age = max(age);
quit;
These produce quite verbose output which is beyond the scope of this answer to explain fully, but you can easily search for more details if you want.
For further details on debugging SQL in SAS, refer to
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/viewer.htm#a001360938.htm

Running All Variables Through a Function in SAS

I am new to SAS and need to sgplot 112 variables. The variable names are all very different and may change over time. How can I call each variable in the statement without having to list all of them?
Here is what I have done so far:
%macro graph(var);
proc sgplot data=monthly;
series x=date y=var;
title 'var';
run;
%mend;
%graph(gdp);
%graph(lbr);
The above code can be a pain since I have to list 112 %graph() lines and then change the names in the future as the variable names change.
Thanks for the help in advance.
List processing is the concept you need to deal with something like this. You can also use BY group processing or in the case of graphing Paneling in some cases to approach this issue.
Create a dataset from a source convenient to you that contains the list of variables. This could be an excel or text file, or it could be created from your data if there's a way to programmatically tell which variables you need.
Then you can use any of a number of methods to produce this:
proc sql;
select cats('%graph(',var,')')
into: graphlist separated by ' '
from yourdata;
quit;
&graphlist
For example.
In your case, you could also generate a vertical dataset with one row per variable, which might be easier to determine which variables are correct:
data citiwk;
set sashelp.citiwk;
var='COM';
val=WSPCA;
output;
var='UTI';
val=WSPUA;
output;
var='INDU';
val=WSPIA;
output;
val=WSPGLT;
var='GOV';
output;
keep val var date;
run;
proc sort data=citiwk;
by var date;
run;
proc sgplot data=citiwk;
by var;
series x=date y=val;
run;
While I hardcoded those four, you could easily create an array and use VNAME() to get the variable name or VLABEL() to get the variable label of each array element.

Perform Iterative Operations on OUTEST or OUTSTAT variables in SAS?

In SAS, how can I assign a variable coming from either the OUTEST or OUTSTAT functions to be used in a loop?
For example, say I want to run some sort of iterative analysis until my mean (average) reaches a certain threshold. I know how to extract the mean using either OUTEST or OUTSTAT, but then how can I perform operations or blocks of code on it?
Thank you.
If you are interested in details, I am trying to perform backward selection of VIFs (to remove multicollinearity). Unfortunately, SAS doesn't seem to have a 'SELECTION=BACKWARD' feature for this...
EDIT: Updated with sample code:
%MACRO MULTICOLLINEARITY(TABLE_SUFFIX,YVAR,FIELDS,MAX_VIF);
/* PRELIMINARY PROC REG ON ALL FIELDS*/
PROC REG DATA=TABLE_&TABLE_SUFFIX. NOPRINT;
MODEL &YVAR = &FIELDS / VIF COLLIN NOINT;
ODS OUTPUT PARAMETERESTIMATES=PAREST1;
RUN;
/* RETAIN NON-NULL VIF FIELDS ONLY */
DATA NO_NULL_VIF;
SET PAREST1 (WHERE=(VarianceInflation <> .));
RUN;
/* CREATE VARIABLE LIST OF NON-NULL VIF FIELDS */
PROC SQL;
SELECT VARIABLE
INTO :NO_NULL_VIF_FIELDS SEPARATED BY ' '
FROM NO_NULL_VIF;
QUIT;
/* RE-RUN REGRESSION WITH NON-NULL VIF FIELDS ONLY */
PROC REG DATA=TABLE_&TABLE_SUFFIX. NOPRINT;
MODEL &YVAR = &NO_NULL_VIF_FIELDS / VIF COLLIN NOINT;
ODS OUTPUT PARAMETERESTIMATES=PAREST2;
RUN;
/* START ITERATION OF DROPPING THE HIGHEST VIF UNTIL THE CRITERIA IS MET */
???
%MEND;
%MULTICOLLINEARITY(, RESPONSE, &INPUT_FIELDS,???)
And by criteria I mean VIF_MAX < N where N is some threshold specified in the macro. For example, if we want to retain only fields with VIF less than 5, then it should drop the highest one, re-run the PROC REG, drop the highest, re-run, etc. etc. until the highest on is less than 5.
First off - I'd verify that you can't do this using PROC MODEL. I'm not a regression guy so I don't know for sure. Might be worth posting on a more stat-focused site; CV isn't really appropriate since they're not generally trying to answer software questions, but maybe communities.sas.com . I would find it surprising if this wasn't directly possible in PROC MODEL and/or in one of the more complicated procs.
Second, the way I'd approach this is to write a recursive macro. Take out the first part (the non-null VIF fields) and either move that to an outer macro that just runs once, or make it an expectation of the programmer to do on his/her own (unless this is not feasible, and/or can change with iterations - not something I'm knowledgeable of). Then do something like this:
%MACRO MULTICOLLINEARITY(TABLE_SUFFIX,YVAR,FIELDS,MAX_VIF);
ods _all_ close;
%put Running with &fields; *note which fields currently running;
*also may want to include a run # counter as parameter;
PROC REG DATA=TABLE_&TABLE_SUFFIX.;
MODEL &YVAR = &FIELDS / VIF COLLIN NOINT;
ODS OUTPUT PARAMETERESTIMATES=PAREST2;
RUN;
quit;
*Data step to analyse PAREST2 and see if any of the fields can be dropped;
proc sort data=parest2;
by descending varianceinflation;
run;
data _null_;
set parest2(obs=1);
if varianceinflation > &max_vif then do;
fields_run = tranwrd("&fields",trim(variable),' ');
if not missing(fields_run) then do;
call_string = cats('%multicollinearity(',"&table_suffix.,&yvar.,",fields_run,",&max_vif.)");
call execute(call_string);
end;
end;
else do;
put "Stopped with Max VIF:" variable "=" varianceinflation;
run;
ods preferences;
%MEND MULTICOLLINEARITY;
Then you call it once with the full field list, and it calls itself in the CALL EXECUTE if there is still a parameter left. An incremented # of runs may be helpful (both to see how many times it ran in your log, and to be able to make sure that you don't end up in an infinite loop if you make a mistake with the fields variable deletion.)
I would run this with OPTION NONOTES NOSOURCE; and none of the symbogen/mprint stuff on, so you can just get the %put/put statements in your log.