apply keep and where together sas - sas

I am working with sas to manipulate some dataset.
I am using the data step to apply some condition to keep columns and filter on some values.
The problem is that I would like to filter on columns that in the end I will not need, so I would like to apply, first, the where clause, and then the keep clause.
The problem is that sas executes, first the keep clause and then where, so when it is trying to apply the where instruction it doesn't find the columns on which it should be applied on. This is my code:
data newtable;
set mytable(where=(var1<value)
keep var2);
end;
In this case the error is the var1 cannot be found since I decided to keep only var2. I know I can do two data steps, but I would like to do everything in one step only.
How can I achieve this?

This can be achieved by using the keep data set option on the output data set, e.g.(untested):
data newtable(keep=var2);
set mytable(where=(var1<value));
end;
Alternatively a keep statement can be used, e.g. (untested):
data newtable;
set mytable(where=(var1<value));
keep var2;
end;

#Amir has the right of it. #Quentin is worried about the efficiency of it all. Unless your where clause is highly involved, then this will be your most efficient method.
data newtable;
set mytable(where=(var1<value) keep=var1 var2);
keep var2;
end;
Yes, var1 is read into the PDV, but you have limited the PDV to only the variables wanted in the output and needed in the where clause.
If you just want to run a where and do no other data step logic, then PROC SQL is just as efficient as the method above.
proc sql noprint;
create table newtable as
select var2
from mytable
where var1<value;
quit;

Related

How to use call symput on a specific observation in SAS

I'm trying to convert a SAS dataset column to a list of macro variables but am unsure of how indexing works in this language.
DATA _Null_;
do I = 1 to &num_or;
set CondensedOverrides4 nobs = num_or;
call symputx("Item" !! left(put(I,8.))
,"Rule", "G");
end;
run;
Right now this code creates a list of macro variables Item1,Item2,..ItemN etc. and assigns the entire column called "Rule" to each new variable. My goal is to put the first observation of "Rule" in Item1, the second observation in that column in Item2, etc.
I'm pretty new to SAS and understand you can't brute force logic in the same way as other languages but if there's a way to do this I would appreciate the guidance.
Much easier to create a series of macro variables using PROC SQL's INTO clause. You can save the number of items into a macro variable.
proc sql noprint;
select rule into :Item1-
from CondensedOverrides4
;
%let num_or=&sqlobs;
quit;
If you want to use a data step there is no need for a DO loop. The data step iterates over the inputs automatically. Put the code to save the number of observations into a macro variable BEFORE the set statement in case the input dataset is empty.
data _null_;
if eof then call symputx('num_or',_n_-1);
set CondensedOverrides4 end=eof ;
call symputx(cats('Item',_n_),rule,'g');
run;
SAS does not need loops to access each row, it does it automatically. So your code is really close. Instead of I, use the automatic variable _n_ which can function as a row counter though it's actually a step counter.
DATA _Null_;
set CondensedOverrides4;
call symputx("Item" || put(_n_,8. -l) , Rule, "G");
run;
To be honest though, if you're new to SAS using macro variables to start isn't recommended, there are usually multiple ways to avoid it anyways and I only use it if there's no other choice. It's incredibly powerful, but easy to get wrong and harder to debug.
EDIT: I modified the code to remove the LEFT() function since you can use the -l option on the PUT statement to left align the results directly.
EDIT2: Removing the quotes around RULE since I suspect it's a variable you want to store the value of, not the text string 'RULE'. If you want the macro variables to resolve to a string you would add back the quotes but that seems incorrect based on your question.

Automating IF and then statement in sas using macro in SAS

I have a data where I have various types of loan descriptions, there are at least 100 of them.
I have to categorise them into various buckets using if and then function. Please have a look at the data for reference
data des;
set desc;
if loan_desc in ('home_loan','auto_loan')then product_summary ='Loan';
if loan_desc in ('Multi') then product_summary='Multi options';
run;
For illustration I have shown it just for two loan description, but i have around 1000 of different loan_descr that I need to categorise into different buckets.
How can I categorise these loan descriptions in different buckets without writing the product summary and the loan_desc again and again in the code which is making it very lengthy and time consuming
Please help!
Another option for categorizing is using a format. This example uses a manual statement, but you can also create a format from a dataset if you have the to/from values in a dataset. As indicated by #Tom this allows you to change only the table and the code stays the same for future changes.
One note regarding your current code, you're using If/Then rather than If/ElseIf. You should use If/ElseIf because then it terminates as soon as one condition is met, rather than running through all options.
proc format;
value $ loan_fmt
'home_loan', 'auto_loan' = 'Loan'
'Multi' = 'Multi options';
run;
data want;
set have;
loan_desc = put(loan, $loan_fmt.);
run;
For a mapping exercise like this, the best technique is to use a mapping table. This is so the mappings can be changed without changing code, among other reasons.
A simple example is shown below:
/* create test data */
data desc (drop=x);
do x=1 to 3;
loan_desc='home_loan'; output;
loan_desc='auto_loan'; output;
loan_desc='Multi'; output;
loan_desc=''; output;
end;
data map;
loan_desc='home_loan'; product_summary ='Loan '; output;
loan_desc='auto_loan'; product_summary ='Loan'; output;
loan_desc='Multi'; product_summary='Multi options'; output;
run;
/* perform join */
proc sql;
create table des as
select a.*
,coalescec(b.product_summary,'UNMAPPED') as product_summary
from desc a
left join map b
on a.loan_desc=b.loan_desc;
There is no need to use the macro language for this task (I have updated the question tag accordingly).
Already good solutions have been proposed (I like #Reeza's proc format solution), but here's another route which also minimizes coding.
Generate sample data
data have;
loan_desc="home_loan"; output;
loan_desc="auto_loan"; output;
loan_desc="Multi"; output;
loan_desc=""; output;
run;
Using PROC SQL's case expression
This way doesn't allow, to my knowledge, having several criteria on a single when line, but it really simplifies coding since the resulting variable's name needs to be written down only once.
proc sql;
create table want as
select
loan_desc,
case loan_desc
when "home_loan" then "Loan"
when "auto_loan" then "Loan"
when "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
Otherwise, using the following syntax is also possible, giving the same results:
proc sql;
create table want as
select
loan_desc,
case
when loan_desc in ("home_loan", "auto_loan") then "Loan"
when loan_desc = "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;

Sas macro with proc sql

I want to perform some regression and i would like to count the number of nonmissing observation for each variable. But i don't know yet which variable i will use. I've come up with the following solution which does not work. Any help?
Here basically I put each one of my explanatory variable in variable. For example
var1 var 2 -> w1 = var1, w2= var2. Notice that i don't know how many variable i have in advance so i leave room for ten variables.
Then store the potential variable using symput.
data _null_;
cntw=countw(&parameters);
i = 1;
array w{10} $15.;
do while(i <= cntw);
w[i]= scan((&parameters"),i, ' ');
i = i +1;
end;
/* store a variable globally*/
do j=1 to 10;
call symput("explanVar"||left(put(j,3.)), w(j));
end;
run;
My next step is to perform a proc sql using the variable i've stored. It does not work as
if I have less than 10 variables.
proc sql;
select count(&explanVar1), count(&explanVar2),
count(&explanVar3), count(&explanVar4),
count(&explanVar5), count(&explanVar6),
count(&explanVar7), count(&explanVar8),
count(&explanVar9), count(&explanVar10)
from estimation
;quit;
Can this code work with less than 10 variables?
You haven't provided the full context for this project, so it's unclear if this will work for you - but I think this is what I'd do.
First off, you're in SAS, use SAS where it's best - counting things. Instead of the PROC SQL and the data step, use PROC MEANS:
proc means data=estimation n;
var &parameters.;
run;
That, without any extra work, gets you the number of nonmissing values for all of your variables in one nice table.
Secondly, if there is a reason to do the PROC SQL, it's probably a bit more logical to structure it this way.
proc sql;
select
%do i = 1 %to %sysfunc(countw(&parameters.));
count(%scan(&parameters.,&i.) ) as Parameter_&i., /* or could reuse the %scan result to name this better*/
%end; count(1) as Total_Obs
from estimation;
quit;
The final Total Obs column is useful to simplify the code (dealing with the extra comma is mildly annoying). You could also put it at the start and prepend the commas.
You finally could also drive this from a dataset rather than a macro variable. I like that better, in general, as it's easier to deal with in a lot of ways. If your parameter list is in a data set somewhere (one parameter per row, in the dataset "Parameters", with "var" as the name of the column containing the parameter), you could do
proc sql;
select cats('%countme(var=',var,')') into :countlist separated by ','
from parameters;
quit;
%macro countme(var=);
count(&var.) as &var._count
%mend countme;
proc sql;
select &countlist from estimation;
quit;
This I like the best, as it is the simplest code and is very easy to modify. You could even drive it from a contents of estimation, if it's easy to determine what your potential parameters might be from that (or from dictionary.columns).
I'm not sure about your SAS macro, but the SQL query will work with these two notes:
1) If you don't follow your COUNT() functions with an identifier such as "COUNT() AS VAR1", your results will not have field headings. If that's ok with you, then you may not need to worry about it. But if you export the data, it will be helpful for you if you name them by adding "...AS "MY_NAME".
2) For observations with fewer than 10 variables, the query will return NULL values. So don't worry about not getting all of the results with what you have, because as long as the table you're querying has space for 10 variables (10 separate fields), you will get data back.

Merging Datasets in SAS

I have a rather simplistic question.
Is there any way to merge $n$ data sets in SAS where $n > 2$. I know how to merge 2 data sets.
Thanks
gowers
You can merge multiple data sets using the same syntax as for just two:
data all;
merge ds1 ds2 ds3 ...;
by some_list_of_variables;
run;
If you have many data sets you want to merge, you may want to right a macro that lists them all.
In addition to the code #itzy provided, you can identify your data sets using an IN= option on the MERGE statement. This allows you only accept the matching you need. Also, you must have common variable names to use in your BY statement. You can include a RENAME= statement to create a common variable for use in your BY statement.
(Untested code)
data all;
merge ds1(in=one rename=(ds1_id=id))
ds2(in=two rename=(ds2_id=id))
ds3(in=three rename=(ds3_id=id))
;
by some_list_of_variables;
if one and two and three ; /* Creates only matching records from all */
run;
Even though you have said that you want to "merge" datasets, note that the MERGE statement is not the only option. If your merging key has duplicates in more than 1 dataset, then using the MERGE statement will might give logically wrong results even though it would work without complaining. In that case, you can use PROC SQL - I also recall that PROC SQL can be more efficient from SAS 9.1 onwards.
Example -
proc sql;
select <fieldlist>
from data1 t1, data2 t2, data3 t3, data4 t4
where <join condition>;
quit;

Efficient way to merge/join 2 large datasets in SAS 8.2

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.