about "data merge" in SAS - sas

I am studying data merge in SAS, and find the following example
data newdata;
merge yourdata (in=a) otherdata (in=b);
by permno date;
I do not know what do "(in=a)" and "(in=b)" mean? Thanks.

yourdata(in=a) creates a flag variable in the program data vector called 'a' that contains 1 if the record is from yourdata and 0 if it isn't. You can then use these variables to perform conditional operations based on the source of the record.
It might be easier to understand if you saw
data newdata;
merge yourdata(in=ThisRecordIsFromYourData) otherdata(in=ThisRecordIsFromOtherData);
by permno date;
run;
Suppose that records from yourdata needed to be manipulated in this step, but not those from otherdata, you could then do something like
data newdata;
merge yourdata(in=ThisRecordIsFromYourData) otherdata(in=ThisRecordIsFromOtherData);
by permno date;
if ThisRecordIsFromYourData then do;
* some operation here for yourdata records only ;
end;
run;
An obvious use for these variables is to control what kind of 'merge' will occur, using if statements. For example, if ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS only include rows that match on the by variables from both input data sets (like an inner join).

Related

SAS: How to create datasets in loop based on macro variable

I have a macro variable like this:
%let months = 202002 202001 201912 201911 201910;
As one can see, we have 5 months, separated by space ' '.
I would like to create 5 datasets like a_202002, a_202001, a_201912, a_2019_11, a_201910. How can I run this in loop and create 5 datasets, instead of writing the datastep 5 times?
Pseudo code:
for m in &months.
data a_m;
....
....
run;
How can I do that in SAS? I tried %do_over but that did not help me.
Use the knowledge you gained from #Tom answer in an earlier question to create the macro
%macro datasets_for_months ...
...
%mend;
Specify the output data sets in the DATA statement:
DATA %datasets_for_months(...);
...
RUN;
Direct rows to specific output data sets by naming the data set, such as
OUTPUT a_202002;
Note:
A step with no OUTPUT statements will implicitly output to each data set
A step with an OUTPUT statement will cause records to be written to either ALL data sets, or only the ones named in the statement:
OUTPUT writes records to all output data sets
OUTPUT data-set-name-1 writes records to only data sets specified
The DATA Step documentation covers what you need to know in greater detail
DATA Statement
Begins a DATA step and provides names for any output such as SAS data sets, views, or programs.
...
Syntax
Form 1:
DATA statement for creating output data sets
DATA <data-set-name-1 <(data-set-options-1)>>
... <data-set-name-n <(data-set-options-n)>>
... ;
THE ROAD AHEAD
You will likely discover that month will be better served in a conceptual role as a categorical variable in a single large data set, instead of breaking the data into multiple month-named data sets.
A categorical variable will let you leverage the power of SAS' partitioning and segregating statements such as WHERE, BY and CLASS when pursuing processing, reporting and visualization of your data at different combinations of class level values.
How about this approach? Create the data set names in another macro variable and use a single data step.
%let months = 202002 202001 201912 201911 201910;
data _null_;
ds = prxchange('s/(\d+)/a_$1/', -1, "&months.");
call symputx('ds', ds);
run;
options symbolgen;
data &ds.;
run;
You can use a %DO loop and the %SCAN() function. Use the COUNTW() function to find the upper bound.
%do i=1 %to %sysfunc(countw(&months,%str( )));
%let month=%scan(&months,&i,%str( ));
....
%end;

apply keep and where together sas

I am working with sas to manipulate some dataset.
I am using the data step to apply some condition to keep columns and filter on some values.
The problem is that I would like to filter on columns that in the end I will not need, so I would like to apply, first, the where clause, and then the keep clause.
The problem is that sas executes, first the keep clause and then where, so when it is trying to apply the where instruction it doesn't find the columns on which it should be applied on. This is my code:
data newtable;
set mytable(where=(var1<value)
keep var2);
end;
In this case the error is the var1 cannot be found since I decided to keep only var2. I know I can do two data steps, but I would like to do everything in one step only.
How can I achieve this?
This can be achieved by using the keep data set option on the output data set, e.g.(untested):
data newtable(keep=var2);
set mytable(where=(var1<value));
end;
Alternatively a keep statement can be used, e.g. (untested):
data newtable;
set mytable(where=(var1<value));
keep var2;
end;
#Amir has the right of it. #Quentin is worried about the efficiency of it all. Unless your where clause is highly involved, then this will be your most efficient method.
data newtable;
set mytable(where=(var1<value) keep=var1 var2);
keep var2;
end;
Yes, var1 is read into the PDV, but you have limited the PDV to only the variables wanted in the output and needed in the where clause.
If you just want to run a where and do no other data step logic, then PROC SQL is just as efficient as the method above.
proc sql noprint;
create table newtable as
select var2
from mytable
where var1<value;
quit;

Adding a column is SAS using MODIFY (no sql)

I'm new to SAS and have some problems with adding a column to existing data set in SAS using MODIFY statement (without proc sql).
Let's say I have data like this
id name salary perks
1 John 2000 50
2 Mary 3000 120
What I need to get is a new column with the sum of salary and perks.
I tried to do it this way
data data1;
modify data1;
money=salary+perks;
run;
but apparently it doesn't work.
I would be grateful for any help!
As #Tom mentioned you use SET to access the dataset.
I generally don't recommend programming this way with the same name in set and data statements, especially as you're learning SAS. This is because it's harder to detect errors, since once run and encounter an error, you destroy your original dataset and have to recreate it before you start again.
If you want to work step by step, consider intermediary datasets and then clean up after you're done by using proc datasets to delete any unnecessary intermediary datasets. Use a naming conventions to be able to drop them all at once, i.e. data1, data2, data3 can be referenced as data1-data3 or data:.
data data2;
set data1;
money = salary + perks;
run;
You do now have two datasets but it's easy to drop datasets later on and you can now run your code in sections rather than running all at once.
Here's how you would drop intermediary datasets
proc datasets library=work nodetails holist;
delete data1-data3;
run;quit;
You can't add a column to an existing dataset. You can make a new dataset with the same name.
data data1;
set data1;
money=salary+perks;
run;
SAS will build it as a new physical file (with a temporary name) and when the step finishes without error it deletes the original and renames the new one.
If you want to use a data set you do it like this:
data dataset;
set dataset;
format new_column $12;
new_column = 'xxx';
run;
Or use Proc SQL and ALTER TABLE.
proc sql;
alter table dataset
add new_column char(8) format = $12.
;
quit;

Automating IF and then statement in sas using macro in SAS

I have a data where I have various types of loan descriptions, there are at least 100 of them.
I have to categorise them into various buckets using if and then function. Please have a look at the data for reference
data des;
set desc;
if loan_desc in ('home_loan','auto_loan')then product_summary ='Loan';
if loan_desc in ('Multi') then product_summary='Multi options';
run;
For illustration I have shown it just for two loan description, but i have around 1000 of different loan_descr that I need to categorise into different buckets.
How can I categorise these loan descriptions in different buckets without writing the product summary and the loan_desc again and again in the code which is making it very lengthy and time consuming
Please help!
Another option for categorizing is using a format. This example uses a manual statement, but you can also create a format from a dataset if you have the to/from values in a dataset. As indicated by #Tom this allows you to change only the table and the code stays the same for future changes.
One note regarding your current code, you're using If/Then rather than If/ElseIf. You should use If/ElseIf because then it terminates as soon as one condition is met, rather than running through all options.
proc format;
value $ loan_fmt
'home_loan', 'auto_loan' = 'Loan'
'Multi' = 'Multi options';
run;
data want;
set have;
loan_desc = put(loan, $loan_fmt.);
run;
For a mapping exercise like this, the best technique is to use a mapping table. This is so the mappings can be changed without changing code, among other reasons.
A simple example is shown below:
/* create test data */
data desc (drop=x);
do x=1 to 3;
loan_desc='home_loan'; output;
loan_desc='auto_loan'; output;
loan_desc='Multi'; output;
loan_desc=''; output;
end;
data map;
loan_desc='home_loan'; product_summary ='Loan '; output;
loan_desc='auto_loan'; product_summary ='Loan'; output;
loan_desc='Multi'; product_summary='Multi options'; output;
run;
/* perform join */
proc sql;
create table des as
select a.*
,coalescec(b.product_summary,'UNMAPPED') as product_summary
from desc a
left join map b
on a.loan_desc=b.loan_desc;
There is no need to use the macro language for this task (I have updated the question tag accordingly).
Already good solutions have been proposed (I like #Reeza's proc format solution), but here's another route which also minimizes coding.
Generate sample data
data have;
loan_desc="home_loan"; output;
loan_desc="auto_loan"; output;
loan_desc="Multi"; output;
loan_desc=""; output;
run;
Using PROC SQL's case expression
This way doesn't allow, to my knowledge, having several criteria on a single when line, but it really simplifies coding since the resulting variable's name needs to be written down only once.
proc sql;
create table want as
select
loan_desc,
case loan_desc
when "home_loan" then "Loan"
when "auto_loan" then "Loan"
when "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
Otherwise, using the following syntax is also possible, giving the same results:
proc sql;
create table want as
select
loan_desc,
case
when loan_desc in ("home_loan", "auto_loan") then "Loan"
when loan_desc = "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;

Concatenation vs Interleaving

I cannot understand the difference between interleaving and concatenation
Interleaving
proc sort data=ds1
out=ds1;
by var1;
run;
proc sort data=ds2
out=ds2;
by var1;
run;
data testInterleaving ;
set ds1 ds2 ;
run ;
Concatenation
data testConcatenation;
set ds1 ds2;
run;
I tested these and the resulting datasets were exactly the same except for the order of observations which I think does not really matter. The two resulting datasets contain exactly the same observations. So, what is the difference, except for order?
Interleaving, as CarolinaJay notes, is combining SET with BY. It is not merging, and it is not just sorting prior to setting.
For example, let's create a pair of datasets, the female and the male members of sashelp.class.
data male female;
set sashelp.class;
if sex='F' then output female;
else output male;
run;
proc sort data=male;
by name;
run;
proc sort data=female;
by name;
run;
data concatenated;
set male female;
run;
data interleaved;
set male female;
by name;
run;
Now, look at the datasets. Concatenated is all of the males, then all of the females - it processes the set statements in order, exhausting the first before moving onto the second.
Interleaved is in name order, not in order by sex. That's because it traverses the two (in this case) set datasets by name, keeping track of where it is in the name ordering. You can add debugging statements (Either use the data step debugger, or add a put _all_; to the datastep) to see how it works.
SAS defines INTERLEAVING as using a BY statement with a SET statement. The included link shows two data sets, sorted by the same variable(s), generating one data set using a BY statement with a SET statement.
The data steps at the end are the exact same. You are performing the same code, it doesn't matter if you sort before hand.
What I think you mean in the interleaving is
data testInterleaving ;
MERGE ds1 ds2;
by var1;
run;
The set statement reads sequentially through the data sets in the order you list them. The merge statement compares records between the sets and puts them into the output in the order of the variable(s) in the by statement. I recommend looking at the SAS documentation on the merge statement as this is a very simplistic explanation for a very powerful tool.