Efficient way to merge/join 2 large datasets in SAS 8.2 - sas

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;

You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.

Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;

Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.

You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!

Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.

Related

Can't use LAG function in Proc SQL in SAS

I have created proc sql query in SAS program, but need to use LAG function and it tell me it can't be used in proc sql, just in data step.
Code:
proc sql;
CREATE TABLE agg_table AS
SELECT USER, MAX(TIME) AS LAST_TIME, SUM(BONUS) AS BONUS_SUM, LAG(EXPDT) AS EXPDT_LAG FROM WORK.MY_DATA GROUP BY USER_ID;
So, I don't how to combine proc sql and datastep into one query to get one table as an output?
Or maybe there is a better approach to the whole problem?
Thanks
PROC SQL does not have a concept of rows the same way the datastep does. SQL may process rows in any order, not necessarily sequential, and may use hash tables, parallel processing, or various binary tree and similar methods to process its query; and the same query may be processed in different methods. Thus lag is not usable in SQL, nor are diff or other functions that expect row data.
It's unclear from your question what exactly you're doing, so it's not really possible to give a direct answer how to do this separately; but you may be able to accomplish this entirely in one datastep, or you may combine a datastep and a SQL query, or two datasteps. You can perform the lag in a prior datastep or a view, then the rest in SQL; or you may use a DoW loop datastep to perform the max/sum elements.

Filter a SAS dataset to contain only identifiers given in a list

I am working in SAS Enterprise guide and have a one column SAS table that contains unique identifiers (id_list).
I want to filter another SAS table to contain only observations that can be found in id_list.
My code so far is:
proc sql noprint;
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN id_list
quit;
This code gives me the following errors:
Error 22-322: Syntax error, expecting on of the following: (, SELECT.
What am I doing wrong?
Thanks up front for the help.
You can't just give it the table name. You need to make a subquery that includes what variable you want it to read from ID_LIST.
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN (select id from id_list)
;
You could use a join in proc sql but probably simpler to use a merge in a data step with an in= statement.
data want;
merge oneColData(in = A) otherData(in = B);
by id_list;
if A;
run;
You merge the two datasets together, and then using if A you only take the ID's that appear in the single column dataset. For this to work you have to merge on id_list which must be in both datasets, and both datasets must be sorted by id_list.
The problem with using a Data Step instead of a PROC SQL is that for the Data step the Data-set must be sorted on the variable used for the merge. If this is not yet the case, the complete Data-set must be sorted first.
If I have a very large SAS Data-set, which is not sorted on the variable to be merged, I have to sort it first (which can take quite some time). If I use the subquery in PROC SQL, I can read the Data-set selectively, so no sort is needed.
My bet is that PROC SQL is much faster for large Data-sets from which you want only a small subset.

How to find common values of two variables of different length?

I'm very new to SAS. I spent some time looking for a solution for my problem. Unfortunately, I couldn't find any. I'd very much appreciate your help. The problem is actually quite simple.
I have two different datasets (let's call it dat1 and dat2) of different length. Moreover, they both have a variable X, which I'm interested in. I'm looking for a way to find common values of these two columns (let's call them dat1_X and dat2_X). The dataset is quite big though, approximately 10 million observations.
There are many ways!
If one dataset was small, and efficiency was an issue, you could consider using hash tables (or formats) to perform a lookup whilst passing through the larger one.
Otherwise, the following SQL approaches will do the trick (try testing to see which is most efficient):
/* correlated subquery (generally slow) */
proc sql;
create table want1 as
select distinct x
from dat1_x
where x in (select from dat2_x);
/* inner join */
proc sql;
create table want2 as
select distinct a.x
from dat1_x a, dat2_x b
where a.x=b.x;
/* intersect */
proc sql;
create table want3 as
select x
from dat1_x
intersect
select x
from dat2_x;

Merging Datasets in SAS

I have a rather simplistic question.
Is there any way to merge $n$ data sets in SAS where $n > 2$. I know how to merge 2 data sets.
Thanks
gowers
You can merge multiple data sets using the same syntax as for just two:
data all;
merge ds1 ds2 ds3 ...;
by some_list_of_variables;
run;
If you have many data sets you want to merge, you may want to right a macro that lists them all.
In addition to the code #itzy provided, you can identify your data sets using an IN= option on the MERGE statement. This allows you only accept the matching you need. Also, you must have common variable names to use in your BY statement. You can include a RENAME= statement to create a common variable for use in your BY statement.
(Untested code)
data all;
merge ds1(in=one rename=(ds1_id=id))
ds2(in=two rename=(ds2_id=id))
ds3(in=three rename=(ds3_id=id))
;
by some_list_of_variables;
if one and two and three ; /* Creates only matching records from all */
run;
Even though you have said that you want to "merge" datasets, note that the MERGE statement is not the only option. If your merging key has duplicates in more than 1 dataset, then using the MERGE statement will might give logically wrong results even though it would work without complaining. In that case, you can use PROC SQL - I also recall that PROC SQL can be more efficient from SAS 9.1 onwards.
Example -
proc sql;
select <fieldlist>
from data1 t1, data2 t2, data3 t3, data4 t4
where <join condition>;
quit;

Is there an easy way to drop all variables from one of the datasets when merging in SAS?

Say I've already sorted set1 and set2 by the variables 'sticks', 'stones', and 'bones' and then i do this:
data merged;
merge set1(in=a) set2(in=b);
by sticks stones bones;
if a and b then output;
*else we don't want to do anything;
run;
Is there an easy way to drop all the variables from set2 in the merged dataset without having to type them all? I keep running into this problem where I have two datasets - both with quite a few variables - and I only want to merge them by a few variables and then only keep the variables from one of the sets.
I usually just use proc sql for something like this, but there are a few situations (more complex than above) where where I think merge is better.
Also, I find it annoying that SAS requires you to "manually" sort datasets before merging them. If it will not let you merge datasets unless they are sorted correctly, why doesn't it just do it for you when you use merge? Thoughts? Maybe there is a way around this I don't know about.
The sorted requirements is there for the way the merge statement and the PDV works in it.
There is really no way around it.
However here basically you're doing a lookup of set2 to make sure you have a match of the key variables (sticks stones bones) through the equivalent of an inner join, which you could likely do more efficiently through an hash table or set with keys (if you have an index of course).
The easiest and more convenient way for what you want here is having a keep statement in the set2 so you load into the PDV only the by variables.
Something like this:
data merged;
merge set1(in=a) set2(in=b keep=sticks stones bones);
by sticks stones bones;
if a and b then output;
*else we don't want to do anything;
run;
In case hash tables don't scare you and want to learn more on how to implement them in this case feel free to contact me for more help.
EDIT:
Here is a good paper about using hash tables http://www.nesug.org/proceedings/nesug06/dm/da07.pdf
Bear in mind that using hashes you should know what you're doing and they may yield unexpected results if you don't know whats happening under the hood.
Regardless here is the problem solved using a very simple and basic hash table
data merged2;
set set1;
if _N_ = 1 then do;
declare hash h(dataset:"set2");
h.defineKey('sticks','stones','bones');
h.defineData('sticks','stones','bones');
h.defineDone();
end;
rc = h.find();
if rc=0;
drop rc;
run;
This code has the main benefit of not requiring the sorting of the datasets which in case set2 is particularly big is a great time-saver.