I would like to know what is the best way to merge multiple tables. I have a unique identifiers across all the tables. Should I join all the tables in one step after sorting the tables OR should I should do stepwise one by one table merging.
Does this matter ?
You can do multiple merges at single step. However, this is not the safest way. If there is possibility that your data is subject to imperfections, it is best to do this step by step. Imho, it is best do merge a step at the time, but it's your call.
proc sort data=data1; by id; run;
proc sort data=data2; by id; run;
proc sort data=data3; by id; run;
data combo;
merge data1(in=a) data2(in=b) data3(in=c);
by id;
if a and b and c; /*Inner join. Change as needed. */
run;
This is equivalent to:
data partial;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
data combo;
merge partial(in=a) data3(in=b);
by id;
if a and b;
run,
There's no particular reason to do it step-by-step, unless you've got conflicting variable names that you're concerned about resolving, or if your combination logic is complicated and you're worried about confusing something. There's no functional reason why not, in any event. merge in SAS is actually somewhat simpler than join in SQL, in particular as the syntax is simpler, so it's somewhat different than the SQL case.
Related
I have created proc sql query in SAS program, but need to use LAG function and it tell me it can't be used in proc sql, just in data step.
Code:
proc sql;
CREATE TABLE agg_table AS
SELECT USER, MAX(TIME) AS LAST_TIME, SUM(BONUS) AS BONUS_SUM, LAG(EXPDT) AS EXPDT_LAG FROM WORK.MY_DATA GROUP BY USER_ID;
So, I don't how to combine proc sql and datastep into one query to get one table as an output?
Or maybe there is a better approach to the whole problem?
Thanks
PROC SQL does not have a concept of rows the same way the datastep does. SQL may process rows in any order, not necessarily sequential, and may use hash tables, parallel processing, or various binary tree and similar methods to process its query; and the same query may be processed in different methods. Thus lag is not usable in SQL, nor are diff or other functions that expect row data.
It's unclear from your question what exactly you're doing, so it's not really possible to give a direct answer how to do this separately; but you may be able to accomplish this entirely in one datastep, or you may combine a datastep and a SQL query, or two datasteps. You can perform the lag in a prior datastep or a view, then the rest in SQL; or you may use a DoW loop datastep to perform the max/sum elements.
I just wanted to know like in proc sql we define stimer option.
The PROC SQL option STIMER | NOSTIMER specifies whether PROC SQL writes timing information for each statement to the SAS log, instead of writing a cumulative value for the entire procedure. NOSTIMER is the default.
Now in same way how to specify timing information in data set step. I am not using proc sql step
data h;
select name,empid
from employeemaster;
quit;
PROC SQL steps individually are effectively separate data steps, so in a sense you always get the identical information from SAS. What you're asking is effectively how to find out how long 'select name' takes versus 'empid'.
There's not a direct way to get the timing of an individual statement in a data step, but you could write data step code to find out. The problem is that the data step is executed row-wise, so it's really quite different from the PROC SQL STIMER details; almost nothing you do in a data step will take very long by itself, unless you are doing something more complex like a hash table lookup. What takes long is writing out the data first, and reading in the data second.
You do have some options for troubleshooting long data steps, if that's your concern. OPTIONS MSGLEVEL=I will give you information about index usage, merge details, etc., which can be helpful if you aren't sure why it is taking a long time to do certain things (see http://goo.gl/bpGWL in SAS documentation for more info). You can write your own timestamp:
data test;
set sashelp.class sashelp.class;
_t=time();
put _t=;
run;
Odds are that won't show you much of use since most data step iterations won't take very long but if you are doing something fancy it might help. You could also use conditional statements to only print the time at certain intervals - when at FIRST.ID for example in a process that works BY ID;.
Ultimately though the information you already get from notes is what is most useful. In PROC SQL you need the STIMER information because SQL is doing several things at once, while SAS lets/makes you do everything out step-wise. Example:
PROC SQL;
create table X as select * from A,B where A.ID=B.ID;
quit;
is one step - but in SAS this would be:
proc sort data=a; by ID; run;
proc sort data=b; by ID; run;
data x;
merge a(in=a) b(in=b);
by id;
if a and b;
run;
For that you would get information on the duration of each of those steps (the two sorts and the merge) in SAS, which is similar to what STIMER would tell you.
No way.
PROC SQL STIMER logs timing for each separately executable SQL statement/query.
In data step, as you may know, the data step looping occurs, observation per observation, so the data step statement timing would be something like per observation, let's say transactional. Anyway this would not describe all the details where the time is being spent - waiting for disk reads, writes, etc.
So I guess this won't be very usable. In general, SAS performance is I/O driven.
I have a rather simplistic question.
Is there any way to merge $n$ data sets in SAS where $n > 2$. I know how to merge 2 data sets.
Thanks
gowers
You can merge multiple data sets using the same syntax as for just two:
data all;
merge ds1 ds2 ds3 ...;
by some_list_of_variables;
run;
If you have many data sets you want to merge, you may want to right a macro that lists them all.
In addition to the code #itzy provided, you can identify your data sets using an IN= option on the MERGE statement. This allows you only accept the matching you need. Also, you must have common variable names to use in your BY statement. You can include a RENAME= statement to create a common variable for use in your BY statement.
(Untested code)
data all;
merge ds1(in=one rename=(ds1_id=id))
ds2(in=two rename=(ds2_id=id))
ds3(in=three rename=(ds3_id=id))
;
by some_list_of_variables;
if one and two and three ; /* Creates only matching records from all */
run;
Even though you have said that you want to "merge" datasets, note that the MERGE statement is not the only option. If your merging key has duplicates in more than 1 dataset, then using the MERGE statement will might give logically wrong results even though it would work without complaining. In that case, you can use PROC SQL - I also recall that PROC SQL can be more efficient from SAS 9.1 onwards.
Example -
proc sql;
select <fieldlist>
from data1 t1, data2 t2, data3 t3, data4 t4
where <join condition>;
quit;
Say I've already sorted set1 and set2 by the variables 'sticks', 'stones', and 'bones' and then i do this:
data merged;
merge set1(in=a) set2(in=b);
by sticks stones bones;
if a and b then output;
*else we don't want to do anything;
run;
Is there an easy way to drop all the variables from set2 in the merged dataset without having to type them all? I keep running into this problem where I have two datasets - both with quite a few variables - and I only want to merge them by a few variables and then only keep the variables from one of the sets.
I usually just use proc sql for something like this, but there are a few situations (more complex than above) where where I think merge is better.
Also, I find it annoying that SAS requires you to "manually" sort datasets before merging them. If it will not let you merge datasets unless they are sorted correctly, why doesn't it just do it for you when you use merge? Thoughts? Maybe there is a way around this I don't know about.
The sorted requirements is there for the way the merge statement and the PDV works in it.
There is really no way around it.
However here basically you're doing a lookup of set2 to make sure you have a match of the key variables (sticks stones bones) through the equivalent of an inner join, which you could likely do more efficiently through an hash table or set with keys (if you have an index of course).
The easiest and more convenient way for what you want here is having a keep statement in the set2 so you load into the PDV only the by variables.
Something like this:
data merged;
merge set1(in=a) set2(in=b keep=sticks stones bones);
by sticks stones bones;
if a and b then output;
*else we don't want to do anything;
run;
In case hash tables don't scare you and want to learn more on how to implement them in this case feel free to contact me for more help.
EDIT:
Here is a good paper about using hash tables http://www.nesug.org/proceedings/nesug06/dm/da07.pdf
Bear in mind that using hashes you should know what you're doing and they may yield unexpected results if you don't know whats happening under the hood.
Regardless here is the problem solved using a very simple and basic hash table
data merged2;
set set1;
if _N_ = 1 then do;
declare hash h(dataset:"set2");
h.defineKey('sticks','stones','bones');
h.defineData('sticks','stones','bones');
h.defineDone();
end;
rc = h.find();
if rc=0;
drop rc;
run;
This code has the main benefit of not requiring the sorting of the datasets which in case set2 is particularly big is a great time-saver.
I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.