How to compare variables in more than two datasets in sas? - sas

How to compare variables in more than two tables?
I even tried with compare but its not comparing more than two tables.

Using PROC COMPARE will not tell you much about the data aside from whether or not 2 tables have the same variable (which I assume is not the info you need).
You'll need to merge the tables together using MERGE. Your code will look something like this:
DATA TABLE1 TABLE2;
MERGE TABLE1 (IN=A) TABLE2 (IN=B);
BY VAR1;
IF A AND B THEN OUTPUT NEWTABLENAME;
ELSE IF A AND NOT B THEN OUTPUT NEWTABLENAME2;
RUN;
The BY statement tells SAS which variable you'd like the tables to merge on, as SAS isn't going to just smush some extra columns onto your existing table.
You can check out more about what PROC COMPARE does here
You can read more about PROC MERGE here

Related

Filter a SAS dataset to contain only identifiers given in a list

I am working in SAS Enterprise guide and have a one column SAS table that contains unique identifiers (id_list).
I want to filter another SAS table to contain only observations that can be found in id_list.
My code so far is:
proc sql noprint;
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN id_list
quit;
This code gives me the following errors:
Error 22-322: Syntax error, expecting on of the following: (, SELECT.
What am I doing wrong?
Thanks up front for the help.
You can't just give it the table name. You need to make a subquery that includes what variable you want it to read from ID_LIST.
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN (select id from id_list)
;
You could use a join in proc sql but probably simpler to use a merge in a data step with an in= statement.
data want;
merge oneColData(in = A) otherData(in = B);
by id_list;
if A;
run;
You merge the two datasets together, and then using if A you only take the ID's that appear in the single column dataset. For this to work you have to merge on id_list which must be in both datasets, and both datasets must be sorted by id_list.
The problem with using a Data Step instead of a PROC SQL is that for the Data step the Data-set must be sorted on the variable used for the merge. If this is not yet the case, the complete Data-set must be sorted first.
If I have a very large SAS Data-set, which is not sorted on the variable to be merged, I have to sort it first (which can take quite some time). If I use the subquery in PROC SQL, I can read the Data-set selectively, so no sort is needed.
My bet is that PROC SQL is much faster for large Data-sets from which you want only a small subset.

Deleting Observations That Are Not Part of a List in SAS

My first SAS data set, ds1 contains dates, firms and share prices. My second data set, ds2 contains a subset of the firms in ds1. I'd like to create ds3 which contains all of the observations in ds1 provided a firm in ds1 is also in ds2. I try doing this as follows:
DATA ds3;
set ds1;
IF firm IN (d2);
run;
The above does not work as planned as ds3 ends up containing no observations. I believe the problem is the IF IN statement. I could manually type all the firms in the parenthesis instead of putting d2 there but that would be very inefficient for me.
You have several options here; the correct one depends largely on your particular needs.
What you're effectively doing is joining two tables together. So, a MERGE or a SQL JOIN would be a simple solution.
data ds3;
merge ds1(in=_ds1) ds2(in=_ds2 keep=firm);
by firm;
if _ds1 and _ds2;
run;
That joins ds1 and ds2, only keeping the firm variable from ds2, and keeps only firms that are in both. Both DS1 and DS2 need to be sorted by firm;, and DS2 should have have only unique values of firm - no duplicates.
SQL is also pretty easy.
proc sql;
create table ds3 as
select * from ds1 where exists (
select 1 from ds2
where ds1.firm=ds2.firm
);
quit;
That's a bit closer to your terminology (unsurprising as SQL attempts to be close to natural language for many simple queries). That doesn't require either sorting or uniqueness, although this won't be particularly fast.
You could also store the DS2 firms in a format, or use a hash table to allow you to open DS2 alongside DS1. You could also use a keyed set statement. All of these are slightly more complicated to implement, but are generally faster as they don't require sorting and don't repeatedly reference the same data as the SQL does.

How to merge multiple datasets by common variable in SAS?

I have multiple datasets. Each of them has different number of attributes. I want to merge them all by common variable. This is 'union' if I use proc SQL. But there is hunderds of variables.
Example.
Dataset_Name Number of columns
dataset1 110
dataset2 120
dataset3 130
... ...
Say they have 100 columns in common. The final dataset which contains all dataset1,dataset2,dataset3..etc
only has common columns(in this case, 100 columns).
How do I do this?
And how do I get columns for each dataset this is not in common with the final dataset.
example: dataset1 will have 10 columns that are not in the final dataset, and list the name of 10 columns.
Thanks!!!!
UNION in SQL is equivalent to sequential SET in SAS.
data want;
set dataset1 dataset2 dataset3;
run;
Now, SAS by default includes all columns present in any dataset. To limit to just what's in all datasets, you have to use a keep statement.
You can determine this using proc sql, among other ways.
proc sql;
select name into :commonlist separated by ' '
from dictionary.columns C, dictionary.columns D
where C.libname=D.libname
and C.memname='DATASET1'
and D.memname='DATASET2'
and C.name=D.name
;
quit;
For more than two datasets it's more complicated and partially depends on your, but if you're comfortable in SQL you can figure that out pretty easily. A similar construct can create a list of just dataset 1 variables. The important part is the into :commonlist separated by ' ', which says to pull the select results into a macro variable called commonlist, separating rows by space. (The colon says to create a macro variable, not a table.)
So you can then run:
data want (keep=&commonlist.) dset1(keep=&dset1list.) dset2(keep=&dset2list.);
set dataset1(in=ds1) dataset2(in=ds2) dataset3(in=ds3);
output want;
if ds1 then output dset1;
else if ds2 then output dset2;
else if ds3 then output dset3;
run;
The in=xyz indicates which dataset a row came from. Each output dataset can have a separate list of variables to keep. You might want to keep the ID variable in those other datasets as well.
I will say that usually in SAS you don't do what you're doing here: it's not easy to do because it doesn't tend to be the best way to handle things - specifically, the little split off datasets. In general you would just keep those extra variables on the master dataset, and they'd just be nulls for anyone not in a dataset with that variable - assuming it makes sense to make this 'master' dataset at all.

Merging Datasets in SAS

I have a rather simplistic question.
Is there any way to merge $n$ data sets in SAS where $n > 2$. I know how to merge 2 data sets.
Thanks
gowers
You can merge multiple data sets using the same syntax as for just two:
data all;
merge ds1 ds2 ds3 ...;
by some_list_of_variables;
run;
If you have many data sets you want to merge, you may want to right a macro that lists them all.
In addition to the code #itzy provided, you can identify your data sets using an IN= option on the MERGE statement. This allows you only accept the matching you need. Also, you must have common variable names to use in your BY statement. You can include a RENAME= statement to create a common variable for use in your BY statement.
(Untested code)
data all;
merge ds1(in=one rename=(ds1_id=id))
ds2(in=two rename=(ds2_id=id))
ds3(in=three rename=(ds3_id=id))
;
by some_list_of_variables;
if one and two and three ; /* Creates only matching records from all */
run;
Even though you have said that you want to "merge" datasets, note that the MERGE statement is not the only option. If your merging key has duplicates in more than 1 dataset, then using the MERGE statement will might give logically wrong results even though it would work without complaining. In that case, you can use PROC SQL - I also recall that PROC SQL can be more efficient from SAS 9.1 onwards.
Example -
proc sql;
select <fieldlist>
from data1 t1, data2 t2, data3 t3, data4 t4
where <join condition>;
quit;

Efficient way to merge/join 2 large datasets in SAS 8.2

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.