SAS (DataFlux) Merge Datasets - sas

I'm totally new to SAS so please be patient. :) I have a Data Job that is producing two data sets that contain a field. I need to compare these two data sets and then output only the rows that do not match, but I'm struggling. Is there a way to accomplish this in SAS? I've tried using "Match Codes", but I couldn't get that to work. I also tried "Cluster Diff", but that didn't work either.
What we have been doing is running two SQL queries and then comparing them in Excel and taking the records that do not match.
Any suggestions would be greatly appreciated.

You need to use PROC COMPARE for this kind of task.
proc compare base=<main dataset> compare=<other dataset> outdif out=<output dataset>;
by <id variables?>;
var <variables to compare, or leave this off for all variables>;
run;
That's the basic structure; read the documentation for more details.

You could do this in the datastep using merge. The below will give you anything in dataset_A that is not in dataset_B
data output_dataset;
merge dataset_A (in=a)
dataset_B (in=b)
;
by <variable>;
if a and not b;
run;

Related

Two way ANOVA in SAS

I have a data set that looks like below. I have to find if the scores are different across the four days. I understand I need to run a Two Way ANOVA for this test. I'm quite new to SAS if anyone could help me how to go about this? Should I be rearranging my data?
Yes you can use PROC TRANSPOSE
proc transpose out=...;
by player;
var round:;
run;

Filter a SAS dataset to contain only identifiers given in a list

I am working in SAS Enterprise guide and have a one column SAS table that contains unique identifiers (id_list).
I want to filter another SAS table to contain only observations that can be found in id_list.
My code so far is:
proc sql noprint;
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN id_list
quit;
This code gives me the following errors:
Error 22-322: Syntax error, expecting on of the following: (, SELECT.
What am I doing wrong?
Thanks up front for the help.
You can't just give it the table name. You need to make a subquery that includes what variable you want it to read from ID_LIST.
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN (select id from id_list)
;
You could use a join in proc sql but probably simpler to use a merge in a data step with an in= statement.
data want;
merge oneColData(in = A) otherData(in = B);
by id_list;
if A;
run;
You merge the two datasets together, and then using if A you only take the ID's that appear in the single column dataset. For this to work you have to merge on id_list which must be in both datasets, and both datasets must be sorted by id_list.
The problem with using a Data Step instead of a PROC SQL is that for the Data step the Data-set must be sorted on the variable used for the merge. If this is not yet the case, the complete Data-set must be sorted first.
If I have a very large SAS Data-set, which is not sorted on the variable to be merged, I have to sort it first (which can take quite some time). If I use the subquery in PROC SQL, I can read the Data-set selectively, so no sort is needed.
My bet is that PROC SQL is much faster for large Data-sets from which you want only a small subset.

Interpreting WHERE= in Proc

Converting a SAS code to Sql based process. Came across this pretty simple snippet.
proc freq data=temp1(where=(SAMERETAIL='Y')) noprint;
tables RETAILER*store/list nocum nopercent out=retailer_list;
run;
My interpretation of this is:
From Temp1:
Choose all observations which fit the criteria (sameretail=Y)
Extract Retail, Store frequency counts:
Store Retailer Count(*)
Output to Retailer_List.
The question I have is on the WHERE=. Is this applied to the Proc or Data? Is my interpretation correct? Business wise this is incorrect since we are only restricting the records with the flag=Y. Hence the question.
Any pointers?
Any help is greatly appreciated.
TIA.
where= is applied to the dataset you're pulling observations from to use in the proc freq. SUGI 24 has a good summary of how this works (see page 3).
THE WHERE DATA SET OPTION
The syntax of the WHERE data set option, called
WHERE=, is a combination of standard data set
option parentheses and a where-expression.

Merging Datasets in SAS

I have a rather simplistic question.
Is there any way to merge $n$ data sets in SAS where $n > 2$. I know how to merge 2 data sets.
Thanks
gowers
You can merge multiple data sets using the same syntax as for just two:
data all;
merge ds1 ds2 ds3 ...;
by some_list_of_variables;
run;
If you have many data sets you want to merge, you may want to right a macro that lists them all.
In addition to the code #itzy provided, you can identify your data sets using an IN= option on the MERGE statement. This allows you only accept the matching you need. Also, you must have common variable names to use in your BY statement. You can include a RENAME= statement to create a common variable for use in your BY statement.
(Untested code)
data all;
merge ds1(in=one rename=(ds1_id=id))
ds2(in=two rename=(ds2_id=id))
ds3(in=three rename=(ds3_id=id))
;
by some_list_of_variables;
if one and two and three ; /* Creates only matching records from all */
run;
Even though you have said that you want to "merge" datasets, note that the MERGE statement is not the only option. If your merging key has duplicates in more than 1 dataset, then using the MERGE statement will might give logically wrong results even though it would work without complaining. In that case, you can use PROC SQL - I also recall that PROC SQL can be more efficient from SAS 9.1 onwards.
Example -
proc sql;
select <fieldlist>
from data1 t1, data2 t2, data3 t3, data4 t4
where <join condition>;
quit;

Efficient way to merge/join 2 large datasets in SAS 8.2

I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.