Comparing two datasets in sas

Comparing two datasets in sas - sas

I have used proc compare to compare two datasets and have the difference details. But I just want to know whether two datasets are same or not(both content and number of rows wise). Like I have two datasets A and B. Want to just know whether they are same or not. No need of any other difference details. More like I just need to set a flag to 1, if the datasets are different or flag to zero if datasets are same. Is there a way to do it. I searched in internet, all I could see was using proc compare in various ways
Thanks in advance

you can use the sysinfo variable:
proc compare noprint base=baseds compare=compareds;
run;
%if %eval(&sysinfo ge 8) %then %do; ...
There is a great SAS paper describing the return codes in meticulous detail, available here.

Related

Unique combinations

I'm looking to examine unique combinations of 7 binary variables (cannabis modes of delivery [yes/no]) and was under the impression this should be a fairly simple task in SAS. However, all of the coding examples I've come across online seem a bit overcomplicated for such a basic process. If anyone has insight regarding this concept I would appreciate it!

So if you want the counts probably it easiest to use PROC SUMMARY.
proc summary data=have nway missing ;
class var1-var7 ;
output out=want(rename=(_freq_=count));
run;

proc sort data=yourdataset(Keep=Variables to examine seperated by blanks) out=choose_a_name noduprec;
by Variables to examine seperated by blanks;
run;
Resulting dataset choose_a_name will contain all unique combinations of examined variables.

Understanding SAS output data sets

SAS has several forms it uses to create output data sets from within a procedure. It is not always clear whether or not a particular procedure can generate a data set and, if it seems to be able to, it's not always clear how.
Off the top of my head, here are some examples of how widely the syntax can differ.
Example 1
proc sort data = sashelp.baseball out = baseball_sorted;
by
league
division
;
run;
Example 2
proc means noprint data = baseball_sorted;
by
league
division
;
var nHits;
output
out = baseball_avg_hits (drop = _TYPE_ _FREQ_)
mean = mean_hits
;
run;
Example 3
ods exclude all;
ods output
statistics = baseball_statistics
equality = baseball_ftest
;
proc ttest data = baseball_sorted;
class league;
var nHits;
run;
ods exclude none;
Example 4
The PROC ANOVA OUTSTAT= option.
It seems almost as if SAS has implemented each of these willy-nilly. Is the SAS syntax dictating how to create a data set directed by some consistent approach I am not seeing or is it truly capricious and arbitrary?

For PROC code, the syntax for outputting data is often specific to that procedure, which often feels willy-nilly. (Your examples 1, 2, 4) I think PROC developers are given a lot of freedom, and remember that many of these PROCS are 30+ years old.
The great thing about the Output Delivery System (ODS, your example 3) is it provides a single syntax for outputting data, regardless of the procedure. So you can use the ODS OUTPUT statement with (almost?) any PROC. The names and structures of the output objects will of course vary between PROCs. So if you are looking for a consistent approach, I would focus on using ODS OUTPUT. ODS was added in V7 (I think).
It would be interesting to try to find an example of an output dataset which could be made by a PROC but could not be made by ODS OUTPUT. I hope there aren't any. If that is the case, you could consider the range of OUTPUT statements/options within PROCs as legacy code.

Agree with Quentin. You have to remember that there are SAS systems out there running code written in the 80s. SAS would have a huge headache if they forced every team to rewrite all the procedures and then forced their customers to change all their code. SAS has been around since the 60s and the organic growth of the syntax is to be expected.
FWIW, having an OUT= statement makes sense on things with no graphical output. I.E. PROC SORT or PROC TRANSPOSE.

The way I see it there are four main ways to specify the output data sets.
In the PROC statement you may be able to specify some type of output statements or options, such as OUT= OUTEST=.
In the main statement of the procedure, ie MODEL/TABLE can have options that allow for output. ie PROC FREQ has an OUT= on the TABLE statement.
An explicit OUTPUT statement within a procedure. These are typically from older procedures. ie PROC MEANS
ODS tables which are relatively newer method, more frequently used these days since the format aligns with what you'd expect to see.
Yes, there are multiple places to check, but fortunately the SAS documentation for procedures is relatively clear with the options and how to use/specify the outputs.
If I've missed anything that seems different post in the comments and I can update this.
PS. Although SAS is definitely bad, trying to navigate different packages/modules in Python to export an XLSX file isn't straight forward either. Some packages support some options others don't. I've given up on asking why these days and just accept it as peculiarities of the different languages at this point.

SAS dynamically declaring macro variable

My company just switched from R to SAS and I am converting a lot of my R code to SAS. I am having a huge issue dynamically declaring variables (macro variables) in SAS.
For example one of my processes needs to take in the mean of a column and then apply it throughout the code in many steps.
%let numm =0;
I have tried the following with my numm variable but both methods do not work and I cannot seem to find anything online.
PROC MEANS DATA = ASSGN3.COMPLETE mean;
#does not work
&numm = VAR MNGPAY;
run;
Proc SQL;
#does not work
&numm =(Select avg(Payment) from CORP.INV);
quit;

I would highly recommend buying a book on SAS or taking a class from SAS Training. SAS Programming II is a good place to start (Programming I if you have not programmed anything else, but that doesn't sound like the case). The code you have shows you need it. It is a complete paradigm shift from R.
That said, try this:
proc sql noprint;
select mean(payment) into :numm from corp.inv;
quit;
%put The mean is: &numm;

Here's the proc summary / data step equivalent:
proc summary data = corp.inv;
var payment;
output out = inv_summary mean=;
run;
data _null_;
set inv_summary;
call symput('numm',payment);
run;
%put The mean is: &numm;
Proc sql is a more compact approach if you just want a simple arithmetic mean, but if you require more complex statistics it makes sense to use proc summary.

SAS (DataFlux) Merge Datasets

I'm totally new to SAS so please be patient. :) I have a Data Job that is producing two data sets that contain a field. I need to compare these two data sets and then output only the rows that do not match, but I'm struggling. Is there a way to accomplish this in SAS? I've tried using "Match Codes", but I couldn't get that to work. I also tried "Cluster Diff", but that didn't work either.
What we have been doing is running two SQL queries and then comparing them in Excel and taking the records that do not match.
Any suggestions would be greatly appreciated.

You need to use PROC COMPARE for this kind of task.
proc compare base=<main dataset> compare=<other dataset> outdif out=<output dataset>;
by <id variables?>;
var <variables to compare, or leave this off for all variables>;
run;
That's the basic structure; read the documentation for more details.

You could do this in the datastep using merge. The below will give you anything in dataset_A that is not in dataset_B
data output_dataset;
merge dataset_A (in=a)
dataset_B (in=b)
;
by <variable>;
if a and not b;
run;

Outputting the top 10% in SAS

How can I limit the number of observations that are output in SAS to the top 10%? I know about the obs= function, but I haven't been able to figure out how to make the obs= lead to a percentage.
Thanks in advance!

As far as I know there's not a direct way to do what you're asking. You could pretty easily write a macro to do this, though.
Assuming you're asking to view the first 10% of records in a PROC PRINT, you could do something like this.
%macro top10pct(lib=WORK,dataset=);
proc sql noprint;
select max(ceil(0.1*nlobs)) into :_nobs
from dictionary.tables
where upcase(libname)=upcase("&lib.") and upcase(memname)=upcase("&dataset.");
quit;
proc print data=&lib..&dataset.(obs=&_nobs.);
run;
%mend top10pct;
dictionary.tables has all of the PROC CONTENTS information available to it, including the number of logical observations (NLOBS). This number is not 100% guaranteed to be accurate if you've been doing things to the dataset like deleting observations in SQL, but for SAS datasets this is almost always accurate, or close enough. For RDBMS tables this may be undefined or may not be accurate.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Comparing two datasets in sas - sas

you can use the sysinfo variable: proc compare noprint base=baseds compare=compareds; run; %if %eval(&sysinfo ge 8) %then %do; ... There is a great SAS paper describing the return codes in meticulous detail, available here.

Related

Unique combinations

Understanding SAS output data sets

SAS dynamically declaring macro variable

SAS (DataFlux) Merge Datasets

Outputting the top 10% in SAS

Categories

Resources