SAS - Proc Compare - show ALL duplicates - compare

whilst using the PROC COMPARE is SAS, is it possible to list all duplicates found? By default a message will be displayed stating the first duplicate found and the total number of duplicates.
i.e:
data x1;
input x $ y $ z $ ;
datalines;
222 test abc
qqq test abc
aaa test abc
222 test abc
222 test abc
;
run;
data y1;
input x $ y $ z $ ;
datalines;
222 test abc
qqq test abc
aaa test abc
222 test abc
222 test abc
;
run;
***********************************;
*** sort data;
***********************************;
proc sort data=x1;
by x y;
run;
proc sort data=y1;
by x y;
run;
***********************************;
*** compare data;
***********************************;
proc compare listvar
base=x1
compare = y1;
id x y;
run;
************** END *****************;
output
The SAS System
The COMPARE Procedure
Comparison of WORK.X1 with WORK.Y1
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.X1 23OCT14:16:03:38 23OCT14:16:03:38 3 5
WORK.Y1 23OCT14:16:03:38 23OCT14:16:03:38 3 5
Variables Summary
Number of Variables in Common: 3.
Number of ID Variables: 2.
WARNING: The data set WORK.X1 contains a duplicate observation at observation
number 2.
NOTE: At observation 2 the current and previous ID values are:
x=222 y=test.
NOTE: Further warnings for duplicate observations in this data set will not be
printed.
WARNING: The data set WORK.Y1 contains a duplicate observation at observation
number 2.
NOTE: At observation 2 the current and previous ID values are:
x=222 y=test.
NOTE: Further warnings for duplicate observations in this data set will not be
printed.
Observation Summary
Observation Base Compare ID
First Obs 1 1 x=222 y=test
Last Obs 5 5 x=qqq y=test
Number of Observations in Common: 5.
Number of Duplicate Observations found in WORK.X1: 2.
Number of Duplicate Observations found in WORK.Y1: 2.
Total Number of Observations Read from WORK.X1: 5.
Total Number of Observations Read from WORK.Y1: 5.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5.
NOTE: No unequal values were found. All values compared are exactly equal.
# Joe - thanks for the comment!

Proc Freq might be a good approach to find duplicates. Then just print them out with a Proc Print.
PROC FREQ;
TABLES keyvar / noprint out=keylist;
RUN;
PROC PRINT data=keylist;
WHERE count ge 2;
RUN;

I don't think there's a way to get the log or listing to list more than just the first duplicate, if that's what you're going after, using the ID statement.
What you are likely best off doing is using the OUTALL option, and outputting the results to a dataset (if you're not already). Then it would be fairly easy to see the duplicates.
For example:
data class2 class3;
set sashelp.class;
output;
output;
output class3;
run;
proc compare base=class2 compare=class3 out=outclass outall;
id name;
run;
You could also use the BY statement along with the ID statement, if it's sorted; then you'll still have duplicates, but each BY Group has a separate report, so you'd see the duplicates there.
proc compare base=class2 compare=class3 out=outclass outall;
by name;
id name;
run;

Finding exact number of duplicates for each id may be better suited for proc sql.
Something like:
proc sql;
create table x2 as select
*,
count(id_var)
from x1
group by x,y,z;
quit;
This could reveal any duplicate rows in either dataset.

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?
You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;
This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;
Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

Summing values by character in SAS

I created this fakedata as an example:
data fakedata;
length name $5;
infile datalines;
input name count percent;
return;
datalines;
Ania 1 17
Basia 1 3
Ola 1 10
Basia 1 52
Basia 1 2
Basia 1 16
;
run;
The result I want to have is:
---> summed counts and percents for Basia
I would like to have summed count and percent for Basia as she was only once in the table with count 4 and percent 83. I tried exchanging name into a number to do GROUP BY in proc sql but it changes into order by (I had such an error). Suppose that it isn't so difficult, but I can't find the solution. I also tried some arrays without any success. Any help appreciated!
It sounds like proc sql does what you want:
proc sql;
select name, count(*) as cnt, sum(percent) as sum_percent
from fakedata
group by name;
You can add a where clause to get the results just for one name.
Hm, actually I got an answer.
proc summary data=fakedata;
by name;
var count percent;
output out=wynik (drop = _FREQ_ _TYPE_) sum(count)=count sum(percent)=percent;
run;
You can go back a step and use PROC FREQ most likely to generate this output in a single step. Based on counts the percents are not correct, but I'm not sure they're intended to be, right now they add up to over 100%. If you already have some summaries, then use the WEIGHT statement to account for the counts.
proc freq data=fakedata;
table name;
weight count;
run;

SAS Find Top Combinations in Dataset

Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?
It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;
There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.

Sorting an almost sorted dataset in SAS

I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:
ID Label Frequency
1 Jon 20
1 John 5
2 Mathieu 2
2 Mathhew 7
2 Matt 5
3 Nat 1
3 Natalie 4
Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?
If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.
* As long as you don't use the force option in the proc sort statement, which forces it to do redundant sorting.
Here's a little test I ran:
option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
do _n_ = 1 to 10000000;
var1 = round(rand('uniform'),0.0001);
var2 = round(rand('uniform'),0.0001);
var3 = round(rand('uniform'),0.0001);
output;
end;
run;
/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
by var1 var2 var3;
run;
/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
by var1 var2;
run;
/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
by var1 var2 var3;
run;
/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
modify baseline (sortedby = _NULL_);
run;
quit;
/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
by var1 var2 var3;
run;
The relevant results I got were as follows:
SAS took 8 seconds to sort the original completely unsorted dataset by all 3 variables.
SAS took 4 seconds to sort by 3/3 starting from the baseline dataset already sorted by 2/3 variables.
SAS took 4 seconds to sort by 3/3 starting from the same baseline dataset after removing the sort information from it.
The relevant metric from the log output is the amount of user CPU time.
Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.
/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
/*Replace the sort information*/
modify baseline(sortedby = var1 var2);
run;
/*Create composite index*/
modify baseline;
index create index1 = (var1 var2 var3);
run;
quit;
Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

I have two datasets in SAS that I would like to merge, but they have no common variables. One dataset has a "subject_id" variable, while the other has a "mom_subject_id" variable. Both of these variables are 9-digit codes that have just 3 digits in the middle of the code with common meaning, and that's what I need to match the two datasets on when I merge them.
What I'd like to do is create a new common variable in each dataset that is just the 3 digits from within the subject ID. Those 3 digits will always be in the same location within the 9-digit subject ID, so I'm wondering if there's a way to extract those 3 digits from the variable to make a new variable.
Thanks!
SQL(using sample data from Data Step code):
proc sql;
create table want2 as
select a.subject_id, a.other, b.mom_subject_id, b.misc
from have1 a JOIN have2 b
on(substr(a.subject_id,4,3)=substr(b.mom_subject_id,4,3));
quit;
Data Step:
data have1;
length subject_id $9;
input subject_id $ other $;
datalines;
abc001def other1
abc002def other2
abc003def other3
abc004def other4
abc005def other5
;
data have2;
length mom_subject_id $9;
input mom_subject_id $ misc $;
datalines;
ghi001jkl misc1
ghi003jkl misc3
ghi005jkl misc5
;
data have1;
length id $3;
set have1;
id=substr(subject_id,4,3);
run;
data have2;
length id $3;
set have2;
id=substr(mom_subject_id,4,3);
run;
Proc sort data=have1;
by id;
run;
Proc sort data=have2;
by id;
run;
data work.want;
merge have1(in=a) have2(in=b);
by id;
run;
an alternative would be to use
proc sql
and then use a join and the substr() just as explained above, if you are comfortable with sql
Assuming that your "subject_id" variable is a number then the substr function wont work as sas will try convert the number to a string. But by default it pads some paces on the left of the number.
You can use the modulus function mod(input, base) which returns the remainder when input is divided by base.
/*First get rid of the last 3 digits*/
temp_var = floor( subject_id / 1000);
/* then get the next three digits that we want*/
id = mod(temp_var ,1000);
Or in one line:
id = mod(floor(subject_id / 1000), 1000);
Then you can continue with sorting the new data sets by id and then merging.