I have removed duplicates from a dataset using the nodupkey feature, but want to compare the deleted duplicates to the first observation that is kept.
proc sort data=matchedfile dupout=deletedduplicate nodupkey
out=dedupedfile;
by ID;
run;
We need a datasets that combines all observations that are duplicates, the removed duplicates in the dupout file and the observation with the same id in the dedupedfile.
Thanks!
If your issue is that you want the 'not removed' row with the 'removed' rows, you can use the NOUNIKEY option added in SAS 9.3. It does the opposite of NODUPKEY - only keeps records that are NOT unique - and removes the unique records. You can have those removed unique records just discarded (if you will, separately, do a different query to get them) or you can use UNIQUEOUT to put them in a dataset.
proc sort data=have out=dups nounikey uniqueout=nodups;
by whatever;
run;
See PROC SORT documentation for more details.
Related
I am working in SAS Enterprise guide and have a one column SAS table that contains unique identifiers (id_list).
I want to filter another SAS table to contain only observations that can be found in id_list.
My code so far is:
proc sql noprint;
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN id_list
quit;
This code gives me the following errors:
Error 22-322: Syntax error, expecting on of the following: (, SELECT.
What am I doing wrong?
Thanks up front for the help.
You can't just give it the table name. You need to make a subquery that includes what variable you want it to read from ID_LIST.
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN (select id from id_list)
;
You could use a join in proc sql but probably simpler to use a merge in a data step with an in= statement.
data want;
merge oneColData(in = A) otherData(in = B);
by id_list;
if A;
run;
You merge the two datasets together, and then using if A you only take the ID's that appear in the single column dataset. For this to work you have to merge on id_list which must be in both datasets, and both datasets must be sorted by id_list.
The problem with using a Data Step instead of a PROC SQL is that for the Data step the Data-set must be sorted on the variable used for the merge. If this is not yet the case, the complete Data-set must be sorted first.
If I have a very large SAS Data-set, which is not sorted on the variable to be merged, I have to sort it first (which can take quite some time). If I use the subquery in PROC SQL, I can read the Data-set selectively, so no sort is needed.
My bet is that PROC SQL is much faster for large Data-sets from which you want only a small subset.
I have a dataset structured with an ID and two other variables.
The id is not unique, it appears in the dataset more than 1 time (a patient could receive more than one clinical treatment).How can I drop the entire observation (the entire line) only if it is a perfect clone of a previous observation (based on the other two variable values)? I don't want to use an insanely long if statement.
Thanks.
proc sql;
select distinct * from olddata;
quit;
Sounds like an easy SQL fix. The select distinct option will remove any completely duplicate rows in a dataset if you select all columns.
If you specifically want to identify if two consecutive lines are identical (but are not looking to match identical lines separated by other lines), you can use notsorted on a by statement and then first and last variables.
data want;
set have;
by id var1 var2 notsorted;
if first.var2;
run;
That will keep the first record for any group of identical id/var1/var2, so long as they're consecutive on the dataset. Of course if you sort the dataset by id var1 var2 first this will always remove the duplicates, but not sorted this still works for removing consecutive pairs (or more) that are collocated.
I prefer #JJFord's answer, but for the sake of completeness, this can also be done using the nodupe option in proc sort:
proc sort data=mydata nodupe;
by id;
run;
What you choose as the by variable doesn't really matter here. The important bit is just to specify the nodupe option.
I have a database in which some of the observations have an identifier ident, and some not. I want to create a new database in which I have dropped the observations which are duplicates of my ident variable, but to keep the observations where ident is missing.
If I simply do a proc sort nodupkey
proc sort nodupkey data=have;
by ident;
run;
Then it also eliminates the missing values. Is there a simple way to do that (that is not break the dataset, proc sort nodupkey one partn, then assemble it again)
You have a couple of options when removing duplicates.
First off, dupout=<dataset> on the proc sort will send all of your duplicates to another dataset, and if you want to then do something with them you can. But this is a back-end version of your 'break the dataset', just probably faster as it only breaks the smaller part.
Simpler is to do the dedup yourself.
proc sort data=have;
by ident;
run;
data want;
set have;
by ident;
if (first.ident) or missing(ident);
run;
That keeps the first record for each ident, plus any record with ident missing.
I'm trying to keep only the duplicate results for one column in a table. This is what I have.
proc sql;
create table DUPLICATES as
select Address, count(*) as count
from TEST_TABLE
group by Address
having COUNT gt 1
;
quit;
Is there any easier way to do this or an alternative I didn't think of? It seems goofy that I then have to re-join it with the original table to get my answer.
proc sort data=TEST_TABLE;
by Address;
run;
data DUPLICATES;
set TEST_TABLE;
by Address;
if not (first.Address and last.Address) then output;
run;
Using proc sort with nodupkey and dupout will dedupe the data and give you an "out" dataset with duplicate records from the original dataset, but the "out" dataset does not include EVERY record with the ID variable - it gives you the 2nd, 3rd, 4th...Nth. So you aren't comparing all the duplicate occurrences of the ID variable when you use this method. It's great when you know what you want to remove and define enough by variables to limit this precisely, or if you know that your records with duplicate IDs are identical in every way and you just want them removed.
When there are duplicates in a raw file I receive, I like to compare all records where ID has more than one occurrence.
proc sort data=test nouniquekeys
uniqueout=singles
out=dups;
by=ID;
run;
nouniquekeys deletes unique observations from the "out" DS
uniqueout=dsname stores unique observations
out=dsname stores remaining observations
Again, this method is great for working with messy raw data, and for debugging if your code might have produced duplicates.
That's easy using a data step:
proc sort data=TEST_TABLE nodupkey dupout=dups;
by Address;
run;
Refer to this documentation for further information
select field,count(field) from table
group by field having count(field) > 1
First off, I know pretty much nothing about SAS and I am not a programmer but an accountant, but here it goes:
I am trying to compare two data sets to identify differences between them, so I am using the 'proc compare' command as follows:
proc compare data=table1 compare=table2
criterion=.01;
run;
This works fine, but it compares line by line and in order, so if table2 is missing a row half way through, then all entries after that row will be returned as not equal.
How do I ask the comparison to be made based on a variable so that the proc compare finds the value associated with variable X in table 1, and then makes sure that the same variable X in table 2 has the same value?
The ID statement in PROC COMPARE is used to match rows. This code may work for you:
proc compare data=table1 compare=table2 criterion=.01;
id X;
run;
You may need to use PROC SORT to sort the data by X before doing the PROC COMPARE. Refer to the PROC COMPARE documentation for details on the ID statement to determine if you should sort or not.
Here is a link to the PROC COMPARE documentation:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a000057814.htm