I have a dataset structured with an ID and two other variables.
The id is not unique, it appears in the dataset more than 1 time (a patient could receive more than one clinical treatment).How can I drop the entire observation (the entire line) only if it is a perfect clone of a previous observation (based on the other two variable values)? I don't want to use an insanely long if statement.
Thanks.
proc sql;
select distinct * from olddata;
quit;
Sounds like an easy SQL fix. The select distinct option will remove any completely duplicate rows in a dataset if you select all columns.
If you specifically want to identify if two consecutive lines are identical (but are not looking to match identical lines separated by other lines), you can use notsorted on a by statement and then first and last variables.
data want;
set have;
by id var1 var2 notsorted;
if first.var2;
run;
That will keep the first record for any group of identical id/var1/var2, so long as they're consecutive on the dataset. Of course if you sort the dataset by id var1 var2 first this will always remove the duplicates, but not sorted this still works for removing consecutive pairs (or more) that are collocated.
I prefer #JJFord's answer, but for the sake of completeness, this can also be done using the nodupe option in proc sort:
proc sort data=mydata nodupe;
by id;
run;
What you choose as the by variable doesn't really matter here. The important bit is just to specify the nodupe option.
Related
I have removed duplicates from a dataset using the nodupkey feature, but want to compare the deleted duplicates to the first observation that is kept.
proc sort data=matchedfile dupout=deletedduplicate nodupkey
out=dedupedfile;
by ID;
run;
We need a datasets that combines all observations that are duplicates, the removed duplicates in the dupout file and the observation with the same id in the dedupedfile.
Thanks!
If your issue is that you want the 'not removed' row with the 'removed' rows, you can use the NOUNIKEY option added in SAS 9.3. It does the opposite of NODUPKEY - only keeps records that are NOT unique - and removes the unique records. You can have those removed unique records just discarded (if you will, separately, do a different query to get them) or you can use UNIQUEOUT to put them in a dataset.
proc sort data=have out=dups nounikey uniqueout=nodups;
by whatever;
run;
See PROC SORT documentation for more details.
I have multiple datasets. Each of them has different number of attributes. I want to merge them all by common variable. This is 'union' if I use proc SQL. But there is hunderds of variables.
Example.
Dataset_Name Number of columns
dataset1 110
dataset2 120
dataset3 130
... ...
Say they have 100 columns in common. The final dataset which contains all dataset1,dataset2,dataset3..etc
only has common columns(in this case, 100 columns).
How do I do this?
And how do I get columns for each dataset this is not in common with the final dataset.
example: dataset1 will have 10 columns that are not in the final dataset, and list the name of 10 columns.
Thanks!!!!
UNION in SQL is equivalent to sequential SET in SAS.
data want;
set dataset1 dataset2 dataset3;
run;
Now, SAS by default includes all columns present in any dataset. To limit to just what's in all datasets, you have to use a keep statement.
You can determine this using proc sql, among other ways.
proc sql;
select name into :commonlist separated by ' '
from dictionary.columns C, dictionary.columns D
where C.libname=D.libname
and C.memname='DATASET1'
and D.memname='DATASET2'
and C.name=D.name
;
quit;
For more than two datasets it's more complicated and partially depends on your, but if you're comfortable in SQL you can figure that out pretty easily. A similar construct can create a list of just dataset 1 variables. The important part is the into :commonlist separated by ' ', which says to pull the select results into a macro variable called commonlist, separating rows by space. (The colon says to create a macro variable, not a table.)
So you can then run:
data want (keep=&commonlist.) dset1(keep=&dset1list.) dset2(keep=&dset2list.);
set dataset1(in=ds1) dataset2(in=ds2) dataset3(in=ds3);
output want;
if ds1 then output dset1;
else if ds2 then output dset2;
else if ds3 then output dset3;
run;
The in=xyz indicates which dataset a row came from. Each output dataset can have a separate list of variables to keep. You might want to keep the ID variable in those other datasets as well.
I will say that usually in SAS you don't do what you're doing here: it's not easy to do because it doesn't tend to be the best way to handle things - specifically, the little split off datasets. In general you would just keep those extra variables on the master dataset, and they'd just be nulls for anyone not in a dataset with that variable - assuming it makes sense to make this 'master' dataset at all.
I'm trying to keep only the duplicate results for one column in a table. This is what I have.
proc sql;
create table DUPLICATES as
select Address, count(*) as count
from TEST_TABLE
group by Address
having COUNT gt 1
;
quit;
Is there any easier way to do this or an alternative I didn't think of? It seems goofy that I then have to re-join it with the original table to get my answer.
proc sort data=TEST_TABLE;
by Address;
run;
data DUPLICATES;
set TEST_TABLE;
by Address;
if not (first.Address and last.Address) then output;
run;
Using proc sort with nodupkey and dupout will dedupe the data and give you an "out" dataset with duplicate records from the original dataset, but the "out" dataset does not include EVERY record with the ID variable - it gives you the 2nd, 3rd, 4th...Nth. So you aren't comparing all the duplicate occurrences of the ID variable when you use this method. It's great when you know what you want to remove and define enough by variables to limit this precisely, or if you know that your records with duplicate IDs are identical in every way and you just want them removed.
When there are duplicates in a raw file I receive, I like to compare all records where ID has more than one occurrence.
proc sort data=test nouniquekeys
uniqueout=singles
out=dups;
by=ID;
run;
nouniquekeys deletes unique observations from the "out" DS
uniqueout=dsname stores unique observations
out=dsname stores remaining observations
Again, this method is great for working with messy raw data, and for debugging if your code might have produced duplicates.
That's easy using a data step:
proc sort data=TEST_TABLE nodupkey dupout=dups;
by Address;
run;
Refer to this documentation for further information
select field,count(field) from table
group by field having count(field) > 1
I'm dealing with one data problem in sas.
I have one dateset including 1000 variables and 1000 records for each variable.
And I have another variable list which includes 100 variable names.
I'd like to subset the first dataset when the variable names in that dataset match the variable list.
I tried proc merge and proc sql, but cannot work it out.
Could any one help me out?
Thanks a lot
SAS keeps or drops variables with the conveniently named keywords 'keep' and 'drop'. PROC SQL can help you generate a list if you don't already have it in text format.
data want;
set have;
keep var1 var2 var3 var4;
run;
If you have the list of variables in dataset "vnames" with the variable "tokeep", you can do this:
proc sql;
select tokeep into :keeplist separated by ' ' from vnames;
quit;
data want;
set have;
keep &keeplist.;
run;
PROC SQL is taking the contents of 'tokeep' and instead of selecting them to a table or the screen, putting them in a space-delimited list inside a macro variable 'keeplist', which then is used as the arguments for the 'keep' statement.
Here you can find how to output a list of all the variable names of a dataset as another dataset. This will make it way easier to decide which of the big datasets you will use and which you will not (e.g. a left (or right) join of variable names, then look at the number of rows is at least the count of variables which you want to have).
First off, I know pretty much nothing about SAS and I am not a programmer but an accountant, but here it goes:
I am trying to compare two data sets to identify differences between them, so I am using the 'proc compare' command as follows:
proc compare data=table1 compare=table2
criterion=.01;
run;
This works fine, but it compares line by line and in order, so if table2 is missing a row half way through, then all entries after that row will be returned as not equal.
How do I ask the comparison to be made based on a variable so that the proc compare finds the value associated with variable X in table 1, and then makes sure that the same variable X in table 2 has the same value?
The ID statement in PROC COMPARE is used to match rows. This code may work for you:
proc compare data=table1 compare=table2 criterion=.01;
id X;
run;
You may need to use PROC SORT to sort the data by X before doing the PROC COMPARE. Refer to the PROC COMPARE documentation for details on the ID statement to determine if you should sort or not.
Here is a link to the PROC COMPARE documentation:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a000057814.htm