Keep missing values with nodupkey - sas

I have a database in which some of the observations have an identifier ident, and some not. I want to create a new database in which I have dropped the observations which are duplicates of my ident variable, but to keep the observations where ident is missing.
If I simply do a proc sort nodupkey
proc sort nodupkey data=have;
by ident;
run;
Then it also eliminates the missing values. Is there a simple way to do that (that is not break the dataset, proc sort nodupkey one partn, then assemble it again)

You have a couple of options when removing duplicates.
First off, dupout=<dataset> on the proc sort will send all of your duplicates to another dataset, and if you want to then do something with them you can. But this is a back-end version of your 'break the dataset', just probably faster as it only breaks the smaller part.
Simpler is to do the dedup yourself.
proc sort data=have;
by ident;
run;
data want;
set have;
by ident;
if (first.ident) or missing(ident);
run;
That keeps the first record for each ident, plus any record with ident missing.

Related

Required ordering for statements and options within SAS procedures

In many cases, one can choose any order for statements and options within SAS procedures.
For instance, as far as statements' order is concerned, the two following
PROC FREQ, in which the order of the BY and the TABLES statements is interverted,
are equivalent:
PROC SORT DATA=SASHELP.CLASS OUT=class;
BY Sex;
RUN;
PROC FREQ DATA=class;
BY Sex;
TABLES Age;
RUN;
PROC FREQ DATA=class;
TABLES Age;
BY Sex;
RUN;
In a similar way, as far as options' order is concerned, the two following PROC PRINT, in which the order of the OBS= and the FIRSTOBS= options is interverted, are equivalent:
PROC PRINT DATA=SASHELP.CLASS (FIRSTOBS=2 OBS=5);
RUN;
PROC PRINT DATA=SASHELP.CLASS (OBS=5 FIRSTOBS=2 OBS=5);
RUN;
But there is some exceptions.
For instance, as far as options' order is concerned, among the two following PROC PRINT, in which the location of the NOOBS option is different, the second PROC PRINT, where the NOOBS option is preceding the parentheses, results in an error while the first PROC PRINT is correct:
PROC PRINT DATA=SASHELP.CLASS (FIRSTOBS=2 OBS=5) NOOBS;
RUN;
PROC PRINT DATA=SASHELP.CLASS NOOBS (FIRSTOBS=2 OBS=5);
RUN;
Similarly, as far as statements' order is concerned, I occasionally met cases where a certain statement must be placed before other(s) statement(s) - but, unfortunately, I don't remember in which procedure (probably a statistical one, for duration or multilevel models).
While the ordering question within data steps might be seen as a completely different question, because within data steps the statements' order is most of the time a matter of logic, the way of ordering some statements looks like being partly a matter of conventional ordering, as within procedures; it is for instance the case in the following merging procedure, where the MERGE statement must precede the BY statement; but I suppose that SAS could have been designed to understand these statements in any order:
/* to get a simple example of merge I start with artificially cutting the Class dataset in two parts */
PROC SORT DATA=SASHELP.CLASS OUT=class;
BY Name;
RUN;
DATA sex_and_age;
SET class (KEEP=Name Sex Age);
RUN;
DATA height_and_weight;
SET class (KEEP=Name Height Weight);
RUN;
DATA all_variables;
MERGE sex_and_age height_and_weight;
BY Name;
RUN;
Because I am unable to find out such a guide, my question is: does it exist a text devoted to the question of the required order for statements and options within SAS procedures?
Joel,
Let me address the NOOBS example to help clarify. The 2 statements:
PROC PRINT DATA=SASHELP.CLASS (FIRSTOBS=2 OBS=5) NOOBS;
PROC PRINT DATA=SASHELP.CLASS NOOBS (FIRSTOBS=2 OBS=5);
Those are dataset options and they affect the read of the dataset. There are a number of them, including KEEP, DROP, WHERE, etc. NOOBS is not a dataset so you get an error. Dataset options are subsequent to the dataset name.
The order of statements, in many cases, is important because it sets the PDV (program data vector). Hence, why an ATTRIB should be at the top of a data step. Some procs, it doesn't matter since they will all be combined for execution.
data test;
attrib myNewVar length=$8 format=$20.
myNewVar2 format=date.
;
set sashelp.class;
myNewVar = 'Hey Joel!';
myNewVar2 = '24FEB2020'd;
run;
A parenthetical list of name=value pairs after a data set specifier are known as data set options. Thus you need to be able to anticipate what the SAS submit parser will be doing.
* (...) applies to SASHELP.CLASS;
PROC PRINT DATA=SASHELP.CLASS (FIRSTOBS=2 OBS=5);
* (...) are where a option name or options name=value is expected -- error ensues;
PROC PRINT DATA=SASHELP.CLASS NOOBS (FIRSTOBS=2 OBS=5);
* (...) applies to SASHELP.CLASS, NOOBS is in a proper option location within the PROC statement;
PROC PRINT NOOBS DATA=SASHELP.CLASS (FIRSTOBS=2 OBS=5);
Any special statement ordering is found in the PROC documentation. Some procs have common syntax and documentation will redirect you.
Your first point appears to be caused by not understanding what dataset options are. Otherwise order of optional parts of statement (like PROC PRINT) will be specified in the documentation for that statement.
To the second point it appears you are confusing the purpose of the BY statement in a PROC and the BY statement in a data step. In a PROC step the BY statement tells it to process the data in groups. In a DATA step the BY statement must be linked to a specific MERGE/SET/UPDATE statement.

How to create a dataset with variables that are duplicates in SAS

I have removed duplicates from a dataset using the nodupkey feature, but want to compare the deleted duplicates to the first observation that is kept.
proc sort data=matchedfile dupout=deletedduplicate nodupkey
out=dedupedfile;
by ID;
run;
We need a datasets that combines all observations that are duplicates, the removed duplicates in the dupout file and the observation with the same id in the dedupedfile.
Thanks!
If your issue is that you want the 'not removed' row with the 'removed' rows, you can use the NOUNIKEY option added in SAS 9.3. It does the opposite of NODUPKEY - only keeps records that are NOT unique - and removes the unique records. You can have those removed unique records just discarded (if you will, separately, do a different query to get them) or you can use UNIQUEOUT to put them in a dataset.
proc sort data=have out=dups nounikey uniqueout=nodups;
by whatever;
run;
See PROC SORT documentation for more details.

Automating IF and then statement in sas using macro in SAS

I have a data where I have various types of loan descriptions, there are at least 100 of them.
I have to categorise them into various buckets using if and then function. Please have a look at the data for reference
data des;
set desc;
if loan_desc in ('home_loan','auto_loan')then product_summary ='Loan';
if loan_desc in ('Multi') then product_summary='Multi options';
run;
For illustration I have shown it just for two loan description, but i have around 1000 of different loan_descr that I need to categorise into different buckets.
How can I categorise these loan descriptions in different buckets without writing the product summary and the loan_desc again and again in the code which is making it very lengthy and time consuming
Please help!
Another option for categorizing is using a format. This example uses a manual statement, but you can also create a format from a dataset if you have the to/from values in a dataset. As indicated by #Tom this allows you to change only the table and the code stays the same for future changes.
One note regarding your current code, you're using If/Then rather than If/ElseIf. You should use If/ElseIf because then it terminates as soon as one condition is met, rather than running through all options.
proc format;
value $ loan_fmt
'home_loan', 'auto_loan' = 'Loan'
'Multi' = 'Multi options';
run;
data want;
set have;
loan_desc = put(loan, $loan_fmt.);
run;
For a mapping exercise like this, the best technique is to use a mapping table. This is so the mappings can be changed without changing code, among other reasons.
A simple example is shown below:
/* create test data */
data desc (drop=x);
do x=1 to 3;
loan_desc='home_loan'; output;
loan_desc='auto_loan'; output;
loan_desc='Multi'; output;
loan_desc=''; output;
end;
data map;
loan_desc='home_loan'; product_summary ='Loan '; output;
loan_desc='auto_loan'; product_summary ='Loan'; output;
loan_desc='Multi'; product_summary='Multi options'; output;
run;
/* perform join */
proc sql;
create table des as
select a.*
,coalescec(b.product_summary,'UNMAPPED') as product_summary
from desc a
left join map b
on a.loan_desc=b.loan_desc;
There is no need to use the macro language for this task (I have updated the question tag accordingly).
Already good solutions have been proposed (I like #Reeza's proc format solution), but here's another route which also minimizes coding.
Generate sample data
data have;
loan_desc="home_loan"; output;
loan_desc="auto_loan"; output;
loan_desc="Multi"; output;
loan_desc=""; output;
run;
Using PROC SQL's case expression
This way doesn't allow, to my knowledge, having several criteria on a single when line, but it really simplifies coding since the resulting variable's name needs to be written down only once.
proc sql;
create table want as
select
loan_desc,
case loan_desc
when "home_loan" then "Loan"
when "auto_loan" then "Loan"
when "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;
Otherwise, using the following syntax is also possible, giving the same results:
proc sql;
create table want as
select
loan_desc,
case
when loan_desc in ("home_loan", "auto_loan") then "Loan"
when loan_desc = "Multi" then "Multi options"
else "Unknown"
end as product_summary
from have;
quit;

Concatenation vs Interleaving

I cannot understand the difference between interleaving and concatenation
Interleaving
proc sort data=ds1
out=ds1;
by var1;
run;
proc sort data=ds2
out=ds2;
by var1;
run;
data testInterleaving ;
set ds1 ds2 ;
run ;
Concatenation
data testConcatenation;
set ds1 ds2;
run;
I tested these and the resulting datasets were exactly the same except for the order of observations which I think does not really matter. The two resulting datasets contain exactly the same observations. So, what is the difference, except for order?
Interleaving, as CarolinaJay notes, is combining SET with BY. It is not merging, and it is not just sorting prior to setting.
For example, let's create a pair of datasets, the female and the male members of sashelp.class.
data male female;
set sashelp.class;
if sex='F' then output female;
else output male;
run;
proc sort data=male;
by name;
run;
proc sort data=female;
by name;
run;
data concatenated;
set male female;
run;
data interleaved;
set male female;
by name;
run;
Now, look at the datasets. Concatenated is all of the males, then all of the females - it processes the set statements in order, exhausting the first before moving onto the second.
Interleaved is in name order, not in order by sex. That's because it traverses the two (in this case) set datasets by name, keeping track of where it is in the name ordering. You can add debugging statements (Either use the data step debugger, or add a put _all_; to the datastep) to see how it works.
SAS defines INTERLEAVING as using a BY statement with a SET statement. The included link shows two data sets, sorted by the same variable(s), generating one data set using a BY statement with a SET statement.
The data steps at the end are the exact same. You are performing the same code, it doesn't matter if you sort before hand.
What I think you mean in the interleaving is
data testInterleaving ;
MERGE ds1 ds2;
by var1;
run;
The set statement reads sequentially through the data sets in the order you list them. The merge statement compares records between the sets and puts them into the output in the order of the variable(s) in the by statement. I recommend looking at the SAS documentation on the merge statement as this is a very simplistic explanation for a very powerful tool.

Keeping only the duplicates

I'm trying to keep only the duplicate results for one column in a table. This is what I have.
proc sql;
create table DUPLICATES as
select Address, count(*) as count
from TEST_TABLE
group by Address
having COUNT gt 1
;
quit;
Is there any easier way to do this or an alternative I didn't think of? It seems goofy that I then have to re-join it with the original table to get my answer.
proc sort data=TEST_TABLE;
by Address;
run;
data DUPLICATES;
set TEST_TABLE;
by Address;
if not (first.Address and last.Address) then output;
run;
Using proc sort with nodupkey and dupout will dedupe the data and give you an "out" dataset with duplicate records from the original dataset, but the "out" dataset does not include EVERY record with the ID variable - it gives you the 2nd, 3rd, 4th...Nth. So you aren't comparing all the duplicate occurrences of the ID variable when you use this method. It's great when you know what you want to remove and define enough by variables to limit this precisely, or if you know that your records with duplicate IDs are identical in every way and you just want them removed.
When there are duplicates in a raw file I receive, I like to compare all records where ID has more than one occurrence.
proc sort data=test nouniquekeys
uniqueout=singles
out=dups;
by=ID;
run;
nouniquekeys deletes unique observations from the "out" DS
uniqueout=dsname stores unique observations
out=dsname stores remaining observations
Again, this method is great for working with messy raw data, and for debugging if your code might have produced duplicates.
That's easy using a data step:
proc sort data=TEST_TABLE nodupkey dupout=dups;
by Address;
run;
Refer to this documentation for further information
select field,count(field) from table
group by field having count(field) > 1