Delete all observations, which are doubled on some variable

Delete all observations, which are doubled on some variable - sas

Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.

If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.

You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?

You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;

This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;

proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;

Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

Select many columns and other non-continuous columns to find duplicate?

I have a dataset with many columns like this:
ID Indicator Name C1 C2 C3....C90
A 0001 Black 0 1 1.....0
B 0001 Blue 1 0 0.....1
B 0002 Blue 1 0 0.....1
Some of the IDs are duplicates because the indicator is different, but they're essentially the same record. To find duplicates, I want to select distinct ID, Name and then C1 through C90 to check because some claims who have the same Id and indicator have different C1...C90 values.
Is there a way to select c1...c90 either through proc sql or a sas data step? It seems the only way I can think of is to set the dataset and then drop the non essential columns, but in the actual dataset, it's not only Indicator but at least 15 other columns.

It would be nice if PROC SQL used the : variable name wildcard like other Procs do. When no other alternative is reasonable, I usually use a macro to select bulk columns. This might work for you:
%macro sel_C(n);
%do i=1 %to %eval(&n.-1);
C&i.,
%end;
C&n.
%mend sel_C;
proc sql;
select ID,
Indicator,
Name,
%sel_C(90)
from have_data;
quit;

If I understand the question properly, the easiest way would be to concatenate the columns to one. RETAIN that value from row to row, and you can compare it across rows to see if it's the same or not.
data want;
set have;
by id indicator;
retain last_cols;
length last_cols $500;
cols = catx('|',of c1-c90);
if first.id then call missing(last_cols);
else do;
identical = (cols = last_cols); *or whatever check you need to perform;
end;
output;
last_cols = cols;
run;

There are a few different ways you can do this and it will be much easier if the actual column names are C1 - C90. If you're just looking to remove anything that you know is a duplicate you can use proc sort.
proc sort data=dups out=nodups nodupkey;
by ID Name C1-C90;
run;
The nodupkey option will automatically remove any duplicates in the by statement.
Alternatively, if you want to know which records contain duplicates, you could use proc summary.
proc summary data=dups nway missing;
class ID Name C1-C90;
output out=onlydups(where=(_freq_ > 1));
run;
proc summary creates two new variables, _type_ and _freq_. If you specify _freq_ > 1 you will only output the duplicate records. Also, note that this will remove the Indicator variable.

I want to add auto_increment column in a table in SAS

I want to add a auto_Increment column in a table in SAS.Following code add's a column but not increment the value.
Thanks In Advance.
proc sql;
alter table pmt.W_cur_qtr_recoveries
add ID integer;
quit;

Wow, going to try for my second "SAS doesn't do that" answer this morning. Risky stuff.
A SAS dataset cannot define an auto-increment column. Whether you are creating a new dataset or inserting records into an existing dataset, you are responsible for creating any increment counters (ie they are just normal numeric vars where you have set the values to what you want).
That said, there are DATA step statements such as the sum statement (e.g. MyCounter+1) that make it easier to implement counters. If you describe more details of your problem, people could provide some alternatives.

The correct answer at this time is to create the ID yourself, BUT the discussion wouldn't be complete without mentioning that there is an unsupported SQL function Monotonic that can do what you want. It's not reliable, yet it persists.
The code pattern for its usage is
select monotonic() as ID, ....

Use the _N_ automatic variable in a data step like:
DATA TEMPLIB.my_dataset (label="my dataset with auto increment variables");
SET TEMPREP.my_dataset;
sas_incr_num = _N_; * add an auto increment 'sas_incr_num' variable;
sas_incr_cat = cat("AB.",cats(repeat("0",5-ceil(log10(sas_incr_num+1))),sas_incr_num),".YZ"); * auto increment the sas_incr_num variable and add 5 leading zeros and concatenate strings on either end;
LABEL
sas_incr_num="auto number each row"
sas_incr_cat="auto number each row, leading zeros, and add strings along for fun"
...

There is no such thing as an auto increment column in a SAS dataset. You can use a data step to create a new dataset that has the new variable. You can use the same name to have it replace the old one when done.
data pmt.W_cur_qtr_recoveries;
set pmt.W_cur_qtr_recoveries;
ID+1;
run;

It really depends on what your intended outcome is. But I have thrown together an example of how you may want to tackle this. it is a little rough, but gives you something to work from.
/*JUST SETTING UP THE DAY ONE DATA WITH AN ID ATTACHED
YOU WOULD MAKE THE FIRST RUN EXECUTE DIFFERENTLY TO SUBSEQUENT RUNS BY USING THE EXISTS FUNCTION AND MACRO LANGUAGE,
BUT I WILL LET YOU INVESTIGATE THIS FURTHER AS IT MAY BE IRRELEVANT.*/
DATA DAY1;
SET SASHELP.CLASS;
ID+1;
RUN;
/*ON DAY 2 WE ARE APPENDING ADDITIONAL RECORDS TO THE EXISTING DATASET*/
DATA DAY2;
/*APPEND DATASETS*/
SET DAY1 SASHELP.CLASS;
/*HOLD VALUE IN PROGRAM DATA VECTOR (PDV) UNTIL EXPLICITLY CHANGED*/
RETAIN _ID;
/*ADD VARIABLE _ID AND POPULATE WITH ID. IN DOING THIS THE LAST INSTANCE OF THE ID WILL BE HELD IN THE PDV FOR THE
FIRST OF THE NEW RECORDS*/
IF ID ~= . THEN _ID = ID;
/*INCREMENT THE VALUE IN _ID BY 1 AND DO SO FOR EACH RECORD ADDED*/
ELSE DO;
_ID+1;
END;
/*DROP THE ORIGINAL ID;*/
DROP ID;
/*RENAME _ID TO ID*/
RENAME _ID = ID;
RUN;

where "W_prv_qtr_recoveries" is a table Name and "pmt" is a library name.
Thanks to user2337871.
DATA pmt.W_prv_qtr_recoveries;
SET pmt.W_prv_qtr_recoveries;
RETAIN _ID;
IF ID ~= . THEN _ID = ID;
ELSE DO;
_ID+1;
END;
DROP ID;
RENAME _ID = ID;
RUN;

Assuming that this autoincrement column will be used for every record that is inserted.
We can accomplish the same as follows:-
We will first check the latest key in the dataset
PROC SQL;
SELECT MAX(KEY) INTO :MK FROM MYDATA;
QUIT;
%put KeyOld=&MK;
Then we increment this key
Data _NULL_;
call symput('KeyNew',&MK+1);
run;
%put KeyNew=&KeyNew;
Here we hold the New record that we want to insert, and add the correspoding key
Data TEMP1;
set TEMP;
Key=&KeyNew;
run;
Finally we load the new record in our dataset
PROC APPEND BASE=MYDATA DATA=TEMP1 FORCE;
RUN;

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.

You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;

I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.

SAS sum variables using name after a proc transpose

I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!

You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;

I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Delete all observations, which are doubled on some variable - sas

You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset). proc sort data = have nouniquekey uniqueout = unique out = dups; by name; run;

Related

Collapsing a large dataset while conditionally preserving some missing values

Select many columns and other non-continuous columns to find duplicate?

I want to add auto_increment column in a table in SAS

select only a few columns from a large table in SAS

SAS sum variables using name after a proc transpose

Categories

Resources