Efficient way to select unique records in SAS - sas

The dataset looks like this:
Code Type Rating
0001 NULL 1
0002 NULL 1
0003 NULL 1
0003 PA 1 3
0004 NULL 1
0004 PB 1 2
0005 AC 1 3
0005 NULL 6
0006 AC 1 2
I want the output dataset looks like
Code Type Rating
0001 NULL 1
0002 NULL 1
0003 PA 1 4
0004 PB 1 3
0005 AC 1 9
0006 AC 1 2
For each Code, Type has at most two values. I want to select the unique Code by summing Rating. But the problem is, for Type, if it has only one value, the passes its value to output dataset. If is has two values (one has to be NULL), then passes the one not equals to NULL to output dataset.
The total number of observation N>100,000,000. So is there any tricky way to achieve this?

If the data is sorted as per your example, then you can achieve this in a single data step. I've assumed that the NULL values are actually missing, however if not then change [if missing(type)] to [if type='NULL']. All this does is sum the Rating values for each Code, then output the last record, keeping the non-null Type. If your data isn't sorted or indexed on Code then you'll need to do a sort first, which will obviously add quite a bit to the execution time.
/* create input file */
data have;
input Code Type $ Rating;
infile datalines dsd;
datalines;
0001,,1
0002,,1
0003,,1
0003,PA 1,3
0004,,1
0004,PB 1,2
0005,AC 1,3
0005,,6
0006,AC 1,2
;
run;
/* create summarised dataset */
data want;
set have;
by code;
retain _type; /* temporary variable */
if first.code then do;
_type = type;
_rating_sum = 0; /* reset sum */
end;
_rating_sum + rating; /* sum rating per Code */
if last.code then do;
if missing(type) then type = _type; /* pick non-null value */
rating = _rating_sum; /* insert sum */
output;
end;
run;

Given the comments, another possibility presents, the hash solution. This is memory-constrained, so it may or may not be able to work with the actual data (the hash table isn't very big, but 100M rows might imply 60 or 70M rows in the hash table, times 40 or 50 bytes would still be pretty big).
This is almost certainly inferior to the plain data step method if the dataset is sorted by code, so this should only be used on unsorted data.
The concepts:
Create hash table keyed on code
If incoming record is new, add to hash table
If incoming record is not a new code, take the retrieved value and sum the rating. Check to see if type needs to be replaced.
Output to dataset.
Code:
data _null_;
if _n_=1 then do;
if 0 then set have;
declare hash h(ordered:'a');
h.defineKey('code');
h.defineData('code','type','rating');
h.defineDone();
end;
set have(rename=(type=type_in rating=rating_in)) end=eof;
rc_1 = h.find();
if rc_1 eq 0 then do;
if type ne type_in and type='NULL' then type=type_in;
rating=sum(rating,rating_in);
h.replace();
end;
else do;
type=type_in;
rating=rating_in;
h.add();
end;
if eof then do;
h.output(dataset:'want');
end;
run;

It's pretty easy to do in one SQL step as well. Just use a CASE...WHEN...END to remove the NULLs and a MAX to then get the non-null value.
data have;
input #1 Code 4.
#9 Type $4.
#19 Rating 1.;
datalines;
0001 NULL 1
0002 NULL 1
0003 NULL 1
0003 PA 1 3
0004 NULL 1
0004 PB 1 2
0005 AC 1 3
0005 NULL 6
0006 AC 1 2
;;;;
run;
proc sql;
create table want as
select code,
max(case type when 'NULL' then '' else type end) as type,
sum(Rating) as rating
from have
group by code;
quit;
If you want the NULLs back, then you need to wrap the select in a select code, case type when ' ' then 'NULL' else type end as type, rating from ( ... );, though I would suggest leaving them blank.

Related

Select an observation if it has another within 24 hours of it

I am trying to create a table that only populates entries of a contact to a customer at a business number if they were NOT first contacted at a home number within 24 hours prior to the attempt at the business number.
So if I have
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
I want to be able to get
1 20MAY2018:06:24:28 B
2 24MAY2018:06:24:28 B
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
I have tried adding a count to the ID but I'm not sure how I'd go about using that, or if there's a way to use a subquery within a proc sql to create a count of observations that have more than one in a 24 hour period.
So, your approach will work, but will be quite messy with large numbers - as you're doing a cartesian join within ID. If each ID has few records it's not so bad, but if each ID has many records you make a lot of connections.
Fortunately, there's an easy way to do this in SAS!
data want;
do _n_ = 1 by 1 until (last.id); *for each ID:;
set have;
by id;
if first.id then last_home=0; *initialize last_home to 0;
if type='H' then last_home = record; *if it is a home then save it aside;
if type='B' and intck('Hour',last_home,record,'c') gt 24 then output; *if it is business then check if 24 hours have passed;
end;
format last_home datetime.;
run;
A few notes:
I use a DoW loop, but that really isn't mandatory, I just like it from a clarity perspective (it makes it clear I'm doing something at an ID-repetition level). You could remove that loop and add a RETAIN for last_home and it would be the same.
I use INTCK instead of INTNX - again this is for clarity, your INTNX is fine too, but INTCK just does the comparison, while INTNX is for advancing dates by an amount. I use the one that matches what I am trying to do, so someone reading the code can see easily what I'm doing.
This will be much faster than SQL on larger datasets, if for no other reason than it only passes the data once. SQL will necessarily do it multiple times, even if you don't separate HAVEA/HAVEB and do that within the SQL query.
I believe I figured it out!
I have HAVEA and HAVEB tables hosting type H and type B entries respectively.
Then I ran the following PROC SQL's.
PROC SQL;
CREATE TABLE WANTA AS
SELECT A.RECORD AS PREVIOUS_CALL, B.* FROM HAVEB B
JOIN HAVEA A ON (B.ID=A.ID AND A.RECORD LE B.RECORD);
CREATE TABLE WANTB AS
SELECT * FROM WANTA
GROUP BY ID, RECORD
HAVING PREVIOUS_CALL = MAX(PREVIOUS_CALL);
CREATE TABLE WANTC AS
SELECT * FROM WANTB
WHERE INTNX('HOUR',RECORD,-24,'SAME') GT PREVIOUS_CALL;
QUIT;
Please let me know if this is not a sustainable answer for larger sums of data or if there is a much better method of approaching this.
You perform a selection to get the final result set with out creating intermediate tables. Here are two alternatives:
First way
Similar to your 'figuring it out'. A reflexive join with grouping detects the "to_home" calls prior to the "to_business" calls that did NOT occur in the last 24 hours (86,400 seconds)
proc sql;
create table want as
select distinct
business.*
from have as business
join have as home
on business.id = home.id
& business.type = 'B'
& home.type = 'H'
& home.CALL_DT < business.CALL_DT
group by
business.call_dt
having
max(home.call_dt) < business.call_dt - 86400
;
Second way
Perform a NOT existential check, for a to_home call in prior 24hr, for every to_business call.
create table want2 as
select
business.*
from
have as business
where
business.type = 'B'
and
not exists (
select * from have as home
where home.id = business.id
and home.type = 'H'
and home.call_dt < business.call_dt
and home.call_dt >= business.call_dt - 86400
)
;
A HASH solution does have some dependencies (amount of data and RAM)...but it is another alternative
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE $;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
/* Keep only HOME TYPE records and
rename RECORD for using in comparision */
Data HOME(Keep=ID RECORD rename=(record=hrecord));
Set HAVE(where=(Type="H"));
Run;
Data WANT(Keep=ID RECORD TYPE);
/* Use only BUSINESS TYPE records */
Set HAVE(where=(Type="B"));
/* Set up HASH object */
If _N_=1 Then Do;
/* Multidata:YES for looping through
all successful FINDs */
Declare HASH HOME(dataset:"HOME", multidata:'yes');
home.DEFINEKEY('id');
home.DEFINEDATA('hrecord');
home.DEFINEDONE();
/* To prevent warnings in the log */
Call Missing(HRECORD);
End;
/* FIND first KEY match */
rc=home.FIND();
/* Successful FINDs result in RC=0 */
Do While (RC=0);
/* This will keep the result of the most recent, in datetime,
HOME/BUS record comparision */
If intck('Hour',hrecord,record,'c') > 24 Then Good_For_Output=1;
Else Good_For_Output=0;
/* Keep comparing HOME/BUS for all HOME records */
rc=home.FIND_NEXT();
End;
If Good_For_Output=1 Then Output;
Run;

How to recode values of a variable based on the maxmium value in the variable, for hundreds of variables?

I want to recode the max value of a variable as 1 and 0 when it is not. For each variable, there may be multiple observations with the max value. The max value for each value is not fixed, i.e. from cycle to cycle the max value for each variable may change. And there are hundreds of variables, cannot "hard-code" anything.
The final product would have the same dimensions as the original table, i.e. equal number of rows and columns as a matrix of 0s and 1s.
This is within SAS. I attempted to calculate the max of each variable and then append these max as a new observation into the data. Then comparing down the column of each variable against the "max" observation... looking into examples of the following did not help:
SQL
Array in datastep
proc transpose
formatting
Any insight would be much appreciated.
Here is a version done with SQL:
The idea is that we first calculate the maximum. The Latter select. Then we join the data to original and the outer the case-select specifies if the flag is set up or not.
data begin;
input var value;
cards;
1 1
1 2
1 3
1 2.5
1 1.7
1 3
2 34
2 33
2 33
2 33.7
2 34
2 34
; run;
proc sql;
create table result as
select a.var, a.value, case when a.value = b.maximum then 1 else 0 end as is_max from
(select * from begin) a
left join
(select max(value) as maximum, var from begin group by var) b
on a.var = b.var
;
quit;
To avoid "hard-code" you need to use some code generation.
First let's figure out what code you could use to solve the problem. Later we can look into ways to generate that code.
It is probably easiest to do this with PROC SQL code. SAS will allow you to reference the MAX() value of a variable. Also note that SAS evaluates boolean expressions to 1 (TRUE) or 0 (FALSE). So you just want to generate code like:
proc sql;
create table want as
select var1=max(var1) as var1
, var2=max(var2) as var2
from have
;
quit;
To generate the code you need a list of the variables in your source dataset. You can get those with PROC CONTENTS but also with the metadata table (view) DICTIONARY.COLUMNS (also accessible as SASHELP.VCOLUMN from outside PROC SQL).
If the list of variables is small then you could generate the code into a single macro variable.
proc sql noprint;
select catx(' ',cats(name,'=max(',name,')'),'as',name)
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='HAVE'
order by varnum
;
create table want as
select &varlist
from have
;
quit;
The maximum number of characters that will fit into a macro variable is 64K. So long enough for about 2,000 variables with names of 8 characters each.
Here is little more complex way that uses PROC SUMMARY and a data step with a temporary array. It does not really need any code generation.
%let dsin=sashelp.class(obs=10);
%let dsout=want;
%let varlist=_numeric_;
proc summary data=&dsin nway ;
var &varlist;
output out=summary(drop=_type_ _freq_) max= ;
run;
data &dsout;
if 0 then set &dsin;
array vars &varlist;
array max [10000] _temporary_;
if _n_=1 then do;
set summary ;
do _n_=1 to dim(vars);
max[_n_]=vars[_n_];
end;
end;
set &dsin;
do _n_=1 to dim(vars);
vars[_n_]=vars[_n_]=max[_n_];
end;
run;
Results:
Obs Name Sex Age Height Weight
1 Alfred M 0 1 1
2 Alice F 0 0 0
3 Barbara F 0 0 0
4 Carol F 0 0 0
5 Henry M 0 0 0
6 James M 0 0 0
7 Jane F 0 0 0
8 Janet F 1 0 1
9 Jeffrey M 0 0 0
10 John M 0 0 0

SAS 9.4 Replacing all values after current line based on current values

I am matching files base on IDs numbers. I need to format a data set with the IDs to be matched, so that the same ID number is not repeated in column a (because column b's ID is the surviving ID after the match is completed). My list of IDs has over 1 million observations, and the same ID may be repeated multiple times in either/both columns.
Here is an example of what I've got/need:
Sample Data
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
The surviving IDs would be:
2
4
5
error - 1 no longer exists
error - 1 no longer exists
8
WHAT I NEED
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I am, probably very obviously, a SAS novice, but here is what I have tried, re-running over and over again because I have some IDs that are repeated upward of 50 times or more.
Proc sort data=Have;
by ID1;
run;
This sort makes the repeated ID1 values consecutive, so the I could use LAG to replace the destroyed ID1s with the surviving ID2 from the line above.
Data Want;
set Have;
by ID1;
lagID1=LAG(ID1);
lagID2=LAG(ID2);
If NOT first. ID1 THEN DO;
If ID1=lagID1 THEN ID1=lagID2;
KEEP ID1 ID2;
IF ID1=ID2 then delete;
end;
run;
That sort of works, but I still end up with some that end up with duplicates that won't resolve no matter how many times I run (I would have looped it, but I don't know how), because they are just switching back and forth between IDs that have other duplicates (I can get down to about 2,000 of these).
I have figured out that instead of using LAG, I need replace all values after the current line with ID2 for each ID1 value, but I cannot figure out how to do that.
I want to read observation 1, find all later instances of the value of ID1, in both ID1 or ID2 columns, and replace that value with the current observation's ID2 value. Then I want to repeat that process with line 2 and so on.
For the example, I would want to look for any instances after line one of the value 1, and replace it with 2, since that is the surviving ID of that pair - 1 may appear further down multiple times in either of the columns, and I need all them to replaced. Line two would look for later values of 3 and replace them with 4, and so one. The end result should be that an ID number only appears once ever in the ID1 column (though it may appear multiple times in the ID2 column).
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
After first line has been read, data set would look as follows:
ID1 ID2
1 2
3 4
2 5
6 2
2 7
5 8
Reading observation two would make no changes since 3 does not appear again; after observation 3, the set would be:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
5 8
Again, there would be not changes from observation four. but observation 5 would cause the final change:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I have tried using the following statement but I can't even tell if I am on the complete wrong track or if I just can't get the syntax figured out.
Data want;
Set have;
Do i=_n_;
ID=ID2;
Replace next var{EUID} where (EUID1=EUID1 AND EUID2=EUID1);
End;
Run;
Thanks for your help!
There is no need to work back and forth thru the data file. You just need to retain the replacement information so that you can process the file in a single pass.
One way to do that is to make a temporary array using the values of the ID variables as the index. That is easy to do for your simple example with small ID values.
So for example if all of the ID values are integers between 1 and 1000 then this step will do the job.
data want ;
set have ;
array xx (1000) _temporary_;
do while (not missing(xx(id1))); id1=xx(id1); end;
do while (not missing(xx(id2))); id2=xx(id2); end;
output;
xx(id1)=id2;
run;
You probably need to add a test to prevent cycles (1 -> 2 -> 1).
For a more general solution you should replace the array with a hash object instead. So something like this:
data want ;
if _n_=1 then do;
declare hash h();
h.definekey('old');
h.definedata('new');
h.definedone();
call missing(new,old);
end;
set have ;
do while (not h.find(key:id1)); id1=new; end;
do while (not h.find(key:id2)); id2=new; end;
output;
h.add(key: id1,data: id2);
drop old new;
run;
Here's an implementation of the algorithm you've suggested, using a modify statement to load and rewrite each row one at a time. It works with your trivial example but with messier data you might get duplicate values in ID1.
data have;
input ID1 ID2 ;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
title "Before making replacements";
proc print data = have;
run;
/*Optional - should improve performance at cost of increased memory usage*/
sasfile have load;
data have;
do i = 1 to nobs;
do j = i to nobs;
modify have point = j nobs = nobs;
/* Make copies of target and replacement value for this pass */
if j = i then do;
id1_ = id1;
id2_ = id2;
end;
else do;
flag = 0; /* Keep track of whether we made a change */
if id1 = id1_ then do;
id1 = id2_;
flag = 1;
end;
if id2 = id1_ then do;
id2 = id2_;
flag = 1;
end;
if flag then replace; /* Only rewrite the row if we made a change */
end;
end;
end;
stop;
run;
sasfile have close;
title "After making replacements";
proc print data = have;
run;
Please bear in mind that as this modifies the dataset in place, interrupting the data step while it is running could result in data loss. Make sure you have a backup first in case you need to roll your changes back.
Seems like this should do the trick and is fairly straight forward. Let me know if it is what you are looking for:
data have;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test();
proc sql noprint;
select count(*) into: cnt
from have;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have
where monotonic() = &i;quit;
data have;
set have;
if (_n_ > input("&i",8.))then do;
if (id1 = input("&id1",8.))then id1 = input("&id2",8.);
if (id2 = input("&id1",8.))then id2 = input("&id2",8.);
end;
run;
%end;
%mend test;
%test();
this might be a little faster:
data have2;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test2();
proc sql noprint;
select count(*) into: cnt
from have2;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have2
where monotonic() = &i;
update have2 set id1 = &id2
where monotonic() > &i
and id1 = &id1;
quit;
proc sql noprint;
update have2 set id2 = &id2
where monotonic() > &i
and id2 = &id1;
quit;
%end;
%mend test2;
%test2();

Delete the group that none of its observation contain the certain value in SAS

I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;

SAS : select several observations with same identifier based on a condtion true for just one of them

I have a dataset with an identifier, with several obsevations for each identifier, let us call it ident, and a categorical variable var, that can take several values, among them 1.
How do I keep all observations corresponding to a common identifier if for just one of the observations I have var=var1
For instance, with
data Test;
input identifier var;
datalines;
1023 1
1023 3
1023 5
1064 2
1064 3
1098 1
1098 1
;
Then I want to keep
1023 1
1023 3
1023 5
1098 1
1098 1
Here's the one pass solution that works for any arbitrary value. (It is a one pass solution as long as your BY group is small enough to fit into memory, which usually is the case).
%let var1=3;
data want;
do _n_ = 1 by 1 until (last.identifier);
set test
by identifier;
if var=&var1. then keepflag=1;
end;
do _n_ = 1 by 1 until (last.identifier);
set test;
by identifier;
if keepflag then output;
end;
run;
That's going through the by group once, setting keepflag=1 if any row in the by group is equal to the value, then keeping all rows from that by group. Buffering will mean this doesn't reread the data twice as long as the by group fits into memory.
The easiest way I can think of is to create a table of the identifier and then join back to it.
data temp_ID;
set TEST;
where var = 1;
run;
proc sql;
create table output_data as select
b.*
from temp_ID a
left join TEST b
on a.identifier=b.identifier;
quit;
Assuming your data is already sorted by identifier and var, you can do this with one pass. You can tell at the first line whether or not that identifier should be output.
data want (drop=keeper);
set test;
by identifier;
length keeper 3;
retain keeper;
if first.identifier then do;
if var = 1 then keeper = 1;
else keeper= 0;
end;
if keeper = 1 then output;
run;