Proc SQL left join with aggregate function omitting zeroes - sas

I have two data sets in SAS 9.4. One is type_list, which contains 14 different account types. The other is account_list, which contains thousands of accounts.
I'm trying to get a list of all types with a count of how many accounts in each type meet certain criteria. Crucially, if there are zero matching accounts, I'd like a row with zero in the output. For example:
TYPE ACCOUNTS
type-1 104
type-2 0
type-3 56
... ...
Here's the query I'm using:
PROC SQL;
CREATE TABLE summary AS
SELECT
type_list.type,
COUNT(account_list.account_number) AS accounts
FROM type_list
LEFT JOIN account_list
ON type_list.type = account_list.type
WHERE account_list.curr_balance > 1000
GROUP BY type_list.type;
QUIT;
The actual output I'm getting doesn't have a row for types that don't have any matching accounts. For example:
TYPE ACCOUNTS
type-1 104
type-3 56
... ...

Left join the types with a computational sub-query of the accounts. Use COALESCE to handle the case of some type never occurring in accounts.
Example:
Type 7 never occurs in accounts.
data types;
do type = 1 to 14;
input count ##;
output;
end;
stop;
datalines;
104 0 56 123 4 0 7 8 9 0 11 12 13 14
;
data accounts;
call streaminit (20220818);
do id = 1 to 1e5;
do until (type ne 7);
type = rand('integer', 1, 14);
end;
balance = rand ('integer', 25, 25000);
output;
end;
run;
proc sql;
create table want as
select
types.type
, types.count as type_count
, coalesce(amc.accounts_count,0) as accounts_count
from
types
left join
(select type, count(*) as accounts_count
from accounts
where balance > 21000
group type
) as amc /* counts of accounts meeting criteria */
on
types.type = amc.type
;

Related

Extracting row with highest value in a column while also calculating averages by group

I have been tasked with taking the following data and creating two permanent data sets from it. One of these permanent data sets is supposed to contain the average of the "value" column for each group (meaning there should only be four rows in the end, with a new column that represents the average of respective values for A, B, C, and D). Averages should exclude missing values, meaning that if category A has a missing value, it should be divided by 3, not 4. The second permanent data set needs to be the one row with the highest overall value in the "value" column (in this case, the row with D 09JUL2021 951 should be the only row exported). I am having a tough time extracting that single row for the second data set. If you know of a way to perform these operations simultaneously, please let me know. Thank you for your time!
Example data:
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
Here is what I tried:
data mylib.data1(keep=type date value value_avg) mylib.data2;
set work.have;
by type;
if value ne . then NotMissing=1; else NotMissing=0;
if first.type then call missing(of value_avg);
value_avg+value;
if first.type then call missing(of num_per_cat);
num_per_cat+NotMissing;
Avg=divide((value_avg+value),(num_per_cat+NotMissing));
if last.type then output mylib.data1;
run;
This was successful for me with calculating averages, but I have no idea how to extract the row with the highest value in the "value" column to a second data set.
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
proc summary data = have nway;
class type;
var value;
output out = want_mean(drop = _:) mean = ;
run;
proc summary data = have nway;
class type;
var value;
output out = want_max(drop = _:) max = ;
run;
Both sets are easelly done by proc sql.
First one:
proc sql;
create table want1 as
select distinct type, max(value) as Max_value, mean(value) as Average_value
from have
group by type
;
quit;
Second one:
proc sql;
create table want2 as
select *
from have
having value = max(value)
;
quit;

Select an observation if it has another within 24 hours of it

I am trying to create a table that only populates entries of a contact to a customer at a business number if they were NOT first contacted at a home number within 24 hours prior to the attempt at the business number.
So if I have
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
I want to be able to get
1 20MAY2018:06:24:28 B
2 24MAY2018:06:24:28 B
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
I have tried adding a count to the ID but I'm not sure how I'd go about using that, or if there's a way to use a subquery within a proc sql to create a count of observations that have more than one in a 24 hour period.
So, your approach will work, but will be quite messy with large numbers - as you're doing a cartesian join within ID. If each ID has few records it's not so bad, but if each ID has many records you make a lot of connections.
Fortunately, there's an easy way to do this in SAS!
data want;
do _n_ = 1 by 1 until (last.id); *for each ID:;
set have;
by id;
if first.id then last_home=0; *initialize last_home to 0;
if type='H' then last_home = record; *if it is a home then save it aside;
if type='B' and intck('Hour',last_home,record,'c') gt 24 then output; *if it is business then check if 24 hours have passed;
end;
format last_home datetime.;
run;
A few notes:
I use a DoW loop, but that really isn't mandatory, I just like it from a clarity perspective (it makes it clear I'm doing something at an ID-repetition level). You could remove that loop and add a RETAIN for last_home and it would be the same.
I use INTCK instead of INTNX - again this is for clarity, your INTNX is fine too, but INTCK just does the comparison, while INTNX is for advancing dates by an amount. I use the one that matches what I am trying to do, so someone reading the code can see easily what I'm doing.
This will be much faster than SQL on larger datasets, if for no other reason than it only passes the data once. SQL will necessarily do it multiple times, even if you don't separate HAVEA/HAVEB and do that within the SQL query.
I believe I figured it out!
I have HAVEA and HAVEB tables hosting type H and type B entries respectively.
Then I ran the following PROC SQL's.
PROC SQL;
CREATE TABLE WANTA AS
SELECT A.RECORD AS PREVIOUS_CALL, B.* FROM HAVEB B
JOIN HAVEA A ON (B.ID=A.ID AND A.RECORD LE B.RECORD);
CREATE TABLE WANTB AS
SELECT * FROM WANTA
GROUP BY ID, RECORD
HAVING PREVIOUS_CALL = MAX(PREVIOUS_CALL);
CREATE TABLE WANTC AS
SELECT * FROM WANTB
WHERE INTNX('HOUR',RECORD,-24,'SAME') GT PREVIOUS_CALL;
QUIT;
Please let me know if this is not a sustainable answer for larger sums of data or if there is a much better method of approaching this.
You perform a selection to get the final result set with out creating intermediate tables. Here are two alternatives:
First way
Similar to your 'figuring it out'. A reflexive join with grouping detects the "to_home" calls prior to the "to_business" calls that did NOT occur in the last 24 hours (86,400 seconds)
proc sql;
create table want as
select distinct
business.*
from have as business
join have as home
on business.id = home.id
& business.type = 'B'
& home.type = 'H'
& home.CALL_DT < business.CALL_DT
group by
business.call_dt
having
max(home.call_dt) < business.call_dt - 86400
;
Second way
Perform a NOT existential check, for a to_home call in prior 24hr, for every to_business call.
create table want2 as
select
business.*
from
have as business
where
business.type = 'B'
and
not exists (
select * from have as home
where home.id = business.id
and home.type = 'H'
and home.call_dt < business.call_dt
and home.call_dt >= business.call_dt - 86400
)
;
A HASH solution does have some dependencies (amount of data and RAM)...but it is another alternative
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE $;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
/* Keep only HOME TYPE records and
rename RECORD for using in comparision */
Data HOME(Keep=ID RECORD rename=(record=hrecord));
Set HAVE(where=(Type="H"));
Run;
Data WANT(Keep=ID RECORD TYPE);
/* Use only BUSINESS TYPE records */
Set HAVE(where=(Type="B"));
/* Set up HASH object */
If _N_=1 Then Do;
/* Multidata:YES for looping through
all successful FINDs */
Declare HASH HOME(dataset:"HOME", multidata:'yes');
home.DEFINEKEY('id');
home.DEFINEDATA('hrecord');
home.DEFINEDONE();
/* To prevent warnings in the log */
Call Missing(HRECORD);
End;
/* FIND first KEY match */
rc=home.FIND();
/* Successful FINDs result in RC=0 */
Do While (RC=0);
/* This will keep the result of the most recent, in datetime,
HOME/BUS record comparision */
If intck('Hour',hrecord,record,'c') > 24 Then Good_For_Output=1;
Else Good_For_Output=0;
/* Keep comparing HOME/BUS for all HOME records */
rc=home.FIND_NEXT();
End;
If Good_For_Output=1 Then Output;
Run;

Random ordered sampling with replacement in SAS

I have a data set from which I'd like to draw a sample with replacement. When I use proc surveyselect, the samples drawn are in the excact same order as in the original dataset and multiple draws are written below each other.
proc surveyselect data=sashelp.baseball outhits method=urs n=1000 out=mydata;
However, it's important to me that the position in the outtable is sampled as well. Is there an option in proc surveyselect, or am I better off to just sample the rownumber myself and output it, like outlined in this paper,p4?
As a toy example (not in SAS notation), suppose I have a list of values [a, b, c, d] and I draw five times with repetition (and keeping the order of draws):
First a, then c, then a, then b, then c. The result I want is [a, c, a, b, c], but sas only gives output of the type
[a,a,b,c,c] (with outhits)
[a 2, b 1, c 2, d 0] (with outall) or
[a 2, b 1, c 2] (without an additional option).
So here is a solution which only requires BASE SAS. Minor changes would be needed to allow inclusion of additional columns such as an ID or a DATE, for instance. I don't claim it's the most efficient way to do this. It relies heavily on PROC SQL which is my preference. Having said that, it should produce the results you wish in quite reasonable time.
The length of the generated SQL code justifies the need for a separate sas program. If you don't want to show the whole %included file in the log, just leave out the /source2 option.
Generate Sample Data
data mymatrix;
input c1 c2 c3 c4 c5;
datalines;
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38 39 40
;
Declare Macro %DrawSample
Parameters:
lib = library in which ds is found
ds = table to sample from
out = table to generate
outfile = path/name of the sas program containing the insert strings
n = number of repetitions
%macro DrawSample(lib, ds, out, outfile, n);
%local nrows ncols cols;
proc sql;
/* get number of rows in source table */
select count(*)
into :nrows
from &lib..&ds;
/* get variable names */
select name, count(name)
into :cols separated by " ",
:ncols
from dictionary.columns
where libname = upcase("&lib")
and memname = upcase("&ds");
quit;
data _null_;
file "&outfile";
length query $ 256;
array column(&ncols) $32;
put "PROC SQL;";
put " /* create an empty table with same structure */";
put " create table &out as";
put " select *";
put " from &lib..&ds";
put " where 1 = 2;";
put " ";
do i = 1 to &n;
%* Randomize column order;
do j = 1 to &ncols;
column(j) = scan("&cols", 1 + floor((&ncols)*rand("uniform")));
end;
%* Build the query;
query = cat(" INSERT INTO &out SELECT ", column(1));
do j = 2 to &ncols;
query = catx(", ", query, column(j));
end;
rownumber = 1 + floor(&nrows * rand("uniform"));
query = catx(" ", query, "FROM &lib..&ds(firstobs=", rownumber,
"obs=", rownumber, ");");
put query;
end;
put "QUIT;";
run;
%include "&outfile" / source2;
%mend;
Calling the Macro
%DrawSample(lib=work, ds=mymatrix, out=matrixSample, outfile=myRandomSample.sas, n=1000);
Et voilĂ !
Not sure exactly what you're after, but something that may help is to use the option OUTALL instead of OUTHITS. This will create an output dataset the same size as the original, with a selected column to show if the record has been sampled and a numberhits column to show how many times that record has been selected. It won't create a row for each time a record is selected.
You can then select the observation number for all records in the sample.

Delete the group that none of its observation contain the certain value in SAS

I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;

SAS : select several observations with same identifier based on a condtion true for just one of them

I have a dataset with an identifier, with several obsevations for each identifier, let us call it ident, and a categorical variable var, that can take several values, among them 1.
How do I keep all observations corresponding to a common identifier if for just one of the observations I have var=var1
For instance, with
data Test;
input identifier var;
datalines;
1023 1
1023 3
1023 5
1064 2
1064 3
1098 1
1098 1
;
Then I want to keep
1023 1
1023 3
1023 5
1098 1
1098 1
Here's the one pass solution that works for any arbitrary value. (It is a one pass solution as long as your BY group is small enough to fit into memory, which usually is the case).
%let var1=3;
data want;
do _n_ = 1 by 1 until (last.identifier);
set test
by identifier;
if var=&var1. then keepflag=1;
end;
do _n_ = 1 by 1 until (last.identifier);
set test;
by identifier;
if keepflag then output;
end;
run;
That's going through the by group once, setting keepflag=1 if any row in the by group is equal to the value, then keeping all rows from that by group. Buffering will mean this doesn't reread the data twice as long as the by group fits into memory.
The easiest way I can think of is to create a table of the identifier and then join back to it.
data temp_ID;
set TEST;
where var = 1;
run;
proc sql;
create table output_data as select
b.*
from temp_ID a
left join TEST b
on a.identifier=b.identifier;
quit;
Assuming your data is already sorted by identifier and var, you can do this with one pass. You can tell at the first line whether or not that identifier should be output.
data want (drop=keeper);
set test;
by identifier;
length keeper 3;
retain keeper;
if first.identifier then do;
if var = 1 then keeper = 1;
else keeper= 0;
end;
if keeper = 1 then output;
run;