Concatenate column names with non zero values - sas

There is a table that have ID, event, date.
The current script counts events by id, event_type and date, transposes and then labels each ID by the combination of events types that day.
The concern is this breaks if new event types appear.
I am hoping to find a method that takes all the column names with non zero count and concatenates them. Ideally this method does not involve hard coding the events.
data test;
infile datalines delimiter=':' truncover;
informat id 10. event_dt DDMMYY10. event_type $10. event $10.;
input id event_dt event_type event;
datalines;
1:01-03-2017:BB:b1
1:01-03-2017:AA:A2
1:02-03-2017:CC:C1
2:01-03-2017:CC:C2
3:03-03-2017:BB:b2
4:02-03-2017:AA:A1
;
run;
proc sql;
create table test2 as
select distinct ID, event_dt format worddate. as event_dt, event_type, count(distinct event) as event_count
from test
group by ID, event_dt, event_type;
quit;
proc transpose data=test2 out=test3 (drop = _name_);
by id event_dt;
id event_type;
run;
proc stdize data=test3 out=test3z reponly missing=0;
run;
proc sql;
create table test4 as
select event_dt,
case when AA = 0 and BB = 0 and CC = 0 then 'No Event'
when AA = 0 and BB = 0 and CC > 0 then 'CC only'
when AA = 0 and BB > 0 and CC = 0 then 'BB only'
when AA = 0 and BB > 0 and CC > 0 then 'BB & CC'
when AA > 0 and BB = 0 and CC = 0 then 'AA only'
when AA > 0 and BB = 0 and CC > 0 then 'AA & CC'
when AA > 0 and BB > 0 and CC = 0 then 'AA & BB'
when AA > 0 and BB > 0 and CC > 0 then 'AA & BB & CC'
else 'Other' end as tag, count(id) as ID_COUNT
from test3z group by
event_dt,
case when AA = 0 and BB = 0 and CC = 0 then 'No Event'
when AA = 0 and BB = 0 and CC > 0 then 'CC only'
when AA = 0 and BB > 0 and CC = 0 then 'BB only'
when AA = 0 and BB > 0 and CC > 0 then 'BB & CC'
when AA > 0 and BB = 0 and CC = 0 then 'AA only'
when AA > 0 and BB = 0 and CC > 0 then 'AA & CC'
when AA > 0 and BB > 0 and CC = 0 then 'AA & BB'
when AA > 0 and BB > 0 and CC > 0 then 'AA & BB & CC'
else 'Other' end;
quit;
Thank you
Ben

Some might say, transpose and use arrays.
Consider this alternative -- instead of assigning the tag value by processing explicit columns across a row.
Compute the tag value while iterating through the id/date group, sorting and then counting the ids while iterating through the date/tag group.
* same as in question;
proc sql;
create table test2 as
select distinct ID, event_dt format worddate. as event_dt, event_type, count(distinct event) as event_count
from test
group by ID, event_dt, event_type;
quit;
* compute tag value (I call it type_list);
data test2a;
length type_list $30;
do _n_ = 1 by 1 until (last.event_dt);
set test2;
by ID event_dt;
type_list = catx(',',type_list,event_type);
end;
keep id event_dt type_list;
run;
proc sort data=test2a;
by event_dt type_list;
run;
* count number of ids with same type list on each event day;
data want;
do id_count = 1 by 1 until (last.type_list);
set test2a;
by event_dt type_list;
end;
run;
You can add extra logic and tranwrd to change the type_list (a csv list) to a more verbose representation (contain words and or & or only)

Related

PROC SQL equivalent of data step merge using IN

I have to merge two very large files and I want to avoid this doing in Data step as that would mean sorting the data. I need all observations for all IDs from left file excluding IDs that are not in the second.
data leftdata;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;run;
data rightdata;
input id $ ;
datalines;
AA
BB
;
run;
*Using datastep;
PROC SORT DATA=leftdata; BY id;
PROC SORT DATA=rightdata; BY id; RUN;
DATA datastep;
MERGE leftdata(IN=a) rightdata(IN=b);
BY id; IF a and b=0;
RUN;
How can the same be achieved using PROC SQL?
Final output must include the following observations:
CC 50
CC 80
DD 60
Here are two ways:
WHERE NOT IN (SELECT ...) filtering, or
LEFT JOIN where missing the right id.
Example:
data have;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;
data excluded_ids;
input id $ ;
datalines;
AA
BB
;
proc sql;
create table want as
select * from have
where id not in (select id from excluded_ids)
;
create table want as
select have.* from have
left join excluded_ids as remove
on have.id = remove.id
where remove.id is null
;
For the second way you will need a SELECT DISTINCT if the exclusion list has a repeated id.
Data step
Use a hash object to store the exclusion list and check method to test for removal
Example:
data want;
set have;
if _n_ = 1 then do;
declare hash exclude(dataset:'excluded_ids');
exclude.defineKey('id');
exclude.defineDone();
end;
if exclude.check() = 0 then delete;
run;

SAS: how to avoid multiple left joins?

I have a dataset like this (just with many more accounts and thus observations)
data dataset;
input account_id month date9. default_flag;
format month date9.;
datalines;
22 31JAN2004 0
22 29FEB2004 0
22 31MAR2004 0
22 30APR2004 0
22 31MAY2004 0
22 30JUN2004 0
22 31JUL2004 0
22 31AUG2004 0
22 30SEP2004 0
22 31OCT2004 0
22 30NOV2004 0
22 31DEC2004 0
22 31JAN2005 0
22 28FEB2005 0
22 31MAR2005 0
22 30APR2005 0
22 31MAY2005 1
;
run;
default_flag variable denotes whether the account is in default (default_flag=1) or not (default_flag=0).
I want to create an X-months forward default flag variable defXMON_flag which is equal to 1 if the account enters default in the following X months and there are at least X months of observations. If the account enters default, but there are less than X months of observations, then the variable will be equal to 3.
I need to create several such variables, for several different X. Let's say I want to create two such variables - one for X=12 months (def12MON_flag) and one for X=15 months (def15MON_flag).
I came up with the following code which produces the results I want:
proc sql noprint;
create table defaults as
select a.account_id, a.month, a.default_flag,
sum(b.default_flag) as nr_defaults_12 format = comma17.,
count(distinct b.month) as nr_months_12,
case when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) eq 0
and coalesce(calculated nr_months_12, 0) = 12 then 0
when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) ne 0
and coalesce(calculated nr_months_12, 0) = 12 then 1
when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) eq 0
and coalesce(calculated nr_months_12, 0) ne 12 then 2
when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) ne 0
and coalesce(calculated nr_months_12, 0) ne 12 then 3
else .
end as def12MON_flag,
sum(c.default_flag) as nr_defaults_15 format = comma17.,
count(distinct c.month) as nr_months_15,
case when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) eq 0
and coalesce(calculated nr_months_15, 0) = 15 then 0
when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) ne 0
and coalesce(calculated nr_months_15, 0) = 15 then 1
when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) eq 0
and coalesce(calculated nr_months_15, 0) ne 15 then 2
when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) ne 0
and coalesce(calculated nr_months_15, 0) ne 15 then 3
else .
end as def15MON_flag
from ( select account_id, month, default_flag from dataset where default_flag = 0) a
left outer join ( select account_id, default_flag, month from dataset ) b
on 1=1
and a.account_id = b.account_id
and b.month between intnx('month', a.month, 1, 'end')
and intnx('month', a.month, 12, 'end')
left outer join ( select account_id, default_flag, month from dataset ) c
on 1=1
and a.account_id = c.account_id
and c.month between intnx('month', a.month, 1, 'end')
and intnx('month', a.month, 15, 'end')
group by a.account_id, a.month, a.default_flag
order by a.account_id, a.month;
quit;
run;
This is my end result - not important part, the important part is the creation of table defaults which I need to optimize.
PROC SQL;
CREATE TABLE RESULT AS
SELECT T.*, S.DEF12MON_FLAG, S.DEF15MON_FLAG
FROM DATASET T
LEFT JOIN DEFAULTS S
ON T.ACCOUNT_ID = S.ACCOUNT_ID
AND T.MONTH = S.MONTH
ORDER BY ACCOUNT_ID, MONTH
;
However, the problem with this code is that it uses two left joins of the huge dataset when creating defaults table (in reality, I would need to use even more left joins for more defXMON_flag variables) and it is not feasible to run it due to the size of those datasets. Is it possible to get the same result without those left joins? Can you please suggest a more efficient way how to calculate def12MON_flag and def15MON_flag with the same result as this code produces?
so, as far as i understood, you want your result to look like this:
if so, step 1: take the month where account_id has default flag,
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_DATASET AS
SELECT DISTINCT t1.account_id,
t1.month,
t1.default_flag
FROM WORK.DATASET t1
WHERE t1.default_flag = 1;
QUIT;
and then calculate difference in month using intck() function
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_DATASET_0000 AS
SELECT DISTINCT t1.account_id,
t1.month,
t1.default_flag,
t2.month AS month_when_default_flag_1,
/* month_difference */
(intck('month', t1.month, t2.month)) AS month_difference
FROM WORK.DATASET t1
LEFT JOIN WORK.QUERY_FOR_DATASET t2 ON (t1.account_id = t2.account_id);
QUIT;

Combine information into a department vector

I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?
Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;
Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;

SAS make summary statistic not available in proc mean

I have a table with very many columns but for the in order to explain my
problem I will use this simple table.
data test;
input a b c;
datalines;
0 0 0
1 1 1
. 4 2
;
run;
I need to calculate the common summary statistic as min, max and number of missing. But I also need to calculate some special numbers as number of values above a certain level( in this example >0 and >1.
I can use proc mean but it only give me results for normal things like min, max etc.
What I want is result on the following format:
var minval maxval nmiss n_above1 n_above2
a 0 1 1 1 0
b 0 4 0 2 1
c 0 2 0 2 1
I have been able to make this informat for one variable with this rather
stupid code:
data result;
set test(keep =b) end=last;
variable = 'b';
retain minval maxval;
if _n_ = 1 then do;
minval = 1e50;
maxval = -1e50;
end;
if minval > b then minval = b;
if maxval < b then maxval = b;
if b=. then nmiss+1;
if b>0 then n_above1+1;
if b>2 then n_above2+1;
if last then do;
output;
end;
drop b;
run;
This produce the following table:
variable minval maxval nmiss n_above1 n_above2
b 0 4 0 2 1
I know there has to be better way do this. I am used to Python and Pandas. There I will only loop through each variable, calculate the different summary statistick and append the result to a new dataframe for each variable.
I can probably also use proc sql. The next example
proc sql;
create table res as
select count(case when a > 0 then 1 end) as n_above1_a,
count(case when b > 0 then 1 end) as n_above1_b,
count(case when c > 0 then 1 end) as n_above1_c
from test;
quit;
This gives me:
n_above1_a n_above1_b n_above1_c
1 2 2
But this do not solve my problem.
If you add an unique identifier to each row then you can just use PROC TRANSPOSE and PROC SQL to get your result.
data test;
input a b c;
id+1;
datalines;
0 0 0
1 1 1
. 4 2
;
proc transpose data=test out=tall ;
by id ;
run;
proc sql noprint ;
create table want as
select _name_
, min(col1) as minval
, max(col1) as maxval
, sum(missing(col1)) as nmiss
, sum(col1>1) as n_above1
, sum(col1>2) as n_above2
from tall
group by _name_
;
quit;
Result
Obs _NAME_ minval maxval nmiss n_above1 n_above2
1 a 0 1 1 0 0
2 b 0 4 0 1 1
3 c 0 2 0 1 0

How to modify the SNP values?

My dataset has 3 SNPs which looks like below
Id SNP1 SNP 2 SNP3
1 AA AA AA
2 AG AC AG
3 GG CC GG
4
5
6 So on
In SNP1 - I would like to modify the values AA =2, AG =1, GG = 0 and Likewise in SNP1 and SNP2
How can I do this?
I would put the new values in a proc format, so that you can either keep the existing values but displayed with the formatted value, or convert the existing values using the format. Here are both ways to do this.
/* create format */
proc format;
value $snpfmt 'AA' = '2'
'AG' = '1'
'GG' = '0'
;
run;
/* create initial dataset */
data have;
input Id SNP1 $ SNP2 $ SNP3 $;
datalines;
1 AA AA AA
2 AG AC AG
3 GG CC GG
;
/* option1 - format the values */
proc datasets lib=work nodetails nolist;
modify have;
format snp1 snp2 snp3 $snpfmt2. ;
quit;
/* option2 - change the values using the format */
data want;
set have;
snp1 = put(snp1,$snpfmt2.);
snp2 = put(snp2,$snpfmt2.);
snp3 = put(snp3,$snpfmt2.);
run;