I have a dataset like this (just with many more accounts and thus observations)
data dataset;
input account_id month date9. default_flag;
format month date9.;
datalines;
22 31JAN2004 0
22 29FEB2004 0
22 31MAR2004 0
22 30APR2004 0
22 31MAY2004 0
22 30JUN2004 0
22 31JUL2004 0
22 31AUG2004 0
22 30SEP2004 0
22 31OCT2004 0
22 30NOV2004 0
22 31DEC2004 0
22 31JAN2005 0
22 28FEB2005 0
22 31MAR2005 0
22 30APR2005 0
22 31MAY2005 1
;
run;
default_flag variable denotes whether the account is in default (default_flag=1) or not (default_flag=0).
I want to create an X-months forward default flag variable defXMON_flag which is equal to 1 if the account enters default in the following X months and there are at least X months of observations. If the account enters default, but there are less than X months of observations, then the variable will be equal to 3.
I need to create several such variables, for several different X. Let's say I want to create two such variables - one for X=12 months (def12MON_flag) and one for X=15 months (def15MON_flag).
I came up with the following code which produces the results I want:
proc sql noprint;
create table defaults as
select a.account_id, a.month, a.default_flag,
sum(b.default_flag) as nr_defaults_12 format = comma17.,
count(distinct b.month) as nr_months_12,
case when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) eq 0
and coalesce(calculated nr_months_12, 0) = 12 then 0
when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) ne 0
and coalesce(calculated nr_months_12, 0) = 12 then 1
when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) eq 0
and coalesce(calculated nr_months_12, 0) ne 12 then 2
when a.default_flag = 0
and coalesce(calculated nr_defaults_12, 0) ne 0
and coalesce(calculated nr_months_12, 0) ne 12 then 3
else .
end as def12MON_flag,
sum(c.default_flag) as nr_defaults_15 format = comma17.,
count(distinct c.month) as nr_months_15,
case when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) eq 0
and coalesce(calculated nr_months_15, 0) = 15 then 0
when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) ne 0
and coalesce(calculated nr_months_15, 0) = 15 then 1
when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) eq 0
and coalesce(calculated nr_months_15, 0) ne 15 then 2
when a.default_flag = 0
and coalesce(calculated nr_defaults_15, 0) ne 0
and coalesce(calculated nr_months_15, 0) ne 15 then 3
else .
end as def15MON_flag
from ( select account_id, month, default_flag from dataset where default_flag = 0) a
left outer join ( select account_id, default_flag, month from dataset ) b
on 1=1
and a.account_id = b.account_id
and b.month between intnx('month', a.month, 1, 'end')
and intnx('month', a.month, 12, 'end')
left outer join ( select account_id, default_flag, month from dataset ) c
on 1=1
and a.account_id = c.account_id
and c.month between intnx('month', a.month, 1, 'end')
and intnx('month', a.month, 15, 'end')
group by a.account_id, a.month, a.default_flag
order by a.account_id, a.month;
quit;
run;
This is my end result - not important part, the important part is the creation of table defaults which I need to optimize.
PROC SQL;
CREATE TABLE RESULT AS
SELECT T.*, S.DEF12MON_FLAG, S.DEF15MON_FLAG
FROM DATASET T
LEFT JOIN DEFAULTS S
ON T.ACCOUNT_ID = S.ACCOUNT_ID
AND T.MONTH = S.MONTH
ORDER BY ACCOUNT_ID, MONTH
;
However, the problem with this code is that it uses two left joins of the huge dataset when creating defaults table (in reality, I would need to use even more left joins for more defXMON_flag variables) and it is not feasible to run it due to the size of those datasets. Is it possible to get the same result without those left joins? Can you please suggest a more efficient way how to calculate def12MON_flag and def15MON_flag with the same result as this code produces?
so, as far as i understood, you want your result to look like this:
if so, step 1: take the month where account_id has default flag,
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_DATASET AS
SELECT DISTINCT t1.account_id,
t1.month,
t1.default_flag
FROM WORK.DATASET t1
WHERE t1.default_flag = 1;
QUIT;
and then calculate difference in month using intck() function
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_DATASET_0000 AS
SELECT DISTINCT t1.account_id,
t1.month,
t1.default_flag,
t2.month AS month_when_default_flag_1,
/* month_difference */
(intck('month', t1.month, t2.month)) AS month_difference
FROM WORK.DATASET t1
LEFT JOIN WORK.QUERY_FOR_DATASET t2 ON (t1.account_id = t2.account_id);
QUIT;
Related
I've got the below code that works beautifully for comparing rows in a group when the first row doesnt matter.
data want_Find_Change;
set WORK.IA;
by ID;
array var[*] $ RATING;
array lagvar[*] $ zRATING;
array changeflag[*] RATING_UPDATE;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
end;
do i = 1 to dim(var) ;
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.ID);
end;
drop i;
run;
Unfortunately, when I use a dataset that has two rows per group I get incorrect returns, I'm assuming because the first row has to be used in the comparison. How can I compare the only to rows and a return only on the second row. This did not work:
data Change;
set WORK.Two;
by ID;
changeflag = last.RATING NE first.RATING;
run;
Example of the data I have and want
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
want
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
If you want only the first row to not be flagged, you first need to create a variable enumerating the rows within each group. You can do so with:
data temp;
set have;
count + 1;
by Group;
if first.Group then count = 1;
run;
In a second step, you can run a proc sql with a subquery, count distinct by groups, and case when:
proc sql;
create table want as
select
Group, Name, Sport, DogName, Eligibility,
case when count_name > 1 and count > 1 then 1 else 0 end as N_change,
case when count_sport > 1 and count > 1 then 1 else 0 end as S_change,
case when count_dog > 1 and count > 1 then 1 else 0 end as D_change,
case when count_E > 1 and count > 1 then 1 else 0 end as E_change
from (select *,
count(distinct(Name)) as count_name,
count(distinct(Sport)) as count_sport,
count(distinct(DogName)) as count_dog,
count(distinct(Eligibility)) as count_E
from temp
group by Group);
quit;
Best,
I have a patient dataset that looks like the below table and I would like to see which diseases run together and ultimately make a heatmap. I used PROC FREQ to make this list table, but it is too laborious to go through like this because it gives me every combination (thousands).
Moya Hypothyroid Hyperthyroid Celiac
1 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
1 1 0 0
1 0 1 0
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1
proc freq data=new;
tables HOHT*HOGD*CroD*Psor*Viti*CelD*UlcC*AddD*SluE*Rhea*PerA/list;
run;
I would ultimately like a bunch of cross tabs as I show below, so I can see how many patients have each combination. Obviously it's possible to copy paste each variable like this manually, but is there any way to see this quickly or automate this?
proc freq data=new;
tables HOHT*HOGD/list;
run;
proc freq data=new;
tables HOHT*CroD/list;
run;
proc freq data=new;
tables HOHT*Psor/list;
run;
Thanks!
One can control the tables generated in PROC FREQ with the TABLES statement. To generate tables that are 2-way contingency tables of all pairs of columns in a data set, one can write a SAS macro that loops through a list of variables, and generates TABLES statements to create all of the correct contingency tables.
For example, using the data from the original post:
data xtabs;
input Moya Hypothyroid Hyperthyroid Celiac;
datalines;
1 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
1 1 0 0
1 0 1 0
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1
;
run;
%macro gentabs(varlist=);
%let word_count = %sysfunc(countw(&varlist));
%do i = 1 %to (&word_count - 1);
tables %scan(&varlist,&i,%str( )) * (
%do j = %eval(&i + 1) %to &word_count;
%scan(&varlist,&j,%str( ))
%end; )
; /* end tables statement */
%end;
%mend;
options mprint;
proc freq data = xtabs;
%gentabs(varlist=Moya Hypothyroid Hyperthyroid Celiac)
run;
The code generated by the SAS macro is:
73 proc freq data = xtabs;
74 %gentabs(varlist=Moya Hypothyroid Hyperthyroid Celiac)
MPRINT(GENTABS): tables Moya * ( Hypothyroid Hyperthyroid Celiac ) ;
MPRINT(GENTABS): tables Hypothyroid * ( Hyperthyroid Celiac ) ;
MPRINT(GENTABS): tables Hyperthyroid * ( Celiac ) ;
75 run;
...and the first few tables from the resulting output looks like:
To add options to the TABLES statement, one would add code before the semicolon on the line commented as /* end tables statement */.
Proc MEANS is one common tool for obtaining a variety of statistics for a combinatoric group with in the data. In your case you want only the count of each combination.
Suppose you had 10,000 patients with 10 binary factors
data patient_factors;
do patient_id = 1 to 10000;
array factor(10);
do _n_ = 1 to dim(factor);
factor(_n_) = ranuni(123) < _n_/(dim(factor)+3);
end;
output;
end;
format factor: 4.;
run;
As you mentioned, Proc FREQ can compute the counts of each 10-level combination.
proc freq noprint data=patient_factors;
table
factor1
* factor2
* factor3
* factor4
* factor5
* factor6
* factor7
* factor8
* factor9
* factor10
/ out = pf_10deep
;
run;
FREQ does not have syntax to support creating output data that contains each pairwise combination involving factor1.
Proc MEANS does have the syntax for such output.
proc means noprint data=patient_factors;
class factor1-factor10;
output out=counts_paired_with_factor1 n=n;
types factor1 * ( factor2 - factor10 );
run;
I have a table with very many columns but for the in order to explain my
problem I will use this simple table.
data test;
input a b c;
datalines;
0 0 0
1 1 1
. 4 2
;
run;
I need to calculate the common summary statistic as min, max and number of missing. But I also need to calculate some special numbers as number of values above a certain level( in this example >0 and >1.
I can use proc mean but it only give me results for normal things like min, max etc.
What I want is result on the following format:
var minval maxval nmiss n_above1 n_above2
a 0 1 1 1 0
b 0 4 0 2 1
c 0 2 0 2 1
I have been able to make this informat for one variable with this rather
stupid code:
data result;
set test(keep =b) end=last;
variable = 'b';
retain minval maxval;
if _n_ = 1 then do;
minval = 1e50;
maxval = -1e50;
end;
if minval > b then minval = b;
if maxval < b then maxval = b;
if b=. then nmiss+1;
if b>0 then n_above1+1;
if b>2 then n_above2+1;
if last then do;
output;
end;
drop b;
run;
This produce the following table:
variable minval maxval nmiss n_above1 n_above2
b 0 4 0 2 1
I know there has to be better way do this. I am used to Python and Pandas. There I will only loop through each variable, calculate the different summary statistick and append the result to a new dataframe for each variable.
I can probably also use proc sql. The next example
proc sql;
create table res as
select count(case when a > 0 then 1 end) as n_above1_a,
count(case when b > 0 then 1 end) as n_above1_b,
count(case when c > 0 then 1 end) as n_above1_c
from test;
quit;
This gives me:
n_above1_a n_above1_b n_above1_c
1 2 2
But this do not solve my problem.
If you add an unique identifier to each row then you can just use PROC TRANSPOSE and PROC SQL to get your result.
data test;
input a b c;
id+1;
datalines;
0 0 0
1 1 1
. 4 2
;
proc transpose data=test out=tall ;
by id ;
run;
proc sql noprint ;
create table want as
select _name_
, min(col1) as minval
, max(col1) as maxval
, sum(missing(col1)) as nmiss
, sum(col1>1) as n_above1
, sum(col1>2) as n_above2
from tall
group by _name_
;
quit;
Result
Obs _NAME_ minval maxval nmiss n_above1 n_above2
1 a 0 1 1 0 0
2 b 0 4 0 1 1
3 c 0 2 0 1 0
I want an data-step or SQL statement to do the following.
Consider this table:
(Before)
id div dlenfol repurch rlenfol
1 0 145 1 25
2 0 114 0 114
2 0 114 0 114
3 0 189 1 53
3 0 189 0 189
3 1 149 0 189
4 1 14 0 182
4 0 182 1 46
4 0 182 0 182
Grouping by id, how do I convert all the values of dlenfol to the minimum value in the dlenfol column, and all the values of rlenfol to the minimum value in the rlenfol column?
Meanwhile I also want to create a variable called choice that:
=1 if a certain id EVER had a div=1;
=0 if a certain id EVER had a repurch=1 (but never had a div=1);
=1 if a certain id EVER had a div=1 AND EVER had a repurch=1;
and =. if the certain id never had a div=1 nor repurch=1.
i.e. Like this:
(After)
id div dlenfol repurch rlenfol choice
1 0 145 1 25 0
2 0 114 0 114 .
2 0 114 0 114 .
3 0 149 1 53 1
3 0 149 0 53 1
3 1 149 0 53 1
4 1 14 0 46 1
4 0 14 1 46 1
4 0 14 0 46 1
The code I've been trying is not working:
data comb2d;
set comb;
do;
set comb;
by id;
dmin = min(dlenfol, dmin);
rmin = min(rlenfol, rmin);
if dlenfol=dmin and rlenfol^=rmin then CHOICE=1;
else if dlenfol^=dmin and rlenfol=rmin then CHOICE=0;
else if dlenfol=dmin and rlenfol=rmin then CHOICE=1;
else CHOICE=.;
/* if DIV=1 and REPURCH=0 then CHOICE=1;
else if DIV=0 and REPURCH=1 then CHOICE=0;
else if DIV=1 and REPURCH=1 then CHOICE=1;
else CHOICE=.; */
end;
dlenfol = dmin;
rlenfol = rmin;
/* drop dmin;
drop rmin; */
run;
The following SQL code seems to solve the minimum value issue but it creates 2 variables (dmin and rmin) that I don't really need:
proc sql;
create table comb3 as
select *, min(dlenfol) as dmin, min(rlenfol) as rmin
from comb
group by comb.id;
quit;
proc sort data=before out=sort1;
by id dlenfol;
run;
data sort1;
drop temp;
set sort1;
retain temp;
by id;
if first.id then temp = dlenfol;
else dlenfol = temp;
run;
then do the same thing for rlenfol
How about something like:
proc sql;
create table comb3 as
select id, div , repurch , min(dlenfol) as dlenfol, min(rlenfol) as relenfol
from comb
group by comb.id;
quit;
The following code seems to be working now:
proc sql;
create table comb3 as
select *, min(dlenfol) as dmin, min(rlenfol) as rmin, max(choice) as choicemax
from comb
group by comb.gvkey
order by comb.gvkey, comb.fyear;
quit;
This is a good example of a problem that can be handled by a double DOW loop. In the first loop find the minimums and check if the flag variables are ever true.
You then have the information needed to define the new CHOICE variable and set the variables to their minimum.
data want ;
do until(last.id);
set have ;
by id ;
mind=min(mind,dlenfol);
minr=min(minr,rlenfol);
anydiv=anydiv or div;
anyrep=anyrep or repurch;
end;
if anydiv then choice=1;
else if anyrep then choice=0;
else choice=.;
do until(last.id);
set have;
by id;
dlenfol=mind;
rlenfol=minr;
output;
end;
drop mind minr anydiv anyrep;
run;
In a summarized dataset, I have the status of an event at each hour after baseline in which it was recorded. I also have the last hour the event could have been recorded. I want to create a new dataset with one record for each hour from the first through the last hour, with the status for each record being the one from the last recorded status.
Here is an example dataset:
data new;
input hour status last_hour;
cards;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
run;
In this case, the first recorded hour was the second, and the last recorded hour was the 10th. The last possible hour to record data was the 12th.
The final dataset should look like so:
0 . 12
1 . 12
2 1 12
3 1 12
4 1 12
5 1 12
6 1 12
7 0 12
8 0 12
9 1 12
10 0 12
11 0 12
12 0 12
I sort of have it working with this series of data steps, but I'm not sure if there's a cleaner way I'm not seeing.
data step1;
set new (keep=id hour);
by id;
do hour = 0 to last_hour;
output;
end;
run;
proc sort data=step1;
by id hour;
run;
proc sql;
create table step2 as
select distinct a.id, a.hour, b.status
from step1 as a
left join new as b
on a.id = b.id
and a.hour = b.hour
order by a.id, a.hour;
quit;
data step3;
set step2;
by id hour;
retain previous_status;
if first.id then do;
previous_status = .;
if status > . then previous_status = status;
end;
if not first.id then do;
if status = . and previous_status > . then status = previous_status;
if status > . then previous_status = status;
end;
run;
Seeing your code, it seems you left out of your question the fact that you also have id's. So this is a newer solution that deals with different id's. See further below for my first solution ignoring id's.
Since last_hour is always 12, I left it out of the have dataset. It will be added later on.
data have;
input id hour status;
cards;
1 2 1
1 4 1
1 5 1
1 6 1
1 7 0
1 9 1
1 10 0
2 2 1
2 4 1
2 5 1
2 6 1
2 7 0
2 9 1
2 10 0
;
Create a hours dataset, just containing numbers 0 thru 12;
data hours;
do i = 0 to 12;
hour = i;
output;
end;
drop i;
run;
Create a temporary dataset that will have the right number of rows (13 rows for every id, with valid hour values where they exist in the have table).
proc sql;
create table tmp as
select distinct t1.id, t2.hour, 12 as last_hour
from have as t1
cross join
(select hour from hours) as t2;
quit;
Then use merge and retain to fill in the missing hour column where appropriate.
data want;
merge have
tmp;
by id hour;
retain status_previous;
if not first.id then do;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
end;
if last.id then status_previous = .;
drop status_previous;
run;
Previous solution (no id's)
If last_hour is always 12, then this should do it:
data have;
input hour status last_hour;
datalines;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
data hours;
do i = 0 to 12;
hour = i;
last_hour = 12;
output;
end;
drop i;
run;
data want;
merge have
hours;
by hour;
retain status_previous;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
drop status_previous;
run;