SAS: Count number of a particular type of disease with patient data on multiple lines - sas

I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
We have found that a particular type of disease is frequently comorbid with a common condition.
I would like to count the number of this type of disease that each patient has, and then create a histogram showing how many people have 1,2,3,4, etc. additional diseases.
This is the format of the data.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 282.1 1/2/10 F 87
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 282.1 12/15/13 M 41
2 232.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 601.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Generally, I was thinking of a DO loop, but I'm not sure where to start because there are duplicates in the dataset, like with patient 1 (282.1 is listed twice). I'm not sure how to account for that. Any thoughts?
Target diagnoses to count would be 282.1, 232.1, 250.02. In this example, patient 1 would have a count of 3, patient 2 would have 2, etc.
Edit:
This is what I have used, but the output is showing each PatientID on multiple lines in the output.
PROC SQL;
create table want as
select age, gender, patientID,
count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID;
quit;
This is what the output table looks like. Why is this patientID showing up so many times?
Obs AGE GENDER PATIENTID count
1 55 Male 107828695 1
2 54 Male 107828695 1
3 54 Male 107828695 1
4 54 Male 107828695 1
5 54 Male 107828695 1

If you include variables that are neither grouping variables or summary statistics then SAS will happily re-merge your summary statistics back with all of the source records. That is why you are getting multiple records. AGE can usually vary if your dataset covers many years. And GENDER can also vary if your data is messy. So for a quick analysis you might try something like this.
create table want as
select patientID
, min(age) as age_at_onset
, min(gender) as gender
, count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID
;

I think you can get what you want with an SQL statement
PROC SQL NOPRINT;
create table want as
select PatientID,
count(distinct Diagnosis) as count
from have
where Diagnosis in (282.1, 232.1, 250.02)
group by PatientID;
quit;
This filters to only the diagnoses you are interested in, counts the distinct times they are seen, by the PatientID, and saves the results to a new table.

Related

Rolling Window Model for Unbalanced Dataset in SAS

I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;​
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;​
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);

removing common prefix or suffix

I have a data set contains a series variables named; PG_86xt, AG_86xt,... with same suffix _86xt. How can I remove such suffix while renaming these variables?
I know how to add prefix or suffix. But the logic of removing them seems to be a little bit different. I think proc dataset modify is still the way to go. But the length of substring before suffix (or after prefix) is unknown.
The example on how to add prefix or suffix
data one;
input id name :$10. age score1 score2 score3;
datalines;
1 George 10 85 90 89
2 Mary 11 99 98 91
3 John 12 100 100 100
4 Susan 11 78 89 100
;
run;
proc datasets library = work nolist;
modify one;
rename &suffixlist;
quit;
You can use the scan function to get the desired result.
By altering the example you have in the link to fit your example:
data one;
input id name :$10. age PG_86xt AG_86xt IG_86xt;
datalines;
1 George 10 85 90 89
2 Mary 11 99 98 91
3 John 12 100 100 100
4 Susan 11 78 89 100
;
run;
By filtering on only those column that fits your convention (XX_86xt), you could use the first part of the scan for renaming.
proc sql noprint;
select cats(name,'=',scan(name, 1, '_'))
into :suffixlist
separated by ' '
from dictionary.columns
where libname = 'WORK' and memname = 'ONE' and '86xt' = scan(name, 2, '_');
quit;
You can use the index function to find the (first) place in each variable name where the suffix / prefix starts, then use that to construct appropriate parameters for substr. It's a bit more work than the code in your example, but you'll get there.

SAS - group individual observations

Sorry I'm new to a lot of the features of SAS - I've only been using for a couple months, mostly for survey data analysis but now I'm working with a dataset which has individual level data for a cross-over study. It's in the form: ID treatment period measure1 measure2 ....
What I want to do is be able to group these individuals by their treatment group and then output a variable with a group average for measure 1 and measure 2 and another variable with the count of observations in each group.
ie
ID trt per m1 m2
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
what I want is the data in the form:
group a = where trt=1 & per=1
group b = where trt=2 & per=2
group c = where trt=2 & per=1
group d = where trt=1 & per=2
trtgrp avg_m1 avg_m2 n
A 102 76 2
B ... ... ...
C
D
Thank you for the help.
/Creating Sample dataset/
data test;
infile datalines dlm=" ";
input ID : 8.
trt : 8.
per : 8.
m1 : 8.
m2 : 8.;
put ID=;
datalines;
1 1 1 101 75
1 2 2 135 89
2 1 1 103 77
2 2 2 140 87
3 2 1 134 79
3 1 2 140 80
4 2 1 156 98
4 1 2 104 78
;
run;
/Using proc summary to summarize trt and per/
Variables(dimensions) on which you want to summarize would go into class
Variables(measures) for which you want to have average would go into var
Since you want to have produce average so you will have to write mean as the desired statistics.
Read more about proc summary here
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473735.htm
and here
http://web.utk.edu/sas/OnlineTutor/1.2/en/60476/m41/m41_19.htm
proc summary data=test nway;
class trt per;
var m1 m2;
output out=final(drop= _type_)
mean=;
run;
The alternative method uses PROC SQL, the advantage being that it makes use of plain-English syntax, so the concept of a group in your question is maintained in the syntax:
PROC SQL;
CREATE TABLE final AS
SELECT
trt,
per,
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY trt, per;
QUIT;
You can even add your own group headings by applying conditional CASE logic as you did in your question:
PROC SQL;
CREATE TABLE final AS
SELECT
CASE
WHEN trt=1 AND per=1 THEN 'A'
WHEN trt=2 AND per=2 THEN 'B'
WHEN trt=2 AND per=1 THEN 'C'
WHEN trt=1 AND per=2 THEN 'D'
END AS group
avg(m1) AS avg_m1,
avg(m2) AS avg_m2,
count(*) AS n
FROM
test
GROUP BY group;
QUIT;
COUNT(*) simply counts the number of rows found within the group. The AVG function calculates the average for the given column.
In each example, you can replace the explicitly named columns in the GROUP BY clause with a number representing column position in the SELECT clause.
GROUP BY 1,2
However, take care with this method, as adding columns to the SELECT clause later can cause problems.

Creating All Possible Combinations in a Table Using SAS

I have a table with four variables and i want the table a table with combination of all values. Showing a table with only 2 columns as an example.
NAME AMOUNT COUNT
RAJ 90 1
RAVI 20 4
JOHN 30 5
JOSEPH 40 3
The following output is to show the values only for raj and the output should be for all names.
NAME AMOUNT COUNT
RAJ 90 1
RAJ 90 4
RAJ 90 5
RAJ 90 3
RAJ 20 1
RAJ 20 4
RAJ 20 5
RAJ 20 3
RAJ 30 1
RAJ 30 4
RAJ 30 5
RAJ 30 3
RAJ 40 1
RAJ 40 4
RAJ 40 5
RAJ 40 3
.
.
.
.
There are a couple of useful options in SAS to do this; both create a table with all possible combinations of variables, and then you can just drop the summary data that you don't need. Given your initial dataset:
data have;
input NAME $ AMOUNT COUNT;
datalines;
RAJ 90 1
RAVI 20 4
JOHN 30 5
JOSEPH 40 3
;;;;
run;
There is PROC FREQ with SPARSE.
proc freq data=have noprint;
tables name*amount*count/sparse out=want(drop=percent);
run;
There is also PROC TABULATE.
proc tabulate data=have out=want(keep=name amount count);
class name amount count;
tables name*amount,count /printmiss;
run;
This has the advantage of not conflicting with the name for the COUNT variable.
Try
PROC SQL;
CREATE TABLE tbl_out AS
SELECT a.name AS name
,b.amount AS amount
,c.count AS count
FROM tbl_in AS a, tbl_in AS b, tbl_in AS c
;
QUIT;
This performs a double self-join and should have the desired effect.
Here's a variation on #JustinJDavies's answer, using an explicit CROSS JOIN clause:
data have;
input NAME $ AMOUNT COUNT;
datalines;
RAJ 90 1
RAVI 20 4
JOHN 30 5
JOSEPH 40 3
run;
PROC SQL;
create table combs as
select *
from have(keep=NAME)
cross join have(keep=AMOUNT)
cross join have(keep=COUNT)
order by name, amount, count;
QUIT;
Results:
NAME AMOUNT COUNT
JOHN 20 1
JOHN 20 3
JOHN 20 4
JOHN 20 5
JOHN 30 1
JOHN 30 3
JOHN 30 4
JOHN 30 5
...

how to find the maximum value across dynamically change observations in sas?

so i have the following dataset with three variables: account, balance, and time.
account balance time
1 110 01/2006
1 111 02/2006
1 88 03/2006
1 61 04/2006
1 1203 05/2006
2 112 01/2006
2 111 02/2006
2 665 03/2006
2 61 04/2006
2 1243 05/2006
3 110 01/2006
3 111 02/2006
3 88 03/2006
3 61 04/2006
3 1203 05/2006
each account has more records. so the starting time might be before what I wrote and the ending time might be after what i wrote.
so my question is:
I am trying to find the maximum balance for each account in prievious 12 monthes. For example, for account 3 on 05/2006, i am trying to find max(account 3 balance at 04/2006, account 3 balance at 03/2006, account 3 balance at 03/2006,............., account 3 balance at 04/2006).
what is in your mind? what i did is to use lag function with array. however, it is NOT so efficient. because i will be in trouble if i need previous 120 months.
Thank you.
Best
Xintong
Try this:
PROC SQL;
create table max_balance as
select *
from your_table
group by account
having balance=max(balance)
;
QUIT;
options mprint;
%macro lag;
data temp (drop=count i);
set balance;
by acct time;
array x(*) lag_bal1 - lag_bal3;
%do i=1 %to 3;
lag_bal&i=lag&i.(balance);
%end;
if first.acct then count=1;
do i=count to dim(x);
call missing(x(i));
end;
count+1;
max_ball=max(balance, of lag_bal1-lag_bal3);
%mend;
%lag;