How to create 2 new columns with appropriate prefix based on values in columns with same prefix in SAS Enterprise Guide / PROC SQL? - sas

I have table in SAS Enterprise Guide like below:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 1 | 5 | 110 | 350
444 | 20 | 5 | 670 | 0
Requirements:
I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
It is not possible to have only 0 in columns with prefix _COUNT or only 0 in columns with prefix _SUM
There is not null in table
Desire output:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111 | 10 | 10 | 320 | 120 | COUNT_COL_A | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_B
333 | 1 | 5 | 110 | 350 | COUNT_COL_B | SUM_COL_B
444 | 20 | 5 | 670 | 0 | COUNT_COL_A | SUM_COL_A
How can i do that in SAS Enterprise Guide or in PROC SQL ?

Use an array with loops methodology:
Declare an array of the count variables
Set the maximum value to 0
Loop through the array
Check if each value is more than current
maximum
If yes, assign value to current maximum value and store name
If no, keep looping
Non looping, function methodology:
Use MAX to find the maximum value of the array
Use WHICHN() to find the location of the array
Use VNAME to get the variable name based on the location
*for count - you can extend for max;
data want;
set have;
array _count(*) count_col_:;
*looping methodology;
top_count_value=0;
do i=1 to _count;
if _count(i) > top_count_value then do;
top_count = vname(_count(i));
top_count_value = _count(i);
end;
end;
/*or function methodology*/
top_count_max = max(of _count(*));
index_top_count = whichn(top_count_max, of _count(*));
top_count_name_2 = vname(_count(index_top_count);
run;

Just do the same thing as your other question. But because you want to transpose two sets of variable it is probably going to be easier to a data step and arrays to do the first transform.
data tall;
set have;
array counts count_col_a count_col_b;
array sums sum_col_a sum_col_b;
do index=1 to dim(sums);
length type $5 name $32 ;
type='COUNT';
name=vname(counts[index]);
value1=counts[index];
value2=sums[index];
output;
type='SUM';
name=vname(sums[index]);
value1=sums[index];
value2=counts[index];
output;
end;
run;
Now sort and take the last per ID/TYPE combination to find the largest.
proc sort;
by id type value1 value2 name;
run;
data top;
set tall;
by id type value1 value2;
if last.type;
run;
And then transpose and re-merge.
proc transpose data=top out=want(drop=_name_) prefix=TOP_;
by id;
id type;
var name;
run;
data want;
merge have want;
by id;
run;
Result:
COUNT_ COUNT_ SUM_ SUM_
Obs ID COL_A COL_B COL_A COL_B TOP_COUNT TOP_SUM
1 111 10 10 320 120 COUNT_COL_A SUM_COL_A
2 222 15 80 500 500 COUNT_COL_B SUM_COL_B
3 333 1 5 110 350 COUNT_COL_B SUM_COL_B
4 444 20 5 670 0 COUNT_COL_A SUM_COL_A

Related

How to aggretage col1 per ID and val1 per ID and values in col1 in SAS Enterprise Gude or PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COL1 | VAL1 |
----|------|------|
111 | A | 10 |
111 | A | 5 |
111 | B | 10 |
222 | B | 20 |
333 | C | 25 |
... | ... | ... |
And I need to aggregate above table to know:
sum of values from COL1 per ID
sum of values from VAL1 per COL1 per ID
So, as a result I need something like below:
ID | COL1_A | COL1_B | COL1_C | COL1_A_VAL1_SUM | COL1_B_VAL1_SUM | COL1_C_VAL1_SUM
----|--------|--------|---------|-----------------|-----------------|------------------
111 | 2 | 1 | 0 | 15 | 10 | 0
222 | 0 | 1 | 0 | 0 | 20 | 0
333 | 0 | 0 | 1 | 0 | 0 | 25
for example because:
COL1_A = 2 for ID 111, because ID=111 has 2 times "A" in COL1
COL1_A_VAL1_SUM = 15 for ID 111, because ID=111 has 10+5=15 in VAL1 for "A" in COL1
How can I do that in SAS Enterpriuse Guide or in PROC SQL ?
First, we'll create the counts that we need by group with SQL:
proc sql;
create table totals_by_group as
select id
, col1
, count(col1) as count_col1
, sum(val1) as sum_val1
from have
group by id, col1
;
quit;
This produces the following table:
id col1 count_col1 sum_val1
111 A 2 15
111 B 1 10
222 B 1 20
333 C 1 25
Now we need to transpose this into the way we want it. We'll do this with two transpose steps: one for count_col1, and one for sum_val1. proc transpose has a few handy options to make this easy, namely the id, prefix, and suffix options.
First, we'll consider our ID variable col1. This creates columns named A, B, and C. For example:
id A B C
111 2 1 .
222 . 1 .
333 . . 1
The prefix and suffix options let us add a prefix and suffix to these names.
proc transpose
data = totals_by_group
out = count_by_group(drop=_NAME_)
prefix = COL1_;
by id;
id col1;
var count_col1;
run;
proc transpose
data = totals_by_group
out = sum_by_group(drop=_NAME_)
prefix = COL1_
suffix = _VAL1_SUM;
by id;
id col1;
var sum_val1;
run;
This gives us two tables:
COUNT_BY_GROUP
id COL1_A COL1_B COL1_C
111 2 1 .
222 . 1 .
333 . . 1
SUM_BY_GROUP
id COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 15 10 .
222 . 20 .
333 . . 25
Now we just need to merge them together, then set all missing values to 0 by iterating over each numeric column and checking if it's missing.
data want;
merge count_by_group
sum_by_group
;
by id;
array numvars[*] _NUMERIC_;
do i = 1 to dim(numvars);
if(missing(numvars[i])) then numvars[i] = 0;
end;
drop i;
run;
Final table:
id COL1_A COL1_B COL1_C COL1_A_VAL1_SUM COL1_B_VAL1_SUM COL1_C_VAL1_SUM
111 2 1 0 15 10 0
222 0 1 0 0 20 0
333 0 0 1 0 0 25

How can I add observations to the existing dataset based on dates?

I have a dataset like this:
data have;
input date :date9. index;
format date date9.;
datalines;
31MAR2019 10
30APR2019 12
31MAY2019 15
30JUN2019 14
;
run;
I would like to add observations with dates from the maximum date (hence from 30JUN2019) until 31DEC2019 (by months) with the value of index being the last available value: 14. How can I achieve this in SAS? I want the code to be flexible, thus for every such dataset, take the maximum of date and add monthly observations from that maximum until DEC2019 with the value of index being equal to the last available value (here in the example the value in JUN2019).
An explicit DO loop over the SET provides the foundation for a concise solution with no extraneous worker variables. Automatic variable last is automatically dropped.
data have;
input date :date9. index;
format date date9.;
datalines;
31MAR2019 10
30APR2019 12
31MAY2019 15
30JUN2019 14
;
data want;
do until (last);
set have end=last;
output;
end;
do last = month(date) to 11; %* repurpose automatic variable last as a loop index;
date = intnx ('month',date,1,'e');
output;
end;
run;
Always helpful to refresh understanding. From SET Options documentation
END=variable
creates and names a temporary variable that contains an end-of-file indicator. The variable, which is initialized to zero, is set to 1 when SET reads the last observation of the last data set listed. This variable is not added to any new data set.
You can do it using end in set statement and retain statement.
data want(drop=i tIndex tDate);
set have end=eof;
retain tIndex tDate;
if eof then do;
tIndex=Index;
tDate=Date;
end;
output;
if eof then do;
do i=1 to 12-month(tDate);
index=tIndex;
date = intnx('month',tDate,i,'e');
output;
end;
end;
run;
INPUT:
+-----------+-------+
| date | index |
+-----------+-------+
| 31MAR2019 | 10 |
| 30APR2019 | 12 |
| 31MAY2019 | 15 |
| 30JUN2019 | 14 |
+-----------+-------+
OUTPUT:
+-----------+-------+
| date | index |
+-----------+-------+
| 31MAR2019 | 10 |
| 30APR2019 | 12 |
| 31MAY2019 | 15 |
| 30JUN2019 | 14 |
| 31JUL2019 | 14 |
| 31AUG2019 | 14 |
| 30SEP2019 | 14 |
| 31OCT2019 | 14 |
| 30NOV2019 | 14 |
| 31DEC2019 | 14 |
+-----------+-------+

SAS - Combine like values within rows, then add new variable for non like value(s)

I have a large dataset and am trying to run an analyses on each customer (same account and routing #), which have 100's of transactions within the dataset. I
was able to add SEQ # for like acct#'s and routing #s. How would I run an analyses to say SEQ #1 and give total # of deposits (Amount), max, min of deposits and potentially some other helpful data.
+-----------+--------+---------+--------+
| Routing# | Acct# | AMOUNT | TOTAL |SEQ #
+-----------+--------+---------+--------+
| 518 | 0 | 490.50 | 3777.5 | 1
| 518 | 0 | 170.00 | 3777.5 | 1
| 518 | 0 | 3117.00 | 3777.5 | 1
| 518 | 99 | 875.00 | 875 | 2
| 518 | 999 | 499.00 | 499 | 3
| 519 | 2 | 100.00 | 200.00 | 4
| 519 | 2 | 100.00 | 200.00 | 4
+-----------+--------+---------+--------+
Thanks
There are multiple ways to do this, but here is a data step way
data have;
input Routing Acct AMOUNT;
datalines;
518 0 490.50
518 0 170.00
518 0 3117.00
518 99 875.00
518 999 499.00
519 2 100.00
519 2 100.00
;
data want;
do until (last.Acct);
set have;
by Routing Acct notsorted;
total+amount;
end;
seq+1;
do until (last.Acct);
set have;
by Routing Acct notsorted;
output;
end;
total=0;
run;

Proc sql subquery based on nonexisitng column returns not null

Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;

Percentage calculation based on multiple columns SAS

I have a data with patientID and the next column with illness, which has more than one category seperated by commas. I need to find the total number of patients per each illness category and the percentage of patients per category. I tried the normal way, it gives the frequency correct but not the percent.
The data looks like this.
ID Type_of_illness
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
where the empty spaces represent no illness. I first separated the illnesses into separate columns, but then got stuck there not knowing how to process to find out the percent.
The code I wrote is as below:
Data new;
set old;
array P(3) L1 L2 L3;
do i to dim(p);
p(i)=scan(type_of_illness,i,',');
end;
run;
Then I created a new column to copy all the illnesses to it so I thought it would give me the correct frequency, but it did not give me the correct percent.
data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;
I have to create a table something like
*Total numer of patients Percent*
.......................................
lf5
lf7
lf6
lf11
lf12
lf13
Please help.
You're trying to output percentages on non-mutually exclusive groups (each illness). It isn't obvious in SAS how to do this.
The following takes Joe's input code but takes an alternative route in determining percentages from event data (a 'long' dataset, if you will). I prefer this to creating a binary variable for each illness at the patient level (a 'wide' dataset) as, for me, this soon gets unwieldy. That said, if you then go on to do some modelling then a 'wide' dataset is usually more useful.
The following code produces output as follows:-
| | Pats | Pats | | | Mean | | |
| | with 0 |with 1+ | % with | Num | events | | |
| |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12 | 19| 7| 27| 7| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7 | 24| 2| 8| 2| 1.0| 0.00| 1|
---------------------------------------------------------------------------------------|
Note that patient 5 is repeated in your data for illness lf5. My code only counts this record once. This is fine if a chronic illness but not if acute. Also, my code includes patients in the denominator who do not have an event.
Finally, you can see another example of this code using dates - with test data - here at the mycodestock.com code sharing site => https://mycodestock.com/public/snippet/11251
Here's the code for the table above:-
options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
proc sort;
by id;
run;
** Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
by id;
run;
** Create event table (1 row per patient*event);
** NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
set have;
attrib grp length = $5 label = 'Illness';
do i = 1 to countc(type_of_illness, ',') + 1;
grp = scan(type_of_illness, i, ',');
if grp ne '' then output;
end;
run;
** Count the number of events each patient had for each grp;
** NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
** NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
by grp;
class id;
output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;
** Add a denominator variable - value '1' for each row.;
** Ensure when num_events = 0 the value is set to missing;
** Create a flag variable - set to 1 - if a patient has a record (no matter how many);
data perc2_summ_grp_pat;
set perc2_summ_grp_pat;
denom = 1;
if num_events = 0 then num_events = .;
flg_scripts = ifn(num_events, 1, .);
run;
proc tabulate data = perc2_summ_grp_pat format=comma8.;
title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
title2 "Units: Patients - within each group level";
title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";
class grp;
var denom flg_scripts num_events;
table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record')
num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;
You're going about this right, but you need to pick percent differently. Normally percent is 'percent of whole dataset', which means that it is going to triplicate your base. You want the percent based to the illness. This means you need a 1/0 for each illness.
The one downside is you have the 0's in your automatic tables; you would have to output the table to a dataset and remove them, then proc print/report the resulting dataset to get the 1's only - or use PROC SQL to generate the table.
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;
data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
else L[_t]=0;
end;
run;
proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;
The multilabel format solution is interesting, so I present it separately.
Using the same have, we create a format that takes every combination of illnesses and outputs a row for each illness in it, ie, if you have "1,2,3", it outputs rows
1,2,3 = 1
1,2,3 = 2
1,2,3 = 3
Enabling multilabel formats and using a class-enabled proc like proc tabulate, you can then use this to allow each respondent to count in each of the label values, but not be counted more than once against the total.
data for_procformat;
set have;
start=type_of_illness; *start is the input to the format;
hlo=' m'; *m means multilabel, adding a space
here to leave room for the o later;
type='c'; *character format - n is numeric;
fmtname='$ILLF'; *whatever name you like;
do _t = 1 to countw(type_of_illness,','); *for each 'word' do this once;
label=scan(type_of_illness,_t,','); *label is the 'result' of the format;
if not missing(label) then output;
end;
if _n_=1 then do; *this block adds a row to deal with values;
hlo='om'; *not defined (in this case, just missings);
label='No Illness'; *the o means 'other';
output;
end;
run;
proc sort data=for_procformat nodupkey; *remove duplicates (which there will be many);
by start label;
run;
proc format cntlin=for_procformat; *import the formats;
quit;
proc tabulate data=have;
class type_of_illness/mlf missing ; *mlf means multilabel formats;
format type_of_illness $ILLF.; *apply said format;
tables type_of_illness,n pctn; *and your table;
run;