Counting categorical variables on row in SAS - sas

Sample Data
I was wondering if it is possible to use data instead of proc to count the number of categorical variables on a row as shown in 'count' example above. This will allow me to further use the data e.g COUNT=1 or COUNT > 1 to check morbidity.
Also will it be possible to then count the number of each diagnosis in the entire data set per patient while accounting for duplicates if there is any? For example there are 3 CB's and 2 AA's in this data set but CB should be 2 because patient 2 had it recorded twice.
Thank you for your time and have a lovely new year.

Your question is not clear but your could manage your diag using union all and count distinct
selec patient count(distinct diag )
from (
select patient, diag1 as diag
from my_table
uniona all
select patient, diag2
from my_table
uniona all
select patient, diag3
from my_table
uniona all
select patient, diag4
from my_table
) t
group by patient
or simply union and count
selec patient count(diag )
from (
select patient, diag1 as diag
from my_table
uniona
select patient, diag2
from my_table
uniona
select patient, diag3
from my_table
uniona
select patient, diag4
from my_table
) t
group by patient

The image indicates that for each row you want a count of the number of columns with non-missing values. Additionally, you apparently have some way to do this using a PROC step, but would like to know how using a DATA step.
In DATA step you can count the number of non-missing values indirectly using CMISS, or directly using COUNTC against a constructed value:
data have;
attrib pid length=8 diag1-diag4 length=$5;
input pid & diag1-diag4;
datalines;
1 AA J9 HH6 .
2 CB . . CB
3 J10 AA CB J10
4 B B . F90 .
5 J10 . . .
6 . . . .
run;
data have_with_count;
set have;
count = 4 - cmiss (of diag1-diag4);
count_way2 = countc(catx('~', of diag1-diag4, 'SENTINEL'), '~');
run;
In order to work again MySQL data source you will also need a libref that connects you to that remote data server.
Added
Counting distinct values across a row can be accomplished using a hash or sortc. Consider this example that sorts a copy of the row data (as an array) and counts the unique values within:
data want;
set have;
array diag diag1-diag4;
array v(4) $5 _temporary_;
do _n_ = 1 to dim(diag);
v(_n_) = diag(_n_);
end;
call sortc(of v(*));
uniq = 0;
do _n_ = 1 to dim(v);
if missing(v(_n_)) then continue;
if uniq = 0 then
uniq + 1;
else
uniq + ( v(_n_) ne v(_n_-1) );
end;
run;

With Richard's dummy data to count number of diagnosis and unique number of diagnosis:
data want;
set have;
array var diag:;
length temp $30.;
call missing(diag_num);
do over var;
if not missing(var) then do;
diag_num+1;
temp=ifc(whichc(var, temp),temp,catx(' ',temp,var));
end;
end;
unique_diag=countw(temp);
drop temp;
run;

Related

Create Index (ID) and increment (SAS)

I have a very simple table with sale information.
Name | Value | Index
AAC | 1000 | 1
BTR | 500 | 2
GRS | 250 | 3
AAC | 100 | 4
I add a new column Name Index.
And I run the first time
DATA BSP;
Index = _N_;
SET BSP;
RUN;
This works fine for the first time.
But now I add more and more sales items and the new line should be get a new indexnumber. The highest index + 1 .... The old sales should keep the indexnumber. But if I run the code below all new lines get the index = 1. What is wrong with the code.
proc sql noprint;
select max(Index) into :max_ID from WORK.BSP;
quit;
DATA work.BSP;
SET work.BSP;
RETAIN new_Id &max_ID;
IF Index = . THEN DO;
new_ID + 1;
index = new_id;
END;
RUN;
You definied the value of Index column in the first step. Where is a new data that you want set? This code like your data it's work well. Can you share your base datase and the last dataset that you want change? Maybe your data is wrong?
(BTW The index variable name is not a lucky choice :-))
data BSP;
Name="AAC";Value=1000;Index=1;output;
Name="BTR";Value=500;Index=2;output;
Name="GRS";Value=250;Index=3;output;
Name="AAC";Value=100;Index=4;output;
run;
/* the row where Index not definied */
data BSPNew;
Name="XXX";Value=1000;output;
run;
proc sql noprint;
select max(Index) into :max_ID from WORK.BSP;
quit;
%put &max_Id.;
proc append base=BSP data=BSPNew force;
run;
DATA work.BSP;
SET work.BSP;
RETAIN new_Id &max_ID;
IF Index = . THEN DO;
new_ID + 1;
index = new_id;
END;
RUN;
data _null_;
set BSP;
put Name Value Index;
run ;
/* the result is:
AAC 1000 1
BTR 500 2
GRS 250 3
AAC 100 4
XXX 1000 5
*/
You need to show more of your code that will demonstrate the problem. The following example is the same as yours, but does not 'fail' to assign a desired index
Example:
data master;
do name = 'A','B','C'; OUTPUT; end;
run;
data master;
set master;
index = _n_;
run;
data new;
do name = 'E','F','G'; OUTPUT; end;
run;
proc sql noprint;
insert into master(name) select name from new; * append new rows;
select max(index) into :next_index from master; * compute highest index known;
data master;
set master;
retain next_index &next_index; * utilize highest index;
if index = . then do;
next_index + 1; * increment highest index before applying;
index = next_index;
end;
drop next_index; * discard 'worker' variable;
run;
You may have inserted a 1 by accident if the insert statement looked like
insert into master select name, 1 from new;
or the new data already has index set to '1'
insert into master select name, index from new;

How do I add in rows with specific values missing in a single DATA step?

Here is a simple example I came up with. There are 3 players here (id is 1,2,3) and each player gets 3 attempts at the game (attempt is 1,2,3).
data have;
infile datalines delimiter=",";
input id attempt score;
datalines;
1,1,100
1,2,200
2,1,150
3,1,60
;
run;
I would like to add in rows where the score is missing if they did not play attempt 2 or attempt 3.
data want;
set have;
by id attempt;
* ??? ;
run;
proc print data=have;
run;
The output would look something like this.
1 1 100
1 2 200
1 3 .
2 1 150
2 2 .
2 3 .
3 1 60
3 2 .
3 3 .
How do I go about doing this?
You could solve this by first creating a table where you have the structure you want to see: for each ID three attempts. This structure can then be joined with a 'left join' to your 'have' table to get the actual scores if they exist and missing variable if they don't.
/* Create table with all ids for which the structure needs to be created */
proc sql;
create table ids as
select distinct id from have;
quit;
/* Create table structure with 3 attempts per ID */
data ids (drop = i);
set ids;
do i = 1 to 3;
attempt = i;
output;
end;
run;
/* Join the table structure to the actual scores in the have table */
proc sql;
create table want as
select a.*,
b.score
from ids a left join have b on a.id = b.id and a.attempt = b.attempt;
quit;
A table of possible attempts cross joined with the distinct ids left joined to the data will produce the desired result set.
Example:
data have;
infile datalines delimiter=",";
input id attempt score;
datalines;
1,1,100
1,2,200
2,1,150
3,1,60
;
data attempts;
do attempt = 1 to 3; output; end;
run;
proc sql;
create table want as
select
each_id.id,
each_attempt.attempt,
have.score
from
(select distinct id from have) each_id
cross join
attempts each_attempt
left join
have
on
each_id.id = have.id
& each_attempt.attempt = have.attempt
order by
id, attempt
;
Update: I figured it out.
proc sort data=have;
by id attempt;
data want;
set have (rename=(attempt=orig_attempt score=orig_score));
by id;
** Previous attempt number **;
retain prev;
if first.id then prev = 0;
** If there is a gap between previous attempt and current attempt, output a blank record for each intervening attempt **;
if orig_attempt > prev + 1 then do attempt = prev + 1 to orig_attempt - 1;
score = .;
output;
end;
** Output current attempt **;
attempt = orig_attempt;
score = orig_score;
output;
** If this is the last record and there are more attempts that should be included, output dummy records for them **;
** (Assumes that you know the maximum number of attempts) **;
if last.id & attempt < 3 then do attempt = attempt + 1 to 3;
score = .;
output;
end;
** Update last attempt used in this iteration **;
prev = attempt;
run;
Here is a alternative DATA step, a DOW way:
data want;
do until (last.id);
set have;
by id;
output;
end;
call missing(score);
do attempt = attempt+1 to 3;
output;
end;
run;
If the absent observations are only at the end then you can just use a couple of OUTPUT statements and a DO loop. So write each observation as it is read and if the last one is NOT attempt 3 then add more observations until you get to attempt 3.
data want1;
set have ;
by id;
output;
score=.;
if last.id then do attempt=attempt+1 to 3;
output;
end;
run;
If the absent attempts can appear any where then you need to "look ahead" to see whether the next observations skips any attempts.
data want2;
set have end=eof;
by id ;
if not eof then set have (firstobs=2 keep=attempt rename=(attempt=next));
if last.id then next=3+1;
output;
score=.;
do attempt=attempt+1 to next-1;
output;
end;
drop next;
run;

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

In SAS --Summing binary is producing Binary Results

I am trying to convert a categorical variable (Product) in binary and then want to know how many products per customer.
data is in the following format:
ID Product
C1 A
C1 B
C2 A
C3 B
C4 A
The code I am using for converting category to binary
IF PRODUCT="A" THEN PROD_A =1 ; ELSE PROD_A=0;
IF PRODUCT="B" THEN PROD_B =1 ; ELSE PROD_B=0;
TOT_PROD = SUM(PROD_A, PROD_B);
But when I count no. of product it gives me '1' for all customer and I am expecting 1 or 2.
I have tried
TOT_PROD = PROD_A + PROD_B;
but I get the same results
This is all inside one datastep, correct? If so you're processing only one line at a time. For each individual line the only possible values for PROD_A and PROD_B are one or zero. You need an aggregate function. For example, if your dataset is named PRODUCTS:
DATA X;
SET PRODUCTS;
IF PRODUCT="A" THEN PROD_A = 1 ; ELSE PROD_A=0;
IF PRODUCT="B" THEN PROD_B = 1 ; ELSE PROD_B=0;
TOT_PROD = SUM(PROD_A, PROD_B);
RUN;
(TOT_PROD will always be equal to 1 in X, but never mind for now).
Now sum them up:
proc sql;
create table prod_totals as
select product, sum(tot_prod) as total_products
from x
group by product;
quit;
More simply just skip the data step:
proc sql;
create table prod_totals as
select product, count(*) as total_products
from products
group by product;
quit;
Or use PROC SUMMARIZE or PROC MEANS instead of PROC SQL.
I have assumed you only want 1 record output per id.
In the solutions below I have employed the DOW-Loop (DO-Whitlock).
If you wanted prod_a and prod_b just to help with the totals and if they're not required in the output, then you could use something like:
data want;
do until(last.id);
set have;
by id;
tot_prod=sum(tot_prod,product='A',product='B');
end;
run;
If you need prod_a and prod_b in the output, then you could use:
data want;
do until(last.id);
set have;
by id;
prod_a=(product='A');
prod_b=(product='B');
tot_prod=sum(tot_prod,prod_a,prod_b);
end;
run;
In both data steps the last product per id will be output along with the other variables and in the case of the 2nd data step example the last prod_a & prod_b per id will also be output.
To do this in the data step, you need retain. Make sure you've sorted the dataset by id first.
data prod_totals;
set products;
by ID;
retain prod_a prod_b;
if first.id then do; *initialize to zero for each new ID;
prod_a=0; prod_b=0;
end;
if product='A' then prod_a=1; *set to 1 for each one found;
else if product='B' then prod_b=1;
if last.id then do; *for last record in each ID, output and sum total;
total_products=sum(prod_a,prod_b);
output;
end;
keep id prod_a prod_b total_products;
run;

Update one dataset with another without using PROC SQL

I have the below two datasets
Dataset A
id age mark
1 . .
2 . .
1 . .
Dataset B
id age mark
2 20 200
1 10 100
I need the below dataset as output
Output Dataset
id age mark
1 10 100
2 20 200
1 10 100
How to carry out this without using PROC SQL i.e. using DATA STEP?
There are many ways to do this. The easiest is to sort the two data sets and then use MERGE. For example:
proc sort data=A;
by id;
run;
proc sort data=B;
by id;
run;
data WANT;
merge A(drop=age mark) B;
by ID;
run;
The trick is to drop the variables you are adding from the first data set A; the new variables will come from the second data set B.
Of course, this solution does not preserve the original order of the observations in your data set AND only works because your second data set contains unique values of id.
I tried this and it worked for me, even if you have data you would like to preserve in that column. Just for completeness sake I added an SQL variant too.
data a;
input id a;
datalines;
1 10
2 20
;
data b;
input id a;
datalines;
1 .
1 5
1 .
2 .
3 4
;
data c (drop=b);
merge a (rename = (a=b) in=ina) b (in = inb);
by id;
if b ne . then a = b;
run;
proc sql;
create table d as
select a.id, a.a from a right join b on a.id=b.id where a.id is not null
union all
select b.id, b.a from a right join b on a.id = b.id where a.id is null
;
quit;