Creating groups of specific data per subject using SAS - sas

My objective is to determine for each subject how many data observations are getting, at least, 2 consecutives "Y" as eligible value. For most of subjects, case only occur once but I realized looking to data that for some subjects it can happen 2, 3 times. So I need to create an extra variable (called GROUP) to keep track of these multiple occurrences within subjects. By using SAS language, could someone help me to get GROUP variable properly created ? Detailed below is an dataset example of subjects (ID) with different study days (CVDY) and eligibility criteria (Y/N format) for a specific lab parameter (not included in the example).
Thanks for your support.
data WHAT_I_HAVE;
length ID CVDY $3. ELIG $2.;
infile datalines TRUNCOVER;
input ID $ CVDY $ ELIG $ ;
datalines;
101 1 N
101 2 Y
101 3 Y
101 4 N
201 1 Y
201 2 Y
201 3 N
201 4 Y
201 5 Y
201 6 Y
201 7 N
201 8 Y
201 9 Y
301 1 Y
301 2 Y
301 3 N
301 4 N
301 5 Y
;
run;
data WHAT_I_WANT;
length ID CVDY $3. ELIG GROUP $2.;
infile datalines TRUNCOVER;
input ID $ CVDY $ ELIG $ GROUP $;
datalines;
101 1 N .
101 2 Y 1
101 3 Y 1
101 4 N .
201 1 Y 1
201 2 Y 1
201 3 N .
201 4 Y 2
201 5 Y 2
201 6 Y 2
201 7 N .
201 8 Y 3
201 9 Y 3
301 1 Y 3
301 2 Y 3
301 3 N .
301 4 N .
301 5 Y .
301 6 N .
;
run;

You can use a double DOW loop. The first loop you can use to count how many rows contribute to this run of contiguous values of ELIG (within this value of ID). You need to use the NOTSORTED keyword on the BY statement to have the data step keep track of when the value of ELIG changes.
Now you have the information you need to know whether or not to increment your counter of the number of runs of two or more Y values in a row. To get your exact output you will need to use two variables. One that keeps the running count and the other to be the value you want to write.
The second DO loop just allows you to re-read the detail lines and write them back out so that the same value of GROUP is attached to each row in the run.
data want;
do rows=1 by 1 until(last.elig);
set have;
by id elig notsorted;
if first.id then cnt=0;
end;
if elig='Y' and rows>1 then do;
cnt+1;
group=cnt;
end;
do rows=1 to rows;
set have;
output;
end;
drop rows cnt;
run;
Results
Obs ID CVDY ELIG group
1 101 1 N .
2 101 2 Y 1
3 101 3 Y 1
4 101 4 N .
5 201 1 Y 1
6 201 2 Y 1
7 201 3 N .
8 201 4 Y 2
9 201 5 Y 2
10 201 6 Y 2
11 201 7 N .
12 201 8 Y 3
13 201 9 Y 3
14 301 1 Y 1
15 301 2 Y 1
16 301 3 N .
17 301 4 N .
18 301 5 Y .
Note there appears to be a typo in your expected results as the last ID only has one run of length 2.

Related

SAS arrays to set a value if it maches a specific value in any of the column

Im trying to set any5 = 'Yes' if there is a number 5 in any of the columns Q1 to Q5. However my code below only shows for the last column.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
array score{5} _temporary_ (5,5,5,5,5);
array Ques{5} Q1-Q5;
do i =1 to 5;
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
drop i;
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Keep it simple :-)
data survey;
infile datalines;
input ID 3. Q1-Q5;
array Ques{*} Q1 - Q5;
any5 = ifc(5 in Ques, 'Yes', 'No');
datalines;
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
Use the COUNTC function to compute the number of times 5 is repeated in your Q 1-Q5 columns then use the IFC function to return a character value based on whether the expression is true, false, or missing.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
any5 = ifc(countc(cats(of Q:),'5')>0,'Yes','No');
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Result:
535 1 3 5 4 2 Yes
12 5 5 4 4 3 Yes
723 2 1 2 1 1 No
7 3 5 1 4 2 Yes
Use the WHICHN function to determine the index of the target value in a list of values.
In your case assign the test for any index matching
any5 = whichn (5, of ques(*)) > 0;
From the documentation:
WHICHN Function
Searches for a numeric value that is equal to the first argument, and
returns the index of the first matching value.
Syntax
WHICHN(argument, value-1 <, value-2, ...>)
It is a simple mistake in your logic. You are setting ANY5 to YES or NO on every time through the loop. Since you continue going through the loop even after the match is found you overwrite the results from the previous times through the loop, so only the results of the last test survive.
Here is one way. Set the answer to NO before the loop and remove the ELSE clause.
any5='No ';
do i =1 to 5;
if Ques{i} = 5 then any5='Yes';
end;
Or stop when you have your answer.
do i =1 to 5 until(any5='Yes');
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
Or skip the looping altogether.
if whichn(5, of Q1-Q5) then any5='Yes';
else any5='No';
Or even easier create any5 as numeric instead of character. SAS will return 1 for TRUE and 0 for FALSE as the result of a boolean expression.
any5 = ( 0 < whichn(5, of Q1-Q5) );

Function to sum group based on id independently based off id

i am currently trying to write some code that goes through my dataset and sums each group everytime it appears independently of the whole group. this is what it currently looks like vs what i want it to. I thought it would be simple but sas 9.3 does not support sum over statements/
week ID var2 ... MinUnits group
24jun2019 1 x 5 0
01jul2019 1 x 4 1
08jul2019 1 x 7 1
15jul2019 1 x 2 1
22jul2019 1 x 0 2
29jul2019 1 x 5 2
05aug2019 1 x 2 2
24jun2019 1 x 9 0
01jul2019 2 x 5 1
08jul2019 2 x 6 1
15jul2019 2 x 8 1
22jul2019 2 x 1 2
29jul2019 2 x 5 2
05aug2019 3 x 3 2
what i want it to show
week ID var2 ... MinUnits group SumMinUnits
24jun2019 1 x 5 0 5
01jul2019 1 x 4 1 13
08jul2019 1 x 7 1
15jul2019 1 x 2 1
22jul2019 1 x 0 2 7
29jul2019 1 x 5 2
05aug2019 1 x 2 2
24jun2019 1 x 9 0 9
01jul2019 2 x 5 1 19
08jul2019 2 x 6 1
15jul2019 2 x 8 1
22jul2019 2 x 1 2 9
29jul2019 2 x 5 2
05aug2019 2 x 3 2
as you can see simply summing by group would not work because the group number gets repeated for different ID's (and eventually same ID's but in those cases a location variable is different than the orignal time the ID showed up).
please note i am not asking for you to code it for me as that is too much work. i just want to know if there is a functin i could use to do this. I thought about using a loop and groupby but that would sum up the total groups
You can use the NOTSORTED keyword on the BY statement use the GROUP variable to make BY groups.
data want;
do until (last.group);
set have ;
by group notsorted;
SumMinUnits=sum(SumMinUnits,MinUnits);
end;
do until (last.group);
set have ;
by group notsorted;
output;
end;
run;
Note this will set SUMMINUNITS to the same value for all observations in the group. You could add extra code to set it to missing inside the second DO loop when it is not the first observation for the group.
Wouldn't something like this work? It adds the total to every record of the group but otherwise your data seems order by ID and GROUP.
proc sql;
create table want as
select *, sum(minUnits) as total_units
from have
group by ID, GROUP;
quit;

Generating Unique ID for same group

I have data set,
CustID Rating
1 A
1 A
1 B
2 A
2 B
2 C
2 D
3 X
3 X
3 Z
4 Y
4 Y
5 M
6 N
7 O
8 U
8 T
8 U
And expecting Output
CustID Rating ID
1 A 1
1 A 1
1 B 1
2 A 1
2 B 2
2 C 3
2 D 4
3 X 1
3 X 1
3 Z 2
4 Y 1
4 Y 1
5 M 1
6 N 1
7 O 1
8 U 1
8 T 2
8 U 1
In the solution below, I selected the distinct possible ratings into a macro variable to be used in an array statement. These distinct values are then searched in the ratings tolumn to return the number assigned at each successful find.
You can avoid the macro statement in this case by replacing the %sysfunc by 3 (the number of distinct ratings, if you know it before hand). But the %sysfunc statement helps resolve this in case you don't know.
data have;
input CustomerID Rating $;
cards;
1 A
1 A
1 B
2 A
2 A
3 A
3 A
3 B
3 C
;
run;
proc sql noprint;
select distinct quote(strip(rating)) into :list separated by ' '
from have
order by 1;
%put &list.;
quit;
If you know the number before hand:
data want;
set have;
array num(3) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
Otherwise:
%macro Y;
data want;
set have;
array num(%sysfunc(countw(&list., %str( )))) $ _temporary_ (&list.);
do i = 1 to dim(num);
if findw(rating,num(i),'tips')>0 then id = i;
end;
drop i;
run;
%mend;
%Y;
The output:
Obs CustomerID Rating id
1 1 A 1
2 1 A 1
3 1 B 2
4 2 A 1
5 2 A 1
6 3 A 1
7 3 A 1
8 3 B 2
9 3 C 3
Assuming data is sorted by customerid and rating (as in the original unedited question). Is the following what you want:
data want;
set have;
by customerid rating;
if first.customerid then
id = 0;
if first.rating then
id + 1;
run;

Splitting string and creating variable using SAS

I have a string 28,16OB4N7L8O4L using two arrays I had split into separate variables.
hrs1 hrs2 hrs3 hrs4 hrs5 hrs6 hrs7
28 16 1 4 7 8 4
cd1 cd2 cd3 cd4 cd5 cd6 cd7
, O B N L O L
Now I want to summarize across variables, if same value repeats in character variable in the above example 'O' and'L' are repeated, in that case I want to merge as one and add the respective hrs.
Output should be:
, O B N L -COLUMN
28 24 1 4 11 -VALUES
Here's an example of transposing to normalized (long skinny format). I added a second sample record.
data have;
input id hrs1-hrs7 (cd1-cd7) ($1.);
cards;
1 28 16 1 4 7 8 4 ,OBNLOL
2 1 2 3 4 5 6 7 AAAABBB
;
run;
data tran (keep=id hr cd) / view=tran ;
set have ;
array hrs{*} hrs1-hrs7 ;
array cds{*} cd1-cd7 ;
do i=1 to dim(hrs) ;
hr=hrs{i} ;
cd=cds{i} ;
output ;
end ;
run ;
proc sql ;
select id, cd, sum(hr)
from tran
group by id, cd
;
quit ;
Returns:
id cd
________________
1 , 28
1 B 1
1 L 11
1 N 4
1 O 24
2 A 10
2 B 18

Detect the difference b/w ages greater than some value using SAS

I am trying to detect groups which contain the difference between first age and second age are greater than 5. For example, if I have the following data, the difference between age in grp=1 is 39 so I want to output that group in a separate data set. Same goes for grp 4.
id grp age sex
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
My initial idea was to sort them by grp and then get the absolute value between ages using something like if first.grp then do;. But I don't know how to get the absolute value between first age and second age by group or actually I don't know how should I start this.
Thanks in advance.
Here's one way that I think works.
data have;
input id $ grp $ age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
proc sort data=have ;
by grp descending age;
run;
data temp(keep=grp);
retain old;
set have;
by grp descending age;
if first.grp then old=age;
if last.grp then do;
diff=old-age;
if diff>5 then output ;
end;
run;
Data want;
merge temp(in=a) have(in=b);
by grp ;
if a and b;
run;
I would use PROC TRANSPOSE so the values in each group can easily be compared. For example:
data groups1;
input id $ grp age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
run;
proc sort data=groups1;
by grp; /* This maintains age order */
run;
proc transpose data=groups1 out=groups2;
by grp;
var age;
run;
With the transposed data you can do whatever comparison you like (I can't tell from your question what exactly you want, so I just compare first two ages):
/* With all ages of a particular group in a single row, it is easy to compare */
data outgroups1(keep=grp);
set groups2;
if abs(col1-col2)>5 then output;
run;
In this instance this would be my preferred method for creating a separate data set for each group that satisfies whatever condition is applied (generate and include code dynamically):
/* A separate data set per GRP value in OUTGROUPS1 */
filename dynacode catalog "work.dynacode.mycode.source";
data _null_;
set outgroups1;
file dynacode;
put "data grp" grp ";";
put " set groups1(where=(grp=" grp "));";
put "run;" /;
run;
%inc dynacode;
If you are after the difference between just the 1st and 2nd ages, then the following code is a fairly straightforward way of extracting these. It reads though the dataset to identify the groups, then uses the direct access method, POINT=, to extract the relevant records. I put in an extra condition, grp=lag(grp) just in case you have any groups with only 1 record.
data want;
set have;
by grp;
if first.grp then do;
num_grp=0;
outflag=0;
end;
outflag+ifn(lag(first.grp)=1 and grp=lag(grp) and abs(dif(age))>5,1,0) /* set flag to determine if group meets criteria */;
if not first.grp then num_grp+1; /* count number of records in group */
if last.grp and outflag=1 then do i=_n_-num_grp to _n_;
set have point=i; /* extract required group records */
drop num_grp outflag;
output;
end;
run;
Here's an SQL approach (using CarolinaJay's code to create the dataset):
data groups1;
input id grp age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
run;
proc sql noprint;
create table xx as
select a.*
from groups1 a
where grp in (select b.grp
from groups1 b
join groups1 c on c.id = b.id+1
and c.grp = b.grp
and abs(c.age - b.age) > 5
left join groups1 d on d.id = b.id-1
and d.grp = b.grp
where d.id eq .
)
;
quit;
The join on C finds all occurrences where the subsequent record in the same group has an absolute value > 5. The join on D (and the where clause) makes sure we only consider the results from the C join if the record is the very first record in the group.
data have;
input id $ grp $ age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
data want;
do i = 1 by 1 until(last.grp);
set have;
by grp notsorted;
if first.grp then cnt = 0;
cnt + 1;
if cnt = 1 then age1 = age;
if cnt = 2 then age2 = age;
diff = sum( age1, -age2 );
end;
do until(last.grp);
set have;
by grp;
if diff > 5 then output;
end;
run;