Compare rows within group of size two - sas

I've got the below code that works beautifully for comparing rows in a group when the first row doesnt matter.
data want_Find_Change;
set WORK.IA;
by ID;
array var[*] $ RATING;
array lagvar[*] $ zRATING;
array changeflag[*] RATING_UPDATE;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
end;
do i = 1 to dim(var) ;
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.ID);
end;
drop i;
run;
Unfortunately, when I use a dataset that has two rows per group I get incorrect returns, I'm assuming because the first row has to be used in the comparison. How can I compare the only to rows and a return only on the second row. This did not work:
data Change;
set WORK.Two;
by ID;
changeflag = last.RATING NE first.RATING;
run;
Example of the data I have and want
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
want
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1

If you want only the first row to not be flagged, you first need to create a variable enumerating the rows within each group. You can do so with:
data temp;
set have;
count + 1;
by Group;
if first.Group then count = 1;
run;
In a second step, you can run a proc sql with a subquery, count distinct by groups, and case when:
proc sql;
create table want as
select
Group, Name, Sport, DogName, Eligibility,
case when count_name > 1 and count > 1 then 1 else 0 end as N_change,
case when count_sport > 1 and count > 1 then 1 else 0 end as S_change,
case when count_dog > 1 and count > 1 then 1 else 0 end as D_change,
case when count_E > 1 and count > 1 then 1 else 0 end as E_change
from (select *,
count(distinct(Name)) as count_name,
count(distinct(Sport)) as count_sport,
count(distinct(DogName)) as count_dog,
count(distinct(Eligibility)) as count_E
from temp
group by Group);
quit;
Best,

Related

How to write a foreach loop statement in SAS?

I'm working in SAS as a novice. I have two datasets:
Dataset1
Unique ID
ColumnA
1
15
1
39
2
20
3
10
Dataset2
Unique ID
ColumnB
1
40
2
55
2
10
For each UniqueID, I want to subtract all values of ColumnB by each value of ColumnA. And I would like to create a NewColumn that is 1 anytime 1>ColumnB-Column >30. For the first row of Dataset 1, where UniqueID= 1, I would want SAS to go through all the rows in Dataset 2 that also have a UniqueID = 1 and determine if there is any rows in Dataset 2 where the difference between ColumnB and ColumnA is greater than 1 or less than 30. For the first row of Dataset 1 the NewColumn should be assigned a value of 1 because 40 - 15 = 25. For the second row of Dataset 1 the NewColumn should be assigned a value of 0 because 40 - 39 = 1 (which is not greater than 1). For the third row of Dataset 1, I again want SAS to go through every row of ColumnB in Dataset 2 that has the same UniqueID as in Dataset1, so 55 - 20 = 35 (which is greater than 30) but NewColumn would still be assigned a value of 1 because (moving to row 3 of Datatset 2 which has UniqueID =2) 20 - 10 = 10 which satisfies the if statement.
So I want my output to be:
Unique ID
ColumnA
NewColumn
1
15
1
1
30
0
2
20
1
I have tried concatenating Dataset1 and Dataset2 into a FullDataset. Then I tried using a do loop statement but I can't figure out how to do the loop for each value of UniqueID. I tried using BY but that of course produces an error because that is only used for increments.
DATA FullDataset;
set Dataset1 Dataset2; /*Concatenate datasets*/
do i=ColumnB-ColumnA by UniqueID;
if 1<ColumnB-ColumnA<30 then NewColumn=1;
output;
end;
RUN;
I know I'm probably way off but any help would be appreciated. Thank you!
So, the way that answers your question most directly is the keyed set. This isn't necessarily how I'd do this, but it is fairly simple to understand (as opposed to a hash table, which is what I'd use, or a SQL join, probably what most people would use). This does exactly what you say: grabs a row of A, says for each matching row of B check a condition. It requires having an index on the datasets (well, at least on the B dataset).
data colA(index=(id));
input ID ColumnA;
datalines;
1 15
1 39
2 20
3 10
;;;;
data colB(index=(id));
input ID ColumnB;
datalines;
1 40
2 55
2 30
;;;;
run;
data want;
*base: the colA dataset - you want to iterate through that once per row;
set colA;
*now, loop while the check variable shows 0 (match found);
do while (_iorc_ = 0);
*bring in other dataset using ID as key;
set colB key=ID ;
* check to see if it matches your requirement, and also only check when _IORC_ is 0;
if _IORC_ eq 0 and 1 lt ColumnB-ColumnA lt 30 then result=1;
* This is just to show you what is going on, can remove;
put _all_;
end;
*reset things for next pass;
_ERROR_=0;
_IORC_=0;
run;

Needing to retain Lab category tests based on individual positive test result

Hello so this is a sample of my data (There is an additional column of LBCAT =URINALYSIS for those panel of tests)
I've been asked to only include the panel of tests where LBNRIND is populated for any of those tests and the rest to be removed. Some subjects have multiple test results at different visit timepoints and others only have 1.I can't utilise a simple where LBNRIND ne '' in the data step because I need the entire panel of Urinalysis tests and not just that particular test result. What would be the best approach here? I think transposing the data would be too messy but maybe putting the variables in an array/macro and utilising a do loop for those panel of tests?.
Update:I've tried this code but it doesn't keep the corresponding tests for where lb_nrind >0. If I apply the sum(lb_nrind > '' ) the same when applying lb_nrind > '' to the having clause
*proc sql;
*create table want as
select * from labUA
group by ptno and day and lb_cat
having sum(lb_nrind > '') > 0 ;
data want2;
do _n_ = 1 by 1 until (last.ptno);
set labUA;
by ptno period day hour ;
if not flag_group then flag_group = (lb_nrind > '');
end;
do _n_ = 1 to _n_;
set want;
if flag_group then output;
end;
drop flag_group; run;*
You can use a SQL HAVING clause to retain rows of a group meeting some aggregate condition. In your case that group might be a patientid, panelid and condition at least one LBNRIND not NULL
Example:
Consider this example where a group of rows is to be kept only if at least one of the rows in the group meets the criteria result7=77
Both code blocks use the SAS feature that a logical evaluation is 1 for true and 0 for false.
SQL
data have;
infile datalines missover;
input id test $ parm $ result1-result10;
datalines;
1 A P 1 2 . 9 8 7 . . . .
1 B Q 1 2 3
1 C R 4 5 6
1 D S 8 9 . . . 6 77
1 E T 1 1 1
1 F U 1 1 1
1 G V 2
2 A Z 3
2 B K 1 2 3 4 5 6 78
2 C L 4
2 D M 9
3 G N 8
4 B Q 7
4 D S 6
4 C 1 1 1 . . 5 0 77
;
proc sql;
create table want as
select * from have
group by id
having sum(result7=77) > 0
;
DOW Loop
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if not flag_group then flag_group = (result7=77);
end;
do _n_ = 1 to _n_;
set have;
if flag_group then output;
end;
drop flag_group;
run;

Last Changed row +1 in a group

I've got group data and it has flags created anytime a name is changed within that group. I can pull the last two or first two observations within the group, but I am struggling figuring out how to pull the last observation with a name change AND the row right after.
The below code give me the first or last two observations per group, depending on how I sort the data.
DATA LastTwo;
SET WhatIveGot;
count + 1;
BY group_ID /*data pre sorted*/;
IF FIRST.group_ID THEN count=1;
IF count<=2 THEN OUTPUT;
RUN;
What I need is to be the LAST observation with a name change AND the following row.
group_ID NAME DATE NAME_CHange
1 TOM 1/1/19 0
1 Jill 1/30/19 1
1 Jill 1/20/19 0
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 TOM 2/1/19 0
2 Jill 2/30/19 1
2 Jill 2/20/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
2 Jim 4/15/19 0
3 Joe 2/20/19 0
3 Kim 3/10/19 1
3 Kim 3/30/19 0
3 Ken 4/15/19 1
4 Tim 3/10/19 0
4 Tim 3/30/19 0
The desired output:
group_ID NAME DATE NAME_CHange
1 Bob 2/10/19 1
1 Bob 2/30/19 0
2 Jim 3/10/19 1
2 Jim 3/30/19 0
3 Ken 4/15/19 1
The cases for Group_ID 2 and 3 are the roadblock. The data is already sorted by date.
Thank you for any help in advance
Use DOW processing to determine where the last name change was. Apply that information in a succeeding loop.
Example:
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id name notsorted;
if first.name then _index_of_last_name_change = _n_;
end;
do _n_ = 1 to _n_;
set have;
if _index_of_last_name_change <= _n_ <= _index_of_last_name_change+1 then OUTPUT;
end;
drop _:;
run;

SAS creating identifier

I have a dataset with patient having multiple courses during a treatment phase.
Data set looks like:
C 1 1 0
C 0 0 1
C 1 1 0
C 0 0 1
The first two rows: patient start at row1 and finishes at row2. This is the first course of patient C.
The second two rows: patient C again starts at row3 and finishes at row four.
How can I create an identifier for these two courses using the first and last statements in SAS.
Expected output should look like this;
C 1 1 0 23
C 0 0 1 23
C 1 1 0 24
C 0 0 1 24
C 1 1 1 25
The counts for one course should be the same and different from courses to courses within he same patient.
Thanks.
Assuming the third variable, whatever it is, is your 'end state' the following works. Probably not the easiest method but hopefully clear. I don't know if First/Last will actually help in this situation except for when the ID switches.
Idea is look for the V3=1 and then set a flag to 1. If the flag is 1, then the next record increments and resets the flag and the process is continued. Retain is used to hold the values of Flag and Course across the rows.
data have;
input ID $ v1-v3;
cards;
C 1 1 0
C 0 0 1
C 1 1 0
C 0 0 1
D 1 0 0
D 0 1 0
D 0 0 1
;
run;
data want;
set have;
BY ID;
retain flag 0 course;
if first.ID then do;
Course=1;
flag=0;
end;
if flag=1 then do;
course=course+1;
flag=0;
end;
else if v3=1 and flag=0 then flag=1;
run;
proc print;
run;

SAS Dataset : Counting observation that match an IF condition

Here is a very basic question, but I'm unable to find an easy way to do it.
I have a dataset that references different highschools and students :
Highschool Students Sexe
A 1 m
A 2 m
A 3 m
A 4 f
A 5 f
B 1 m
B 2 m
And I'd like to create two new variables that count the number of male and female in each schools :
Highschool Students Sexe Nb_m Nb_f
A 1 m 1 0
A 2 m 2 0
A 3 m 3 0
A 4 f 3 1
A 5 f 3 2
B 1 m 1 0
B 2 m 2 0
And I can finally extract the last line with the total that would look like this :
Highschool Students Sexe Nb_m Nb_f
A 5 f 3 2
B 2 m 2 0
Any ideas ?
You can do this in a single PROC SQL step...
Also, I don't think you really need the value of Sexe from the last row.
proc sql ;
create table want as
select Highschool,
sum(case when Sexe = 'f' then 1 else 0 end) as Nb_f,
sum(case when Sexe = 'm' then 1 else 0 end) as Nb_m,
Nb_f + Nb_m as Students
group by Highschool
order by Highschool ;
quit ;
First you have to sort your dataset by Highschool:
proc sort data = your_dataset;
by Highschool;
run;
then you use
- retain to not reset Nb_m and Nb_f at every record;
- last function and output statement to print only the last observation for every school.
data new_dataset;
set your_dataset;
by Highschool;
retain Nb_m Nb_f;
if Sexe = 'm' then
Nb_m + 1;
else
Nb_f + 1;
if last.Highschool then do;
Students = Nb_m + Nb_f;
output;
Nb_m = 0;
Nb_f = 0;
end;
run;