sas recursive lag by id - sas

I am trying to do a recursive lag in sas, the problem that I just learned is that x = lag(x) does not work in SAS.
The data I have is similar in format to this:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
What I want is that given x for the first count, I want each successive x by id to be the lag(x) + some constant.
For example, lets say: if count > 1 then x = lag(x) + 3.
The output that I would want is:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 13
a 1/1/2001 3 16
b 1/1/1997 1 51
b 1/1/1998 2 54

Yes, the lag function in SAS requires some understanding. You should read through the documentation on it (http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#n0l66p5oqex1f2n1quuopdvtcjqb.htm)
When you have conditional statements with a lag inside the "then", I tend to use a retained variable.
data test;
input id $ date count x;
informat date anydtdte.;
format date date9.;
datalines;
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
;
run;
data test(drop=last);
set test;
by id;
retain last;
if ^first.id then do;
if count > 1 then
x = last + 3;
end;
last = x;
run;

Related

Needing to retain Lab category tests based on individual positive test result

Hello so this is a sample of my data (There is an additional column of LBCAT =URINALYSIS for those panel of tests)
I've been asked to only include the panel of tests where LBNRIND is populated for any of those tests and the rest to be removed. Some subjects have multiple test results at different visit timepoints and others only have 1.I can't utilise a simple where LBNRIND ne '' in the data step because I need the entire panel of Urinalysis tests and not just that particular test result. What would be the best approach here? I think transposing the data would be too messy but maybe putting the variables in an array/macro and utilising a do loop for those panel of tests?.
Update:I've tried this code but it doesn't keep the corresponding tests for where lb_nrind >0. If I apply the sum(lb_nrind > '' ) the same when applying lb_nrind > '' to the having clause
*proc sql;
*create table want as
select * from labUA
group by ptno and day and lb_cat
having sum(lb_nrind > '') > 0 ;
data want2;
do _n_ = 1 by 1 until (last.ptno);
set labUA;
by ptno period day hour ;
if not flag_group then flag_group = (lb_nrind > '');
end;
do _n_ = 1 to _n_;
set want;
if flag_group then output;
end;
drop flag_group; run;*
You can use a SQL HAVING clause to retain rows of a group meeting some aggregate condition. In your case that group might be a patientid, panelid and condition at least one LBNRIND not NULL
Example:
Consider this example where a group of rows is to be kept only if at least one of the rows in the group meets the criteria result7=77
Both code blocks use the SAS feature that a logical evaluation is 1 for true and 0 for false.
SQL
data have;
infile datalines missover;
input id test $ parm $ result1-result10;
datalines;
1 A P 1 2 . 9 8 7 . . . .
1 B Q 1 2 3
1 C R 4 5 6
1 D S 8 9 . . . 6 77
1 E T 1 1 1
1 F U 1 1 1
1 G V 2
2 A Z 3
2 B K 1 2 3 4 5 6 78
2 C L 4
2 D M 9
3 G N 8
4 B Q 7
4 D S 6
4 C 1 1 1 . . 5 0 77
;
proc sql;
create table want as
select * from have
group by id
having sum(result7=77) > 0
;
DOW Loop
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if not flag_group then flag_group = (result7=77);
end;
do _n_ = 1 to _n_;
set have;
if flag_group then output;
end;
drop flag_group;
run;

SAS Comparing values across multiple columns of the same observation?

I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row, I can do it among multiple rows but within the row by the columns, I don't know.
Any help would be appreciated. Thank you
********************************************UPDATE*******************************************************
Thank you to everyone that has replied!!
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering:
I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that ok because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
one more way using sortn and using conditions.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
77 . 3 . 4
88 . 9 5 .
99 . . 2 2
76 . . . 2
58 1 1 . .
50 2 . 2 .
66 2 . 7 .
89 1 1 1 .
75 1 2 3 .
76 . 5 6 7
88 . 1 1 1
43 1 . . 1
31 1 . . 2
;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1

How to add a column of repeated numbers in SAS?

How to generate a repeating series of numbers in a column in SAS, from 1 to x?
Suppose x is 3.
Data is like:
name age
A 15
D 16
C 21
B 35
E 79
F 85
G 64
and I want to add a column named list, like this:
name age list
A 15 1
D 16 2
C 21 3
B 35 1
E 79 2
F 85 3
G 64 1
data class;
set sashelp.class;
if list>=3 then list=0;
list+1;
run;
Easiest way I can think of is to use mod and the iteration counter.
data want;
set have;
list = 1 + mod(_N_ - 1,3);
run;
mod is the modulo function (gives the remainder after dividing).
So if you want that to vary based on some parameter, well, change the 3 to a parameter.
%let num_atwork = 2;
data want;
set have;
list = 1 + mod(_N_ - 1, &num_atwork.);
run;

SAS: Output last value across several by-groups

I want to output the last value of a variable pr. sub-group to a SAS dataset, preferably in just a few steps. The code below do it, but I was hoping to do it in one step a la by variable; if last.variable then output; as for the case with just 1 by-variable.
data two;
input year firm price;
cards;
1 1 48
1 1 45
2 2 50
1 2 42
2 1 41
2 2 51
2 1 52
1 1 43
1 2 52;
run;
proc sort data = two;by year firm;run;
/* a) Create id across both sub-groups */
data two1;
set two;
by year firm;
retain case_id;
if FIRST.year OR first.firm then case_id + 1;
run;
/* b) Use id to output last values across both by-groups */
data two2;
set two1;
by case_id;
if last.case_id then output;
run;
proc print data = two1;run;
proc print data = two2;run;
With just 1 by-variable the two steps marked a) and b) can be combined. Is it possible with more than one by-group?
In data step a) add condition if lst.firm then output two2.
The final code should looks like:
data two1 two2;
set two;
by year firm;
retain case_id;
if FIRST.year OR first.firm then case_id + 1;
if last.firm then output two2;
output two1;
run;

Detect the difference b/w ages greater than some value using SAS

I am trying to detect groups which contain the difference between first age and second age are greater than 5. For example, if I have the following data, the difference between age in grp=1 is 39 so I want to output that group in a separate data set. Same goes for grp 4.
id grp age sex
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
My initial idea was to sort them by grp and then get the absolute value between ages using something like if first.grp then do;. But I don't know how to get the absolute value between first age and second age by group or actually I don't know how should I start this.
Thanks in advance.
Here's one way that I think works.
data have;
input id $ grp $ age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
proc sort data=have ;
by grp descending age;
run;
data temp(keep=grp);
retain old;
set have;
by grp descending age;
if first.grp then old=age;
if last.grp then do;
diff=old-age;
if diff>5 then output ;
end;
run;
Data want;
merge temp(in=a) have(in=b);
by grp ;
if a and b;
run;
I would use PROC TRANSPOSE so the values in each group can easily be compared. For example:
data groups1;
input id $ grp age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
run;
proc sort data=groups1;
by grp; /* This maintains age order */
run;
proc transpose data=groups1 out=groups2;
by grp;
var age;
run;
With the transposed data you can do whatever comparison you like (I can't tell from your question what exactly you want, so I just compare first two ages):
/* With all ages of a particular group in a single row, it is easy to compare */
data outgroups1(keep=grp);
set groups2;
if abs(col1-col2)>5 then output;
run;
In this instance this would be my preferred method for creating a separate data set for each group that satisfies whatever condition is applied (generate and include code dynamically):
/* A separate data set per GRP value in OUTGROUPS1 */
filename dynacode catalog "work.dynacode.mycode.source";
data _null_;
set outgroups1;
file dynacode;
put "data grp" grp ";";
put " set groups1(where=(grp=" grp "));";
put "run;" /;
run;
%inc dynacode;
If you are after the difference between just the 1st and 2nd ages, then the following code is a fairly straightforward way of extracting these. It reads though the dataset to identify the groups, then uses the direct access method, POINT=, to extract the relevant records. I put in an extra condition, grp=lag(grp) just in case you have any groups with only 1 record.
data want;
set have;
by grp;
if first.grp then do;
num_grp=0;
outflag=0;
end;
outflag+ifn(lag(first.grp)=1 and grp=lag(grp) and abs(dif(age))>5,1,0) /* set flag to determine if group meets criteria */;
if not first.grp then num_grp+1; /* count number of records in group */
if last.grp and outflag=1 then do i=_n_-num_grp to _n_;
set have point=i; /* extract required group records */
drop num_grp outflag;
output;
end;
run;
Here's an SQL approach (using CarolinaJay's code to create the dataset):
data groups1;
input id grp age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
run;
proc sql noprint;
create table xx as
select a.*
from groups1 a
where grp in (select b.grp
from groups1 b
join groups1 c on c.id = b.id+1
and c.grp = b.grp
and abs(c.age - b.age) > 5
left join groups1 d on d.id = b.id-1
and d.grp = b.grp
where d.id eq .
)
;
quit;
The join on C finds all occurrences where the subsequent record in the same group has an absolute value > 5. The join on D (and the where clause) makes sure we only consider the results from the C join if the record is the very first record in the group.
data have;
input id $ grp $ age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
data want;
do i = 1 by 1 until(last.grp);
set have;
by grp notsorted;
if first.grp then cnt = 0;
cnt + 1;
if cnt = 1 then age1 = age;
if cnt = 2 then age2 = age;
diff = sum( age1, -age2 );
end;
do until(last.grp);
set have;
by grp;
if diff > 5 then output;
end;
run;