longitudinal calculation in SAS with lag function [duplicate] - sas

This question already has an answer here:
Fill the blank values of a variable with the previous non blank value SAS 9.3
(1 answer)
Closed 8 years ago.
Hi I have a data in columns., and the patient visits
some patient visits have not recorded the values., and I want to copy the previous visit values., and I am using the lag function which is not working any idea?
the data is something like this
ID value
A 22
A .
A 23
B .
B 12
C 3
C .
C .
C .
C 21
the required output.,
ID value
A 22
A 22
A 23
B 23
B 12
C 3
C 3
C 3
C 3
C 21

You would use RETAIN not LAG here.
Retain:
data want;
set have;
retain newval;
if not missing(oldval) then newval=oldval;
run;
If you need the same variable name, drop+rename to get newval into oldval name.
Normally, you would also check for ID to be the same; your example updates across IDs, so I leave that out, but if you don't want to update a b record with a value, you need to add a by id; and then if first.id then call missing(newval); to reset it at the start of each new ID.

I'm assuming that the ID field represents your patient ID? And that you don't want to use values recorded against patient A for patient B etc... If so, then this code will do the job:
data test;
infile datalines truncover;
input ID $ value ;
datalines;
A 22
A
A 23
B
B 12
C 3
C
C
C
C 21
;
run;
Sort it first so that we can use by-group processing:
proc sort data=test;
by id;
run;
I prefer to use the retain statement rather than the lag() function as people are less likely to make mistakes using retain:
data final;
set test;
by id;
retain prev_value .;
if first.id then do;
prev_value = .; * RESET THIS VALUE EVERY TIME WE GET TO A NEW PATIENT;
end;
if value eq . then do;
value = prev_value; * VALUE IS MISSING SO ASSIGN THE PREVIOUS RECORDED VALUE FOR THE PATIENT AGAINST IT;
end;
else do;
prev_value = value; * PATIENT HAS A NEW VALUE TO RECORD SO SAVE IT INTO THE PREV_VALUE VARIABLE;
end;
run;
Incidentally this will give a slightly different result to what you requested as patient B did not supply a value on his first visit so his first record will remain null. If you need to fill that in with the value from his second visit, simply sort the dataset in the opposite direction, and run the same code against it.

Related

Deleting missing values when dealing with panel data

I am working with a panel dataset, so many countries and many variables throughout a period. The problem is that some countries have no value for certain variables across the whole period and I would like to get rid of them. I found this code for deleting rows with missing values :
DATA data0;
SET data1;
IF cmiss(of _all_) then delete;
RUN;
But all this does is check every row, while I would like to delete a whole country if it has no observations in at least one variable.
Here's a part of the data :
If you want to delete the whole country if it has any information missing, you are on the right track, you just need to add a (group) by statement.
If your data is already sorted by country, as it appears to be in the picture, you can just run:
data want;
set have;
IF cmiss(of _all_) then delete;
by country;
If it is not sorted, you need to first run:
proc sort data=have;
by country;
However, if you have 60 years of data for every country, my guess is that you will not find a single one that have all the information for every year. It will be probably better to do some substantive choices of countries and periods you want to analyze, and then perform multiple imputatiom of missing data: https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
You can use a DOW loop to compute which variable(s) contain only missing values within a group.
A second DOW loop outputs only those groups in which all variables contain at least on value.
Example:
data have;
call streaminit (2020);
do country = 1 to 6;
do year = 1960 to 1999;
array x gini kof tradegdp fdi gdp age_dep educ;
do over x;
x = rand('integer', 20, 100);
end;
if country = 1 then call missing (gini);
if country = 2 then call missing (educ);
if country = 4 then call missing (fdi);
output;
end;
end;
run;
data want;
* count number of non-missing values over group for each arrayed variable;
do _n_ = 1 by 1 until (last.country);
set have;
by country;
array x gini kof tradegdp fdi gdp age_dep educ;
array flag(100) _temporary_; * flag if variable has a non-missing value in group;
do _index = 1 to dim(x);
if not(flag(_index)) then flag(_index) = 1 - missing(x(_index));
end;
end;
* check if at least one variable has no values;
_remove_group_flag = sum(of flag(*)) ne dim(x);
do _n_ = 1 to _n_;
set have;
if not _remove_group_flag then output;
end;
call missing (of flag(*));
run;
Will LOG
NOTE: There were 240 observations read from the data set WORK.HAVE. First DOW loop
NOTE: There were 240 observations read from the data set WORK.HAVE. Second DOW loop
NOTE: The data set WORK.WANT has 120 observations and 11 variables. Conditional output

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

Filling in missing values with forward-backward method with lag in SAS

Assume that you have a table with user name, counter and score for each counter.
data have;
input user $ counter score;
cards;
A 1 .
A 2 .
A 3 40
A 4 .
A 5 20
A 6 .
B 1 30
B 2 .
C 1 .
C 2 .
C 3 .
;
run;
Some scores are missing beween some counters, and you want to put the same score as previous counter. So the result will look like below:
A 1 40
A 2 40
A 3 40
A 4 40
A 5 20
A 6 20
B 1 30
B 2 30
C 1 .
C 2 .
C 3 .
I managed to fill the missing score values forward by using the lag function like below:
data result1a;
set have(keep=user);
by user;
*Look ahead;
merge have have(firstobs=2 keep=score rename=(score=_NextScore));
if first.user then do;
if score= . then score=_NextScore;
end;
else do;
_PrevScore = lag(score);
if score= . then score=_PrevScore;
end;
output;
run;
Then I sorted the table backward by using descending funtion on counter like below:
proc sort data = result1a out= result1b;
by user descending counter ;
run;
Then finally I would fill the missing values forward in raaranged table (going backward according to the initial table) by using the lag function again like below.
I used the lag function in do-loop, because I wanted to update the previous value in each step (For example, the value 40 would be carried from the first score to the last score in the group all the way).
However, I get strange result. All missing values don't geta real value. Any idea about fixing the last data-step?
data result1c;
set result1b;
by user;
if first.user then do;
if score= . then score=_NextScore;
else score = score;
end;
else do;
_PrevScore = lag(score);
if score= . then
score=_PrevScore;
else score = score;
end;
output;
run;
Don't need to use lag, use retain (or equivalent). Here's a double DoW loop solution that does it in one datastep (and, effectively, one read - it buffers the read so this is as efficient as a single read).
First we loop through the dataset to get the first score found, so we can grab that for the initial prev_score value. Then setting that, and re-looping through the rows for that user and outputting. There's no actual retain here since I am doing the looping myself, but it's similar to if there were a retain prev_score; and this was a normal data step loop. I don't actually retain it since I want it to go missing when a new user is met.
data want;
do _n_ = 1 by 1 until (last.user);
set have;
by user;
if missing(first_score) and not missing(score) then
first_score = score;
end;
prev_score = first_score;
do _n_ = 1 by 1 until (last.user);
set have;
by user;
if missing(score) then
score = prev_score;
prev_score = score;
output;
end;
run;
lag() is a commonly misunderstood function. The name implies that when you call it SAS looks back at the previous row and grabs the value, but this is not at all the case.
In fact, lag<n>() is a function that creates a "queue" with n values. When you call lag<n>(x), it pushes the current value of x into that queue and reads a previous value from it (of course the pushing only occurs once per row). So if you have lag<n>() within a condition, the pushing only occurs when that condition is satisfied.
To fix your problem, you need the lag() function to run for every row, and to run after score has been corrected:
data result1c;
set result1b;
by user;
if first.user then do;
if score= . then score=_NextScore;
else score = score;
end;
else do;
if score= . then
score=_PrevScore;
else score = score;
end;
_PrevScore = lag(score);
output;
run;
EDIT: I got hung up on the misuse of lag and didn't present a working alternative. Because you're modifying score, it's a bad idea to use lag at all. Retain will work here:
data result1c;
set result1b;
by user;
retain _PrevScore;
if first.user then do;
if score= . then score=_NextScore;
else score = score;
end;
else do;
if score= . then
score=_PrevScore;
else score = score;
end;
_PrevScore = score;
output;
run;

proc summary with statistic "multiply"

Is it possible to make a new statistic with proc summary that multiplies every value in each column, for example instead of just mean? SAS is so rigid it makes me crazy.
data test;
input b c ;
datalines;
50 11
35 12
75 13
;
Desired output would be 50*35*75, and 11*12*13, and _FREQ (as is normal output in proc summary)
This is an uncommon aggregate so you essentially need to roll your own. Since a data step loops this is easily accomplished using a RETAIN to keep value from row to row and outputting result at the last record.
Data want;
Set have end=eof;
Retain prod_b prod_c;
prod_b = prod_b * b;
prod_c = prod_c * c;
Freq= _n_;
If eof then OUTPUT;
Keep prod: freq;
Run;

Delete the group that none of its observation contain the certain value in SAS

I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;