Attempting to create a variable that calculates the number of days between the current records date and the previous records date (grouped by patid). I am obviously missing something in my attempt, and I have unfortunately made of mess of things.
Any answers would be appreciated.
data sample;
informat patid $1. test_date mmddyy10.;
input patid test_date;
format test_date date9.;
datalines;
a 10/01/2005
a 10/05/2005
a 10/17/2005
b 10/02/2005
b 10/18/2005
b 10/22/2005
;
run;
I would like my data set to look like this:
data desire;
informat patid $1. test_date mmddyy10. dif;
input patid test_date dif;
format test_date date9.;
datalines;
a 10/01/2005 0
a 10/05/2005 4
a 10/17/2005 12
b 10/02/2005 0
b 10/18/2005 16
b 10/22/2005 4
;
run;
My attempt:
data increment;
set sample;
by patid;
if first.patid then do;
dys_btwn_trmnt = 0;
end;
else do;
dys_btwn_trmnt=dif(test_date);
end;
run;
Then I thought I may need retain, but no success -
data increment;
set sample;
by patid;
retain dys_btwn_trmnt 0;
if first.patid then do;
dys_btwn_trmnt = 0;
end;
else do;
dys_btwn_trmnt=dif(test_date);
end;
run;
This is an example of what happens when you conditionally execute a lag function (in this case the DIF() function). The lag functions to do return the value from the previous observation. Instead they return the value from a stack that is created by executing the function. By not executing the function for every observation not all values are placed onto the stack.
The simple solution is to run the DIF() function for every observation and then use the FIRST. condition to fix the value for the first observation in the group.
data increment;
set sample;
by patid;
dys_btwn_trmnt=dif(test_date);
if first.patid then dys_btwn_trmnt=0;
run;
Related
I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;
I have the following problem:
I want to fill missing values with proc expand be simply taking the value from the next data row.
My data looks like this:
date;index;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
As you can see for some dates the index is missing. I want to achieve the following:
date;index;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;-1688
05.Jul09;-1688
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;-1683
12.Jul09;-1683
13.Jul09;-1683
As you can see the values for the missing data where taken from the next row (11.Jul09 and 12Jul09 got the value from 13Jul09)
So proc expand seems to be the right approach and i started using this code:
PROC EXPAND DATA=DUMMY
OUT=WORK.DUMMY_TS
FROM = DAY
ALIGN = BEGINNING
METHOD = STEP
OBSERVED = (BEGINNING, BEGINNING);
ID date;
CONVERT index /;
RUN;
QUIT;
This filled the gaps but from the previous row and whatever I set for ALIGN, OBSERVED or even sorting the data descending I do not achieve the behavior I want.
If you know how to make it right it would be great if you could give me a hint. Good papers on proc expand are apprechiated as well.
Thanks for your help and kind regards
Stephan
I don't know about proc expand. But apparently this can be done with a few steps.
Read the dataset and create a new variable that will get the value of n.
data have;
set have;
pos = _n_;
run;
Sort this dataset by this new variable, in descending order.
proc sort data=have;
by descending pos;
run;
Use Lag or retain to fill the missing values from the "next" row (After sorting, the order will be reversed).
data want;
set have (rename=(index=index_old));
retain index;
if not missing(index_old) then index = index_old;
run;
Sort back if needed.
proc sort data=want;
by pos;
run;
I'm no PROC EXPAND expert but this is what I came up with. Create LEADS for the maximum gap run (2) then coalesce them into INDEX.
data index;
infile cards dsd dlm=';';
input date:date11. index;
format date date11.;
cards4;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
;;;;
run;
proc print;
run;
PROC EXPAND DATA=index OUT=index2 method=none;
ID date;
convert index=lead1 / transform=(lead 1);
CONVERT index=lead2 / transform=(lead 2);
RUN;
QUIT;
proc print;
run;
data index3;
set index2;
pocb = coalesce(index,lead1,lead2);
run;
proc print;
run;
Modified to work for any reasonable gap size.
data index;
infile cards dsd dlm=';';
input date:date11. index;
format date date11.;
cards4;
27.Jun09;
28.Jun09;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
14.Jul09;
15.Jul09;
16.Jul09;
17.Jul09;-1694
;;;;
run;
proc print;
run;
/* find the largest gap */
data gapsize(keep=n);
set index;
by index notsorted;
if missing(index) then do;
if first.index then n=0;
n+1;
if last.index then output;
end;
run;
proc summary data=gapsize;
output out=maxgap(drop=_:) max(n)=maxgap;
run;
/* Gen the convert statement for LEADs */
filename FT67F001 temp;
data _null_;
file FT67F001;
set maxgap;
do i = 1 to maxgap;
put 'Convert index=lead' i ' / transform=(lead ' i ');';
end;
stop;
run;
proc expand data=index out=index2 method=none;
id date;
%inc ft67f001;
run;
quit;
data index3;
set index2;
pocb = coalesce(index,of lead:);
drop lead:;
run;
proc print;
run;
I have a dataset looks like the following:
Name Number
a 1
b 2
c 9
d 6
e 5.5
Total ???
I want to calculate the sum of variable Number and record the sum in the last row (corresponding with Name = 'total'). I know I can do this using proc means then merge the output backto this file. But this seems not very efficient. Can anyone tell me whether there is any better way please.
you can do the following in a dataset:
data test2;
drop sum;
set test end = last;
retain sum;
if _n_ = 1 then sum = 0;
sum = sum + number;
output;
if last then do;
NAME = 'TOTAL';
number = sum;
output;
end;
run;
it takes just one pass through the dataset
It is easy to get by report procedure.
data have;
input Name $ Number ;
cards;
a 1
b 2
c 9
d 6
e 5.5
;
proc report data=have out=want(drop=_:);
rbreak after/ summarize ;
compute after;
name='Total';
endcomp;
run;
The following code uses the DOW-Loop (DO-Whitlock) to achieve the result by reading through the observations once, outputting each one, then lastly outputting the total:
data want(drop=tot);
do until(lastrec);
set have end=lastrec;
tot+number;
output;
end;
name='Total';
number=tot;
output;
run;
For all of the data step solutions offered, it is important to keep in mind the 'Length' factor. Make sure it will accommodate both 'Total' and original values.
proc sql;
select max(5,length) into :len trimmed from dictionary.columns WHERE LIBNAME='WORK' AND MEMNAME='TEST' AND UPCASE(NAME)='NAME';
QUIT;
data test2;
length name $ &len;
set test end=last;
...
run;
This question is an expansion of this one : SAS: Create a frequency variable
The code provided in the first response work well, but what if I want to add another categorical variable ? I have a date variable and an ID, categorical variable. I've tried multiple things, but here's what seemed the most logical to me (but doesn't work):
data work.frequencycounts;
do _n_ =1 by 1 until (last.Date);
set work.dataset;
by Date ID;
if first.Date & first.ID then count=0;
count+1;
end;
frequency= count;
do _n_ = 1 by 1 until (last.Date);
set work.dataset;
by Date ID;
output;
end;
run;
Should I add a do loop ?
Thanks for your help.
Edits:
Example of what I have:
Date ID
1 19736 H-3-10
2 19736 H-3-12
3 19737 E-2-10
4 19737 E-2-10
Example of what I want:
Date ID Count
1 19736 H-3-10 1
2 19736 H-3-12 1
3 19737 E-2-10 2
4 19737 E-2-10 2
This produces the desired output.
What is happening here is that you need to use the last variable in the BY statement for everything with first./last. processing. If you need to know why, put a few put _all_; in the datastep to see what is what value at different points. You shouldn't check for first.Date at any point, because if first.Date is true then first.ID is always true (by definition, first propagates rightwards); and you want a different count for [first.ID and not first.date].
Basically, treat the initial example as correct, and the variable in the initial example should be the last variable in your by statement; add as many additional variables as you want to the left of it, and nothing will change. This does require the data be sorted by the by-group variables.
data have;
input date id $;
datalines;
19736 H-3-10
19736 H-3-12
19737 E-2-10
19737 E-2-10
;;;;;
run;
data work.want;
do _n_ =1 by 1 until (last.ID); *last.<last variable in by group>;
set work.have;
by Date ID;
if first.ID then count=0; *first.ID is what you want here.;
count+1;
end;
frequency= count; *this is not really needed - can use just the one variable consistently;
do _n_ = 1 by 1 until (last.ID); *again, last.<last var in by group>;
set work.have;
by Date ID;
output;
end;
run;
ESN is an id column that has multiple observations per esn, so repeated values of esn occur. For a given esn, I want to find the earliest service start date (and call it first), and I want to find the proper end date (called last) the if/then statement for how "last" is chosen is correct, but I get the following errors when I run the code below:
340 first = min(of start(*));
---
71
ERROR 71-185: The MIN function call does not have enough arguments.
here is the code I used
data three_1; /*first and last date created ?? used to ignore ? in data*/
set three;
format first MMDDYY10. last MMDDYY10.;
by esn;
array start(*) service_start_date;
array stop(*) service_end_date entry_date_est ;
do i=1 to dim(start);
first = min(of start(*));
end;
do i=1 to dim(stop);
if esn_status = 'Cancelled' then last = min(input(service_end_date, MMDDYY10.), input(entry_date_est, MMDDYY10.));
else last = max(input(service_end_date, MMDDYY10.), input(entry_date_est, MMDDYY10.));
end;
run;
"esn" "service_start_date" "service_end_date" "entry_date_est" "esn_status"
1 10/12/2010 01/01/2100 10/12/2012 cancelled
1 05/02/2009 02/12/2010 10/09/2012 cancelled
1 04/05/2011 03/04/2100 10/02/2012 cancelled
the results should be first= 05/02/2009 and last=10/12/2012
Arrays and the min(), max(), etc. functions operate horizontally across rows of a data set, not vertically across multiple records.
Assuming esn_status is constant for a given esn, then you need to sort your input by esn and service_start_date. You can use a data step to collect the values you want.
data three; /*thanks Joe for the data step to create the example data*/
length esn_status $10;
format service_start_date service_end_date entry_date_est MMDDYY10.;
input esn (service_start_date service_end_date entry_date_est) (:mmddyy10.) esn_status $;
datalines;
1 10/12/2010 01/01/2100 10/12/2012 cancelled
1 05/02/2009 02/12/2010 10/09/2012 cancelled
1 04/05/2011 03/04/2100 10/02/2012 cancelled
;;;;
run;
proc sort data=three;
by esn service_start_date;
run;
data three_1(keep=esn esn_status start last);
set three;
format start last date9.;
by esn;
retain start last;
if first.esn then do;
start = service_start_date;
last = service_end_date;
end;
if esn_status = "cancelled" then
last = min(last,service_end_date,entry_date_est);
else
last = max(last,service_end_date,entry_date_est);
if last.esn then
output;
run;
A DoW loop will get to where you want, or you could do it in SQL. Your desired results don't match with the actual results as far as I can tell, so you may need to make some adjustments. You'd need a second WANT dataset for the non-cancelled folks, I don't think there's an easy way to put it in one data step.
data have;
length esn_status $10;
format service_start_date service_end_date entry_date_est MMDDYY10.;
input esn (service_start_date service_end_date entry_date_est) (:mmddyy10.) esn_status $;
datalines;
1 10/12/2010 01/01/2100 10/12/2012 cancelled
1 05/02/2009 02/12/2010 10/09/2012 cancelled
1 04/05/2011 03/04/2100 10/02/2012 cancelled
;;;;
run;
data want_cancelled;
first = 99999;
last = 99999;
do _n_ = 1 by 1 until (last.esn);
set have(where=(esn_status='cancelled'));
by esn;
first = min(first,service_start_date);
last = min(last,service_end_date,entry_date_est);
end;
output;
keep first last esn;
format first last mmddyy10.;
run;