check if value exists in a group - sas

I have my desired output table:
data work.employees;
length employee timepoint visit realvisit $30;
input employee $ timepoint $ visit $ realvisit $;
datalines;
Smith 1 Screening Screening
Smith 1 Randomization Randomization
Williams 1 Screening Baseline
Williams 2 Randomization Randomization
Jones 1 Visit1 Visit1
Jones 2 Visit3 Visit3
;
run;
and I want to to derive realvisit such that in a group of (Employee, Timepoint), if there is no record where visit = Randomization and visit = Screening, then realvisit = Baseline.
Realvisit in the above table is already derived correctly as an example of what I'm trying to achieve.
This is what I've tried so far:
proc sort data = work.employees;
by employee timepoint;
run;
data work.employees2;
set work.employees;
by employee timepoint;
if visit = 'Randomization' then exists = "Y";
else exists = "N";
if visit = 'Screening' and exists = "N" then
realvisit = 'Baseline';
run;

I think you need to check the whole group.
You could use a double DOW loop. The first one to check. The second to re-read the data so you can write it back out.
data work.employees2;
do until (last.timepoint);
set work.employees;
by employee timepoint;
if visit = 'Randomization' then exists = "Y";
end;
do until (last.timepoint);
set work.employees;
by employee timepoint;
if visit = 'Screening' and exists ne "Y" then realvisit = 'Baseline';
output;
end;
run;

Related

How to transform Table data to another Table format in SAS

I am stuck in transforming the data table from one format to another format using the SAS Programming function. The structure of the Table is given as below:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 Jun/1/11 1:00 8w^201 10E^45 Transfer to 8w
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 Jun/1/11 8:00 10E^45 Death
“Id” is the randomly generated patient identifier.
“Date” and “Time” is the timestamp of the event.
“Assigned_pat_loc” is the current patient location in the hospital, formatted as “unit^room^bed”. EIAB is the internal code for the emergency department, with most of the admissions process through the emergency department.
"Prior_pat_loc” is the location where the patient was immediately prior to the current location.
“Activity” is the description of the event. It includes entries like “Admission”, “Transfer to” “Transfer from” “Discharge”, and “Death”.
You will notice a lot of duplicate records, where the same transfer is recorded in both the departing and the receiving unit. You will be able to tell by looking at the time stamp – they are identical for duplicate records.
I want to transform it into the following table.
Here are the details of the variables.
r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two-room sharing episodes. The first one was with patient 4 in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note that the room-sharing episodes are listed twice, once for each patient.
Please anyone guide me how it could be done?
We need to process the data by location
proc sort HAVE;
by assigned_pat_loc data time;
run;
In the result, we don not need temporary variables (starting with underscore) and the date and time must be renamed to end_date and end_time.
data WANT (drop= _: rename=(date=end_date time=end_time));
set HAVE;
by assigned_pat_loc data time;
I generalize the problem to rooms with a capacity above 2 and use arrays.
Extending the temporary arrays beyond &max_patients, saves me a few if-statements.
Note that temporary arrays are dropped in the result and are retained anyway.
%let max_patients = 9;
array id_r {&max_patients - 1} id_1 - id_%eval(&max_patients - 1);
array patients temporary {&max_patients + 1};
array admissions temporary {&max_patients + 1};
if _N_ eq 1 then patient_count = 0;
retain patient_count;
for every pat_loc, start all over
if first.assigned_pat_loc then do;
do patient_nr = 1 to patient_count;
patients[patient_nr] = .;
end;
patient_count = 0;
end;
if a patient leaves, calculate the time she spent
if Activity in (“Discharge”, “Death”) then do;
_found_patient = 0;
do _patient_nr = 1 to patient_count;
if patients[_patient_nr] eq id then do;
start_date = datepart(admissions[_patient_nr]);
start_time = timepart(admissions[_patient_nr]);
duration = (dhms(date,0,0,time) - admissions[_patient_nr]) / 3600;
_found_patient = 1;
end;
end;
shift the patients that arrived later
if _found_patient then do;
patients[_patient_nr] = patients[_patient_nr + 1];
admissions[_patient_nr] = admissions[_patient_nr + 1];
end;
patient_count = patient_count - 1;
find out who else was in the pat_loc and write the result
do _patient_nr = 1 to patient_count;
id_r[_patient_nr] = patents[_patient_nr];
end;
output;
end;
if a patient arrives, register that for later
else do;
patient_count = patient_count + 1;
patients[_patient_nr] = id;
admissions[_patient_nr] = dhms(date,0,0,time);
end;
run;
sort the results
proc sort;
by id start_date start_time;
run;
Disclaimer: this is a draft, which might need debugging.
When dealing with ranges in which there is a possibility of an unexpected overlap case you can enumerate over the range and perform simpler logic for finding shared time/unit/room.
Example:
data have;
length id date time 8 loc ploc $20 activity $10;
input
id Date& date11. Time time5. loc ploc Activity;
format date date9. time time5.;
datetime = dhms (date,0,0,0) + time;
length unit room bed punit proom pbed $4;
unit = scan(loc,1,'^');
room = scan(loc,2,'^');
bed = scan(loc,3,'^');
punit = scan(ploc,1,'^');
proom = scan(ploc,2,'^');
pbed = scan(ploc,3,'^');
drop loc ploc;
datalines;
1 31-May-2011 8:00 EIAB^EIAB^6 . Admission
1 31-May-2011 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 8-Jun-2011 15:00 8w^201 . Discharge
2 31-May-2011 5:00 EIAB^EIAB^4 . Admission
2 31-May-2011 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 1-Jun-2011 1:00 8w^201 10E^45 Transfer to 8w
2 1-Jun-2011 8:00 8w^201 . Discharge
3 31-May-2011 9:00 EIAB^EIAB^2 . Admission
3 1-Jun-2011 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 5-Jun-2011 9:00 8w^201 . Discharge
4 31-May-2011 9:00 EIAB^EIAB^9 . Admission
4 31-May-2011 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 1-Jun-2011 8:00 10E^45 . Death
;
* Fill in the ranges to get data by hour;
data hours(keep=id in_unit in_room at_dt);
set have;
by id;
retain at_dt in_unit in_room;
if first.id then do;
at_dt = datetime;
in_unit = unit;
in_room = room;
end;
else do;
do at_dt = at_dt to datetime-1 by dhms(0,1,0,0);
output;
end;
in_unit = unit;
in_room = room;
end;
format at_dt datetime16.;
run;
* prepare for transposition;
proc sort data=hours;
by at_dt in_unit in_room id;
run;
* transpose to know which time/unit/room has multiple patients;
proc transpose data=hours out=roomies_by_hour(drop=_name_ where=(not missing(patid2))) prefix=patid;
by at_dt in_unit in_room ;
var id;
run;
* 'unfill' the individual hours to get ranges again;
data roomies;
set roomies_by_hour;
by in_unit in_room patid1 patid2;
retain start_dt end_dt;
format start_dt end_dt datetime16.;
if first.patid2 then
start_dt = at_dt;
if last.patid2 then do;
end_dt = at_dt;
length_hrs = intck('hours', start_dt, end_dt);
output;
end;
run;
* stack data flipping perspective of who shared with who;
data roomies_mirrored;
set
roomies /* patid1 centric */
roomies(rename=(patid1=patid2 patid2=patid1)) /* patid2 centric */
;
run;
proc sort data=roomies_mirrored;
by patid1 start_dt;
run;

condense multiple records into single record in sas

I have row data by account level and I wish to group them by the account owner as a new data. Yes will take the priority.
Account_Owner Account_No Ever_Purchase Ever_Purchase_within_2days Ever_Deliver_in_2weeks
Tom 12345 Yes Yes No
Tom 34567 Yes No Yes
Tom 09876 No No No
Desired Outcome
Account_Owner Ever_Purchase Ever_Purchase_within_2days Ever_Deliver_in_2weeks
Tom Yes Yes Yes
I am sorry that I don't have any code because I don't know where to start.
You can use a DOW loop to track the group result for each ever_* variable in a temporary array.
proc format;
value yesno .,0 = 'No' other='Yes';
data have; input
Account_Owner $ Account_No Ever_Purchase $ Ever_Purchase_within_2days $ Ever_Deliver_in_2weeks $;
datalines;
Tom 12345 Yes Yes No
Tom 34567 Yes No Yes
Tom 09876 No No No
;
data want;
array evals(100) _temporary_; * presume never more than 100 flag variables;
call missing (of evals(*));
* dow loop;
do until (last.account_owner);
set have;
by account_owner;
array flags ever:;
do _n_ = 1 to dim(flags);
evals(_n_) = evals(_n_) or flags(_n_) = 'Yes'; * compute aggregate result;
end;
end;
* move results back into original variables;
do _n_ = 1 to dim(flags);
flags(_n_) = put(evals(_n_), yesno.);
end;
* implicit output, one row per group combination;
run;
Note: In an alternative solution you can convert Yes/No to numeric 1/0 you can use Proc SUMMARY or Proc MEANS to computed the group result (max of var would be 1 if any Yes and 0 if all No)

SAS cumulative count by unique ID and date

I have a dataset like below
Customer_ID Vistited_Date
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
I am trying find the cumulative unique count of customers by date, assuming my output will be like below
Cust_count Vistited_Date
3 7-Feb-20
2 14-Feb-20
7-Feb-2020 has 3 unique customers, whereas 14-Feb-2020 has only 2 hence customer 1234 has visited already.
Anyone knows how I could develop a data set in these conditions?
Sorry if my question is not clear enough, and I am available to give more details if necessary.
Thanks!
NOTE: #draycut's answer has the same logic but is faster, and I will explain why.
#draycut's code uses one hash method, add(), using the return code as test for conditional increment. My code uses check() to test for conditional increment and then add (which will never fail) to track. The one method approach can be perceived as being anywhere from 15% to 40% faster in performance (depending on number of groups, size of groups and id reuse rate)
You will need to track the IDs that have occurred in all prior groups, and exclude the tracked IDs from the current group count.
Tracking can be done with a hash, and conditional counting can be performed in a DOW loop over each group. A DOW loop places the SET statement inside an explicit DO.
Example:
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data counts(keep=date count);
if _n_ = 1 then do;
declare hash tracker();
tracker.defineKey('id');
tracker.defineDone();
end;
do until (last.date);
set have;
by date;
if tracker.check() ne 0 then do;
count = sum(count, 1);
tracker.add();
end;
end;
run;
Raw performance benchmark - no disk io, cpu required to fill array before doing hashing, so those performance components are combined.
The root performance is how fast can new items be added to the hash.
Simulate 3,000,000 'records', 1,000 groups of 3,000 dates, 10% id reuse (so the distinct ids will be ~2.7M).
%macro array_fill (top=3000000, n_group = 1000, overlap_factor=0.10);
%local group_size n_overlap index P Q;
%let group_size = %eval (&top / &n_group);
%if (&group_size < 1) %then %let group_size = 1;
%let n_overlap = %sysevalf (&group_size * &overlap_factor, floor);
%if &n_overlap < 0 %then %let n_overlap = 0;
%let top = %sysevalf (&group_size * &n_group);
P = 1;
Q = &group_size;
array ids(&top) _temporary_;
_n_ = 0;
do i = 1 to &n_group;
do j = P to Q;
_n_+1;
ids(_n_) = j;
end;
P = Q - &n_overlap;
Q = P + &group_size - 1;
end;
%mend;
options nomprint;
data _null_ (label='check then add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.check() ne 0 then do;
count = sum(count,1);
h.add();
end;
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data _null_ (label='just add');
length id 8;
declare hash h();
h.defineKey('id');
h.defineDone();
%array_fill;
do index = 1 to dim(ids);
id = ids(index);
if h.add() = 0 then
count = sum(count,1);
end;
_n_ = h.num_items;
put 'num_items=' _n_ comma12.;
put index= comma12.;
stop;
run;
data have;
input Customer_ID Vistited_Date :anydtdte12.;
format Vistited_Date date9.;
datalines;
1234 7-Feb-2020
4567 7-Feb-2020
9870 7-Feb-2020
1234 14-Feb-2020
7654 14-Feb-2020
3421 14-Feb-2020
;
data want (drop=Customer_ID);
if _N_=1 then do;
declare hash h ();
h.definekey ('Customer_ID');
h.definedone ();
end;
do until (last.Vistited_Date);
set have;
by Vistited_Date;
if h.add() = 0 then Count = sum(Count, 1);
end;
run;
If your data is not sorted and you like the SQL maybe this solution is same good for you and it is very simple:
/* your example 3 rows */
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
1234 15-Feb-20
7654 15-Feb-20
1111 15-Feb-20
;
run;
/* simple set theory. Final dataset contains your final data like results
below*/
proc sql;
create table temp(where =(mindate=date)) as select
ID, date,min(date) as mindate from have
group by id;
create table final as select count(*) as customer_count,date from temp
group by date;
quit;
/* results:
customer_count Date
3 07.febr.20
2 14.febr.20
1 15.febr.20
*/
Another method cause I dont know hash so well. >_<
data have;
input ID Date: date9.; format date date11.;
datalines;
1234 7-Feb-20
4567 7-Feb-20
9870 7-Feb-20
1234 14-Feb-20
7654 14-Feb-20
3421 14-Feb-20
;
data want;
length Used $200.;
retain Used;
set have;
by Date;
if first.Date then count = .;
if not find(Used,cats(ID)) then do;
count + 1;
Used = catx(',',Used,ID);
end;
if last.Date;
put Date= count=;
run;
If you are not overly concerned with processing speed and want something simple:
proc sort data=have;
by id date;
** Get date of each customer's first unique visit **;
proc sort data=have out=first_visit nodupkey;
by id;
proc freq data=first_visit noprint;
tables date /out=want (keep=date count);
run;

Combining the rows with overlapping data ranges in SAS

Since I am new to SAS I need some help to understand how to combine the overlap date ranges into one row.I want to combine the overlap date ranges when they have matching Id. If the dates don’t overlap then I want to keep them as it is. IF they over lap by Matching Id and drug code Then it should combine into one line. Please look at the same ple data set which I have below and the expected results:
Current Data set:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 1/1/2019
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
Expected results:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 3/31/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/2/2018 8/31/2018
1 100 9/1/2018 5/4/2019
I wrote some SAS code but I am combining the dates even when there is no overlap. I want to write some code which should work in SAS.
PROC SORT DATA=Want OUT=ONE;
BY PERSON_ID BEG_DATE DRUG_CODE END_DATE;
RUN;
data TWO (DROP=PERSON_ID2 DRUG_CODE2 BEG_DATE END_DATE
RENAME=(BEG2=BEG_DOS
END2=END_DOS));
SET ONE;
RETAIN BEG2 END2;
PERSON_ID2=LAG1(PERSON_ID);
DRUG_CODE2=LAG1(DRUG_CODE);
IF PERSON_ID2=PERSON_ID AND DRUG_CODE2=DRUG_CODE AND BEG_DATE LE(END2+1) THEN
DO;
BEG2=MIN(BEG_DATE,BEG2);
END2=MAX(END_DATE,END2);
END;
ELSE
DO;
SEG+1;
BEG2=BEG_DATE;
END2=END_DATE;
END;
FORMAT BEG2 END2 MMDDYY10.;
RUN;
DATA THREE(DROP=BEG_DOS END_DOS SEG);
RETAIN BEG_DATE END_DATE;
SET TWO;
BY PERSON_ID SEG;
FORMAT BEG_DATE END_DATE MMDDYY10.;
IF FIRST.SEG THEN
DO;
BEG_DATE=BEG_DOS;
END;
IF LAST.SEG THEN
DO;
END_DATE = END_DOS;
OUTPUT;
END;
RUN;
This is how I would do it. Create an obs for each ID DRUG and DATE. Flag the gaps and summarize by RUN.
data have;
input ID Drug_Code (BEG End)(:mmddyy.);
format BEG End mmddyyd10.;
cards;
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 90 6/1/2018 8/15/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
;;;;
run;
proc print;
run;
/*1 100 1/1/2018 1/1/2019*/
data exv/ view=exv;
set have;
do date = beg to end;
output;
end;
drop beg end;
format date mmddyyd10.;
run;
proc sort data=exv out=ex nodupkey;
by id drug_code date;
run;
data breaksV / view=BreaksV;
set ex;
by id drug_code;
dif = dif(date);
if first.drug_code then do; dif=1; run=1; end;
if dif ne 1 then run+1;
run;
proc summary data=breaksV nway missing;
class id drug_code run;
var date;
output out=want(drop=_type_) min=Begin max=End;
run;
Proc print;
run;
Computing the extent range composed of overlapping segment ranges requires a good understanding of the range conditions (cases).
Consider the scenarios when sorted by start date (within any larger grouping set, G, such as id and drug)
Let [ and ] be endpoints of a range
# be date values (integers) within
Extent be the combined range that grows
Segment be the range in the current row
Case 1 - Growth. Within G Segment start before Extent end
Segment will either not contribute to Extent or extend it.
[####] Extent
+ [#] Segment range DOES NOT contribute
--------
[####] Extent (do not output a row, still growing)
or
[####] Extent
+ [#####] Segment range DOES contribute
--------
[#######] Extent (do not output a row, still growing)
Case 2 - Terminus. 3 possibilities:
Within G Segment start after Extent end,
Next G reached (different id/drug combination),
End of data reached.
#2 and #3 can be tested by checking the appropriate last. flag.
[####] Extent
+ ..[#] Segment beyond Extent (gap is 2)
--------
[####] output Extent
[#] reset Extent to Segment
You can adjust your rules for Segment being adjacent (gap=0) or close enough (gap < threshold) to mean an Extent is either expanded, or, output and reset to Segment.
Note: The situation is a little more (not shown) complicated for the real world cases of:
missing start means the Segment has an unknown start date (presume it to be epoch (0=01JAN1960, or some date that pre-dates all dates in the data or study)
missing end means the Segment is active today (end date is date when processing data)
Sample code:
data have;
call streaminit(42);
do id = 1 to 10;
do _n_ = 1 to 50;
drug = ceil(rand('UNIFORM', 10));
beg_date = intnx ('MONTH', '01JAN2008'D, rand('UNIFORM',20));
end_date = intnx ('DAY', beg_date, rand('UNIFORM',75));
OUTPUT;
end;
end;
format beg_date end_date yymmdd10.;
run;
proc sort data=have out=segments;
by id drug beg_date end_date;
run;
data want;
set segments;
by id drug beg_date end_date; * will error if incoming data is NOT sorted;
retain ext_beg ext_end;
retain gap_allowed 0; * set to 1 for contiguously adjacent segment ;
if first.drug then do;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 0;
end;
if beg_date <= ext_end + gap_allowed then do;
ext_end = max (ext_end, end_date);
segment_count + 1;
end;
else do;
extent_id + 1;
OUTPUT;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 1;
end;
if last.drug then do;
extent_id + 1;
OUTPUT;
* reset occurs implicitly;
* it will happen at first. logic when control returns to top of step;
end;
format ext_: yymmdd10.;
keep id drug ext_beg ext_end segment_count extent_id;
run;

longitudinal calculation in SAS with lag function [duplicate]

This question already has an answer here:
Fill the blank values of a variable with the previous non blank value SAS 9.3
(1 answer)
Closed 8 years ago.
Hi I have a data in columns., and the patient visits
some patient visits have not recorded the values., and I want to copy the previous visit values., and I am using the lag function which is not working any idea?
the data is something like this
ID value
A 22
A .
A 23
B .
B 12
C 3
C .
C .
C .
C 21
the required output.,
ID value
A 22
A 22
A 23
B 23
B 12
C 3
C 3
C 3
C 3
C 21
You would use RETAIN not LAG here.
Retain:
data want;
set have;
retain newval;
if not missing(oldval) then newval=oldval;
run;
If you need the same variable name, drop+rename to get newval into oldval name.
Normally, you would also check for ID to be the same; your example updates across IDs, so I leave that out, but if you don't want to update a b record with a value, you need to add a by id; and then if first.id then call missing(newval); to reset it at the start of each new ID.
I'm assuming that the ID field represents your patient ID? And that you don't want to use values recorded against patient A for patient B etc... If so, then this code will do the job:
data test;
infile datalines truncover;
input ID $ value ;
datalines;
A 22
A
A 23
B
B 12
C 3
C
C
C
C 21
;
run;
Sort it first so that we can use by-group processing:
proc sort data=test;
by id;
run;
I prefer to use the retain statement rather than the lag() function as people are less likely to make mistakes using retain:
data final;
set test;
by id;
retain prev_value .;
if first.id then do;
prev_value = .; * RESET THIS VALUE EVERY TIME WE GET TO A NEW PATIENT;
end;
if value eq . then do;
value = prev_value; * VALUE IS MISSING SO ASSIGN THE PREVIOUS RECORDED VALUE FOR THE PATIENT AGAINST IT;
end;
else do;
prev_value = value; * PATIENT HAS A NEW VALUE TO RECORD SO SAVE IT INTO THE PREV_VALUE VARIABLE;
end;
run;
Incidentally this will give a slightly different result to what you requested as patient B did not supply a value on his first visit so his first record will remain null. If you need to fill that in with the value from his second visit, simply sort the dataset in the opposite direction, and run the same code against it.