I have below data in visit variable
Screening DayXX
CycleXX DayXX
CycleXX DayXX
CycleXX DayXX
CycleXX DayXX
CycleXX DayXX
CycleXX DayXX
CycleXX DayXX
CycleXX DayXX
Endofthetreatment DayXX
We have Cycles and days in the visit, now Sponsor asking populate only Cycle in the Visit variable without Screening and End of the study
Use a where statement to only keep lines where the word "Cycle" is in it. You can then use the input, compress, and scan functions to select a specific string, keep only digits, and convert it to a number.
data have;
infile datalines dlm='|';
length visit $30.;
input visit$;
datalines;
Screening Day01
Cycle01 Day02
Cycle01 Day03
Cycle01 Day04
Cycle02 Day04
Cycle02 Day06
Cycle02 Day07
Cycle03 Day08
Cycle03 Day09
Endofthetreatment Day10
;
run;
data want;
set have;
where upcase(visit) contains 'CYCLE';
/* Keep only digits from the first and second words and
convert it to a number */
cycle = input(compress(scan(visit, 1),, 'DK'), 2.);
day = input(compress(scan(visit, 2),, 'DK'), 2.);
drop visit;
run;
Output:
cycle day
1 2
1 3
1 4
2 4
2 6
2 7
3 8
3 9
Related
I have my desired output table:
data work.employees;
length employee timepoint visit realvisit $30;
input employee $ timepoint $ visit $ realvisit $;
datalines;
Smith 1 Screening Screening
Smith 1 Randomization Randomization
Williams 1 Screening Baseline
Williams 2 Randomization Randomization
Jones 1 Visit1 Visit1
Jones 2 Visit3 Visit3
;
run;
and I want to to derive realvisit such that in a group of (Employee, Timepoint), if there is no record where visit = Randomization and visit = Screening, then realvisit = Baseline.
Realvisit in the above table is already derived correctly as an example of what I'm trying to achieve.
This is what I've tried so far:
proc sort data = work.employees;
by employee timepoint;
run;
data work.employees2;
set work.employees;
by employee timepoint;
if visit = 'Randomization' then exists = "Y";
else exists = "N";
if visit = 'Screening' and exists = "N" then
realvisit = 'Baseline';
run;
I think you need to check the whole group.
You could use a double DOW loop. The first one to check. The second to re-read the data so you can write it back out.
data work.employees2;
do until (last.timepoint);
set work.employees;
by employee timepoint;
if visit = 'Randomization' then exists = "Y";
end;
do until (last.timepoint);
set work.employees;
by employee timepoint;
if visit = 'Screening' and exists ne "Y" then realvisit = 'Baseline';
output;
end;
run;
I am stuck in transforming the data table from one format to another format using the SAS Programming function. The structure of the Table is given as below:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 Jun/1/11 1:00 8w^201 10E^45 Transfer to 8w
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 Jun/1/11 8:00 10E^45 Death
“Id” is the randomly generated patient identifier.
“Date” and “Time” is the timestamp of the event.
“Assigned_pat_loc” is the current patient location in the hospital, formatted as “unit^room^bed”. EIAB is the internal code for the emergency department, with most of the admissions process through the emergency department.
"Prior_pat_loc” is the location where the patient was immediately prior to the current location.
“Activity” is the description of the event. It includes entries like “Admission”, “Transfer to” “Transfer from” “Discharge”, and “Death”.
You will notice a lot of duplicate records, where the same transfer is recorded in both the departing and the receiving unit. You will be able to tell by looking at the time stamp – they are identical for duplicate records.
I want to transform it into the following table.
Here are the details of the variables.
r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two-room sharing episodes. The first one was with patient 4 in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note that the room-sharing episodes are listed twice, once for each patient.
Please anyone guide me how it could be done?
We need to process the data by location
proc sort HAVE;
by assigned_pat_loc data time;
run;
In the result, we don not need temporary variables (starting with underscore) and the date and time must be renamed to end_date and end_time.
data WANT (drop= _: rename=(date=end_date time=end_time));
set HAVE;
by assigned_pat_loc data time;
I generalize the problem to rooms with a capacity above 2 and use arrays.
Extending the temporary arrays beyond &max_patients, saves me a few if-statements.
Note that temporary arrays are dropped in the result and are retained anyway.
%let max_patients = 9;
array id_r {&max_patients - 1} id_1 - id_%eval(&max_patients - 1);
array patients temporary {&max_patients + 1};
array admissions temporary {&max_patients + 1};
if _N_ eq 1 then patient_count = 0;
retain patient_count;
for every pat_loc, start all over
if first.assigned_pat_loc then do;
do patient_nr = 1 to patient_count;
patients[patient_nr] = .;
end;
patient_count = 0;
end;
if a patient leaves, calculate the time she spent
if Activity in (“Discharge”, “Death”) then do;
_found_patient = 0;
do _patient_nr = 1 to patient_count;
if patients[_patient_nr] eq id then do;
start_date = datepart(admissions[_patient_nr]);
start_time = timepart(admissions[_patient_nr]);
duration = (dhms(date,0,0,time) - admissions[_patient_nr]) / 3600;
_found_patient = 1;
end;
end;
shift the patients that arrived later
if _found_patient then do;
patients[_patient_nr] = patients[_patient_nr + 1];
admissions[_patient_nr] = admissions[_patient_nr + 1];
end;
patient_count = patient_count - 1;
find out who else was in the pat_loc and write the result
do _patient_nr = 1 to patient_count;
id_r[_patient_nr] = patents[_patient_nr];
end;
output;
end;
if a patient arrives, register that for later
else do;
patient_count = patient_count + 1;
patients[_patient_nr] = id;
admissions[_patient_nr] = dhms(date,0,0,time);
end;
run;
sort the results
proc sort;
by id start_date start_time;
run;
Disclaimer: this is a draft, which might need debugging.
When dealing with ranges in which there is a possibility of an unexpected overlap case you can enumerate over the range and perform simpler logic for finding shared time/unit/room.
Example:
data have;
length id date time 8 loc ploc $20 activity $10;
input
id Date& date11. Time time5. loc ploc Activity;
format date date9. time time5.;
datetime = dhms (date,0,0,0) + time;
length unit room bed punit proom pbed $4;
unit = scan(loc,1,'^');
room = scan(loc,2,'^');
bed = scan(loc,3,'^');
punit = scan(ploc,1,'^');
proom = scan(ploc,2,'^');
pbed = scan(ploc,3,'^');
drop loc ploc;
datalines;
1 31-May-2011 8:00 EIAB^EIAB^6 . Admission
1 31-May-2011 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 8-Jun-2011 15:00 8w^201 . Discharge
2 31-May-2011 5:00 EIAB^EIAB^4 . Admission
2 31-May-2011 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 1-Jun-2011 1:00 8w^201 10E^45 Transfer to 8w
2 1-Jun-2011 8:00 8w^201 . Discharge
3 31-May-2011 9:00 EIAB^EIAB^2 . Admission
3 1-Jun-2011 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 5-Jun-2011 9:00 8w^201 . Discharge
4 31-May-2011 9:00 EIAB^EIAB^9 . Admission
4 31-May-2011 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 1-Jun-2011 8:00 10E^45 . Death
;
* Fill in the ranges to get data by hour;
data hours(keep=id in_unit in_room at_dt);
set have;
by id;
retain at_dt in_unit in_room;
if first.id then do;
at_dt = datetime;
in_unit = unit;
in_room = room;
end;
else do;
do at_dt = at_dt to datetime-1 by dhms(0,1,0,0);
output;
end;
in_unit = unit;
in_room = room;
end;
format at_dt datetime16.;
run;
* prepare for transposition;
proc sort data=hours;
by at_dt in_unit in_room id;
run;
* transpose to know which time/unit/room has multiple patients;
proc transpose data=hours out=roomies_by_hour(drop=_name_ where=(not missing(patid2))) prefix=patid;
by at_dt in_unit in_room ;
var id;
run;
* 'unfill' the individual hours to get ranges again;
data roomies;
set roomies_by_hour;
by in_unit in_room patid1 patid2;
retain start_dt end_dt;
format start_dt end_dt datetime16.;
if first.patid2 then
start_dt = at_dt;
if last.patid2 then do;
end_dt = at_dt;
length_hrs = intck('hours', start_dt, end_dt);
output;
end;
run;
* stack data flipping perspective of who shared with who;
data roomies_mirrored;
set
roomies /* patid1 centric */
roomies(rename=(patid1=patid2 patid2=patid1)) /* patid2 centric */
;
run;
proc sort data=roomies_mirrored;
by patid1 start_dt;
run;
I would like to know if my data would be included in a specified month. Please see reprex below:
id Period_start Period_end
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016
4 05-10-2009 26-12-2015
I've tried the below codes using a single month. They worked the first time and did not work after that.
data dates2;
set work.dates;
by id;
if (period_start>='01MAR2016'd and period_end<='01MAR2016'd) or (period_start>='31MAR2016'd and period_end<='31MAR2016'd) then flag='March 2016';
else flag='';
run;
/* Or */
data dates2;
set work.dates;
by id;
if ('01MAR2016'd ge period_start and '01MAR2016'd le period_end) or ('31MAR2016'd ge period_start and '31MAR2016'd le period_end) then flag='March 2016';
else flag='';
run;
My intended outcome for this example is below:
id Period_start Period_end Flag
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020 March 2016
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017 March 2016
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016 March 2016
4 05-10-2009 26-12-2015
Please note that I have a number of months to compare them against which is why I didn't use the where function.
You can process multiple months to flag (i.e "number of months to compare") in one go if you store those months in a separate data set (as opposed to hard coding the month in a DATA Step program source code)
Example:
The months to flag are stored in a control data set, which, is then transposed to create flag variables. The flag variables are reloaded at every iteration of the DATA Step implicit loop using SET and POINT= and conditionally cleared based on date range comparison in an explicit loop over the flag variables.
data have;
attrib
id length=8
period_start period_end informat=ddmmyy10. format=ddmmyyd10.;
input
id Period_start Period_end; datalines;
1 01-01-2012 12-03-2015
1 21-03-2014 12-11-2014
2 09-05-2018 31-01-2019
3 08-12-2013 30-03-2015
3 26-03-2016 22-03-2020
4 31-07-2018 07-08-2018
4 29-09-2014 03-03-2017
4 13-06-2020 17-02-2021
4 23-01-2008 15-08-2016
4 05-10-2009 26-12-2015
;
data flag_months;
attrib month informat=monyy7. format=monyy7.;
input month; datalines;
MAR2016
AUG2018
;
proc transpose data=flag_months out=flag_vars(drop=_name_) prefix=FLAG_;
id month;
var month;
run;
data want;
set have;
retain one 1;
set flag_vars point=one; drop one; * load flag values;
array flag_vars flag_:;
do _i_ = 1 to dim(flag_vars);
* clear flag value if month does not touch any day in the period;
if not
( intnx('month', period_start, 0)
<=
flag_vars(_i_)
<=
intnx('month', period_end, 0, 'E')
)
then
call missing(flag_vars(_i_));
end;
run;
Flag months
Transposed into Flag vars
Which are loaded and conditionally cleared during a pass over the data set containing date range information.
This question is an expansion of this one : SAS: Create a frequency variable
The code provided in the first response work well, but what if I want to add another categorical variable ? I have a date variable and an ID, categorical variable. I've tried multiple things, but here's what seemed the most logical to me (but doesn't work):
data work.frequencycounts;
do _n_ =1 by 1 until (last.Date);
set work.dataset;
by Date ID;
if first.Date & first.ID then count=0;
count+1;
end;
frequency= count;
do _n_ = 1 by 1 until (last.Date);
set work.dataset;
by Date ID;
output;
end;
run;
Should I add a do loop ?
Thanks for your help.
Edits:
Example of what I have:
Date ID
1 19736 H-3-10
2 19736 H-3-12
3 19737 E-2-10
4 19737 E-2-10
Example of what I want:
Date ID Count
1 19736 H-3-10 1
2 19736 H-3-12 1
3 19737 E-2-10 2
4 19737 E-2-10 2
This produces the desired output.
What is happening here is that you need to use the last variable in the BY statement for everything with first./last. processing. If you need to know why, put a few put _all_; in the datastep to see what is what value at different points. You shouldn't check for first.Date at any point, because if first.Date is true then first.ID is always true (by definition, first propagates rightwards); and you want a different count for [first.ID and not first.date].
Basically, treat the initial example as correct, and the variable in the initial example should be the last variable in your by statement; add as many additional variables as you want to the left of it, and nothing will change. This does require the data be sorted by the by-group variables.
data have;
input date id $;
datalines;
19736 H-3-10
19736 H-3-12
19737 E-2-10
19737 E-2-10
;;;;;
run;
data work.want;
do _n_ =1 by 1 until (last.ID); *last.<last variable in by group>;
set work.have;
by Date ID;
if first.ID then count=0; *first.ID is what you want here.;
count+1;
end;
frequency= count; *this is not really needed - can use just the one variable consistently;
do _n_ = 1 by 1 until (last.ID); *again, last.<last var in by group>;
set work.have;
by Date ID;
output;
end;
run;
I have a database like this. This corresponds to a single person and I have this type of data for multiple persons.
data test;
input date YYMMDD10. real_length min_length;
format date YYMMDD10.;
cards;
2000-02-23 1 7
2000-02-24 12 15
2000-03-07 15 7
2000-03-22 7 15
2000-03-29 13 7
2000-04-11 17 7
2000-04-28 . 7
run;
What I am looking for is : if the interval between 2 dates in consecutive lines (real_length) is inferior to a certain length (min_length), I want to replace the date in the next line by the previous date + min_length. So far, this is not a problem and here is the code I used to achieve it:
data test2;
set test;
format lagdate min_date YYMMDD10.;
retain lagmin lagdate;
if lag(real_length) < lag(min_length) and lag(real_length) ~= . then min_date = lagdate + lagmin;
else min_date = date;
lagdate = min_date;
lagmin = min_length;
run;
Which gives :
date min_date min_length
2000-02-23 2000-02-23 7
2000-02-24 2000-03-01 15
2000-03-07 2000-03-16 7
2000-03-22 2000-03-22 15
...
The problem is that now the interval between 2 consecutive dates could become less than the minimal length, e.g. : 2000-03-22 - 2000-03-16 = 6 days < min_length = 7. And I would like to have 2000-03-23 = 2000-03-16 + 7 (=min_length) instead of 2000-02-22 like this:
date min_date min_length
2000-02-23 2000-02-23 7
2000-02-24 2000-03-01 15
2000-03-07 2000-03-16 7
2000-03-22 2000-03-23 15
...
So I've tried this code, but it does not work... I believe the problem could be in the if condition.
data test2;
set test;
format lagdate min_date YYMMDD10.;
retain lagmin lagdate;
if (lag(real_length) < lag(min_length) and lag(real_length) ~= .) or (adjust_length < lag(min_length) and adjust_length ~=.) then min_date = lagdate + lagmin;
else min_date = date;
adjust_length = min_date - lagdate;
lagdate = min_date;
lagmin = min_length;
run;
Does anybody see why this isn't working or do you hve another way of doing this?
Thank you!
The problem is that each time you adjust one date, you have to move all the subsequent dates as well if they're bunched up together. I think you can do this by keeping a running total of how many days you've added on to all the previous rows and then adding on only what's needed after that to get to the min_length between dates:
data want;
set test;
format t_min_date min_date yymmdd10.;
if _n_ = 1 then total_adj = 0;
t_min_date = date + min_length + total_adj;
min_date = lag1(t_min_date);
total_adj + max(0,min_length - real_length);
run;
Is that what you were aiming for?
N.B. you'll need to replace the if _n_ = 1 with some first.id and last.id logic to make this work for multiple individuals in the same dataset.