I want to delete records in the Have dataset which meets all the following conditions. ID_num here stands for the 3-digit part of the ID field
ID = Mxxx
Type = blood
located prior to any of the following records WITHIN EACH GROUP OF (ID_num, drug).
ID=Mxxx and Type=milk
ID=Infxxx
Below are Have and the desired output.
data have;
input ID $ Type $ Drug $;
cards;
M001 blood A
M001 blood A
M001 blood A
M001 blood B
M001 blood B
M001 milk B
M001 blood C
M001 blood C
M002 blood A
M002 blood A
Inf002 blood A
M002 blood A
M002 blood B
M002 milk C
Inf003 blood B
M003 blood B
;
run;
data want;
input ID $ Type $ Drug $;
cards;
M001 milk B
Inf002 blood A
M002 blood A
M002 milk C
Inf003 blood B
M003 blood B
;
run;
For example, the M002 (blood, drug A) that is under the inf002 drug A observation stays because it occurs after an infant sample in the same drug group. But two M002 (blood, A) observations above it should get deleted as they occur before the first infant sample in same drug group. Conversely, the two M001 (blood, C) observations following M001 (milk, B) should be deleted as the drug groups are different.
Edit: group by (gp, Drug).
Keys
Extract the ID grouping number (gp in the code) using SAS regex (prxmatch(patt, var) here).
The keep condition can be examined row-by-row while also grouped by (gp, Drug). A change in gp is identified by FIRST.drug.
The dataset must be sorted before the use of BY statement. Since SAS sorting is stable, the original ordering won't break.
The original ordering can be tracked by recording _n_ in the regex parsing phase.
Code
* "have" is in your post;
data tmp;
set have;
pos = prxmatch('(\d{3})', ID);
gp = substr(ID, pos, pos+2); * group number;
mi = substr(ID, 1, 1); * mother or infant;
n = _n_; * keep track of the original ordering;
drop pos;
run;
proc sort data=tmp out=tmp;
by gp drug;
run;
data want(drop=flag_keep gp mi);
set tmp;
by gp drug;
* state variables;
retain flag_keep 0;
if FIRST.drug then flag_keep = 0;
* mark keep;
if (flag_keep = 1) or (mi = "I") or ((mi = "M") and (Type = "milk"))
then flag_keep = 1;
if flag_keep = 1 then output;
run;
proc sort data=want out=want;
by n;
run;
Result: the original row number n is shown for clarity.
ID Type Drug n
1 M001 milk B 6
2 Inf002 blood A 11
3 M002 blood A 12
4 M002 milk C 14
5 Inf003 blood B 15
6 M003 blood B 16
Related
I am stuck in transforming the data table from one format to another format using the SAS Programming function. The structure of the Table is given as below:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 Jun/1/11 1:00 8w^201 10E^45 Transfer to 8w
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 Jun/1/11 8:00 10E^45 Death
“Id” is the randomly generated patient identifier.
“Date” and “Time” is the timestamp of the event.
“Assigned_pat_loc” is the current patient location in the hospital, formatted as “unit^room^bed”. EIAB is the internal code for the emergency department, with most of the admissions process through the emergency department.
"Prior_pat_loc” is the location where the patient was immediately prior to the current location.
“Activity” is the description of the event. It includes entries like “Admission”, “Transfer to” “Transfer from” “Discharge”, and “Death”.
You will notice a lot of duplicate records, where the same transfer is recorded in both the departing and the receiving unit. You will be able to tell by looking at the time stamp – they are identical for duplicate records.
I want to transform it into the following table.
Here are the details of the variables.
r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two-room sharing episodes. The first one was with patient 4 in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note that the room-sharing episodes are listed twice, once for each patient.
Please anyone guide me how it could be done?
We need to process the data by location
proc sort HAVE;
by assigned_pat_loc data time;
run;
In the result, we don not need temporary variables (starting with underscore) and the date and time must be renamed to end_date and end_time.
data WANT (drop= _: rename=(date=end_date time=end_time));
set HAVE;
by assigned_pat_loc data time;
I generalize the problem to rooms with a capacity above 2 and use arrays.
Extending the temporary arrays beyond &max_patients, saves me a few if-statements.
Note that temporary arrays are dropped in the result and are retained anyway.
%let max_patients = 9;
array id_r {&max_patients - 1} id_1 - id_%eval(&max_patients - 1);
array patients temporary {&max_patients + 1};
array admissions temporary {&max_patients + 1};
if _N_ eq 1 then patient_count = 0;
retain patient_count;
for every pat_loc, start all over
if first.assigned_pat_loc then do;
do patient_nr = 1 to patient_count;
patients[patient_nr] = .;
end;
patient_count = 0;
end;
if a patient leaves, calculate the time she spent
if Activity in (“Discharge”, “Death”) then do;
_found_patient = 0;
do _patient_nr = 1 to patient_count;
if patients[_patient_nr] eq id then do;
start_date = datepart(admissions[_patient_nr]);
start_time = timepart(admissions[_patient_nr]);
duration = (dhms(date,0,0,time) - admissions[_patient_nr]) / 3600;
_found_patient = 1;
end;
end;
shift the patients that arrived later
if _found_patient then do;
patients[_patient_nr] = patients[_patient_nr + 1];
admissions[_patient_nr] = admissions[_patient_nr + 1];
end;
patient_count = patient_count - 1;
find out who else was in the pat_loc and write the result
do _patient_nr = 1 to patient_count;
id_r[_patient_nr] = patents[_patient_nr];
end;
output;
end;
if a patient arrives, register that for later
else do;
patient_count = patient_count + 1;
patients[_patient_nr] = id;
admissions[_patient_nr] = dhms(date,0,0,time);
end;
run;
sort the results
proc sort;
by id start_date start_time;
run;
Disclaimer: this is a draft, which might need debugging.
When dealing with ranges in which there is a possibility of an unexpected overlap case you can enumerate over the range and perform simpler logic for finding shared time/unit/room.
Example:
data have;
length id date time 8 loc ploc $20 activity $10;
input
id Date& date11. Time time5. loc ploc Activity;
format date date9. time time5.;
datetime = dhms (date,0,0,0) + time;
length unit room bed punit proom pbed $4;
unit = scan(loc,1,'^');
room = scan(loc,2,'^');
bed = scan(loc,3,'^');
punit = scan(ploc,1,'^');
proom = scan(ploc,2,'^');
pbed = scan(ploc,3,'^');
drop loc ploc;
datalines;
1 31-May-2011 8:00 EIAB^EIAB^6 . Admission
1 31-May-2011 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 8-Jun-2011 15:00 8w^201 . Discharge
2 31-May-2011 5:00 EIAB^EIAB^4 . Admission
2 31-May-2011 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 1-Jun-2011 1:00 8w^201 10E^45 Transfer to 8w
2 1-Jun-2011 8:00 8w^201 . Discharge
3 31-May-2011 9:00 EIAB^EIAB^2 . Admission
3 1-Jun-2011 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 5-Jun-2011 9:00 8w^201 . Discharge
4 31-May-2011 9:00 EIAB^EIAB^9 . Admission
4 31-May-2011 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 1-Jun-2011 8:00 10E^45 . Death
;
* Fill in the ranges to get data by hour;
data hours(keep=id in_unit in_room at_dt);
set have;
by id;
retain at_dt in_unit in_room;
if first.id then do;
at_dt = datetime;
in_unit = unit;
in_room = room;
end;
else do;
do at_dt = at_dt to datetime-1 by dhms(0,1,0,0);
output;
end;
in_unit = unit;
in_room = room;
end;
format at_dt datetime16.;
run;
* prepare for transposition;
proc sort data=hours;
by at_dt in_unit in_room id;
run;
* transpose to know which time/unit/room has multiple patients;
proc transpose data=hours out=roomies_by_hour(drop=_name_ where=(not missing(patid2))) prefix=patid;
by at_dt in_unit in_room ;
var id;
run;
* 'unfill' the individual hours to get ranges again;
data roomies;
set roomies_by_hour;
by in_unit in_room patid1 patid2;
retain start_dt end_dt;
format start_dt end_dt datetime16.;
if first.patid2 then
start_dt = at_dt;
if last.patid2 then do;
end_dt = at_dt;
length_hrs = intck('hours', start_dt, end_dt);
output;
end;
run;
* stack data flipping perspective of who shared with who;
data roomies_mirrored;
set
roomies /* patid1 centric */
roomies(rename=(patid1=patid2 patid2=patid1)) /* patid2 centric */
;
run;
proc sort data=roomies_mirrored;
by patid1 start_dt;
run;
I'd like to make the dataset like the below. I got it, but it’s a long program.
I think it would become more simple. If you have a good idea, please give me some advice.
This is the data.
data test;
input ID $ NO DAT1 $ TIM1 $ DAT2 $ TIM2 $;
cards;
1 1 2020/8/4 8:30 2020/8/5 8:30
1 2 2020/8/18 8:30 2020/8/19 8:30
1 3 2020/9/1 8:30 2020/9/2 8:30
1 4 2020/9/15 8:30 2020/9/16 8:30
2 1 2020/8/4 8:34 2020/8/5 8:34
2 2 2020/8/18 8:34 2020/8/19 8:34
2 3 2020/9/1 8:34 2020/9/2 8:34
2 4 2020/9/15 8:34 2020/9/16 8:34
3 1 2020/8/4 8:46 2020/8/5 8:46
3 2 2020/8/18 8:46 2020/8/19 8:46
3 3 2020/9/1 8:46 2020/9/2 8:46
3 4 2020/9/15 8:46 2020/9/16 8:46
;
run;
This is my program.
data
t1(keep = ID A1 A2 A3 A4)
t2(keep = ID B1 B2 B3 B4)
t3(keep = ID C1 C2 C3 C4)
t4(keep = ID D1 D2 D3 D4);
set test;
if NO = 1 then do;
A1 = DAT1;
A2 = TIM1;
A3 = DAT2;
A4 = TIM2;
end;
*--- cut (NO = 2, 3, 4 are same as NO = 1)--- ;
end;
if NO = 1 then output t1;
if NO = 2 then output t2;
if NO = 3 then output t3;
if NO = 4 then output t4;
run;
proc sort data = t1;by ID; run;
proc sort data = t2;by ID; run;
proc sort data = t3;by ID; run;
proc sort data = t4;by ID; run;
data test2;
merge t1 t2 t3 t4;
by ID;
run;
Since the result looks like a report use a reporting tool.
proc report data=test ;
column id no,(dat1 tim1 dat2 tim2 n) ;
define id / group width=5;
define no / across ' ' ;
define n / noprint;
run;
Tall to very wide data transformations are typically
sketchy, you put data into metadata (column names or labels) or lose referential context, or
a reporting layout for human consumption
Presuming your "as dataset like below" is accurate and you want to pivot your data in such a manner.
Way 1 - self merging subsets with renaming
You should see that the NO field is a sequence number that can be used as a BY variable when merging data sets.
Consider this example code as a template that could be the source code generation of a macro:
NO is changed name to seq for better clarity
data want;
merge
have (where=(seq=1) rename=(dat1=A1 tim1=B1 dat2=C1 tim2=D1)
have (where=(seq=2) rename=(dat1=A2 tim1=B2 dat2=C2 tim2=D2)
have (where=(seq=3) rename=(dat1=A3 tim1=B3 dat2=C3 tim2=D3)
have (where=(seq=4) rename=(dat1=A4 tim1=B4 dat2=C4 tim2=D4)
;
by id;
run;
For unknown data sets organized like the above pattern, the code generation requirements should be obvious; determine maximum seq and have the names of variables to pivot be specified (as macro parameters, in which loop over the names occurs).
Way 2 - multiple transposes
Caution, all pivoted columns will be character type and contain the formatted result of original values.
proc transpose data=have(rename=(dat1=A tim1=B dat2=C tim2=D)) out=stage1;
by id seq;
var a b c d;
run;
proc transpose data=stage1 out=want;
by id;
var col1;
id _name_ seq;
run;
Way 3 - Use array and DOW loop
* presume SEQ is indeed a unit monotonic sequence value;
data want (keep=id a1--d4);
do until (last.id);
array wide A1-A4 B1-B4 C1-C4 D1-D4;
wide [ (seq-1)*4 + 1 ] = dat1;
wide [ (seq-1)*4 + 2 ] = tim1;
wide [ (seq-1)*4 + 3 ] = dat2;
wide [ (seq-1)*4 + 4 ] = tim2;
end;
keep id A1--D4;
* format A1 A3 B1 B3 C1 C3 D1 D3 your-date-format;
* format A2 A4 ................. your-time-format;
Way 4 - change your data values to datetime
I'll leave this to esteemed others
Assume you have a data file called VIRUS_PROLIF from an infectious disease research center. Each observation has 3 variables COUNTRY START_DATE, and DOUBLE_RATE, where START_DATE is the date that the Country registered its 100th case of COVID-19. For each country, DOUBLE_RATE is the number of days it takes for the number of cases to double in that country. Write the SAS code using DO UNTIL to calculate the date at which that Country would be predicted to register 200,000 cases of COVID-19.
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate ;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
;
run;
data VIRUS_PROLIF1 (drop=start_date);
set VIRUS_PROLIF;
do until (num_of_cases>200000);
double_rate+1;
num_of_cases+ (num_of_cases*1);
end;
run;
proc print data=VIRUS_PROLIF1;
run;
The key concept you're missing here is how to employ the growth rate. That would be using the following formula, similar to interest growth for money.
If you have one dollar today and you get 100% interest it becomes
StartingAmount * (1 + interestRate) where the interest rate here is 100/100 = 1.
*fake data;
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
AB 03/17/2020 100 20
;
run;
data VIRUS_PROLIF1;
set VIRUS_PROLIF;
*assign date to starting date so both are in output;
date=start_date;
*save record to data set;
output;
do until (num_of_cases>200000);
*increment your day;
date=date+1;
;
*doubling rate is represented as a percent so add it to 1 to show the rate;
num_of_cases=num_of_cases*(1+double_rate/100);
*save record to data set;
output;
end;
*control date display;
format date start_date date9.;
run;
*check results;
proc print data=VIRUS_PROLIF1;
run;
The problem 200,000 < N0 (1+R/100) k can be solved for integer k without iterations
day_of_200K = ceil (
LOG ( 200000 / NUM_OF_CASES )
/ LOG ( 1 + R / 100 )
);
I have the following dataset:
dataseta:
No. Name1 Name2 Sales Inv Comp
1 TC Tribal Council Inc 100 100 0
2. TC Tribal Council Limited INC 20 25 65
desired output:
datasetb:
No. Name1 Name2 Sales Inv Comp
1 TC Tribal Council Limited Inc 120 125 0
Basically, I need to choose the row with the maximum length of characters for the column name2.
I tried the following, but it didn't work
proc sql;
create table datasetb as select no,name1,name2,sum(sales),sum(inv),min(comp) from dataseta group by 1,2,3 having length(name2)=max(length(name2));quit;
If I do the following code, it only partially resolves it, and I get duplicate rows
proc sql;
create table datasetb as select no,name1,max(length(name2)),sum(sales),sum(inv),min(comp) from dataseta group by 1,2 having length(name2)=max(length(name2));quit;
You appear to be joining the results of two separate aggregate computations.
Presuming:
no is unique so as to allow a tie breaker criteria and the first (per no) longest name2 is to be joined with the cost, inv, comp totals over name1.
The query will have lots going on...
1st longest name2 within name1, nested subqueries are needed to:
Determine the longest name2, then
Select first one, according to no, if more than one.
totals over name1
The totals will be a sub-query that is joined to, for delivering the desired result set.
Example (SQL)
data have;
length no 8 name1 $6 name2 $35 sales inv comp 8;
input
no name1& name2& sales inv comp; datalines;
1 TC Tribal Council Inc 100 100 0 * name1=TC group
2 TC Tribal Council Limited INC 20 25 65
3 TC Tribal council co 0 0 0
4 TC The Tribal council Assoctn 10 10 10
7 LS Longshore association 10 10 0 * name=LS group
8 LS The Longshore Group, LLC 2 4 8
9 LS The Longshore Group, llc 15 15 6
run;
proc sql;
create table want as
select
first_longest_name2.no,
first_longest_name2.name1,
first_longest_name2.name2,
name1_totals.sales,
name1_totals.inv,
name1_totals.comp
FROM
(
select
no, name1, name2
from
( select
no, name1, name2
from have
group by name1
having length(name2) = max(length(name2))
) longest_name2s
group by name1
having no = min(no)
) as
first_longest_name2
LEFT JOIN
(
select
name1,
sum(sales) as sales,
sum(inv) as inv,
sum(comp) as comp
from
have
group by name1
) as
name1_totals
ON
first_longest_name2.name1 = name1_totals.name1
;
quit;
Example (DATA Step)
Processing the data in a serial manner, when name1 groups are contiguous rows, can be accomplished using a DOW loop technique -- that is a loop with a SET statement within it.
data want2;
do until (last.name1);
set have;
by name1 notsorted;
if length(name2) > longest then do;
longest = length(name2);
no_at_longest = no;
name2_at_longest = name2;
end;
sales_sum = sum(sales_sum,sales);
inv_sum = sum(inv_sum,inv);
comp_sum = sum(comp_sum,comp);
end;
drop name2 no sales inv comp longest;
rename
no_at_longest = no
name2_at_longest = name2
sales_sum = sales
inv_sum = inv
comp_sum = comp
;
run;
I am using SAS Enterprise Guide.
I have a new file and i was asked to generate output.
Source:
Name feeder_in feeder_out NickName
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
and the desire output:
Name feeder_final
ABBA 1
ABBA 2
ABBA A
ABBA B
POLA 1
POLA 2
CONS POLA C
CONS POLA D
CONS POLA E
I have been trying myself on handling this but no luck at all.
I tried
data test;
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
output;
end;
xequipment=newequipment;
run;
Does anyone have clue for this?
Here's my understanding of your requirements, based on the desired output: you want your output to have one observation for each combination of NAME and FEEDER_IN, plus another observation for each combination of NICKNAME and FEEDER_OUT.
On that assumption, the code would look something like (not tested):
data want;
set have;
keep name feeder_final
* Loop over FEEDER_IN and output one obs for each delimited value;
do i = 1 to countw(feeder_in, ',');
feeder_final = scan(feeder_in, i, ',');
output;
end;
* Move the NICKNAME value into NAME;
name = nickname;
* Loop over FEEDER_OUT and output one obs for each delimited value;
do i = 1 to countw(feeder_out, ',');
feeder_final = scan(feeder_out, i, ',');
output;
end;
run;
When transposing multiple columns you might want to also maintain the source row and column identifiers for further downstream analytics. The sequence of the values in the csv might also be important if you need to do pairwise joining on sequence position of the categorical form -- such as needing to match 1A 2B in row 1 and 1C 2D in row 2.
data have;
length name feeder_in feeder_out nickname $20;
input
Name& feeder_in& feeder_out& NickName&; datalines;
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
run;
data want;
_row_ + 1;
set have;
feeder = 'in ';
do seq = 1 to countw(feeder_in,',');
value = scan(feeder_in,seq,',');
OUTPUT;
end;
feeder = 'out';
do seq = 1 to countw(feeder_out,',');
value = scan(feeder_out,seq,',');
OUTPUT;
end;
keep _row_ Name feeder seq value NickName;
run;