Using do loops in sas - sas

Assume you have a data file called VIRUS_PROLIF from an infectious disease research center. Each observation has 3 variables COUNTRY START_DATE, and DOUBLE_RATE, where START_DATE is the date that the Country registered its 100th case of COVID-19. For each country, DOUBLE_RATE is the number of days it takes for the number of cases to double in that country. Write the SAS code using DO UNTIL to calculate the date at which that Country would be predicted to register 200,000 cases of COVID-19.
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate ;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
;
run;
data VIRUS_PROLIF1 (drop=start_date);
set VIRUS_PROLIF;
do until (num_of_cases>200000);
double_rate+1;
num_of_cases+ (num_of_cases*1);
end;
run;
proc print data=VIRUS_PROLIF1;
run;

The key concept you're missing here is how to employ the growth rate. That would be using the following formula, similar to interest growth for money.
If you have one dollar today and you get 100% interest it becomes
StartingAmount * (1 + interestRate) where the interest rate here is 100/100 = 1.
*fake data;
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
AB 03/17/2020 100 20
;
run;
data VIRUS_PROLIF1;
set VIRUS_PROLIF;
*assign date to starting date so both are in output;
date=start_date;
*save record to data set;
output;
do until (num_of_cases>200000);
*increment your day;
date=date+1;
;
*doubling rate is represented as a percent so add it to 1 to show the rate;
num_of_cases=num_of_cases*(1+double_rate/100);
*save record to data set;
output;
end;
*control date display;
format date start_date date9.;
run;
*check results;
proc print data=VIRUS_PROLIF1;
run;

The problem 200,000 < N0 (1+R/100) k can be solved for integer k without iterations
day_of_200K = ceil (
LOG ( 200000 / NUM_OF_CASES )
/ LOG ( 1 + R / 100 )
);

Related

How to transform Table data to another Table format in SAS

I am stuck in transforming the data table from one format to another format using the SAS Programming function. The structure of the Table is given as below:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 Jun/1/11 1:00 8w^201 10E^45 Transfer to 8w
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 Jun/1/11 8:00 10E^45 Death
“Id” is the randomly generated patient identifier.
“Date” and “Time” is the timestamp of the event.
“Assigned_pat_loc” is the current patient location in the hospital, formatted as “unit^room^bed”. EIAB is the internal code for the emergency department, with most of the admissions process through the emergency department.
"Prior_pat_loc” is the location where the patient was immediately prior to the current location.
“Activity” is the description of the event. It includes entries like “Admission”, “Transfer to” “Transfer from” “Discharge”, and “Death”.
You will notice a lot of duplicate records, where the same transfer is recorded in both the departing and the receiving unit. You will be able to tell by looking at the time stamp – they are identical for duplicate records.
I want to transform it into the following table.
Here are the details of the variables.
r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two-room sharing episodes. The first one was with patient 4 in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note that the room-sharing episodes are listed twice, once for each patient.
Please anyone guide me how it could be done?
We need to process the data by location
proc sort HAVE;
by assigned_pat_loc data time;
run;
In the result, we don not need temporary variables (starting with underscore) and the date and time must be renamed to end_date and end_time.
data WANT (drop= _: rename=(date=end_date time=end_time));
set HAVE;
by assigned_pat_loc data time;
I generalize the problem to rooms with a capacity above 2 and use arrays.
Extending the temporary arrays beyond &max_patients, saves me a few if-statements.
Note that temporary arrays are dropped in the result and are retained anyway.
%let max_patients = 9;
array id_r {&max_patients - 1} id_1 - id_%eval(&max_patients - 1);
array patients temporary {&max_patients + 1};
array admissions temporary {&max_patients + 1};
if _N_ eq 1 then patient_count = 0;
retain patient_count;
for every pat_loc, start all over
if first.assigned_pat_loc then do;
do patient_nr = 1 to patient_count;
patients[patient_nr] = .;
end;
patient_count = 0;
end;
if a patient leaves, calculate the time she spent
if Activity in (“Discharge”, “Death”) then do;
_found_patient = 0;
do _patient_nr = 1 to patient_count;
if patients[_patient_nr] eq id then do;
start_date = datepart(admissions[_patient_nr]);
start_time = timepart(admissions[_patient_nr]);
duration = (dhms(date,0,0,time) - admissions[_patient_nr]) / 3600;
_found_patient = 1;
end;
end;
shift the patients that arrived later
if _found_patient then do;
patients[_patient_nr] = patients[_patient_nr + 1];
admissions[_patient_nr] = admissions[_patient_nr + 1];
end;
patient_count = patient_count - 1;
find out who else was in the pat_loc and write the result
do _patient_nr = 1 to patient_count;
id_r[_patient_nr] = patents[_patient_nr];
end;
output;
end;
if a patient arrives, register that for later
else do;
patient_count = patient_count + 1;
patients[_patient_nr] = id;
admissions[_patient_nr] = dhms(date,0,0,time);
end;
run;
sort the results
proc sort;
by id start_date start_time;
run;
Disclaimer: this is a draft, which might need debugging.
When dealing with ranges in which there is a possibility of an unexpected overlap case you can enumerate over the range and perform simpler logic for finding shared time/unit/room.
Example:
data have;
length id date time 8 loc ploc $20 activity $10;
input
id Date& date11. Time time5. loc ploc Activity;
format date date9. time time5.;
datetime = dhms (date,0,0,0) + time;
length unit room bed punit proom pbed $4;
unit = scan(loc,1,'^');
room = scan(loc,2,'^');
bed = scan(loc,3,'^');
punit = scan(ploc,1,'^');
proom = scan(ploc,2,'^');
pbed = scan(ploc,3,'^');
drop loc ploc;
datalines;
1 31-May-2011 8:00 EIAB^EIAB^6 . Admission
1 31-May-2011 9:00 8w^201 EIAB^EIAB^6 Transfer to 8w
1 8-Jun-2011 15:00 8w^201 . Discharge
2 31-May-2011 5:00 EIAB^EIAB^4 . Admission
2 31-May-2011 7:00 10E^45 EIAB^EIAB^4 Transfer to 10E
2 1-Jun-2011 1:00 8w^201 10E^45 Transfer to 8w
2 1-Jun-2011 8:00 8w^201 . Discharge
3 31-May-2011 9:00 EIAB^EIAB^2 . Admission
3 1-Jun-2011 9:00 8w^201 EIAB^EIAB^2 Transfer to 8w
3 5-Jun-2011 9:00 8w^201 . Discharge
4 31-May-2011 9:00 EIAB^EIAB^9 . Admission
4 31-May-2011 7:00 10E^45 EIAB^EIAB^9 Transfer to 10E
4 1-Jun-2011 8:00 10E^45 . Death
;
* Fill in the ranges to get data by hour;
data hours(keep=id in_unit in_room at_dt);
set have;
by id;
retain at_dt in_unit in_room;
if first.id then do;
at_dt = datetime;
in_unit = unit;
in_room = room;
end;
else do;
do at_dt = at_dt to datetime-1 by dhms(0,1,0,0);
output;
end;
in_unit = unit;
in_room = room;
end;
format at_dt datetime16.;
run;
* prepare for transposition;
proc sort data=hours;
by at_dt in_unit in_room id;
run;
* transpose to know which time/unit/room has multiple patients;
proc transpose data=hours out=roomies_by_hour(drop=_name_ where=(not missing(patid2))) prefix=patid;
by at_dt in_unit in_room ;
var id;
run;
* 'unfill' the individual hours to get ranges again;
data roomies;
set roomies_by_hour;
by in_unit in_room patid1 patid2;
retain start_dt end_dt;
format start_dt end_dt datetime16.;
if first.patid2 then
start_dt = at_dt;
if last.patid2 then do;
end_dt = at_dt;
length_hrs = intck('hours', start_dt, end_dt);
output;
end;
run;
* stack data flipping perspective of who shared with who;
data roomies_mirrored;
set
roomies /* patid1 centric */
roomies(rename=(patid1=patid2 patid2=patid1)) /* patid2 centric */
;
run;
proc sort data=roomies_mirrored;
by patid1 start_dt;
run;

Multiple operations on a single value in SAS?

I'm trying to create a column that will apply to different interests to it based on how much each customer's cumulative purchases are. Not sure but I was thinking that I'd need to use a do while statement but entirely sure. :S
This is what I got so far but I don't know how to get it to perform two operations on one value. Such that, it will apply one interest rate until say, 4000, and then apply the other interest rate to the rest above 4000.
data cards;
set sortedccards;
by Cust_ID;
if first.Cust_ID then cp=0;
cp+Purchase;
if cp<=4000 then cb=(cp*.2);
if cp>4000 then cb=(cp*.2)+(cp*.1);
format cp dollar10.2 cp dollar10.2;
run;
What I'd like my output to look like.
You will want to also track the prior cumulative purchase in order to detect when a purchase causes the cumulative to cross the threshold (or breakpoint) $4,000. Breakpoint crossing purchases would be split into pre and post portions for different bonus rates.
Example:
Program flow causes retained variable pcp to act like a LAGged variable.
data have;
input id $ p;
datalines;
C001 1000
C001 2300
C001 2000
C001 1500
C001 800
C002 6200
C002 800
C002 300
C003 2200
C003 1700
C003 2500
C003 600
;
data want;
set have;
by id;
if first.id then do;
cp = 0;
pcp = 0; retain pcp; /* prior cumulative purchase */
end;
cp + p; /* sum statement causes cp to be implicitly retained */
* break point is 4,000;
if (cp > 4000 and pcp > 4000) then do;
* entire purchase in post breakpoint territory;
b = 0.01 * p;
end;
else
if (cp > 4000) then do;
* split purchase into pre and post breakpoint portions;
b = 0.10 * (4000 - pcp) + 0.01 * (p - (4000 - pcp));
end;
else do;
* entire purchase in pre breakpoint territory;
b = 0.10 * p;
end;
* update prior for next implicit iteration;
pcp = cp;
run;
Here is a fairly straightforward solution which is not optimized but works. We calculate the cumulative purchases and cumulative bonus at each step (which can be done quite simply), and then calculate the current period bonus as cumulative bonus minus previous cumulative bonus.
This is assuming that the percentage is 20% up to $4000 and 30% over $4000.
data have;
input id $ period MMDDYY10. purchase;
datalines;
C001 01/25/2019 1000
C001 02/25/2019 2300
C001 03/25/2019 2000
C001 04/25/2019 1500
C001 05/25/2019 800
C002 03/25/2019 6200
C002 04/25/2019 800
C002 05/25/2019 300
C003 02/25/2019 2200
C003 03/25/2019 1700
C003 04/25/2019 2500
C003 05/25/2019 600
;
run;
data want (drop=cumul_bonus);
set have;
by id;
retain cumul_purchase cumul_bonus;
if first.id then call missing(cumul_purchase,cumul_bonus);
** Calculate total cumulative purchase including current purchase **;
cumul_purchase + purchase;
** Calculate total cumulative bonus including current purchase **;
cumul_bonus = (0.2 * cumul_purchase) + ifn(cumul_purchase > 4000, 0.1 * (cumul_purchase - 4000), 0);
** Bonus for current purchase = total cumulative bonus - previous cumulative bonus **;
bonus = ifn(first.id,cumul_bonus,dif(cumul_bonus));
format period MMDDYY10.
purchase cumul_purchase bonus DOLLAR10.2
;
run;
proc print data=want;

How to average a subset of data in SAS conditional on a date?

I'm trying to write SAS code that can loop over a dataset that contains event dates that looks like:
Data event;
input Date;
cards;
20200428
20200429
;
run;
And calculate averages for the prior three-days from another dataset that contains dates and volume that looks like:
Data vol;
input Date Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
For example, for date 20200428 the average should be 88.33 [(95+80+90)/3] and for date 20200429 the average should be 87.00 [(86+95+80)/3]. I want these values and the volume of the date to be saved on a new dataset that looks like the following if possible.
Data clean;
input Date Vol Avg;
cards;
20200428 86 88.33
20200429 110 87.00
;
run;
The actual data that I'm working with is from 1970-2010. I may also increase my average period from 3 days prior to 10 days prior, so I want to have flexible code. From what I've read I think a macro and/or call symput might work very well for this, but I'm not sure how to code these to do what I want. Honestly, I don't know where to start. Can anyone point me in the right direction? I'm open to any advice/ideas. Thanks.
A SQL statement is by far the most succinct code for obtaining your result set.
The query will join with 2 independent references to volume data. The first for obtaining the event date volume, and the second for computing the average volume over the three prior days.
The date data should be read in as a SAS date, so that the BETWEEN condition will be correct.
Data event;
input Date: yymmdd8.;
cards;
20200428
20200429
;
run;
Data vol;
input Date: yymmdd8. Volume;
cards;
20200430 100
20200429 110
20200428 86
20200427 95
20200426 80
20200425 90
;
run;
* SQL query with GROUP BY ;
proc sql;
create table want as
select
event.date
, volume_one.volume
, mean(volume_two.volume) as avg
from event
left join vol as volume_one
on event.date = volume_one.date
left join vol as volume_two
on volume_two.date between event.date-1 and event.date-3
group by
event.date, volume_one.volume
;
* alternative query using correlated sub-query;
create table want_2 as
select
event.date
, volume
, ( select mean(volume) as avg from vol where vol.date between event.date-1 and event.date-3 )
as avg
from event
left join vol
on event.date = vol.date
;
For the case of the Volumes data being date gapped, a better solution would be to separately compute the rolling average of N prior volumes. The date gaps could be from weekends, holidays, or a date not present due to data entry problems or operator error. Conceptually, for the averaging, the only role of date is only to order the data.
After the rolling averages are computed, a simple join or merge can be done.
Example:
* Simulate some volume data that excludes weekends, holidays, and a 2% rate of missing dates;
data volumes(keep=date volume);
call streaminit(20200502);
do date = '01jan1970'd to today();
length holiday $25;
year = year(date);
holiday = 'NEWYEAR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USINDEPENDENCE'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'THANKSGIVING'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'CHRISTMAS'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'MEMORIAL'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'LABOR'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'EASTER'; hdate = holiday(holiday, year); if date=hdate then continue;
holiday = 'USPRESIDENTS'; hdate = holiday(holiday, year); if date=hdate then continue;
if weekday(date) in (1,7) then continue; *1=Sun, 7=Sat;
volume = 100 + ceil(75 * sin (date / 8));
if rand('uniform') < 0.02 then continue;
output;
end;
format date yymmdd10.;
run;
* Compute an N item rolling average from N prior values;
%let ROLLING_N = 5;
data volume_averages;
set volumes;
by date; * enforce sort order requirement;
array v[0:&ROLLING_N] _temporary_; %* <---- &ROLLING_N ;
retain index -1;
avg_prior_&ROLLING_N. = mean (of v(*)); %* <---- &ROLLING_N ;
OUTPUT;
index = mod(index + 1,&ROLLING_N); %* <---- Modular arithmetic, the foundation of rolling ;
v[index] = volume;
format v: 6.;
drop index;
run;
* merge;
data want_merge;
merge events(in=event_date) volume_averages;
by date;
if event_date;
run;
* join;
proc sql;
create table want_join as
select events.*, volume_averages.avg_prior_5
from events join volume_averages
on events.date = volume_averages.date;
quit;
You want to loop over a series of dates in an input data set. Therefore I use a PROC SQL statement where I select the distinct dates in this input data set into a macro variable.
This macro variable is then used to loop over. In your example the macro variable will thus be: 20200428 20200429. You can then use the %SCAN macro function to start looping over these dates.
For each date in the loop, we will then calculate the average: in your example the average of the 3 days prior to the looping date. As the number of days for which you want to calculate the average is variable, this is also passed as a parameter in the macro. I then use the INTNX function to calculate the lower bound of dates you want to select to calculate the average over. Then the PROC MEANS procedure is used to calculate the average volume over the days: lower bound - looping date.
I then put a minor data step in between to attach the looping date again to the calculated average. Finally everything is appended in a final data set.
%macro dayAverage(input = , range = , selectiondata = );
/* Input = input dataset
range = number of days prior to the selected date for which you want to calculate
the average
selectiondata = data where the volumes are in */
/* Create a macro variable with the dates for which you want to calculate the
average, to loop over */
proc sql noprint;
select distinct date into: datesrange separated by " "
from &input.;
quit;
/*Start looping over the dates for which you want to calculate the average */
%let I = 1;
%do %while (%scan(&datesrange.,&I.) ne %str());
/* Assign the current date in the loop to the variable currentdate */
%let currentdate = %scan(&datesrange.,&I.);
/* Create the minimum date in the range based on input parameter range */
%let mindate =
%sysfunc(putn(%sysfunc(intnx(day,%sysfunc(inputn(&currentdate.,yymmdd8.)),-
&range.)),yymmddn8.));
/* Calculate the mean volume for the selected date and selected range */
proc means data = &selectiondata.(where = (date >= &mindate. and date <
&currentdate.)) noprint ;
output out = averagecurrent(drop = _type_ _freq_) mean(volume)=avgerage_volume;
run;
/* Add the current date to the calculated average */
data averagecurrent;
retain date average_volume;
set averagecurrent;
date = &currentdate.;
run;
/* Append the result to a final list */
proc datasets nolist;
append base = final data = averagecurrent force;
run;
%let I = %eval(&I. + 1);
%end;
%mend;
This macro can in your example be called as:
%dayAverage(input = event, range = 3, selectiondata = vol);
It will give you a data set in your work library called final

Combining the rows with overlapping data ranges in SAS

Since I am new to SAS I need some help to understand how to combine the overlap date ranges into one row.I want to combine the overlap date ranges when they have matching Id. If the dates don’t overlap then I want to keep them as it is. IF they over lap by Matching Id and drug code Then it should combine into one line. Please look at the same ple data set which I have below and the expected results:
Current Data set:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 1/1/2019
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
Expected results:
ID Drug Code BEG_Date End_Date
1 100 1/1/2018 3/31/2018
1 90 4/1/2018 04/30/2018
1 100 5/1/2018 6/1/2018
1 98 6/2/2018 8/31/2018
1 100 9/1/2018 5/4/2019
I wrote some SAS code but I am combining the dates even when there is no overlap. I want to write some code which should work in SAS.
PROC SORT DATA=Want OUT=ONE;
BY PERSON_ID BEG_DATE DRUG_CODE END_DATE;
RUN;
data TWO (DROP=PERSON_ID2 DRUG_CODE2 BEG_DATE END_DATE
RENAME=(BEG2=BEG_DOS
END2=END_DOS));
SET ONE;
RETAIN BEG2 END2;
PERSON_ID2=LAG1(PERSON_ID);
DRUG_CODE2=LAG1(DRUG_CODE);
IF PERSON_ID2=PERSON_ID AND DRUG_CODE2=DRUG_CODE AND BEG_DATE LE(END2+1) THEN
DO;
BEG2=MIN(BEG_DATE,BEG2);
END2=MAX(END_DATE,END2);
END;
ELSE
DO;
SEG+1;
BEG2=BEG_DATE;
END2=END_DATE;
END;
FORMAT BEG2 END2 MMDDYY10.;
RUN;
DATA THREE(DROP=BEG_DOS END_DOS SEG);
RETAIN BEG_DATE END_DATE;
SET TWO;
BY PERSON_ID SEG;
FORMAT BEG_DATE END_DATE MMDDYY10.;
IF FIRST.SEG THEN
DO;
BEG_DATE=BEG_DOS;
END;
IF LAST.SEG THEN
DO;
END_DATE = END_DOS;
OUTPUT;
END;
RUN;
This is how I would do it. Create an obs for each ID DRUG and DATE. Flag the gaps and summarize by RUN.
data have;
input ID Drug_Code (BEG End)(:mmddyy.);
format BEG End mmddyyd10.;
cards;
1 100 1/1/2018 3/1/2018
1 100 2/1/2018 04/30/2018
1 90 4/1/2018 04/30/2018
1 90 6/1/2018 8/15/2018
1 100 5/1/2018 6/1/2018
1 98 6/1/2018 8/31/2018
1 100 9/1/2018 5/4/2019
;;;;
run;
proc print;
run;
/*1 100 1/1/2018 1/1/2019*/
data exv/ view=exv;
set have;
do date = beg to end;
output;
end;
drop beg end;
format date mmddyyd10.;
run;
proc sort data=exv out=ex nodupkey;
by id drug_code date;
run;
data breaksV / view=BreaksV;
set ex;
by id drug_code;
dif = dif(date);
if first.drug_code then do; dif=1; run=1; end;
if dif ne 1 then run+1;
run;
proc summary data=breaksV nway missing;
class id drug_code run;
var date;
output out=want(drop=_type_) min=Begin max=End;
run;
Proc print;
run;
Computing the extent range composed of overlapping segment ranges requires a good understanding of the range conditions (cases).
Consider the scenarios when sorted by start date (within any larger grouping set, G, such as id and drug)
Let [ and ] be endpoints of a range
# be date values (integers) within
Extent be the combined range that grows
Segment be the range in the current row
Case 1 - Growth. Within G Segment start before Extent end
Segment will either not contribute to Extent or extend it.
[####] Extent
+ [#] Segment range DOES NOT contribute
--------
[####] Extent (do not output a row, still growing)
or
[####] Extent
+ [#####] Segment range DOES contribute
--------
[#######] Extent (do not output a row, still growing)
Case 2 - Terminus. 3 possibilities:
Within G Segment start after Extent end,
Next G reached (different id/drug combination),
End of data reached.
#2 and #3 can be tested by checking the appropriate last. flag.
[####] Extent
+ ..[#] Segment beyond Extent (gap is 2)
--------
[####] output Extent
[#] reset Extent to Segment
You can adjust your rules for Segment being adjacent (gap=0) or close enough (gap < threshold) to mean an Extent is either expanded, or, output and reset to Segment.
Note: The situation is a little more (not shown) complicated for the real world cases of:
missing start means the Segment has an unknown start date (presume it to be epoch (0=01JAN1960, or some date that pre-dates all dates in the data or study)
missing end means the Segment is active today (end date is date when processing data)
Sample code:
data have;
call streaminit(42);
do id = 1 to 10;
do _n_ = 1 to 50;
drug = ceil(rand('UNIFORM', 10));
beg_date = intnx ('MONTH', '01JAN2008'D, rand('UNIFORM',20));
end_date = intnx ('DAY', beg_date, rand('UNIFORM',75));
OUTPUT;
end;
end;
format beg_date end_date yymmdd10.;
run;
proc sort data=have out=segments;
by id drug beg_date end_date;
run;
data want;
set segments;
by id drug beg_date end_date; * will error if incoming data is NOT sorted;
retain ext_beg ext_end;
retain gap_allowed 0; * set to 1 for contiguously adjacent segment ;
if first.drug then do;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 0;
end;
if beg_date <= ext_end + gap_allowed then do;
ext_end = max (ext_end, end_date);
segment_count + 1;
end;
else do;
extent_id + 1;
OUTPUT;
ext_beg = beg_date;
ext_end = end_date;
segment_count = 1;
end;
if last.drug then do;
extent_id + 1;
OUTPUT;
* reset occurs implicitly;
* it will happen at first. logic when control returns to top of step;
end;
format ext_: yymmdd10.;
keep id drug ext_beg ext_end segment_count extent_id;
run;

How can I select the first and last week of each month in SAS?

I have monthly data with several observations per day. I have day, month and year variables. How can I retain data from only the first and the last 5 days of each month? I have only weekdays in my data so the first and last five days of the month changes from month to month, ie for Jan 2008 the first five days can be 2nd, 3rd, 4th, 7th and 8th of the month.
Below is an example of the data file. I wasn't sure how to share this so I just copied some lines below. This is from Jan 2, 2008.
Would a variation of first.variable and last.variable work? How can I retain observations from the first 5 days and last 5 days of each month?
Thanks.
1 AA 500 B 36.9800 NH 2 1 2008 9:10:21
2 AA 500 S 36.4500 NN 2 1 2008 9:30:41
3 AA 100 B 36.4700 NH 2 1 2008 9:30:43
4 AA 100 B 36.4700 NH 2 1 2008 9:30:48
5 AA 50 S 36.4500 NN 2 1 2008 9:30:49
If you want to examine the data and determine the minimum 5 and maximum 5 values then you can use PROC SUMMARY. You could then merge the result back with the data to select the records.
So if your data has variables YEAR, MONTH and DAY you can make a new data set that has the top and bottom five days per month using simple steps.
proc sort data=HAVE (keep=year month day) nodupkey
out=ALLDAYS;
by year month day;
run;
proc summary data=ALLDAYS nway;
class year month;
output out=MIDDLE
idgroup(min(day) out[5](day)=min_day)
idgroup(max(day) out[5](day)=max_day)
/ autoname ;
run;
proc transpose data=MIDDLE out=DAYS (rename=(col1=day));
by year month;
var min_day: max_day: ;
run;
proc sql ;
create table WANT as
select a.*
from HAVE a
inner join DAYS b
on a.year=b.year and a.month=b.month and a.day = b.day
;
quit;
/****
get some dates to play with
****/
data dates(keep=i thisdate);
offset = input('01Jan2015',DATE9.);
do i=1 to 100;
thisdate = offset + round(599*ranuni(1)+1); *** within 600 days from offset;
output;
end;
format thisdate date9.;
run;
/****
BTW: intnx('month',thisdate,1)-1 = first day of next month. Deduct 1 to get the last day
of the current month.
intnx('month',thisdate,0,"BEGINNING") = first day of the current month
****/
proc sql;
create table first5_last5 AS
SELECT
*
FROM
dates /* replace with name of your data set */
WHERE
/* replace all occurences of 'thisdate' with name of your date variable */
( intnx('month',thisdate,1)-5 <= thisdate <= intnx('month',thisdate,1)-1 )
OR
( intnx('month',thisdate,0,"BEGINNING") <= thisdate <= intnx('month',thisdate,0,"BEGINNING")+4 )
ORDER BY
thisdate;
quit;
Create some data with the desired structure;
Data inData (drop=_:); * froget all variables starting with an underscore*;
format date yymmdd10. time time8.;
_instant = datetime();
do _i = 1 to 1E5;
date = datepart(_instant);
time = timepart(_instant);
yy = year(date);
mm = month(date);
dd = day(date);
*just some more random data*;
letter = byte(rank('a') +floor(rand('uniform', 0, 26)));
*select week days*;
if weekday(date) in (2,3,4,5,6) then output;
_instant = _instant + 1E5*rand('exponential');
end;
run;
Count the days per month;
proc sql;
create view dayCounts as
select yy, mm, count(distinct dd) as _countInMonth
from inData
group by yy, mm;
quit;
Select the days;
data first_5(drop=_:) last_5(drop=_:);
merge inData dayCounts;
by yy mm;
_newDay = dif(date) ne 0;
retain _nrInMonth;
if first.mm then _nrInMonth = 1;
else if _newDay then _nrInMonth + 1;
if _nrInMonth le 5 then output first_5;
if _nrInMonth gt _countInMonth - 5 then output last_5;
run;
Use the INTNX() function. You can use INTNX('month',...) to find the beginning and ending days of the month and then use INTNX('weekday',...) to find the first 5 week days and last five week days.
You can convert your month, day, year values into a date using the MDY() function. Let's assume that you do that and create a variable called TODAY. Then to test if it is within the first 5 weekdays of last 5 weekdays of the month you could do something like this:
first5 = intnx('weekday',intnx('month',today,0,'B'),0) <= today
<= intnx('weekday',intnx('month',today,0,'B'),4) ;
last5 = intnx('weekday',intnx('month',today,0,'E'),-4) <= today
<= intnx('weekday',intnx('month',today,0,'E'),0) ;
Note that those ranges will include the week-ends, but it shouldn't matter if your data doesn't have those dates.
But you might have issues if your data skips holidays.