I have a data which data which looks something like this
/********************************************************************************/
YYMM Sector
1701 Agriculture
1611 Retail
1501 CRE
/*************/
There is another dataset which looks something like this/*************
Customer_ID YYMM
XXXX 1702
XXXX 1701
XXXX 1612
XXXX 1611
XXXX 1610
XXXX 1510
XXXX 1509
/********************************************************/
So basically I want to mere these two datasets on the basis of YYMM but and merge in the sectors. But since the previous data has only few YYMM all I want to do is copy the sectors till a new yymm is encountered from the first dataset.
So the sector from 1701 to 1612 should be agriculture and the sector from 1611 to 1502 is retail and for any month before 1501 it has to be CRE.
Can you please tell me how to do it?
Here is a SQL based solution (similar to the one proposed by pinegulf).
Let us create test datasets:
data T01;
length Sector $20;
infile cards;
input YYMM_to Sector;
cards;
1701 Agriculture
1611 Retail
1501 CRE
;
run;
data T02;
length Customer_id $10;
infile cards;
input Customer_ID YYMM;
cards;
AXXX 1702
BXXX 1701
CXXX 1612
DXXX 1611
EXXX 1610
FXXX 1510
GXXX 1509
;
run;
We can add a "YYMM_from" column to T01:
proc sort data=T01;
by YYMM_to;
run;
data T01;
set T01;
by YYMM_to;
YYMM_from=lag(YYMM_to);
if _N_=1 then YYMM_from=0;
run;
proc print data=T01;
run;
We get:
Obs Sector YYMM_to YYMM_from
------------------------------------------
1 CRE 1501 0
2 Retail 1611 1501
3 Agriculture 1701 1611
Then comes the join:
proc sql;
create table T03 as
select a.*, b.Sector
from T02 a LEFT JOIN T01 b
on YYMM_from<a.YYMM<=YYMM_to;
quit;
proc print data=T03;
quit;
We get:
Obs Customer_id YYMM Sector
-----------------------------------------
1 DXXX 1611 Retail
2 EXXX 1610 Retail
3 FXXX 1510 Retail
4 GXXX 1509 Retail
5 BXXX 1701 Agriculture
6 CXXX 1612 Agriculture
7 AXXX 1702
Here is a solution with proc format. Since your data is in yymm format you can set the limits logical without the data conversion, but I feel more comfortable with actual dates.
data Begin;
input Customer_ID $ YYMM $;
cards;
XXXX 1702
YYYY 1701
ZZZZ 1612
OOOO 1611
AAAA 1610
FFFF 1510
DDDD 1509
; run;
data with_date;
set begin;
date = mdy(substr(yymm,3,2), 1, substr(yymm,1,2) );
run;
proc format; /*Didn't check the bins too much. Adjust as needed.*/
value sector
low - '1jan2015'd ='lows'
'1jan2015'd < - '1nov2016'd = 'CRE'
'1nov2016'd < - '1jan2017'd = 'Retail'
'1jan2017'd < - high = 'Agriculture'
;
run;
data wanted;
set with_date;
format date sector.;
run;
For more on proc format see http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473474.htm
Related
I am trying to select the last non-missing DAT value to ADT which if the SUBJID have two consecutive missing DAT, else the latest DAT will be set to the value of ADT.
Below code produce the data I have, and I want to have the ADT could be derived with the illustratioin of below rule (either finally merged to this set HAVE or just create into a totally new set):
for subjid 1001: 1997-05-01 for this subject, there is no consecutive missing (though only single non-consecutive missing)
for subjid 1002: 1998-02-01, as this subject has missing consecutively at AVISIT of 2-5
for subjid 1003: 1999-03-08, as the first consecutive missing happened at AVISIT of 4, and at AVISIT=3, there is non-missing DAT.
Hope you can help me. Thanks.
data have;
infile datalines truncover;
input subjid avisit dat : yymmdd10.;
format dat yymmdd10.;
datalines;
1001 0 1997-01-01
1001 1 1997-02-01
1001 2
1001 3 1997-05-01
1002 0 1998-01-01
1002 1 1998-02-01
1002 2
1002 3
1002 4
1002 5
1002 6 1998-12-01
1003 0 1999-01-01
1003 1 1999-02-01
1003 2
1003 3 1999-03-08
1003 4
1003 5
1003 6 1999-05-01
1003 7
1003 8
;
run;
The below will create a dataset that includes the last non-missing dat whenever the count of consecutively missing dat is greater than or equal to 2. Each time we encounter a missing dat, we increase the consecutive missing counter nmiss by 1.
We are always storing the last valid value of dat in the variables last_nonmissing_dat and last_nonmissing_avisit such that they always carry forward when a missing value is encountered. When two consecutive missing values occur, we output the results.
nmiss and last_nonmissing are reset whenever we move to a new subjid.
data want;
set have;
by subjid;
retain last_nonmissing_dat
last_nonmissing_avisit
;
if(first.subjid) then call missing(nmiss, of last_nonmissing:);
if(missing(dat)) then nmiss+1;
else do;
nmiss = 0;
last_nonmissing_dat = dat;
last_nonmissing_avisit = avisit;
end;
if(nmiss GE 2) then output;
format last_nonmissing_dat yymmdd10.;
run;
Assume you have a temporary SAS data set called EPISODES that contains information about hospital episodes. The data set contains the variables ID_NO (patient ID), ADMIT_DATE (date of admission), DISC_DATE (date of discharge), and TOTAL_COST.
Using this data set, create a new data set in which you will create a separate observation for each day of each hospital episode. If a patient had a hospital episode that was 3 days long, they would have three views in the new data set from that episode -- one for each day.
Each observation in the new data set should have only three variables: the patient identifier ID_NO, the date for that particular day of hospitalization XDATE, and the cost for that day of hospitalization DAILY_COST = TOTAL_COST divided by the number of days in the episode.
My thought is to do this as a loop. Something like the following.
data new_data;
set input_data ;
do xdate = admit_date to disc_data;
daily_cost = .... ;
output new_data ( keep = xdate daily_cost id_no );
end;
run;
*This program block sets up our data set;
data episodes;
INPUT ID_NO $ ADMIT_DATE mmddyy10. TOTAL_COST DISC_DATE mmddyy10.;
DATALINES;
1 01/01/2017 3000 01/03/2017
2 01/01/2017 14000 01/14/2017
;
run;
data new_episodes (keep= ID_NO XDATE DAILY_COST);
set episodes;
NUM_DAYS= DISC_DATE-ADMIT_DATE;
DAILY_COST= TOTAL_COST/(DISC_DATE-ADMIT_DATE);
*Using the Do While loop to create a matrix of date observations;
XDATE=ADMIT_DATE;*initializing our variable;
do while(XDATE<DISC_DATE);
put XDATE=;
XDATE+1;
output;*outputting the date variable;
end;
format XDATE mmddyy10.;
run;
proc print data=new_episodes;
run;
Since SAS stores dates as number of days you can just use a DO loop to increment XDATE from ADMIT_DATE to DISC_DATE.
But you need to decide how to count dates. If you are admitted on Monday and discharged on Tuesday is that one day or two days? If it is one day then do you want XDATE records for Monday or Tuesday? Or both?
Let's make some test data:
data have;
input id_no $ admit_date :yymmdd. total_cost disc_date :yymmdd.;
format admit_date disc_date yymmdd10.;
put (_all_) (+0);
datalines;
1 2017-01-01 3000 2017-01-04
2 2017-01-01 5000 2017-01-06
3 2020-02-23 500 2020-02-23
;
Here is code that treats Monday to Tuesday as one day. So it doesn't output the discharge date (unless it is the same as the admission date).
data want;
set have ;
if admit_date=disc_date then daily_cost=total_cost;
else daily_cost = total_cost / (disc_date - admit_date);
do xdate=admit_date to max(admit_date,disc_date-1) ;
output;
end;
keep id_no xdate daily_cost;
format xdate yymmdd10.;
run;
Results:
daily_
Obs id_no cost xdate
1 1 1000 2017-01-01
2 1 1000 2017-01-02
3 1 1000 2017-01-03
4 2 1000 2017-01-01
5 2 1000 2017-01-02
6 2 1000 2017-01-03
7 2 1000 2017-01-04
8 2 1000 2017-01-05
9 3 500 2020-02-23
If you want to treat a stay from Monday to Tuesday as 2 days then the code is easier.
data want;
set have ;
daily_cost = total_cost / (disc_date - admit_date + 1);
do xdate=admit_date to disc_date ;
output;
end;
keep id_no xdate daily_cost;
format xdate yymmdd10.;
run;
Results:
daily_
Obs id_no cost xdate
1 1 750.000 2017-01-01
2 1 750.000 2017-01-02
3 1 750.000 2017-01-03
4 1 750.000 2017-01-04
5 2 833.333 2017-01-01
6 2 833.333 2017-01-02
7 2 833.333 2017-01-03
8 2 833.333 2017-01-04
9 2 833.333 2017-01-05
10 2 833.333 2017-01-06
11 3 500.000 2020-02-23
This is my code:
DATA sales;
INFILE 'D:\Users\...\Desktop\Onions.dat';
INPUT VisitingTeam $ 1-20 ConcessionSales 21-24 BleacherSales 25-28
OurHits 29-31 TheirHits 32-34 OurRuns 35-37 TheirRuns 38-40;
PROC PRINT DATA = sales;
TITLE 'SAS Data Set Sales';
RUN;
This is the data, but the spacing may be incorrect.
Columbia Peaches 35 67 1 10 2 1
Plains Peanuts 210 . 2 5 0 2
Gilroy Garlics 151035 12 11 7 6
Sacramento Tomatoes 124 85 15 4 9 1
;
I need to add or delete a blank column at the 19th
column. Can someone help?
Just open the dataset and then look at what the variable name is. Then do:
Data Want (drop=varible_name_you_are_dropping); /*This is your output dataset*/
Set have; /*this is your dataset you have*/
Run;
I'm struggling conceptualizing a code I would like to develop that would output the average number of patients seen by provider. Here is what a snippet of what my dataset, which spans 3 years worth of data, looks like (I have three variables, the patient_ID, provider name and the time which the provider saw the patient which is displayed in a date/time format:
patient_fin first_Md_seen Provider_Seen_Date_Time
1 Bob 5/1/2018 4:19:00 AM
2 Bob 5/1/2018 4:29:00 AM
3 Bob 5/1/2018 4:30:00 PM
4 Sally 5/1/2018 7:39:00 AM
5 Sally 5/1/2018 7:49:00 AM
6 Sally 5/1/2018 8:55:00 PM
7 Bubba 5/3/2018 12:19:00 AM
8 Bob 5/3/2018 4:10:00 AM
....
To calculate the number of a patients seen by a provider, I wrote the following code:
data ED_TAT3;
SET ED_TAT2;
if patient_fin ne . then Patient_fin_count=1;
run;
proc means data = ED_TAT3;
class first_Md_seen;
var Patient_fin_count;
run;
Now, I need to figure out how many hours a provider worked so I can divide the number of patients seen by the number of hours worked.
I think I can use the Provider_Seen_Date_Time variable as a proxy after running the following code to get the hour 'hour = hour (datepart(Provider_Seen_Date_Time))'.
Would a code like this give me the correct number of hours a provider
data new1;
set new;
hour = hour (datepart(Provider_Seen_Date_Time));
if Provider_Name = 'Bob' and hour ne . then hour_worked = 1;
run;
Is there:
1) a more accurate or efficient (there are hundreds of different providers) way to figure out the total number of hours worked per provider?
OR
2) which is the more ideal code, to simply figure out the number of patients per hour a provider saw.
Desired output:
Provider Avg Patients Seen per Hour
Bob 5
Sally 4
Bubba 6
Thanks in advance!
Based on what is given , you can try following code.. however, I still have concerns about the data
data ed_tat2;
input patient_fin first_Md_seen$ Provider_Seen_Date_Time mdyampm25.2;
format Provider_Seen_Date_Time mdyampm25.;
hour = hour (Provider_Seen_Date_Time);
date_seen=datepart(Provider_Seen_Date_Time);
format date_seen date9.;
datalines;
1 Bob 5/1/2018 4:19:00 AM
2 Bob 5/1/2018 4:30:00 PM
3 Sally 5/1/2018 7:39:00 AM
4 Sally 5/1/2018 7:59:00 PM
5 Bubba 5/3/2018 12:19:00 AM
6 Bob 5/3/2018 4:10:00 AM
7 Bob 5/3/2018 4:30:00 AM
8 Bob 5/3/2018 5:10:00 AM
run;
proc sort data=ed_tat2; by first_Md_seen date_seen hour; run;
data ed_tat3;
set ed_tat2;
by first_Md_seen date_seen hour;
if not first.first_Md_seen and date_seen=lag(date_seen) and hour=lag(hour) then hour=0;
else hour=1;
run;
proc sql;
select first_Md_seen, date_seen, count(patient_fin) as number_of_patients_seen, sum(hour) as number_of_hours, count(patient_fin)/sum(hour) as patients_seen_per_hour
from ed_tat3
where hour ne .
group by first_Md_seen, date_seen;
select first_Md_seen, count(patient_fin) as number_of_patients_seen, sum(hour) as number_of_hours, count(patient_fin)/sum(hour) as patients_seen_per_hour
from ed_tat3
where hour ne .
group by first_Md_seen;
quit;
You can do this easily within two proc freqs.
The first will calculate the number of patients seen by doctor per hour and the second uses the first output to calculate the number of hours worked per doctor, per day. You can easily modify these by modifying the TABLE statements.
data ed_tat2;
input patient_fin first_Md_seen $ Provider_Seen_Date_Time mdyampm25.2;
format Provider_Seen_Date_Time mdyampm25.;
hour=hour (Provider_Seen_Date_Time);
date_seen=datepart(Provider_Seen_Date_Time);
format date_seen date9.;
datalines;
1 Bob 5/1/2018 4:19:00 AM
2 Bob 5/1/2018 4:30:00 PM
3 Sally 5/1/2018 7:39:00 AM
4 Sally 5/1/2018 7:59:00 PM
5 Bubba 5/3/2018 12:19:00 AM
6 Bob 5/3/2018 4:10:00 AM
7 Bob 5/3/2018 4:30:00 AM
8 Bob 5/3/2018 5:10:00 AM
;
run;
*counts per hour;
proc freq data=ed_tat2 noprint;
table first_Md_seen*date_seen*hour / out=provider_counts;
run;
*hours worked per doctor;
proc freq data=provider_counts noprint;
table first_Md_seen*date_seen / out=provider_hours;
run;
title 'Number of patients seen';
proc print data=provider_counts label;
label count='# of patients per hour';
title 'Number of hours worked';
proc print data=provider_hours label;
label count='# of hours worked in a day';
run;
I'm trying to transpose a data using values as variable names and summarize numeric data by group, I tried with proc transpose and with proc report (across) but I can't do this, the unique way that I know to do this is with data set (if else and sum but the changes aren't dynamically)
For example I have this data set:
school name subject picked saving expenses
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
and I need this in 1 line, sum of 'picked' by the names of students, and later sum of picked by subject, the last 3 columns is the sum total for picked, saving and expense:
school john ruby peter noname math spanish geography nosubject picked saving expenses
raget 15 15 2 0 13 5 2 12 32 22700 8200
If it's possible to be dynamically changed if I have a new student in the school or subject?
It's a little difficult because you're summarising at more than one level, so I've used PROC SUMMARY and chosen different _TYPE_ values. See below:
data have;
infile datalines;
input school $ name $ subject : $10. picked saving expenses;
datalines;
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
;
run;
proc summary data=have;
class school name subject;
var picked saving expenses;
output out=want1 sum(picked)=picked sum(saving)=saving sum(expenses)=expenses;
run;
proc transpose data=want1 (where=(_type_=5)) out=subs (where=(_NAME_='picked'));
by school;
id subject;
run;
proc transpose data=want1 (where=(_type_=6)) out=names (where=(_NAME_='picked'));
by school;
id name;
run;
proc sql;
create table want (drop=_TYPE_ _FREQ_ name subject) as
select
n.*,
s.*,
w.*
from want1 (where=(_TYPE_=4)) w,
names (drop=_NAME_) n,
subs (drop=_NAME_) s
where w.school = n.school
and w.school = s.school;
quit;
I've also tested this code by adding new schools, names and subjects and they do appear in the final table. You'll note that I haven't hardcoded anything (e.g. no reference to math or John), so the code is dynamic enough.
PROC REPORT is an interesting alternative, particularly if you want the printed output rather than as a dataset. You can use ODS OUTPUT to get the output dataset, but it's messy as the variable names aren't defined for some reason (they're "C2" etc.). The printed output of this one is a little messy also as the header rows don't line up, but that can be fixed with some finagling if that's desired.
data have;
input school $ name $ subject $ picked saving expenses;
datalines;
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
;;;;
run;
ods output report=want;
proc report nowd data=have;
columns school (name subject),(picked) picked=picked2 saving expenses;
define picked/analysis sum ' ';
define picked2/analysis sum;
define saving/analysis sum ;
define expenses/analysis sum;
define name/across;
define subject/across;
define school/group;
run;