I have data that looks like -
data abc;
input ID $ drug $ episode start_date date9. end_date date9.;
format start_date end_date date9.;
informat start_date end_date date9.;
datalines ;
1 A 1 01Jan2012 30Mar2012
1 A 2 01May2012 03Jul2012
1 A 3 28Sep2012 28Oct2012
1 A 4 01Nov2012 30Dec2012
1 B 1 01Apr2012 10May2012
1 B 2 02Nov2012 28Dec2012
1 B 3 01Jan2012 30Mar2012
1 C 1 01Jul2012 02Aug2012
;
run;
Here we have subjects and the the drugs they take. A new episode of one drug means that the person discontinued.
If the start date (start date of 1st episode) of second drug consumed , lies in between the episodes of first drug , then we will ignore all the further episodes of 1st drug.
Eg. here 1 april (start date of drug B) lies after the first episode of drug A, so episode 2,3,4 of drug A would be deleted.
Similarly the start date for drug C lies after the end date of episode 1 for drug B then episode 2 of drug B would be deleted.
The maximum number of episodes a subject can have is 15.
The resultant dataset should look like -
ID Drug Episode start_date end_date
1 A 1 1-Jan 30-Mar
1 B 1 1-Apr 10-May
1 C 1 1-Jul 2-Aug
How about this? I added another ID to the example data for demonstration.
data abc;
input ID $ drug $ episode start_date :date9. end_date :date9.;
format start_date end_date date9.;
datalines ;
1 A 1 01Jan2012 30Mar2012
1 A 2 01May2012 03Jul2012
1 A 3 28Sep2012 28Oct2012
1 A 4 01Nov2012 30Dec2012
1 B 1 01Apr2012 10May2012
1 B 2 02Nov2012 28Dec2012
1 B 3 01Jan2012 30Mar2012
1 C 1 01Jul2012 02Aug2012
2 A 1 01Jan2012 30Mar2012
2 A 2 01May2012 03Jul2012
2 A 3 28Sep2012 28Oct2012
2 A 4 01Nov2012 30Dec2012
2 B 1 01Apr2012 10May2012
2 B 2 02Nov2012 28Dec2012
2 B 3 01Jan2012 30Mar2012
2 C 1 01Jul2012 02Aug2012
;
run;
data want;
format ID drug episode start_date end_date;
keep ID drug episode start_date end_date;
declare hash h ();
h.definekey ('ID', 'd');
h.definedata ('_start_date');
h.definedone ();
do until (lr1);
set abc (rename= (start_date = _start_date)) end=lr1;
by ID drug;
if first.ID then d = 0;
if first.drug then d + 1;
if episode = 1 then h.add();
end;
do until (lr2);
set abc end=lr2;
by ID drug;
if first.ID then d = 0;
if first.drug then do;
d + 1; flag = 0;
end;
rc = h.find(key : ID, key : d+1);
if start_date > _start_date then flag=1;
if flag = 0 then output;
end;
retain flag;
run;
Result:
ID drug episode start_date end_date
1 A 1 01JAN2012 30MAR2012
1 B 1 01APR2012 10MAY2012
1 C 1 01JUL2012 02AUG2012
2 A 1 01JAN2012 30MAR2012
2 B 1 01APR2012 10MAY2012
2 C 1 01JUL2012 02AUG2012
Related
I'm new to SAS and I'm at a dead end.
I need to get the final table. C with a full set of attributes, and the" intersection " of versioning, i.e. as soon as a version change has occurred in one of the Tariffs or Abonents tables, the version in C should also change. If the version was changed simultaneously, in both tables, then in C the version should be changed once.
Tarifs
abon_id tariff_plan type from_date to_date
1 1 1 01OCT2005 01JAN2040
2 1 2 05NOV2005 01DEC2006
2 2 2 02DEC2006 01DEC2007
2 2 1 02DEC2007 01JAN2040
3 0 0 07NOV1917 11JUN1991
3 1 1 12JUN1991 01JAN2040
4 1 1 12JUN1991 01JAN2040
Abonents
abon_id name sex from_date
1 Igor M 01OCT2005 01JAN2040
2 Vasya M 05NOV2005 01AUG2006
2 Lena F 02AUG2006 02SEP2007
2 Yulia F 03SEP2007 01JAN2040
3 USSR Country 07NOV1917 11JUN1991
3 Russia Country 12JUN1991 01JAN2040
4 Petya M 12AUG1991 01JAN2040
Resulting table should be:
C:
abon_id tariff_plan type name sex fd td
1 1 1 Igor М 01oct2005 01jan2040
2 1 2 Vasya М 05nov2005 01aug2006
2 1 2 Lena F 02aug2006 01dec2006
2 2 2 Lena F 02dec2006 02sep2007
2 2 2 Julia F 03sep2007 01dec2007
2 2 1 Julia F 02dec2007 01jan2040
3 0 0 USSR Country 07nov1917 11jun1991
3 1 1 Russia Country 12jun1991 01jan2040
4 1 1 . . 12jun1991 11aug1991
4 1 1 Petya M 12aug1991 01jan2040
So far I have something like:
data out;
retain fd1 fd2 td1 td2;
format fd1 fd2 td1 td2 ddmmyy10.;
merge Tarifs(in=x) Abonents(in=y);
by abon_id fd;
fd1 = 0; fd2 = 0; td1 = 0; td2 = 0;
if x then do;
fd1 = fd;
td1 = td;
end;
if y then do;
fd2 = fd;
td2 = td;
end;
if fd1 <= fd2 then do;
fd = fd1;
if fd2 < td1 and f2 < td2 then td = fd2;
else if td1 < td2 then td = td1;
else td = td2;
end;
else do;
fd = fd2;
if fd1 < td1 and fd1 < td2 then td = fd1;
else if td1 < td2 then td = td1;
else td = td2;
end;
run;
But I think I'm doing something wrong. Please help me!
You can use SQL union to combine the the overlaps with the tarifs pre-abonent
data tarifs;
input
abon_id tariff_plan type from_date: date9. to_date date9.;
format _numeric_ 4. from_date to_date date9.;
datalines;
1 1 1 01OCT2005 01JAN2040
2 1 2 05NOV2005 01DEC2006
2 2 2 02DEC2006 01DEC2007
2 2 1 02DEC2007 01JAN2040
3 0 0 07NOV1917 11JUN1991
3 1 1 12JUN1991 01JAN2040
4 1 1 12JUN1991 01JAN2040
data abonents;
length abon_id 8 name $10 sex $10;
input
abon_id name sex from_date: date9. to_date date9.;
format from_date to_date date9.;
datalines;
1 Igor M 01OCT2005 01JAN2040
2 Vasya M 05NOV2005 01AUG2006
2 Lena F 02AUG2006 02SEP2007
2 Julia F 03SEP2007 01JAN2040
3 USSR Country 07NOV1917 11JUN1991
3 Russia Country 12JUN1991 01JAN2040
4 Petya M 12AUG1991 01JAN2040
;
proc sql;
create table want as
(
select
A.abon_id, A.tariff_plan, A.type
, B.name, B.sex
, case
when A.from_date < B.from_date then B.from_date else A.from_date
end as fd format=date9.
, case
when A.to_date > B.to_date then B.to_date else A.to_date
end as td format=date9.
from tarifs A
left join abonents B
on A.abon_id = B.abon_id
where
B.from_date between A.from_date and A.to_date
or
B.to_date between A.from_date and A.to_date
)
union
(
select
A.abon_id, A.tariff_plan, A.type
, ' ' as name , ' ' as sex
, A.from_date as fd
, min(B.from_date)-1 as td
from tarifs A
left join abonents B
on A.abon_id = B.abon_id
group by
B.abon_id
having
A.from_date < min(B.from_date)
)
;
A simple merge can not accomplish the task because you need to cross join on abon_id.
A cross join can be accomplished in DATA Step by multidata hashing the abonents, linear traversing the tariffs with SET and iterating over find/find_next.
Example
data tarifs;
input
abon_id tariff_plan type from_date: date9. to_date date9.;
format _numeric_ 4. from_date to_date date9.;
datalines;
1 1 1 01OCT2005 01JAN2040
2 1 2 05NOV2005 01DEC2006
2 2 2 02DEC2006 01DEC2007
2 2 1 02DEC2007 01JAN2040
3 0 0 07NOV1917 11JUN1991
3 1 1 12JUN1991 01JAN2040
4 1 1 12JUN1991 01JAN2040
5 1 1 06JAN2021 31DEC2031
data abonents;
length abon_id 8 name $10 sex $10;
input
abon_id name sex from_date: date9. to_date date9.;
format from_date to_date date9.;
datalines;
1 Igor M 01OCT2005 01JAN2040
2 Vasya M 05NOV2005 01AUG2006
2 Lena F 02AUG2006 02SEP2007
2 Julia F 03SEP2007 01JAN2040
3 USSR Country 07NOV1917 11JUN1991
3 Russia Country 12JUN1991 01JAN2040
4 Petya M 12AUG1991 01JAN2040
;
data want(keep=abon_id tariff_plan type name sex tariffed:);
if 0 then set tarifs abonents;
if _n_ = 1 then do;
declare hash abon (dataset:'abonents', multidata:'y');
abon.defineKey('abon_id');
abon.defineData('name', 'sex', 'from_date', 'to_date');
abon.defineDone();
end;
set tarifs (rename=(from_date=fd to_date=td));
min_from = 1e9;
if abon.find() = 0 then do until (abon.find_next() ne 0);
if fd <= from_date <= td then tariffed_fd = from_date;
else
if from_date <= fd <= to_date then tariffed_fd = fd;
if fd <= to_date <= td then tariffed_td = to_date;
else
if from_date <= td <= to_date then tariffed_td = td;
if nmiss(of tariffed:) = 0 then output;
if from_date < min_from then min_from = from_date;
call missing (of tariffed:);
end;
if fd < min_from then do;
tariffed_fd = fd;
tariffed_td = from_date - 1;
call missing (name, sex);
output;
end;
format min_from tariffed: date9.;
run;
I'm pretty new in SAS, so I'm struggling to find out how to rearrange my data. My data set looks like this:
CPT DATE A B C D etc.
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
etc.
where cpt(i) represents each counterparty, date(i) represents the date of my cash flows and A,B,C,D are the different types of cash flows. Since this dataset has lots of columns, I'd like to rearrange the data by increasing the number of rows when there is more than one cash flow in date(i). So the output is supposed to be this one:
CPT DATE Cash Flow Type
1 date1 20.000 A
1 date1 5.000 B
1 date2 30.000 D
1 date3 10.000 B
1 date3 10.000 C
2 date1 3.000 A
2 date2 3.000 B
2 date3 5.000 C
2 date4 3.000 D
etc.
Any tips on how to get what I want? Cheers
Datalines format of data is below.
data have;
input CPT DATE$ A B C D;
format a b c d 8.3;
datalines;
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
;
run;
This is a 'wide to long' transpose. It's really easy!
data have;
input CPT DATE $ A B C D ;
datalines;
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
;;;;
run;
proc transpose data=have out=want;
by cpt date;
var a b c d;
run;
If there are more complexities than this, you can also do this in the data step.
Use proc transpose. It's the easiest way to transpose any data in SAS. It'll automatically rename variable column names to COL1, COL2, etc. Use the rename= output dataset option to rename your variable to cash_flow.
proc transpose data = have
out = want(rename=(COL1 = cash_flow) )
name = type
;
by cpt date;
run;
A more tricked out TRANSPOSE can set the pivot column label and restrict the output to non-zero cashflow.
proc transpose data=have
out=want(
rename=(_name_=Type col1=cashflow)
where=(cashflow ne 0)
)
;
by cpt date;
var a b c d;
label cashflow='Cash Flow';
run;
You will have to endure a log message
WARNING: Variable CASHFLOW not found in data set WORK.HAVE.
The goal is to add a new row whenever there is a gap between the date variable between two rows grouped by id.
If the gap occurs, then duplicate a row that is first. However only the date feature should not be as the first row rather it should be incremented by one day.
Also, everything needs to be grouped by id. I need to achieve it without expanding the function.
data sample;
input id date numeric_feature character_feature $;
informat date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-03 1 B
2 2020-01-05 9 F
;
data sample;
set sample;
format date yymmdd10.;
run;
The desired result:
data sample;
input id date numeric_feature character_feature $;
informat date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-03 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-02 4 V
2 2020-01-03 1 B
2 2020-01-04 1 B
2 2020-01-05 9 F
;
data sample;
set sample;
format date yymmdd10.;
run;
You can perform a 1:1 self merge with the second self starting at row 2 in order to provide a lead value. A 1:1 merge does not use a BY statement.
Example:
data have;
input id date numeric_feature character_feature $;
informat date yymmdd10.;
format date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-03 1 B
2 2020-01-05 9 F
;
data want;
* 1:1 merge without by statement;
merge
have /* start at row 1 */
have ( firstobs=2 /* start at row 2 for lead values */
keep=id date /* more data set options that prepare the lead */
rename = ( id=nextid
date=nextdate
))
;
output;
flag = '*'; /* marker for filled in dates */
if id = nextid then
do date=date+1 to nextdate-1;
output;
end;
drop next:;
run;
Result flagging filled in dates
To "look ahead" you can re-read the same dataset starting from the second observation. SAS will stop when you read past the end of the input so add an extra empty observation.
data sample;
input id date numeric_feature character_feature $;
informat date yymmdd.;
format date yymmdd10.;
datalines;
1 2020-01-01 5 A
1 2020-01-02 3 Z
1 2020-01-04 2 D
1 2020-01-05 7 B
2 2020-01-01 4 V
2 2020-01-03 1 B
2 2020-01-05 9 F
;
data want;
set sample;
by id;
set sample(firstobs=2 keep=date rename=(date=next_date)) sample(obs=1 drop=_all_);
output;
if not last.id then do date=date+1 to next_date-1; output; end;
run;
Results:
numeric_ character_
Obs id date feature feature next_date
1 1 2020-01-01 5 A 2020-01-02
2 1 2020-01-02 3 Z 2020-01-04
3 1 2020-01-03 3 Z 2020-01-04
4 1 2020-01-04 2 D 2020-01-05
5 1 2020-01-05 7 B 2020-01-01
6 2 2020-01-01 4 V 2020-01-03
7 2 2020-01-02 4 V 2020-01-03
8 2 2020-01-03 1 B 2020-01-05
9 2 2020-01-04 1 B 2020-01-05
10 2 2020-01-05 9 F .
I am looking to figure out how many customers get their product from a certain store. The problem each prod_id can have up to 12 weeks of data for each customer. I have tried a multitude of codes, some add up all of the obersvations for each customer while others like the one below remove all but the last observation.
proc sort data= have; BY Prod_ID cust; run;
Data want;
Set have;
by Prod_Id cust;
if (last.Prod_Id and last.cust);
count= +1;
run;
data have
prod_id cust week store
1 A 7/29 ABC
1 A 8/5 ABC
1 A 8/12 ABC
1 A 8/19 ABC
1 B 7/29 ABC
1 B 8/5 ABC
1 B 8/12 ABC
1 B 8/19 ABC
1 B 8/26 ABC
1 C 7/29 XYZ
1 C 8/5 XYZ
1 F 7/29 XYZ
1 F 8/5 XYZ
2 A 7/29 ABC
2 A 8/5 ABC
2 A 8/12 ABC
2 A 8/19 ABC
2 C 7/29 EFG
2 C 8/5 EFG
2 C 8/12 EFG
2 C 8/19 EFG
2 C 8/26 EFG
what i want it to look like
prod_id store count
1 ABC 2
1 XYZ 2
2 ABC 1
2 EFG 2
Firstly, read about if-statement.
I've just edited your code to make it work:
proc sort data=have;
by prod_id store cust;
run;
data want(drop=cust week);
set have;
retain count;
by prod_id store cust;
if (last.cust) then count=count+1;
else if (first.prod_id or first.store) then count = 0;
if (last.prod_id or last.store) then output;
run;
If you will have questions, ask.
The only place where the result of the COUNT() aggregate function in SQL might be confusing is that it will not count missing values of the variable.
select prod_id
, store
, count(distinct cust) as count
, count(distinct cust)+max(missing(cust)) as count_plus_missing
from have
group by prod_id ,store
;
In a summarized dataset, I have the status of an event at each hour after baseline in which it was recorded. I also have the last hour the event could have been recorded. I want to create a new dataset with one record for each hour from the first through the last hour, with the status for each record being the one from the last recorded status.
Here is an example dataset:
data new;
input hour status last_hour;
cards;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
run;
In this case, the first recorded hour was the second, and the last recorded hour was the 10th. The last possible hour to record data was the 12th.
The final dataset should look like so:
0 . 12
1 . 12
2 1 12
3 1 12
4 1 12
5 1 12
6 1 12
7 0 12
8 0 12
9 1 12
10 0 12
11 0 12
12 0 12
I sort of have it working with this series of data steps, but I'm not sure if there's a cleaner way I'm not seeing.
data step1;
set new (keep=id hour);
by id;
do hour = 0 to last_hour;
output;
end;
run;
proc sort data=step1;
by id hour;
run;
proc sql;
create table step2 as
select distinct a.id, a.hour, b.status
from step1 as a
left join new as b
on a.id = b.id
and a.hour = b.hour
order by a.id, a.hour;
quit;
data step3;
set step2;
by id hour;
retain previous_status;
if first.id then do;
previous_status = .;
if status > . then previous_status = status;
end;
if not first.id then do;
if status = . and previous_status > . then status = previous_status;
if status > . then previous_status = status;
end;
run;
Seeing your code, it seems you left out of your question the fact that you also have id's. So this is a newer solution that deals with different id's. See further below for my first solution ignoring id's.
Since last_hour is always 12, I left it out of the have dataset. It will be added later on.
data have;
input id hour status;
cards;
1 2 1
1 4 1
1 5 1
1 6 1
1 7 0
1 9 1
1 10 0
2 2 1
2 4 1
2 5 1
2 6 1
2 7 0
2 9 1
2 10 0
;
Create a hours dataset, just containing numbers 0 thru 12;
data hours;
do i = 0 to 12;
hour = i;
output;
end;
drop i;
run;
Create a temporary dataset that will have the right number of rows (13 rows for every id, with valid hour values where they exist in the have table).
proc sql;
create table tmp as
select distinct t1.id, t2.hour, 12 as last_hour
from have as t1
cross join
(select hour from hours) as t2;
quit;
Then use merge and retain to fill in the missing hour column where appropriate.
data want;
merge have
tmp;
by id hour;
retain status_previous;
if not first.id then do;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
end;
if last.id then status_previous = .;
drop status_previous;
run;
Previous solution (no id's)
If last_hour is always 12, then this should do it:
data have;
input hour status last_hour;
datalines;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
data hours;
do i = 0 to 12;
hour = i;
last_hour = 12;
output;
end;
drop i;
run;
data want;
merge have
hours;
by hour;
retain status_previous;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
drop status_previous;
run;