I need help retaining values in a SAS dataset and completing the column datetime (to the level of seconds) when not existing.
My dataset looks like:
data HAVE;
input type$ DATE:datetime18. value;
format date datetime18.;
cards;
A 19JUN01:21:06:55 534
A 19JUN01:21:06:58 590
A 19JUN01:21:07:02 600
A 19JUN01:21:07:04 602
B 18JUN01:22:06:58 105
B 18JUN01:22:07:03 110
;
run;
I need to fill the missing datetime and repeat the value when needed.
My result dataset should be:
data WANT;
input type$ DATE:datetime18. value;
format date datetime18.;
cards;
A 19JUN01:21:06:55 534
A 19JUN01:21:06:56 534
A 19JUN01:21:06:57 534
A 19JUN01:21:06:58 590
A 19JUN01:21:06:59 590
A 19JUN01:21:07:00 590
A 19JUN01:21:07:01 590
A 19JUN01:21:07:02 600
A 19JUN01:21:07:03 600
A 19JUN01:21:07:04 602
B 18JUN01:22:06:58 105
B 18JUN01:22:06:59 105
B 18JUN01:22:07:00 105
B 18JUN01:22:07:01 105
B 18JUN01:22:07:02 105
B 18JUN01:22:07:03 110
;
run;
Thanks for your suggestions.
Regards
If you have SAS/ETS, proc expand will do the entire conversion for you with the step method.
proc expand data=have out=want to=second;
by type;
id date;
convert value / method=step;
run;
If you don't, you can do this using a few DATA Steps.
First, create a template of all datetimes that you want for each by-group. We want something that looks like this:
type date_start date_end
A 19JUN01:21:06:55 19JUN01:21:07:04
B 18JUN01:22:06:58 18JUN01:22:07:03
The following code will do this:
data date_start_end;
set have;
by type date;
retain date_start;
if(first.type) then date_start = date;
if(last.type) then do;
date_end = date;
output;
end;
format date_start date_end datetime.;
keep type date_start date_end;
run;
Next, we need to create a template that fills in all the possible seconds between start/end for each type. We want something that looks like this:
type date
A 19JUN01:21:06:55
A 19JUN01:21:06:56
A 19JUN01:21:06:57
...
B 18JUN01:22:07:01
B 18JUN01:22:07:02
B 18JUN01:22:07:03
The following code does this:
data date_template;
set date_start_end;
do date = date_start to date_end;
output;
end;
format date datetime.;
keep type date;
run;
Now we just need to merge this template with our original data and retain the last non-missing value.
data want;
merge have(rename=(value = _value_) )
date_template
;
by type date;
retain value;
if(NOT missing(_value_) ) then value = _value_;
drop _value_;
run;
Note that we rename value to _value_ in the original dataset since retain will not work the way we expect after we merge. We need to create a new variable in order for it to retain properly.
Related
During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.
I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.
I am trying to go through a list of dates and keep only the date range for dates that 5 or more occurrences and delete all others. The example I have is:
data test;
input dt dt2;
format dt dt2 date9.;
datalines;
20000 20001
20000 20002
20000 20003
21000 21001
21000 21002
21000 21003
21000 21004
21000 21005
;
run;
proc sort data = test;
by dt dt2;
run;
data check;
set test;
by dt dt2;
format dt dt2 date9.;
if last.dt = first.dt then
if abs(last.dt2 - first.dt) < 5 then delete;
run;
What I want returned is just one entry, if possible, but I would be happy with the entire appropriate range as well.
The one entry would be a table that has:
start_dt end_dt
21000 21005
The appropriate range is:
21000 21001
21000 21002
21000 21003
21000 21004
21000 21005
My code doesn't work as desired, and I am not sure what changes I need to make.
last.dt2 and first.dt are flags and can have value in (0,1), so condition abs(last.dt2 - first.dt) < 5 is always true.
Use counter variable to count records in group instead:
data check(drop= count);
length count 8;
count=0;
do until(last.dt);
set test;
by dt dt2;
format dt dt2 date9.;
count = count+1;
if last.dt and count>=5 then output;
end;
run;
I'm not sure why you are looking to use the last.dt2 and the first.dt within your delete function so I have turned it around to create your desired output:
data check2;
set test;
by dt ;
format dt dt2 date9.;
if last.dt then do;
if abs(dt2 - dt) >= 5 then output;
end;
run;
Of course, this will only work if your file is sorted on dt and dt2.
Hope this helps.
I have a data set in SAS containing individuals as rows and a variable for each period as columns. It looks something like this:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
What I want is for SAS to count how many there is of each number for each time period. So I want to get something like it:
data want;
input count t1 t2 t3;
cards;
111 1 3 1
112 3 1 0
123 0 0 3
;
run;
I could do this with proc freq, but outputting this doesn't work very well, when I have a lot of columns.
Thanks
In general having data in the meta data is a bad idea, as here where PERIOD is coded into the Tn variables and you really want that to be a group. Having said that you can still have your cake and eat it too.
PROC SUMMARY can get the counts for each Tn quickly and then you will have smaller data set to fiddle with. Here is one approach that should work well for many time periods.
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;;;;
run;
proc print;
run;
proc summary data=have chartype;
class t:;
ways 1;
output out=want;
run;
proc print;
run;
data want;
set want;
p = findc(_type_,'1');
c = coalesce(of t1-t3);
run;
proc print;
run;
proc summary data=want nway completetypes;
class c p;
freq _freq_;
output out=final;
run;
proc print;
run;
proc transpose data=final out=morefinal(drop=_name_) prefix=t;
by c;
id p;
var _freq_;
run;
proc print;
run;
First restructure the data so that it is in more of a vertical fashion. This will be easier to work with. We also want to create a flag that we will use as a counter later on.
data have2;
set have;
array arr[*] t1-t3;
flag = 1;
do period=lbound(arr) to hbound(arr);
val = arr[period];
output;
end;
keep period val flag;
run;
Summarize the data so we have the number of times that value occurred in each of the periods.
proc sql noprint;
create table smry as
select val,
period,
sum(flag) as count
from have3
group by 1,2
order by 1,2
;
quit;
Transpose the data so we have one line per value and then the counts for each period after that:
proc transpose data=smry out=want(drop=_name_);
by val;
id period;
var count;
run;
Note that when you define the array in the first step you could use this notation which would allow for a dynamic number of periods:
array arr[*] t:;
This assumes every variable beginning with 't' in the dataset should go into the array.
If your computer memory is large enough to hold the entire output, then Hash could be a viable solution:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
data _null_;
if _n_=1 then
do;
/*This is to construct a Hash, where count is tracked and t1-t3 is maintained*/
declare hash h(ordered:'a');
h.definekey('count');
h.definedata('count', 't1','t2','t3');
h.definedone();
call missing(count, t1,t2,t3);
end;
set have(rename=(t1-t3=_t1-_t3))
/*rename to avoid conflict between input data and Hash object*/
end=last;
array _t(*) _t:;
array t(*) t:;
/*The key is to set up two arrays, one is for input data,
another is for Hash feed, and maneuver their index variable accordingly*/
do i=1 to dim(_t);
count=_t(i);
rc=h.find(); /*search the Hash and bring back data elements if found*/
/*If there is a match, then corresponding 't' will increase by '1'*/
if rc=0 then
t(i)+1;
else
do;
/*If there is no match, then corresponding 't' will be initialized as '1',
and all of the other 't' reset to '0'*/
do j=1 to dim(t);
t(j)=0;
end;
t(i)=1;
end;
rc=h.replace(); /*Update the Hash*/
end;
if last then
rc=h.output(dataset:'want');
run;
Try this:
%macro freq(dsn);
proc sql;
select name into:name separated by ' ' from dictionary.columns where libname='WORK' and memname='HAVE' and name like 't%';
quit;
%let ncol=%sysfunc(countw(&name,%str( )));
%do i=1 %to &ncol;
%let col=%scan(&name,&i);
proc freq data=have;
table &col/out=col_&i(keep=&col count rename=(&col=count count=&col));
run;
%end;
data temp;
merge
%do i=1 %to &ncol;
col_&i
%end;
;
by count;
run;
data want;
set temp;
array vars t:;
do over vars;
if missing(vars) then vars=0;
end;
run;
%mend;
%freq(have)
I have to calculate the correlation and covariance for my daily sales values for an event window. The event window is of 45 day period and my data looks like -
store_id date sales
5927 12-Jan-07 3,714.00
5927 12-Jan-07 3,259.00
5927 14-Jan-07 3,787.00
5927 14-Jan-07 3,480.00
5927 17-Jan-07 3,646.00
5927 17-Jan-07 3,316.00
4978 18-Jan-07 3,530.00
4978 18-Jan-07 3,103.00
4978 18-Jan-07 3,026.00
4978 21-Jan-07 3,448.00
Now, for every store_id, date combination, I need to go back 45 days (there is more data for each combination in my original data set) calculate the correlation between sales and lag(sales) i.e. autocorrelation of degree one. As you can see, the date column is not continuous. So something like (date - 45) is not going to work.
I have gotten till this part -
data ds1;
set ds;
by store_id;
LAG_SALE = lag(sales);
IF FIRST.store_idTHEN DO;
LAG_SALE = .;
END;
run;
For calculating correlation and covariances -
proc corr data=ds1 outp=Corr
by store_id date;
cov; /** include covariances **/
var sales lag_sale;
run;
But how do I insert the event window for each date, store_id combination? My final output should look something like this -
id date corr cov
5927 12-Jan-07 ... ...
5927 14-Jan-07 ... ...
Here is what I've come up with:
First I convert the date to a SAS date, which is the number of days since Jan. 1 1960:
data ds;
set ds (rename=(date=old_date));
date = input(old_date, date11.);
drop old_date;
run;
Then compute lag_sale (I am using the same calculation you used in the question, but make sure this is what you want to do. For some observations the lag sale is the previous recorded date, but for some it is the same store_id and date, just a different observation.):
proc sort data=ds; by store_id; run;
data ds;
set ds;
by store_id;
lag_sale = lag(sales);
if first.store_id then lag_sale = .;
run;
Then set up the final data set:
data final;
length store_id 8 date 8 cov 8 corr 8;
if _n_ = 0;
run;
Then create a macro which takes a store_id and date and runs proc corr. The first part of the macro selects only the data with that store_id and within the past 45 days of the date. Then it runs proc corr. Then it formats proc corr how you want it and appends the results to the "final" data set.
%macro corr(store_id, date);
data ds2;
set ds;
where store_id = &store_id and %eval(&date-45) <= date <=&date
and lag_sale ne .;
run;
proc corr noprint data=ds2 cov outp=corr;
by store_id;
var sales lag_sale;
run;
data corr2;
set corr;
where _type_ in ('CORR', 'COV') and _name_ = 'sales';
retain cov;
date = &date;
if _type_ = 'COV' then cov = lag_sale;
else do;
corr = lag_sale;
output;
end;
keep store_id date corr cov;
run;
proc append base=final data=corr2 force; run;
%mend corr;
Finally run the macro for each store_id/date combination.
proc sort data=ds out=ds3 nodupkey;
by store_id date;
run;
data _null_;
set ds3;
call execute('%corr('||store_id||','||date||');');
run;
proc sort data=final;
by store_id date;
run;