Case 1
Suppose the data are sorted by year then by month (always have 3 observations in data).
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
I need to copy the Index of last month to new observation
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
2015 2 1.2
Case 2
Year is removed from data. So we only have Month and Index.
Month Index
1 1.2
11 1.1
12 1.5
Data is always collected from consecutive 3 months. So 1 is the last month.
Still, ideal output is
Month Index
1 1.2
2 1.2
11 1.1
12 1.5
I solve it by creating another dataset only contains Month (1,2...12). Then right join the original dataset twice. But I think there's more elegant way to deal with this.
Case 1 can be a straight-forward data step. Add end=eof to the set statement to initialize a variable eof that returns value 1 when the data step is reading the last row of the data set. An output statement in the data step outputs a row during each iteration. If eof=1, a do block runs that increments the month by 1 and outputs another row.
data want;
set have end=eof;
output;
if eof then do;
month=mod(month+1,12);
output;
end;
run;
For case 2, I would switch to an sql solution. Self join the table to itself on month, incremented by 1 in the second table. Use the coalesce function to keep the values from the existing table if it exists. If not, use the values from the second table. Since a case crossing December-January will produce 5 months, limit the output to four rows using the outobs= option in proc sql to exclude the unwanted second January.
proc sql outobs=4;
create table want as
select
coalesce(t1.month,mod(t2.month+1,12)) as month,
coalesce(t1.index,t2.index) as index
from
have t1
full outer join have t2
on t1.month = t2.month+1
order by
coalesce(t1.month,t2.month+1)
;
quit;
Related
I am working with crime data. Now, I have the following table crimes. Each row contains a specific crime (e.g. assault): the date it was committed (date) and a person-ID of the offender (person).
date person
------------------------------
02JAN2017 1
03FEB2017 1
04JAN2018 1 --> not to be counted (more than a year after 02JAN2017)
27NOV2017 2
28NOV2018 2 --> should not be counted (more than a year after 27NOV2017)
01MAY2017 3
24FEB2018 3
10OCT2017 4
I am interested in whether each person has committed (relapse=1) or not committed (relapse=0) another crime within 1 year after the first crime committed by the same person. Another condition is that the first crime has to be committed within a specific year (here 2017).
The result should therefore look like this:
date person relapse
------------------------------
02JAN2017 1 1
03FEB2017 1 1
04JAN2018 1 1
27NOV2017 2 0
28NOV2018 2 0
01MAY2017 3 1
24FEB2018 3 1
10OCT2017 4 0
Can anyone please give me a hint on how to do this in SAS?
Obviously, the real data are much larger, so I cannot do it manually.
One approach is to use DATA step by group processing.
The BY <var> statement sets up binary variables first.<var> and last.<var> that flag the first row in a group and the last row in a group.
You appear to be assigning the computed relapse flag over the entire group, and that kind of computation can be done with what SAS coders call a DOW loop -- a loop with the SET statement inside loop, with a follow up loop that assigns the computation to each row in the group.
The INTCK function can compute the number of years between two dates.
For example:
data want(keep=person date relapse);
* DOW loop computes assertion that relapse occurred;
relapse = 0;
do _n_ = 1 by 1 until (last.person);
set crimes; * <-------------- CRIMES;
by person date;
* check if persons first crime was in 2017;
if _n_ = 1 and year(date) = 2017 then _first = date;
* check if persons second crime was within 1 year of first;
if _n_ = 2 and _first then relapse = intck('year', _first, date, 'C') < 1;
end;
* at this point the relapse flag has been computed, and its value
* will be repeated for each row output;
* serial loop over same number of rows in the group, but
* read in through a second SET statement;
do _n_ = 1 to _n_;
set crimes; * <-------------- CRIMES;
output;
end;
run;
The process would be more complex, with more bookkeeping variables, if the actual process is to classify different time frames of a person as either relapsed or reformed based on rules more nuanced than "1st in 2017 and next within 1 year".
I started using sas relatively recent - I'm not by any means attempting to create perfect code here.
I'd sort the data by id/person and date first (date should be numeric), and then use retain statements check against the date of the first crime. It's not perfect, but if your data is good (no missing dates), it'll work, and it is easy to follow imho.
This only works if the first record and act of crime is supposed to happen in 2017. If you have crimes happening in 2016, and want to check whether 'a crime' is committed in 2017 and then check the relapse, then this code is not going to work - but I think that is covered in the comments beneath your question.
data test;
input tmp_year $ 1-9 person;
datalines;
02JAN2017 1
03FEB2017 1
04JAN2018 1
27NOV2017 2
28NOV2018 2
01MAY2017 3
24FEB2018 3
10OCT2017 4
;
run;
data test2;
set test;
crime_date = input(tmp_year, date9.);
act_year = year(crime_date);
run;
proc sort data=test2;
by person crime_date ;
run;
data want;
set test2;
by person crime_date;
retain date_of_crime;
if first.person and act_year = 2017 then date_of_crime = crime_date;
else if first.person then call missing(date_of_crime);
if intck('YEAR', date_of_crime, crime_date) =< 1 and not first.person
then relapse = 1;
else relapse = 0;
run;
The above code flags the act of crimes committed one year after an act of crime in 2017. You can then retrieve the unique persons with a proc sql statement, and join them with whatever dataset you have.
I have a dataset with several subjects with hour timepoints (0.02,24.02,48.02 etc) for each record per subject.
Each record has 4 dates with a single record assigned to each timepoint (0.02= 28AUG2019, 24.02= 29AUG2019 etc).
The date should be the same for each hour timepoint.
What sas function could I use to validate that the dates for each record assigned to each hour timepoint is the same for each subject ?
Would the IFC/IFN function work in this scenario?
Sample data for one subject
For the case of a data set with variables time1-time4 and date1-date4 I see at least two interpretations of the question.
One
Are the date values all the same across the record?
The date values can be arrayed and examined in a loop for a change from one index to the previous.
data have;
format time1-time4 6.2 date1-date4 date9.;
informat date1-date4 date9.;
input time1-time4 date1-date4;
datalines;
0.02 24.02 48.02 72.02 28AUG2019 29AUG2019 28AUG2019 28AUG2019
1.02 34.02 58.02 82.02 01SEP2019 01SEP2019 01SEP2019 01SEP2019
run;
data want;
set have;
array dates date1-date4;
do index=2 to dim(dates);
if dates(index) ne dates(index-1) then at_least_one_date_different_flag = 1;
end;
drop index;
run;
Two
The date should be the same for each hour timepoint.
This could be construed to mean looking down the data set, for each distinct value of time in any of the time1-time4 variables the corresponding date values in the date1-date4 must be also be distinct.
row 1: 1 2 3 4 A B C D
row 2: 1 2 3 5 A B C D
row 3: 1 3 5 2 A X D B
Looking down you have 3:C, 3:C, 3:X which would be different (3 has C's and X's)
If this is the case, please update the question with more sample data of what you have and want.
The date in the table is not one set,
Days in the days column and months in the month column and years in the year column
I have concatenated the columns and then put these concatenation in where clause and put the parameter I have made but I got no result
I assume you are querying a date dimension table, and you want to extract the record that matches a certain date.
Solution:
I created a dates table to match with,
data dates;
input key day month year ;
datalines;
1 19 2 2018
2 20 2 2018
3 21 2 2018
4 22 2 2018
;;;
run;
Output:
In the where clause I parse the date '20feb2018'd using day, month & year functions: in SAS you have to quote the dates in [''d]
proc sql;
select * from dates
/*if you want to match todays' date: replace '20feb2018'd with today()*/
where day('20feb2018'd)=day and month('20feb2018'd)=month and year('20feb2018'd)=year;
quit;
Output:
if you compare date from day month and year, then use mdy function in where clause as shown below. it is not totally clear what you are looking for.
proc sql;
select * from dates
where mdy(month,day, year) between '19feb2018'd and '21feb2018'd ;
For the data set below(actual one is several thousand row long) I would like SAS to aggregate the income daily (many income lines everyday per machine), weekly, monthly (start of week is Monday, Start of month is 01 in any given year) by the machine. Is there a straight forward code for this? Any help is appreciated.
MachineNo Date income
1 01Jan2012 1500
1 02Jan2012 2000
1 27Aug2012 300
2 02Jan2012 1200
2 15Jun2012 50
3 03Mar2012 1000
4 08Apr2012 500
proc expand and proc timeseries are excellent tools for accumulation and aggregation to different frequencies of series. You can combine both with by-group processing to convert to any time period that you need.
Step 1: Sort by MachineNo and Date
proc sort data=want;
by MachineNo Date;
run;
Step 2: Find the min/max end dates of your series for date alignment
The format=date9. statement is important. For whatever reason, some SAS/ETS and HPF procedures require date literals for certain arguments.
proc sql noprint;
select min(date) format=date9.,
max(date) format=date9.
into :min_date,
:max_date
from have;
quit;
Step 3: Align each MachineNo by start/end date, and accumulate days per MachineNo
The below code will get you aligned daily accumulation, remove duplicate days per machine, and set Income on any missing days to 0. This step will also guarantee that your series has equal time intervals per by-group, allowing you to run hierarchical time-series analyses without violating the equal-spaced interval assumption.
proc timeseries data=have
out=want_day;
by MachineNo;
id date interval=day
align=both
start="&min_date"d
end="&max_date"d;
var income / accumulate=total setmiss=0;
run;
Step 4: Aggregate aligned Daily to Weekly shifted by 1 day, Monthly
SAS time intervals are able to be both multiplied and shifted. Since the standard weekday starts on a Sunday, we want to shift by 1 day to have it start on a Monday.
Standard Week
2 3 4 5 6 7 1
Mon Tue Wed Thu Fri Sat Sun
Shifted
1 2 3 4 5 6 7
Mon Tue Wed Thu Fri Sat Sun
Intervals follow the format:
TimeInterval<Multiplier>.<Shift>
The standard shift interval is 1. For all intents and purposes, consider 1 as 0: 1 means it's unshifted. 2 means it's shifted by 1 period. Thus, for a week to start on a Monday, we want to use the interval Week.2.
proc expand data=want_day
out=want_week
from=day
to=week.2;
id date;
convert income / method=aggregate observed=total;
run;
Step 5: Convert Week to Month
proc expand data=want_week
out=want_month
from=week.2
to=month;
id date;
convert income / method=aggregate observed=total;
run;
In case you don't have a license for SAS/ETS here's another way.
For the monthly data you can format the date in a proc means output.
I think WeekW. starts on Monday but it may not be in a format you want, so you'll need to create a new variable for week first if you wanted to use this method.
proc means data=have nway noprint;
class machineno date;
format date monyy7.;
var income;
output out=want sum(income)=income;
run;
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75