I'm trying to use SAS to compute a moving average for x number of periods that uses forecasted values in the calculation. For example if I have a data set with ten observations for a variable, and I wanted to do a 3-month moving average. The first forecast value should be an average of the last 3 observations, and the second forecast value should be an average of the last two observations, and the first forecast value.
If you have for example data like this:
data input;
infile datalines;
length product $10 period value 8;
informat period yymmdd10.;
format period yymmdd10.;
input product $ period value;
datalines;
car 2016-01-01 10
car 2015-12-01 20
car 2015-11-01 30
car 2015-10-01 40
car 2015-09-01 30
car 2015-08-01 15
;
run;
You can left join input table itself with a condition:
input t1 left join input t2
on t1.product = t2.product
and t2.period between intnx('month',t1.period,-2,'b') and t1.period
group by t1.product, t1.period, t1.value
With this you have t1.value as current value and avg(t2.value) as 3 months avg. To compute 2 months avg change every value that is older then previos period to missing value with ifn() function:
avg(ifn( t2.period >= intnx('month',t1.period,-1,'b'),t2.value,. ))
Full code could looks like this:
proc sql;
create table want as
select t1.product, t1.period, t1.value as currentValue,
ifn(count(t2.period)>1,avg(ifn( t2.period >= intnx('month',t1.period,-1,'b'),t2.value,. )),.) as twoMonthsAVG,
ifn(count(t2.period)>2,avg(t2.value),.) as threeMonthsAVG
from input t1 left join input t2
on t1.product = t2.product
and t2.period between intnx('month',t1.period,-2,'b') and t1.period
group by t1.product, t1.period, t1.value
;
quit;
I've also added count(t2.perion) condition to return missing values if I haven't got enough records to compute measure. My result set looks like this:
Related
The following is an example of the data I have
startdate
enddate
amount
1/1/2010
2/2/2020
10
1/5/2011
2/3/2015
10
1/3/2012
2/2/2023
10
1/4/2013
2/2/2014
10
5/5/2015
2/2/2028
10
1/6/2016
2/2/2032
10
I want to calculate the sum of all existing amounts as of each start date so it should look like this:
startdate
amount
1/1/2010
10
1/5/2011
20
1/3/2012
30
1/4/2013
40
5/5/2015
30
1/6/2016
40
How do I do this in SAS?
Essentially what I want to do is for each of the start dates, calculate the cumulative sum of any amounts that haven't expired. So for the first four dates, it is just a running cumulative sum because none of the amounts have expired. But at 5/5/2015, two of the previous amounts have expired hence a cumulative sum of 30. Same for the last date, where the same two have previously expired and you have the additional amount as of 1/6/2016 therefore 40.
One way to accomplish this is with a self-join via Proc SQL:
proc sql;
create table out_dset as
select a.startdate, sum(a.amount) as amount
from in_dset as a left join in_dset as b
on a.startdate >= b.startdate and a.startdate < b.enddate
group by a.startdate
order by a.startdate;
quit;
For each observation in the original dataset, this code will find observations in the same dataset that meet the date range criteria and will sum up the amount column.
You can change the second comparison operator from < to <= if you want to include situations when a previous amount expired on the same date as a given startdate.
I have a dataset that looks like:
Month Cost_Center Account Actual Annual_Budget
June 53410 Postage 13 234
June 53420 Postage 0 432
June 53430 Postage 48 643
June 53440 Postage 0 917
June 53710 Postage 92 662
June 53410 Phone 73 267
June 53420 Phone 103 669
June 53430 Phone 90 763
...
I would like to first sum the Actual and Annual columns, respectively and then create a variable where it flags if the Actual extrapolated for the entire year is greater than than Annual column.
I have the following code:
Data Test;
set Combined;
%All_CC; /*MACRO TO INCLUDE ALL COST CENTERS*/
%Total_Other_Expenses;/*MACRO TO INCLUDE SPECIFIC Account Descriptions*/
Sum_Actual = sum(Actual);
Sum_Annual = sum(Annual_Budget);
Run_Rate = Sum_Actual*12;
if Run_Rate > Sum_Annual then Over_Budget_Alarm = 1;
run;
However, when I run this code, it does not sum by group, for example, this is the output I get:
Account_Description Sum_Actual Sum_Annual Run_Rate Over_Budget_Alarm
Postage 13 234 146
Postage 0 432 0
Postage 48 643 963 1
Postage 0 917 0
Postage 92 662 634 1
I'm looking for output where all the 'postage' are summed for Actual and Annual, leaving just one row of data.
Use PROC MEANS to summarize the data
Use a data step and IF/THEN statement to create your flags.
proc means data=have N SUM NWAY STACKODS;
class account;
var amount annual_budget;
ods output summary = summary_stats1;
output out = summary_stats2 N = SUM= / AUTONAME;
run;
data want;
set summary_stats;
if sum_actual > sum_annual_budget then flag=1;
else flag=0;
run;
SAS DATA step behavior is quite complex ("About DATA Step Execution" in SAS Language Reference: Concepts). The default behavior, that you're seeing, is: at the end of each iteration (i.e. for each input row) the row is written to the output data set, and the PDV - all data step variables - is reset.
You can't expect to write Base SAS "intuitively" without spending a few days learning it first, so I recommend using PROC SQL, unless you have a reason not to.
If you really want to aggregate in data step, you have to use something called BY groups processing: after ensuring the input data set is sorted by the BY vars, you can use something like the following:
data Test (keep = Month Account Sum_Actual Sum_Annual /*...your Run_Rate and Over_Budget_Alarm...*/);
set Combined; /* the input table */
by Month Account; /* must be sorted by these */
retain Sum_Actual Sum_Annual; /* don't clobber for each input row */
if first.account then do; /* instead do it manually for each group */
Sum_Actual = 0;
Sum_Annual = 0;
end;
/* accumulate the values from each row */
Sum_Actual = sum(Sum_Actual, Actual);
Sum_Annual = sum(Sum_Annual, Annual_Budget);
/* Note that Sum_Actual = Sum_Actual+Actual; will not work if any of the input values is 'missing'. */
if last.account then do;
/* The group has been processed.
Do any additional processing for the group as a whole, e.g.
calculate Over_Budget_Alarm. */
output; /* write one output row per group */
end;
run;
Proc SQL can be very effective for understanding aggregate data examination. With out seeing what the macros do, I would say perform the run rate checks after outputting data set test.
You don't show rows for other months, but I must presume the annual_budget values are constant across all months -- if so, I don't see a reason to ever sum annual_budget; comparing anything to sum(annual_budget) is probably at the incorrect time scale and not useful.
From the show data its hard to tell if you want to know any of these
which (or if some) months had a run_rate that exceeded the annual_budget
which (or if some) months run_rate exceeded the balance of annual_budget (i.e. the annual_budget less the prior months expenditure)
Presume each row in test is for a single year/month/costCenter/account -- if not the underlying data would have to be aggregated to that level.
Proc SQL;
* retrieve presumed constant annual_budget values from data;
* this information might (should) already exist in another table;
* presume constant annual budget value at each cost center | account combination;
* distinct because there are multiple months with the same info;
create table annual_budgets as
select distinct Cost_Center, Account, Annual_Budget
from test;
create table account_budgets as
select account, sum(annual_budget) as annual_budget
from annual_budgets
group by account;
* flag for some run rate condition;
create table annual_budget_mon_runrate_check as
select
2019 as year,
account,
sum(actual) as yr_actual, /* across all month/cost center */
min (
select annual_budget from account_budgets as inner
where inner.account = outer.account
) as account_budget,
max (
case when actual * 12 > annual_budget then 1 else 0 end
) as
excessive_runrate_flag label="At least one month had a cost center run rate that would exceed its annual_budget")
from
test as outer
group by
year, account;
You can add a where clause to restrict the accounts processed.
Changing the max to sum in the flag computation would return the number of cost center months with excessive run rates.
I have a SAS dataset with the following 13 fields:
base_price tax_month1-12
base price is the price of a product before taxes are paid and the tax_month is the tax percent needed to be collected for each month.
I want to create a new variable group tax_paid1-12 which is the product of the base_price and each of the tax months.
Is there an efficient way to do this in SAS without multiplying the fields 1 at a time? The number of months could vary in the future, so I do not want to hard code the number of fields in the variable group.
Unfortunately you need addtional step to compute number of months.
You can use arrays to calculate tax_paid for each month.
data source;
infile datalines;
input base_price tax_month1 tax_month2 tax_month3;
datalines;
1 2 . 4
5 6 7 8
;
run;
data _null_;
set source;
array tax_month(*) tax_month:;
call symputx('n', dim(tax_month));
stop;
run;
data result;
set source;
array tax_month(&n) tax_month1-tax_month&n;
array tax_paid(&n) tax_paid1-tax_paid&n;
do i = 1 to dim(tax_month);
tax_paid(i) = base_price * tax_month(i);
end;
drop i;
run;
I have 4 columns in my SAS dataset as shown in first image below. I need to compare the dates of consecutive rows by ID. For each ID, if Date2 occurs before the next row's Date1 for the same ID, then keep the Bill amount. If Date2 occurs after the Date1 of the next row, delete the bill amount. So for each ID, only keep the bill where the Date2 is less than the next rows Date1. I have placed what the result set should look like at the bottom.
Result set should look like
You'll want to create a new variable that moves the next row's DATE1 up one row to make the comparison. Assuming your date variables are in a date format, use PROC EXPAND and make the comparison ensuring that you're not comparing the last value which will have a missing LEAD value:
DATA TEST;
INPUT ID: $3. DATE1: MMDDYY10. DATE2: MMDDYY10. BILL: 8.;
FORMAT DATE1 DATE2 MMDDYY10.;
DATALINES;
AA 07/23/2015 07/31/2015 34
AA 07/30/2015 08/10/2015 50
AA 08/12/2015 08/15/2015 18
BB 07/23/2015 07/24/2015 20
BB 07/30/2015 08/08/2015 20
BB 08/06/2015 08/08/2015 20
;
RUN;
PROC EXPAND DATA = TEST OUT=TEST1 METHOD=NONE;
BY ID;
CONVERT DATE1 = DATE1_LEAD / TRANSFORMOUT=(LEAD 1);
RUN;
DATA TEST2; SET TEST1;
IF DATE1_LEAD NE . AND DATE2 GT DATE1_LEAD THEN BILL=.;
RUN;
If you sort your data so that you are looking to the previous obs to compare your dates, you can use a the LAG Function in a DATA STEP.
I have a credit card transaction dataset (let's call it "Trans") with transaction amount, zip code, and date. I have another dataset (let's call it "Key") that lists sales tax rates based on date and geocode. The Key dataset also includes a range of zip codes associated with each geocode represented by 2 variables: Zip Start and Zip End.
Because Geocodes don't align with zip codes, some of the zip code ranges overlap. If this happens, I want to use the lowest sales tax rate associated with the zip code shown in Trans.
Trans dataset:
TransAmount TransDate TransZip
$200 01/07/1998 90010
$12 02/09/2002 90022
Key dataset:
Geocode Rate StartDate EndDate ZipStart ZipEnd
1001 .0825 199701 200012 90001 90084
1001 .085 200101 200812 90001 90084
1002 .0825 199701 200012 90022 90024
1002 .08 200101 200812 90022 90024
Desired output:
TransAmount TransDate TransZip Rate
$200 01/07/1998 90010 .0825
$12 02/09/2002 90022 .08
I used this basic SQL code in SAS, but I run into the problem of overlapping zip codes.
proc sql;
create table output as
select a.*, b.zipstart, b.zipend, b.startdate, b.enddate, b.rate
from Trans.CA_Zip_Cd_Testing a left join Key.CA_rates b
on a.TranZip ge b.zipstart
and a.TranZip le b.zipend
and a.TransDate ge b.StartDate
and a.transDate le b.EndDate
;
quit;
Well the easiest way to do this as far as the query portion is to just add a subquery to get the min rate.
Select t.transamount, t.transdate,t.transzip
,(Select MIN(rate) from Key where t.transzip between ZipStart and ZipEnd and t.transdate between startdate and enddate) 'Rate'
from trans t
You could also do it as subquery and join on it.
The SAS SQL Optimizer can be good sometimes. Other times, it can be a challenge. This code is going to be a bit more complicated, but it will likely be faster, and subject to size constraints on your key table.
data key;
set key;
dummy_key=1;
run;
data want(drop=dummy_key geocode rate startDate endDate zipStart zipEnd rc i);
if _n_ = 1 then do;
if 0 then set key;
declare hash k (dataset:'key',multidata:'y');
k.defineKey('dummy_key');
k.defineData('geocode','rate','startdate','enddate','zipstart','zipend');
k.defineDone();
end;
call missing (of _all_);
set trans;
dummy_key=1;
rc = k.find();
do i=1 to 1000 while (rc=0);
transZipNum = input(transZip,8.); *converts character zip to number. if its already a number then remove;
zipStartNum = input(zipStart,8.);
zipEndNum = input(zipEnd,8.);
if startDate <= transDate <= endDate then do;
if zipStartNum <= transZipNum <= zipEndNum then do;
rate_out = min(rate_out,rate);
end;
end;
rc=k.find_next();
end;
run;