Calculate average of the last x years - sas

I have the following data
Date value_idx
2002-01-31 .
2002-01-31 24.533
2002-01-31 26.50
2018-02-28 25.2124
2019-09-12 22.251
2019-01-31 24.214
2019-05-21 25.241
2019-05-21 .
2020-05-21 25.241
2020-05-21 23.232
I would need to calculate the average of value_idx of the last 3 years and 7 years.
I tried first to calculate it as follows:
proc sql;
create table table1 as
select date, avg(value_idx) as avg_value_idx
from table
group by date;
quit;
The problem is that I do not know how to calculate the average of value_idx not per each month but for the last two years. So I think I should extract the year, group by that, and then calculate the average.
I hope someone of you can help me with this.

You can use CASE to decide which records contribute to which MEAN. You need to clarify what you mean by last 2 or last 7 years. This code will find the value of the maximum date and then compare the year of that date to the year of the other dates.
select
mean(case when year(max_date)-year(date) < 2 then value_idx else . end) as mean_yr2
,mean(case when year(max_date)-year(date) < 7 then value_idx else . end) as mean_yr7
from have,(select max(date) as max_date from have)
;
Results
mean_yr2 mean_yr7
------------------
24.0358 24.2319

The best way to do this sort of thing in SAS is with native PROCs, as they have a lot of functionality related to grouping.
In this case, we use multilabel formats to control the grouping. I assume you mean 'Last Three Years' as in calendar 2018/2019/2020 and 'Last Seven Years' as calendar 2014-2020. Presumably you can see how to modify this for other time periods - so long as you aren't trying to make the time period relative to each data point.
We create a format that uses the MULTILABEL option (which allows data points to fall in multiple categories), and the NOTSORTED option (to allow us to force the ordering of the labels, otherwise SEVEN is earlier than THREE).
Then, we use it in PROC TABULATE, enabling it with MLF (MultiLabel Format) and preloadfmt order=data which again keeps the ordering correct. This produces a report with the two averages only.
data have;
informat date yymmdd10.;
input Date value_idx;
datalines;
2002-01-31 .
2002-01-31 24.533
2002-01-31 26.50
2017-02-28 25.2124
2017-09-12 22.251
2018-01-31 24.214
2018-05-21 25.241
2019-05-21 .
2020-05-21 25.241
2020-05-21 23.232
;;;;
run;
proc format;
value yeartabfmt (multilabel notsorted)
'01JAN2018'd-'31DEC2020'd = 'Last Three Years'
'01JAN2014'd-'31DEC2020'd = 'Last Seven Years'
other=' '
;
quit;
proc tabulate data=have;
class date/mlf preloadfmt order=data;
var value_idx;
format date yeartabfmt.;
tables date,value_idx*mean;
run;

Related

SAS Macro help to loop monthly sas datasets

I have monthly datasets in SAS Library for customers from Jan 2013 onwards with datasets name as CUST_JAN2013,CUST_FEB2013........CUST_OCT2017. These customers datasets have huge records of 2 million members for each month.This monthly datset has two columns (customer number and customer monthly expenses).
I have one input dataset Cust_Expense with customer number and month as columns. This Cust_Expense table has only 250,000 members and want to pull expense data for each member from SPECIFIC monthly SAS dataset by joining customer number.
Cust_Expense
------------
Customer_Number Month
111 FEB2014
987 APR2017
784 FEB2014
768 APR2017
.....
145 AUG2017
345 AUG2014
I have tried using call execute, but it tries to loop thru each 250,000 records of input dataset (Cust_Expense) and join with corresponding monthly SAS customer tables which takes too much of time.
Is there a way to read input tables (Cust_Expense) by month so that we read all customers for a specific month and then read the same monthly table ONCE to pull all the records from that month, so that it does not loop 250,000 times.
Depending on what you want the result to be, you can create one output per month by filtering on cust_expenses per month and joining with the corresponding monthly dataset
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
create table want_&month. as
select *
from cust_expense(where=(month="&month.")) t1
inner join cust_&month. t2
on t1.customer_number=t2.customer_number
;
%end;
quit;
%mend;
%want;
Or you could have one output using one join by 'unioning' all those monthly datasets into one and dynamically adding a month column.
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
create table want as
select *
from cust_expense t1
inner join (
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
%if &i>1 %then union;
select *, "&month." as month
from cust_&month
%end;
) t2
on t1.customer_number=t2.customer_number
and t1.month=t2.month
;
quit;
%mend;
%want;
In either case, I don't really see the point in joining those monthly datasets with the cust_expense dataset. The latter does not seem to hold any information that isn't already present in the monthly datasets.
Your first, best answer is to get rid of these monthly separate tables and make them into one large table with ID and month as key. Then you can simply join on this and go on your way. Having many separate tables like this where a data element determines what table they're in is never a good idea. Then index on month to make it faster.
If you can't do that, then try creating a view that is all of those tables unioned. It may be faster to do that; SAS might decide to materialize the view but maybe not (but if it's extremely slow, then look in your temp table space to see if that's what's happening).
Third option then is probably to make use of SAS formats. Turn the smaller table into a format, using the CNTLIN option. Then a single large datastep will allow you to perform the join.
data want;
set jan feb mar apr ... ;
where put(id,CUSTEXPF1.) = '1';
run;
That only makes one pass through the 250k table and one pass through the monthly tables, plus the very very fast format lookup which is undoubtedly zero cost in this data step (as the disk i/o will be slower).
I guess you could output your data in specific dataset like this example :
data test;
infile datalines dsd;
input ID : $2. MONTH $3. ;
datalines;
1,JAN
2,JAN
3,JAN
4,FEB
5,FEB
6,MAR
7,MAR
8,MAR
9,MAR
;
run;
data JAN FEB MAR;
set test;
if MONTH = "JAN" then output JAN;
if MONTH = "FEB" then output FEB;
if MONTH = "MAR" then output MAR;
run;
You will avoid to loop through all your ID (250000)
and you will use dataset statement from SAS
At the end you will get 12 DATASET containing the ID related.
If you case, FEB2014 , for example, you will use a substring fonction and the condition in your dataset will become :
...
set test;
...
if SUBSTR(MONTH,1,3)="FEB" then output FEB;
...
Regards

Counting working days in SAS EG

Hi to all and good time of a day!
Here is my case I need to solve I will very gratefull if you can help me.
I have some data set it contains only one variable date format.
Example:
01JAN2016
06JAN2016
15FEB2016
The second data set is days - holidays for a period 5 years.
Example:
01JAN2016
02JAN2016
and etc, all these days are not working days.
The case is I need to count number of working days from date for every observation from first data set till now. It seems that I need to count number of days
"Now date" minus Date(from first data set) and minus number of days from second data set with holidays (count(date) where Date(from first data set)< date < "Now"
You can define your own type of interval to use with SAS funcions intck and intnx. Here's how to do it:
First create a table of weekdays for whichever years you have holidays for, up to present (or a future) year.
Here we'll start by including all weekdays from 2014 to 2016. This is assuming you don't want to count weekend days. If that's not the case, just modify the code so that the condition "weekday(date) in (2:6)" is not applied. You'll get the full 365 days of the year.
data mon_fri;
do date = "01JAN2014"d to "31DEC2016"d;
if weekday(date) in (2:6) then output;
end;
format date date9.;
run;
Then we'll create a table having all those dates we just created, minus the holidays we have in the table Holidays. We'll place the table in a library called myLib, and rename the date column to "Begin" for compliance with SAS custom intervals.
libname myLib "some/place/on/your/drive";
data mylib.workdays(RENAME=(date=Begin));
merge mon_fri (in=weekday)
Holidays (in=holiday);
by date;
if weekday and not holiday then output;
run;
Now we set up a custom interval which we'll simply call "workdays".
options intervalds=(workdays=mylib.workdays);
From there, all you have left to do is something like this:
data dateCalculations;
set mydata;
numOfDays = intck("workdays", theDate, today());
run;
SAS will take care of counting the number of dates (lines in the workdays dataset) separating the startdate (column called theDate) from the enddate (today's date).
Et voilĂ !
This is wonderful and very helpful. I use two different SAS systems (both on remote Unix servers). Setting the intervalds option only seems to work on one of them. I copy/paste the same code and on the other nothing happens - no warning, no error, it simply doesn't work.
Here is how I'm setting it (download the CSV from Yahoo! Finance for the S&P500, daily data, starting January 1950):
PROC IMPORT DATAFILE="sp500_1950_2016.csv"
OUT=sp500_1950_2016
DBMS=DLM
REPLACE;
delimiter=',';
getnames=yes;
RUN;
data trading_days;
set sp500_1950_2016 (keep = date rename=(date=begin));
where year(begin) < 2017;
run;
options intervalds=(TradingDay=trading_days) ;
Then I call it like so to count number of observations I should have from fund inception to Dec 31, 2016 or when the fund closed, whichever is sooner:
data ops2; set operations_master; where ~missing(inception);
if missing(enddate) then enddate = '31dec2016'd;
datadays = INTCK('TradingDay',inception,enddate);run;
proc univariate; var datadays;run;quit;
On system 1, this works just fine. On system 2, I get 0 for the variable datadays. I've already checked to see if there is a sys admin override on setting the intervalds option, and there is not. Is there another reason why this might not work on a given system?

SAS forward looking Moving standard deviation

Hi does anyone know how to calculate the standard deviation over the next four quarters for each quarter? Thanks :)
My attempt is below:
date1 is the sas date for the quarter in a year
Proc sql ; create table th.totalroll as
Select distinct permco, date1 ,
(select std(adjret) from th.returns1 where qtr between
intnx('quarter',qtr(date),0) and intnx('quarter', qtr(date),+3)) as
TOTALroll From th.returns1 group by permco ,date1;
QUIT;
It's hard to tell how close you are because I'm not entirely certain what your data looks like, but here's an example assuming you have more than one date in each quarter. Create sample data:
data have;
format date date9.;
do m = 1 to 128;
date = intnx('month','01JAN2008'd,m-1);
amount = round(ranuni(date)*10);
output;
end;
drop m;
run;
Using proc sql, create quarter variable (you might already have this variable?) and group by this variable. Use a having clause to restrict results to the first date of each quarter.
proc sql;
create table want as
select
yyq(year(t1.date),qtr(t1.date)) as quarter format=yyq.,
(select std(t2.amount)
from have t2
where t2.date >= yyq(year(t1.date),qtr(t1.date))
and t2.date < intnx('quarter',yyq(year(t1.date),qtr(t1.date)),4)) as stddev
from
have t1
group by
calculated quarter
having
t1.date = min(t1.date)
;
quit;
You should be able to adapt this to work for your data.
You can use proc expand if your dataset is already in quarterly. So something like this:
proc expand data=th.returns1
out=th.totalroll
from=quarter
to=quarter;
by permco date1;
id date;
convert adjret=TOTALroll / transformout=( MOVSTD 4 );
run;
Don't forget to sort you data first. And MOVSTD gives you backward moving standard deviation. You may need to shift the output stream back by 4 quarters if you want the forward moving STD.
Transformation Operations for proc expand:
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_expand_sect026.htm

Contingency table in SAS

I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;

Transposing one column in a dataset but by year and another column

I have this dataset here which looks like this:
Basically I want to manipulate the data set so that I have
GVKEY1 as unique such as 1004 then a unique year number such as 1996 then several gvkey2 after that. However the number of gvkey2 for each year is not the same. Does anyone know how to get around this problem? This means I will have several 12 lines of data for gvkey1 for 1004 since i have years from 1996 to 2008. Then for each year I will have many columns where each column will have a gvkey2.
Best Regards,
Naz
Can you not just use PROC TRANSPOSE?
proc sort data=your_data_set out=temp1;
by gvkey1 year;
run;
proc transpose data=temp1 out=temp2;
by gvkey1 year;
var gvkey2;
run;
This will give you a series of variables COL1 - COLx. Use the PREFIX option for different variable names.
I'm not sure I've understood your question, but if you're looking for unique gvkey1/year pairs, you could do either of these:
proc sql;
create table results as
select distinct gvkey1, year
from _your_data_set;
quit;
or
proc sort data=_your_data_set(keep=gvkey1 year) out=results nodupkey;
by gvkey1 year;
run;
If that's not what you're looking for, I suggest posting an example of the results you want.