Rolling Window Model for Unbalanced Dataset in SAS - sas

I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;​
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;​
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.

Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);

Related

How to fetch a value based on another column

Please find the dataset below:
ID amt order_type no_of_order
1 200 em 6
1 300 on 5
2 600 em 10
Output desired:
ID amt order_type no_of_order
1 500 on 11
2 600 em 10
based on the highest amount i need to pick the order_type.
How can this be achieved in sas code
Sounds like you want to get the sum of the two numeric variables for each value of ID and also select one value for ORDER_TYPE. You appear to want to take the value of ORDER_TYPE which had the largest AMT. Here is simply way using PROC SUMMARY.
data have;
input ID amt order_type $ no_of_order;
cards;
1 200 em 6
1 300 on 5
2 600 em 10
;
proc summary data=have ;
by id;
var amt no_of_order;
output out=want sum= idgroup(max(amt) out[1] (order_type)=);
run;
Results:
no_of_ order_
Obs ID _TYPE_ _FREQ_ amt order type
1 1 0 2 500 11 on
2 2 0 1 600 10 em

SAS proc iml use to impute the missing sales from the first or second month based on the average of the next two observations

I'm looking a way to impute using proc iml in sas the average of the sales of the next two months.
As you can see sometimes I dont have the sales of 201901 and sometimes is missing on 201902
For example for the first barcode I want to find the sales[1]= mean(sales[2],sales[3]) and I want to do this for each unique barcode.
The "table A" is like this:
Obs. | Barcode | date | sales | Position
---------------------------------------------------------------
1 |21220000000| 201901 | . | 1
2 |21220000000| 201902| 311 | 2
3 |21220000000| 201903| 349 | 3
4 |21220000000| 201904| 360 | 4
5 |21220000000| 201905| 380 | 5
6 |21220000000| 201906| 440 | 6
7 |21220000000| 201907| 360 | 7
8 |21220000000| 201908| 390 | 8
9 |21220000000| 201909| 410 | 9
10 |21220000000| 201910| 520 | 10
11 |21220000000| 201911| 410 | 11
12 |21220000000| 201912| 390 | 12
13 |31350000000| 201901| 360 | 1
14 |31350000000| 201902| . | 2
.etc.
24 |31350000000| 201912| . | 12
25 |45480000000| 201901| 310 | 1
26 |45480000000| 201902| . | 2
.etc.
I tried something like this but it doesnt work:
proc iml;
t_a= TableCreateFromDataSet("work","table_a");
call TablePrint(t_a);
do i =1 to nrow(t_a);
if t_a[i,4]=. and t_a[i,5]=1 then t_a[1,4]= mean(t_a[i+1,4],t_a[i+2,4]) ;
i=i+1;
end;
run;
Is there a way to do it using matrices or lists in proc iml or would you recommend any other ways?
Thank you in advance!
This problem only involves an ID variable (='BarCode') and a variable that has missing values (='Sales'), so you really only need to read and process two vectors.
An efficient approach is to iterate over the unique levels of the "Barcode" variable (an ID variable) and process each missing value. Thus you can reduce the problem to a "BY group analysis" in which each ID value is processed in turn. There are several ways to perform a BY-group analysis in IML. The easiest to understand and implement is the UNIQUE-LOC technique. For large data, the UNIQUEBY technique is more efficient.
The following example uses the UNIQUE-LOC technique:
proc iml;
use table_a;
read all var {"BarCode"} into ID;
read all var {"Sales"} into X;
close;
imputeX = X; /* make copy of X */
u = unique(ID); /* unique categories of the ID variable */
do i = 1 to ncol(u); /* for each ID level */
groupIdx = loc(ID=u[i]);
y = x[groupIdx]; /* get the values for this level */
k = loc( y=. ); /* which are missing? */
if ncol(k)>0 then do; /* if some are missing, do imputation */
n = nrow(y);
startIdx = ((k+1) >< n); /* starting location, don't exceed n */
stopIdx = ((k+2) >< n); /* ending location, don't exceed n */
values = y[ startIdx ] || y[ stopIdx ];
mean = values[ , :]; /* find mean of each row */
y[k] = mean; /* copy mean to missing values */
imputeX[groupIdx] = y; /* update imputed vector (optional: write data) */
end;
end;
print ID[F=Z11.] X imputeX;
I don't think this is a good solution in PROC IML to your problem, but I can tell you where you're going wrong in your particular approach. Hopefully Rick or someone else can stop by to show the right IML way to solve this using Matrix operations, or you can browse the Do Loop as I'm fairly sure Rick has articles on imputation there.
That said, your issue here is that SAS IML doesn't really have very much support for tables as a data structure. They were recently added, and mostly added just to make it easier to import datasets into IML from SAS without a lot of trouble. However, you can't treat them like Pandas data frames or similar; they're really just data storage devices that you need to extract things from. You need to move data into matrices to actually use them.
Here's how I would directly translate your nonfunctional code into functional code. Again, please remember this is probably not a good way to do this - matrices have a lot of features that make them good at this sort of thing, if you use them right, and you probably don't need to use a DO loop to iterate here - you should rather use matrix multiplication to do what you want. That's really the point of using IML; if you're just iterating, then use base SAS to do this, it's much easier to write the same program in base SAS (or, even better, use the imputation procedures if you have them licensed).
data table_a;
input Obs Barcode date :$6. sales Position;
datalines;
1 21220000000 201901 . 1
2 21220000000 201902 311 2
3 21220000000 201903 349 3
4 21220000000 201904 360 4
5 21220000000 201905 380 5
6 21220000000 201906 440 6
7 21220000000 201907 360 7
8 21220000000 201908 390 8
9 21220000000 201909 410 9
10 21220000000 201910 520 10
11 21220000000 201911 410 11
12 21220000000 201912 390 12
13 31350000000 201901 360 1
14 31350000000 201902 . 2
24 31350000000 201912 . 12
25 45480000000 201901 310 1
26 45480000000 201902 . 2
;;;;
run;
proc iml;
t_a= TableCreateFromDataSet("work","table_a");
call TablePrint(t_a);
sales = TableGetVarData(t_a,4);
barcode = TableGetVarData(t_a,2);
do i =1 to nrow(sales);
if missing(sales[i]) then do; *if the sales value is missing, then ...;
if i <= (nrow(sales) - 2) then do; *make sure we are not going over the total;
if all(j(2,1,barcode[i])=barcode[i+1:i+2]) then do; *and see if the rows are all the same barcode;
sales[i] = mean(sales[i+1:i+2]); *compute the mean!;
end;
end;
end;
end;
call TableAddVar(t_a,'sales_i',sales); *Put the matrix back in the table;
call TablePrint(t_a); *Take a peek!;
quit;
What I do first is extract the Barcode and Sales columns into matrices. Barcode is to check to make sure we're imputing from the same ID. Then, we check to see if sales is missing for that iteration, and further make sure we're not on the last two iterations (or it'll give an out-of-range error). Lastly, we compare the barcode with the next two barcodes and make sure they're the same. (The way I do that is pretty silly, honestly, but it's the quickest way I can think of.) If those all pass, then we calculate the mean.
Finally, we add the matrix back on the t_a table, and you can export it at your leisure back to a SAS dataset, or do whatever it is you want to do with it!
Again - this isn't really a good way to do this, it's more a direct answer of "what is wrong with your code". Find a better solution to imputation than this!

Construct centered moving average window in SAS

I am trying to construct centered moving average in SAS.
my table is in below
date number average
01/01/2015 18 ...
01/01/2015 15 ...
01/01/2015 5 ...
02/01/2015 66 ...
02/01/2015 7 ...
03/01/2015 7 ...
04/01/2015 19 ...
04/01/2015 7 ...
04/01/2015 11 ...
04/01/2015 17 ...
05/01/2015 3 ...
06/01/2015 7 ...
... ... ...
I need to obtain the average number for a surrounding period over (-2,+2) days, instead of (-2,+2) observations
I know that for Centered Moving Average, I can use.
convert number=av_number/transformout=(cmovave 3)
but here we have different number of observations in each day.
Anyone can tell me how to include only (-2, +2) days of centered moving average in this case ?
Thanks in advance !
Best
The suggestion from #Joe to aggregate to a daily level is the right approach, however you have to be careful that you don't lose the number of entries per day, otherwise you won't calculate the correct moving average. In other words, you need to weight the daily value by the number of entries for that day.
I've taken 3 steps to calculate the moving average, it may be possible to do it in 2 but I can't see how.
Step 1 is to calculate the sum and count of number per day.
Step 2 is to calculate the moving 5 day sum for both variables.
Step 3 then divides the sum by the count to get the weighted 5 day average.
I've added the trim function to exclude the first and last 2 records, obviously you can include those if you wish. You'll probably want to drop some of the extra variables as well.
data have;
input date :ddmmyy10. number;
format date date9.;
datalines;
01/01/2015 18
01/01/2015 15
01/01/2015 5
02/01/2015 66
02/01/2015 7
03/01/2015 7
04/01/2015 19
04/01/2015 7
04/01/2015 11
04/01/2015 17
05/01/2015 3
06/01/2015 7
;
run;
proc summary data=have nway;
class date;
var number;
output out=daily_agg sum=;
run;
proc expand data=daily_agg out=daily_agg_mov_sum;
convert number=tot_number / transformout = (cmovsum 5 trim 2);
convert _freq_=tot_count / transformout = (cmovsum 5 trim 2);
run;
data want;
set daily_agg_mov_sum;
if not missing(tot_number) then av_number = tot_number / tot_count;
run;

SAS transpose using values as column names and summarize

I'm trying to transpose a data using values as variable names and summarize numeric data by group, I tried with proc transpose and with proc report (across) but I can't do this, the unique way that I know to do this is with data set (if else and sum but the changes aren't dynamically)
For example I have this data set:
school name subject picked saving expenses
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
and I need this in 1 line, sum of 'picked' by the names of students, and later sum of picked by subject, the last 3 columns is the sum total for picked, saving and expense:
school john ruby peter noname math spanish geography nosubject picked saving expenses
raget 15 15 2 0 13 5 2 12 32 22700 8200
If it's possible to be dynamically changed if I have a new student in the school or subject?
It's a little difficult because you're summarising at more than one level, so I've used PROC SUMMARY and chosen different _TYPE_ values. See below:
data have;
infile datalines;
input school $ name $ subject : $10. picked saving expenses;
datalines;
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
;
run;
proc summary data=have;
class school name subject;
var picked saving expenses;
output out=want1 sum(picked)=picked sum(saving)=saving sum(expenses)=expenses;
run;
proc transpose data=want1 (where=(_type_=5)) out=subs (where=(_NAME_='picked'));
by school;
id subject;
run;
proc transpose data=want1 (where=(_type_=6)) out=names (where=(_NAME_='picked'));
by school;
id name;
run;
proc sql;
create table want (drop=_TYPE_ _FREQ_ name subject) as
select
n.*,
s.*,
w.*
from want1 (where=(_TYPE_=4)) w,
names (drop=_NAME_) n,
subs (drop=_NAME_) s
where w.school = n.school
and w.school = s.school;
quit;
I've also tested this code by adding new schools, names and subjects and they do appear in the final table. You'll note that I haven't hardcoded anything (e.g. no reference to math or John), so the code is dynamic enough.
PROC REPORT is an interesting alternative, particularly if you want the printed output rather than as a dataset. You can use ODS OUTPUT to get the output dataset, but it's messy as the variable names aren't defined for some reason (they're "C2" etc.). The printed output of this one is a little messy also as the header rows don't line up, but that can be fixed with some finagling if that's desired.
data have;
input school $ name $ subject $ picked saving expenses;
datalines;
raget John math 10 10500 3500
raget John spanish 5 1200 2000
raget Ruby nosubject 10 5000 1000
raget Ruby nosubject 2 3000 0
raget Ruby math 3 2000 500
raget peter geography 2 1000 0
raget noname nosubject 0 0 1200
;;;;
run;
ods output report=want;
proc report nowd data=have;
columns school (name subject),(picked) picked=picked2 saving expenses;
define picked/analysis sum ' ';
define picked2/analysis sum;
define saving/analysis sum ;
define expenses/analysis sum;
define name/across;
define subject/across;
define school/group;
run;

Confidence intervals on a matrix of data in SAS

I have the following matrix of data, which I am reading into SAS:
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
It represents 5 different species of animal (the rows) in 5 different areas of an environment (the columns). I want to get a Shannon diversity index for the whole environment, so I sum the rows to get:
48 54 68 79 84
Then calculate the Shannon index from this, to get:
1.5873488
What I need to do, however, is calculate a confidence interval for this Shannon index. So I want to perform a nonparametric bootstrap on the initial matrix.
Can anyone advise how this is possible in SAS?
There are several ways to do this in SAS. I would use proc surveyselect to generate the bootstrap samples, and then calculate the Shannon Index for each replicate. (I didn't know what the Shannon Index was, so my code is just based on what I read on Wikipedia.)
data animals;
input v1-v5;
cards;
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
run;
/* Generate 5000 bootstrap samples, with replacement */
proc surveyselect data=animals method=urs n=5 reps=5000 seed=10024 out=boots;
run;
/* For each replicate, calculate the sum of each variable */
proc means data=boots noprint nway;
class replicate;
var v:;
output out=sums sum=;
run;
/* Calculate the proportions, and p*log(p), which will be used next */
data sums;
set sums;
ttl=sum(of v1-v5);
array ps{*} p1-p5;
array vs{*} v1-v5;
array hs{*} h1-h5;
do i=1 to dim(vs);
ps{i}=vs{i}/ttl;
hs{i}=ps{i}*log(ps{i});
end;
keep replicate h:;
run;
/* Calculate the Shannon Index, again for each replicate */
data shannon;
set sums;
shannon = -sum(of h:);
keep replicate shannon;
run;
We now have a data set, shannon, which contains the Shannon Index calculated for each of 5000 bootstrap samples. You could use this to calculate p-values, but if you just want critical values, you can run proc means (or univariate if you want a 5% value, as I don't think it's possible to get 97.5 quantiles with proc means).
proc means data=shannon mean p1 p5 p95 p99;
var shannon;
run;