I am trying to construct centered moving average in SAS.
my table is in below
date number average
01/01/2015 18 ...
01/01/2015 15 ...
01/01/2015 5 ...
02/01/2015 66 ...
02/01/2015 7 ...
03/01/2015 7 ...
04/01/2015 19 ...
04/01/2015 7 ...
04/01/2015 11 ...
04/01/2015 17 ...
05/01/2015 3 ...
06/01/2015 7 ...
... ... ...
I need to obtain the average number for a surrounding period over (-2,+2) days, instead of (-2,+2) observations
I know that for Centered Moving Average, I can use.
convert number=av_number/transformout=(cmovave 3)
but here we have different number of observations in each day.
Anyone can tell me how to include only (-2, +2) days of centered moving average in this case ?
Thanks in advance !
Best
The suggestion from #Joe to aggregate to a daily level is the right approach, however you have to be careful that you don't lose the number of entries per day, otherwise you won't calculate the correct moving average. In other words, you need to weight the daily value by the number of entries for that day.
I've taken 3 steps to calculate the moving average, it may be possible to do it in 2 but I can't see how.
Step 1 is to calculate the sum and count of number per day.
Step 2 is to calculate the moving 5 day sum for both variables.
Step 3 then divides the sum by the count to get the weighted 5 day average.
I've added the trim function to exclude the first and last 2 records, obviously you can include those if you wish. You'll probably want to drop some of the extra variables as well.
data have;
input date :ddmmyy10. number;
format date date9.;
datalines;
01/01/2015 18
01/01/2015 15
01/01/2015 5
02/01/2015 66
02/01/2015 7
03/01/2015 7
04/01/2015 19
04/01/2015 7
04/01/2015 11
04/01/2015 17
05/01/2015 3
06/01/2015 7
;
run;
proc summary data=have nway;
class date;
var number;
output out=daily_agg sum=;
run;
proc expand data=daily_agg out=daily_agg_mov_sum;
convert number=tot_number / transformout = (cmovsum 5 trim 2);
convert _freq_=tot_count / transformout = (cmovsum 5 trim 2);
run;
data want;
set daily_agg_mov_sum;
if not missing(tot_number) then av_number = tot_number / tot_count;
run;
Related
I'm looking a way to impute using proc iml in sas the average of the sales of the next two months.
As you can see sometimes I dont have the sales of 201901 and sometimes is missing on 201902
For example for the first barcode I want to find the sales[1]= mean(sales[2],sales[3]) and I want to do this for each unique barcode.
The "table A" is like this:
Obs. | Barcode | date | sales | Position
---------------------------------------------------------------
1 |21220000000| 201901 | . | 1
2 |21220000000| 201902| 311 | 2
3 |21220000000| 201903| 349 | 3
4 |21220000000| 201904| 360 | 4
5 |21220000000| 201905| 380 | 5
6 |21220000000| 201906| 440 | 6
7 |21220000000| 201907| 360 | 7
8 |21220000000| 201908| 390 | 8
9 |21220000000| 201909| 410 | 9
10 |21220000000| 201910| 520 | 10
11 |21220000000| 201911| 410 | 11
12 |21220000000| 201912| 390 | 12
13 |31350000000| 201901| 360 | 1
14 |31350000000| 201902| . | 2
.etc.
24 |31350000000| 201912| . | 12
25 |45480000000| 201901| 310 | 1
26 |45480000000| 201902| . | 2
.etc.
I tried something like this but it doesnt work:
proc iml;
t_a= TableCreateFromDataSet("work","table_a");
call TablePrint(t_a);
do i =1 to nrow(t_a);
if t_a[i,4]=. and t_a[i,5]=1 then t_a[1,4]= mean(t_a[i+1,4],t_a[i+2,4]) ;
i=i+1;
end;
run;
Is there a way to do it using matrices or lists in proc iml or would you recommend any other ways?
Thank you in advance!
This problem only involves an ID variable (='BarCode') and a variable that has missing values (='Sales'), so you really only need to read and process two vectors.
An efficient approach is to iterate over the unique levels of the "Barcode" variable (an ID variable) and process each missing value. Thus you can reduce the problem to a "BY group analysis" in which each ID value is processed in turn. There are several ways to perform a BY-group analysis in IML. The easiest to understand and implement is the UNIQUE-LOC technique. For large data, the UNIQUEBY technique is more efficient.
The following example uses the UNIQUE-LOC technique:
proc iml;
use table_a;
read all var {"BarCode"} into ID;
read all var {"Sales"} into X;
close;
imputeX = X; /* make copy of X */
u = unique(ID); /* unique categories of the ID variable */
do i = 1 to ncol(u); /* for each ID level */
groupIdx = loc(ID=u[i]);
y = x[groupIdx]; /* get the values for this level */
k = loc( y=. ); /* which are missing? */
if ncol(k)>0 then do; /* if some are missing, do imputation */
n = nrow(y);
startIdx = ((k+1) >< n); /* starting location, don't exceed n */
stopIdx = ((k+2) >< n); /* ending location, don't exceed n */
values = y[ startIdx ] || y[ stopIdx ];
mean = values[ , :]; /* find mean of each row */
y[k] = mean; /* copy mean to missing values */
imputeX[groupIdx] = y; /* update imputed vector (optional: write data) */
end;
end;
print ID[F=Z11.] X imputeX;
I don't think this is a good solution in PROC IML to your problem, but I can tell you where you're going wrong in your particular approach. Hopefully Rick or someone else can stop by to show the right IML way to solve this using Matrix operations, or you can browse the Do Loop as I'm fairly sure Rick has articles on imputation there.
That said, your issue here is that SAS IML doesn't really have very much support for tables as a data structure. They were recently added, and mostly added just to make it easier to import datasets into IML from SAS without a lot of trouble. However, you can't treat them like Pandas data frames or similar; they're really just data storage devices that you need to extract things from. You need to move data into matrices to actually use them.
Here's how I would directly translate your nonfunctional code into functional code. Again, please remember this is probably not a good way to do this - matrices have a lot of features that make them good at this sort of thing, if you use them right, and you probably don't need to use a DO loop to iterate here - you should rather use matrix multiplication to do what you want. That's really the point of using IML; if you're just iterating, then use base SAS to do this, it's much easier to write the same program in base SAS (or, even better, use the imputation procedures if you have them licensed).
data table_a;
input Obs Barcode date :$6. sales Position;
datalines;
1 21220000000 201901 . 1
2 21220000000 201902 311 2
3 21220000000 201903 349 3
4 21220000000 201904 360 4
5 21220000000 201905 380 5
6 21220000000 201906 440 6
7 21220000000 201907 360 7
8 21220000000 201908 390 8
9 21220000000 201909 410 9
10 21220000000 201910 520 10
11 21220000000 201911 410 11
12 21220000000 201912 390 12
13 31350000000 201901 360 1
14 31350000000 201902 . 2
24 31350000000 201912 . 12
25 45480000000 201901 310 1
26 45480000000 201902 . 2
;;;;
run;
proc iml;
t_a= TableCreateFromDataSet("work","table_a");
call TablePrint(t_a);
sales = TableGetVarData(t_a,4);
barcode = TableGetVarData(t_a,2);
do i =1 to nrow(sales);
if missing(sales[i]) then do; *if the sales value is missing, then ...;
if i <= (nrow(sales) - 2) then do; *make sure we are not going over the total;
if all(j(2,1,barcode[i])=barcode[i+1:i+2]) then do; *and see if the rows are all the same barcode;
sales[i] = mean(sales[i+1:i+2]); *compute the mean!;
end;
end;
end;
end;
call TableAddVar(t_a,'sales_i',sales); *Put the matrix back in the table;
call TablePrint(t_a); *Take a peek!;
quit;
What I do first is extract the Barcode and Sales columns into matrices. Barcode is to check to make sure we're imputing from the same ID. Then, we check to see if sales is missing for that iteration, and further make sure we're not on the last two iterations (or it'll give an out-of-range error). Lastly, we compare the barcode with the next two barcodes and make sure they're the same. (The way I do that is pretty silly, honestly, but it's the quickest way I can think of.) If those all pass, then we calculate the mean.
Finally, we add the matrix back on the t_a table, and you can export it at your leisure back to a SAS dataset, or do whatever it is you want to do with it!
Again - this isn't really a good way to do this, it's more a direct answer of "what is wrong with your code". Find a better solution to imputation than this!
I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);
This is my code:
DATA sales;
INFILE 'D:\Users\...\Desktop\Onions.dat';
INPUT VisitingTeam $ 1-20 ConcessionSales 21-24 BleacherSales 25-28
OurHits 29-31 TheirHits 32-34 OurRuns 35-37 TheirRuns 38-40;
PROC PRINT DATA = sales;
TITLE 'SAS Data Set Sales';
RUN;
This is the data, but the spacing may be incorrect.
Columbia Peaches 35 67 1 10 2 1
Plains Peanuts 210 . 2 5 0 2
Gilroy Garlics 151035 12 11 7 6
Sacramento Tomatoes 124 85 15 4 9 1
;
I need to add or delete a blank column at the 19th
column. Can someone help?
Just open the dataset and then look at what the variable name is. Then do:
Data Want (drop=varible_name_you_are_dropping); /*This is your output dataset*/
Set have; /*this is your dataset you have*/
Run;
I am wondering the best way to transpose data in SAS when I have multiple occurances of my id variable. I know I can use the let option in the proc transpose statement to do this, but I do not want to get rid of any data, as I intend to compute averages.
Here is an example of my data and my code:
data grades;
input student testnum grade;
cards;
1 1 30
1 1 25
1 2 45
1 3 67
2 1 22
2 2 63
2 2 12
2 2 77
3 1 22
3 1 17
3 2 14
3 4 17
;
run;
proc sort data=grades;
by student testnum;
run;
proc transpose data=grades out=trgrades;
by student;
id testnum;
var grade;
run;
Here is how I would like my resulting dataset to look:
student testnum1 testnum2 testnum3 testnum4 avg12 avg34
1 30 45 67 . 33.33 67
1 25 . . . 33.33 67
2 22 63 . . 43.5 .
2 . 12 . . 43.5 .
2 . 77 . . 43.5 .
3 22 14 . 17 53 17
3 17 . . . 53 17
I want to use this new dataset (not sure how yet) to create the new columns that are the average score of all testnum1's and testnum2's for a student (avg12) and the average of all testenum3's and testnum4's (avg34) for a student.
There may be a much more efficient way to do this but I am stumped.
Any advice is appreciated.
If all you really need is the average of all test 1's and 2's, and 3's and 4's for each student, then you don't need to transpose at all. All you need is a simple data step:
data grouped;
set grades;
if testnum In (1,2) then group=1;
else if testnum in (3,4) then group=2;
run;
Then a basic proc means:
proc means data=grouped;
by student group;
var grade;
output out=averages mean=groupaverage;
run;
If you need the averages in a single observation, you can easily transpose the averages dataset.
proc transpose data=grades out=trgrades;
by student;
id group;
var grade;
run;
Update:
As mentioned by #Keith, using a format to group the tests is an excellent choice as well. Skip the data step and create the format like so:
proc format;
value TestGroup
1,2 = 'Tests 1 and 2'
3,4 = 'Tests 3 and 4'
;
run;
Then the proc means becomes:
proc means data=grouped;
by student testnum;
var grade;
format testnum TestGroup.;
output out=averages mean=groupaverage;
run;
End Update
If, for some reason, you really need to have all the test scores in one observation then I would recommend using a data step to make them uniquely identifiable. Use by, testnum.first, retain, and a simple counter to assign each score a retake number. Now your transpose uses retake and testnum as id variables. You should be able to figure it out from there.
Really hoping right now that I didn't just do your SAS homework assignment for you.
I have the following matrix of data, which I am reading into SAS:
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
It represents 5 different species of animal (the rows) in 5 different areas of an environment (the columns). I want to get a Shannon diversity index for the whole environment, so I sum the rows to get:
48 54 68 79 84
Then calculate the Shannon index from this, to get:
1.5873488
What I need to do, however, is calculate a confidence interval for this Shannon index. So I want to perform a nonparametric bootstrap on the initial matrix.
Can anyone advise how this is possible in SAS?
There are several ways to do this in SAS. I would use proc surveyselect to generate the bootstrap samples, and then calculate the Shannon Index for each replicate. (I didn't know what the Shannon Index was, so my code is just based on what I read on Wikipedia.)
data animals;
input v1-v5;
cards;
1 5 12 19 13
6 3 1 3 14
2 7 12 19 21
22 24 21 29 18
17 15 22 9 18
run;
/* Generate 5000 bootstrap samples, with replacement */
proc surveyselect data=animals method=urs n=5 reps=5000 seed=10024 out=boots;
run;
/* For each replicate, calculate the sum of each variable */
proc means data=boots noprint nway;
class replicate;
var v:;
output out=sums sum=;
run;
/* Calculate the proportions, and p*log(p), which will be used next */
data sums;
set sums;
ttl=sum(of v1-v5);
array ps{*} p1-p5;
array vs{*} v1-v5;
array hs{*} h1-h5;
do i=1 to dim(vs);
ps{i}=vs{i}/ttl;
hs{i}=ps{i}*log(ps{i});
end;
keep replicate h:;
run;
/* Calculate the Shannon Index, again for each replicate */
data shannon;
set sums;
shannon = -sum(of h:);
keep replicate shannon;
run;
We now have a data set, shannon, which contains the Shannon Index calculated for each of 5000 bootstrap samples. You could use this to calculate p-values, but if you just want critical values, you can run proc means (or univariate if you want a 5% value, as I don't think it's possible to get 97.5 quantiles with proc means).
proc means data=shannon mean p1 p5 p95 p99;
var shannon;
run;