I am trying to calculate moving average for test data set in SaS, where i want to consider the current calculated moving average for next moving average. I have added the below sample calculation.
I have data something like this
data have;
input category week value ;
datalines;
a 1 10
a 2 5
a 3
a 4 30
a 5 50
b 1 30
b 2 5
b 3
b 4 0
b 5 50
;
I want to calculate 4 weeks of moving average at category level
here is below expected output
data want;
input category week value moving_average;
datalines;
a 1 10 .
a 2 5 .
a 3 . .
a 4 30 .
a 5 50 .
a 6 . 28.33
a 7 . 36.11
a 8 . 34.86
b 1 30 .
b 2 5 .
b 3 . .
b 4 0 .
b 5 50 .
b 6 . 18.33
b 7 . 22.77
b 8 . 22.775
b 9 . 28.46
SO here is logic for b
`For Week 6: (50+0+5)/3 = 18.33
For Week 7: (18.33+50+0)/3 = 22.77
For Week 8: (22.77+18.33+50+0)/4 = 22.775
Similar calculation can be done for b
**One can consider till week 5 is training data after week its test data **
Hope this time i have made clear my problem statement.`
So you want to create new observations? You will need an explicit OUTPUT statement.
You can use a "circular array" to make it easier to calculate the average.
data have;
input category $ week value ;
datalines;
a 1 10
a 2 5
a 3 .
a 4 30
a 5 50
b 1 30
b 2 5
b 3 .
b 4 0
b 5 50
;
data want;
set have;
by category ;
array c_array [0:3] _temporary_ ;
if first.category then call missing(of c_array[*]);
if week <= 5 then c_array[mod(week,4)]=value;
output;
if week=5 then do week=6 to 9;
value=.;
average=mean(of c_array[*]);
output;
c_array[mod(week,4)]=average;
end;
run;
Results
Obs category week value average
1 a 1 10 .
2 a 2 5 .
3 a 3 . .
4 a 4 30 .
5 a 5 50 .
6 a 6 . 28.3333
7 a 7 . 36.1111
8 a 8 . 36.1111
9 a 9 . 37.6389
10 b 1 30 .
11 b 2 5 .
12 b 3 . .
13 b 4 0 .
14 b 5 50 .
15 b 6 . 18.3333
16 b 7 . 22.7778
17 b 8 . 22.7778
18 b 9 . 28.4722
Related
How to Capture previous row value and perform subtraction
Refer Table 1 as main data, Table 2 as desired output, Let me explain you in detail, Closing_Bal is derived from (Opening_bal - EMI) for eg if (20 - 2) = 18, as value 18 i want in 2nd row under opening_bal column then ( opening_bal - EMI) and so till new LAN , If New LAN available then start the loop again ,
i have created lag function butnot able to run loop
Try this
data A;
input Month $ LAN Opening_Bal EMI Closing_Bal;
infile datalines dlm = '|' dsd;
datalines;
1_Nov|1|20|2|18
2_Dec|1| |3|
3_Jan|1| |5|
4_Feb|1| |3|
1_Nov|2|30|4|26
2_Dec|2| |3|
3_Jan|2| |2|
4_Feb|2| |5|
5_Mar|2| |6|
;
data B(drop = c);
set A;
by LAN;
if first.LAN then c = Closing_Bal;
if Opening_Bal = . then do;
Opening_Bal = c;
Closing_Bal = Opening_Bal - EMI;
c = Closing_Bal;
end;
retain c;
run;
Result:
Month LAN Opening_Bal EMI Closing_Bal
1_Nov 1 20 2 18
2_Dec 1 18 3 15
3_Jan 1 15 5 10
4_Feb 1 10 3 7
1_Nov 2 30 4 26
2_Dec 2 26 3 23
3_Jan 2 23 2 21
4_Feb 2 21 5 16
5_Mar 2 16 6 10
The problem is that you already have CLOSING_BAL on the input dataset, so when the SET statement reads a new observation it will overwrite the value calculated on the previous observation. Either drop or rename the variable in the source dataset.
Example:
data have;
input Month $ LAN Opening_Bal EMI Closing_Bal;
datalines;
1_Nov 1 20 2 18
2_Dec 1 . 3 .
3_Jan 1 . 5 .
4_Feb 1 . 3 .
1_Nov 2 30 4 26
2_Dec 2 . 3 .
3_Jan 2 . 2 .
4_Feb 2 . 5 .
5_Mar 2 . 6 .
;
data want;
set have (drop=closing_bal);
retain Closing_Bal;
Opening_Bal=coalesce(Opening_Bal,Closing_Bal);
Closing_bal=Opening_bal - EMI ;
run;
Results:
Opening_ Closing_
Obs Month LAN Bal EMI Bal
1 1_Nov 1 20 2 18
2 2_Dec 1 18 3 15
3 3_Jan 1 15 5 10
4 4_Feb 1 10 3 7
5 1_Nov 2 30 4 26
6 2_Dec 2 26 3 23
7 3_Jan 2 23 2 21
8 4_Feb 2 21 5 16
9 5_Mar 2 16 6 10
I am not sure this works
data B;
set A;
by lan;
if not first.lan then do;
opening_bal = lag(closing_bal);
closing_bal = opening_bal - EMI;
end;
run;
because you don't execute lag for each observation.
I want to count the distinct values of a variable grouped by MEMBER_ID and a rolling date range of 5 years. I have seen a similar post.
How to Count Distinct for SAS PROC SQL with Rolling Date Window?
When I change h2.DATE BETWEEN h.DATE - 180 AND h.DATE to h2.year BETWEEN h.year-5 AND h.year, should it give me the correct distinct count within the last 5 years? Thank you in advance.
data have;
input permno year Cand_ID$;
datalines;
1 2000 1
1 2001 2
1 2002 3
1 2003 1
1 2004 3
1 2005 1
2 2000 1
2 2001 3
2 2002 1
2 2003 2
2 2004 2
2 2005 2
2 2006 1
2 2007 1
3 2001 3
3 2002 3
3 2003 3
3 2004 1
3 2005 1
;
run;
Here's how you can do it with a data step. This assumes you have values for all years. If you do not, fill it in with zeros.
Keep a rolling list of the last 5 years by using the lag function. If we keep a rolling sorted array list of the last 5 years using lag, we can count the distinct values for each row to get a rolling 5-year count.
In other words, we're going to create and count a list that looks like this:
permno year id1 id2 id3 id4 id5
1 2000 . . . . 1
1 2001 . . . 1 2
1 2002 . . 1 2 3
1 2003 . 1 1 2 3
Code:
data want;
set have;
by permno year;
array lagid[4] $;
array id[5] $;
id1 = cand_id;
lagid1 = lag1(cand_id);
lagid2 = lag2(cand_id);
lagid3 = lag3(cand_id);
lagid4 = lag4(cand_id);
/* Reset the counter for the first group */
if(first.permno) then n = 0;
/* Count the number of rows within a group */
n+1;
/* Save the last 5 years by using the lag function,
but do not get lags from previous groups
*/
do i = 1 to 4;
if(i < n) then id[i+1] = lagid[i];
end;
/* Sort the array of IDs into ascending order */
call sortc(of id:);
/* Count the number of distinct IDs in the array. Do not count
missing values.
*/
n_distinct = 1;
do i = 2 to dim(id);
if(id[i] > id[i-1] AND NOT missing(id[i-1]) ) then n_distinct+1;
end;
drop lag: n i;
run;
Output (without id: dropped):
permno year Cand_ID id1 id2 id3 id4 id5 n_distinct
1 2000 1 . . . . 1 1
1 2001 2 . . . 1 2 2
1 2002 3 . . 1 2 3 3
1 2003 1 . 1 1 2 3 3
1 2004 3 1 1 2 3 3 3
1 2005 1 1 1 2 3 3 3
I need to do this:
table 1:
ID Cod.
1 20
2 102
4 30
7 10
9 201
10 305
table 2:
ID Cod.
1 20
2 50
3 15
4 30
5 25
7 10
10 300
Now, I got a table like this with an outer join:
ID Cod. ID1 Cod1.
1 20 1 20
2 50 . .
. . 2 102
3 15 . .
4 30 4 30
5 25 . .
7 10 7 10
. . 9 201
10 300 . .
. . 10 305
Now I want to add a flag that tell me if the ID have common values, so:
ID Cod. ID1 Cod1. FLag_ID Flag_cod:
1 20 1 20 0 0
2 50 . . 0 1
. . 2 102 0 1
3 15 . . 1 1
4 30 4 30 0 0
5 25 . . 1 1
7 10 7 10 0 0
. . 9 201 1 1
10 300 . . 0 1
. . 10 305 0 1
I would like to know how can I get the flag_ID, specifically to cover the cases of ID = 2 or ID=10.
Thank you
You can group by a coalescence of id in order to count and compare details.
Example
data table1;
input id code ##; datalines;
1 20 2 102 4 30 7 10 9 201 10 305
;
data table2;
input id code ##; datalines;
1 20 2 50 3 15 4 30 5 25 7 10 10 300
;
proc sql;
create table got as
select
table2.id, table2.code
, table1.id as id1, table1.code as code1
, case
when count(table1.id) = 1 and count(table2.id) = 1 then 0 else 1
end as flag_id
, case
when table1.code - table2.code ne 0 then 1 else 0
end as flag_code
from
table1
full join
table2
on
table2.id=table1.id and table2.code=table1.code
group by
coalesce(table2.id,table1.id)
;
You might also want to look into
Proc COMPARE with BY
I have data which is as follows.
data have;
input group replicate $ sex $ count;
datalines;
1 A F 3
1 A M 2
1 B F 4
1 B M 2
1 C F 4
1 C M 5
2 A F 5
2 A M 4
2 B F 6
2 B M 3
2 C F 2
2 C M 2
3 A F 5
3 A M 1
3 B F 3
3 B M 4
3 C F 3
3 C M 1
;
run;
I want to break the count column into two separate columns based on gender.
count_ count_
Obs group replicate female male
1 1 A 3 2
2 1 B 4 2
3 1 C 4 5
4 2 A 5 4
5 2 B 6 3
6 2 C 2 2
7 3 A 5 1
8 3 B 3 4
9 3 C 3 1
This can be done by first creating two separate data sets for each level of sex and then performing a merge.
data just_female;
set have;
where sex = 'F';
rename count = count_female;
run;
data just_male;
set have;
where sex = 'M';
rename count = count_male;
run;
data want;
merge
just_female
just_male
;
by
group
replicate
;
keep
group
replicate
count_female
count_male
;
run;
Is there a less verbose way to do this which doesn't require the need to sort or explicitly drop/keep variables?
You can do this using proc transpose but you will need to sort the data. I believe this is what you're looking for though.
proc sort data=have;
by group replicate;
run;
The data is sorted so now you have your by-group for transposing.
proc transpose data=have out=want(drop=_name_) prefix=count_;
by group replicate;
id sex;
var count;
run;
proc print data=want;
Then you get:
Obs group replicate count_F count_M
1 1 A 3 2
2 1 B 4 2
3 1 C 4 5
4 2 A 5 4
5 2 B 6 3
6 2 C 2 2
7 3 A 5 1
8 3 B 3 4
9 3 C 3 1
Dv1 Dv2 Dv3 Dv4 Dv5 Dv6 Dv7 Dv8
1 1 2 5 5 7 9 9
3 4 8 8 8 9 10 .
2 5 9 11 13 13 . .
4 4 5 9 9 . . .
2 6 7 9 . . . .
2 4 6 . . . . .
1 3 . . . . . .
3 . . . . . . .
I have a much larger version of the above data. Each column has a factor which when multiplied by the previous column data gives the current column data.
The factor = (sum of the previous 5 rows)/(sum of the previous 5 rows one column to the left)
eg. Column 2 factor = (3+4+6+4+5)/(1+2+2+4+2) = 2 and the resulting data being:
Dv1 Dv2 Dv3 Dv4 Dv5 Dv6 Dv7 Dv8
1 1 2 5 5 7 9 9
3 4 8 8 8 9 10 .
2 5 9 11 13 13 . .
4 4 5 9 9 . . .
2 6 7 9 . . . .
2 4 6 . . . . .
1 3 . . . . . .
3 6 . . . . . .
Use any available rows if 5 do not exist above the data.
I want to fill this out data using SAS. My problem is how to sum the previous 5 rows, I'm fairly confident I can proceed from there.
Many thanks in advance!
LAG function.
sum_prev5 = lag(x) + lag2(x) + lag3(x) + lag4(x) + lag5(x);