Sum previous rows - SAS - sas

Dv1 Dv2 Dv3 Dv4 Dv5 Dv6 Dv7 Dv8
1 1 2 5 5 7 9 9
3 4 8 8 8 9 10 .
2 5 9 11 13 13 . .
4 4 5 9 9 . . .
2 6 7 9 . . . .
2 4 6 . . . . .
1 3 . . . . . .
3 . . . . . . .
I have a much larger version of the above data. Each column has a factor which when multiplied by the previous column data gives the current column data.
The factor = (sum of the previous 5 rows)/(sum of the previous 5 rows one column to the left)
eg. Column 2 factor = (3+4+6+4+5)/(1+2+2+4+2) = 2 and the resulting data being:
Dv1 Dv2 Dv3 Dv4 Dv5 Dv6 Dv7 Dv8
1 1 2 5 5 7 9 9
3 4 8 8 8 9 10 .
2 5 9 11 13 13 . .
4 4 5 9 9 . . .
2 6 7 9 . . . .
2 4 6 . . . . .
1 3 . . . . . .
3 6 . . . . . .
Use any available rows if 5 do not exist above the data.
I want to fill this out data using SAS. My problem is how to sum the previous 5 rows, I'm fairly confident I can proceed from there.
Many thanks in advance!

LAG function.
sum_prev5 = lag(x) + lag2(x) + lag3(x) + lag4(x) + lag5(x);

Related

SaS: How to calculate moving average in sas using current observation?

I am trying to calculate moving average for test data set in SaS, where i want to consider the current calculated moving average for next moving average. I have added the below sample calculation.
I have data something like this
data have;
input category week value ;
datalines;
a 1 10
a 2 5
a 3
a 4 30
a 5 50
b 1 30
b 2 5
b 3
b 4 0
b 5 50
;
I want to calculate 4 weeks of moving average at category level
here is below expected output
data want;
input category week value moving_average;
datalines;
a 1 10 .
a 2 5 .
a 3 . .
a 4 30 .
a 5 50 .
a 6 . 28.33
a 7 . 36.11
a 8 . 34.86
b 1 30 .
b 2 5 .
b 3 . .
b 4 0 .
b 5 50 .
b 6 . 18.33
b 7 . 22.77
b 8 . 22.775
b 9 . 28.46
SO here is logic for b
`For Week 6: (50+0+5)/3 = 18.33
For Week 7: (18.33+50+0)/3 = 22.77
For Week 8: (22.77+18.33+50+0)/4 = 22.775
Similar calculation can be done for b
**One can consider till week 5 is training data after week its test data **
Hope this time i have made clear my problem statement.`
So you want to create new observations? You will need an explicit OUTPUT statement.
You can use a "circular array" to make it easier to calculate the average.
data have;
input category $ week value ;
datalines;
a 1 10
a 2 5
a 3 .
a 4 30
a 5 50
b 1 30
b 2 5
b 3 .
b 4 0
b 5 50
;
data want;
set have;
by category ;
array c_array [0:3] _temporary_ ;
if first.category then call missing(of c_array[*]);
if week <= 5 then c_array[mod(week,4)]=value;
output;
if week=5 then do week=6 to 9;
value=.;
average=mean(of c_array[*]);
output;
c_array[mod(week,4)]=average;
end;
run;
Results
Obs category week value average
1 a 1 10 .
2 a 2 5 .
3 a 3 . .
4 a 4 30 .
5 a 5 50 .
6 a 6 . 28.3333
7 a 7 . 36.1111
8 a 8 . 36.1111
9 a 9 . 37.6389
10 b 1 30 .
11 b 2 5 .
12 b 3 . .
13 b 4 0 .
14 b 5 50 .
15 b 6 . 18.3333
16 b 7 . 22.7778
17 b 8 . 22.7778
18 b 9 . 28.4722

How to Capture previous row value and perform subtraction

How to Capture previous row value and perform subtraction
Refer Table 1 as main data, Table 2 as desired output, Let me explain you in detail, Closing_Bal is derived from (Opening_bal - EMI) for eg if (20 - 2) = 18, as value 18 i want in 2nd row under opening_bal column then ( opening_bal - EMI) and so till new LAN , If New LAN available then start the loop again ,
i have created lag function butnot able to run loop
Try this
data A;
input Month $ LAN Opening_Bal EMI Closing_Bal;
infile datalines dlm = '|' dsd;
datalines;
1_Nov|1|20|2|18
2_Dec|1| |3|
3_Jan|1| |5|
4_Feb|1| |3|
1_Nov|2|30|4|26
2_Dec|2| |3|
3_Jan|2| |2|
4_Feb|2| |5|
5_Mar|2| |6|
;
data B(drop = c);
set A;
by LAN;
if first.LAN then c = Closing_Bal;
if Opening_Bal = . then do;
Opening_Bal = c;
Closing_Bal = Opening_Bal - EMI;
c = Closing_Bal;
end;
retain c;
run;
Result:
Month LAN Opening_Bal EMI Closing_Bal
1_Nov 1 20 2 18
2_Dec 1 18 3 15
3_Jan 1 15 5 10
4_Feb 1 10 3 7
1_Nov 2 30 4 26
2_Dec 2 26 3 23
3_Jan 2 23 2 21
4_Feb 2 21 5 16
5_Mar 2 16 6 10
The problem is that you already have CLOSING_BAL on the input dataset, so when the SET statement reads a new observation it will overwrite the value calculated on the previous observation. Either drop or rename the variable in the source dataset.
Example:
data have;
input Month $ LAN Opening_Bal EMI Closing_Bal;
datalines;
1_Nov 1 20 2 18
2_Dec 1 . 3 .
3_Jan 1 . 5 .
4_Feb 1 . 3 .
1_Nov 2 30 4 26
2_Dec 2 . 3 .
3_Jan 2 . 2 .
4_Feb 2 . 5 .
5_Mar 2 . 6 .
;
data want;
set have (drop=closing_bal);
retain Closing_Bal;
Opening_Bal=coalesce(Opening_Bal,Closing_Bal);
Closing_bal=Opening_bal - EMI ;
run;
Results:
Opening_ Closing_
Obs Month LAN Bal EMI Bal
1 1_Nov 1 20 2 18
2 2_Dec 1 18 3 15
3 3_Jan 1 15 5 10
4 4_Feb 1 10 3 7
5 1_Nov 2 30 4 26
6 2_Dec 2 26 3 23
7 3_Jan 2 23 2 21
8 4_Feb 2 21 5 16
9 5_Mar 2 16 6 10
I am not sure this works
data B;
set A;
by lan;
if not first.lan then do;
opening_bal = lag(closing_bal);
closing_bal = opening_bal - EMI;
end;
run;
because you don't execute lag for each observation.

Calculate mean of last 5 years exclude missing

I have a dataset like this:
Year Dv1 Dv2 Dv3 Dv4
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
I want to sum the last 5 years for each column with data for a summary line, so ideally I would like my results to look like:
Year Dv1 Dv2 Dv3 Dv4
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
Avg5 2.4 4.4 7 8.4
Is there a way to do this in SAS? I tried some things with Proc Expand and Lag, but not getting what I want with those.
I don't quite see the need, but if you must. I assume that Year is a character variable since you want the value 'Avg5' in it. And, I assume you want the result to be a data set, since that is what Proc Expand and the Data Step produces.
data have;
input Year $ Dv1 Dv2 Dv3 Dv4;
datalines;
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
;
data want;
array lag1[0:4] _temporary_;
array lag2[0:4] _temporary_;
array lag3[0:4] _temporary_;
array lag4[0:4] _temporary_;
do _N_ = 1 by 1 until (z);
set have end = z;
array dv dv:;
if Dv1 then lag1[mod(_N_, 5)] = Dv1;
if Dv2 then lag2[mod(_N_, 5)] = Dv2;
if Dv3 then lag3[mod(_N_, 5)] = Dv3;
if Dv4 then lag4[mod(_N_, 5)] = Dv4;
output;
end;
Year = 'Avg5';
Dv1 = mean(of lag1[*]);
Dv2 = mean(of lag2[*]);
Dv3 = mean(of lag3[*]);
Dv4 = mean(of lag4[*]);
output;
run;
An easy way to get statistics on the last N values it to store them into a "wrap around" array. It would be simpler if you transposed the data. Then you only need to find the last five non-missing observations for only one variable per year/dv# group.
But here is a solution using multiple arrays. One to keep track of the number of non-missing values seen and the other to store the last 5 values.
First let's convert your listing into a dataset.
data have ;
input Year $ dv1-dv4 ;
cards;
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
;
Now process the data storing the last5 into the array and copy the input back out. When you get to the end calculate the means and output the extra observation.
data want;
set have end=eof;
array next [4] _temporary_;
array last5 [4,0:4] _temporary_ ;
array dv dv1-dv4 ;
do index=1 to dim(dv);
if not missing(dv[index]) then do;
next[index]+1;
last5[index,mod(next[index],5)]=dv[index];
end;
end;
output;
if eof then do ;
year='Avg5';
do index=1 to dim(dv);
dv[index]=mean(last5[index,0],last5[index,1],last5[index,2],last5[index,3],last5[index,4]);
end;
output;
end;
drop index;
run;
Results
Obs Year dv1 dv2 dv3 dv4
1 2014 1.0 1.0 2 5.0
2 2015 3.0 4.0 8 8.0
3 2016 2.0 5.0 9 11.0
4 2017 4.0 4.0 5 9.0
5 2018 2.0 6.0 7 9.0
6 2019 2.0 4.0 6 .
7 2020 1.0 3.0 . .
8 2021 3.0 . . .
9 Avg5 2.4 4.4 7 8.4

Biderectional Vlookup - flag in the same table - Sas

I need to do this:
table 1:
ID Cod.
1 20
2 102
4 30
7 10
9 201
10 305
table 2:
ID Cod.
1 20
2 50
3 15
4 30
5 25
7 10
10 300
Now, I got a table like this with an outer join:
ID Cod. ID1 Cod1.
1 20 1 20
2 50 . .
. . 2 102
3 15 . .
4 30 4 30
5 25 . .
7 10 7 10
. . 9 201
10 300 . .
. . 10 305
Now I want to add a flag that tell me if the ID have common values, so:
ID Cod. ID1 Cod1. FLag_ID Flag_cod:
1 20 1 20 0 0
2 50 . . 0 1
. . 2 102 0 1
3 15 . . 1 1
4 30 4 30 0 0
5 25 . . 1 1
7 10 7 10 0 0
. . 9 201 1 1
10 300 . . 0 1
. . 10 305 0 1
I would like to know how can I get the flag_ID, specifically to cover the cases of ID = 2 or ID=10.
Thank you
You can group by a coalescence of id in order to count and compare details.
Example
data table1;
input id code ##; datalines;
1 20 2 102 4 30 7 10 9 201 10 305
;
data table2;
input id code ##; datalines;
1 20 2 50 3 15 4 30 5 25 7 10 10 300
;
proc sql;
create table got as
select
table2.id, table2.code
, table1.id as id1, table1.code as code1
, case
when count(table1.id) = 1 and count(table2.id) = 1 then 0 else 1
end as flag_id
, case
when table1.code - table2.code ne 0 then 1 else 0
end as flag_code
from
table1
full join
table2
on
table2.id=table1.id and table2.code=table1.code
group by
coalesce(table2.id,table1.id)
;
You might also want to look into
Proc COMPARE with BY

Convert one to many with 2 digits

I am currently handling a data set in Stata generated through ODK, the open data kit.
There is an option to answer questions with multiple answers. E.g. in my questionnaire "Which of these assets do you own?" and the interviewer tagged all the answers out of 20 options.
This generated for me a string variable with contents such as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
As this is difficult to analyse for several hundred participants, I wanted to generate new variables creating a 1 or 0 for each of the answer options.
For the variable hou_as I tried to generate the variables hou_as_1, hou_as_2 etc. with the following code:
foreach p in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 {
local P : subinstr local p "-" ""
gen byte hou_as_`P' = strpos(hou_as, "`p'") > 0
}
For the single digits this brings the problem that the variable hou_as_1 is also filled with a 1 if any of the 10 11 12 ... 19 is filled even if the option 1 was not chosen. Similarly hou_as_2 is filled when the option 2, 12 or 20 is checked.
How can I avoid this issue?
You want 20 indicator or dummy variables. Note first that it's much easier to use forval to loop 1(1)20, e.g.
forval j = 1/20 {
gen hou_as_`j' = 0
}
initialises 20 such variables as 0.
I think it's easier to loop over the words of your answer variables, words being here just whatever is separated by spaces. There are at most 20 words, and it is a little crude but likely to be fast enough to go
forval j = 1/20 {
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Let's put that together and try it out on your example:
clear
input str42 hou_as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
end
forval j = 1/20 {
gen hou_as_`j' = 0
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Just to show that it worked:
. list in 3
+----------------------------------------------------------------------------+
3. | hou_as | hou_as_1 | hou_as_2 | hou_as_3 | hou_as_4 | hou_as_5 | hou_as_6 |
| 1 3 9 11 | 1 | 0 | 1 | 0 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_as_7 | hou_as_8 | hou_as_9 | hou_a~10 | hou_a~11 | hou_a~12 | hou_a~13 |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_a~14 | hou_a~15 | hou_a~16 | hou_a~17 | hou_a~18 | hou_a~19 | hou_a~20 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------------------------------------------+
Incidentally, your line
local P : subinstr local p "-" ""
does nothing useful. The local macro p only ever has contents which are integer digits, so there is no punctuation at all to remove.
See also this explanation and
. search multiple responses, sj
Search of official help files, FAQs, Examples, SJs, and STBs
SJ-5-1 st0082 . . . . . . . . . . . . . . . Tabulation of multiple responses
(help _mrsvmat, mrgraph, mrtab if installed) . . . . . . . . B. Jann
Q1/05 SJ 5(1):92--122
introduces new commands for the computation of one- and
two-way tables of multiple responses
SJ-3-1 pr0008 Speaking Stata: On structure & shape: the case of mult. resp.
. . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox & U. Kohler
Q1/03 SJ 3(1):81--99 (no commands)
discussion of data manipulations for multiple response data