I have a dataset like this:
Year Dv1 Dv2 Dv3 Dv4
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
I want to sum the last 5 years for each column with data for a summary line, so ideally I would like my results to look like:
Year Dv1 Dv2 Dv3 Dv4
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
Avg5 2.4 4.4 7 8.4
Is there a way to do this in SAS? I tried some things with Proc Expand and Lag, but not getting what I want with those.
I don't quite see the need, but if you must. I assume that Year is a character variable since you want the value 'Avg5' in it. And, I assume you want the result to be a data set, since that is what Proc Expand and the Data Step produces.
data have;
input Year $ Dv1 Dv2 Dv3 Dv4;
datalines;
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
;
data want;
array lag1[0:4] _temporary_;
array lag2[0:4] _temporary_;
array lag3[0:4] _temporary_;
array lag4[0:4] _temporary_;
do _N_ = 1 by 1 until (z);
set have end = z;
array dv dv:;
if Dv1 then lag1[mod(_N_, 5)] = Dv1;
if Dv2 then lag2[mod(_N_, 5)] = Dv2;
if Dv3 then lag3[mod(_N_, 5)] = Dv3;
if Dv4 then lag4[mod(_N_, 5)] = Dv4;
output;
end;
Year = 'Avg5';
Dv1 = mean(of lag1[*]);
Dv2 = mean(of lag2[*]);
Dv3 = mean(of lag3[*]);
Dv4 = mean(of lag4[*]);
output;
run;
An easy way to get statistics on the last N values it to store them into a "wrap around" array. It would be simpler if you transposed the data. Then you only need to find the last five non-missing observations for only one variable per year/dv# group.
But here is a solution using multiple arrays. One to keep track of the number of non-missing values seen and the other to store the last 5 values.
First let's convert your listing into a dataset.
data have ;
input Year $ dv1-dv4 ;
cards;
2014 1 1 2 5
2015 3 4 8 8
2016 2 5 9 11
2017 4 4 5 9
2018 2 6 7 9
2019 2 4 6 .
2020 1 3 . .
2021 3 . . .
;
Now process the data storing the last5 into the array and copy the input back out. When you get to the end calculate the means and output the extra observation.
data want;
set have end=eof;
array next [4] _temporary_;
array last5 [4,0:4] _temporary_ ;
array dv dv1-dv4 ;
do index=1 to dim(dv);
if not missing(dv[index]) then do;
next[index]+1;
last5[index,mod(next[index],5)]=dv[index];
end;
end;
output;
if eof then do ;
year='Avg5';
do index=1 to dim(dv);
dv[index]=mean(last5[index,0],last5[index,1],last5[index,2],last5[index,3],last5[index,4]);
end;
output;
end;
drop index;
run;
Results
Obs Year dv1 dv2 dv3 dv4
1 2014 1.0 1.0 2 5.0
2 2015 3.0 4.0 8 8.0
3 2016 2.0 5.0 9 11.0
4 2017 4.0 4.0 5 9.0
5 2018 2.0 6.0 7 9.0
6 2019 2.0 4.0 6 .
7 2020 1.0 3.0 . .
8 2021 3.0 . . .
9 Avg5 2.4 4.4 7 8.4
Related
I am trying to calculate moving average for test data set in SaS, where i want to consider the current calculated moving average for next moving average. I have added the below sample calculation.
I have data something like this
data have;
input category week value ;
datalines;
a 1 10
a 2 5
a 3
a 4 30
a 5 50
b 1 30
b 2 5
b 3
b 4 0
b 5 50
;
I want to calculate 4 weeks of moving average at category level
here is below expected output
data want;
input category week value moving_average;
datalines;
a 1 10 .
a 2 5 .
a 3 . .
a 4 30 .
a 5 50 .
a 6 . 28.33
a 7 . 36.11
a 8 . 34.86
b 1 30 .
b 2 5 .
b 3 . .
b 4 0 .
b 5 50 .
b 6 . 18.33
b 7 . 22.77
b 8 . 22.775
b 9 . 28.46
SO here is logic for b
`For Week 6: (50+0+5)/3 = 18.33
For Week 7: (18.33+50+0)/3 = 22.77
For Week 8: (22.77+18.33+50+0)/4 = 22.775
Similar calculation can be done for b
**One can consider till week 5 is training data after week its test data **
Hope this time i have made clear my problem statement.`
So you want to create new observations? You will need an explicit OUTPUT statement.
You can use a "circular array" to make it easier to calculate the average.
data have;
input category $ week value ;
datalines;
a 1 10
a 2 5
a 3 .
a 4 30
a 5 50
b 1 30
b 2 5
b 3 .
b 4 0
b 5 50
;
data want;
set have;
by category ;
array c_array [0:3] _temporary_ ;
if first.category then call missing(of c_array[*]);
if week <= 5 then c_array[mod(week,4)]=value;
output;
if week=5 then do week=6 to 9;
value=.;
average=mean(of c_array[*]);
output;
c_array[mod(week,4)]=average;
end;
run;
Results
Obs category week value average
1 a 1 10 .
2 a 2 5 .
3 a 3 . .
4 a 4 30 .
5 a 5 50 .
6 a 6 . 28.3333
7 a 7 . 36.1111
8 a 8 . 36.1111
9 a 9 . 37.6389
10 b 1 30 .
11 b 2 5 .
12 b 3 . .
13 b 4 0 .
14 b 5 50 .
15 b 6 . 18.3333
16 b 7 . 22.7778
17 b 8 . 22.7778
18 b 9 . 28.4722
How to Capture previous row value and perform subtraction
Refer Table 1 as main data, Table 2 as desired output, Let me explain you in detail, Closing_Bal is derived from (Opening_bal - EMI) for eg if (20 - 2) = 18, as value 18 i want in 2nd row under opening_bal column then ( opening_bal - EMI) and so till new LAN , If New LAN available then start the loop again ,
i have created lag function butnot able to run loop
Try this
data A;
input Month $ LAN Opening_Bal EMI Closing_Bal;
infile datalines dlm = '|' dsd;
datalines;
1_Nov|1|20|2|18
2_Dec|1| |3|
3_Jan|1| |5|
4_Feb|1| |3|
1_Nov|2|30|4|26
2_Dec|2| |3|
3_Jan|2| |2|
4_Feb|2| |5|
5_Mar|2| |6|
;
data B(drop = c);
set A;
by LAN;
if first.LAN then c = Closing_Bal;
if Opening_Bal = . then do;
Opening_Bal = c;
Closing_Bal = Opening_Bal - EMI;
c = Closing_Bal;
end;
retain c;
run;
Result:
Month LAN Opening_Bal EMI Closing_Bal
1_Nov 1 20 2 18
2_Dec 1 18 3 15
3_Jan 1 15 5 10
4_Feb 1 10 3 7
1_Nov 2 30 4 26
2_Dec 2 26 3 23
3_Jan 2 23 2 21
4_Feb 2 21 5 16
5_Mar 2 16 6 10
The problem is that you already have CLOSING_BAL on the input dataset, so when the SET statement reads a new observation it will overwrite the value calculated on the previous observation. Either drop or rename the variable in the source dataset.
Example:
data have;
input Month $ LAN Opening_Bal EMI Closing_Bal;
datalines;
1_Nov 1 20 2 18
2_Dec 1 . 3 .
3_Jan 1 . 5 .
4_Feb 1 . 3 .
1_Nov 2 30 4 26
2_Dec 2 . 3 .
3_Jan 2 . 2 .
4_Feb 2 . 5 .
5_Mar 2 . 6 .
;
data want;
set have (drop=closing_bal);
retain Closing_Bal;
Opening_Bal=coalesce(Opening_Bal,Closing_Bal);
Closing_bal=Opening_bal - EMI ;
run;
Results:
Opening_ Closing_
Obs Month LAN Bal EMI Bal
1 1_Nov 1 20 2 18
2 2_Dec 1 18 3 15
3 3_Jan 1 15 5 10
4 4_Feb 1 10 3 7
5 1_Nov 2 30 4 26
6 2_Dec 2 26 3 23
7 3_Jan 2 23 2 21
8 4_Feb 2 21 5 16
9 5_Mar 2 16 6 10
I am not sure this works
data B;
set A;
by lan;
if not first.lan then do;
opening_bal = lag(closing_bal);
closing_bal = opening_bal - EMI;
end;
run;
because you don't execute lag for each observation.
I want to count the distinct values of a variable grouped by MEMBER_ID and a rolling date range of 5 years. I have seen a similar post.
How to Count Distinct for SAS PROC SQL with Rolling Date Window?
When I change h2.DATE BETWEEN h.DATE - 180 AND h.DATE to h2.year BETWEEN h.year-5 AND h.year, should it give me the correct distinct count within the last 5 years? Thank you in advance.
data have;
input permno year Cand_ID$;
datalines;
1 2000 1
1 2001 2
1 2002 3
1 2003 1
1 2004 3
1 2005 1
2 2000 1
2 2001 3
2 2002 1
2 2003 2
2 2004 2
2 2005 2
2 2006 1
2 2007 1
3 2001 3
3 2002 3
3 2003 3
3 2004 1
3 2005 1
;
run;
Here's how you can do it with a data step. This assumes you have values for all years. If you do not, fill it in with zeros.
Keep a rolling list of the last 5 years by using the lag function. If we keep a rolling sorted array list of the last 5 years using lag, we can count the distinct values for each row to get a rolling 5-year count.
In other words, we're going to create and count a list that looks like this:
permno year id1 id2 id3 id4 id5
1 2000 . . . . 1
1 2001 . . . 1 2
1 2002 . . 1 2 3
1 2003 . 1 1 2 3
Code:
data want;
set have;
by permno year;
array lagid[4] $;
array id[5] $;
id1 = cand_id;
lagid1 = lag1(cand_id);
lagid2 = lag2(cand_id);
lagid3 = lag3(cand_id);
lagid4 = lag4(cand_id);
/* Reset the counter for the first group */
if(first.permno) then n = 0;
/* Count the number of rows within a group */
n+1;
/* Save the last 5 years by using the lag function,
but do not get lags from previous groups
*/
do i = 1 to 4;
if(i < n) then id[i+1] = lagid[i];
end;
/* Sort the array of IDs into ascending order */
call sortc(of id:);
/* Count the number of distinct IDs in the array. Do not count
missing values.
*/
n_distinct = 1;
do i = 2 to dim(id);
if(id[i] > id[i-1] AND NOT missing(id[i-1]) ) then n_distinct+1;
end;
drop lag: n i;
run;
Output (without id: dropped):
permno year Cand_ID id1 id2 id3 id4 id5 n_distinct
1 2000 1 . . . . 1 1
1 2001 2 . . . 1 2 2
1 2002 3 . . 1 2 3 3
1 2003 1 . 1 1 2 3 3
1 2004 3 1 1 2 3 3 3
1 2005 1 1 1 2 3 3 3
I am new to sas, I used to do oracle SQL
I did similar question before
How to tricky rank SAS?
I thought this question could solve the problem.
but
I got stuck.
so my code is this
data stepstep;
input emplid KEY:$3. count;
cards;
11 11Y 1
11 11Y 2
11 11N 3
11 11N 4
11 11Y 5
11 11N 6
12 12Y 1
12 12Y 2
12 12N 3
;
run;
and then I tried
data stepstep2;
set stepstep;
by key emplid NOTSORTED;
if first.key AND first.emplidthen rank=1;
ELSE rank+1;
run;
Output is this
I want to show
emplid key count rank
11 11Y 1 1
11 11Y 2 1
11 11N 3 2
11 11N 4 2
11 11Y 5 3
11 11N 6 4
12 12Y 1 1
12 12Y 2 1
12 12N 3 2
so new emplid comes, I want "Rank" goes back to start count from 1.
so this example, when first emplid "12" comes, rank goes back to 1
How can I do that?
You need to leverage your BY groups properly and I think you have them in the wrong order for starters. Try this instead:
data stepstep2;
set stepstep;
by emplid KEY NOTSORTED;
if first.emplid then rank=1; *start of each emplid group;
ELSE if first.key rank+1; *start of each new key;
run;
You can also use a sum statement:
data stepstep2;
set stepstep;
by emplid key NOTSORTED;
if first.emplid then rank=0;
rank + first.key;
run;
I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n