SAS_Conditional Cumulative Sum - sas

My question is about the conditional cumulative sum in SAS. I think it can be explained better by using sample. I have following dataset:
Date Value
01/01/2001 10
02/01/2001 20
03/01/2001 30
04/01/2001 15
05/01/2001 25
06/01/2001 35
07/01/2001 20
08/01/2001 45
09/01/2001 35
I want to find the cumulative sum of value. My condition is if cumulative sum more than 70, it should be 70 and the next cumulative sum should be began from the excessive value over 70 and so on.. More preciesly, my new data should be:
Date Value Cumulative
01/01/2001 10 10
02/01/2001 20 30
03/01/2001 30 60
04/01/2001 15 70
05/01/2001 25 30 ( 75-70=5+25=30)
06/01/2001 35 65
07/01/2001 20 70
08/01/2001 45 60 ( 85-70=15+45=60)
09/01/2001 35 95 ( because its last value)
Many thanks in advance

Here is a solution, although there is bound to be one more elegant. It's split into two parts with if eof to satisfy the last observation condition.
data want;
set test end = eof;
if eof ^= 1 then do;
if cumulative = 70 then cumulative = extra;
Cumulative + value;
extra = cumulative - 70;
if extra > 0 then do;
cumulative = 70;
end;
end;
retain extra;
retain cumulative;
if eof = 1 then cumulative + value;
run;

Related

Average over number of variables where number of variables is dictated by separate column

I would like to create a new column whose values equal the average of values in other columns. But the number of columns I am taking the average of is dictated by a variable. My data look like this, with 'length' dictating the number of columns x1-x5 that I want to average:
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
run;
I would like to end up with the below where 'avg' is the average of the specified columns.
data want;
input ID $ length avg
datalines;
A 5 87
B 4 156.5
C 3 558.3
D 5 39.6
;
run;
Any suggestions? Thanks! Sorry about the awful title, I did my best.
You have to do a little more work since mean(of x[1]-x[length]) is not valid syntax. Instead, save the values to a temporary array and take the mean of it, then reset it at each row. For example:
tmp1 tmp2 tmp3 tmp4 tmp5
8 234 79 36 78
8 26 589 3 .
19 892 764 . .
72 48 65 4 9
data want;
set have;
array x[*] x:;
array tmp[5] _temporary_;
/* Reset the temp array */
call missing(of tmp[*]);
/* Save each value of x to the temp array */
do i = 1 to length;
tmp[i] = x[i];
end;
/* Get the average of the non-missing values in the temp array */
avg = mean(of tmp[*]);
drop i;
run;
Use an array to average it by summing up the array for the length and then dividing by the length.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array x(5) x1-x5;
sum=0;
do i=1 to length;
sum + x(i);
end;
avg = sum/length;
keep id length avg;
format avg 8.2;
run;
#Reeza's solution is good, but in case of missing values in x it will produce not always desirable result. It's better to use a function SUM. Also the code is little simplified:
data want (drop=i s);
set have;
array a{*} x:;
s=0; nm=0;
do i=1 to length;
if missing(a{i}) then nm+1;
s=sum(s,a{i});
end;
avg=s/(length-nm);
run;
Rather than writing your own code to calculate means you could just calculate all of the possible means and then just use an index into an array to select the one you need.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array means[5] ;
means[1]=x1;
means[2]=mean(x1,x2);
means[3]=mean(of x1-x3);
means[4]=mean(of x1-x4);
means[5]=mean(of x1-x5);
want = means[length];
run;
Results:

Use the dif function to obtain the difference with several lags without specifying the number of lags

I want a new data set in which the variable y is equal to the value in the n row minus the lags values.
The original data set:
data test;
input x;
datalines;
20
40
2
5
74
;
run;
I used the dif function, but It returns the difference with a one lag:
data want;
set test;
y = dif(x);
run;
And I want:
_n_ = 1 y = 20
_n_ = 2 y = 40 - 20 = 20
_n_ = 3 y = 2 - (40 + 20) = -58
_n_ = 4 y = 5 - (2 + 40 + 20) = - 57
_n_ = 5 y = 74 - (5 + 2 + 40 + 20) = 7
Thanks.
No need for lag() or dif(). Just make another variable to retain the running total.
data want ;
set test;
y=x-cumm;
output;
cumm+x;
run;
I kept the extra column and output the values before updating the running total to make it clearer what value was used in the calculation of Y.
Obs x y cumm
1 20 20 0
2 40 20 20
3 2 -58 60
4 5 -57 62
5 74 7 67
Possible solution (thanks to Longfish for suggestions):
data want;
set test;
retain total 0;
total = total + x;
y = x - coalesce(lag(total), 0);
run;

SAS_Add value for specific rows

I want to give the value for some specific rows. I think showing it by example would be better. I have following datasheet;
Date Value
01/01/2001 10
02/01/2001 20
03/01/2001 35
04/01/2001 15
05/01/2001 25
06/01/2001 35
07/01/2001 20
08/01/2001 45
09/01/2001 35
My result should be:
Date Value Spec.Value
01/01/2001 10 1
02/01/2001 20 1
03/01/2001 35 1
04/01/2001 15 2
05/01/2001 25 2
06/01/2001 35 2
07/01/2001 20 3
08/01/2001 45 3
09/01/2001 35 3
As you can see, my condition value is 35. I have three 35. I need to group my date by using this condition value.
data want;
set have;
retain specvalue 1;
if lag(value) = 35 then do;
specvalue +1;
end;
run;

Cumulative sum in multiple columns in SAS

I have been searching the solution a while, but I couldn't find any similar question in SAS in communities. So here is my question: I have a big SAS table: let's say with 2 classes and 26 variables:
A B Var1 Var2 ... Var25 Var26
-----------------------------
1 1 10 20 ... 35 30
1 2 12 24 ... 32 45
1 3 20 23 ... 24 68
2 1 13 29 ... 22 57
2 2 32 43 ... 33 65
2 3 11 76 ... 32 45
...................
...................
I need to calculate the cumulative sum of the all 26 variables through the Class=B, which means that for A=1, it will accumulate through B=1,2,3; and for A=2 it will accumulate through B=1,2,3. The resulting table will be like:
A B Cum1 Cum2 ... Cum25 Cum26
-----------------------------
1 1 10 20 ... 35 30
1 2 22 44 ... 67 75
1 3 40 67 ... 91 143
2 1 13 29 ... 22 57
2 2 45 72 ... 55 121
2 3 56 148 .. 87 166
...................
...................
I can choose the hard way, like describing each of 26 variables in a loop, and then I can find the cumulative sums through B. But I want to find a more practical solution for this without describing all the variables.
On one of the websites was suggested a solution like this:
proc sort data= (drop=percent cum_pct rename=(count=demand cum_freq=cal));
weight var1;
run;
I am not sure if there is any option like "Weight" in Proc Sort, but if it works then I thought that maybe I can modify it by putting numeric instead of Var1, then the Proc Sort process can do the process for all the numerical values :
proc sort data= (drop=percent cum_pct rename=(count=demand cum_freq=cal));
weight _numerical_;
run;
Any ideas?
One way to accomplish this is to use 2 'parallel' arrays, one for your input values and another for the cumulative values.
%LET N = 26 ;
data cum ;
set have ;
by A B ;
array v{*} var1-var&N ;
array c{*] cum1-cum&N ;
retain c . ;
if first.A then call missing(of c{*}) ; /* reset on new values of A */
do i = 1 to &N ;
c{i} + v{i} ;
end ;
drop i ;
run ;

Use collapse like summarize

I want to use Stata's collapse like summarize. Say I have data (the 1's correspond to the same person, so do the 2's and the 3's) that, when summarized, looks like this:
Obs Mean Std. Dev. Min Max
Score1 54 17 3 11 22
Score2 32 13 2 5 28
Score3 43 22 4 17 33
Value1 54 9 3 2 12
Value2 32 31 7 22 44
Value3 43 38 4 31 45
Speed1 54 3 1 1 11
Speed2 32 6 3 2 12
Speed3 43 8 2 2 15
How would I create a new dataset (using collapse or something else) that looks somewhat like what summarize gives, but looks like the following? Note that the numbers after the variables correspond to observations in my data. So Score1, Value1, and Speed1 all correspond to _n==1.
_n ScoreMean ValueMean SpeedMean ScoreMax ValueMax SpeedMax
1 17 9 3 22 12 11
2 13 31 6 28 44 12
3 22 38 8 33 45 15
(I have omitted Std. Dev. and Min for brevity.)
When I run collapse (mean) Score1 Score2 Score3 Value1 Value2 Value3 Speed1 Speed2 Speed3, I get the following, which is not very helpful:
Score1 Score2 Score3 Value1 Value2 Value3 Speed1 Speed2 Speed3
1 17 13 22 9 31 38 3 6 8
This is on the right track. It only gives me the mean, though. I am not sure how to have it give me more than one statistic at once. I think I need to somehow use reshape at some point.
One way, following your lead:
*clear all
set more off
input ///
score1 score2 value1 value2 speed1 speed2
5 8 346 235 80 89
2 10 642 973 65 78
end
list
summarize
*-----
collapse (mean) score1m=score1 score2m=score2 ///
value1m=value1 value2m=value2 ///
speed1m=speed1 speed2m=speed2 ///
(max) score1max=score1 score2max=score2 ///
value1max=value1 value2max=value2 ///
speed1max=speed1 speed2max=speed2
gen obs = _n
reshape long score#m score#max value#m value#max speed#m speed#max, i(obs) j(n)
drop obs
list
Asking for several statistics is easy. Use the [(stat)] target_var=varname syntax so you don't get conflicting names when asking for several statistics. Then, reshape.
If there are many variables/subjects, it will turn very tedious. There are other ways. I will revise the answer later if no one posts an alternative by then.
This starts with Roberto's example toy dataset. I think it generalises more easily to 800 objects. (By the way, in Stata _n always and only means observation number in current dataset or group defined by by:, so your usage is mild abuse of syntax.)
clear
input score1 score2 value1 value2 speed1 speed2
5 8 346 235 80 89
2 10 642 973 65 78
end
gen j = _n
reshape long score value speed, i(j) j(i)
rename score yscore
rename value yvalue
rename speed yspeed
reshape long y, i(i j) j(what) string
collapse (mean) mean=y (min) min=y (max) max=y, by(what i)
reshape wide mean min max, j(what) i(i) string