How to retain calculated values between rows when calculating running totals? - sas

I have a tricky question about conditional sum in SAS. Actually, it is very complicated for me and therefore, I cannot explain it by words. Therefore I want to show an example:
A B
5 3
7 2
8 6
6 4
9 5
8 2
3 1
4 3
As you can see, I have a datasheet that has two columns. First of all, I calculated the conditional cumulative sum of column A ( I can do it by myself-So no need help for that step):
A B CA
5 3 5
7 2 12
8 6 18
6 4 8 ((12+8)-18)+6
9 5 17
8 2 18
3 1 10 (((17+8)-18)+3
4 3 14
So my condition value is 18. If the cumulative more than 18, then it equal 18 and next value if sum of the first value after 18 and exceeds amount over 18. ( As I said I can do it by myself )
So the tricky part is I have to calculate the cumulative sum of column B according to column A:
A B CA CB
5 3 5 3
7 2 12 5
8 6 18 9.5 (5+(6*((18-12)/8)))
6 4 8 5.5 ((5+6)-9.5)+4
9 5 17 10.5 (5.5+5)
8 2 18 10.75 (10.5+(2*((18-7)/8)))
3 1 10 2.75 ((10.5+2)-10.75)+1
4 3 14 5.75 (2.75+3)
As you can see from example the cumulative sum of column B is very specific. When column CA is equal to our condition value (18), then we calculate the proportion of the last value for getting our condition value (18) and then use this proportion for computing cumulative sum of column B.

Looks like when the sum of A reaches 18 or more you want to split the values of A and B between the current and the next record. One way is to remember the left over values for A and B and carry them forward in your new cumulative variables. Just make sure to output the observation before resetting those variables.
data want ;
set have ;
ca+a;
cb+b;
if ca >= 18 then do;
extra_a=ca - 18;
extra_b=b - b*((a - extra_a)/a) ;
ca=18;
cb=cb-extra_b ;
end;
output;
if ca=18 then do;
ca=extra_a;
cb=extra_b;
end;
drop extra_a extra_b ;
run;

Related

Sorted list of random repeated numbers to sorted list of repeated and continuos numbers in google sheets

I think the best way to show the problem is with an example. Column A is what i have now, and column B is what I would want.
A
B
1
1
1
1
2
2
2
2
5
3
5
3
5
3
8
4
8
4
9
5
9
5
14
6
14
6
17
7
17
7
17
7
Update: Based on your comment, use this formula
=ArrayFormula(IF(ISNUMBER(A1:A), VLOOKUP(A1:A, {UNIQUE(A1:A), ArrayFormula(RANK(UNIQUE(A1:A), UNIQUE(A1:A), 1))}, 2, 0), ""))
Previous answer: Have you already used the SORT formula?
Try =SORT(A1:A, 1, 1) in cell B1
Assuming your data starts at row 2 through row 10 column A. In B2 :
=arrayformula(1/COUNTIF($A$2:$A$10,$A$2:$A$10))
in C2
=sumproduct(($B$1:$B1)*($A$1:$A1<A2))+1

How to select a percentage of values from a column in SAS?

I have 70 databases of different sizes (same number of columns, different numbers of lines).
I need to get the 25% higher values and the 25% lower values considering a given column VAR1.
I have:
id VAR1
1 10
2 -5
3 -12
4 7
5 12
6 7
7 -9
8 -24
9 0
10 6
11 -18
12 22
Sorting by VAR1, I need to select the rows (all columns) containing the 3 smallest and the 3 largest (25% from each extreme), i.e.,
id VAR1
8 -24
11 -18
3 -12
7 -9
2 -5
9 0
10 6
4 7
6 7
1 10
5 12
12 22
I need to keep in the database the rows (all columns) that contain the VAR1 equal to -24, -18, -12, 10, 12 and 22.
id VAR1
8 -24
11 -18
3 -12
1 10
5 12
12 22
What I’ve been thinking:
Order column VAR1 in ascending order;
Create a numbered column from 1 to N (n=_N_) - in this case, N=12;
I do a=N*0.25 (to have the value that represents 25%);
I do b=N-a (to have the value that represents the "last" 25%).
So, I can use keep:
if N<a.... I will have the first 25% (the smallest).
if N>b.... I will have the last 25% (the largest).
I can calculate a and b.
But I’m not getting the maximum value of N in this case 12.
I will repeat this for the 70 database, I would not like to have to enter this maximum value every time (it varies from one database to another).
I need help to "fix" the maximum value (N) without having to type it (even if it is repeated in all the lines of another "auxiliary column").
Or if there’s some better way to get those 25% from each end.
My code:
proc sort data=have; by VAR1; run;
data want; set have;
seq=_N_;
N=max(seq); *N=max. value of lines. (I stopped here and don’t know if below is right);
a=N*0.25;
b=N-b;
if N<a;
if N>b;
run;
Thank you very much!
Proc RANK computes percentiles that you can use to select the desired rows.
Example:
data have1 have2 have3 have4 have5;
do id = 1 to 100;
X = ceil(rand('normal', 0, 10));
if id < 60 then output have1;
if id < 70 then output have2;
if id < 80 then output have3;
if id < 90 then output have4;
if id < 100 then output have5;
end;
run;
proc rank data=have1 percent out=want1(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have2 percent out=want2(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have3 percent out=want3(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;

count and group the column by sequence

I have a dataset that has to be grouped by number as follows.
ID dept count
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
so for every 3rd row I need a new level the output should be as follows.
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
I have tried counting the number of rows based on the dept and count.
data want;
set have;
by dept count;
if first.count then level=1;
else level+1;
run;
this generates a count but not what exactly I am looking for
ID dept count Level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 2
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 2
10 30 4 2
It isn't quite clear what output you want. I've extended your input data a bit - please
could you clarify what output you'd expect for this input and what the logic is for generating it?
I've made a best guess at roughly what you might be aiming for - incrementing every 3 rows with the same dept and count - perhaps this will be enough for you to get to the answer you want?
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 30 4
8 30 4
9 30 4
10 30 4
11 30 4
12 30 4
13 30 4
14 30 4
;
run;
data want;
set have;
by dept count;
if first.count then do;
level = 0;
dummy = 0;
end;
if mod(dummy,3) = 0 then level + 1;
dummy + 1;
drop dummy;
run;
Output:
ID dept count level
1 10 2 1
2 10 2 1
3 20 4 1
4 20 4 1
5 20 4 1
6 20 4 2
7 30 4 1
8 30 4 1
9 30 4 1
10 30 4 2
11 30 4 2
12 30 4 2
13 30 4 3
14 30 4 3
One way to do this is to nest the SET statement inside a DO loop. Or in this case two DO loops. One to generate the LEVEL (within DEPT) and the second to count by twos. Use the LAST.DEPT flag to handle odd number of observations.
So if I modify the input to include odd number of observations in some groups.
data have;
input ID dept count;
cards;
1 10 2
2 10 2
3 20 4
4 20 4
5 20 4
6 20 4
7 20 4
8 30 4
9 30 4
10 30 4
;
Then can use this step to assign the LEVEL variable.
data want ;
do level=1 by 1 until(last.dept);
do sublevel=1 to 2 until(last.dept);
set have;
by dept;
output;
end;
end;
run;
Results:
Obs level sublevel ID dept count
1 1 1 1 10 2
2 1 2 2 10 2
3 1 1 3 20 4
4 1 2 4 20 4
5 2 1 5 20 4
6 2 2 6 20 4
7 3 1 7 20 4
8 1 1 8 30 4
9 1 2 9 30 4
10 2 1 10 30 4

SAS Function to calculate percentage for row for two stratifications

I have a dataset that looks like this
data test;
input id1$ id2$ score1 score2 score3 total;
datalines;
A D 9 36 6 51
A D 9 8 6 23
A E 5 3 2 10
B D 5 3 3 11
B E 7 4 7 18
B E 5 3 3 11
C D 8 7 9 24
C E 8 52 6 66
C D 4 5 3 12
;
run;
I want to add a column that calculates what percentage of the corresponding total is of the summation within id1 and id2.
What I mean is this; id1 has a value of A. Within the value of A, there are twoid2 values ; D and E. There are two values of D, and one of E. The two total values of D are 51 and 23, and they sum to 74. The one total value of E is 10, and it sums to 10. The column I'd like to create would hold the values of .68 (51/74), .31 (23/74), and 1 (10/10) in row 1 ,row 2, and row 3 respectively.
I need to perform this calculations for the rest of the id1 and their corresponding id2. So when complete, I want a table that would like like this:
id1 id2 score1 score2 score3 total percent_of_total
A D 9 36 6 51 0.689189189
A D 9 8 6 23 0.310810811
A E 5 3 2 10 1
B D 5 3 3 11 1
B E 7 4 7 18 0.620689655
B E 5 3 3 11 0.379310345
C D 8 7 9 24 0.666666667
C E 8 52 6 66 1
C D 4 5 3 12 0.333333333
I realize a loop might be able to solve the problem I've given, but I'm dealing with EIGHT levels of stratification, with as many as 98 sublevels within those levels. A loop is not practical. I'm thinking something along the lines of PROC SUMMARY but I'm not too familiar with the function.
Thank you.
It is easy to do with a data step. Make sure the records are sorted.
You can find the grand total for the ID1*ID2 combination and then use it to calculate the percentage.
proc sort data=test;
by id1 id2;
run;
data want ;
do until (last.id2);
set test ;
by id1 id2 ;
grand = sum(grand,total);
end;
do until (last.id2);
set test ;
by id1 id2 ;
precent_of_total = total/grand ;
output;
end;
run;

How to group data in kdb+ using customized groups?

I have a table (allsales) with a column for time (sale_time). I want to group the data by sale_time. But I want to be able to bucket this. ex any data where time is between 00:00:00-03:00:00 should be grouped together, 03:00:00-06:00:00 should be grouped together and so on. Is there a way to write such a query?
xbar is useful for rounding to interval values e.g.
q)5 xbar 1 3 5 8 10 11 12 14 18
0 0 5 5 10 10 10 10 15
We can then use this to group rows into time groups, for your example:
q)s:([] t:13:00t+00:15t*til 24; v:til 24)
q)s
t v
--------------
13:00:00.000 0
13:15:00.000 1
13:30:00.000 2
13:45:00.000 3
14:00:00.000 4
14:15:00.000 5
..
q)select count i,sum v by xbar[`int$03:00t;t] from s
t | x v
------------| ------
12:00:00.000| 8 28
15:00:00.000| 12 162
18:00:00.000| 4 86
"by xbar[`int$03:00t;t]" rounds the time column t to the nearest three hour value, then this is used as the group by.
There are few more ways to achieve the same results.
q)select count i , sum v by t:01:00u*3 xbar t.hh from s
q)select count i , sum v by t:180 xbar t.minute from s
t | x v
-----| ------
12:00| 8 28
15:00| 12 162
18:00| 4 86
But in all cases, be careful of the date column if present in the table, otherwise same time window across different dates will generate the wrong results.
q)s:([] d:24#2013.05.07 2013.05.08; t:13:00t+00:15t*til 24; v:til 24)
q)select count i , sum v by d, t:180 xbar t.minute from s
d t | x v
----------------| ----
2013.05.07 12:00| 4 12
2013.05.07 15:00| 6 78
2013.05.07 18:00| 2 42
2013.05.08 12:00| 4 16
2013.05.08 15:00| 6 84
2013.05.08 18:00| 2 44