sas datastep loop to calculate new rows of data - sas

I have the below requirement and wondering if this can be implemented in simple datastep:
will be starting with a simple dataset with two variable
input:
x y
1 0
logic:
x y z
1 0 x+y
prev z prev y +1 x+y
prev z prev y +1 x+y
output:
x y z
1 0 1
1 1 2
2 2 4
4 3 7
7 4 11

Just output the computed rows one-by-one within a do-while loop.
data have;
input x y;
cards;
1 0
;run;
data want(drop=i);
set have;
z = x + y;
output;
i = 2; /*next row*/
do while (i <= 5); /* put the total number of rows here */
x = z;
y = y + 1;
z = x + y;
output;
i = i + 1;
end;
run;
Result
proc print data=want;
run;
Obs x y z
1 1 0 1
2 1 1 2
3 2 2 4
4 4 3 7
5 7 4 11
Macro version:
%macro gen(have, want, n_rows);
data &want(drop=i);
set &have;
z = x + y;
output;
i = 2;
do while (i <= &n_rows);
x = z;
y = y + 1;
z = x + y;
output;
i = i + 1;
end;
run;
%mend gen;
/* execute */
%gen(have, want, 5)

Ok. I decided to do some research to try to answer your question. And I found out that you (probably) did the same question in a sas forum (link here).
As any interested user may see, the question was answered in a very elegant way there (by Mr. Reinhard - credits to him).
While I was pasting Reinhard answer here, I saw that #Bill Huang came with an original answer his own. So probably you should accept his answer. Anyway, Reinhard answer was really cool and elegant, so I thought it might worth to have it registerd here. Mainly for other users, because it is such an easy way of creating additional interative rows in SAS:
data have;
x=1; y=0;
run;
data want;
set have;
do y=y to 4;
z=x+y;
output;
x=z;
end;

#LuizZ's do y=y to 4 version presumes the first row has y=0. If that is not the case try
data have;
x=1; y=0;
run;
%let iterations = 4;
data want;
set have;
do y = y to y+&iterations;
z = x + y;
output;
x = z;
end;
run;

Related

Use the dif function to obtain the difference with several lags without specifying the number of lags

I want a new data set in which the variable y is equal to the value in the n row minus the lags values.
The original data set:
data test;
input x;
datalines;
20
40
2
5
74
;
run;
I used the dif function, but It returns the difference with a one lag:
data want;
set test;
y = dif(x);
run;
And I want:
_n_ = 1 y = 20
_n_ = 2 y = 40 - 20 = 20
_n_ = 3 y = 2 - (40 + 20) = -58
_n_ = 4 y = 5 - (2 + 40 + 20) = - 57
_n_ = 5 y = 74 - (5 + 2 + 40 + 20) = 7
Thanks.
No need for lag() or dif(). Just make another variable to retain the running total.
data want ;
set test;
y=x-cumm;
output;
cumm+x;
run;
I kept the extra column and output the values before updating the running total to make it clearer what value was used in the calculation of Y.
Obs x y cumm
1 20 20 0
2 40 20 20
3 2 -58 60
4 5 -57 62
5 74 7 67
Possible solution (thanks to Longfish for suggestions):
data want;
set test;
retain total 0;
total = total + x;
y = x - coalesce(lag(total), 0);
run;

SAS DO loop with SET statement

Given two simple datasets A and B as follows
DATA A; INPUT X ##;
CARDS;
1 2 3 4
RUN;
DATA B; INPUT Y ##;
CARDS;
1 2
RUN;
I am trying to create two datasets named C and D, one using repeated SET and OUTPUT statements and another using DO loop.
DATA C;
SET B;
K=1; DO; SET A; OUTPUT; END;
K=K+1; DO; SET A; OUTPUT; END;
K=K+1;
RUN;
DATA D;
SET B;
DO K = 1 TO 2;
SET A; OUTPUT;
END;
RUN;
I thought that C and D should be the same as the DO loop is supposed to be repeating those statements as shown in the DATA step for C, but it turns out that they are different.
Dataset C:
Obs Y K X
1 1 1 1
2 1 2 1
3 2 1 2
4 2 2 2
Dataset D:
Obs Y K X
1 1 1 1
2 1 2 2
3 2 1 3
4 2 2 4
Could someone please explain this?
The two SET A statements in the first data step are independent. So on each iteration of the data step they will both read the same observation. So it is as if you ran this step instead.
data c;
set b;
set a;
do k=1 to 2; output; end;
run;
The SET A statement in the second data step will execute twice on the first iteration of the data step. So it will read two observations from A for each iteration of the data step.
If you really wanted to do a cross-join you would need to use point= option so that you could re-read one of the data sets.
data want ;
set b ;
do p=1 to nobs ;
set a point=p nobs=nobs ;
output;
end;
run;
Your Table B has two obs so your code will only do two iterations:
Every time you read a new observation K resets to 1, Solution: use Retain keyword.
When your current records is OBS 1 and you do an output, you will keep outputting the first row from each table, that's why you output the first and second rows twice from table A.
Debugging:
Iteration 1 current view:
Obs Table X
1 A 1
Obs Table Y k
1 B 1 1
Output:
K=1; DO; SET A; OUTPUT; END;
Obs Y K X
1 1 1 1
K=K+1; DO; SET A; OUTPUT; END;
Obs Y K X
2 1 2 1
Iteration 2 current view:
Obs Table X
2 A 2
Obs Table Y k
2 B 2 1
Output:
K=1; DO; SET A; OUTPUT; END;
Obs Y K X
3 2 1 2
K=K+1; DO; SET A; OUTPUT; END;
Obs Y K X
4 2 2 2

Keeping or deleting a group of observations based on a characteristic of a BY-group

I answered a SAS question a few minutes ago and realized there is a generalization that might be more useful than that one (here). I didn't see this question already in StackOverflow.
The general question is: How can you process and keep an entire BY-group based on some characteristic of the BY-group that you might not know until you have looked at all the observations in the group?
Using input data similar to that from the earlier question:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
Double DoW loop solution:
data have;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
data want;
do _n_ = 1 by 1 until(last.id_2);
set have;
by id id_2;
flag = sum(flag,value=0);
end;
do _n_ = 1 to _n_;
set have;
if flag then output;
end;
drop flag;
run;
I've tested this against the point approach using ~55m rows and found no appreciable difference in performance. Dataset used:
data have;
do ID = 1 to 10000000;
do id_2 = 1 to ceil(ranuni(1)*10);
do value = floor(ranuni(2) * 5);
output;
end;
end;
end;
run;
My answer might not be the most efficient, especially for large datasets, and I'm interested in seeing other possible answers. Here it is:
* For some reason, we are tasked with keeping only observations that
* are in groups of ID_1 and ID_2 that contain at least one obs with
* a VALUE of 0.;
* In the following data, the following ID and ID_2 groups should be
* kept:
* A 2 (2 obs)
* B 1 (3 obs)
* B 3 (2 obs)
* B 4 (1 obs)
* The resulting dataset will have 8 observations.;
data x;
input id $ id_2 value;
datalines;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 1 0
B 1 1
B 1 3
B 2 1
B 3 0
B 3 0
B 4 0
C 2 4
;
run;
* I realize the data are already sorted, but I think it is better
* not to assume they are.;
proc sort data=x;
by id id_2;
run;
data obstokeep;
keep id id_2 value;
retain startptr haszero;
* This SET statement reads through the dataset in sequence and
* uses the CUROBS option to obtain the observation number. In
* most situations, this will be the same as the _N_ automatic
* variable, but CUROBS is probably safer.;
set x curobs=myptr;
by id id_2;
* When this is the first observation in a BY-group, save the
* current observation number (pointer).
* Also initialize a flag variable that will become 1 if any
* obs contains a VALUE of 0;
* The variables are in a RETAIN statement, so they keep their
* values as the SET statement above is executed for each obs
* in the BY-group.;
if first.id_2
then do;
startptr=myptr;
haszero=0;
end;
* This statement is executed for each observation. We check
* whether VALUE is 0 and, if so, record that fact.;
if value = 0
then haszero=1;
* At the end of the BY-group, we check to see if there were
* any observations with VALUE = 0. If so, we go back using
* another SET statement, re-read them via direct access, and
* write them to the output dataset.
* (Note that if VALUE order is not relevant, you can gain a bit
* more efficiency by writing the current obs first, then going
* back to get the rest.);
if last.id_2 and haszero
then do;
* When LAST and FIRST at the same time, there is only one
* obs, so no need to backtrack, just output and go on.;
if first.id_2
then output obstokeep;
else do;
* Here we assume that the observations are sequential
* (which they will be for a sequential SET statement),
* so we re-read these observations using another SET
* statement with the POINT option for direct access
* starting with the first obs of the by-group (the
* saved pointer) and ending with the current one (the
* current pointer).;
do i=startptr to myptr;
set x point=i;
output obstokeep;
end;
end;
end;
run;
proc sql;
select a.*,b.value from (select id,id_2 from have where value=0)a left join have b
on a.id=b.id and a.id_2=b.id_2;
quit;

Simulating probabilities in a do loop

I am trying to determine the probability that the mean of a sample from a unifrom distribution lies between .4 and .5.
data sample1 (drop= i x z) ;
z=0;
do i=1 to 50;
x= ranuni(234);
z= z+x;
meanz= z/50;
end;
output;
run;
This gives me the mean, but is there some nice way in the loop to output P(.4 <= meanz <=.5).
Try this:
It gives you the average of 100 meanz and the percentage of them between .4 and .5.
data sample1 (drop= i x z sim) ;
between4_5 = 0;
meanz = 0;
do sim=1 to 100;
z=0;
do i=1 to 50;
x= ranuni(234);
z= z+x;
end;
meanz = meanz + z/(50*100);
if .4 < z/50 < .5 then
between4_5 = between4_5 + 1/100;
end;
output;
run;

Saving states according to an indicator, perhaps by using the Retain statement

As can bee seen, I sorted the data by rk, and descending version:
data have;
rk = 1;
version = 7;
ind = 0;
output;
rk = 1;
version = 6;
ind = 1;
output;
rk = 1;
version = 5;
ind = 0;
output;
rk = 1;
version = 4;
ind = 0;
output;
rk = 1;
version = 3;
ind = 1;
output;
rk = 1;
version = 2;
ind = 0;
output;
rk = 1;
version = 1;
ind = 0;
output;
rk = 1;
version = 0;
ind = 0;
output;
run;
I thought of the Retain statement. but any solution for this problem will suit me just fine.
What I need to do is,
if at some point, ind = 1, I want all previous rows (versions) for the same rk, to have some sort of indication for that.
So basically,
versions 0,1,2 should be flagged, because version 3 has ind = 1;
versions 4,5 should be flagged , because version 6 has ind = 1;
but version 7 should not be affected at all, as it appears after rows of ind = 1,
and not before them.
It would be even better if every flagged row affected by a row with ind = 1
will have an indicator states the version number affected that change,
meaning
versions 0,1,2 will have a field named "affected_by" equals to 3
versions 4,5 will have that field equals to 6
Your help is very much appreciated!
Since the data set was sorted, we will go "forward" (which I think is easier) using your sorted set. We'll use the SELECT statement as we only want one execution per iteration. We'll also use RETAIN statement that you have suggested and the CAT function for concatenating strings together to generate the indicator flag:
data test;
set have;
drop N count x;
select;
when(ind = 1) do;
N = 1;
count = version;
retain N count;
output;
end;
when(N = 1) do;
x = ind;
flag = cat('Flagged because of version ', count);
N = .;
retain x count;
output;
end;
when(x = ind) do;
flag = cat('Flagged because of version ', count);
retain x count;
output;
end;
otherwise do;
output;
end;
end;
run;
OUTPUT:
rk version ind flag
1 7 0
1 6 1
1 5 0 Flagged because of version 6
1 4 0 Flagged because of version 6
1 3 1
1 2 0 Flagged because of version 3
1 1 0 Flagged because of version 3
1 0 0 Flagged because of version 3
In this case, N is used as an indicator for which the previous observation had ind = 1. Then we destroy it (i.e. N = .), otherwise it will just satisfy the N = 1 condition again in next iteration.
Note that we retain the variables x and count for comparing x with next ind. Variable count equals the version in the row that has ind = 1. For the flag indicator, use the CAT function to add the numeric variable count to a string.
Cheers.