SAS Do loop over Set statement - sas

I have a dataset named as test, with one numeric variable named 'id' having 3 observations as:
1
2
3
I am creating another one using do loop as below:
DATA abc;
DO i = 1 to 3;
SET test;
m+1;
OUTPUT;
END;
RUN;
This returns 3 observations.
If I change do loop from 1 to 4 and remove the output statement, I get an empty dataset. I am unable to get my head around this. Can someone please explain this?

Most SAS data steps actually end when the step executes a SET or INPUT statement and finds there is no more input available. That is what is happening.
SAS normally writes the observations at the end of the data step iteration. The exception is when you have an explicit OUTPUT statement coded. So without the OUTPUT step SAS will only write out an observation when it gets to the end of the data step. So when you rmeoved the OUTPUT you made the step the same as:
DATA abc;
DO i = 1 to 3;
SET test;
m+1;
END;
OUTPUT;
RUN;
But if your DO loop iterates more times than there are observations for the SET statement to read then it will never get to the end to write the output since it will read past the end of the input dataset and stop.

Related

What is the difference with on one set statement in do loop and multi-set statements?

I am learning the skill of using double set statement and got a trouble in the following code:
data test1;
do i = 1 to 2;
set sashelp.class;
end;
run;
data test2;
set sashelp.class;
set sashelp.class;
run;
Test1 has 9 observations(all of the even rows) and Test2 has 19 observations, can somebody explain this for me?
The SAS output statement writes out observations to your output data set. When no explicit output statement is used (as in your data steps) an implicit output at the end of the data step outputs the current observation to the output data set.
In your first data step the do loop causes the set statement to be executed twice, the first time reading obs #1, the second time reading obs #2. The loop finishes and the next statement is run, so the implicit output outputs the current observation which is #2. The next iteration of the data step causes the do loop to read obs #3 and then #4, so the last obs (#4) is output, and so on until the end of the data set.
The second data step executes the first set statement reading in obs #1, then it executes the second set statement, reading obs #1 from that input data set, overwriting the current observation. The implicit output causes this obs to be written out. The data step reiterates causing the same to happen to obs #2, and so on until all 19 obs are read and output.
Inserting some diagnostics can help understand what is happening, e.g submit the following and check the log:
data test1;
do i = 1 to 2;
set sashelp.class;
putlog 'In loop: ' i= name=;
end;
putlog 'About to output: ' name=;
run;

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

Set in the first row, why do I have two output rows?

I tried this code in SAS but the output isn't the same as I expect.
data temp;
input sumy;
datalines;
36
;
run;
data aaa;
if _n_ = 1 then
set temp;
run;
proc print data = aaa;
run;
May I ask why there are two observations, sas have "set" twice? How does the "set" and PDV work here during iteration? Thank you in advance.
Because you executed the implied OUTPUT at the end of the data step twice. The first time, with _N_=1, you read one observation from the input dataset. The second time you did not read a new observation, since _N_ now is 2, and the values from the previous observation were retained. After the second observation SAS stops because it has detected that your data step is in a loop.
If you want only one observation then either add a STOP statement before the RUN statement or recode the data step to use OBS=1 dataset option on the input dataset instead of the IF statement.
Note that if the input data set was empty then you would have output zero observations because the data step would have stopped when the SET statement read past the end of the input dataset.
There are two observations because during the second DATA Step iteration no read operation occurred.
The SET statement has two roles.
Compile time role (unconditional) - the data set header is read by the compiler and its variables are added the the steps PDV.
Run time role (conditional) - a row from the data set is read and the values are placed in the PDV each time the code reaches the SET statement.
Additionally, every variable coming from (or corresponding to) a SET statement has it's value automatically retained. That is why the second observation created by your sample code has sumy=36.
Additional detail from the SAS support site:
Usage Note 8914: DATA step stopped due to looping message
If a DATA step is written such that no data reading statements (e.g.
SET, INPUT) are executed, the step is terminated after one iteration
and the following message is written to the SAS log:
NOTE: DATA STEP stopped due to looping.
As SAS creates a new data set, it reads one record at a time, saving values from that record in the Program Data Vector (PDV) until values from the next record replace them. SAS continues in this way until it has reached the last record.
You can refer this link for better understanding
http://www.lexjansen.com/nesug/nesug07/cc/cc45.pdf
Also you can go through this answer on stack overflow
SAS . Are variables set to missing at every iteration of a data step?

Explain the order in which SAS reads data step (conceptual)

I need to understand how SAS reads/executes data steps. When I've looked up info on how SAS reads data steps, all I seem to find is info on how it reads for merging purposes, which i don't understand in relation to a regular data step. Lets say, for example, I have this line of code:
data work.DATA;
if amount_a= . then
amount_a= 1;
amount_b= 1;
amount_a= . ;
total = (amount_a + amount_b) + 0 ;
run;
Now, given this, what would "total" equal?
I want to know, essentially, how SAS would read this step -- which line would it read/execute first? Does it start at the last, then work its way up? Or start at the top, and work its way down?
Thanks.
A SAS data step processes code from top to bottom, beginning with the DATA statement and ending with the RUN; statement. Data steps have an implied OUTPUT; statement included immediately before the RUN; if the code does not have an explicit output statement.
Since SAS is an "interpreted" language, the code for each data step is compiled before execution. Part of the compilation involves creating a structure called the Program Data Vector (PDV) which contains execution attributes of all variables used by the program. Variables are defined to the PDV in the order they appear in the code (from top to bottom).
A handy debugging tool is the PUTLOG statement, with which you can cause output to be written to your SAS log file during program execution. For example, consider this:
data work.DATA;
if amount_a= . then
amount_a= 1;
amount_b= 1;
putlog amount_a= amount_b=;
amount_a= . ;
putlog amount_a= amount_b=;
total = (amount_a + amount_b) + 0 ;
putlog amount_a= amount_b= total=;
output;
run;
Notice that I added an explicit OUTPUT; statement to illustrate. The result is a SAS data set with one observation and three variables. Your variable total will be a missing value because at the time it is calculated, amount_a is missing. You will also get a NOTE in the SAS log indicating that "Missing values were generated".
The best place to learn all about how SAS does this is in the SAS Language Reference: Concepts book. Here is a link to the book for SAS version 9.3. In particular, read the chapter on Data Step Processing.

Understanding the SAS PDV in by-group processing

While I've read quite a bit about conceptualizing the Program Data Vector when using a SAS data step, I still don't understand how the PDV works when there is by group processing. For example if I have the dataset olddata
GROUP VAL
A 10
A 5
B 20
And I call a datastep on it with a by statement, such as:
data newdata;
set olddata;
by group;
...
run;
then the compiler adds two temporary variables to the PDV: first.group and last.group. When you read any tutorial on the PDV it will tell you that on the first pass of the SET statement, the PDV will look like:
_N_ _ERROR_ FIRST.GROUP LAST.GROUP GROUP VAL
1 0 1 0 A 10
and that LAST.GROUP is zero because observation 1 is not the last observation in group A.
Herein lies my question: How does SAS know that this is not the last observation?
If SAS is processing olddata row-by-row, how is the PDV aware that the next row holds another group A observation instead of a new group? In other words, it seems like SAS must be using information from previous or future rows to update the FIRST and LAST variables, but I'm not sure how. Is there some trick in how the PDV retains values from row to row when the BY statement is called?
SAS actually looks ahead to the next record to see if it should set LAST.(var) or not. I haven't been able to find an article explaining that in any detail, unfortunately. I was a bit disappointed to see that even papers like http://www.wuss.org/proceedings09/09WUSSProceedings/papers/ess/ESS-Li1.pdf just gloss over how LAST is detemined.
SAS also looks ahead to see if the END= variable should be set, when specified, and a few other things. It's not just using metadata to determine those; you can remove or modify records without modifying the metadata, and it will still work - and SQL tables that don't have the usual SAS metadata will still allow you to perform normal BY group processing and such.
The FIRST variable doesn't need a look-behind, of course; it remembers where it was after all.
Edit: I crossposted this to SAS-L, and got the same answer - there doesn't seem to be any documentation of the subject, but it must read ahead. See http://listserv.uga.edu/cgi-bin/wa?A1=ind1303a&L=sas-l#8 for example.
Edit2: From SAS-L, Dan Nordlund linked to a paper that confirms this. http://support.sas.com/resources/papers/proceedings12/222-2012.pdf
The paper's logic that confirms the lookahead - look at the number of observations read from the data set.
DATA DS_Sample1;
Input Sum_Var
Product;
Cards;
100 3
100 2
100 1
;
*With BY statement - reads 3 observations even though it stops after 2.;
DATA DS_Sample2;
Set DS_Sample1;
by Sum_Var;
cnt+1; If CNT > 1 then stop;
Run;
*no BY statement - reads 2 observations as expected;
DATA DS_Sample2;
Set DS_Sample1;
cnt+1; If CNT > 1 then stop;
Run;
* END statement - again, a lookahead;
DATA DS_Sample2;
Set DS_Sample1 end=eof;
cnt+1; If CNT > 1 then stop;
Run;