SAS do-Loops and set statements - sas

Why does this macro work? (it does) The loop is able to start, despite the fact that the nrows variable is defined in the set statement inside the loop. Does SAS read the set statement before starting the loop? Where can I find documentation on this issue (which statements inside loops, if any, are executed before the loop starts)?
%macro get_last_n_rows(n, existing, new);
data &new.;
do _i_ = 1 + nrows - &n. to nrows;
set &existing. point = _i_ nobs = nrows;
output;
end;
stop;
run;
%mend get_last_n_rows;

The short answer to your question is: yes, SAS reads the number of rows available prior to the loop executing. In fact, SAS reads the number of rows available before the data step executes; it's determined at data step compile time. See for example, this paper and this paper, among many others.
See specifically the SAS documentation for SET:
At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you can refer to the NOBS= variable before the SET statement. The variable is available in the DATA step but is not added to any output data set.
Note this has nothing to do with the do loop; this is true for the entire data step (which is itself one large loop, of course).

Related

SAS: How to create a variable like 1,2,3,4,...,N

I need to introduce a time trend to a regression model for a course but have no idea how to create a variable that's just (1,2,3,4,...,108). In R or Python I would just create an empty vector of 0's and then loop through to fill them with the loop index but I have no clue how to do it in SAS.
Thank you in advance
data want;
set have;
time_trend+1;
run;
SAS is an inherently looping language. The code above does four things:
Read a row
Add 1 to a variable called time_trend
Output the row to a dataset named want
Read the next row and execute the statements again
SAS automatically initialized the variable time_trend for us at compilation, so we do not need to declare a length or type. SAS assumes it is a numeric variable by default.
The statement time_trend+1 is a special shortcut of the below logic:
data want;
set have;
retain time_trend 0;
time_trend = time_trend + 1;
run;

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

Why does undefined variable not return an error in DATA step?

I wrote the code below to create a "running total" column named "sum". Although it seems to work, I don't understand how SAS is executing this code. When it encounters the statement sum + var, how does it know what to do given that sum is undefined? Based on the book "The Little SAS Book: A Primer", the SAS data step has a built-in loop that executes the program observation by observation. Given this, how does the program know to do the equivalent of sum[row2] = sum[row1] + var[row2] when it gets to the second row?
data df;
input var;
datalines;
1
3
.
5
1
;
run;
data df2;
set df;
sum+var;
run;
This syntax is known as an implicit retain, and is equivalent to:
retain sum 0;
sum=sum(sum,var);
When you retain a variable, it's value is not set to missing when the PDV is reloaded (it 'retains' the previous value). It does NOT read from the previous row - a common misconception.
More information on the retain statement is available in the SAS documentation

Set in the first row, why do I have two output rows?

I tried this code in SAS but the output isn't the same as I expect.
data temp;
input sumy;
datalines;
36
;
run;
data aaa;
if _n_ = 1 then
set temp;
run;
proc print data = aaa;
run;
May I ask why there are two observations, sas have "set" twice? How does the "set" and PDV work here during iteration? Thank you in advance.
Because you executed the implied OUTPUT at the end of the data step twice. The first time, with _N_=1, you read one observation from the input dataset. The second time you did not read a new observation, since _N_ now is 2, and the values from the previous observation were retained. After the second observation SAS stops because it has detected that your data step is in a loop.
If you want only one observation then either add a STOP statement before the RUN statement or recode the data step to use OBS=1 dataset option on the input dataset instead of the IF statement.
Note that if the input data set was empty then you would have output zero observations because the data step would have stopped when the SET statement read past the end of the input dataset.
There are two observations because during the second DATA Step iteration no read operation occurred.
The SET statement has two roles.
Compile time role (unconditional) - the data set header is read by the compiler and its variables are added the the steps PDV.
Run time role (conditional) - a row from the data set is read and the values are placed in the PDV each time the code reaches the SET statement.
Additionally, every variable coming from (or corresponding to) a SET statement has it's value automatically retained. That is why the second observation created by your sample code has sumy=36.
Additional detail from the SAS support site:
Usage Note 8914: DATA step stopped due to looping message
If a DATA step is written such that no data reading statements (e.g.
SET, INPUT) are executed, the step is terminated after one iteration
and the following message is written to the SAS log:
NOTE: DATA STEP stopped due to looping.
As SAS creates a new data set, it reads one record at a time, saving values from that record in the Program Data Vector (PDV) until values from the next record replace them. SAS continues in this way until it has reached the last record.
You can refer this link for better understanding
http://www.lexjansen.com/nesug/nesug07/cc/cc45.pdf
Also you can go through this answer on stack overflow
SAS . Are variables set to missing at every iteration of a data step?

Explain the order in which SAS reads data step (conceptual)

I need to understand how SAS reads/executes data steps. When I've looked up info on how SAS reads data steps, all I seem to find is info on how it reads for merging purposes, which i don't understand in relation to a regular data step. Lets say, for example, I have this line of code:
data work.DATA;
if amount_a= . then
amount_a= 1;
amount_b= 1;
amount_a= . ;
total = (amount_a + amount_b) + 0 ;
run;
Now, given this, what would "total" equal?
I want to know, essentially, how SAS would read this step -- which line would it read/execute first? Does it start at the last, then work its way up? Or start at the top, and work its way down?
Thanks.
A SAS data step processes code from top to bottom, beginning with the DATA statement and ending with the RUN; statement. Data steps have an implied OUTPUT; statement included immediately before the RUN; if the code does not have an explicit output statement.
Since SAS is an "interpreted" language, the code for each data step is compiled before execution. Part of the compilation involves creating a structure called the Program Data Vector (PDV) which contains execution attributes of all variables used by the program. Variables are defined to the PDV in the order they appear in the code (from top to bottom).
A handy debugging tool is the PUTLOG statement, with which you can cause output to be written to your SAS log file during program execution. For example, consider this:
data work.DATA;
if amount_a= . then
amount_a= 1;
amount_b= 1;
putlog amount_a= amount_b=;
amount_a= . ;
putlog amount_a= amount_b=;
total = (amount_a + amount_b) + 0 ;
putlog amount_a= amount_b= total=;
output;
run;
Notice that I added an explicit OUTPUT; statement to illustrate. The result is a SAS data set with one observation and three variables. Your variable total will be a missing value because at the time it is calculated, amount_a is missing. You will also get a NOTE in the SAS log indicating that "Missing values were generated".
The best place to learn all about how SAS does this is in the SAS Language Reference: Concepts book. Here is a link to the book for SAS version 9.3. In particular, read the chapter on Data Step Processing.