Set in the first row, why do I have two output rows? - sas

I tried this code in SAS but the output isn't the same as I expect.
data temp;
input sumy;
datalines;
36
;
run;
data aaa;
if _n_ = 1 then
set temp;
run;
proc print data = aaa;
run;
May I ask why there are two observations, sas have "set" twice? How does the "set" and PDV work here during iteration? Thank you in advance.

Because you executed the implied OUTPUT at the end of the data step twice. The first time, with _N_=1, you read one observation from the input dataset. The second time you did not read a new observation, since _N_ now is 2, and the values from the previous observation were retained. After the second observation SAS stops because it has detected that your data step is in a loop.
If you want only one observation then either add a STOP statement before the RUN statement or recode the data step to use OBS=1 dataset option on the input dataset instead of the IF statement.
Note that if the input data set was empty then you would have output zero observations because the data step would have stopped when the SET statement read past the end of the input dataset.

There are two observations because during the second DATA Step iteration no read operation occurred.
The SET statement has two roles.
Compile time role (unconditional) - the data set header is read by the compiler and its variables are added the the steps PDV.
Run time role (conditional) - a row from the data set is read and the values are placed in the PDV each time the code reaches the SET statement.
Additionally, every variable coming from (or corresponding to) a SET statement has it's value automatically retained. That is why the second observation created by your sample code has sumy=36.
Additional detail from the SAS support site:
Usage Note 8914: DATA step stopped due to looping message
If a DATA step is written such that no data reading statements (e.g.
SET, INPUT) are executed, the step is terminated after one iteration
and the following message is written to the SAS log:
NOTE: DATA STEP stopped due to looping.

As SAS creates a new data set, it reads one record at a time, saving values from that record in the Program Data Vector (PDV) until values from the next record replace them. SAS continues in this way until it has reached the last record.
You can refer this link for better understanding
http://www.lexjansen.com/nesug/nesug07/cc/cc45.pdf
Also you can go through this answer on stack overflow
SAS . Are variables set to missing at every iteration of a data step?

Related

Common Values in Two Tables in SAS (without Proc SQL)

I am learning SAS at the moment and I wanted to know how do you join two tables without using any SQL where I need to get only the common values in two tables.
Both tables have a common unique id. Also the tables don't have common variables.
Please don't give any documentation links as I already have and I know merge. I am trying it with an IN operator.
Table 1 : Screenshot
Table 2 : Screenshot
Description: The first table has 157 records and the other has 161 records.
I tried searching a solution but didn't get any. Please refer a solution.
Thanks !
In DATA Step you will want to use the MERGE statement and the IN= option which sets up flags indicating 'contribution' to the current state of the program data vector (PDV)
data want;
merge
have1 (in=_from1)
have2 (in=_from2)
;
by uniqueid; * variable of same name, type and length should be in have1 and have2;
if _from1 and _from2; * subsetting if;
run;
DATA Step is an implicit loop. The MERGE automatically advances reads through the contributing data, synchronizing about the BY variables.
When a DATA Step has no explicit OUTPUT statement, there will be an implicit OUPUT of the values in the PDV when control reaches the bottom of the step. Thus, the if without a then is called subsetting because control only goes past the if (and reaches the bottom for implicit output) when both flags are true (or when data is coming from both tables at a common key value)

SAS do-Loops and set statements

Why does this macro work? (it does) The loop is able to start, despite the fact that the nrows variable is defined in the set statement inside the loop. Does SAS read the set statement before starting the loop? Where can I find documentation on this issue (which statements inside loops, if any, are executed before the loop starts)?
%macro get_last_n_rows(n, existing, new);
data &new.;
do _i_ = 1 + nrows - &n. to nrows;
set &existing. point = _i_ nobs = nrows;
output;
end;
stop;
run;
%mend get_last_n_rows;
The short answer to your question is: yes, SAS reads the number of rows available prior to the loop executing. In fact, SAS reads the number of rows available before the data step executes; it's determined at data step compile time. See for example, this paper and this paper, among many others.
See specifically the SAS documentation for SET:
At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you can refer to the NOBS= variable before the SET statement. The variable is available in the DATA step but is not added to any output data set.
Note this has nothing to do with the do loop; this is true for the entire data step (which is itself one large loop, of course).

SAS Do loop over Set statement

I have a dataset named as test, with one numeric variable named 'id' having 3 observations as:
1
2
3
I am creating another one using do loop as below:
DATA abc;
DO i = 1 to 3;
SET test;
m+1;
OUTPUT;
END;
RUN;
This returns 3 observations.
If I change do loop from 1 to 4 and remove the output statement, I get an empty dataset. I am unable to get my head around this. Can someone please explain this?
Most SAS data steps actually end when the step executes a SET or INPUT statement and finds there is no more input available. That is what is happening.
SAS normally writes the observations at the end of the data step iteration. The exception is when you have an explicit OUTPUT statement coded. So without the OUTPUT step SAS will only write out an observation when it gets to the end of the data step. So when you rmeoved the OUTPUT you made the step the same as:
DATA abc;
DO i = 1 to 3;
SET test;
m+1;
END;
OUTPUT;
RUN;
But if your DO loop iterates more times than there are observations for the SET statement to read then it will never get to the end to write the output since it will read past the end of the input dataset and stop.

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

Understanding the SAS PDV in by-group processing

While I've read quite a bit about conceptualizing the Program Data Vector when using a SAS data step, I still don't understand how the PDV works when there is by group processing. For example if I have the dataset olddata
GROUP VAL
A 10
A 5
B 20
And I call a datastep on it with a by statement, such as:
data newdata;
set olddata;
by group;
...
run;
then the compiler adds two temporary variables to the PDV: first.group and last.group. When you read any tutorial on the PDV it will tell you that on the first pass of the SET statement, the PDV will look like:
_N_ _ERROR_ FIRST.GROUP LAST.GROUP GROUP VAL
1 0 1 0 A 10
and that LAST.GROUP is zero because observation 1 is not the last observation in group A.
Herein lies my question: How does SAS know that this is not the last observation?
If SAS is processing olddata row-by-row, how is the PDV aware that the next row holds another group A observation instead of a new group? In other words, it seems like SAS must be using information from previous or future rows to update the FIRST and LAST variables, but I'm not sure how. Is there some trick in how the PDV retains values from row to row when the BY statement is called?
SAS actually looks ahead to the next record to see if it should set LAST.(var) or not. I haven't been able to find an article explaining that in any detail, unfortunately. I was a bit disappointed to see that even papers like http://www.wuss.org/proceedings09/09WUSSProceedings/papers/ess/ESS-Li1.pdf just gloss over how LAST is detemined.
SAS also looks ahead to see if the END= variable should be set, when specified, and a few other things. It's not just using metadata to determine those; you can remove or modify records without modifying the metadata, and it will still work - and SQL tables that don't have the usual SAS metadata will still allow you to perform normal BY group processing and such.
The FIRST variable doesn't need a look-behind, of course; it remembers where it was after all.
Edit: I crossposted this to SAS-L, and got the same answer - there doesn't seem to be any documentation of the subject, but it must read ahead. See http://listserv.uga.edu/cgi-bin/wa?A1=ind1303a&L=sas-l#8 for example.
Edit2: From SAS-L, Dan Nordlund linked to a paper that confirms this. http://support.sas.com/resources/papers/proceedings12/222-2012.pdf
The paper's logic that confirms the lookahead - look at the number of observations read from the data set.
DATA DS_Sample1;
Input Sum_Var
Product;
Cards;
100 3
100 2
100 1
;
*With BY statement - reads 3 observations even though it stops after 2.;
DATA DS_Sample2;
Set DS_Sample1;
by Sum_Var;
cnt+1; If CNT > 1 then stop;
Run;
*no BY statement - reads 2 observations as expected;
DATA DS_Sample2;
Set DS_Sample1;
cnt+1; If CNT > 1 then stop;
Run;
* END statement - again, a lookahead;
DATA DS_Sample2;
Set DS_Sample1 end=eof;
cnt+1; If CNT > 1 then stop;
Run;