Understanding the SAS PDV in by-group processing - sas

While I've read quite a bit about conceptualizing the Program Data Vector when using a SAS data step, I still don't understand how the PDV works when there is by group processing. For example if I have the dataset olddata
GROUP VAL
A 10
A 5
B 20
And I call a datastep on it with a by statement, such as:
data newdata;
set olddata;
by group;
...
run;
then the compiler adds two temporary variables to the PDV: first.group and last.group. When you read any tutorial on the PDV it will tell you that on the first pass of the SET statement, the PDV will look like:
_N_ _ERROR_ FIRST.GROUP LAST.GROUP GROUP VAL
1 0 1 0 A 10
and that LAST.GROUP is zero because observation 1 is not the last observation in group A.
Herein lies my question: How does SAS know that this is not the last observation?
If SAS is processing olddata row-by-row, how is the PDV aware that the next row holds another group A observation instead of a new group? In other words, it seems like SAS must be using information from previous or future rows to update the FIRST and LAST variables, but I'm not sure how. Is there some trick in how the PDV retains values from row to row when the BY statement is called?

SAS actually looks ahead to the next record to see if it should set LAST.(var) or not. I haven't been able to find an article explaining that in any detail, unfortunately. I was a bit disappointed to see that even papers like http://www.wuss.org/proceedings09/09WUSSProceedings/papers/ess/ESS-Li1.pdf just gloss over how LAST is detemined.
SAS also looks ahead to see if the END= variable should be set, when specified, and a few other things. It's not just using metadata to determine those; you can remove or modify records without modifying the metadata, and it will still work - and SQL tables that don't have the usual SAS metadata will still allow you to perform normal BY group processing and such.
The FIRST variable doesn't need a look-behind, of course; it remembers where it was after all.
Edit: I crossposted this to SAS-L, and got the same answer - there doesn't seem to be any documentation of the subject, but it must read ahead. See http://listserv.uga.edu/cgi-bin/wa?A1=ind1303a&L=sas-l#8 for example.
Edit2: From SAS-L, Dan Nordlund linked to a paper that confirms this. http://support.sas.com/resources/papers/proceedings12/222-2012.pdf
The paper's logic that confirms the lookahead - look at the number of observations read from the data set.
DATA DS_Sample1;
Input Sum_Var
Product;
Cards;
100 3
100 2
100 1
;
*With BY statement - reads 3 observations even though it stops after 2.;
DATA DS_Sample2;
Set DS_Sample1;
by Sum_Var;
cnt+1; If CNT > 1 then stop;
Run;
*no BY statement - reads 2 observations as expected;
DATA DS_Sample2;
Set DS_Sample1;
cnt+1; If CNT > 1 then stop;
Run;
* END statement - again, a lookahead;
DATA DS_Sample2;
Set DS_Sample1 end=eof;
cnt+1; If CNT > 1 then stop;
Run;

Related

Recursively advancing a variable by a growth rate variable in SAS

I have a panel/longitudinal dataset in SAS.
One field indicates a class or type, another a point in time without breaks, another is the observed history and another is the log difference forecast for said history. I'd like to add a new field: the history field, advanced by the forecast field.
So if the time field is in the 'future', I want to recursively advance my goal variable with its own lag, multiplied by the exp of the log-difference forecast variable. A trivial operation it seems to me.
I've attempted to replicate the problem with a toy dataset below.
data in;
input class time hist forecast;
datalines;
1 1 100 .
1 2 . .1
1 3 . .15
1 4 . .17
2 1 100 .
2 2 . .18
2 3 . .12
2 4 . .05
run;
proc sort data=work.in;
by class time;
run;
data out;
set in;
by class time;
retain goal hist;
if time > 1 then goal= lag1(goal) * exp(forecast);
run;
JP:
You might want this:
data out;
set in;
by class time;
retain goal;
if first.class
then goal=hist;
else goal = goal * exp(forecast);
run;
Retaining a non data set variable can mostly be considered a lag1 type of stack. The initial goal needs to be reset at the start of each group.
Your first attempt is conditionally LAG1'ng a retained variable while BY group processing -- makes my head spin. LAG-n is tricky because the implicit LAG stack is updated only when processing flow goes through it. If a conditional bypasses the LAG function invocation there is no way the LAG stack can get updated. If you do see LAG in other SAS coding, it might appear in an unconditional place prior to any ifs.
NOTE: retaining data set variables (such as hist) is atypical because their values are overwritten when the SET statement is reached. The atypical case is when testing the retained data set variable prior to the SET statement has a functional purpose.

How does one or more data step OUTPUT statements work and can it be implicit?

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?
I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.
1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)
2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?
3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?
For example:
data test;
do i = 1 to 100;
output;
end;
run;
data test2;
set test;
if _N_ in (4 8 11) then output;
run;
data test3;
set test;
if _N_ in (4 8 11) then output;
output;
run;
test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.
output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).
run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.
Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.
It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:
data x;
do row = 1 to 100;
output;
row_prev+1;
end;
run;
That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).
You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.
The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:
The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).
The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement
The output statement simply outputs the contents of the PDV at that moment into your output dataset
SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement
point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one
It's implicit at the end, unless it's used in one or more places in that data step.
Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.
You are very close.
1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.
2) The OUTPUT statement tells SAS to write the current record to the output dataset.
3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.
For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.
data want;
set sashelp.class
sashelp.class(where=(age=11))
;
by name;
run;

Set in the first row, why do I have two output rows?

I tried this code in SAS but the output isn't the same as I expect.
data temp;
input sumy;
datalines;
36
;
run;
data aaa;
if _n_ = 1 then
set temp;
run;
proc print data = aaa;
run;
May I ask why there are two observations, sas have "set" twice? How does the "set" and PDV work here during iteration? Thank you in advance.
Because you executed the implied OUTPUT at the end of the data step twice. The first time, with _N_=1, you read one observation from the input dataset. The second time you did not read a new observation, since _N_ now is 2, and the values from the previous observation were retained. After the second observation SAS stops because it has detected that your data step is in a loop.
If you want only one observation then either add a STOP statement before the RUN statement or recode the data step to use OBS=1 dataset option on the input dataset instead of the IF statement.
Note that if the input data set was empty then you would have output zero observations because the data step would have stopped when the SET statement read past the end of the input dataset.
There are two observations because during the second DATA Step iteration no read operation occurred.
The SET statement has two roles.
Compile time role (unconditional) - the data set header is read by the compiler and its variables are added the the steps PDV.
Run time role (conditional) - a row from the data set is read and the values are placed in the PDV each time the code reaches the SET statement.
Additionally, every variable coming from (or corresponding to) a SET statement has it's value automatically retained. That is why the second observation created by your sample code has sumy=36.
Additional detail from the SAS support site:
Usage Note 8914: DATA step stopped due to looping message
If a DATA step is written such that no data reading statements (e.g.
SET, INPUT) are executed, the step is terminated after one iteration
and the following message is written to the SAS log:
NOTE: DATA STEP stopped due to looping.
As SAS creates a new data set, it reads one record at a time, saving values from that record in the Program Data Vector (PDV) until values from the next record replace them. SAS continues in this way until it has reached the last record.
You can refer this link for better understanding
http://www.lexjansen.com/nesug/nesug07/cc/cc45.pdf
Also you can go through this answer on stack overflow
SAS . Are variables set to missing at every iteration of a data step?

SAS Transpose Comma Separated Field

This is a follow-up to an earlier question of mine.
Transposing Comma-delimited field
The answer I got worked for the specific case, but now I have a much larger dataset, so reading it in a datalines statement is not an option. I have a dataset similar to the one created by this process:
data MAIN;
input ID STATUS STATE $;
cards;
123 7 AL,NC,SC,NY
456 6 AL,NC
789 7 ALL
;
run;
There are two problems here:
1: I need a separate row for each state in the STATE column
2: Notice the third observation says 'ALL'. I need to replace that with a list of the specific states, which I can get from a separate dataset (below).
data STATES;
input STATE $;
cards;
AL
NC
SC
NY
TX
;
run;
So, here is the process I am attempting that doesn't seem to be working.
First, I create a list of the STATES needed for the imputation, and a count of said states.
proc sql;
select distinct STATE into :all_states separated by ','
from STATES;
select count(distinct STATE) into :count_states
from STATES;
quit;
Second, I try to impute that list where the 'ALL' value appears for STATE. This is where the first error appears. How can I ensure that the variable STATE is long enough for the new value? Also, how do I handle the commas?
data x_MAIN;
set MAIN;
if STATE='ALL' then STATE="&all_states.";
run;
Finally, I use a SCAN function to read in one state at a time. I'm also getting an error here, but I think fixing the above part may solve it.
data x_MAIN_mod;
set x_MAIN;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
run;
Thanks in advance for the help!
Looks like you are almost there. Try this on the last Data Step.
data x_MAIN_mod;
set x_MAIN;
format out_state $2.;
nstate = countw(state,",");
do i=1 to nstate;
out_state = scan(state,i,",");
output;
end;
run;
Do you have to actually have two steps like that? You can use a 'big number' in a temporary variable and not have much effect on things, if you don't have the intermediate dataset.
data x_MAIN;
length state_temp $150;
set MAIN;
if STATE='ALL' then STATE_temp="&all_states.";
else STATE_temp=STATE;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
drop STATE_temp;
run;
If you actually do need the STATE, then honestly I'd go with the big number (=50*3, so not all that big) and then add OPTIONS COMPRESS=CHAR; which will (give or take) turn your CHAR fields into VARCHAR (at the cost of a tiny bit of CPU time, but usually far less than the disk read/write time saved).

SAS: How do I point to a specific observation of a value?

I'm very new to SAS and I'm trying to figure out some basic things available in other languages.
I have a table
ID Number
-- ------
1 2
2 5
3 6
4 1
I would like to create a new variable where I sum the value of one observation of Number to each other observations, like
Number2 = Number + Number[3]
ID Number Number2
-- ------ ------
1 2 8
2 5 11
3 6 12
4 1 7
How to I get the value of third observation of Number and add this to each observation of Number in a new variable?
There are several ways to do this; here is one using the SAS POINT= option:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
run;
data want;
retain adder;
drop adder;
if _n_=1 then do;
adder = 3;
set have point=adder;
adder = number;
end;
set have;
number = number + adder;
run;
The RETAIN and DROP statements define a temp variable to hold the value you want to add. RETAIN means the value is not to be re-initialized to missing each time through the data step and DROP means you do not want to include that variable in the output data set.
The POINT= option allows one to read a specific observation from a SAS data set. The _n_=1 part is a control mechanism to only execute that bit of code once, assigning the variable adder to the value of the third observation.
The next section reads the data set one observation at a time and adds applies your change.
Note that the same data set is read twice; a handy SAS feature.
I'll start by suggesting that Base SAS doesn't really work this way, normally; it's not that it can't, but normally you can solve most problems without pointing to a specific row.
So while this answer will solve your explicit problem, it's probably not something useful in a real world scenario; usually in the real world you'd have a match key or some other element other than 'row number' to combine with, and if you did then you could do it much more efficiently. You also likely could rearrange your data structure in a way that made this operation more convenient.
That said, the specific example you give is trivial:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
;;;;
run;
data want;
set have;
_t = 3;
set have(rename=number=number3 keep=number) point=_t ;
number2=number+number3;
run;
If you have SAS/IML (SAS's matrix language), which is somewhat similar to R, then this is a very different story both in your likelihood to perform this operation and in how you'd do it.
proc iml;
a= {1 2, 2 5, 3 6, 4 1}; *create initial matrix;
b = a[,2] + a[3,2]; *create a new matrix which is the 2nd column of a added
elementwise to the value in the third row second column;
c = a||b; *append new matrix to a - could be done in same step of course;
print b c;
quit;
To do this with the First observation, it's a lot easier.
data want;
set have;
retain _firstpoint; *prevents _firstpoint from being set to missing each iteration;
if _n_ = 1 then _firstpoint=number; *on the first iteration (usually first row) set to number's value;
number = number - _firstpoint; *now subtract that from number to get relative value;
run;
I'll elaborate a little more on this. SAS works on a record-by-record level, where each record is independently processed in the DATA step. (PROCs on the other hand may not behave this way, though many do at some level). SAS, like SQl and similar databases, doesn't truly acknowledge that any row is "first" or "second" or "nth"; however, unlike SQL, it does let you pretend that it is, based on the current sort. The POINT= random access method is one way to go about doing that.
Most of the time, though, you're going to be using something in the data to determine what you want to do rather than some related to the ordering of the data. Here's a way you could do the same thing as the POINT= method, but using the value of ID:
data want;
if n = 1 then set have(where=(ID=3) rename=number=number3);
set have;
number2=number+number3;
run;
That in the first iteration of the data step (_N_=1) takes the row from HAVE where Id=3, and then takes the lines from have in order (really it does this:)
*check to see if _n_=1; it is; so take row id=3;
*take first row (id=1);
*check to see if _n_=1; it is not;
*take second row (id=2);
... continue ...
Variables that are in a SET statement are automatically retained, so NUMBER3 is automatically retained (yay!) and not set to missing between iterations of the data step loop. As long as you don't modify the value, it will stay for each iteration.