To 'copy' the PDV structure of a data set, it has been advised to "reference a data set at compile time" using
if 0 then set <data-set>
For example,
data toBeCopied;
length var1 $ 4. var2 $ 4. ;
input var1 $ var2 $;
datalines;
this is
just some
fake data
;
run;
data copyPDV;
if 0 then set toBeCopied;
do var1 = 'cutoff' ;
do var2 = 'words';
output;
end;
end;
run;
When you run this, however, the following NOTE appears in the log:
NOTE: DATA STEP stopped due to looping.
This is because the data step never reaches the EOF marker and gets stuck in an infinite loop, as explained in Data Set Looping. (It turns out the DATA step recognizes this and terminates the loop, hence the NOTE in the log).
It seems like usage of the if 0 then set <data-set> statement is a longstanding practice, dating as far back as 1987. Although it seems hacky to me, I can't think of another way to produce the same result (i.e. copying PDV structure), aside from manually restating the attribute requirements. It also strikes me as poor form to allow ERRORs, WARNINGs, and NOTEs which imply unintended program behavior to remain in the log.
Is there way to suppress this note or an altogether better method to achieve the same result (i.e. of copying the PDV structure of a data set)?
If you include a stop; statement, as in
if 0 then do;
set toBeCopied;
stop;
end;
the NOTE still persists.
Trying to restrict the SET to a single observation also seems to have no effect:
if 0 then set toBeCopied (obs=1);
Normally SAS will terminate a data step at the point when you read past the input. Either of raw data or SAS datasets. For example this data step will stop when it executes the SET statement for the 6th time and finds there are no more observations to read.
data want;
put _n_=;
set sashelp.class(obs=5);
run;
To prevent loops SAS checks to see if you read any observations in this iteration of the data step. It is smart enough not to warn you if you are not reading from any datasets. So this program does not get a warning.
data want ;
do age=10 to 15;
output;
end;
run;
But by adding that SET statement you triggered the checking. You can prevent the warning by having a dataset that you are actually reading so that it stops when it reads past the end of the actual input data.
data want;
if 0 then set sashelp.class ;
set my_class;
run;
Or a file you are reading.
data want ;
if 0 then set sashelp.class ;
infile 'my_class.csv' dsd firstobs=2 truncover ;
input (_all_) (:) ;
run;
Otherwise add a STOP statement to manually end the data step.
data want ;
if 0 then set sashelp.class;
do age=10 to 15;
do sex='M','F';
output;
end;
end;
stop;
run;
The stop needs to not be in the if 0 branch. That branch is never executed. stop needs to be executed, and executed at the spot where you want the execution to stop.
data copyPDV;
if 0 then set toBeCopied;
do var1 = 'cutoff' ;
do var2 = 'words';
output;
end;
end;
stop;
run;
Not to suppress this note in your situation. It is method to get structure of a data set.
data class;
set sashelp.class(obs=0);
run;
or
proc sql;
create table class1 like sashelp.class;
quit;
Related
Below code will solve for getting last 2 observations from the dataset without using loops, first & last dot concept or sorting.
data a;
set sashelp.cars nobs=_nobs_;/*create the temporary variable to store total no of obs*/
if _N_ ge _nobs_-1;/*Now compare the automatic variable _N_ to _nobs_*/
run;
Not sure there is a question here.
You can also use the POINT= option on the SET statement. You have to explicitly end the data step since most data steps end when they read past the end of the input data and this step cannot do that.
data want;
do p=max(1,nobs-1) to nobs;
set have point=p nobs=nobs;
output;
end;
stop;
run;
A DATA step is an implicit loop with an automatic index variable _N_. You can leverage that fact to implicitly output rows without an explicit DO loop. As per #Tom, the point= option is used so the entire data set does not need to be read to reach the last two rows.
Example:
data want;
if _N_ > min(2,_Z_) then stop;
_P_ = _Z_ - min(2,_Z_) + _N_;
set sashelp.class point=_P_ nobs=_Z_;
run;
I am learning the skill of using double set statement and got a trouble in the following code:
data test1;
do i = 1 to 2;
set sashelp.class;
end;
run;
data test2;
set sashelp.class;
set sashelp.class;
run;
Test1 has 9 observations(all of the even rows) and Test2 has 19 observations, can somebody explain this for me?
The SAS output statement writes out observations to your output data set. When no explicit output statement is used (as in your data steps) an implicit output at the end of the data step outputs the current observation to the output data set.
In your first data step the do loop causes the set statement to be executed twice, the first time reading obs #1, the second time reading obs #2. The loop finishes and the next statement is run, so the implicit output outputs the current observation which is #2. The next iteration of the data step causes the do loop to read obs #3 and then #4, so the last obs (#4) is output, and so on until the end of the data set.
The second data step executes the first set statement reading in obs #1, then it executes the second set statement, reading obs #1 from that input data set, overwriting the current observation. The implicit output causes this obs to be written out. The data step reiterates causing the same to happen to obs #2, and so on until all 19 obs are read and output.
Inserting some diagnostics can help understand what is happening, e.g submit the following and check the log:
data test1;
do i = 1 to 2;
set sashelp.class;
putlog 'In loop: ' i= name=;
end;
putlog 'About to output: ' name=;
run;
i have a data that contain 30 variable and 2000 Observations.
I want to calculate regression in a loop, whan in each step I delete the i row in the data.
so in the end I need thet my output will be 2001 regrsion, one for the regrsion on all the data end 2000 on each time thet I drop a row.
I am new to sas, and I tray to find how to do it withe macro, but I didn't understand.
Any comments and help will be appreciated!
This will create the data set I was talking about in my comment to Chris.
data del1V /view=del1v;
length group _obs_ 8;
set sashelp.class nobs=nobs;
_obs_ = _n_;
group=0;
output;
do group=1 to nobs;
if group eq _n_ then;
else output;
end;
run;
proc sort out=analysis;
by group;
run;
DATA NEW;
DATA OLD;
do i = 1 to 2001;
IF _N_ ^= i THEN group=i;
else group=.;
output;
end;
proc sort data=new;
by group;
proc reg syntax;
by group;
run;
This will create a data set that is much longer. You will only call proc reg once, but it will run 2001 models.
Examining 2001 regression outputs will be difficult just written as output. You will likely need to go read the PROC REG support documentation and look into the output options for whatever type of output you're interested in. SAS can create a data set with the GROUP column to differentiate the results.
I edited my original answer per #data null suggestion. I agree that the above is probably faster, though I'm not as confident that it would be 100x faster. I do not know enough about the costs of the overhead of proc reg versus the cost of the group by statement and a larger data set. Regardless the answer above is simpler programming. Here is my original answer/alternate approach.
You can do this within a macro program. It will have this general structure:
%macro regress;
%do i=1 %to 2001;
DATA NEW;
DATA OLD;
IF _N_=&I THEN DELETE;
RUN;
proc reg syntax;
run;
%end;
%mend;
%regress
Macros are an advanced programming function in SAS. The macro program is required in order to do a loop of proc reg. The %'s are indicative of macro functions. &i is a macro variable (& is the prefix of a macro variable that is being called). The macro is created in a block that starts and ends with %macro / %mend, and called by %regress.
Examining 2001 regression outputs will be difficult just written as output. You will likely need to go read the PROC REG support documentation and look into the output options for whatever type of output you're interested in. Use &i to create a different data set each time and then append together as part of the macro loop.
When using VTYPE on a dataset with 0 observations I do not get needed information.
Here is MWE:
Create simple set with 1 variable and 1 observation.
data fullset;
myvar=1;
run;
Create another set with same 1 variable and 0 observations.
data emptyset;
set fullset;
stop;
run;
Make a macro that opens set, checks vtype and prints it to log.
%macro mwe(inset);
%local TYPE;
data _NULL_;
set &inset.;
CALL SYMPUT("TYPE", VTYPE(myvar));
put TYPE;
stop;
run;
%put &=TYPE.;
%mend mwe;
When run on set with observations everything works fine:
%mwe(fullset);
TYPE=N
But when run with an empty set the TYPE does not get assigned
%mwe(emptyset);
TYPE=
I guess the reason is that no code lines are processed since the set has no observations. Is there any workaround for that?
NOTE:Using proc contents and parsing the result table is certainly an overkill for such a simple task
Your problem is not vtype(), but how the data step works with an empty dataset.
When the set statement attempts to pull a row and fails, the data step immediately terminates. This can be useful - for example, when you don't want it to do things after the last row in the dataset is past. But in this case, it is less useful. Your datastep terminates instantly upon the set statement, meaning your call symput never occurs.
However, you can take advantage of a different thing: the fact that SAS will happily create all of the metadata even before set, during compilation.
%macro mwe(inset);
%local TYPE;
data _NULL_;
CALL SYMPUT("TYPE", VTYPE(myvar));
set &inset.;
stop;
run;
%put &=TYPE.;
%mend mwe;
Notice I moved the call symput before the set. Yes, vtype() works fine even before set - the variables are still defined in the PDV even before anything happens in the data step.
(I also took out the spurious put statement that never will do anything as no TYPE variable is ever created in either version.)
An alternative approach is to use the vartype function instead, which does not require a set statement and unlike the vtype function can be used in pure macro code outside a data step (without resorting to dosubl or the like).
What all this means in practice is that you can use vartype to make a function-style macro version of vtype, like so:
%macro vtype(ds,var);
%local dsid varnum rc vartype;
%let dsid = %sysfunc(open(&ds));
%let varnum = %sysfunc(varnum(&dsid,&var));
%let vartype = %sysfunc(vartype(&dsid,&varnum));
%let rc = %sysfunc(close(&dsid));
&vartype
%mend vtype;
/*Example*/
%put %vtype(emptyset,myvar);
/*Output*/
N
When running a data step with two datasets in the set statement, sometimes variables do not reset to missing between iterations. This is also true of merge when you have duplicate by values (ie, when your by variables do not guarantee a unique record).
For example:
data have1;
do x=1 to 5;
y=1;
output;
end;
run;
data have2;
do x = 6 to 10;
z=x+1;
output;
end;
run;
data want;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
run;
Here, y is given a value of 2 for every record coming from have2, as opposed to only the even z values.
Similarly,
data have1;
do x = 1 to 5;
y=1;
output;
end;
run;
data have2;
do x = 1 to 5;
do z = 1 to 4;
output;
end;
end;
run;
data want;
merge have1 have2;
by x;
if mod(z,4)=3 then y=3;
run;
Why does this happen, and how can I prevent it from causing unexpected consequences?
Why is this happening?
As discussed at length in the SAS documentation in Combining SAS Datsets: Methods, this arises from the fact that variables that are defined on a set, merge, or update statement are not set to missing on each iteration of the data step (This is the equivalent to using retain for all variables on the incoming data sets).
For the first example, this follows naturally from the retain concept: y is retained, so when it is not replaced by a new record from set having a value of y on it, it keeps its last value. (As we'll see later, it is cleared once: when the set dataset changes, hence why it doesn't still have the earlier value from the previous dataset).
However, this doesn't quite explain the functionality of the merge (how it goes back and forth). That is caused by a different behavior when a by group is involved.
Specifically, variables are not set to missing between each data step iteration; however, they are set to missing for each new by group or data set. From the documentation:
The values of the variables in the program data vector are set to
missing each time SAS starts to read a new data set and when the BY
group changes.
The implications of this are why the second example has y set back to 1 for the first two iterations of z but is kept at 3 for the z=4 iteration.
In order, labelling each iteration by its z value:
Z=1: first record of by group, so everything is set to missing. HAVE1 is read, HAVE2 is read. X=1, Y=1, Z=1 are all set.
Z=2: Second record of have2 is read. y retains its value of 1 from the previous iteration.
Z=3: Third record of have2 is read. y is set to 3.
Z=4: Fourth record of have2 is read. y retains its value of 3 from the previous iteration.
Note that HAVE1 is only read once, on the z=1 iteration. If this were a many-to-many merge, HAVE1 would be read once for each different row that had the same x value on it.
How do we prevent it from happening?
You have several options to deal with this, assuming you want it to act as if it was not automatically retained.
Add a by statement
As was noted before, on new by values it will automatically reset everything to missing. So if you ran
data want;
set have1 have2;
by x;
if missing(y) and mod(z,2)=0 then y=2;
run;
This would work as expected (though giving a slightly different result here).
Set some or all variables to missing on your own
You can do this in two places:
data want;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
output;
call missing(of _all_);
run;
or
data want;
y=.;
set have1 have2;
if missing(y) and mod(z,2)=0 then y=2;
run;
One or the other may be more appropriate for your program depending on your needs (the first sets everything to missing, but requires an extra statement (output;), while the second only sets y to missing (which is all that's needed) but changes the variable order by putting y first).
For a merge with duplicate by values, if you want to preserve the value of y you may need to do something like:
data want;
merge have1 have2;
by x;
y_new=y;
if mod(z,4)=3 then y_new=3;
rename y_new=y;
drop y;
run;
which gets around things by using a separate variable to store the new value. You also can set it to missing similarly to the above, if that is what is desired.