Reproducible random number generation in multiple SAS data steps - sas

I am struggling to figure out the best way to generate random numbers reproducibly using multiple SAS data steps.
To do it in one data step is straightfoward: just use CALL STREAMINIT at the start of the data step.
However, if I then use a second data step, I can't figure out any way to continue the sequence of random numbers. If I don't use CALL STREAMINIT at all in the second data step, then the random numbers in the second data step are not reproducible. If I use CALL STREAMINIT with the same seed, I get the same random numbers as in the first data step.
The only think I can think of is to use CALL STREAMINIT with a different seed in each data step. Somehow that seems less satisfactory to me than using just one long random number sequence starting with the firs data step.
So for example I could do something like this:
%macro myrandom;
%do i = 1 %to 10;
data dataset&i;
call streaminit(&i);
[do stuff involving random numbers]
run;
%end;
%mend;
But somehow using a predictable sequence of seeds seems like cheating. Should I be worried about that? Is that actually a perfectly acceptable way of doing it, or is there a better way?

Here is my attempt at this:
%macro dataset_rand(_num,_rows);
data dataset;
do i = 0 to &_rows - 1;
call streaminit(123);
c = rand("UNIFORM");
varnum = mod(i,&_num.) +1;
output;
end;
run;
data %do i = 1 %to &_num.;
dataset&i.
%end;
;
set dataset;
%do j = 1 %to &_num;
if varnum = &j. then
output dataset&j.;
%end;
run;
%mend;
%dataset_rand(10,100);
Here I ran one step to create every single row with a single random variable and another variable that will be used to assign it to a dataset.
input is _num and _rows, which allow you to chose how many rows total and how many tables, so the example (10,100) creates 10 tables of 10 rows. With dataset1 holding the 1st, 11th ... 91st member of the random sequence.
That said I don't know of any reason why 10 datasets with 10 seeds, would be any better or worse than 1 dataset with 1 seed split into 10.

Using RANUNI or similar (the 'old' random number streams), you would use call ranuni to accomplish this. This lets you save the seed for the next round, and then you could call symputx that value to the next datastep and re-start the same stream. That's because the output value for one pseudorandom value is a direct variation on the seed for the next in that algorithm.
However, using RAND, the seed is more complicated (it's not really just one value, after the first number was called). From the documentation:
The RAND function is started with a single seed. However, the state of the process cannot be captured by a single seed. You cannot stop and restart the generator from its stopping point.
This is of course a simplification (obviously SAS is capable of doing so, it just doesn't open up the right hooks for you to do so, presumably as it's not as straightforward as call ranuni is).
What you can do, though, is use the macro language, depending on exactly what you're trying to do. Using %syscall and %sysfunc, you can get a single stream that goes across data steps.
However, one caveat: it doesn't look like you can ever reset it. From documentation on Seed Values:
When the RANUNI function is called through the macro language by using %SYSFUNC, one pseudo-random number stream is created. You cannot change the seed value unless you close SAS and start a new SAS session. The %SYSFUNC macro produces the same pseudo-random number stream as the DATA steps that generated the data sets A, B, and C for the first macro invocation only. Any subsequent macro calls produce a continuation of the single stream.
This is specific to the ranuni family, but it looks like it is also true for thhe rand family.
So, start up a new session of SAS, and run this:
%macro get_rands(seed=0, n=, var=, randtype=Uniform, randargs=);
%local i;
%syscall streaminit(seed);
%do i = 1 %to &n;
&var. = %sysfunc(rand(&randtype. &randargs.));
output;
%end;
%mend get_rands;
data first;
%get_rands(seed=7,n=10,var=x);
run;
data second;
%get_rands(n=10,var=x);
run;
data whole;
call streaminit(7);
do _i = 1 to 20;
x = rand('Uniform');
output;
end;
run;
But don't make the mistake of running it twice in one session.
Otherwise, your best bet is to generate your random numbers once, then use them in multiple data steps. If you use BY groups, it's easy to manage things this way. If you have specific questions how to implement your project in this way, let us know in a new question.

Not sure if it's much easier, but you can use the STREAM subroutine to generate multiple independent streams from the same initial seed. Below is an example, slightly modified from the doc on CALL STREAM.
%macro RepeatRand(N=, Repl=, seed=);
%do k = 1 %to &Repl;
data dataset&k;
call streaminit('PCG', &seed);
call stream(&k);
do i = 1 to &N;
u = rand("uniform");
output;
end;
run;
%end;
%mend;
%RepeatRand(N=8, Repl=2, seed=36457);

Related

SYSPBUFF OPTION

I'm trying to write a macro code where i've used several keyword parameters and want one of those parameters to be able to read in multiple arguments/values.
I want to achieve something like this:
%MACRO TEST(CONDITION=, VVAR=, OUT_VAR=)/PARMBUFF;
%LET CNT = %sysfunc(countw(&syspbuff));
&OUT_VAR = .;
%DO I =1 %TO &CNT;
%IF &CONDITION=Y %THEN %DO;
&OUT_VAR=(SALARY+BUNUS)/COUNT(VALUES PASSED TO VVAR PARAMETER);
%END;
%END;
%MEND;
data person;
input SALARY BONUS COND $;
datalines;
100 50 Y
200 75 Y
300 0 N
;
%TEST(CONDITION=COND,VVAR=SALARY BONUS,OUT_VAR=AVG_SAL);
RUN;
Can anyone suggest how can I achieve that? I tired using the syspbuff options to read in the values for VVAR parameter, but it has all the values passed to all the parameters.
Thanks!
You appear to be confused about the timing of when macro code executes and when the code that the macro has generated is executed by SAS. The macro processor does its work first and then passes the code onto SAS to interpret. When SAS sees a complete data or proc step then it runs that step.
Here is my attempt to translate your post into something that could execute. Not sure if it is what you are looking for.
First you have a macro that takes in three parameters. The first is some SAS expression that evaluates to a string. The middle is a variable list. And the last is a single variable name. It uses these parameter values to generate SAS code that could be used inside of a data step to conditionally generate the mean of the variables in the list.
%MACRO TEST(CONDITION=, VVAR=, OUT_VAR=);
IF &CONDITION='Y' THEN DO;
&OUT_VAR=mean(of &vvar);
END;
else &out_var=.;
%MEND;
Now you will need to call that macro as part of a data step so that the code is generated in a place where SAS will understand it. Note that when your data step is using inline data (CARDS/DATALINES) then the inline data most be the last thing in the data step.
data person;
input SALARY BONUS COND $;
%TEST(CONDITION=COND,VVAR=SALARY BONUS,OUT_VAR=AVG_SAL);
datalines;
100 50 Y
200 75 Y
300 0 N
;
If you run it with the MPRINT option on you can see the SAS code that the macro has generated. It is just the same as if you had typed that code directly into the data step instead of asking the macro to generate it for you.
615 data person;
616 input SALARY BONUS COND $;
617 %TEST(CONDITION=COND,VVAR=SALARY BONUS,OUT_VAR=AVG_SAL);
MPRINT(TEST): IF COND='Y' THEN DO;
MPRINT(TEST): AVG_SAL=mean(of SALARY BONUS);
MPRINT(TEST): END;
MPRINT(TEST): else AVG_SAL=.;
618 datalines;
NOTE: The data set WORK.PERSON has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.03 seconds
622 ;
If you wanted to generate multiple new variables then just call it multiple times. You could probably make the macro much more complicated by treating each input parameter as a delimited list of values and process the first set and then the second set etc. But why?
If you want to know how many words appear in the value of a macro variable, like the VVAR parameter in the macro above, then you can use the %sysfunc() macro function to call the SAS function countw().
%let cnt=%sysfunc(countw(&vvar));
You could then use the value of &cnt where you need it. Say as the upper bound in a macro %do loop or as an integer constant in an expression in the generated SAS code.

SAS macro modification

I have two values which represent dates:
a=101 and b=103
Below is first macro saved in separate file one.sas:
%global time nmall;
%let nmall =;
%macro pmall;
%do i=&a. %to &b;
%if &i =&a. then %do;
%let nmall=&nmall.&i;
%end;
%else %let nmall=&nmall.,&i;
end;
%put (&nmall);
%mend;
%pmall;
So above pmall give me values 101,102,103.
Below is second macro:
%include “one.as”;
%macro c(a=,b=);
%let m=;
%let m1=;
%do i =&a %to &b;
%let o=&i;
proc sql;
create table new&o as select * from data where nb in(&o.);quit;
%let m =&m.date&o;
data date&o.;
set date&o.;
if pass =&o.;
run;
proc sort data=date&o.;
by flag;
end;
data output &a._&b.;
set &m;
%mend;
The above macro creates three datasets date101 date102 and date 103, then append it to output101_103.
I am trying to modify above macros in such a way that I will not use %macro and %mend approach. Below is the modified macro code:
data a_to_c;
do o=&a to &c;
output;
end;
run;
so above code will have values 101 102 103 in variable o for dataset a_to_c.
data _null_;
set a_to_c;
call execute ('create table new’||strip(o)||' as select * from data
where nb in(’||strip(o)||' );quit;’);
run;
I want to know how to do below things.
Create pmall values in a macro variable in my modified macro inside the data step data a_to_c, so that I can use it further.
How to proceed from %let m macro in the first macro code to new code which I am developing above.
Geetha:
I think you will find the macro-ization of the process to be far easier if you go from a data-centric explicit solution and proceed abstracting the salient features into macro symbols (aka variables)
The end run solution appears to be:
data output_101_to_103;
set original_data;
where nb between 101 and 103;
run;
proc sort data=output_101_to_103;
by nb flag;
run;
In which case you could code a macro that abstracts 101 to FIRST and 103 to LAST. The data sets could also be abstracted. The abstracted parts are specified as the macro parameters.
%macro subsetter(DATA=, FIRST=, LAST, OUTPREFIX=OUTPUT);
%local out;
%let out = &OUTPREFIX._&FIRST._&LAST.;
data &out;
set &DATA.;
where nb between &FIRST. and &LAST.;
* condition = "between &FIRST. and &LAST."; * uncomment if you want to carry along the condition into your output data set;
run;
proc sort data=&out;
by nb flag;
run;
%mend;
And use as
%subsetter (data=original_data, first=101, last=103, outprefix=output)
Note: If you did keep the condition variable in the output data, you WOULD NOT be able to use it directly as a source code statement in a future data step, as in if nb condition then ...
I suppose you could also pass the NB and FLAG as parameters -- but you approach a point of diminishing returns on the utility of the macro.
Macro-izing the specific example I showed doesn't make too much sense unless you need to perform a lot of different variations of FIRST and LAST in a well documented framework. Sometimes it is just better to not abstract the code and work with the specific cases. Why? Because when there are too many abstracted pieces the macro invocation is almost as long as the specific code you are generating and the abstraction just gets in the way of understanding.
If the macro is simply chopping up data and reassembling data, you might be better served rethinking the flow using where, by, and class statements and abstracting around that.
Pmall is macro variable which will have list of values separated by
commas. In my modify macro, i want to create pmall as macro variable
in the datastep data a_to_c; do o=&a to &c; output; end; run; – geetha
anand 1 min ago
To create a macro variable from within a data step using the CALL SYMPUTX() function.
data a_to_c;
length pmall $200 ;
do o=&a to &c;
pmall=catx(',',pmall,o);
output;
end;
call symputx('pmall',pmall);
drop pmall;
run;
If you really want to generate code without a SAS macro you can use CALL EXECUTE() or write the code to a file and use %INCLUDE to run it. Or for small pieces of code you could try putting the code in a macro variable, but macro variables can only contain 64K bytes.
It is really hard to tell from what you posted what code you want to generate. Let's assume that you want to generate an new dataset for each value in the sequence and then append that to some aggregate dataset. So for the first pass through the loop your code might be as simple as these two steps. First to create the proper subset in the right order and the second to append the result to the aggregate dataset.
proc sort data=nb out=date101 ;
where nb=101 ;
by flag ;
run;
proc append base=date101_103 data=date101 force;
run;
Then next two times through the loop will look the same only the "101" will be replaced by the current value in the sequence.
So using CALL EXECUTE your program might look like:
%let a=101;
%let c=103;
proc delete data=date&a._&c ;
run;
data _null_;
do nb=&a to &c;
call execute(catx(' ','proc sort data=nb out=',cats('date',nb,'),';'));
call execute(cats('where nb=',nb,';')) ;
call execute('by flag; run;');
call execute("proc append base=date&a._&c data=");
call execute(cats('date',nb));
call execute(' force; run;');
end;
run;
Writing it to a file to run via %INCLUDE would look like this:
filename code temp ;
data _null_;
file code ;
do nb=&a to &c;
put 'proc sort data=nb out=date' nb ';'
/ ' where ' nb= ';'
/ ' by flag;'
/ ';'
/ "proc append base=date&a._&c data=date" nb 'force;'
/ 'run;'
;
end;
run;
proc delete data=date&a._&c ;
run;
%include code / source2;
If the goal is to just create the aggregate dataset and you do not need to keep the smaller intermediate datasets then you could just use the same name for the intermediate dataset on each pass through the loop. That will make the code generation easier as then there is only only place that needs to change based on the current value. Also that way you only need to have two dataset names even for a sequence of 10 or 20 values. It will take less space and reduce clutter in the work library.

Naming variable using _n_, a column for each iteration of a datastep

I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;

Holding Sampled Macro Variable Constant

Hopefully a simple answer. I'm doing a simulation study, where I need to sample a random number of individuals, N, from a uniform distribution, U(25,200), at each of a thousand or so replications. Code for one replication is shown below:
%LET U = RAND("UNIFORM");
%LET N = ROUND(25 + (200 - 25)*&U.);
I created both of these macro variables outside of a DATA step because I need to call the N variable repeatedly in subsequent DATA steps and DO loops in both SAS and IML.
The problem is that every time I call N within a replication, it re-samples U, which necessarily modifies N. Thus, N is not held constant within a replication. This issue is shown in the code below, where I first create N as a variable (that is constant across individuals) and sample predictor values for X for each individual using a DO loop. Note that the value in N is not the same as the total number of individuals, which is also a problem.
DATA ID;
N = &N.;
DO PersonID = 1 TO &N.;
X = RAND("NORMAL",0,1); OUTPUT;
END;
RUN;
I'm guessing that what I need to do is to somehow hold U constant throughout the entirety of one replication, and then allow it to be re-sampled for replication 2, and so on. By holding U constant, N will necessarily be held constant.
Is there a way to do this using macro variables?
&N does not store a value. &N stores the code "ROUND(...(RAND..." etc. You're misusing macro variables, here: while you could store a number in &N you aren't doing so; you have to use %sysfunc, and either way it's not really the right answer here.
First, if you're repeatedly sampling replicates, look at the paper Don't be Loopy', which has some applications here. Also consider Rick Wicklin's paper, Sampling with Replacement, and his book that he references ("Simulating Data in SAS") in there is quite good as well. If you're running your process on a one-sample-one-execution model, that's the slow and difficult to work with way. Do all the replicates at once, process them all at once; IML and SAS are both happy to do that for you. Your uniform random sample size is a bit more difficult to work with, but it's not insurmountable.
If you must do it the way you're doing it, I would ask the data step to create the macro variable, if there's a reason to do that. At the end of the sample, you can use call symput to put out the value of N. IE:
%let iter=7; *we happen to be on the seventh iteration of your master macro;
DATA ID;
CALL STREAMINIT(&iter.);
U = RAND("UNIFORM");
N = ROUND(25 + (200 - 25)*U);
DO PersonID = 1 TO N;
X = RAND("NORMAL",0,1);
OUTPUT;
END;
CALL SYMPUTX('N',N);
CALL SYMPUTX('U',U);
RUN;
But again, a one-data-step model is probably your most efficient model.
I'm not sure how to do it in the macro world, but this is how you could convert your code to a data step to accomplish the same thing.
The key is setting the random number stream initialization value, using CALL STREAMINIT.
Data _null_;
call streaminit(35);
u=rand('uniform');
call symput('U', u);
call symput('N', ROUND(25 + (200 - 25)*U));
run;
%put &n;
%put &u;
As Joe points out, the efficient way to perform this simulation is to generate all 1000 samples in a single data step, as follows:
data AllSamples;
call streaminit(123);
do SampleID = 1 to 1000;
N = ROUND(25 + (200 - 25)*RAND("UNIFORM"));
/* simulate sample of size N HERE */
do PersonID = 1 to N;
X = RAND("NORMAL",0,1);
OUTPUT;
end;
end;
run;
This ensures independence of the random number streams, and it takes a fraction of a second to produce the 1000 samples. You can then use a BY statement to analyze the sampling distributions of the statistics on each sample. For example, the following call to PROC MEANS outputs the sample size, sample mean, and sample standard deviation for each of the 1000 samples:
proc means data=AllSamples noprint;
by SampleID;
var X;
output out=OutStats n=SampleN mean=SampleMean std=SampleStd;
run;
proc print data=OutStats(obs=5);
var SampleID SampleN SampleMean SampleStd;
run;
For more details about why the BY-group approach is more efficient (total time= less than 1 second!) see the article "Simulation in SAS: The slow way or the BY way."

Sas macro with proc sql

I want to perform some regression and i would like to count the number of nonmissing observation for each variable. But i don't know yet which variable i will use. I've come up with the following solution which does not work. Any help?
Here basically I put each one of my explanatory variable in variable. For example
var1 var 2 -> w1 = var1, w2= var2. Notice that i don't know how many variable i have in advance so i leave room for ten variables.
Then store the potential variable using symput.
data _null_;
cntw=countw(&parameters);
i = 1;
array w{10} $15.;
do while(i <= cntw);
w[i]= scan((&parameters"),i, ' ');
i = i +1;
end;
/* store a variable globally*/
do j=1 to 10;
call symput("explanVar"||left(put(j,3.)), w(j));
end;
run;
My next step is to perform a proc sql using the variable i've stored. It does not work as
if I have less than 10 variables.
proc sql;
select count(&explanVar1), count(&explanVar2),
count(&explanVar3), count(&explanVar4),
count(&explanVar5), count(&explanVar6),
count(&explanVar7), count(&explanVar8),
count(&explanVar9), count(&explanVar10)
from estimation
;quit;
Can this code work with less than 10 variables?
You haven't provided the full context for this project, so it's unclear if this will work for you - but I think this is what I'd do.
First off, you're in SAS, use SAS where it's best - counting things. Instead of the PROC SQL and the data step, use PROC MEANS:
proc means data=estimation n;
var &parameters.;
run;
That, without any extra work, gets you the number of nonmissing values for all of your variables in one nice table.
Secondly, if there is a reason to do the PROC SQL, it's probably a bit more logical to structure it this way.
proc sql;
select
%do i = 1 %to %sysfunc(countw(&parameters.));
count(%scan(&parameters.,&i.) ) as Parameter_&i., /* or could reuse the %scan result to name this better*/
%end; count(1) as Total_Obs
from estimation;
quit;
The final Total Obs column is useful to simplify the code (dealing with the extra comma is mildly annoying). You could also put it at the start and prepend the commas.
You finally could also drive this from a dataset rather than a macro variable. I like that better, in general, as it's easier to deal with in a lot of ways. If your parameter list is in a data set somewhere (one parameter per row, in the dataset "Parameters", with "var" as the name of the column containing the parameter), you could do
proc sql;
select cats('%countme(var=',var,')') into :countlist separated by ','
from parameters;
quit;
%macro countme(var=);
count(&var.) as &var._count
%mend countme;
proc sql;
select &countlist from estimation;
quit;
This I like the best, as it is the simplest code and is very easy to modify. You could even drive it from a contents of estimation, if it's easy to determine what your potential parameters might be from that (or from dictionary.columns).
I'm not sure about your SAS macro, but the SQL query will work with these two notes:
1) If you don't follow your COUNT() functions with an identifier such as "COUNT() AS VAR1", your results will not have field headings. If that's ok with you, then you may not need to worry about it. But if you export the data, it will be helpful for you if you name them by adding "...AS "MY_NAME".
2) For observations with fewer than 10 variables, the query will return NULL values. So don't worry about not getting all of the results with what you have, because as long as the table you're querying has space for 10 variables (10 separate fields), you will get data back.