Holding Sampled Macro Variable Constant - sas

Hopefully a simple answer. I'm doing a simulation study, where I need to sample a random number of individuals, N, from a uniform distribution, U(25,200), at each of a thousand or so replications. Code for one replication is shown below:
%LET U = RAND("UNIFORM");
%LET N = ROUND(25 + (200 - 25)*&U.);
I created both of these macro variables outside of a DATA step because I need to call the N variable repeatedly in subsequent DATA steps and DO loops in both SAS and IML.
The problem is that every time I call N within a replication, it re-samples U, which necessarily modifies N. Thus, N is not held constant within a replication. This issue is shown in the code below, where I first create N as a variable (that is constant across individuals) and sample predictor values for X for each individual using a DO loop. Note that the value in N is not the same as the total number of individuals, which is also a problem.
DATA ID;
N = &N.;
DO PersonID = 1 TO &N.;
X = RAND("NORMAL",0,1); OUTPUT;
END;
RUN;
I'm guessing that what I need to do is to somehow hold U constant throughout the entirety of one replication, and then allow it to be re-sampled for replication 2, and so on. By holding U constant, N will necessarily be held constant.
Is there a way to do this using macro variables?

&N does not store a value. &N stores the code "ROUND(...(RAND..." etc. You're misusing macro variables, here: while you could store a number in &N you aren't doing so; you have to use %sysfunc, and either way it's not really the right answer here.
First, if you're repeatedly sampling replicates, look at the paper Don't be Loopy', which has some applications here. Also consider Rick Wicklin's paper, Sampling with Replacement, and his book that he references ("Simulating Data in SAS") in there is quite good as well. If you're running your process on a one-sample-one-execution model, that's the slow and difficult to work with way. Do all the replicates at once, process them all at once; IML and SAS are both happy to do that for you. Your uniform random sample size is a bit more difficult to work with, but it's not insurmountable.
If you must do it the way you're doing it, I would ask the data step to create the macro variable, if there's a reason to do that. At the end of the sample, you can use call symput to put out the value of N. IE:
%let iter=7; *we happen to be on the seventh iteration of your master macro;
DATA ID;
CALL STREAMINIT(&iter.);
U = RAND("UNIFORM");
N = ROUND(25 + (200 - 25)*U);
DO PersonID = 1 TO N;
X = RAND("NORMAL",0,1);
OUTPUT;
END;
CALL SYMPUTX('N',N);
CALL SYMPUTX('U',U);
RUN;
But again, a one-data-step model is probably your most efficient model.

I'm not sure how to do it in the macro world, but this is how you could convert your code to a data step to accomplish the same thing.
The key is setting the random number stream initialization value, using CALL STREAMINIT.
Data _null_;
call streaminit(35);
u=rand('uniform');
call symput('U', u);
call symput('N', ROUND(25 + (200 - 25)*U));
run;
%put &n;
%put &u;

As Joe points out, the efficient way to perform this simulation is to generate all 1000 samples in a single data step, as follows:
data AllSamples;
call streaminit(123);
do SampleID = 1 to 1000;
N = ROUND(25 + (200 - 25)*RAND("UNIFORM"));
/* simulate sample of size N HERE */
do PersonID = 1 to N;
X = RAND("NORMAL",0,1);
OUTPUT;
end;
end;
run;
This ensures independence of the random number streams, and it takes a fraction of a second to produce the 1000 samples. You can then use a BY statement to analyze the sampling distributions of the statistics on each sample. For example, the following call to PROC MEANS outputs the sample size, sample mean, and sample standard deviation for each of the 1000 samples:
proc means data=AllSamples noprint;
by SampleID;
var X;
output out=OutStats n=SampleN mean=SampleMean std=SampleStd;
run;
proc print data=OutStats(obs=5);
var SampleID SampleN SampleMean SampleStd;
run;
For more details about why the BY-group approach is more efficient (total time= less than 1 second!) see the article "Simulation in SAS: The slow way or the BY way."

Related

Reproducible random number generation in multiple SAS data steps

I am struggling to figure out the best way to generate random numbers reproducibly using multiple SAS data steps.
To do it in one data step is straightfoward: just use CALL STREAMINIT at the start of the data step.
However, if I then use a second data step, I can't figure out any way to continue the sequence of random numbers. If I don't use CALL STREAMINIT at all in the second data step, then the random numbers in the second data step are not reproducible. If I use CALL STREAMINIT with the same seed, I get the same random numbers as in the first data step.
The only think I can think of is to use CALL STREAMINIT with a different seed in each data step. Somehow that seems less satisfactory to me than using just one long random number sequence starting with the firs data step.
So for example I could do something like this:
%macro myrandom;
%do i = 1 %to 10;
data dataset&i;
call streaminit(&i);
[do stuff involving random numbers]
run;
%end;
%mend;
But somehow using a predictable sequence of seeds seems like cheating. Should I be worried about that? Is that actually a perfectly acceptable way of doing it, or is there a better way?
Here is my attempt at this:
%macro dataset_rand(_num,_rows);
data dataset;
do i = 0 to &_rows - 1;
call streaminit(123);
c = rand("UNIFORM");
varnum = mod(i,&_num.) +1;
output;
end;
run;
data %do i = 1 %to &_num.;
dataset&i.
%end;
;
set dataset;
%do j = 1 %to &_num;
if varnum = &j. then
output dataset&j.;
%end;
run;
%mend;
%dataset_rand(10,100);
Here I ran one step to create every single row with a single random variable and another variable that will be used to assign it to a dataset.
input is _num and _rows, which allow you to chose how many rows total and how many tables, so the example (10,100) creates 10 tables of 10 rows. With dataset1 holding the 1st, 11th ... 91st member of the random sequence.
That said I don't know of any reason why 10 datasets with 10 seeds, would be any better or worse than 1 dataset with 1 seed split into 10.
Using RANUNI or similar (the 'old' random number streams), you would use call ranuni to accomplish this. This lets you save the seed for the next round, and then you could call symputx that value to the next datastep and re-start the same stream. That's because the output value for one pseudorandom value is a direct variation on the seed for the next in that algorithm.
However, using RAND, the seed is more complicated (it's not really just one value, after the first number was called). From the documentation:
The RAND function is started with a single seed. However, the state of the process cannot be captured by a single seed. You cannot stop and restart the generator from its stopping point.
This is of course a simplification (obviously SAS is capable of doing so, it just doesn't open up the right hooks for you to do so, presumably as it's not as straightforward as call ranuni is).
What you can do, though, is use the macro language, depending on exactly what you're trying to do. Using %syscall and %sysfunc, you can get a single stream that goes across data steps.
However, one caveat: it doesn't look like you can ever reset it. From documentation on Seed Values:
When the RANUNI function is called through the macro language by using %SYSFUNC, one pseudo-random number stream is created. You cannot change the seed value unless you close SAS and start a new SAS session. The %SYSFUNC macro produces the same pseudo-random number stream as the DATA steps that generated the data sets A, B, and C for the first macro invocation only. Any subsequent macro calls produce a continuation of the single stream.
This is specific to the ranuni family, but it looks like it is also true for thhe rand family.
So, start up a new session of SAS, and run this:
%macro get_rands(seed=0, n=, var=, randtype=Uniform, randargs=);
%local i;
%syscall streaminit(seed);
%do i = 1 %to &n;
&var. = %sysfunc(rand(&randtype. &randargs.));
output;
%end;
%mend get_rands;
data first;
%get_rands(seed=7,n=10,var=x);
run;
data second;
%get_rands(n=10,var=x);
run;
data whole;
call streaminit(7);
do _i = 1 to 20;
x = rand('Uniform');
output;
end;
run;
But don't make the mistake of running it twice in one session.
Otherwise, your best bet is to generate your random numbers once, then use them in multiple data steps. If you use BY groups, it's easy to manage things this way. If you have specific questions how to implement your project in this way, let us know in a new question.
Not sure if it's much easier, but you can use the STREAM subroutine to generate multiple independent streams from the same initial seed. Below is an example, slightly modified from the doc on CALL STREAM.
%macro RepeatRand(N=, Repl=, seed=);
%do k = 1 %to &Repl;
data dataset&k;
call streaminit('PCG', &seed);
call stream(&k);
do i = 1 to &N;
u = rand("uniform");
output;
end;
run;
%end;
%mend;
%RepeatRand(N=8, Repl=2, seed=36457);

Summing Multiple lags in SAS using LAG

I'm trying to make a data step that creates a column in my table that has the sum of ten, fifteen, twenty and fortyfive lagged variables. What I have below works, but it is not practicle to write this code for the twenty and fortyfive summed lags. I'm new to SAS and can't find a good way to write the code. Any help would be greatly appreciated.
Here's what I have:
data averages;
set work.cuts;
sum_lag_ten = (lag10(col) + lag9(col) + lag8(col) + lag7(col) + lag6(col) + lag5(col) + lag4(col) + lag3(col) + lag2(col) + lag1(col));
run;
Proc EXPAND allows the easy calculation for moving statistics.
Technically it requires a time component, but if you don't have one you can make one up, just make sure it's consecutive. A row number would work.
Given this, I'm not sure it's less code, but it's easier to read and type. And if you're calculating for multiple variables it's much more scalable.
Transformout specifies the transformation, In this case a moving sum with a window of 10 periods. Trimleft/right can be used to ensure that only records with a full 10 days are included.
You may need to tweak these depending on what exactly you want. The third example under PROC EXPAND has examples.
Data have;
Set have;
RowNum = _n_;
Run;
Proc EXPAND data=have out=want;
ID rownum;
Convert col=col_lag10 / transformout=(MOVSUM 10 trimleft 9);
Run;
Documentation(SAS/STAT 14.1)
http://support.sas.com/documentation/cdl/en/etsug/68148/HTML/default/viewer.htm#etsug_expand_examples04.htm
If you must do this in the datastep (and if you do things like this regularly, SAS/ETS has better tools for sure), I would do it like this.
data want;
set sashelp.steel;
array lags[20];
retain lags1-lags20;
*move everything up one;
do _i = dim(lags) to 2 by -1;
lags[_i] = lags[_i-1];
end;
*assign the current record value;
lags[1] = steel;
*now calculate sums;
*if you want only earlier records and NOT this record, then use lags2-lags11, or do the sum before the move everything up one step;
lag_sum_10 = sum(of lags1-lags10);
lag_sum_15 = sum(of lags1-lags15); *etc.;
run;
Note - this is not the best solution (I think a hash table is better), but this is better for a more intermediate level programmer as it uses data step variables.
I don't use a temporary array because you need to use variable shortcuts to do the sum; with temporary array you don't get that, unfortunately (so no way to sum just 1-10, you have to sum [*] only).

Is sorting more favorable (efficient) in if-else statement?

Assume two functions fun1, fun2 have been defined to carry out some calculation given input x.
The structure of data have is:
Day Group x
01Jul14 A 1.5
02JUl14 B 2.7
I want to do sth like this:
data want;
set have;
if Group = 'A' then y = fun1(x);
if Group = 'B' then y = fun2(x);
run;
Is it better to do proc sort data=have;by Group;run; first then move on to the data step? Or it doesn't matter because each time it just picks one observation and determines which if statement it falls into?
So long as you are not doing anything to alter the normal input of observations - such as using random access (point=), building a hash table, using a by statement, etc. - sorting will have no impact: you read each row regardless of the if statement, check both lines, execute one of them. Nothing different occurs sorted or unsorted.
This is easy to test. Write something like this:
%put Before Unsorted Time: %sysfunc(time(),time8.);
***your datastep here***;
%put After Unsorted Time: %sysfunc(time(),time8.);
proc sort data=your_dataset;
by x;
run;
%put Before Sorted Time: %sysfunc(time(),time8.);
***your datastep here***;
%put After Sorted Time: %sysfunc(time(),time8.);
Or just run your datasteps and look at the execution time!
You may be confusing this with sorting your if statements (ie, changing the order of them in the code). That could have an impact, if your data is skewed and you use else. That's because SAS won't have to evaluate further downstream conditionals. It's not very common for this to have any sort of impact - it only matters when you have extremely skewed data, large numbers of observations, and certain other conditions based on your code - so I wouldn't program for it.

Macro that outputs table with testing results of SAS table

Problem
I'm not a very experienced SAS user, but unfortunately the lab where I can access data is restricted to SAS. Also, I don't currently have access to the data since it is only available in the lab, so I've created simulated data for testing.
I need to create a macro that gets the values and dimensions from a PROC MEANS table and performs some tests that check whether or not the top two values from the data make up 90% of the results.
As an example, assume I have panel data that lists firms revenue, costs, and profits. I've created a table that lists n, sum, mean, median, and std. Now I need to check whether or not the top two firms make up 90% of the results and if so, flag if it's profit, revenue, or costs that makes up 90%.
I'm not sure how to get started
Here are the steps :
Read the data
Read the PROC MEAN table created, get dimensions, and variables.
Get top two firms in each variable and perform check
Create new table that lists variable, value from read table, largest and second largest, and flag.
Then print table
Simulated data :
https://www.dropbox.com/s/ypmri8s6i8irn8a/dataset.csv?dl=0
PROC MEANS Table
proc import datafile="/folders/myfolders/dataset.csv"
out=dt
dbms=csv
replace;
getnames=yes;
run;
TITLE "Macro Project Sample";
PROC MEANS n sum mean median std;
VAR V1 V2 V3;
RUN;
Desired Results :
Value Largest Sec. Largest Flag
V1 463138.09 9888.09 9847.13
V2 148.92 1.99 1.99
V3 11503375 9999900 1000000 Y
At the moment I can't open your simulated dataset but I can give you some advices, hope they will help.
You can add the n extreme values of given variables using the 'output out=' statement with the option IDGROUP.
Here an example using charity dataset ( run this to create it http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p1oii7oi6k9gfxn19hxiiszb70ms.htm)
proc means data=Charity;
var MoneyRaised HoursVolunteered;
output out=try sum=
IDGROUP ( MAX (Moneyraised HoursVolunteered) OUT[2] (moneyraised hoursvolunteered)=max1 max2);
run;
data var1 (keep=name1 _freq_ moneyraised max1_1 max1_2 rename=(moneyraised=value max1_1=largest max1_2=seclargest name1=name))
var2 (keep=name2 _freq_ HoursVolunteered max2_1 max2_2 rename=(HoursVolunteered=value max2_1=largest max2_2=seclargest name2=name));
length name1 name2 $4;
set try ;
name1='VAR1';
name2='VAR2';
run;
data finalmerge;
length flag $1;
set var1 var2;
if largest+seclargest > value*0.9 then flag='Y';
run;
in the proc means I choose to variables moneyraised and hoursvolunteered, you will choose your var1 var2 var3 and make your changes in all the program.
The IDgroup will output the max value for both variables, as you see in the parentheses, but with out[2], obviously largest and second largest.
You must rename them, I choose to rename max1 and max 2, then sas will add an _1 and _2 to the first and the second max values automatically.
All the output will be on the same line, so I do a datastep referencing 2 datasets in output (data var1 var2) keeping the variables needed and renaming them for the next merge, I also choose a naming system as you see.
Finally I'll merge the 2 datasets created and add the flag.
Here are some initial steps and pointers in a non macro approach which restructures the data in such a manner that no array processing is required. This approach should be good for teaching you a bit about manipulating data in SAS but will not be as fast a single pass approach (like the macros you originally posted) as it transposes and sorts the data.
First create some nice looking dummy data.
/* Create some dummy data with three variables to assess */
data have;
do firm = 1 to 3;
revenue = rand("uniform");
costs = rand("uniform");
profits = rand("uniform");
output;
end;
run;
Transpose the data so all the values are in one column (with the variable names in another).
/* Move from wide to deep table */
proc transpose
data = have
out = trans
name = Variable;
by firm;
var revenue costs profits;
run;
Sort the data so each variable is in a contiguous group of rows and the highest values are at the end of each Variable group.
/* Sort by Variable and then value
so the biggest values are at the end of each Variable group */
proc sort data = trans;
by Variable COL1;
run;
Because of the structure of this data, you could go down through each observation in turn, creating a running total, which when you get to the final observation in a Variable group would be the Variable total. In this observation you also have the largest value (the second largest was in the previous observation).
At this point you can create a data step that:
Is aware when it is in the first and last values of each variable group
by statement to make the data step aware of your groups
first.Variable temporary variable so you can initialise your total variable to 0
last.Variable temporary variable so you can output only the last line of each group
Sums up the values in each group
retain statement so SAS doesn't empty your total with each new observation
sum() function or + operator to create your total
Creates and populates new variables for the largest and second largest values in each group
lag() function or retain statement to keep the previous value (the second largest)
Creates your flag
Outputs your new variables at the end of each group
output statement to request an observation be stored
keep statement to select which variables you want
The macros you posted originally looked like they were meant to perform the analysis you are describing but with some extras (only positive values contributed to the Total, an arbitrary number of values could be included rather than just the top 2, the total was multiplied by another variable k1198, negative values where caught in the second largest, extra flags and values were calculated).

Extracting sub-data from a SAS dataset & applying to a different dataset

I have written a macro to use proc univariate to calculate custom quantiles for variables in a dataset (say dsn1) %cust_quants(dsn= , varlist= , quant_list= ). The output is a summary dataset (say dsn2)that looks something like the following:
q_1 q_2.5 q_50 q_80 q_97.5 q_99 var_name
1 2.5 50 80 97.5 99 ex_var_1_100
-2 10 25 150 500 20000 ex_var_pos_skew
-20000 -500 -150 0 10 50 ex_var_neg_skew
What I would like to do is to use the summary dataset to cap/floor extreme values in the original dataset. My idea is to extract the column of interest (say q_99) and put it into a vector of macro-variables (say q_99_1, q_99_2, ..., q_99_n). I can then do something like the following:
/* create summary of dsn1 as above example */
%cust_quants(dsn= dsn1, varlist= ex_var_1_100 ex_var_pos_skew ex_var_neg_skew,
quant_list= 1 2.5 50 80 97.5 99);
/* cap dsn1 var's at 99th percentile */
data dsn1_cap;
set dsn1;
if ex_var_1_100 > &q_99_1 then ex_var_1_100 = &q_99_1;
if ex_var_pos_skew > &q_99_2 then ex_var_pos_skew = &q_99_2;
/* don't cap neg skew */
run;
In R, it is very easy to do this. One can extract sub-data from a data-frame using matrix like indexing and assign this sub-data to an object. This second object can then be referenced later. R example--extracting b from data-frame a:
> a <- as.data.frame(cbind(c(1,2,3), c(4,5,6)))
> print(a)
V1 V2
1 1 4
2 2 5
3 3 6
> a[, 2]
[1] 4 5 6
> b <- a[, 2]
> b[1]
[1] 4
Is it possible to do the same thing in SAS? I want to be able to assign a column(s) of sub-data to a macro variable / array, such that I can then use the macro / array within a 2nd data step. One thought is proc sql into::
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
However, this creates a single string variable when what I really want is a vector of variables (or array--no vectors in SAS). Another thought is to add %scan (assuming this is inside a macro):
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
%let i = 1;
%do %until(%scan(&v2_macro, &i) = "");
%let var_&i = %scan(&v2_macro, &i);
%let &i = %eval(&i + 1);
%end;
This seems inefficient and takes a lot of code. It also requires the programmer to remember which var_&i corresponds to each future purpose. Is there a simpler / cleaner way to do this?
**Please let me know in the comments if this is enough background / example. I'm happy to give a more complete description of why I'm doing what I'm attempting if needed.
First off, I assume you are talking about SAS/Base not SAS/IML; SAS/IML is essentially similar to R and has the same kind of operations available in the same manner.
SAS/Base is more similar to a database language than a matrix language (though has some elements of both, and some elements of an OOP language, as well as being a full-featured functional programming language).
As a result, you do things somewhat differently in order to achieve the same goal. Additionally, because of the cost of moving data in a large data table, you are given multiple methods to achieve the same result; you can choose the appropriate method for the required situation.
To begin with, you generally should not store data in a macro variable in the manner you suggest. It is bad programming practice, and it is inefficient (as you have already noticed). SAS Datasets exist to store data; SAS macro variables exist to help simplify your programming tasks and drive the code.
Creating the dataset "b" as above is trivial in Base SAS:
data b;
set a;
keep v2;
run;
That creates a new dataset with the same rows as A, but only the second column. KEEP and DROP allow you to control which columns are in the dataset.
However, there would be very little point in this dataset, unless you were planning on modifying the data; after all, it contains the same information as A, just less. So for example, if you wanted to merge V2 into another dataset, rather than creating b, you could simply use a dataset option with A:
data c;
merge z a(keep=v2);
by id;
run;
(Note: I presuppose an ID variable of some form to combine A and Z.)
This merge combines the v2 column onto z, in a new dataset, c. This is equivalent to vertically concatenating two matrices (although a straight-up concatenation would remove the 'by id;' requirement, in databases you do not typically do that, as order is not guaranteed to be what you expect).
If you plan on using b to do something else, how you create and/or use it depends on that usage. You can create a format, which is a mapping of values [ie, 1='Hello' 2='Goodbye'] and thus allows you to convert one value to another with a single programming statement. You can load it into a hash table. You can transpose it into a row (proc transpose). Supply more detail and a more specific answer can be provided.