SAS/IML: how to use individual variance components in RANDNORMAL - sas

This is a programming question, but I'll give you a little of the stats background first. This question refers to part of a data sim for a mixed-effects location scale model (i.e., heterogeneous variances). I'm trying to simulate two MVN variance components using the RANDNORMAL function in IML. Because both variance components are heterogeneous, the variances used by RANDNORMAL will differ across people. Thus, I need IML to select the specific row (e.g., row 1 = person 1) and use the RANDNORMAL function before moving onto the next row, and so on.
My example code below is for 2 people. I use DO to loop through each person's specific variance components (VC1 and VC2). I get the error: "Module RANDNORMAL called again before exit from prior call." I am assuming I need some kind of BREAK or EXIT function in the DO loop, but none I have tried work.
PROC IML;
ColNames = {"ID" "VC1" "VC2"};
A = {1 2 3,
2 8 9};
PRINT A[COLNAME=ColNames];
/*Set men of each variance component to 0*/
MeanVector = {0, 0};
/*Loop through each person's data using THEIR OWN variances*/
DO i = 1 TO 2;
VC1 = A[i,2];
VC2 = A[i,3];
CovMatrix = {VC1 0,
0 VC2};
CALL RANDSEED(1);
U = RANDNORMAL(2, MeanVector, CovMatrix);
END;
QUIT;
Any help is appreciated. Oh, and I'm using SAS 9.4.

You want to move some things around, but mostly you don't want to rewrite U twice: you need to write U's 1st row, then U's 2nd row, if I understand what you're trying to do. The below is a bit more efficient also, since I j() the U and _cv matrices rather than constructing then de novo every time through the loop (which is slow).
proc iml;
a = {1 2 3,2 8 9};
print(a);
_mv = {0,0};
U = J(2,2);
_cv = J(2,2,0);
CALL RANDSEED(1);
do i = 1 to 2;
_cv[1,1] = a[i,2];
_cv[2,2] = a[i,3];
U[i,] = randnormal(1,_mv, _cv);
end;
print(u);
quit;

Your mistake is the line
CovMatrix = {VC1 0, 0 VC2}; /* wrong */
which is not valid SAS/IML syntax. Instead, use #Joe's approach or use
CovMatrix = (VC1 || 0) // (0 || VC2);
For details, see the article "How to build matrices from expressions."
You might also be interested in this article that describes how to carry out this simulation with a block-diagonal matrix: "Constructing block matrices with applications to mixed models."

Related

SAS Studio: Finding Values of Column 2 in Column 1 Until Column 2 is Specific Value

I have a simple question that I can't seem to answer. I HAVE a large data set where I am searching for values of column 2 that are found in column 1, until column 2 is a specific value. Sounds like a DO loop but I don't have much experience using them. Please see image as this likely will explain better.
Essentially, I have a "starting" point (with the first_match flag=1). Then, I want to grab the value of column 2 in this row (B in this example). Next, I want to search for this value (B) in column 1. Once I find that row (with column 1 = B & column 2 = C), I again grab the value in column 2 (C). Again, I find where in column 1 this new value occurs and obtain the corresponding value of column 2. I repeat this process until column 2 has a value of Z. That's my stopping point. The WANT table shows my desired output.
My apologies if the above is confusing, but it seems like a simple exercise that I can't seem to solve. Any help would be greatly appreciated. Glad to supply further clarification as well.
Have & Want
I have tried PROC SQL to create flags and grab the appropriate rows, but the code is extremely bulky and doesn't seem efficient. Also, the example I laid out has a desired output table with 3 rows. This may not be the case as the desired output could contain between 1 and 10 rows.
This question has been asked and answered previously.
Path traversal can be done using a DATA Step hash object.
Example:
data have;
length vertex1 vertex2 $8;
input vertex1 vertex2;
datalines;
A B
X B
D B
E B
B C
Q C
C Z
Z X
;
data want(keep=vertex1 vertex2 crumb);
length vertex1 vertex2 $8 crumb $1;
declare hash edges ();
edges.defineKey('vertex1');
edges.defineData('vertex2', 'crumb');
edges.defineDone();
crumb = ' ';
do while (not last_edge);
set have end=last_edge;
edges.add();
end;
trailhead = 'A';
vertex1 = trailhead;
do while (0 = edges.find());
if not missing(crumb) then leave;
output;
edges.replace(key:vertex1, data:vertex2, data:'*');
vertex1 = vertex2;
end;
if not missing(crumb) then output;
stop;
run;
All paths in the data can be discovered with an additional outer loop iterating (HITER) over a hash of the vertex1 values.

How to perform rolling window calculations without SSC packages

Goal: perform rolling window calculations on panel data in Stata with variables PanelVar, TimeVar, and Var1, where the window can change within a loop over different window sizes.
Problem: no access to SSC for the packages that would take care of this (like rangestat)
I know that
by PanelVar: gen Var1_1 = Var1[_n]
produces a copy of Var1 in Var1_1. So I thought it would make sense to try
by PanelVar: gen Var1SumLag = sum(Var1[(_n-3)/_n])
to produce a rolling window calculation for _n-3 to _n for the whole variable. But it fails to produce the results I want, it just produces zeros.
You could use sum(Var1) - sum(Var1[_n-3]), but I also want to be able to make the rolling window left justified (summing future observations) as well as right justified (summing past observations).
Essentially I would like to replicate Python's ".rolling().agg()" functionality.
In Stata _n is the index of the current observation. The expression (_n - 3) / _n yields -2 when _n is 1 and increases slowly with _n but is always less than 1. As a subscript applied to extract values from observations of a variable it always yields missing values given an extra rule that Stata rounds down expressions so supplied. Hence it reduces to -2, -1 or 0: in each case it yields missing values when given as a subscript. Experiment will show you that given any numeric variable say numvar references to numvar[-2] or numvar[-1] or numvar[0] all yield missing values. Otherwise put, you seem to be hoping that the / yields a set of subscripts that return a sequence you can sum over, but that is a long way from what Stata will do in that context: the / is just interpreted as division. (The running sum of missings is always returned as 0, which is an expression of missings being ignored in that calculation: just as 2 + 3 + . + 4 is returned as 9 so also . + . + . + . is returned as 0.)
A fairly general way to do what you want is to use time series operators, and this is strongly preferable to subscripts as (1) doing the right thing with gaps (2) automatically working for panels too. Thus after a tsset or xtset
L0.numvar + L1.numvar + L2.numvar + L3.numvar
yields the sum of the current value and the three previous and
L0.numvar + F1.numvar + F2.numvar + F3.numvar
yields the sum of the current value and the three next. If any of these terms is missing, the sum will be too; a work-around for that is to return say
cond(missing(L3.numvar), 0, L3.numvar)
More general code will require some kind of loop.
Given a desire to loop over lags (negative) and leads (positive) some code might look like this, given a range of subscripts as local macros i <= j
* example i and j
local i = -3
local j = 0
gen double wanted = 0
forval k = `i'/`j' {
if `k' < 0 {
local k1 = -(`k')
replace wanted = wanted + L`k1'.numvar
}
else replace wanted = wanted + F`k'.numvar
}
Alternatively, use Mata.
EDIT There's a simpler method, to use tssmooth ma to get moving averages and then multiply up by the number of terms.
tssmooth ma wanted1=numvar, w(3 1)
tssmooth ma wanted2=numvar, w(0 1 3)
replace wanted1 = 4 * wanted1
replace wanted2 = 4 * wanted2
Note that in contrast to the method above tssmooth ma uses whatever is available at the beginning and end of each panel. So, the first moving average, the average of the first value and the three previous, is returned as just the first value at the beginning of each panel (when the three previous values are unknown).

Nearest Neighbor Matching in Stata

I need to program a nearest neighbor algorithm in stata from scratch because my dataset does not allow me to use any of the available solutions (as far as I am concerned).
To be pecise. I have a dataset that is of similar structure to that of the following (original has around 14k observations)
input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end
I want to generate a variable called match (already included in the example above). For each observation with treatment == 1 the variable match should store the id of another observation from within treatment == 0 whose value is closest to value of the considered observation (treatment == 1).
I am new to stata programming, so I am not yet familiar with the syntax. My first shot is the following however it does not produce any changes to the match variable. I am sure this is a novice question but I am hoping for some advice on how to make the code running.
EDIT: I have changed the code slightly and now it seems to work. Do you see any problems that may arise if I run it on a bigger dataset?
set more off
clear all
input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end
gen match = .
forval i = 1/`= _N' {
if treatment[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treatment[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
Consider some simulated data: 1,000 observations, 200 of them untreated (treat == 0) and the rest treated (treat == 1). Then the code included below will be much more efficient than the originally posted. (Ties, like in your code, are not explicitly handled.)
clear
set more off
*----- example data -----
set obs 1000
set seed 32956
gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)
*----- new method -----
timer clear
timer on 1
// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1
// start loop
gen match = .
gen dif = .
quietly forvalues i = `firstt'/`=_N' {
// compute distances
replace dif = (pscore[`i'] - pscore)^2
summarize dif in 1/`lastnt', meanonly
// identify id of minimum-distance observation
replace match = . in 1/`lastnt'
replace match = id in 1/`lastnt' if dif == r(min)
summarize match in 1/`lastnt', meanonly
// save the minimum-distance id
replace match = r(max) in `i'
}
// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat
timer off 1
tempfile first
save `first'
*----- your method -----
drop match
timer on 2
gen match = .
quietly forval i = 1/`= _N' {
if treat[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treat[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
timer off 2
tempfile second
save `second'
// check for equality of results
cf _all using `first'
// check times
timer list
The results in seconds to finish execution:
. timer list
1: 0.19 / 1 = 0.1930
2: 10.79 / 1 = 10.7900
The difference is huge, specially considering this data set has only 1,000 observations.
An interesting thing to notice is that as the number of non-treated cases increases relative to the number of treated, then the original method improves, but never reaches the levels of efficiency of the new method. As an example, invert the number of cases, so there is now 800 untreated and 200 treated (change data setup to gen treat = cond(_n <= 800, 0, 1)). The result is
. timer list
1: 0.07 / 1 = 0.0720
2: 4.45 / 1 = 4.4470
You can see that the new method also improves and is still much faster. In fact, the relative difference is still the same.
Another way to do this is using joinby or cross. The problem is they temporarily expand (a lot) the size of your data base. In many cases, they are not feasible due to the hard limit Stata has on the number of possible observations (see help limits). You can find an example of joinby here: https://stackoverflow.com/a/19784222/2077064.
Edit
If there's a large number of treated relative to untreated, your code suffers
because you go through the whole first loop many more times (due to the first if).
Furthermore, going through
that whole loop once, implies going through another loop that
has itself two if conditions, _N more times.
The opposite case in which there are few treated observations means that you go through the whole
first loop only in a small number of occasions, speeding up your code substantially.
The reason my code can maintain its efficiency is due to the use of in. This always
offers speed gains over if. Stata will go directly to those observations with no
logical checking needed. Your problem provides an opportunity for that replacement
and it's wise to seize it.
If my code used if where in is in place, the results would be different.
Your code would be faster for the
case in which there's a large number of untreated relative to treated, and again, that
is because in your code there would not be the need to go through the complete loop,
requiring very little work;
the first loop is short-circuited with the first if. For the opposite case,
my code would still dominate.
The key is to "separate" treated from untreated and work on each group using in.

SAS Randgen call with Weibull distribution

I am trying to use call randgen within proc IML to create 10 random numbers that follow a Weibull distribution with certain parameters. Here is the code I am using (obviously there will be more than just one loop but I am just testing right now):
do i = 1 to 1;
Call randgen(Rands[i,1:Ntimes], 'Weibull', alpha[i], beta[i]);
print (rands[1,1:Ntimes]);
print (alpha[i]) (beta[i]);
end;
For this example Ntimes = 10, alpha[i] = 4.5985111, and beta[i] = 131.79508. My issue is that each of the 10 iterations/random numbers comes back as 1. I used the rweibull function in R with the same parameters and got results that made sense so I am thinking it has something to do with SAS or my code rather than an issue with the parameters. Am I using the Randgen call correctly? Does anyone know why the results would be coming out this way?
This works:
proc iml;
alpha=j(10);
beta=j(10);
alpha[1]=4.59;
beta[1] = 131.8;
Ntimes=10;
rands = j(1,10);
print (rands);
do i = 1 to 1;
Call randgen(Rands, 'WEIB', alpha[1],beta[1]);
print (rands);
end;
quit;
I don't think you can use Rands[1:Ntimes] that way. I think you would want to assign it to a temporary matrix and then assign that matrix's results to a larger matrix.
IE:
allRands=j(10,10);
do i = 1 to 10;
Call randgen(Rands, 'WEIB', alpha[1],beta[1]);
print (rands);
allRands[i,1:10]=Rands;
end;
print(allRands);
Actually, unless you are using an ancient version of SAS/IML, you don't need any loops. Since SAS/IML 12.3, the RANDGEN subroutine accepts a vector of parameters. In your case, define a vector for the alpha and beta parameters. Let's say there are 'Nparam' parameters. Then allocate an N x Nparam matrix to hold the results. With a single call to RANDGEN, you can fill the matrix so that the i_th column is a sample of size N from Weibull(alpha[i], beta[i]), as shown in the following example:
proc iml;
Nparam = 8; N = 1000;
alpha= 1:Nparam; /* assign parameter values */
beta = 10 + (Nparam:1);
rands = j(N,Nparam);
call randgen(rands, 'WEIB', alpha,beta); /* SAS/IML 12.1 */
/* DONE. The i_th column is a sample from Weibul(alpha[i], beta[i])
TEST IT: Compute the mean of each sample: */
mean = mean(rands); std = std(rands);
print (alpha//beta//mean//std)[r={"alpha" "beta" "mean" "std"}];
/* TEST IT: Plot the distribution of each sample (SAS/IML 12.3) */
title "First param"; call histogram(rands[,1]);
title "Last param"; call histogram(rands[,Nparam]);

SAS Proc IML: Minimizing a function with respect to one variable

Using SAS proc IML, I have a function: CVF(m,p,h,pi,e);
I would like to guess h which minimizes this function. Is there some built-in subroutine to minimize it? Or how can I construct an iterative process for it? Every other variables are defined.
Use call nlpqn();. The function you pass needs to have only 1 parameter, the vector you want to optimize over. So here I define a quadratic function where the a, b, and c parameters are also able to be defined. Use the GLOBAL statement and define the non-moving variables before the call.
Alternatively, you can put everything in the input vector, and then add constraints to keep those values from moving.
proc iml;
start myfun(x) global(a, b, c);
out = a*x**2 + b*x + c;
return (out);
finish myfun;
a = 1;
b = 2;
c = 4;
optn = {0, /* option 1: 0 -> MIN, 1 -> MAX*/
0 /* Print options 0-5 0 least, 5 most*/
};
init = 0;
call nlpqn(rc, res, "myfun", init);
/*rc > 0 means success*/
print rc res;
quit;
Returns:
The SAS System 08:59 Friday, June 27, 2014 3
rc res
3 -1
You can build your own optimization routine with IML language, of course, but you have also optimization routines built-in, see chapter 11 of IML User's Guide, Nonlinear Optimization Examples.