How SAS computes Ridge values in PROC PHREG - sas

The itprint option in the class statement of SAS proc phreg causes the display of the iteration history. This includes a Ridge value, along with the beta values and log likelihoods for each iteration. Ridge is usually zero but is non-zero whenever a log likelihood would otherwise be more negative than the log likelihood for the previous iteration. I need to know how SAS computes that ridge value and I can find nothing in the Details section for that procedure, or anywhere else.
It appears that, by default, that Ridge value is always 0.0001 * 2^n, and that SAS starts with n=0 and increments n until log likelihood is less negative than in the previous iteration. But I have tested at least one example where SAS used Ridge=0.4096 when Ridge=0.2048 would suffice.
Update: I now think that SAS is iterating 4^n, rather than 2^n. That explains skipping 2048 and is consistent with my testing so far.
So I think I have answered my own question and would now like academic support for this method. I'll likely seek that at Cross Validated as Robert Penridge and Joe suggest.

When PHREG fails to converge, that is, when a log likelihood value is more negative than in the previous iteration, the procedure computes a ridge value. This value is RIDGEINIT * 2^n, with n incremented until either the log likelihood value becomes less negative, or the ridge value reaches RIDGEMAX.
The default RIDGEINIT is 1e-4.
The default RIDGEMAX is MAX(1, RIDGEINIT) * 2000.

Related

Correct values for SsaSpikeEstimator's pvalueHistoryLength

In the creation of a SsaSpikeEstimator instance by the DetectSpikeBySsa method, there is a parameter called pvalueHistoryLength - could anybody please help me understand, for any given time series with X points, which is the optimal value for this parameter?
I got similar issue, when I try to read the paper, https://arxiv.org/pdf/1206.6910.pdf, I notice one paragraph
Also, simulations and theory (Golyandina, 2010) show that it is
better to choose window length L smaller than half of the time series length
N. One of the recommended values is N/3.
Maybe that's why in the ML.Net Power Anomaly example, the value is chosen to be 30 for the 90 points dataset.

BY processing in PROC NLMIXED; procedure stops due to error

I simulated 500 replications and planned to analyze each in NLMIXED using BY processing. My NLMIXED code is below:
PROC NLMIXED DATA=MELS GCONV=1E-12 QPOINTS=11;
BY Rep;
PARMS LMFI=&LMFI.
SMFI=&SMFI.
LMRIvar=&LMRIvar.
SMRIvar=0 TO 0.15 BY 0.005;
mu = LMFI + b0i;
evar = EXP(SMFI + t0i);
MODEL Y ~ NORMAL(mu,evar);
RANDOM b0i t0i ~ NORMAL([0,0],[LMRIvar,0,SMRIvar]) SUBJECT=PersonID;
ODS OUTPUT FitStatistics=Fit2 ConvergenceStatus=Conv2 ParameterEstimates=Parm2;
RUN;
For some of these replications, the variance components were sampled to be small, so some non-zero number of convergence errors are expected (note the ConvergenceStatus request on the ODS OUTPUT statement). However, when I get the warning below, NLMIXED quits processing regardless of the number of replications remaining to be analyzed.
WARNING: The final Hessian matrix is full rank but has at least one negative eigenvalue. Second-order optimality condition violated.
ERROR: QUANEW Optimization cannot be completed.
Am I missing something? I would think that NLMIXED could acknowledge the error for that replication, but continue with the remaining replications. Thoughts are appreciated!
Best,
Ryan
Here is what I believe is occurring. The requirement that variances must be non-negative and the fact that the distribution of variance estimates is long-tailed make variances troublesome to estimate. Variance component estimate updates may result in a negative value for one or more of the estimates. The NLMIXED procedure attempts to compute eigenvalues of the model variance components. At that point, NLMIXED crashes.
But note that
V[Y] = (sd[Y])^2
V[Y] = exp(ln(V[Y]))
V[Y] = exp(2*ln(sd[Y]))
V[Y] = exp(2*ln_sd_Y)
Now, suppose that we make ln_sd_Y the parameter. References to V[Y] would need to be written as the function shown in the last statement above. As the domain of the parameter ln_sd_Y is (-infinity, infinity), there is no lower bound on ln_sd_Y. The function exp(2*ln_sd_Y) will always produce a non-negative variance estimate. Actually, given limitations of digital computers such that negative infinity cannot be represented, only values which head to negative infinity), the function exp(2*ln_sd_Y) will always produce a positive parameter estimate. The estimate may be very, very close to 0. But the estimate would always come at 0 from above. This should preclude SAS trying to compute the eigenvalue of a negative number.
A slight alteration of your code writes LMRIvar and SMRIvar as functions of ln_sd_LMRIvar and ln_sd_SMRIvar.
PROC NLMIXED DATA=MELS GCONV=1E-12 QPOINTS=11;
BY Rep;
PARMS LMFI=&LMFI.
SMFI=&SMFI.
ln_sd_LMRIvar=%sysfunc(log(%sysfunc(sqrt(&LMRIvar.))))
ln_sd_SMRIvar=-5 to -1 by 0.1;
mu = LMFI + b0i;
evar = EXP(SMFI + t0i);
MODEL Y ~ NORMAL(mu,evar);
RANDOM b0i t0i ~ NORMAL([0,0],
[exp(2*ln_sd_LMRIvar), 0,
exp(2*ln_sd_SMRIvar)]) SUBJECT=PersonID;
ODS OUTPUT FitStatistics=Fit2 ConvergenceStatus=Conv2 ParameterEstimates=Parm2;
RUN;
Alternatively, you could employ a bounds statement in an attempt to prevent updates of LMRIvar and/or SMRIvar from going negative. You could keep your original code, inserting the statement
bounds LMRIvar SMRIvar > 0;
This is simpler than writing the model in terms of parameters which are allowed to go negative. However, my experience has been that employing parameters which have domain (-infinity, infinity) is actually the better approach.

Stata code to conditionally sum values based on a group rank

I'm trying to write a code for a fairly huge dataset (3m observations) which has been segregated into smaller groups (ID). For each observation (described in the table below), I want to create a cumulative sum of a variable "Value" for all observations ranked below me, subject to condition of the lower ranked observation equals mine.
[
I want to write this code without using loops, if there is a way to do so.
Could someone help me?
Thank you!
UPDATE:
I have pasted the equation for the output variable below.
UPDATE 2:
The CSV format of the above table is:
ID,Rank,Condition,Value,Expected output,,
1,1,30,10,0,,
1,2,40,20,0,,
1,3,20,30,0,,
1,4,30,40,10,,
1,5,40,50,20,,
1,6,20,60,30,,
1,7,30,70,80,,
2,1,40,80,0,,
2,2,20,90,0,,
2,3,30,100,0,,
2,4,40,110,80,,
2,5,20,120,90,,
2,6,30,130,100,,
2,7,40,140,190,,
2,8,20,150,210,,
2,9,30,160,230,,
Equation
If I understand correctly, for each combination of ID and Condition, you want to calculate a running sum, ordered by Rank, of the variable Value, excluding the current observation. If that is indeed your goal, the following untested code might set you on the path to a solution
sort ID Condition Rank
// be sure there is a single observation for each combination
isid ID Condition Rank
// generate the running sum
by ID Condition (Rank): generate output = sum(Value)
// subtract out the current observation
replace output = output - Value
// return to the original order
sort ID Rank
As I said, this is untested, because my copy of Stata cannot read pictures of data. If your testing shows that it is imperfect and you cannot resolve the problem yourself, providing your sample data in a usable format will increase the likelihood someone will be able to help.
Added in edit: Corrected the isid command.

RRDTool Counter increment lower than time

I create a standard RRDTool database with a default step of 5mn (300s).
I have different types of values in it, some gauges which are easily processed, but I have other values I would have in COUNTER but here is my problem :
I read the data in a program, and get the difference between values over two steps is good but the counter increment less than time (It can increment by less than 300 during a step), so my out value is wrong.
Is it possible to change the COUNTER for not be a number by second but by step or something like that, if it's not I suppose I have to calculate the difference in my program ?
Thank you for helping.
RRDTool is capable of handling fractional values, so there is no problem if the counter increments by less than the seconds interval since the last update.
RRDTool stores everything as a Rate. If your DS is of type GAUGE, then RRDTool assumes that the incoming value is alreayd a rate, and only applies Data Normalisation (more on this later). If the type is COUNTER or DERIVE, then the value/timepoint you are updating with is compared to the previous value/timepoint to obtain a rate thus: r=(x2 - x1)/(t2 - t1). The rate obtained is then Normalised. The other DS type is ABSOLUTE, which assumes the counter was reset on the last read, giving r=x2/(t2 - t1).
The Normalisation step adjusts the data point based on assuming a linear progression from the last data point so that it lies exactly on an interval boundary. For example, if your step is 5min, and you update at 12:06, the data point is adjusted back to what it would have been at 12:05, and stored against 12:05. However the last unadjusted DP is still preserved for use at the next update, so that overall rates are correct.
So, if you have a 300s (5min) interval, and the value increased by 150, the rate stored will be 0.5.
If the value you are graphing is something small, e.g. 'number of pages printed', this might seem counterintuitive, but it works well for large rates such as network traffic counters (which is what RRDTool was designed for).
If you really do not want to display fractional values in the generated graphs or output, then you can use a format string such as %.0f to enforce no decimal places and the displayed number will be rounded to the nearest integer.

Stata seems to be ignoring my starting values in maximum likelihood estimation

I am trying to estimate a maximum likelihood model and it is running into convergence problems in Stata. The actual model is quite complicated, but it converges with no troubles in R when it is supplied with appropriate starting values. I however cannot seem to get Stata to accept the starting values I provide.
I have included a simple example below estimating the mean of a poisson distribution. This is not the actual model I am trying to estimate, but it demonstrates my problem. I set the trace variable, which allows you to see the parameters as Stata searches the likelihood surface.
Although I use init to set a starting value of 0.5, the first iteration still shows that Stata is trying a coefficient of 4.
Why is this? How can I force the estimation procedure to use my starting values?
Thanks!
generate y = rpoisson(4)
capture program drop mypoisson
program define mypoisson
args lnf mu
quietly replace `lnf' = $ML_y1*ln(`mu') - `mu' - lnfactorial($ML_y1)
end
ml model lf mypoisson (mean:y=)
ml init 0.5, copy
ml maximize, iterations(2) trace
Output:
Iteration 0:
Parameter vector:
mean:
_cons
r1 4
Added: Stata doesn't ignore the initial value. If you look at the output of the ml maximize command, the first line in the listing will be titled
initial: log likelihood =
Following the equal sign is the value of the likelihood for the parameter value set in the init statement.
I don't know how the search(off) or search(norescale) solutions affect the subsequent likelihood calculations, so these solution might still be worthwhile.
Original "solutions":
To force a start at your initial value, add the search(off) option to ml maximize:
ml maximize, iterate(2) trace search(off)
You can also force a use of the initial value with search(norescale). See Jeff Pitblado's post at http://www.stata.com/statalist/archive/2006-07/msg00499.html.