Random slopes Cox Proportional Hazards - cox-regression

I have been trying to use coxme to extract random slopes for each of the covariates in my model.
library (coxme)
Start <- runif(5000, 1985, 2015)
Stop <- Start + runif(5000, 2, 10)
S <- data.frame (
X1 <- runif(5000, 5.0, 7.5),
X2 <- runif(5000, 5.0, 7.5),
D <- rbinom(5000, 1, 0.8),
Letters <- sample(LETTERS, 5000, replace = TRUE),
Start <- Start,
Stop <- Stop
)
S_ind1 <- Surv (time = S$Start, time2 = S$Stop, event = S$D)
a <- coxme (S_ind1 ~ X1 + X2 + (X1 + X2|Letters), data = S)
All I get is:
Error in gchol(kfun(theta, varlist, vparm, ntheta, ncoef)) :
NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning messages:
1: In sqrt(xvar * zvar) : NaNs produced
2: In sqrt(xvar * zvar) : NaNs produced
When using my own data I often get:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
'Calloc' could not allocate memory (56076596 of 8 bytes)
Is it possible at all to include random slopes using coxme?
If not, is there any other alternative using other package?

Answer from Terry Therneau, author of the coxme package via email - he asked me to post this here.
Below is my rewrite of your example, removing the Surv indirection and using '=' inside the data.frame call (I'm a bit surprised that <- works in that context), and adding set.seed so that the example is reproducable.
library (coxme)
set.seed(1953)
time1 <- runif(5000, 1985, 2015)
time2 <- time1 + runif(5000, 2, 10)
test <- data.frame (
x1 = runif(5000, 5.0, 7.5),
x2 = runif(5000, 5.0, 7.5),
death = rbinom(5000, 1, 0.8),
letters = sample(LETTERS, 5000, replace = TRUE),
time1 = time1,
time2 = time2)
fit1 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1|letters), data=test)
fit2 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+x1 | letters), test)
fit3 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+x2 | letters), test)
fit4 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+ x1 + x2 | letters),
data=test, vinit= c(1e-6, 1e-8, 1e-8))
*1. All the models work until fit4.
I find your model worrisome, since it has a random slope but no random intercept, in the same way that all regressions through the origin worry me: I have a hard time interpeting the results. Although lme puts intercept terms in by default, coxme does not.
I was hopeful tht fit4 would work, and perhaps with better starting estimates it would. The underlying code for coxme is the hardest maximization problem that I have encountered in all my survival work, hard in the sense that the maximizer gets easily lost and never finds its way. This is a function that sometimes needs hand-holding, via limited iteration counts and/or starting estimates. I wish it were not so, and I have some long term plans to improve this by addition of an alternate MCMC based maximizer, which will in theory never get lost but at the expense of much longer computation time.
If any of the variances get too close to zero then the sqrt() message tends to arise as a function of round off error. In your test case, of course, the actual MLE is at a variance of 0. When this happens, I will often check for a zero variance directly by doing fits with a sequence of fixed variances (vfixed argument). If the likelihood is constant or increasing as the variance goes to values of 1e-6 or less, then I assume the MLE is zero and remove that random term from the model.
Terry T.*

Related

lp_solve return uniform solution

Can lp_solve return a unifrom solution? (Is there a flag or something that will force this kinf of behavior?)
Say that I have this:
max: x + y + z + w;
x + y + z + w <= 100;
Results in:
Actual values of the variables:
x 100
y 0
z 0
w 0
However, I would like to have something like:
Actual values of the variables:
x 25
y 25
z 25
w 25
This is an oversimplyfied example, but the idea is that if the variables have the same factor in the objective function, then the result should idealy be more uniform and not everything for one, and the other what is left.
Is this possible to do? (I've tested other libs and some of them seem to do this by default like the solver on Excel or Gekko for Python).
EDIT:
For instance, Gekko has already this behavior without me especifing anything...
from gekko import GEKKO
m = GEKKO()
x1,x2,x3,x4 = [m.Var() for i in range(4)]
#upper bounds
x1.upper = 100
x2.upper = 100
x3.upper = 100
x4.upper = 100
# Constrain
m.Equation(x1 + x2 + x3 + x4 <= 100)
# Objective
m.Maximize(x1 + x2 + x3 + x4)
m.solve(disp=False)
print(x1.value, x2.value, x3.value, x4.value)
>> [24.999999909] [24.999999909] [24.999999909] [24.999999909]
You would need to explicitly model this (as another objective). A solver does nothing automatically: it just finds a solution that obeys the constraints and optimizes the objective function.
Also, note that many linear solvers will produce so-called basic solutions (corner points). So "all variables in the middle" does not come naturally at all.
The example in Gekko ended on [25,25,25,25] because of how the solver took a step towards the solution from an initial guess of [0,0,0,0] (default in Gekko). The problem is under-specified so there are an infinite number of feasible solutions. Changing the guess values gives a different solution.
from gekko import GEKKO
m = GEKKO()
x1,x2,x3,x4 = m.Array(m.Var,4,lb=0,ub=100)
x1.value=50 # change initial guess
m.Equation(x1 + x2 + x3 + x4 <= 100)
m.Maximize(x1 + x2 + x3 + x4)
m.solve(disp=False)
print(x1.value, x2.value, x3.value, x4.value)
Solution with guess values [50,0,0,0]
[3.1593723566] [32.280209196] [32.280209196] [32.280209196]
Here is one method with equality constraints m.Equations([x1==x2,x1==x3,x1==x4]) to modify the problem to guarantee a unique solution that can be used by any linear programming solver.
from gekko import GEKKO
m = GEKKO()
x1,x2,x3,x4 = m.Array(m.Var,4,lb=0,ub=100)
x1.value=50 # change initial guess
m.Equation(x1 + x2 + x3 + x4 <= 100)
m.Maximize(x1 + x2 + x3 + x4)
m.Equations([x1==x2,x1==x3,x1==x4])
m.solve(disp=False)
print(x1.value, x2.value, x3.value, x4.value)
This gives a solution:
[25.000000002] [25.000000002] [25.000000002] [25.000000002]
QP Solution
Switching to a QP solver allows a slight penalty for deviations but doesn't consume a degree of freedom.
from gekko import GEKKO
m = GEKKO()
x1,x2,x3,x4 = m.Array(m.Var,4,lb=0,ub=100)
x1.value=50 # change initial guess
m.Equation(x1 + x2 + x3 + x4 <= 100)
m.Maximize(x1 + x2 + x3 + x4)
penalty = 1e-5
m.Minimize(penalty*(x1-x2)**2)
m.Minimize(penalty*(x1-x3)**2)
m.Minimize(penalty*(x1-x4)**2)
m.solve(disp=False)
print(x1.value, x2.value, x3.value, x4.value)
Solution with QP penalty
[24.999998377] [25.000000544] [25.000000544] [25.000000544]

R function inside C++ loop

I have a double loop in R. It works well, but the problem is that it runs slow with big data frames. So I would like to do the loop in C++ through the Rcpp package, but using an R function inside the loop. The R loop is:
> output2=list()
> for (j in r){
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+ }
+ }
And the output is going to be a list.
An example of DF and r is:
> r
[1] "A==A[i] & B==B[i] " "A==A[i] & C==C[i] "
[3] "B==B[i] & C==C[i] " "A==A[i] & B==B[i] & C==C[i] "
> DF
A B C
1 11 22 88
2 11 22 47
3 2 30 21
4 3 30 21
My question is how I can put the expression in the C++ code. Another question is whether this way is better than make the entire code in C++.
I would be grateful if someone could help me with this issue.
Regards,
For loops aren't necessarily slow in R. It is calling a set of functions a very large number of times, which can be slow (an with more recent versions of R, even that isn't as slow as it was). However, for loops can often be completely avoided by using vectorised code which is many times faster.
In general using eval and parse is not needed and generally an indication that a suboptimal solution is used. In this case (without knowing the complete problem), I am not completely sure how to avoid that. However by writing the loops more efficient a speed gain of over a factor 20 can be gained without using Rcpp.
Generate data
r <- c("A==A[i] & B==B[i]", "A==A[i] & C==C[i] ", "B==B[i] & C==C[i] ",
"A==A[i] & B==B[i] & C==C[i] ")
DF <- read.table(textConnection(" A B C
1 11 22 88
2 11 22 47
3 2 30 21
4 3 30 21"))
DF <- DF[sample(nrow(DF), 1E3, replace=TRUE), ]
Measure time of initial implementation
> system.time({
+ output2=list()
+ for (j in r){
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+ }
+ }
+ })
user system elapsed
1.120 0.007 1.127
Preallocate result; doesn't help much in this case
> system.time({
+ output2=vector(length(r), mode = "list")
+ names(output2) <- r
+ for (j in r){
+ output2[[i]] <- numeric(nrow(DF))
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+ }
+ }
+ })
user system elapsed
1.116 0.000 1.116
subset is not needed as we only need the number of rows. subset ceates a completely new data.frame, which generates overhead
> system.time({
+ output2=vector(length(r), mode = "list")
+ names(output2) <- r
+ for (j in r){
+ output2[[i]] <- numeric(nrow(DF))
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=sum(eval(parse(text=j), envir = DF))
+ }
+ }
+ })
user system elapsed
0.622 0.003 0.626
Parsing r takes time and is repeated nrow(DF) times, remove form inner loop
> system.time({
+ output2=vector(length(r), mode = "list")
+ names(output2) <- r
+ for (j in r){
+ output2[[i]] <- numeric(nrow(DF))
+ expr <- parse(text=j)
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=sum(eval(expr, envir = DF))
+ }
+ }
+ })
user system elapsed
0.054 0.000 0.054
A more readable and even faster implementation using dplyr
> library(dplyr)
> system.time({
+ output3 <- DF %>% group_by(A,B) %>% mutate(a = n()) %>%
+ group_by(A,C) %>% mutate(b = n()) %>%
+ group_by(B,C) %>% mutate(c = n()) %>%
+ group_by(A,B,C) %>% mutate(d = n())
+ })
user system elapsed
0.010 0.000 0.009
I would have preferred to post this in comment as it doesn't fully answer the question but I don't have enough reputation to do so.
R is an interpreted language whereas C is a compiled one. Loops are slow in R but your expression output2[[j]][i]=nrow(subset(DF,eval(parse(text=j)))) represents at least 99% of the execution time. Therefore, it won't help to find a way to mix both languages. I advise you to keep both in R and find a way to speed up the process (maybe only one loop with a different expression ?) or find a way to translate your expression to a C one. I know that a lot of basic functions of R are coded in C (as you can see here), maybe it's already the case for nrow, subset and parse.
You can also use LAPACK/BLAS library to speed up some R functions:
LAPACK/BLAS handles matrix math in R. If that's all you need, you can find libraries that are much faster than the vanilla ones in R (you can use some of them in R too to improve performance!).
stated from this topic from stack overflow

Do loop with 2 variables changing in each step of loop

I'm working on Fortran 90. I need to calculate a recursion like xn = 6*(xn-1) + 7*(xn-2) where (xn-1) is the n-1 step and (xn-2) is the n-2 step. So if we set x1 = 2 and x0 = 1 we get x2 = 6*x1 + 7*x0 and so on for each n.
So I wrote
x0 = 1.
x1 = 2.
do i = 1,20
xn = 6.*x1 + 7.*x0
x1 = xn
x0 = x1
end do
but this code is replacing x0 and x1 for xn and I need to replace x1 for xn and x0 for x1 in each step. I'd tryed many things but I failed. Any idea how to do that?
Though the answer has already been added to this question, let me answer a more general question which is encountered more frequently. Consider the problem where the very next value in the iteration depends on n previous values. In the present case n = 2. The general strategy to solve this problem is to construct another 1-d array of size n and save all the initial values x(1),x(2),..,x(n) in this array. Then in each iteration we use these values to calculate the next value x(n+1) and update the array with x(1) by x(2), x(2) by x(3),...,x(n) by x(n+1) and again use these values to calculate the next value of x and so on. A particular example where such strategy must necessarily be used is the integration of time-delayed systems.
#parthian-shot has given the correct answer in the comment. But that leaves the question marked as unanswered, so I am repeating it here:
You are assigning the value of xn to x1, and then the value of x1 (which is now the same as xn) to x0. You just need to flip it around:
do i = 1,20
xn = 6.*x1 + 7.*x0
x0 = x1
x1 = xn
end do

Solving for polynomial roots in Stata

I am trying to solve for the roots of a function in Stata. There is the "polyeval" command under Mata, but I am not sure how to apply it here. It seems to me as if under polyeval functions must follow a very clear structure of x^2 + x + c.
I would like to find out more about how to use Stata to solve this type of problem in general. But here is my current one, if that provides some idea of what I am working with.
I am currently trying to solve the Black (1976) American Options pricing model:
C = e^{-rt} [ F N(d1) - E N(d2)]
where,
d1 = [ln(F/E) + 1/2 simga^2 t] / [sigma sqrt{t}]
d2 = d1 - sigma sqrt{t}
where C is the price of call option, t is time to expiration, r is interest rate, F is current futures price of contract, E is strike price, sigma is the annualized standard deviation of the futures contract. N(d1) and N(d2) are cumulative normal probability functions. All variables are known except for sigma.
As an aside, this seems to be really easy to do in R:
fun <- function(sigma) exp(-int.rate* T) * (futures * pnorm((log(futures/Strike)+ sigma^2 * T/2) / sigma * sqrt(T),0,1)- Strike * pnorm((log(futures/Strike)+ sigma^2 * T/2) / sigma * sqrt(T)- sigma * sqrt(T),0,1) ) - Option
uni <- uniroot(fun, c(0, 1), tol = 0.001 )
uni$root
Does anyone have any ideas/pointers on how to use Stata to solve this type of function?

Variation on set cover problem in R / C++

Given a universe of elements U = {1, 2, 3,...,n} and a number of sets in this universe {S1, S2,...,Sm}, what is the smallest set we can create that will cover at least one element in each of the m sets?
For example, given the following elements U = {1,2,3,4} and sets S = {{4,3,1},{3,1},{4}}, the following sets will cover at least one element from each set:
{1,4}
or
{3,4}
so the minimum sized set required here is 2.
Any thoughts on how this can be scaled up to solve the problem for m=100 or m=1000 sets? Or thoughts on how to code this up in R or C++?
The sample data, from above, using R's library(sets).
s1 <- set(4, 3, 1)
s2 <- set(3, 1)
s3 <- set(4)
s <- set(s1, s2, s3)
Cheers
This is the hitting set problem, which is basically set cover with the roles of elements and sets interchanged. Letting A = {4, 3, 1} and B = {3, 1} and C = {4}, the element-set containment relation is
A B C
1 + + -
2 - - -
3 + + -
4 + - +
so you basically want to solve the problem of covering {A, B, C} with sets 1 = {A, B} and 2 = {} and 3 = {A, B} and 4 = {A, C}.
Probably the easiest way to solve nontrivial instances of set cover in practice is to find an integer programming package with an interface to R or C++. Your example would be rendered as the following integer program, in LP format.
Minimize
obj: x1 + x2 + x3 + x4
Subject To
A: x1 + x3 + x4 >= 1
B: x1 + x3 >= 1
C: x4 >= 1
Binary
x1 x2 x3 x4
End
At first I misunderstood the complexity of the problem and came up with a function that finds a set that covers the m sets - but I then realized that it isn't necessarily the smallest one:
cover <- function(sets, elements = NULL) {
if (is.null(elements)) {
# Build the union of all sets
su <- integer()
for(si in sets) su <- union(su, si)
} else {
su <- elements
}
s <- su
for(i in seq_along(s)) {
# create set candidate with one element removed
sc <- s[-i]
ok <- TRUE
for(si in sets) {
if (!any(match(si, sc, nomatch=0L))) {
ok <- FALSE
break
}
}
if (ok) {
s <- sc
}
}
# The resulting set
s
}
sets <- list(s1=c(1,3,4), s2=c(1,3), s3=c(4))
> cover(sets) # [1] 3 4
Then we can time it:
n <- 100 # number of elements
m <- 1000 # number of sets
sets <- lapply(seq_len(m), function(i) sample.int(n, runif(1, 1, n)))
system.time( s <- cover(sets) ) # 0.53 seconds
Not too bad, but still not the smallest one.
The obvious solution: generate all permutations of elements and pass is to the cover function and keep the smallest result. This will take close to "forever".
Another approach is to generate a limited number of random permutations - this way you get an approximation of the smallest set.
ns <- 10 # number of samples
elements <- seq_len(n)
smin <- sets
for(i in seq_len(ns)) {
s <- cover(sets, sample(elements))
if (length(s) < length(smin)) {
smin <- s
}
}
length(smin) # approximate smallest length
If you restrict each set to have 2 elements, you have the np-complete problem node cover. I would guess the more general problem would also be NP complete (for the exact version).
If you're just interested in an algorithm (rather than an efficient/feasible algorithm), you can simply generate subsets of the universe of increasing cardinality and check that the intersection with all the sets in S is non-empty. Stop as soon as you get one that works; the cardinality is the minimum possible.
The complexity of this is 2^|U| in the worst case, I think. Given Foo Bah's answer, I don't think you're going to get a polynomial-time answer...