R function inside C++ loop - c++

I have a double loop in R. It works well, but the problem is that it runs slow with big data frames. So I would like to do the loop in C++ through the Rcpp package, but using an R function inside the loop. The R loop is:
> output2=list()
> for (j in r){
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+ }
+ }
And the output is going to be a list.
An example of DF and r is:
> r
[1] "A==A[i] & B==B[i] " "A==A[i] & C==C[i] "
[3] "B==B[i] & C==C[i] " "A==A[i] & B==B[i] & C==C[i] "
> DF
A B C
1 11 22 88
2 11 22 47
3 2 30 21
4 3 30 21
My question is how I can put the expression in the C++ code. Another question is whether this way is better than make the entire code in C++.
I would be grateful if someone could help me with this issue.
Regards,

For loops aren't necessarily slow in R. It is calling a set of functions a very large number of times, which can be slow (an with more recent versions of R, even that isn't as slow as it was). However, for loops can often be completely avoided by using vectorised code which is many times faster.
In general using eval and parse is not needed and generally an indication that a suboptimal solution is used. In this case (without knowing the complete problem), I am not completely sure how to avoid that. However by writing the loops more efficient a speed gain of over a factor 20 can be gained without using Rcpp.
Generate data
r <- c("A==A[i] & B==B[i]", "A==A[i] & C==C[i] ", "B==B[i] & C==C[i] ",
"A==A[i] & B==B[i] & C==C[i] ")
DF <- read.table(textConnection(" A B C
1 11 22 88
2 11 22 47
3 2 30 21
4 3 30 21"))
DF <- DF[sample(nrow(DF), 1E3, replace=TRUE), ]
Measure time of initial implementation
> system.time({
+ output2=list()
+ for (j in r){
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+ }
+ }
+ })
user system elapsed
1.120 0.007 1.127
Preallocate result; doesn't help much in this case
> system.time({
+ output2=vector(length(r), mode = "list")
+ names(output2) <- r
+ for (j in r){
+ output2[[i]] <- numeric(nrow(DF))
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+ }
+ }
+ })
user system elapsed
1.116 0.000 1.116
subset is not needed as we only need the number of rows. subset ceates a completely new data.frame, which generates overhead
> system.time({
+ output2=vector(length(r), mode = "list")
+ names(output2) <- r
+ for (j in r){
+ output2[[i]] <- numeric(nrow(DF))
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=sum(eval(parse(text=j), envir = DF))
+ }
+ }
+ })
user system elapsed
0.622 0.003 0.626
Parsing r takes time and is repeated nrow(DF) times, remove form inner loop
> system.time({
+ output2=vector(length(r), mode = "list")
+ names(output2) <- r
+ for (j in r){
+ output2[[i]] <- numeric(nrow(DF))
+ expr <- parse(text=j)
+ for (i in 1:nrow(DF)){
+ output2[[j]][i]=sum(eval(expr, envir = DF))
+ }
+ }
+ })
user system elapsed
0.054 0.000 0.054
A more readable and even faster implementation using dplyr
> library(dplyr)
> system.time({
+ output3 <- DF %>% group_by(A,B) %>% mutate(a = n()) %>%
+ group_by(A,C) %>% mutate(b = n()) %>%
+ group_by(B,C) %>% mutate(c = n()) %>%
+ group_by(A,B,C) %>% mutate(d = n())
+ })
user system elapsed
0.010 0.000 0.009

I would have preferred to post this in comment as it doesn't fully answer the question but I don't have enough reputation to do so.
R is an interpreted language whereas C is a compiled one. Loops are slow in R but your expression output2[[j]][i]=nrow(subset(DF,eval(parse(text=j)))) represents at least 99% of the execution time. Therefore, it won't help to find a way to mix both languages. I advise you to keep both in R and find a way to speed up the process (maybe only one loop with a different expression ?) or find a way to translate your expression to a C one. I know that a lot of basic functions of R are coded in C (as you can see here), maybe it's already the case for nrow, subset and parse.
You can also use LAPACK/BLAS library to speed up some R functions:
LAPACK/BLAS handles matrix math in R. If that's all you need, you can find libraries that are much faster than the vanilla ones in R (you can use some of them in R too to improve performance!).
stated from this topic from stack overflow

Related

Random slopes Cox Proportional Hazards

I have been trying to use coxme to extract random slopes for each of the covariates in my model.
library (coxme)
Start <- runif(5000, 1985, 2015)
Stop <- Start + runif(5000, 2, 10)
S <- data.frame (
X1 <- runif(5000, 5.0, 7.5),
X2 <- runif(5000, 5.0, 7.5),
D <- rbinom(5000, 1, 0.8),
Letters <- sample(LETTERS, 5000, replace = TRUE),
Start <- Start,
Stop <- Stop
)
S_ind1 <- Surv (time = S$Start, time2 = S$Stop, event = S$D)
a <- coxme (S_ind1 ~ X1 + X2 + (X1 + X2|Letters), data = S)
All I get is:
Error in gchol(kfun(theta, varlist, vparm, ntheta, ncoef)) :
NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning messages:
1: In sqrt(xvar * zvar) : NaNs produced
2: In sqrt(xvar * zvar) : NaNs produced
When using my own data I often get:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
'Calloc' could not allocate memory (56076596 of 8 bytes)
Is it possible at all to include random slopes using coxme?
If not, is there any other alternative using other package?
Answer from Terry Therneau, author of the coxme package via email - he asked me to post this here.
Below is my rewrite of your example, removing the Surv indirection and using '=' inside the data.frame call (I'm a bit surprised that <- works in that context), and adding set.seed so that the example is reproducable.
library (coxme)
set.seed(1953)
time1 <- runif(5000, 1985, 2015)
time2 <- time1 + runif(5000, 2, 10)
test <- data.frame (
x1 = runif(5000, 5.0, 7.5),
x2 = runif(5000, 5.0, 7.5),
death = rbinom(5000, 1, 0.8),
letters = sample(LETTERS, 5000, replace = TRUE),
time1 = time1,
time2 = time2)
fit1 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1|letters), data=test)
fit2 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+x1 | letters), test)
fit3 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+x2 | letters), test)
fit4 <- coxme(Surv(time1, time2, death) ~ x1 + x2 + (1+ x1 + x2 | letters),
data=test, vinit= c(1e-6, 1e-8, 1e-8))
*1. All the models work until fit4.
I find your model worrisome, since it has a random slope but no random intercept, in the same way that all regressions through the origin worry me: I have a hard time interpeting the results. Although lme puts intercept terms in by default, coxme does not.
I was hopeful tht fit4 would work, and perhaps with better starting estimates it would. The underlying code for coxme is the hardest maximization problem that I have encountered in all my survival work, hard in the sense that the maximizer gets easily lost and never finds its way. This is a function that sometimes needs hand-holding, via limited iteration counts and/or starting estimates. I wish it were not so, and I have some long term plans to improve this by addition of an alternate MCMC based maximizer, which will in theory never get lost but at the expense of much longer computation time.
If any of the variances get too close to zero then the sqrt() message tends to arise as a function of round off error. In your test case, of course, the actual MLE is at a variance of 0. When this happens, I will often check for a zero variance directly by doing fits with a sequence of fixed variances (vfixed argument). If the likelihood is constant or increasing as the variance goes to values of 1e-6 or less, then I assume the MLE is zero and remove that random term from the model.
Terry T.*

R use apply function on xts zoo class

I am new in R and I try to use apply function on the xts zoo class, however it shows error. I have a formula: ((2*Close-High-Low)/(High-Low)) * Volume
Input:
y <- getSymbols("0005.HK", auto.assign = FALSE, src = "yahoo")
Error:
y$II <- apply(y,2,function(x) (2Cl(x) - Hi(x) - Lo(x)) / ((Hi(x) - Lo(x)) * Vo(stk)))
Error: unexpected symbol in "apply(y,2,function(x) (2Cl"
and then I tried another one:
Error:
y$II <- apply(y,2,function(x) (2(x[,4]) - x[,2] - x[,3]) / (x[,2] - x[,3]) * x[,5])
Error in FUN(newX[, i], ...) : attempt to apply non-function
After that, I would like to sum the y$II 21 days but I don't know how to do apply function to sum 21 days between every 21 days
IIstd = Sum of 21 ((2*C-H-L)/(H-L)) * V
IInorm = (IIstd / Sum 21 day V) * 100
Anyone can help me ? Please advice, thanks.
There are two problems here:
2Cl(x) i s not valid R -- use 2 * Cl(x)
all operations on the right hand side are already vectorized so we do not need apply in the first place
For clarity here we have assumed that II = (2C - H - L)/((H-L) * V)and you want 100 times the 21 period volume weighted moving average of that. Modify if that is not what you want.
Try this:
y$II <- (2*Cl(y) - Hi(y) - Lo(y)) / ((Hi(y) - Lo(y)) * Vo(y))
Regarding the second part of the question try this -- rollapplyr is in the zoo package.
wmean <- function(x) weighted.mean(x$II, Vo(x))
y$MeanII <- 100 * rollapplyr(y, 21, wmean, by.column = FALSE, fill = NA)
Also check out the TTR package.
UPDATE: Added answer to second part of question.

Mathematics with large factorials (e.g. division?)

I'm trying to find the percentage of permutations of 100 numbers that contain cycles of length more than 50. This involves mathematics consisting of division with large factorials which can't be done by hand very quickly, so I need to resort to programming. For instance one these term contains
(3!/100!)*((99!/3!) + (98!/2!) + (97!/1!) + (96!/0!))
I could re-arrange all terms to provide one large number (>2^64) that just needs to be divided by 100! to get my answer.
I've thought for quite a bit, still being new to C++, and I'm not sure how I can do division with large numbers. Normally when I've dealt with large factorials I've output digits of the number in to an array and done multiplication through that, but I'm not entirely sure how to do division that way. What is the best way to deal with mathematics of large numbers in C++?
It is obvious from the structure of the equation (a reciprocal of a very large factorial multiplied with some large factorials with about the same magnitude) that a lot of cancellation can happen. That means that this problem can be solved with some very simple algebra and with a bit of extra luck even without any kind of computation.
Let's replace the factorials with some innocent little letters to avoid intimidation by the large numbers.
With 0! = 1 by definition and 1!=1 we can skip these values and use the following substitutions:
a = 2!, b = 3!, v = 96!, w = 97!, x = 98!, y = 99!, z = 100!
That gives
(b/z)*(y/b + x/a + w + v)
expand
b*(y/b + x/a + w + v) * 1/z
expand numerator (let's use some ASCII art for legibility)
b*x
y + (b*w) + (b*v) + ---
a
------------------------
z
squeeze it all in one fraction
(a*y) + (a*b*w) + (a*b*v) + (x*b)
---------------------------------
(a*z)
take it apart
a*y a*b*w a*b*v x*b
----- + ------- + ------- + -----
a*z a*z a*z a*z
Yepp, looks good, we can put the numbers back in
2!*99! 2!*3!*96! 2!*3!*97! 3!*98!
--------- + ------------- + ------------- + ---------
2!*100! 2!*100! 2!*100! 2!*100!
First round of cancellations (could have already been done at the letter stage)
99! 3!*96! 3!*97! 3!*98!
------ + ---------- + -------- + --------
100! 100! 100! 2!*100!
The factorials cancel each other out, but only partially
First step
1 1*2*3 1*2*3 1*2*3
--- + ------------ + --------- + ------
100 97*98*99*100 98*99*100 1*2*99*100
Second step
1 1 1 1
--- + ------------ + -------- + -------
100 97*98*33*50 98*33*50 2*33*50
Common denominator
97*98*33*50*2 + 100*2 + 100*97*2 + 100*97*98
--------------------------------------------
100*97*98*33*50*2
Massage numerator a bit by factoring out 100
97*98*33*50*2 + 100*(2*98 + 97*98)
----------------------------------
100*97*98*33*50*2
Part
97*98*33*50*2 100*(2*98 + 97*98)
----------------- + -------------------
100*97*98*33*50*2 100*97*98*33*50*2
Cancel
1 2*98 + 97*98
--- + --------------
100 97*98*33*50*2
Part
1 2*98 97*98
--- + -------------- + --------------
100 97*98*33*50*2 97*98*33*50*2
Cancel
1 1 1
--- + -------- + -------
100 97*33*50 33*50*2
And we have on summand less and got rid of the 98. Rinse and repeat until the final result is found:
1
----
97
Yes, sometimes a sharp pencil and a blank sheet of paper is all you need ;-)

Calculate modulus for large numbers

Hi I need to calculate (2^n + (-1)^n) % 10000007
where 1 < n < 10^9
How should I go about writing a program for it in c++?
I know this mod property
(a + b)%n = (a%n + b%n)%n but this wont help me.
Given
(a + b)%m = (a%m + b%m)%m
Then, replace both a and b with the same power of 2, and you get the recurrence:
2k+1%m = (2k%m + 2k%m)%m
You probably already figured your formula allows you to break down your problem into:
(2n + (-1)n)%P = (2n%P + (-1)n%P)%P
Then, note that (-1)k is either 1 or -1, and you should be able to calculate your problem in O(n) time.

Variation on set cover problem in R / C++

Given a universe of elements U = {1, 2, 3,...,n} and a number of sets in this universe {S1, S2,...,Sm}, what is the smallest set we can create that will cover at least one element in each of the m sets?
For example, given the following elements U = {1,2,3,4} and sets S = {{4,3,1},{3,1},{4}}, the following sets will cover at least one element from each set:
{1,4}
or
{3,4}
so the minimum sized set required here is 2.
Any thoughts on how this can be scaled up to solve the problem for m=100 or m=1000 sets? Or thoughts on how to code this up in R or C++?
The sample data, from above, using R's library(sets).
s1 <- set(4, 3, 1)
s2 <- set(3, 1)
s3 <- set(4)
s <- set(s1, s2, s3)
Cheers
This is the hitting set problem, which is basically set cover with the roles of elements and sets interchanged. Letting A = {4, 3, 1} and B = {3, 1} and C = {4}, the element-set containment relation is
A B C
1 + + -
2 - - -
3 + + -
4 + - +
so you basically want to solve the problem of covering {A, B, C} with sets 1 = {A, B} and 2 = {} and 3 = {A, B} and 4 = {A, C}.
Probably the easiest way to solve nontrivial instances of set cover in practice is to find an integer programming package with an interface to R or C++. Your example would be rendered as the following integer program, in LP format.
Minimize
obj: x1 + x2 + x3 + x4
Subject To
A: x1 + x3 + x4 >= 1
B: x1 + x3 >= 1
C: x4 >= 1
Binary
x1 x2 x3 x4
End
At first I misunderstood the complexity of the problem and came up with a function that finds a set that covers the m sets - but I then realized that it isn't necessarily the smallest one:
cover <- function(sets, elements = NULL) {
if (is.null(elements)) {
# Build the union of all sets
su <- integer()
for(si in sets) su <- union(su, si)
} else {
su <- elements
}
s <- su
for(i in seq_along(s)) {
# create set candidate with one element removed
sc <- s[-i]
ok <- TRUE
for(si in sets) {
if (!any(match(si, sc, nomatch=0L))) {
ok <- FALSE
break
}
}
if (ok) {
s <- sc
}
}
# The resulting set
s
}
sets <- list(s1=c(1,3,4), s2=c(1,3), s3=c(4))
> cover(sets) # [1] 3 4
Then we can time it:
n <- 100 # number of elements
m <- 1000 # number of sets
sets <- lapply(seq_len(m), function(i) sample.int(n, runif(1, 1, n)))
system.time( s <- cover(sets) ) # 0.53 seconds
Not too bad, but still not the smallest one.
The obvious solution: generate all permutations of elements and pass is to the cover function and keep the smallest result. This will take close to "forever".
Another approach is to generate a limited number of random permutations - this way you get an approximation of the smallest set.
ns <- 10 # number of samples
elements <- seq_len(n)
smin <- sets
for(i in seq_len(ns)) {
s <- cover(sets, sample(elements))
if (length(s) < length(smin)) {
smin <- s
}
}
length(smin) # approximate smallest length
If you restrict each set to have 2 elements, you have the np-complete problem node cover. I would guess the more general problem would also be NP complete (for the exact version).
If you're just interested in an algorithm (rather than an efficient/feasible algorithm), you can simply generate subsets of the universe of increasing cardinality and check that the intersection with all the sets in S is non-empty. Stop as soon as you get one that works; the cardinality is the minimum possible.
The complexity of this is 2^|U| in the worst case, I think. Given Foo Bah's answer, I don't think you're going to get a polynomial-time answer...