How to optimise Rcpp function (calling another R function) - c++

I'm solving the following problem: Given two vectors x,y and a vectorised function f. I'd like to compute for each element x0 of x the average of f(x_0 - y).
I already implemented the function in R like this
sol <- function(x, y, f) {
ret <- numeric(length(x))
for (y0 in y) {
ret <- ret + f(x - y0)
}
ret/length(y)
}
We could use the function like this sol(1:100, 1:100, exp). Since this function is a crucial part of my code I'd like to optimise it. The length of x is in the range (1 - 100,000) and y is in the range (1 - 1,000). I tried using Rcpp like this
library(Rcpp)
cppFunction('NumericVector cppEval(NumericVector x, NumericVector y, Function f) {
int num_y = y.size();
NumericVector out(x.size());
for(int i = 0; i < num_y; ++i) {
out += Rcpp::as<NumericVector>(f(x - y[i]));
}
return out/num_y;
}')
Sadly this piece of code is much slower than the R equivalent. What could I do to efficiently write Cpp here? I don't know how to completely get rid of the loop.
microbenchmark::microbenchmark(sol(1:100, 1:100, exp), cppEval(1:100, 1:100, exp))
Unit: microseconds
expr min lq mean median uq max neval
sol(1:100, 1:100, exp) 157.572 178.336 244.4421 210.4775 221.7085 4199.367 100
cppEval(1:100, 1:100, exp) 1451.395 1628.367 1829.2443 1697.7480 1794.4390 12868.237 100

Related

Fastest way to compute Gaussian kernel vector in CppArmadillo?

I am trying to compute a vector of gaussian kernel evaluations as quickly as possible. I have a data point x in R^p, and a matrix X of n vectors x_i. I would like to compute exp( -||x-x_i||^2 / t) for every x_i and return the result as a vector.
I have tried implementing this in both R and RcppArmadillo via the following code
R CODE:
kernel <- function(x, Data, sigma){
if(sigma <= 0 ) stop('Gaussian kernel parameter <= 0.')
DiffPart <- (t(t(Data) - x))^2 ## Computes the distance squared of the data and point x
DiffPart <- rowSums(DiffPart) # Sum of squares
exp( - DiffPart / sigma) #Divide by kernel parameter and evluate exponential function
}
RcppArmadillo:
arma::Col<double> kernelCPP(arma::Row<double> x, arma::Mat<double> Data, double sigma){
arma::Mat<double> Diff=Data.each_row()-x;
int n = Data.n_rows;
arma::Col<double> kern(n);
for(int k = 0 ; k < n; k++){
kern(k) = exp(-arma::accu(square(Diff.row(k)))/sigma);
}
return(kern);
}
Unfortunately, my RcppArmadillo code is not much faster than the original R code. I'll be computing kernel vectors hundreds of thousands of times in future code/computation, and so I would like this to be as fast of a process as I can make it.
When microbenchmarking, I get the following results:
> microbenchmark(
+ kernel(x= TrainX1[1,], Data = TrainX1, sigma = 100)
+ )
Unit: milliseconds
min lq mean median
2.223359 2.274559 2.5199 2.308052
uq max neval
2.575144 4.73301 100
and
> microbenchmark(
+ kernelCPP(x= TrainX1[1,], Data = TrainX1, sigma = 100)
+ )
Unit: milliseconds
min lq mean
1.697706 1.732053 1.826743
median uq max neval
1.775786 1.871786 2.493439 100
A little faster, but not by much.

Is it legitimate to use Rcpp to speed up replacing elements of lists and vectors in iterative algorithm?

Context
I have been working on iterative algorithm lately, where each iteration n depends on the iteration n-1. During each iteration most of the computation time is taken by sub-setting and/or replacing elements of vectors, lists or data.tables (N > 10^6).
I recently came across Rcpp and playing a little bit with it I discovered that replacing element k of vectors or lists can be sped up by two or three orders of magnitudes (few benchmark tests below).
However, when using the Rcpp subsetting code within a for and a while loop, R seems to become unstable and the session aborts at random points with no hint of what went wrong.
Question
My question: is this use of Rcpp legitimate or it can leads to problems I am not aware of?
Example
Below is the Rcpp code I am using and a few benchmarks. Overall, the algorithm should call the replacing functions ~5.5 billion times and subset functions ~50 billion times.
Note that replacing elements of lists and double vectors is faster using Rcpp, while for integer vectors base R solutions are preferred (benchmark 1); data table is a good option to replace elements but if you have to subset repeatedly to access its elements the vector approach is much faster (benchmark 2).
Functions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void assign_list(List x, int k, NumericVector new_element){
x[k-1] = new_element;
}
// [[Rcpp::export]]
void assign_dbl(NumericVector x, int k, double y){
x[k-1] = y;
}
// [[Rcpp::export]]
void assign_int(IntegerVector x, int k, int y){
x[k-1] = y;
}
Benchmarks:
Inputs
set.seed(747474)
int <- 1:10^7
dou <- rnorm(10^7, 1000, 300)
l <- lapply(sample(5:20, 10^7, replace = T), rnorm, mean = 1000, sd = 300)
dt <- data.table(int = int, dou = dou, l = l)
i <- 999999
z <- 2222
k <- 30000
s <- 552877
1)
Unit: nanoseconds
expr min lq mean median uq max neval
int[i] <- -1L 488 2439 36938108.9 4146.0 15651119 799720107 30
dou[i] <- -1 732 3170 19101960.4 6609193.5 16187500 212369197 30
l[i] <- -1 489 3902 159442538.1 186035314.5 227131872 618326686 30
assign_int 19853910 22028692 81055363.5 24665494.0 39352345 872241539 30
assign_dbl 1220 5852 48023.2 8534.5 13167 1158828 30
assign_list 1464 6828 52866.9 10850.5 13411 1243430 30
dt[k, ':=' (int = -1, dou = -1, l = -1)] 206020 340116 481850.0 425326.5 529312 1414341 30
2)
microbenchmark(times = 30L,
"subset vector + list" = {int[s]; dou[s]; l[s]},
"subset datatable" = {dt[s, int]; dt[s, dou]; dt[s, l]})
Unit: nanoseconds
expr min lq mean median uq max neval
subset vector + list 488 488 1715.533 1585.5 2926 4389 30
subset datatable 563688 574417 719304.467 600138.0 875765 1308040 30
This is very dangerous, because of the side effect shown here, where x and y are changed even though you only pass x into the Rcpp function
> x <- y <- 1:10
> assign_int(x, 1, 2)
> y
[1] 2 2 3 4 5 6 7 8 9 10
It does not seem to be faster; for these functions
f0 <- function(x) {
for (i in seq_along(x))
x[i] = -i
}
f1 <- function(x) {
for (i in seq_along(x))
assign_int(x, i, -i)
}
I have
> int <- 1:10^5
> microbenchmark(f0(int), f1(int), times=5)
Unit: milliseconds
expr min lq mean median uq max neval
f0(int) 14.78777 14.80264 15.05683 15.03138 15.17678 15.48556 5
f1(int) 659.67346 669.00095 672.93343 670.48917 676.16930 689.33429 5
In your benchmark, int[i] <- 1, '1' is a numeric (double) value, so you are coercing to a double vector (check out class(int) after the assignment), triggering a complete copy. Use int[i] <- 1L to force the right-hand side to be an integer.
> int0 <- int1 <- 1:10^7
> microbenchmark(int0[1] <- 1, int1[1] <- 1L)
Unit: microseconds
expr min lq mean median uq max neval
int0[1] <- 1 1.047 1.102 1770.9911 1.143 1.2650 176960.52 100
int1[1] <- 1L 1.105 1.176 339.4264 1.213 1.2655 33815.97 100
> class(int0)
[1] "numeric"
> class(int1)
[1] "integer"
Updating just a single element as benchmark is expensive in base R because it triggers a copy of the vector on each assignment; in f0() the copy occurs just once. On the first iteration, R makes a copy because it knows that the vector of integer values is referenced by at least two symbols (the argument to the function int, and the symbol used in the function x) so it makes a copy of the memory and assigns it to x inside the function. It does this to avoid the side-effect seen in your Rcpp code (i.e., to avoid modifying int). The next time through the loop R recognizes that only a single symbol references the vector, so does the replacement without making a copy.
Note that
> int <- 1:10^5
> f1(int)
> head(int)
[1] -1 -2 -3 -4 -5 -6
illustrates the subtle way that the side-effects of your Rcpp code can have unexpected outcomes.
Also, there are several ways of slowing down iterative loops, e.g.,
f2 <- function(x) {
for (i in seq_along(x)) {
x[i] = -i
y <- x
}
}
f3 <- function(x) {
result <- integer()
for (i in seq_along(x))
result <- c(result, -i)
}
with
> int <- 1:10^3
> microbenchmark(f0(int), f2(int), f3(int), times = 1)
Unit: microseconds
expr min lq mean median uq max neval
f0(int) 150.507 150.507 150.507 150.507 150.507 150.507 1
f2(int) 667.201 667.201 667.201 667.201 667.201 667.201 1
f3(int) 4379.005 4379.005 4379.005 4379.005 4379.005 4379.005 1
f2() causes R to make a copy of x each time through the loop (to avoid the side-effect of modifying y). f3() copies a vector of length 0, 1, 2, 3, ... n - 1 (where n = length(x)) on successive iterations, leading to n * (n - 1) / 2 elements copied, and an algorithm that scales as the square of the length of x.
The general principle applies to other types as well, including lists with
f0l <- function(x) {
for (i in seq_along(x))
x[[i]] <- i
x
}
f1l <- function(x) {
for (i in seq_along(x))
assign_list(x, i, i)
}
leading to
> set.seed(123)
> l0 <- lapply(sample(5:20, 10^6, replace = T), rnorm, mean = 1000, sd = 300)
> set.seed(123)
> l1 <- lapply(sample(5:20, 10^6, replace = T), rnorm, mean = 1000, sd = 300)
> microbenchmark(f0l(l0), f1l(l1), times=1)
Unit: milliseconds
expr min lq mean median uq max neval
f0l(l0) 239.9865 239.9865 239.9865 239.9865 239.9865 239.9865 1
f1l(l1) 6767.9172 6767.9172 6767.9172 6767.9172 6767.9172 6767.9172 1

Armadillo C++: Sorting a vector in terms of two other vectors

My question relates to a sorting exercise, which I can undertake easily (but perhaps slowly) in R and would like to undertake in C++ in order to speed up my code.
Consider three vectors of the same size a,b and c. In R, the following command would sort the vector first in terms of b and then, in case of ties, would further sort in terms of c.
a<-a[order(b,c),1]
Example:
a<-c(1,2,3,4,5)
b<-c(1,2,1,2,1)
c<-c(5,4,3,2,1)
> a[order(b,c)]
[1] 5 3 1 4 2
Is there an efficient way to undertake this in C++ using Armadillo vectors?
We can write the following C++ solution, which I have in a file SO_answer.cpp:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace arma;
// [[Rcpp::export]]
vec arma_sort(vec x, vec y, vec z) {
// Order the elements of x by sorting y and z;
// we order by y unless there's a tie, then order by z.
// First create a vector of indices
uvec idx = regspace<uvec>(0, x.size() - 1);
// Then sort that vector by the values of y and z
std::sort(idx.begin(), idx.end(), [&](int i, int j){
if ( y[i] == y[j] ) {
return z[i] < z[j];
}
return y[i] < y[j];
});
// And return x in that order
return x(idx);
}
What we've done is take advantage of the fact that std::sort() allows you to sort based on a custom comparator. We use a comparator that compares the elements of z only if the elements of y are equal; otherwise it compares the values of y.1 Then we can compile the file and test the function in R:
library(Rcpp)
sourceCpp("SO_answer.cpp")
set.seed(1234)
x <- sample(1:10)
y <- sample(1:10)
z <- sample(1:10)
y[sample(1:10, 1)] <- 1 # create a tie
all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good
Of course, we must also consider whether this actually gives you any performance increase, which is the whole reason why you're doing this. Let's benchmark:
library(microbenchmark)
microbenchmark(r = x[order(y, z)],
arma = arma_sort(x, y, z),
times = 1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
r 36.040 37.23 39.386160 37.64 38.32 3316.286 10000 b
arma 5.055 6.07 7.155676 7.00 7.53 107.230 10000 a
On my machine, it looks like you get about a 5-6X increase in speed with small vectors, though this advantage doesn't hold as well when you scale up:
x <- sample(1:100)
y <- sample(1:100)
z <- sample(1:100)
y[sample(1:100, 10)] <- 1 # create some ties
all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good
microbenchmark(r = x[order(y, z)],
arma = arma_sort(x, y, z),
times = 1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
r 44.50 46.360 48.01275 46.930 47.755 294.051 10000 b
arma 10.76 12.045 16.30033 13.015 13.715 5262.132 10000 a
x <- sample(1:1000)
y <- sample(1:1000)
z <- sample(1:1000)
y[sample(1:100, 10)] <- 1 # create some ties
all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good
microbenchmark(r = x[order(y, z)],
arma = arma_sort(x, y, z),
times = 1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
r 113.765 118.7950 125.7387 120.5075 122.4475 3373.696 10000 b
arma 82.690 91.3925 104.0755 95.2350 99.4325 6040.162 10000 a
It's still faster, but by less than 2X once you're at vectors of length 1000. This is probably why F. Privé said this operation should be fast enough in R. While moving to C++ using Rcpp can give you great performance advantages, the extent to which you get gains is largely dependent on context, as mentioned many times by Dirk Eddelbuettel in answers to various questions here.
1 Note that typically for sorting Armadillo vectors I would suggest using sort() or sort_index() (see the Armadillo docs here). If you're trying to sort a vec by the values of a second vec, you could usex(arma::sort_index(y)) as I indicated in an answer to a related question here. You can even use stable_sort_index() to preserve ties. However, I couldn't figure out how to use these functions to solve the specific problem you present here.

Matrix multiplication in Rcpp

First of all, I am a novice user so forget my general ignorance. I am looking for a faster alternative to the %*% operator in R. Even though older posts suggest the use of RcppArmadillo, I have tried for 2 hours to make RcppArmadillo work without success. I always run into lexical issues that yield 'unexpected ...' errors. I have found the following function in Rcpp which I do can make work:
library(Rcpp)
func <- '
NumericMatrix mmult( NumericMatrix m , NumericMatrix v, bool byrow=true )
{
if( ! m.nrow() == v.nrow() ) stop("Non-conformable arrays") ;
if( ! m.ncol() == v.ncol() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
for (int i = 0; i < m.nrow(); i++)
{
for (int j = 0; j < m.ncol(); j++)
{
out(i,j)=m(i,j) * v(i,j) ;
}
}
return out ;
}
'
This function, however, performs element-wise multiplication and does not behave as %*%. Is there an easy way to modify the above code to achieve the intended result?
EDIT:
I have come up with a function using RcppEigen that seems to beat %*%:
etest <- cxxfunction(signature(tm="NumericMatrix",
tm2="NumericMatrix"),
plugin="RcppEigen",
body="
NumericMatrix tm22(tm2);
NumericMatrix tmm(tm);
const Eigen::Map<Eigen::MatrixXd> ttm(as<Eigen::Map<Eigen::MatrixXd> >(tmm));
const Eigen::Map<Eigen::MatrixXd> ttm2(as<Eigen::Map<Eigen::MatrixXd> >(tm22));
Eigen::MatrixXd prod = ttm*ttm2;
return(wrap(prod));
")
set.seed(123)
M1 <- matrix(sample(1e3),ncol=50)
M2 <- matrix(sample(1e3),nrow=50)
identical(etest(M1,M2), M1 %*% M2)
[1] TRUE
res <- microbenchmark(
+ etest(M1,M2),
+ M1 %*% M2,
+ times=10000L)
res
Unit: microseconds
expr min lq mean median uq max neval
etest(M1, M2) 5.709 6.61 7.414607 6.611 7.211 49.879 10000
M1 %*% M2 11.718 12.32 13.505272 12.621 13.221 58.592 10000
There are good reasons to rely on existing libraries / packages for standard tasks. The routines in the libraries are
optimized
thoroughly tested
a good means to keep the code compact, human-readable, and easy to maintain.
Therefore I think that using RcppArmadillo or RcppEigen should be preferable here. However, to answer your question, below is a possible Rcpp code to perform a matrix multiplication:
library(Rcpp)
cppFunction('NumericMatrix mmult(const NumericMatrix& m1, const NumericMatrix& m2){
if (m1.ncol() != m2.nrow()) stop ("Incompatible matrix dimensions");
NumericMatrix out(m1.nrow(),m2.ncol());
NumericVector rm1, cm2;
for (size_t i = 0; i < m1.nrow(); ++i) {
rm1 = m1(i,_);
for (size_t j = 0; j < m2.ncol(); ++j) {
cm2 = m2(_,j);
out(i,j) = std::inner_product(rm1.begin(), rm1.end(), cm2.begin(), 0.);
}
}
return out;
}')
Let's test it:
A <- matrix(c(1:6),ncol=2)
B <- matrix(c(0:7),nrow=2)
mmult(A,B)
# [,1] [,2] [,3] [,4]
#[1,] 4 14 24 34
#[2,] 5 19 33 47
#[3,] 6 24 42 60
identical(mmult(A,B), A %*% B)
#[1] TRUE
Hope this helps.
As benchmark tests show, the above Rcpp code is slower than R's built-in %*% operator. I assume that, while my Rcpp code can certainly be improved, it will be hard to beat the optimized code behind %*% in terms of performance:
library(microbenchmark)
set.seed(123)
M1 <- matrix(rnorm(1e4),ncol=100)
M2 <- matrix(rnorm(1e4),nrow=100)
identical(M1 %*% M2, mmult(M1,M2))
#[1] TRUE
res <- microbenchmark(
mmult(M1,M2),
M1 %*% M2,
times=1000L)
#> res
#Unit: microseconds
# expr min lq mean median uq max neval cld
# mmult(M1, M2) 1466.855 1484.8535 1584.9509 1494.0655 1517.5105 2699.643 1000 b
# M1 %*% M2 602.053 617.9685 687.6863 621.4335 633.7675 2774.954 1000 a
I would encourage to try to work out your issues with RcppArmadillo. Using it is as simple as this example also created by calling RcppArmadillo.package.skeleton():
// another simple example: outer product of a vector,
// returning a matrix
//
// [[Rcpp::export]]
arma::mat rcpparma_outerproduct(const arma::colvec & x) {
arma::mat m = x * x.t();
return m;
}
// and the inner product returns a scalar
//
// [[Rcpp::export]]
double rcpparma_innerproduct(const arma::colvec & x) {
double v = arma::as_scalar(x.t() * x);
return v;
}
There is actually more code in the example but this should give you an idea.
The following approach can also be used :
NumericMatrix mmult(NumericMatrix m, NumericMatrix v)
{
Environment base("package:base");
Function mat_Mult = base["%*%"];
return(mat_Mult(m, v));
}
With this approach, we use the operator %*% of R.

Intersect function with Rcpp

I'm having a hard time implementing a function with the Rcpp module using cppFunction. I need to use something like R's intersect with two NumericVector types and return another NumericVector with the result, just like in R.
This document has been of some help but unfortunately I'm pretty much a noob in C++ atm.
How could I implement the intersect R function with cppFunction ?
Thanks
You would probably want to use something like the unordered_set to implement intersect:
File myintersect.cpp:
#include <Rcpp.h>
using namespace Rcpp;
// Enable C++11 via this plugin (Rcpp 0.10.3 or later)
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
NumericVector myintersect(NumericVector x, NumericVector y) {
std::vector<double> res;
std::unordered_set<double> s(y.begin(), y.end());
for (int i=0; i < x.size(); ++i) {
auto f = s.find(x[i]);
if (f != s.end()) {
res.push_back(x[i]);
s.erase(f);
}
}
return Rcpp::wrap(res);
}
We can load the function and verify it works:
library(Rcpp)
sourceCpp(file="myintersect.cpp")
set.seed(144)
x <- c(-1, -1, sample(seq(1000000), 10000, replace=T))
y <- c(-1, sample(seq(1000000), 10000, replace=T))
all.equal(intersect(x, y), myintersect(x, y))
# [1] TRUE
However, it seems this approach is a good deal less efficient than the itersect function:
library(microbenchmark)
microbenchmark(intersect(x, y), myintersect(x, y))
# Unit: microseconds
# expr min lq median uq max neval
# intersect(x, y) 424.167 495.861 501.919 523.7835 989.997 100
# myintersect(x, y) 1778.609 1798.111 1808.575 1835.1570 2571.426 100