Fastest way to compute Gaussian kernel vector in CppArmadillo? - c++

I am trying to compute a vector of gaussian kernel evaluations as quickly as possible. I have a data point x in R^p, and a matrix X of n vectors x_i. I would like to compute exp( -||x-x_i||^2 / t) for every x_i and return the result as a vector.
I have tried implementing this in both R and RcppArmadillo via the following code
R CODE:
kernel <- function(x, Data, sigma){
if(sigma <= 0 ) stop('Gaussian kernel parameter <= 0.')
DiffPart <- (t(t(Data) - x))^2 ## Computes the distance squared of the data and point x
DiffPart <- rowSums(DiffPart) # Sum of squares
exp( - DiffPart / sigma) #Divide by kernel parameter and evluate exponential function
}
RcppArmadillo:
arma::Col<double> kernelCPP(arma::Row<double> x, arma::Mat<double> Data, double sigma){
arma::Mat<double> Diff=Data.each_row()-x;
int n = Data.n_rows;
arma::Col<double> kern(n);
for(int k = 0 ; k < n; k++){
kern(k) = exp(-arma::accu(square(Diff.row(k)))/sigma);
}
return(kern);
}
Unfortunately, my RcppArmadillo code is not much faster than the original R code. I'll be computing kernel vectors hundreds of thousands of times in future code/computation, and so I would like this to be as fast of a process as I can make it.
When microbenchmarking, I get the following results:
> microbenchmark(
+ kernel(x= TrainX1[1,], Data = TrainX1, sigma = 100)
+ )
Unit: milliseconds
min lq mean median
2.223359 2.274559 2.5199 2.308052
uq max neval
2.575144 4.73301 100
and
> microbenchmark(
+ kernelCPP(x= TrainX1[1,], Data = TrainX1, sigma = 100)
+ )
Unit: milliseconds
min lq mean
1.697706 1.732053 1.826743
median uq max neval
1.775786 1.871786 2.493439 100
A little faster, but not by much.

Related

How to optimise Rcpp function (calling another R function)

I'm solving the following problem: Given two vectors x,y and a vectorised function f. I'd like to compute for each element x0 of x the average of f(x_0 - y).
I already implemented the function in R like this
sol <- function(x, y, f) {
ret <- numeric(length(x))
for (y0 in y) {
ret <- ret + f(x - y0)
}
ret/length(y)
}
We could use the function like this sol(1:100, 1:100, exp). Since this function is a crucial part of my code I'd like to optimise it. The length of x is in the range (1 - 100,000) and y is in the range (1 - 1,000). I tried using Rcpp like this
library(Rcpp)
cppFunction('NumericVector cppEval(NumericVector x, NumericVector y, Function f) {
int num_y = y.size();
NumericVector out(x.size());
for(int i = 0; i < num_y; ++i) {
out += Rcpp::as<NumericVector>(f(x - y[i]));
}
return out/num_y;
}')
Sadly this piece of code is much slower than the R equivalent. What could I do to efficiently write Cpp here? I don't know how to completely get rid of the loop.
microbenchmark::microbenchmark(sol(1:100, 1:100, exp), cppEval(1:100, 1:100, exp))
Unit: microseconds
expr min lq mean median uq max neval
sol(1:100, 1:100, exp) 157.572 178.336 244.4421 210.4775 221.7085 4199.367 100
cppEval(1:100, 1:100, exp) 1451.395 1628.367 1829.2443 1697.7480 1794.4390 12868.237 100

Is it legitimate to use Rcpp to speed up replacing elements of lists and vectors in iterative algorithm?

Context
I have been working on iterative algorithm lately, where each iteration n depends on the iteration n-1. During each iteration most of the computation time is taken by sub-setting and/or replacing elements of vectors, lists or data.tables (N > 10^6).
I recently came across Rcpp and playing a little bit with it I discovered that replacing element k of vectors or lists can be sped up by two or three orders of magnitudes (few benchmark tests below).
However, when using the Rcpp subsetting code within a for and a while loop, R seems to become unstable and the session aborts at random points with no hint of what went wrong.
Question
My question: is this use of Rcpp legitimate or it can leads to problems I am not aware of?
Example
Below is the Rcpp code I am using and a few benchmarks. Overall, the algorithm should call the replacing functions ~5.5 billion times and subset functions ~50 billion times.
Note that replacing elements of lists and double vectors is faster using Rcpp, while for integer vectors base R solutions are preferred (benchmark 1); data table is a good option to replace elements but if you have to subset repeatedly to access its elements the vector approach is much faster (benchmark 2).
Functions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void assign_list(List x, int k, NumericVector new_element){
x[k-1] = new_element;
}
// [[Rcpp::export]]
void assign_dbl(NumericVector x, int k, double y){
x[k-1] = y;
}
// [[Rcpp::export]]
void assign_int(IntegerVector x, int k, int y){
x[k-1] = y;
}
Benchmarks:
Inputs
set.seed(747474)
int <- 1:10^7
dou <- rnorm(10^7, 1000, 300)
l <- lapply(sample(5:20, 10^7, replace = T), rnorm, mean = 1000, sd = 300)
dt <- data.table(int = int, dou = dou, l = l)
i <- 999999
z <- 2222
k <- 30000
s <- 552877
1)
Unit: nanoseconds
expr min lq mean median uq max neval
int[i] <- -1L 488 2439 36938108.9 4146.0 15651119 799720107 30
dou[i] <- -1 732 3170 19101960.4 6609193.5 16187500 212369197 30
l[i] <- -1 489 3902 159442538.1 186035314.5 227131872 618326686 30
assign_int 19853910 22028692 81055363.5 24665494.0 39352345 872241539 30
assign_dbl 1220 5852 48023.2 8534.5 13167 1158828 30
assign_list 1464 6828 52866.9 10850.5 13411 1243430 30
dt[k, ':=' (int = -1, dou = -1, l = -1)] 206020 340116 481850.0 425326.5 529312 1414341 30
2)
microbenchmark(times = 30L,
"subset vector + list" = {int[s]; dou[s]; l[s]},
"subset datatable" = {dt[s, int]; dt[s, dou]; dt[s, l]})
Unit: nanoseconds
expr min lq mean median uq max neval
subset vector + list 488 488 1715.533 1585.5 2926 4389 30
subset datatable 563688 574417 719304.467 600138.0 875765 1308040 30
This is very dangerous, because of the side effect shown here, where x and y are changed even though you only pass x into the Rcpp function
> x <- y <- 1:10
> assign_int(x, 1, 2)
> y
[1] 2 2 3 4 5 6 7 8 9 10
It does not seem to be faster; for these functions
f0 <- function(x) {
for (i in seq_along(x))
x[i] = -i
}
f1 <- function(x) {
for (i in seq_along(x))
assign_int(x, i, -i)
}
I have
> int <- 1:10^5
> microbenchmark(f0(int), f1(int), times=5)
Unit: milliseconds
expr min lq mean median uq max neval
f0(int) 14.78777 14.80264 15.05683 15.03138 15.17678 15.48556 5
f1(int) 659.67346 669.00095 672.93343 670.48917 676.16930 689.33429 5
In your benchmark, int[i] <- 1, '1' is a numeric (double) value, so you are coercing to a double vector (check out class(int) after the assignment), triggering a complete copy. Use int[i] <- 1L to force the right-hand side to be an integer.
> int0 <- int1 <- 1:10^7
> microbenchmark(int0[1] <- 1, int1[1] <- 1L)
Unit: microseconds
expr min lq mean median uq max neval
int0[1] <- 1 1.047 1.102 1770.9911 1.143 1.2650 176960.52 100
int1[1] <- 1L 1.105 1.176 339.4264 1.213 1.2655 33815.97 100
> class(int0)
[1] "numeric"
> class(int1)
[1] "integer"
Updating just a single element as benchmark is expensive in base R because it triggers a copy of the vector on each assignment; in f0() the copy occurs just once. On the first iteration, R makes a copy because it knows that the vector of integer values is referenced by at least two symbols (the argument to the function int, and the symbol used in the function x) so it makes a copy of the memory and assigns it to x inside the function. It does this to avoid the side-effect seen in your Rcpp code (i.e., to avoid modifying int). The next time through the loop R recognizes that only a single symbol references the vector, so does the replacement without making a copy.
Note that
> int <- 1:10^5
> f1(int)
> head(int)
[1] -1 -2 -3 -4 -5 -6
illustrates the subtle way that the side-effects of your Rcpp code can have unexpected outcomes.
Also, there are several ways of slowing down iterative loops, e.g.,
f2 <- function(x) {
for (i in seq_along(x)) {
x[i] = -i
y <- x
}
}
f3 <- function(x) {
result <- integer()
for (i in seq_along(x))
result <- c(result, -i)
}
with
> int <- 1:10^3
> microbenchmark(f0(int), f2(int), f3(int), times = 1)
Unit: microseconds
expr min lq mean median uq max neval
f0(int) 150.507 150.507 150.507 150.507 150.507 150.507 1
f2(int) 667.201 667.201 667.201 667.201 667.201 667.201 1
f3(int) 4379.005 4379.005 4379.005 4379.005 4379.005 4379.005 1
f2() causes R to make a copy of x each time through the loop (to avoid the side-effect of modifying y). f3() copies a vector of length 0, 1, 2, 3, ... n - 1 (where n = length(x)) on successive iterations, leading to n * (n - 1) / 2 elements copied, and an algorithm that scales as the square of the length of x.
The general principle applies to other types as well, including lists with
f0l <- function(x) {
for (i in seq_along(x))
x[[i]] <- i
x
}
f1l <- function(x) {
for (i in seq_along(x))
assign_list(x, i, i)
}
leading to
> set.seed(123)
> l0 <- lapply(sample(5:20, 10^6, replace = T), rnorm, mean = 1000, sd = 300)
> set.seed(123)
> l1 <- lapply(sample(5:20, 10^6, replace = T), rnorm, mean = 1000, sd = 300)
> microbenchmark(f0l(l0), f1l(l1), times=1)
Unit: milliseconds
expr min lq mean median uq max neval
f0l(l0) 239.9865 239.9865 239.9865 239.9865 239.9865 239.9865 1
f1l(l1) 6767.9172 6767.9172 6767.9172 6767.9172 6767.9172 6767.9172 1

Arithmetic in R faster on numerics as opposed to integers. What's going on?

I was in the middle of converting some code that utilized mostly numeric data (i.e. doubles) to integers and did a quick benchmark to see how much efficiency I gained.
To my surprise it was slower... by about 20%. I thought I had done something wrong, but the original code was only a few basic arithmetical operations on moderately sized vectors, so I knew it wasn't that. Maybe my environment was messed up? I restarted fresh, and the same result... integers were less efficient.
This started a series of test and a dive into the rabbit hole. Here is my first test. We sum one million elements using base R's sum. Note that with R version 3.5.0 the timings are quite a bit different and with v 3.5.1, the timings are about the same (still not what one would expect):
set.seed(123)
int1e6 <- sample(1:10, 1e6, TRUE)
dbl1e6 <- runif(1e6, 1, 10)
head(int1e6)
# [1] 5 3 6 8 6 2
class(int1e6)
# [1] "integer"
head(dbl1e6)
# [1] 5.060628 2.291397 2.992889 5.299649 5.217105 9.769613
class(dbl1e6)
#[1] "numeric"
mean(dbl1e6)
# [1] 5.502034
mean(int1e6)
# [1] 5.505185
## R 3.5.0
library(microbenchmark)
microbenchmark(intSum = sum(int1e6), dblSum = sum(dbl1e6), times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intSum 1033.677 1043.991 1147.9711 1111.438 1200.725 2723.834 1000
dblSum 817.719 835.486 945.6553 890.529 998.946 2736.024 1000
## R 3.5.1
Unit: microseconds
expr min lq mean median uq max neval
intSum 836.243 877.7655 966.4443 950.1525 997.9025 2077.257 1000
dblSum 866.939 904.7945 1015.3445 986.4770 1046.4120 2541.828 1000
class(sum(int1e6))
# [1] "integer"
class(sum(dbl1e6))
#[1] "numeric"
From here on out both version 3.5.0 and 3.5.1 give nearly identical results.
Here is our first dive into the rabbit hole. Along with the documentation for sum (see ?sum), we see that sum is simply a generic function that is dispatched via standardGeneric. Digging deeper, we see it eventually calls R_execMethod here on line 516. This is where I get lost. It looks to me, like R_execClosure is called next followed by many different possible branches. I think the standard path is to call eval next, but I'm not sure. My guess is that eventually, a function is called in arithimetic.c but I can't find anything that specifically sums a vector of numbers. Either way, based off of my limited knowledge of method dispatching and C in general, my naive assumption is that a function that looks like the following is called:
template <typename T>
T sum(vector<T> x) {
T mySum = 0;
for (std::size_t i = 0; i < x.size(); ++i)
mySum += x[i];
return mySum;
}
I know there is no function overloading or vectors in C, but you get my point. My belief is that eventually, a bunch of the same type of elements are added to an element of the same type and eventually returned. In Rcpp we would have something like:
template <typename typeReturn, typename typeRcpp>
typeReturn sumRcpp(typeRcpp x) {
typeReturn mySum = 0;
unsigned long int mySize = x.size();
for (std::size_t i = 0; i < mySize; ++i)
mySum += x[i];
return mySum;
}
// [[Rcpp::export]]
SEXP mySumTest(SEXP Rx) {
switch(TYPEOF(Rx)) {
case INTSXP: {
IntegerVector xInt = as<IntegerVector>(Rx);
int resInt = sumRcpp<int>(xInt);
return wrap(resInt);
}
case REALSXP: {
NumericVector xNum = as<NumericVector>(Rx);
double resDbl = sumRcpp<double>(xNum);
return wrap(resDbl);
}
default: {
Rcpp::stop("Only integers and numerics are supported");
}
}
}
And the benchmarks confirm my normal thinking about the inherit efficiency dominance of integers:
microbenchmark(mySumTest(int1e6), mySumTest(dbl1e6))
Unit: microseconds
expr min lq mean median uq max neval
mySumTest(int1e6) 103.455 160.776 185.2529 180.2505 200.3245 326.950 100
mySumTest(dbl1e6) 1160.501 1166.032 1278.1622 1233.1575 1347.1660 1644.494 100
Binary Operators
This got me thinking further. Maybe it is just the complexity wrapped around standardGeneric that makes the different data types behave strangely. So, let's skip all that jazz and go straight to the binary operators (+, -, *, /, %/%)
set.seed(321)
int1e6Two <- sample(1:10, 1e6, TRUE)
dbl1e6Two <- runif(1e6, 1, 10)
## addition
microbenchmark(intPlus = int1e6 + int1e6Two,
dblPlus = dbl1e6 + dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intPlus 2.531220 3.214673 3.970903 3.401631 3.668878 82.11871 1000
dblPlus 1.299004 2.045720 3.074367 2.139489 2.275697 69.89538 1000
## subtraction
microbenchmark(intSub = int1e6 - int1e6Two,
dblSub = dbl1e6 - dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intSub 2.280881 2.985491 3.748759 3.166262 3.379755 79.03561 1000
dblSub 1.302704 2.107817 3.252457 2.208293 2.382188 70.24451 1000
## multiplication
microbenchmark(intMult = int1e6 * int1e6Two,
dblMult = dbl1e6 * dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intMult 2.913680 3.573557 4.380174 3.772987 4.077219 74.95485 1000
dblMult 1.303688 2.020221 3.078500 2.119648 2.299145 10.86589 1000
## division
microbenchmark(intDiv = int1e6 %/% int1e6Two,
dblDiv = dbl1e6 / dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intDiv 2.892297 3.210666 3.720360 3.228242 3.373456 62.12020 1000
dblDiv 1.228171 1.809902 2.558428 1.842272 1.990067 64.82425 1000
The classes are preserved as well:
unique(c(class(int1e6 + int1e6Two), class(int1e6 - int1e6Two),
class(int1e6 * int1e6Two), class(int1e6 %/% int1e6Two)))
# [1] "integer"
unique(c(class(dbl1e6 + dbl1e6Two), class(dbl1e6 - dbl1e6Two),
class(dbl1e6 * dbl1e6Two), class(dbl1e6 / dbl1e6Two)))
# [1] "numeric"
With every case, we see that arithmetic is 40% - 70% faster on numeric data type. What is really strange is that we get an even larger discrepancy when the two vectors being operated on are identical:
microbenchmark(intPlus = int1e6 + int1e6,
dblPlus = dbl1e6 + dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intPlus 2522.774 3148.464 3894.723 3304.189 3531.310 73354.97 1000
dblPlus 977.892 1703.865 2710.602 1767.801 1886.648 77738.47 1000
microbenchmark(intSub = int1e6 - int1e6,
dblSub = dbl1e6 - dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intSub 2236.225 2854.068 3467.062 2994.091 3214.953 11202.06 1000
dblSub 893.819 1658.032 2789.087 1730.981 1873.899 74034.62 1000
microbenchmark(intMult = int1e6 * int1e6,
dblMult = dbl1e6 * dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intMult 2852.285 3476.700 4222.726 3658.599 3926.264 78026.18 1000
dblMult 973.640 1679.887 2638.551 1754.488 1875.058 10866.52 1000
microbenchmark(intDiv = int1e6 %/% int1e6,
dblDiv = dbl1e6 / dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intDiv 2879.608 3355.015 4052.564 3531.762 3797.715 11781.39 1000
dblDiv 945.519 1627.203 2706.435 1701.512 1829.869 72215.51 1000
unique(c(class(int1e6 + int1e6), class(int1e6 - int1e6),
class(int1e6 * int1e6), class(int1e6 %/% int1e6)))
# [1] "integer"
unique(c(class(dbl1e6 + dbl1e6), class(dbl1e6 - dbl1e6),
class(dbl1e6 * dbl1e6), class(dbl1e6 / dbl1e6)))
# [1] "numeric"
That is nearly a 100% increase with every operator type!!!
How about a regular for loop in base R?
funInt <- function(v) {
mySumInt <- 0L
for (element in v)
mySumInt <- mySumInt + element
mySumInt
}
funDbl <- function(v) {
mySumDbl <- 0
for (element in v)
mySumDbl <- mySumDbl + element
mySumDbl
}
microbenchmark(funInt(int1e6), funDbl(dbl1e6))
Unit: milliseconds
expr min lq mean median uq max neval
funInt(int1e6) 25.44143 25.75075 26.81548 26.09486 27.60330 32.29436 100
funDbl(dbl1e6) 24.48309 24.82219 25.68922 25.13742 26.49816 29.36190 100
class(funInt(int1e6))
# [1] "integer"
class(funDbl(dbl1e6))
# [1] "numeric"
The difference isn't amazing, but still one would expect the integer sum to outperform the double sum. I really don't know what to think about this.
So my question is:
Why exactly do numeric data types outperform integer data types on basic arithmetical operations in base R?
Edit. Forgot to mention this:
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
F.Privé's "random guess" in the comments is really good! The function
do_arith seems to be the starting point within arithmetic.c. First for scalars we see that the case of REALSXP is simple: e.g., standard + is used. For INTSXP there is a dispatch to, for example, R_integer_plus, which does indeed check for integer overflow:
static R_INLINE int R_integer_plus(int x, int y, Rboolean *pnaflag)
{
if (x == NA_INTEGER || y == NA_INTEGER)
return NA_INTEGER;
if (((y > 0) && (x > (R_INT_MAX - y))) ||
((y < 0) && (x < (R_INT_MIN - y)))) {
if (pnaflag != NULL)
*pnaflag = TRUE;
return NA_INTEGER;
}
return x + y;
}
Similar for other binary operations. For vectors it is also similar. Within integer_binary there is a dispatch to the same method, while in real_binary the standard operations are used without any checks.
We can see this in action using the following Rcpp code:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
#include <cstdint>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector sumInt(IntegerVector a, IntegerVector b) {
IntegerVector result(no_init(a.size()));
std::transform(a.begin(), a.end(), b.begin(), result.begin(),
[] (int32_t x, int32_t y) {return x + y;});
return result;
}
// [[Rcpp::export]]
IntegerVector sumIntOverflow(IntegerVector a, IntegerVector b) {
IntegerVector result(no_init(a.size()));
std::transform(a.begin(), a.end(), b.begin(), result.begin(),
[] (int32_t x, int32_t y) {
if (x == NA_INTEGER || y == NA_INTEGER)
return NA_INTEGER;
if (((y > 0) && (x > (INT32_MAX - y))) ||
((y < 0) && (x < (INT32_MIN - y))))
return NA_INTEGER;
return x + y;
});
return result;
}
// [[Rcpp::export]]
NumericVector sumReal(NumericVector a, NumericVector b) {
NumericVector result(no_init(a.size()));
std::transform(a.begin(), a.end(), b.begin(), result.begin(),
[] (double x, double y) {return x + y;});
return result;
}
/*** R
set.seed(123)
int1e6 <- sample(1:10, 1e6, TRUE)
int1e6two <- sample(1:10, 1e6, TRUE)
dbl1e6 <- runif(1e6, 1, 10)
dbl1e6two <- runif(1e6, 1, 10)
microbenchmark::microbenchmark(int1e6 + int1e6two,
sumInt(int1e6, int1e6two),
sumIntOverflow(int1e6, int1e6two),
dbl1e6 + dbl1e6two,
sumReal(dbl1e6, dbl1e6two),
times = 1000)
*/
Result:
Unit: microseconds
expr min lq mean median uq max neval
int1e6 + int1e6two 1999.698 2046.2025 2232.785 2061.7625 2126.970 5461.816 1000
sumInt 812.560 846.1215 1128.826 861.9305 892.089 44723.313 1000
sumIntOverflow 1664.351 1690.2455 1901.472 1702.6100 1760.218 4868.182 1000
dbl1e6 + dbl1e6two 1444.172 1501.9100 1997.924 1526.0695 1641.103 47277.955 1000
sumReal 1459.224 1505.2715 1887.869 1530.5995 1675.594 5124.468 1000
Introducing the overflow checking into the C++ code produces a significant reduction in performance. Even though it is not as bad as the standard +. So if you know that your integer numbers are "well behaved", you can gain quite a bit of performance by skipping R's error checking by going straight to C/C++. This reminds me of another question with a similar conclusion. The error checking done by R can be costly.
For the case with identical vectors, I get the following benchmark results:
Unit: microseconds
expr min lq mean median uq max neval
int1e6 + int1e6 1761.285 2000.720 2191.541 2011.5710 2029.528 47397.029 1000
sumInt 648.151 761.787 1002.662 767.9885 780.129 46673.632 1000
sumIntOverflow 1408.109 1647.926 1835.325 1655.6705 1670.495 44958.840 1000
dbl1e6 + dbl1e6 1081.079 1119.923 1443.582 1137.8360 1173.807 44469.509 1000
sumReal 1076.791 1118.538 1456.917 1137.2025 1250.850 5141.558 1000
There is a significant performance increase for doubles (both R and C++). For integers there is also some performance increase, but not as seizable as for doubles.

Multiplying complex matrices in R using C++

Suppose that A is a complex matrix. I am interested in computing the product A%*%Conj(t(A)) in R efficiently. As far as I understand, using C++ would speed up things significantly, so that is what I am trying to do.
I have the following code for real matrices that I can use in R.
library(Rcpp);
library(inline);
library(RcppEigen);
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::Lower;
const Map<MatrixXd> A(as<Map<MatrixXd> >(AA));
const int m(A.rows());
MatrixXd AAt(MatrixXd(m, m).setZero().selfadjointView<Lower>().rankUpdate(A));
return wrap(AAt);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
A<-matrix(rnorm(100^2),100)
all.equal(fcprd(A),tcrossprod(A))
fcprd(A) runs much faster on my laptop than tcrossprod(A). This is what I get for A<-matrix(rnorm(1000^2),1000):
microbenchmark::microbenchmark('tcrossprod(A)'=tcrossprod(A),'A%*%t(A)'=A%*%t(A),fcprd=fcprd(A))
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(A) 428.06452 435.9700 468.9323 448.8168 504.2628 618.7681 100
A%*%t(A) 722.24053 736.6197 775.4814 767.7668 809.8356 903.8592 100
fcprd 95.04678 100.0733 111.5021 103.6616 107.2551 197.4479 100
However, this code only works for matrices with double precision entries. How could I modify this code so that it works for complex matrices?
I have a very limited knowledge of programming, but I am trying to learn.
Any help is much appreciated!
The Eigen library supports also complex entries via Eigen::MatrixXcd. So in principle it should work if you replace MatrixXd with MatrixXcd. However, this does not compile probably because there is no as-function for complex matrices using Map (c.f. https://github.com/RcppCore/RcppEigen/blob/master/inst/unitTests/runit.RcppEigen.R#L205). The as-function are needed to convert between R data types and C++/Eigen data types (c.f. http://dirk.eddelbuettel.com/code/rcpp/Rcpp-extending.pdf). If you do not use Map, then you can use this:
crossprodCpp <- '
using Eigen::MatrixXcd;
using Eigen::Lower;
const MatrixXcd A(as<MatrixXcd>(AA));
const int m(A.rows());
MatrixXcd AAt(MatrixXcd(m, m).setZero().selfadjointView<Lower>().rankUpdate(A));
return wrap(AAt);
'
fcprd <- inline::cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
N <- 100
A <- matrix(complex(real = rnorm(N), imaginary = rnorm(N)), N)
all.equal(fcprd(A), A %*% Conj(t(A)))
However, this is slower than the base R version in my tests:
N <- 1000
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
all.equal(fcprd(A), A %*% Conj(t(A)))
#> [1] TRUE
microbenchmark::microbenchmark(base = A %*% Conj(t(A)), eigen = fcprd(A))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 111.6512 124.4490 145.7583 140.9199 160.3420 241.8986 100
#> eigen 453.6702 501.5419 535.0192 537.2925 564.8746 628.4999 100
Note that matrix multiplication in R is done via BLAS. However, the default BLAS implementation used by R is not very fast. One way to improve R's performance is to use an optimized BLAS library, c.f. https://csgillespie.github.io/efficientR/set-up.html#blas-and-alternative-r-interpreters.
Alternatively you can use the BLAS function zherk if you have a full BLAS available. Very rough:
dyn.load("/usr/lib/libblas.so")
zherk <- function(a, uplo = 'u', trans = 'n') {
n <- nrow(a)
k <- ncol(a)
c <- matrix(complex(real = 0, imaginary = 0), nrow = n, ncol = n)
z <- .Fortran("zherk",
uplo = as.character(uplo),
trans = as.character(trans),
n = as.integer(n),
k = as.integer(k),
alpha = as.double(1),
a = as.complex(a),
lda = as.integer(n),
beta = as.double(0),
c = as.complex(c),
ldc = as.integer(n))
matrix(z$c, nrow = n, ncol = n)
}
N <- 2
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
zherk(A, uplo = "l") - A %*% Conj(t(A))
Note that this fills only the upper (or lower) triangular part but is quite fast:
microbenchmark::microbenchmark(base = A %*% Conj(t(A)), blas = zherk(A))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 112.5588 117.12531 146.10026 138.37565 167.6811 282.3564 100
#> blas 66.9541 70.12438 91.44617 82.74522 108.4979 188.3728 100
Here is a way to bind an Eigen::Map<Eigen::MatrixXcd> object in Rcpp. The solution works in a R package setup, but I'm not sure about an easy way to put it together using the inline library.
First, you need to provide the following specialization in your inst/include/mylib.h such that this header get included in the RcppExports.cpp:
#include <complex>
#include <Eigen/Core>
#include <Eigen/Dense>
#include <Rcpp.h>
namespace Rcpp {
namespace traits {
template<>
class Exporter<Eigen::Map<Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> > > {
using OUT = typename Eigen::Map<Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> >;
const static int RTYPE = ::Rcpp::traits::r_sexptype_traits<std::complex<double>>::rtype;
Rcpp::Vector<RTYPE> vec;
int d_ncol, d_nrow;
public:
Exporter(SEXP x)
: vec(x), d_ncol(1)
, d_nrow(Rf_xlength(x)) {
if (TYPEOF(x) != RTYPE)
throw std::invalid_argument("Wrong R type for mapped matrix");
if (::Rf_isMatrix(x)) {
int* dims = INTEGER(::Rf_getAttrib(x, R_DimSymbol));
d_nrow = dims[0];
d_ncol = dims[1];
}
}
OUT get() { return OUT(reinterpret_cast<std::complex<double>*>(vec.begin()), d_nrow, d_ncol); }
};
}}
The only difference with the unspecialized Exporter available in RcppEigenWrap.h being the reinterpret_cast on the last line. Both std::complex and Rcomplex having C99 complex compatible types, they are supposed to have identical memory layouts regardless of the implementation.
Wrapping it up, you can now create your function as:
// [[Rcpp::export]]
Eigen::MatrixXd selfadj_mult(const Eigen::Map<Eigen::MatrixXcd>& mat) {
Eigen::MatrixXd result = (mat * mat.adjoint()).real();
return result;
}
and then invoke the function in R as:
library(mylib)
library(microbenchmark)
N <- 1000
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
microbenchmark::microbenchmark(
base = A %*% Conj(t(A))
, eigen = mylib::selfadj_mult(A)
, times = 100L
)
the code is compiled on centos7/gcc83 with -O3 -DNDEBUG -flto -march=generic. R has been build from source with the exact same compiler/flags (using the default BLAS binding). Results are:
Unit: seconds
expr min lq mean median uq max neval
base 2.9030320 2.9045865 2.9097162 2.9053835 2.9093232 2.9614318 100
eigen 1.1978697 1.2004888 1.2134219 1.2031046 1.2057647 1.3035751 100

Speeding up computation of Dice coefficient in C / Rcpp

I need to compute a similarity measure call the Dice coefficient over large matrices (600,000 x 500) of binary vectors in R. For speed I use C / Rcpp. The function runs great but as I am not a computer scientist by background I would like to know if it could run faster. This code is suitable for parallelisation but I have no experience parallelising C code.
The Dice coefficient is a simple measure of similarity / dissimilarity (depending how you take it). It is intended to compare asymmetric binary vectors, meaning one of the combination (usually 0-0) is not important and agreement (1-1 pairs) have more weight than disagreement (1-0 or 0-1 pairs). Imagine the following contingency table:
1 0
1 a b
0 c d
The Dice coef is: (2*a) / (2*a +b + c)
Here is my Rcpp implementation:
library(Rcpp)
cppFunction('
NumericMatrix dice(NumericMatrix binaryMat){
int nrows = binaryMat.nrow(), ncols = binaryMat.ncol();
NumericMatrix results(ncols, ncols);
for(int i=0; i < ncols-1; i++){ // columns fixed
for(int j=i+1; j < ncols; j++){ // columns moving
double a = 0;
double d = 0;
for (int l = 0; l < nrows; l++) {
if(binaryMat(l, i)>0){
if(binaryMat(l, j)>0){
a++;
}
}else{
if(binaryMat(l, j)<1){
d++;
}
}
}
// compute Dice coefficient
double abc = nrows - d;
double bc = abc - a;
results(j,i) = (2*a) / (2*a + bc);
}
}
return wrap(results);
}
')
And here is a running example:
x <- rbinom(1:200000, 1, 0.5)
X <- matrix(x, nrow = 200, ncol = 1000)
system.time(dice(X))
user system elapsed
0.814 0.000 0.814
The solution proposed by Roland was not entirely satisfying for my use case. So based on the source code from the arules package I implement a much faster version. The code in arules rely on an algorithm from Leisch (2005) using the tcrossproduct() function in R.
First, I wrote a Rcpp / RcppEigen version of crossprod that is 2-3 time faster. This is based on the example code in the RcppEigen vignette.
library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;
const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));
const int m(A.rows()), n(A.cols());
MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));
return wrap(AtA);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
Then I wrote a small R function to compute the Dice coefficient.
diceR <- function(X){
a <- fcprd(X)
nx <- ncol(X)
rsx <- colSums(X)
c <- matrix(rsx, nrow = nx, ncol = nx) - a
# b <- matrix(rsx, nrow = nx, ncol = nx, byrow = TRUE) - a
b <- t(c)
m <- (2 * a) / (2*a + b + c)
return(m)
}
This new function is ~8 time faster than the old one and ~3 time faster than the one in arules.
m <- microbenchmark(dice(X), diceR(X), dissimilarity(t(X), method="dice"), times=100)
m
# Unit: milliseconds
# expr min lq median uq max neval
# dice(X) 791.34558 809.8396 812.19480 814.6735 910.1635 100
# diceR(X) 62.98642 76.5510 92.02528 159.2557 507.1662 100
# dissimilarity(t(X), method = "dice") 264.07997 342.0484 352.59870 357.4632 520.0492 100
I cannot run your function at work, but is the result the same as this?
library(arules)
plot(dissimilarity(X,method="dice"))
system.time(dissimilarity(X,method="dice"))
#user system elapsed
#0.04 0.00 0.04