Performance of R stats::sd() vs. arma::stddev() vs. Rcpp implementation - c++

Just for the purpose of working on my C++ / Rcpp programming, I took a shot at implementing a (sample) standard deviation function:
#include <Rcpp.h>
#include <vector>
#include <cmath>
#include <numeric>
// [[Rcpp::export]]
double cppSD(Rcpp::NumericVector rinVec)
{
std::vector<double> inVec(rinVec.begin(),rinVec.end());
int n = inVec.size();
double sum = std::accumulate(inVec.begin(), inVec.end(), 0.0);
double mean = sum / inVec.size();
for(std::vector<double>::iterator iter = inVec.begin();
iter != inVec.end(); ++iter){
double temp;
temp= (*iter - mean)*(*iter - mean);
*iter = temp;
}
double sd = std::accumulate(inVec.begin(), inVec.end(), 0.0);
return std::sqrt( sd / (n-1) );
}
I also decided to test out the stddev function from the Armadillo library, considering that it can be called on a vector:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
double armaSD(arma::colvec inVec)
{
return arma::stddev(inVec);
}
Then I benchmarked these two functions against the base R function sd() for a few vectors of varying size:
Rcpp::sourceCpp('G:/CPP/armaSD.cpp')
Rcpp::sourceCpp('G:/CPP/cppSD.cpp')
require(microbenchmark)
##
## sample size = 1,000: armaSD() < cppSD() < sd()
X <- rexp(1000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 4.181 4.562 4.942 5.322 12.924 100
# sd(X) 17.865 19.766 20.526 21.287 86.285 100
# cppSD(X) 4.561 4.941 5.321 5.701 29.269 100
##
## sample size = 10,000: armaSD() < cppSD() < sd()
X <- rexp(10000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 24.707 25.847 26.4175 29.6490 52.455 100
# sd(X) 51.315 54.356 55.8760 61.1980 100.730 100
# cppSD(X) 26.608 28.128 28.8885 31.7395 114.413 100
##
## sample size = 25,000: armaSD() < cppSD() < sd()
X <- rexp(25000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 66.900 67.6600 68.040 76.403 155.845 100
# sd(X) 108.332 111.5625 122.016 125.817 169.910 100
# cppSD(X) 70.320 71.0805 74.692 80.203 102.250 100
##
## sample size = 50,000: cppSD() < sd() < armaSD()
X <- rexp(50000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 249.733 267.4085 297.8175 337.729 642.388 100
# sd(X) 203.740 229.3975 240.2300 260.186 303.709 100
# cppSD(X) 162.308 185.1140 239.6600 256.575 290.405 100
##
## sample size = 75,000: sd() < cppSD() < armaSD()
X <- rexp(75000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 445.110 479.8900 502.5070 520.5625 642.388 100
# sd(X) 310.931 334.8780 354.0735 379.7310 429.146 100
# cppSD(X) 346.661 380.8715 400.6370 424.0140 501.747 100
I was not terribly surprised at the fact that my C++ function cppSD() was faster than stats::sd() for smaller samples, but slower for larger sized vectors since stats::sd() is vectorized. However, I did not expect to see such a performance degradation from the arma::stddev() function since it appears to also be operating in a vectorized manner. Is there a problem with the way that I'm using arma::stdev(), or is it simply that stats::sd() was written (in C I'm assuming) in such a way that it can handle larger vectors much more efficiently? Any input would be appreciated.
Update:
Although my question was originally about the correct use of arma::stddev, and not so much about trying find the most efficient way possible to calculate the sample standard deviation, it is very interesting to see that the Rcpp::sd sugar function performs so well. To make things a little more interesting, I benchmarked the arma::stddev and Rcpp::sd functions below against an RcppParallel version that I adapted from two of JJ Allaire's Rcpp Gallery posts - here and here:
library(microbenchmark)
set.seed(123)
x <- rnorm(5.5e06)
##
Res <- microbenchmark(
armaSD(x),
par_sd(x),
sd_sugar(x),
times=500L,
control=list(warmup=25))
##
R> print(Res)
Unit: milliseconds
expr min lq mean median uq max neval
armaSD(x) 24.486943 24.960966 26.994684 25.255584 25.874139 123.55804 500
par_sd(x) 8.130751 8.322682 9.136323 8.429887 8.624072 22.77712 500
sd_sugar(x) 13.713366 13.984638 14.628911 14.156142 14.401138 32.81684 500
This was on my laptop running 64-bit linux with a i5-4200U CPU # 1.60GHz processor; but I'm guessing the difference between par_sd and sugar_sd would be less substantial on a Windows machine.
And the code for the RcppParallel version (which is considerably longer, and requires a C++11 compatible compiler for the lambda expression used in overloaded operator() of the InnerProduct functor):
#include <Rcpp.h>
#include <RcppParallel.h>
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
/*
* based on: http://gallery.rcpp.org/articles/parallel-vector-sum/
*/
struct Sum : public RcppParallel::Worker {
const RcppParallel::RVector<double> input;
double value;
Sum(const Rcpp::NumericVector input)
: input(input), value(0) {}
Sum(const Sum& sum, RcppParallel::Split)
: input(sum.input), value(0) {}
void operator()(std::size_t begin, std::size_t end) {
value += std::accumulate(input.begin() + begin,
input.begin() + end,
0.0);
}
void join(const Sum& rhs) {
value += rhs.value;
}
};
/*
* based on: http://gallery.rcpp.org/articles/parallel-inner-product/
*/
struct InnerProduct : public RcppParallel::Worker {
const RcppParallel::RVector<double> x;
const RcppParallel::RVector<double> y;
double mean;
double product;
InnerProduct(const Rcpp::NumericVector x,
const Rcpp::NumericVector y,
const double mean)
: x(x), y(y), mean(mean), product(0) {}
InnerProduct(const InnerProduct& innerProduct,
RcppParallel::Split)
: x(innerProduct.x), y(innerProduct.y),
mean(innerProduct.mean), product(0) {}
void operator()(std::size_t begin, std::size_t end) {
product += std::inner_product(x.begin() + begin,
x.begin() + end,
y.begin() + begin,
0.0, std::plus<double>(),
[&](double lhs, double rhs)->double {
return ( (lhs-mean)*(rhs-mean) );
});
}
void join(const InnerProduct& rhs) {
product += rhs.product;
}
};
// [[Rcpp::export]]
double par_sd(const Rcpp::NumericVector& x_)
{
int N = x_.size();
Rcpp::NumericVector y_(x_);
Sum sum(x_);
RcppParallel::parallelReduce(0, x_.length(), sum);
double mean = sum.value / N;
InnerProduct innerProduct(x_, y_, mean);
RcppParallel::parallelReduce(0, x_.length(), innerProduct);
return std::sqrt( innerProduct.product / (N-1) );
}

You made a subtle mistake in how you instantiate the Armadillo object -- which leads to copies and hence degraded performance.
Use an interface of const arma::colvec & invec instead, and all is good:
R> sourceCpp("/tmp/sd.cpp")
R> library(microbenchmark)
R> X <- rexp(500)
R> microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
Unit: microseconds
expr min lq median uq max neval
armaSD(X) 3.745 4.0280 4.2055 4.5510 19.375 100
armaSD2(X) 3.305 3.4925 3.6400 3.9525 5.154 100
sd(X) 22.463 23.6985 25.1525 26.0055 52.457 100
cppSD(X) 3.640 3.9495 4.2030 4.8620 13.609 100
R> X <- rexp(5000)
R> microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
Unit: microseconds
expr min lq median uq max neval
armaSD(X) 18.627 18.9120 19.3245 20.2150 34.684 100
armaSD2(X) 14.583 14.9020 15.1675 15.5775 22.527 100
sd(X) 54.507 58.8315 59.8615 60.4250 84.857 100
cppSD(X) 18.585 19.0290 19.3970 20.5160 22.174 100
R> X <- rexp(50000)
R> microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
Unit: microseconds
expr min lq median uq max neval
armaSD(X) 186.307 187.180 188.575 191.825 405.775 100
armaSD2(X) 142.447 142.793 143.207 144.233 155.770 100
sd(X) 382.857 384.704 385.223 386.075 405.713 100
cppSD(X) 181.601 181.895 182.279 183.350 194.588 100
R>
which is based on my version of your code where everything is one file and armaSD2 is defined as I suggested -- leading to the winning performance.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
#include <vector>
#include <cmath>
#include <numeric>
// [[Rcpp::export]]
double cppSD(Rcpp::NumericVector rinVec) {
std::vector<double> inVec(rinVec.begin(),rinVec.end());
int n = inVec.size();
double sum = std::accumulate(inVec.begin(), inVec.end(), 0.0);
double mean = sum / inVec.size();
for(std::vector<double>::iterator iter = inVec.begin();
iter != inVec.end();
++iter){
double temp = (*iter - mean)*(*iter - mean);
*iter = temp;
}
double sd = std::accumulate(inVec.begin(), inVec.end(), 0.0);
return std::sqrt( sd / (n-1) );
}
// [[Rcpp::export]]
double armaSD(arma::colvec inVec) {
return arma::stddev(inVec);
}
// [[Rcpp::export]]
double armaSD2(const arma::colvec & inVec) { return arma::stddev(inVec); }
/*** R
library(microbenchmark)
X <- rexp(500)
microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
X <- rexp(5000)
microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
X <- rexp(50000)
microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
*/

I think the sd function built in Rcpp sugar is much more efficient. See the code below:
#include <RcppArmadillo.h>
//[[Rcpp::depends(RcppArmadillo)]]
#include <vector>
#include <cmath>
#include <numeric>
using namespace Rcpp;
//[[Rcpp::export]]
double sd_cpp(NumericVector& xin){
std::vector<double> xres(xin.begin(),xin.end());
int n=xres.size();
double sum=std::accumulate(xres.begin(),xres.end(),0.0);
double mean=sum/n;
for(std::vector<double>::iterator iter=xres.begin();iter!=xres.end();++iter){
double tmp=(*iter-mean)*(*iter-mean);
*iter=tmp;
}
double sd=std::accumulate(xres.begin(),xres.end(),0.0);
return std::sqrt(sd/(n-1));
}
//[[Rcpp::export]]
double sd_arma(arma::colvec& xin){
return arma::stddev(xin);
}
//[[Rcpp::export]]
double sd_sugar(NumericVector& xin){
return sd(xin);
}
> sourcecpp("sd.cpp")
> microbenchmark(sd(X),sd_cpp(X),sd_arma(X),sd_sugar(X))
Unit: microseconds
expr min lq mean median uq max neval
sd(X) 47.655 49.4120 51.88204 50.5395 51.1950 113.643 100
sd_cpp(X) 28.145 28.4410 29.01541 28.6695 29.4570 37.118 100
sd_arma(X) 23.706 23.9615 24.65931 24.1955 24.9520 50.375 100
sd_sugar(X) 19.197 19.478 20.38872 20.0785 21.2015 28.664 100

2 questions about the Rcpp function:
1) how likely is it that you would want the standard deviation without the mean? If it will be uncommon to require SD without the mean, why not return both rather than ask the R user to make 2 function calls which in effect calculate the mean twice.
2) Is there any reason for cloning the vector inside the function?
using namespace Rcpp;
// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::export]]
NumericVector cppSD(NumericVector rinVec)
{
double n = (double)rinVec.size();
double sum = 0;
for (double& v : rinVec)
sum += v;
double mean = sum / n;
double varianceNumerator = 0;
for(double& v : rinVec)
varianceNumerator += (v - mean) * (v - mean);
return NumericVector::create(_["std.dev"] = sqrt(varianceNumerator/ (n - 1.0)),
_["mean"] = mean);
}

Related

Fastest way to compute Gaussian kernel vector in CppArmadillo?

I am trying to compute a vector of gaussian kernel evaluations as quickly as possible. I have a data point x in R^p, and a matrix X of n vectors x_i. I would like to compute exp( -||x-x_i||^2 / t) for every x_i and return the result as a vector.
I have tried implementing this in both R and RcppArmadillo via the following code
R CODE:
kernel <- function(x, Data, sigma){
if(sigma <= 0 ) stop('Gaussian kernel parameter <= 0.')
DiffPart <- (t(t(Data) - x))^2 ## Computes the distance squared of the data and point x
DiffPart <- rowSums(DiffPart) # Sum of squares
exp( - DiffPart / sigma) #Divide by kernel parameter and evluate exponential function
}
RcppArmadillo:
arma::Col<double> kernelCPP(arma::Row<double> x, arma::Mat<double> Data, double sigma){
arma::Mat<double> Diff=Data.each_row()-x;
int n = Data.n_rows;
arma::Col<double> kern(n);
for(int k = 0 ; k < n; k++){
kern(k) = exp(-arma::accu(square(Diff.row(k)))/sigma);
}
return(kern);
}
Unfortunately, my RcppArmadillo code is not much faster than the original R code. I'll be computing kernel vectors hundreds of thousands of times in future code/computation, and so I would like this to be as fast of a process as I can make it.
When microbenchmarking, I get the following results:
> microbenchmark(
+ kernel(x= TrainX1[1,], Data = TrainX1, sigma = 100)
+ )
Unit: milliseconds
min lq mean median
2.223359 2.274559 2.5199 2.308052
uq max neval
2.575144 4.73301 100
and
> microbenchmark(
+ kernelCPP(x= TrainX1[1,], Data = TrainX1, sigma = 100)
+ )
Unit: milliseconds
min lq mean
1.697706 1.732053 1.826743
median uq max neval
1.775786 1.871786 2.493439 100
A little faster, but not by much.

Armadillo - Norm of each small block in a long vector

I am using Armadillo in C++.
I have a long vector with 10 elements. I want to take norm 2 of each block of 2 adjacent values. In the end I will have 5 values.
In R I can convert that vector into a matrix and use apply but I am not sure how to do it in Armadillo. Appreciate any help
You just have to create a matrix from your vector and then loop through the columns.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::vec foo_Cpp(arma::vec x) {
// Note that the dimension of x must be divisible by two.
arma::mat X = arma::mat(x.memptr(), 2, x.n_elem/2);
arma::uword n = X.n_cols;
arma::vec norms = arma::vec(n);
for (arma::uword i = 0; i < n; i++) {
norms(i) = arma::norm(X.col(i), 2);
}
return norms;
}
/*** R
foo_R <- function(x) {
X <- matrix(x, 2, length(x)/2)
apply(X, 2, norm, type = "2")
}
x <- rnorm(1000)
all.equal(foo_R(x), c(foo_Cpp(x)))
microbenchmark::microbenchmark(foo_R(x), foo_Cpp(x))
*/
> all.equal(foo_R(x), c(foo_Cpp(x)))
[1] TRUE
> microbenchmark::microbenchmark(foo_R(x), foo_Cpp(x))
Unit: microseconds
expr min lq mean median uq max neval
foo_R(x) 17907.290 19640.24 21548.06789 20386.5815 21212.609 50780.584 100
foo_Cpp(x) 5.133 6.34 26.48266 19.4705 21.734 1191.124 100

Is it legitimate to use Rcpp to speed up replacing elements of lists and vectors in iterative algorithm?

Context
I have been working on iterative algorithm lately, where each iteration n depends on the iteration n-1. During each iteration most of the computation time is taken by sub-setting and/or replacing elements of vectors, lists or data.tables (N > 10^6).
I recently came across Rcpp and playing a little bit with it I discovered that replacing element k of vectors or lists can be sped up by two or three orders of magnitudes (few benchmark tests below).
However, when using the Rcpp subsetting code within a for and a while loop, R seems to become unstable and the session aborts at random points with no hint of what went wrong.
Question
My question: is this use of Rcpp legitimate or it can leads to problems I am not aware of?
Example
Below is the Rcpp code I am using and a few benchmarks. Overall, the algorithm should call the replacing functions ~5.5 billion times and subset functions ~50 billion times.
Note that replacing elements of lists and double vectors is faster using Rcpp, while for integer vectors base R solutions are preferred (benchmark 1); data table is a good option to replace elements but if you have to subset repeatedly to access its elements the vector approach is much faster (benchmark 2).
Functions:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void assign_list(List x, int k, NumericVector new_element){
x[k-1] = new_element;
}
// [[Rcpp::export]]
void assign_dbl(NumericVector x, int k, double y){
x[k-1] = y;
}
// [[Rcpp::export]]
void assign_int(IntegerVector x, int k, int y){
x[k-1] = y;
}
Benchmarks:
Inputs
set.seed(747474)
int <- 1:10^7
dou <- rnorm(10^7, 1000, 300)
l <- lapply(sample(5:20, 10^7, replace = T), rnorm, mean = 1000, sd = 300)
dt <- data.table(int = int, dou = dou, l = l)
i <- 999999
z <- 2222
k <- 30000
s <- 552877
1)
Unit: nanoseconds
expr min lq mean median uq max neval
int[i] <- -1L 488 2439 36938108.9 4146.0 15651119 799720107 30
dou[i] <- -1 732 3170 19101960.4 6609193.5 16187500 212369197 30
l[i] <- -1 489 3902 159442538.1 186035314.5 227131872 618326686 30
assign_int 19853910 22028692 81055363.5 24665494.0 39352345 872241539 30
assign_dbl 1220 5852 48023.2 8534.5 13167 1158828 30
assign_list 1464 6828 52866.9 10850.5 13411 1243430 30
dt[k, ':=' (int = -1, dou = -1, l = -1)] 206020 340116 481850.0 425326.5 529312 1414341 30
2)
microbenchmark(times = 30L,
"subset vector + list" = {int[s]; dou[s]; l[s]},
"subset datatable" = {dt[s, int]; dt[s, dou]; dt[s, l]})
Unit: nanoseconds
expr min lq mean median uq max neval
subset vector + list 488 488 1715.533 1585.5 2926 4389 30
subset datatable 563688 574417 719304.467 600138.0 875765 1308040 30
This is very dangerous, because of the side effect shown here, where x and y are changed even though you only pass x into the Rcpp function
> x <- y <- 1:10
> assign_int(x, 1, 2)
> y
[1] 2 2 3 4 5 6 7 8 9 10
It does not seem to be faster; for these functions
f0 <- function(x) {
for (i in seq_along(x))
x[i] = -i
}
f1 <- function(x) {
for (i in seq_along(x))
assign_int(x, i, -i)
}
I have
> int <- 1:10^5
> microbenchmark(f0(int), f1(int), times=5)
Unit: milliseconds
expr min lq mean median uq max neval
f0(int) 14.78777 14.80264 15.05683 15.03138 15.17678 15.48556 5
f1(int) 659.67346 669.00095 672.93343 670.48917 676.16930 689.33429 5
In your benchmark, int[i] <- 1, '1' is a numeric (double) value, so you are coercing to a double vector (check out class(int) after the assignment), triggering a complete copy. Use int[i] <- 1L to force the right-hand side to be an integer.
> int0 <- int1 <- 1:10^7
> microbenchmark(int0[1] <- 1, int1[1] <- 1L)
Unit: microseconds
expr min lq mean median uq max neval
int0[1] <- 1 1.047 1.102 1770.9911 1.143 1.2650 176960.52 100
int1[1] <- 1L 1.105 1.176 339.4264 1.213 1.2655 33815.97 100
> class(int0)
[1] "numeric"
> class(int1)
[1] "integer"
Updating just a single element as benchmark is expensive in base R because it triggers a copy of the vector on each assignment; in f0() the copy occurs just once. On the first iteration, R makes a copy because it knows that the vector of integer values is referenced by at least two symbols (the argument to the function int, and the symbol used in the function x) so it makes a copy of the memory and assigns it to x inside the function. It does this to avoid the side-effect seen in your Rcpp code (i.e., to avoid modifying int). The next time through the loop R recognizes that only a single symbol references the vector, so does the replacement without making a copy.
Note that
> int <- 1:10^5
> f1(int)
> head(int)
[1] -1 -2 -3 -4 -5 -6
illustrates the subtle way that the side-effects of your Rcpp code can have unexpected outcomes.
Also, there are several ways of slowing down iterative loops, e.g.,
f2 <- function(x) {
for (i in seq_along(x)) {
x[i] = -i
y <- x
}
}
f3 <- function(x) {
result <- integer()
for (i in seq_along(x))
result <- c(result, -i)
}
with
> int <- 1:10^3
> microbenchmark(f0(int), f2(int), f3(int), times = 1)
Unit: microseconds
expr min lq mean median uq max neval
f0(int) 150.507 150.507 150.507 150.507 150.507 150.507 1
f2(int) 667.201 667.201 667.201 667.201 667.201 667.201 1
f3(int) 4379.005 4379.005 4379.005 4379.005 4379.005 4379.005 1
f2() causes R to make a copy of x each time through the loop (to avoid the side-effect of modifying y). f3() copies a vector of length 0, 1, 2, 3, ... n - 1 (where n = length(x)) on successive iterations, leading to n * (n - 1) / 2 elements copied, and an algorithm that scales as the square of the length of x.
The general principle applies to other types as well, including lists with
f0l <- function(x) {
for (i in seq_along(x))
x[[i]] <- i
x
}
f1l <- function(x) {
for (i in seq_along(x))
assign_list(x, i, i)
}
leading to
> set.seed(123)
> l0 <- lapply(sample(5:20, 10^6, replace = T), rnorm, mean = 1000, sd = 300)
> set.seed(123)
> l1 <- lapply(sample(5:20, 10^6, replace = T), rnorm, mean = 1000, sd = 300)
> microbenchmark(f0l(l0), f1l(l1), times=1)
Unit: milliseconds
expr min lq mean median uq max neval
f0l(l0) 239.9865 239.9865 239.9865 239.9865 239.9865 239.9865 1
f1l(l1) 6767.9172 6767.9172 6767.9172 6767.9172 6767.9172 6767.9172 1

Arithmetic in R faster on numerics as opposed to integers. What's going on?

I was in the middle of converting some code that utilized mostly numeric data (i.e. doubles) to integers and did a quick benchmark to see how much efficiency I gained.
To my surprise it was slower... by about 20%. I thought I had done something wrong, but the original code was only a few basic arithmetical operations on moderately sized vectors, so I knew it wasn't that. Maybe my environment was messed up? I restarted fresh, and the same result... integers were less efficient.
This started a series of test and a dive into the rabbit hole. Here is my first test. We sum one million elements using base R's sum. Note that with R version 3.5.0 the timings are quite a bit different and with v 3.5.1, the timings are about the same (still not what one would expect):
set.seed(123)
int1e6 <- sample(1:10, 1e6, TRUE)
dbl1e6 <- runif(1e6, 1, 10)
head(int1e6)
# [1] 5 3 6 8 6 2
class(int1e6)
# [1] "integer"
head(dbl1e6)
# [1] 5.060628 2.291397 2.992889 5.299649 5.217105 9.769613
class(dbl1e6)
#[1] "numeric"
mean(dbl1e6)
# [1] 5.502034
mean(int1e6)
# [1] 5.505185
## R 3.5.0
library(microbenchmark)
microbenchmark(intSum = sum(int1e6), dblSum = sum(dbl1e6), times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intSum 1033.677 1043.991 1147.9711 1111.438 1200.725 2723.834 1000
dblSum 817.719 835.486 945.6553 890.529 998.946 2736.024 1000
## R 3.5.1
Unit: microseconds
expr min lq mean median uq max neval
intSum 836.243 877.7655 966.4443 950.1525 997.9025 2077.257 1000
dblSum 866.939 904.7945 1015.3445 986.4770 1046.4120 2541.828 1000
class(sum(int1e6))
# [1] "integer"
class(sum(dbl1e6))
#[1] "numeric"
From here on out both version 3.5.0 and 3.5.1 give nearly identical results.
Here is our first dive into the rabbit hole. Along with the documentation for sum (see ?sum), we see that sum is simply a generic function that is dispatched via standardGeneric. Digging deeper, we see it eventually calls R_execMethod here on line 516. This is where I get lost. It looks to me, like R_execClosure is called next followed by many different possible branches. I think the standard path is to call eval next, but I'm not sure. My guess is that eventually, a function is called in arithimetic.c but I can't find anything that specifically sums a vector of numbers. Either way, based off of my limited knowledge of method dispatching and C in general, my naive assumption is that a function that looks like the following is called:
template <typename T>
T sum(vector<T> x) {
T mySum = 0;
for (std::size_t i = 0; i < x.size(); ++i)
mySum += x[i];
return mySum;
}
I know there is no function overloading or vectors in C, but you get my point. My belief is that eventually, a bunch of the same type of elements are added to an element of the same type and eventually returned. In Rcpp we would have something like:
template <typename typeReturn, typename typeRcpp>
typeReturn sumRcpp(typeRcpp x) {
typeReturn mySum = 0;
unsigned long int mySize = x.size();
for (std::size_t i = 0; i < mySize; ++i)
mySum += x[i];
return mySum;
}
// [[Rcpp::export]]
SEXP mySumTest(SEXP Rx) {
switch(TYPEOF(Rx)) {
case INTSXP: {
IntegerVector xInt = as<IntegerVector>(Rx);
int resInt = sumRcpp<int>(xInt);
return wrap(resInt);
}
case REALSXP: {
NumericVector xNum = as<NumericVector>(Rx);
double resDbl = sumRcpp<double>(xNum);
return wrap(resDbl);
}
default: {
Rcpp::stop("Only integers and numerics are supported");
}
}
}
And the benchmarks confirm my normal thinking about the inherit efficiency dominance of integers:
microbenchmark(mySumTest(int1e6), mySumTest(dbl1e6))
Unit: microseconds
expr min lq mean median uq max neval
mySumTest(int1e6) 103.455 160.776 185.2529 180.2505 200.3245 326.950 100
mySumTest(dbl1e6) 1160.501 1166.032 1278.1622 1233.1575 1347.1660 1644.494 100
Binary Operators
This got me thinking further. Maybe it is just the complexity wrapped around standardGeneric that makes the different data types behave strangely. So, let's skip all that jazz and go straight to the binary operators (+, -, *, /, %/%)
set.seed(321)
int1e6Two <- sample(1:10, 1e6, TRUE)
dbl1e6Two <- runif(1e6, 1, 10)
## addition
microbenchmark(intPlus = int1e6 + int1e6Two,
dblPlus = dbl1e6 + dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intPlus 2.531220 3.214673 3.970903 3.401631 3.668878 82.11871 1000
dblPlus 1.299004 2.045720 3.074367 2.139489 2.275697 69.89538 1000
## subtraction
microbenchmark(intSub = int1e6 - int1e6Two,
dblSub = dbl1e6 - dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intSub 2.280881 2.985491 3.748759 3.166262 3.379755 79.03561 1000
dblSub 1.302704 2.107817 3.252457 2.208293 2.382188 70.24451 1000
## multiplication
microbenchmark(intMult = int1e6 * int1e6Two,
dblMult = dbl1e6 * dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intMult 2.913680 3.573557 4.380174 3.772987 4.077219 74.95485 1000
dblMult 1.303688 2.020221 3.078500 2.119648 2.299145 10.86589 1000
## division
microbenchmark(intDiv = int1e6 %/% int1e6Two,
dblDiv = dbl1e6 / dbl1e6Two, times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
intDiv 2.892297 3.210666 3.720360 3.228242 3.373456 62.12020 1000
dblDiv 1.228171 1.809902 2.558428 1.842272 1.990067 64.82425 1000
The classes are preserved as well:
unique(c(class(int1e6 + int1e6Two), class(int1e6 - int1e6Two),
class(int1e6 * int1e6Two), class(int1e6 %/% int1e6Two)))
# [1] "integer"
unique(c(class(dbl1e6 + dbl1e6Two), class(dbl1e6 - dbl1e6Two),
class(dbl1e6 * dbl1e6Two), class(dbl1e6 / dbl1e6Two)))
# [1] "numeric"
With every case, we see that arithmetic is 40% - 70% faster on numeric data type. What is really strange is that we get an even larger discrepancy when the two vectors being operated on are identical:
microbenchmark(intPlus = int1e6 + int1e6,
dblPlus = dbl1e6 + dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intPlus 2522.774 3148.464 3894.723 3304.189 3531.310 73354.97 1000
dblPlus 977.892 1703.865 2710.602 1767.801 1886.648 77738.47 1000
microbenchmark(intSub = int1e6 - int1e6,
dblSub = dbl1e6 - dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intSub 2236.225 2854.068 3467.062 2994.091 3214.953 11202.06 1000
dblSub 893.819 1658.032 2789.087 1730.981 1873.899 74034.62 1000
microbenchmark(intMult = int1e6 * int1e6,
dblMult = dbl1e6 * dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intMult 2852.285 3476.700 4222.726 3658.599 3926.264 78026.18 1000
dblMult 973.640 1679.887 2638.551 1754.488 1875.058 10866.52 1000
microbenchmark(intDiv = int1e6 %/% int1e6,
dblDiv = dbl1e6 / dbl1e6, times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
intDiv 2879.608 3355.015 4052.564 3531.762 3797.715 11781.39 1000
dblDiv 945.519 1627.203 2706.435 1701.512 1829.869 72215.51 1000
unique(c(class(int1e6 + int1e6), class(int1e6 - int1e6),
class(int1e6 * int1e6), class(int1e6 %/% int1e6)))
# [1] "integer"
unique(c(class(dbl1e6 + dbl1e6), class(dbl1e6 - dbl1e6),
class(dbl1e6 * dbl1e6), class(dbl1e6 / dbl1e6)))
# [1] "numeric"
That is nearly a 100% increase with every operator type!!!
How about a regular for loop in base R?
funInt <- function(v) {
mySumInt <- 0L
for (element in v)
mySumInt <- mySumInt + element
mySumInt
}
funDbl <- function(v) {
mySumDbl <- 0
for (element in v)
mySumDbl <- mySumDbl + element
mySumDbl
}
microbenchmark(funInt(int1e6), funDbl(dbl1e6))
Unit: milliseconds
expr min lq mean median uq max neval
funInt(int1e6) 25.44143 25.75075 26.81548 26.09486 27.60330 32.29436 100
funDbl(dbl1e6) 24.48309 24.82219 25.68922 25.13742 26.49816 29.36190 100
class(funInt(int1e6))
# [1] "integer"
class(funDbl(dbl1e6))
# [1] "numeric"
The difference isn't amazing, but still one would expect the integer sum to outperform the double sum. I really don't know what to think about this.
So my question is:
Why exactly do numeric data types outperform integer data types on basic arithmetical operations in base R?
Edit. Forgot to mention this:
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
F.Privé's "random guess" in the comments is really good! The function
do_arith seems to be the starting point within arithmetic.c. First for scalars we see that the case of REALSXP is simple: e.g., standard + is used. For INTSXP there is a dispatch to, for example, R_integer_plus, which does indeed check for integer overflow:
static R_INLINE int R_integer_plus(int x, int y, Rboolean *pnaflag)
{
if (x == NA_INTEGER || y == NA_INTEGER)
return NA_INTEGER;
if (((y > 0) && (x > (R_INT_MAX - y))) ||
((y < 0) && (x < (R_INT_MIN - y)))) {
if (pnaflag != NULL)
*pnaflag = TRUE;
return NA_INTEGER;
}
return x + y;
}
Similar for other binary operations. For vectors it is also similar. Within integer_binary there is a dispatch to the same method, while in real_binary the standard operations are used without any checks.
We can see this in action using the following Rcpp code:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
#include <cstdint>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector sumInt(IntegerVector a, IntegerVector b) {
IntegerVector result(no_init(a.size()));
std::transform(a.begin(), a.end(), b.begin(), result.begin(),
[] (int32_t x, int32_t y) {return x + y;});
return result;
}
// [[Rcpp::export]]
IntegerVector sumIntOverflow(IntegerVector a, IntegerVector b) {
IntegerVector result(no_init(a.size()));
std::transform(a.begin(), a.end(), b.begin(), result.begin(),
[] (int32_t x, int32_t y) {
if (x == NA_INTEGER || y == NA_INTEGER)
return NA_INTEGER;
if (((y > 0) && (x > (INT32_MAX - y))) ||
((y < 0) && (x < (INT32_MIN - y))))
return NA_INTEGER;
return x + y;
});
return result;
}
// [[Rcpp::export]]
NumericVector sumReal(NumericVector a, NumericVector b) {
NumericVector result(no_init(a.size()));
std::transform(a.begin(), a.end(), b.begin(), result.begin(),
[] (double x, double y) {return x + y;});
return result;
}
/*** R
set.seed(123)
int1e6 <- sample(1:10, 1e6, TRUE)
int1e6two <- sample(1:10, 1e6, TRUE)
dbl1e6 <- runif(1e6, 1, 10)
dbl1e6two <- runif(1e6, 1, 10)
microbenchmark::microbenchmark(int1e6 + int1e6two,
sumInt(int1e6, int1e6two),
sumIntOverflow(int1e6, int1e6two),
dbl1e6 + dbl1e6two,
sumReal(dbl1e6, dbl1e6two),
times = 1000)
*/
Result:
Unit: microseconds
expr min lq mean median uq max neval
int1e6 + int1e6two 1999.698 2046.2025 2232.785 2061.7625 2126.970 5461.816 1000
sumInt 812.560 846.1215 1128.826 861.9305 892.089 44723.313 1000
sumIntOverflow 1664.351 1690.2455 1901.472 1702.6100 1760.218 4868.182 1000
dbl1e6 + dbl1e6two 1444.172 1501.9100 1997.924 1526.0695 1641.103 47277.955 1000
sumReal 1459.224 1505.2715 1887.869 1530.5995 1675.594 5124.468 1000
Introducing the overflow checking into the C++ code produces a significant reduction in performance. Even though it is not as bad as the standard +. So if you know that your integer numbers are "well behaved", you can gain quite a bit of performance by skipping R's error checking by going straight to C/C++. This reminds me of another question with a similar conclusion. The error checking done by R can be costly.
For the case with identical vectors, I get the following benchmark results:
Unit: microseconds
expr min lq mean median uq max neval
int1e6 + int1e6 1761.285 2000.720 2191.541 2011.5710 2029.528 47397.029 1000
sumInt 648.151 761.787 1002.662 767.9885 780.129 46673.632 1000
sumIntOverflow 1408.109 1647.926 1835.325 1655.6705 1670.495 44958.840 1000
dbl1e6 + dbl1e6 1081.079 1119.923 1443.582 1137.8360 1173.807 44469.509 1000
sumReal 1076.791 1118.538 1456.917 1137.2025 1250.850 5141.558 1000
There is a significant performance increase for doubles (both R and C++). For integers there is also some performance increase, but not as seizable as for doubles.

Multiplying complex matrices in R using C++

Suppose that A is a complex matrix. I am interested in computing the product A%*%Conj(t(A)) in R efficiently. As far as I understand, using C++ would speed up things significantly, so that is what I am trying to do.
I have the following code for real matrices that I can use in R.
library(Rcpp);
library(inline);
library(RcppEigen);
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::Lower;
const Map<MatrixXd> A(as<Map<MatrixXd> >(AA));
const int m(A.rows());
MatrixXd AAt(MatrixXd(m, m).setZero().selfadjointView<Lower>().rankUpdate(A));
return wrap(AAt);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
A<-matrix(rnorm(100^2),100)
all.equal(fcprd(A),tcrossprod(A))
fcprd(A) runs much faster on my laptop than tcrossprod(A). This is what I get for A<-matrix(rnorm(1000^2),1000):
microbenchmark::microbenchmark('tcrossprod(A)'=tcrossprod(A),'A%*%t(A)'=A%*%t(A),fcprd=fcprd(A))
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(A) 428.06452 435.9700 468.9323 448.8168 504.2628 618.7681 100
A%*%t(A) 722.24053 736.6197 775.4814 767.7668 809.8356 903.8592 100
fcprd 95.04678 100.0733 111.5021 103.6616 107.2551 197.4479 100
However, this code only works for matrices with double precision entries. How could I modify this code so that it works for complex matrices?
I have a very limited knowledge of programming, but I am trying to learn.
Any help is much appreciated!
The Eigen library supports also complex entries via Eigen::MatrixXcd. So in principle it should work if you replace MatrixXd with MatrixXcd. However, this does not compile probably because there is no as-function for complex matrices using Map (c.f. https://github.com/RcppCore/RcppEigen/blob/master/inst/unitTests/runit.RcppEigen.R#L205). The as-function are needed to convert between R data types and C++/Eigen data types (c.f. http://dirk.eddelbuettel.com/code/rcpp/Rcpp-extending.pdf). If you do not use Map, then you can use this:
crossprodCpp <- '
using Eigen::MatrixXcd;
using Eigen::Lower;
const MatrixXcd A(as<MatrixXcd>(AA));
const int m(A.rows());
MatrixXcd AAt(MatrixXcd(m, m).setZero().selfadjointView<Lower>().rankUpdate(A));
return wrap(AAt);
'
fcprd <- inline::cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
N <- 100
A <- matrix(complex(real = rnorm(N), imaginary = rnorm(N)), N)
all.equal(fcprd(A), A %*% Conj(t(A)))
However, this is slower than the base R version in my tests:
N <- 1000
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
all.equal(fcprd(A), A %*% Conj(t(A)))
#> [1] TRUE
microbenchmark::microbenchmark(base = A %*% Conj(t(A)), eigen = fcprd(A))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 111.6512 124.4490 145.7583 140.9199 160.3420 241.8986 100
#> eigen 453.6702 501.5419 535.0192 537.2925 564.8746 628.4999 100
Note that matrix multiplication in R is done via BLAS. However, the default BLAS implementation used by R is not very fast. One way to improve R's performance is to use an optimized BLAS library, c.f. https://csgillespie.github.io/efficientR/set-up.html#blas-and-alternative-r-interpreters.
Alternatively you can use the BLAS function zherk if you have a full BLAS available. Very rough:
dyn.load("/usr/lib/libblas.so")
zherk <- function(a, uplo = 'u', trans = 'n') {
n <- nrow(a)
k <- ncol(a)
c <- matrix(complex(real = 0, imaginary = 0), nrow = n, ncol = n)
z <- .Fortran("zherk",
uplo = as.character(uplo),
trans = as.character(trans),
n = as.integer(n),
k = as.integer(k),
alpha = as.double(1),
a = as.complex(a),
lda = as.integer(n),
beta = as.double(0),
c = as.complex(c),
ldc = as.integer(n))
matrix(z$c, nrow = n, ncol = n)
}
N <- 2
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
zherk(A, uplo = "l") - A %*% Conj(t(A))
Note that this fills only the upper (or lower) triangular part but is quite fast:
microbenchmark::microbenchmark(base = A %*% Conj(t(A)), blas = zherk(A))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 112.5588 117.12531 146.10026 138.37565 167.6811 282.3564 100
#> blas 66.9541 70.12438 91.44617 82.74522 108.4979 188.3728 100
Here is a way to bind an Eigen::Map<Eigen::MatrixXcd> object in Rcpp. The solution works in a R package setup, but I'm not sure about an easy way to put it together using the inline library.
First, you need to provide the following specialization in your inst/include/mylib.h such that this header get included in the RcppExports.cpp:
#include <complex>
#include <Eigen/Core>
#include <Eigen/Dense>
#include <Rcpp.h>
namespace Rcpp {
namespace traits {
template<>
class Exporter<Eigen::Map<Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> > > {
using OUT = typename Eigen::Map<Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> >;
const static int RTYPE = ::Rcpp::traits::r_sexptype_traits<std::complex<double>>::rtype;
Rcpp::Vector<RTYPE> vec;
int d_ncol, d_nrow;
public:
Exporter(SEXP x)
: vec(x), d_ncol(1)
, d_nrow(Rf_xlength(x)) {
if (TYPEOF(x) != RTYPE)
throw std::invalid_argument("Wrong R type for mapped matrix");
if (::Rf_isMatrix(x)) {
int* dims = INTEGER(::Rf_getAttrib(x, R_DimSymbol));
d_nrow = dims[0];
d_ncol = dims[1];
}
}
OUT get() { return OUT(reinterpret_cast<std::complex<double>*>(vec.begin()), d_nrow, d_ncol); }
};
}}
The only difference with the unspecialized Exporter available in RcppEigenWrap.h being the reinterpret_cast on the last line. Both std::complex and Rcomplex having C99 complex compatible types, they are supposed to have identical memory layouts regardless of the implementation.
Wrapping it up, you can now create your function as:
// [[Rcpp::export]]
Eigen::MatrixXd selfadj_mult(const Eigen::Map<Eigen::MatrixXcd>& mat) {
Eigen::MatrixXd result = (mat * mat.adjoint()).real();
return result;
}
and then invoke the function in R as:
library(mylib)
library(microbenchmark)
N <- 1000
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
microbenchmark::microbenchmark(
base = A %*% Conj(t(A))
, eigen = mylib::selfadj_mult(A)
, times = 100L
)
the code is compiled on centos7/gcc83 with -O3 -DNDEBUG -flto -march=generic. R has been build from source with the exact same compiler/flags (using the default BLAS binding). Results are:
Unit: seconds
expr min lq mean median uq max neval
base 2.9030320 2.9045865 2.9097162 2.9053835 2.9093232 2.9614318 100
eigen 1.1978697 1.2004888 1.2134219 1.2031046 1.2057647 1.3035751 100