Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?) - c++

I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calculations.
Consider the following example:
testmat <- matrix(1:9, nrow=3)
testvec <- 1:3
testmat*testvec
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 4 10 16
#[3,] 9 18 27
Here, R recycled testvec so that, loosely speaking, testvec "became" a matrix of the same dimensions as testmat for the purpose of this multiplication. Then the Hadamard product is returned. I wish to implement this behavior using Rcpp, that is I want that each element of the i-th row in the matrix testmat is multiplied with the i-th element of the vector testvec. My benchmarks tell me that my implementations are extremely slow, and I would appreciate advise on how to speed this up. Here my code:
First, using Eigen:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using namespace Rcpp;
using namespace Eigen;
// [[Rcpp::export]]
NumericMatrix E_matvecprod_elwise(NumericMatrix Xs, NumericVector ys){
Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));
Map<VectorXd> y(as<Map<VectorXd> >(ys));
int k = X.cols();
int n = X.rows();
MatrixXd Y(n,k) ;
// here, I emulate R's recycling. I did not find an easier way of doing this. Any hint appreciated.
for(int i = 0; i < k; ++i) {
Y.col(i) = y;
}
MatrixXd out = X.cwiseProduct(Y);
return wrap(out);
}
Here my implementation using Armadillo (adjusted to follow Dirk's example, see answer below):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
int k = X.n_cols ;
arma::mat Y = repmat(y, 1, k) ; //
arma::mat out = X % Y;
return out;
}
Benchmarking these solutions using R, Eigen or Armadillo shows that both Eigen and Armadillo are about 2 times slower than R. Is there a way to speed these computations up or to get at least as fast as R? Are there more elegant ways of setting this up? Any advise is appreciated and welcome. (I also encourage tangential remarks about programming style in general as I am new to Rcpp / C++.)
Here some reproducable benchmarks:
# for comparison, define R function:
R_matvecprod_elwise <- function(mat, vec) mat*vec
n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)
benchmark(R_matvecprod_elwise(X, e), A2_matvecprod_elwise(X, e), E_matvecprod_elwise(X,e),
columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
This yields
test replications elapsed relative
1 R_matvecprod_elwise(X, e) 1000 10.89 1.000
2 A_matvecprod_elwise(X, e) 1000 26.87 2.467
3 E_matvecprod_elwise(X, e) 1000 27.73 2.546
As you can see, my Rcpp-solutions perform quite miserably. Any way to do it better?

If you want to speed up your calculations you will have to be a little careful about not making copies. This usually means sacrificing readability. Here is a version which makes no copies and modifies matrix X inplace.
// [[Rcpp::export]]
NumericMatrix Rcpp_matvecprod_elwise(NumericMatrix & X, NumericVector & y){
unsigned int ncol = X.ncol();
unsigned int nrow = X.nrow();
int counter = 0;
for (unsigned int j=0; j<ncol; j++) {
for (unsigned int i=0; i<nrow; i++) {
X[counter++] *= y[i];
}
}
return X;
}
Here is what I get on my machine
> library(microbenchmark)
> microbenchmark(R=R_matvecprod_elwise(X, e), Arma=A_matvecprod_elwise(X, e), Rcpp=Rcpp_matvecprod_elwise(X, e))
Unit: milliseconds
expr min lq median uq max neval
R 8.262845 9.386214 10.542599 11.53498 12.77650 100
Arma 18.852685 19.872929 22.782958 26.35522 83.93213 100
Rcpp 6.391219 6.640780 6.940111 7.32773 7.72021 100
> all.equal(R_matvecprod_elwise(X, e), Rcpp_matvecprod_elwise(X, e))
[1] TRUE

For starters, I'd write the Armadillo version (interface) as
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arama::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
int k = X.n_cols ;
arma::mat Y = repmat(y, 1, k) ; //
arma::mat out = X % Y;
return out;
}
as you're doing an additional conversion in and out (though the wrap() gets added by the glue code). The const & is notional (as you learned via your last question, a SEXP is a pointer object that is lightweight to copy) but better style.
You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. Your pick.
Edit: If you really want something cheap and fast, I would just do this:
// [[Rcpp::export]]
mat cheapHadamard(mat X, vec y) {
// should row dim of X versus length of Y here
for (unsigned int i=0; i<y.n_elem; i++) X.row(i) *= y(i);
return X;
}
which allocates no new memory and will hence be faster, and probably be competitive with R.
Test output:
R> cheapHadamard(testmat, testvec)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 27
R>

My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.
Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.
// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict // MS Visual Studio
//#define restrict // remove it completely
#include <Rcpp.h>
using namespace Rcpp;
#include <cstdlib>
using std::size_t;
void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
size_t numRows, size_t numCols,
const double* restrict y)
{
if (numRows == 0 || numCols == 0) return;
for (size_t col = 0; col < numCols; ++col) {
double* restrict x_col = x + col * numRows;
for (size_t row = 0; row < numRows; ++row) {
x_col[row] *= y[row];
}
}
}
// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
const NumericVector& y)
{
// do some dimension checking here
hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
y.begin());
return X;
}
Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.
#include <Rcpp.h>
using namespace Rcpp;
#include <cstdlib>
using std::size_t;
#include <R.h>
#include <Rdefines.h>
void hadamardMultiplyMatrixByVector(const double* restrict x,
size_t numRows, size_t numCols,
const double* restrict y,
double* restrict z)
{
if (numRows == 0 || numCols == 0) return;
for (size_t col = 0; col < numCols; ++col) {
const double* restrict x_col = x + col * numRows;
double* restrict z_col = z + col * numRows;
for (size_t row = 0; row < numRows; ++row) {
z_col[row] = x_col[row] * y[row];
}
}
}
// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
size_t numRows = X.nrow();
size_t numCols = X.ncol();
// do some dimension checking here
SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
int* dims = INTEGER(dimsExpr);
dims[0] = (int) numRows;
dims[1] = (int) numCols;
Rf_setAttrib(Z, R_DimSymbol, dimsExpr);
hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));
UNPROTECT(2);
return Z;
}
If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.
Some R code to benchmark:
require(rbenchmark)
n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)
R_matvecprod_elwise <- function(mat, vec) mat*vec
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))
benchmark(R_matvecprod_elwise(X, e),
C_matvecprod_elwise(X, e),
C_matvecprod_elwise_inplace(X, e),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 1000)
And the results:
test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e) 1000 3.317 1.000
2 C_matvecprod_elwise(X, e) 1000 7.174 2.163
1 R_matvecprod_elwise(X, e) 1000 10.670 3.217
Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.
Edit:
Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

Related

efficient distance calculations in armadillo

I'm new to armadillo. I have the below code, which I assume is inefficient. Any suggestions to make it more memory efficient and/or speedy? Following the armadillo docs and Rcpp gallery, I was unable to get .colptr's, uvec's, or batch insertion to work. But I assume any of them would be improvements.
With an input of X (~100 x 30000), even my stupidly large work VM crashes.
Linux release 7.3.1611 (Core)
117GB RAM / 0GB SWAP
(24 x 2.494 GHz) processor(s)
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
code
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
sp_mat arma_distmat_LT(const arma::mat& x) { // input expected X_{n x p} n << p
int nr, nc;
Col<double> col0, col1;
nr = x.n_rows;
nc = x.n_cols;
sp_mat out(nc, nc);
for (int i = 0; i < nc; i++) {
col0 = x.col(i);
for (int j = i + 1; j < nc; j++) {
col1 = x.col(j);
out(j, i) = as_scalar(col0.t() * col1);
}
}
return out;
}
Call: sourceCpp("<file>"); dist_x <- arma_distmat_LT(X)
Note: these are distances because I am calculating cosine similarities where I have set L2 norm == 1.
It looks to me as if you're just computing the (upper triangular) matrix product t(X)%*%X. You can actually do that directly in R with the underused crossprod function.
X <- matrix(rnorm(100*30000), ncol=30000)
res <- crossprod(X, X)
This takes a few minutes on my laptop. If you change your code to use the Armadillo library then you can use
sp_mat arma_distmat_LT2(const arma::mat& x) { // input expected X_{n x p} n << p
int nr, nc;
Col<double> col0, col1;
nr = x.n_rows;
nc = x.n_cols;
sp_mat out(nc, nc);
out = trimatl(x.t() * x, k=-1);
return out;
}
Still takes a few minutes. It uses an awful amount of memory though so I doubt you can have a lot of objects in memory at the same time.
The code could probably be optimized to only compute the lower/upper triangular matrix.
Just to show the speedup for a 100*800 matrix:
> microbenchmark(crossprod(X, X), arma_distmat_LT(X), arma_distmat_LT2(X))
Unit: milliseconds
expr min lq mean median uq
crossprod(X, X) 50.25574 53.72049 57.98812 56.29532 58.71277
arma_distmat_LT(X) 1331.83243 1471.42465 1523.74060 1492.84611 1512.45416
arma_distmat_LT2(X) 29.69420 33.23954 36.24613 35.54700 38.05208
max neval cld
160.81227 100 a
3080.37891 100 b
66.07351 100 a
As you can see there is a substantial speedup to be gained by brute-forcing it. That being said I'm sure that the cross product can be optimised further.

Extract elements from a matrix based on the row and column indices with Armadillo

In R, I could extract matrix elements based on their indices as follow
> m <- matrix(1:6, nrow = 3)
> m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> row_index <- c(1, 2)
> col_index <- c(2, 2)
> m[cbind(row_index, col_index)]
[1] 4 5
Is there a native way to do this is Armadillo / Rcpp::Armadillo? The best I could do is a custom function that uses the row and column indices to calculate the element index (see below). I'm mostly worried that custom function won't perform as well.
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
NumericVector Rsubmatrix(arma::uvec rowInd, arma::uvec colInd, arma::mat m) {
arma::uvec ind = (colInd - 1) * m.n_rows + (rowInd - 1);
arma::vec ret = m.elem(ind);
return wrap(ret);
}
/*** R
Rsubmatrix(row_index, col_index, m)
/
From the docs:
X.submat( vector_of_row_indices, vector_of_column_indices )
but that seems to only return matrix blocks. For non-simply-connected regions, I think your solution is the best, but you don't really need a function,
m.elem((colInd - 1) * m.n_rows + (rowInd - 1));
returns the vector without any problem. For clarity you could define a function to deal with the row+col to indices conversion,
inline arma::uvec arr2ind(arma::uvec c, arma::uvec r, int nrow)
{
return c * nrow + r;
}
// m.elem(arr2ind(colInd - 1, rowInd - 1, m.n_rows));
Let's try this...
In particular, you can subset by rowInd and colInd through writing your own loop to use the .(i,j) subset operator. Otherwise, the only other option is the solution that you proposed to start the question off...
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// Optimized OP method
// [[Rcpp::export]]
arma::vec Rsubmatrix(const arma::mat& m, const arma::uvec& rowInd, const arma::uvec& colInd) {
return m.elem((colInd - 1) * m.n_rows + (rowInd - 1));
}
// Proposed Alternative
// [[Rcpp::export]]
arma::rowvec get_elements(const arma::mat& m, const arma::uvec& rowInd, const arma::uvec& colInd){
unsigned int n = rowInd.n_elem;
arma::rowvec out(n);
for(unsigned int i = 0; i < n; i++){
out(i) = m(rowInd[i]-1,colInd[i]-1);
}
return out;
}
Where:
m <- matrix(1:6, nrow = 3)
row_index <- c(1, 2)
col_index <- c(2, 2)
m[cbind(row_index, col_index)]
Gives:
[1] 4 5
And we have:
get_elements(m, row_index, col_index)
Giving:
[,1] [,2]
[1,] 4 5
Edit
Microbenchmark:
microbenchmark(Rsubmatrix(m, row_index, col_index), get_elements(m, row_index, col_index), times = 1e4)
Gives:
Unit: microseconds
expr min lq mean median uq max neval
Rsubmatrix(m, row_index, col_index) 2.836 3.111 4.129051 3.281 3.502 5016.652 10000
get_elements(m, row_index, col_index) 2.699 2.947 3.436844 3.115 3.335 716.742 10000
The methods are both close time wise. Note that the later should be better as it avoids having two separate loops (1. to calculate & 2. to subset) and an additional temporary vector created to store the results.
Edit
Per armadillo 7.200.0 release, the sub2ind() function has received the ability to take matrix notation. This function takes a matrix subscript via a 2 x n matrix, where n denotes the number of elements to subset, and converts them into element notation.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::rowvec matrix_locs(arma::mat M, arma::umat locs) {
arma::uvec eids = sub2ind( size(M), locs ); // Obtain Element IDs
arma::vec v = M.elem( eids ); // Values of the Elements
return v.t(); // Transpose to mimic R
}
Calling in R:
cpp_locs <- locs - 1 # Shift indices from R to C++
(cpp_locs <- t(cpp_locs)) # Transpose matrix for 2 x n form
matrix_locs(M, cpp_locs) # Subset the matrix

Matrix multiplication in Rcpp

First of all, I am a novice user so forget my general ignorance. I am looking for a faster alternative to the %*% operator in R. Even though older posts suggest the use of RcppArmadillo, I have tried for 2 hours to make RcppArmadillo work without success. I always run into lexical issues that yield 'unexpected ...' errors. I have found the following function in Rcpp which I do can make work:
library(Rcpp)
func <- '
NumericMatrix mmult( NumericMatrix m , NumericMatrix v, bool byrow=true )
{
if( ! m.nrow() == v.nrow() ) stop("Non-conformable arrays") ;
if( ! m.ncol() == v.ncol() ) stop("Non-conformable arrays") ;
NumericMatrix out(m) ;
for (int i = 0; i < m.nrow(); i++)
{
for (int j = 0; j < m.ncol(); j++)
{
out(i,j)=m(i,j) * v(i,j) ;
}
}
return out ;
}
'
This function, however, performs element-wise multiplication and does not behave as %*%. Is there an easy way to modify the above code to achieve the intended result?
EDIT:
I have come up with a function using RcppEigen that seems to beat %*%:
etest <- cxxfunction(signature(tm="NumericMatrix",
tm2="NumericMatrix"),
plugin="RcppEigen",
body="
NumericMatrix tm22(tm2);
NumericMatrix tmm(tm);
const Eigen::Map<Eigen::MatrixXd> ttm(as<Eigen::Map<Eigen::MatrixXd> >(tmm));
const Eigen::Map<Eigen::MatrixXd> ttm2(as<Eigen::Map<Eigen::MatrixXd> >(tm22));
Eigen::MatrixXd prod = ttm*ttm2;
return(wrap(prod));
")
set.seed(123)
M1 <- matrix(sample(1e3),ncol=50)
M2 <- matrix(sample(1e3),nrow=50)
identical(etest(M1,M2), M1 %*% M2)
[1] TRUE
res <- microbenchmark(
+ etest(M1,M2),
+ M1 %*% M2,
+ times=10000L)
res
Unit: microseconds
expr min lq mean median uq max neval
etest(M1, M2) 5.709 6.61 7.414607 6.611 7.211 49.879 10000
M1 %*% M2 11.718 12.32 13.505272 12.621 13.221 58.592 10000
There are good reasons to rely on existing libraries / packages for standard tasks. The routines in the libraries are
optimized
thoroughly tested
a good means to keep the code compact, human-readable, and easy to maintain.
Therefore I think that using RcppArmadillo or RcppEigen should be preferable here. However, to answer your question, below is a possible Rcpp code to perform a matrix multiplication:
library(Rcpp)
cppFunction('NumericMatrix mmult(const NumericMatrix& m1, const NumericMatrix& m2){
if (m1.ncol() != m2.nrow()) stop ("Incompatible matrix dimensions");
NumericMatrix out(m1.nrow(),m2.ncol());
NumericVector rm1, cm2;
for (size_t i = 0; i < m1.nrow(); ++i) {
rm1 = m1(i,_);
for (size_t j = 0; j < m2.ncol(); ++j) {
cm2 = m2(_,j);
out(i,j) = std::inner_product(rm1.begin(), rm1.end(), cm2.begin(), 0.);
}
}
return out;
}')
Let's test it:
A <- matrix(c(1:6),ncol=2)
B <- matrix(c(0:7),nrow=2)
mmult(A,B)
# [,1] [,2] [,3] [,4]
#[1,] 4 14 24 34
#[2,] 5 19 33 47
#[3,] 6 24 42 60
identical(mmult(A,B), A %*% B)
#[1] TRUE
Hope this helps.
As benchmark tests show, the above Rcpp code is slower than R's built-in %*% operator. I assume that, while my Rcpp code can certainly be improved, it will be hard to beat the optimized code behind %*% in terms of performance:
library(microbenchmark)
set.seed(123)
M1 <- matrix(rnorm(1e4),ncol=100)
M2 <- matrix(rnorm(1e4),nrow=100)
identical(M1 %*% M2, mmult(M1,M2))
#[1] TRUE
res <- microbenchmark(
mmult(M1,M2),
M1 %*% M2,
times=1000L)
#> res
#Unit: microseconds
# expr min lq mean median uq max neval cld
# mmult(M1, M2) 1466.855 1484.8535 1584.9509 1494.0655 1517.5105 2699.643 1000 b
# M1 %*% M2 602.053 617.9685 687.6863 621.4335 633.7675 2774.954 1000 a
I would encourage to try to work out your issues with RcppArmadillo. Using it is as simple as this example also created by calling RcppArmadillo.package.skeleton():
// another simple example: outer product of a vector,
// returning a matrix
//
// [[Rcpp::export]]
arma::mat rcpparma_outerproduct(const arma::colvec & x) {
arma::mat m = x * x.t();
return m;
}
// and the inner product returns a scalar
//
// [[Rcpp::export]]
double rcpparma_innerproduct(const arma::colvec & x) {
double v = arma::as_scalar(x.t() * x);
return v;
}
There is actually more code in the example but this should give you an idea.
The following approach can also be used :
NumericMatrix mmult(NumericMatrix m, NumericMatrix v)
{
Environment base("package:base");
Function mat_Mult = base["%*%"];
return(mat_Mult(m, v));
}
With this approach, we use the operator %*% of R.

Trying to write a setdiff() function using RcppArmadillo gives compilation error

I'm trying to write a sort of analogue of R's setdiff() function in C++ using RcppArmadillo. My rather crude approach:
// [[Rcpp::export]]
arma::uvec my_setdiff(arma::uvec x, arma::uvec y){
// Coefficientes of unsigned integer vector y form a subset of the coefficients of unsigned integer vector x.
// Returns set difference between the coefficients of x and those of y
int n2 = y.n_elem;
uword q1;
for (int j=0 ; j<n2 ; j++){
q1 = find(x==y[j]);
x.shed_row(q1);
}
return x;
}
fails at compilation time. The error reads:
fnsauxarma.cpp:622:29: error: no matching function for call to ‘arma::Col<double>::shed_row(const arma::mtOp<unsigned int, arma::mtOp<unsigned int, arma::Col<double>, arma::op_rel_eq>, arma::op_find>)’
I really have no idea what's going on, any help or comments would be greatly appreciated.
The problem is that arma::find returns a uvec, and doesn't know how to make the implicit conversion to arma::uword, as pointed out by #mtall. You can help the compiler out by using the templated arma::conv_to<T>::from() function. Also, I included another version of my_setdiff that returns an Rcpp::NumericVector because although the first version returns the correct values, it's technically a matrix (i.e. it has dimensions), and I assume you would want this to be as compatible with R's setdiff as possible. This is accomplished by setting the dim attribute of the return vector to NULL, using R_NilValue and the Rcpp::attr member function.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec my_setdiff(arma::uvec& x, const arma::uvec& y){
for (size_t j = 0; j < y.n_elem; j++) {
arma::uword q1 = arma::conv_to<arma::uword>::from(arma::find(x == y[j]));
x.shed_row(q1);
}
return x;
}
// [[Rcpp::export]]
Rcpp::NumericVector my_setdiff2(arma::uvec& x, const arma::uvec& y){
for (size_t j = 0; j < y.n_elem; j++) {
arma::uword q1 = arma::conv_to<arma::uword>::from(arma::find(x == y[j]));
x.shed_row(q1);
}
Rcpp::NumericVector x2 = Rcpp::wrap(x);
x2.attr("dim") = R_NilValue;
return x2;
}
/*** R
x <- 1:8
y <- 2:6
R> all.equal(setdiff(x,y), my_setdiff(x,y))
#[1] "Attributes: < target is NULL, current is list >" "target is numeric, current is matrix"
R> all.equal(setdiff(x,y), my_setdiff2(x,y))
#[1] TRUE
R> setdiff(x,y)
#[1] 1 7 8
R> my_setdiff(x,y)
# [,1]
# [1,] 1
# [2,] 7
# [3,] 8
R> my_setdiff2(x,y)
#[1] 1 7 8
*/
Edit:
For the sake of completeness, here is a more robust version of setdiff than the two implementations presented above:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::export]]
Rcpp::NumericVector arma_setdiff(arma::uvec& x, arma::uvec& y){
x = arma::unique(x);
y = arma::unique(y);
for (size_t j = 0; j < y.n_elem; j++) {
arma::uvec q1 = arma::find(x == y[j]);
if (!q1.empty()) {
x.shed_row(q1(0));
}
}
Rcpp::NumericVector x2 = Rcpp::wrap(x);
x2.attr("dim") = R_NilValue;
return x2;
}
/*** R
x <- 1:10
y <- 2:8
R> all.equal(setdiff(x,y), arma_setdiff(x,y))
#[1] TRUE
X <- 1:6
Y <- c(2,2,3)
R> all.equal(setdiff(X,Y), arma_setdiff(X,Y))
#[1] TRUE
*/
The previous versions would throw an error if you passed them vectors with non-unique elements, e.g.
R> my_setdiff2(X,Y)
error: conv_to(): given object doesn't have exactly one element
To solve the problem and more closely mirror R's setdiff, we just make x and y unique. Additionally, I switched out the arma::conv_to<>::from with q1(0) (where q1 is now a uvec instead of a uword), because uvec's are just a vector of uwords, and the explicit cast seemed a little inelegant.
I've used std::set_difference from the STL instead, converting back and forth from arma::uvec.
#include <RcppArmadillo.h>
#include <algorithm>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec std_setdiff(arma::uvec& x, arma::uvec& y) {
std::vector<int> a = arma::conv_to< std::vector<int> >::from(arma::sort(x));
std::vector<int> b = arma::conv_to< std::vector<int> >::from(arma::sort(y));
std::vector<int> out;
std::set_difference(a.begin(), a.end(), b.begin(), b.end(),
std::inserter(out, out.end()));
return arma::conv_to<arma::uvec>::from(out);
}
Edit: I thought a performance comparison might be in order. The difference becomes smaller when the relative sizes of the sets are in the opposite order.
a <- sample.int(350)
b <- sample.int(150)
microbenchmark::microbenchmark(std_setdiff(a, b), arma_setdiff(a, b))
> Unit: microseconds
> expr min lq mean median uq max neval cld
> std_setdiff(a, b) 11.548 14.7545 17.29930 17.107 19.245 36.779 100 a
> arma_setdiff(a, b) 60.727 65.0040 71.77804 66.714 72.702 138.133 100 b
The Questioner might have already got the answer. However, the following template version may be more general. This is equivalent to setdiff function in Matlab
If P and Q are two sets, then their difference is given by P - Q or Q - P. If P = {1, 2, 3, 4} and Q = {4, 5, 6}, P - Q means elements of P which are not in Q. i.e., in the above example P - Q = {1, 2, 3}.
/* setdiff(t1, t2) is similar to setdiff() function in MATLAB. It removes the common elements and
gives the uncommon elements in the vectors t1 and t2. */
template <typename T>
T setdiff(T t1, T t2)
{
int size_of_t1 = size(t1);
int size_of_t2 = size(t2);
T Intersection_Elements;
uvec iA, iB;
intersect(Intersection_Elements, iA, iB, t1, t2);
for (int i = 0; i < size(iA); i++)
{
t1(iA(i)) = 0;
}
for (int i = 0; i < size(iB); i++)
{
t2(iB(i)) = 0;
}
T t1_t2_vec(size_of_t1 + size_of_t2);
t1_t2_vec = join_vert(t1, t2);
T DiffVec = nonzeros(t1_t2_vec);
return DiffVec;
}
Any suggestions for improving the performance of the algorithm are welcome.

How to speed up this Rcpp function?

I wish to implement a simple split-apply-combine routine in Rcpp where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R, but often takes quite some time. I have managed to implement an Rcpp solution that beats the performance of R, but I wonder if I can further improve upon it. To illustrate, here some code, first for the use of R:
n <- 50000
k <- 50
set.seed(42)
X <- matrix(rnorm(n*k), nrow=n)
g=rep(1:8,length.out=n )
use.for <- function(mat, ind){
sums <- matrix(NA, nrow=length(unique(ind)), ncol=ncol(mat))
for(i in seq_along(unique(ind))){
sums[i,] <- colSums(mat[ind==i,])
}
return(sums)
}
use.apply <- function(mat, ind){
apply(mat,2, function(x) tapply(x, ind, sum))
}
use.dt <- function(mat, ind){ # based on Roland's answer
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
It turns out that the for-loops is actually quite fast and is the easiest (for me) to implement with Rcpp. It works by creating a submatrix for each group and then calling colSums on the matrix. This is implemented using RcppArmadillo:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat use_arma(arma::mat X, arma::colvec G){
arma::colvec gr = arma::unique(G);
int gr_n = gr.n_rows;
int ncol = X.n_cols;
arma::mat out = zeros(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
arma::uvec subvec = find(G==g_id);
arma::mat submat = X.rows(subvec);
arma::rowvec res = sum(submat,0);
out.row(g) = res;
}
return out;
}
However, based on answers to this question, I learned that creating copies is expensive in C++ (just as in R), but that loops are not as bad as they are in R. Since the arma-solution relies on creating matrixes (submat in the code) for each group, my guess is that avoiding this will speed up the process even further. Hence, here a second implementation based on Rcpp only using a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
out(g,j) = total;
}
}
return out;
}
Benchmarking these solutions, including the use_dt version provided by #Roland (my previous version discriminted unfairly against data.table), as well as the dplyr-solution suggested by #beginneR, yields the following:
library(rbenchmark)
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dplyr(X,g), use_arma(X,g), use_Rcpp(X,g),
+ columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
test replications elapsed relative
# 5 use_arma(X, g) 1000 29.65 1.000
# 4 use.dplyr(X, g) 1000 42.05 1.418
# 3 use.dt(X, g) 1000 56.94 1.920
# 1 use.for(X, g) 1000 60.97 2.056
# 6 use_Rcpp(X, g) 1000 113.96 3.844
# 2 use.apply(X, g) 1000 301.14 10.156
My intution (use_Rcpp better than use_arma) did not turn out right. Having said that, I guess that the line if (G(i) != g_id) continue; in my use_Rcpp function slows down everything. I am happy to learn about alternatives to set this up.
I am happy that I have achieved the same task in half the time it takes R to do it, but maybe the several Rcpp is much faster than R-examples have messed with my expectations, and I am wondering if I can speed this up even more. Does anyone have an idea? I also welcome any programming / coding comments in general since I am relatively new to Rcpp and C++.
No, it's not the for loop that you need to beat:
library(data.table)
#it doesn't seem fair to include calls to library in benchmarks
#you need to do that only once in your session after all
use.dt2 <- function(mat, ind){
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
all.equal(use.dt(X,g), use.dt2(X,g))
#TRUE
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dt2(X,g),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 50)
# test replications elapsed relative
#4 use.dt2(X, g) 50 3.12 1.000
#1 use.for(X, g) 50 4.67 1.497
#3 use.dt(X, g) 50 7.53 2.413
#2 use.apply(X, g) 50 17.46 5.596
Maybe you're looking for (the oddly named) rowsum
library(microbenchmark)
use.rowsum = rowsum
and
> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
expr min lq median uq max neval
use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579 5
use.rowsum(X, g) 10.56727 10.93942 11.01106 11.38697 11.38918 5
Here's my critiques with in-line comments for your Rcpp solution.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
// Rcpp has a sort_unique() function, which combines the
// sort and unique steps into one, and is often faster than
// performing the operations separately. Try `sort_unique(G)`
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
// This constructor zero-initializes memory (kind of like
// making a copy). You should use:
//
// NumericMatrix out = no_init(gr_n, ncol)
//
// to ensure the memory is allocated, but not zeroed.
//
// EDIT: We don't have no_init for matrices right now, but you can hack
// around that with:
//
// NumericMatrix out(Rf_allocMatrix(REALSXP, gr_n, ncol));
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
// subsetting with operator[] is cheaper, so use gr[g] when
// you can be sure bounds checks are not necessary
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
// similarily here
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
// IIUC, you are filling the matrice row-wise. This is slower as
// R matrices are stored in column-major format, and so filling
// matrices column-wise will be faster.
out(g,j) = total;
}
}
return out;
}