Armadillo - Norm of each small block in a long vector - c++

I am using Armadillo in C++.
I have a long vector with 10 elements. I want to take norm 2 of each block of 2 adjacent values. In the end I will have 5 values.
In R I can convert that vector into a matrix and use apply but I am not sure how to do it in Armadillo. Appreciate any help

You just have to create a matrix from your vector and then loop through the columns.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::vec foo_Cpp(arma::vec x) {
// Note that the dimension of x must be divisible by two.
arma::mat X = arma::mat(x.memptr(), 2, x.n_elem/2);
arma::uword n = X.n_cols;
arma::vec norms = arma::vec(n);
for (arma::uword i = 0; i < n; i++) {
norms(i) = arma::norm(X.col(i), 2);
}
return norms;
}
/*** R
foo_R <- function(x) {
X <- matrix(x, 2, length(x)/2)
apply(X, 2, norm, type = "2")
}
x <- rnorm(1000)
all.equal(foo_R(x), c(foo_Cpp(x)))
microbenchmark::microbenchmark(foo_R(x), foo_Cpp(x))
*/
> all.equal(foo_R(x), c(foo_Cpp(x)))
[1] TRUE
> microbenchmark::microbenchmark(foo_R(x), foo_Cpp(x))
Unit: microseconds
expr min lq mean median uq max neval
foo_R(x) 17907.290 19640.24 21548.06789 20386.5815 21212.609 50780.584 100
foo_Cpp(x) 5.133 6.34 26.48266 19.4705 21.734 1191.124 100

Related

Multiplying complex matrices in R using C++

Suppose that A is a complex matrix. I am interested in computing the product A%*%Conj(t(A)) in R efficiently. As far as I understand, using C++ would speed up things significantly, so that is what I am trying to do.
I have the following code for real matrices that I can use in R.
library(Rcpp);
library(inline);
library(RcppEigen);
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXd;
using Eigen::Lower;
const Map<MatrixXd> A(as<Map<MatrixXd> >(AA));
const int m(A.rows());
MatrixXd AAt(MatrixXd(m, m).setZero().selfadjointView<Lower>().rankUpdate(A));
return wrap(AAt);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
A<-matrix(rnorm(100^2),100)
all.equal(fcprd(A),tcrossprod(A))
fcprd(A) runs much faster on my laptop than tcrossprod(A). This is what I get for A<-matrix(rnorm(1000^2),1000):
microbenchmark::microbenchmark('tcrossprod(A)'=tcrossprod(A),'A%*%t(A)'=A%*%t(A),fcprd=fcprd(A))
Unit: milliseconds
expr min lq mean median uq max neval
tcrossprod(A) 428.06452 435.9700 468.9323 448.8168 504.2628 618.7681 100
A%*%t(A) 722.24053 736.6197 775.4814 767.7668 809.8356 903.8592 100
fcprd 95.04678 100.0733 111.5021 103.6616 107.2551 197.4479 100
However, this code only works for matrices with double precision entries. How could I modify this code so that it works for complex matrices?
I have a very limited knowledge of programming, but I am trying to learn.
Any help is much appreciated!
The Eigen library supports also complex entries via Eigen::MatrixXcd. So in principle it should work if you replace MatrixXd with MatrixXcd. However, this does not compile probably because there is no as-function for complex matrices using Map (c.f. https://github.com/RcppCore/RcppEigen/blob/master/inst/unitTests/runit.RcppEigen.R#L205). The as-function are needed to convert between R data types and C++/Eigen data types (c.f. http://dirk.eddelbuettel.com/code/rcpp/Rcpp-extending.pdf). If you do not use Map, then you can use this:
crossprodCpp <- '
using Eigen::MatrixXcd;
using Eigen::Lower;
const MatrixXcd A(as<MatrixXcd>(AA));
const int m(A.rows());
MatrixXcd AAt(MatrixXcd(m, m).setZero().selfadjointView<Lower>().rankUpdate(A));
return wrap(AAt);
'
fcprd <- inline::cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
N <- 100
A <- matrix(complex(real = rnorm(N), imaginary = rnorm(N)), N)
all.equal(fcprd(A), A %*% Conj(t(A)))
However, this is slower than the base R version in my tests:
N <- 1000
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
all.equal(fcprd(A), A %*% Conj(t(A)))
#> [1] TRUE
microbenchmark::microbenchmark(base = A %*% Conj(t(A)), eigen = fcprd(A))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 111.6512 124.4490 145.7583 140.9199 160.3420 241.8986 100
#> eigen 453.6702 501.5419 535.0192 537.2925 564.8746 628.4999 100
Note that matrix multiplication in R is done via BLAS. However, the default BLAS implementation used by R is not very fast. One way to improve R's performance is to use an optimized BLAS library, c.f. https://csgillespie.github.io/efficientR/set-up.html#blas-and-alternative-r-interpreters.
Alternatively you can use the BLAS function zherk if you have a full BLAS available. Very rough:
dyn.load("/usr/lib/libblas.so")
zherk <- function(a, uplo = 'u', trans = 'n') {
n <- nrow(a)
k <- ncol(a)
c <- matrix(complex(real = 0, imaginary = 0), nrow = n, ncol = n)
z <- .Fortran("zherk",
uplo = as.character(uplo),
trans = as.character(trans),
n = as.integer(n),
k = as.integer(k),
alpha = as.double(1),
a = as.complex(a),
lda = as.integer(n),
beta = as.double(0),
c = as.complex(c),
ldc = as.integer(n))
matrix(z$c, nrow = n, ncol = n)
}
N <- 2
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
zherk(A, uplo = "l") - A %*% Conj(t(A))
Note that this fills only the upper (or lower) triangular part but is quite fast:
microbenchmark::microbenchmark(base = A %*% Conj(t(A)), blas = zherk(A))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> base 112.5588 117.12531 146.10026 138.37565 167.6811 282.3564 100
#> blas 66.9541 70.12438 91.44617 82.74522 108.4979 188.3728 100
Here is a way to bind an Eigen::Map<Eigen::MatrixXcd> object in Rcpp. The solution works in a R package setup, but I'm not sure about an easy way to put it together using the inline library.
First, you need to provide the following specialization in your inst/include/mylib.h such that this header get included in the RcppExports.cpp:
#include <complex>
#include <Eigen/Core>
#include <Eigen/Dense>
#include <Rcpp.h>
namespace Rcpp {
namespace traits {
template<>
class Exporter<Eigen::Map<Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> > > {
using OUT = typename Eigen::Map<Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> >;
const static int RTYPE = ::Rcpp::traits::r_sexptype_traits<std::complex<double>>::rtype;
Rcpp::Vector<RTYPE> vec;
int d_ncol, d_nrow;
public:
Exporter(SEXP x)
: vec(x), d_ncol(1)
, d_nrow(Rf_xlength(x)) {
if (TYPEOF(x) != RTYPE)
throw std::invalid_argument("Wrong R type for mapped matrix");
if (::Rf_isMatrix(x)) {
int* dims = INTEGER(::Rf_getAttrib(x, R_DimSymbol));
d_nrow = dims[0];
d_ncol = dims[1];
}
}
OUT get() { return OUT(reinterpret_cast<std::complex<double>*>(vec.begin()), d_nrow, d_ncol); }
};
}}
The only difference with the unspecialized Exporter available in RcppEigenWrap.h being the reinterpret_cast on the last line. Both std::complex and Rcomplex having C99 complex compatible types, they are supposed to have identical memory layouts regardless of the implementation.
Wrapping it up, you can now create your function as:
// [[Rcpp::export]]
Eigen::MatrixXd selfadj_mult(const Eigen::Map<Eigen::MatrixXcd>& mat) {
Eigen::MatrixXd result = (mat * mat.adjoint()).real();
return result;
}
and then invoke the function in R as:
library(mylib)
library(microbenchmark)
N <- 1000
A <- matrix(complex(real = rnorm(N * N), imaginary = rnorm(N * N)), N)
microbenchmark::microbenchmark(
base = A %*% Conj(t(A))
, eigen = mylib::selfadj_mult(A)
, times = 100L
)
the code is compiled on centos7/gcc83 with -O3 -DNDEBUG -flto -march=generic. R has been build from source with the exact same compiler/flags (using the default BLAS binding). Results are:
Unit: seconds
expr min lq mean median uq max neval
base 2.9030320 2.9045865 2.9097162 2.9053835 2.9093232 2.9614318 100
eigen 1.1978697 1.2004888 1.2134219 1.2031046 1.2057647 1.3035751 100

Extract elements from a matrix based on the row and column indices with Armadillo

In R, I could extract matrix elements based on their indices as follow
> m <- matrix(1:6, nrow = 3)
> m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> row_index <- c(1, 2)
> col_index <- c(2, 2)
> m[cbind(row_index, col_index)]
[1] 4 5
Is there a native way to do this is Armadillo / Rcpp::Armadillo? The best I could do is a custom function that uses the row and column indices to calculate the element index (see below). I'm mostly worried that custom function won't perform as well.
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
NumericVector Rsubmatrix(arma::uvec rowInd, arma::uvec colInd, arma::mat m) {
arma::uvec ind = (colInd - 1) * m.n_rows + (rowInd - 1);
arma::vec ret = m.elem(ind);
return wrap(ret);
}
/*** R
Rsubmatrix(row_index, col_index, m)
/
From the docs:
X.submat( vector_of_row_indices, vector_of_column_indices )
but that seems to only return matrix blocks. For non-simply-connected regions, I think your solution is the best, but you don't really need a function,
m.elem((colInd - 1) * m.n_rows + (rowInd - 1));
returns the vector without any problem. For clarity you could define a function to deal with the row+col to indices conversion,
inline arma::uvec arr2ind(arma::uvec c, arma::uvec r, int nrow)
{
return c * nrow + r;
}
// m.elem(arr2ind(colInd - 1, rowInd - 1, m.n_rows));
Let's try this...
In particular, you can subset by rowInd and colInd through writing your own loop to use the .(i,j) subset operator. Otherwise, the only other option is the solution that you proposed to start the question off...
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// Optimized OP method
// [[Rcpp::export]]
arma::vec Rsubmatrix(const arma::mat& m, const arma::uvec& rowInd, const arma::uvec& colInd) {
return m.elem((colInd - 1) * m.n_rows + (rowInd - 1));
}
// Proposed Alternative
// [[Rcpp::export]]
arma::rowvec get_elements(const arma::mat& m, const arma::uvec& rowInd, const arma::uvec& colInd){
unsigned int n = rowInd.n_elem;
arma::rowvec out(n);
for(unsigned int i = 0; i < n; i++){
out(i) = m(rowInd[i]-1,colInd[i]-1);
}
return out;
}
Where:
m <- matrix(1:6, nrow = 3)
row_index <- c(1, 2)
col_index <- c(2, 2)
m[cbind(row_index, col_index)]
Gives:
[1] 4 5
And we have:
get_elements(m, row_index, col_index)
Giving:
[,1] [,2]
[1,] 4 5
Edit
Microbenchmark:
microbenchmark(Rsubmatrix(m, row_index, col_index), get_elements(m, row_index, col_index), times = 1e4)
Gives:
Unit: microseconds
expr min lq mean median uq max neval
Rsubmatrix(m, row_index, col_index) 2.836 3.111 4.129051 3.281 3.502 5016.652 10000
get_elements(m, row_index, col_index) 2.699 2.947 3.436844 3.115 3.335 716.742 10000
The methods are both close time wise. Note that the later should be better as it avoids having two separate loops (1. to calculate & 2. to subset) and an additional temporary vector created to store the results.
Edit
Per armadillo 7.200.0 release, the sub2ind() function has received the ability to take matrix notation. This function takes a matrix subscript via a 2 x n matrix, where n denotes the number of elements to subset, and converts them into element notation.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::rowvec matrix_locs(arma::mat M, arma::umat locs) {
arma::uvec eids = sub2ind( size(M), locs ); // Obtain Element IDs
arma::vec v = M.elem( eids ); // Values of the Elements
return v.t(); // Transpose to mimic R
}
Calling in R:
cpp_locs <- locs - 1 # Shift indices from R to C++
(cpp_locs <- t(cpp_locs)) # Transpose matrix for 2 x n form
matrix_locs(M, cpp_locs) # Subset the matrix

Trying to write a setdiff() function using RcppArmadillo gives compilation error

I'm trying to write a sort of analogue of R's setdiff() function in C++ using RcppArmadillo. My rather crude approach:
// [[Rcpp::export]]
arma::uvec my_setdiff(arma::uvec x, arma::uvec y){
// Coefficientes of unsigned integer vector y form a subset of the coefficients of unsigned integer vector x.
// Returns set difference between the coefficients of x and those of y
int n2 = y.n_elem;
uword q1;
for (int j=0 ; j<n2 ; j++){
q1 = find(x==y[j]);
x.shed_row(q1);
}
return x;
}
fails at compilation time. The error reads:
fnsauxarma.cpp:622:29: error: no matching function for call to ‘arma::Col<double>::shed_row(const arma::mtOp<unsigned int, arma::mtOp<unsigned int, arma::Col<double>, arma::op_rel_eq>, arma::op_find>)’
I really have no idea what's going on, any help or comments would be greatly appreciated.
The problem is that arma::find returns a uvec, and doesn't know how to make the implicit conversion to arma::uword, as pointed out by #mtall. You can help the compiler out by using the templated arma::conv_to<T>::from() function. Also, I included another version of my_setdiff that returns an Rcpp::NumericVector because although the first version returns the correct values, it's technically a matrix (i.e. it has dimensions), and I assume you would want this to be as compatible with R's setdiff as possible. This is accomplished by setting the dim attribute of the return vector to NULL, using R_NilValue and the Rcpp::attr member function.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec my_setdiff(arma::uvec& x, const arma::uvec& y){
for (size_t j = 0; j < y.n_elem; j++) {
arma::uword q1 = arma::conv_to<arma::uword>::from(arma::find(x == y[j]));
x.shed_row(q1);
}
return x;
}
// [[Rcpp::export]]
Rcpp::NumericVector my_setdiff2(arma::uvec& x, const arma::uvec& y){
for (size_t j = 0; j < y.n_elem; j++) {
arma::uword q1 = arma::conv_to<arma::uword>::from(arma::find(x == y[j]));
x.shed_row(q1);
}
Rcpp::NumericVector x2 = Rcpp::wrap(x);
x2.attr("dim") = R_NilValue;
return x2;
}
/*** R
x <- 1:8
y <- 2:6
R> all.equal(setdiff(x,y), my_setdiff(x,y))
#[1] "Attributes: < target is NULL, current is list >" "target is numeric, current is matrix"
R> all.equal(setdiff(x,y), my_setdiff2(x,y))
#[1] TRUE
R> setdiff(x,y)
#[1] 1 7 8
R> my_setdiff(x,y)
# [,1]
# [1,] 1
# [2,] 7
# [3,] 8
R> my_setdiff2(x,y)
#[1] 1 7 8
*/
Edit:
For the sake of completeness, here is a more robust version of setdiff than the two implementations presented above:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::export]]
Rcpp::NumericVector arma_setdiff(arma::uvec& x, arma::uvec& y){
x = arma::unique(x);
y = arma::unique(y);
for (size_t j = 0; j < y.n_elem; j++) {
arma::uvec q1 = arma::find(x == y[j]);
if (!q1.empty()) {
x.shed_row(q1(0));
}
}
Rcpp::NumericVector x2 = Rcpp::wrap(x);
x2.attr("dim") = R_NilValue;
return x2;
}
/*** R
x <- 1:10
y <- 2:8
R> all.equal(setdiff(x,y), arma_setdiff(x,y))
#[1] TRUE
X <- 1:6
Y <- c(2,2,3)
R> all.equal(setdiff(X,Y), arma_setdiff(X,Y))
#[1] TRUE
*/
The previous versions would throw an error if you passed them vectors with non-unique elements, e.g.
R> my_setdiff2(X,Y)
error: conv_to(): given object doesn't have exactly one element
To solve the problem and more closely mirror R's setdiff, we just make x and y unique. Additionally, I switched out the arma::conv_to<>::from with q1(0) (where q1 is now a uvec instead of a uword), because uvec's are just a vector of uwords, and the explicit cast seemed a little inelegant.
I've used std::set_difference from the STL instead, converting back and forth from arma::uvec.
#include <RcppArmadillo.h>
#include <algorithm>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec std_setdiff(arma::uvec& x, arma::uvec& y) {
std::vector<int> a = arma::conv_to< std::vector<int> >::from(arma::sort(x));
std::vector<int> b = arma::conv_to< std::vector<int> >::from(arma::sort(y));
std::vector<int> out;
std::set_difference(a.begin(), a.end(), b.begin(), b.end(),
std::inserter(out, out.end()));
return arma::conv_to<arma::uvec>::from(out);
}
Edit: I thought a performance comparison might be in order. The difference becomes smaller when the relative sizes of the sets are in the opposite order.
a <- sample.int(350)
b <- sample.int(150)
microbenchmark::microbenchmark(std_setdiff(a, b), arma_setdiff(a, b))
> Unit: microseconds
> expr min lq mean median uq max neval cld
> std_setdiff(a, b) 11.548 14.7545 17.29930 17.107 19.245 36.779 100 a
> arma_setdiff(a, b) 60.727 65.0040 71.77804 66.714 72.702 138.133 100 b
The Questioner might have already got the answer. However, the following template version may be more general. This is equivalent to setdiff function in Matlab
If P and Q are two sets, then their difference is given by P - Q or Q - P. If P = {1, 2, 3, 4} and Q = {4, 5, 6}, P - Q means elements of P which are not in Q. i.e., in the above example P - Q = {1, 2, 3}.
/* setdiff(t1, t2) is similar to setdiff() function in MATLAB. It removes the common elements and
gives the uncommon elements in the vectors t1 and t2. */
template <typename T>
T setdiff(T t1, T t2)
{
int size_of_t1 = size(t1);
int size_of_t2 = size(t2);
T Intersection_Elements;
uvec iA, iB;
intersect(Intersection_Elements, iA, iB, t1, t2);
for (int i = 0; i < size(iA); i++)
{
t1(iA(i)) = 0;
}
for (int i = 0; i < size(iB); i++)
{
t2(iB(i)) = 0;
}
T t1_t2_vec(size_of_t1 + size_of_t2);
t1_t2_vec = join_vert(t1, t2);
T DiffVec = nonzeros(t1_t2_vec);
return DiffVec;
}
Any suggestions for improving the performance of the algorithm are welcome.

How to speed up this Rcpp function?

I wish to implement a simple split-apply-combine routine in Rcpp where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R, but often takes quite some time. I have managed to implement an Rcpp solution that beats the performance of R, but I wonder if I can further improve upon it. To illustrate, here some code, first for the use of R:
n <- 50000
k <- 50
set.seed(42)
X <- matrix(rnorm(n*k), nrow=n)
g=rep(1:8,length.out=n )
use.for <- function(mat, ind){
sums <- matrix(NA, nrow=length(unique(ind)), ncol=ncol(mat))
for(i in seq_along(unique(ind))){
sums[i,] <- colSums(mat[ind==i,])
}
return(sums)
}
use.apply <- function(mat, ind){
apply(mat,2, function(x) tapply(x, ind, sum))
}
use.dt <- function(mat, ind){ # based on Roland's answer
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
It turns out that the for-loops is actually quite fast and is the easiest (for me) to implement with Rcpp. It works by creating a submatrix for each group and then calling colSums on the matrix. This is implemented using RcppArmadillo:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat use_arma(arma::mat X, arma::colvec G){
arma::colvec gr = arma::unique(G);
int gr_n = gr.n_rows;
int ncol = X.n_cols;
arma::mat out = zeros(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
arma::uvec subvec = find(G==g_id);
arma::mat submat = X.rows(subvec);
arma::rowvec res = sum(submat,0);
out.row(g) = res;
}
return out;
}
However, based on answers to this question, I learned that creating copies is expensive in C++ (just as in R), but that loops are not as bad as they are in R. Since the arma-solution relies on creating matrixes (submat in the code) for each group, my guess is that avoiding this will speed up the process even further. Hence, here a second implementation based on Rcpp only using a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
out(g,j) = total;
}
}
return out;
}
Benchmarking these solutions, including the use_dt version provided by #Roland (my previous version discriminted unfairly against data.table), as well as the dplyr-solution suggested by #beginneR, yields the following:
library(rbenchmark)
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dplyr(X,g), use_arma(X,g), use_Rcpp(X,g),
+ columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
test replications elapsed relative
# 5 use_arma(X, g) 1000 29.65 1.000
# 4 use.dplyr(X, g) 1000 42.05 1.418
# 3 use.dt(X, g) 1000 56.94 1.920
# 1 use.for(X, g) 1000 60.97 2.056
# 6 use_Rcpp(X, g) 1000 113.96 3.844
# 2 use.apply(X, g) 1000 301.14 10.156
My intution (use_Rcpp better than use_arma) did not turn out right. Having said that, I guess that the line if (G(i) != g_id) continue; in my use_Rcpp function slows down everything. I am happy to learn about alternatives to set this up.
I am happy that I have achieved the same task in half the time it takes R to do it, but maybe the several Rcpp is much faster than R-examples have messed with my expectations, and I am wondering if I can speed this up even more. Does anyone have an idea? I also welcome any programming / coding comments in general since I am relatively new to Rcpp and C++.
No, it's not the for loop that you need to beat:
library(data.table)
#it doesn't seem fair to include calls to library in benchmarks
#you need to do that only once in your session after all
use.dt2 <- function(mat, ind){
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
all.equal(use.dt(X,g), use.dt2(X,g))
#TRUE
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dt2(X,g),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 50)
# test replications elapsed relative
#4 use.dt2(X, g) 50 3.12 1.000
#1 use.for(X, g) 50 4.67 1.497
#3 use.dt(X, g) 50 7.53 2.413
#2 use.apply(X, g) 50 17.46 5.596
Maybe you're looking for (the oddly named) rowsum
library(microbenchmark)
use.rowsum = rowsum
and
> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
expr min lq median uq max neval
use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579 5
use.rowsum(X, g) 10.56727 10.93942 11.01106 11.38697 11.38918 5
Here's my critiques with in-line comments for your Rcpp solution.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
// Rcpp has a sort_unique() function, which combines the
// sort and unique steps into one, and is often faster than
// performing the operations separately. Try `sort_unique(G)`
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
// This constructor zero-initializes memory (kind of like
// making a copy). You should use:
//
// NumericMatrix out = no_init(gr_n, ncol)
//
// to ensure the memory is allocated, but not zeroed.
//
// EDIT: We don't have no_init for matrices right now, but you can hack
// around that with:
//
// NumericMatrix out(Rf_allocMatrix(REALSXP, gr_n, ncol));
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
// subsetting with operator[] is cheaper, so use gr[g] when
// you can be sure bounds checks are not necessary
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
// similarily here
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
// IIUC, you are filling the matrice row-wise. This is slower as
// R matrices are stored in column-major format, and so filling
// matrices column-wise will be faster.
out(g,j) = total;
}
}
return out;
}

Performance of R stats::sd() vs. arma::stddev() vs. Rcpp implementation

Just for the purpose of working on my C++ / Rcpp programming, I took a shot at implementing a (sample) standard deviation function:
#include <Rcpp.h>
#include <vector>
#include <cmath>
#include <numeric>
// [[Rcpp::export]]
double cppSD(Rcpp::NumericVector rinVec)
{
std::vector<double> inVec(rinVec.begin(),rinVec.end());
int n = inVec.size();
double sum = std::accumulate(inVec.begin(), inVec.end(), 0.0);
double mean = sum / inVec.size();
for(std::vector<double>::iterator iter = inVec.begin();
iter != inVec.end(); ++iter){
double temp;
temp= (*iter - mean)*(*iter - mean);
*iter = temp;
}
double sd = std::accumulate(inVec.begin(), inVec.end(), 0.0);
return std::sqrt( sd / (n-1) );
}
I also decided to test out the stddev function from the Armadillo library, considering that it can be called on a vector:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
double armaSD(arma::colvec inVec)
{
return arma::stddev(inVec);
}
Then I benchmarked these two functions against the base R function sd() for a few vectors of varying size:
Rcpp::sourceCpp('G:/CPP/armaSD.cpp')
Rcpp::sourceCpp('G:/CPP/cppSD.cpp')
require(microbenchmark)
##
## sample size = 1,000: armaSD() < cppSD() < sd()
X <- rexp(1000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 4.181 4.562 4.942 5.322 12.924 100
# sd(X) 17.865 19.766 20.526 21.287 86.285 100
# cppSD(X) 4.561 4.941 5.321 5.701 29.269 100
##
## sample size = 10,000: armaSD() < cppSD() < sd()
X <- rexp(10000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 24.707 25.847 26.4175 29.6490 52.455 100
# sd(X) 51.315 54.356 55.8760 61.1980 100.730 100
# cppSD(X) 26.608 28.128 28.8885 31.7395 114.413 100
##
## sample size = 25,000: armaSD() < cppSD() < sd()
X <- rexp(25000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 66.900 67.6600 68.040 76.403 155.845 100
# sd(X) 108.332 111.5625 122.016 125.817 169.910 100
# cppSD(X) 70.320 71.0805 74.692 80.203 102.250 100
##
## sample size = 50,000: cppSD() < sd() < armaSD()
X <- rexp(50000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 249.733 267.4085 297.8175 337.729 642.388 100
# sd(X) 203.740 229.3975 240.2300 260.186 303.709 100
# cppSD(X) 162.308 185.1140 239.6600 256.575 290.405 100
##
## sample size = 75,000: sd() < cppSD() < armaSD()
X <- rexp(75000)
microbenchmark(armaSD(X),sd(X), cppSD(X))
#Unit: microseconds
# expr min lq median uq max neval
# armaSD(X) 445.110 479.8900 502.5070 520.5625 642.388 100
# sd(X) 310.931 334.8780 354.0735 379.7310 429.146 100
# cppSD(X) 346.661 380.8715 400.6370 424.0140 501.747 100
I was not terribly surprised at the fact that my C++ function cppSD() was faster than stats::sd() for smaller samples, but slower for larger sized vectors since stats::sd() is vectorized. However, I did not expect to see such a performance degradation from the arma::stddev() function since it appears to also be operating in a vectorized manner. Is there a problem with the way that I'm using arma::stdev(), or is it simply that stats::sd() was written (in C I'm assuming) in such a way that it can handle larger vectors much more efficiently? Any input would be appreciated.
Update:
Although my question was originally about the correct use of arma::stddev, and not so much about trying find the most efficient way possible to calculate the sample standard deviation, it is very interesting to see that the Rcpp::sd sugar function performs so well. To make things a little more interesting, I benchmarked the arma::stddev and Rcpp::sd functions below against an RcppParallel version that I adapted from two of JJ Allaire's Rcpp Gallery posts - here and here:
library(microbenchmark)
set.seed(123)
x <- rnorm(5.5e06)
##
Res <- microbenchmark(
armaSD(x),
par_sd(x),
sd_sugar(x),
times=500L,
control=list(warmup=25))
##
R> print(Res)
Unit: milliseconds
expr min lq mean median uq max neval
armaSD(x) 24.486943 24.960966 26.994684 25.255584 25.874139 123.55804 500
par_sd(x) 8.130751 8.322682 9.136323 8.429887 8.624072 22.77712 500
sd_sugar(x) 13.713366 13.984638 14.628911 14.156142 14.401138 32.81684 500
This was on my laptop running 64-bit linux with a i5-4200U CPU # 1.60GHz processor; but I'm guessing the difference between par_sd and sugar_sd would be less substantial on a Windows machine.
And the code for the RcppParallel version (which is considerably longer, and requires a C++11 compatible compiler for the lambda expression used in overloaded operator() of the InnerProduct functor):
#include <Rcpp.h>
#include <RcppParallel.h>
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
/*
* based on: http://gallery.rcpp.org/articles/parallel-vector-sum/
*/
struct Sum : public RcppParallel::Worker {
const RcppParallel::RVector<double> input;
double value;
Sum(const Rcpp::NumericVector input)
: input(input), value(0) {}
Sum(const Sum& sum, RcppParallel::Split)
: input(sum.input), value(0) {}
void operator()(std::size_t begin, std::size_t end) {
value += std::accumulate(input.begin() + begin,
input.begin() + end,
0.0);
}
void join(const Sum& rhs) {
value += rhs.value;
}
};
/*
* based on: http://gallery.rcpp.org/articles/parallel-inner-product/
*/
struct InnerProduct : public RcppParallel::Worker {
const RcppParallel::RVector<double> x;
const RcppParallel::RVector<double> y;
double mean;
double product;
InnerProduct(const Rcpp::NumericVector x,
const Rcpp::NumericVector y,
const double mean)
: x(x), y(y), mean(mean), product(0) {}
InnerProduct(const InnerProduct& innerProduct,
RcppParallel::Split)
: x(innerProduct.x), y(innerProduct.y),
mean(innerProduct.mean), product(0) {}
void operator()(std::size_t begin, std::size_t end) {
product += std::inner_product(x.begin() + begin,
x.begin() + end,
y.begin() + begin,
0.0, std::plus<double>(),
[&](double lhs, double rhs)->double {
return ( (lhs-mean)*(rhs-mean) );
});
}
void join(const InnerProduct& rhs) {
product += rhs.product;
}
};
// [[Rcpp::export]]
double par_sd(const Rcpp::NumericVector& x_)
{
int N = x_.size();
Rcpp::NumericVector y_(x_);
Sum sum(x_);
RcppParallel::parallelReduce(0, x_.length(), sum);
double mean = sum.value / N;
InnerProduct innerProduct(x_, y_, mean);
RcppParallel::parallelReduce(0, x_.length(), innerProduct);
return std::sqrt( innerProduct.product / (N-1) );
}
You made a subtle mistake in how you instantiate the Armadillo object -- which leads to copies and hence degraded performance.
Use an interface of const arma::colvec & invec instead, and all is good:
R> sourceCpp("/tmp/sd.cpp")
R> library(microbenchmark)
R> X <- rexp(500)
R> microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
Unit: microseconds
expr min lq median uq max neval
armaSD(X) 3.745 4.0280 4.2055 4.5510 19.375 100
armaSD2(X) 3.305 3.4925 3.6400 3.9525 5.154 100
sd(X) 22.463 23.6985 25.1525 26.0055 52.457 100
cppSD(X) 3.640 3.9495 4.2030 4.8620 13.609 100
R> X <- rexp(5000)
R> microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
Unit: microseconds
expr min lq median uq max neval
armaSD(X) 18.627 18.9120 19.3245 20.2150 34.684 100
armaSD2(X) 14.583 14.9020 15.1675 15.5775 22.527 100
sd(X) 54.507 58.8315 59.8615 60.4250 84.857 100
cppSD(X) 18.585 19.0290 19.3970 20.5160 22.174 100
R> X <- rexp(50000)
R> microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
Unit: microseconds
expr min lq median uq max neval
armaSD(X) 186.307 187.180 188.575 191.825 405.775 100
armaSD2(X) 142.447 142.793 143.207 144.233 155.770 100
sd(X) 382.857 384.704 385.223 386.075 405.713 100
cppSD(X) 181.601 181.895 182.279 183.350 194.588 100
R>
which is based on my version of your code where everything is one file and armaSD2 is defined as I suggested -- leading to the winning performance.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
#include <vector>
#include <cmath>
#include <numeric>
// [[Rcpp::export]]
double cppSD(Rcpp::NumericVector rinVec) {
std::vector<double> inVec(rinVec.begin(),rinVec.end());
int n = inVec.size();
double sum = std::accumulate(inVec.begin(), inVec.end(), 0.0);
double mean = sum / inVec.size();
for(std::vector<double>::iterator iter = inVec.begin();
iter != inVec.end();
++iter){
double temp = (*iter - mean)*(*iter - mean);
*iter = temp;
}
double sd = std::accumulate(inVec.begin(), inVec.end(), 0.0);
return std::sqrt( sd / (n-1) );
}
// [[Rcpp::export]]
double armaSD(arma::colvec inVec) {
return arma::stddev(inVec);
}
// [[Rcpp::export]]
double armaSD2(const arma::colvec & inVec) { return arma::stddev(inVec); }
/*** R
library(microbenchmark)
X <- rexp(500)
microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
X <- rexp(5000)
microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
X <- rexp(50000)
microbenchmark(armaSD(X), armaSD2(X), sd(X), cppSD(X))
*/
I think the sd function built in Rcpp sugar is much more efficient. See the code below:
#include <RcppArmadillo.h>
//[[Rcpp::depends(RcppArmadillo)]]
#include <vector>
#include <cmath>
#include <numeric>
using namespace Rcpp;
//[[Rcpp::export]]
double sd_cpp(NumericVector& xin){
std::vector<double> xres(xin.begin(),xin.end());
int n=xres.size();
double sum=std::accumulate(xres.begin(),xres.end(),0.0);
double mean=sum/n;
for(std::vector<double>::iterator iter=xres.begin();iter!=xres.end();++iter){
double tmp=(*iter-mean)*(*iter-mean);
*iter=tmp;
}
double sd=std::accumulate(xres.begin(),xres.end(),0.0);
return std::sqrt(sd/(n-1));
}
//[[Rcpp::export]]
double sd_arma(arma::colvec& xin){
return arma::stddev(xin);
}
//[[Rcpp::export]]
double sd_sugar(NumericVector& xin){
return sd(xin);
}
> sourcecpp("sd.cpp")
> microbenchmark(sd(X),sd_cpp(X),sd_arma(X),sd_sugar(X))
Unit: microseconds
expr min lq mean median uq max neval
sd(X) 47.655 49.4120 51.88204 50.5395 51.1950 113.643 100
sd_cpp(X) 28.145 28.4410 29.01541 28.6695 29.4570 37.118 100
sd_arma(X) 23.706 23.9615 24.65931 24.1955 24.9520 50.375 100
sd_sugar(X) 19.197 19.478 20.38872 20.0785 21.2015 28.664 100
2 questions about the Rcpp function:
1) how likely is it that you would want the standard deviation without the mean? If it will be uncommon to require SD without the mean, why not return both rather than ask the R user to make 2 function calls which in effect calculate the mean twice.
2) Is there any reason for cloning the vector inside the function?
using namespace Rcpp;
// [[Rcpp::plugins(cpp14)]]
// [[Rcpp::export]]
NumericVector cppSD(NumericVector rinVec)
{
double n = (double)rinVec.size();
double sum = 0;
for (double& v : rinVec)
sum += v;
double mean = sum / n;
double varianceNumerator = 0;
for(double& v : rinVec)
varianceNumerator += (v - mean) * (v - mean);
return NumericVector::create(_["std.dev"] = sqrt(varianceNumerator/ (n - 1.0)),
_["mean"] = mean);
}