element wise matrix multiplication in Rcpp - c++

I am trying to speed up some R code with Rcpp that takes a vector of length L (psi) and a matrix of dimensions (L,L) and does some element-wise operations. Is there a more efficient way to be doing these element-wise operations with Rcpp?
r:
UpdateLambda <- function(psi,phi){
# updated full-day infection probabilites
psi.times.phi <- apply(phi,1,function(x) x*psi)
## return Lambda_{i,j} = 1 - \prod_{j} (1 - \psi_{i,j,t} \phi_{i,j})
apply(psi.times.phi,2,function(x) 1-prod(1-x))
}
cpp:
#include <Rcpp.h>
#include <algorithm>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector UpdateLambdaC(NumericVector psi,
NumericMatrix phi
){
int n = psi.size();
NumericMatrix psi_times_phi(n,n);
NumericVector tmp(n,1.0);
NumericVector lambda(n);
for(int i=0; i<n;i++){
psi_times_phi(i,_) = psi*phi(i,_);
}
for(int i=0; i<n;i++){
// \pi_{j} (1- \lambda_{i,j,t})
for(int j=0; j<n;j++){
tmp[i] *= 1-psi_times_phi(i,j);
}
lambda[i] = 1-tmp[i];
}
return lambda;
}

You can replace you apply loops with vectorised alternatives.
The first one is equivalent to:
t(phi)*psi
And the second:
1-exp(colSums(log(1-psi.times.phi)))
#test data
phi <- matrix(runif(1e6),1e3)
psi <- runif(1e3)
#new function
UpdateLambda2 <- function(psi,phi) 1-exp(colSums(log(1-t(phi)*psi)))
#sanity check
identical(UpdateLambda(psi,phi),UpdateLambda2(psi,phi))
[1] TRUE
#timings
library(rbenchmark)
benchmark(UpdateLambda(psi,phi),UpdateLambda2(psi,phi))
test replications elapsed relative user.self sys.self
1 UpdateLambda(psi, phi) 100 16.05 1.041 15.06 0.93
2 UpdateLambda2(psi, phi) 100 15.42 1.000 14.19 1.19
Well, it appears that it does not make much of a difference, which is very surprising as colSums is typically much faster than apply. I'm not sure if the test data I've used is relevant as the output is all 1's due number of multiplications of number less than 1 in the second part. You may be better off working in a log scale anyway if you want to note the detail of such small numbers.

Related

Why is my Rcpp code is much slower than glmnet's?

I edited the lasso code from this site to use it for multiple lambda values.
I used lassoshooting package for one lambda value (this package works for one lambda value) and glmnet for multiple lambda values for comparison.
The coefficient estimates are different and this is expected because of standardization and scaling back to original scale. This is out of scope and not important here.
For one parameter case, lassoshooting is 1.5 times faster.
Both methods used all 100 lambda values in my code for multiple lambda case. But glmnet is 7.5 times faster than my cpp code. Of course, I expected that glmnet was faster, but this amount seems too much. Is it normal or is my code wrong?
EDIT
I also attached lshoot function which calculates coefficient path in an R loop. This outperforms my cpp code too.
Can I improve my cpp code?
C++ code:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
vec softmax_cpp(const vec & x, const vec & y) {
return sign(x) % max(abs(x) - y, zeros(x.n_elem));
}
// [[Rcpp::export]]
mat lasso(const mat & X, const vec & y, const vec & lambda,
const double tol = 1e-7, const int max_iter = 10000){
int p = X.n_cols; int lam = lambda.n_elem;
mat XX = X.t() * X;
vec Xy = X.t() * y;
vec Xy2 = 2 * Xy;
mat XX2 = 2 * XX;
mat betas = zeros(p, lam); // to store the betas
vec beta = zeros(p); // initial beta for each lambda
bool converged = false;
int iteration = 0;
vec beta_prev, aj, cj;
for(int l = 0; l < lam; l++){
while (!converged && (iteration < max_iter)){
beta_prev = beta;
for (int j = 0; j < p; j++){
aj = XX2(j,j);
cj = Xy2(j) - dot(XX2.row(j), beta) + beta(j) * XX2(j,j);
beta(j) = as_scalar(softmax_cpp(cj / aj, as_scalar(lambda(l)) / aj));
}
iteration = iteration + 1;
converged = norm(beta_prev - beta, 1) < tol;
}
betas.col(l) = beta;
iteration = 0;
converged = false;
}
return betas;
}
R code:
library(Rcpp)
library(rbenchmark)
library(glmnet)
library(lassoshooting)
sourceCpp("LASSO.cpp")
library(ElemStatLearn)
X <- as.matrix(prostate[,-c(9,10)])
y <- as.matrix(prostate[,9])
lambda_one <- 0.1
benchmark(cpp=lasso(X,y,lambda_one),
lassoshooting=lassoshooting(X,y,lambda_one)$coefficients,
order="relative", replications=100)[,1:4]
################################################
lambda <- seq(0,10,len=100)
benchmark(cpp=lasso(X,y,lambda),
glmn=coef(glmnet(X,y,lambda=lambda)),
order="relative", replications=100)[,1:4]
####################################################
EDIT
lambda <- seq(0,10,len=100)
lshoot <- function(lambda){
betas <- matrix(NA,8,100)
for(l in 1:100){
betas[, l] <- lassoshooting(X,y,lambda[l])$coefficients
}
return(betas)
}
benchmark(cpp=lasso(X,y,lambda),
lassoshooting_loop=lshoot(lambda),
order="relative", replications=300)[,1:4]
Results for one parameter case:
test replications elapsed relative
2 lassoshooting 300 0.06 1.0
1 cpp 300 0.09 1.5
Results for multiple parameter case:
test replications elapsed relative
2 glmn 300 0.70 1.000
1 cpp 300 5.24 7.486
Results for lassoshooting loop and cpp:
test replications elapsed relative
2 lassoshooting_loop 300 4.06 1.000
1 cpp 300 6.38 1.571
Package {glmnet} uses warm starts and special rules for discarding lots of predictors, which makes fitting the whole "regularization path" very fast.
See their paper.

efficient distance calculations in armadillo

I'm new to armadillo. I have the below code, which I assume is inefficient. Any suggestions to make it more memory efficient and/or speedy? Following the armadillo docs and Rcpp gallery, I was unable to get .colptr's, uvec's, or batch insertion to work. But I assume any of them would be improvements.
With an input of X (~100 x 30000), even my stupidly large work VM crashes.
Linux release 7.3.1611 (Core)
117GB RAM / 0GB SWAP
(24 x 2.494 GHz) processor(s)
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
code
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
sp_mat arma_distmat_LT(const arma::mat& x) { // input expected X_{n x p} n << p
int nr, nc;
Col<double> col0, col1;
nr = x.n_rows;
nc = x.n_cols;
sp_mat out(nc, nc);
for (int i = 0; i < nc; i++) {
col0 = x.col(i);
for (int j = i + 1; j < nc; j++) {
col1 = x.col(j);
out(j, i) = as_scalar(col0.t() * col1);
}
}
return out;
}
Call: sourceCpp("<file>"); dist_x <- arma_distmat_LT(X)
Note: these are distances because I am calculating cosine similarities where I have set L2 norm == 1.
It looks to me as if you're just computing the (upper triangular) matrix product t(X)%*%X. You can actually do that directly in R with the underused crossprod function.
X <- matrix(rnorm(100*30000), ncol=30000)
res <- crossprod(X, X)
This takes a few minutes on my laptop. If you change your code to use the Armadillo library then you can use
sp_mat arma_distmat_LT2(const arma::mat& x) { // input expected X_{n x p} n << p
int nr, nc;
Col<double> col0, col1;
nr = x.n_rows;
nc = x.n_cols;
sp_mat out(nc, nc);
out = trimatl(x.t() * x, k=-1);
return out;
}
Still takes a few minutes. It uses an awful amount of memory though so I doubt you can have a lot of objects in memory at the same time.
The code could probably be optimized to only compute the lower/upper triangular matrix.
Just to show the speedup for a 100*800 matrix:
> microbenchmark(crossprod(X, X), arma_distmat_LT(X), arma_distmat_LT2(X))
Unit: milliseconds
expr min lq mean median uq
crossprod(X, X) 50.25574 53.72049 57.98812 56.29532 58.71277
arma_distmat_LT(X) 1331.83243 1471.42465 1523.74060 1492.84611 1512.45416
arma_distmat_LT2(X) 29.69420 33.23954 36.24613 35.54700 38.05208
max neval cld
160.81227 100 a
3080.37891 100 b
66.07351 100 a
As you can see there is a substantial speedup to be gained by brute-forcing it. That being said I'm sure that the cross product can be optimised further.

Fastest way to compute the cdf of the Normal distribution over vectors - R::pnorm vs erfc vs?

I hope my reworded question now fits the criteria of Stackoverflow. Please consider the example below. I am writing a Log-Likelihood function in which computing the cdf over vectors is the most time consuming part. Example 1 uses the R::pnorm, Example 2 approximates the normal cdf with erfc. As you can see the results are sufficiently similar, the ercf version is a bit faster.
In practice (within an MLE) however it turns out that the ercf is not as precise, which lets the algorithm run into inf areas unless one sets the constraints accurately. My questions:
1) Am I missing something? Is it necessary to implement some error handling (for the erfc)?
2) Do you have any other suggestions to speed up the code, or alternatives? Does it pay off to look into parallelizing the for-loop?
require(Rcpp)
require(RcppArmadillo)
require(microbenchmark)
#Example 1 : standard R::pnorm
src1 <- '
NumericVector ppnorm(const arma::vec& x,const arma::vec& mu,const arma::vec& sigma, int lt, int lg) {
int n = x.size();
arma::vec res(n);
for (int i=0; i<n; i++) {
res(i) = R::pnorm(x(i),mu(i),sigma(i),lt,lg);
}
return wrap(res);
}
'
#Example 2: approximation with ercf
src2 <- '
NumericVector ppnorm(const arma::vec& x,const arma::vec& mu,const arma::vec& sigma, int lt, int lg) {
int n = x.size();
arma::vec res(n);
for (int i=0; i<n; i++) {
res(i) = 0.5 * erfc(-(x(i) - mu(i))/sigma(i) * M_SQRT1_2);
}
if (lt==0 & lg==0) {
return wrap(1 - res);
}
if (lt==1 & lg==0) {
return wrap(res);
}
if (lt==0 & lg==1) {
return wrap(log(1 - res));
}
if (lt==1 & lg==1) {
return wrap(log(res));
}
}
'
#some random numbers
xex = rnorm(100,5,4)
muex = rnorm(100,3,1)
siex = rnorm(100,0.8,0.3)
#compile c++ functions
func1 = cppFunction(depends = "RcppArmadillo",code=src1) #R::pnorm
func2 = cppFunction(depends = "RcppArmadillo",code=src2) #ercf
#run with exemplaric data
res1 = func1(xex,muex,siex,1,0)
res2 = func2(xex,muex,siex,1,0)
# sum of squared errors
sum((res1 - res2)^2,na.rm=T)
# 6.474419e-32 ... very small
#benchmarking
microbenchmark(func1(xex,muex,siex,1,0),func2(xex,muex,siex,1,0),times=10000)
#Unit: microseconds
#expr min lq mean median uq max neval
#func1(xex, muex, siex, 1, 0) 11.225 11.9725 13.72518 12.460 13.617 103.654 10000
#func2(xex, muex, siex, 1, 0) 8.360 9.1410 10.62114 9.669 10.769 205.784 10000
#my machine: Ubuntu 14.04 LTS, i7 2640M 2.8 Ghz x 4, 8GB memory, RRO 3.2.0 based on version R 3.2.0
1) Well, you really should use R's pnorm() as your 0-th example.
You don't, you use the Rcpp interface to it. R's pnorm() is already nicely vectorized R-internally (i.e. on C level) so may well be comparative or even faster than Rcpp. Also it does have the advantage to cover cases of NA, NaN, Inf, etc..
2) If you are talking about MLE, and you are concerned about speed and accuracy, you almost surely should rather work with the logarithms, and maybe not withpnorm() but rather dnorm() ?

How to speed up this Rcpp function?

I wish to implement a simple split-apply-combine routine in Rcpp where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R, but often takes quite some time. I have managed to implement an Rcpp solution that beats the performance of R, but I wonder if I can further improve upon it. To illustrate, here some code, first for the use of R:
n <- 50000
k <- 50
set.seed(42)
X <- matrix(rnorm(n*k), nrow=n)
g=rep(1:8,length.out=n )
use.for <- function(mat, ind){
sums <- matrix(NA, nrow=length(unique(ind)), ncol=ncol(mat))
for(i in seq_along(unique(ind))){
sums[i,] <- colSums(mat[ind==i,])
}
return(sums)
}
use.apply <- function(mat, ind){
apply(mat,2, function(x) tapply(x, ind, sum))
}
use.dt <- function(mat, ind){ # based on Roland's answer
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
It turns out that the for-loops is actually quite fast and is the easiest (for me) to implement with Rcpp. It works by creating a submatrix for each group and then calling colSums on the matrix. This is implemented using RcppArmadillo:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat use_arma(arma::mat X, arma::colvec G){
arma::colvec gr = arma::unique(G);
int gr_n = gr.n_rows;
int ncol = X.n_cols;
arma::mat out = zeros(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
arma::uvec subvec = find(G==g_id);
arma::mat submat = X.rows(subvec);
arma::rowvec res = sum(submat,0);
out.row(g) = res;
}
return out;
}
However, based on answers to this question, I learned that creating copies is expensive in C++ (just as in R), but that loops are not as bad as they are in R. Since the arma-solution relies on creating matrixes (submat in the code) for each group, my guess is that avoiding this will speed up the process even further. Hence, here a second implementation based on Rcpp only using a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
out(g,j) = total;
}
}
return out;
}
Benchmarking these solutions, including the use_dt version provided by #Roland (my previous version discriminted unfairly against data.table), as well as the dplyr-solution suggested by #beginneR, yields the following:
library(rbenchmark)
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dplyr(X,g), use_arma(X,g), use_Rcpp(X,g),
+ columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
test replications elapsed relative
# 5 use_arma(X, g) 1000 29.65 1.000
# 4 use.dplyr(X, g) 1000 42.05 1.418
# 3 use.dt(X, g) 1000 56.94 1.920
# 1 use.for(X, g) 1000 60.97 2.056
# 6 use_Rcpp(X, g) 1000 113.96 3.844
# 2 use.apply(X, g) 1000 301.14 10.156
My intution (use_Rcpp better than use_arma) did not turn out right. Having said that, I guess that the line if (G(i) != g_id) continue; in my use_Rcpp function slows down everything. I am happy to learn about alternatives to set this up.
I am happy that I have achieved the same task in half the time it takes R to do it, but maybe the several Rcpp is much faster than R-examples have messed with my expectations, and I am wondering if I can speed this up even more. Does anyone have an idea? I also welcome any programming / coding comments in general since I am relatively new to Rcpp and C++.
No, it's not the for loop that you need to beat:
library(data.table)
#it doesn't seem fair to include calls to library in benchmarks
#you need to do that only once in your session after all
use.dt2 <- function(mat, ind){
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
all.equal(use.dt(X,g), use.dt2(X,g))
#TRUE
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dt2(X,g),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 50)
# test replications elapsed relative
#4 use.dt2(X, g) 50 3.12 1.000
#1 use.for(X, g) 50 4.67 1.497
#3 use.dt(X, g) 50 7.53 2.413
#2 use.apply(X, g) 50 17.46 5.596
Maybe you're looking for (the oddly named) rowsum
library(microbenchmark)
use.rowsum = rowsum
and
> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
expr min lq median uq max neval
use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579 5
use.rowsum(X, g) 10.56727 10.93942 11.01106 11.38697 11.38918 5
Here's my critiques with in-line comments for your Rcpp solution.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
// Rcpp has a sort_unique() function, which combines the
// sort and unique steps into one, and is often faster than
// performing the operations separately. Try `sort_unique(G)`
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
// This constructor zero-initializes memory (kind of like
// making a copy). You should use:
//
// NumericMatrix out = no_init(gr_n, ncol)
//
// to ensure the memory is allocated, but not zeroed.
//
// EDIT: We don't have no_init for matrices right now, but you can hack
// around that with:
//
// NumericMatrix out(Rf_allocMatrix(REALSXP, gr_n, ncol));
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
// subsetting with operator[] is cheaper, so use gr[g] when
// you can be sure bounds checks are not necessary
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
// similarily here
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
// IIUC, you are filling the matrice row-wise. This is slower as
// R matrices are stored in column-major format, and so filling
// matrices column-wise will be faster.
out(g,j) = total;
}
}
return out;
}

Speeding up computation of Dice coefficient in C / Rcpp

I need to compute a similarity measure call the Dice coefficient over large matrices (600,000 x 500) of binary vectors in R. For speed I use C / Rcpp. The function runs great but as I am not a computer scientist by background I would like to know if it could run faster. This code is suitable for parallelisation but I have no experience parallelising C code.
The Dice coefficient is a simple measure of similarity / dissimilarity (depending how you take it). It is intended to compare asymmetric binary vectors, meaning one of the combination (usually 0-0) is not important and agreement (1-1 pairs) have more weight than disagreement (1-0 or 0-1 pairs). Imagine the following contingency table:
1 0
1 a b
0 c d
The Dice coef is: (2*a) / (2*a +b + c)
Here is my Rcpp implementation:
library(Rcpp)
cppFunction('
NumericMatrix dice(NumericMatrix binaryMat){
int nrows = binaryMat.nrow(), ncols = binaryMat.ncol();
NumericMatrix results(ncols, ncols);
for(int i=0; i < ncols-1; i++){ // columns fixed
for(int j=i+1; j < ncols; j++){ // columns moving
double a = 0;
double d = 0;
for (int l = 0; l < nrows; l++) {
if(binaryMat(l, i)>0){
if(binaryMat(l, j)>0){
a++;
}
}else{
if(binaryMat(l, j)<1){
d++;
}
}
}
// compute Dice coefficient
double abc = nrows - d;
double bc = abc - a;
results(j,i) = (2*a) / (2*a + bc);
}
}
return wrap(results);
}
')
And here is a running example:
x <- rbinom(1:200000, 1, 0.5)
X <- matrix(x, nrow = 200, ncol = 1000)
system.time(dice(X))
user system elapsed
0.814 0.000 0.814
The solution proposed by Roland was not entirely satisfying for my use case. So based on the source code from the arules package I implement a much faster version. The code in arules rely on an algorithm from Leisch (2005) using the tcrossproduct() function in R.
First, I wrote a Rcpp / RcppEigen version of crossprod that is 2-3 time faster. This is based on the example code in the RcppEigen vignette.
library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;
const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));
const int m(A.rows()), n(A.cols());
MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));
return wrap(AtA);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
Then I wrote a small R function to compute the Dice coefficient.
diceR <- function(X){
a <- fcprd(X)
nx <- ncol(X)
rsx <- colSums(X)
c <- matrix(rsx, nrow = nx, ncol = nx) - a
# b <- matrix(rsx, nrow = nx, ncol = nx, byrow = TRUE) - a
b <- t(c)
m <- (2 * a) / (2*a + b + c)
return(m)
}
This new function is ~8 time faster than the old one and ~3 time faster than the one in arules.
m <- microbenchmark(dice(X), diceR(X), dissimilarity(t(X), method="dice"), times=100)
m
# Unit: milliseconds
# expr min lq median uq max neval
# dice(X) 791.34558 809.8396 812.19480 814.6735 910.1635 100
# diceR(X) 62.98642 76.5510 92.02528 159.2557 507.1662 100
# dissimilarity(t(X), method = "dice") 264.07997 342.0484 352.59870 357.4632 520.0492 100
I cannot run your function at work, but is the result the same as this?
library(arules)
plot(dissimilarity(X,method="dice"))
system.time(dissimilarity(X,method="dice"))
#user system elapsed
#0.04 0.00 0.04