Why is my Rcpp code is much slower than glmnet's? - c++

I edited the lasso code from this site to use it for multiple lambda values.
I used lassoshooting package for one lambda value (this package works for one lambda value) and glmnet for multiple lambda values for comparison.
The coefficient estimates are different and this is expected because of standardization and scaling back to original scale. This is out of scope and not important here.
For one parameter case, lassoshooting is 1.5 times faster.
Both methods used all 100 lambda values in my code for multiple lambda case. But glmnet is 7.5 times faster than my cpp code. Of course, I expected that glmnet was faster, but this amount seems too much. Is it normal or is my code wrong?
EDIT
I also attached lshoot function which calculates coefficient path in an R loop. This outperforms my cpp code too.
Can I improve my cpp code?
C++ code:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
vec softmax_cpp(const vec & x, const vec & y) {
return sign(x) % max(abs(x) - y, zeros(x.n_elem));
}
// [[Rcpp::export]]
mat lasso(const mat & X, const vec & y, const vec & lambda,
const double tol = 1e-7, const int max_iter = 10000){
int p = X.n_cols; int lam = lambda.n_elem;
mat XX = X.t() * X;
vec Xy = X.t() * y;
vec Xy2 = 2 * Xy;
mat XX2 = 2 * XX;
mat betas = zeros(p, lam); // to store the betas
vec beta = zeros(p); // initial beta for each lambda
bool converged = false;
int iteration = 0;
vec beta_prev, aj, cj;
for(int l = 0; l < lam; l++){
while (!converged && (iteration < max_iter)){
beta_prev = beta;
for (int j = 0; j < p; j++){
aj = XX2(j,j);
cj = Xy2(j) - dot(XX2.row(j), beta) + beta(j) * XX2(j,j);
beta(j) = as_scalar(softmax_cpp(cj / aj, as_scalar(lambda(l)) / aj));
}
iteration = iteration + 1;
converged = norm(beta_prev - beta, 1) < tol;
}
betas.col(l) = beta;
iteration = 0;
converged = false;
}
return betas;
}
R code:
library(Rcpp)
library(rbenchmark)
library(glmnet)
library(lassoshooting)
sourceCpp("LASSO.cpp")
library(ElemStatLearn)
X <- as.matrix(prostate[,-c(9,10)])
y <- as.matrix(prostate[,9])
lambda_one <- 0.1
benchmark(cpp=lasso(X,y,lambda_one),
lassoshooting=lassoshooting(X,y,lambda_one)$coefficients,
order="relative", replications=100)[,1:4]
################################################
lambda <- seq(0,10,len=100)
benchmark(cpp=lasso(X,y,lambda),
glmn=coef(glmnet(X,y,lambda=lambda)),
order="relative", replications=100)[,1:4]
####################################################
EDIT
lambda <- seq(0,10,len=100)
lshoot <- function(lambda){
betas <- matrix(NA,8,100)
for(l in 1:100){
betas[, l] <- lassoshooting(X,y,lambda[l])$coefficients
}
return(betas)
}
benchmark(cpp=lasso(X,y,lambda),
lassoshooting_loop=lshoot(lambda),
order="relative", replications=300)[,1:4]
Results for one parameter case:
test replications elapsed relative
2 lassoshooting 300 0.06 1.0
1 cpp 300 0.09 1.5
Results for multiple parameter case:
test replications elapsed relative
2 glmn 300 0.70 1.000
1 cpp 300 5.24 7.486
Results for lassoshooting loop and cpp:
test replications elapsed relative
2 lassoshooting_loop 300 4.06 1.000
1 cpp 300 6.38 1.571

Package {glmnet} uses warm starts and special rules for discarding lots of predictors, which makes fitting the whole "regularization path" very fast.
See their paper.

Related

Memory leaks in a simple Rcpp function

I am developing a package in R that I would like to convert to Rcpp for better performance. I'm new to Rcpp (and C++ in general.) My problem is that the Rcpp function I've written works fine if I run it many times with one set of arguments, but if I try to loop it over many combinations of arguments, it springs memory leaks and causes the R session to abort.
Here is the code in R, which holds up well to any test I throw at it:
raw_noise <- function(timesteps, mu, sigma, phi) {
delta <- mu * (1 - phi)
variance <- sigma^2 * (1 - phi^2)
noise <- vector(mode = "double", length = timesteps)
noise[1] <- c(rnorm(1, mu, sigma))
for (i in (1:(timesteps - 1))) {
noise[i + 1] <- delta + phi * noise[i] + rnorm(1, 0, sqrt(variance))
}
return(noise)
}
Here is the code in Rcpp, using three Rcpp sugar functions (pow, sqrt, rnorm):
NumericVector raw_noise(int timesteps, double mu, double sigma, double phi) {
double delta = mu * (1 - phi);
double variance = pow(sigma, 2.0) * (1 - pow(phi, 2.0));
NumericVector noise(timesteps);
noise[0] = R::rnorm(mu, sigma);
for(int i = 0; i < timesteps; ++i) {
noise[i+1] = delta + phi*noise[i] + R::rnorm(0, sqrt(variance));
}
return noise;
}
What really confuses me is that this code runs without problems:
library(purrr)
rerun(10000, raw_noise(timesteps = 30, mu = 0.5, sigma = 0.2, phi = 0.3))
But when I run this code:
test_loop <- function(timesteps, mu, sigma, phi, replicates) {
params <- cross_df(list(timesteps = timesteps, phi = phi, mu = mu, sigma =
sigma))
for (i in 1:nrow(params)) {
print(params[i,])
pmap(params[i,], raw_noise)
}
}
library(purrr)
test_loop(timesteps=c(5, 6, 7, 8, 9, 10), mu=c(0.2, 0.5), sigma=c(0.2, 0.5),
phi=c(0, 0.1))
More often than not, the R session aborts and RStudio crashes altogether. But sometimes I manage to catch this error message before the R session aborts:
Error in match(x, table, nomatch = 0L) : GC encountered a node
(0x10db7af50) with an unknown SEXP type: NEWSXP at memory.c:1692
As I understand it, NEWSXP is an exotic object type in R that doesn't come up very often. What's happening looks to me like a memory leak, but I'm not at all sure how to fix it. Like I said, I'm new to Rcpp and C++ generally so I'd appreciate any nudges in the right direction.
You have an out of bounds error:
for(int i = 0; i < timesteps; ++i)
causes
noise[i+1]
to exceed the defined range since C++ indices start at 0 and not 1.
For example, 0 to timesteps - 1 has a length of timesteps and, thus, is okay.
but
0 to timesteps would have a length of timesteps + 1
This can be seen if you change noise[i+1] to noise(i+1), which performs a bounds check on the requested index.
Error in raw_noise(100, 2, 3, 0.2) :
Index out of bounds: [index=100; extent=100].
To address this, make the following change:
NumericVector raw_noise(int timesteps, double mu, double sigma, double phi) {
double delta = mu * (1 - phi);
double variance = pow(sigma, 2.0) * (1 - pow(phi, 2.0));
NumericVector noise(timesteps);
noise[0] = R::rnorm(mu, sigma);
// change here
for(int i = 0; i < timesteps - 1; ++i) { // 1 less time step
noise[i+1] = delta + phi*noise[i] + R::rnorm(0, sqrt(variance));
}
return noise;
}

efficient distance calculations in armadillo

I'm new to armadillo. I have the below code, which I assume is inefficient. Any suggestions to make it more memory efficient and/or speedy? Following the armadillo docs and Rcpp gallery, I was unable to get .colptr's, uvec's, or batch insertion to work. But I assume any of them would be improvements.
With an input of X (~100 x 30000), even my stupidly large work VM crashes.
Linux release 7.3.1611 (Core)
117GB RAM / 0GB SWAP
(24 x 2.494 GHz) processor(s)
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
code
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
sp_mat arma_distmat_LT(const arma::mat& x) { // input expected X_{n x p} n << p
int nr, nc;
Col<double> col0, col1;
nr = x.n_rows;
nc = x.n_cols;
sp_mat out(nc, nc);
for (int i = 0; i < nc; i++) {
col0 = x.col(i);
for (int j = i + 1; j < nc; j++) {
col1 = x.col(j);
out(j, i) = as_scalar(col0.t() * col1);
}
}
return out;
}
Call: sourceCpp("<file>"); dist_x <- arma_distmat_LT(X)
Note: these are distances because I am calculating cosine similarities where I have set L2 norm == 1.
It looks to me as if you're just computing the (upper triangular) matrix product t(X)%*%X. You can actually do that directly in R with the underused crossprod function.
X <- matrix(rnorm(100*30000), ncol=30000)
res <- crossprod(X, X)
This takes a few minutes on my laptop. If you change your code to use the Armadillo library then you can use
sp_mat arma_distmat_LT2(const arma::mat& x) { // input expected X_{n x p} n << p
int nr, nc;
Col<double> col0, col1;
nr = x.n_rows;
nc = x.n_cols;
sp_mat out(nc, nc);
out = trimatl(x.t() * x, k=-1);
return out;
}
Still takes a few minutes. It uses an awful amount of memory though so I doubt you can have a lot of objects in memory at the same time.
The code could probably be optimized to only compute the lower/upper triangular matrix.
Just to show the speedup for a 100*800 matrix:
> microbenchmark(crossprod(X, X), arma_distmat_LT(X), arma_distmat_LT2(X))
Unit: milliseconds
expr min lq mean median uq
crossprod(X, X) 50.25574 53.72049 57.98812 56.29532 58.71277
arma_distmat_LT(X) 1331.83243 1471.42465 1523.74060 1492.84611 1512.45416
arma_distmat_LT2(X) 29.69420 33.23954 36.24613 35.54700 38.05208
max neval cld
160.81227 100 a
3080.37891 100 b
66.07351 100 a
As you can see there is a substantial speedup to be gained by brute-forcing it. That being said I'm sure that the cross product can be optimised further.

How to speed up this Rcpp function?

I wish to implement a simple split-apply-combine routine in Rcpp where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R, but often takes quite some time. I have managed to implement an Rcpp solution that beats the performance of R, but I wonder if I can further improve upon it. To illustrate, here some code, first for the use of R:
n <- 50000
k <- 50
set.seed(42)
X <- matrix(rnorm(n*k), nrow=n)
g=rep(1:8,length.out=n )
use.for <- function(mat, ind){
sums <- matrix(NA, nrow=length(unique(ind)), ncol=ncol(mat))
for(i in seq_along(unique(ind))){
sums[i,] <- colSums(mat[ind==i,])
}
return(sums)
}
use.apply <- function(mat, ind){
apply(mat,2, function(x) tapply(x, ind, sum))
}
use.dt <- function(mat, ind){ # based on Roland's answer
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
It turns out that the for-loops is actually quite fast and is the easiest (for me) to implement with Rcpp. It works by creating a submatrix for each group and then calling colSums on the matrix. This is implemented using RcppArmadillo:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat use_arma(arma::mat X, arma::colvec G){
arma::colvec gr = arma::unique(G);
int gr_n = gr.n_rows;
int ncol = X.n_cols;
arma::mat out = zeros(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
arma::uvec subvec = find(G==g_id);
arma::mat submat = X.rows(subvec);
arma::rowvec res = sum(submat,0);
out.row(g) = res;
}
return out;
}
However, based on answers to this question, I learned that creating copies is expensive in C++ (just as in R), but that loops are not as bad as they are in R. Since the arma-solution relies on creating matrixes (submat in the code) for each group, my guess is that avoiding this will speed up the process even further. Hence, here a second implementation based on Rcpp only using a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
out(g,j) = total;
}
}
return out;
}
Benchmarking these solutions, including the use_dt version provided by #Roland (my previous version discriminted unfairly against data.table), as well as the dplyr-solution suggested by #beginneR, yields the following:
library(rbenchmark)
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dplyr(X,g), use_arma(X,g), use_Rcpp(X,g),
+ columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
test replications elapsed relative
# 5 use_arma(X, g) 1000 29.65 1.000
# 4 use.dplyr(X, g) 1000 42.05 1.418
# 3 use.dt(X, g) 1000 56.94 1.920
# 1 use.for(X, g) 1000 60.97 2.056
# 6 use_Rcpp(X, g) 1000 113.96 3.844
# 2 use.apply(X, g) 1000 301.14 10.156
My intution (use_Rcpp better than use_arma) did not turn out right. Having said that, I guess that the line if (G(i) != g_id) continue; in my use_Rcpp function slows down everything. I am happy to learn about alternatives to set this up.
I am happy that I have achieved the same task in half the time it takes R to do it, but maybe the several Rcpp is much faster than R-examples have messed with my expectations, and I am wondering if I can speed this up even more. Does anyone have an idea? I also welcome any programming / coding comments in general since I am relatively new to Rcpp and C++.
No, it's not the for loop that you need to beat:
library(data.table)
#it doesn't seem fair to include calls to library in benchmarks
#you need to do that only once in your session after all
use.dt2 <- function(mat, ind){
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
all.equal(use.dt(X,g), use.dt2(X,g))
#TRUE
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dt2(X,g),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 50)
# test replications elapsed relative
#4 use.dt2(X, g) 50 3.12 1.000
#1 use.for(X, g) 50 4.67 1.497
#3 use.dt(X, g) 50 7.53 2.413
#2 use.apply(X, g) 50 17.46 5.596
Maybe you're looking for (the oddly named) rowsum
library(microbenchmark)
use.rowsum = rowsum
and
> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
expr min lq median uq max neval
use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579 5
use.rowsum(X, g) 10.56727 10.93942 11.01106 11.38697 11.38918 5
Here's my critiques with in-line comments for your Rcpp solution.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
// Rcpp has a sort_unique() function, which combines the
// sort and unique steps into one, and is often faster than
// performing the operations separately. Try `sort_unique(G)`
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
// This constructor zero-initializes memory (kind of like
// making a copy). You should use:
//
// NumericMatrix out = no_init(gr_n, ncol)
//
// to ensure the memory is allocated, but not zeroed.
//
// EDIT: We don't have no_init for matrices right now, but you can hack
// around that with:
//
// NumericMatrix out(Rf_allocMatrix(REALSXP, gr_n, ncol));
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
// subsetting with operator[] is cheaper, so use gr[g] when
// you can be sure bounds checks are not necessary
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
// similarily here
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
// IIUC, you are filling the matrice row-wise. This is slower as
// R matrices are stored in column-major format, and so filling
// matrices column-wise will be faster.
out(g,j) = total;
}
}
return out;
}

Speeding up computation of Dice coefficient in C / Rcpp

I need to compute a similarity measure call the Dice coefficient over large matrices (600,000 x 500) of binary vectors in R. For speed I use C / Rcpp. The function runs great but as I am not a computer scientist by background I would like to know if it could run faster. This code is suitable for parallelisation but I have no experience parallelising C code.
The Dice coefficient is a simple measure of similarity / dissimilarity (depending how you take it). It is intended to compare asymmetric binary vectors, meaning one of the combination (usually 0-0) is not important and agreement (1-1 pairs) have more weight than disagreement (1-0 or 0-1 pairs). Imagine the following contingency table:
1 0
1 a b
0 c d
The Dice coef is: (2*a) / (2*a +b + c)
Here is my Rcpp implementation:
library(Rcpp)
cppFunction('
NumericMatrix dice(NumericMatrix binaryMat){
int nrows = binaryMat.nrow(), ncols = binaryMat.ncol();
NumericMatrix results(ncols, ncols);
for(int i=0; i < ncols-1; i++){ // columns fixed
for(int j=i+1; j < ncols; j++){ // columns moving
double a = 0;
double d = 0;
for (int l = 0; l < nrows; l++) {
if(binaryMat(l, i)>0){
if(binaryMat(l, j)>0){
a++;
}
}else{
if(binaryMat(l, j)<1){
d++;
}
}
}
// compute Dice coefficient
double abc = nrows - d;
double bc = abc - a;
results(j,i) = (2*a) / (2*a + bc);
}
}
return wrap(results);
}
')
And here is a running example:
x <- rbinom(1:200000, 1, 0.5)
X <- matrix(x, nrow = 200, ncol = 1000)
system.time(dice(X))
user system elapsed
0.814 0.000 0.814
The solution proposed by Roland was not entirely satisfying for my use case. So based on the source code from the arules package I implement a much faster version. The code in arules rely on an algorithm from Leisch (2005) using the tcrossproduct() function in R.
First, I wrote a Rcpp / RcppEigen version of crossprod that is 2-3 time faster. This is based on the example code in the RcppEigen vignette.
library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;
const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));
const int m(A.rows()), n(A.cols());
MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));
return wrap(AtA);
'
fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")
Then I wrote a small R function to compute the Dice coefficient.
diceR <- function(X){
a <- fcprd(X)
nx <- ncol(X)
rsx <- colSums(X)
c <- matrix(rsx, nrow = nx, ncol = nx) - a
# b <- matrix(rsx, nrow = nx, ncol = nx, byrow = TRUE) - a
b <- t(c)
m <- (2 * a) / (2*a + b + c)
return(m)
}
This new function is ~8 time faster than the old one and ~3 time faster than the one in arules.
m <- microbenchmark(dice(X), diceR(X), dissimilarity(t(X), method="dice"), times=100)
m
# Unit: milliseconds
# expr min lq median uq max neval
# dice(X) 791.34558 809.8396 812.19480 814.6735 910.1635 100
# diceR(X) 62.98642 76.5510 92.02528 159.2557 507.1662 100
# dissimilarity(t(X), method = "dice") 264.07997 342.0484 352.59870 357.4632 520.0492 100
I cannot run your function at work, but is the result the same as this?
library(arules)
plot(dissimilarity(X,method="dice"))
system.time(dissimilarity(X,method="dice"))
#user system elapsed
#0.04 0.00 0.04

element wise matrix multiplication in Rcpp

I am trying to speed up some R code with Rcpp that takes a vector of length L (psi) and a matrix of dimensions (L,L) and does some element-wise operations. Is there a more efficient way to be doing these element-wise operations with Rcpp?
r:
UpdateLambda <- function(psi,phi){
# updated full-day infection probabilites
psi.times.phi <- apply(phi,1,function(x) x*psi)
## return Lambda_{i,j} = 1 - \prod_{j} (1 - \psi_{i,j,t} \phi_{i,j})
apply(psi.times.phi,2,function(x) 1-prod(1-x))
}
cpp:
#include <Rcpp.h>
#include <algorithm>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector UpdateLambdaC(NumericVector psi,
NumericMatrix phi
){
int n = psi.size();
NumericMatrix psi_times_phi(n,n);
NumericVector tmp(n,1.0);
NumericVector lambda(n);
for(int i=0; i<n;i++){
psi_times_phi(i,_) = psi*phi(i,_);
}
for(int i=0; i<n;i++){
// \pi_{j} (1- \lambda_{i,j,t})
for(int j=0; j<n;j++){
tmp[i] *= 1-psi_times_phi(i,j);
}
lambda[i] = 1-tmp[i];
}
return lambda;
}
You can replace you apply loops with vectorised alternatives.
The first one is equivalent to:
t(phi)*psi
And the second:
1-exp(colSums(log(1-psi.times.phi)))
#test data
phi <- matrix(runif(1e6),1e3)
psi <- runif(1e3)
#new function
UpdateLambda2 <- function(psi,phi) 1-exp(colSums(log(1-t(phi)*psi)))
#sanity check
identical(UpdateLambda(psi,phi),UpdateLambda2(psi,phi))
[1] TRUE
#timings
library(rbenchmark)
benchmark(UpdateLambda(psi,phi),UpdateLambda2(psi,phi))
test replications elapsed relative user.self sys.self
1 UpdateLambda(psi, phi) 100 16.05 1.041 15.06 0.93
2 UpdateLambda2(psi, phi) 100 15.42 1.000 14.19 1.19
Well, it appears that it does not make much of a difference, which is very surprising as colSums is typically much faster than apply. I'm not sure if the test data I've used is relevant as the output is all 1's due number of multiplications of number less than 1 in the second part. You may be better off working in a log scale anyway if you want to note the detail of such small numbers.