Alternating Error: "Invalid dimension for argument 0" - c++

In converting the example below to a gfor loop. I encountered an error of the type "Invalid dimension for argument 0", the full error message below. However, the error occurs, then the function runs, then the same error. This pattern repeats. I am confused and am wondering if this error is in someway system dependent.
Full error message:
Error in random_shuffle(theta, 5, 1) :
ArrayFire Exception (Invalid input size:203):
In function af_err af_assign_seq(af_array *, const af_array, const unsigned int, const af_seq *, const af_array)
In file src/api/c/assign.cpp:168
Invalid dimension for argument 0
Expected: (outDims.ndims() >= inDims.ndims())
A second problem, the seed fails to change with the input parameter, when using the gfor loop.
#include "RcppArrayFire.h"
using namespace Rcpp;
using namespace RcppArrayFire;
// [[Rcpp::export]]
af::array random_shuffle(const RcppArrayFire::typed_array<f64> theta, int counts, int seed){
const int theta_size = theta.dims()[0];
af::array out(counts, theta_size, f64);
af::array seed_seq = af::seq(seed, seed+counts);
// for(int f = 0; f < counts; f++){
gfor ( af::seq f, counts-1 ){
af::randomEngine engine;
engine.setSeed(af::sum<double>(seed_seq(f)));
af::array index_shuffle(1, u16);
af::array temp_rand(1, f64);
af::array temp_end(1, f64);
af::array shuffled = theta;
// implementation of the Knuth-Fisher-Yates shuffle algo
for(int i = theta_size-1; i > 1; i --){
index_shuffle = af::round(af::randu(1, u16, engine)/(65536/(i+1)));
temp_rand = shuffled(index_shuffle);
temp_end = shuffled(i);
shuffled(index_shuffle) = temp_end;
shuffled(i) = temp_rand;
}
out(f, af::span) = shuffled;
}
return out;
}
/*** R
theta <- 10:20
random_shuffle(theta, 5, 1)
random_shuffle(theta, 5, 2)
*/
Updated with Ralf Stunber's solution, but 'shuffled' samples in Column space.
// [[Rcpp::export]]
af::array random_shuffle2(const RcppArrayFire::typed_array<f64> theta, int counts, int seed) {
int len = theta.dims(0);
af::setSeed(seed);
af::array tmp = af::randu(len, counts, 1);
af::array val, idx;
af::sort(val, idx, tmp, 1);
af::array shuffled = theta(idx);
return af::moddims(shuffled, len, counts);
}
/*** R
random_shuffle2(theta, 5, 1)
*/
Here is a picture of output, sampling with replacement:
In the second part, of 50 repetitions, the samples moves towards anergodic outcome.

Why do you want to use multiple RNG engines in parallel? There is really no need for this. In general, it should be sufficient to use only the global RNG engine. It should also be sufficient to set the seed of this engine only once. You can do this from R with RcppArrayFire::arrayfire_set_seed. Besides, random number generation within a gfor loop does not work as one might expect, c.f. http://arrayfire.org/docs/page_gfor.htm.
Anyway, I am not an expert in writing efficient GPU algorithms, which is why I like using the methods implemented in libraries like ArrayFire. Unfortunately ArrayFire does not have a shuffle algorithm, but the corresponding issue has a nice implementation, which can be generalized to your case with multiple shuffles:
// [[Rcpp::depends(RcppArrayFire)]]
#include "RcppArrayFire.h"
// [[Rcpp::export]]
af::array random_shuffle(const RcppArrayFire::typed_array<f64> theta, int counts, int seed) {
int len = theta.dims(0);
af::setSeed(seed);
af::array tmp = af::randu(counts, len, 1);
af::array val, idx;
af::sort(val, idx, tmp, 1);
af::array shuffled = theta(idx);
return af::moddims(shuffled, counts, len);
}
BTW, depending on the later usage it might make more sense to arrange the different samples in columns instead of rows, since both R and AF use column major layout.

Related

Adding 3D vectors using SIMD intrinsics

I've got two streams of 3D vectors which I'd like to add using x86 AVX2 intrinsics. I'm using the GNU compiler 11.1.0. Hopefully, the code illustrates what I want to do:
// Example program
#include <utility> // std::size_t
#include <immintrin.h>
struct v3
{
float data[3] = {};
};
void add(const v3* a, const v3* b, v3* c, const std::size_t& n)
{
// c <- a + b
for (auto i = std::size_t{}; i < n; i += 2) // 2 vector3s at a time ~6 data
{
// masking
// [95:0] of a[i] move into [255:128], [95:0] of a[i+1] move into [255:128] of *another* 256-bit register
// ^same with b[i]
static const auto p1_mask = _mm256_setr_epi32(-1, -1, -1, 0, 0, 0, 0, 0);
static const auto p2_mask = _mm256_setr_epi32(0, 0, 0, -1, -1, -1, 0, 0);
const auto p1_leftop_packed = _mm256_maskload_ps(a[i].data, p1_mask);
const auto p2_lefttop_packed = _mm256_maskload_ps(a[i].data, p2_mask);
const auto p1_rightop_packed = _mm256_maskload_ps(b[i].data, p1_mask);
const auto p2_rightop_packed = _mm256_maskload_ps(b[i].data, p2_mask);
// addition is being done inefficiently with 2 AVX2 instructions!
const auto result1_packed = _mm256_add_ps(p1_leftop_packed, p1_rightop_packed);
const auto result2_packed = _mm256_add_ps(p2_leftop_packed, p2_rightop_packed);
// store them back
_mm256_maskstore_ps(c[i].data, p1_mask, result1_packed);
_mm256_maskstore_ps(c[i].data, p2_mask, result2_packed);
}
}
int main()
{
// data
const auto n = std::size_t{1000};
v3 a[n] = {};
v3 b[n] = {};
v3 c[n] = {};
// run
add(a, b, c, n);
return 0;
}
The above code works but the performance is quite terrible. To correct it, I think I need a version which looks approximately like the following:
// c <- a + b
for (auto i = std::size_t{}; i < n; i += 2) // 2 vector3s at a time ~6 data
{
// masking
// [95:0] of a[i] move into [255:128], [95:0] of a[i+1] in [127:0]
const auto leftop_packed = /*code required here*/;
const auto rightop_packed = /*code required here*/;
// addition is being done with only 1 AVX2 instruction
const auto result_packed = _mm256_add_ps(leftop_packed, rightop_packed);
// store them back
// [95:0] of result_packed move into c[i], [223:128] of result_packed into c[i+1]
/*code required here*/
}
How do I achieve this? I will gladly provide any additional information when needed. Any help would be much appreciated.
The two following comments say the same. They are good. Do as they say.
I think you can just load 8 floats at a time and then if you have anything left over at the end you can do a masked store (not sure about this part). – LHLaurini
Use char*, float*, or __m256* to work in 32-byte or 8-float chunks, ignoring vector boundaries since you're just doing pure vertical addition. float* should be good for cleanup of the last up-to-7 floats – Peter Cordes
The Eigen library supports vectorization. It also has a lot of the vector/matrix math algorithms already implemented, and quite efficiently too. If you can, I'd recommend looking into using it instead of rolling your own logic.

What is the most efficient way to apply a functor to a subset of a device array?

I am rewriting a library that performs calculations and other operations on data that is stored in contiguous chunks of memory so that it can work on GPUs using the CUDA framework. The data represents information that lives on a 4-dimensional grid. The total size of the grid can range from 1000's to millions of grid points. Along each direction, the grid may have as little as 8 or as much as 100's of points. My question is about what is the best way to implement operations on a subset of the grid. For example, suppose that my grid is [0,nx)x[0,ny)x[0,nz)x[0,nq), and I want to implement a transformation that multiplies all the points whose indexes belong to [1,nx-1)x[1,ny-1)x[1,nz-1)x[0,nq-1) by minus 1.
Right now, what I do is via nested loops. This is a skeleton of code
{
int nx,ny,nz,nq;
nx=10,ny=10,nz=10,nq=10;
typedef thrust::device_vector<double> Array;
Array A(nx*ny*nz*nq);
thrust::fill(A.begin(), A.end(), (double) 1);
for (auto q=1; q<nq-1; ++q){
for (auto k=1; k<nz-1; ++k){
for (auto j=1; j<ny-1; ++j){
int offset1=+1+j*nx+k*nx*ny+q*nx*ny*nz;
int offset2=offset1+nx-2;
thrust::transform(A.begin()+offset1,
A.begin()+offset2,
thrust::negate<double>());
}
}
}
}
However, I wonder if this is the most efficient way, because it seems to me that in this case at most only nx-2 threads can be run simultaneously. So I was thinking that perhaps a better way would be to generate a sequence iterator (returning the linear position along the array), zip it to the array with a zip iterator, and defining a functor that examines the second element of the tuple (the position value) and if that value falls into the accepted range, modify the first element of the tuple. However, there may be a better way to do that. I am new to CUDA, and to make matter worse I really cut my teeth with Fortran, so it is hard for me to think outside the for-loop box...
I'm not sure what is the most efficient way. I can suggest what I think will be more efficient than your skeleton code.
Your proposal in the text is headed in the right direction. Rather than use a set of nested for-loops that will iterate potentially quite a few times, we should seek to get everything done in one thrust call. But we still need to have that one thrust call only modify the array values at the indices within the "cubic" volume to be operated on.
We don't want to use a method involving testing of a generated index against the valid index volume, however, as you seem to be suggesting. This would require us to launch a grid as large as our array, even if we only wanted to modify a small volume of it.
Instead, we launch an operation that is just large enough to cover the needed number of elements to modify, and we create a functor which does a linear index -> 4D index -> adjusted linear index conversion. That functor then operates within a transform iterator to convert an ordinary linear sequence starting at 0, 1, 2, etc. to a sequence that starts and stays within the volume to be modified. A permutation iterator is then used with this modified sequence to select the values of the array to modify.
Here's an example showing the difference in timing for your nested loop method (1) vs. mine (2) for an array of 64x64x64x64 and a modified volume of 62x62x62x62:
$ cat t39.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/equal.h>
#include <cassert>
#include <iostream>
struct my_idx
{
int nx, ny, nz, nq, lx, ly, lz, lq, dx, dy, dz, dq;
my_idx(int _nx, int _ny, int _nz, int _nq, int _lx, int _ly, int _lz, int _lq, int _hx, int _hy, int _hz, int _hq) {
nx = _nx;
ny = _ny;
nz = _nz;
nq = _nq;
lx = _lx;
ly = _ly;
lz = _lz;
lq = _lq;
dx = _hx - lx;
dy = _hy - ly;
dz = _hz - lz;
dq = _hq - lq;
// could do a lot of assert checking here
}
__host__ __device__
int operator()(int idx){
int rx = idx / dx;
int ix = idx - (rx * dx);
int ry = rx / dy;
int iy = rx - (ry * dy);
int rz = ry / dz;
int iz = ry - (rz * dz);
int rq = rz / dq;
int iq = rz - (rq * dq);
return (((iq+lq)*nz+iz+lz)*ny+iy+ly)*nx+ix+lx;
}
};
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
int main()
{
int nx,ny,nz,nq,lx,ly,lz,lq,hx,hy,hz,hq;
nx=64,ny=64,nz=64,nq=64;
lx=1,ly=1,lz=1,lq=1;
hx=nx-1,hy=ny-1,hz=nz-1,hq=nq-1;
thrust::device_vector<double> A(nx*ny*nz*nq);
thrust::device_vector<double> B(nx*ny*nz*nq);
thrust::fill(A.begin(), A.end(), (double) 1);
thrust::fill(B.begin(), B.end(), (double) 1);
// method 1
unsigned long long m1_time = dtime_usec(0);
for (auto q=lq; q<hq; ++q){
for (auto k=lz; k<hz; ++k){
for (auto j=ly; j<hy; ++j){
int offset1=lx+j*nx+k*nx*ny+q*nx*ny*nz;
int offset2=offset1+(hx-lx);
thrust::transform(A.begin()+offset1,
A.begin()+offset2, A.begin()+offset1,
thrust::negate<double>());
}
}
}
cudaDeviceSynchronize();
m1_time = dtime_usec(m1_time);
// method 2
unsigned long long m2_time = dtime_usec(0);
auto p = thrust::make_permutation_iterator(B.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_idx(nx, ny, nz, nq, lx, ly, lz, lq, hx, hy, hz, hq)));
thrust::transform(p, p+(hx-lx)*(hy-ly)*(hz-lz)*(hq-lq), p, thrust::negate<double>());
cudaDeviceSynchronize();
m2_time = dtime_usec(m2_time);
if (thrust::equal(A.begin(), A.end(), B.begin()))
std::cout << "method 1 time: " << m1_time/(float)USECPSEC << "s method 2 time: " << m2_time/(float)USECPSEC << "s" << std::endl;
else
std::cout << "mismatch error" << std::endl;
}
$ nvcc -std=c++11 t39.cu -o t39
$ ./t39
method 1 time: 1.6005s method 2 time: 0.013182s
$

Rcppparallel bootstrap

I presume, or rather hope, that I have a singular fixable problem or perhaps many smaller ones and should give up. Either way I am relatively new to Rcpp and extremely uninformed on parallel computation and can't find a solution online.
The problem is typically, a 'fatal error' in R or R gets stuck in a loop, something like 5 minuets for 10 iterations, when the non-parallel version will do 5K iterations in the same time, roughly speaking.
As this algorithm fits into a much larger project I call on several other functions, these are all in Rcpp and I rewrote them with only 'arma' objects as that seemed to help other people, here. I also ran the optimization part with a 'heat map' optimizer I wrote in Rcpp, again exclusively in 'arma' without improvement - I should also point out this returned as an 'arma::vec'.
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends("RcppParallel")]]
#include <RcppArmadillo.h>
#include <RcppParallel.h>
using namespace Rcpp;
using namespace std;
using namespace arma;
using namespace RcppParallel;
struct Boot_Worker : public Worker {
//Generate Inputs
// Source vector to keep track of the number of bootstraps
const arma::vec Boot_reps;
// Initial non-linear theta parameter values
const arma::vec init_val;
// Decimal date vector
const arma::colvec T_series;
// Generate the price series observational vector
const arma::colvec Y_est;
const arma::colvec Y_res;
// Generate the optimization constants
const arma::mat U;
const arma::colvec C;
const int N;
// Generate Output Matrix
arma::mat Boots_out;
// Initialize with the proper input and output
Boot_Worker( const arma::vec Boot_reps, const arma::vec init_val, const arma::colvec T_series, const arma::colvec Y_est, const arma::colvec Y_res, const arma::mat U, const arma::colvec C, const int N, arma::mat Boots_out)
: Boot_reps(Boot_reps), init_val(init_val), T_series(T_series), Y_est(Y_est), Y_res(Y_res), U(U), C(C), N(N), Boots_out(Boots_out) {}
void operator()(std::size_t begin, std::size_t end){
//load necessary stuffs from around
Rcpp::Environment stats("package:stats");
Rcpp::Function constrOptim = stats["constrOptim"];
Rcpp::Function SDK_pred_mad( "SDK_pred_mad");
arma::mat fake_data(N,2);
arma::colvec index(N);
for(unsigned int i = begin; i < end; i ++){
// Need a nested loop to create and fill the fake data matrix
arma::vec pool = arma::regspace(0, N-1) ;
std::random_shuffle(pool.begin(), pool.end());
for(int k = 0; k <= N-1; k++){
fake_data(k, 0) = Y_est[k] + Y_res[ pool[k] ];
fake_data(k, 1) = T_series[k];
}
// Call the optimization
Rcpp::List opt_results = constrOptim(Rcpp::_["theta"] = init_val,
Rcpp::_["f"] = SDK_pred_mad,
Rcpp::_["data_in"] = fake_data,
Rcpp::_["grad"] = "NULL",
Rcpp::_["method"] = "Nelder-Mead",
Rcpp::_["ui"] = U,
Rcpp::_["ci"] = C );
/// fill the output matrix ///
// need to create an place holder arma vector for the parameter output
arma::vec opt_param = Rcpp::as<arma::vec>(opt_results[0]);
Boots_out(i, 0) = opt_param[0];
Boots_out(i, 1) = opt_param[1];
Boots_out(i, 2) = opt_param[2];
// for the cost function value at optimization
arma::vec opt_value = Rcpp::as<arma::vec>(opt_results[1]);
Boots_out(i, 3) = opt_value[0];
// for the number of function calls (?)
arma::vec counts = Rcpp::as<arma::vec>(opt_results[2]);
Boots_out(i, 4) = counts[0];
// for thhe convergence code
arma::vec convergence = Rcpp::as<arma::vec>(opt_results[3]);
Boots_out(i, 5) = convergence[0];
}
}
};
// [[Rcpp::export]]
arma::mat SDK_boots_test(arma::vec init_val, arma::mat data_in, int boots_n){
//First establish theta_sp, estimate and residuals
const int N = arma::size(data_in)[0];
// Create the constraints for the constrained optimization
// Make a boundry boundry condition matrix of the form Ui*theta - ci >= 0
arma::mat U(6, 3);
U(0, 0) = 1;
U(1, 0) = -1;
U(2, 0) = 0;
U(3, 0) = 0;
U(4, 0) = 0;
U(5, 0) = 0;
U(0, 1) = 0;
U(1, 1) = 0;
U(2, 1) = 1;
U(3, 1) = -1;
U(4, 1) = 0;
U(5, 1) = 0;
U(0, 2) = 0;
U(1, 2) = 0;
U(2, 2) = 0;
U(3, 2) = 0;
U(4, 2) = 1;
U(5, 2) = -1;
arma::colvec C(6);
C[0] = 0;
C[1] = -data_in(N-1, 9)-0.5;
C[2] = 0;
C[3] = -3;
C[4] = 0;
C[5] = -50;
Rcpp::Function SDK_est( "SDK_est");
Rcpp::Function SDK_res( "SDK_res");
arma::vec Y_est = as<arma::vec>(SDK_est(init_val, data_in));
arma::vec Y_res = as<arma::vec>(SDK_res(init_val, data_in));
// Generate feed items for the Bootstrap Worker
arma::vec T_series = data_in( span(0, N-1), 9);
arma::vec Boots_reps(boots_n+1);
// Allocate the output matrix
arma::mat Boots_out(boots_n, 6);
// Pass input and output the Bootstrap Worker
Boot_Worker Boot_Worker(Boots_reps, init_val, T_series, Y_est, Y_res, U, C, N, Boots_out);
// Now finnaly call the parallel for loop
parallelFor(0, Boots_reps.size(), Boot_Worker);
return Boots_out;
}
So I wrote back in my 'heat algorithm' to solve the optimization, this is entirely in Rcpp-armadillo, this simplifies the code massively as the constraints are written into the optimizer. Additionally, I removed the randomization, so it just has to solve the same optimization; just to see if that was the only problem. Without fail I am still having the same 'fatal error'.
as it stands here is code:
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::depends("RcppParallel")]]
#include <RcppArmadillo.h>
#include <RcppParallel.h>
#include <random>
using namespace Rcpp;
using namespace std;
using namespace arma;
using namespace RcppParallel;
struct Boot_Worker : public Worker {
//Generate Inputs
// Source vector to keep track of the number of bootstraps
const arma::vec Boot_reps;
// Initial non-linear theta parameter values
const arma::vec init_val;
// Decimal date vector
const arma::colvec T_series;
// Generate the price series observational vector
const arma::colvec Y_est;
const arma::colvec Y_res;
const int N;
// Generate Output Matrix
arma::mat Boots_out;
// Initialize with the proper input and output
Boot_Worker( const arma::vec Boot_reps, const arma::vec init_val, const arma::colvec T_series, const arma::colvec Y_est, const arma::colvec Y_res, const int N, arma::mat Boots_out)
: Boot_reps(Boot_reps), init_val(init_val), T_series(T_series), Y_est(Y_est), Y_res(Y_res), N(N), Boots_out(Boots_out) {}
void operator()(std::size_t begin, std::size_t end){
//load necessary stuffs from around
Rcpp::Function SDK_heat( "SDK_heat");
arma::mat fake_data(N,2);
arma::colvec index(N);
for(unsigned int i = begin; i < end; i ++){
// Need a nested loop to create and fill the fake data matrix
//arma::vec pool = arma::shuffle( arma::regspace(0, N-1) );
for(int k = 0; k <= N-1; k++){
fake_data(k, 0) = Y_est[k] + Y_res[ k ];
//fake_data(k, 0) = Y_est[k] + Y_res[ pool[k] ];
fake_data(k, 1) = T_series[k];
}
// Call the optimization
arma::vec opt_results = Rcpp::as<arma::vec>( SDK_heat(Rcpp::_["data_in"] = fake_data, Rcpp::_["tol"] = 0.1) );
/// fill the output matrix ///
// need to create an place holder arma vector for the parameter output
Boots_out(i, 0) = opt_results[0];
Boots_out(i, 1) = opt_results[1];
Boots_out(i, 2) = opt_results[2];
// for the cost function value at optimization
Boots_out(i, 3) = opt_results[3];
}
}
};
// [[Rcpp::export]]
arma::mat SDK_boots_test(arma::vec init_val, arma::mat data_in, int boots_n){
//First establish theta_sp, estimate and residuals
const int N = arma::size(data_in)[0];
Rcpp::Function SDK_est( "SDK_est");
Rcpp::Function SDK_res( "SDK_res");
const arma::vec Y_est = as<arma::vec>(SDK_est(init_val, data_in));
const arma::vec Y_res = as<arma::vec>(SDK_res(init_val, data_in));
// Generate feed items for the Bootstrap Worker
const arma::vec T_series = data_in( span(0, N-1), 9);
arma::vec Boots_reps(boots_n+1);
// Allocate the output matrix
arma::mat Boots_out(boots_n, 4);
// Pass input and output the Bootstrap Worker
Boot_Worker Boot_Worker(Boots_reps, init_val, T_series, Y_est, Y_res, N, Boots_out);
// Now finnaly call the parallel for loop
parallelFor(0, Boots_reps.size(), Boot_Worker);
return Boots_out;
}
Looking at your code I see the following:
struct Boot_Worker : public Worker {
[...]
void operator()(std::size_t begin, std::size_t end){
//load necessary stuffs from around
Rcpp::Environment stats("package:stats");
Rcpp::Function constrOptim = stats["constrOptim"];
Rcpp::Function SDK_pred_mad( "SDK_pred_mad");
[...]
// Call the optimization
Rcpp::List opt_results = constrOptim(Rcpp::_["theta"] = init_val,
Rcpp::_["f"] = SDK_pred_mad,
Rcpp::_["data_in"] = fake_data,
Rcpp::_["grad"] = "NULL",
Rcpp::_["method"] = "Nelder-Mead",
Rcpp::_["ui"] = U,
Rcpp::_["ci"] = C );
You are calling an R function from a multi-threaded C++ context. That's something you should not do. R is single-threaded so this will lead to undefined behavior or crashes:
API Restrictions
The code that you write within parallel workers should not call the R or Rcpp API in any fashion. This is because R is single-threaded and concurrent interaction with it’s data structures can cause crashes and other undefined behavior. Here is the official guidance from Writing R Extensions:
Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. In particular, code which makes use of the stack-checking mechanism must not be called from threaded code.
Besides, calling back to R from C++ even in a single threaded context is not the best thing you can do for performance. It should be more efficient to use a optimization library that offers a direct C(++) interface. One possibility might be the development version of nlopt, c.f. this issue for a discussion and references to examples. In addition, std::random_shuffle is not only deprecated in C++14 and removed from C++17, but it is also not thread-safe.
In your second example, you say that the function SDK_heat is actually implemented in C++. In that case you can call it directly:
Remove importing the corresponding R function, i.e. the Rcpp::Function SDK_heat( "SDK_heat");
Make sure that the compiler knows the declaration of the C++ function and that the linker has the actual function:
Quick and dirty: Copy the function definition into your cpp file before the definition of BootWorker.
For a cleaner approach, see section "1.10 Sharing code" in the Rcpp attributes vignette
Call the function like any other C++ function, i.e. using positional arguments with types compatible to the function declaration.
All this assumes you are using sourceCpp as indicated by your usage of [[Rcpp::depends(...)]]. You are reaching a complexity that warrants to build a package from this.

Armadillo SpMat<int> extremely slow compared to Mat<int>

I am trying to utilize sparse matrices in Armadillo, and am noticing a significant difference in access times with SpMat<int> compared to equivalent code using Mat<int>.
Description:
Below are two methods, which are identical in every respect except that Method_One uses regular matrices and Method_Two uses sparse matrices.
Both methods take following arguments:
WS, DS: Pointers to a NN dimensional array
WW: 13 K [max(WS)]
DD: 1.7 K [max(DS)]
NN: 2.3 M
TT: 50
I am using Visual Studio 2017 for compiling the code into a .mexw64 executable which can be called from Matlab.
Code:
void Method_One(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
Mat<int> WP(WW, TT, fill::zeros); // (13000 x 50) matrix
Mat<int> DP(DD, TT, fill::zeros); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
void Method_Two(int WW, int DD, int TT, int NN, double* WS, double* DS)
{
SpMat<int> WP(WW, TT); // (13000 x 50) matrix
SpMat<int> DP(DD, TT); // (1700 x 50) matrix
Col<int> ZZ(NN, fill::zeros); // 2,300,000 column vector
for (int n = 0; n < NN; n++)
{
int w_n = (int) WS[n] - 1;
int d_n = (int) DS[n] - 1;
int t_n = rand() % TT;
WP(w_n, t_n)++;
DP(d_n, t_n)++;
ZZ(n) = t_n + 1;
}
return;
}
Timing:
I am timing both methods using wall_clock timer object in Armadillo. For example,
wall_clock timer;
timer.tic();
Method_One(WW, DD, TT, NN, WS, DS);
double t = timer.toc();
Results:
Timing elapsed for Method_One using Mat<int>: 0.091 sec
Timing elapsed for Method_Two using SpMat<int>: 30.227 sec (almost 300 times slower)
Any insights into this are highly appreciated!
UPDATE:
This issue has been fixed with newer version (8.100.1) of Armadillo.
Here are the new results:
Timing elapsed for Method_One using Mat<int>: 0.141 sec
Timing elapsed for Method_Two using SpMat<int>: 2.127 sec (15 times slower, which is acceptable!)
Thanks to Conrad and Ryan.
As hbrerkere already mentioned, the problem stems from the fact that the values of the matrix are stored in a packed format (CSC) that makes it time-consuming to
Find the index of an already existing entry: Depending on whether the column entries are sorted by their row index you need either linear or binary search.
Insert a value that was previously zero: Here you need to find the insertion point for your new value and move all elements after that, leading to Ω(n) worst case time for a single insertion!
All these operations are constant-time operations for dense matrices, which mostly explains the runtime difference.
My usual solution was to use a separate sparse matrix type for assembly (where you usually access an element multiple times) based on the coordinate format (storing triples (i, j, value)) that uses a map like std::map or std::unordered_map to store the triple index corresponding to a position (i,j) in the matrix.
Some similar approaches are also discussed in this question about matrix assembly
Example from my most recent use:
class DynamicSparseMatrix {
using Number = double;
using Index = std::size_t;
using Entry = std::pair<Index, Index>;
std::vector<Number> values;
std::vector<Index> rows;
std::vector<Index> cols;
std::map<Entry, Index> map; // unordered_map might be faster,
// but you need a suitable hash function
// like boost::hash<Entry> for this.
Index num_rows;
Index num_cols;
...
Number& value(Index row, Index col) {
// just to prevent misuse
assert(row >= 0 && row < num_rows);
assert(col >= 0 && col < num_cols);
// Find the entry in the matrix
Entry e{row, col};
auto it = map.find(e);
// If the entry hasn't previously been stored
if (it == map.end()) {
// Add a new entry by adding its value and coordinates
// to the end of the storage vectors.
it = map.insert(make_pair(e, values.size())).first;
rows.push_back(row);
cols.push_back(col);
values.push_back(0);
}
// Return the value
return values[(*it).second];
}
...
};
After assembly you can store all the values from rows, cols, values (which actually represent the matrix in Coordinate format), possibly sort them and do a batch insertion into your Armadillo matrix.
Sparse matrices are stored in a compressed format (CSC). Every time a non-zero element inserted into a sparse matrix, the entire internal representation has to be updated. This is time consuming.
It's much faster to construct the sparse matrix using batch constructors.

OpenCV construct range of size 1

In the following code, is there a better way to go about constructing the singleton ranges cv::Range(i, i+1) and cv::Range(j, j+1)? I would expect there to exist somewhere in OpenCV a function that creates a singleton range, e.g. just a constructor cv::Range(i) equivalent to cv::Range(i, i+1).
const int sizeA[] = { 100, 100, 100 };
cv::Mat matrix(3, sizeA, cv::DataType<int>::type);
// get submatrix (i, j, :)
int i = 8;
int j = 15;
const cv::Range ranges = { cv::Range(i, i+1), cv::Range(j, j+1), cv::Range::all() };
cv::Mat submatrix = matrix(ranges);
There is nothing built into OpenCV to do this. Simply write cv::Range(i, i+1) everywhere or write your own helper function.