define and filling a sparse matrix using Eigen Library in C++ - c++

I am trying to build a spars Matrix using a Eigen or Armadillo library in C++ to solve a system of linear equations Ax=b. A is the coefficient matrix with a dimension of n*n, and B is a vector of right hand side with a dimension of n
the Spars Matrix A is like this, see the figure
I had a look though the Eigen document but I have a problem with defining and filling the Spars Matrix in C++.
could you please give me an example code to define the spars matrix and how to fill the values into the matrix using Eigen library in c++?
consider for example a simple spars matrix A:
1 2 0 0
0 3 0 0
0 0 4 5
0 0 6 7
int main()
SparseMatrix<double> A;
// fill the A matrix ????
VectorXd b, x;
SparseCholesky<SparseMatrix<double> > solver;
x = solver.solve(b);
return 0;

The sparse matrix could be filled with the values mentioned in the post by using the .coeffRef() member function, as shown in this routine:
SparseMatrix<double> fillMatrix() {
int N = 4;
int M = 4;
SparseMatrix<double> m1(N,M);
m1.reserve(VectorXi::Constant(M, 4)); // 4: estimated number of non-zero enties per column
m1.coeffRef(0,0) = 1;
m1.coeffRef(0,1) = 2.;
m1.coeffRef(1,1) = 3.;
m1.coeffRef(2,2) = 4.;
m1.coeffRef(2,3) = 5.;
m1.coeffRef(3,2) = 6.;
m1.coeffRef(3,3) = 7.;
return m1;
However, the SparseCholesky module (SimplicialCholesky<SparseMatrix<double> >) won't work in this case because the matrix is not Hermitian. The system could be solved with a LU or BiCGStab solver. Also note that sizes ofx and b need to be defined:
VectorXd b(A.rows()), x(A.cols());
In case of larger sparse matrices you may also want to look at the .reserve() function in order to allocate memory before filling the elements. The .reserve() function can be used to provide an estimate of the number of non-zero entries per column (or row, depending on the storage order. The default is comumn-major). In the example above that estimate is 4, but it does not make sense in such a small matrix. The documentation states that it is preferable to overestimate the number of non-zeros per column.

Since this question also asks about Armadillo, here is the corresponding Armadillo-based code. Best to use Armadillo version 9.100+ or later, and link with SuperLU.
#include <armadillo>
using namespace arma;
int main()
sp_mat A(4,4); // don't need to explicitly reserve the number of non-zeros
// fill with direct element access
A(0,0) = 1.0;
A(0,1) = 2.0;
A(1,1) = 3.0;
A(2,2) = 4.0;
A(2,3) = 5.0;
A(3,2) = 6.0;
A(3,3) = 7.0; // etc
// or load the sparse matrix from a text file with the data stored in coord format
sp_mat AA;
AA.load("my_sparse_matrix.txt", coord_ascii)
vec b; // ... fill b here ...
vec x = spsolve(A,b); // solve sparse system
return 0;
See also the documentation for SpMat, element access, .load(), spsolve().
The coord file format is simple. It stores non-zeros values.
Each line contains:
row col value
The row and column counts start at zero. Example:
0 0 1.0
0 1 2.0
1 1 3.0
2 2 4.0
2 3 5.0
3 2 6.0
3 3 7.0
1000 2000 9.0
Values not explicitly listed are assumed to be zero.

#include <vector>
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/Sparse>
#include <Eigen/Core>
#include <cstdlib>
using namespace Eigen;
using namespace std;
int main()
double L = 5; // Length
const int N = 120; // No of cells
double L_cell = L / N;
double k = 100; // Thermal Conductivity
double T_A = 100.;
double T_B = 200.;
double S = 1000.;
Vector<double, N> d, D, A, aL, aR, aP, S_u, S_p;
vector<double> xp;
xp.push_back((0 + L_cell) / 2.0);
double xm = xp[0];
for (int i = 0; i < N - 1; i++)
xm = xm + L_cell;
for (int i = 0; i < N; i++)
A(i) = .1;
d(i) = L_cell;
D(i) = k / d(i);
aL(0) = 0;
aR(0) = D(0) * A(0);
S_p(0) = -2 * D(0) * A(0);
aP(0) = aL(0) + aR(0) - S_p(0);
S_u(0) = 2 * D(0) * A(0) * T_A + S * L_cell * A(0);
for (int i = 1; i < N - 1; i++)
aL(i) = D(i) * A(i);
aR(i) = D(i) * A(i);
S_p(i) = 0;
aP(i) = aL(i) + aR(i) - S_p(i);
S_u(i) = S * A(i) * L_cell;
aL(N - 1) = D(N - 1) * A(N - 1);
aR(N - 1) = 0;
S_p(N - 1) = -2 * D(N - 1) * A(N - 1);
aP(N - 1) = aL(N - 1) + aR(N - 1) - S_p(N - 1);
S_u(N - 1) = 2 * D(N - 1) * A(N - 1) * T_B + S * L_cell * A(N - 1);
typedef Eigen::Triplet<double> T;
std::vector<T> tripletList;
tripletList.reserve(N * 3);
Matrix<double, N, 3> v; // v is declared here
v << (-1) * aL, aP, (-1) * aR;
for (int i = 0, j = 0; i < N && j < N; i++, j++)
tripletList.push_back(T(i, j, v(i, 1)));
if (i + 1 < N && j + 1 < N)
tripletList.push_back(T(i + 1, j, v(i + 1, 0)));
tripletList.push_back(T(i, j + 1, v(i, 2)));
SparseMatrix<double> coeff(N, N);
coeff.setFromTriplets(tripletList.begin(), tripletList.end());
SimplicialLDLT<SparseMatrix<double> > solver;
if ( != Success) {
cout << "decomposition failed" << endl;
Vector<double, N> temparature;
temparature = solver.solve(S_u);
if ( != Success)
cout << "Solving failed" << endl;
vector<double> Te = {}, x = {};
for (int i = 0; i < N; i++)
for (int i = 0; i < N + 2; i++)
cout << x[i] << " " << Te[i] << endl;
return 0;
Here is a full code of a solution to numerical problem which uses SparseMatrix. Look at the matrix v. It has the values of all the nonzero elements of coeff matrix yet to be defined. In the next loop I made a series of tripletList.push_back(...) adding a triplet consisting of row and column index and corresponding value taken from v for each non-zero element of coeff. Now declare a Sparse Matrix coeff with appropriate size and use the method setFromTriplets (documentation) to set its non-zero elements from tripletList triplets.


Eigen decomposition of Hermitian Matrix using CuSolver does not match the result with matlab

I am following the example of eigen decomposition from here,
I need to do it for Hermatian complex matrix. The problem is the eigen vector is not matching at all with the result with Matlab result.
Does anyone have any idea about it why this mismatch is happening?
I have also tried cusolverdn svd method to get eigen values and vector that is giving another result.
My code is here for convenience,
#include <cstdio>
#include <cstdlib>
#include <vector>
#include <cuda_runtime.h>
#include <cusolverDn.h>
#include "cusolver_utils.h"
int N = 16;
void BuildMatrix(cuComplex* input);
void main()
cusolverDnHandle_t cusolverH = NULL;
cudaStream_t stream = NULL;
cuComplex* h_idata = (cuComplex*)malloc(sizeof(cuComplex) * N);
cuComplex* h_eigenVector = (cuComplex*)malloc(sizeof(cuComplex) * N); // eigen vector
float* h_eigenValue = (float*)malloc(sizeof(float) * 4); // eigen Value
int count = 0;
for (int i = 0; i < N / 4; i++)
for (int j = 0; j < 4; j++)
printf("%f + %f\t", h_idata[count].x, h_idata[count].y);
/* step 1: create cusolver handle, bind a stream */
CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
CUSOLVER_CHECK(cusolverDnSetStream(cusolverH, stream));
// step 2: reserve memory in cuda and copy input data from host to device
cuComplex* d_idata;
float* d_eigenValue = nullptr;
int* d_info = nullptr;
CUDA_CHECK(cudaMalloc((void**)&d_idata, N * sizeof(cuComplex)));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_eigenValue), N * sizeof(float)));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_info), sizeof(int)));
CUDA_CHECK(cudaMemcpyAsync(d_idata, h_idata, N * sizeof(cuComplex), cudaMemcpyHostToDevice, stream));
// step 3: query working space of syevd
cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvalues and eigenvectors.
cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;
int lwork = 0; /* size of workspace */
cuComplex* d_work = nullptr; /* device workspace*/
const int m = 4;
const int lda = m;
cusolverDnCheevd_bufferSize(cusolverH, jobz, uplo, m, d_idata, lda, d_eigenValue, &lwork);
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_work), sizeof(cuComplex) * lwork));
// step 4: compute spectrum
cusolverDnCheevd(cusolverH, jobz, uplo, m, d_idata, lda, d_eigenValue, d_work, lwork, d_info);
cudaMemcpyAsync(h_eigenVector, d_idata, N * sizeof(cuComplex), cudaMemcpyDeviceToHost, stream));
cudaMemcpyAsync(h_eigenValue, d_eigenValue, 4 * sizeof(double), cudaMemcpyDeviceToHost, stream));
int info = 0;
CUDA_CHECK(cudaMemcpyAsync(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost, stream));
std::printf("after syevd: info = %d\n", info);
if (0 > info)
std::printf("%d-th parameter is wrong \n", -info);
count = 0;
for (int i = 0; i < N / 4; i++)
for (int j = 0; j < 4; j++)
printf("%f + %f\t", h_eigenVector[count].x, h_eigenVector[count].y);
for (int i = 0; i < N / 4; i++)
std::cout << h_eigenValue[i] << std::endl;
/* free resources */
//0.5560 + 0.0000i - 0.4864 + 0.0548i 0.8592 + 0.2101i - 1.5374 - 0.2069i
//- 0.4864 - 0.0548i 0.4317 + 0.0000i - 0.7318 - 0.2698i 1.3255 + 0.3344i
//0.8592 - 0.2101i - 0.7318 + 0.2698i 1.4099 + 0.0000i - 2.4578 + 0.2609i
//- 1.5374 + 0.2069i 1.3255 - 0.3344i - 2.4578 - 0.2609i 4.3333 + 0.0000i
void BuildMatrix(cuComplex* input)
std::vector<float> realVector = { 0.5560, -0.4864, 0.8592, -1.5374, -0.4864, 0.4317, -0.7318, 1.3255,
0.8592, -0.7318, 1.4099, -2.4578, -1.5374, 1.3255, -2.4578, 4.3333 };
std::vector<float> imagVector = { 0, -0.0548, -0.2101, 0.2069, 0.0548, 0.0000, 0.2698, -0.3344,
0.2101, -0.2698, 0, -0.2609, -0.2069, 0.3344, 0.2609, 0 };
for (int i = 0; i < N; i++)
input[i].x = * std::pow(10, 11);
input[i].y = * std::pow(10, 11);
I raised this issue in their git (, but unfortunately no one is answering.
If anyone can help me to solve this that will be very helpful.
Please follow the post for the clear answer,
The theory tells, A*V-lamda*V=0 should satisfy, however it might not be perfect zero. My thinking was it will very very close to zero or e-14 somethng like this. If the equation gives a value close to zero then it is acceptable.
There are different algorithms for solving eigen decomposition, like Jacobi algorithm, Cholesky factorization... The program I provided in my post uses the function cusolverDnCheevd which is based on LAPACK. LAPACK doc tells that it uses divide and conquer algorithm to solve Hermitian matrix. Here is the link,

failed assertion during applying BDCSVD

I am using the following struct in my project and the problem occurs in the second constructor. (I am using Visual Studio 2019.)
struct optimal_subspace {
vector<Eigen::VectorXd> span;
//empty constructor
optimal_subspace() {}
//constructor taking a pointset, the cluster number i and the size of subspaces q
//used for the k-means subspace algorithm on the whole pointset
optimal_subspace(vector<point>& pointset, int i, int q) {
//declare a vector to contain the span
vector<Eigen::VectorXd> subspace_span;
//declare integers n,d to save the dimensions of the current data matrix
int n, d;
//declare integer r to save the minimum of n and d
int r;
//using the constructor of the struct subspacematrix to get the data matrix of cluster i, the cluster mean is already subtracted
subspacematrix sm(pointset, i);
Eigen::MatrixXd m = sm.matrix;
//save the dimensions of m
n = m.rows();
d = m.cols();
//determine min(n,d)
r = min(n, d);
//check if the cluster contains points
if (sm.status == true) {
//use either Jacobi or BDCSVD according to the size of m, declare v to save V from the SVD D = U E V^T or thin SVD
//Jacobi better for matrices smaller than 16x16
Eigen::MatrixXd v;
//if r < q compute the Full decomposition as otherwise there are not enough singular vectors to obtain a q-dimensional subspace
//else compute the thin decomposition
clock_t start = clock();
if (n < 16 & d < 16) {
Eigen::JacobiSVD<Eigen::MatrixXd> svd(m, Eigen::ComputeThinU | Eigen::ComputeThinV);
v = svd.matrixV();
else {
Eigen::BDCSVD<Eigen::MatrixXd> svd(m, Eigen::ComputeThinU | Eigen::ComputeThinV);
v = svd.matrixV();
clock_t stop = clock();
svd_time += (double) (stop - start) / CLOCKS_PER_SEC;
for (int j = 0; j < min(q, r); j++) {
//V is of the form dxr, so, we take the r columns
//currentsubspace.push_back(v.col(j) + mean);
//if r < q, we fill the subspaces by taking the coordinates of random points outside the cluster
if (min(q, r) < q) {
vector<int> non_cluster_indices = opp_ind(sm.cluster_indices, pointset.size());
uniform_int_distribution<int> uniform_dist(0, non_cluster_indices.size());
//pick randomly a point outside the cluster and add it
for (int j = min(q, r); j < q; j++) {
Eigen::VectorXd non_cluster_vector = pointset[non_cluster_indices[uniform_dist(mt)]].getcoord();
//orthonormalize the span
stableGramSchmidt(subspace_span, min(q, r));
if (subspace_span.size() == 0) cout << "error: empty subspace added" << endl;
span = subspace_span;
//constructor taking a pointset, the cluster number i and the size of subspaces q and a vector of indices representing a sample
//used for sampling k-means
optimal_subspace(vector<point>& pointset, int i, int q, vector<int> indices) {
//declare a vector to contain the span
vector<Eigen::VectorXd> subspace_span;
//declare integers n,d to save the dimensions of the current data matrix
int n, d;
//declare integer r to save the minimum of n and d
int r;
//using the constructor of the struct subspacematrix to get the data matrix of cluster i, the cluster mean is already subtracted
subspacematrix sm(pointset, indices, i);
Eigen::MatrixXd m = sm.matrix;
//save the dimensions of m
n = m.rows();
d = m.cols();
//check if the cluster contains points
if (sm.status == true) {
//use either Jacobi or BDCSVD according to the size of m, declare v to save V from the SVD D = U E V^T or thin SVD
//Jacobi better for matrices smaller than 16x16
Eigen::MatrixXd v;
//if r < q compute the Full decomposition as otherwise there are not enough singular vectors to obtain a q-dimensional subspace
//else compute the thin decomposition
clock_t start = clock();
if (n < 16 & d < 16) {
Eigen::JacobiSVD<Eigen::MatrixXd> svd(m, Eigen::ComputeThinU | Eigen::ComputeThinV);
v = svd.matrixV();
else {
//ofstream file("problematicmatrix.txt", ofstream::trunc);
//file << sm.matrix.format(CommaInitFmt) << endl;
//Eigen::MatrixXd matrix = load_csv<Eigen::MatrixXd>("problematicmatrix.txt");
Eigen::BDCSVD<Eigen::MatrixXd> svd(sm.matrix, Eigen::ComputeThinU | Eigen::ComputeThinV);
v = svd.matrixV();
clock_t stop = clock();
svd_time += (double) (stop - start) / CLOCKS_PER_SEC;
int v_cols = v.cols();
int fill_up_index = min(q, v_cols);
for (int j = 0; j < fill_up_index; j++) {
//if we don't have enough columns, we fill the subspaces by taking the coordinates of random points outside the cluster
if (fill_up_index < q) {
vector<int> non_cluster_indices = opp_ind(sm.cluster_indices, indices);
uniform_int_distribution<int> uniform_dist(0, non_cluster_indices.size() - 1);
//pick randomly a point outside the cluster and add it
for (int j = fill_up_index; j < q; j++) {
Eigen::VectorXd non_cluster_vector = pointset[non_cluster_indices[uniform_dist(mt)]].getcoord();
//orthonormalize the span
stableGramSchmidt(subspace_span, fill_up_index);
if (subspace_span.size() == 0) cout << "error: empty subspace added" << endl;
span = subspace_span;
I get the following exception:
Unhandled exception at 0x00007FF6CB72BD3B in MAaktuell.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.
and when debugging after getting it, I end up in the BDCSVD.h.
I also run it in debug mode and got the following error message:
Assertion failed: index >= 0 && index < size(), file C:\Users\Marcel\Desktop\eigen-3.3.7\eigen-3.3.7\Eigen\src\Core\DenseCoeffsBase.h, line 180
I stored the matrix using the I0 format provided by eigen in a txt.file as follows (and included it in the second constructor, it is commented right now):
ofstream file("problematicmatrix.txt", ofstream::trunc);
ile << sm.matrix.format(CommaInitFmt) << endl;
and uploaded it here:
problematic matrix in a txt.file
However, I tried to compute the BDCSVD for this matrix again as follows:
Eigen::MatrixXd matrix = load_csv<Eigen::MatrixXd>("problematicmatrix.txt");
Eigen::BDCSVD<Eigen::MatrixXd> svd(matrix, Eigen::ComputeThinU | Eigen::ComputeThinV);
and then, it works. If I include saving and loading the matrix in my method, it fails again. Can anyone help me finding the error? Why do I end up in the header of BDCSVD, when debugging?

Trying to compute e^x when x_0 = 1

I am trying to compute the Taylor series expansion for e^x at x_0 = 1. I am having a very hard time understanding what it really is I am looking for. I am pretty sure I am trying to find a decimal approximation for when e^x when x_0 = 1 is. However, when I run this code when x_0 is = 0, I get the wrong output. Which leads me to believe that I am computing this incorrectly.
Here is my class e.hpp
#ifndef E_HPP
#define E_HPP
class E
int factorial(int n);
double computeE();
int fact = 1;
int x_0 = 1;
int x = 1;
int N = 10;
double e = 2.718;
double sum = 0.0;
Here is my e.cpp
#include "e.hpp"
#include <cmath>
#include <iostream>
int E::factorial(int n)
if(n == 0) return 1;
for(int i = 1; i <= n; ++i)
fact = fact * i;
return fact;
double E::computeE()
sum = std::pow(e,x_0);
for(int i = 1; i < N; ++i)
sum += ((std::pow(x-x_0,i))/factorial(i));
return e * sum;
In main.cpp
#include "e.hpp"
#include <iostream>
#include <cmath>
int main()
E a;
std::cout << "E calculated at x_0 = 1: " << a.computeE() << std::endl;
std::cout << "E Calculated with std::exp: " << std::exp(1) << std::endl;
E calculated at x_0 = 1: 7.38752
E calculated with std::exp: 2.71828
When I change to x_0 = 0.
E calculated at x_0 = 0: 7.03102
E calculated with std::exp: 2.71828
What am I doing wrong? Am I implementing the Taylor Series incorrectly? Is my logic incorrect somewhere?
Yeah, your logic is incorrect somewhere.
Like Dan says, you have to reset fact to 1 each time you calculate the factorial. You might even make it local to the factorial function.
In the return statement of computeE you are multiplying the sum by e, which you do not need to do. The sum is already the taylor approximation of e^x.
The taylor series for e^x about 0 is sum _i=0 ^i=infinity (x^i / i!), so x_0 should indeed be 0 in your program.
Technically your computeE computes the right value for sum when you have x_0=0, but it's kind of strange. The taylor series starts at i=0, but you start the loop with i=1. However, the first term of the taylor series is x^0 / 0! = 1 and you initialize sum to std::pow(e, x_0) = std::pow(e, 0) = 1 so it works out mathematically.
(Your computeE function also computed the right value for sum when you had x_0 = 1. You initialized sum to std::pow(e, 1) = e, and then the for loop didn't change its value at all because x - x_0 = 0.)
However, as I said, in either case you don't need to multiply it by e in the return statement.
I would change the computeE code to this:
double E::computeE()
sum = 0;
for(int i = 0; i < N; ++i)
sum += ((std::pow(x-x_0,i))/factorial(i));
cout << sum << endl;
return sum;
and set x_0 = 0.
"fact" must be reset to 1 each time you calculate factorial. It should be a local variable instead of a class variable.
When "fact" is a class varable, and you let "factorial" change it to, say 6, that means that it will have the vaule 6 when you call "factorial" a second time. And this will only get worse. Remove your declaration of "fact" and use this instead:
int E::factorial(int n)
int fact = 1;
if(n == 0) return 1;
for(int i = 1; i <= n; ++i)
fact = fact * i;
return fact;
Write less code.
Don't use factorial.
Here it is in Java. You should have no trouble converting this to C++:
* #link
* #link
public class TaylorSeries {
private static final int DEFAULT_NUM_TERMS = 50;
public static void main(String[] args) {
int xmax = (args.length > 0) ? Integer.valueOf(args[0]) : 10;
for (int i = 0; i < xmax; ++i) {
System.out.println(String.format("x: %10.5f series exp(x): %10.5f function exp(x): %10.5f", (double)i, exp(i), Math.exp(i)));
public static double exp(double x) {
return exp(DEFAULT_NUM_TERMS, x);
// This is the Taylor series for exp that you want to port to C++
public static double exp(int n, double x) {
double value = 1.0;
double term = 1.0;
for (int i = 1; i <= n; ++i) {
term *= x/i;
value += term;
return value;

RcppParallel Parallelizing distance computation: segfault

I have a matrix, for which I want to compute the distance (let's say Euclidean) between the ith row and every other row(i.e. I want the ith row of the pairwise distance matrix).
#include <Rcpp.h>
#include <cmath>
#include <algorithm>
#include <RcppParallel.h>
//#include <RcppArmadillo.h>
#include <queue>
using namespace std;
using namespace Rcpp;
using namespace RcppParallel;
// [[Rcpp::export]]
double dist_fun(NumericVector row1, NumericVector row2){
double rval = 0;
for (int i = 0; i < row1.length(); i++){
rval += (row1[i] - row2[i]) * (row1[i] - row2[i]);
return rval;
// [[Rcpp::export]]
NumericVector dist_row(NumericMatrix mat, int i){
NumericVector row(mat.nrow());
NumericMatrix::Row row1 = mat.row(i - 1);
for (int j = 0; j < mat.nrow(); j++){
NumericMatrix::Row row2 = mat.row(j);
row(j) = dist_fun(row1, row2);
return row;
// [[Rcpp::depends(RcppParallel)]]
struct JsDistance: public Worker {
// input matrix to read from
const NumericMatrix mat;
int i;
// output vector to write to
NumericVector output;
// initialize from Rcpp input and output matrixes (the RMatrix class
// can be automatically converted to from the Rcpp matrix type)
JsDistance(const NumericMatrix mat, int i, NumericVector output)
: mat(mat), i(i), output(output) {}
// function call operator that work for the specified range (begin/end)
void operator()(std::size_t begin, std::size_t end) {
NumericVector row1 = mat.row(i);
for (std::size_t j = begin; j < end; j++) {
NumericVector row2 = mat.row(j);
output[j] = dist_fun(row1, row2);
// [[Rcpp::export]]
NumericVector parallel_dist_row(NumericMatrix mat, int i) {
// allocate the matrix we will return
NumericVector output(mat.nrow());
// create the worker
JsDistance JsDistance(mat, i, output);
// call it with parallelFor
parallelFor(0, mat.nrow(), JsDistance);
return output;
The sequential way using Rcpp is the function 'row_dist' as written above. Yet the matrix I want to work with is very large so I want to parallelize it. But then I will run into a segfault error which I don't quite understand why. To trigger the error you can run the following code:
setThreadOptions(numThreads = 20)
X = matrix(rnorm(10000 * 400), 10000, 400)
start1 = proc.time()
print(dist_row(X, 2)[1:30])
print(proc.time() - start1)
start2 = proc.time()
print(parallel_dist_row(X, 2)[1:30])
print(proc.time() - start2)
Can someone give me some hint about what I did wrong? Thanks in advance for your time!
inline double d(double a, double b){
return fabs(a - b);
// [[Rcpp::depends(RcppParallel)]
struct dtwDistance: public Worker {
// Input matrix to read from must be of the RMatrix<T> form
// if using Rcpp objects
const RMatrix<double> mat;
int i;
// Output vector to write to must be of the RVector<T> form
// if using Rcpp objects
RVector<double> output;
// initialize from Rcpp input and output matrixes (the RMatrix class
// can be automatically converted to from the Rcpp matrix type)
dtwDistance(const NumericMatrix mat, int i, NumericVector output)
: mat(mat), i(i - 1), output(output) {}
// Note the -1 ^^^^ to match results from prior function
// Function call operator to iterate over a specified range (begin/end)
void operator()(std::size_t begin, std::size_t end) {
RMatrix<double>::Row row1 = mat.row(i);
for (std::size_t j = begin; j < end; ++j) {
RMatrix<double>::Row row2 = mat.row(j);
size_t n = row1.length();
size_t m = row2.length();
NumericMatrix cost(n + 1, m + 1);
for (int ii = 1; ii <= n; ii++){
cost(i, 0) = numeric_limits<double>::infinity();
for (int jj = 1; jj <= m; jj++){
cost(0, j) = numeric_limits<double>::infinity();
for (int ii = 1; ii <= n; ii++){
for (int jj = 1; jj <= m; jj++){
double dist = d(row1[ii - 1], row2[jj - 1]);
cost(ii, jj) = dist + min(min(cost(ii - 1, jj), cost(ii, jj - 1)), cost(ii - 1, jj - 1));
//cout << ii << ", " << jj << ", " << cost(ii, jj) << "\n";
output[j] = cost(n, m);
// [[Rcpp::export]]
NumericVector parallel_dist_row_dtw(NumericMatrix mat, int i) {
// allocate the matrix we will return
//RMatrix<double> input(mat);
NumericVector y(mat.nrow());
//RVector<double> output(y);
// create the worker
dtwDistance dtwDistance(mat, i, y);
// call it with parallelFor
parallelFor(0, mat.nrow(), dtwDistance);
return y;
The distance I needed to calculate is the dynamic time warping distance. I implemented it as above. Yet when running, it will give a 'stack imbalance' warning. And there will be a segfault after several runs. I'm wondering what is the problem now.
To trigger the problem, I did:
setThreadOptions(numThreads = 4)
X = matrix(rnorm(1000), 100, 10)
parallel_dist_row_dtw(X, 1)
parallel_dist_row_dtw(X, 2)
parallel_dist_row_dtw(X, 3)
parallel_dist_row_dtw(X, 4)
parallel_dist_row_dtw(X, 5)
The issue is you are not using the thread-safe wrapper around R objects via RMatrix<T> and RVector<T>. These classes are important because of the parallelization being executed on a background thread, which is an area that is not safe to call R or Rcpp APIs. The official documentation emphasizes this in the Safe Accessors section.
In particular, we have:
To provide safe and convenient access to the arrays underlying R vectors and matrices RcppParallel introduces several accessor classes:
RVector<T> — Wrap R vectors of various types
RMatrix<T> — Wrap R matrices of various types (also includes Row and Column classes)
To create a thread safe accessor for an Rcpp vector or matrix just construct an instance of RVector or RMatrix with it.
Code Fix
So, your work can be fixed by switching *Matrix to RMatrix<T> and *Vector to RVector<T>.
struct JsDistance: public Worker {
// Input matrix to read from must be of the RMatrix<T> form
// if using Rcpp objects
const RMatrix<double> mat;
int i;
// Output vector to write to must be of the RVector<T> form
// if using Rcpp objects
RVector<double> output;
// initialize from Rcpp input and output matrixes (the RMatrix class
// can be automatically converted to from the Rcpp matrix type)
JsDistance(const NumericMatrix mat, int i, NumericVector output)
: mat(mat), i(i - 1), output(output) {}
// Note the -1 ^^^^ to match results from prior function
// Function call operator to iterate over a specified range (begin/end)
void operator()(std::size_t begin, std::size_t end) {
RMatrix<double>::Row row1 = mat.row(i);
for (std::size_t j = begin; j < end; ++j) {
RMatrix<double>::Row row2 = mat.row(j);
double rval = 0;
for (unsigned int k = 0; k < row1.length(); ++k) {
rval += (row1[k] - row2[k]) * (row1[k] - row2[k]);
output[j] = rval;
In particular, the data types used here are of the form RMatrix<double> even for accessing the matrix.
Also, within the parallelized version there is a missing i-1 statement. To remedy this, I've opted to have it taken care of in the constructor of JSDistance.
X = matrix(rnorm(10000 * 400), 10000, 400)
start1 = proc.time()
print(dist_row(X, 2)[1:30])
# [1] 811.8873 0.0000 799.8153 810.1442 720.3232 730.6083 797.8441 781.8066 827.1511 834.1863 842.9392 850.2476 724.5842 673.1428 775.0994
# [16] 805.5752 804.9281 774.9770 799.7669 870.3187 815.1129 934.7581 726.1554 804.2097 758.4943 772.8931 806.6026 715.8257 847.8980 831.7555
print(proc.time() - start1)
# user system elapsed
# 0.22 0.00 0.23
start2 = proc.time()
print(parallel_dist_row(X, 2)[1:30])
# [1] 811.8873 0.0000 799.8153 810.1442 720.3232 730.6083 797.8441 781.8066 827.1511 834.1863 842.9392 850.2476 724.5842 673.1428 775.0994
# [16] 805.5752 804.9281 774.9770 799.7669 870.3187 815.1129 934.7581 726.1554 804.2097 758.4943 772.8931 806.6026 715.8257 847.8980 831.7555
print(proc.time() - start2)
# user system elapsed
# 0.28 0.00 0.06
all.equal(parallel_dist_row(X, 2), dist_row(X, 2))
# [1] TRUE

Memory Overflow? std::badalloc

I have a program that solves generally for 1D brownian motion using an Euler's Method.
Being a stochastic process, I want to average it over many particles. But I find that as I ramp up the number of particles, it overloads and i get the std::badalloc error, which I understand is a memory error.
Here is my full code
#include <iostream>
#include <vector>
#include <fstream>
#include <cmath>
#include <cstdlib>
#include <limits>
#include <ctime>
using namespace std;
// Box-Muller Method to generate gaussian numbers
double generateGaussianNoise(double mu, double sigma) {
const double epsilon = std::numeric_limits<double>::min();
const double tau = 2.0 * 3.14159265358979323846;
static double z0, z1;
static bool generate;
generate = !generate;
if (!generate) return z1 * sigma + mu;
double u1, u2;
do {
u1 = rand() * (1.0 / RAND_MAX);
u2 = rand() * (1.0 / RAND_MAX);
} while (u1 <= epsilon);
z0 = sqrt(-2.0 * log(u1)) * cos(tau * u2);
z1 = sqrt(-2.0 * log(u1)) * sin(tau * u2);
return z0 * sigma + mu;
int main() {
// Initialize Variables
double gg; // Gaussian Number Picked from distribution
// Integrator
double t0 = 0; // Setting the Time Window
double tf = 10;
double n = 5000; // Number of Steps
double h = (tf - t0) / n; // Time Step Size
// Set Constants
const double pii = atan(1) * 4; // pi
const double eta = 1; // viscous constant
const double m = 1; // mass
const double aa = 1; // radius
const double Temp = 30; // Temperature in Kelvins
const double KB = 1; // Boltzmann Constant
const double alpha = (6 * pii * eta * aa);
// More Constants
const double mu = 0; // Gaussian Mean
const double sigma = 1; // Gaussian Std Deviation
const double ng = n; // No. of pts to generate for Gauss distribution
const double npart = 1000; // No. of Particles
// Initial Conditions
double x0 = 0;
double y0 = 0;
double t = t0;
// Vectors
vector<double> storX; // Vector that keeps displacement values
vector<double> storY; // Vector that keeps velocity values
vector<double> storT; // Vector to store time
vector<double> storeGaussian; // Vector to store Gaussian numbers generated
vector<double> holder; // Placeholder Vector for calculation operations
vector<double> mainstore; // Vector that holds the final value desired
// Prepares mainstore
for (int z = 0; z < (n+1); z++) {
for (int NN = 0; NN < npart; NN++) {
// Prepares holder
for (int z = 0; z < (n+1); z++) {
// Gaussian Generator
for (double iiii = 0; iiii < ng; iiii++) {
gg = generateGaussianNoise(0, 1); // generateGaussianNoise(mu,sigma)
// Solver
for (int ii = 0; ii < n; ii++) {
storY[ii + 1] =
storY[ii] - (alpha / m) * storY[ii] * h +
(sqrt(2 * alpha * KB * Temp) / m) * sqrt(h) * storeGaussian[ii];
storX[ii + 1] = storX[ii] + storY[ii] * h;
holder[ii + 1] =
pow(storX[ii + 1], 2); // Finds the displacement squared
t = t + h;
// Updates the Main Storage
for (int z = 0; z < storX.size(); z++) {
mainstore[z] = mainstore[z] + holder[z];
// Average over the number of particles
for (int z = 0; z < storX.size(); z++) {
mainstore[z] = mainstore[z] / (npart);
// Outputs the data
ofstream fout("LangevinEulerTest.txt");
for (int jj = 0; jj < storX.size(); jj++) {
fout << storT[jj] << '\t' << mainstore[jj] << '\t' << storX[jj] << endl;
return 0;
As you can see, npart is the variable that I change to vary the number of particles. But after each iteration, I do clear my storage vectors like storX,storY... So on paper, the number of particles should not affect memory? I am only just calling the compiler to repeat many more times, and add onto the main storage vector mainstore. I am running my code on a computer with 4GB ram.
Would greatly appreciate it if anyone could point out my errors in logic or suggest improvements.
Edit: Currently the number of particles is set to npart = 1000.
So when I try to ramp it up to like npart = 20000 or npart = 50000, it gives me memory errors.
Edit2 I've edited the code to allocate an extra index to each of the storage vectors. But it does not seem to fix the memory overflow
There is an out of bounds exception in the solver part. storY has size n and you access ii+1 where i goes up to n-1. So for your code provided. storY has size 5000. It is allowed to access with indices between 0 and 4999 (including) but you try to access with index 5000. The same for storX, holder and mainstore.
Also, storeGaussian does not get cleared before adding new variables. It grows by n for each npart loop. You access only the first n values of it in the solver part anyway.
Please note, that vector::clear removes all elements from the vector, but does not necessarily change the vector's capacity (i.e. it's storage array), see the documentation.
This won't cause the problem here, because you'll reuse the same array in the next runs, but it's something to be aware when using vectors.