I have a performance critical piece of code, where I need to check one array for values below a threshold and then conditionally set the values of two other arrays. My code looks like this:
#include <Eigen/Dense>
int main(){
a (1, 100),
b (2, 100),
c (3, 100);
constexpr double minVal { 1e-8 };
/* the code segment in question */
/* option 1 */
for ( int i=0; i<2; ++i ){
b.row(i) = (a < minVal).select( 0, c.row(i+1) / a );
c.row(i+1) = (a < minVal).select( 0, c.row(i+1) );
/* option 2, which is slower */
b = (a < minVal).replicate(2,1).select( 0, c.bottomRows(2) / a.replicate(2,1) );
c.bottomRows(2) = (a < minVal).replicate(2,1).select( 0, c.bottomRows(2) );
return 0;
The array a, whose values are checked for reaching the threshold minVal, has one row and a dynamic number of columns. The other two arrays b and c have two and three rows, respectively, and the same number of columns as a.
Now I would like to do the above logic in a more eigen way, without that loop in option 1, because typically, eigen has tricks up its sleeve for performance, that I can never hope to match when writing raw loops.
However, the only way I could think of was option 2, which is noticeably slower than option 1.
What would be the right and efficient way to do the above? Or is the loop already my best option?
You can try the following:
Define your array types with fixed number of rows and dynamic number of columns, i.e., you can replace Eigen::ArrayXXd with Eigen::Array<double, 1/2/3, Eigen::Dynamic>.
Use fixed-size version of block operations (see, i.e., you can replace bottomRows(N) with bottomRows<N>() and similarly replicate(2,1) with replicate<2,1>().
I have changed the array types in your code and included a third option with the possible improvements that I have mentioned:
#include <Eigen/Dense>
#include <iostream>
#include <chrono>
constexpr int numberOfTrials = 1000000;
constexpr double minVal{ 1e-8 };
typedef Eigen::Array<double, 1, Eigen::Dynamic> Array1Xd;
typedef Eigen::Array<double, 2, Eigen::Dynamic> Array2Xd;
typedef Eigen::Array<double, 3, Eigen::Dynamic> Array3Xd;
inline void option1(const Array1Xd& a, Array2Xd& b, Array3Xd& c)
for (int i = 0; i < 2; ++i) {
b.row(i) = (a < minVal).select(0, c.row(i + 1) / a);
c.row(i + 1) = (a < minVal).select(0, c.row(i + 1));
inline void option2(const Array1Xd& a, Array2Xd& b, Array3Xd& c)
b = (a < minVal).replicate(2, 1).select(0, c.bottomRows(2) / a.replicate(2, 1));
c.bottomRows(2) = (a < minVal).replicate(2, 1).select(0, c.bottomRows(2));
inline void option3(const Array1Xd& a, Array2Xd& b, Array3Xd& c)
b = (a < minVal).replicate<2, 1>().select(0, c.bottomRows<2>() / a.replicate<2, 1>());
c.bottomRows<2>() = (a < minVal).replicate<2, 1>().select(0, c.bottomRows<2>());
int main() {
Array1Xd a(1, 100);
Array2Xd b(2, 100);
Array3Xd c(3, 100);
auto tpBegin1 = std::chrono::steady_clock::now();
for (int i = 0; i < numberOfTrials; i++)
option1(a, b, c);
auto tpEnd1 = std::chrono::steady_clock::now();
auto tpBegin2 = std::chrono::steady_clock::now();
for (int i = 0; i < numberOfTrials; i++)
option2(a, b, c);
auto tpEnd2 = std::chrono::steady_clock::now();
auto tpBegin3 = std::chrono::steady_clock::now();
for (int i = 0; i < numberOfTrials; i++)
option3(a, b, c);
auto tpEnd3 = std::chrono::steady_clock::now();
std::cout << "(Option 1) Average execution time: " << std::chrono::duration_cast<std::chrono::microseconds>(tpEnd1 - tpBegin1).count() / (long double)(numberOfTrials) << " us" << std::endl;
std::cout << "(Option 2) Average execution time: " << std::chrono::duration_cast<std::chrono::microseconds>(tpEnd2 - tpBegin2).count() / (long double)(numberOfTrials) << " us" << std::endl;
std::cout << "(Option 3) Average execution time: " << std::chrono::duration_cast<std::chrono::microseconds>(tpEnd3 - tpBegin3).count() / (long double)(numberOfTrials) << " us" << std::endl;
return 0;
Average execution times that I have obtained are as follows (i7-9700K, msvc2019, optimizations enabled, NDEBUG):
(Option 1) Average execution time: 0.527717 us
(Option 2) Average execution time: 3.25618 us
(Option 3) Average execution time: 0.512029 us
And with AVX2+OpenMP enabled:
(Option 1) Average execution time: 0.374309 us
(Option 2) Average execution time: 3.31356 us
(Option 3) Average execution time: 0.260551 us
I'm not sure if it is the most "Eigen" way but I hope it helps!
This question already has answers here:
C programming, why does this large array declaration produce a segmentation fault?
(6 answers)
Closed 6 years ago.
I am running a code where I am simply creating 2 matrices: one matrix is of dimensions arows x nsame and the other has dimensions nsame x bcols. The result is an array of dimensions arows x bcols. This is fairly simple to implement using BLAS and the following code appears to work as intended when using the below master-slave model with OpenMPI:`
#include <iostream>
#include <stdio.h>
#include <iostream>
#include <cmath>
#include <mpi.h>
#include <gsl/gsl_blas.h>
using namespace std;`
int main(int argc, char** argv){
int noprocs, nid;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &nid);
MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
int master = 0;
const int nsame = 500; //must be same if matrices multiplied together = acols = brows
const int arows = 500;
const int bcols = 527; //works for 500 x 500 x 527 and 6000 x 100 x 36
int rowsent;
double buff[nsame];
double b[nsame*bcols];
double c[arows][bcols];
double CC[1*bcols]; //here ncols corresponds to numbers of rows for matrix b
for (int i = 0; i < bcols; i++){
CC[i] = 0.;
// Master part
if (nid == master ) {
double a [arows][nsame]; //creating identity matrix of dimensions arows x nsame (it is I if arows = nsame)
for (int i = 0; i < arows; i++){
for (int j = 0; j < nsame; j++){
if (i == j)
a[i][j] = 1.;
a[i][j] = 0.;
double b[nsame*bcols];//here ncols corresponds to numbers of rows for matrix b
for (int i = 0; i < (nsame*bcols); i++){
b[i] = (10.*i + 3.)/(3.*i - 2.) ;
MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
for (int i=1; i < (noprocs); i++) {
// Note A is a 2D array so A[rowsent]=&A[rowsent][0]
MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
for (int i=0; i<arows; i++) {
MPI_COMM_WORLD, &status);
int sender = status.MPI_SOURCE;
int anstype = status.MPI_TAG; //row number+1
int IND_I = 0;
while (IND_I < bcols){
c[anstype - 1][IND_I] = CC[IND_I];
if (rowsent < arows) {
MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
else { // tell sender no more work to do via a 0 TAG
// Slave part
else {
MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
while(status.MPI_TAG != 0) {
int crow = status.MPI_TAG;
gsl_matrix_view AAAA = gsl_matrix_view_array(buff, 1, nsame);
gsl_matrix_view BBBB = gsl_matrix_view_array(b, nsame, bcols);
gsl_matrix_view CCCC = gsl_matrix_view_array(CC, 1, bcols);
/* Compute C = A B */
gsl_blas_dgemm (CblasNoTrans, CblasNoTrans, 1.0, &AAAA.matrix, &BBBB.matrix,
0.0, &CCCC.matrix);
// output c here on master node //uncomment the below lines if I wish to see the output
// if (nid == master){
// if (rowsent == arows){
// // cout << rowsent;
// int IND_F = 0;
// while (IND_F < arows){
// int IND_K = 0;
// while (IND_K < bcols){
// cout << "[" << IND_F << "]" << "[" << IND_K << "] = " << c[IND_F][IND_K] << " ";
// IND_K++;
// }
// cout << "\n";
// IND_F++;
// }
// }
// }
//free any allocated space here
return 0;
Now what appears odd is that when I increase size of the matrices (e.g. from nsame = 500 to nsame = 501), the code no longer works. I receive the following error:
mpirun noticed that process rank 0 with PID 0 on node Users-MacBook-Air exited on signal 11 (Segmentation fault: 11).
I have tried this with other combinations of sizes for the matrices and there always appears to be an upper limit for the size of the matrices themselves (which seems to vary based on how I vary the different dimensions themselves). I have also tried modifying the values of the matrices themselves although this does not appear to change anything. I realize there are alternative ways to initialize the matrices in my example (e.g. using vector) but am simply wondering why my current scheme of multiplying matrices of arbitrary size seems to only work to a certain extent.
You're declaring too many big local variables, which is causing stack space related problems. a, in particular, is 500x500 doubles (250000 8 byte elements, or 2 million bytes). b is even larger.
You'll need to dynamically allocate space for some or all of those arrays.
There might be a compiler option to increase the initial stack space but that isn't a good long term solution.
Consider the following dataset and centroids. There are 7 individuals and two means each with 8 dimensions. They are stored row major order.
short dim = 8;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114
I want to calculate each euclidean distances. c1 - d1, c1 - d2 ....
On CPU I would do:
float dist = 0.0, dist_sqrt;
for(int i = 0; i < 2; i++)
for(int j = 0; j < 7; j++)
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
dist_sqrt = sqrt(dist_sum);
// do something with the distance
std::cout << dist_sqrt << std::endl;
Is there any built in solution of vector distance calculation in THRUST?
It can be done in thrust. Explaining how will be rather involved, and the code is rather dense.
The key observation to start with is that the core operation can be done via a transformed reduction. The thrust transform operation is used to perform the elementwise subtraction of the vectors (individual-centroid) and squaring of each result, and the reduction sums the results together to produce the square of the euclidean distance. The starting point for this operation is thrust::reduce_by_key, but it gets rather involved to present the data correctly to reduce_by_key.
The final results are produced by taking the square root of each result from above, and we can use an ordinary thrust::transform for this.
The above is a summary description of the only 2 lines of thrust code that do all the work. However, the first line has considerable complexity to it. In order to exploit parallelism, the approach I took was to virtually "lay out" the necessary vectors in sequence, to be presented to reduce_by_key. To take a simple example, suppose we have 2 centroids and 4 individuals, and suppose our dimension is 2.
centroid 0: C00 C01
centroid 1: C10 C11
individ 0: I00 I01
individ 1: I10 I11
individ 2: I20 I21
individ 3: I30 I31
We can "lay out" the vectors like this:
C00 C01 C00 C01 C00 C01 C00 C01 C10 C11 C10 C11 C10 C11 C10 C11
I00 I01 I10 I11 I20 I21 I30 I31 I00 I01 I10 I11 I20 I21 I30 I31
To facilitate the reduce_by_key, we will also need to generate key values to delineate the vectors:
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
The above data "laid-out" data sets can be quite large, and we don't want to incur storage and retrieval cost, so we will generate these "on-the-fly" using thrust's collection of fancy iterators. This is where things get quite dense. With the above strategy in mind, we will use thrust::reduce_by_key to do the work. We'll create a custom functor provided to a transform_iterator to do the subtraction (and squaring) of the I and C vectors, which will be zipped together for this purpose. The "lay out" of the vectors will be created on the fly using permutation iterators with additional custom index-creation functors, to help with the replicated patterns in each of I and C.
Therefore, working from the "inside out", the sequence of steps is as follows:
for both I (data) and C (centr) use a counting_iterator combined with a custom indexing functor inside of a transform_iterator to produce the indexing sequences we will need.
using the indexing sequences created in step 1 and the base I and C vectors, virtually "lay out" the vectors via a permutation_iterator (one for each laid-out vector).
zip the 2 "laid out" virtual I and C vectors together, to create a <float, float> tuple vector (virtual).
take the zip_iterator from step 3, and combine with a custom distance-calculation functor ((I-C)^2) in a transform_iterator
use another transform_iterator, combining a counting_iterator with a custom key-generating functor, to produce the key sequence (virtual)
pass the iterators in steps 4 and 5 to reduce_by_keyas the inputs (keys, values) to be reduced. The output vectors for reduce_by_key are also keys and values. We don't need the keys, so we'll use a discard_iterator to dump those. The values we will save.
The above steps are all accomplished in a single line of thrust code.
Here's a code illustrating the above:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/reduce.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#define MAX_DATA 100000000
#define MAX_CENT 5000
#define TOL 0.001
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
unsigned verify(float *d1, float *d2, int len){
unsigned pass = 1;
for (int i = 0; i < len; i++)
if (fabsf(d1[i] - d2[i]) > TOL){
std::cout << "mismatch at: " << i << " val1: " << d1[i] << " val2: " << d2[i] << std::endl;
pass = 0;
return pass;
void eucl_dist_cpu(const float *centroids, const float *data, float *rdist, int num_centroids, int dim, int num_data, int print){
int out_idx = 0;
float dist, dist_sqrt;
for(int i = 0; i < num_centroids; i++)
for(int j = 0; j < num_data; j++)
float dist_sum = 0.0;
for(int k = 0; k < dim; k++)
dist = centroids[i * dim + k] - data[j * dim + k];
dist_sum += dist * dist;
dist_sqrt = sqrt(dist_sum);
// do something with the distance
rdist[out_idx++] = dist_sqrt;
if (print) std::cout << dist_sqrt << ", ";
if (print) std::cout << std::endl;
struct dkeygen : public thrust::unary_function<int, int>
int dim;
int numd;
dkeygen(const int _dim, const int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val/dim);
typedef thrust::tuple<float, float> mytuple;
struct my_dist : public thrust::unary_function<mytuple, float>
__host__ __device__ float operator()(const mytuple &my_tuple) const {
float temp = thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple);
return temp*temp;
struct d_idx : public thrust::unary_function<int, int>
int dim;
int numd;
d_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % (dim*numd));
struct c_idx : public thrust::unary_function<int, int>
int dim;
int numd;
c_idx(int _dim, int _numd) : dim(_dim), numd(_numd) {};
__host__ __device__ int operator()(const int val) const {
return (val % dim) + (dim * (val/(dim*numd)));
struct my_sqrt : public thrust::unary_function<float, float>
__host__ __device__ float operator()(const float val) const {
return sqrtf(val);
unsigned long long eucl_dist_thrust(thrust::host_vector<float> ¢roids, thrust::host_vector<float> &data, thrust::host_vector<float> &dist, int num_centroids, int dim, int num_data, int print){
thrust::device_vector<float> d_data = data;
thrust::device_vector<float> d_centr = centroids;
thrust::device_vector<float> values_out(num_centroids*num_data);
unsigned long long compute_time = dtime_usec(0);
thrust::reduce_by_key(thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), dkeygen(dim, num_data)), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(dim*num_data*num_centroids), dkeygen(dim, num_data)),thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_centr.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), c_idx(dim, num_data))), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator<int>(0), d_idx(dim, num_data))))), my_dist()), thrust::make_discard_iterator(), values_out.begin());
thrust::transform(values_out.begin(), values_out.end(), values_out.begin(), my_sqrt());
compute_time = dtime_usec(compute_time);
if (print){
thrust::copy(values_out.begin(), values_out.end(), std::ostream_iterator<float>(std::cout, ", "));
std::cout << std::endl;
thrust::copy(values_out.begin(), values_out.end(), dist.begin());
return compute_time;
int main(int argc, char *argv[]){
int dim = 8;
int num_centroids = 2;
float centroids[] = {
0.223, 0.002, 0.223, 0.412, 0.334, 0.532, 0.244, 0.612,
0.742, 0.812, 0.817, 0.353, 0.325, 0.452, 0.837, 0.441
int num_data = 8;
float data[] = {
0.314, 0.504, 0.030, 0.215, 0.647, 0.045, 0.443, 0.325,
0.731, 0.354, 0.696, 0.604, 0.954, 0.673, 0.625, 0.744,
0.615, 0.936, 0.045, 0.779, 0.169, 0.589, 0.303, 0.869,
0.275, 0.406, 0.003, 0.763, 0.471, 0.748, 0.230, 0.769,
0.903, 0.489, 0.135, 0.599, 0.094, 0.088, 0.272, 0.719,
0.112, 0.448, 0.809, 0.157, 0.227, 0.978, 0.747, 0.530,
0.908, 0.121, 0.321, 0.911, 0.884, 0.792, 0.658, 0.114,
0.721, 0.555, 0.979, 0.412, 0.007, 0.501, 0.844, 0.234
std::cout << "cpu results: " << std::endl;
float dist[num_data*num_centroids];
eucl_dist_cpu(centroids, data, dist, num_centroids, dim, num_data, 1);
thrust::host_vector<float> h_data(data, data + (sizeof(data)/sizeof(float)));
thrust::host_vector<float> h_centr(centroids, centroids + (sizeof(centroids)/sizeof(float)));
thrust::host_vector<float> h_dist(num_centroids*num_data);
std::cout << "gpu results: " << std::endl;
eucl_dist_thrust(h_centr, h_data, h_dist, num_centroids, dim, num_data, 1);
float *data2, *centroids2, *dist2;
num_centroids = 10;
num_data = 1000000;
if (argc > 2) {
num_centroids = atoi(argv[1]);
num_data = atoi(argv[2]);
if ((num_centroids < 1) || (num_centroids > MAX_CENT)) {std::cout << "Num centroids out of range" << std::endl; return 1;}
if ((num_data < 1) || (num_data > MAX_DATA)) {std::cout << "Num data out of range" << std::endl; return 1;}
if (num_data * dim * num_centroids > 2000000000) {std::cout << "data set out of range" << std::endl; return 1;}}
std::cout << "Num Data: " << num_data << std::endl;
std::cout << "Num Cent: " << num_centroids << std::endl;
std::cout << "result size: " << ((num_data*num_centroids*4)/1048576) << " Mbytes" << std::endl;
data2 = new float[dim*num_data];
centroids2 = new float[dim*num_centroids];
dist2 = new float[num_data*num_centroids];
for (int i = 0; i < dim*num_data; i++) data2[i] = rand()/(float)RAND_MAX;
for (int i = 0; i < dim*num_centroids; i++) centroids2[i] = rand()/(float)RAND_MAX;
unsigned long long dtime = dtime_usec(0);
eucl_dist_cpu(centroids2, data2, dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "cpu time: " << dtime/(float)USECPSEC << "s" << std::endl;
thrust::host_vector<float> h_data2(data2, data2 + (dim*num_data));
thrust::host_vector<float> h_centr2(centroids2, centroids2 + (dim*num_centroids));
thrust::host_vector<float> h_dist2(num_data*num_centroids);
dtime = dtime_usec(0);
unsigned long long ctime = eucl_dist_thrust(h_centr2, h_data2, h_dist2, num_centroids, dim, num_data, 0);
dtime = dtime_usec(dtime);
std::cout << "gpu total time: " << dtime/(float)USECPSEC << "s, gpu compute time: " << ctime/(float)USECPSEC << "s" << std::endl;
if (!verify(dist2, &(h_dist2[0]), num_data*num_centroids)) {std::cout << "Verification failure." << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
The code is set up to do 2 passes, a short one using a data set similar to yours, with printout for visual check. Then a larger data set can be entered, via command-line sizing parameters (number of centroids, then number of individuals), for benchmark comparison and validation of results.
Contrary to what I stated in the comments, the thrust code is only running about 25% faster than the naive single-threaded CPU code. Your mileage may vary.
This is just one way to think about handling it. I have had other ideas, but not enough time to flesh them out.
The data sets can become rather large. The code right now is intended to be limited to data sets where the product of dimension*number_of_centroids*number_of_individuals is less than 2 billion. However, as you approach even this number, you will need a GPU and CPU that both have a few GB of memory. I briefly explored larger data set sizes. A few code changes would be needed in various places to extend from e.g. int to unsigned long long, etc. However I haven't provided that as I am still investigating an issue with that code.
For another, non-thrust-related look at computing euclidean distances on the GPU, you may be interested in this question. If you follow the sequence of optimizations that were made there, it may shed some light on either how this thrust code might be improved, or else how another non-thrust realization could be used.
Sorry I wasn't able to squeeze more performance out.
I have 2 rotations represented as yaw, pitch, roll (Tait-Brian intrinsic right-handed). What is the recommended way to construct a single rotation that is equivalent to both of them?
EDIT: if I understand correctly from the answers, I must first convert yaw, pitch, roll to either matrix or quaternion, compose them and then transform the result back to yaw, pitch, roll representation.
Also, my first priority is simplicity, then numerical stability and efficiency.
Thanks :)
As a general answer, if you make a rotation matrix for each of the two rotations, you can then make a single matrix which is the product of the two (order is important!) to represent the effect of applying both rotations.
It is possible to conceive of instances where "gimbal lock" could make this numerically unstable for certain angles (typically involving angles very close to 90 degrees).
It is faster and more stable to use quaternions. You can see a nice treatment at - in summary, every rotation can be represented by a quaternion and multiple rotations are just represented by the product of the quaternions. They tend to have better stability properties.
Formulas for doing this can be found at
UPDATE using the formulas provided at , you can adapt the following code to do a sequence of rotations. While the code is written in (and compiles as ) C++, I am not taking advantage of certain built in C++ types and methods that might make this code more elegant - showing my C roots here. The point is really to show how the rotation equations work, and how you can concatenate multiple rotations.
The two key functions are calcRot which computes the rotation matrix for given yaw, pitch and roll; and mMult which multiplies two matrices together. When you have two successive rotations, the product of their rotation matrices is the "composite" rotation - you do have to watch out for the order in which you do things. The example that I used shows this. First I rotate a vector by two separate rotations; then I compute a single matrix that combines both rotations and get the same result; finally I reverse the order of the rotations, and get a different result. All of which should help you solve your problem.
Make sure that the conventions I used make sense for you.
#include <iostream>
#include <cmath>
#define PI (2.0*acos(0.0))
//#define DEBUG
void calcRot(double ypr[3], double M[3][3]) {
// extrinsic rotations: using the world frame of reference
// ypr: yaw, pitch, roll in radians
double cy, sy, cp, sp, cr, sr;
// compute sin and cos of each just once:
cy = cos(ypr[0]);
sy = sin(ypr[0]);
cp = cos(ypr[1]);
sp = sin(ypr[1]);
cr = cos(ypr[2]);
sr = sin(ypr[2]);
// compute this rotation matrix:
// source:
M[0][0] = cy*cp;
M[0][1] = cy*sp*sr - sy*cr;
M[0][2] = cy*sp*cr + sy*sr;
M[1][0] = sy*cp;
M[1][1] = sy*sp*sr + cy*cr;
M[1][2] = sy*sp*sr - cy*sr;
M[2][0] = -sp;
M[2][1] = cp*sr;
M[2][2] = cp*cr;
void mMult(double M[3][3], double R[3][3]) {
// multiply M * R, returning result in M
double T[3][3] = {0};
for(int ii = 0; ii < 3; ii++) {
for(int jj = 0; jj < 3; jj++) {
for(int kk = 0; kk < 3; kk++ ) {
T[ii][jj] += M[ii][kk] * R[kk][jj];
// copy the result:
for(int ii = 0; ii < 3; ii++) {
for(int jj = 0; jj < 3; jj++ ) {
M[ii][jj] = T[ii][jj];
void printRotMat(double M[3][3]) {
// print 3x3 matrix - for debug purposes
#ifdef DEBUG
std::cout << "rotation matrix is: " << std::endl;
for(int ii = 0; ii < 3; ii++) {
for(int jj = 0; jj < 3; jj++ ) {
std::cout << M[ii][jj] << " ";
std::cout << std::endl;
std::cout << std::endl;
void applyRot(double before[3], double after[3], double M[3][3]) {
// apply rotation matrix M to vector 'before'
// returning result in vector 'after'
double sumBefore = 0, sumAfter = 0;
std::cout << "Result of rotation:" << std::endl;
for(int ii = 0; ii < 3; ii++) {
std::cout << before[ii] << " -> ";
sumBefore += before[ii] * before[ii];
after[ii] = 0;
for( int jj = 0; jj < 3; jj++) {
after[ii] += M[ii][jj]*before[jj];
sumAfter += after[ii] * after[ii];
std::cout << after[ii] << std::endl;
std::cout << std::endl;
#ifdef DEBUG
std::cout << "length before: " << sqrt(sumBefore) << "; after: " << sqrt(sumAfter) << std::endl;
int main(void) {
double r1[3] = {0, 0, PI/2}; // order: yaw, pitch, roll
double r2[3] = {0, PI/2, 0};
double initPoint[3] = {3,4,5}; // initial point before rotation
double rotPoint[3], rotPoint2[3];
// initialize rotation matrix to I
double R[3][3];
double R2[3][3];
// compute first rotation matrix in-place:
calcRot(r1, R);
applyRot(initPoint, rotPoint, R);
// apply second rotation on top of first:
calcRot(r2, R2);
std::cout << std::endl << "second rotation matrix: " << std::endl;
// applying second matrix to result of first rotation:
std::cout << std::endl << "applying just the second matrix to result of first: " << std::endl;
applyRot(rotPoint, rotPoint2, R2);
mMult(R2, R);
std::cout << "after multiplication: " << std::endl;
std::cout << "Applying the combined matrix to the intial vector: " << std::endl;
applyRot(initPoint, rotPoint2, R2);
// now in the opposite order:
double S[3][3] = {{1, 0, 0}, {0, 1, 0}, {0, 0, 1}};
calcRot(r2, S);
calcRot(r1, R2);
mMult(R2, S);
std::cout << "applying rotation in the opposite order: " << std::endl;
applyRot(initPoint, rotPoint, R2);
Output (with #DEBUG not defined - commented out):
Result of rotation:
3 -> 3
4 -> -5
5 -> 4
second rotation matrix:
applying just the second matrix to result of first:
Result of rotation:
3 -> 4
-5 -> -5
4 -> -3
after multiplication:
Applying the combined matrix to the intial vector:
Result of rotation:
3 -> 4
4 -> -5
5 -> -3
Note that these last two give the same result, showing that you can combine rotation matrices.
applying rotation in the opposite order:
Result of rotation:
3 -> 5
4 -> 3
5 -> 4
Now the result is different - the order is important.
If you are familiar with matrix operations, you may try Rodrigues' rotation formula. If you are familiar with quaternions, you may try the P' = q*P*q' approach.
Quaterion math is a bit more complicated to grasp, but code is simpler and faster.
I'm hoping that the answer to the question in the title is that I'm doing something stupid!
Here is the problem. I want to compute all the eigenvalues and eigenvectors of a real, symmetric matrix. I have implemented code in MATLAB (actually, I run it using Octave), and C++, using the GNU Scientific Library. I am providing my full code below for both implementations.
As far as I can understand, GSL comes with its own implementation of the BLAS API, (hereafter I refer to this as GSLCBLAS) and to use this library I compile using:
g++ -O3 -lgsl -lgslcblas
GSL suggests here to use an alternative BLAS library, such as the self-optimizing ATLAS library, for improved performance. I am running Ubuntu 12.04, and have installed the ATLAS packages from the Ubuntu repository. In this case, I compile using:
g++ -O3 -lgsl -lcblas -latlas -lm
For all three cases, I have performed experiments with randomly-generated matrices of sizes 100 to 1000 in steps of 100. For each size, I perform 10 eigendecompositions with different matrices, and average the time taken. The results are these:
The difference in performance is ridiculous. For a matrix of size 1000, Octave performs the decomposition in under a second; GSLCBLAS and ATLAS take around 25 seconds.
I suspect that I may be using the ATLAS library incorrectly. Any explanations are welcome; thanks in advance.
Some notes on the code:
In the C++ implementation, there is no need to make the matrix
symmetric, because the function only uses the lower triangular part
of it.
In Octave, the line triu(A) + triu(A, 1)' enforces the matrix to be symmetric.
If you wish to compile the C++ code your own Linux machine, you also need to add the flag -lrt, because of the clock_gettime function.
Unfortunately I don't think clock_gettime exits on other platforms. Consider changing it to gettimeofday.
Octave Code
K = 10;
fileID = fopen('octave_out.txt','w');
for N = 100:100:1000
AverageTime = 0.0;
for k = 1:K
A = randn(N, N);
A = triu(A) + triu(A, 1)';
AverageTime = AverageTime + toc/K;
disp([num2str(N), " ", num2str(AverageTime), "\n"]);
fprintf(fileID, '%d %f\n', N, AverageTime);
C++ Code
#include <iostream>
#include <fstream>
#include <time.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_eigen.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
int main()
const int K = 10;
gsl_rng * RandomNumberGenerator = gsl_rng_alloc(gsl_rng_default);
gsl_rng_set(RandomNumberGenerator, 0);
std::ofstream OutputFile("atlas.txt", std::ios::trunc);
for (int N = 100; N <= 1000; N += 100)
gsl_matrix* A = gsl_matrix_alloc(N, N);
gsl_eigen_symmv_workspace* EigendecompositionWorkspace = gsl_eigen_symmv_alloc(N);
gsl_vector* Eigenvalues = gsl_vector_alloc(N);
gsl_matrix* Eigenvectors = gsl_matrix_alloc(N, N);
double AverageTime = 0.0;
for (int k = 0; k < K; k++)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
gsl_matrix_set(A, i, j, gsl_ran_gaussian(RandomNumberGenerator, 1.0));
timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
gsl_eigen_symmv(A, Eigenvalues, Eigenvectors, EigendecompositionWorkspace);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
double TimeElapsed = (double) ((1e9*end.tv_sec + end.tv_nsec) - (1e9*start.tv_sec + start.tv_nsec))/1.0e9;
AverageTime += TimeElapsed/K;
std::cout << "N = " << N << ", k = " << k << ", Time = " << TimeElapsed << std::endl;
OutputFile << N << " " << AverageTime << std::endl;
return 0;
I disagree with the previous post. This is not a threading issue, this is an algorithm issue. The reason matlab, R, and octave wipe the floor with C++ libraries is because their C++ libraries use more complex, better algorithms. If you read the octave page you can find out what they do[1]:
Eigenvalues are computed in a several step process which begins with a Hessenberg decomposition, followed by a Schur decomposition, from which the eigenvalues are apparent. The eigenvectors, when desired, are computed by further manipulations of the Schur decomposition.
Solving eigenvalue/eigenvector problems is non-trivial. In fact its one of the few things "Numerical Recipes in C" recommends you don't implement yourself. (p461). GSL is often slow, which was my initial response. ALGLIB is also slow for its standard implementation (I'm getting about 12 seconds!):
#include <iostream>
#include <iomanip>
#include <ctime>
#include <linalg.h>
using std::cout;
using std::setw;
using std::endl;
const int VERBOSE = false;
int main(int argc, char** argv)
int size = 0;
if(argc != 2) {
cout << "Please provide a size of input" << endl;
return -1;
} else {
size = atoi(argv[1]);
cout << "Array Size: " << size << endl;
alglib::real_2d_array mat;
alglib::hqrndstate state;
mat.setlength(size, size);
for(int rr = 0 ; rr < mat.rows(); rr++) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
mat[rr][cc] = mat[cc][rr] = alglib::hqrndnormal(state);
cout << "Matrix: " << endl;
for(int rr = 0 ; rr < mat.rows(); rr++) {
for(int cc = 0 ; cc < mat.cols(); cc++) {
cout << setw(10) << mat[rr][cc];
cout << endl;
cout << endl;
alglib::real_1d_array d;
alglib::real_2d_array z;
auto t = clock();
alglib::smatrixevd(mat, mat.rows(), 1, 0, d, z);
t = clock() - t;
cout << (double)t/CLOCKS_PER_SEC << "s" << endl;
for(int cc = 0 ; cc < mat.cols(); cc++) {
cout << "lambda: " << d[cc] << endl;
cout << "V: ";
for(int rr = 0 ; rr < mat.rows(); rr++) {
cout << setw(10) << z[rr][cc];
cout << endl;
If you really need a fast library, probably need to do some real hunting.
I have also encountered with the problem. The real cause is that the eig() in matlab doesn't calculate the eigenvectors, but the C version code above does. The different in time spent can be larger than one order of magnitude as shown in the figure below. So the comparison is not fair.
In Matlab, depending on the return value, the actual function called will be different. To force the calculation of eigenvectors, the [V,D] = eig(A) should be used (see codes below).
The actual time to compute eigenvalue problem depends heavily on the matrix properties and the desired results, such as
Real or complex
Hermitian/Symmetric or not
Dense or sparse
Eigenvalues only, Eigenvectors, Maximum eigenvalue only, etc
Serial or parallel
There are algorithms optimized for each of the above case. In the gsl, these algorithm are picked manually, so a wrong selection will decrease performance significantly. Some C++ wrapper class or some language such as matlab and mathematica will choose the optimized version through some methods.
Also, the Matlab and Mathematica have used parallelization. These are further broaden the gap you see by few times, depending on the machine. It is reasonable to say that the calculation of eigenvalues and eigenvectors of a general complex 1000x1000 are about a second and ten second, without parallelization.
Fig. Compare Matlab and C. The "+ vec" means the codes included the calculations of the eigenvectors. The CPU% is the very rough observation of CPU usage at N=1000 which is upper bounded by 800%, though they are supposed to fully use all 8 cores. The gap between Matlab and C are smaller than 8 times.
Fig. Compare different matrix type in Mathematica. Algorithms automatically picked by program.
Matlab (WITH the calculation of eigenvectors)
K = 10;
fileID = fopen('octave_out.txt','w');
for N = 100:100:1000
AverageTime = 0.0;
for k = 1:K
A = randn(N, N);
A = triu(A) + triu(A, 1)';
[V,D] = eig(A);
AverageTime = AverageTime + toc/K;
disp([num2str(N), ' ', num2str(AverageTime), '\n']);
fprintf(fileID, '%d %f\n', N, AverageTime);
C++ (WITHOUT the calculation of eigenvectors)
#include <iostream>
#include <fstream>
#include <time.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <gsl/gsl_eigen.h>
#include <gsl/gsl_vector.h>
#include <gsl/gsl_matrix.h>
int main()
const int K = 10;
gsl_rng * RandomNumberGenerator = gsl_rng_alloc(gsl_rng_default);
gsl_rng_set(RandomNumberGenerator, 0);
std::ofstream OutputFile("atlas.txt", std::ios::trunc);
for (int N = 100; N <= 1000; N += 100)
gsl_matrix* A = gsl_matrix_alloc(N, N);
gsl_eigen_symm_workspace* EigendecompositionWorkspace = gsl_eigen_symm_alloc(N);
gsl_vector* Eigenvalues = gsl_vector_alloc(N);
double AverageTime = 0.0;
for (int k = 0; k < K; k++)
for (int i = 0; i < N; i++)
for (int j = i; j < N; j++)
double rn = gsl_ran_gaussian(RandomNumberGenerator, 1.0);
gsl_matrix_set(A, i, j, rn);
gsl_matrix_set(A, j, i, rn);
timespec start, end;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
gsl_eigen_symm(A, Eigenvalues, EigendecompositionWorkspace);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
double TimeElapsed = (double) ((1e9*end.tv_sec + end.tv_nsec) - (1e9*start.tv_sec + start.tv_nsec))/1.0e9;
AverageTime += TimeElapsed/K;
std::cout << "N = " << N << ", k = " << k << ", Time = " << TimeElapsed << std::endl;
OutputFile << N << " " << AverageTime << std::endl;
return 0;
(* Symmetric real matrix + eigenvectors *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Symmetric real matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Asymmetric real matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[], {i, NN}, {j, NN}];
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Hermitian matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[] + I Random[], {i, NN}, {j, NN}];
M = M + Transpose[Conjugate[M]];
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
(* Random complex matrix *)
Table[{NN, Mean[Table[(
M = Table[Random[] + I Random[], {i, NN}, {j, NN}];
), {K, 10}]]
}, {NN, Range[100, 1000, 100]}]
In the C++ implementation, there is no need to make the matrix
symmetric, because the function only uses the lower triangular part of
This may not be the case. In the reference, it is stated that:
int gsl_eigen_symmv(gsl_matrix *A,gsl_vector *eval, gsl_matrix *evec, gsl_eigen_symmv_workspace * w)
This function computes the eigenvalues and eigenvectors of the real symmetric matrix
A. Additional workspace of the appropriate size must be provided in w.
The diagonal and lower triangular part of A are destroyed during the
computation, but the strict upper triangular part is not referenced.
The eigenvalues are stored in the vector eval and are unordered. The
corresponding eigenvectors are stored in the columns of the matrix
evec. For example, the eigenvector in the first column corresponds to
the first eigenvalue. The eigenvectors are guaranteed to be mutually
orthogonal and normalised to unit magnitude.
It seems that you also need to apply a similar symmetrization operation in C++ in order to get at least correct results although you can get the same performance.
On the MATLAB side, eigen value decomposition may be faster due to its multi-threaded execution as stated in this reference:
Built-in Multithreading
Linear algebra and numerical functions such as fft, \ (mldivide), eig,
svd, and sort are multithreaded in MATLAB. Multithreaded computations
have been on by default in MATLAB since Release 2008a. These
functions automatically execute on multiple computational threads in a
single MATLAB session, allowing them to execute faster on
multicore-enabled machines. Additionally, many functions in Image
Processing Toolbox™ are multithreaded.
In order to test the performance of MATLAB for single core, you can disable multithreading by
in R2007a or newer as stated here.