Efficient Eigen Matrix From Function - c++

I'm trying to build a matrix from a kernel, such that A(i,j) = f(i,j) where i,j are both vectors (hence I build A from two matrices x,y which each row corresponds to a point/vector). My current function looks similar to this:
Eigen::MatrixXd get_kernel_matrix(const Eigen::MatrixXd& x, const Eigen::MatrixXd& y, double(&kernel)(const Eigen::VectorXd&)) {
Eigen::MatrixXd res (x.rows(), y.rows());
for(int i = 0; i < res.rows() ; i++) {
for(int j = 0; j < res.cols(); j++) {
res(i, j) = kernel(x.row(i), y.row(j));
}
}
}
return res;
}
Along with some logic for the diagonal (which would in my case likely cause division by zero).
Is there a more efficient/idiometric way to do this? In some of my tests it appears that Matlab code beats the speed of my C++/Eigen implementation (I'm guessing due to vectorization).
I've looked through a considerable amount of documentation (such as the unaryExpr function), but can't seem to find what I'm looking for.
Thanks for any help.

You can use NullaryExpr with an appropriate lambda to remove your for loops:
MatrixXd res = MatrixXd::NullaryExpr(x.rows(), y.rows(),
[&x,&y,&kernel](int i,int j) { return kernel(x.row(i), y.row(j)); });
Here is a working self-contained example reproducing a matrix product:
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
using namespace std;
double my_kernel(const MatrixXd::ConstRowXpr &x, const MatrixXd::ConstRowXpr &y) {
return x.dot(y);
}
template<typename Kernel>
MatrixXd apply_kernel(const MatrixXd& x, const MatrixXd& y, Kernel kernel) {
return MatrixXd::NullaryExpr(x.rows(), y.rows(),
[&x,&y,&kernel](int i,int j) { return kernel(x.row(i), y.row(j)); });
}
int main()
{
int n = 10;
MatrixXd X = MatrixXd::Random(n,n);
MatrixXd Y = MatrixXd::Random(n,n);
MatrixXd R = apply_kernel(X,Y,std::ptr_fun(my_kernel));
std::cout << R << "\n\n";
std::cout << X*Y.transpose() << "\n\n";
}
If you don't want to make apply_kernel a template function, you can use std::function to pass the kernel.

Related

Efficient matrix implementation

I have the following problem:
I've a precomputed 2d matrix of values which i need to lookup very often and compute only once
The size of the matrix is about 4000x4000 at most
The matrix won't be sparse, i typically need almost all values.
The values in the matrix can be boolean, integer or double. At least they are always small objects
Currently i am storing the precomputed values in a std::vector<<std::vector<T>>, and i've noticed the lookups into this datastructure takes quite some time in heavy computations. I've googled around and so far the suggested implementation seems to be to try a solution in which all the memory is stored contigious using an 1D array where the location in this array is computed based on i and j.
Does anybody have a good example implementation of this or has an even better suggestion? I couldn't find a modern C++ example, while it seems to be a very common problem to me. I'd prefer to use someone elses code instead of reinventing the wheel here. Of course i will measure the differences to see whether it actually improves performance.
Examples i've found:
https://medium.com/#patdhlk/c-2d-array-a-different-better-solution-6d371363ebf8
https://secure.eld.leidenuniv.nl/~moene/Home/tips/matrix2d/
Here is a very simple and efficient 2-d matrix. The 'main' creates a 10000x10000 double array 'mat', then filled it with random number. The array 'mat' is copied into another array 'mat2'. your may input two integers 'n' and 'm' between 0 and 9999 to fetch the double data at mat2(n,m).
Feel free to use or test it. Let me know if you encounter problems or need some more functions to be implemented. Good luck!
#ifndef ytlu_simple_matrix_class_
#define ytlu_simple_matrix_class_
#include <iostream>
#include <iomanip>
#include <complex>
template <typename T> class tMatrix
{
public:
T *ptr;
int col, row, size;
inline T* begin() const {return ptr;}
inline T* end() const {return this->ptr + this->size;}
inline T operator()(const int i, const int j) const { return ptr[i*col+j];
} // r-value
inline T&operator()(const int i, const int j) { return ptr[i*col+j]; } //l-value
inline tMatrix(): col{0}, row{0}, size{0}, ptr{0} {;}
tMatrix(const int i, const int j): col(j), row(i), size(i*j)
{
ptr = new T [this->size] ;
}
tMatrix(const tMatrix<T>&a) : tMatrix<T>(a.row, a.col)
{
std::copy(a.begin(), a.end(), this->ptr);
}
tMatrix<T>& operator=(tMatrix<T>&&a)
{
this->col = a.col;
this->row = a.row;
delete [] this->ptr;
this->ptr = a.ptr;
a.ptr = nullptr;
return *this;
}
tMatrix<T>& operator=(const tMatrix<T>&a)
{
if (col==a.cpl && row==a.row) std::copy(a.begin(), a.end(), this->ptr);
else { tMatrix<T>&&v(a); *this = std::move(v);}
return *this;
}
~tMatrix() {delete [] this->ptr;}
}; //end of class tMatrix
template <typename X> std::ostream& operator<<(std::ostream&p, const tMatrix<X>&a)
{
p << std::fixed;
for (int i=0; i<a.row; i++) {
for (int j=0; j <a.col; j++) p << std::setw(12) << a(i, j);
p << std::endl;
}
return p;
}
using iMatrix = tMatrix<int>;
using rMatrix = tMatrix<double>;
using cMatrix = tMatrix<std::complex<double> >;
#endif
//
//
#include <ctime>
#include <cstdlib>
#define N1 10000
int main()
{
int n, m;
std:srand(time(NULL)); // randomize
rMatrix mat(N1, N1); // declare a 10000 x 10000 double matrix
//
// fill the whole matrix with double random number 0.0 - 1.0
//
for (int i = 0; i<mat.row; i++)
{ for (int j=0; j<mat.col; j++) mat(i, j) = (double)std::rand() / (double)RAND_MAX; }
//
// copy mat to mat 2 just for test
//
rMatrix mat2 = mat;
//
// fetch data test input 0 <= n m < 10000 to print mat2(n, m)
//
while(1)
{
std::cout << "Fetch 2d array at (n m) = ";
std::cin >> n >> m;
if ((n < 0) || (m < 0) || (n > mat2.row) || (m > mat2.col) )break;
std::cout << "mat(" << n << ", " << m << ") = " << mat2(n, m) << std::endl << std::endl;
}
return 0;
}
The compile parameter I used and the test run. It takes a couple seconds to fill the random numbers, and I felt no lapse at all in fetch a data running in my aged PC.
ytlu#ytlu-PC MINGW32 /d/ytlu/working/cpptest
$ g++ -O3 -s mtx_class.cpp -o a.exe
ytlu#ytlu-PC MINGW32 /d/ytlu/working/cpptest
$ ./a.exe
Fetch 2d array at (n m) = 7000 9950
mat(7000, 9950) = 0.638447
Fetch 2d array at (n m) = 2904 5678
mat(2904, 5678) = 0.655934
Fetch 2d array at (n m) = -3 4

how could I change memory layout of multi dimensional arrays with c++ [duplicate]

I have a matrix (relatively big) that I need to transpose. For example assume that my matrix is
a b c d e f
g h i j k l
m n o p q r
I want the result be as follows:
a g m
b h n
c I o
d j p
e k q
f l r
What is the fastest way to do this?
This is a good question. There are many reason you would want to actually transpose the matrix in memory rather than just swap coordinates, e.g. in matrix multiplication and Gaussian smearing.
First let me list one of the functions I use for the transpose (EDIT: please see the end of my answer where I found a much faster solution)
void transpose(float *src, float *dst, const int N, const int M) {
#pragma omp parallel for
for(int n = 0; n<N*M; n++) {
int i = n/N;
int j = n%N;
dst[n] = src[M*j + i];
}
}
Now let's see why the transpose is useful. Consider matrix multiplication C = A*B. We could do it this way.
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*l+j];
}
C[K*i + j] = tmp;
}
}
That way, however, is going to have a lot of cache misses. A much faster solution is to take the transpose of B first
transpose(B);
for(int i=0; i<N; i++) {
for(int j=0; j<K; j++) {
float tmp = 0;
for(int l=0; l<M; l++) {
tmp += A[M*i+l]*B[K*j+l];
}
C[K*i + j] = tmp;
}
}
transpose(B);
Matrix multiplication is O(n^3) and the transpose is O(n^2), so taking the transpose should have a negligible effect on the computation time (for large n). In matrix multiplication loop tiling is even more effective than taking the transpose but that's much more complicated.
I wish I knew a faster way to do the transpose (Edit: I found a faster solution, see the end of my answer). When Haswell/AVX2 comes out in a few weeks it will have a gather function. I don't know if that will be helpful in this case but I could image gathering a column and writing out a row. Maybe it will make the transpose unnecessary.
For Gaussian smearing what you do is smear horizontally and then smear vertically. But smearing vertically has the cache problem so what you do is
Smear image horizontally
transpose output
Smear output horizontally
transpose output
Here is a paper by Intel explaining that
http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions
Lastly, what I actually do in matrix multiplication (and in Gaussian smearing) is not take exactly the transpose but take the transpose in widths of a certain vector size (e.g. 4 or 8 for SSE/AVX). Here is the function I use
void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) {
#pragma omp parallel for
for(int n=0; n<M*N; n++) {
int k = vec_size*(n/N/vec_size);
int i = (n/vec_size)%N;
int j = n%vec_size;
B[n] = A[M*i + k + j];
}
}
EDIT:
I tried several function to find the fastest transpose for large matrices. In the end the fastest result is to use loop blocking with block_size=16 (Edit: I found a faster solution using SSE and loop blocking - see below). This code works for any NxM matrix (i.e. the matrix does not have to be square).
inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<block_size; i++) {
for(int j=0; j<block_size; j++) {
B[j*ldb + i] = A[i*lda +j];
}
}
}
inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size);
}
}
}
The values lda and ldb are the width of the matrix. These need to be multiples of the block size. To find the values and allocate the memory for e.g. a 3000x1001 matrix I do something like this
#define ROUND_UP(x, s) (((x)+((s)-1)) & -(s))
const int n = 3000;
const int m = 1001;
int lda = ROUND_UP(m, 16);
int ldb = ROUND_UP(n, 16);
float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);
For 3000x1001 this returns ldb = 3008 and lda = 1008
Edit:
I found an even faster solution using SSE intrinsics:
inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) {
__m128 row1 = _mm_load_ps(&A[0*lda]);
__m128 row2 = _mm_load_ps(&A[1*lda]);
__m128 row3 = _mm_load_ps(&A[2*lda]);
__m128 row4 = _mm_load_ps(&A[3*lda]);
_MM_TRANSPOSE4_PS(row1, row2, row3, row4);
_mm_store_ps(&B[0*ldb], row1);
_mm_store_ps(&B[1*ldb], row2);
_mm_store_ps(&B[2*ldb], row3);
_mm_store_ps(&B[3*ldb], row4);
}
inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) {
#pragma omp parallel for
for(int i=0; i<n; i+=block_size) {
for(int j=0; j<m; j+=block_size) {
int max_i2 = i+block_size < n ? i + block_size : n;
int max_j2 = j+block_size < m ? j + block_size : m;
for(int i2=i; i2<max_i2; i2+=4) {
for(int j2=j; j2<max_j2; j2+=4) {
transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb);
}
}
}
}
}
This is going to depend on your application but in general the fastest way to transpose a matrix would be to invert your coordinates when you do a look up, then you do not have to actually move any data.
Some details about transposing 4x4 square float (I will discuss 32-bit integer later) matrices with x86 hardware. It's helpful to start here in order to transpose larger square matrices such as 8x8 or 16x16.
_MM_TRANSPOSE4_PS(r0, r1, r2, r3) is implemented differently by different compilers. GCC and ICC (I have not checked Clang) use unpcklps, unpckhps, unpcklpd, unpckhpd whereas MSVC uses only shufps. We can actually combine these two approaches together like this.
t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);
r0 = _mm_shuffle_ps(t0,t2, 0x44);
r1 = _mm_shuffle_ps(t0,t2, 0xEE);
r2 = _mm_shuffle_ps(t1,t3, 0x44);
r3 = _mm_shuffle_ps(t1,t3, 0xEE);
One interesting observation is that two shuffles can be converted to one shuffle and two blends (SSE4.1) like this.
t0 = _mm_unpacklo_ps(r0, r1);
t1 = _mm_unpackhi_ps(r0, r1);
t2 = _mm_unpacklo_ps(r2, r3);
t3 = _mm_unpackhi_ps(r2, r3);
v = _mm_shuffle_ps(t0,t2, 0x4E);
r0 = _mm_blend_ps(t0,v, 0xC);
r1 = _mm_blend_ps(t2,v, 0x3);
v = _mm_shuffle_ps(t1,t3, 0x4E);
r2 = _mm_blend_ps(t1,v, 0xC);
r3 = _mm_blend_ps(t3,v, 0x3);
This effectively converted 4 shuffles into 2 shuffles and 4 blends. This uses 2 more instructions than the implementation of GCC, ICC, and MSVC. The advantage is that it reduces port pressure which may have a benefit in some circumstances.
Currently all the shuffles and unpacks can go only to one particular port whereas the blends can go to either of two different ports.
I tried using 8 shuffles like MSVC and converting that into 4 shuffles + 8 blends but it did not work. I still had to use 4 unpacks.
I used this same technique for a 8x8 float transpose (see towards the end of that answer).
https://stackoverflow.com/a/25627536/2542702. In that answer I still had to use 8 unpacks but I manged to convert the 8 shuffles into 4 shuffles and 8 blends.
For 32-bit integers there is nothing like shufps (except for 128-bit shuffles with AVX512) so it can only be implemented with unpacks which I don't think can be convert to blends (efficiently). With AVX512 vshufi32x4 acts effectively like shufps except for 128-bit lanes of 4 integers instead of 32-bit floats so this same technique might be possibly with vshufi32x4 in some cases. With Knights Landing shuffles are four times slower (throughput) than blends.
If the size of the arrays are known prior then we could use the union to our help. Like this-
#include <bits/stdc++.h>
using namespace std;
union ua{
int arr[2][3];
int brr[3][2];
};
int main() {
union ua uav;
int karr[2][3] = {{1,2,3},{4,5,6}};
memcpy(uav.arr,karr,sizeof(karr));
for (int i=0;i<3;i++)
{
for (int j=0;j<2;j++)
cout<<uav.brr[i][j]<<" ";
cout<<'\n';
}
return 0;
}
Consider each row as a column, and each column as a row .. use j,i instead of i,j
demo: http://ideone.com/lvsxKZ
#include <iostream>
using namespace std;
int main ()
{
char A [3][3] =
{
{ 'a', 'b', 'c' },
{ 'd', 'e', 'f' },
{ 'g', 'h', 'i' }
};
cout << "A = " << endl << endl;
// print matrix A
for (int i=0; i<3; i++)
{
for (int j=0; j<3; j++) cout << A[i][j];
cout << endl;
}
cout << endl << "A transpose = " << endl << endl;
// print A transpose
for (int i=0; i<3; i++)
{
for (int j=0; j<3; j++) cout << A[j][i];
cout << endl;
}
return 0;
}
transposing without any overhead (class not complete):
class Matrix{
double *data; //suppose this will point to data
double _get1(int i, int j){return data[i*M+j];} //used to access normally
double _get2(int i, int j){return data[j*N+i];} //used when transposed
public:
int M, N; //dimensions
double (*get_p)(int, int); //functor to access elements
Matrix(int _M,int _N):M(_M), N(_N){
//allocate data
get_p=&Matrix::_get1; // initialised with normal access
}
double get(int i, int j){
//there should be a way to directly use get_p to call. but i think even this
//doesnt incur overhead because it is inline and the compiler should be intelligent
//enough to remove the extra call
return (this->*get_p)(i,j);
}
void transpose(){ //twice transpose gives the original
if(get_p==&Matrix::get1) get_p=&Matrix::_get2;
else get_p==&Matrix::_get1;
swap(M,N);
}
}
can be used like this:
Matrix M(100,200);
double x=M.get(17,45);
M.transpose();
x=M.get(17,45); // = original M(45,17)
of course I didn't bother with the memory management here, which is crucial but different topic.
template <class T>
void transpose( const std::vector< std::vector<T> > & a,
std::vector< std::vector<T> > & b,
int width, int height)
{
for (int i = 0; i < width; i++)
{
for (int j = 0; j < height; j++)
{
b[j][i] = a[i][j];
}
}
}
Modern linear algebra libraries include optimized versions of the most common operations. Many of them include dynamic CPU dispatch, which chooses the best implementation for the hardware at program execution time (without compromising on portability).
This is commonly a better alternative to performing manual optimization of your functinos via vector extensions intrinsic functions. The latter will tie your implementation to a particular hardware vendor and model: if you decide to swap to a different vendor (e.g. Power, ARM) or to a newer vector extensions (e.g. AVX512), you will need to re-implement it again to get the most of them.
MKL transposition, for example, includes the BLAS extensions function imatcopy. You can find it in other implementations such as OpenBLAS as well:
#include <mkl.h>
void transpose( float* a, int n, int m ) {
const char row_major = 'R';
const char transpose = 'T';
const float alpha = 1.0f;
mkl_simatcopy (row_major, transpose, n, m, alpha, a, n, n);
}
For a C++ project, you can make use of the Armadillo C++:
#include <armadillo>
void transpose( arma::mat &matrix ) {
arma::inplace_trans(matrix);
}
intel mkl suggests in-place and out-of-place transposition/copying matrices. here is the link to the documentation. I would recommend trying out of place implementation as faster ten in-place and into the documentation of the latest version of mkl contains some mistakes.
I think that most fast way should not taking higher than O(n^2) also in this way you can use just O(1) space :
the way to do that is to swap in pairs because when you transpose a matrix then what you do is: M[i][j]=M[j][i] , so store M[i][j] in temp, then M[i][j]=M[j][i],and the last step : M[j][i]=temp. this could be done by one pass so it should take O(n^2)
my answer is transposed of 3x3 matrix
#include<iostream.h>
#include<math.h>
main()
{
int a[3][3];
int b[3];
cout<<"You must give us an array 3x3 and then we will give you Transposed it "<<endl;
for(int i=0;i<3;i++)
{
for(int j=0;j<3;j++)
{
cout<<"Enter a["<<i<<"]["<<j<<"]: ";
cin>>a[i][j];
}
}
cout<<"Matrix you entered is :"<<endl;
for (int e = 0 ; e < 3 ; e++ )
{
for ( int f = 0 ; f < 3 ; f++ )
cout << a[e][f] << "\t";
cout << endl;
}
cout<<"\nTransposed of matrix you entered is :"<<endl;
for (int c = 0 ; c < 3 ; c++ )
{
for ( int d = 0 ; d < 3 ; d++ )
cout << a[d][c] << "\t";
cout << endl;
}
return 0;
}

why my function doesn't change my object's attribute

I'm trying to do a Matrix class using C++ Vector, but i don't know why the inside of "Matrix result" inside my function isn't passed to my object but it remain enclosed inside the function.
for semplicity so far I've tryed only to do an "addition function" among two Matrices.
I have tryied to work with pointer but in this way (according to my knowledgs) i cant call my funtion to an object in this wise:
foo.function1(bar1).function2(bar2);
but working with pointer i have to call function in this manner:
foo.function1(bar1);
foo.function2(bar2);
//and so on..
this is my header file:
#include <iostream>
#include <vector>
using namespace std;
class Matrix
{
public:
Matrix (int height, int width);
Matrix add(Matrix m);
Matrix applyFunction(double (*function)(double));
void print();
private:
vector<vector<double> > matrix;
int height;
int width;
};
this is the .cpp file:
Matrix::Matrix(int height, int width)
{
this->height = height;
this->width = width;
this->matrix = vector<vector<double> >(this->height, vector<double>(this->width));
}
Matrix Matrix::add(Matrix m)
{
Matrix result(this->height, this->width);
if (m.height== this->height&& m.width== this->width)
{
for (int i = 0; i < this->height; i++)
{
for (int j = 0; j < this->width; j++)
{
result.matrix[i][j] = this->matrix[i][j] + m.matrix[i][j];
}
return result;
}
}
else
{
cout << "Impossible to do addition, matrices doesn't have the same dimension" << endl;
return result;
}
}
Matrix Matrix::applyFunction(double(*function)(double))
{
Matrix result(this->height, this->width);
for (int i = 0; i < this->height; i++)
{
for (int j = 0; j < this->width; j++)
{
result.matrix[i][j] = (*function)(this->matrix[i][j]);
}
}
return result;
}
void Matrix::print()
{
for (int i = 0; i < this->height; i++)
{
for (int j = 0; j < this->width; j++)
{
cout << this->matrix[i][j] << " ";
}
cout << endl;
}
cout << endl;
}
the output should be the addition beetwen A B 2x2:
x1 x2
x3 x4
but computer show only zeros.
Your member functions all return a new object (they return "by value").
From your usage of chaining, it seems like you actually want to modify the object and return *this by reference.
Otherwise you'll need something like:
auto bar2 = foo.function1(bar1);
auto bar3 = foo.function2(bar2);
// etc
There are no pointers here at present.
There are two variants how you can implement your add
Matrix add(Matrix m)
{
// optimisation: you don't need separate result, m already IS a copy!
// so you can just calculate:
...
{
m.matrix[i][j] += this->matrix[i][j];
}
return m;
}
or:
Matrix& add(Matrix const& m)
// ^ accept const reference to avoid unnecessary copy
// ^ returning reference(!)
{
...
{
// modifies itself!
this->matrix[i][j] += m.matrix[i][j];
}
return *this; // <- (!)
}
This allows now to do:
Matrix m0, m1, m2;
m0.add(m1).add(m2);
// m0 now contains the result, original value is lost (!)
So you don't need the final assignment as in first variant:
m0 = m0.add(m1).add(m2);
// or assign to a new variable, if you want to retain m0's original values
which is what you lacked in your question (thus you did not get the desired result).
Maybe you want to have both variants, and you might rename one of. But there's a nice feature in C++ that you might like even better: Operator overloading. Consider ordinary int:
int n0, n1;
n0 += n1;
int n2 = n0 + n1;
Well, suppose you know what's going on. And if you could do exactly the same with your matrices? Actually, you can! You need to do is overloading the operators:
Matrix& operator+=(Matrix const& m)
{
// identical to second variant of add above!
}
Matrix operator+(Matrix m) // again: the copy!
{
// now implement in terms of operator+=:
return m += *this;
}
Yes, now you can do:
Matrix m0, m1, m2;
m0 += m1 += m2;
m2 = m1 + m0;
Alternatively (and I'd prefer it) you can implement the second operator (operator+) as free standing function as well:
// defined OUTSIDE Matrix class!
Matrix operator+(Matrix first, Matrix const& second)
{
return first += second;
}
Finally: If dimensions don't match, better than returning some dummy matrix would be throwing some exception; std::domain_error might be a candidate for, or you define your own exception, something like SizeMismatch. And please don't output anything to console or elsewhere in such operators, this is not what anybody would expect from them, additionally, you impose console output to others who might consider it inappropriate (perhaps they want output in another language?).

Eliminate Intermediate Eigen Arrays

Does Eigen make any intermediate array for calculation of x or Eigen just put the values into simd registers and do the calculation?
In general, how to know how many intermediates did Eigen make?
Will Eigen allocate new memory for the intermediates in every cycle of the loop?
Is there anyway to ensure that eigen would not make any intermediate? Does it have a macro like "EIGEN_NO_INTERMEDIATE"?
#include <Eigen/Eigen>
#include <iostream>
using namespace Eigen;
template<typename T>
void fill(T& x) {
for (int i = 0; i < x.size(); ++i) x.data()[i] = i + 1;
}
int main() {
int n = 10; // n is actually about 400
ArrayXXf x(n, n);
ArrayXf y(n);
fill(x);
fill(y);
for (int i = 0; i < 10; ++i) { // many cycles
x = x * ((x.colwise() / y).rowwise() / y.transpose()).exp();
}
std::cout << x << "\n";
}
You can add a hook into the DenseStorage constructor like so:
#include <iostream>
static long int nb_temporaries;
inline void on_temporary_creation(long int size) {
if(size!=0) nb_temporaries++;
}
// must be defined before including any Eigen header!
#define EIGEN_DENSE_STORAGE_CTOR_PLUGIN { on_temporary_creation(size); }
#define VERIFY_EVALUATION_COUNT(XPR,N) {\
nb_temporaries = 0; \
XPR; \
if(nb_temporaries!=N) { std::cerr << "nb_temporaries == " << nb_temporaries << "\n"; }\
}
#include <Eigen/Core>
using namespace Eigen;
template<typename T>
void fill(T& x) { for(int i=0; i<x.size(); ++i) x(i)= i+1; }
int main() {
int n=10;
ArrayXXf x(n,n); fill(x);
ArrayXf y(n); fill(y);
for(int i=0; i<10; ++i)
{
VERIFY_EVALUATION_COUNT( x = x * ((x.colwise()/y).rowwise()/y.transpose()).exp(), 0);
}
std::cout << x << '\n';
}
Essentially, this is what Eigen does in its testsuite at some points:
See here for the original definition in the testsuite and here for an example usage in the testsuite.
Alternatively, if you only care about intermediate memory allocations, you can try the macro EIGEN_RUNTIME_NO_MALLOC -- this would allow fixed-sized expressions to evaluate into temporaries, as they would only allocate on the stack.

Casting a BigMatrix/array to a Armadillo matrix

I have a big.matrix that I want to cast to an arma::Mat so that I can use the linear algebra functionality of Armadillo.
However, I can't seem to get the cast to work.
As far as I can gather from reading, both are internally stored in column major format, and the actual matrix component of a big.matrix is simply a pointer of type <T> (char/short/int/double)
The following code compiles, but the cast to the arma::Mat doesn't work, segfaulting when iterating over the cast matrix.
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(BH, bigmemory, RcppArmadillo)]]
#include <bigmemory/BigMatrix.h>
template <typename T>
void armacast(const arma::Mat<T>& M) {
// This segfaults
for (int j = 0; j < 2; j++) {
for (int i = 0; i < 2; i++) {
std::cout << M.at(j, i) << std::endl;
}
}
std::cout << "Success!" << std::endl;
}
// [[Rcpp::export]]
void armacast(SEXP pDat) {
XPtr<BigMatrix> xpDat(pDat);
if (xpDat->matrix_type() == 8) {
// I can iterate over this *mat and get sensible output.
double *mat = (double *)xpDat->matrix();
for (int j = 0; j < 2; j++) {
for (int i = 0; i < 2; i++) {
std::cout << *mat + 2 * (j + 0) + i << std::endl;
}
}
armacast((const arma::Mat<double> &)mat);
} else {
std::cout << "Not implemented yet!" << std::endl;
}
}
In R:
library(Rcpp)
library(RcppArmadillo)
library(bigmemory)
sourceCpp("armacast.cpp")
m <- as.big.matrix(matrix(1:4, 2), type="double")
armacast(m#address)
Great question! We may spin this into another Rcpp Gallery post.
There is one important detail you may have glossed over. Bigmemory objects are external so that we get R to not let its memory management interfere. Armadillo does have constructors for this (and please read the docs and warnings there) so in a first instance
we can just do
arma::mat M( (double*) xpDat->matrix(), xpDat->nrow(), xpDat->ncol(), false);
where we using a pointer to the matrix data, as well as row and column counts. A complete version:
// [[Rcpp::export]]
void armacast(SEXP pDat) {
XPtr<BigMatrix> xpDat(pDat);
if (xpDat->matrix_type() == 8) {
arma::mat M(mat, xpDat->nrow(), xpDat>-ncol(), false);
M.print("Arma matrix M");
} else {
std::cout << "Not implemented yet!" << std::endl;
}
}
It correctly invokes the print method from Armadillo:
R> armacast(m#address)
Arma matrix M
1.0000 3.0000
2.0000 4.0000
R>