Eliminate Intermediate Eigen Arrays - c++

Does Eigen make any intermediate array for calculation of x or Eigen just put the values into simd registers and do the calculation?
In general, how to know how many intermediates did Eigen make?
Will Eigen allocate new memory for the intermediates in every cycle of the loop?
Is there anyway to ensure that eigen would not make any intermediate? Does it have a macro like "EIGEN_NO_INTERMEDIATE"?
#include <Eigen/Eigen>
#include <iostream>
using namespace Eigen;
template<typename T>
void fill(T& x) {
for (int i = 0; i < x.size(); ++i) x.data()[i] = i + 1;
}
int main() {
int n = 10; // n is actually about 400
ArrayXXf x(n, n);
ArrayXf y(n);
fill(x);
fill(y);
for (int i = 0; i < 10; ++i) { // many cycles
x = x * ((x.colwise() / y).rowwise() / y.transpose()).exp();
}
std::cout << x << "\n";
}

You can add a hook into the DenseStorage constructor like so:
#include <iostream>
static long int nb_temporaries;
inline void on_temporary_creation(long int size) {
if(size!=0) nb_temporaries++;
}
// must be defined before including any Eigen header!
#define EIGEN_DENSE_STORAGE_CTOR_PLUGIN { on_temporary_creation(size); }
#define VERIFY_EVALUATION_COUNT(XPR,N) {\
nb_temporaries = 0; \
XPR; \
if(nb_temporaries!=N) { std::cerr << "nb_temporaries == " << nb_temporaries << "\n"; }\
}
#include <Eigen/Core>
using namespace Eigen;
template<typename T>
void fill(T& x) { for(int i=0; i<x.size(); ++i) x(i)= i+1; }
int main() {
int n=10;
ArrayXXf x(n,n); fill(x);
ArrayXf y(n); fill(y);
for(int i=0; i<10; ++i)
{
VERIFY_EVALUATION_COUNT( x = x * ((x.colwise()/y).rowwise()/y.transpose()).exp(), 0);
}
std::cout << x << '\n';
}
Essentially, this is what Eigen does in its testsuite at some points:
See here for the original definition in the testsuite and here for an example usage in the testsuite.
Alternatively, if you only care about intermediate memory allocations, you can try the macro EIGEN_RUNTIME_NO_MALLOC -- this would allow fixed-sized expressions to evaluate into temporaries, as they would only allocate on the stack.

Related

Can we pass an array to any function in C++?

I have passed an array of size 10 to a funtion to sort the array reversely, but it's going wrong after rightly sorting first five elements of the array.
I want to sort the array 'std' reversely here,
# include <iostream>
using namespace std;
int reverse(int a[]); //funtion prototype
int main()
{
int std[10] = {0,1,2,3,4,5,6,7,8,9};
reverse(std);
}
int reverse(int a[]) //funtion defination
{
int index = 0;
for (int i = 9; i >= 0; i--)
{
a[index] = a[i]; //swaping values of the array
cout << a[index] << " ";
index++;
}
}
There's basically three things wrong with your code.
You aren't swapping anything
You have to swap the first half of the array with the second half, not swap the whole array. If you do that then everything gets swapped twice, so that nothing changes
You should print the reversed array after you have finished the reverse, not while you are doing the reverse.
Here's some code that fixes all these problems
# include <iostream>
# include <utility>
void reverse(int a[]);
int main()
{
int std[10] = {0,1,2,3,4,5,6,7,8,9};
reverse(std);
// print the array after reversing it
for (int i = 0; i < 10; ++i)
std::cout << std[i] << ' ';
std::cout << '\n';
}
void reverse(int a[])
{
for (int i = 0; i < 5; ++i) // swap the first half of the array with the second half
{
std::swap(a[i], a[9 - i]); // real swap
}
}
Yes you can.
I usually don't use "C" style arrays anymore (they can still be useful, but the don't behave like objects). When passing "C" style arrays to functions you kind of always have to manuall pass the size of the array as well (or make assumptions). Those can lead to bugs. (not to mention pointer decay)
Here is an example :
#include <array>
#include <iostream>
// using namespace std; NO unlearn trhis
template<std::size_t N>
void reverse(std::array<int, N>& values)
{
int index = 0;
// you only should run until the middle of the array (size/2)
// or you start swapping back values.
for (int i = values.size() / 2; i >= 0; i--, index++)
{
// for swapping objects/values C++ has std::swap
// using functions like this shows WHAT you are doing by giving it a name
std::swap(values[index], values[i]);
}
}
int main()
{
std::array<int,10> values{ 0,1,2,3,4,5,6,7,8,9 };
reverse(values);
for (const int value : values)
{
std::cout << value << " ";
}
return 0;
}

Armadillo C++ bad performance ifft

I have current test code
#include <iostream>
#define ARMA_DONT_USE_WRAPPER
#include <armadillo>
using namespace std::complex_literals;
int main()
{
arma::cx_mat testMat { };
testMat.set_size(40, 19586);
auto nPositions = static_cast<arma::sword>(floor(19586/2));
arma::cx_rowvec a_vec {19586, arma::fill::randu};
arma::cx_rowvec b_vec {19586, arma::fill::randu};
arma::cx_rowvec c_vec {19586, arma::fill::randu};
for (size_t nCo=0; nCo < 3; nCo++) {
arma::rowvec d {19586, arma::fill::randu};
for(size_t iDop = 0; iDop < 40; ++iDop)
{
arma::cx_rowvec signalFi = (b_vec % arma::exp(-1i*M_PI*a_vec));
testMat.row(iDop) += arma::ifft(arma::shift(arma::fft(signalFi), nPositions).eval() % c_vec).eval();
}
}
return 0;
}
I am trying to perform some computation.
StopWatch shared performance for each iteration around : 300 ms, which is bad performance for my needs.
Is someone which can explain what i am doing wrong or some tricks how can i increase the performance.
I used .eval() to perform 'eager' evaluation.
gcc 11.2
armadillo 10.8.2
Release Mode -O3
Updated Version. Is possible to redesign the ifft function ?
Test Code
#include <iostream>
#include <fftw3.h>
#include <armadillo>
#include "StopWatch.h"
using namespace std;
inline arma::cx_mat ifftshift(arma::cx_mat const &axx)
{
return arma::shift(axx, -ceil(axx.n_rows/2), 0);
}
void ifft(arma::cx_mat &inMat, arma::cx_mat &outMat)
{
size_t N = inMat.n_rows;
size_t n_cols = inMat.n_cols;
for (size_t index = 0; index < n_cols; ++index)
{
fftw_complex *in1 = reinterpret_cast<fftw_complex *>(inMat.colptr(index));
fftw_complex *out1 = reinterpret_cast<fftw_complex *>(outMat.colptr(index));
fftw_plan pl_ifft_cx1 = fftw_plan_dft_1d(N, in1, out1, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute_dft(pl_ifft_cx1, in1, out1);
}
outMat /= N;
}
int main()
{
arma::cx_mat B;
B << std::complex<double>(+1.225e-01,+8.247e-01) << std::complex<double>(+4.078e-01,+5.632e-01) << std::complex<double>(+8.866e-01,+8.386e-01) << arma::endr
<< std::complex<double>(+5.958e-01,+1.015e-01) << std::complex<double>(+7.857e-01,+4.267e-01) << std::complex<double>(+7.997e-01,+9.176e-01) << arma::endr
<< std::complex<double>(+1.877e-01,+3.378e-01) << std::complex<double>(+2.921e-01,+9.651e-01) << std::complex<double>(+1.056e-01,+6.901e-01) << arma::endr
<< std::complex<double>(+2.322e-01,+6.990e-01) << std::complex<double>(+1.547e-01,+4.256e-01) << std::complex<double>(+9.094e-01,+1.194e-01) << arma::endr
<< std::complex<double>(+3.917e-01,+3.886e-01) << std::complex<double>(+2.166e-01,+4.962e-01) << std::complex<double>(+9.777e-01,+4.464e-01) << arma::endr;
arma::cx_mat output(5,3);
arma::cx_mat shifted = ifftshift(B);
arma::cx_mat arma_result = arma::ifft(shifted);
B.print("B");
arma_result.print("arma_result");
ifft(shifted, output);
output.print("output");
return 0;
}
I just tried a similar operation with my own library and, according to my measurements, you are correct that each iteration of the loop shouldn't take more than 1 millisecond (instead of 300 ms).
This is the equivalent code, sorry that this is not an Armadillo answer, I am just pointing out what are the concrete goals for minimizing operations and allocations.
#include<multi/adaptors/fftw.hpp>
#include<multi/array.hpp>
namespace fftw = multi::fftw;
int main() {
multi::array<std::complex<double>, 1> const arr = n_random_complex<double>(19586);
multi::array<std::complex<double>, 1> res(arr.extensions()); // output allocated only once
fftw::plan fdft{arr, res, fftw::forward}; // fftw plan and internal buffers allocated only once
auto const N = 40;
for(int i = 0; i != N; ++i) { // each iteration takes ~1ms in an intel-i7
fdft(arr.base(), res.base()); // fft operation with precalculated plan
std::rotate(res.begin(), res.begin() + res.size()/2, res.end()); // rotation (shift on size/2) done in place, no allocation either
}
}
The full code and library is here: https://gitlab.com/correaa/boost-multi/-/blob/master/adaptors/fftw/test/shift.cpp#L45-58 (the extra code is for the timing measurement).
What is also telling is that I tried to do all the possible mistakes to pessimize the code.
To try to mimic what I think Armadillo is doing "wrong"... allocating inside the loop and making copies all the time. But what I get is that each iteration take 1.5 milliseconds.
My conclusion is that something is terribly wrong in your Armadillo usage or in the library itself.
multi::array<std::complex<double>, 1> const arr = n_random_complex<double>(19586); BOOST_REQUIRE(arr.size() == 19586);
auto const N = 40;
for(int i = 0; i != N; ++i) {
multi::array<std::complex<double>, 1> res(arr.extensions(), 0.);
fftw::plan fdft{arr, res, fftw::forward};
fdft(arr.base(), res.base());
multi::array<std::complex<double>, 1> res_copy(arr.extensions(), 0.);
std::rotate_copy(res.begin(), res.begin() + res.size()/2, res.end(), res_copy.begin());
}

merging a collection of `Eigen::VectorXd`s into one large `Eigen::VectorXd`

If you go to this Eigen page, you'll see you can initialize VectorXd objects with the << operator. You can also dump a few vector objects into one big VectorXd object (e.g. look at the third example in the section called "The comma initializer").
I want to dump a few vectors into a big vector, but I'm having a hard time writing code that will work for an arbitrarily sized collection of vectors. The following doesn't work, and I'm having a hard time writing it in a way that does (that isn't a double for loop). Any suggestions?
#include <iostream>
#include <Eigen/Dense>
#include <vector>
int main(int argc, char **argv)
{
// make some random VectorXds
std::vector<Eigen::VectorXd> vOfV;
Eigen::VectorXd first(3);
Eigen::VectorXd second(4);
first << 1,2,3;
second << 4,5,6,7;
vOfV.push_back(first);
vOfV.push_back(second);
// here is the problem
Eigen::VectorXd flattened(7);
for(int i = 0; i < vOfV.size(); ++i)
flattened << vOfV[i];
//shows that this doesn't work
for(int i = 0; i < 7; ++i)
std::cout << flattened(i) << "\n";
return 0;
}
The comma initializer does not work like that. You have to fully initialize the matrix from that. Instead, allocate a large enough vector and iterate and assign the blocks.
#include <iostream>
#include <vector>
#include <Eigen/Dense>
// http://eigen.tuxfamily.org/dox/group__TopicStlContainers.html
#include <Eigen/StdVector>
EIGEN_DEFINE_STL_VECTOR_SPECIALIZATION(Eigen::VectorXd)
int main()
{
// make some random VectorXds
std::vector<Eigen::VectorXd> vOfV;
Eigen::VectorXd first(3);
Eigen::VectorXd second(4);
first << 1,2,3;
second << 4,5,6,7;
vOfV.push_back(first);
vOfV.push_back(second);
int len = 0;
for (auto const &v : vOfV)
len += v.size();
Eigen::VectorXd flattened(len);
int offset = 0;
for (auto const &v : vOfV)
{
flattened.middleRows(offset,v.size()) = v;
offset += v.size();
}
std::cout << flattened << "\n";
}

Efficient Eigen Matrix From Function

I'm trying to build a matrix from a kernel, such that A(i,j) = f(i,j) where i,j are both vectors (hence I build A from two matrices x,y which each row corresponds to a point/vector). My current function looks similar to this:
Eigen::MatrixXd get_kernel_matrix(const Eigen::MatrixXd& x, const Eigen::MatrixXd& y, double(&kernel)(const Eigen::VectorXd&)) {
Eigen::MatrixXd res (x.rows(), y.rows());
for(int i = 0; i < res.rows() ; i++) {
for(int j = 0; j < res.cols(); j++) {
res(i, j) = kernel(x.row(i), y.row(j));
}
}
}
return res;
}
Along with some logic for the diagonal (which would in my case likely cause division by zero).
Is there a more efficient/idiometric way to do this? In some of my tests it appears that Matlab code beats the speed of my C++/Eigen implementation (I'm guessing due to vectorization).
I've looked through a considerable amount of documentation (such as the unaryExpr function), but can't seem to find what I'm looking for.
Thanks for any help.
You can use NullaryExpr with an appropriate lambda to remove your for loops:
MatrixXd res = MatrixXd::NullaryExpr(x.rows(), y.rows(),
[&x,&y,&kernel](int i,int j) { return kernel(x.row(i), y.row(j)); });
Here is a working self-contained example reproducing a matrix product:
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
using namespace std;
double my_kernel(const MatrixXd::ConstRowXpr &x, const MatrixXd::ConstRowXpr &y) {
return x.dot(y);
}
template<typename Kernel>
MatrixXd apply_kernel(const MatrixXd& x, const MatrixXd& y, Kernel kernel) {
return MatrixXd::NullaryExpr(x.rows(), y.rows(),
[&x,&y,&kernel](int i,int j) { return kernel(x.row(i), y.row(j)); });
}
int main()
{
int n = 10;
MatrixXd X = MatrixXd::Random(n,n);
MatrixXd Y = MatrixXd::Random(n,n);
MatrixXd R = apply_kernel(X,Y,std::ptr_fun(my_kernel));
std::cout << R << "\n\n";
std::cout << X*Y.transpose() << "\n\n";
}
If you don't want to make apply_kernel a template function, you can use std::function to pass the kernel.

Generating random numbers with uniform distribution using Thrust

I need to generate a vector with random numbers between 0.0 and 1.0 using Thrust. The only documented example I could find produces very large random numbers (thrust::generate(myvector.begin(), myvector.end(), rand).
I'm sure the answer is simple, but I would appreciate any suggestions.
Thrust has random generators you can use to produce sequences of random numbers. To use them with a device vector you will need to create a functor which returns a different element of the random generator sequence. The most straightforward way to do this is using a transformation of a counting iterator. A very simple complete example (in this case generating random single precision numbers between 1.0 and 2.0) could look like:
#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
struct prg
{
float a, b;
__host__ __device__
prg(float _a=0.f, float _b=1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
int main(void)
{
const int N = 20;
thrust::device_vector<float> numbers(N);
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
thrust::transform(index_sequence_begin,
index_sequence_begin + N,
numbers.begin(),
prg(1.f,2.f));
for(int i = 0; i < N; i++)
{
std::cout << numbers[i] << std::endl;
}
return 0;
}
In this example, the functor prg takes the lower and upper bounds of the random number as an argument, with (0.f,1.f) as the default. Note that in order to have a different vector each time you call the transform operation, you should used a counting iterator initialised to a different starting value.
It might not be a direct answer to your question but, cuRand library is quite powerful in this concept. You may both generate random numbers at GPU and CPU, and it contains many distribution functions (normal distribution etc).
Search for the title: "An NVIDIA CURAND implementation" on this link: http://adnanboz.wordpress.com/tag/nvidia-curand/
//Create a new generator
curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT);
//Set the generator options
curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed);
//Generate random numbers
curandGenerateUniform(m_prng, d_randomData, dataCount);
One note is that, do not generate the generator again and again, it makes some precalculations. Calling curandGenerateUniform is quite fast and produces values between 0.0 and 1.0.
The approach suggested by #talonmies has a number of useful characteristics. Here's another approach that mimics the example you quoted:
#include <thrust/host_vector.h>
#include <thrust/generate.h>
#include <iostream>
#define DSIZE 5
__host__ static __inline__ float rand_01()
{
return ((float)rand()/RAND_MAX);
}
int main(){
thrust::host_vector<float> h_1(DSIZE);
thrust::generate(h_1.begin(), h_1.end(), rand_01);
std::cout<< "Values generated: " << std::endl;
for (unsigned i=0; i<DSIZE; i++)
std::cout<< h_1[i] << " : ";
std::cout<<std::endl;
return 0;
}
similar to the example you quoted, this uses rand(), and therefore can only be used to generate host vectors. Likewise it will produce the same sequence each time unless you re-seed rand() appropriately.
There are already satisfactory answers to this questions. In particular, the OP and Robert Crovella have dealt with thrust::generate while talonmies has proposed using thrust::transform.
I think there is another possibility, namely, using thrust::for_each, so I'm posting a fully worked example using such a primitive, just for the record.
I'm also timing the different solutions.
THE CODE
#include <iostream>
#include <thrust\host_vector.h>
#include <thrust\generate.h>
#include <thrust\for_each.h>
#include <thrust\execution_policy.h>
#include <thrust\random.h>
#include "TimingCPU.h"
/**************************************************/
/* RANDOM NUMBERS GENERATION STRUCTS AND FUNCTION */
/**************************************************/
template<typename T>
struct rand_01 {
__host__ T operator()(T& VecElem) const { return (T)rand() / RAND_MAX; }
};
template<typename T>
struct rand_01_for_each {
__host__ void operator()(T& VecElem) const { VecElem = (T)rand() / RAND_MAX; }
};
template<typename T>
__host__ T rand_01_fcn() { return ((T)rand() / RAND_MAX); }
struct prg
{
float a, b;
__host__ __device__
prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
/********/
/* MAIN */
/********/
int main() {
TimingCPU timerCPU;
const int N = 2 << 18;
//const int N = 64;
const int numIters = 50;
thrust::host_vector<double> h_v1(N);
thrust::host_vector<double> h_v2(N);
thrust::host_vector<double> h_v3(N);
thrust::host_vector<double> h_v4(N);
printf("N = %d\n", N);
double timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::transform(thrust::host, h_v1.begin(), h_v1.end(), h_v1.begin(), rand_01<double>());
timing = timing + timerCPU.GetCounter();
}
printf("Timing using transform = %f\n", timing / numIters);
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
thrust::transform(index_sequence_begin,
index_sequence_begin + N,
h_v2.begin(),
prg(0.f, 1.f));
timing = timing + timerCPU.GetCounter();
}
printf("Timing using transform and internal Thrust random generator = %f\n", timing / numIters);
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::for_each(h_v3.begin(), h_v3.end(), rand_01_for_each<double>());
timing = timing + timerCPU.GetCounter();
}
timerCPU.StartCounter();
printf("Timing using for_each = %f\n", timing / numIters);
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N; k++)
// std::cout << h_v3[k] << " : ";
//std::cout << std::endl;
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::generate(h_v4.begin(), h_v4.end(), rand_01_fcn<double>);
timing = timing + timerCPU.GetCounter();
}
timerCPU.StartCounter();
printf("Timing using generate = %f\n", timing / numIters);
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N; k++)
// std::cout << h_v4[k] << " : ";
//std::cout << std::endl;
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N * 2; k++)
// std::cout << h_v[k] << " : ";
//std::cout << std::endl;
return 0;
}
On a laptop Core i5 platform, I had the following timings
N = 2097152
Timing using transform = 33.202298
Timing using transform and internal Thrust random generator = 264.508662
Timing using for_each = 33.155237
Timing using generate = 35.309399
The timings are equivalent, apart from the second one which uses Thrust's internal random number generator instead of rand().
Please, note that, differently from the other solutions, the one thrust::generate is somewhat more rigid since the function used to generate the random numbers cannot have input parameters. So, for example, it is not possible to scale the input arguments by a constant.