I have an application where I need to pass N Eigen matrices to some functions. N is a compile-time constant, and these functions are called a significant number of times within tight loops. To avoid dynamic allocations are runtime, I thought it might be nice to store these matrices in an std::array, and then pass iterators to this array as function arguments. As a trivial example, consider:
const int N = 3;
const int SIZE = 125;
typedef std::array<Eigen::Matrix<double, SIZE, 1>, N> MatrixArray;
void computeMatrixProductArray(MatrixArray::const_iterator BeginIn,
MatrixArray::const_iterator BeginEnd,
MatrixArray::iterator BeginOut,
MatrixArray::iterator EndOut)
{
Eigen::Matrix<double, SIZE, 1> Test;
for (int J = 0; J < N; ++J) {
*(BeginOut + J) = Test.array() * (*(BeginIn + J)).array();
}
}
int main()
{
MatrixArray ArrayIn, ArrayOut;
computeMatrixProductArray(ArrayIn.cbegin(), ArrayIn.cend(),
ArrayOut.begin(), ArrayOut.end());
}
My question has to do with the alignment of the matrices stored in MatrixArray, and how Eigen3.3 treats unaligned memory. The size of the matrices, and the properties of std::array, ensure that an individual matrix in a MatrixArray will certainly not be aligned on any nice boundary. However, to my understanding, Eigen3.3 can still vectorize this case with unaligned operations.
Can anyone provide insight on what happens in the above example when J == 1, and the first entry of this matrix is not aligned? Does Eigen3.3 now treat this case similarly to what happens with equivalent dynamic matrices? My understanding here is that scalar, or unaligned, operations are used until an appropriate alignment boundary is reached, at which time fully aligned operations will be used. Or, since the matrix is not aligned to begin with, there is no chance of aligned vectorization? Or is something else happening entirely?
Thanks for any insight, and thanks to the developers for maintaining such a powerful library.
Related
I'm hesitating on how to organize the memory layout of my 2D data.
Basically, what I want is an N*M 2D double array, where N ~ M are in the thousands (and are derived from user-supplied data)
The way I see it, I have 2 choices :
double *data = new double[N*M];
or
double **data = new double*[N];
for (size_t i = 0; i < N; ++i)
data[i] = new double[M];
The first choice is what I'm leaning to.
The main advantages I see are shorter new/delete syntax, continuous memory layout implies adjacent memory access at runtime if I arrange my access correctly, and possibly better performance for vectorized code (auto-vectorized or use of vector libraries such as vDSP or vecLib)
On the other hand, it seems to me that allocating a big chunk of continuous memory could fail/take more time compared to allocating a bunch of smaller ones. And the second method also has the advantage of the shorter syntax data[i][j] compared to data[i*M+j]
What would be the most common / better way to do this, mainly if I try to view it from a performance standpoint (even though those are gonna be small improvements, I'm curious to see which would more performing).
Between the first two choices, for reasonable values of M and N, I would almost certainly go with choice 1. You skip a pointer dereference, and you get nice caching if you access data in the right order.
In terms of your concerns about size, we can do some back-of-the-envelope calculations.
Since M and N are in the thousands, suppose each is 10000 as an upper bound. Then your total memory consumed is
10000 * 10000 * sizeof(double) = 8 * 10^8
This is roughly 800 MB, which while large, is quite reasonable given the size of memory in modern day machines.
If N and M are constants, it is better to just statically declare the memory you need as a two dimensional array. Or, you could use std::array.
std::array<std::array<double, M>, N> data;
If only M is a constant, you could use a std::vector of std::array instead.
std::vector<std::array<double, M>> data(N);
If M is not constant, you need to perform some dynamic allocation. But, std::vector can be used to manage that memory for you, so you can create a simple wrapper around it. The wrapper below returns a row intermediate object to allow the second [] operator to actually compute the offset into the vector.
template <typename T>
class matrix {
const size_t N;
const size_t M;
std::vector<T> v_;
struct row {
matrix &m_;
const size_t r_;
row (matrix &m, size_t r) : m_(m), r_(r) {}
T & operator [] (size_t c) { return m_.v_[r_ * m_.M + c]; }
T operator [] (size_t c) const { return m_.v_[r_ * m_.M + c]; }
};
public:
matrix (size_t n, size_t m) : N(n), M(m), v_(N*M) {}
row operator [] (size_t r) { return row(*this, r); }
const row & operator [] (size_t r) const { return row(*this, r); }
};
matrix<double> data(10,20);
data[1][2] = .5;
std::cout << data[1][2] << '\n';
In addressing your particular concern about performance: Your rationale for wanting a single memory access is correct. You should want to avoid doing new and delete yourself, however (which is something this wrapper provides), and if the data is more naturally interpreted as multi-dimensional, then showing that in the code will make the code easier to read as well.
Multiple allocations as shown in your second technique is inferior because it will take more time, but its advantage is that it may succeed more often if your system is fragmented (the free memory consists of smaller holes, and you do not have a free chunk of memory large enough to satisfy the single allocation request). But multiple allocations has another downside in that some more memory is needed to allocate space for the pointers to each row.
My suggestion provides the single allocation technique without needed to explicitly call new and delete, as the memory is managed by vector. At the same time, it allows the data to be addressed with the 2-dimensional syntax [x][y]. So it provides all the benefits of a single allocation with all the benefits of the multi-allocation, provided you have enough memory to fulfill the allocation request.
Consider using something like the following:
// array of pointers to doubles to point the beginning of rows
double ** data = new double*[N];
// allocate so many doubles to the first row, that it is long enough to feed them all
data[0] = new double[N * M];
// distribute pointers to individual rows as well
for (size_t i = 1; i < N; i++)
data[i] = data[0] + i * M;
I'm not sure if this is a general practice or not, I just came up with this. Some downs still apply to this approach, but I think it eliminates most of them, like being able to access the individual doubles like data[i][j] and all.
I dug into the boost ublas code and found out the ublas implementation for memory allocation in compressed_matrix is not as standard as in CSC or CSR.
There is one line that cause the trouble, namely,
non_zeros = (std::max) (non_zeros, (std::min) (size1_,size2_)); in the private restrict_capactiy method.
Does that mean if I create a sparse matrix the number of nonzero allocated in boost ublas will always be greater than min(nrow, ncol)?
The following code I used to demonstrate this problem. The output will has zeros in the unused part of the vector allocated in compressed_matrix.
typedef boost::numeric::ublas::compressed_matrix<double, boost::numeric::ublas::column_major,0,std::vector<std::size_t>, std::vector<double> > Matrix;
long nrow = 5;
long ncol = 4;
long nnz = 2;
Matrix m(nrow, ncol, nnz);
cout<<"setting"<<endl;
m(1,2) = 1.1;
m(2,2) = 2.1;
for(int i=0;i<m.index1_data().size();i++)
{
cout<<"ind1 -"<<i<<" "<<m.index1_data()[i]<<endl;
}
for(int i=0;i<m.index2_data().size();i++)
{
cout<<"ind2 -"<<i<<" "<<m.index2_data()[i]<<endl;
}
for(int i=0;i<m.value_data().size();i++)
{
cout<<"val -"<<i<<" "<<m.value_data()[i]<<endl;
}
Perhaps it is a performance-design choice with certain use cases in mind.
The idea is that when filling the compressed_matrix one might try to minimize reallocations of the arrays that maintains the index/values arrays. If one starts from 0 allocated space, it will quickly to speculativelly reallocate once in a while (e.g. reserving twice the space each time the allocated space is exceeded, like std::vector does).
Since the idea is to kill the $N^2$ scaling of the dense matrix. A good guess is that in an sparse matrix you will use more or less $N$ elements out of the $N^2$. If you use more than $N$ then reallocation will happen at some point but not as many times. But then you will probably be in the case when it is better to switch to a dense matrix anyway.
What is a little more surprising is that it overwrites the passed value. But still, the above applies.
Are the usages (not creations) speed of dynamic and classical multi-dimensional arrays different in terms of speed?
I mean, for example, when I try to access all values in a three-dimensional array with the help of loops, Is there any speed difference between the arrays which created as dynamic and classical methods.
When I say "dynamic three-dimensional array", I mean matris_cos[kuanta][d][angle_scale] is created like this.
matris_cos = new float**[kuanta];
for (int i = 0; i < kuanta; ++i) {
matris_cos[i] = new float*[d];
for (int j = 0; j < d; ++j)
matris_cos[i][j] = new float[angle_scale];
}
When I say "classical three-dimensional array", I mean matris_cos[kuanta][d][angle_scale] is simply created like this.
float matris_cos[kuanta][d][angle_scale];
But please attention, I don't ask the creation speed of these arrays. I want to access the values of these arrays via some loops. Is there any speed difference when I try to access the values.
An array of pointers (to arrays of pointers) will require extra levels of indirection to access a random element, while a multi-dimensional array will require basic arithmetic (multiplication and pointer addition). On most modern platforms, indirection is likely to be slower unless you use cache-friendly access patterns. Also, all the elements of the multi-dimensional array will be contiguous, which could help caching if you iterate over the whole array.
Whether this difference is measurable or not is something you can only tell by measuring it.
If the extra indirection does prove to be a bottleneck, you could replace the array-of-pointers with a class to represent the multi-dimensional array with a flat array:
class array_3d {
size_t d1,d2,d3;
std::vector<float> flat;
public:
array_3d(size_t d1, size_t d2, size_t d3) :
d1(d1), d2(d2), d3(d3), flat(d1*d2*d3)
{}
float & operator()(size_t x, size_t y, size_t z) {
return flat[x*d2*d3 + y*d3 + z];
}
// and a similar const overload
};
I believe that the next C++ standard (due next year) will include dynamically sized arrays, so you should be able to use the multi-dimensional form in all cases.
You won't be able to spot any difference between them in a typical application unless your arrays are pretty huge and you spend a lot of time reading/writing to them, but nonetheless, there is a difference.
float matris_cos[kuanta][d][angle_scale];
1) The memory for this multidimensional array will be contiguous. There will be less cache misses as a result.
2) The array will require space only for the floats themselves.
matris_cos = new float**[kuanta];
for (int i = 0; i < kuanta; ++i) {
matris_cos[i] = new float*[d];
for (int j = 0; j < d; ++j)
matris_cos[i][j] = new float[angle_scale];
}
1) The memory for this multidimensional array is allocated in blocks and is thus much less likely to be contiguous. This may result in cache misses.
2) This method requires space for the pointers as well as the floats themselves.
Since there's indirection in the second case, you can expect a tiny speed difference when attempting to access or change values.
To recap:
Second case uses more memory
Second case involves indirection
Second case does not have guaranteed cache locality.
I have a 4x4 matrix class that holds the values as 2d float array : float mat[4][4]; and I overloaded the [] operator
inline float *Matrix4::operator[](int row)
{
return mat[row];
}
and I have an uniform mat4 in my shader (uniform mat4 Bones[64];) which i will upload my data onto. And I hold the bones as a Matrix4 pointer (Matrix4 *finalBones) which is a matrix array, containing numJoints elements. and this is what i use to upload the data :
glUniformMatrix4fv(shader.getLocation("Bones"),numJoints,false,finalBones[0][0]);
I am totally unsure about what is going to happen, so i need to ask that if this will work or i need to extract everything into 16*numJoints sized float arrays which seems to be expensive for each tick.
edit
this is what i have done which seems really expensive for each draw operation
void MD5Loader::uploadToGPU()
{
float* finalmat=new float[16*numJoints];
for (int i=0; i < numJoints; i++)
for (int j=0; j < 4; j++)
for(int k=0; k < 4; k++)
finalmat[16*i + j*4 + k] = finalBones[i][j][k];
glUniformMatrix4fv(shader.getLocation("Bones"),numJoints,false,finalmat);
delete[] finalmat;
}
It depends on two things. Assuming your compiler is C++03, (and you care about standards compliance), then your Matrix class must be a POD type. It must, in particular, have no constructors. C++11 relaxes these rules significantly.
The other thing is that your matrix appears to be row-major. I say this because your operator[] seems to think that your 2D array is row-major. The first coordinate is the row rather than the column, so it's row-major storage order.
If you're going to give OpenGL row-major matrices, you need to tell it that they are row-major:
glUniformMatrix4fv(shader.getLocation("Bones"), numJoints, GL_TRUE, finalBones[0][0]);
With C++11, a class with standard layout can be cast to a pointer, addressing its first member, and back again. With a sizeof check, you then have a guarantee of contiguous matrix data:
#include <type_traits>
static_assert((sizeof(Matrix4) == (sizeof(GLfloat) * 16)) &&
(std::is_standard_layout<Matrix4>::value),
"Matrix4 does not satisfy contiguous storage requirements");
Trivial (or POD) layouts are just too restrictive once you have non-trivial constructors, etc. Your post suggests you will need to set the transpose parameter to GL_TRUE.
I need to create a large two dimensional array of objects. I've read some related questions on this site and others regarding multi_array, matrix, vector, etc, but haven't been able to put it together. If you recommend using one of those, please go ahead and translate the code below.
Some considerations:
The array is somewhat large (1300 x 1372).
I might be working with more than one of these at a time.
I'll have to pass it to a function at some point.
Speed is a large factor.
The two approaches that I thought of were:
Pixel pixelArray[1300][1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i][j].setOn(true);
...
}
}
and
Pixel* pixelArray[1300][1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i][j] = new Pixel();
pixelArray[i][j]->setOn(true);
...
}
}
What's the right approach/syntax here?
Edit:
Several answers have assumed Pixel is small - I left out details about Pixel for convenience, but it's not small/trivial. It has ~20 data members and ~16 member functions.
Your first approach allocates everything on stack, which is otherwise fine, but leads to stack overflow when you try to allocate too much stack. The limit is usually around 8 megabytes on modern OSes, so that allocating arrays of 1300 * 1372 elements on stack is not an option.
Your second approach allocates 1300 * 1372 elements on heap, which is a tremendous load for the allocator, which holds multiple linked lists to chunks of allocted and free memory. Also a bad idea, especially since Pixel seems to be rather small.
What I would do is this:
Pixel* pixelArray = new Pixel[1300 * 1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i * 1372 + j].setOn(true);
...
}
}
This way you allocate one large chunk of memory on heap. Stack is happy and so is the heap allocator.
If you want to pass it to a function, I'd vote against using simple arrays. Consider:
void doWork(Pixel array[][]);
This does not contain any size information. You could pass the size info via separate arguments, but I'd rather use something like std::vector<Pixel>. Of course, this requires that you define an addressing convention (row-major or column-major).
An alternative is std::vector<std::vector<Pixel> >, where each level of vectors is one array dimension. Advantage: The double subscript like in pixelArray[x][y] works, but the creation of such a structure is tedious, copying is more expensive because it happens per contained vector instance instead of with a simple memcpy, and the vectors contained in the top-level vector must not necessarily have the same size.
These are basically your options using the Standard Library. The right solution would be something like std::vector with two dimensions. Numerical libraries and image manipulation libraries come to mind, but matrix and image classes are most likely limited to primitive data types in their elements.
EDIT: Forgot to make it clear that everything above is only arguments. In the end, your personal taste and the context will have to be taken into account. If you're on your own in the project, vector plus defined and documented addressing convention should be good enough. But if you're in a team, and it's likely that someone will disregard the documented convention, the cascaded vector-in-vector structure is probably better because the tedious parts can be implemented by helper functions.
I'm not sure how complicated your Pixel data type is, but maybe something like this will work for you?:
std::fill(array, array+100, 42); // sets every value in the array to 42
Reference:
Initialization of a normal array with one default value
Check out Boost's Generic Image Library.
gray8_image_t pixelArray;
pixelArray.recreate(1300,1372);
for(gray8_image_t::iterator pIt = pixelArray.begin(); pIt != pixelArray.end(); pIt++) {
*pIt = 1;
}
My personal peference would be to use std::vector
typedef std::vector<Pixel> PixelRow;
typedef std::vector<PixelRow> PixelMatrix;
PixelMatrix pixelArray(1300, PixelRow(1372, Pixel(true)));
// ^^^^ ^^^^ ^^^^^^^^^^^
// Size 1 Size 2 default Value
While I wouldn't necessarily make this a struct, this demonstrates how I would approach storing and accessing the data. If Pixel is rather large, you may want to use a std::deque instead.
struct Pixel2D {
Pixel2D (size_t rsz_, size_t csz_) : data(rsz_*csz_), rsz(rsz_), csz(csz_) {
for (size_t r = 0; r < rsz; r++)
for (size_t c = 0; c < csz; c++)
at(r, c).setOn(true);
}
Pixel &at(size_t row, size_t col) {return data.at(row*csz+col);}
std::vector<Pixel> data;
size_t rsz;
size_t csz;
};