Efficient way of making data contiguous to transfer it among nodes

Efficient way of making data contiguous to transfer it among nodes - c++

struct Face
{
// Matrixd is 1D representation of 2D matrix
std::array < Matrixd<5,5>, 2 > M;
};
std::vector <Face> face;
I have a distributed for-loop among nodes. After all nodes finish working on their elements I would like to transfer corresponding elements among nodes. But AFAIK to use MPI_Allgatherv the data should be contiguous. First of all, I switched to 1D representation of 2D matrices (I was using [][] notation before). Now I want to make face.M to be contiguous. I am thinking to copy all elements of say, M[0] to an std::array an transfer that among nodes. Is this way efficient? To give an idea of number of data I work with, if I have 20k cells, at maximum I have 20k*3=60k faces. I might have a million of cells, too.

A true 2D array in C/C++, e.g. int foo[5][5] is already contiguous in memory; it's basically just syntactic sugar for int foo[25] where accesses like foo[3][2] implicitly look up foo[3*5 + 2] in the flat equivalent. Switching to a Matrixd defined in a single dimension won't change the actual memory layout.
std::array is (mostly) just a wrapper for C-style arrays as well; with no virtual members, and compile time defined size with no internal pointers (just the raw array), it's also going to be contiguous. I strongly suspect if you checked the assembly produced, you'd find that the array of Matrixds is already contiguous.
In short, I don't think you need to change anything; you're already contiguous, so MPI should be fine.

Related

In-place conversion between row major and column major storage in three dimensions

I have a program that receives three-dimensional data as flat arrays in row-major (a.k.a. "C") order as input.
I need to pass these to a library that expects the same three-dimensional data in column-major (a.k.a. "Fortran") order.
Preprocessing the arrays outside of my program is not an option.
Transforming the data while copying is no problem except for performance - there are quite a few arrays of several million elements each, and the allocation and copying is my major bottleneck - so I would like to do the transformation in-place and see if that helps.
However, I have been unable to work out the maths behind this transformation, and my googling has been less than helpful.
Is there an efficient way to perform this transformation in-place?

An in-place transformation (if possible) would copy all the elements of these big arrays anyway, thus it won't be cache-friendly.
Each allocation will be done once for a big array (and its subsequent long transformation) and if you have to deal with a stream of such arrays you could reuse old ones in order to avoid alloc/free repetitions.
I would simply recommend to load the data in the predictible/cache-friendly row-major order and rely on the store-buffer machinery to deal with the column-major store anti-pattern to the second (allocated) array.

What are the Disadvantages of Nested Vectors?

I'm still fairly new to C++ and have a lot left to learn, but something that I've become quite attached to recently is using nested (multidimensional) vectors. So I may typically end up with something like this:
std::vector<std::vector<std::string> > table;
Which I can then easily access elements of like this:
std::string data = table[3][5];
However, recently I've been getting the impression that it's better (in terms of performance) to have a single-dimensional vector and then just use "index arithmetic" to access elements correspondingly. I assume this performance impact is significant for much larger or higher dimensional vectors, but I honestly have no idea and haven't been able to find much information about it so far.
While, intuitively, it kind of makes sense that a single vector would have better performance than a a higher dimensional one, I honestly don't understand the actual reasons why. Furthermore, if I were to just use single-dimensional vectors, I would lose the intuitive syntax I have for accessing elements of multidimensional ones. So here are my questions:
Why are multidimensional vectors inefficient? If I were to only use a single-dimensional vector instead (to represent data in higher dimensions), what would be the best, most intuitive way to access its elements?

It depends on the exact conditions. I'll talk about the case, when the nested version is a true 2D table (i.e., all rows have equal length).
A 1D vector usually will be faster on every usage patterns. Or, at least, it won't be slower than the nested version.
Nested version can be considered worse, because:
it needs to allocate number-of-rows times, instead of one.
accessing an element takes an additional indirection, so it is slower (additional indirection is usually slower than the multiply needed in the 1D case)
if you process your data sequentially, then it could be much slower, if the 2D data is scattered around the memory. It is because there could be a lot of cache misses, depending how the memory allocator returns memory areas of different rows.
So, if you go for performance, I'd recommend you to create a 2D-wrapper class for 1D vector. This way, you could get as simple API as the nested version, and you'll get the best performance too. And even, if for some cause, you decide to use the nested version instead, you can just change the internal implementation of this wrapper class.
The most intuitive way to access 1D elements is y*width+x. But, if you know your access patterns, you can choose a different one. For example, in a painting program, a tile based indexing could be better for storing and manipulating the image. Here, data can be indexed like this:
int tileMask = (1<<tileSizeL)-1; // tileSizeL is log of tileSize
int tileX = x>>tileSizeL;
int tileY = y>>tileSizeL;
int tileIndex = tileY*numberOfTilesInARow + tileX;
int index = (tileIndex<<(tileSizeL*2)) + ((y&tileMask)<<tileSizeL) + (x&tileMask);
This method has a better spatial locality in memory (pixels near to each other tend to have a near memory address). Index calculation is slower than a simple y*width+x, but this method could have much less cache misses, so in the end, it could be faster.

Should I use simple classes or a highly dimensional matrix?

C++ newbie here! I would like to simulate a population containing patches, containing individuals, containing chromosomes, containing genes.
What are the pros and cons of using a series of simple classes versus a highly dimensional matrix in C++? Typically, does the time to access a memory slot varies in between the two technics?
Highly dimensional Matrix
One could make "a vector of vectors of vectors of vectors" (or a C-style highly dimensional arrays of integers) and access any gene in memory with
for (int patch=0;patch<POP.size();patch++)
{
for (int ind=0;ind<POP[patch].size();patch++)
{
for (int chrom=0;chrom<POP[patch][ind].size();chrom++)
{
for (int gene=0;gene<POP[patch][ind][chrom].size();gene++)
{
POP[patch][ind][chrom][gene];
}
}
}
}
Series of Simple Classes
One could use a series of simple classes and access any gene in memory with
for (int patch=0;patch<POP->PATCHES.size();patch++)
{
for (int ind=0;ind<POP->PATCHES[patch]->INDIVIDUALS.size();patch++)
{
for (int chrom=0;chrom<POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES.size();chrom++)
{
for (int gene=0;gene<POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES[chrom]->GENES.size();gene++)
{
POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES[chrom]->GENES[gene];
}
}
}
}

While a high-dimensional matrix would work, consider that you might want to add more information to an individual. It might not just have chromosomes, but also an age, siblings, parents, phenotypes, et cetera. It is then natural to have a class Individual, which can contain all that information along with its list of chromosomes. Using classes will group relevant information together.

While I in general agree with #g-sliepen's answer, there is an additional point you should know about:
C++ gives you the ability to make a distinction between interface and type. You can leave the type abstract for the users of your code (even if that is only you) and provide a finite set of operations on it.
Using this pattern allows you to change the implementation completely (e.g. back to vectors for parallel computation etc.) later without having to change the code using it (e.g. a concrete simulation).

I won't cover what's already been suggested as it is generally a good idea to store your individual entities as a class with all relevant fields associared with it, but I'll just address your first suggestion:
The issue with using something like a std::vector<std::vector<std::vector<std::vector<type>>>> (apart from the fact it's a pain to handle generically) is that whilst the overall std::vector enclosing the structure has contiguous storage (so long as you aren't storing bools in your std::vector that is) the inner vectors are not contiguous with each other or the other elements.
Due to this, if you are storing a large amount of data in your structure and need access and iteration to be as fast as possible, this method of storage is not ideal - it also complicates matters of iterating through the entire structure.
A good solution for storing a large multi-dimensional "matrix" (technically a rank 4 tensor in this case I suppose) when you require fast iteration and random access is to write a wrapper around a single std::vector in some row-major / column-major configuration such that all your data is stored as a contiguous block and you can iterate over it all via a single loop or call to std::for_each (for example). Then each index by which you access the structure would correspond to patch, ind, chrom and gene in order.
An example of a pre-built data structure which could handle this is boost::multi_array if you'd rather not code the wrapper yourself.

There are two major ways to do multidimensional arrays. Vector of vectors (aka jagged array) and really multidimensional array - n dimensional cube. Using the latter one means for example, that all chromozomes have the same amount of genes and every individual has the same amount of chromozomes. If You can accept those restrictions, You get some advantages like continuous memory storage.

storing multi-dimensional arrays in c

I am working on a simple lisp-style pre-processor language.
In the API i want users to be able to pass arrays of any dimension and size to the pre-processor which can be manipulated using the language.
Currently i have an enum of types;
typedef enum LISP_TYPE
{
LT_UINT,
LT_FLOAT,
LT_ARRAY
...,
...
} _LISP_TYPE;
I am having trouble finding an efficient and easy to use method of storing arrays and also accessing them.
There is another structure i use specifically for arrays;
typedef struct _lisp_array
{
LISP_TYPE type;
unsigned int length;
void* data;
} lisp_array;
When the pre-processor See's a list atom with type LT_ARRAY, it will convert its void*(cdr in lisp terms) to the above structure. Where i am having problems is figuring out how to access multi-dimensional arrays. I have thought of calculating a step value to traverse the array but can i guarantee that all arrays passed will be contiguously allocated?
Any help is appreciated.

C built-in (singe and multi-dimensional) arrays are guaranteed to be stored in one contiguous region of memory in row-major mode. This may not answer your question, however. What is the expected layout of the data structure pointed to by _lisp_array::data member?

Since you're writing the interpreter, it's up to you to decide on the representation and make the array contiguous - that is, if you need it to be contiguous. If you make it contiguous, you can access elements by, for example (assuming zero-based indices a, b, c... and dimension size sa, sb, sc...):
(a*sb + b) * sc + c ... (row major order)
(c * sb + b) * sa + a ... (column major order)
There are other ways of representing arrays, of course - you could use arrays-of-pointers-to-arrays, etc. Each has its own advantages and disadvantages; without any specifics on the use case, if the bounds of the array are fixed, and the array is not expected to be sparse, then a contiguous buffer is usually a reasonable approach.

It would depend on how lisp-like you wanted to make it, really. Lisp doesn't have the strict definition of multi-dimensional arrays you're thinking of - everything is either an atom or a list. The closest thing it would have is an array of arrays:
((1 2 3) (4) (5 6))
Note, though, that the sub-arrays aren't the same length. But it's inherently lispy that they don't have to be, and I don't think there's a way to force the issue...
If you need strictly "rectangular" arrays this won't work, obviously, but if you've got wiggle-room, this is how I'd implement it - it's a nice, clean structure (check out the Wikipedia page for more details).
Cheers!

Choice of the most performant container (array)

This is my little big question about containers, in particular, arrays.
I am writing a physics code that mainly manipulates a big (> 1 000 000) set of "particles" (with 6 double coordinates each). I am looking for the best way (in term of performance) to implement a class that will contain a container for these data and that will provide manipulation primitives for these data (e.g. instantiation, operator[], etc.).
There are a few restrictions on how this set is used:
its size is read from a configuration file and won't change during execution
it can be viewed as a big two dimensional array of N (e.g. 1 000 000) lines and 6 columns (each one storing the coordinate in one dimension)
the array is manipulated in a big loop, each "particle / line" is accessed and computation takes place with its coordinates, and the results are stored back for this particle, and so on for each particle, and so on for each iteration of the big loop.
no new elements are added or deleted during the execution
First conclusion, as the access on the elements is essentially done by accessing each element one by one with [], I think that I should use a normal dynamic array.
I have explored a few things, and I would like to have your opinion on the one that can give me the best performances.
As I understand there is no advantage to use a dynamically allocated array instead of a std::vector, so things like double** array2d = new ..., loop of new, etc are ruled out.
So is it a good idea to use std::vector<double> ?
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > my_array that can be indexed like my_array[i][j], or is it a bad idea and it would be better to use std::vector<double> other_array and acces it with other_array[6*i+j].
Maybe this can gives better performance, especially as the number of columns is fixed and known from the beginning.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
Another option, the one that I am using so far is to use Blitz, in particular blitz::Array:
typedef blitz::Array<double,TWO_DIMENSIONS> store_t;
store_t my_store;
Where my elements are accessed like that: my_store(line, column);.
I think there are not much advantage to use Blitz in my case because I am accessing each element one by one and that Blitz would be interesting if I was using operations directly on array (like matrix multiplication) which I am not.
Do you think that Blitz is OK, or is it useless in my case ?
These are the possibilities I have considered so far, but maybe the best one I still another one, so don't hesitate to suggest me other things.
Thanks a lot for your help on this problem !
Edit:
From the very interesting answers and comments bellow a good solution seems to be the following:
Use a structure particle (containing 6 doubles) or a static array of 6 doubles (this avoid the use of two dimensional dynamic arrays)
Use a vector or a deque of this particle structure or array. It is then good to traverse them with iterators, and that will allow to change from one to another later.
In addition I can also use a Blitz::TinyVector<double,6> instead of a structure.

So is it a good idea to use std::vector<double> ?
Usually, a std::vector should be the first choice of container. You could use either std::vector<>::reserve() or std::vector<>::resize() to avoid reallocations while populating the vector. Whether any other container is better can be found by measuring. And only by measuring. But first measure whether anything the container is involved in (populating, accessing elements) is worth optimizing at all.
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > [...]?
No. IIUC, you are accessing your data per particle, not per row. If that's the case, why not use a std::vector<particle>, where particle is a struct holding six values? And even if I understood incorrectly, you should rather write a two-dimensional wrapper around a one-dimensional container. Then align your data either in rows or columns - what ever is faster with your access patterns.
Do you think that Blitz is OK, or is it useless in my case?
I have no practical knowledge about blitz++ and the areas it is used in. But isn't blitz++ all about expression templates to unroll loop operations and optimizing away temporaries when doing matrix manipulations? ICBWT.

First of all, you don't want to scatter the coordinates of one given particle all over the place, so I would begin by writing a simple struct:
struct Particle { /* coords */ };
Then we can make a simple one dimensional array of these Particles.
I would probably use a deque, because that's the default container, but you may wish to try a vector, it's just that 1.000.000 of particles means about a single chunk of a few MBs. It should hold but it might strain your system if this ever grows, while the deque will allocate several chunks.
WARNING:
As Alexandre C remarked, if you go the deque road, refrain from using operator[] and prefer to use iteration style. If you really need random access and it's performance sensitive, the vector should prove faster.

The first rule when choosing from containers is to use std::vector. Then, only after your code is complete and you can actually measure performance, you can try other containers. But stick to vector first. (And use reserve() from the start)
Then, you shouldn't use an std::vector<std::vector<double> >. You know the size of your data: it's 6 doubles. No need for it to be dynamic. It is constant and fixed. You can define a struct to hold you particle members (the six doubles), or you can simply typedef it: typedef double particle[6]. Then, use a vector of particles: std::vector<particle>.
Furthermore, as your program uses the particle data contained in the vector sequentially, you will take advantage of the modern CPU cache read-ahead feature at its best performance.

You could go several ways. But in your case, don't declare astd::vector<std::vector<double> >. You're allocating a vector (and you copy it around) for every 6 doubles. Thats way too costly.

If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
(other_array[i,j] won't work too well, as i,j employs the comma operator to evaluate the value of "i", then discards that and evaluates and returns "j", so it's equivalent to other_array[i]).
You will need to use one of:
other_array[i][j]
other_array(i, j) // if other_array implements operator()(int, int),
// but std::vector<> et al don't.
other_array[i].identifier // identifier is a member variable
other_array[i].identifier() // member function getting value
other_array[i].identifier(double) // member function setting value
You may or may not prefer to put get_ and set_ or similar on the last two functions should you find them useful, but from your question I think you won't: functions are prefered in APIs between parts of large systems involving many developers, or when the data items may vary and you want the algorithms working on the data to be independent thereof.
So, a good test: if you find yourself writing code like other_array[i][3] where you've decided "3" is the double with the speed in it, and other_array[i][5] because "5" is the the acceleration, then stop doing that and give them proper identifiers so you can say other_array[i].speed and .acceleration. Then other developers can read and understand it, and you're much less likely to make accidental mistakes. On the other hand, if you are iterating over those 6 elements doing exactly the same things to each, then you probably do want Particle to hold a double[6], or to provide an operator[](int). There's no problem doing both:
struct Particle
{
double x[6];
double& speed() { return x[3]; }
double speed() const { return x[3]; }
double& acceleration() { return x[5]; }
...
};
BTW / the reason that vector<vector<double> > may be too costly is that each set of 6 doubles will be allocated on the heap, and for fast allocation and deallocation many heap implementations use fixed-size buckets, so your small request will be rounded up t the next size: that may be a significant overhead. The outside vector will also need to record a extra pointer to that memory. Further, heap allocation and deallocation is relatively slow - in you're case, you'd only be doing it at startup and shutdown, but there's no particular point in making your program slower for no reason. Even more importantly, the areas on the heap may just around in memory, so your operator[] may have cache-faults pulling in more distinct memory pages than necessary, slowing the entire program. Put another way, vectors store elements contiguously, but the pointed-to-vectors may not be contiguous.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Efficient way of making data contiguous to transfer it among nodes - c++

Related

In-place conversion between row major and column major storage in three dimensions

What are the Disadvantages of Nested Vectors?

Should I use simple classes or a highly dimensional matrix?

storing multi-dimensional arrays in c

Choice of the most performant container (array)

Categories

Resources