I am currently trying to represent a 2D array using tbb::concurrent_vector<T>. This 2d array will be accessed by a lot of different threads and thats why I want it to handle parallel accesses the most efficiently possible.
I came up with 2 solutions:
Use a tbb::concurrent_vector<tbb::concurrent_vector<T> > to store it.
Store everything in a tbb::concurrent_vector<T> and access elements w/ x * width + y
I had a preference for the second one because I dont want to lock an entire row to access one element (since I assume that to access the element array[x][y], the tbb implementation will lock the xth row and then the yth element).
I would like to know which solution seems better to you.
First, I think there might be some confusion concerning tbb::concurrent_vector. This container is similar to std::vector but with thread-safe resizing but slower element access due to the internal storage layout.
You can read more about it here.
In your case, because of the second solution you proposed (1D array with x * width + y indexing), I'm assuming your algorithm doesn't involves intensive multi-threaded resizing of the array. Therefore, you wouldn't benefit from tbb::concurrent_vector compared to a single threaded container like std::vector.
I guess you were assuming that tbb::concurrent_vector guaranteed thread-safe element access, but it does not - quote from tbb::concurrent_vector::operator[] doc:
Get reference to element at given index.
This method is thread-safe for concurrent reads, and also while growing the vector, as long as the calling thread has checked that index
If you are not resizing the array, you are only interested in the first part: thread-safe concurrent reads. But std::vector or even raw C array gives you the same. On the other hand, neither offers thread-safe arbitrary access (read/write elements).
You would have to implement it using locks, or find another library that does this for you (maybe std::vector from STLPort but I'm not sure). Although this would be quite inefficient, since each access of an element from your 2D array would involve a thread synchronization overhead. While I don't know what exactly you are trying to implement, It's quite possible that the synchronization would take longer than your actual computations.
Now to answer your question, even in a single threaded setting, it is always better to use 1D array for representing ND arrays, because computing the index (x * width + y) is faster than an additional memory accesses needed by the true ND array.
For concurrent vectors, this is even more true because you would have twice the lock overhead in a best-case scenario (no conflicting row access), and a lot more in case there are conflicts.
So out of the two solutions you proposed, I wouldn't hesitate and go for the second one: a 1D array (not necessary tbb::concurrent_vector) with adequate locking for element access.
Depending on your algorithm and the access pattern by the different threads, another approach - used in image editing software (gimp, photoshop...) - is tile based:
template<typename T> struct Tile {
int offsetX, int offsetY;
int width, height;
Not_ThreadSafe_Vector<T> data;
};
ThreadSafe_Vector< Tile<T> > data
Where Not_ThreadSafe_Vector could be any container without locking on element access, e.g. std::vector ; and ThreadSafe_Vector is a container with thread safe read/write element access (not tbb::concurrent_vector!).
This way, if you have some locality in your access pattern (a thread is more likely to access an element near to its previous access than far away), then each thread can be made to work on the data from a single tile most of the time, and you only have synchronization overhead when switching to another tile.
tbb::concurrent_vector is completely thread safe for accessing elements (read, write, take the address) and growing the vector. Check response from Intel employee Arch D. Robison here.
However, operations for clearing or destroying a vector are not thread safe (refer to $6.2.1 Intel TBB Tutorial pdf document). That is, don't invoke clear() if there are other operations underway on the tbb::concurrent_vector.
As Antoine mentioned, handling the ND array as a 1D array is always to be preferred due to efficiency & performance.
Related
Imagine the following requirements:
measurement data should be logged and the user should be able to iterate through the data.
uint32_t timestamp;
uint16_t place;
struct SomeData someData;
have a timestamp (uint32_t), a place (uint16_t) and some data in a struct
have a constant number of datasets. If a new one arrives, the oldest is thrown away.
the number of "place" is dynamic, the user can insert new ones during runtime
it should be possible to iterate through the data to the next newer or older dataset but only if the place is the same
need to insert at the end only
memory should be allocated once at program start
insertion need not to be fast but should not block other threads for a long time which might be iterating through the container
memory requirement should be low
EDIT: - The container should all the memory which is not used otherwise, therefore it can be large.
I am not sure which container I should use. It is an embedded system and should not use boost etc.
I see the following possibilities:
std::vector - drawbacks: The insertion at the end requires that all objects are copied and during this time another thread cannot access the vector. Edit: This can be avoided by implementing it as a circular buffer - see comments below. When iterating throught the vector, I have to test the place ID. Maybe it might also be a problem to allocate much memory as one block - because the memory could be segmented?
std::deque - compared to std::vector insertion (and pop_back) is faster but memory requirement? Iterators do not become invalid if the insertion is at the end. But I still have to iterate and test the second ID ("place"). I think it does not need to allocate all the memory in one big block as it is the case with vector or array. If an element is added in front and another one is removed at the end (or removed first and added after), I guess there does no memory allocation take place?
std::queue - instead of deque, I should rather use a queue? Is it true that in many implementations a queue ist implented just as a deque?
std::map - Like deque any iterators to existing elements will not become invalid. If I make the key a combination of place and timestamp, then iteration through the map is maybe faster because it is already sorted? Memory requirements of a map?
std::multimap - as the number of places is not constant I cannot make a multimap with "place" as the index.
std::list - has no advantage over deque here?
Some suggested the use of a circular buffer. If I do not want that the memory is allocated as one big block I still have to use a container and most questions above stay valid.
Update:
I will use a ring buffer as suggested here but using a deque as the underlying container. In order to being able to scroll fast through the datasets with the preselected "place" I will eventually introduce two additional indices into the data struct which will point to the previous and the next index with the same place.
How much memory will be used? In my special case the size of the struct is 56 bytes. The gnu lib uses 512 bytes as minimum block size, the IAR compiler 16 bytes. Hence the used block size will be 512 or 56 bytes respectively. Besides two iterators (using 4 pointers each) and the size there will be a pointer stored for each block. Therefore in the implementation of the iar compiler (block size 56 bytes) there will be 7 % overhead (on a 32 bit system) compared to the use of a std::vector or array. In the gcc implementation there will fit 9 objects in the block (504 bytes) while 512 + 4 bytes are needed per block which is 2 % more.
The block size is not large but the continuous memory size needed for the pointer array is already relatively large, especially for the implementation where one block is one struct.
A std::list would need 2 pointers per struct which is 14 % overhead in my case on 32 bit systems.
std::vector
... the memory could be segmented?
No, std::vector allocates contiguous memory, as is documented in that link. Arrays are also contiguous, but you might just as well use vector for this.
std::deque is segmented, which you said you didn't want. Or do you want to avoid a single large allocated block? It's not clear.
Anyway, it has no benefit over vector if you really want a circular buffer (because you'll never be adding/removing elements from the front/back anyway), and you can't control the block size.
std::queue
... Is it true that in many implementations a queue is implented just as a deque?
Yes, that's the default in all implementations. See the linked documentation or any decent book.
It doesn't sound like you want a FIFO queue, so I don't know why you're considering this one - the interface doesn't match your stated requirement.
'std::map`
... iteration through the map is maybe faster because it is already sorted?
On most modern server/desktop architectures, map will be slower because advancing an iterator involves a pointer chase (which impairs pipelining) and a likely cache miss. Your anonymous embedded architecture may be less sensitive to these effects, so map may be faster for you.
... Memory requirements of a map?
Higher. You have the node size (at least a couple of pointers) added to each element.
I'm trying to take a (very) large vector, and reassign all of the values in it into a multidimensional (2D) vector>.
The multidimensional vector has both dimensions resized to the correct size prior to value population, to avoid reallocation.
Currently, I am doing it single-threaded, but it is something that needs to happen repeatedly, and is very slow due to the large size (~7 seconds). The question is whether it is thread-safe for me to use, for instance, a thread per 2D element.
Some pseudocode:
vector<string> source{/*assume that it is populated by 8,000,000 strings
of varying length*/};
vector<vector<string>> destination;
destination.resize(8);
for(loop=0;loop<8;loop++)destination[loop].resize(1000000);
//current style
for(loop=0;loop<source.size();loop++)destination[loop/1000000][loop%1000000]=source[loop];
//desired style
void Populate(int index){
for(loop=0;loop<destination[index].size();loop++)destination[index][loop]=source[index*1000000+loop];
}
for(loop=0;loop<8;loop++)boost::thread populator(populate,loop);
I would think that the threaded version should work, since they're writing to separate 2nd dimensional elements. However, I'm not sure whether writing the strings would break things, since they are being resized.
When considering only thread-safety, this is fine.
Writing concurrently to distinct objects is allowed. C++ considers objects distinct even if they are neighboring fields in a struct or elements in the same array. The data type of the object does not matter here, so this holds true for string just as well as it does for int. The only important thing is that you must ensure that the ranges that you operate on are really fully distinct. If there is any overlap, you have a data race on your hands.
There is however, another thing to take into consideration here and that is performance. This is highly platform dependent, so the language standard does not give you any rules here, but there are some effects to look out for. For instance, neighboring elements in an array might reside on the same cache line. So in order for the hardware to be able to fulfill the thread-safety guarantees of the language, it must synchronize access to such elements. For instance: Partitioning array access in a way that one thread works out all the elements with even indices, while another works on the odd indices is technically thread-safe, but puts a lot of stress on the hardware as both threads are likely to contend for data stored on the same cache line.
Similarly, your case there is contention on the memory bus. If your threads are able to complete calculation of the data much faster than you are able to write them to memory, you might not actually gain anything by using multiple threads, because all the threads will end up waiting for the memory in the end.
Keep these things in mind when deciding whether parallelism is really the right solution to your problem.
I have a program where multiple threads share the same data structure which is basically a 2D array of vectors and sometimes two or more threads might have to insert at the same position i.e. vector which might result in a crash if no precautions were taken. What is the fastest and most efficient way to implement a safe solution for this issue ? Since this issue does not happen very often (no high contention) I had a 2D array of mutexes where each mutex maps to a vector and then each thread locks then unlocks the mutex after finishing from updating the corresponding vector. If this is a good solution, I would like to know if there is something faster than mutex to use.
Note, I am using OpenMP for the multithreading.
The solution greatly depends on how the problem is. For example:
If the vector size may exceed its capacity (i.e. reallocation is required).
Whether the vector is only being read, elements are being inserted or elements can be both inserted and removed.
In the first case, you don't have any other possibility than using locks, since you always need to check whether the vector is being reallocated, and wait for the reallocation to complete if necessary.
On the other hand, if you are completely sure that the vector is only initialized once by a single thread (which is not your case), probably you would not need any synchronization mechanism to perform access to vector elements (inside-element access synchronization may still be required though).
If elements are being inserted and removed from the back of the vector only (queue style), then using atomic compare and swap would be enough (atomically increase the size of the vector, and insert in position size-1 when the swap was successful.
If elements may be removed at any point of the vector, its contents may need to be moved to remove empty holes. This case is similar to a reallocation. You can use a customized heap to manage the empty positions in your vector, although this will increase the complexity.
At the end of the day, probably you will need to either develop your own parallel data structure or rely on a library, such as TBB or Boost.
Excuse me if this question is common or trivial, I am not very familiar with MPI so bear with me.
I have a matrix of vectors. Each vector is empty or has a few items in it.
std::vector<someStruct*> partitions[matrix_size][matrix_size];
When I start the program each process will have the same data in this matrix, but as the code progresses each process might remove several items from some vectors and put them in other vectors.
So when I reach a barrier I somehow have to make sure each process has the latest version of this matrix. The big problem is that each process might manipulate any or all vectors.
How would I go about to make sure that every process has the correct updated matrix after the barrier?
EDIT:
I am sorry I was not clear. Each process may move one or more objects to another vector but only one process may move each object. In other words each process has a list of objects it may move, but the matrix may be altered by everyone. And two processes can't move the same object ever.
In that case you'll need to send messages using MPI_Bcast that inform the other processors about this and instruct them to do the same. Alternatively, if the ordering doesn't matter until you hit the barrier, you can only send the messages to the root process which performs the permutations and then after the barrier sends it to all the others using MPI_Bcast.
One more thing: vectors of pointers are usually quite a bad idea, as you'll need to manage the memory manually in there. If you can use C++11, use std::unique_ptr or std::shared_ptr instead (depending on what your semantics are), or use Boost which provides very similar facilities.
And lastly, representing a matrix as a fixed-size array of fixed-size arrays is readlly bad. First: the matrix size is fixed. Second: adjacent rows are not necessarily stored in contiguous memory, slowing your program down like crazy (it literally can be orders of magnitudes). Instead represent the matrix as a linear array of size Nrows*Ncols, and then index the elements as Nrows*i + j where Nrows is the number of rows and i and j are the row and column indices, respectively. If you don't want column-major storage instead, address the elements by i + Ncols*j. You can wrap this index-juggling in inline functions that have virtually zero overhead.
I would suggest to lay out the data differently:
Each process has a map of his objects and their position in the matrix. How that is implemented depends on how you identify objects. If all local objects are numbered, you could just use a vector<pair<int,int>>.
Treat that as the primary structure you manipulate and communicate that structure with MPI_Allgather (each process sends it data to all other processes, at the end everyone has all data). If you need fast lookup by coordinates, then you can build up a cache.
That may or may not be performing well. Other optimizations (like sharing 'transactions') totally depend on your objects and the operations you perform on them.
Say you have a buffer of size N which must be set to definite values (say to zero, or something else). This value setting in the buffer is divided over M threads, each handling N / M elements of the buffer.
The buffer cannot be immutable, since we change the values. Message passing won't work either, since it is forbidden to pass ref or array (= pointer) types. So it must happen through shared? No, since in my case the buffer elements are of type creal and thus arithmetics are not atomic.
At the end, the main program must wait until all threads are finished. It is given that each thread only writes to a subset of the array and none of the threads have overlap in the array with another thread or in any way depend on eachother.
How would I go about writing to (or modifying) a buffer in a concurrent manner?
PS: sometimes I can simply divide the array in M consecutive pieces, but sometimes I go over the array (the array is 1D but represents 2D data) column-by-column. Which makes the individual arrays the threads use be actually interleaved in the mother-array. Argh.
EDIT: I figured out that the type shared(creal)[] would work, since now the elements are shared and not the array itself. You could parallelize interleaved arrays I bet. There is some disadvantage though:
The shared storage class is so strict, that the allocation must be supplied with the keyword. Which makes it hardly encapsulated; since the caller must supply the array, it is obligated to pass a shared array and can't just generically pass a regular array and let the processing function worry about parallelism. No, the calling function must worry about parallelism too, so that the processing function receives a shared array and needn't reallocate the array into shared space.
There is also a very strange bug, that when I dynamically allocate shared(creal)[] at certain spots, it simply hangs at allocation. Seems very random and can't find the culprit...
In the test example this works, but not in my project... This turned out to be a bug in DMD / OptLink.
EDIT2: I never mentioned, but it's for implementing the FFT (Fast Fourier Theorem). So I have no power over selecting precise cache aligned slices. All I know is the elements are of type creal and the number of elements is a power of 2 (per row / column).
you can use the std.parallelism module
T[] buff;
foreach(ref elem;parallel(buff))elem=0;
but if you want to reinvent this you can just use shared (it is thread safe to only let 1 thread accesses a certain element at the time and if you enforce this with the appropriate join() or Task.*force() so much the better)