D concurrent writing to buffer - concurrency

Say you have a buffer of size N which must be set to definite values (say to zero, or something else). This value setting in the buffer is divided over M threads, each handling N / M elements of the buffer.
The buffer cannot be immutable, since we change the values. Message passing won't work either, since it is forbidden to pass ref or array (= pointer) types. So it must happen through shared? No, since in my case the buffer elements are of type creal and thus arithmetics are not atomic.
At the end, the main program must wait until all threads are finished. It is given that each thread only writes to a subset of the array and none of the threads have overlap in the array with another thread or in any way depend on eachother.
How would I go about writing to (or modifying) a buffer in a concurrent manner?
PS: sometimes I can simply divide the array in M consecutive pieces, but sometimes I go over the array (the array is 1D but represents 2D data) column-by-column. Which makes the individual arrays the threads use be actually interleaved in the mother-array. Argh.
EDIT: I figured out that the type shared(creal)[] would work, since now the elements are shared and not the array itself. You could parallelize interleaved arrays I bet. There is some disadvantage though:
The shared storage class is so strict, that the allocation must be supplied with the keyword. Which makes it hardly encapsulated; since the caller must supply the array, it is obligated to pass a shared array and can't just generically pass a regular array and let the processing function worry about parallelism. No, the calling function must worry about parallelism too, so that the processing function receives a shared array and needn't reallocate the array into shared space.
There is also a very strange bug, that when I dynamically allocate shared(creal)[] at certain spots, it simply hangs at allocation. Seems very random and can't find the culprit...
In the test example this works, but not in my project... This turned out to be a bug in DMD / OptLink.
EDIT2: I never mentioned, but it's for implementing the FFT (Fast Fourier Theorem). So I have no power over selecting precise cache aligned slices. All I know is the elements are of type creal and the number of elements is a power of 2 (per row / column).

you can use the std.parallelism module
T[] buff;
foreach(ref elem;parallel(buff))elem=0;
but if you want to reinvent this you can just use shared (it is thread safe to only let 1 thread accesses a certain element at the time and if you enforce this with the appropriate join() or Task.*force() so much the better)

Related

Is writing to elements within children of a multi-dimensional vector thread-safe?

I'm trying to take a (very) large vector, and reassign all of the values in it into a multidimensional (2D) vector>.
The multidimensional vector has both dimensions resized to the correct size prior to value population, to avoid reallocation.
Currently, I am doing it single-threaded, but it is something that needs to happen repeatedly, and is very slow due to the large size (~7 seconds). The question is whether it is thread-safe for me to use, for instance, a thread per 2D element.
Some pseudocode:
vector<string> source{/*assume that it is populated by 8,000,000 strings
of varying length*/};
vector<vector<string>> destination;
destination.resize(8);
for(loop=0;loop<8;loop++)destination[loop].resize(1000000);
//current style
for(loop=0;loop<source.size();loop++)destination[loop/1000000][loop%1000000]=source[loop];
//desired style
void Populate(int index){
for(loop=0;loop<destination[index].size();loop++)destination[index][loop]=source[index*1000000+loop];
}
for(loop=0;loop<8;loop++)boost::thread populator(populate,loop);
I would think that the threaded version should work, since they're writing to separate 2nd dimensional elements. However, I'm not sure whether writing the strings would break things, since they are being resized.
When considering only thread-safety, this is fine.
Writing concurrently to distinct objects is allowed. C++ considers objects distinct even if they are neighboring fields in a struct or elements in the same array. The data type of the object does not matter here, so this holds true for string just as well as it does for int. The only important thing is that you must ensure that the ranges that you operate on are really fully distinct. If there is any overlap, you have a data race on your hands.
There is however, another thing to take into consideration here and that is performance. This is highly platform dependent, so the language standard does not give you any rules here, but there are some effects to look out for. For instance, neighboring elements in an array might reside on the same cache line. So in order for the hardware to be able to fulfill the thread-safety guarantees of the language, it must synchronize access to such elements. For instance: Partitioning array access in a way that one thread works out all the elements with even indices, while another works on the odd indices is technically thread-safe, but puts a lot of stress on the hardware as both threads are likely to contend for data stored on the same cache line.
Similarly, your case there is contention on the memory bus. If your threads are able to complete calculation of the data much faster than you are able to write them to memory, you might not actually gain anything by using multiple threads, because all the threads will end up waiting for the memory in the end.
Keep these things in mind when deciding whether parallelism is really the right solution to your problem.

Why is deque using so much more RAM than vector in C++?

I have a problem I am working on where I need to use some sort of 2 dimensional array. The array is fixed width (four columns), but I need to create extra rows on the fly.
To do this, I have been using vectors of vectors, and I have been using some nested loops that contain this:
array.push_back(vector<float>(4));
array[n][0] = a;
array[n][1] = b;
array[n][2] = c;
array[n][3] = d;
n++
to add the rows and their contents. The trouble is that I appear to be running out of memory with the number of elements I was trying to create, so I reduced the number that I was using. But then I started reading about deque, and thought it would allow me to use more memory because it doesn't have to be contiguous. I changed all mentions of "vector" to "deque", in this loop, as well as all declarations. But then it appeared that I ran out of memory again, this time with even with the reduced number of rows.
I looked at how much memory my code is using, and when I am using deque, the memory rises steadily to above 2GB, and the program closes soon after, even when using the smaller number of rows. I'm not sure exactly where in this loop it is when it runs out of memory.
When I use vectors, the memory usage (for the same number of rows) is still under 1GB, even when the loop exits. It then goes on to a similar loop where more rows are added, still only reaching about 1.4GB.
So my question is. Is this normal for deque to use more than twice the memory of vector, or am I making an erroneous assumption in thinking I can just replace the word "vector" with "deque" in the declarations/initializations and the above code?
Thanks in advance.
I'm using:
MS Visual C++ 2010 (32-bit)
Windows 7 (64-bit)
The real answer here has little to do with the core data structure. The answer is that MSVC's implementation of std::deque is especially awful and degenerates to an array of pointers to individual elements, rather than the array of arrays it should be. Frankly, only twice the memory use of vector is surprising. If you had a better implementation of deque you'd get better results.
It all depends on the internal implementation of deque (I won't speak about vector since it is relatively straightforward).
Fact is, deque has completely different guarantees than vector (the most important one being that it supports O(1) insertion at both ends while vector only supports O(1) insertion at the back). This in turn means the internal structures managed by deque have to be more complex than vector.
To allow that, a typical deque implementation will split its memory in several non-contiguous blocks. But each individual memory block has a fixed overhead to allow the memory management to work (eg. whatever the size of the block, the system may need another 16 or 32 bytes or whatever in addition, just for bookkeeping). Since, contrary to a vector, a deque requires many small, independent blocks, the overhead stacks up which can explain the difference you see. Also note that those individual memory blocks need to be managed (maybe in separate structures?), which probably means some (or a lot of) additional overhead too.
As for a way to solve your problem, you could try what #BasileStarynkevitch suggested in the comments, this will indeed reduce your memory usage but it will get you only so far because at some point you'll still run out of memory. And what if you try to run your program on a machine that only has 256MB RAM? Any other solution which goal is to reduce your memory footprint while still trying to keep all your data in memory will suffer from the same problems.
A proper solution when handling large datasets like yours would be to adapt your algorithms and data structures in order to be able to handle small partitions at a time of your whole dataset, and load/save those partitions as needed in order to make room for the other partitions. Unfortunately since it probably means disk access, it also means a big drop in performance but hey, you can't eat the cake and have it too.
Theory
There two common ways to efficiently implement a deque: either with a modified dynamic array or with a doubly linked list.
The modified dynamic array uses is basically a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end.
There are several implementations of modified dynamic array:
Allocating deque contents from the center of the underlying array,
and resizing the underlying array when either end is reached. This
approach may require more frequent resizings and waste more space,
particularly when elements are only inserted at one end.
Storing deque contents in a circular buffer, and only resizing when
the buffer becomes full. This decreases the frequency of resizings.
Storing contents in multiple smaller arrays, allocating additional
arrays at the beginning or end as needed. Indexing is implemented by
keeping a dynamic array containing pointers to each of the smaller
arrays.
Conclusion
Different libraries may implement deques in different ways, but generally as a modified dynamic array. Most likely your standard library uses the approach #1 to implement std::deque, and since you append elements only from one end, you ultimately waste a lot of space. For that reason, it makes an illusion that std::deque takes up more space than usual std::vector.
Furthermore, if std::deque would be implemented as doubly-linked list, that would result in a waste of space too since each element would need to accommodate 2 pointers in addition to your custom data.
Implementation with approach #3 (modified dynamic array approach too) would again result in a waste of space to accommodate additional metadata such as pointers to all those small arrays.
In any case, std::deque is less efficient in terms of storage than plain old std::vector. Without knowing what do you want to achieve I cannot confidently suggest which data structure do you need. However, it seems like you don't even know what deques are for, therefore, what you really want in your situation is std::vector. Deques, in general, have different application.
Deque can have additional memory overhead over vector because it's made of a few blocks instead of contiguous one.
From en.cppreference.com/w/cpp/container/deque:
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The primary issue is running out of memory.
So, do you need all the data in memory at once?
You may never be able to accomplish this.
Partial Processing
You may want to consider processing the data into "chunks" or smaller sub-matrices. For example, using the standard rectangular grid:
Read data of first quadrant.
Process data of first quandrant.
Store results (in a file) of first quandrant.
Repeat for remaining quandrants.
Searching
If you are searching for a particle or a set of datum, you can do that without reading in the entire data set into memory.
Allocate a block (array) of memory.
Read a portion of the data into this block of memory.
Search the block of data.
Repeat steps 2 and 3 until the data is found.
Streaming Data
If your application is receiving the raw data from an input source (other than a file), you will want to store the data for later processing.
This will require more than one buffer and is more efficient using at least two threads of execution.
The Reading Thread will be reading data into a buffer until the buffer is full. When the buffer is full, it will read data into another empty one.
The Writing Thread will initially wait until either the first read buffer is full or the read operation is finished. Next, the Writing Thread takes data out of the read buffer and writes to a file. The Write Thread then starts writing from the next read buffer.
This technique is called Double Buffering or Multiple Buffering.
Sparse Data
If there is a lot of zero or unused data in the matrix, you should try using Sparse Matrices. Essentially, this is a list of structures that hold the data's coordinates and the value. This also works when most of the data is a common value other than zero. This saves a lot of memory space; but costs a little bit more execution time.
Data Compression
You could also change your algorithms to use data compression. The idea here is to store the data location, value and the number or contiguous equal values (a.k.a. runs). So instead of storing 100 consecutive data points of the same value, you would store the starting position (of the run), the value, and 100 as the quantity. This saves a lot of space, but requires more processing time when accessing the data.
Memory Mapped File
There are libraries that can treat a file as memory. Essentially, they read in a "page" of the file into memory. When the requests go out of the "page", they read in another page. All this is performed "behind the scenes". All you need to do is treat the file like memory.
Summary
Arrays and deques are not your primary issue, quantity of data is. Your primary issue can be resolved by processing small pieces of data at a time, compressing the data storage, or treating the data in the file as memory. If you are trying to process streaming data, don't. Ideally, streaming data should be placed into a file and then processed later.
A historical purpose of a file is to contain data that doesn't fit into memory.

MPI synchronize matrix of vectors

Excuse me if this question is common or trivial, I am not very familiar with MPI so bear with me.
I have a matrix of vectors. Each vector is empty or has a few items in it.
std::vector<someStruct*> partitions[matrix_size][matrix_size];
When I start the program each process will have the same data in this matrix, but as the code progresses each process might remove several items from some vectors and put them in other vectors.
So when I reach a barrier I somehow have to make sure each process has the latest version of this matrix. The big problem is that each process might manipulate any or all vectors.
How would I go about to make sure that every process has the correct updated matrix after the barrier?
EDIT:
I am sorry I was not clear. Each process may move one or more objects to another vector but only one process may move each object. In other words each process has a list of objects it may move, but the matrix may be altered by everyone. And two processes can't move the same object ever.
In that case you'll need to send messages using MPI_Bcast that inform the other processors about this and instruct them to do the same. Alternatively, if the ordering doesn't matter until you hit the barrier, you can only send the messages to the root process which performs the permutations and then after the barrier sends it to all the others using MPI_Bcast.
One more thing: vectors of pointers are usually quite a bad idea, as you'll need to manage the memory manually in there. If you can use C++11, use std::unique_ptr or std::shared_ptr instead (depending on what your semantics are), or use Boost which provides very similar facilities.
And lastly, representing a matrix as a fixed-size array of fixed-size arrays is readlly bad. First: the matrix size is fixed. Second: adjacent rows are not necessarily stored in contiguous memory, slowing your program down like crazy (it literally can be orders of magnitudes). Instead represent the matrix as a linear array of size Nrows*Ncols, and then index the elements as Nrows*i + j where Nrows is the number of rows and i and j are the row and column indices, respectively. If you don't want column-major storage instead, address the elements by i + Ncols*j. You can wrap this index-juggling in inline functions that have virtually zero overhead.
I would suggest to lay out the data differently:
Each process has a map of his objects and their position in the matrix. How that is implemented depends on how you identify objects. If all local objects are numbered, you could just use a vector<pair<int,int>>.
Treat that as the primary structure you manipulate and communicate that structure with MPI_Allgather (each process sends it data to all other processes, at the end everyone has all data). If you need fast lookup by coordinates, then you can build up a cache.
That may or may not be performing well. Other optimizations (like sharing 'transactions') totally depend on your objects and the operations you perform on them.

Efficiently collect data from multiple 1-D arrays in to a single 1-D array

I've got a prewritten function in C that fills an 1-D array with data, e.g.
int myFunction(myData **arr,...);
myData *array;
int arraySize;
arraySize = myFunction(&arr, ...);
I would like to call the function n times in a row with slightly different parameters (n is dependent on user input), and I need all the data collected in a single C array afterwards. The size of the returned array is not always fixed. Oh, and myFunction does the memory allocation internally. I want to do this in a memory-efficient way, but using realloc in each iteration does not sound like a good idea.
I do have all the C++ functionality available (the project is in C++, just using a C library), but using std::vector is no good because the collected data is later sent in to a function with a definition similar to:
void otherFunction(myData *data, int numData, ...);
Any ideas? Only things I can think of are realloc or using a std::vector and copying the data into an array afterwards, and those don't sound too promising.
Using realloc() in each iteration sounds like a very fine idea to me, for two reasons:
"does not sound like a good idea" is what people usually say when they have not established a performance requirement for their software, and they have not tested their software against the performance requirement to see if there is any need to improve it.
Instead of reallocating a new block each time, the realloc method will simply keep expanding your memory block which will presumably be at the top of the memory heap, so it won't be wasting any time either traversing memory block lists, or copying data around. This holds true provided that whatever memory allocated by myFunction() gets freed before it returns. You can verify it by looking at the pointer returned by realloc() and seeing that it always (or almost always(*1)) is the exact same pointer as the one you gave it to reallocate.
EDIT (*1) some C++ runtimes implement two heaps, one for small allocations and one for large allocations, so if your block gets allocated in the heap for small blocks, and then it grows large, there is a possibility that it will be moved once to the heap for large blocks. So, don't expect the pointer to always be the same; just most of the time.
Just copy all of the data into an std::vector. You can call otherFunction on a vector v with
otherFunction(&v[0], v.size(), ...)
or
otherFunction(v.data(), v.size(), ...)
As for your efficiency requirement: it looks to me like your optimizing prematurely. First try this option, then measure how fast it is and only look for other solutions if it's really too slow.
If you know that you are going to call the function N times, and returned arrays are always M long, then why don't you just allocate one array M*N initially? Or if you don't know one of M or N, then set a worst case maximum. Or are M and N both dependent on user-input?
Then, change how you call your user-input-getting function, such that the array pointer you pass it is actually an offset into that large array, so that it stores the data in the right location. Then, next iteration, offset further, and call again.
I think best solution would be to write your own 1D array class with some methods which you need.
depending on how you write the class you'll get such result. (sorry bad grammar)..

concurrent_vector for 2d array

I am currently trying to represent a 2D array using tbb::concurrent_vector<T>. This 2d array will be accessed by a lot of different threads and thats why I want it to handle parallel accesses the most efficiently possible.
I came up with 2 solutions:
Use a tbb::concurrent_vector<tbb::concurrent_vector<T> > to store it.
Store everything in a tbb::concurrent_vector<T> and access elements w/ x * width + y
I had a preference for the second one because I dont want to lock an entire row to access one element (since I assume that to access the element array[x][y], the tbb implementation will lock the xth row and then the yth element).
I would like to know which solution seems better to you.
First, I think there might be some confusion concerning tbb::concurrent_vector. This container is similar to std::vector but with thread-safe resizing but slower element access due to the internal storage layout.
You can read more about it here.
In your case, because of the second solution you proposed (1D array with x * width + y indexing), I'm assuming your algorithm doesn't involves intensive multi-threaded resizing of the array. Therefore, you wouldn't benefit from tbb::concurrent_vector compared to a single threaded container like std::vector.
I guess you were assuming that tbb::concurrent_vector guaranteed thread-safe element access, but it does not - quote from tbb::concurrent_vector::operator[] doc:
Get reference to element at given index.
This method is thread-safe for concurrent reads, and also while growing the vector, as long as the calling thread has checked that index
If you are not resizing the array, you are only interested in the first part: thread-safe concurrent reads. But std::vector or even raw C array gives you the same. On the other hand, neither offers thread-safe arbitrary access (read/write elements).
You would have to implement it using locks, or find another library that does this for you (maybe std::vector from STLPort but I'm not sure). Although this would be quite inefficient, since each access of an element from your 2D array would involve a thread synchronization overhead. While I don't know what exactly you are trying to implement, It's quite possible that the synchronization would take longer than your actual computations.
Now to answer your question, even in a single threaded setting, it is always better to use 1D array for representing ND arrays, because computing the index (x * width + y) is faster than an additional memory accesses needed by the true ND array.
For concurrent vectors, this is even more true because you would have twice the lock overhead in a best-case scenario (no conflicting row access), and a lot more in case there are conflicts.
So out of the two solutions you proposed, I wouldn't hesitate and go for the second one: a 1D array (not necessary tbb::concurrent_vector) with adequate locking for element access.
Depending on your algorithm and the access pattern by the different threads, another approach - used in image editing software (gimp, photoshop...) - is tile based:
template<typename T> struct Tile {
int offsetX, int offsetY;
int width, height;
Not_ThreadSafe_Vector<T> data;
};
ThreadSafe_Vector< Tile<T> > data
Where Not_ThreadSafe_Vector could be any container without locking on element access, e.g. std::vector ; and ThreadSafe_Vector is a container with thread safe read/write element access (not tbb::concurrent_vector!).
This way, if you have some locality in your access pattern (a thread is more likely to access an element near to its previous access than far away), then each thread can be made to work on the data from a single tile most of the time, and you only have synchronization overhead when switching to another tile.
tbb::concurrent_vector is completely thread safe for accessing elements (read, write, take the address) and growing the vector. Check response from Intel employee Arch D. Robison here.
However, operations for clearing or destroying a vector are not thread safe (refer to $6.2.1 Intel TBB Tutorial pdf document). That is, don't invoke clear() if there are other operations underway on the tbb::concurrent_vector.
As Antoine mentioned, handling the ND array as a 1D array is always to be preferred due to efficiency & performance.