Multithreaded array of arrays? - c++

I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"

You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.

Related

Is writing to elements within children of a multi-dimensional vector thread-safe?

I'm trying to take a (very) large vector, and reassign all of the values in it into a multidimensional (2D) vector>.
The multidimensional vector has both dimensions resized to the correct size prior to value population, to avoid reallocation.
Currently, I am doing it single-threaded, but it is something that needs to happen repeatedly, and is very slow due to the large size (~7 seconds). The question is whether it is thread-safe for me to use, for instance, a thread per 2D element.
Some pseudocode:
vector<string> source{/*assume that it is populated by 8,000,000 strings
of varying length*/};
vector<vector<string>> destination;
destination.resize(8);
for(loop=0;loop<8;loop++)destination[loop].resize(1000000);
//current style
for(loop=0;loop<source.size();loop++)destination[loop/1000000][loop%1000000]=source[loop];
//desired style
void Populate(int index){
for(loop=0;loop<destination[index].size();loop++)destination[index][loop]=source[index*1000000+loop];
}
for(loop=0;loop<8;loop++)boost::thread populator(populate,loop);
I would think that the threaded version should work, since they're writing to separate 2nd dimensional elements. However, I'm not sure whether writing the strings would break things, since they are being resized.
When considering only thread-safety, this is fine.
Writing concurrently to distinct objects is allowed. C++ considers objects distinct even if they are neighboring fields in a struct or elements in the same array. The data type of the object does not matter here, so this holds true for string just as well as it does for int. The only important thing is that you must ensure that the ranges that you operate on are really fully distinct. If there is any overlap, you have a data race on your hands.
There is however, another thing to take into consideration here and that is performance. This is highly platform dependent, so the language standard does not give you any rules here, but there are some effects to look out for. For instance, neighboring elements in an array might reside on the same cache line. So in order for the hardware to be able to fulfill the thread-safety guarantees of the language, it must synchronize access to such elements. For instance: Partitioning array access in a way that one thread works out all the elements with even indices, while another works on the odd indices is technically thread-safe, but puts a lot of stress on the hardware as both threads are likely to contend for data stored on the same cache line.
Similarly, your case there is contention on the memory bus. If your threads are able to complete calculation of the data much faster than you are able to write them to memory, you might not actually gain anything by using multiple threads, because all the threads will end up waiting for the memory in the end.
Keep these things in mind when deciding whether parallelism is really the right solution to your problem.

What is the Optimal Memory Setup for OpenCL where the host needs access at regular time steps?

I'm looking to find the best way to setup the CL memory objects for my project, which does a device side physics simulation. The buffers will be accessed by the host every frame, approx every 16ms, to get the updated data for rendering. Unfortunately, I cannot send the new data straight to the GPU via a VBO.
The data in the buffer consists of structs with 3 cl_float4's and one cl_float. I also want to have the ability for the host to update some of the structs in the buffer, this will not be per-frame.
Currently I'm looking to have all the data be allocated/stored on the GPU and using map/unmap whenever the host requires access. But this brings up two issues that I can see:
Still require a device to host copy for rendering
Buffer must be rebuilt whenever objects are added/removed from the simulation. Or additional validation data must exist per struct to check if this object is "alive"/valid...
Any advice is appreciated. If you need any additional info or code snippets, just let me know.
Thank you.
Algorithm for efficient memory management
You're asking for the best setup for OpenCL memory. I'm assuming you mostly care about high performance and not too much of a much of a size overhead.
This means you should perform as many operations as possible on the GPU. Syncing between CPU/GPU should be minimized.
Memory model
I will now describe in detail how such a memory and processing model should look like.
Preallocate buffers with the maximum size and fill them over time.
Track how many elements currently are in the buffer
Have separate buffers for validity and your data. The validity buffer indicates the validity for each element
Adding elements
Adding elements can be done via the following principle:
Have a buffer with host pointer for input data. The size of the buffer is determined by the maximum number of input elements
When you receive data, copy it onto the host buffer and sync it to the GPU
(Optional) Preprocess input data on the GPU
In a kernel, add input data and corresponding validity behind the last element in the global buffer. Input points that are empty (maybe you just got 100 input points instead of 10000), just mark them as invalid.
This has several effects:
Adding can be completely done in parallel
You only have to sync a small buffer (input data buffer) to the GPU
When adding input data, you always add the maximum amount of input elements into the buffer, but most of them will be empty/invalid. So when you frequently add points
If your rendering step is not able to discard invalid points, you must remove invalid points from the model before rendering.
Otherwise, you can postpone cleaning up to a point, where it is only needed because the size of the model becomes to big and threatens to overflow.
Removing elements
Removing elements should be done via the following principle:
Have a kernel that determines if an elements becomes invalid. If so, just mark its validity accordingly (if you want you can zero nor NAN out the data, too, but that is not necessary).
Have an algorithm that is able to remove invalid elements from the buffer and give you the information about the number of valid,
consecutive elements in the buffer (that information is needed when adding elements).
Such an algorithm will require you to perform sorts and a search using parallel reduction.
Sorting elements in parallel
Sorting a buffer, especially one with many elements is highly demanding. You should use available implementations to do so.
Simple Bitonic sort:
If you do not care about the maximum possible performance and simple code, this is your choice.
Implementation available: https://software.intel.com/en-us/articles/bitonic-sorting
Simple to integrate, just a single kernel.
Can only sort 4*2^n elements (as far as I remember).
WARNING: This sort does not work with numbers larger than one billion (1,000,000,000). Not sure why but finding that out cost me quite some time.
Fast radix sort:
If you care about maximum performance and have lots of elements to sort (1 million up to 1 billion or even more), this is your choice.
Implementation available: https://github.com/sschaetz/nvidia-opencl-examples/tree/master/OpenCL/src/oclRadixSort
More difficult to integrate, serveral kernel calls
Can only sort 2^n elements (as far as I remember)
Faster than Bitonic sort, especially with more than 1 million elements
Finding out the number of valid elements
If the buffer has been sorted and all invalid elements have been removed, you could simply parallely count the number of valid values, or simply find the first index of the first invalid element (this requires you to have unused buffer space invalidated). Both ways will give you the number of valid elements
Problem size vs. sorting size restrictions
To overcome the problems that arise with only being able to sort a fixed number of elements, just pad out with values whose sorting behavior you know.
Example:
You want to sort 10,000 integers with values between 0 and 10 million in ascending order.
You can only sort 2^n elements
The closest you will get is 2^14 = 16384.
Have a buffer for sorting with 2^14 elements
Fill the buffer with the 10000 values to sort.
Fill all remaining values of the buffer with a value you know will be sorted behind the 10,000 actually existing values.
Since you know your value range (0 to 10 million), you could pick 11 million as filling value.
In-place sorting problem
In-place sorting and removing of elements is difficult (but possible) to implement. An easier solution is to determine the indices of consecutive valid elements and write them to a new buffer in that order and then swap buffers.
But this requires you to swap buffers or copy back which costs both performance and space. Chose the lesser evil in your case.
More advice
Only add wait-events, if you are still not content with the performance. However, this will complicate your code and possibly introduce bugs (which won't even be your fault - there is a nasty bug with Nvidia cards and OpenCL where wait-events are not destoyed and memory leaks - this will slowly but surely cause problems).
Be very careful with syncing/mapping buffers to CPU too early, as this sync-call will force all kernels using this buffer to finish
If adding elements rarely occurs, and your rendering step is able to discard invalid elements, you can postpone removing elements from the buffer until it is really needed (too many elements threaten to overflow your buffer).

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Why is deque using so much more RAM than vector in C++?

I have a problem I am working on where I need to use some sort of 2 dimensional array. The array is fixed width (four columns), but I need to create extra rows on the fly.
To do this, I have been using vectors of vectors, and I have been using some nested loops that contain this:
array.push_back(vector<float>(4));
array[n][0] = a;
array[n][1] = b;
array[n][2] = c;
array[n][3] = d;
n++
to add the rows and their contents. The trouble is that I appear to be running out of memory with the number of elements I was trying to create, so I reduced the number that I was using. But then I started reading about deque, and thought it would allow me to use more memory because it doesn't have to be contiguous. I changed all mentions of "vector" to "deque", in this loop, as well as all declarations. But then it appeared that I ran out of memory again, this time with even with the reduced number of rows.
I looked at how much memory my code is using, and when I am using deque, the memory rises steadily to above 2GB, and the program closes soon after, even when using the smaller number of rows. I'm not sure exactly where in this loop it is when it runs out of memory.
When I use vectors, the memory usage (for the same number of rows) is still under 1GB, even when the loop exits. It then goes on to a similar loop where more rows are added, still only reaching about 1.4GB.
So my question is. Is this normal for deque to use more than twice the memory of vector, or am I making an erroneous assumption in thinking I can just replace the word "vector" with "deque" in the declarations/initializations and the above code?
Thanks in advance.
I'm using:
MS Visual C++ 2010 (32-bit)
Windows 7 (64-bit)
The real answer here has little to do with the core data structure. The answer is that MSVC's implementation of std::deque is especially awful and degenerates to an array of pointers to individual elements, rather than the array of arrays it should be. Frankly, only twice the memory use of vector is surprising. If you had a better implementation of deque you'd get better results.
It all depends on the internal implementation of deque (I won't speak about vector since it is relatively straightforward).
Fact is, deque has completely different guarantees than vector (the most important one being that it supports O(1) insertion at both ends while vector only supports O(1) insertion at the back). This in turn means the internal structures managed by deque have to be more complex than vector.
To allow that, a typical deque implementation will split its memory in several non-contiguous blocks. But each individual memory block has a fixed overhead to allow the memory management to work (eg. whatever the size of the block, the system may need another 16 or 32 bytes or whatever in addition, just for bookkeeping). Since, contrary to a vector, a deque requires many small, independent blocks, the overhead stacks up which can explain the difference you see. Also note that those individual memory blocks need to be managed (maybe in separate structures?), which probably means some (or a lot of) additional overhead too.
As for a way to solve your problem, you could try what #BasileStarynkevitch suggested in the comments, this will indeed reduce your memory usage but it will get you only so far because at some point you'll still run out of memory. And what if you try to run your program on a machine that only has 256MB RAM? Any other solution which goal is to reduce your memory footprint while still trying to keep all your data in memory will suffer from the same problems.
A proper solution when handling large datasets like yours would be to adapt your algorithms and data structures in order to be able to handle small partitions at a time of your whole dataset, and load/save those partitions as needed in order to make room for the other partitions. Unfortunately since it probably means disk access, it also means a big drop in performance but hey, you can't eat the cake and have it too.
Theory
There two common ways to efficiently implement a deque: either with a modified dynamic array or with a doubly linked list.
The modified dynamic array uses is basically a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end.
There are several implementations of modified dynamic array:
Allocating deque contents from the center of the underlying array,
and resizing the underlying array when either end is reached. This
approach may require more frequent resizings and waste more space,
particularly when elements are only inserted at one end.
Storing deque contents in a circular buffer, and only resizing when
the buffer becomes full. This decreases the frequency of resizings.
Storing contents in multiple smaller arrays, allocating additional
arrays at the beginning or end as needed. Indexing is implemented by
keeping a dynamic array containing pointers to each of the smaller
arrays.
Conclusion
Different libraries may implement deques in different ways, but generally as a modified dynamic array. Most likely your standard library uses the approach #1 to implement std::deque, and since you append elements only from one end, you ultimately waste a lot of space. For that reason, it makes an illusion that std::deque takes up more space than usual std::vector.
Furthermore, if std::deque would be implemented as doubly-linked list, that would result in a waste of space too since each element would need to accommodate 2 pointers in addition to your custom data.
Implementation with approach #3 (modified dynamic array approach too) would again result in a waste of space to accommodate additional metadata such as pointers to all those small arrays.
In any case, std::deque is less efficient in terms of storage than plain old std::vector. Without knowing what do you want to achieve I cannot confidently suggest which data structure do you need. However, it seems like you don't even know what deques are for, therefore, what you really want in your situation is std::vector. Deques, in general, have different application.
Deque can have additional memory overhead over vector because it's made of a few blocks instead of contiguous one.
From en.cppreference.com/w/cpp/container/deque:
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.
The primary issue is running out of memory.
So, do you need all the data in memory at once?
You may never be able to accomplish this.
Partial Processing
You may want to consider processing the data into "chunks" or smaller sub-matrices. For example, using the standard rectangular grid:
Read data of first quadrant.
Process data of first quandrant.
Store results (in a file) of first quandrant.
Repeat for remaining quandrants.
Searching
If you are searching for a particle or a set of datum, you can do that without reading in the entire data set into memory.
Allocate a block (array) of memory.
Read a portion of the data into this block of memory.
Search the block of data.
Repeat steps 2 and 3 until the data is found.
Streaming Data
If your application is receiving the raw data from an input source (other than a file), you will want to store the data for later processing.
This will require more than one buffer and is more efficient using at least two threads of execution.
The Reading Thread will be reading data into a buffer until the buffer is full. When the buffer is full, it will read data into another empty one.
The Writing Thread will initially wait until either the first read buffer is full or the read operation is finished. Next, the Writing Thread takes data out of the read buffer and writes to a file. The Write Thread then starts writing from the next read buffer.
This technique is called Double Buffering or Multiple Buffering.
Sparse Data
If there is a lot of zero or unused data in the matrix, you should try using Sparse Matrices. Essentially, this is a list of structures that hold the data's coordinates and the value. This also works when most of the data is a common value other than zero. This saves a lot of memory space; but costs a little bit more execution time.
Data Compression
You could also change your algorithms to use data compression. The idea here is to store the data location, value and the number or contiguous equal values (a.k.a. runs). So instead of storing 100 consecutive data points of the same value, you would store the starting position (of the run), the value, and 100 as the quantity. This saves a lot of space, but requires more processing time when accessing the data.
Memory Mapped File
There are libraries that can treat a file as memory. Essentially, they read in a "page" of the file into memory. When the requests go out of the "page", they read in another page. All this is performed "behind the scenes". All you need to do is treat the file like memory.
Summary
Arrays and deques are not your primary issue, quantity of data is. Your primary issue can be resolved by processing small pieces of data at a time, compressing the data storage, or treating the data in the file as memory. If you are trying to process streaming data, don't. Ideally, streaming data should be placed into a file and then processed later.
A historical purpose of a file is to contain data that doesn't fit into memory.

What is the appropriate madvise setting for reading a file backwards?

I am using gcc 4.7.2 on a 64-bit Linux box.
I have 20 large sorted binary POD files that I need to read as a part of the final merge in an external merge-sort.
Normally, I would mmap all the files for reading and use a multiset<T,LessThan> to manage the merge sort from small to large, before doing a mmap write out to disk.
However, I realised that if I keep a std::mutex on each of these files, I can create a second thread which reads the file backwards, and sort from large to small at the same time. If I decide, beforehand, that the first thread will take exactly n/2 elements and the second thread will take the rest, I will have no need for a mutex on the output end of things.
Reading lock contentions, can be expected to occur, on average, maybe 1 in 20 in this particular case, so that's acceptable.
Now, here's my question. In the first case, it is obvious that I should call madvise with MADV_SEQUENTIAL, but I have no idea what I should do for the second case, where I'm reading the file backwards.
I see no MADV_REVERSE in the man pages. Should I use MADV_NORMAL or maybe don't call madvise at all?
Recall that an external sort is needed when the volume of data is so large that it will not fit into memory. So we are left with a more complex algorithm to use disk as a temporary store. Divide-and-conquer algorithms will usually involve breaking up the data, doing partial sorts, and then merging the partial sorts.
My steps for an external merge-sort
Take n=1 billion random numbers and break them into 20 shards of equal sizes.
Sort each shard individually from small to large, and write each out into its own file.
Open 40 mmap's, 2 for each file, one for going forward, one for going backwards, associate a mutex with each file.
Instantiate a std::multiset<T,LessThan> buff_fwd; for the forward thread and a std::multiset<T,GreaterThan> buff_rev for the reverse thread. Some people prefer to use priority queues here, but essentially, and sort-on-insert container will work here.
I like to call the two buffers surface and rockbottom, representing the smallest and largest numbers not yet added to the final sort.
Add items from the shards until n/2 is used up, and flush the shards to one output file using mmap from beginning towards the middle, and from the end towards to middle in the other thread. You can basically flush at will, but at least one should do it before either buffer uses up too much memory.
I would suggest:
MADV_RANDOM
To prevent useless read-ahead (which is in the wrong direction).