c++ multithread optimization - c++

in my code I have 2/4 threads performing montecarlo simulations. Each of them runs a number of experiments and they all collect the results into a stl vector.
My question is this: suppose each thread runs 1000 experiments sequentially. Is is better to store the result into the shared vector one at the time, or every once in a while? If they wait until they have some consistent amount of data, writing into the vector will take longer, so I'm not sure whether the second solution is necessarily better than the first one.
PS each experiment is numerical computation, so no IO operations.
Thanks

If you are going to wait until all the results are computed before you use any of the results, preallocate space for 4,000 results in the vector and have each thread write into one range of elements in the vector. No locking is required because no two threads access the same element in the vector.
If you want to use the results as they are computed, use some sort of a concurrent queue data structure instead of a vector.

If you're only putting 2000 to 4000 elements in the vector I doubt it would make much of a difference either way.
Do whatever is most natural for the algorithm. If that doesn't work well enough look into doing it the other way.
After thinking about it for a bit, it might serve both purposes (simplicity and speed) to have each thread store results to a local vector then copy the contents of the local vector to the 'global' vector (protected by a lock) when the thread is done. Of course, that's as long as whatever's waiting for the results can wait until a thread is fully finished before getting an update.

a singly linked list may be a better choice than vector here.
If there is only one thread reading and one thread writing to a fifo .. you don't need any synchronization . The trick is to keep at least one 'dummy' element always in the list, and fifo is empty if head == tail . The head and tail pointers can be manipulated for push and pop, such that there is no need for synchronization..
Using this .. you can make several Q's .. which will not need any synchronization
If new/delete is taking time .. you can have Q's to hold reusable elements.
best of luck .
remember .. Exactly one reader, and Exactly one writer .. no more, no less .
the trick is createa LOT of Q's like this , Q to recycle objects also .. and
you'll not need any thread synchronization stuff ...
If your Q's do run empty .. just a sleep() / wakeup() functionality is needed.
and in case i haven't already said .. Exactly one reader, and Exactly one writer.

Related

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

Multithreaded array of arrays?

I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"
You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.

iterate over a vector using multiple threads (no data sharing, or vector modification)

I have a large vector of objects and I just need to iterate over the vector using multiple threads and read the objects (no modification to data or the vector). What is the most efficient method to do this? Could it be done in a lock free fashion, maybe using an atomic variable? what is most easy to read implementation of such multithreading process?
Edit:
I do not want more than one thread reads the same element of vector (reading is time consuming in this case). When one thread is reading an element, I want the next thread reads the first not-yet-read element. For example when thread 1 is reading object 1, I want thread 2 reads object 2. whenever one of them is done, it can read object 3, so on and so forth.
Splitting the input in equal parts is really really easy, it doesn't use locks and doesn't cause memory sharing. So try that, measure how much time each thread needs to complete and check if it's a relevant difference.
If the difference is relevant, consider using an array of one atomic<bool> per element, before reading an element, the thread does compare_exchange_strong on the flag related to that element (I think you can even use memory_order_relaxed, but use memory_order_acq_rel at first, only try relaxed if the performance doesn't satisfy you) and only actually processes the element if the exchange succeeds. Otherwise it tries with the next element, because someone is processing or has already processed the current one.
If you can't then you can use a single atomic<int> to store the index of the next element to be processed. The threads just use fetch_add or the postfix++ to atomically get the next element to process and increment the counter (the considerations for the memory ordering are the same as above). If the variance in reading times is high (as determined by step 1), you will have low contention on the atomic variable, so it will perform well.
If the contention is still too high, and you get a significant slowdown, try to estimate in advance how much time it will take to read an element. If you can, then sort your vector by estimated read time, and make the n-th thread read every n-th element, so that the load will be split more evenly.

How to design a datastructure that spits out one available space for each thread in CUDA

In my Project with CUDA I need to have a data structure(available to all threads in the block)that is similar a "stash". In this stash there are multiple spaces which could be either empty or full. I need this data structure to spit out empty space when each thread asks for. The thread will ask for a space in the stash, put something in, and mark this position as full. I could not use a fifo because fetching from stash is random. Any position(and multiple positions)could be marked as empty or full.
The initial version I have is that I use an array to represent whether the space is empty or not. each thread will loop through each position space(using atomicCAS) until it finds a empty spot. But this algorithm the searching time depends on how full the stash is, which is not acceptable in my design.
How could I design a datastructure that the fetching time and write back time does not depend on how full the stash is?
Does this remind anyone of anything any algorithm similar?
Thanks
You could implement this with a FIFO containing a list of free locations.
At startup you fill the FIFO with all locations.
Then whenever you want a space, you take the next element from the FIFO .
When you are finished with the slot, you can place the address back into the FIFO again.
This should have O(1) allocation and deallocation time.
You could implement a hash table (SeparateChaining) with ThreadID as the key.
It is more or less similar to array of linked lists. This way you need not put a lock on the entire array as you did earlier. Instead, you use atomicCAS only while reading a linkedlist from a specific index. Thereby, you can have n threads running in parallel where array size is n.
Note: The distribution of threads however depends on the hash function.

Multithreading - In an array what should I protect?

I'm working on some code that has a global array that can be accessed by two threads for reading writing purposes.
There will be no batch processing where a range of indexes are read or written, so I'm trying to figure out if I should lock the entire array or only the array index I am currently using.
The easiest solution would be to consider the array a CS and put a big fat lock around it, but can I avoid this and just lock an index?
Cheers.
Locking one index implies that you can keep track of which thread is accessing what part of the array. Keeping track of this information, which is shared between the reading and the writing thread, implies that you have one lock around this information. So, you still end up with a global lock.
In this situation, I think that the most efficient approaches are:
- using a reader/writer lock
- or dividing the big array into a few subsets, each subset using a distinct lock.
If this is C++ i suggest you to use STL containers. std::vector or something else which suits your job. They are fast, easy to use, no memory leaks.
If you want to do it all by your self, then of course one method will be to use a single mutex ( which is bad ).
or you can use some reader writer thingy for the whole array.
I think its not feasible to make each element of an array thread safe with its own lock!! that would eat your memory. Check the link and there are 3 solutions with different out comes. Test them out and use the best for your case. ( don't think like "ok i think my program needs the readers preference algorithm". try using it in your system and decide. because we really cant assume such things sometimes )
There is no way of knowing what will be optimal unless you profile under realistic running conditions. I would suggest implementing an array-like class, where you can lock a varying number of elements in groups. Then you fine-tune the size of these groups.
Another option would be to enqueue all read/write operations using an active object. This would make all access sequential, and means you could use a non-concurrent array type to store the data. It would require some sort of concurrent queue data structure under the hood.