I wrote a thread-safe(at least the aim is that) container class in C++. I lock mutexes while accessing the member and release when finished.
Now, I try to write a test case if it is really thread safe.
Let's say, I have Container container and two threads Thread1 Thread2.
Container container;
Thread1()
{
//Add N items to the container
}
Thread2()
{
//Add N items to the container
}
In this way, it works with no problem with N=1000.
But I'm not sure this regression test is enough or not. Is there a deterministic way to test a class like that?
Thanks.
there is no real way to write a test to prove its safe.
you can only design it so it is safe and test that your design is implemented. best you can do is stress test it.
I guess that you wrote a generic container and that you want to verify that two different threads cannot insert items on the same time.
If my assumptions are correct, then my proposition would be to write a custom class in wich you overload the copy constructor, inserting a sleep that could be parametrized.
To test your container, create an instance of it for your custom class and then in the first thread, insert an instance of the custom class with a long sleep, meanwhile you start the second thread trying to insert an instance of the custom class with a short sleep. If the second insertion comes back before the first one, you know that the test failed.
That's a reasonable starting point, though I'd make a few suggestions:
Run the test on a quad-core machine to improve the odds of real resource contention.
Instead of having a fixed number of threads, I'd suggest spawning a random number of threads with a lower bound equal to the number of processors on the test machine and an upper bound that's four times that number.
Consider doing occasional runs with a substantially larger number of items (say 100,000).
Run your tests on optimized, release (non-debug) builds.
If you're targeting Windows, you may want to consider using critical sections rather than mutexes as they're generally more performant.
Proving that it's safe is not possible, but for improving the stress-testing chances of finding bugs, you can modify the container's add method so looks like this:
// Assuming all this is thread safe
if ( in_use_flag == true ) {
error!
}
in_use_flag = true;
... original add method code ....
sleep( long_time );
in-use-flag = false;
This way you can almost make sure that the two threads would try to access the container at the same time, and also check for such occurrences - thus making sure the thread-safety actually works.
PS I would also remove the mutex protection just to see it fail once.
Related
I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.
If I have 8 threads, and an array of 1,000,000,000 elements in an array, I can have 1,000,000,000 mutices where the index represents the element within the array that is being locked and written to. However this is fairly wasteful to me and requires a lot of memory.
Is there a way that I can only use 8 mutices and have the same functionality?
Thinking out aloud here... and not really sure how efficient this would be, but:
You could create method of locking certain indexes:
vector<int> mutexed_slots;
std::mutex mtx;
bool lock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// Check if item is in the mutexed list
if ( !std::find(mutexed_slots.begin(), mutexed_slots.end(), index) != vector.end() )
{
// If its not then add it - now that array value is safe from other threads
mutexed_slots.emplace_back(index);
return true;
}
return false;
}
void unlock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// No need to check because you will only unlock the element that you accessed (unless you are a very naughty boy indeed)
vec.erase(vec.begin() + index);
}
Note: This is the start of a idea, so don't knock it too hard just yet! Its also un-tested pseudo code. Its not really intended as a final answer - but as a start point. Please add comments to improve or to suggest that is is/isn't plausible.
Further points:
There may be a more efficient STL to use
You could probably wrap all of this up in a class along with your data
You would need to loop through lock_element() until it returns true - again not pretty at the moment. This mechanism could be improved.
Each thread needs to remember which index they currently are working on so that they only unlock that particular one - again this could be more integrated within a class to ensure that behaviour.
But as a concept - workable? I would think if you need really fast access (which maybe you do) this might not be that efficient, thoughts?
Update
This could be made much more efficient if each thread/worker "registers" its own entry in mutexed_slots. Then there would no push_back/remove's from the vector (except at the start/end). So each thread just sets the index that it has locked - if it has nothing locked then it just gets set to -1 (or such). I am thinking there are many more such efficiency improvements to be made. Again a complete class to do all this for you would be the way to implement it.
Testing / Results
I implemented a tester for this, just because I quite enjoy that sort of thing. My implementation is here
I think its a public github repo - so you are welcome to take a look. But I posted the results on the top-level readme (so scroll a little to see them). I implemented some improvements such that:
There are no insert/removal to the protection array at run-time
There is no need for a lock_guard to do the "unlock" because I am relying no a std::atomic index.
Below is my a printout of my summary:
Summary:
When the workload is 1ms (the time taken to perform each action) then the amount of work done was:
9808 for protected
8117 for normal
Note these values varied, sometimes the normal was higher, there appeared no clear winner.
When the workload is 0ms (basically increment a few counters) then the amount of work done was:
9791264 for protected
29307829 for normal
So here you can see that using the mutexed protection slows down the work by a factor of about a third (1/3). This ratio is consistant between tests.
I also ran the same tests for 1 worker, and the same ratios roughly held true. However when I make the array smaller (~1000 elements) the amount of work done is still roughly the same for when the work load is 1ms. But when the workload is very light I got results like:
5621311
39157931
Which is about 7 times slower.
Conclusion
The larger the array then less collisions occur - the performance is better.
The longer the work load is (per item) then the less noticeable the difference is with using the protecting mechanism.
It appears that the locking is generally only adding an overhead that is 2-3 times slower then incrementing a few counters. This is probably skewed by actual collisions because (from the results) the longest lock time recorded was a huge 40ms - but this was when there was the work time was very fast so, many collisions occurred (~8 successful locks per collision).
It depends on the access pattern, do you have a way to partition the work effectively? Basically, you can partition the array into 8 chunks (or as many as you can afford) and cover each part with a mutex, but if the access pattern is random you're still going to have a lot of collisions.
Do you have TSX support on your system? it would be a classic example, just have one global lock, and have the threads ignore it unless there's an actual collision.
You can write a class that will create locks on the fly when a particular index requires it, std::optional would be helpful for this (C++17 code ahead):
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == std::nullopt) {
index_locks_[i].emplace();
}
return *index_locks_[i];
}
private:
std::vector<std::optional<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
You could also use std::unique_ptr to minimize stack-space but maintain identical semantics:
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == nullptr) {
index_locks_[i] = std::make_unique<std::mutex>();
}
return *index_locks_[i];
}
private:
std::vector<std::unique_ptr<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
Using this class doesn't necessarily mean you need to create all 1,000,000 elements. You can use modulo operations to treat the locker as a "hash table" of mutexes:
constexpr size_t kLockLimit = 8;
IndexLocker index_locker(kLockLimit);
auto thread_code = [&](size_t i) {
std::lock_guard guard(index_locker.get_lock(i % kLockLimit));
// Do work with lock.
};
Worth mentioning that the "hash table" approach makes it very easy to deadlock (get_lock(0) followed by get_lock(16), for example). If each thread does work on exactly one element at a time, however, this shouldn't be an issue.
There are other trade-offs with fine-grain locking. Atomic operations are expensive, so a parallel algorithm that locks every element can take longer than the sequential version.
How to lock efficiently depends. Are the array elements dependent on other elements in the array? Are you mostly reading? mostly writing?
I don't want to split the array into 8 parts because that will cause a
high likelihood of waiting (access is random). The elements of the
array are a class that I will write that will be multiple Golomb coded
values.
I don't think having 8 mutexes is the way to go here. If a given lock protects an array section, you can't switch it to protect a different section in the midst of parallel execution without introducing a race condition (rendering the mutex pointless).
Are the array items small? If you can get them down to 8 bytes, you can declare your class with alignas(8) and instantiate std::atomic<YourClass> objects. (Size depends on architecture. Verify is_lock_free() returns true.) That could open up the possibility of lock-free algorithms. It almost seems like a variant of hazard pointers would be useful here. That's complex, so it's probably better to look into other approaches to parallelism if time is limited.
Im writing some settings classes that can be accessed from everywhere in my multithreaded application. I will read these settings very often (so read access should be fast), but they are not written very often.
For primitive datatypes it looks like boost::atomic offers what I need, so I came up with something like this:
class UInt16Setting
{
private:
boost::atomic<uint16_t> _Value;
public:
uint16_t getValue() const { return _Value.load(boost::memory_order_relaxed); }
void setValue(uint16_t value) { _Value.store(value, boost::memory_order_relaxed); }
};
Question 1: Im not sure about the memory ordering. I think in my application I don't really care about memory ordering (do I?). I just want to make sure that getValue() always returns a non-corrupted value (either the old or the new one). So are my memory ordering settings correct?
Question 2: Is this approach using boost::atomic recommended for this kind of synchronization? Or are there other constructs that offer better read performance?
I will also need some more complex setting types in my application, like std::string or for example a list of boost::asio::ip::tcp::endpoints. I consider all these setting values as immutable. So once I set the value using setValue(), the value itself (the std::string or the list of endpoints itself) does not change anymore. So again I just want to make sure that I get either the old value or the new value, but not some corrupted state.
Question 3: Does this approach work with boost::atomic<std::string>? If not, what are alternatives?
Question 4: How about more complex setting types like the list of endpoints? Would you recommend something like boost::atomic<boost::shared_ptr<std::vector<boost::asio::ip::tcp::endpoint>>>? If not, what would be better?
Q1, Correct if you don't try to read any shared non-atomic variables after reading the atomic. Memory barriers only synchronize access to non-atomic variables that may happen between atomic operations
Q2 I don't know (but see below)
Q3 Should work (if compiles). However,
atomic<string>
possibly isn't lock free
Q4 Should work but, again, the implementation isn't possibly lockfree (Implementing lockfree shared_ptr is challenging and patent-mined field).
So probably readers-writers lock (as Damon suggests in the comments) may be simpler and even more effective if your config includes data with size more than 1 machine word (for which CPU native atomics usually works)
[EDIT]However,
atomic<shared_ptr<TheWholeStructContainigAll> >
may have some sense even being non-lock free: this approach minimize collision probability for readers that need more than one coherent value, though the writer should make a new copy of the whole "parameter sheet" every time it changes something.
For question 1, the answer is "depends, but probably not". If you really only care that a single value isn't garbled, then yes, this is fine, and you don't care about memory order either.
Usually, though, this is a false premise.
For questions 2, 3, and 4 yes, this will work, but it will likely use locking for complex objects such as string (internally, for every access, without you knowing). Only rather small objects which are roughly the size of one or two pointers can normally be accessed/changed atomically in a lockfree manner. This depends on your platform, too.
It's a big difference whether one successfully updates one or two values atomically. Say you have the values left and right which delimit the left and right boundaries of where a task will do some processing in an array. Assume they are 50 and 100, respectively, and you change them to 101 and 150, each atomically. So the other thread picks up the change from 50 to 101 and starts doing its calculation, sees that 101 > 100, finishes, and writes the result to a file. After that, you change the output file's name, again, atomically.
Everything was atomic (and thus, more expensive than normal), but none of it was useful. The result is still wrong, and was written to the wrong file, too.
This may not be a problem in your particular case, but usually it is (and, your requirements may change in the future). Usually you really want the complete set of changes being atomic.
That said, if you have either many or complex (or, both many and complex) updates like this to do, you might want to use one big (reader-writer) lock for the whole config in the first place anyway, since that is more efficient than acquiring and releasing 20 or 30 locks or doing 50 or 100 atomic operations. Do however note that in any case, locking will severely impact performance.
As pointed out in the comments above, I would preferrably make a deep copy of the configuration from the one thread that modifies the configuration, and schedule updates of the reference (shared pointer) used by consumers as a normal tasks. That copy-modify-publish approach a bit similar to how MVCC databases work, too (these, too, have the problem that locking kills their performance).
Modifying a copy asserts that only readers are accessing any shared state, so no synchronization is necessary either for readers or for the single writer. Reading and writing is fast.
Swapping the configuration set happens only at well-defined points in times when the set is guaranteed to be in a complete, consistent state and threads are guaranteed not to do something else, so no ugly surprises of any kind can happen.
A typical task-driven application would look somewhat like this (in C++-like pseudocode):
// consumer/worker thread(s)
for(;;)
{
task = queue.pop();
switch(task.code)
{
case EXIT:
return;
case SET_CONFIG:
my_conf = task.data;
break;
default:
task.func(task.data, &my_conf); // can read without sync
}
}
// thread that interacts with user (also producer)
for(;;)
{
input = get_input();
if(input.action == QUIT)
{
queue.push(task(EXIT, 0, 0));
for(auto threads : thread)
thread.join();
return 0;
}
else if(input.action == CHANGE_SETTINGS)
{
new_config = new config(config); // copy, readonly operation, no sync
// assume we have operator[] overloaded
new_config[...] = ...; // I own this exclusively, no sync
task t(SET_CONFIG, 0, shared_ptr<...>(input.data));
queue.push(t);
}
else if(input.action() == ADD_TASK)
{
task t(RUN, input.func, input.data);
queue.push(t);
}
...
}
For anything more substantial than a pointer, use a mutex. The tbb (opensource) library supports the concept of reader-writer mutices, which allow multiple simultaneous readers, see the documentation.
I have a for loop that I would like to make parallel, however the threads must share an unordered_map and a vector.
Because the for loop is somewhat big I will post here a concise overview of it so that I can make my main problem clear. Please read the comments.
unordered_map<string, vector<int>> sharedUM;
/*
here I call a function that updates the unordered_map with some
initial data, however the unordered_map will need to be updated by
the threads inside the for loop
*/
vector<int> sharedVector;
/*
the shared vector initially is empty, the threads will
fill it with integers, the order of these integers should be in ascending
order, however I can simply sort the array after all the
threads finish executing so I guess we can assume that the order
does not matter
*/
#pragma omp parallel for
for(int i=0; i<N; i++){
key = generate_a_key_value_according_to_an_algorithm();
std::unordered_map<string, vector<int>::iterator it = sharedUM.find(key);
/*
according to the data inside it->second(the value),
the thread makes some conclusions which then
uses in order to figure out whether
it should run a high complexity algorithm
or not.
*/
bool conclusion = make_conclusion();
if(conclusion == true){
results = run_expensive_algorithm();
/*
According to the results,
the thread updates some values of
the key that it previously searched for inside the unordered_map
this update may help other threads avoid running
the expensive algorithm
*/
}
sharedVector.push_back(i);
}
Initially I left the code as it is, so I just used that #pragma over the for loop, however I got a few problems regarding the update of the sharedVector. So I decided to use simple locks in order to force a thread acquire the lock before writing to the vector. So in my implementation I had something like this:
omp_lock_t sharedVectorLock;
omp_init_lock(&sharedVectorLock);
...
for(...)
...
omp_set_lock(&sharedVectorLock);
sharedVector.push_back(i);
omp_unset_lock(&sharedVectorLock);
...
omp_destroy_lock(&sharedVectorLock);
I had run my application many times and everything seemed to be working great, and that's until I decided to rerun it automatically too many times until I got wrong results. Because I'm very new to the world of OpenMP and the threads in general, I wasn't aware of the fact that we should lock all the readers when a writer is updating some shared data. As you can see here in my application the threads always read some data from the unordered_map in order make some conclusions and learn things about the key that was assigned to them. What happens though if two threads have to work with the same key, and while some other thread is trying to read the values of this key, another one has reached the point of updating those values? I believe that's where my problem occurs.
However my main problem right now is that I'm not sure what would be the best way to avoid such things from happening. It's like my system works for 99% of the time, but that 1% ruins everything because two threads are rarely assigned with the same key which in turn is because my unordered_map is usually big.
Would locking the unordered_map do my job? Most likely, but that wouldn't be efficient because a thread A that wants to work with the key x would have to wait for a thread B that is already working with the key y where y can be different than x to finish.
So my main question is, how should I approach this problem? How can I lock the unordered_map if and only if two threads are working with the same key?
Thank you in advance
1 on using locks and mutexes. You must declare and initialise the lock variables outside of the parallel block (before #pragma omp parallel) and then use them inside the parallel block: (1) acquire a lock (this may block if another thread has locked it), (2) change the variable with the race condition, (3) release the lock. Finally, destroy it after exiting the parallel block. A lock declared inside the parallel block is local to the thread and hence cannot provide synchronisation.
This may explain your problems.
2 on writing into complicated C++ containers. OpenMP was designed originally for simple FORTRAN do loops (similar to C/C++ for loops with integer control variables). Everything more complicated will give you headache. To be on the safe side, any non-constant operation on a C++ container must be performed within a lock (use the same lock for any such operation on the same container) or omp critical region (use the same name for any such operation on the same container). This includes pop() and push() etc, anything but simple reads. This can only remain efficient if such non-constant container operations take only a tiny fraction of the time.
3 If I were you, I wouldn't bother with openMP (I have used it but am regretting this now). With C++ you could use TBB, which also comes with some threadsafe but lock-free containers. It also allows you to think in terms of tasks, not threads, which are executed recursively (a parent task spawns child tasks, etc), but TBB has some simple implementations for parallel for loops, for instance.
An alternative approach would be to use TBB's concurrent_unordered_map.
You don't have to use the rest of TBB's parallelism support (though if you're starting from scratch in C++ it's certainly more "c++-ish" than OpenMP).
May be this could help:
vector<bool> sv(N);
replace
sharedVector.push_back(i);
by
sv[i]=true;
this allows to avoid locks (very time consuming) and sharedVector
can easily be sorted, e.g
for(int i=0; i<N;i++){
if(sv[i])sharedVector.push_back(i);
}
I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.