I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?
I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.
#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t size = 1000000;
std::vector<int> first_result(size);
std::vector<int> second_result(size);
std::atomic<bool> sync(false);
{
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
while(sync.exchange(true)) {
std::this_thread::yield();
};
first_result[counter] = counter;
sync.store(false) ;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
{
auto start_time = std::chrono::high_resolution_clock::now();
std::mutex mutex;
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
std::unique_lock<std::mutex> lock(mutex);
second_result[counter] = counter;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
return 0;
}
I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.
Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.
Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).
Custom spin-lock could be implemented this way:
class SpinLock
{
private:
std::atomic_flag _lockFlag;
public:
SpinLock()
: _lockFlag {ATOMIC_FLAG_INIT}
{ }
void lock()
{
while(_lockFlag.test_and_set(std::memory_order_acquire))
{ }
}
bool try_lock()
{
return !_lockFlag.test_and_set(std::memory_order_acquire);
}
void unlock()
{
_lockFlag.clear();
}
};
Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.
Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.
Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.
In your case, mutex is locked in each loop iteration to perform this instruction:
second_result[counter] = omp_get_thread_num();
It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).
Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.
I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).
Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
source: boost/smart_ptr/detail/spinlock_std_atomic.hpp
Where detail::yield() is (Win32 version):
inline void yield( unsigned k )
{
if( k < 4 )
{
}
#if defined( BOOST_SMT_PAUSE )
else if( k < 16 )
{
BOOST_SMT_PAUSE
}
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
else if( k < 32 )
{
Sleep( 0 );
}
else
{
Sleep( 1 );
}
#else
else
{
// Sleep isn't supported on the Windows Runtime.
std::this_thread::yield();
}
#endif
}
[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]
First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!
Also, this statement:
The purpose of a spinlock is busy waiting
is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):
In the spin-wait loop, the pause intrinsic improves the speed at which
the code detects the release of the lock and provides especially
significant performance gain.
So, implementations like
void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }
may not be as good as it seems.
There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.
A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++
Related
Recently, I found myself often in a situation that shared data get read a lot, but written rarely, so I begin to wonder is it possible to speed up the sync a little bit.
Take the following as an example, in which mutiple threads occasionally write the data, a single thread frequently read the data, all synched with a normal mutex.
#include <iostream>
#include <unistd.h>
#include <unordered_map>
#include <mutex>
#include <thread>
using namespace std;
unordered_map<int, int> someData({{1,10}});
mutex mu;
void writeData(){
while(true){
{
lock_guard<mutex> lock(mu);
int r = rand()%10;
someData[1] = r;
printf("data changed to %d\n", r);
}
usleep(rand()%100000000 + 100000000);
}
}
void readData(){
while(true){
{
lock_guard<mutex> lock(mu);
for(auto &i:someData){
printf("%d:%d\n", i.first, i.second);
}
}
usleep(100);
}
}
int main() {
thread writeT1(&writeData2);
thread writeT2(&writeData2);
thread readT(&readData2);
readT.join();
}
using normal lock mechanism, every read requires a lock, and I'm thinking to speed up to a single atomic read in most cases:
unordered_map<int, int> someData({{1,10}});
mutex mu;
atomic_int dataVersion{0};
void writeData2(){
while(true){
{
lock_guard<mutex> lock(mu);
dataVersion.fetch_add(1, memory_order_acquire);
int r = rand()%10;
someData[1] = r;
printf("data changed to %d\n", r);
}
usleep(rand()%100000000 + 100000000);
}
}
void readData2(){
mu.lock();
int versionCopy = dataVersion.load();
auto dataCopy = someData;
mu.unlock();
while(true){
if (versionCopy != dataVersion.load(memory_order_relaxed)){
lock_guard<mutex> lock(mu);
versionCopy = dataVersion.load(memory_order_relaxed);
dataCopy = someData;
}
else{
for(auto &i:dataCopy){
printf("%d:%d\n", i.first, i.second);
}
usleep(100);
}
}
}
The data type unordered_map here is just an example, it could be any type, and I'm not looking for a pure lock-free algorithm, as that might be a whole other story. Just for a normal lock based sync, in a situation that most operation is read, using a trick like this, is it logically ok? Are there any established approaches for this?
[edit]
I'm aware of the shared mutex, but it isn't really the situation that I was talking about. firstly shared lock is not cheap, probably more expensive than the plain mutex, certainly heavier than atomics; secondly, in the example I showed a single reading thread which can't take much advantage of it.
I was interested particularly in the locking operation cost. Reducing blocking, critical section sure is the first thing to look at in a real case, but I wasn't targeting that here.
The unordered_map data type is just an example, not looking for a data structure that better suits for a specific task, or a lock free algorithm, the data type could be anything.
sleep time is to demonstrate that read happens way much more than write, to a degree that we begin to not so care the extra lock and copy time in the if block.
Thanks~
You are storing the data in an unordered_map. What guarantees does the unordered_map class make about concurrent access for readers & writers. If it is unhappy with that prospect, the atomics are not your friend.
In most (every?) OS, locking primitives themselves are handled with atomics in the uncontested case; only reverting to a kernel when contested. With that in mind, you are best to minimize the amount of code while the lock is held, so your first loop should be:
int r = rand()%10;
mu.lock();
someData[1] = r;
mu.unlock();
printf("data changed to %d\n", r);
I don't know how you would fix the read side, but if you chose a friendlier data store, you could minimize access to it in the same way.
I will first try to describe my own understanding of your idea:
Frequent reads, occasional write.
Locks are expensive ... should be benchmarked, try std::shared_mutex or Slim Reader/Writer SRW Locks - Windows only, or some other slim implementation, which usually use some cheap and optimistic (atomic/spin-lock) mechanism, that has little-to-no impact in case of no collision (no writer most of the time).
You don't seem to care how old/recent your copy is. That is acceptable for some informative performance counters, but I would think twice about that - it is not something somebody else having to maintain your code would expect or even think about. The consequences can be catastrophic.
You only access the writable data under the lock, read a copy you create holding the lock. That means your approach is safe from simple threading synchronization view, except the above point (readers working with old data, multiple readers can have different copies ... is it worth it?).
Anyway, you should really try to benchmark first and then try to find better solution which somebody else already wrote (slim rw-locks), before even attemting to come-up with your own synchronization mechanism (that is generally very hard to do correctly).
EDIT: Found some article with concreate shared_mutex implementation using std::atomic:
Code Project: We make a std::shared_mutex 10 times faster
Coliru test here
I am trying to use multi-threading for encoding with Random Linear Network Coding (RLNC) to increase performance.
However, I have problem with the performance issue, my multi-thread solution is slower, much slower than the current non-threaded version. I have a suspension that it is the atomic access on m_completed and the std::mutex for inserting elements into m_results which are killing my performance. I am however, not aware of how to confirm this.
So a bit more information the function completed() is called in a while-loop in the main thread while(!encoder.completed()){} which results in a hell of a lot of atomic access, but I cannot find a proper way to do it without the atomic or mutex lock. You can find the code below.
So please if someone can see a mistake or guide me towards a better way of doing this I will be very grate full. I have speend 1.5 weeks on figuring out what the is wrong now, and my only idea is atomic or std::mutex locks
#include <cstdint>
#include <vector>
#include <mutex>
#include <memory>
#include <atomic>
...
namespace master_thesis
{
namespace encoder
{
class smart_encoder
{
...
void start()
{
...
// Incase there are more an uneven amount
// of symbols we adjust this abov
else
{
m_pool.enqueue([this, encoder](){
std::vector<std::vector<uint8_t>> total_payload(this->m_coefficients,
std::vector<uint8_t>(encoder->payload_size()));
std::vector<uint8_t> payload(encoder->payload_size());
for (uint32_t j = 0; j < this->m_coefficients; ++j)
{
encoder->write_payload(payload.data());
total_payload[j] = payload; //.insert(total_payload.begin() + j, payload);
}
this->m_mutex.lock();
this->m_result.insert(std::end(this->m_result),
std::begin(total_payload),
std::end(total_payload));
++(this->m_completed);
this->m_mutex.unlock();
});
}
}
}
bool completed()
{
return m_completed.load() >= (m_threads - 1);
}
std::vector<std::vector<uint8_t>> result()
{
return m_result;
}
private:
uint32_t m_symbols;
uint32_t m_symbol_size;
std::atomic<uint32_t> m_completed;
unsigned int m_threads;
uint32_t m_coefficients;
std::mutex m_mutex;
std::vector<uint8_t> m_data;
std::vector<std::vector<uint8_t>> m_result;
ThreadPool m_pool;
std::vector<std::shared_ptr<rlnc_encoder>> m_encoders;
};
}
}
The bottleneck is probably not the call to Completed().
On x86, a read from a word-aligned uint32_t is automatically an atomic operation, std::atomic or not. The only thing std::atomic does for uint32_t on x86 is ensure it is word-aligned and that the compiler doesn't reorder it or optimize it out.
A tight-loop load isn't the cause of bus contention. There will be a cache miss on the first read, but subsequent loads will be cache hits until the cache is invalidated by a write to the address from another thread. There's a caveat--accidental sharing of cache lines ("false sharing"). One idea for how you can eliminate this possibility by switching to an array with 60 bytes of unused padded on both sides of your atomic (only use the middle one).
std::atomic<uint32_t> m_buffered[31]; std::atomic<uint_32t>& m_completed = m_buffered[15];
Keep in mind that a tight-loop will tie up one of your cores doing nothing but looking in its cache. That's a waste of money... ;) That could very well be the cause of your issue. You should could change your code to be like this:
int m_completed = 0; // no longer atomic
std::condition_variable cv;
// in main...(pseudocode)
lock (unique) m_mutex // the m_mutex from the class
while !Completed()
cv.wait(m_mutex)
// in thread (pseudocode)
bool toSignal = false;
lock guard m_mutex
this->m_result.insert(std::end(this->m_result),
std::begin(total_payload),
std::end(total_payload));
++m_completed;
toSignal = Completed();
if toSignal
cv.signalOne()
It's also could be that your performance loss has to do with the mutex critical section. This critical section has the potential to be many orders of magnitude longer than a cache miss. I would recommend comparing the time for 1 thread, 2 threads, and 4 threads in the threadpool. if 2 threads isn't faster than 1 thread, then your code is essentially running sequentially.
How to measure? Profiling tools are good for when you don't know what to optimize. I don't have a lot of experience with them, but I know that (at least some of the older ones) can get a little sketchy when it comes to multithreading. You can also use a good old fashioned timer. C++11 has a high_resolution_clock that probably has a resolution in the 1's of microseconds, if you have decent hardware.
Lastly, I see a lot of opportunity for algorithmic/scalar optimization. Pre-allocate vectors instead of doing it every time. Use pointers or std::move to avoid unnecessary deep-copies. Pre-allocate m_result and have the threads write to specific index offsets.
I've been trying to use c++ primitives operators and variables, like int, if, and while to develop a thread-safe mechanism.
My idea is to use two integers variables called sync and lock, incrementing and checking the sync and after that incrementing and checking the lock. If all the checkings are successful, then the lock is guarantee, but it tries again if the checking is unsuccessful.
It seems that my idea is not working properly as it is asserting in the final verification.
#include <assert.h>
#include <iostream>
#include <string>
#include <thread>
#include <vector>
class Resource {
// Shared resource to be thread safe.
int resource;
// Mutual exclusion variables.
volatile int lock;
volatile int sync;
public:
Resource() : resource( 0 ), lock( 0 ), sync( 0 ) {}
~Resource() {}
int sharedResource() {
return resource;
}
void sharedResourceAction( std::string id ) {
bool done;
do {
int oldSync = sync;
// ++ should be atomic.
sync++;
if ( sync == oldSync + 1 ) {
// ++ should be atomic.
lock++;
if ( lock == 1 ) {
// For the sake of the example, the read-modify-write
// is not atomic and not thread safe if threre is no
// mutex surronding it.
int oldResource = resource;
resource = oldResource + 1;
done = true;
}
// -- should be atomic.
lock--;
}
if ( !done ) {
// Pseudo randomic sleep to unlock the race condition
// between the threads.
std::this_thread::sleep_for(
std::chrono::microseconds( resource % 5 ) );
}
} while( !done );
}
};
static const int maxThreads = 10;
static const int maxThreadActions = 1000;
void threadAction( Resource& resource, std::string& name ) {
for ( int i = 0; i < maxThreadActions; i++) {
resource.sharedResourceAction( name );
}
}
int main() {
std::vector< std::thread* > threadVec;
Resource resource;
// Create the threads.
for (int i = 0; i < maxThreads; ++i) {
std::string name = "t";
name += std::to_string( i );
std::thread *thread = new std::thread( threadAction,
std::ref( resource ),
std::ref( name ) );
threadVec.push_back( thread );
}
// Join the threads.
for ( auto threadVecIter = threadVec.begin();
threadVecIter != threadVec.end(); threadVecIter++ ) {
(*threadVecIter)->join();
}
std::cout << "Shared resource is " << resource.sharedResource()
<< std::endl;
assert( resource.sharedResource() == ( maxThreads * maxThreadActions ) );
return 0;
}
Is there a thread-safe mechanism to protect shared resources using only primitives variables and operators?
No, there are a few reasons why this doesn't work
Firstly the standard describes it not to work. You've (explicitly) got a read/write and write/write race condition and the standard forbids this.
Secondly, ++i is in no way atomic. Even on mainstream intel processors it isn't - it'll usually be an inc instruction when it needs to be a lock inc instruction.
Thirdly, volatile has no threading meaning in c++ like it does in java or c#. Its neither necessary nor sufficient to achieve anything to do with threadsafety (outside of nasty compiler extensions like volatile:/ms). See this answer for more information about volatile in c++.
There may be more issues in your code but this list should be enough to dissuade you.
Edit: And to actually answer your final question - no I dont think its possible to implement thread safety mechanisms from primitive types and operations in a standard compliant way. Basically you need to get the memory-subsytem, cpu AND compiler to all agree not to perform some kinds of transformations when implementing thread safety mechanisms. This generally means you need to use compiler hooks or guarantees outside of the standard and also knowledge of the final target CPUs guarantees or intrinsics to achieve it.
volatile is absolutely no good for multithreading:
Within a thread of execution, accesses (reads and writes) through
volatile glvalues cannot be reordered past observable side-effects
(including other volatile accesses) that are sequenced-before or
sequenced-after within the same thread, but this order is not
guaranteed to be observed by another thread, since volatile access
does not establish inter-thread synchronization.
In addition, volatile
accesses are not atomic (concurrent read and write is a data race) and
do not order memory (non-volatile memory accesses may be freely
reordered around the volatile access).
If you want to have atomic operations on an integer, the proper way to do it is with std::atomic<int>. That gives you guarantees on memory ordering that will be observed by other threads. If you really want to do this sort of lock-free programming, you should sit and absorb the memory model documentation, and, if you're anything like me, strongly reconsider attempting lock-free programming as you try to stop your head exploding.
I'm trying to implement a spinning thread barrier using atomics, specifically __sync_fetch_and_add. https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html
I basically want an alternative to the pthread barrier. I'm using Ubuntu on a system that can run about a hundred threads in parallel.
int bar = 0; //global variable
int P = MAX_THREADS; //number of threads
__sync_fetch_and_add(&bar,1); //each thread comes and adds atomically
while(bar<P){} //threads spin until bar increments to P
bar=0; //a thread sets bar=0 to be used in the next spinning barrier
This does not work for obvious reasons (a thread may set bar=0, and another thread gets stuck in an infinite while loop etc). I saw an implementation here: Writing a (spinning) thread barrier using c++11 atomics, however it seems too complex and I think its performance might be worse than a pthread barrier.
This implementation is also expected to produce more traffic within the memory hierarchy due to bar's cache line being ping-ponged among threads.
Any ideas on how to use these atomic instructions to make a simple barrier? A communication-optimal scheme would also be helpful additionally.
Instead of spinning on the counter of the threads, it is better to spin on the number of the barries passed, which will be incremented only by the last thread, faced the barrier. Such way you also reduce memory cache pressure, as spinning variable is now updated only by single thread.
int P = MAX_THREADS;
int bar = 0; // Counter of threads, faced barrier.
volatile int passed = 0; // Number of barriers, passed by all threads.
void barrier_wait()
{
int passed_old = passed; // Should be evaluated before incrementing *bar*!
if(__sync_fetch_and_add(&bar,1) == (P - 1))
{
// The last thread, faced barrier.
bar = 0;
// *bar* should be reseted strictly before updating of barriers counter.
__sync_synchronize();
passed++; // Mark barrier as passed.
}
else
{
// Not the last thread. Wait others.
while(passed == passed_old) {};
// Need to synchronize cache with other threads, passed barrier.
__sync_synchronize();
}
}
Note, that you need to use volatile modificator for spinning variable.
C++ code could be somewhat faster than C one, as it can use acquire/release memory barriers instead of the full one, which is the only barrier available from __sync functions:
int P = MAX_THREADS;
std::atomic<int> bar = 0; // Counter of threads, faced barrier.
std::atomic<int> passed = 0; // Number of barriers, passed by all threads.
void barrier_wait()
{
int passed_old = passed.load(std::memory_order_relaxed);
if(bar.fetch_add(1) == (P - 1))
{
// The last thread, faced barrier.
bar = 0;
// Synchronize and store in one operation.
passed.store(passed_old + 1, std::memory_order_release);
}
else
{
// Not the last thread. Wait others.
while(passed.load(std::memory_order_relaxed) == passed_old) {};
// Need to synchronize cache with other threads, passed barrier.
std::atomic_thread_fence(std::memory_order_acquire);
}
}
For example, I've got a some work that is computed simultaneously by multiple threads.
For demonstration purposes the work is performed inside a while loop. In a single iteration each thread performs its own portion of the work, before the next iteration begins a counter should be incremented once.
My problem is that the counter is updated by each thread.
As this seems like a relatively simple thing to want to do, I presume there is a 'best practice' or common way to go about it?
Here is some sample code to illustrate the issue and help the discussion along.
(Im using boost threads)
class someTask {
public:
int mCounter; //initialized to 0
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
int getCount()
{
boost::mutex::scoped_lock lock( cntmutex );
return mCount;
}
void process( int thread_id, int numThreads )
{
while ( getCount() < mTotal )
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
// Wait for all thread to get to this point
cntmutex.lock();
mCounter++; // < ---- how to ensure this is only updated once?
cntmutex.unlock();
}
}
};
The main problem I see here is that you reason at a too-low level. Therefore, I am going to present an alternative solution based on the new C++11 thread API.
The main idea is that you essentially have a schedule -> dispatch -> do -> collect -> loop routine. In your example you try to reason about all this within the do phase which is quite hard. Your pattern can be much more easily expressed using the opposite approach.
First we isolate the work to be done in its own routine:
void process_thread(size_t id, size_t numThreads) {
// do something
}
Now, we can easily invoke this routine:
#include <future>
#include <thread>
#include <vector>
void process(size_t const total, size_t const numThreads) {
for (size_t count = 0; count != total; ++count) {
std::vector< std::future<void> > results;
// Create all threads, launch the work!
for (size_t id = 0; id != numThreads; ++id) {
results.push_back(std::async(process_thread, id, numThreads));
}
// The destruction of `std::future`
// requires waiting for the task to complete (*)
}
}
(*) See this question.
You can read more about std::async here, and a short introduction is offered here (they appear to be somewhat contradictory on the effect of the launch policy, oh well). It is simpler here to let the implementation decides whether or not to create OS threads: it can adapt depending on the number of available cores.
Note how the code is simplified by removing shared state. Because the threads share nothing, we no longer have to worry about synchronization explicitly!
You protected the counter with a mutex, ensuring that no two threads can access the counter at the same time. Your other option would be using Boost::atomic, c++11 atomic operations or platform-specific atomic operations.
However, your code seems to access mCounter without holding the mutex:
while ( mCounter < mTotal )
That's a problem. You need to hold the mutex to access the shared state.
You may prefer to use this idiom:
Acquire lock.
Do tests and other things to decide whether we need to do work or not.
Adjust accounting to reflect the work we've decided to do.
Release lock. Do work. Acquire lock.
Adjust accounting to reflect the work we've done.
Loop back to step 2 unless we're totally done.
Release lock.
You need to use a message-passing solution. This is more easily enabled by libraries like TBB or PPL. PPL is included for free in Visual Studio 2010 and above, and TBB can be downloaded for free under a FOSS licence from Intel.
concurrent_queue<unsigned int> done;
std::vector<Work> work;
// fill work here
parallel_for(0, work.size(), [&](unsigned int i) {
processWorkItem(work[i]);
done.push(i);
});
It's lockless and you can have an external thread monitor the done variable to see how much, and what, has been completed.
I would like to disagree with David on doing multiple lock acquisitions to do the work.
Mutexes are expensive and with more threads contending for a mutex , it basically falls back to a system call , which results in user space to kernel space context switch along with the with the caller Thread(/s) forced to sleep :Thus a lot of overheads.
So If you are using a multiprocessor system , I would strongly recommend using spin locks instead [1].
So what i would do is :
=> Get rid of the scoped lock acquisition to check the condition.
=> Make your counter volatile to support above
=> In the while loop do the condition check again after acquiring the lock.
class someTask {
public:
volatile int mCounter; //initialized to 0 : Make your counter Volatile
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
void process( int thread_id, int numThreads )
{
while ( mCounter < mTotal ) //compare without acquiring lock
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
cntmutex.lock();
//Now compare again to make sure that the condition still holds
//This would save all those acquisitions and lock release we did just to
//check whther the condition was true.
if(mCounter < mTotal)
{
mCounter++;
}
cntmutex.unlock();
}
}
};
[1]http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock