Is this approach of barriers right?

Is this approach of barriers right? - c++

I have found that pthread_barrier_wait is quite slow, so at one place in my code I replaced pthread_barrier_wait with my version of barrier (my_barrier), which uses an atomic variable. I found it to be much faster than pthread_barrier_wait. Is there any flaw of using this approach? Is it correct? Also, I don't know why it is faster than pthread_barrier_wait? Any clue?
EDIT
I am primarily interested in cases where there are equal number of threads as cores.
atomic<int> thread_count = 0;
void my_barrier()
{
thread_count++;
while( thread_count % NUM_OF_THREADS )
sched_yield();
}

Your barrier implementation does not work, at least not if the barrier will be used more than once. Consider this case:
NUM_OF_THREADS-1 threads are waiting at the barrier, spinning.
Last thread arrives and passes through the barrier.
Last thread exits barrier, continues processing, finishes its next task, and reenters the barrier wait.
Only now do the other waiting threads get scheduled, and they can't exit the barrier because the counter was incremented again. Deadlock.
In addition, one often-overlooked but nasty issue to deal with using dynamically allocated barriers is destroying/freeing them. You'd like any one of the threads to be able to perform the destroy/free after the barrier wait returns as long as you know nobody will be trying to wait on it again, but this requires making sure all waiters have finished touching memory in the barrier object before any waiters wake up - not an easy problem to solve. See my past questions on implementing barriers...
How can barriers be destroyable as soon as pthread_barrier_wait returns?
Can a correct fail-safe process-shared barrier be implemented on Linux?
And unless you know you have a special-case where none of the difficult problems apply, don't try implementing your own for an application.

AFAICT it's correct, and it looks like it's faster, but in the high contended case it'll be a lot worse. The hight contended case being when you have lots of threads, way more than CPUs.
There's a way to make fast barriers though, using eventcounts (look at it through google).
struct barrier {
atomic<int> count;
struct eventcount ec;
};
void my_barrier_wait(struct barrier *b)
{
eventcount_key_t key;
if (--b->count == 0) {
eventcount_broadcast(&b->ec);
return;
}
for (;;) {
key = eventcount_get(&b->ec);
if (!b->count)
return;
eventcount_wait(&b->ec);
}
}
This should scale way better.
Though frankly, when you use barriers, I don't think performance matters much, it's not supposed to be an operation that needs to be fast, it looks a lot like too early optimization.

Your barrier should be correct from what I can see, as long as you don't use the barrier to often or your thread number is a power of two. Theoretically your atomic will overflow somewhere (after hundreds of millions of uses for typical core counts, but still), so you might want to add some functionality to reset that somewhere.
Now to why it is faster: I'm not entirely sure, but I think pthread_barrier_wait will let the thread sleep till it is time to wake up. Yours is spinning on the condition, yielding in each iteration. However if there is no other application/thread which needs the processing time the thread will likely be scheduled again directly after the yield, so the wait time is shorter. At least thats what playing around with that kind of barriers seemed to indicate on my system.
As a side note: since you use atomic<int> I assume you use C++11. Wouldn't it make sense to use std::this_thread::yield() instead of sched_yield() in that case to remove the dependency on pthreads?
This link might also be intressting for you, it measures the performance of various barrier implementations (yours is rougly the lock xadd+while(i<NCPU) case, except for the yielding)

Related

Windows critical sections fairness

I've a question about the fairness of the critical sections on Windows, using EnterCriticalSection and LeaveCriticalSection methods. The MSDN documentation specifies: "There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads."
The problem comes with an application I wrote, which blocks some threads that never enter critical section, even after a long time; so I perfomed some tests with a simple c program, to verify this behaviour, but I noticed strange results when you have many threads an some wait times inside.
This is the code of the test program:
CRITICAL_SECTION CriticalSection;
DWORD WINAPI ThreadFunc(void* data) {
int me;
int i,c = 0;;
me = *(int *) data;
printf(" %d started\n",me);
for (i=0; i < 10000; i++) {
EnterCriticalSection(&CriticalSection);
printf(" %d Trying to connect (%d)\n",me,c);
if(i!=3 && i!=4 && i!=5)
Sleep(500);
else
Sleep(10);
LeaveCriticalSection(&CriticalSection);
c++;
Sleep(500);
}
return 0;
}
int main() {
int i;
int a[20];
HANDLE thread[20];
InitializeCriticalSection(&CriticalSection);
for (i=0; i<20; i++) {
a[i] = i;
thread[i] = CreateThread(NULL, 0, ThreadFunc, (LPVOID) &a[i], 0, NULL);
}
}
The results of this is that some threads are blocked for many many cycles, and some others enter critical section very often. I also noticed if you change the faster Sleep (the 10 ms one), everything might returns to be fair, but I didn't find any link between sleep times and fairness.
However, this test example works much better than my real application code, which is much more complicated, and shows actually starvation for some threads. To be sure that starved threads are alive and working, I made a test (in my application) in which I kill threads after entering 5 times in critical section: the result is that, at the end, every thread enters, so I'm sure all of them are alive and blocked on the mutex.
Do I have to assume that Windows is really NOT fair with threads?
Do you know any solution for this problem?
EDIT: The same code in linux with pthreads, works as expected (no thread starves).
EDIT2: I found a working solution, forcing fairness, using a CONDITION_VARIABLE.
It can be inferred from this post (link), with the required modifications.

You're going to encounter starvation issues here anyway since the critical section is held for so long.
I think MSDN is probably suggesting that the scheduler is fair about waking up threads but since there is no lock acquisition order then it may not actually be 'fair' in the way that you expect.
Have you tried using a mutex instead of a critical section? Also, have you tried adjusting the spin count?
If you can avoid locking the critical section for extended periods of time then that is probably a better way to deal with this.
For example, you could restructure your code to have a single thread that deals with your long running operation and the other threads queue requests to that thread, blocking on a completion event. You only need to lock the critical section for short periods of time when managing the queue. Of course if these operations must also be mutually exclusive to other operations then you would need to be careful with that. If all of this stuff can't operate concurrently then you may as well serialize that via the queue too.
Alternatively, perhaps take a look at using boost asio. You could use a threadpool and strands to prevent multiple async handlers from running concurrently where synchronization would otherwise be an issue.

I think you should review a few things:
in 9997 of 10000 cases you branch to Sleep(500). Each thread holds the citical section for as much as 500 ms on almost every successful attempt to acquire the critical section.
The threads do another Sleep(500) after releasing the critical section. As a result a single thread occupies almost 50 % (49.985 %) of the availble time by holding the critical section - no matter what!
Behind the scenes: Joe Duffy: The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues.
Assuming you did that on purpose to show the behavior: Starting 20 of those threads may result in a minimum wait time of 10 seconds for the last thread to get access to the critical section on a single logical processor when the processor is completely available for this test.
For how long dif you do the test / What CPU? And what Windows version? You should be able to write down some more facts: A histogram of thread active vs. thread id could tell a lot about fairness.
Critical sections shall be acquired for short periods of time. In most cases shared resources can be dealt with much quicker. A Sleep inside a critical section almost certainly points to a design flaw.
Hint: Reduce the time spent inside the critical section or investigate Semaphore Objects.

InterlockedExchange vs InterlockedCompareExchange spin locks

I've written a basic spin lock (see below) using InterlockedExchange. However I've seen a lot of implementations use InterlockedCompareExchange instead. Is mine incorrect in some unforeseen way and if not what are the pro's and cons of each way (if indeed there are any)?
PS I know the sleep is expensive and I'd want to have an attempt count before I call it.
class SpinLock
{
public:
SpinLock() : m_lock( 0 ) {}
~SpinLock(){}
void Lock()
{
while( InterlockedExchange( &m_lock, 1 ) == 1 )
{
Sleep( 0 );
}
}
void Unlock()
{
InterlockedExchange( &m_lock, 0 );
}
private:
volatile unsigned int m_lock;
};

First of all, InterlockedExchange takes a LONG. Please repeat after me: a LONG isn't the same an an int. This may seem like a small thing but it can cause you grief.
Now, to elaborate a little on what Mats Petersson said:
Your spinlock will have horrible performance since the InterlockedExchange loop in Lock will modify the m_lock variable unconditionally, causing a lot of work to be done by the processors behind the scenes to maintain cache coherency.
To make matters worse, by not ensuring that your m_lock variable is on a cache line by itself, the above effect is amplified and could affect other data, unlucky enough to share the cache line with the instance of your spinlock.
These are just two moderately subtle issues with this code. There are others. The simple fact is that locks aren't easy to get right, and you shouldn't be implementing custom locking primitives. Please don't reinvent the wheel. Use the facilities provided to you by the operating system. It's unlikely they themselves are a bottleneck.
If you do find you have a performance issue (that is, you have profiling data that suggests a performance bottleneck) first focus on algorithmic changes and on improving parallelization and reducing lock contention. If the problem persists then and only then look elsewhere.

There is very little difference between CMPXCHG and XCHG (which is the x86 instructions that you'd get from the two intrinsic functions you mention).
I think the main difference is that in a SMP system with a lot of contention on the lock, you don't get a bunch of writes when the value is already "locked" - which means that the other processors don't have to read back a value that is already there in the cache.
In a debug build, you'd also want to ensure that Unlock() is only called from the current owner of the lock!

C++11 Thread waiting behaviour: std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) )

I was told when writing Microsoft specific C++ code that writing Sleep(1) is much better than Sleep(0) for spinlocking, due to the fact that Sleep(0) will use more of the CPU time, moreover, it only yields if there is another equal-priority thread waiting to run.
However, with the C++11 thread library, there isn't much documentation (at least that I've been able to find) about the effects of std::this_thread::yield() vs. std::this_thread::sleep_for( std::chrono::milliseconds(1) ); the second is certainly more verbose, but are they both equally efficient for a spinlock, or does it suffer from potentially the same gotchas that affected Sleep(0) vs. Sleep(1)?
An example loop where either std::this_thread::yield() or std::this_thread::sleep_for( std::chrono::milliseconds(1) ) would be acceptable:
void SpinLock( const bool& bSomeCondition )
{
// Wait for some condition to be satisfied
while( !bSomeCondition )
{
/*Either std::this_thread::yield() or
std::this_thread::sleep_for( std::chrono::milliseconds(1) )
is acceptable here.*/
}
// Do something!
}

The Standard is somewhat fuzzy here, as a concrete implementation will largely be influenced by the scheduling capabilities of the underlying operating system.
That being said, you can safely assume a few things on any modern OS:
yield will give up the current timeslice and re-insert the thread into the scheduling queue. The amount of time that expires until the thread is executed again is usually entirely dependent upon the scheduler. Note that the Standard speaks of yield as an opportunity for rescheduling. So an implementation is completely free to return from a yield immediately if it desires. A yield will never mark a thread as inactive, so a thread spinning on a yield will always produce a 100% load on one core. If no other threads are ready, you are likely to lose at most the remainder of the current timeslice before you get scheduled again.
sleep_* will block the thread for at least the requested amount of time. An implementation may turn a sleep_for(0) into a yield. The sleep_for(1) on the other hand will send your thread into suspension. Instead of going back to the scheduling queue, the thread goes to a different queue of sleeping threads first. Only after the requested amount of time has passed will the scheduler consider re-inserting the thread into the scheduling queue. The load produced by a small sleep will still be very high. If the requested sleep time is smaller than a system timeslice, you can expect that the thread will only skip one timeslice (that is, one yield to release the active timeslice and then skipping the one afterwards), which will still lead to a cpu load close or even equal to 100% on one core.
A few words about which is better for spin-locking. Spin-locking is a tool of choice when expecting little to no contention on the lock. If in the vast majority of cases you expect the lock to be available, spin-locks are a cheap and valuable solution. However, as soon as you do have contention, spin-locks will cost you. If you are worrying about whether yield or sleep is the better solution here spin-locks are the wrong tool for the job. You should use a mutex instead.
For a spin-lock, the case that you actually have to wait for the lock should be considered exceptional. Therefore it is perfectly fine to just yield here - it expresses the intent clearly and wasting CPU time should never be a concern in the first place.

I just did a test with Visual Studio 2013 on Windows 7, 2.8GHz Intel i7, default release mode optimizations.
sleep_for(nonzero) appears sleep for a minimium of around one millisecond and takes no CPU resources in a loop like:
for (int k = 0; k < 1000; ++k)
std::this_thread::sleep_for(std::chrono::nanoseconds(1));
This loop of 1,000 sleeps takes about 1 second if you use 1 nanosecond, 1 microsecond, or 1 millisecond. On the other hand, yield() takes about 0.25 microseconds each but will spin the CPU to 100% for the thread:
for (int k = 0; k < 4,000,000; ++k) (commas added for clarity)
std::this_thread::yield();
std::this_thread::sleep_for((std::chrono::nanoseconds(0)) seems to be about the the same as yield() (test not shown here).
In comparison, locking an atomic_flag for a spinlock takes about 5 nanoseconds. This loop is 1 second:
std::atomic_flag f = ATOMIC_FLAG_INIT;
for (int k = 0; k < 200,000,000; ++k)
f.test_and_set();
Also, a mutex takes about 50 nanoseconds, 1 second for this loop:
for (int k = 0; k < 20,000,000; ++k)
std::lock_guard<std::mutex> lock(g_mutex);
Based on this, I probably wouldn't hesitate to put a yield in the spinlock, but I would almost certainly wouldn't use sleep_for. If you think your locks will be spinning a lot and are worried about cpu consumption, I would switch to std::mutex if that's practical in your application. Hopefully, the days of really bad performance on std::mutex in Windows are behind us.

What you want is probably a condition variable. A condition variable with a conditional wake up function is typically implemented like what you are writing, with the sleep or yield inside the loop a wait on the condition.
Your code would look like:
std::unique_lock<std::mutex> lck(mtx)
while(!bSomeCondition) {
cv.wait(lck);
}
Or
std::unique_lock<std::mutex> lck(mtx)
cv.wait(lck, [bSomeCondition](){ return !bSomeCondition; })
All you need to do is notify the condition variable on another thread when the data is ready. However, you cannot avoid a lock there if you want to use condition variable.

if you are interested in cpu load while using yield - it's very bad, except one case-(only your application is running, and you are aware that it will basically eat all your resources)
here is more explanation:
running yield in loop will ensure that cpu will release execution of thread, still, if system try to come back to thread it will just repeat yield operation. This can make thread use full 100% load of cpu core.
running sleep() or sleep_for() is also a mistake, this will block thread execution but you will have something like wait time on cpu. Don't be mistaken, this IS working cpu but on lowest possible priority. While somehow working for simple usage examples ( fully loaded cpu on sleep() is half that bad as fully loaded working processor ), if you want to ensure application responsibility, you would like something like third example:
combining! :
std::chrono::milliseconds duration(1);
while (true)
{
if(!mutex.try_lock())
{
std::this_thread::yield();
std::this_thread::sleep_for(duration);
continue;
}
return;
}
something like this will ensure, cpu will yield as fast as this operation will be executed, and also sleep_for() will ensure that cpu will wait some time before even trying to execute next iteration. This time can be of course dynamicaly (or staticaly) adjusted to suits your needs
cheers :)

How do I safely read a variable from one thread and modify it from another?

I have a class instances which is being used in multiple threads. I am updating multiple member variables from one thread and reading the same member variables from one thread. What is the correct way to maintain the thread safety?
eg:
phthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
//unlock here?
Should I unlock the mutex over here? ( if another thread access the obj1 member variables 1 and 2 now, the accessed data might not be correct because memberV2 has not yet be updated. However, if I does not release the lock, the other thread might block because there is time consuming operation below.
//perform some time consuming operation which must be done before the assignment to memberV2 and after the assignment to memberV1
obj1.memberV2 = update field 2 from some calculation
pthread_mutex_unlock(&mutex1) //should I only unlock here?
Thanks

Your locking is correct. You should not release the lock early just to allow another thread to proceed (because that would allow the other thread to see the object in an inconsistent state.)

Perhaps it would be better to do something like:
//perform time consuming calculation
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = result;
pthread_mutex_unlock(&mutex1)
This of course assumes that the values used in the calculation won't be modified on any other thread.

Its hard to tell what you are doing that is causing problems. The mutex pattern is pretty simple. You Lock the mutex, access the shared data, unlock the mutex. This protects data, becuase the mutex will only let one thread get the lock at a time. Any thread that fails to get the lock has to wait till the mutex is unlocked. Unlocking wakes the waiters up. They will then fight to attain the lock. Losers go back to sleep. The time it takes to wake up might be multiple ms or more from the time the lock is released. Make sure you always unlock the mutex eventually.
Make sure you don't to keep locks locked for a long period of time. Most of the time, a long period of time is like a micro second. I prefer to keep it down around "a few lines of code." Thats why people have suggested that you do the long running calculation outside the lock. The reason for not keeping locks a long time is you increase the number of times other threads will hit the lock and have to spin or sleep, which decreases performance. You also increase the probability that your thread might be pre-empted while owning the lock, which means the lock is enabled while that thread sleeps. Thats even worse performance.
Threads that fail a lock dont have to sleep. Spinning means a thread encountering a locked mutex doesn't sleep, but loops repeatedly testing the lock for a predefine period before giving up and sleeping. This is a good idea if you have multiple cores or cores capable of multiple simultaneous threads. Multiple active threads means two threads can be executing the code at the same time. If the lock is around a small amount of code, then the thread that got the lock is going to be done real soon. the other thread need only wait a couple nano secs before it will get the lock. Remember, sleeping your thread is a context switch and some code to attach your thread to the waiters on the mutex, all have costs. Plus, once your thread sleeps, you have to wait for a period of time before the scheduler wakes it up. that could be multiple ms. Lookup spinlocks.
If you only have one core, then if a thread encounters a lock it means another sleeping thread owns the lock and no matter how long you spin it aint gonna unlock. So you would use a lock that sleeps a waiter immediately in hopes that the thread owning the lock will wake up and finish.
You should assume that a thread can be preempted at any machine code instruction. Also you should assume that each line of c code is probably many machine code instructions. The classic example is i++. This is one statement in c, but a read, an increment, and a store in machine code land.
If you really care about performance, try to use atomic operations first. Look to mutexes as a last resort. Most concurrency problems are easily solved with atomic operations (google gcc atomic operations to start learning) and very few problems really need mutexes. Mutexes are way way way slower.
Protect your shared data wherever it is written and wherever it is read. else...prepare for failure. You don't have to protect shared data during periods of time when only a single thread is active.
Its often useful to be able to run your app with 1 thread as well as N threads. This way you can debug race conditions easier.
Minimize the shared data that you protect with locks. Try to organize data into structures such that a single thread can gain exclusive access to the entire structure (perhaps by setting a single locked flag or version number or both) and not have to worry about anything after that. Then most of the code isnt cluttered with locks and race conditions.
Functions that ultimately write to shared variables should use temp variables until the last moment and then copy the results. Not only will the compiler generate better code, but accesses to shared variables especially changing them cause cache line updates between L2 and main ram and all sorts of other performance issues. Again if you don't care about performance disregard this. However i recommend you google the document "everything a programmer should know about memory" if you want to know more.
If you are reading a single variable from the shared data you probably don't need to lock as long as the variable is an integer type and not a member of a bitfield (bitfield members are read/written with multiple instructions). Read up on atomic operations. When you need to deal with multiple values, then you need a lock to make sure you didn't read version A of one value, get preempted, and then read version B of the next value. Same holds true for writing.
You will find that copies of data, even copies of entire structures come in handy. You can be working on building a new copy of the data and then swap it by changing a pointer in with one atomic operation. You can make a copy of the data and then do calculations on it without worrying if it changes.
So maybe what you want to do is:
lock the mutex
Make a copy of the input data to the long running calculation.
unlock the mutex
L1: Do the calculation
Lock the mutex
if the input data has changed and this matters
read the input data, unlock the mutex and go to L1
updata data
unlock mutex
Maybe, in the example above, you still store the result if the input changed, but go back and recalc. It depends if other threads can use a slightly out of date answer. Maybe other threads when they see that a thread is already doing the calculation simply change the input data and leave it to the busy thread to notice that and redo the calculation (there will be a race condition you need to handle if you do that, and easy one). That way the other threads can do other work rather than just sleep.
cheers.

Probably the best thing to do is:
temp = //perform some time consuming operation which must be done before the assignment to memberV2
pthread_mutex_lock(&mutex1)
obj1.memberV1 = 1;
obj1.memberV2 = temp; //result from previous calculation
pthread_mutex_unlock(&mutex1)

What I would do is separate the calculation from the update:
temp = some calculation
pthread_mutex_lock(&mutex1);
obj.memberV1 = 1;
obj.memberV2 = temp;
pthread_mutex_unlock(&mutex1);

Overhead due to use of Events

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}

This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.

Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.

Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}

Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.

If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick

It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.

The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.

It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.

As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js