How to match processing time with reception time in c++ multithreading

How to match processing time with reception time in c++ multithreading - c++

I'm writing a c++ application, in which I'll receive 4096 bytes of data for every 0.5 seconds. This is processed and the output will be sent to some other application. Processing each set of data is taking nearly 2 seconds.
This is how exactly I'm doing this.
In my main function, I'm receiving the data and pushing it into a vector.
I've created a thread, which will always process the first element and deletes it immediately after processing. Below is the simulation of my application receiving part.
#include<iostream>
#include <unistd.h>
#include <vector>
#include <mutex>
#include <pthread.h>
using namespace std;
struct Student{
int id;
int age;
};
vector<Student> dustBin;
pthread_mutex_t lock1;
bool isEven=true;
void *processData(void* arg){
Student st1;
while(true)
{
if(dustBin.size())
{
printf("front: %d\tSize: %d\n",dustBin.front(),dustBin.size());
st1 = dustBin.front();
cout << "Currently Processing ID "<<st1.id<<endl;
sleep(2);
pthread_mutex_lock(&lock1);
dustBin.erase(dustBin.begin());
cout<<"Deleted"<<endl;
pthread_mutex_unlock(&lock1);
}
}
return NULL;
}
int main()
{
pthread_t ptid;
Student st;
dustBin.clear();
pthread_mutex_init(&lock1, NULL);
pthread_create(&ptid, NULL, &processData, NULL);
while(true)
{
for(int i=0; i<4096; i++)
{
st.id = i+1;
st.age = i+2;
pthread_mutex_lock(&lock1);
dustBin.push_back(st);
printf("Pushed: %d\n",st.id);
pthread_mutex_unlock(&lock1);
usleep(500000);
}
}
pthread_join(ptid, NULL);
pthread_mutex_destroy(&lock1);
}
The output of this code is
Output
In the output image posted here, you can observe the exact sequence of the processing. It is processing only one item for every 4 insertions.
Note that the reception time of data <<< processing time.
Because of this reason, my input buffer is growing very rapidly. And one more thing is that as the main thread and the processData thread are using a mutex, they are dependent on each other for the lock to release. Because of this reason my incoming buffer is getting locked sometimes leading to data misses. Please, someone, suggest to me how to handle this or suggest me some method to do.
Thanks & Regards
Vamsi

Undefined behavior
When you read data, you must lock before getting the size.
Busy waiting
You should always avoid tight loop that does nothing. Here if dustBin is empty, you will immediately check it against forever which will use 100% of that core and slow down everything else, drain the laptop battery and make it hotter than it should be. Very bad idea to write such code!
Learn multithreading first
You should read a book or 2 on multithreading. Doing multithreading right is hard and almost impossible without taking time to learn it properly. C++ Concurrency in Action is highly recommended for standard C++ multithreading.
Condition variable
Usually you will use a condition variable or some sort of event to tell the consumer thread when data is added so it does not have to wake up uselessly to check if it is the case.
Since you have a typical producer/consumer, you should be able to find lot of information on how to do it or special containers or other constructs that will help implement your code.
Output
Your printf and cout stuff will have an impact on the performance and since some are inside a lock and other not, you will probably get an improperly formatted output. If you really need output, a third thread might be a better option. In any case, you want to minimize the time you have a lock so formatting into a temporary buffer might be a good idea too.
By the way, standard output is relatively slow and it is perfectly possible that it might even be the reason why you are not able to process rapidly all data.
Processing rate
Obviously if you are able to produce 4096 bytes of data every 0.5 second but need 2 seconds to process that data, you have a serious problem.
You should really think about what you want to do in such case before asking a question here as without that information, we are making guess about possible solutions.
Here are some possibilities:
Slow down the producer. Obviously, this does not work if you get data in real time.
Optimize the consumer (better algorithms, better hardware, optimal parallelism…)
Skip some data
Obviously for performance problems, you should use a profiler to know were you lost your time. Once you know that, you will have a better idea where to check to improve you code.
Taking 2 seconds to process the data is really slow but we cannot help you since we have no idea of what your code is doing.
For example, if you add the data into a database and it is not able to follow up, you might want to batch multiple insert into a single command to reduce the overhead of communicating with the database over the network.
Another example, would be if you append the data to a file, you might want to keep the file open and accumulate some data before doing each write.
Container
A vector would not be a good choice if you remove item from the head one by one and it size become somewhat large (say more than 100 small items) as every other item need to be moved every time.
In addition to changing the container as suggested in a comment, another possibility would be to use 2 vectors and swap them. That way, you will be able to reduce the number of time you lock the mutex and process many item without needing a lock.
How to optimize
You should accumulate enough data (say 30 seconds), stop accumulating and then test your processing speed with that data. If you cannot process that data in less that about half the time (15 seconds), then you clearly need to improve your processing speed one way or another. One your consumer(s) is (are) fast enough, then you could optimize communication from the producer to the consumer(s).
You have to know if your bottleneck is I/O, database or what and if some part might be done in parallel.
There are probably a lot of optimization that can be done in the code you have not shown...

If you can't handle messages fast enough, you have to drop some.
Use a circular buffer of a fixed size.
Then if the provider is faster than the consumer, older entries will be overwritten.
If you cannot skip some data and you cannot process it fast enough, you are doomed.

Create two const variables, NBUFFERS and NTHREADS, make them both 8 initially if you have 16 cores and your processing is 4x too slow. Play with these values later.
Create NBUFFERS data buffers, each big enough to hold 4096 samples, In practice, just create a single large buffer and make offsets into it to divide it up.
Start NTHREADS. They will each continuously wait to be told which buffer to process and then they will process it and wait again for another buffer.
In your main program, go into a loop, receiving data. Receive the first 4096 samples into the first buffer and notify the first thread. Receive the second 4096 samples into the second buffer and notify the second thread.
buffer = (buffer + 1) % NBUFFERS
thread = (thread + 1) % NTHREADS
Rinse and repeat. As you have 8 threads, and data only arrives every 0.5 seconds, each thread will only get a new buffer every 4 seconds but only needs 2 seconds to clear the previous buffer.

Related

empty std::queue pushing data to end of stale items

I am using an std::queue to buffer messages on my network (CAN bus in this case). During an interrupt I am adding the message to the "inbox". Then my main program checks every cycle if the queue is empty, if not handles the messages. Problem is, the queue is pop'd until empty (it exits from while (! inbox.empty()), but the next time I push data to it, it works as normal BUT the old data is still hanging out at the back.
For example, first message pushes a "1" to the queue. Loop reads
1
Next message is "2". Next read is
2
1
If I were to get in TWO messages before another read, "3", "4", then next read would be
3
4
2
1
I am very confused. I am also working with an STM32F0 ARM chip and mbed online, and have no idea if this is working poorly on the hardware or what!
I was concerned about thread safety, so I added an extra buffer queue and only push to the inbox when it "unlocked". And once I ran this I have not seen any conflict occur anyway!
Pusher code:
if (bInboxUnlocked) {
while (! inboxBuffer.empty()) {
inbox.push (inboxBuffer.front());
inboxBuffer.pop();
}
inbox.push(msg);
} else {
inboxBuffer.push(msg);
printf("LOCKED!");
}
Main program read code
bInboxUnlocked = 0;
while (! inbox.empty()) {
printf("%d\r\n", inbox.front().data);
inbox.pop();
}
bInboxUnlocked = 1;
Thoughts anyone? Am I using this wrong? Any other ways to easily accomplish what I am doing? I expect the buffers to be small enough to implement a small circular array, but with queue on hand I was hoping not to have to do that.

Based on what I can figure out from a basic Google search, your CPU is a single core CPU, essentially. If so, then there should not be any memory fencing issues to deal with, here.
If, on the other hand, you had multiple CPU cores to deal with here, it will be necessary to either cram in explicit fences, in key places, or employ C++11 classes like std::mutex, that will take care of this for you.
But going with the original use case of a single CPU, and no memory fencing issues, if you can guarantee that:
A) There's some definite upper limit on the number of messages you expect to buffer by your interrupt handling code in the queue before it gets drained, and:
B) the messages you're buffering are PODs
Then a potential alternative to std::queue worth exploring here is to roll your own simple queue, using nothing more than a static std::array, or maybe a std::vector, an int head pointer, and an int tail pointer. A google search should find plenty of examples of implementing this simple algorithm:
The puller checks "if head != tail", if so, reads the message in queue[head] and increments head. Increment means: head=(head+1)%queuesize. The puller checks if incrementing tail (also modulo queuesize) results in head, if so the queue has filled up (something that shouldn't happen, according to the prerequisites of this approach). If not, put the message into queue[tail], and increment tail.
If all of these operations are done in the right order, the net effect would be the same as using std::queue but:
1) Without the overhead of std::queue and the heap allocation it uses. Should be a major win on an embedded platform.
2) Since the queue is a vector, in contiguous memory, this should take advantage of CPU caching that's often the case here, with traditional CPUs.

Execute Functions on an Interval Basis C++

So I have a Kinect program that has three main functions that collect data and saves it. I want one of these functions to execute as much as possible, while the other two run maybe 10 times every second.
while(1)
{
...
//multi-threading to make sure color and depth events are aligned -> get skeletal data
if (WaitForSingleObject(colorEvent, 0) == 0 && WaitForSingleObject(depthEvent, 0) == 0)
{
std::thread first(getColorImage, std::ref(colorEvent), std::ref(colorStreamHandle), std::ref(colorImage));
std::thread second(getDepthImage, std::ref(depthEvent), std::ref(depthStreamHandle), std::ref(depthImage));
if (WaitForSingleObject(skeletonEvent, INFINITE) == 0)
{
first.join();
second.join();
std::thread third(getSkeletonImage, std::ref(skeletonEvent), std::ref(skeletonImage), std::ref(colorImage), std::ref(depthImage), std::ref(myfile));
third.join();
}
//if (check == 1)
//check = 2;
}
}
Currently my threads are making them all run at the same exact time, but this slows down my computer a lot and I only need to run 'getColorImage' and 'getDepthImage' maybe 5-10 times/second, whereas 'getSkeletonImage' I would want to run as much as possible.
I want 'getSkeletonImage' to run at max frequency (~30 times/second through the while loop) and then the 'getColorImage' and 'getDepthImage' to time synchronize (~5-10 times/second through the while loop)
What is a way I can do this? I am already using threads, but I need one to run consistently, and then the other two to join in intermittently essentially. Thank you for your help.

Currently, your main loop is creating the threads every iteration, which suggests each thread function runs once to completion. That introduces the overhead of creating and destroying threads every time.
Personally, I wouldn't bother with threads at all. Instead, in the main thread I'd do
void RunSkeletonEvent(int n)
{
for (i = 0; i < n; ++i)
{
// wait required time (i.e. to next multiple of 1/30 second)
skeletonEvent();
}
}
// and, in your main function ....
while (termination_condition_not_met)
{
runSkeletonEvent(3);
colorEvent();
runSkeletonEvent(3);
depthEvent();
}
This interleaves the events, so skeletonEvent() runs six times for every time depthEvent() and colorEvent() are run. Just adjust the numbers as needed to get required behaviour.
You'll need to design the code for all the events so they don't run over time (if they do, all subsequent events will be delayed - there is no means to stop that).
The problem you'll then need to resolve is how to wait for the time to fire the skeleton event. A process of retrieving clock time, calculating how long to wait, and sleeping for that interval will do it. By sleeping (the thread yielding its time slice) your program will also be a bit better mannered (e.g. it won't be starving other processes of processor time).
One advantage is that, if data is to be shared between the "events" (e.g. all of the events modify some global data) there is no need for synchronisation, because the looping above guarantees that only one "event" accesses shared data at one time.
Note: your usage of WaitForSingleObject() indicates you are using windows. Windows (except, arguably CE in a weak sense) is not really a realtime system, so does not guarantee precise timing. In other words, the actual intervals you achieve will vary.
It is still possible to restructure to use threads. From your description, there is no evidence you really need anything like that, so I'll leave this reply at that.

Design of concurrent processing of a dual buffer system?

I have a long-running application that basically:
read packets off network
save it somewhere
process it and output to disk
A very common use-case indeed - except both the data size and data rate can be quite large. To avoid overflow of the memory and improve efficiency, I am thinking of a dual buffer design, where buffer A and B alternate: while A is holding networking packet, B is processed for output. Once buffer A reaches a soft bound, A is due for output processing, and B will be used for holding network packets.
I am not particularly experienced on concurrency/multi-thread program paradigm. I have read some past discussion on circular buffer that handle multiple-producer and multiple consumer case. I am not sure if that is the best solution and It seems the dual buffer design is simpler.
My question is: is there a design pattern I can follow to tackle the problem? or better design for that matter? If possible, please use pseudo code to help to illustrate the solution. Thanks.

I suggest that you should, instead of assuming "two" (or any fixed number of ...) buffers, simply use a queue, and therefore a "producer/consumer" relationship.
The process that is receiving packets simply adds them to a buffer of some certain size, and, either when the buffer is sufficiently full or a specified (short...) time interval has elapsed, places the (non-empty) buffer onto a queue for processing by the other. It then allocates a new buffer for its own use.
The receiving ("other...") process is woken up any time there might be a new buffer in the queue for it to process. It removes the buffer, processes it, then checks the queue again. It goes to sleep only when it finds that the queue is empty. (Use care to be sure that the process cannot decide to go to sleep at the precise instant that the other process decides to signal it... there must be no "race condition" here.)
Consider simply allocating storage "per-message" (whatever a "message" may mean to you), and putting that "message" onto the queue, so that there is no unnecessary delay in processing caused by "waiting for a buffer to fill up."

It might be worth mentioning a technique used in real-time audio processing/recording, which uses a single ring buffer (or fifo if you prefer that term) of sufficient size can be used for this case.
You will need then a read and write cursor. (Whether you actually need a lock or can do with volatile plus memory barriers is a touchy subject, but the people at portaudio suggest you do this without locks if performance is important.)
You can use one thread to read and another thread to write. The read thread should consume as much of the buffer as possible. You will be safe unless you run out of buffer space, but that exists for the dual-buffer solution as well. So the underlying assumption is that you can write to disk faster then the input comes in, or you will need to expand on the solution.

Find a producer-consumer queue class that works. Use one to create a buffer pool to improve performance and control memory use. Use another to transfer the buffers from the network thread to the disk thread:
#define CnumBuffs 128
#define CbufSize 8182
#define CcacheLineSize 128
public class netBuf{
private char cacheLineFiller[CcacheLineSize]; // anti false-sharing space
public int dataLen;
public char bigBuf[CbufSize];
};
PCqueue pool;
PCqueue diskQueue;
netThread Thread;
diskThread Thread;
pool=new(PCqueue);
diskQueue=new(PCqueue);
// make an object pool
for(i=0;i<CnumBuffs,i++){
pool->push(new(netBuf));
};
netThread=new(netThread);
diskThread=new(diskThread);
netThread->start();
diskThread->start();
..
void* netThread.run{
netbuf *thisBuf;
for(;;){
pool->pop(&thisBuf}; // blocks if pool empty
netBuf->datalen=network.read(&thisBuf.bigBuf,sizeof(thisBuf.bigBuf));
diskQueue->push(thisBuf);
};
};
void* diskThread.run{
fileStream *myFile;
diskBuf *thisBuf;
new myFile("someFolder\fileSpec",someEnumWrite);
for(;;){
diskQueue->pop(&thisBuf}; // blocks until buffer available
myFile.write(&thisBuf.bigBuf,thisBuf.dataLen);
pool->push(thisBuf};
};
};

UDP - lost data during microbursts

The code below runs great (ie. doesn't drop messages) 99.9 of the time. But when there's a microburst of datagrams coming in at the rate of 2-3 microseconds between datagrams, then I experience data loss. The boost notify_one() member call requires 5 to 10 microseconds to complete, so that by itself is the key bottleneck under these conditions. Any suggestions on how to improve performance?
Receiver/"producer" code thread:
if (bytes_recvd > 0) {
InQ.mut.lock();
string t;
t.append(data_, bytes_recvd);
InQ.msg_queue.push(t); // < 1 microsecs
InQ.mut.unlock();
InQ.cond.notify_one(); // 5 - 10 microsecs
}
Consumer code thread:
//snip......
std::string s;
while (1) {
InQ.mut.lock();
if (!InQ.msg_queue.empty()) {
s.clear();
s = InQ.msg_queue.front();
InQ.msg_queue.pop();
}
InQ.mut.unlock();
if (s.length()) {
processDatagram((char *)s.c_str(), s.length());
s.clear();
}
boost::mutex::scoped_lock lock(InQ.mut);
InQ.cond.wait(lock);
}

Just change
if (!InQ.msg_queue.empty()) {
to
while (!InQ.msg_queue.empty()) {
That way packets don't have to wake the thread to get processed, if the thread is already awake and busy, it will see the new packet before sleeping.
Ok, it's not quite that simple, because you need to release the lock between packets, but the idea will work -- before sleeping, check whether the queue is empty.

If you're losing data try increasing your socket buffer read size. If you're using boost::asio, look into this option: boost::asio::socket_base::receiver_buffer_size. Generally for our high throughput UDP applications we set the socket buffer size to 1MB (more in some cases).
Also, make sure that the buffers you're using in your receive calls are not too large, they should only be large enough to handle your maximum expected datagram size (which is obviously implementation dependent).

Your obvious clog is in the conditioning.
Your main hope would be in using a lockless Q implementation. This is probably an obvious statement to you.
The only way to really get the lockless q to work for you, of course, is if you have multicores and don't mind dedicating on to the consuming task.

Some general suggestions:
Increase socket receive buffer size.
Read all available datagrams, then pass them all on for processing.
Avoid data copying, pass pointers around.
Reduce lock scope to absolute minimum, say, only push/pop a pointer onto/off the queue under that mutex.
If all above fails you, look into lock-free data structures to pass data around.
Hope this helps.

Overhead due to use of Events

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}

This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.

Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.

Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}

Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.

If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick

It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.

The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.

It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.

As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js