Related
I am using ALSA to read in an audio stream. I have set my period size to 960 frames which are received every 20ms. I read in the PCM values using: snd_pcm_readi(handle, buffer, period_size);
Every time I fill this buffer, I need to iterate through it and perform multiple checks on the received values. Iterating through this using a simple for loop takes too long and I get a buffer overrun error on subsequent calls to snd_pcm_readi(). I have been told not to increase the ALSA buffer_size to prevent this from happening. A friend suggested I create a seperate thread to iterate through this buffer and perform the checks? How would this work given that I don't know exactly how long it will take for snd_pcm_readi() to fill the buffer, so knowing when to lock the buffer is a bit confusing to me.
A useful way to multithread an application for signal processing and computation is to use OpenMP (OMP). This avoids the developer having to use locking mechanisms themselves to synchronise multiple computation threads. Typically using locking is a bad thing in real time application programming.
In this example, a for loop is multithreaded in the C language :
#include <omp.h>
int N=960;
float audio[960]; // the signal to process
#pragma omp parallel for
for (int n=0; i<N; n++){ // perform the function on each sample in parallel
printf("n=%d thread num %d\n", n, omp_get_thread_num()); // remove this to speed up the code
audio[n]; // process your signal here
}
You can see a concrete example of this in the gtkIOStream FIR code base. The FIR filter's channels are processed independently, one per available thread.
To initalise the actual OMP subsystem specify the number of threads you want to use like this to maximised the available threads :
int MProcessors = omp_get_max_threads();
omp_set_num_threads(MProcessors);
If you would prefer to look at an approach which uses locking techniques, then you could go for a concurrency pattern such as that developed for the Nuclear Processing.
I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.
I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.
I have a classic physics-thread vs. graphics-thread problem:
Say I'm running one thread for the physics update and one thread for rendering.
In the physics thread (pseudo-code):
while(true)
{
foreach object in simulation
SomeComplicatedPhysicsIntegration( &object->modelviewmatrix);
//modelviewmatrix is a vector of 16 floats (ie. a 4x4 matrix)
}
and in the graphics thread:
while(true)
{
foreach object in simulation
RenderObject(object->modelviewmatrix);
}
Now in theory this would not require locks, as one thread is only writing to the matrices and another is only reading, and I don't care about stale data so much.
The problem that updating the matrix is not an atomic operation and sometimes the graphics thread will read only partially updated matrices (ie. not all 16 floats have been copied, only part of them) which means part of the matrix is from one physics frame and part is from the previous frame, which in turn means the matrix is nolonger affine (ie. it's basically corrupted).
Is there any good method of preventing this without using locks? I read about a possible implementation using double buffering, but I cannot imagine a way that would work without syncing the threads.
Edit: I guess what I'd really like to use is some sort of triple buffering like they use on graphic displays.. anyone know of a good presentation of the triple buffering algorithm?
Edit 2: Indeed using non-synced triple buffering is not a good ideea (as suggested in the answers below). The physics thread can run mutiple cycles eating a lot of CPU and stalling the graphics thread, computing frames that never even get rendered in the end.
I have opted for a simple double-buffered algorithm with a single lock, where the physics thread only computes as much as 1 frame in advance of the graphics thread before swapping buffers. Something like this:
Physics:
while(true)
{
foreach physicstimestep
foreach object in simulation
SomeComplicatedPhysicsIntegration( &object->modelviewmatrix.WriteBuffer);
LockSemaphore()
SwapBuffers()
UnlockSemaphore()
}
Graphics:
while(true)
{
LockSemaphore()
foreach object in simulation
RenderObject(object->modelviewmatrix.ReadBuffer);
UnlockSemaphore()
}
How does that sound?
You could maintain a shared queue between the two threads, and implement the physics thread such that it only adds a matrix to the queue after it has fully populated all of the values in that matrix. This assumes that the physics thread allocates a new matrix on each iteration (or more specifically that the matrices are treated as read-only once they are placed in the queue).
So any time your graphics thread pulls a matrix out of the queue, it is guaranteed to be fully populated and a valid representation of the simulation state at the time at which the matrix was generated.
Note that the graphics thread will need to be able to handle cases in which the queue is empty for one or more iterations, and that it would probably be a good idea to world-timestamp each queue entry so that you have a mechanism of keeping the two threads reasonably in sync without using any formal synchronization techniques (for instance, by not allowing the graphics thread to consume any matrices that have a timestamp that is in the future, and by allowing it to skip ahead in the queue if the next matrix is from too far in the past). Also note that whatever queue you use must be implemented such that it will not explode if the physics thread tries to add something at the same time that the graphics thread is removing something.
but I cannot imagine a way that would work without syncing the threads.
No matter what kind of scheme you are using, synchronizing the threads is an absolute essential here. Without synchronization you run the risk that your physics thread will race far ahead of the graphics thread, or vice versa. Your program, typically a master thread that advances time, needs to be in control of thread operations, not the threading mechanism.
Double buffering is one scheme that lets your physics and graphics threads run in parallel (for example, you have a multi-CPU or multi-core machine). The physics thread operates on one buffer while the graphics thread operates on the other. Note that this induces a lag in the graphics, which may or may not be an issue.
The basic gist behind double buffering is that you duplicate your data to be rendered on screen.
If you run with some sort of locking, then your simulation thread will always be rendering exactly one frame ahead of the display thread. Every piece of data that gets simulated gets rendered. (The synchronization doesn't have to be very heavy: a simple condition variable can frequently be updated and wake the rendering thread pretty cheaply.)
If you run without synchronization, your simulation thread might simulate events that never get rendered, if the rendering thread cannot keep up. If you include a monotonically increasing generation number in your data (update it after each complete simulation cycle), then your rendering thread can simply busy-wait on the two generation numbers (one for each buffer of data).
Once one (or both) of the generation numbers is greater than the most-recently-rendered generation, copy the newest buffer into the rendering thread, update the most-recently-rendered counter, and start rendering. When it's done, return to busy waiting.
If your rendering thread is too fast, you may chew through a lot of processor in that busy wait. So this only makes sense if you expect to periodically skip rendering some data and almost never need to wait for more simulation.
Don't update the matrix in the physics thread?
Take a chunk, (perhaps a row you have just rendered), and queue its position/size/whatever to the physics thread. Invert/transpose/whateverCleverMatrixStuff the row of modelviewmatrix's into another, new row. Post it back to the render thread. Copy the new row in at some suitable time in your rendering. Perhaps you do not need to copy it in - maybe you can just swap out an 'old' vector for the new one and free the old one?
Is this possible, or is the structure/manipulation of your matrices/whatever too complex for this?
All kinda depends on the structure of your data, so this solution may well be inappropriate/impossible.
Rgds,
Martin
Now in theory this would not require locks, as one thread is only writing to the matrices and another is only reading, and I don't care about stale data so much.
Beware: without proper synchronization, there is no guarantee that the reading thread will ever observe any changes by the writing thread. This aspect is known as visibility, and sadly, it is often overlooked.
Assuming lockless or near-lockless updates is actually what would solve your problem best, it sounds like you want the physics thread to calculate a new matrix, and then instantaneously update all those values at once, so it doesn't matter what version of the matrices the graphics thread gets, as long as (a) it gets them eventually and (b) it never gets half of one and half of the old one.
In which case, it sounds like you want a physics thread something like:
/* pseudocode */
while (true) foreach (object in simulation) {
auto new_object = object;
SomeComplicatedPhysicsIntegrationInPlace(new_object)
atomic_swap(object, new_object); // In pseudocode, ignore return value since nowhere
// else changes value of object. In code, use assert, etc
}
Alternativelty, you can calculate a new state of the whole simulation, and then swap the the values. An easy way of implementing this would be:
/* Psudocode */
while (true) {
simulation[ 1-global_active_idx ] = simulation[ global_active_idx ];
foreach (object in simulation[global_inactive_idx]) {
SomeComplicatedPhysicsIntegrationInPlace(object);
}
global_active_idx = 1-global_active_idx; // implicitly assume this is atomic
}
The graphics thread should constantly render simulation[global_active_idx].
In fact, this doesn't work. It will work in many situations because typically, writing a 1 to memory location containing 0 is in fact atomic on most processors, but it's not guaraneed to work. Specifically, the other thread may never reread that value. Many people bodge this by declaring the variable volatile, while works on many compilers, but is not guaranteed to work.
However, to make either example work, all you need is an atomic write instruction, while C++ doesn't provide until C++0x, but is fairly easy for compilers to implement, since most "write an int" instructions ARE atomic, the compiler just has to ensure that.
So, you can write your code using an atomic_swap function at the end of the physics loop, and implement that in terms of (a) a lock, write, unlock sequence -- which shouldn't block the graphics thread significantly, beause it's only blocking for the length of time of one memory write, and possibly only doing so once per whole frame or (b) a compiler's built in atomic support, eg. http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
There are similar solutions, eg. the physics thread updating a semaphore, which the graphics thread treats simply as a variable with value 0 or 1; eg. the physics thread posts finished calculations to a queue (which is implemented internally in a similar way to the above), and the graphics thread constantly renders the top of the queue, repeating the last one if the queue underflows.
However, I'm not sure I'm understanding your problem. Why is there any point updating the graphics if the physics hasn't changed? Why is there any point updating the physics faster than the physics, can't it extrapolate further in each chunk? Does locking the update actually make any difference?
You can make two matrices fit in one cache line size (128 bytes) and aligned it to 128 bytes. This way the matrix is loaded in a cache line and therefore write to and from memory in one block. This is not for the feint of heart though and will require much more work. (This only fixes the problem of reading a matrix while it is being updated and getting a non-orthogonal result.)
I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}
This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.
Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.
Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}
Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.
If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick
It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.
The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.
It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.
As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.