C++ Multithreading decoding audio data [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to decode audio data as fast as possible using the Opus decoder.
Currently my application is not fast enough.
The decoding is as fast as it can get, but I need to gain more speed.
I need to decode about 100 sections of audio. T
hese sections are not consecutive (they are not related to each other).
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
I have not used multithreading before.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Thank you.

This answer is probably going to need a little refinement from the community, since it's been a long while since I worked in this environment, but here's a start -
Since you're new to multi-threading in C++, start with a simple project to create a bunch of pthreads doing a simple task.
Here's a quick and small example of creating pthreads:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* ThreadStart(void* arg);
int main( int count, char** argv) {
pthread_t thread1, thread2;
int* threadArg1 = (int*)malloc(sizeof(int));
int* threadArg2 = (int*)malloc(sizeof(int));
*threadArg1 = 1;
*threadArg2 = 2;
pthread_create(&thread1, NULL, &ThreadStart, (void*)threadArg1 );
pthread_create(&thread2, NULL, &ThreadStart, (void*)threadArg2 );
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
free(threadArg1);
free(threadArg2);
}
void* ThreadStart(void* arg) {
int threadNum = *((int*)arg);
printf("hello world from thread %d\n", threadNum);
return NULL;
}
Next, you're going to be using multiple opus decoders. Opus appears to be thread safe, so long as you create separate OpusDecoder objects for each thread.
To feed jobs to your threads, you'll need a list of pending work units that can be accessed in a thread safe manner. You can use std::vector or std::queue, but you'll have to use locks around it when adding to it and when removing from it, and you'll want to use a counting semaphore so that the threads will block, but stay alive, while you slowly add workunits to the queue (say, buffers of files read from disk).
Here's some example code similar from above that shows how to use a shared queue, and how to make the threads wait while you fill the queue:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <queue>
#include <semaphore.h>
#include <unistd.h>
void* ThreadStart(void* arg);
static std::queue<int> workunits;
static pthread_mutex_t workunitLock;
static sem_t workunitCount;
int main( int count, char** argv) {
pthread_t thread1, thread2;
pthread_mutex_init(&workunitLock, NULL);
sem_init(&workunitCount, 0, 0);
pthread_create(&thread1, NULL, &ThreadStart, NULL);
pthread_create(&thread2, NULL, &ThreadStart, NULL);
// Make a bunch of workunits while the threads are running.
for (int i = 0; i < 200; i++ ){
pthread_mutex_lock(&workunitLock);
workunits.push(i);
sem_post(&workunitCount);
pthread_mutex_unlock(&workunitLock);
// Pretend that it takes some effort to create work units;
// this shows that the threads really do block patiently
// while we generate workunits.
usleep(5000);
}
// Sometime in the next while, the threads will be blocked on
// sem_wait because they're waiting for more workunits. None
// of them are quitting because they never saw an empty queue.
// Pump the semaphore once for each thread so they can wake
// up, see the empty queue, and return.
sem_post(&workunitCount);
sem_post(&workunitCount);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_mutex_destroy(&workunitLock);
sem_destroy(&workunitCount);
}
void* ThreadStart(void* arg) {
int workUnit;
bool haveUnit;
do{
sem_wait(&workunitCount);
pthread_mutex_lock(&workunitLock);
// Figure out if there's a unit, grab it under
// the lock, then release the lock as soon as we can.
// After we release the lock, then we can 'process'
// the unit without blocking everybody else.
haveUnit = !workunits.empty();
if ( haveUnit ) {
workUnit = workunits.front();
workunits.pop();
}
pthread_mutex_unlock(&workunitLock);
// Now that we're not under the lock, we can spend
// as much time as we want processing the workunit.
if ( haveUnit ) {
printf("Got workunit %d\n", workUnit);
}
}
while(haveUnit);
return NULL;
}

You would break your work up by task. Let's assume your process is in fact CPU bound (you indicate it is but… it's not usually that simple).
Right now, you decode 100 sections:
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
Actually, you should use a number close to the number of cores on the machine.
Assuming a modern desktop (e.g. 2-8 cores), running 100 threads at once will just slow it down; The kernel will waste a lot of time switching from one thread to another and the process is also likely to use higher peak resources and contend for similar resources.
So just create a task pool which restricts the number of active tasks to the number of cores. Each task would (generally) represent the decoding work to perform for one input file (section). This way, the decoding process is not actually sharing data across multiple threads (allowing you to avoid locking and other resource contention).
When complete, go back and fine tune the number of processes in the task pool (e.g. using the exact same inputs and a stopwatch on multiple machines). The fastest may be lower or higher than the number of cores (most likely because of disk I/O). It also helps to profile.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Yes, if the problem is CPU bound, then that is generally fine. This also assumes your decoder/dependent software is capable of running with multiple threads.
The problem you will realize if these are files on disk is that you will probably need to optimize how you read (and write?) the files from many cores. So allowing it to run 8 jobs at once can make your problem become disk bound -- and 8 simultaneous readers/writers is a bad way to use hard disks, so you may find that it is not as fast as you expected. Therefore, you may need to optimize I/O for your concurrent decode implementation. In this regard, using larger buffer sizes, but that comes at a cost in memory.

Instead of making your own threads and manage them, I suggest you use a thread pool and give your decoding tasks to the pool. The pool will assign tasks to as many threads as it and the system can handle. Though there are different types of thread pools so you can set some parameters like forcing it to use some specific number of threads or if you should allow the pool to keep increasing the number of threads.
One thing to keep in mind is that more threads doesn't mean they execute in parallel. I think the correct term is concurrently, unless you have the guarantee that each thread is run on a different CPU (which would give true parallelism)
Your entire pool can come to a halt if blocked for IO.

Before jumping into multithreading as solution to speed up things , Study the concept of Oversubscribing & under Subscribing .
If the processing of Audio involves .long blocking calls of IO , Then Multithreading is worth it.

Although the vagueness of you question doesn't really help...how about:
Create a list of audio files to convert.
While there is a free processor,
launch the decoder application with the next file in the queue.
Repeat until there is nothing else in the list
If, during testing, you discover the processors aren't always 100% busy, launch 2 decodes per processor.
It could be done quite easily with a bit of bash/tcl/python.

You can use threads in general but locking has some issues. I will base the answer around POSIX threads and locks but this is fairly general and you will able to port the idea to any platform. But if your jobs require any kind of locking, you may find the following useful. Also it is best to keep using the existing threads again and again because thread creations are costly(see thread pooling).
Locking is a bad idea in general for "realtime" audio since it adds latency, but that's for real time jobs for decoding/encoding they are perfectly ok, even for real time ones you can get better performance and no dropping frames by using some threading knowledge.
For audio, semaphores is a bad, bad idea. They are too slow on at least my system(POSIX semaphores) when I tried, but you will need them if you are thinking of cross thread locking(not the type of locking where you lock in one thread and unlock in the same thread). POSIX mutexes only allow self lock and self unlock(you have to do both in the same thread) otherwise the program might work but it's undefined behavior and should be avoided.
Most lock-free atomic operations might give you enough freedom from locks to use some functionality(like locking) but with better performance.

Related

How to match processing time with reception time in c++ multithreading

I'm writing a c++ application, in which I'll receive 4096 bytes of data for every 0.5 seconds. This is processed and the output will be sent to some other application. Processing each set of data is taking nearly 2 seconds.
This is how exactly I'm doing this.
In my main function, I'm receiving the data and pushing it into a vector.
I've created a thread, which will always process the first element and deletes it immediately after processing. Below is the simulation of my application receiving part.
#include<iostream>
#include <unistd.h>
#include <vector>
#include <mutex>
#include <pthread.h>
using namespace std;
struct Student{
int id;
int age;
};
vector<Student> dustBin;
pthread_mutex_t lock1;
bool isEven=true;
void *processData(void* arg){
Student st1;
while(true)
{
if(dustBin.size())
{
printf("front: %d\tSize: %d\n",dustBin.front(),dustBin.size());
st1 = dustBin.front();
cout << "Currently Processing ID "<<st1.id<<endl;
sleep(2);
pthread_mutex_lock(&lock1);
dustBin.erase(dustBin.begin());
cout<<"Deleted"<<endl;
pthread_mutex_unlock(&lock1);
}
}
return NULL;
}
int main()
{
pthread_t ptid;
Student st;
dustBin.clear();
pthread_mutex_init(&lock1, NULL);
pthread_create(&ptid, NULL, &processData, NULL);
while(true)
{
for(int i=0; i<4096; i++)
{
st.id = i+1;
st.age = i+2;
pthread_mutex_lock(&lock1);
dustBin.push_back(st);
printf("Pushed: %d\n",st.id);
pthread_mutex_unlock(&lock1);
usleep(500000);
}
}
pthread_join(ptid, NULL);
pthread_mutex_destroy(&lock1);
}
The output of this code is
Output
In the output image posted here, you can observe the exact sequence of the processing. It is processing only one item for every 4 insertions.
Note that the reception time of data <<< processing time.
Because of this reason, my input buffer is growing very rapidly. And one more thing is that as the main thread and the processData thread are using a mutex, they are dependent on each other for the lock to release. Because of this reason my incoming buffer is getting locked sometimes leading to data misses. Please, someone, suggest to me how to handle this or suggest me some method to do.
Thanks & Regards
Vamsi
Undefined behavior
When you read data, you must lock before getting the size.
Busy waiting
You should always avoid tight loop that does nothing. Here if dustBin is empty, you will immediately check it against forever which will use 100% of that core and slow down everything else, drain the laptop battery and make it hotter than it should be. Very bad idea to write such code!
Learn multithreading first
You should read a book or 2 on multithreading. Doing multithreading right is hard and almost impossible without taking time to learn it properly. C++ Concurrency in Action is highly recommended for standard C++ multithreading.
Condition variable
Usually you will use a condition variable or some sort of event to tell the consumer thread when data is added so it does not have to wake up uselessly to check if it is the case.
Since you have a typical producer/consumer, you should be able to find lot of information on how to do it or special containers or other constructs that will help implement your code.
Output
Your printf and cout stuff will have an impact on the performance and since some are inside a lock and other not, you will probably get an improperly formatted output. If you really need output, a third thread might be a better option. In any case, you want to minimize the time you have a lock so formatting into a temporary buffer might be a good idea too.
By the way, standard output is relatively slow and it is perfectly possible that it might even be the reason why you are not able to process rapidly all data.
Processing rate
Obviously if you are able to produce 4096 bytes of data every 0.5 second but need 2 seconds to process that data, you have a serious problem.
You should really think about what you want to do in such case before asking a question here as without that information, we are making guess about possible solutions.
Here are some possibilities:
Slow down the producer. Obviously, this does not work if you get data in real time.
Optimize the consumer (better algorithms, better hardware, optimal parallelism…)
Skip some data
Obviously for performance problems, you should use a profiler to know were you lost your time. Once you know that, you will have a better idea where to check to improve you code.
Taking 2 seconds to process the data is really slow but we cannot help you since we have no idea of what your code is doing.
For example, if you add the data into a database and it is not able to follow up, you might want to batch multiple insert into a single command to reduce the overhead of communicating with the database over the network.
Another example, would be if you append the data to a file, you might want to keep the file open and accumulate some data before doing each write.
Container
A vector would not be a good choice if you remove item from the head one by one and it size become somewhat large (say more than 100 small items) as every other item need to be moved every time.
In addition to changing the container as suggested in a comment, another possibility would be to use 2 vectors and swap them. That way, you will be able to reduce the number of time you lock the mutex and process many item without needing a lock.
How to optimize
You should accumulate enough data (say 30 seconds), stop accumulating and then test your processing speed with that data. If you cannot process that data in less that about half the time (15 seconds), then you clearly need to improve your processing speed one way or another. One your consumer(s) is (are) fast enough, then you could optimize communication from the producer to the consumer(s).
You have to know if your bottleneck is I/O, database or what and if some part might be done in parallel.
There are probably a lot of optimization that can be done in the code you have not shown...
If you can't handle messages fast enough, you have to drop some.
Use a circular buffer of a fixed size.
Then if the provider is faster than the consumer, older entries will be overwritten.
If you cannot skip some data and you cannot process it fast enough, you are doomed.
Create two const variables, NBUFFERS and NTHREADS, make them both 8 initially if you have 16 cores and your processing is 4x too slow. Play with these values later.
Create NBUFFERS data buffers, each big enough to hold 4096 samples, In practice, just create a single large buffer and make offsets into it to divide it up.
Start NTHREADS. They will each continuously wait to be told which buffer to process and then they will process it and wait again for another buffer.
In your main program, go into a loop, receiving data. Receive the first 4096 samples into the first buffer and notify the first thread. Receive the second 4096 samples into the second buffer and notify the second thread.
buffer = (buffer + 1) % NBUFFERS
thread = (thread + 1) % NTHREADS
Rinse and repeat. As you have 8 threads, and data only arrives every 0.5 seconds, each thread will only get a new buffer every 4 seconds but only needs 2 seconds to clear the previous buffer.

Recommended pattern for a queue accessed by multiple threads...what should the worker thread do?

I have a queue of objects that is being added to by a thread A. Thread B is removing objects from the queue and processing them. There may be many threads A and many threads B.
I am using a mutex when the queue in being "push"ed to, and also when "front"ed and "pop"ped from as shown in the pseudo-code as below:
Thread A calls this to add to the queue:
void Add(object)
{
mutex->lock();
queue.push(object);
mutex->unlock();
}
Thread B processes the queue as follows:
object GetNextTargetToWorkOn()
{
object = NULL;
mutex->lock();
if (! queue.empty())
{
object = queue.front();
queue.pop();
}
mutex->unlock();
return(object);
}
void DoTheWork(int param)
{
while(true)
{
object structure;
while( (object = GetNextTargetToWorkOn()) == NULL)
boost::thread::sleep(100ms); // sleep a very short time
// do something with the object
}
}
What bothers me is the while---get object---sleep-if-no-object paradigm. While there are objects to process it is fine. But while the thread is waiting for work there are two problems
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
Is there a better pattern to achieve the same thing?
You're using spin-waiting, a better design is to use a monitor. Read more on the details on wikipedia.
And a cross-platform solution using std::condition_variable with a good example can be found here.
a) The while loop is whirling consuming resources
b) the sleep means wasted time is a new object comes in to be processed
It has been my experience that the sleep you used actually 'fixes' both of these issues.
a) The consuming of resources is a small amount of ram, and remarkably small fraction of available cpu cycles.
b) Sleep is not a wasted time on the OS's I've worked on.
c) Sleep can affect 'reaction time' (aka latency), but has seldom been an issue (outside of interrupts.)
The time spent in sleep is likely to be several orders of magnitude longer than the time spent in this simple loop. i.e. It is not significant.
IMHO - this is an ok implementation of the 'good neighbor' policy of relinquishing the processor as soon as possible.
On my desktop, AMD64 Dual Core, Ubuntu 15.04, a semaphore enforced context switch takes ~13 us.
100 ms ==> 100,000 us .. that is 4 orders of magnitude difference, i.e. VERY insignificant.
In the 5 OS's (Linux, vxWorks, OSE, and several other embedded system OS's) I have worked on, sleep (or their equivalent) is the correct way to relinquish the processor, so that it is not blocked from running another thread while the one thread is in sleep.
Note: It is feasible that some OS's sleep might not relinquish the processor. So, you should always confirm. I've not found one. Oh, but I admit I have not looked / worked much on Windows.

Windows critical sections fairness

I've a question about the fairness of the critical sections on Windows, using EnterCriticalSection and LeaveCriticalSection methods. The MSDN documentation specifies: "There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads."
The problem comes with an application I wrote, which blocks some threads that never enter critical section, even after a long time; so I perfomed some tests with a simple c program, to verify this behaviour, but I noticed strange results when you have many threads an some wait times inside.
This is the code of the test program:
CRITICAL_SECTION CriticalSection;
DWORD WINAPI ThreadFunc(void* data) {
int me;
int i,c = 0;;
me = *(int *) data;
printf(" %d started\n",me);
for (i=0; i < 10000; i++) {
EnterCriticalSection(&CriticalSection);
printf(" %d Trying to connect (%d)\n",me,c);
if(i!=3 && i!=4 && i!=5)
Sleep(500);
else
Sleep(10);
LeaveCriticalSection(&CriticalSection);
c++;
Sleep(500);
}
return 0;
}
int main() {
int i;
int a[20];
HANDLE thread[20];
InitializeCriticalSection(&CriticalSection);
for (i=0; i<20; i++) {
a[i] = i;
thread[i] = CreateThread(NULL, 0, ThreadFunc, (LPVOID) &a[i], 0, NULL);
}
}
The results of this is that some threads are blocked for many many cycles, and some others enter critical section very often. I also noticed if you change the faster Sleep (the 10 ms one), everything might returns to be fair, but I didn't find any link between sleep times and fairness.
However, this test example works much better than my real application code, which is much more complicated, and shows actually starvation for some threads. To be sure that starved threads are alive and working, I made a test (in my application) in which I kill threads after entering 5 times in critical section: the result is that, at the end, every thread enters, so I'm sure all of them are alive and blocked on the mutex.
Do I have to assume that Windows is really NOT fair with threads?
Do you know any solution for this problem?
EDIT: The same code in linux with pthreads, works as expected (no thread starves).
EDIT2: I found a working solution, forcing fairness, using a CONDITION_VARIABLE.
It can be inferred from this post (link), with the required modifications.
You're going to encounter starvation issues here anyway since the critical section is held for so long.
I think MSDN is probably suggesting that the scheduler is fair about waking up threads but since there is no lock acquisition order then it may not actually be 'fair' in the way that you expect.
Have you tried using a mutex instead of a critical section? Also, have you tried adjusting the spin count?
If you can avoid locking the critical section for extended periods of time then that is probably a better way to deal with this.
For example, you could restructure your code to have a single thread that deals with your long running operation and the other threads queue requests to that thread, blocking on a completion event. You only need to lock the critical section for short periods of time when managing the queue. Of course if these operations must also be mutually exclusive to other operations then you would need to be careful with that. If all of this stuff can't operate concurrently then you may as well serialize that via the queue too.
Alternatively, perhaps take a look at using boost asio. You could use a threadpool and strands to prevent multiple async handlers from running concurrently where synchronization would otherwise be an issue.
I think you should review a few things:
in 9997 of 10000 cases you branch to Sleep(500). Each thread holds the citical section for as much as 500 ms on almost every successful attempt to acquire the critical section.
The threads do another Sleep(500) after releasing the critical section. As a result a single thread occupies almost 50 % (49.985 %) of the availble time by holding the critical section - no matter what!
Behind the scenes: Joe Duffy: The wait lists for mutually exclusive locks are kept in FIFO order, and the OS always wakes the thread at the front of such wait queues.
Assuming you did that on purpose to show the behavior: Starting 20 of those threads may result in a minimum wait time of 10 seconds for the last thread to get access to the critical section on a single logical processor when the processor is completely available for this test.
For how long dif you do the test / What CPU? And what Windows version? You should be able to write down some more facts: A histogram of thread active vs. thread id could tell a lot about fairness.
Critical sections shall be acquired for short periods of time. In most cases shared resources can be dealt with much quicker. A Sleep inside a critical section almost certainly points to a design flaw.
Hint: Reduce the time spent inside the critical section or investigate Semaphore Objects.

How to get multithreads working properly using pthreads and not boost in class using C++

I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.

Overhead due to use of Events

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}
This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.
Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.
Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}
Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.
If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick
It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.
The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.
It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.
As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.