Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In short: Does an un-delayed while loop consume significant processing power, compared to a similar loop which is slowed down by a delay?
In not-so-short:
I have run into this question more often. I am writing the core part of a program (either microcontroller unit or computer application) and it consists of a semi-infinite while loop to stay alive and look for events.
I will take this example: I have a small application that uses an SDL window and the console. In a while loop I would like to listen to events for this SDL window, but I would also like to break this loop according to the command line input by means of a global variable. Possible solution (pseudo-code):
// Global
bool running = true;
// ...
while (running)
{
if (getEvent() == quit)
{
running = false;
}
}
shutdown();
The core while loop will quit from the listened event or something external. However, this loop is run continuously, maybe even a 1000 times per second. That's a little over-kill, I don't need that response time. Therefore I often add a delaying statement:
while (running)
{
if (getEvent() == quit)
{
running = false;
}
delay(50); // Wait 50 milliseconds
}
This limits the refresh rate to 20 times per second, which is plenty.
So. Is there a real difference between the two? Is it significant? Would it be more significant on the microcontroller unit (where processing power is very limited (but nothing else besides the program needs to run...))?
Well, in fact it's not a question about C++, but rather the answer depends on CPU architecture / Host OS / delay() implementation.
If it's a multi-tasking environment then delay() could (and probably will) help to the OS scheduler to make its job more effectively. However the real difference could be too little to notice (except old cooperative multi-tasking where delay() is a must).
If it's a single-task environment (possibly some microcontroller) then delay() could still be useful if the underlying implementation is able to execute some dedicated low power consumption instructions instead of your ordinary loop. But, of course, there's no guarantee it will, unless your manual explicitly states so.
Considering performance issues, well, it's obvious that you can receive and process an event with a significant delay (or even miss it completely), but if you believe it's not a case then there are no other cons against delay().
You will make your code much harder to read and you are doing asynchronism the old style way: you explicitely wait for something to happen, instead of relying on mechanism that do the job for you.
Also, you delay by 50ms. Is it always optimal? Does it depend on which programs are running?
In C++11 you can use condition_variable. This allows you to wait for an event to happen, without coding the waiting loops.
Documentation here:
http://en.cppreference.com/w/cpp/thread/condition_variable
I have adapted the example to make it simpler to understand. Just waiting for a single event.
Here is an example for you, adapted to your context
// Example program
#include <iostream>
#include <string>
#include <iostream>
#include <string>
#include <thread>
#include <mutex>
#include <chrono>
#include <condition_variable>
std::mutex m;
std::condition_variable cv;
std::string data;
bool ready = false;
bool processed = false;
using namespace std::chrono_literals;
void worker_thread()
{
// Wait until main() sends data
std::unique_lock<std::mutex> lk(m);
std::cout << "Worker thread starts processing data\n";
std::this_thread::sleep_for(10s);//simulates the work
data += " after processing";
// Send data back to main()
processed = true;
std::cout << "Worker thread signals data processing completed"<<std::endl;
std::cout<<"Corresponds to you getEvent()==quit"<<std::endl;
// Manual unlocking is done before notifying, to avoid waking up
// the waiting thread only to block again (see notify_one for details)
lk.unlock();
cv.notify_one();
}
int main()
{
data = "Example data";
std::thread worker(worker_thread);
// wait for the worker
{
std::unique_lock<std::mutex> lk(m);
//this means I wait for the processing to be finished and I will be woken when it is done.
//No explicit waiting
cv.wait(lk, []{return processed;});
}
std::cout<<"data processed"<<std::endl;
}
In my experience, you must do something that will relinquish the processor. sleep works OK, and on most windows systems even sleep(1) is adequate to completely unload the processor in a loop.
You can get the best of all worlds, however, if you use something like std::condition_variable. It is possible to come up with constructions using condition variables (similar to 'events' and WaitForSingleObject in Windows API).
One thread can block on a condition variable that is released by another thread. This way, one thread can do condition_varaible.wait(some_time), and it will either wait for the timeout period (without loading the processor), or it will continue execution immediately when another thread releases it.
I use this method where one thread is sending messages to another thread. I want the receiving thread to respond as soon as possible, not after waiting for a sleep(20) to complete. The receiving thread has a condition_variable.wait(20), for example. The sending thread sends a message, and does a corresponding condition_variable.release(). The receiving thread will immediately release and process the message.
This solution gives very fast response to messages, and does not unduly load the processor.
If you don't care about portability, and you happen to be using windows, events and WaitForSingleObject do the same thing.
your loop would look something like:
while(!done)
{
cond_var.wait(std::chrono::milliseconds(20));
// process messages...
msg = dequeue_message();
if(msg == done_message)
done = true;
else
process_message(msg);
}
In another thread...
send_message(string msg)
{
enqueue_message(msg);
cond_var.release();
}
Your message processing loop will spend most if it's time idle, waiting on the condition variable. When a message is sent, and the condition variable is released by the send thread, your receive thread will immediately respond.
This allows your receive thread to loop at a minimum rate set by the wait time, and a maximum rated determined by the sending thread.
What you are asking is how to properly implement an Event Loop. Use OS calls. You ask the OS for event or message. If no message is present, the OS simply sends the process to sleep. In a micro-controller environment you probably don't have an OS. There the concept of interrupts has to be used, which pretty much an "message" (or event) on lower level.
And for microcontrollers you don't have concepts like sleeping or interrupts, so you end with just looping.
In your example, a properly implemented getEvent() should block and do nothing until something actually happens, e.g. a key press.
The best way to determine that is to measure it yourself.
Undelayed loop will result in 100% usage for that specific core the app is running on. With the delay statement, it will be around 0 - 1%.
(counting on immediate response of getEvent function)
Well, that depends on a few factors - if you don't need to run anything else besides that loop in parallel, it makes no performance difference, obviously.
But a problem that might come up is power consumption - depending on how long this loop is, you might save like 90% of the power consumed by the microcontroller in the second variant.
To call it a bad practice overall doesn't seem right to me - it works in a lot of scenarios.
As I know about while loop, the process is still kept in the ram. So its not going to let the processor use its resource while its given delay. The only difference it is making in the second code is the number of executions of while loop in a given amount of time. This helps if the program is running for long time. Else no problem with the first case.
Related
Here is a small c++ code to test cyclic call using thread. But it gets failed because of unexpected delay sometimes. The for-loop should be called every 10ms. The runtime of for-loop is just 1ms usually. But sometimes the execution time is longer than 200ms. It looks like other process interrupts this for-loop and return back after 200ms. This is unbelievable, 200ms, so long time is taken. The program runs under GNU Linux 5.10.41 ARM aarch64.
How can I do, so that the main thread can not be preemptive by other process or threads ? Thanks a lot!
while(1)
{
auto start_time = std::chrono::high_resolution_clock::now();
LOG_F(INFO, "cyclic start time");
for(int j = 0; j < 150000; j++); //
auto end_time = std::chrono::high_resolution_clock::now();
auto exec_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
LOG_F(INFO, "execution time: %dms", exec_time.count());
if (exec_time.count() < 10)
{
LOG_F(INFO, "i = %d", i);
std::this_thread::sleep_for(std::chrono::milliseconds(10 - exec_time.count()));
}
else
{
LOG_F(ERROR, "execution time was higher than 10ms (%dms)", exec_time.count());
break;
}
}
It is working after remove logging statements. Logging is really a mutex hog like #Frebreeze mentioned. Actually the logging statements can be kept in code, just call program with -v ERROR. Thank you very much for the Help !
It looks that the logging is the reason. But why does the logging statements take so long time, but not always, just like a heartbeat ?
Correct answer written in original answer in comments:
Shot in the dark. Logging mutex hog. There likely is a mutex/lock
within your logging system. Another thread may be a resource hog
acquiring the lock and not releasing it quickly. This would cause THIS
thread to wait much longer than expected. Check this behaviour by
disabling all other threads and see if behaviour continues.
Follow up help below:
#Jung Glad you figured it out!
If logging is taking variable amounts of time, look into the data you are passing to the logging system. General things to lookout for would be large data structs passed by value. Yes I know compilers are smart now, but just something to keep an eye out for.
One of the methods I sometimes use to reduce log hogs is to use a dedicated thread for writing log statements to file. This thread uses a std::vector or queue. When vector.size() > 8, then dump it to the file(s). If you have a method of setting priority, you can then also then assign prio. This gets us into RTOS ideas. Can look into that if you need strict timings.
Workflow
Suppose sensor1 thread writes log statement.
Sensor1 thread adds data to std::vec
Suppose sensor2 thread writes log statement
Sensor2 thread adds data to std::vec
....
Once std::vec > 8 fire event to wake up thread
Thread wakes up, and once given CPU time, begins writing to file
This will help minimize the time spent locking mutexs as well as minimize time spent opening files. However, it is at the cost of memory. Play with the queue size to reach your desired goals.
I'm currently writing code for a simulator to sync with ROS time.
Essentially, the problem becomes "write a get_time and sleep that scales according to ROS time"? Doing this will allow no change to the codebase and just require linking to the custom get_time and sleep. get_time seems to work perfectly; however, I've been having trouble getting the sleep to run accurately.
My current design is like this (code attached at the bottom):
Thread calls sleep
Sleep will add the time when to unlock this thread (current_time + sleep_time) into a priority queue, and then wait on a condition variable.
A separate thread (let's call it watcher) will constantly loop and check for the top of the queue; if the top of the prio queue > current time, then it will notify_all on the condition variable and then pop the prio queue
However, it seems like the watcher thread is not accurate enough (I see discrepancies of 0~50ms), meaning the sleep calls make the threads sleep too long sometimes. I also visibly notice lag/jagged behavior in the simulator compared to if I were to replace the sleep with a usleep(1000*ms).
Unfortunately, I'm not too experienced at these types of designs, and I feel like there are lots of ways to optimize/rewrite this to make it run more accurately.
So my question is, are condition variables the right way? Am I even using them correctly? Here are some things I tried:
reduce the number of unnecessary notify_all calls by having an array of condition variables and assigning them based on time like this: (ms/100)%256. The idea being that close together times will share the same cv because they are likely to actually wake up from the notify_all. This made the performance worse
keep the threads and prio_queue pushing etc. but instead use usleep. I found out that the usleep will make it work so much better, which probably means the mutex, locking, and pushing/popping operations do not contribute to a noticeable amount of lag, meaning it must be in the condition variable part
Code:
Watcher (this is run on startup)
void watcher()
{
while (true)
{
usleep(1);
{
std::lock_guard<std::mutex> lk(m_queue);
if (prio_queue.empty())
continue;
if (get_time_in_ms() >= prio_queue.top())
{
cv.notify_all();
prio_queue.pop();
}
}
}
}
Sleep
void sleep(int ms)
{
int wakeup = get_time_in_ms() + ms;
{
std::lock_guard<std::mutex> lk(m_queue);
prio_queue.push(wakeup);
}
std::unique_lock<std::mutex> lk(m_time);
cv.wait(lk, [wakeup] {return get_time_in_ms() >= wakeup;});
lk.unlock();
}
Any help would be appreciated.
So I have a Kinect program that has three main functions that collect data and saves it. I want one of these functions to execute as much as possible, while the other two run maybe 10 times every second.
while(1)
{
...
//multi-threading to make sure color and depth events are aligned -> get skeletal data
if (WaitForSingleObject(colorEvent, 0) == 0 && WaitForSingleObject(depthEvent, 0) == 0)
{
std::thread first(getColorImage, std::ref(colorEvent), std::ref(colorStreamHandle), std::ref(colorImage));
std::thread second(getDepthImage, std::ref(depthEvent), std::ref(depthStreamHandle), std::ref(depthImage));
if (WaitForSingleObject(skeletonEvent, INFINITE) == 0)
{
first.join();
second.join();
std::thread third(getSkeletonImage, std::ref(skeletonEvent), std::ref(skeletonImage), std::ref(colorImage), std::ref(depthImage), std::ref(myfile));
third.join();
}
//if (check == 1)
//check = 2;
}
}
Currently my threads are making them all run at the same exact time, but this slows down my computer a lot and I only need to run 'getColorImage' and 'getDepthImage' maybe 5-10 times/second, whereas 'getSkeletonImage' I would want to run as much as possible.
I want 'getSkeletonImage' to run at max frequency (~30 times/second through the while loop) and then the 'getColorImage' and 'getDepthImage' to time synchronize (~5-10 times/second through the while loop)
What is a way I can do this? I am already using threads, but I need one to run consistently, and then the other two to join in intermittently essentially. Thank you for your help.
Currently, your main loop is creating the threads every iteration, which suggests each thread function runs once to completion. That introduces the overhead of creating and destroying threads every time.
Personally, I wouldn't bother with threads at all. Instead, in the main thread I'd do
void RunSkeletonEvent(int n)
{
for (i = 0; i < n; ++i)
{
// wait required time (i.e. to next multiple of 1/30 second)
skeletonEvent();
}
}
// and, in your main function ....
while (termination_condition_not_met)
{
runSkeletonEvent(3);
colorEvent();
runSkeletonEvent(3);
depthEvent();
}
This interleaves the events, so skeletonEvent() runs six times for every time depthEvent() and colorEvent() are run. Just adjust the numbers as needed to get required behaviour.
You'll need to design the code for all the events so they don't run over time (if they do, all subsequent events will be delayed - there is no means to stop that).
The problem you'll then need to resolve is how to wait for the time to fire the skeleton event. A process of retrieving clock time, calculating how long to wait, and sleeping for that interval will do it. By sleeping (the thread yielding its time slice) your program will also be a bit better mannered (e.g. it won't be starving other processes of processor time).
One advantage is that, if data is to be shared between the "events" (e.g. all of the events modify some global data) there is no need for synchronisation, because the looping above guarantees that only one "event" accesses shared data at one time.
Note: your usage of WaitForSingleObject() indicates you are using windows. Windows (except, arguably CE in a weak sense) is not really a realtime system, so does not guarantee precise timing. In other words, the actual intervals you achieve will vary.
It is still possible to restructure to use threads. From your description, there is no evidence you really need anything like that, so I'll leave this reply at that.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to decode audio data as fast as possible using the Opus decoder.
Currently my application is not fast enough.
The decoding is as fast as it can get, but I need to gain more speed.
I need to decode about 100 sections of audio. T
hese sections are not consecutive (they are not related to each other).
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
I have not used multithreading before.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Thank you.
This answer is probably going to need a little refinement from the community, since it's been a long while since I worked in this environment, but here's a start -
Since you're new to multi-threading in C++, start with a simple project to create a bunch of pthreads doing a simple task.
Here's a quick and small example of creating pthreads:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* ThreadStart(void* arg);
int main( int count, char** argv) {
pthread_t thread1, thread2;
int* threadArg1 = (int*)malloc(sizeof(int));
int* threadArg2 = (int*)malloc(sizeof(int));
*threadArg1 = 1;
*threadArg2 = 2;
pthread_create(&thread1, NULL, &ThreadStart, (void*)threadArg1 );
pthread_create(&thread2, NULL, &ThreadStart, (void*)threadArg2 );
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
free(threadArg1);
free(threadArg2);
}
void* ThreadStart(void* arg) {
int threadNum = *((int*)arg);
printf("hello world from thread %d\n", threadNum);
return NULL;
}
Next, you're going to be using multiple opus decoders. Opus appears to be thread safe, so long as you create separate OpusDecoder objects for each thread.
To feed jobs to your threads, you'll need a list of pending work units that can be accessed in a thread safe manner. You can use std::vector or std::queue, but you'll have to use locks around it when adding to it and when removing from it, and you'll want to use a counting semaphore so that the threads will block, but stay alive, while you slowly add workunits to the queue (say, buffers of files read from disk).
Here's some example code similar from above that shows how to use a shared queue, and how to make the threads wait while you fill the queue:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <queue>
#include <semaphore.h>
#include <unistd.h>
void* ThreadStart(void* arg);
static std::queue<int> workunits;
static pthread_mutex_t workunitLock;
static sem_t workunitCount;
int main( int count, char** argv) {
pthread_t thread1, thread2;
pthread_mutex_init(&workunitLock, NULL);
sem_init(&workunitCount, 0, 0);
pthread_create(&thread1, NULL, &ThreadStart, NULL);
pthread_create(&thread2, NULL, &ThreadStart, NULL);
// Make a bunch of workunits while the threads are running.
for (int i = 0; i < 200; i++ ){
pthread_mutex_lock(&workunitLock);
workunits.push(i);
sem_post(&workunitCount);
pthread_mutex_unlock(&workunitLock);
// Pretend that it takes some effort to create work units;
// this shows that the threads really do block patiently
// while we generate workunits.
usleep(5000);
}
// Sometime in the next while, the threads will be blocked on
// sem_wait because they're waiting for more workunits. None
// of them are quitting because they never saw an empty queue.
// Pump the semaphore once for each thread so they can wake
// up, see the empty queue, and return.
sem_post(&workunitCount);
sem_post(&workunitCount);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_mutex_destroy(&workunitLock);
sem_destroy(&workunitCount);
}
void* ThreadStart(void* arg) {
int workUnit;
bool haveUnit;
do{
sem_wait(&workunitCount);
pthread_mutex_lock(&workunitLock);
// Figure out if there's a unit, grab it under
// the lock, then release the lock as soon as we can.
// After we release the lock, then we can 'process'
// the unit without blocking everybody else.
haveUnit = !workunits.empty();
if ( haveUnit ) {
workUnit = workunits.front();
workunits.pop();
}
pthread_mutex_unlock(&workunitLock);
// Now that we're not under the lock, we can spend
// as much time as we want processing the workunit.
if ( haveUnit ) {
printf("Got workunit %d\n", workUnit);
}
}
while(haveUnit);
return NULL;
}
You would break your work up by task. Let's assume your process is in fact CPU bound (you indicate it is but… it's not usually that simple).
Right now, you decode 100 sections:
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
Actually, you should use a number close to the number of cores on the machine.
Assuming a modern desktop (e.g. 2-8 cores), running 100 threads at once will just slow it down; The kernel will waste a lot of time switching from one thread to another and the process is also likely to use higher peak resources and contend for similar resources.
So just create a task pool which restricts the number of active tasks to the number of cores. Each task would (generally) represent the decoding work to perform for one input file (section). This way, the decoding process is not actually sharing data across multiple threads (allowing you to avoid locking and other resource contention).
When complete, go back and fine tune the number of processes in the task pool (e.g. using the exact same inputs and a stopwatch on multiple machines). The fastest may be lower or higher than the number of cores (most likely because of disk I/O). It also helps to profile.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Yes, if the problem is CPU bound, then that is generally fine. This also assumes your decoder/dependent software is capable of running with multiple threads.
The problem you will realize if these are files on disk is that you will probably need to optimize how you read (and write?) the files from many cores. So allowing it to run 8 jobs at once can make your problem become disk bound -- and 8 simultaneous readers/writers is a bad way to use hard disks, so you may find that it is not as fast as you expected. Therefore, you may need to optimize I/O for your concurrent decode implementation. In this regard, using larger buffer sizes, but that comes at a cost in memory.
Instead of making your own threads and manage them, I suggest you use a thread pool and give your decoding tasks to the pool. The pool will assign tasks to as many threads as it and the system can handle. Though there are different types of thread pools so you can set some parameters like forcing it to use some specific number of threads or if you should allow the pool to keep increasing the number of threads.
One thing to keep in mind is that more threads doesn't mean they execute in parallel. I think the correct term is concurrently, unless you have the guarantee that each thread is run on a different CPU (which would give true parallelism)
Your entire pool can come to a halt if blocked for IO.
Before jumping into multithreading as solution to speed up things , Study the concept of Oversubscribing & under Subscribing .
If the processing of Audio involves .long blocking calls of IO , Then Multithreading is worth it.
Although the vagueness of you question doesn't really help...how about:
Create a list of audio files to convert.
While there is a free processor,
launch the decoder application with the next file in the queue.
Repeat until there is nothing else in the list
If, during testing, you discover the processors aren't always 100% busy, launch 2 decodes per processor.
It could be done quite easily with a bit of bash/tcl/python.
You can use threads in general but locking has some issues. I will base the answer around POSIX threads and locks but this is fairly general and you will able to port the idea to any platform. But if your jobs require any kind of locking, you may find the following useful. Also it is best to keep using the existing threads again and again because thread creations are costly(see thread pooling).
Locking is a bad idea in general for "realtime" audio since it adds latency, but that's for real time jobs for decoding/encoding they are perfectly ok, even for real time ones you can get better performance and no dropping frames by using some threading knowledge.
For audio, semaphores is a bad, bad idea. They are too slow on at least my system(POSIX semaphores) when I tried, but you will need them if you are thinking of cross thread locking(not the type of locking where you lock in one thread and unlock in the same thread). POSIX mutexes only allow self lock and self unlock(you have to do both in the same thread) otherwise the program might work but it's undefined behavior and should be avoided.
Most lock-free atomic operations might give you enough freedom from locks to use some functionality(like locking) but with better performance.
I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}
This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.
Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.
Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}
Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.
If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick
It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.
The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.
It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.
As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.