I'm writing a thread pool class in C++ which receives tasks to be executed in parallel. I want all cores to be busy, if possible, but sometimes some threads are idle because they are blocked for a time for synchronization purposes. When this happens I would like to start a new thread, so that there are always approximately as many threads awake as there are cpu cores. For this purpose I need a way to find out whether a certain thread is awake or sleeping (blocked). How can I find this out?
I'd prefer to use the C++11 standard library or boost for portability purposes. But if necessary I would also use WinAPI. I'm using Visual Studio 2012 on Windows 7. But really, I'd like to have a portable way of doing this.
Preferably this thread-pool should be able to master cases like
MyThreadPool pool;
for ( int i = 0; i < 100; ++i )
pool.addTask( &block_until_this_function_has_been_called_a_hundred_times );
pool.join(); // waits until all tasks have been dispatched.
where the function block_until_this_function_has_been_called_a_hundred_times() blocks until 100 threads have called it. At this time all threads should continue running. One requirement for the thread-pool is that it should not deadlock because of a too low number of threads in the pool.
Add a facility to your thread pool for a thread to say "I'm blocked" and then "I'm no longer blocked". Before every significant blocking action (see below for what I mean by that) signal "I'm blocked", and then "I'm no longer blocked" afterwards.
What constitutes a "significant blocking action"? Certainly not a simple mutex lock: mutexes should only be held for a short period of time, so blocking on a mutex is not a big deal. I mean things like:
Waiting for I/O to complete
Waiting for another pool task to complete
Waiting for data on a shared queue
and other similar events.
Use Boost Asio. It has its own thread pool management and scheduling framework. The basic idea is to push tasks to the io_service object using the post() method, and call run() from as many threads as many CPU cores you have. You should create a work object while the calculation is running to avoid the threads from exiting if they don't have enough jobs.
The important thing about Asio is never to use any blocking calls. For I/O calls, use the asynchronous calls of Asio's own I/O objects. For synchronization, use strand objects instead of mutexes. If you post functions to the io service that is wrapped in a strand, then it ensures that at any time at most one task runs that belongs to a certain strand. If there is a conflict, the task remains in Asio's event queue instead of blocking a working thread.
There is one drawback of using asynchronous programming though. It is much harder to read a code that is scattered into several asynchronous calls than one with a clear control flow. You should be aware of this when designing your program.
Related
In my app I will receive various events that I would like to process asynchronously in a prioritised order.
I could do this with a boost::asio::io_service, but my application is single threaded. I don't want to pay for locks and mallocs you might need for a multi threaded program (the performance cost really is significant to me). I'm basically looking for a boost::asio::io_service that is written for single threaded execution.
I'm pretty sure I could implement this myself using boost::coroutine, but before I do, does something like a boost::asio::io_service that is written for single threaded execution exist already? I scanned the list of boost libraries already and nothing stood out to me
Be aware that you have to pay for synchronization as soon as you use any non-blocking calls of Asio.
Even though you might use a single thread for scheduling work and processing the resulting callbacks, Asio might still have to spawn additional threads internally for executing asynchronous calls. Those will access the io_service concurrently.
Think of an async_read on a socket: As soon as the received data becomes available, the socket has to notify the io_service. This happens concurrent to your main thread, so additional synchronization is required.
For blocking I/O this problem goes away in theory, but since asynchronous I/O is sort of the whole point of the library, I would not expect to find too many optimizations for this case in the implementation.
As was pointed out in the comments already, the contention on the io_service will be very low with only one main thread, so unless profiling indicates a clear performance bottleneck there, you should not worry about it too much.
I suggest to use boost::asio together with boost::coroutine -> boost::asio::yield_context (does already the coupling between coroutine + io_service). If you detect an task with higher priority you could suspend the current task and start processing the task with higher priority.
The problem is that you have to define/call certain check-points in the code of your task in order to suspend the task if the condition (higher prio task enqueued) is given.
I need a threadpool for my application, and I'd like to rely on standard (C++11 or boost) stuff as much as possible. I realize there is an unofficial(!) boost thread pool class, which basically solves what I need, however I'd rather avoid it because it is not in the boost library itself -- why is it still not in the core library after so many years?
In some posts on this page and elsewhere, people suggested using boost::asio to achieve a threadpool like behavior. At first sight, that looked like what I wanted to do, however I found out that all implementations I have seen have no means to join on the currently active tasks, which makes it useless for my application. To perform a join, they send stop signal to all the threads and subsequently join them. However, that completely nullifies the advantage of threadpools in my use case, because that makes new tasks require the creation of a new thread.
What I want to do is:
ThreadPool pool(4);
for (...)
{
for (int i=0;i<something;i++)
pool.pushTask(...);
pool.join();
// do something with the results
}
Can anyone suggest a solution (except for using the existing unofficial thread pool on sourceforge)? Is there anything in C++11 or core boost that can help me here?
At first sight, that looked like what I wanted to do, however I found out that all implementations I have seen have no means to join on the currently active tasks, which makes it useless for my application. To perform a join, they send stop signal to all the threads and subsequently join them. However, that completely nullifies the advantage of threadpools in my use case, because that makes new tasks require the creation of a new thread.
I think you might have misunderstood the asio example:
IIRC (and it's been a while) each thread running in the thread pool has called io_service::run which means that effectively each thread has an event loop and a scheduler. To then get asio to complete tasks you post tasks to the io_service using the io_service::post method and asio's scheduling mechanism takes care of the rest. As long as you don't call io_service::stop, the thread pool will continue running using as many threads as you started running (assuming that each thread has work to do or has been assigned a io_service::work object).
So you don't need to create new threads for new tasks, that would go against the concept of a threadpool.
Have each task class derive from a Task that has an 'OnCompletion(task)' method/event. The threadpool threads can then call that after calling the main run() method of the task.
Waiting for a single task to complete is then easy. The OnCompletion() can perform whatever is required to signal the originating thread, signaling a condvar, queueing the task to a producer-consumer queue, calling SendMessage/PostMessage API's, Invoke/BeginInvoke, whatever.
If an oringinating thread needs to wait for several tasks to all complete, you could extend the above and issue a single 'Wait task' to the pool. The wait task has its own OnCompletion to communicate the completion of other tasks and has a thread-safe 'task counter', (atomic ops or lock), set to the number of 'main' tasks to be issued. The wait task is issued to the pool first and the thread that runs it waits on a private 'allDone' condvar in the wait task. The 'main' tasks are then issued to the pool with their OnCompletion set to call a method of the wait task that decrements the task counter towards zero. When the task counter reaches zero, the thread that achieves this signals the allDone condvar. The wait task OnCompletion then runs and so signals the completion of all the main tasks.
Such a mechansism does not require the continual create/terminate/join/delete of threadpool threads, places no restriction on how the originating task needs to be signaled and you can issue as many such task-groups as you wish. You should note, however, that each wait task blocks one threadpool thread, so make sure you create a few extra threads in the pool, (not usually any problem).
This seems like a job for boost::futures. The example in the docs seems to demonstrate exactly what you're looking to do.
Joining a thread mean stop for it until it stop, and if it stop and you want to assign a new task to it, you must create a new thread. So in your case you should wait for a condition (for example boost::condition_variable) to indicate end of tasks. So using this technique it is very easy to implement it using boost::asio and boost::condition_variable. Each thread call boost::asio::io_service::run and tasks will be scheduled and executed on different threads and at the end, each task will set a boost::condition_variable or event decrement a std::atomic to indicate end of the job! that's really easy, isn't it?
Since the strand will not be executed concurrently, what is the difference in performance between strand and single thread? Moreover, a lock is not necessary to protect the share data in the handler of post function, right?
suppose an application performance several jobs, below is some sample code.
strand.post(boost::bind(&onJob, this, job1));
void onJob(tJobType oType)
{
if (oType == job1)
// do something
else if(oType == job2)
// do something
}
Edit: I try to measure the latency from post and calling onJob is quite high. I would like to know if there is any way to reduce it
A strand will typically perform better than a single thread. This is because a strand gives the scheduler and the program logic more flexibility. However, the differences are typically not significant (except in the special case I discuss below).
For example, consider the case where something happens that requires service. With a strand, there can be more than one thread that could perform the service, and whichever of those threads gets scheduled first will do the job. With a thread, that very thread must get scheduled for the job to start.
Suppose, for example, a timer fires that creates some new work to be done by the strand. If the timer thread then calls into the strand's dispatch routine, the timer thread can do the work with no context switch. If you had a dedicated thread rather than a strand, then the timer thread could not do the work, and a context switch would be needed before the work created by the timer routine could even begin.
Note that if you just have one thread that executes the strand, you don't get these benefits. (But, IMO, that's a dumb way to do things if you care about performance at this fine a level.)
For some applications, carefully breaking your program into strands can significantly reduce the amount of lock operations required. Objects that are only accessed in a single strand need not be locked. But you can still get a lot of the advantages of multi-threading. (One big disadvantage though -- if any of your code ever blocks, it will stall the entire strand. So you either have to not mind if a strand stalls or make sure none of your code for a critical strand ever blocks.)
In this case, you can have three strands, A, B, and C, and a single thread can do some work for strand A, some for strand B, and some for strand C with no context switches (and with the data hot in the cache). Using a thread for each task would require two context switches to do the same job, and each task would likely not find the data in cache. If you constantly "hand things" from strand to strand, strands can significantly outperform dedicated threads.
As to your second question, a lock is not needed unless data is being accessed in one thread while it could possibly be being modified in another thread. If all accesses to an object are through a single strand, locks are not needed because a strand can only execute in one thread at a time. Typically, strands will access some data that is only accessed by that strand and some that is shared with other threads or strands.
I have a program that should get the maximum out of my cpu.
It is multithreaded via pthreads that do their job well apart from the fact that they "only" get my cores to about 60% load which is not enough in my opinion.
I am searching for the reason and am asking myself (and hereby you) if the blocking functions mutex_lock/cond_wait are candidates?
What happens when a thread cannot run on in such a function?
Does pthread switch to another thread it handles or
does the thread yield its time to the system and if the latter is the case, can I change this behavior?
Regards,
Nobody
More Information
The setting is one mainthread that fills the taskpool and countless workers that fetch jobs from there and wait on a conditional that is signaled via broadcast when a serialized calculation is done. They go on with the values from this calculation until they are done, deliver their mail and fetch the next job...
On a typical modern pthreads implementation, each thread is managed by the kernel not unlike a separate process. Any blocking call like pthread_mutex_lock or pthread_cond_wait (but also, say, read) will yield its time to the system. The system will then find another eligible thread to schedule, whether in your process or another process, and run it.
If your program is only taking 60% of the CPU, it is more likely blocked on I/O than on pthread operations, unless you have done something way too granular with your pthread operations.
If a thread is waiting on a mutex/condition, it doesn't use resources (well, uses just a tiny amount). Whenever the thread enters waiting state, control switches to other threads. When the mutex is released (or condition variable signalled), the thread wakes up and may acquire the mutex (if no other thread grabs it first), and continue to run. If however some other thread acquires the mutex (this can happen if several threads are waiting for it), the thread returns to sleeping state.
I am developing a C++ application that needs to process large amount of data. I am not in position to partition data so that multi-processes can handle each partition independently. I am hoping to get ideas on frameworks/libraries that can manage threads and work allocation among worker threads.
Manage threads should include at least below functionality.
1. Decide on how many workers threads are required. We may need to provide user-defined function to calculate number of threads.
2. Create required number of threads.
3. Kill/stop unnecessary threads to reduce resource wastage.
4. Monitor healthiness of each worker thread.
Work allocation should include below functionality.
1. Using callback functionality, the library should get a piece of work.
2. Allocate the work to available worker thread.
3. Master/slave configuration or pipeline-of-worker-threads should be possible.
Many thanks in advance.
Your question essentially boils down to "how do I implement a thread pool?"
Writing a good thread pool is tricky. I recommend hunting for a library that already does what you want rather than trying to implement it yourself. Boost has a thread-pool library in the review queue, and both Microsoft's concurrency runtime and Intel's Threading Building Blocks contain thread pools.
With regard to your specific questions, most platforms provide a function to obtain the number of processors. In C++0x this is std::thread::hardware_concurrency(). You can then use this in combination with information about the work to be done to pick a number of worker threads.
Since creating threads is actually quite time consuming on many platforms, and blocked threads do not consume significant resources beyond their stack space and thread info block, I would recommend that you just block worker threads with no work to do on a condition variable or similar synchronization primitive rather than killing them in the first instance. However, if you end up with a large number of idle threads, it may be a signal that your pool has too many threads, and you could reduce the number of waiting threads.
Monitoring the "healthiness" of each thread is tricky, and typically platform dependent. The simplest way is just to check that (a) the thread is still running, and hasn't unexpectedly died, and (b) the thread is processing tasks at an acceptable rate.
The simplest means of allocating work to threads is just to use a single shared job queue: all tasks are added to the queue, and each thread takes a task when it has completed the previous task. A more complex alternative is to have a queue per thread, with a work-stealing scheme that allows a thread to take work from others if it has run out of tasks.
If your threads can submit tasks to the work queue and wait for the results then you need to have a scheme for ensuring that your worker threads do not all get stalled waiting for tasks that have not yet been scheduled. One option is to spawn a new thread when a task gets blocked, and another is to run the not-yet-scheduled task that is blocking a given thread on that thread directly in a recursive manner. There are advantages and disadvantages with both these schemes, and with other alternatives.