C++ multithreading: managing threads - c++

I have a pool of N jobs and I would like to have a constant load of M (= number of CPUs) threads. How can I do that in C++?
I already know the basics of the thread library of C++. The easiest way is to enqueue M jobs and wait for all to finish. Then enqueue another M jobs. Do this as long as there are jobs left.
This simple approach works fine as long as each job/thread takes approximately the same amount of time. If this is not the case, it can easily happen that one long thread is still working while all other are finished. Thereby, only one of M CPUs are loaded.
So I need a kind of thread managing. Instead of waiting for all threads, I have to constantly check how many threads are running and enqueue a new one, if necessary.
Is there something similar already implemented in C++? Otherwise, what would be the easiest/smartest way implement such a manager?

So I need a kind of thread managing. Instead of waiting for all threads, I have to constantly check how many threads are running and enqueue a new one, if necessary.
Is there something similar already implemented in C++?
There is a Task Scheduler in Intel Thread Building blocks library, which does load balancing of user tasks onto available CPUs in a highly-efficient fashion.

Related

Running only a given amount of threads at any given time in C++

I have a (let's say) function in C++ that takes random time to return a result (from 0.001 to 10 seconds). I would like to run this function N~10^5 independent times. So, it is natural to run lots of threads.
My question, is how can I run only 10 threads at a time. By this I mean that I would like to start 10 threads, and launching a new one only when another finishes. I tried launching 10^5 threads, and let the computer figure it out, but it doesn't work. Any suggestions?
You can either use std::async and let the system manage the number of threads.
Or you can, set a limited number of threads (a thread pool) - preferably using std::thread::hardware_concurrency; There are many ways to do this...
You need to start reading documentation of:
std::promise, std::future (std::shared_future)
std::packaged_task
C++ Concurrency In Action has an excellent chapter 9 on thread pools.
One good advanced idea is to make a thread-safe function queue (the simplest such queue can be obtained by wrapping std::queue in a class with much reduced interface that locks an std::mutex before pushing and popping/returning the popped value) to which you push your functions in advance... Then have your threads pull from the queue and execute the functions.

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.
If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.
Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.
You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

Is it safe to use hundreds of threads if they're only created once?

Basically I have a Task and a Thread class,I create threads equal to the amount of physical cores(or logical cores,since on Intel CPU cores they're double the count).
So basically threads take tasks from a list of tasks and execute them.However,I do have to make sure everything is safe and multiple threads don't try to take the same task at once and of course this introduces extra overhead(and headaches).
What I put the tasks functionality inside the threads?I mean - instead of 4 threads grabbing tasks from a pool of 200 tasks,why not 200 threads that execute in groups of 4 by 4,basically I won't need to synchronize anything,no locking,no nothing.Of course I won't be creating the threads during the whole run-time,just at the initialization.
What pros and cons would such a method have?One problem I can thin of is - since I only create the threads at initialization,their count is fixed,while with tasks I can keep dumping more tasks in the task pool.
Threads have cost - each one requires space for a TLS and for a stack as a minimum.
Keeping your Task and Thread classes separate would be a cleaner and more managable approach in the long run, and keep overhead down by allowing you to limit how many Threads are created and running at any given time (also, a Task is likely to take up less memory than a Thread, and be faster to create and free when needed). A Task is what controls what gets done. A Thread is what controls when a Task is run. Yes, you would need to store the Task objects in a thread-safe list, but that is very simple to implement using a critical section, mutexe, semaphore, etc. On Windows specifically, you could alternatively use an I/O Completion Port instead to submit Tasks to Threads, and let the OS handle the synchornization and scheduling for you.
It will definitely take longer to have 200 threads running at once than it is to run 4 threads to run 200 "tasks". You can test this by a simple program that does some simple math (e.g. calculate the first 20000000 prime, by asking each thread to do 100000 numbers at a time, then grabbing the next lot, or making 200 threads with 100000 numbers each).
How much slower? Don't know, depends on so many things.

Possible frameworks/ideas for thread managment and work allocation in C++

I am developing a C++ application that needs to process large amount of data. I am not in position to partition data so that multi-processes can handle each partition independently. I am hoping to get ideas on frameworks/libraries that can manage threads and work allocation among worker threads.
Manage threads should include at least below functionality.
1. Decide on how many workers threads are required. We may need to provide user-defined function to calculate number of threads.
2. Create required number of threads.
3. Kill/stop unnecessary threads to reduce resource wastage.
4. Monitor healthiness of each worker thread.
Work allocation should include below functionality.
1. Using callback functionality, the library should get a piece of work.
2. Allocate the work to available worker thread.
3. Master/slave configuration or pipeline-of-worker-threads should be possible.
Many thanks in advance.
Your question essentially boils down to "how do I implement a thread pool?"
Writing a good thread pool is tricky. I recommend hunting for a library that already does what you want rather than trying to implement it yourself. Boost has a thread-pool library in the review queue, and both Microsoft's concurrency runtime and Intel's Threading Building Blocks contain thread pools.
With regard to your specific questions, most platforms provide a function to obtain the number of processors. In C++0x this is std::thread::hardware_concurrency(). You can then use this in combination with information about the work to be done to pick a number of worker threads.
Since creating threads is actually quite time consuming on many platforms, and blocked threads do not consume significant resources beyond their stack space and thread info block, I would recommend that you just block worker threads with no work to do on a condition variable or similar synchronization primitive rather than killing them in the first instance. However, if you end up with a large number of idle threads, it may be a signal that your pool has too many threads, and you could reduce the number of waiting threads.
Monitoring the "healthiness" of each thread is tricky, and typically platform dependent. The simplest way is just to check that (a) the thread is still running, and hasn't unexpectedly died, and (b) the thread is processing tasks at an acceptable rate.
The simplest means of allocating work to threads is just to use a single shared job queue: all tasks are added to the queue, and each thread takes a task when it has completed the previous task. A more complex alternative is to have a queue per thread, with a work-stealing scheme that allows a thread to take work from others if it has run out of tasks.
If your threads can submit tasks to the work queue and wait for the results then you need to have a scheme for ensuring that your worker threads do not all get stalled waiting for tasks that have not yet been scheduled. One option is to spawn a new thread when a task gets blocked, and another is to run the not-yet-scheduled task that is blocking a given thread on that thread directly in a recursive manner. There are advantages and disadvantages with both these schemes, and with other alternatives.

My threadspool just make 4~5threads. why?

I use QueueUserWorkItem() function to invoke threadpool.
And I tried lots of work with it. (about 30000)
but by the task manager my application only make 4~5 thread after I push the start button.
I read the MSDN which said that the default number of thread limitation is about 500.
why just a few of threads are made in my application?
I'm tyring to speed up my application and I dout this threadpool is the one of reason that slow down my application.
thanks
It is important to understand how the threadpool scheduler works. It was designed to fine-tune the number of running threads against the capabilities of your machine. Your machine probably can run only two threads at the same time, dual-core CPUs are the current standard. Maybe four.
So when you dump a bunch of threads in its lap, it starts out by activating only two threads. The rest of them are in a queue, waiting for CPU cores to become available. As soon as one of those two threads completes, it activates another one. Twice a second, it evaluates what's going on with active threads that didn't complete. It makes the rough assumption that those threads are blocking and thus not making progress and allows another thread to activate. You've now got three running threads. Getting up the 500 threads, the default max number of threads, will take 249 seconds.
Clearly, this behavior spells out what a thread should do to be suitable to run as a threadpool thread. It should complete quickly and don't block often. Note that blocking on I/O requests is dealt with separately.
If this behavior doesn't suit you then you can use a regular Thread. It will start running right away and compete with other threads in your program (and the operating system) for CPU time. Creating 30,000 of such threads is not possible, there isn't enough virtual memory available for that. A 32-bit operating system poops out somewhere south of 2000 threads, consuming all available virtual memory. You can get about 50,000 threads on a 64-bit operating system before the paging file runs out. Testing these limits in a production program is not recommended.
I think you may have misunderstood the use of the threadpool. Spawning threads and killing threads involves the Windows Kernel and is an expensive operation. If you continuously need threads to perform an aynchronous operation and then you throw them away it would perform many system calls.
So the threadpool is actually a group of threads which are created once which instead of exiting when they complete their task actually enter a wait for another item for queueuserworkitem. The threadpool will then tune itself based on how many threads are required concurrently for your process. If you wish to test this write this code:
for(int i = 0; i < 30000; i++)
{
ThreadPool.QueueUserWorkItem(myMethod);
}
You will see this will create a whole bunch of threads. Maybe not 30000 as some of the threads that are created will be reused as the ThreadPool starts to work through your function calls.
The threadpool is there so you can avoid creating a thread for every asynchronous operation for the very reason that threads are expensive. If you want 30,000 threads you're going to use a lot of memory for the thread stacks plus waste a lot of CPU time doing context switches. Now creating that many threads would be justified if you had 30,000 CPU cores...