Boost thread_group thread limiter - c++

Am trying to limit the number of threads at anytime to be equal to utmost the number of cores available. Is the following a reasonable method? Is there a better alternative? Thanks!
boost::thread_group threads;
iThreads = 0;
for (int i = 0; i < Utility::nIterations; i++)
{
threads.create_thread(
boost::bind(&ScenarioInventory::BuildInventoryWorker, this,i));
thread_limiter.lock();
iThreads++;
thread_limiter.unlock();
while (iThreads > nCores)
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
threads.join_all();
void ScenarioInventory::BuildInventoryWorker(int i)
{
//code code code....
thread_limiter.lock();
iThreads--;
thread_limiter.unlock();
}

What you are likely looking for is thread_pool with a task queue.
Have a fixed number of threads blocking an queue. Whenever a task is pushed onto the queue a worker thread gets signalled (condition variable) and processes the task.
That way you
don't have the (inefficient) waiting lock
don't have any more threads than the "maximum"
don't have to block in the code that pushes a task
don't have redundant creation of threads each time around
See this answer for two different demos of such a thread pool w/ task queue: Calculating the sum of a large vector in parallel

Related

How to make two threads take turns executing their respective critical sections after one thread ends

In modern C++ with STL threads I want to have two worker threads that take turns doing their work. Only one can be working at a time and each may only get one turn before the other takes a turn. I have this part working.
The added constraint is that one thread needs to keep taking turns after the other thread finishes. But in my code the remaining worker thread deadlocks after the first worker thread finishes. I don't understand why, given that the last things the first worker did was unlock and notify the condition variable, which should've woken the second one up. Here's the code:
{
std::mutex mu;
std::condition_variable cv;
int turn = 0;
auto thread_func = [&](int tid, int iters) {
std::unique_lock<std::mutex> lk(mu);
lk.unlock();
for (int i = 0; i < iters; i++) {
lk.lock();
cv.wait(lk, [&] {return turn == tid; });
printf("tid=%d turn=%d i=%d/%d\n", tid, turn, i, iters);
fflush(stdout);
turn = !turn;
lk.unlock();
cv.notify_all();
}
};
auto th0 = std::thread(thread_func, 0, 20);
auto th1 = std::thread(thread_func, 1, 25); // Does more iterations
printf("Made the threads.\n");
fflush(stdout);
th0.join();
th1.join();
printf("Both joined.\n");
fflush(stdout);
}
I don't know whether this is something I don't understand about concurrency in STL threads, or whether I just have a logic bug in my code. Note that there is a question on SO that's similar to this, but without the second worker having to run longer than the first. I can't find it right now to link to it. Thanks in advance for your help.
When one thread is done, the other will wait for a notification that nobody will send. When only one thread is left, you need to either stop using the condition variable or signal the condition variable some other way.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

Activate threads from the slowest or from the faster?

I have an application on Linux on an i7 using boost::thread, and currently I have about 10 threads (between the 8 cores) that run simultaneously doing image processing on images of sizes of approximately 240 x 180 to 960 x 720, so naturally the smaller images finish faster than the larger images. At some point in the future I may need to bump up the number of threads so there will definitely be many more threads than cores.
So, is there a rule of thumb to decide which order to start and wait for threads; fastest first to get the small tasks out of the way, or slowest first to get it started sooner so finished sooner? Or is my synchronisation code wonky?
For reference, my code is approximately this, where the lower-numbered threads are slower:
Globals
static boost::mutex mutexes[MAX_THREADS];
static boost::condition_variable_any conditions[MAX_THREADS];
Main thread
// Wake threads
for (int thread = 0; thread < MAX_THREADS; thread++)
{
mutexes[thread].unlock();
conditions[thread].notify_all();
}
// Wait for them to sleep again
for (int thread = 0; thread < MAX_THREADS; thread++)
{
boost::unique_lock<boost::mutex> lock(mutexes[thread]);
}
Processing thread
static void threadFunction(int threadIdx)
{
while(true)
{
boost::unique_lock<boost::mutex> lock(mutexes[threadIdx]);
// DO PROCESSING
conditions[threadIdx].notify_all();
conditions[threadIdx].wait(lock);
}
}
Thanks to the the hints from the commenters and much Googling, I've completely reworked my code, which seems to be slightly faster without the mutexes.
Globals
// None now
Main thread
boost::asio::io_service ioService;
boost::thread_group threadpool;
{
boost::asio::io_service::work work(ioService);
for (size_t i = 0; i < boost::thread::hardware_concurrency(); i++)
threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService));
for (int thread = 0; thread < MAX_THREADS; thread++)
ioService.post(std::bind(threadFunction, thread));
}
threadpool.join_all();
Processing thread
static void threadFunction(int threadIdx)
{
// DO PROCESSING
}
(I've made this a Community Wiki as it's not really my answer.)

c++ pthread limit number of thread

I tried to use pthread to do some task faster. I have thousands files (in args) to process and i want to create just a small number of thread many times.
Here's my code :
void callThread(){
int nbt = 0;
pthread_t *vp = (pthread_t*)malloc(sizeof(pthread_t)*NBTHREAD);
for(int i=0;i<args.size();i+=NBTHREAD){
for(int j=0;j<NBTHREAD;j++){
if(i+j<args.size()){
pthread_create(&vp[j],NULL,calcul,&args[i+j]);
nbt++;
}
}
for(int k=0;k<nbt;k++){
if(pthread_join(vp[k], NULL)){
cout<<"ERROR pthread_join()"<<endl;
}
}
}
}
It returns error, i don't know if it's a good way to solve my problem. All the resources are in args (vector of struct) and are independants.
Thanks for help.
You're better off making a thread pool with as many threads as the number of cores the cpu has. Then feed the tasks to this pool and let it do its job. You should take a look at this blog post right here for a great example of how to go about creating such thread pool.
A couple of tips that are not mentioned in that post:
Use std::thread::hardware_concurrency() to get the number of cores.
Figure out a way how to store the tasks, hint: std::packaged_task or something along
those lines wrapped in a class so you can track things such as when a task is done, or implement task.join().
Also, github with the code of his implementation plus some extra stuff such as std::future support can be found here.
You can use a semaphore to limit the number of parallel threads, here is a pseudo code:
Semaphore S = MAX_THREADS_AT_A_TIME // Initial semaphore value
declare handle_array[NUM_ITERS];
for(i=0 to NUM_ITERS)
{
wait-while(S<=0);
Acquire-Semaphore; // S--
handle_array[i] = Run-Thread(MyThread);
}
for(i=0 to NUM_ITERS)
{
Join_thread(handle_array[i])
Close_handle(handle_array[i])
}
MyThread()
{
mutex.lock
critical-section
mutex.unlock
release-semaphore // S++
}

boost::thread_group - is it ok to call create_thread after join_all?

I have the following situation:
I create a boost::thread_group instance, then create threads for parallel-processing on some data, then join_all on the threads.
Initially I created the threads for every X elements of data, like so:
// begin = someVector.begin();
// end = someVector.end();
// batchDispatcher = boost::function<void(It, It)>(...);
boost::thread_group processors;
// create dispatching thread every ASYNCH_PROCESSING_THRESHOLD notifications
while(end - begin > ASYNCH_PROCESSING_THRESHOLD)
{
NotifItr split = begin + ASYNCH_PROCESSING_THRESHOLD;
processors.create_thread(boost::bind(batchDispatcher, begin, split));
begin = split;
}
// create dispatching thread for the remainder
if(begin < end)
{
processors.create_thread(boost::bind(batchDispatcher, begin, end));
}
// wait for parallel processing to finish
processors.join_all();
but I have a problem with this: When I have lots of data, this code is generating lots of threads (> 40 threads) which keeps the processor busy with thread-switching contexts.
My question is this: Is it possible to call create_thread on the thread_group after the call to join_all.
That is, can I change my code to this?
boost::thread_group processors;
size_t processorThreads = 0; // NEW CODE
// create dispatching thread every ASYNCH_PROCESSING_THRESHOLD notifications
while(end - begin > ASYNCH_PROCESSING_THRESHOLD)
{
NotifItr split = begin + ASYNCH_PROCESSING_THRESHOLD;
processors.create_thread(boost::bind(batchDispatcher, begin, split));
begin = split;
if(++processorThreads >= MAX_ASYNCH_PROCESSORS) // NEW CODE
{ // NEW CODE
processors.join_all(); // NEW CODE
processorThreads = 0; // NEW CODE
} // NEW CODE
}
// ...
Whoever has experience with this, thanks for any insight.
I believe this is not possible. The solution you want might actually be to implement a producer-consumer or a master-worker (main 'master' thread divides the work in several fixed size tasks, creates pool of 'workers' threads and sends one task to each worker until all tasks are done).
These solutions will demand some synchronization through semaphores but they will equalize well the performance one you can create one thread for each available core in the machine avoiding waste of time on context switches.
Another not-so-good-and-fancy option is to join one thread at a time. You can have a vector with 4 active threads, join one and create another. The problem of this approach is that you may waste processing time if your tasks are heterogeneous.