How to avoid deadlock when synching loops with barrier - c++

The program I am implementing involves iterating over a medium amount of independent data, performing some computation, collecting the result, and then looping back over again. This loop needs to execute very quickly. To solve this, I am trying to implement the following thread pattern.
Rather than spawn threads in setup and join them in collect, I would like to spawn all threads initially, and keep them synchronized throughout their loops. This question regarding thread barriers had initially seemed to point me in the right direction, but my implementation of them is not working. Below is my example
int main() {
int counter = 0;
int threadcount = 10;
auto on_completion = [&]() noexcept {
++counter; // Incremenent counter
};
std::barrier sync_point(threadcount, on_completion);
auto work = [&]() {
while(true)
sync_point.arrive_and_wait(); // Keep cycling the sync point
};
std::vector<std::thread> threads;
for (int i = 0; i < threadcount; ++i)
threads.emplace_back(work); // Start every thread
for (auto& thread : threads)
thread.join();
}
To keep things as simple as possible, there is no computation being done in the worker threads, and I have done away with the setup thread. I am simply cycling the threads, syncing them after each cycle, and keeping a count of how many times they have looped. However, this code is deadlocking very quickly. More threads = faster deadlock. Adding work/delay inside the compute threads slows down the deadlock, but does not stop it.
Am I abusing the thread barrier? Is this unexpected behavior? Is there a cleaner way to implement this pattern?
Edit
It looks like removing the on_completion gets rid of the deadlock. I tried a different approach to meet the synchronization requirements without using the function, but it still deadlocks fairly quickly.
int threadcount = 10;
std::barrier start_point(threadcount + 1);
std::barrier stop_point(threadcount + 1);
auto work = [&](int i) {
while(true) {
start_point.arrive_and_wait();
stop_point.arrive_and_wait();
}
};
std::vector<std::thread> threads;
for (int i = 0; i < threadcount; ++i) {
threads.emplace_back(work, i);
}
while (true) {
std::cout << "Setup" << std::endl;
start_point.arrive_and_wait(); // Sync to start
// Workers do work here
stop_point.arrive_and_wait(); // Sync to end
std::cout << "Collect" << std::endl;
}

Related

How to start several threads in C++?

I have some class objects and want to hand them over to several threads. The number of threads is given by the command line.
When I write it the following way, it works fine:
thread t1(thread(thread(tasks[0], parts[0])));
thread t2(thread(thread(tasks[1], parts[1])));
thread t3(thread(thread(tasks[2], parts[2])));
thread t4(thread(thread(tasks[3], parts[3])));
t1.join();
t2.join();
t3.join();
t4.join();
But as I mentioned, the number of threads shall be given by the command line, so it must be more dynamic. I tried the following code, which doesn't work, and I have no idea what is wrong with it:
for(size_t i=0; i < threads.size(); i++) {
threads.push_back(thread(tasks[i], parts[i]));
}
for(auto &t : threads) {
t.join();
}
I hope someone has an idea on how to correct it.
In this statement:
thread t1(thread(thread(tasks[0], parts[0])));
You don't need to move a thread into another thread and then move that into another thread. Just pass your task parameters directly to t1's constructor:
thread t1(tasks[0], parts[0]);
Same with t2, t3, and t4.
As for your loop:
for(size_t i=0; i < threads.size(); i++) {
threads.push_back(thread(tasks[i], parts[i]));
}
Assuming you are using std::vector<std::thread> threads, then your loop is populating threads wrong. At best, the loop simply won't do anything at all if threads is initially empty, because i < threads.size() will be false when size()==0. At worst, if threads is not initially empty then the loop will run and continuously increase threads.size() with each call to threads.push_back(), causing an endless loop because i < threads.size() will never be false, thus pushing more and more threads into threads until memory blows up.
Try something more like this instead:
size_t numThreads = ...; // taken from cmd line...
std::vector<std::thread> threads(numThreads);
for(size_t i = 0; i < numThreads; i++) {
threads[i] = std::thread(tasks[i], parts[i]);
}
for(auto &t : threads) {
t.join();
}
Or this:
size_t numThreads = ...; // taken from cmd line...
std::vector<std::thread> threads;
threads.reserve(numThreads);
for(size_t i = 0; i < numThreads; i++) {
threads.emplace_back(tasks[i], parts[i]);
}
for(auto &t : threads) {
t.join();
}
Threads are not copyable; try this:
threads.emplace_back(std::thread(task));
Emplace back thread on vector

C/C++ - Single semaphore of type sem_t to print numbers in order

Problem: Let's say we have n threads where each thread receives a random unique number between 1 and n. And we want the threads to print the numbers in sorted order.
Trivial Solution (using n semaphore/mutex): We can use n mutex locks (or similarly semaphores) where thread i waits to acquire mutex lock number i and unlocks number i + 1. Also, thread 1 has no wait.
However, I'm wondering if it's possible to simulate a similar logic using a single semaphore (of type sem_t) to implement the following logic: (i is a number between 1 to n inclusive)
Thread with number i as input, waits to acquire a count of (i-1) on the semaphore, and after
printing, releases a count of i. Needless to say, thread one does not
wait.
I know that unlike Java, sem_t does not support arbitrary increase/decrease in the semaphore value. Moreover, writing a for loop to do (i-1) wait and i release won't work because of asynchrony.
I've been looking for the answer for so long but couldn't find any. Is this possible in plain C? If not, is it possible in C++ using only one variable or semaphore? Overall, what is the least wasteful way to do this with ONE semaphore.
Please feel free to edit the question since I'm new to multi-threaded programming.
You can do this with a condition_variable in C++, which is equivalent to a pthread_cond_t with the pthreads library in C.
What you want to share between threads is a pointer to a condition_variable, number, and a mutex to guard access to the number.
struct GlobalData
{
std::condition_variable cv;
int currentValue;
std::mutex mut;
};
Each thread simply invokes a function that waits for its number to be set:
void WaitForMyNumber(std::shared_ptr<GlobalData> gd, int number)
{
std::unique_lock<std::mutex> lock(gd->mut);
while (gd->currentValue != number)
{
gd->cv.wait(lock);
}
std::cout << number << std::endl;
gd->currentValue++;
gd->cv.notify_all(); // notify all other threads that it can wake up and check
}
And then a program to test it all out. This one uses 10 threads. You can modify it to use more and then have your own randomization algorithm of the numbers list.
int main()
{
int numbers[10] = { 9, 1, 0, 7, 5, 3, 2, 8, 6, 4 };
std::shared_ptr<GlobalData> gd = std::make_shared<GlobalData>();
// gd->number is initialized to 0.
std::thread threads[10];
for (int i = 0; i < 10; i++)
{
int num = numbers[i];
auto fn = [gd, num] {WaitForMyNumber(gd, num); };
threads[i] = std::move(std::thread(fn));
}
// wait for all the threads to finish
for (int i = 0; i < 10; i++)
{
threads[i].join();
}
return 0;
}
All of the above is in C++. But it would be easy to transpose the above solution to C using pthreads. But I'll leave that as an exercise for the OP.
I'm not sure if this satisfies your "one semaphore requirement". The mutex technically has a semaphore. Not sure if the condition_variable itself has a semaphore for its implementation.
Thats a good question although, I fear you might have a XY problem since I can not imagine a good reason for your problem scenario. Never the less, after 1-2 minutes I came up with 2 solutions with pros and cons, but I think one is perfect for you:
A. When your threads are almost done the same time and or need their print ASAP you could use a shared std::atomic<T> with T=unsigned,int,size_t,uint32_t what ever you like, or any of the integer atomics in the C standard library when using C, initialise it with 0, and now every thread i busy waits until its value is i-1. If so, it prints and then adds 1 on the atomic. Of course since of the busy wait, you will have much CPU load when thread are waiting long, and slow down, when many are waiting. But you get your print ASAP
B. You just store your result of thread i in a container, maybe along with its index, since I guess you want more to just print i, and after all threads are finished or periodically, sort this container and then print it.
A.:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
#include <functional>
void thread_function(unsigned i, std::atomic<unsigned>& atomic) {
while (atomic < i - 1) {}
std::cout << i << " ";
atomic += 1;
}
int main() {
std::atomic<unsigned> atomic = 0;
std::vector<std::thread> threads;
for (auto i : {3,1,2}) {
threads.push_back(std::thread(thread_function, i, std::ref(atomic)));
}
for (auto& t : threads) {
t.join();
}
std::cout << "\n";
}
Works also in C, just use the atomics there.
The following code uses pthread_cond_t and works in C.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define n 100
int counter = 0;
int used[n];
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
void foo(void *given_number){
int number = (int)given_number;
pthread_mutex_lock(&mutex);
while(counter != number){
pthread_cond_wait(&cond, &mutex);
}
printf("%d\n", number);
counter++;
pthread_cond_broadcast(&cond);
pthread_mutex_unlock(&mutex);
}
int get_random_number(){
while(1){
int x = rand()%n;
if(!used[x]){
used[x] = 1;
return x;
}
}
}
int main(){
pthread_t threads[n];
for(int i = 0; i < n; i++){
int num = get_random_number();
pthread_create(&threads[i], NULL, foo, (void *)num);
}
for(int i = 0; i < n; i++){
pthread_join(threads[i], NULL);
}
return 0;
}

How to multithread line by line pixels using std::thread?

I want to learn how to adapt pseudocode I have for multithreading line by line to C++. I understand the pseudocode but I am not very experienced with C++ nor the std::thread function.
This is the pseudocode I have and that I've often used:
myFunction
{
int threadNr=previous;
int numberProcs = countProcessors();
// Every thread calculates a different line
for (y = y_start+threadNr; y < y_end; y+=numberProcs) {
// Horizontal lines
for (int x = x_start; x < x_end; x++) {
psetp(x,y,RGB(255,128,0));
}
}
}
int numberProcs = countProcessors();
// Launch threads: e.g. for 1 processor launch no other thread, for 2 processors launch 1 thread, for 4 processors launch 3 threads
for (i=0; i<numberProcs-1; i++)
triggerThread(50,FME_CUSTOMEVENT,i); //The last parameter is the thread number
triggerEvent(50,FME_CUSTOMEVENT,numberProcs-1); //The last thread used for progress
// Wait for all threads to finished
waitForThread(0,0xffffffff,-1);
I know I can call my current function using one thread via std::thread like this:
std::thread t1(FilterImage,&size_param, cdepth, in_data, input_worldP, output_worldP);
t1.join();
But this is not efficient as it is calling the entire function over and over again per thread.
I would expect every processor to tackle a horizontal line on it's own.
Any example code would would be highly appreciated as I tend to learn best through example.
Invoking thread::join() forces the calling thread to wait for the child thread to finish executing. For example, if I use it to create a number of threads in a loop, and call join() on each one, it'll be the same as though everything happened in sequence.
Here's an example. I have two methods that print out the numbers 1 through n. The first one does it single threaded, and the second one joins each thread as they're created. Both have the same output, but the threaded one is slower because you're waiting for each thread to finish before starting the next one.
#include <iostream>
#include <thread>
void printN_nothreads(int n) {
for(int i = 0; i < n; i++) {
std::cout << i << "\n";
}
}
void printN_threaded(int n) {
for(int i = 0; i < n; i++) {
std::thread t([=](){ std::cout << i << "\n"; });
t.join(); //This forces synchronization
}
}
Doing threading better.
To gain benefit from using threads, you have to start all the threads before joining them. In addition, to avoid false sharing, each thread should work on a separate region of the image (ideally a section that's far away in memory).
Let's look at how this would work. I don't know what library you're using, so instead I'm going to show you how to write a multi-threaded transform on a vector.
auto transform_section = [](auto func, auto begin, auto end) {
for(; begin != end; ++begin) {
func(*begin);
}
};
This transform_section function will be called once per thread, each on a different section of the vector. Let's write transform so it's multithreaded.
template<class Func, class T>
void transform(Func func, std::vector<T>& data, int num_threads) {
size_t size = data.size();
auto section_start = [size, num_threads](int thread_index) {
return size * thread_index / num_threads;
};
auto section_end = [size, num_threads](int thread_index) {
return size * (thread_index + 1) / num_threads;
};
std::vector<std::thread> threads(num_threads);
// Each thread works on a different section
for(int i = 0; i < num_threads; i++) {
T* start = &data[section_start(i)];
T* end = &data[section_end(i)];
threads[i] = std::thread(transform_section, func, start, end);
}
// We only join AFTER all the threads are started
for(std::thread& t : threads) {
t.join();
}
}

Why is this C++11 code containing rand() slower with multiple threads than with one?

I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.
#include <iostream>
#include <thread>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <cmath>
double add_single(int N) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
return sum/N;
}
void add_multi(int N, double& result) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
result = sum/N;
}
int main() {
srand (time(NULL));
int N = 1000000;
// single-threaded
auto t1 = std::chrono::high_resolution_clock::now();
double result1 = add_single(N);
auto t2 = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time single: " << time_elapsed << std::endl;
// multi-threaded
std::vector<std::thread> th;
int nr_threads = 3;
double partual_results[] = {0,0,0};
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < nr_threads; ++i)
th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));
for(auto &a : th)
a.join();
double result_multicore = 0;
for(double result:partual_results)
result_multicore += result;
result_multicore /= nr_threads;
t2 = std::chrono::high_resolution_clock::now();
time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time multi: " << time_elapsed << std::endl;
return 0;
}
Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is
time single: 33
time multi: 565
So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.
edit:
This problem scales for larger N, so the problem is not the short runtime
The time for creating the threads is not the problem. Excluding it does not change the result significantly
Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!
On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.
void add_multi(int N, double& result) {
double sum=0;
unsigned int seed = time(NULL);
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX);
}
result = sum/N;
}
As you discovered, rand is the culprit here.
For those who are curious, it's possible that this behavior comes from your implementation of rand using a mutex for thread safety.
For example, eglibc defines rand in terms of __random, which is defined as:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
This kind of locking would force multiple threads to run serially, resulting in lower performance.
The time needed to execute the program is very small (33msec). This means that the overhead to create and handle several threads may be more than the real benefit. Try using programs that need longer times for the execution (e.g., 10 sec).
To make this faster, use a thread pool pattern.
This will let you enqueue tasks in other threads without the overhead of creating a std::thread each time you want to use more than one thread.
Don't count the overhead of setting up the queue in your performance metrics, just the time to enqueue and extract the results.
Create a set of threads and a queue of tasks (a structure containing a std::function<void()>) to feed them. The threads wait on the queue for new tasks to do, do them, then wait on new tasks.
The tasks are responsible for communicating their "done-ness" back to the calling context, such as via a std::future<>. The code that lets you enqueue functions into the task queue might do this wrapping for you, ie this signature:
template<typename R=void>
std::future<R> enqueue( std::function<R()> f ) {
std::packaged_task<R()> task(f);
std::future<R> retval = task.get_future();
this->add_to_queue( std::move( task ) ); // if we had move semantics, could be easier
return retval;
}
which turns a naked std::function returning R into a nullary packaged_task, then adds that to the tasks queue. Note that the tasks queue needs be move-aware, because packaged_task is move-only.
Note 1: I am not all that familiar with std::future, so the above could be in error.
Note 2: If tasks put into the above described queue are dependent on each other for intermediate results, the queue could deadlock, because no provision to "reclaim" threads that are blocked and execute new code is described. However, "naked computation" non-blocking tasks should work fine with the above model.

Concurrency in C++ with Boost

I have a very simple function in C++:
double testSpeed()
{
using namespace boost;
int temp = 0;
timer aTimer;
//1 billion iterations.
for(int i = 0; i < 1000000000; i++) {
temp = temp + i;
}
double elapsedSec = aTimer.elapsed();
double speed = 1.0/elapsedSec;
return speed;
}
I want to run this function with multiple threads. I saw examples online that I can
do it as follows:
// start two new threads that calls the "hello_world" function
boost::thread my_thread1(&testSpeed);
boost::thread my_thread2(&testSpeed);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
However, this will run two threads that each will iterate billion times, right? I want the
two threads to do the job concurrently so the entire thing will run faster. I don't care
about sync, it's just a speed test.
Thank you!
There may be a nicer way, but this should work, it passes the range of variable to iterate over into the thread, it also starts a single timer before the threads are started, and ends after the timer after they're both done. It should be pretty obvious how to scale this up to more threads.
void testSpeed(int start, int end)
{
int temp = 0;
for(int i = start; i < end; i++)
{
temp = temp + i;
}
}
using namespace boost;
timer aTimer;
// start two new threads that calls the "hello_world" function
boost::thread my_thread1(&testSpeed, 0, 500000000);
boost::thread my_thread2(&testSpeed, 500000000, 1000000000);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
double elapsedSec = aTimer.elapsed();
double speed = 1.0/elapsedSec;