Multithreaded program blocks when compiled with optimizations

Multithreaded program blocks when compiled with optimizations - c++

I am developing a project in which I have to model (arbitrary) computations that happen in pipeline.
The pipeline is made of stages, each stage takes the input from the previous stage (except the first, who directly receives tasks from the pipeline object), makes a computation and sends the result to the next stage. Each stage is implemented with a separate thread of execution.
The pipeline should have a basic load balancing capability: if (after a while) it recognizes that the sum of the execution times of two consecutive stages is smaller than the execution time of the slowest stage, it "collapses" those two stages, that is it makes both of them run sequentially, using a single thread.
There are three classes in the project: classes Pipeline and Stage are obvious, while class TSOHeap (Thread-Safe Ordered heap) is the buffer used in input by each stage. It has a maximum size and the capability to give highest priority to special messages indicating that a Stage has to be collapsed.
My question is: why if I compile without optimizations the code runs smoothly (or at least does not block), while if I compile with optimizations ( -O2, -O3 ) the program blocks? If I run the program with the debugger it blocks few times; if I run the program "normally" from terminal it blocks almost always.
The strange thing is that a thread blocks on a line in which there is a simple print. Before I added that print (for debugging purpose), the program blocked on the previous line, which is the guard of a while loop.
I guess the problem is related to synchronization among threads, but I don't know how to discover the faulty part. The only constant is that the program blocks after the method collapse_next_stage() has been invoked, that is after a thread has been stopped.
Any suggestion would be appreciated, even general procedures to discover bugs like these.
I report the code to run an example:
Class "TSOHeap.hpp":
#include <mutex>
#include <queue>
#include <vector>
#include <atomic>
#include <climits>
using namespace std;
template<typename T>
struct Comparator{
bool operator()(pair<T,int> p1, pair<T,int> p2){
return p1.second > p2.second;
}
};
//Thread-Safe Ordered Heap
template<typename T>
struct TSOHeap
{
TSOHeap(int _max=10):size{0},max{_max}{};
~TSOHeap(){}
void push(T* item, int id){
while(size==max);
{
lock_guard<mutex> lock(heap_mutex);
heap.push(pair<T*,int>(item, id));
size++;
}
}
pair<T*,int> pop(){
while(size==0);
{
lock_guard<mutex> lock(heap_mutex);
pair<T*,int> p = heap.top();
heap.pop();
size--;
return p;
}
}
priority_queue<pair<T*,int>, vector<pair<T*,int>>,Comparator<T*>> heap;
atomic<int> size;
int max;
mutex heap_mutex;
};
Class "Stage.hpp":
#include "TSOHeap.hpp"
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <mutex>
using namespace std;;
struct IStage{
virtual void run() = 0;
virtual void wait_end() = 0;
virtual void stage_func() = 0;
virtual double get_exec_time() = 0;
virtual void reset_exec_time()=0;
virtual void add_next(IStage&)=0;
virtual IStage* get_next() = 0;
virtual void* get_input_ptr() = 0;
virtual void set_input(void*) = 0;
virtual void collapse() = 0;
virtual bool is_collapsed() = 0;
virtual void collapse_next_stage() = 0;
virtual int num_collapsed() = 0;
~IStage(){};
};
template <typename Tin, typename Tf, typename Tout>
struct Stage : IStage{
Stage(Tf function, int ind):fun{function}, input_ptr{new(TSOHeap<Tin>)},_end{false},
next{nullptr}, collapsed{0}, i{ind}, exec_time{0.0},count{0},collapsing{false},c{0}{};
~Stage(){delete input_ptr;}
void stage_func(){
Tin * input = input_ptr->pop().first;
if (input!=nullptr){
auto start = chrono::system_clock::now();
Tout out = fun(*input);
auto end = chrono::system_clock::now();
chrono::duration<double> diff = end-start;
set_exec_time(diff.count());
if (next!=nullptr)
next->set_input(new Tout(out));
}
else
_end = true;
}
void run_thread(){
while(!_end){
cout << "t " << i << ", r " << ++c << endl; // BLOCKS HERE
while(collapsing); //waiting that next stage finishes the remaining tasks
stage_func();
if(collapsed==1 && !_end)
next->stage_func();
}
if(collapsed!=-1){
IStage * nptr = next;
if(nptr!=nullptr && nptr->is_collapsed())
nptr = nptr->get_next();
if(nptr!=nullptr)
nptr->set_input(nullptr);
}
else{
while((input_ptr->size)>0)
stage_func();
}
}
void run()
{
thread _t(&Stage::run_thread, this);
t = move(_t);
return;
}
void wait_end()
{
t.join();
}
void set_input(void * iptr)
{
input_ptr->push(static_cast<Tin*>(iptr), ++count);
}
void* get_input_ptr()
{
return input_ptr;
}
void add_next(IStage &n)
{
next = &n;
output_ptr = static_cast<TSOHeap<Tout>*>(n.get_input_ptr());
}
void collapse()
{
collapsed=-1;
input_ptr->push(nullptr, INT_MIN);
// First condition is to avoid deadlock, in case this thread finished the execution in the meanwhile
while(!_end && (input_ptr->size) > 0);
}
bool is_collapsed()
{
return collapsed==-1;
}
void collapse_next_stage()
{
collapsing = true;
next->collapse();
collapsed++;
collapsing = false;
cout << "Stage # " << i << " has collapsed the successive Stage" << endl;
}
IStage* get_next()
{
return next;
}
double get_exec_time()
{
return exec_time;
}
void reset_exec_time()
{
set_exec_time(0.0);
}
void set_exec_time(double value)
{
lock_guard<mutex> lock(et_mutex);
exec_time = value;
}
int num_collapsed()
{
return collapsed;
}
Tf fun;
TSOHeap<Tin> * input_ptr;
bool _end;
IStage * next;
int collapsed;
int const i;
double exec_time;
int count;
mutex et_mutex;
bool collapsing;
int c;
TSOHeap<Tout> * output_ptr;
thread t;
};
Class "Pipe.hpp":
#include "Stage.hpp"
#include <list>
#include <thread>
#include <algorithm>
using namespace std;;
template <typename Tin, typename Tout>
struct Pipe{
Pipe(list<IStage*>li, int n_samples=10):slowest{-1},end{false},num_samples{n_samples}
{
for(auto& s:li)
add_node(s);
}
void add_node(IStage* sptr)
{
if(!nodes.empty())
nodes.back()->add_next(*sptr);
nodes.push_back(sptr);
}
void set_input(void * in_ptr)
{
nodes.front()->set_input(in_ptr);
}
int num_nodes()
{
return nodes.size();
}
void run()
{
for(auto &x: nodes)
x->run();
}
void run(list<Tin>&& input)
{
thread t(&Pipe::run_manager, this, ref(input));
while(!end)
monitor_times();
t.join();
}
void run_manager(list<Tin>& input)
{
run();
for(auto& x:input)
set_input(&x);
set_input(nullptr);
end=true;
for(auto& s : nodes)
s->wait_end();
}
void monitor_times()
{ // initialization phase
vector<int> count;
vector<double> avg;
vector<priority_queue<pair<double,int>, vector<pair<double,int>>,Comparator<double>>> measures;
for(auto& x : nodes){
count.push_back(0);
avg.push_back(0);
measures.push_back(priority_queue<pair<double,int>,
vector<pair<double,int>>,Comparator<double>>());
}
while(!end){
// monitoring phase
for(int i=0; i<nodes.size(); i++){
if(nodes[i]->get_exec_time()!=0){
pair<double,int> measure = pair<double,int>(nodes[i]->get_exec_time(),++count[i]);
nodes[i]->reset_exec_time();
measures[i].push(measure);
if(count[i]<=num_samples){
avg[i] = (avg[i]*(count[i]-1) + measure.first) / count[i];
}
else
{
double old = measures[i].top().first;
// the ordering of the heap guarantees that I drop the oldest measure
measures[i].pop();
avg[i] = (avg[i] * num_samples - old + measure.first) / num_samples;
}
}
}
// updating phase
if(is_steady_state(count)){
int slowest = get_slowest_stage(avg);
for(int i=0; i<nodes.size()-1; i++){
if(avg[i]+avg[i+1]<avg[slowest]){
if(nodes[i]->num_collapsed()==0 && nodes[i+1]->num_collapsed()==0){
nodes[i]->collapse_next_stage();
break;
}
}
}
}
}
}
bool is_steady_state(vector<int>& count){
for(auto& c: count){
if(c < num_samples) return false;
}
return true;
}
int get_slowest_stage(vector<double>& avg){
double max = 0.0;
int index = -1;
for(int i=0; i<avg.size(); i++){
if(avg[i]>max){
max=avg[i];
index = i;
}
}
return index;
}
int slowest;
bool end;
int num_samples;
vector<IStage*> nodes;
};
Class "main.cpp":
#include<iostream>
#include<functional>
#include <chrono>
#include<cmath>
#include "Pipe.hpp"
using namespace std;;
auto f = [](int x){
int c = 0;
for(int i=0; i<300; i++)
c=sin(i);
return x;
};
auto fast = [] (int x) {return x;};
auto fast_init = [](int x){
if(x < 5)
return x;
int c=0;
for(int i=0; i<300; i++)
c=sin(i);
return x;
};
auto print = [] (int x) {
cout << "Result: " << x << " " << endl;
return x;
};
int main(int argc, char* argv[])
{
auto print_usage_msg = [&](){
cout << "Usage: " << argv[0] << " <func_type> \n" <<
"<func_type> = \n"
" 0 to have 2 consecutive stages running the fast function\n"
" 1 to have 2 consecutive stages running the fast function "
"but after a short time reaching steady state " << endl;
};
if(argc!=2){
print_usage_msg();
return 1;
}
int fun_code = atoi(argv[1]);
if (fun_code!=0 && fun_code!=1){
print_usage_msg();
return 1;
}
Stage<int,function<int(int)>,int> s1{f,1};
Stage<int,function<int(int)>,int> s2{f,2};
Stage<int,function<int(int)>,int> s3{f,3};
Stage<int,function<int(int)>,int> s4{f,4};
Stage<int,function<int(int)>,int> s5{f,5};
Stage<int,function<int(int)>,int> s6{f,6};
Stage<int,function<int(int)>,int> s7{f,7};
Stage<int,function<int(int)>,int> sp{print,8};
if(fun_code==0){
s2.fun = fast;
s3.fun = fast;
}
else{
s2.fun = fast_init;
s3.fun = fast_init;
}
Pipe<int,int> p ({&s1, &s2, &s3, &s4, &s5, &s6, &s7, &sp});
cout << "Pipe length: " << p.num_nodes() << endl;
list<int> li {};
for(int i=0; i<100; i++)
li.push_back(i);
p.run(move(li));
return 0;
}
Compile with:
g++ main.cpp -std=c++11 -pthread -O3 -o gpipe -g
Run with :
./gpipe 1
Thanks for any help!

Imagine the following code for a single-threaded program:
void func()
{
bool a = true;
while(a)
{
// busy wait...
}
}
Will this function ever return? Obviously not. If you were a compiler, how would you write optimized code for this?
1: NOP
2: GOTO 1
This is exactly what you're doing with this bit of code. Twice.
while(!_end){ // here #1
cout << "t " << i << ", r " << ++c << endl;
while(collapsing) // here #2
; // for the love of God, move your semicolon here or use braces
stage_func();
if(collapsed==1 && !_end)
next->stage_func();
}
Your compiler has absolutely no obligation to realize that you're doing multi-threading programming. (It's your job to tell it)
The compiler needs to know not to perform optimizations on _end and collapsed. DO NOT USE volatile. Why? volatile will keep the compiler from optimizing a variable, but... heh heh... the CPU can also potentially optimize away your writes to _end and collapsed from different threads (by keeping them in its cache and not writing to main memory). Compilers and CPU's will also re-order your instructions, which can cause similar problems.
Memory fences (aka memory barriers) can be used to instruct the CPU to do things like push out pending writes or re-update its cached value for reading. They also give guidelines for command re-ordering. AFAIK the std::atomic_thread_fence will prevent compiler reordering but I've read conflicting things about this...
By far the simplest, most-pragmatic, and easiest-to-prove-correct thing to do is just to switch all your inter-thread communicating variables to std::atomic<> types, which incorporate memory barriers. So
std::atomic<bool> _end;
std::atomic<int> collapsed;
As a general rule, any data that is shared between threads should be protected by a mutex OR be an std::atomic<> if race conditions are not an issue (as you are doing with the simple signaling). You can break this rule if you really know what you're doing and really know the architecture, compiler, and standard implementation really well, but that's a tall order even for an expert.
By the way, a mutex's lock and unlock operation both incorporate a memory barrier, in case you were worried about that. So when you get a pointer from the TSOHeap, that's fine (assuming your TSOHeap implementation is correct...I didn't look at it).

You have race conditions in TSOHeap when using size. While size is atomic, it is a part of larger state that is not atomic, so that changes in size are not synchronized with changes to the rest of the state.
Make size non-atomic and access it only when the mutex is locked. Add condition variables to notify threads waiting in push and pop.
Alternatively, remove size entirely. Example:
template<typename T>
struct TSOHeap
{
TSOHeap(size_t _max=10): max{_max}{}
void push(T* item, int id){
unique_lock<mutex> lock(heap_mutex);
while(heap.size() == max)
cnd_pop.wait(lock);
heap.push(pair<T*,int>(item, id));
cnd_push.notify_one();
}
pair<T*,int> pop() {
pair<T*,int> result = {};
{
unique_lock<mutex> lock(heap_mutex);
while(heap.empty())
cnd_push.wait(lock);
bool notify = heap.size() == max;
result = heap.top();
heap.pop();
if(notify)
cnd_pop.notify_one();
}
return result;
}
mutex heap_mutex;
condition_variable cnd_push, cnd_pop;
priority_queue<pair<T*,int>, vector<pair<T*,int>>,Comparator<T*>> heap;
size_t const max;
};

Related

Thread pool with job queue gets stuck

I want to split jobs among multiple std::thread workers and continue once they are all done.
To do so, I implemented a thread pool class mainly based on this SO answer.
I noticed, however, that my benchmarks can get stuck, running forever, without any errors thrown.
I wrote a minimal reproducing code, enclosed at the end.
Based on terminal output, the issue seems to occur when the jobs are being queued.
I checked videos (1, 2), documentation (3) and blog posts (4).
I tried replacing the type of the locks, using atomics.
I could not find the underlying cause.
Here is the snippet to replicate the issue.
The program repeatedly counts the odd elements in the test vector.
#include <atomic>
#include <condition_variable>
#include <functional>
#include <iostream>
#include <mutex>
#include <queue>
#include <thread>
#include <vector>
class Pool {
public:
const int worker_count;
bool to_terminate = false;
std::atomic<int> unfinished_tasks = 0;
std::mutex mutex;
std::condition_variable condition;
std::vector<std::thread> threads;
std::queue<std::function<void()>> jobs;
void thread_loop()
{
while (true) {
std::function<void()> job;
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&] { return (!jobs.empty()) || to_terminate; });
if (to_terminate)
return;
job = jobs.front();
jobs.pop();
}
job();
unfinished_tasks -= 1;
}
}
public:
Pool(int size) : worker_count(size)
{
if (size < 0)
throw std::invalid_argument("Worker count needs to be a positive integer");
for (int i = 0; i < worker_count; ++i)
threads.push_back(std::thread(&Pool::thread_loop, this));
};
~Pool()
{
{
std::unique_lock lock(mutex);
to_terminate = true;
}
condition.notify_all();
for (auto &thread : threads)
thread.join();
threads.clear();
};
void queue_job(const std::function<void()> &job)
{
{
std::unique_lock<std::mutex> lock(mutex);
jobs.push(job);
unfinished_tasks += 1;
// std::cout << unfinished_tasks;
}
condition.notify_one();
}
void wait()
{
while (unfinished_tasks) {
; // spinlock
};
}
};
int main()
{
constexpr int worker_count = 8;
constexpr int vector_size = 1 << 10;
Pool pool = Pool(worker_count);
std::vector<int> test_vector;
test_vector.reserve(vector_size);
for (int i = 0; i < vector_size; ++i)
test_vector.push_back(i);
std::vector<int> worker_odd_counts(worker_count, 0);
std::function<void(int)> worker_task = [&](int thread_id) {
int chunk_size = vector_size / (worker_count) + 1;
int my_start = thread_id * chunk_size;
int my_end = std::min(my_start + chunk_size, vector_size);
int local_odd_count = 0;
for (int ii = my_start; ii < my_end; ++ii)
if (test_vector[ii] % 2 != 0)
++local_odd_count;
worker_odd_counts[thread_id] = local_odd_count;
};
for (int iteration = 0;; ++iteration) {
std::cout << "Jobs.." << std::flush;
for (int i = 0; i < worker_count; ++i)
pool.queue_job([&worker_task, i] { worker_task(i); });
std::cout << "..queued. " << std::flush;
pool.wait();
int odd_count = 0;
for (auto elem : worker_odd_counts)
odd_count += elem;
std::cout << "Iter:" << iteration << ". Odd:" << odd_count << '\n';
}
}
Here is the terminal output of one specific run:
[...]
Jobs....queued. Iter:2994. Odd:512
Jobs....queued. Iter:2995. Odd:512
Jobs..
Edit:
The error occurres using GCC 12.2.0 x86_64-w64-mingw32 on Windows 10 with AMD Ryzen 4750U CPU. I do not get past 15k iterations .
Using Visual Studio Community 2022, I got past 1.5M iterations (and stopped it myself). Thanks #IgorTandetnik for pointing out the latter.

Mingw doesn’t natively support multithreading on Windows. They supporting threads in their C++ standard library over the POSIX API, and winpthreads compatibility layer which implements that API on top of the Windows OS threads.
I think your error is not in the C++ code, but in the computer setup. Do the following.
Use the compiler from x86_64-12.2.0-release-posix-seh-ucrt-rt_v10-rev2.7z archive, there.
Don’t forget the binary built that way depends on a bunch of DLL files provided by the compiler: libgcc_s_seh-1.dll, libwinpthread-1.dll and libstdc++-6.dll. You must use exactly the same version of these DLL which were shipped with mingw. If you have some other versions of these DLLs anywhere in your %PATH%, expect all kinds of fails.
Couple general notes.
Linux-first C++ compilers like gcc have issues on Windows. A path of least resistance is using Visual C++ instead. If you want your software to build on other platforms as well, consider cmake to abstract away the compiler.
Windows already includes a thread pool implementation, since Vista. The API is easy to use, you only need 4 functions: CreateThreadpoolWork, SubmitThreadpoolWork, WaitForThreadpoolWorkCallbacks, and CloseThreadpoolWork. Example.

The first thing you should do is split the queue from the thread pool. They are both tricky enough, writing both of them comingled in one class is asking for trouble.
This also allows you to unit test the queue without the pool.
template<class Payload>
class MutexQueue {
public:
std::optional<Payload> wait_and_pop();
void push(Payload);
void terminate_queue();
bool queue_is_terminated() const;
private:
mutable std::mutex m;
std::condition_variable cv;
std::deque<Payload> q;
bool terminated = false;
std::unique_lock<std::mutex> lock() const {
return std::unique_lock<std::mutex>(m);
}
};
this is a bit easier to write than the thread pool.
void push(Payload p) {
{
auto l = lock();
if (terminate) return;
q.push_back(std::move(p));
}
cv.notify_one();
}
void terminate_queue() {
{
auto l = lock(); // YOU CANNOT SKIP THIS LOCK, even if terminate is atomic
terminate = true;
q.clear();
}
cv.notify_all();
}
bool queue_is_terminated() const {
auto l = lock(); // if you make terminate atomic, you CAN skip this lock
return terminate;
}
std::optional<Payload> wait_and_pop() {
auto l = lock();
cv.wait(l, [&]{ return terminate || !q.empty(); }
if (terminate) return std::nullopt;
auto r = std::move(q.front());
q.pop_front();
return std::move(r);
}
there we go.
Now our thread pool is simpler.
struct ThreadPool {
explicit ThreadPool(std::size_t n) {
create_threads(n);
}
std::future<void> push_task(std::function<void()> f) {
std::packaged_task<void()> p = std::move(f);
auto r = p.get_future();
q.push( std::move(p) );
return r;
}
void terminate_pool() {
q.terminate_queue();
terminate_threads();
}
~ThreadPool() {
terminate_pool();
}
private:
MutexQueue<std::packaged_task<void()>> q;
std::vector<std::thread> threads;
void terminate_threads() {
for(auto& thread:threads)
thread.join();
threads.clear();
}
static void thread_task( MutexQueue<std::packaged_task<void()>>* pq ) {
if (!pq) return;
while (auto task = pq->wait_and_pop()) {
(*task)();
}
}
void create_threads(std::size_t n) {
for (std::size_t i = 0; i < n; ++i) {
threads.push_back( std::thread( thread_task, &q ) );
}
}
I cannot spot an error in your code. But with the above, you can test a split of the queue from the pool.
The queue will work with pthreads or other primitives.

How to reduce the (writer side) performance penalty for sharing a small structure between two threads?

I'm adding a profiler to an existing evaluator. The evaluator is implemented in c++ (std17 x86 msvc) and I need to store 3 int32, 1 uint16 and 1 uint8 to represent the context of the execution frame. As this is interpreted code, I cannot write any kernel level driver to take snapshots, so we have to add it to the eval loop. As with all profiling, we want to avoid slowing down the actual computation. So far for the motivation/context of this question.
At a high level, the behavior is as follows: before every instruction gets executed by the evaluator, it calls a small function ("hey, i'm here right now"), the sampling profiler is only interested in that frame position once every X ms (or μs, but that's most likely pushing it). So we need 2 threads (let's ignore the serialization for now) that want to share data. We have a very frequent writer (one single thread), and an infrequent reader (a different single thread). We like to minimize the performance penalty on the write size. Note, sometimes the writer becomes slow, so it might be stuck on a frame for a few seconds, we would like to be able to observe this.
So to help this question, I've written a small benchmark setup.
#include <memory>
#include <chrono>
#include <string>
#include <iostream>
#include <thread>
#include <immintrin.h>
#include <atomic>
#include <cstring>
using namespace std;
typedef struct frame {
int32_t a;
int32_t b;
uint16_t c;
uint8_t d;
} frame;
class ProfilerBase {
public:
virtual void EnterFrame(int32_t a, int32_t b, uint16_t c, uint8_t d) = 0;
virtual void Stop() = 0;
virtual ~ProfilerBase() {}
virtual string Name() = 0;
};
class NoOp : public ProfilerBase {
public:
void EnterFrame(int32_t a, int32_t b, uint16_t c, uint8_t d) override {}
void Stop() override {}
string Name() override { return "NoOp"; }
};
class JustStore : public ProfilerBase {
private:
frame _current = { 0 };
public:
string Name() override { return "OnlyStoreInMember"; }
void EnterFrame(int32_t a, int32_t b, uint16_t c, uint8_t d) override {
_current.a = a;
_current.b = b;
_current.c = c;
_current.d = d;
}
void Stop() override {
if ((_current.a + _current.b + _current.c + _current.d) == _current.a) {
cout << "Make sure optimizer keeps the record around";
}
}
};
class WithSampler : public ProfilerBase {
private:
unique_ptr<thread> _sampling;
atomic<bool> _keepSampling = true;
protected:
const chrono::milliseconds _sampleEvery;
virtual void _snap() = 0;
virtual string _subname() = 0;
public:
WithSampler(chrono::milliseconds sampleEvery): _sampleEvery(sampleEvery) {
_sampling = make_unique<thread>(&WithSampler::_sampler, this);
}
void Stop() override {
_keepSampling = false;
_sampling->join();
}
string Name() override {
return _subname() + to_string(_sampleEvery.count()) + "ms";
}
private:
void _sampler() {
auto nextTick = chrono::steady_clock::now();
while (_keepSampling)
{
const auto sleepTime = nextTick - chrono::steady_clock::now();
if (sleepTime > chrono::milliseconds(0))
{
this_thread::sleep_for(sleepTime);
}
_snap();
nextTick += _sampleEvery;
}
}
};
struct checkedFrame {
frame actual;
int32_t check;
};
// https://rigtorp.se/spinlock/
struct spinlock {
std::atomic<bool> lock_ = { 0 };
void lock() noexcept {
for (;;) {
// Optimistically assume the lock is free on the first try
if (!lock_.exchange(true, std::memory_order_acquire)) {
return;
}
// Wait for lock to be released without generating cache misses
while (lock_.load(std::memory_order_relaxed)) {
// Issue X86 PAUSE or ARM YIELD instruction to reduce contention between
// hyper-threads
_mm_pause();
}
}
}
void unlock() noexcept {
lock_.store(false, std::memory_order_release);
}
};
class Spinlock : public WithSampler {
private:
spinlock _loc;
checkedFrame _current;
public:
using WithSampler::WithSampler;
string _subname() override { return "Spinlock"; }
void EnterFrame(int32_t a, int32_t b, uint16_t c, uint8_t d) override {
_loc.lock();
_current.actual.a = a;
_current.actual.b = b;
_current.actual.c = c;
_current.actual.d = d;
_current.check = a + b + c + d;
_loc.unlock();
}
protected:
void _snap() override {
_loc.lock();
auto snap = _current;
_loc.unlock();
if ((snap.actual.a + snap.actual.b + snap.actual.c + snap.actual.d) != snap.check) {
cout << "Corrupted snap!!\n";
}
}
};
static constexpr int32_t LOOP_MAX = 1000 * 1000 * 1000;
int measure(unique_ptr<ProfilerBase> profiler) {
cout << "Running profiler: " << profiler->Name() << "\n ";
cout << "\tProgress: ";
auto start_time = std::chrono::steady_clock::now();
int r = 0;
for (int32_t x = 0; x < LOOP_MAX; x++)
{
profiler->EnterFrame(x, x + x, x & 0xFFFF, x & 0xFF);
r += x;
if (x % (LOOP_MAX / 1000) == 0)
{
this_thread::sleep_for(chrono::nanoseconds(10)); // simulat that sometimes we do other stuff not like storing
}
if (x % (LOOP_MAX / 10) == 0)
{
cout << static_cast<int>((static_cast<double>(x) / LOOP_MAX) * 10);
}
if (x % 1000 == 0) {
_mm_pause(); // give the other threads some time
}
if (x == (LOOP_MAX / 2)) {
// the first half of the loop we take as warmup
// so now we take the actual time
start_time = std::chrono::steady_clock::now();
}
}
cout << "\n";
const auto done_calc = std::chrono::steady_clock::now();
profiler->Stop();
const auto done_writing = std::chrono::steady_clock::now();
cout << "\tcalc: " << chrono::duration_cast<chrono::milliseconds>(done_calc - start_time).count() << "ms\n";
cout << "\tflush: " << chrono::duration_cast<chrono::milliseconds>(done_writing - done_calc).count() << "ms\n";
return r;
}
int main() {
measure(make_unique<NoOp>());
measure(make_unique<JustStore>());
measure(make_unique<Spinlock>(chrono::milliseconds(1)));
measure(make_unique<Spinlock>(chrono::milliseconds(10)));
return 0;
}
Compiling this with /O2 in x86 mode on my machine gives this output:
Running profiler: NoOp
Progress: 0123456789
calc: 1410ms
flush: 0ms
Running profiler: OnlyStoreInMember
Progress: 0123456789
calc: 1368ms
flush: 0ms
Running profiler: Spinlock1ms
Progress: 0123456789
calc: 3952ms
flush: 4ms
Running profiler: Spinlock10ms
Progress: 0123456789
calc: 3985ms
flush: 11ms
(while this was compiled with msvc in VS2022, I think g++ --std=c++17 -O2 -m32 -pthread -o testing small-test-case.cpp should come close enough).
Here we see that the Spinlock based sampler adds a ~2.5x overhead to the one without any. I've profiled it, and as expected, a lot of time is spend on taking a lock (where in most cases, there was no need for the lock).

Simple worker thread in C++ class

Assume that there is a class which contains some data and calculates some results given queries, and the queries take a relatively large amount of time.
An example class (everything dummy) is:
#include <vector>
#include <numeric>
#include <thread>
struct do_some_work
{
do_some_work(std::vector<int> data)
: _data(std::move(data))
, _current_query(0)
, _last_calculated_result(0)
{}
void update_query(size_t x) {
if (x < _data.size()) {
_current_query = x;
recalculate_result();
}
}
int get_result() const {
return _last_calculated_result;
}
private:
void recalculate_result() {
//dummy some work here
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
_last_calculated_result = std::accumulate(_data.cbegin(), _data.cbegin() + _current_query, 0);
}
std::vector<int> const _data;
size_t _current_query;
int _last_calculated_result;
};
and this can be used in the main code like:
#include <algorithm>
int main()
{
//make some dummy data
std::vector<int> test_data(20, 0);
std::iota(test_data.begin(), test_data.end(), 0);
{
do_some_work work(test_data);
for (size_t i = 0; i < test_data.size(); ++i) {
work.update_query(i);
std::cout << "result = {" << i << "," << work.get_result() << "}" << std::endl;
}
}
}
The above will wait in the main function a lot.
Now, assuming we want to run this querying in a tight loop (say GUI) and only care about about getting a "recent" result quickly when we query.
So, we want to move the work to a separate thread which calculates the results, and updates it, and when we get result, we get the last calculated one. That is, we want to change do_some_work class to do its work on a thread, with minimal changes (essentially find a pattern of changes that can be applied to (mostly) any class of this type).
My stab at this is the following:
#include <vector>
#include <numeric>
#include <mutex>
#include <thread>
#include <condition_variable>
#include <iostream>
struct do_lots_of_work
{
do_lots_of_work(std::vector<int> data)
: _data(std::move(data))
, _current_query(0)
, _last_calculated_result(0)
, _worker()
, _data_mtx()
, _result_mtx()
, _cv()
, _do_exit(false)
, _work_available(false)
{
start_worker();
}
void update_query(size_t x) {
{
if (x < _data.size()) {
std::lock_guard<std::mutex> lck(_data_mtx);
_current_query = x;
_work_available = true;
_cv.notify_one();
}
}
}
int get_result() const {
std::lock_guard<std::mutex> lck(_result_mtx);
return _last_calculated_result;
}
~do_lots_of_work() {
stop_worker();
}
private:
void start_worker() {
if (!_worker.joinable()) {
std::cout << "starting worker..." << std::endl;
_worker = std::thread(&do_lots_of_work::worker_loop, this);
}
}
void stop_worker() {
std::cout << "worker stopping..." << std::endl;
if (_worker.joinable()) {
std::unique_lock<std::mutex> lck(_data_mtx);
_do_exit = true;
lck.unlock();
_cv.notify_one();
_worker.join();
}
std::cout << "worker stopped" << std::endl;
}
void worker_loop() {
std::cout << "worker started" << std::endl;
while (true) {
std::unique_lock<std::mutex> lck(_data_mtx);
_cv.wait(lck, [this]() {return _work_available || _do_exit; });
if (_do_exit) { break; }
if (_work_available) {
_work_available = false;
int query = _current_query; //take local copy
lck.unlock(); //unlock before doing lots of work.
recalculate_result(query);
}
}
}
void recalculate_result(int query) {
//dummy lots of work here
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
int const result = std::accumulate(_data.cbegin(), _data.cbegin() + query, 0);
set_result(result);
}
void set_result(int result) {
std::lock_guard<std::mutex> lck(_result_mtx);
_last_calculated_result = result;
}
std::vector<int> const _data;
size_t _current_query;
int _last_calculated_result;
std::thread _worker;
mutable std::mutex _data_mtx;
mutable std::mutex _result_mtx;
std::condition_variable _cv;
bool _do_exit;
bool _work_available;
};
and the usage is (example):
#include <algorithm>
int main()
{
//make some dummy data
std::vector<int> test_data(20, 0);
std::iota(test_data.begin(), test_data.end(), 0);
{
do_lots_of_work work(test_data);
for (size_t i = 0; i < test_data.size(); ++i) {
work.update_query(i);
std::this_thread::sleep_for(std::chrono::milliseconds(500));
std::cout << "result = {" << i << "," << work.get_result() << "}" << std::endl;
}
}
}
This seems to work, giving the last result, not stopping the main function etc.
But, this looks a LOT of changes are required to add a worker thread to a simple class like do_some_work. Items like two mutexes (one for the worker/main interaction data, and one for the result), one condition_variable, one more-work-available flag and one do-exit flag, that is quite a bit. I guess we don't want an async kind of mechanism because we don't want to potentially launch a new thread every time.
Now, I am not sure if there is a MUCH simpler pattern to make this kind of change, but it feels like there should be. A kind of pattern that can be used to off-load work to a thread.
So finally, my question is, can do_some_work be converted into do_lots_of_work in a much simpler way than the implementation above?
Edit (Solution 1) ThreadPool based:
Using a threadpool, the worker loop can be skipped, we need two mutexes, for result and query. Lock in updating query, Lock in getting result, Both lock in recalculate (take a local copy of a query, and write to result).
Note: Also, when pushing work on the queue, as we do not care about the older results, we can clear the work queue.
Example implementation (using the CTPL threadpool)
#include "CTPL\ctpl_stl.h"
#include <vector>
#include <mutex>
struct do_lots_of_work_with_threadpool
{
do_lots_of_work_with_threadpool(std::vector<int> data)
: _data(std::move(data))
, _current_query(0)
, _last_calculated_result(0)
, _pool(1)
, _result_mtx()
, _query_mtx()
{
}
void update_query(size_t x) {
if (x < _data.size()) {
std::lock_guard<std::mutex> lck(_query_mtx);
_current_query = x;
}
_pool.clear_queue(); //clear as we don't want to calculate any out-date results.
_pool.push([this](int id) { recalculate_result(); });
}
int get_result() const {
std::lock_guard<std::mutex> lck(_result_mtx);
return _last_calculated_result;
}
private:
void recalculate_result() {
//dummy some work here
size_t query;
{
std::lock_guard<std::mutex> lck(_query_mtx);
query = _current_query;
}
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
int result = std::accumulate(_data.cbegin(), _data.cbegin() + query, 0);
{
std::lock_guard<std::mutex> lck(_result_mtx);
_last_calculated_result = result;
}
}
std::vector<int> const _data;
size_t _current_query;
int _last_calculated_result;
ctpl::thread_pool _pool;
mutable std::mutex _result_mtx;
mutable std::mutex _query_mtx;
};
Edit (Solution 2) With ThreadPool and Atomic:
This solution changes the shared variables to atomic, and so we do not need any mutexes and do not have to consider taking/releasing locks etc. This is much simpler and very close to the original class (of course assumes a threadpool type exists somewhere as it is not part of the standard).
#include "CTPL\ctpl_stl.h"
#include <vector>
#include <mutex>
#include <atomic>
struct do_lots_of_work_with_threadpool_and_atomics
{
do_lots_of_work_with_threadpool_and_atomics(std::vector<int> data)
: _data(std::move(data))
, _current_query(0)
, _last_calculated_result(0)
, _pool(1)
{
}
void update_query(size_t x) {
if (x < _data.size()) {
_current_query.store(x);
}
_pool.clear_queue(); //clear as we don't want to calculate any out-date results.
_pool.push([this](int id) { recalculate_result(); });
}
int get_result() const {
return _last_calculated_result.load();
}
private:
void recalculate_result() {
//dummy some work here
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
_last_calculated_result.store(std::accumulate(_data.cbegin(), _data.cbegin() + _current_query.load(), 0));
}
std::vector<int> const _data;
std::atomic<size_t> _current_query;
std::atomic<int> _last_calculated_result;
ctpl::thread_pool _pool;
};

std::thread throwing "resource dead lock would occur"

I have a list of objects, each object has member variables which are calculated by an "update" function. I want to update the objects in parallel, that is I want to create a thread for each object to execute it's update function.
Is this a reasonable thing to do? Any reasons why this may not be a good idea?
Below is a program which attempts to do what I described, this is a complete program so you should be able to run it (I'm using VS2015). The goal is to update each object in parallel. The problem is that once the update function completes, the thread throws an "resource dead lock would occur" exception and aborts.
Where am I going wrong?
#include <iostream>
#include <thread>
#include <vector>
#include <algorithm>
#include <thread>
#include <mutex>
#include <chrono>
class Object
{
public:
Object(int sleepTime, unsigned int id)
: m_pSleepTime(sleepTime), m_pId(id), m_pValue(0) {}
void update()
{
if (!isLocked()) // if an object is not locked
{
// create a thread to perform it's update
m_pThread.reset(new std::thread(&Object::_update, this));
}
}
unsigned int getId()
{
return m_pId;
}
unsigned int getValue()
{
return m_pValue;
}
bool isLocked()
{
bool mutexStatus = m_pMutex.try_lock();
if (mutexStatus) // if mutex is locked successfully (meaning it was unlocked)
{
m_pMutex.unlock();
return false;
}
else // if mutex is locked
{
return true;
}
}
private:
// private update function which actually does work
void _update()
{
m_pMutex.lock();
{
std::cout << "thread " << m_pId << " sleeping for " << m_pSleepTime << std::endl;
std::chrono::milliseconds duration(m_pSleepTime);
std::this_thread::sleep_for(duration);
m_pValue = m_pId * 10;
}
m_pMutex.unlock();
try
{
m_pThread->join();
}
catch (const std::exception& e)
{
std::cout << e.what() << std::endl; // throws "resource dead lock would occur"
}
}
unsigned int m_pSleepTime;
unsigned int m_pId;
unsigned int m_pValue;
std::mutex m_pMutex;
std::shared_ptr<std::thread> m_pThread; // store reference to thread so it doesn't go out of scope when update() returns
};
typedef std::shared_ptr<Object> ObjectPtr;
class ObjectManager
{
public:
ObjectManager()
: m_pNumObjects(0){}
void updateObjects()
{
for (int i = 0; i < m_pNumObjects; ++i)
{
m_pObjects[i]->update();
}
}
void removeObjectByIndex(int index)
{
m_pObjects.erase(m_pObjects.begin() + index);
}
void addObject(ObjectPtr objPtr)
{
m_pObjects.push_back(objPtr);
m_pNumObjects++;
}
ObjectPtr getObjectByIndex(unsigned int index)
{
return m_pObjects[index];
}
private:
std::vector<ObjectPtr> m_pObjects;
int m_pNumObjects;
};
void main()
{
int numObjects = 2;
// Generate sleep time for each object
std::vector<int> objectSleepTimes;
objectSleepTimes.reserve(numObjects);
for (int i = 0; i < numObjects; ++i)
objectSleepTimes.push_back(rand());
ObjectManager mgr;
// Create some objects
for (int i = 0; i < numObjects; ++i)
mgr.addObject(std::make_shared<Object>(objectSleepTimes[i], i));
// Print expected object completion order
// Sort from smallest to largest
std::sort(objectSleepTimes.begin(), objectSleepTimes.end());
for (int i = 0; i < numObjects; ++i)
std::cout << objectSleepTimes[i] << ", ";
std::cout << std::endl;
// Update objects
mgr.updateObjects();
int numCompleted = 0; // number of objects which finished updating
while (numCompleted != numObjects)
{
for (int i = 0; i < numObjects; ++i)
{
auto objectRef = mgr.getObjectByIndex(i);
if (!objectRef->isLocked()) // if object is not locked, it is finished updating
{
std::cout << "Object " << objectRef->getId() << " completed. Value = " << objectRef->getValue() << std::endl;
mgr.removeObjectByIndex(i);
numCompleted++;
}
}
}
system("pause");
}

Looks like you've got a thread that is trying to join itself.

While I was trying to understand your solution I was simplifying it a lot. And I come to point that you use std::thread::join() method in a wrong way.
std::thread provide capabilities to wait for it completion (non-spin wait) -- In your example you wait for thread completion in infinite loop (snip wait) that will consume CPU time heavily.
You should call std::thread::join() from other thread to wait for thread completion. Mutex in Object in your example is not necessary. Moreover, you missed one mutex to synchronize access to std::cout, which is not thread-safe. I hope the example below will help.
#include <iostream>
#include <thread>
#include <vector>
#include <algorithm>
#include <thread>
#include <mutex>
#include <chrono>
#include <cassert>
// cout is not thread-safe
std::recursive_mutex cout_mutex;
class Object {
public:
Object(int sleepTime, unsigned int id)
: _sleepTime(sleepTime), _id(id), _value(0) {}
void runUpdate() {
if (!_thread.joinable())
_thread = std::thread(&Object::_update, this);
}
void waitForResult() {
_thread.join();
}
unsigned int getId() const { return _id; }
unsigned int getValue() const { return _value; }
private:
void _update() {
{
{
std::lock_guard<std::recursive_mutex> lock(cout_mutex);
std::cout << "thread " << _id << " sleeping for " << _sleepTime << std::endl;
}
std::this_thread::sleep_for(std::chrono::seconds(_sleepTime));
_value = _id * 10;
}
std::lock_guard<std::recursive_mutex> lock(cout_mutex);
std::cout << "Object " << getId() << " completed. Value = " << getValue() << std::endl;
}
unsigned int _sleepTime;
unsigned int _id;
unsigned int _value;
std::thread _thread;
};
class ObjectManager : public std::vector<std::shared_ptr<Object>> {
public:
void runUpdate() {
for (auto it = this->begin(); it != this->end(); ++it)
(*it)->runUpdate();
}
void waitForAll() {
auto it = this->begin();
while (it != this->end()) {
(*it)->waitForResult();
it = this->erase(it);
}
}
};
int main(int argc, char* argv[]) {
enum {
TEST_OBJECTS_NUM = 2,
};
srand(static_cast<unsigned int>(time(nullptr)));
ObjectManager mgr;
// Generate sleep time for each object
std::vector<int> objectSleepTimes;
objectSleepTimes.reserve(TEST_OBJECTS_NUM);
for (int i = 0; i < TEST_OBJECTS_NUM; ++i)
objectSleepTimes.push_back(rand() * 9 / RAND_MAX + 1); // 1..10 seconds
// Create some objects
for (int i = 0; i < TEST_OBJECTS_NUM; ++i)
mgr.push_back(std::make_shared<Object>(objectSleepTimes[i], i));
assert(mgr.size() == TEST_OBJECTS_NUM);
// Print expected object completion order
// Sort from smallest to largest
std::sort(objectSleepTimes.begin(), objectSleepTimes.end());
for (size_t i = 0; i < mgr.size(); ++i)
std::cout << objectSleepTimes[i] << ", ";
std::cout << std::endl;
// Update objects
mgr.runUpdate();
mgr.waitForAll();
//system("pause"); // use Ctrl+F5 to run the app instead. That's more reliable in case of sudden app exit.
}

About is it a reasonable thing to do...
A better approach is to create an object update queue. Objects that need to be updated are added to this queue, which can be fulfilled by a group of threads instead of one thread per object.
The benefits are:
No 1-to-1 correspondence between thread and objects. Creating a thread is a heavy operation, probably more expensive than most update code for a single object.
Supports thousands of objects: with your solution you would need to create thousands of threads, which you will find exceeds your OS capacity.
Can support additional features like declaring dependencies between objects or updating a group of related objects as one operation.

Concurrently push()ing to a shared queue with pthread?

I am practicing pthread.
In my original program, it pushes to the shared queue an instance of a class called request, but I first at least want to make sure that I am pushing something to a shared queue.
It is a very simple code, but it just throws a lot of errors that I could not figure out the reason.
I guess it's probably the syntax, but whatever I tried it did not work.
Do you see why it is not working?
Following is the code I have been trying.
extern "C" {
#include<pthread.h>
#include<unistd.h>
}
#include<queue>
#include<iostream>
#include<string>
using namespace std;
class request {
public:
string req;
request(string s) : req(s) {}
};
int n;
queue<request> q;
pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER;
void * putToQueue(string);
int main ( void ) {
pthread_t t1, t2;
request* ff = new request("First");
request* trd = new request("Third");
int result1 = pthread_create(&t1, NULL, &putToQueue, reinterpret_cast<void*>(&ff));
if (result1 != 0) cout << "error 1" << endl;
int result2 = pthread_create(&t2, NULL, &putToQueue, reinterpret_cast<void*>(&trd));
if (result2 != 0) cout << "error 2" << endl;
pthread_join(t1, NULL);
pthread_join(t2, NULL);
for(int i=0; i<q.size(); ++i) {
cout << q.front().req << " is in queue" << endl;
q.pop();
--n;
}
return 0;
}
void * putToQueue(void* elem) {
pthread_mutex_lock(&mut);
q.push(reinterpret_cast<request>(elem));
++n;
cout << n << " items are in the queue." << endl;
pthread_mutex_unlock(&mut);
return 0;
}

The code below comments on everything that had to be changed. I would write up a detailed description of why they had to change, but I hope the code speaks for itself. It still isn't bullet-proof. There are plenty of things that could be done differently or better (exception handling for failed new, etc) but at least it compiles, runs, and doesn't leak memory.
#include <queue>
#include <iostream>
#include <string>
#include <pthread.h>
#include <unistd.h>
using namespace std;
// MINOR: param should be a const-ref
class request {
public:
string req;
request(const string& s) : req(s) {}
};
int n;
queue<request> q;
pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER;
// FIXED: made protoype a proper pthread-proc signature
void * putToQueue(void*);
int main ( void )
{
pthread_t t1, t2;
// FIXED: made thread param the actual dynamic allocation address
int result1 = pthread_create(&t1, NULL, &putToQueue, new request("First"));
if (result1 != 0) cout << "error 1" << endl;
// FIXED: made thread param the actual dynamic allocation address
int result2 = pthread_create(&t2, NULL, &putToQueue, new request("Third"));
if (result2 != 0) cout << "error 2" << endl;
pthread_join(t1, NULL);
pthread_join(t2, NULL);
// FIXED: was skipping elements because the queue size was shrinking
// with each pop in the while-body.
while (!q.empty())
{
cout << q.front().req << " WAS in queue" << endl;
q.pop();
}
return 0;
}
// FIXED: pretty much a near-total-rewrite
void* putToQueue(void* elem)
{
request *req = static_cast<request*>(elem);
if (pthread_mutex_lock(&mut) == 0)
{
q.push(*req);
cout << ++n << " items are in the queue." << endl;
pthread_mutex_unlock(&mut);
}
delete req; // FIXED: squelched memory leak
return 0;
}
Output (yours may vary)
1 items are in the queue.
2 items are in the queue.
Third WAS in queue
First WAS in queue

As noted in the comment, I'd advise skipping direct use of pthreads, and use the C++11 threading primitives instead. I'd start with a simple protected queue class:
template <class T, template<class, class> class Container=std::deque>
class p_q {
typedef typename Container<T, std::allocator<T>> container;
typedef typename container::iterator iterator;
container data;
std::mutex m;
public:
void push(T a) {
std::lock_guard<std::mutex> l(m);
data.emplace_back(a);
}
iterator begin() { return data.begin(); }
iterator end() { return data.end(); }
// omitting front() and pop() for now, because they're not used in this code
};
Using this, the main-stream of the code stays nearly as simple and clean as single-threaded code, something like this:
int main() {
p_q<std::string> q;
auto pusher = [&q](std::string const& a) { q.push(a); };
std::thread t1{ pusher, "First" };
std::thread t2{ pusher, "Second" };
t1.join();
t2.join();
for (auto s : q)
std::cout << s << "\n";
}
As it stands right now, this is a multiple-producer, single-consumer queue. Further, it depends on the fact that the producers are no longer running when the consuming is happening. That's true in this case, but wouldn't/won't always be. When it's not the case, you'll need a (marginally) more complex queue that does locking as it reads/pops from the queue, not just when writing to it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multithreaded program blocks when compiled with optimizations - c++

Related

Thread pool with job queue gets stuck

How to reduce the (writer side) performance penalty for sharing a small structure between two threads?

Simple worker thread in C++ class

std::thread throwing "resource dead lock would occur"

Concurrently push()ing to a shared queue with pthread?

Categories

Resources