How can I refactor this code into multi-thread version? - c++

There is a loop which takes quite a long time and I'm considering refactoring this code into multi-thread version. And here is the model.
Photon photon;
for (int i=0;i<1000000;++i){
func(){
photon.lanuch(args...){
// do something
}
}
}
I have to call this function a thousand and thousand times.So I was wondering how can I create some threads to run this function at the some time.
But the photon have to be individual every single time.
the index i can be converted to this:
atomic<int> i{0};
while(i<1000000){
func(){
photon.lanuch(args...){
// do something
++i;
}
}
}

With threading you have to pay attention to object lifetime and sharing far more than normal.
But the basic solution is
void do_tasks( std::size_t count, std::function<void( std::size_t start, std::size_t finish )> task ) {
auto thread_count = std::thread::hardware_concurrency();
if (thread_count <= 0) thread_count = 1;
std::vector<std::future<void>> threads( thread_count-1 );
auto get_task = [=](std::size_t index) {
auto start = count * index / thread_count;
auto finish = count * (index+1) / thread_count;
// std::cout << "from " << start << " to " << finish << "\n";
return [task, start, finish]{ task(start, finish); };
};
for( auto& thread : threads ) {
auto index = &thread-threads.data();
thread = std::async( std::launch::async, get_task(index) );
}
get_task( threads.size() )();
for (auto& thread : threads) {
thread.get();
}
}
This is a little multi threading library.
You use it like this:
do_tasks( 100, [&](size_t start, size_t finish) {
// do subtasks starting at index start, up to and not including finish
});
There are other more complex threading libraries, but writing a small half-decent one isn't hard so I did it.
To be explicit:
Photon photon;
do_tasks( 1000000, [&](size_t start, size_t finish) {
for (int i = start; i < finish; ++i) {
photon.lanuch(args...){
}
});
but you'll have to be extremely careful making sure there is no unsafe data sharing between the threads, and you aren't just blocking each thread on a common mutex.
Live example

A awful lot depends on how and to what extent photon.launch() can be parallelised.
The code below divides a range into (approximately) equal segments and then executes each segment in a separate thread.
As stated whether that helps will depend on how much of photon.launch() can be done in parallel. If it spends most of its time modifying a shared state and essentially has the form:
void launch(int index){
std::lock_guard<std::mutex> guard{m};
//.....
}
Where m is a member of Photon then little if anything will be gained.
If (at the other extreme) the individual calls to launch never contend for the same data then it can be parallelised up to the number of cores the system can provide.
#include <thread>
#include <vector>
class Photon {
public:
void launch(int index){
//... what goes here matters a lot...
}
};
void photon_launch(Photon& photon,int from,int to){
for(auto i=from;i<=to;++i){
photon.launch(i);
}
}
int main() {
const size_t loop_count=100000;//How big is the loop?
const size_t thread_count=4;//How many threads can we utilize?
std::vector< std::thread > threads;
Photon photon;
int from=1;
for(size_t i=1;i<=thread_count;++i){
//If loop_count isn't divisible by thread_count evens out the remainder.
int to=(loop_count*i)/thread_count;
threads.emplace_back(photon_launch,std::ref(photon),from,to);
from=to+1;
}
//Now the threads are launched we block until they all finish.
//If we don't the program may (will?) finish before the threads.
for(auto& curr : threads){
curr.join();
}
return 0;
}

Related

Fast process std::bitset<65536> in parallel

Once there was a deleted question, that I wrote a huge answer to, but this question was deleted and author refused to undelete it.
So posting here a short summary of this question. And immediately answering this question myself, just to share my results.
Question was that if we're given std::bitset<65536> that is processed (by some formula) inside inner loop bit-by-bit, then how can we boost this computation?
Outer loop just called inner loop many times (lets say 50 000 times), and outer loop can't be processed in parallel, because each next iteration depends on results of previous iteration.
Example code of this process:
std::bitset<65536> bits{};
uint64_t hash = 0;
for (size_t i = 0; i < 50000; ++i) {
// Process Bits
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
hash = Hash(bits, hash);
}
Code above is just one sample way of processing, it is not a real case. The real case is such that many times we process std::bitset<65536> somehow in such a way that all bits can be processed independently.
The question is how we can process bits in parallel as fast as possible inside inner loop.
One important Note that formula that modifies bits is generic, meaning that we don't know it in advance and can't make SIMD instructions out of it.
But what we know is that all bits can be processed independently. And that we need to parallelize this processing. Also we can't parallelize outer loop as each its iteration depends on results of previous iteration.
Another Note is that std::bitset<65536> is quite small, just 1K of 64-bit words. So it means that directly using pool of std::thread of std::async threads will not work as each thread's work will be just around 50-200 nano-seconds, very tiny time to start and stop threads and send work to them. Even std::mutex takes 75 nano-seconds on my Windows machine (although 20 nano-seconds on Linux), so using std::mutex is also a big overhead.
One may assume that ModifyBit() function above takes around same time for each bit, otherwise there is no understanding on how to schedule balanced parallelization of a loop, only by slicing it into very many tiny tasks hoping that longer tasks will be balanced out by several shorter one.
Implemented quite large and complex solution for your task, but which works very fast. On my 4-core (8 hardware threads) laptop I have 6x times multi-core speedup compared to single threaded version (your version of code).
Main idea of solution below is to implement very fast multi core Thread-Pool for running arbitrary tasks that has small overhead. My implementation can handle up to 1-10 Million tasks per second (depending on CPU speed and cores count).
Regular way of asynchronously starting multiple tasks is through usage of std::async or just by creating std::thread. Both these ways are considerably slower than my own implementation. They can't give throughput of 5 Million tasks per second like my implementation gives. And your code needs millions of tasks per second to be run for good speed. That's why I implemented everything from scratch.
After fast thread pool is implemented now we can slice your 64K bitset into smaller sub-sets and process these sub-sets in parallel. I sliced 64K bitset into 16 equal parts (see BitSize / 16 in code), you can set other amount of parts equal to power of two, but not too many, otherwise thread pool overhead will be too large. Usually it is good to slice into amount of parts that is equal to twice the amount of hardware threads (or 4 times amount of cores).
I implemented several classes in C++ code. AtomicMutex class uses std::atomic_flag in order to implement very fast replacement for mutex that is based on spin-locking approach. This AtomicMutex is used to protect queue of tasks submitted for running on thread pool.
RingBuffer class is based on std::vector and implements simple and fast queue to store any objects. It is implemented using two pointers (head and tail), pointing into vector. When new element is added to queue then tail pointer is advanced to the right, if this pointer reaches end of vector then it wraps around to 0-th position. Same way when element is taken out from queue then head pointer also advances to the right with wrap around. RingBuffer is used to store thread pool tasks.
Queue class is a wrapper around RingBuffer, but with AtomicMutex protection. This spin-lock mutex is used to protect simultaneous adding/taking elements to/from queue from multiple workers' threads.
Pool implements multi-core pool of tasks itself. It creates as many worker threads as there are CPU hardware threads (double amount of cores) minus one. Each worker thread just polls new tasks from queue and executes them immediately. Main thread adds new tasks to queue. Pool also has Wait() capability to wait till all current tasks are finished, this waiting is used as barrier to wait till whole 64K bitset is processed (all sub-parts are processed). Pool accepts any lambdas (function closures) to be run. You can see that 64K bitset sliced into smaller parts is processed by doing pool.Emplace(lambda) and later pool.Wait() is used to wait till all sub-parts are finished. Exceptions from pool workers are collected and reported to user if there is any error. When doing Wait() pool runs tasks also inside main thread not to waste one core for just waiting of tasks to finish.
Timings reported in console are done by std::chrono module.
There is an ability to run both versions - single-threaded (your original version) and multi-threaded using all cores. Switch between single/multi is done by passing MultiThreaded = true template parameter to function ProcessBitset().
Try it online!
#include <cstdint>
#include <atomic>
#include <vector>
#include <array>
#include <queue>
#include <functional>
#include <thread>
#include <future>
#include <exception>
#include <optional>
#include <memory>
#include <iostream>
#include <iomanip>
#include <bitset>
#include <string>
#include <chrono>
#include <algorithm>
#include <any>
#include <type_traits>
class AtomicMutex {
class LockerC;
public:
void lock() {
while (f_.test_and_set(std::memory_order_acquire))
//f_.wait(true, std::memory_order_acquire)
;
}
void unlock() {
f_.clear(std::memory_order_release);
//f_.notify_all();
}
LockerC Locker() { return LockerC(*this); }
private:
class LockerC {
public:
LockerC() = delete;
LockerC(AtomicMutex & mux) : pmux_(&mux) { mux.lock(); }
LockerC(LockerC const & other) = delete;
LockerC(LockerC && other) : pmux_(other.pmux_) { other.pmux_ = nullptr; }
~LockerC() { if (pmux_) pmux_->unlock(); }
LockerC & operator = (LockerC const & other) = delete;
LockerC & operator = (LockerC && other) = delete;
private:
AtomicMutex * pmux_ = nullptr;
};
std::atomic_flag f_ = ATOMIC_FLAG_INIT;
};
template <typename T>
class RingBuffer {
public:
RingBuffer() : buf_(1 << 8), last_(buf_.size() - 1) {}
T & front() { return buf_[first_]; }
T const & front() const { return buf_[first_]; }
T & back() { return buf_[last_]; }
T const & back() const { return buf_[last_]; }
size_t size() const { return size_; }
bool empty() const { return size_ == 0; }
template <typename ... Args>
void emplace(Args && ... args) {
while (size_ >= buf_.size()) {
std::rotate(&buf_[0], &buf_[first_], &buf_[buf_.size()]);
first_ = 0;
last_ = buf_.size() - 1;
buf_.resize(buf_.size() * 2);
}
++size_;
++last_;
if (last_ >= buf_.size())
last_ = 0;
buf_[last_] = T(std::forward<Args>(args)...);
}
void pop() {
if (size_ == 0)
return;
--size_;
++first_;
if (first_ >= buf_.size())
first_ = 0;
}
private:
std::vector<T> buf_;
size_t first_ = 0, last_ = 0, size_ = 0;
};
template <typename T>
class Queue {
public:
size_t Size() const { return q_.size(); }
bool Empty() const { return q_.size() == 0; }
template <typename ... Args>
void Emplace(Args && ... args) {
auto lock = m_.Locker();
q_.emplace(std::forward<Args>(args)...);
}
T Pop(std::function<void()> const & on_empty = []{},
std::function<void()> const & on_full = []{}) {
while (true) {
if (q_.empty()) {
on_empty();
continue;
}
auto lock = m_.Locker();
if (q_.empty()) {
on_empty();
continue;
}
on_full();
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
}
std::optional<T> TryPop() {
auto lock = m_.Locker();
if (q_.empty())
return std::nullopt;
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
private:
AtomicMutex m_;
RingBuffer<T> q_;
};
class RunInDestr {
public:
RunInDestr(std::function<void()> const & f) : f_(f) {}
~RunInDestr() { f_(); }
private:
std::function<void()> const & f_;
};
class Pool {
public:
struct FinishExc {};
struct Worker {
std::unique_ptr<std::atomic<bool>> pdone = std::make_unique<std::atomic<bool>>(true);
std::unique_ptr<std::exception_ptr> pexc = std::make_unique<std::exception_ptr>();
std::unique_ptr<std::thread> thr;
};
Pool(size_t nthreads = size_t(-1)) {
if (nthreads == size_t(-1))
nthreads = std::thread::hardware_concurrency() - 1;
std::cout << "Pool has " << nthreads << " worker threads." << std::endl;
for (size_t i = 0; i < nthreads; ++i) {
workers_.emplace_back(Worker{});
workers_.back().thr = std::make_unique<std::thread>(
[&, pdone = workers_.back().pdone.get(), pexc = workers_.back().pexc.get()]{
try {
std::function<void()> f_done = [pdone]{
pdone->store(true, std::memory_order_relaxed);
}, f_empty = [this]{
CheckFinish();
}, f_full = [pdone]{
pdone->store(false, std::memory_order_relaxed);
};
while (true) {
RunInDestr set_done(f_done);
tasks_.Pop(f_empty, f_full)();
}
} catch (...) {
exc_.store(true, std::memory_order_relaxed);
*pexc = std::current_exception();
}
});
}
}
~Pool() {
Wait();
Finish();
}
void CheckExc() {
if (!exc_.load(std::memory_order_relaxed))
return;
Finish();
throw std::runtime_error("Pool: Exception occured!");
}
void Finish() {
finish_ = true;
for (auto & w: workers_)
try {
w.thr->join();
if (*w.pexc)
std::rethrow_exception(*w.pexc);
} catch (FinishExc const &) {}
workers_.clear();
}
template <typename ... Args>
void Emplace(Args && ... args) {
CheckExc();
tasks_.Emplace(std::forward<Args>(args)...);
}
void Wait() {
while (true) {
auto task = tasks_.TryPop();
if (!task)
break;
(*task)();
}
while (true) {
bool done = true;
for (auto & w: workers_)
if (!w.pdone->load(std::memory_order_relaxed)) {
done = false;
break;
}
if (done)
break;
}
CheckExc();
}
private:
void CheckFinish() {
if (finish_)
throw FinishExc{};
}
Queue<std::function<void()>> tasks_;
std::vector<Worker> workers_;
bool finish_ = false;
std::atomic<bool> exc_ = false;
};
template <bool MultiThreaded = true, size_t BitSize>
void ProcessBitset(Pool & pool, std::bitset<BitSize> & bset,
std::string const & businessLogicCriteria) {
static size_t constexpr block = BitSize / 16;
for (int j = 0; j < BitSize; j += block) {
auto task = [&bset, j]{
int const hi = std::min(j + block, BitSize);
for (int i = j; i < hi; ++i) {
if (i % 2 == 0)
bset[i] = 0;
else
bset[i] = 1;
}
};
if constexpr(MultiThreaded)
pool.Emplace(std::move(task));
else
task();
}
if constexpr(MultiThreaded)
pool.Wait();
}
static auto const gtb = std::chrono::high_resolution_clock::now();
double Time() {
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Compute() {
Pool pool;
std::bitset<65536> bset;
std::string businessLogicCriteria;
int const hi = 50000;
for (int j = 0; j < hi; ++j) {
if ((j & 0x1FFF) == 0 || j + 1 >= hi)
std::cout << j / 1000 << "K (" << std::fixed << std::setprecision(3) << Time() << " sec), " << std::flush;
ProcessBitset(pool, bset, businessLogicCriteria);
businessLogicCriteria = "...";
}
}
void TimeMeasure() {
size_t constexpr A = 1 << 16, B = 1 << 5;
{
Pool pool;
auto const tb = Time();
int64_t volatile x = 0;
for (size_t i = 0; i < A; ++i) {
for (size_t j = 0; j < B; ++j)
pool.Emplace([&]{ x = x + 1; });
pool.Wait();
}
std::cout << "AtomicPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << A * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (A * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (A * B) << std::endl;
}
{
auto const tb = Time();
//size_t const nthr = std::thread::hardware_concurrency();
size_t constexpr C = A / 8;
std::vector<std::future<void>> asyncs;
int64_t volatile x = 0;
for (size_t i = 0; i < C; ++i) {
asyncs.clear();
for (size_t j = 0; j < B; ++j)
asyncs.emplace_back(std::async(std::launch::async, [&]{ x = x + 1; }));
asyncs.clear();
}
std::cout << "AsyncPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << C * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (C * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (C * B) << std::endl;
}
}
int main() {
try {
TimeMeasure();
Compute();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
} catch (...) {
std::cout << "Unknown Exception!" << std::endl;
return -1;
}
}
Output for 4 cores (8 hardware threads):
Pool has 7 worker threads.
AtomicPool time 0.903 sec, speed 2321.831 empty K-tasks/sec, 0.431 sec/M-task, no-collisions 0.9999967
AsyncPool time 0.982 sec, speed 266.789 empty K-tasks/sec, 3.750 sec/M-task, no-collisions 0.9999123
Pool has 7 worker threads.
0K (0.074 sec), 8K (0.670 sec), 16K (1.257 sec), 24K (1.852 sec), 32K (2.435 sec), 40K (2.984 sec), 49K (3.650 sec), 49K (3.711 sec),
For comparison below is single-threaded version timings, that is 6x times slower:
0K (0.125 sec), 8K (3.786 sec), 16K (7.754 sec), 24K (11.202 sec), 32K (14.662 sec), 40K (18.056 sec), 49K (21.470 sec), 49K (21.841 sec),
You have this inner loop you want to parallelize:
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
So a good idea is to split it into chunks, and have multiple threads do each chunk in parallel. You can submit chunks to workers easily with a std::atomic<int> counter that increments to identify which chunk to work on. You can also make sure the threads all stop working after one loop before starting the next with a std::barrier:
std::bitset<65536> bits{};
std::thread pool[8]; // Change size accordingly
std::atomic<int> task_number{0};
constexpr std::size_t tasks_per_loop = 32; // Arbitrarily chosen
constexpr std::size_t block_size = (bits.size()+tasks_per_loop-1) / tasks_per_loop;
// (only written to by one thread by the barrier, so not atomic)
uint64_t hash = 0;
int i = 0;
std::barrier barrier(std::size(pool), [&]() {
task_number = 0;
++i;
hash = Hash(bits, hash);
});
for (std::thread& t : pool) {
t = std::thread([&]{
while (i < 50000) {
for (int t; (t = task_number++) < tasks_per_loop;) {
int block_start = t * block_size;
int block_end = std::min(block_start + block_size, bits.size());
for (int j = block_start; j < block_end; ++j) {
bits[j] = ModifyBit(i, j, hash, bits[j]);
}
}
// Wait for other threads to finish and hash
// to be calculated before starting next loop
barrier.arrive_and_wait();
}
});
}
for (std::thread& t : pool) t.join();
(The seemingly easy way of parallelizing the for loop with OpenMP #pragma omp parallel for seemed slower with some testing, perhaps because the tasks were so small)
Here it is against your implementation running similar code: https://godbolt.org/z/en76Kv4nn
And on my machine, running this a few times with 1 million iterations took 28 to 32 seconds with my approach and 44 to 50 seconds with your general thread pool approach (granted this is much less general because it can't execute arbitrary std::function<void()> tasks).

What am I doing wrong with my C++ threading?

I trying to solve this following problem:
Give a vector V[] of integers with positive and negetive. A number N is paired with its negative counter part, which is -N. Now if there are pairs of such numbers in the given vector V[], take the positive integer and push them to a return result vector.
Example:
If input is V = [1,-1,0,2,-3,3]
return [1,3]
I tried to solve this problem in 3 flavors:
Single Threaded | Runtime: 404000
Multithreaded course grained lock | Runtime: 39882000
Multithreaded fine grained lock | Runtime: 43921000
My idea with fine grained locking is to update memory at decrete memory locations based upon the input.
I see that my Multithreaded course grained lock is performing worst than Single Threaded one (which is kind of expected). But what I don't understand is why my Multithreaded fine grained lock is most-of-the-time performing worse than Multithreaded course grained lock, performing poor compared to Single-Threaded version. I expected the *Multithreaded fine grained lock** should perform better than the Single-Threaded version.
What is wrong with my implementation? What am I doing wrong. How can I improve performance of this code with multithreading?
#include <iostream>
#include <unordered_map>
#include <vector>
#include <mutex>
#include <thread>
#include <chrono>
#include <cstdlib>
#include <memory>
using namespace std;
class Solution
{
private:
const static uint32_t THREAD_N = 5;
unordered_map<uint32_t, int32_t> records;
vector<uint32_t> results;
vector<atomic<uint32_t>> atm_results;
mutex mut[THREAD_N];
mutex mutrec;
bool bzero;
public:
Solution(): bzero(true){
records.reserve(100);
}
void InsertVal(const vector<int32_t> &vin)
{
for (auto iter : vin) {
if(iter < 0)
{
if(records[0-iter] > 0) results.emplace_back(0-iter);
records[0-iter]--;
}
else if(iter > 0)
{
if(records[iter] < 0) results.emplace_back(iter);
records[iter]++;
}
else
{
bzero = !bzero;
if (bzero) {
results.emplace_back(0);
}
}
}
}
void InsertValEach(const int32_t &val)
{
lock_guard<mutex> lock(mutrec); // single block of lock
if(val < 0)
{
if(records[0-val] > 0) results.emplace_back(0-val);
records[0-val]--;
}
else if(val > 0)
{
if(records[val] < 0) results.emplace_back(val);
records[val]++;
}
else
{
bzero = !bzero;
if (bzero) {
results.emplace_back(0);
}
}
}
void InsertValEachFree(const int32_t &val)
{
if(val < 0)
{
lock_guard<mutex> lock(mut[(0-val)%THREAD_N]); // finer lock based on input
if(records[0-val] > 0)
{
lock_guard<mutex> l(mutrec); // yet another finer lock to update results
results.emplace_back(0-val);
}
records[0-val]--;
}
else if(val > 0)
{
lock_guard<mutex> lock(mut[(val)%THREAD_N]);
if(records[val] < 0)
{
lock_guard<mutex> l(mutrec);
results.emplace_back(val);
}
records[val]++;
}
else
{
lock_guard<mutex> lock(mut[0]);
bzero = !bzero;
if (bzero) {
lock_guard<mutex> l(mutrec);
results.emplace_back(0);
}
}
}
vector<uint32_t> GetResult()
{
lock_guard<mutex> l(mutrec);
return results;
}
void reset()
{
lock_guard<mutex> l(mutrec);
results = vector<uint32_t>();
}
};
void Display(Solution &s)
{
auto v = s.GetResult();
// for (auto &iter : v) {
// cout<<iter<<" ";
// }
cout<<v.size()<<"\n";
}
size_t SingleThread(Solution &s, const vector<int32_t> &vec)
{
chrono::time_point<chrono::system_clock> start, stop;
start = chrono::system_clock::now();
s.InsertVal(vec);
stop = chrono::system_clock::now();
chrono::duration<double> elapse_time = stop - start;
Display(s);
s.reset();
return chrono::duration_cast<chrono::nanoseconds>(elapse_time).count();
}
size_t CourseGrainLock(Solution &s, const vector<int32_t> &vec)
{
chrono::time_point<chrono::system_clock> start, stop;
vector<thread> vthreads;
auto vsize = vec.size();
start = chrono::system_clock::now();
for (int32_t iter=0; iter<vsize; iter++) {
vthreads.push_back(thread(&Solution::InsertValEach, &s, vec[iter]));
}
stop = chrono::system_clock::now();
for (auto &th : vthreads) {
th.join();
}
chrono::duration<double> elapse_time = stop - start;
Display(s);
s.reset();
return chrono::duration_cast<chrono::nanoseconds>(elapse_time).count();
}
size_t FineGrainLock(Solution &s, const vector<int32_t> &vec)
{
chrono::time_point<chrono::system_clock> start, stop;
vector<thread> vthreads;
auto vsize = vec.size();
start = chrono::system_clock::now();
for (int32_t iter=0; iter<vsize; iter++) {
vthreads.push_back(thread(&Solution::InsertValEachFree, &s, vec[iter]));
}
stop = chrono::system_clock::now();
for (auto &th : vthreads) {
th.join();
}
chrono::duration<double> elapse_time = stop - start;
Display(s);
s.reset();
return chrono::duration_cast<chrono::nanoseconds>(elapse_time).count();
}
int main(int argc, const char * argv[]) {
vector<int32_t> vec;
int count = 1000;
while(count--)
{
vec.emplace_back(rand()%50);
vec.emplace_back(0-(rand()%50));
}
Solution s;
auto nanosec = SingleThread(s, vec);
cout<<"Time of Execution (nano) Single Thread: "<<nanosec<<"\n";
nanosec = CourseGrainLock(s, vec);
cout<<"Time of Execution (nano) Course Grain: "<<nanosec<<"\n";
nanosec = FineGrainLock(s, vec);
cout<<"Time of Execution (nano) Fine Grain: "<<nanosec<<"\n";
return 0;
}
You're creating one thread for each number in vec. There is a considerable cost in creating a thread. You should create a few threads (no more than the number of execution units in your hardware) and have each thread process multiple entries of the vector. main can run one set of results, thus avoiding creating of one thread.
With the locking in CourseGrainLock (in InsertValEach), since the first thing each thread does is grab a lock that is not release until the function is done, your code is effectively single threaded but with the cost of creating all those threads.
The locking in your FineGrainLock (in InsertValEachFree) is not much better. You have several locks, but you make changes to results in multiple threads with different locks. Adding elements to an unordered map (which you do with results[i] or results[0-i] is not thread safe, and you risk Undefined Behavior with that code.
A reasonable approach here is to have each thread keep track of its own results independently, thus avoiding the need for locks at all, and combine them into the main results once all the threads are done.
You probably can't improve it with multithreading. All of the threads have to access the same shared input vector and result vector. The tremendous slowdown that you see vs. the single-threaded solution is the overhead of serializing the access to the shared structure.
Multithreading is not a panacea. If you need to do something like this to an array, "just do it." Single-thread.
the major issue i see with your code is the majority of the work is being done inside the mutex. That completely blocks the other threads, so there is no benefit. Even the fine grained one is only doing a very small calculation outside the mutex compared to the cost of updating the map ant the output vector.
I'm not even totally convinced your finegrained locking is completely thread safe? If the array index might create nodes in the map for a value that hasn't been seen before, then that invalidates any other thread's simultaneous searches. You could use a separate map for each locked value range I think.
But to be honest I think you are just doing too little work in each thread. Try creating a smaller number of threads and have each one do a range of the input values - calling the existing code for each entry in that range.

C++ Reusing a vector of threads that call the same function

I would like to reuse a vector of threads that call the same function several times with different parameters. There is no writing (with the exception of an atomic parameter), so no need for a mutex. To depict the idea, I created a basic example of a parallelized code that finds the maximum value of a vector. There are clearly better ways to find the max of a vector, but for the sake of the explanation and to avoid getting into further details of the real code I am writing, I am going with this silly example.
The code finds the maximum number of a vector by calling a function pFind that checks whether the vector contains the number k (k is initialized with an upper bound). If it does, the execution stops, otherwise k is reduced by one and the process repeats.
The code bellow generates a vector of threads that parallelize the search for k in the vector. The issue is that, for every value of k, the vector of threads is regenerated and each time the new threads are joined.
Generating the vector of threads and joining them every time comes with an overhead that I want to avoid.
I am wondering if there is a way of generating a vector (a pool) of threads only once and reuse them for the new executions. Any other speedup tip will be appreciated.
void pFind(
vector<int>& a,
int n,
std::atomic<bool>& flag,
int k,
int numTh,
int val
) {
int i = k;
while (i < n) {
if (a[i] == val) {
flag = true;
break;
} else
i += numTh;
}
}
int main() {
std::atomic<bool> flag;
flag = false;
int numTh = 8;
int val = 1000;
int pos = 0;
while (!flag) {
vector<thread>threads;
for (int i = 0; i < numTh; i++){
thread th(&pFind, std::ref(a), size, std::ref(flag), i, numTh, val);
threads.push_back(std::move(th));
}
for (thread& th : threads)
th.join();
if (flag)
break;
val--;
}
cout << val << "\n";
return 0;
}
There is no way to assign a different execution function (closure) to a std::thread after construction. This is generally true of all thread abstractions, though often implementations try to memoize or cache lower-level abstractions internally to make thread fork and join fast so just constructing new threads is viable. There is a debate in systems programming circles about whether creating a new thread should be incredibly lightweight or whether clients should be written to not fork threads as frequently. (Given this has been ongoing for a very long time, it should be clear there are a lot of tradeoffs involved.)
There are a lot of other abstractions which try to do what you really want. They have names such as "threadpools," "task executors" (or just "executors"), and "futures." All of them tend to map onto threads by creating some set of threads, often related to the number of hardware cores in the system, and then having each of those threads loop and look for requests.
As the comments indicated, the main way you would do this yourself is to have threads with a top-level loop that accepts execution requests, processes them, and then posts the results. To do this you will need to use other synchronization methods such as mutexes and condition variables. It is generally faster to do things this way if there are a lot of requests and requests are not incredibly large.
As much as standard C++ concurrency support is a good thing, it is also rather significantly lacking for real world high performance work. Something like Intel's TBB is far more of an industrial strength solution.
By piecing together some code from different online searches, the following works, but is not as fast as as the approach that regenerates the threads at each iteration of the while loop.
Perhaps someone can comment on this approach.
The following class describes the thread pool
class ThreadPool {
public:
ThreadPool(int threads) : shutdown_(false){
threads_.reserve(threads);
for (int i = 0; i < threads; ++i)
threads_.emplace_back(std::bind(&ThreadPool::threadEntry, this, i));
}
~ThreadPool(){
{
// Unblock any threads and tell them to stop
std::unique_lock<std::mutex>l(lock_);
shutdown_ = true;
condVar_.notify_all();
}
// Wait for all threads to stop
std::cerr << "Joining threads" << std::endl;
for (auto & thread : threads_) thread.join();
}
void doJob(std::function<void(void)>func){
// Place a job on the queu and unblock a thread
std::unique_lock<std::mutex>l(lock_);
jobs_.emplace(std::move(func));
condVar_.notify_one();
}
void threadEntry(int i){
std::function<void(void)>job;
while (1){
{
std::unique_lock<std::mutex>l(lock_);
while (!shutdown_ && jobs_.empty()) condVar_.wait(l);
if (jobs_.empty()){
// No jobs to do and we are shutting down
std::cerr << "Thread " << i << " terminates" << std::endl;
return;
}
std::cerr << "Thread " << i << " does a job" << std::endl;
job = std::move(jobs_.front());
jobs_.pop();
}
// Do the job without holding any locks
job();
}
}
};
Here is the rest of the code
void pFind(
vector<int>& a,
int n,
std::atomic<bool>& flag,
int k,
int numTh,
int val,
std::atomic<int>& completed) {
int i = k;
while (i < n) {
if (a[i] == val) {
flag = true;
break;
} else
i += numTh;
}
completed++;
}
int main() {
std::atomic<bool> flag;
flag = false;
int numTh = 8;
int val = 1000;
int pos = 0;
std::atomic<int> completed;
completed=0;
ThreadPool p(numThreads);
while (!flag) {
for (int i = 0; i < numThreads; i++) {
p.doJob(std::bind(pFind, std::ref(a), size, std::ref(flag), i, numTh, val, std::ref(completed)));
}
while (completed < numTh) {}
if (flag) {
break;
} else {
completed = 0;
val--;
}
}
cout << val << "\n";
return 0;
}
Your code has a race condition: bool is not an atomic type and is therefore not safe for multiple threads to write to concurrently. You need to use std::atomic_bool or std::atomic_flag.
To answer your question, you're recreating the threads vector each iteration of the loop, which you can avoid by moving its declaration outside the loop body. Reusing the threads themselves is a much more complex topic that's hard to get right or describe concisely.
vector<thread> threads;
threads.reserve(numTh);
while (!flag) {
for (size_t i = 0; i < numTh; ++i)
threads.emplace_back(pFind, a, size, flag, i, numTh, val);
for (auto &th : threads)
th.join();
threads.clear();
}

Thread pooling in C++11

Relevant questions:
About C++11:
C++11: std::thread pooled?
Will async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?
About Boost:
C++ boost thread reusing threads
boost::thread and creating a pool of them!
How do I get a pool of threads to send tasks to, without creating and deleting them over and over again? This means persistent threads to resynchronize without joining.
I have code that looks like this:
namespace {
std::vector<std::thread> workers;
int total = 4;
int arr[4] = {0};
void each_thread_does(int i) {
arr[i] += 2;
}
}
int main(int argc, char *argv[]) {
for (int i = 0; i < 8; ++i) { // for 8 iterations,
for (int j = 0; j < 4; ++j) {
workers.push_back(std::thread(each_thread_does, j));
}
for (std::thread &t: workers) {
if (t.joinable()) {
t.join();
}
}
arr[4] = std::min_element(arr, arr+4);
}
return 0;
}
Instead of creating and joining threads each iteration, I'd prefer to send tasks to my worker threads each iteration and only create them once.
This is adapted from my answer to another very similar post.
Let's build a ThreadPool class:
class ThreadPool {
public:
void Start();
void QueueJob(const std::function<void()>& job);
void Stop();
void busy();
private:
void ThreadLoop();
bool should_terminate = false; // Tells threads to stop looking for jobs
std::mutex queue_mutex; // Prevents data races to the job queue
std::condition_variable mutex_condition; // Allows threads to wait on new jobs or termination
std::vector<std::thread> threads;
std::queue<std::function<void()>> jobs;
};
ThreadPool::Start
For an efficient threadpool implementation, once threads are created according to num_threads, it's better not to
create new ones or destroy old ones (by joining). There will be a performance penalty, and it might even make your
application go slower than the serial version. Thus, we keep a pool of threads that can be used at any time (if they
aren't already running a job).
Each thread should be running its own infinite loop, constantly waiting for new tasks to grab and run.
void ThreadPool::Start() {
const uint32_t num_threads = std::thread::hardware_concurrency(); // Max # of threads the system supports
threads.resize(num_threads);
for (uint32_t i = 0; i < num_threads; i++) {
threads.at(i) = std::thread(ThreadLoop);
}
}
ThreadPool::ThreadLoop
The infinite loop function. This is a while (true) loop waiting for the task queue to open up.
void ThreadPool::ThreadLoop() {
while (true) {
std::function<void()> job;
{
std::unique_lock<std::mutex> lock(queue_mutex);
mutex_condition.wait(lock, [this] {
return !jobs.empty() || should_terminate;
});
if (should_terminate) {
return;
}
job = jobs.front();
jobs.pop();
}
job();
}
}
ThreadPool::QueueJob
Add a new job to the pool; use a lock so that there isn't a data race.
void ThreadPool::QueueJob(const std::function<void()>& job) {
{
std::unique_lock<std::mutex> lock(queue_mutex);
jobs.push(job);
}
mutex_condition.notify_one();
}
To use it:
thread_pool->QueueJob([] { /* ... */ });
ThreadPool::busy
void ThreadPool::busy() {
bool poolbusy;
{
std::unique_lock<std::mutex> lock(queue_mutex);
poolbusy = jobs.empty();
}
return poolbusy;
}
The busy() function can be used in a while loop, such that the main thread can wait the threadpool to complete all the tasks before calling the threadpool destructor.
ThreadPool::Stop
Stop the pool.
void ThreadPool::Stop() {
{
std::unique_lock<std::mutex> lock(queue_mutex);
should_terminate = true;
}
mutex_condition.notify_all();
for (std::thread& active_thread : threads) {
active_thread.join();
}
threads.clear();
}
Once you integrate these ingredients, you have your own dynamic threading pool. These threads always run, waiting for
job to do.
I apologize if there are some syntax errors, I typed this code and and I have a bad memory. Sorry that I cannot provide
you the complete thread pool code; that would violate my job integrity.
Notes:
The anonymous code blocks are used so that when they are exited, the std::unique_lock variables created within them
go out of scope, unlocking the mutex.
ThreadPool::Stop will not terminate any currently running jobs, it just waits for them to finish via active_thread.join().
You can use C++ Thread Pool Library, https://github.com/vit-vit/ctpl.
Then the code your wrote can be replaced with the following
#include <ctpl.h> // or <ctpl_stl.h> if ou do not have Boost library
int main (int argc, char *argv[]) {
ctpl::thread_pool p(2 /* two threads in the pool */);
int arr[4] = {0};
std::vector<std::future<void>> results(4);
for (int i = 0; i < 8; ++i) { // for 8 iterations,
for (int j = 0; j < 4; ++j) {
results[j] = p.push([&arr, j](int){ arr[j] +=2; });
}
for (int j = 0; j < 4; ++j) {
results[j].get();
}
arr[4] = std::min_element(arr, arr + 4);
}
}
You will get the desired number of threads and will not create and delete them over and over again on the iterations.
A pool of threads means that all your threads are running, all the time – in other words, the thread function never returns. To give the threads something meaningful to do, you have to design a system of inter-thread communication, both for the purpose of telling the thread that there's something to do, as well as for communicating the actual work data.
Typically this will involve some kind of concurrent data structure, and each thread would presumably sleep on some kind of condition variable, which would be notified when there's work to do. Upon receiving the notification, one or several of the threads wake up, recover a task from the concurrent data structure, process it, and store the result in an analogous fashion.
The thread would then go on to check whether there's even more work to do, and if not go back to sleep.
The upshot is that you have to design all this yourself, since there isn't a natural notion of "work" that's universally applicable. It's quite a bit of work, and there are some subtle issues you have to get right. (You can program in Go if you like a system which takes care of thread management for you behind the scenes.)
A threadpool is at core a set of threads all bound to a function working as an event loop. These threads will endlessly wait for a task to be executed, or their own termination.
The threadpool job is to provide an interface to submit jobs, define (and perhaps modify) the policy of running these jobs (scheduling rules, thread instantiation, size of the pool), and monitor the status of the threads and related resources.
So for a versatile pool, one must start by defining what a task is, how it is launched, interrupted, what is the result (see the notion of promise and future for that question), what sort of events the threads will have to respond to, how they will handle them, how these events shall be discriminated from the ones handled by the tasks. This can become quite complicated as you can see, and impose restrictions on how the threads will work, as the solution becomes more and more involved.
The current tooling for handling events is fairly barebones(*): primitives like mutexes, condition variables, and a few abstractions on top of that (locks, barriers). But in some cases, these abstrations may turn out to be unfit (see this related question), and one must revert to using the primitives.
Other problems have to be managed too:
signal
i/o
hardware (processor affinity, heterogenous setup)
How would these play out in your setting?
This answer to a similar question points to an existing implementation meant for boost and the stl.
I offered a very crude implementation of a threadpool for another question, which doesn't address many problems outlined above. You might want to build up on it. You might also want to have a look of existing frameworks in other languages, to find inspiration.
(*) I don't see that as a problem, quite to the contrary. I think it's the very spirit of C++ inherited from C.
Follwoing [PhD EcE](https://stackoverflow.com/users/3818417/phd-ece) suggestion, I implemented the thread pool:
function_pool.h
#pragma once
#include <queue>
#include <functional>
#include <mutex>
#include <condition_variable>
#include <atomic>
#include <cassert>
class Function_pool
{
private:
std::queue<std::function<void()>> m_function_queue;
std::mutex m_lock;
std::condition_variable m_data_condition;
std::atomic<bool> m_accept_functions;
public:
Function_pool();
~Function_pool();
void push(std::function<void()> func);
void done();
void infinite_loop_func();
};
function_pool.cpp
#include "function_pool.h"
Function_pool::Function_pool() : m_function_queue(), m_lock(), m_data_condition(), m_accept_functions(true)
{
}
Function_pool::~Function_pool()
{
}
void Function_pool::push(std::function<void()> func)
{
std::unique_lock<std::mutex> lock(m_lock);
m_function_queue.push(func);
// when we send the notification immediately, the consumer will try to get the lock , so unlock asap
lock.unlock();
m_data_condition.notify_one();
}
void Function_pool::done()
{
std::unique_lock<std::mutex> lock(m_lock);
m_accept_functions = false;
lock.unlock();
// when we send the notification immediately, the consumer will try to get the lock , so unlock asap
m_data_condition.notify_all();
//notify all waiting threads.
}
void Function_pool::infinite_loop_func()
{
std::function<void()> func;
while (true)
{
{
std::unique_lock<std::mutex> lock(m_lock);
m_data_condition.wait(lock, [this]() {return !m_function_queue.empty() || !m_accept_functions; });
if (!m_accept_functions && m_function_queue.empty())
{
//lock will be release automatically.
//finish the thread loop and let it join in the main thread.
return;
}
func = m_function_queue.front();
m_function_queue.pop();
//release the lock
}
func();
}
}
main.cpp
#include "function_pool.h"
#include <string>
#include <iostream>
#include <mutex>
#include <functional>
#include <thread>
#include <vector>
Function_pool func_pool;
class quit_worker_exception : public std::exception {};
void example_function()
{
std::cout << "bla" << std::endl;
}
int main()
{
std::cout << "stating operation" << std::endl;
int num_threads = std::thread::hardware_concurrency();
std::cout << "number of threads = " << num_threads << std::endl;
std::vector<std::thread> thread_pool;
for (int i = 0; i < num_threads; i++)
{
thread_pool.push_back(std::thread(&Function_pool::infinite_loop_func, &func_pool));
}
//here we should send our functions
for (int i = 0; i < 50; i++)
{
func_pool.push(example_function);
}
func_pool.done();
for (unsigned int i = 0; i < thread_pool.size(); i++)
{
thread_pool.at(i).join();
}
}
You can use thread_pool from boost library:
void my_task(){...}
int main(){
int threadNumbers = thread::hardware_concurrency();
boost::asio::thread_pool pool(threadNumbers);
// Submit a function to the pool.
boost::asio::post(pool, my_task);
// Submit a lambda object to the pool.
boost::asio::post(pool, []() {
...
});
}
You also can use threadpool from open source community:
void first_task() {...}
void second_task() {...}
int main(){
int threadNumbers = thread::hardware_concurrency();
pool tp(threadNumbers);
// Add some tasks to the pool.
tp.schedule(&first_task);
tp.schedule(&second_task);
}
Something like this might help (taken from a working app).
#include <memory>
#include <boost/asio.hpp>
#include <boost/thread.hpp>
struct thread_pool {
typedef std::unique_ptr<boost::asio::io_service::work> asio_worker;
thread_pool(int threads) :service(), service_worker(new asio_worker::element_type(service)) {
for (int i = 0; i < threads; ++i) {
auto worker = [this] { return service.run(); };
grp.add_thread(new boost::thread(worker));
}
}
template<class F>
void enqueue(F f) {
service.post(f);
}
~thread_pool() {
service_worker.reset();
grp.join_all();
service.stop();
}
private:
boost::asio::io_service service;
asio_worker service_worker;
boost::thread_group grp;
};
You can use it like this:
thread_pool pool(2);
pool.enqueue([] {
std::cout << "Hello from Task 1\n";
});
pool.enqueue([] {
std::cout << "Hello from Task 2\n";
});
Keep in mind that reinventing an efficient asynchronous queuing mechanism is not trivial.
Boost::asio::io_service is a very efficient implementation, or actually is a collection of platform-specific wrappers (e.g. it wraps I/O completion ports on Windows).
Edit: This now requires C++17 and concepts. (As of 9/12/16, only g++ 6.0+ is sufficient.)
The template deduction is a lot more accurate because of it, though, so it's worth the effort of getting a newer compiler. I've not yet found a function that requires explicit template arguments.
It also now takes any appropriate callable object (and is still statically typesafe!!!).
It also now includes an optional green threading priority thread pool using the same API. This class is POSIX only, though. It uses the ucontext_t API for userspace task switching.
I created a simple library for this. An example of usage is given below. (I'm answering this because it was one of the things I found before I decided it was necessary to write it myself.)
bool is_prime(int n){
// Determine if n is prime.
}
int main(){
thread_pool pool(8); // 8 threads
list<future<bool>> results;
for(int n = 2;n < 10000;n++){
// Submit a job to the pool.
results.emplace_back(pool.async(is_prime, n));
}
int n = 2;
for(auto i = results.begin();i != results.end();i++, n++){
// i is an iterator pointing to a future representing the result of is_prime(n)
cout << n << " ";
bool prime = i->get(); // Wait for the task is_prime(n) to finish and get the result.
if(prime)
cout << "is prime";
else
cout << "is not prime";
cout << endl;
}
}
You can pass async any function with any (or void) return value and any (or no) arguments and it will return a corresponding std::future. To get the result (or just wait until a task has completed) you call get() on the future.
Here's the github: https://github.com/Tyler-Hardin/thread_pool.
looks like threadpool is very popular problem/exercise :-)
I recently wrote one in modern C++; it’s owned by me and publicly available here - https://github.com/yurir-dev/threadpool
It supports templated return values, core pinning, ordering of some tasks.
all implementation in two .h files.
So, the original question will be something like this:
#include "tp/threadpool.h"
int arr[5] = { 0 };
concurency::threadPool<void> tp;
tp.start(std::thread::hardware_concurrency());
std::vector<std::future<void>> futures;
for (int i = 0; i < 8; ++i) { // for 8 iterations,
for (int j = 0; j < 4; ++j) {
futures.push_back(tp.push([&arr, j]() {
arr[j] += 2;
}));
}
}
// wait until all pushed tasks are finished.
for (auto& f : futures)
f.get();
// or just tp.end(); // will kill all the threads
arr[4] = *std::min_element(arr, arr + 4);
I found the pending tasks' future.get() call hangs on caller side if the thread pool gets terminated and leaves some tasks inside task queue. How to set future exception inside thread pool with only the wrapper std::function?
template <class F, class... Args>
std::future<std::result_of_t<F(Args...)>> enqueue(F &&f, Args &&...args) {
auto task = std::make_shared<std::packaged_task<std::result_of_t<F(Args...)>()>>(
std::bind(std::forward<F>(f), std::forward<Args>(args)...));
std::future<return_type> res = task->get_future();
{
std::unique_lock<std::mutex> lock(_mutex);
_tasks.push([task]() -> void { (*task)(); });
}
return res;
}
class StdThreadPool {
std::vector<std::thread> _workers;
std::priority_queue<TASK> _tasks;
...
}
struct TASK {
//int _func_return_value;
std::function<void()> _func;
int priority;
...
}
The Stroika library has a threadpool implementation.
Stroika ThreadPool.h
ThreadPool p;
p.AddTask ([] () {doIt ();});
Stroika's thread library also supports cancelation (cooperative) - so that when the ThreadPool above goes out of scope - it cancels any running tasks (similar to c++20's jthread).

Extend the life of threads with synchronization (C++11)

I have a program with a function which takes a pointer as arg, and a main. The main is creating n threads, each of them running the function on different memory areas depending on the passed arg. Threads are then joined, the main performs some data mixing between the area and creates n new threads which do the the same operation as the old ones.
To improve the program I would like to keep the threads alive, removing the long time necessary to create them. Threads should sleep when the main is working and notified when they have to come up again. At the same way the main should wait when threads are working as it did with join.
I cannot end up with a strong implementation of this, always falling in a deadlock.
Simple baseline code, any hints about how to modify this would be much appreciated
#include <thread>
#include <climits>
...
void myfunc(void * p) {
do_something(p);
}
int main(){
void * myp[n_threads] {a_location, another_location,...};
std::thread mythread[n_threads];
for (unsigned long int j=0; j < ULONG_MAX; j++) {
for (unsigned int i=0; i < n_threads; i++) {
mythread[i] = std::thread(myfunc, myp[i]);
}
for (unsigned int i=0; i < n_threads; i++) {
mythread[i].join();
}
mix_data(myp);
}
return 0;
}
Here is a possible approach using only classes from the C++11 Standard Library. Basically, each thread you create has an associated command queue (encapsulated in std::packaged_task<> objects) which it continuously check. If the queue is empty, the thread will just wait on a condition variable (std::condition_variable).
While data races are avoided through the use of std::mutex and std::unique_lock<> RAII wrappers, the main thread can wait for a particular job to be terminated by storing the std::future<> object associated to each submitted std::packaged_tast<> and call wait() on it.
Below is a simple program that follows this design. Comments should be sufficient to explain what it does:
#include <thread>
#include <iostream>
#include <sstream>
#include <future>
#include <queue>
#include <condition_variable>
#include <mutex>
// Convenience type definition
using job = std::packaged_task<void()>;
// Some data associated to each thread.
struct thread_data
{
int id; // Could use thread::id, but this is filled before the thread is started
std::thread t; // The thread object
std::queue<job> jobs; // The job queue
std::condition_variable cv; // The condition variable to wait for threads
std::mutex m; // Mutex used for avoiding data races
bool stop = false; // When set, this flag tells the thread that it should exit
};
// The thread function executed by each thread
void thread_func(thread_data* pData)
{
std::unique_lock<std::mutex> l(pData->m, std::defer_lock);
while (true)
{
l.lock();
// Wait until the queue won't be empty or stop is signaled
pData->cv.wait(l, [pData] () {
return (pData->stop || !pData->jobs.empty());
});
// Stop was signaled, let's exit the thread
if (pData->stop) { return; }
// Pop one task from the queue...
job j = std::move(pData->jobs.front());
pData->jobs.pop();
l.unlock();
// Execute the task!
j();
}
}
// Function that creates a simple task
job create_task(int id, int jobNumber)
{
job j([id, jobNumber] ()
{
std::stringstream s;
s << "Hello " << id << "." << jobNumber << std::endl;
std::cout << s.str();
});
return j;
}
int main()
{
const int numThreads = 4;
const int numJobsPerThread = 10;
std::vector<std::future<void>> futures;
// Create all the threads (will be waiting for jobs)
thread_data threads[numThreads];
int tdi = 0;
for (auto& td : threads)
{
td.id = tdi++;
td.t = std::thread(thread_func, &td);
}
//=================================================
// Start assigning jobs to each thread...
for (auto& td : threads)
{
for (int i = 0; i < numJobsPerThread; i++)
{
job j = create_task(td.id, i);
futures.push_back(j.get_future());
std::unique_lock<std::mutex> l(td.m);
td.jobs.push(std::move(j));
}
// Notify the thread that there is work do to...
td.cv.notify_one();
}
// Wait for all the tasks to be completed...
for (auto& f : futures) { f.wait(); }
futures.clear();
//=================================================
// Here the main thread does something...
std::cin.get();
// ...done!
//=================================================
//=================================================
// Posts some new tasks...
for (auto& td : threads)
{
for (int i = 0; i < numJobsPerThread; i++)
{
job j = create_task(td.id, i);
futures.push_back(j.get_future());
std::unique_lock<std::mutex> l(td.m);
td.jobs.push(std::move(j));
}
// Notify the thread that there is work do to...
td.cv.notify_one();
}
// Wait for all the tasks to be completed...
for (auto& f : futures) { f.wait(); }
futures.clear();
// Send stop signal to all threads and join them...
for (auto& td : threads)
{
std::unique_lock<std::mutex> l(td.m);
td.stop = true;
td.cv.notify_one();
}
// Join all the threads
for (auto& td : threads) { td.t.join(); }
}
The concept you want is the threadpool. This SO question deals with existing implementations.
The idea is to have a container for a number of thread instances. Each instance is associated with a function which polls a task queue, and when a task is available, pulls it and run it. Once the task is over (if it terminates, but that's another problem), the thread simply loop over to the task queue.
So you need a synchronized queue, a thread class which implements the loop on the queue, an interface for the task objects, and maybe a class to drive the whole thing (the pool class).
Alternatively, you could make a very specialized thread class for the task it has to perform (with only the memory area as a parameter for instance). This requires a notification mechanism for the threads to indicate that they are done with the current iteration.
The thread main function would be a loop on that specific task, and at the end of one iteration, the thread signals its end, and wait on condition variables to start the next loop. In essence, you would be inlining the task code within the thread, dropping the need of a queue altogether.
using namespace std;
// semaphore class based on C++11 features
class semaphore {
private:
mutex mMutex;
condition_variable v;
int mV;
public:
semaphore(int v): mV(v){}
void signal(int count=1){
unique_lock lock(mMutex);
mV+=count;
if (mV > 0) mCond.notify_all();
}
void wait(int count = 1){
unique_lock lock(mMutex);
mV-= count;
while (mV < 0)
mCond.wait(lock);
}
};
template <typename Task>
class TaskThread {
thread mThread;
Task *mTask;
semaphore *mSemStarting, *mSemFinished;
volatile bool mRunning;
public:
TaskThread(Task *task, semaphore *start, semaphore *finish):
mTask(task), mRunning(true),
mSemStart(start), mSemFinished(finish),
mThread(&TaskThread<Task>::psrun){}
~TaskThread(){ mThread.join(); }
void run(){
do {
(*mTask)();
mSemFinished->signal();
mSemStart->wait();
} while (mRunning);
}
void finish() { // end the thread after the current loop
mRunning = false;
}
private:
static void psrun(TaskThread<Task> *self){ self->run();}
};
classcMyTask {
public:
MyTask(){}
void operator()(){
// some code here
}
};
int main(){
MyTask task1;
MyTask task2;
semaphore start(2), finished(0);
TaskThread<MyTask> t1(&task1, &start, &finished);
TaskThread<MyTask> t2(&task2, &start, &finished);
for (int i = 0; i < 10; i++){
finished.wait(2);
start.signal(2);
}
t1.finish();
t2.finish();
}
The proposed (crude) implementation above relies on the Task type which must provide the operator() (ie. a functor like class). I said you could incorporate the task code directly in the thread function body earlier, but since I don't know it, I kept it as abstract as I could. There's one condition variable for the start of threads, and one for their end, both encapsulated in semaphore instances.
Seeing the other answer proposing the use of boost::barrier, I can only support this idea: make sure to replace my semaphore class with that class if possible, the reason being that it is better to rely on well tested and maintained external code rather than a self implemented solution for the same feature set.
All in all, both approaches are valid, but the former gives up a tiny bit of performance in favor of flexibility. If the task to be performed takes a sufficiently long time, the management and queue synchronization cost becomes negligible.
Update: code fixed and tested. Replaced a simple condition variable by a semaphore.
It can easily be achieved using a barrier (just a convenience wrapper over a conditional variable and a counter). It basically blocks until all N threads have reached the "barrier". It then "recycles" again. Boost provides an implementation.
void myfunc(void * p, boost::barrier& start_barrier, boost::barrier& end_barrier) {
while (!stop_condition) // You'll need to tell them to stop somehow
{
start_barrier.wait ();
do_something(p);
end_barrier.wait ();
}
}
int main(){
void * myp[n_threads] {a_location, another_location,...};
boost::barrier start_barrier (n_threads + 1); // child threads + main thread
boost::barrier end_barrier (n_threads + 1); // child threads + main thread
std::thread mythread[n_threads];
for (unsigned int i=0; i < n_threads; i++) {
mythread[i] = std::thread(myfunc, myp[i], start_barrier, end_barrier);
}
start_barrier.wait (); // first unblock the threads
for (unsigned long int j=0; j < ULONG_MAX; j++) {
end_barrier.wait (); // mix_data must not execute before the threads are done
mix_data(myp);
start_barrier.wait (); // threads must not start new iteration before mix_data is done
}
return 0;
}
The following is a simple compiling and working code performing some random stuffs. It implements aleguna's concept of barrier. The task length of each thread is different so it is really necessary to have a strong synchronization mechanism. I will try to do a pool on the same tasks and benchmark the result, and then maybe with futures as pointed out by Andy Prowl.
#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <complex>
#include <random>
const unsigned int n_threads=4; //varying this will not (almost) change the total amount of work
const unsigned int task_length=30000/n_threads;
const float task_length_variation=task_length/n_threads;
unsigned int rep=1000; //repetitions of tasks
class t_chronometer{
private:
std::chrono::steady_clock::time_point _t;
public:
t_chronometer(): _t(std::chrono::steady_clock::now()) {;}
void reset() {_t = std::chrono::steady_clock::now();}
double get_now() {return std::chrono::duration_cast<std::chrono::duration<double>>(std::chrono::steady_clock::now() - _t).count();}
double get_now_ms() {return
std::chrono::duration_cast<std::chrono::duration<double,std::milli>>(std::chrono::steady_clock::now() - _t).count();}
};
class t_barrier {
private:
std::mutex m_mutex;
std::condition_variable m_cond;
unsigned int m_threshold;
unsigned int m_count;
unsigned int m_generation;
public:
t_barrier(unsigned int count):
m_threshold(count),
m_count(count),
m_generation(0) {
}
bool wait() {
std::unique_lock<std::mutex> lock(m_mutex);
unsigned int gen = m_generation;
if (--m_count == 0)
{
m_generation++;
m_count = m_threshold;
m_cond.notify_all();
return true;
}
while (gen == m_generation)
m_cond.wait(lock);
return false;
}
};
using namespace std;
void do_something(complex<double> * c, unsigned int max) {
complex<double> a(1.,0.);
complex<double> b(1.,0.);
for (unsigned int i = 0; i<max; i++) {
a *= polar(1.,2.*M_PI*i/max);
b *= polar(1.,4.*M_PI*i/max);
*(c)+=a+b;
}
}
bool done=false;
void task(complex<double> * c, unsigned int max, t_barrier* start_barrier, t_barrier* end_barrier) {
while (!done) {
start_barrier->wait ();
do_something(c,max);
end_barrier->wait ();
}
cout << "task finished" << endl;
}
int main() {
t_chronometer t;
std::default_random_engine gen;
std::normal_distribution<double> dis(.0,1000.0);
complex<double> cpx[n_threads];
for (unsigned int i=0; i < n_threads; i++) {
cpx[i] = complex<double>(dis(gen), dis(gen));
}
t_barrier start_barrier (n_threads + 1); // child threads + main thread
t_barrier end_barrier (n_threads + 1); // child threads + main thread
std::thread mythread[n_threads];
unsigned long int sum=0;
for (unsigned int i=0; i < n_threads; i++) {
unsigned int max = task_length + i * task_length_variation;
cout << i+1 << "th task length: " << max << endl;
mythread[i] = std::thread(task, &cpx[i], max, &start_barrier, &end_barrier);
sum+=max;
}
cout << "total task length " << sum << endl;
complex<double> c(0,0);
for (unsigned long int j=1; j < rep+1; j++) {
start_barrier.wait (); //give to the threads the missing call to start
if (j==rep) done=true;
end_barrier.wait (); //wait for the call from each tread
if (j%100==0) cout << "cycle: " << j << endl;
for (unsigned int i=0; i<n_threads; i++) {
c+=cpx[i];
}
}
for (unsigned int i=0; i < n_threads; i++) {
mythread[i].join();
}
cout << "result: " << c << " it took: " << t.get_now() << " s." << endl;
return 0;
}