How to multithread reading a file in c++11? - c++

I have a big file, and i have to read it by chunk. Each time when i read a chunk, i have to do some time consuming operation, so i think multithread reading might help, each thread reads a chunk one by one and does its operation. here is my code in c++11
#include<iostream>
#include<fstream>
#include <condition_variable>
#include <mutex>
#include <thread>
using namespace std;
const int CHAR_PER_FILE = 1e8;
const int NUM_THREAD = 2;
int order = -1;
bool is_reading = false;
mutex mtx;
condition_variable file_not_reading;
void partition(ifstream& is)
{
while (is.peek() != EOF)
{
unique_lock<mutex> lock(mtx);
while (is_reading)
file_not_reading.wait(lock);
is_reading = true;
char *c = new char[CHAR_PER_FILE];
is.read(c, CHAR_PER_FILE);
order++;
is_reading = false;
file_not_reading.notify_all();
lock.unlock();
char oc[3];
sprintf(oc, "%d", order);
this_thread::sleep_for(chrono::milliseconds(2000));//some operations that take long time
ofstream os(oc, ios::binary);
os.write(c, CHAR_PER_FILE);
delete[] c;
os.close();
}
}
int main()
{
ifstream is("bigfile.txt",ios::binary);
thread threads[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; i++)
threads[i] = thread(partition, ref(is));
for (int i = 0; i < NUM_THREAD; i++)
threads[i].join();
is.close();
system("pause");
return 0;
}
But my code didn't work, it only created 4 files instead of `bigfilesize/CHAR_PER_FILE, and threads seem got stuck, how can i make it work?
Is there any c++11 multithread reading file implementation or example?
Thanks.

My advice:
Use one thread to read chunks from the file. Every time a chunk is read, post it to a request queue. It is not worth reading multithreaded as there will be internal locks/blocking reading a common resource.
Use a pool of threads. Each of them read from the queue, retrieves a chunk, execute the expensive operation and go back to wait for a new request.
The queue must be mutex protected.
Don't use more threads than the number of processing units (CPU/Cores/HyperThreads) you have.
The main caveat of the above is that it will not guarantee the processing order. You will probably need to post the results to a central place that can reorder (again central place -> must be mutex protected).

You could use task-based parallelism with std::async:
class result; // result of expensive operation
result expensive_operation(std::vector<char> const& data)
{
result r = // long computation
return r;
}
std::vector<char>::size_type BLOCK_SIZE = 4096;
std::vector<std::future<result>> partition(ifstream& in)
{
std::vector<std::future<result>> tasks;
while (!in.eof() && !in.fail())
{
std::vector<char> c(BLOCK_SIZE);
is.read(c.data(), BLOCK_SIZE);
c.resize(in.gcount());
tasks.push_back( std::async( [](std::vector<char> data)
{
return expensive_operation(data);
},
std::move(c) ));
}
return tasks;
}
int main()
{
ifstream is("bigfile.txt",ios::binary);
auto results = partition(is);
// iterate over results and do something with it
}

Does the file have to be read in "sequential" order, i.e. do the chunks have to be "operated" on in a special order? Otherwise you could e.g. make 4 threads and let each thread read 1/4 of the file (you could do this by using tellg and saving the position in e.g. a vector or variable). That way you wouldn't have to use locks.
Maybe you could tell us how the data you read in has to be evaluated.

Perhaps...
void partition(ifstream& is)
{
unique_lock<mutex> lock(mtx);
std::vector<char> c(CHAR_PER_FILE);
is.read(c.data(), CHAR_PER_FILE);
lock.unlock();
if (is.fail() && !is.eof()) return;
size_t num_bytes_read = is.gcount();
std::ostringstream oc;
oc << order;
this_thread::sleep_for(chrono::milliseconds(2000)); //take long time
if (std::ofstream os(oc, ios::binary))
os.write(c.data(), CHAR_PER_FILE);
}
Notes:
The mutex serialises the operations already - no need for a condition variable.
I've added a little input error and bytes-read handling - you should check after os.write() too, add an else for failed ofstream creation etc.

Related

Do you need to pair memory_order_acquire and memory_order_release around a block of code?

I'm making a profiler for an interpreter, I need the interpreter to write the current frame position somewhere on every call. Then sample that information every X ms. I initially started with rigtorp's spinlock around the frame position, but that had quite an effect on the runtime of the interpreter (profiling pointed at the locking acquire time for every loop through the interpreter). So after reading quite some pages about memory fences I came up with a more efficient solution, but I would like to know if this is the correct interpretation of the relation between memory_order_relaxed & acquire/release.
#include <memory>
#include <chrono>
#include <string>
#include <iostream>
#include <thread>
#include <immintrin.h>
#include <cstring>
#include <atomic>
using namespace std;
typedef struct frame {
uint8_t op;
uint16_t arg;
uint32_t check;
} frame;
static constexpr unsigned int FRAME_BUFFER_SIZE = 8 * 1024;
static atomic<unsigned int> index;
static frame frames[FRAME_BUFFER_SIZE];
static void writer() {
uint8_t op = 0;
uint16_t arg = 0;
for (;;) {
op++;
arg++;
const auto newIndex = index.load(memory_order_relaxed) + 1;
auto &target = frames[newIndex % FRAME_BUFFER_SIZE];
target.op = op;
target.arg = arg;
target.check = static_cast<uint32_t>(arg) + op;
index.store(newIndex, memory_order_release);
_mm_pause(); // give the other threads some time
}
}
static void reader() {
for (;;) {
const auto lastValidIndex = index.load(memory_order_acquire);
// we race, hoping that the FRAME_BUFFER_SIZE is enough room
// to avoid writter catching up to us
const auto snapshot = frames[lastValidIndex % FRAME_BUFFER_SIZE];
if ((static_cast<uint32_t>(snapshot.arg) + snapshot.op) != snapshot.check) {
cout << "Invalid snapshot\n";
exit(1);
}
// we sleep a bit, since the reader is only intendede to read once in a while
this_thread::sleep_for(chrono::milliseconds(1));
}
}
int main() {
cout << "Starting race\n";
index = 0;
memset(frames, 0, sizeof(frames));
thread w(writer);
thread r(reader);
w.join();
r.join();
return 0;
}
So the strategy is as follows, we have a circular buffer, where the writer is writing into, it's the only one that mutates the index variable, so the first read is memory_order_relaxed. Then we update the value in the array, and then we store the new "full" index, this time with a memory_order_release. The reader only reads the index (this case with a memory_order_acquire) and then indexes of that location in the array.
So my questions are:
does this pairing of fences guarantee that the writes to frames array happen before the update of the index?
is the cpu "cache" of the fences cleared in the read thread every time we do a memory_order_acquire?
is it indeed safe to do a memory_order_relaxed read of the index variable, since we know that our thread is the only one reading this, and we don't care about the values in the frames array?

Are I/O streams really thread-safe?

I wrote a program that writes random numbers to one file in the first thread, and another thread reads them from there and writes to another file those that are prime numbers. The third thread is needed to stop/start the work. I read that I/O threads are thread-safe. Since writing to a single shared resource is thread-safe, what could be the problem?
Output: always correct record in numbers.log, sometimes no record in numbers_prime.log when there are prime numbers, sometimes they are all written.
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
#include <vector>
#include <condition_variable>
#include <future>
#include <random>
#include <chrono>
#include <string>
using namespace std::chrono_literals;
std::atomic_int ITER_NUMBERS = 30;
std::atomic_bool _var = false;
bool ret() { return _var; }
std::atomic_bool _var_log = false;
bool ret_log() { return _var_log; }
std::condition_variable cv;
std::condition_variable cv_log;
std::mutex mtx;
std::mutex mt;
std::atomic<int> count{0};
std::atomic<bool> _FL = 1;
int MIN = 100;
int MAX = 200;
bool is_empty(std::ifstream& pFile) // function that checks if the file is empty
{
return pFile.peek() == std::ifstream::traits_type::eof();
}
bool isPrime(int n) // function that checks if the number is prime
{
if (n <= 1)
return false;
for (int i = 2; i <= sqrt(n); i++)
if (n % i == 0)
return false;
return true;
}
void Log(int min, int max) { // function that generates random numbers and writes them to a file numbers.log
std::string str;
std::ofstream log;
std::random_device seed;
std::mt19937 gen{seed()};
std::uniform_int_distribution dist{min, max};
log.open("numbers.log", std::ios_base::trunc);
for (int i = 0; i < ITER_NUMBERS; ++i, ++count) {
std::unique_lock<std::mutex> ulm(mtx);
cv.wait(ulm,ret);
str = std::to_string(dist(gen)) + '\n';
log.write(str.c_str(), str.length());
log.flush();
_var_log = true;
cv_log.notify_one();
//_var_log = false;
//std::this_thread::sleep_for(std::chrono::microseconds(500000));
}
log.close();
_var_log = true;
cv_log.notify_one();
_FL = 0;
}
void printCheck() { // Checking function to start/stop printing
std::cout << "Log to file? [y/n]\n";
while (_FL) {
char input;
std::cin >> input;
std::cin.clear();
if (input == 'y') {
_var = true;
cv.notify_one();
}
if (input == 'n') {
_var = false;
}
}
}
void primeLog() { // a function that reads files from numbers.log and writes prime numbers to numbers_prime.log
std::unique_lock ul(mt);
int number = 0;
std::ifstream in("numbers.log");
std::ofstream out("numbers_prime.log", std::ios_base::trunc);
if (is_empty(in)) {
cv_log.wait(ul, ret_log);
}
int oldCount{};
for (int i = 0; i < ITER_NUMBERS; ++i) {
if (oldCount == count && count != ITER_NUMBERS) { // check if primeLog is faster than Log. If it is faster, then we wait to continue
cv_log.wait(ul, ret_log);
_var_log = false;
}
if (!in.eof()) {
in >> number;
if (isPrime(number)) {
out << number;
out << "\n";
}
oldCount = count;
}
}
}
int main() {
std::thread t1(printCheck);
std::thread t2(Log, MIN, MAX);
std::thread t3(primeLog);
t1.join();
t2.join();
t3.join();
return 0;
}
This has nothing to do with the I/O stream thread safety. The shown code's logic is broken.
The shown code seems to follow a design pattern of breaking up a single logical algorithm into multiple pieces, and scattering them far and wide. This makes it more difficult to understand what it's doing. So let's rewrite a little bit of it, to make the logic more clear. In primeLog let's do this instead:
cv_log.wait(ul, []{ return _var_log; });
_var_log = false;
It's now more clear that this waits for _var_log to be set, before proceeding on its merry way. Once it is it gets immediately reset.
The code that follows reads exactly one number from the file, before looping back here. So, primeLog's main loop will always handle exactly one number, on each iteration of the loop.
The problem now is very easy to see, once we head over to the other side, and do the same clarification:
std::unique_lock<std::mutex> ulm(mtx);
cv.wait(ulm,[]){ return _var; });
// Code that generates one number and writes it to the file
_var_log = true;
cv_log.notify_one();
Once _var is set to true, it remains true. This loops starts running full blast, iterating continuously. On each iteration of the loop it blindly sets _var_log to true and signals the other thread's condition variable.
C++ execution threads are completely independent of each other unless they are explicitly synchronize in some way.
Nothing is preventing this loop from running full blast, getting through its entire number range, before the other execution thread wakes up and decides to read the first number from the file. It'll do that, then go back and wait for its condition variable to be signaled again, for the next number. Its hopes and dreams of the 2nd number will be left unsatisfied.
On each iteration of the generating thread's loop the condition variable, for the other execution thread, gets signaled.
Condition variables are not semaphores. If nothing is waiting on a condition variable when it's signaled -- too bad. When some execution thread decides to wait on a condition variable, it may or may not be immediately woken up.
One of these two execution thread relies on it receiving a condition variable notification for every iteration of its loop.
The logic in the other execution thread fails to implement this guarantee. This may not be the only flaw, there might be others, subject to further analysis, this was just the most apparent logical flaw.
Thanks to those who wrote about read-behind-write, now I know more. But that was not the problem. The main problem was that if it was a new file, when calling pFile.peek() in the is_empty function, we permanently set the file flag to eofbit. Thus, until the end of the program in.rdstate() == std::ios_base::eofbit.
Fix: reset the flag state.
if (is_empty(in)) {
cv_log.wait(ul, ret_log);
}
in.clear(); // reset state
There was also a problem with the peculiarity of reading/writing one file from different threads, though it was not the cause of my program error, but it led to another one.
Because if when I run the program again primeLog() opens std::ifstream in("numbers.log") for reading faster than log.open("numbers.log", std::ios_base::trunc), then in will save old data into its buffer faster than log.open will erase them with the std::ios_base::trunc flag. Hence we will read and write to numbers_prime.log the old data.

What am I doing wrong with my C++ threading?

I trying to solve this following problem:
Give a vector V[] of integers with positive and negetive. A number N is paired with its negative counter part, which is -N. Now if there are pairs of such numbers in the given vector V[], take the positive integer and push them to a return result vector.
Example:
If input is V = [1,-1,0,2,-3,3]
return [1,3]
I tried to solve this problem in 3 flavors:
Single Threaded | Runtime: 404000
Multithreaded course grained lock | Runtime: 39882000
Multithreaded fine grained lock | Runtime: 43921000
My idea with fine grained locking is to update memory at decrete memory locations based upon the input.
I see that my Multithreaded course grained lock is performing worst than Single Threaded one (which is kind of expected). But what I don't understand is why my Multithreaded fine grained lock is most-of-the-time performing worse than Multithreaded course grained lock, performing poor compared to Single-Threaded version. I expected the *Multithreaded fine grained lock** should perform better than the Single-Threaded version.
What is wrong with my implementation? What am I doing wrong. How can I improve performance of this code with multithreading?
#include <iostream>
#include <unordered_map>
#include <vector>
#include <mutex>
#include <thread>
#include <chrono>
#include <cstdlib>
#include <memory>
using namespace std;
class Solution
{
private:
const static uint32_t THREAD_N = 5;
unordered_map<uint32_t, int32_t> records;
vector<uint32_t> results;
vector<atomic<uint32_t>> atm_results;
mutex mut[THREAD_N];
mutex mutrec;
bool bzero;
public:
Solution(): bzero(true){
records.reserve(100);
}
void InsertVal(const vector<int32_t> &vin)
{
for (auto iter : vin) {
if(iter < 0)
{
if(records[0-iter] > 0) results.emplace_back(0-iter);
records[0-iter]--;
}
else if(iter > 0)
{
if(records[iter] < 0) results.emplace_back(iter);
records[iter]++;
}
else
{
bzero = !bzero;
if (bzero) {
results.emplace_back(0);
}
}
}
}
void InsertValEach(const int32_t &val)
{
lock_guard<mutex> lock(mutrec); // single block of lock
if(val < 0)
{
if(records[0-val] > 0) results.emplace_back(0-val);
records[0-val]--;
}
else if(val > 0)
{
if(records[val] < 0) results.emplace_back(val);
records[val]++;
}
else
{
bzero = !bzero;
if (bzero) {
results.emplace_back(0);
}
}
}
void InsertValEachFree(const int32_t &val)
{
if(val < 0)
{
lock_guard<mutex> lock(mut[(0-val)%THREAD_N]); // finer lock based on input
if(records[0-val] > 0)
{
lock_guard<mutex> l(mutrec); // yet another finer lock to update results
results.emplace_back(0-val);
}
records[0-val]--;
}
else if(val > 0)
{
lock_guard<mutex> lock(mut[(val)%THREAD_N]);
if(records[val] < 0)
{
lock_guard<mutex> l(mutrec);
results.emplace_back(val);
}
records[val]++;
}
else
{
lock_guard<mutex> lock(mut[0]);
bzero = !bzero;
if (bzero) {
lock_guard<mutex> l(mutrec);
results.emplace_back(0);
}
}
}
vector<uint32_t> GetResult()
{
lock_guard<mutex> l(mutrec);
return results;
}
void reset()
{
lock_guard<mutex> l(mutrec);
results = vector<uint32_t>();
}
};
void Display(Solution &s)
{
auto v = s.GetResult();
// for (auto &iter : v) {
// cout<<iter<<" ";
// }
cout<<v.size()<<"\n";
}
size_t SingleThread(Solution &s, const vector<int32_t> &vec)
{
chrono::time_point<chrono::system_clock> start, stop;
start = chrono::system_clock::now();
s.InsertVal(vec);
stop = chrono::system_clock::now();
chrono::duration<double> elapse_time = stop - start;
Display(s);
s.reset();
return chrono::duration_cast<chrono::nanoseconds>(elapse_time).count();
}
size_t CourseGrainLock(Solution &s, const vector<int32_t> &vec)
{
chrono::time_point<chrono::system_clock> start, stop;
vector<thread> vthreads;
auto vsize = vec.size();
start = chrono::system_clock::now();
for (int32_t iter=0; iter<vsize; iter++) {
vthreads.push_back(thread(&Solution::InsertValEach, &s, vec[iter]));
}
stop = chrono::system_clock::now();
for (auto &th : vthreads) {
th.join();
}
chrono::duration<double> elapse_time = stop - start;
Display(s);
s.reset();
return chrono::duration_cast<chrono::nanoseconds>(elapse_time).count();
}
size_t FineGrainLock(Solution &s, const vector<int32_t> &vec)
{
chrono::time_point<chrono::system_clock> start, stop;
vector<thread> vthreads;
auto vsize = vec.size();
start = chrono::system_clock::now();
for (int32_t iter=0; iter<vsize; iter++) {
vthreads.push_back(thread(&Solution::InsertValEachFree, &s, vec[iter]));
}
stop = chrono::system_clock::now();
for (auto &th : vthreads) {
th.join();
}
chrono::duration<double> elapse_time = stop - start;
Display(s);
s.reset();
return chrono::duration_cast<chrono::nanoseconds>(elapse_time).count();
}
int main(int argc, const char * argv[]) {
vector<int32_t> vec;
int count = 1000;
while(count--)
{
vec.emplace_back(rand()%50);
vec.emplace_back(0-(rand()%50));
}
Solution s;
auto nanosec = SingleThread(s, vec);
cout<<"Time of Execution (nano) Single Thread: "<<nanosec<<"\n";
nanosec = CourseGrainLock(s, vec);
cout<<"Time of Execution (nano) Course Grain: "<<nanosec<<"\n";
nanosec = FineGrainLock(s, vec);
cout<<"Time of Execution (nano) Fine Grain: "<<nanosec<<"\n";
return 0;
}
You're creating one thread for each number in vec. There is a considerable cost in creating a thread. You should create a few threads (no more than the number of execution units in your hardware) and have each thread process multiple entries of the vector. main can run one set of results, thus avoiding creating of one thread.
With the locking in CourseGrainLock (in InsertValEach), since the first thing each thread does is grab a lock that is not release until the function is done, your code is effectively single threaded but with the cost of creating all those threads.
The locking in your FineGrainLock (in InsertValEachFree) is not much better. You have several locks, but you make changes to results in multiple threads with different locks. Adding elements to an unordered map (which you do with results[i] or results[0-i] is not thread safe, and you risk Undefined Behavior with that code.
A reasonable approach here is to have each thread keep track of its own results independently, thus avoiding the need for locks at all, and combine them into the main results once all the threads are done.
You probably can't improve it with multithreading. All of the threads have to access the same shared input vector and result vector. The tremendous slowdown that you see vs. the single-threaded solution is the overhead of serializing the access to the shared structure.
Multithreading is not a panacea. If you need to do something like this to an array, "just do it." Single-thread.
the major issue i see with your code is the majority of the work is being done inside the mutex. That completely blocks the other threads, so there is no benefit. Even the fine grained one is only doing a very small calculation outside the mutex compared to the cost of updating the map ant the output vector.
I'm not even totally convinced your finegrained locking is completely thread safe? If the array index might create nodes in the map for a value that hasn't been seen before, then that invalidates any other thread's simultaneous searches. You could use a separate map for each locked value range I think.
But to be honest I think you are just doing too little work in each thread. Try creating a smaller number of threads and have each one do a range of the input values - calling the existing code for each entry in that range.

C++11 thread to modify std::list

I'll post my code, and then tell you what I think it's doing.
#include <thread>
#include <mutex>
#include <list>
#include <iostream>
using namespace std;
...
//List of threads and ints
list<thread> threads;
list<int> intList;
//Whether or not a thread is running
bool running(false);
//Counters
int busy(0), counter(0);
//Add 10000 elements to the list
for (int i = 0; i < 10000; ++i){
//push back an int
intList.push_back(i);
counter++;
//If the thread is running, make a note of it and continue
if (running){
busy++;
continue;
}
//If we haven't yet added 10 elements before a reset, continue
if (counter < 10)
continue;
//If we've added more than 10 ints, and there's no active thread,
//reset the counter and launch
counter = 0;
threads.push_back(std::thread([&]
//These iterators are function args
(list<int>::iterator begin, list<int>::iterator end){
//mutex for the running bool
mutex m;
m.lock();
running = true;
m.unlock();
//Remove either 10 elements or every element till the end
int removed(0);
while (removed < 10 && begin != end){
begin = intList.erase(begin);
removed++;
}
//unlock the running bool
m.lock();
running = false;
m.unlock();
//Pass into the thread func the current beginning and end of the list
}, intList.begin(), intList.end()));
}
for (auto& thread : threads){
thread.join();
}
What I think this code is doing is adding 10000 elements to the end of a list. For every 10 we add, launch a (single) thread that deletes the first 10 elements of the list (at the time the thread was launched).
I don't expect this to remove every list element, I was just interested in seeing if I could add to the end of a list while removing elements from the beginning. In Visual Studio I get a "list iterators incompatible" error quite often, but I figure the problem is cross platform.
What's wrong with my thinking? I know it's something
EDIT:
So I see now that this code is very incorrect. Really I just want one auxiliary thread active at a time to delete elements, which is why I though calling erase was ok. However I don't know how to declare a thread without joining it up, and if I wait for that then I don't really see the point of doing any of this.
Should I declare my thread before the loop and have it wait for a signal from the main thread?
To clarify, my goal here is to do the following: I want to grab keyboard presses on one thread and store them in a list, and every so often log them to a file on a separate thread while removing the things I've logged. Since I don't want to spend a lot of time writing to the disk, I'd like to write in discrete chunks (of 10).
Thanks to Christophe, and everyone else. Here's my code now... I may be using lock_guard incorrectly.
#include <thread>
#include <mutex>
#include <list>
#include <iostream>
#include <atomic>
using namespace std;
...
atomic<bool> running(false);
list<int> intList;
int busy(0), counter(0);
mutex m;
thread * t(nullptr);
for (int i = 0; i < 100000; ++i){
//Would a lock_guard here be inappropriate?
m.lock();
intList.push_back(i);
m.unlock();
counter++;
if (running){
busy++;
continue;
}
if (counter < 10)
continue;
counter = 0;
if (t){
t->join();
delete t;
}
t = new thread([&](){
running = true;
int removed(0);
while (removed < 10){
lock_guard<mutex> lock(m);
if (intList.size())
intList.erase(intList.begin());
removed++;
}
running = false;
});
}
if (t){
t->join();
delete t;
}
Your code won't work for because:
your mutex is local to each thread (each thread has it's own copy used only by itself: no chance of interthread synchronisation!)
intList is not an atomic type, but you access to it from several threads causing race conditions and undefined behaviour.
the begin and end that you send to your threads at their creation, might no longer be valid during the execution.
Here some improvements (look at the commented lines):
atomic<bool> running(false); // <=== atomic (to avoid unnecessary use of mutex)
int busy(0), counter(0);
mutex l; // define the mutex here, so that it will be the same for all threads
for (int i = 0; i < 10000; ++i){
l.lock(); // <===you need to protect each access to the list
intList.push_back(i);
l.unlock(); // <===and unlock
counter++;
if (running){
busy++;
continue;
}
if (counter < 10)
continue;
counter = 0;
threads.push_back(std::thread([&]
(){ //<====No iterator args as they might be outdated during executionof threads!!
running = true; // <=== no longer surrounded from lock/unlock as it is now atomic
int removed(0);
while (removed < 10){
l.lock(); // <====you really need to protect access to the list
if (intList.size()) // <=== check if elements exist NOW
intList.erase(intList.begin()); // <===use current data, not a prehistoric outdated local begin !!
l.unlock(); // <====end of protected section
removed++;
}
running = false; // <=== no longer surrounded from lock/unlock as it is now atomic
})); //<===No other arguments
}
...
By the way, I'd suggest that you have a look at lock_guard<mutex> for the locks, as these ensure the unlock in all circumstances (especially when there are exceptions or orhter surprises like this).
Edit: I've avoided the lock protection of running with a mutex, by making it atomic<bool>.

Thread pooling in C++11

Relevant questions:
About C++11:
C++11: std::thread pooled?
Will async(launch::async) in C++11 make thread pools obsolete for avoiding expensive thread creation?
About Boost:
C++ boost thread reusing threads
boost::thread and creating a pool of them!
How do I get a pool of threads to send tasks to, without creating and deleting them over and over again? This means persistent threads to resynchronize without joining.
I have code that looks like this:
namespace {
std::vector<std::thread> workers;
int total = 4;
int arr[4] = {0};
void each_thread_does(int i) {
arr[i] += 2;
}
}
int main(int argc, char *argv[]) {
for (int i = 0; i < 8; ++i) { // for 8 iterations,
for (int j = 0; j < 4; ++j) {
workers.push_back(std::thread(each_thread_does, j));
}
for (std::thread &t: workers) {
if (t.joinable()) {
t.join();
}
}
arr[4] = std::min_element(arr, arr+4);
}
return 0;
}
Instead of creating and joining threads each iteration, I'd prefer to send tasks to my worker threads each iteration and only create them once.
This is adapted from my answer to another very similar post.
Let's build a ThreadPool class:
class ThreadPool {
public:
void Start();
void QueueJob(const std::function<void()>& job);
void Stop();
void busy();
private:
void ThreadLoop();
bool should_terminate = false; // Tells threads to stop looking for jobs
std::mutex queue_mutex; // Prevents data races to the job queue
std::condition_variable mutex_condition; // Allows threads to wait on new jobs or termination
std::vector<std::thread> threads;
std::queue<std::function<void()>> jobs;
};
ThreadPool::Start
For an efficient threadpool implementation, once threads are created according to num_threads, it's better not to
create new ones or destroy old ones (by joining). There will be a performance penalty, and it might even make your
application go slower than the serial version. Thus, we keep a pool of threads that can be used at any time (if they
aren't already running a job).
Each thread should be running its own infinite loop, constantly waiting for new tasks to grab and run.
void ThreadPool::Start() {
const uint32_t num_threads = std::thread::hardware_concurrency(); // Max # of threads the system supports
threads.resize(num_threads);
for (uint32_t i = 0; i < num_threads; i++) {
threads.at(i) = std::thread(ThreadLoop);
}
}
ThreadPool::ThreadLoop
The infinite loop function. This is a while (true) loop waiting for the task queue to open up.
void ThreadPool::ThreadLoop() {
while (true) {
std::function<void()> job;
{
std::unique_lock<std::mutex> lock(queue_mutex);
mutex_condition.wait(lock, [this] {
return !jobs.empty() || should_terminate;
});
if (should_terminate) {
return;
}
job = jobs.front();
jobs.pop();
}
job();
}
}
ThreadPool::QueueJob
Add a new job to the pool; use a lock so that there isn't a data race.
void ThreadPool::QueueJob(const std::function<void()>& job) {
{
std::unique_lock<std::mutex> lock(queue_mutex);
jobs.push(job);
}
mutex_condition.notify_one();
}
To use it:
thread_pool->QueueJob([] { /* ... */ });
ThreadPool::busy
void ThreadPool::busy() {
bool poolbusy;
{
std::unique_lock<std::mutex> lock(queue_mutex);
poolbusy = jobs.empty();
}
return poolbusy;
}
The busy() function can be used in a while loop, such that the main thread can wait the threadpool to complete all the tasks before calling the threadpool destructor.
ThreadPool::Stop
Stop the pool.
void ThreadPool::Stop() {
{
std::unique_lock<std::mutex> lock(queue_mutex);
should_terminate = true;
}
mutex_condition.notify_all();
for (std::thread& active_thread : threads) {
active_thread.join();
}
threads.clear();
}
Once you integrate these ingredients, you have your own dynamic threading pool. These threads always run, waiting for
job to do.
I apologize if there are some syntax errors, I typed this code and and I have a bad memory. Sorry that I cannot provide
you the complete thread pool code; that would violate my job integrity.
Notes:
The anonymous code blocks are used so that when they are exited, the std::unique_lock variables created within them
go out of scope, unlocking the mutex.
ThreadPool::Stop will not terminate any currently running jobs, it just waits for them to finish via active_thread.join().
You can use C++ Thread Pool Library, https://github.com/vit-vit/ctpl.
Then the code your wrote can be replaced with the following
#include <ctpl.h> // or <ctpl_stl.h> if ou do not have Boost library
int main (int argc, char *argv[]) {
ctpl::thread_pool p(2 /* two threads in the pool */);
int arr[4] = {0};
std::vector<std::future<void>> results(4);
for (int i = 0; i < 8; ++i) { // for 8 iterations,
for (int j = 0; j < 4; ++j) {
results[j] = p.push([&arr, j](int){ arr[j] +=2; });
}
for (int j = 0; j < 4; ++j) {
results[j].get();
}
arr[4] = std::min_element(arr, arr + 4);
}
}
You will get the desired number of threads and will not create and delete them over and over again on the iterations.
A pool of threads means that all your threads are running, all the time – in other words, the thread function never returns. To give the threads something meaningful to do, you have to design a system of inter-thread communication, both for the purpose of telling the thread that there's something to do, as well as for communicating the actual work data.
Typically this will involve some kind of concurrent data structure, and each thread would presumably sleep on some kind of condition variable, which would be notified when there's work to do. Upon receiving the notification, one or several of the threads wake up, recover a task from the concurrent data structure, process it, and store the result in an analogous fashion.
The thread would then go on to check whether there's even more work to do, and if not go back to sleep.
The upshot is that you have to design all this yourself, since there isn't a natural notion of "work" that's universally applicable. It's quite a bit of work, and there are some subtle issues you have to get right. (You can program in Go if you like a system which takes care of thread management for you behind the scenes.)
A threadpool is at core a set of threads all bound to a function working as an event loop. These threads will endlessly wait for a task to be executed, or their own termination.
The threadpool job is to provide an interface to submit jobs, define (and perhaps modify) the policy of running these jobs (scheduling rules, thread instantiation, size of the pool), and monitor the status of the threads and related resources.
So for a versatile pool, one must start by defining what a task is, how it is launched, interrupted, what is the result (see the notion of promise and future for that question), what sort of events the threads will have to respond to, how they will handle them, how these events shall be discriminated from the ones handled by the tasks. This can become quite complicated as you can see, and impose restrictions on how the threads will work, as the solution becomes more and more involved.
The current tooling for handling events is fairly barebones(*): primitives like mutexes, condition variables, and a few abstractions on top of that (locks, barriers). But in some cases, these abstrations may turn out to be unfit (see this related question), and one must revert to using the primitives.
Other problems have to be managed too:
signal
i/o
hardware (processor affinity, heterogenous setup)
How would these play out in your setting?
This answer to a similar question points to an existing implementation meant for boost and the stl.
I offered a very crude implementation of a threadpool for another question, which doesn't address many problems outlined above. You might want to build up on it. You might also want to have a look of existing frameworks in other languages, to find inspiration.
(*) I don't see that as a problem, quite to the contrary. I think it's the very spirit of C++ inherited from C.
Follwoing [PhD EcE](https://stackoverflow.com/users/3818417/phd-ece) suggestion, I implemented the thread pool:
function_pool.h
#pragma once
#include <queue>
#include <functional>
#include <mutex>
#include <condition_variable>
#include <atomic>
#include <cassert>
class Function_pool
{
private:
std::queue<std::function<void()>> m_function_queue;
std::mutex m_lock;
std::condition_variable m_data_condition;
std::atomic<bool> m_accept_functions;
public:
Function_pool();
~Function_pool();
void push(std::function<void()> func);
void done();
void infinite_loop_func();
};
function_pool.cpp
#include "function_pool.h"
Function_pool::Function_pool() : m_function_queue(), m_lock(), m_data_condition(), m_accept_functions(true)
{
}
Function_pool::~Function_pool()
{
}
void Function_pool::push(std::function<void()> func)
{
std::unique_lock<std::mutex> lock(m_lock);
m_function_queue.push(func);
// when we send the notification immediately, the consumer will try to get the lock , so unlock asap
lock.unlock();
m_data_condition.notify_one();
}
void Function_pool::done()
{
std::unique_lock<std::mutex> lock(m_lock);
m_accept_functions = false;
lock.unlock();
// when we send the notification immediately, the consumer will try to get the lock , so unlock asap
m_data_condition.notify_all();
//notify all waiting threads.
}
void Function_pool::infinite_loop_func()
{
std::function<void()> func;
while (true)
{
{
std::unique_lock<std::mutex> lock(m_lock);
m_data_condition.wait(lock, [this]() {return !m_function_queue.empty() || !m_accept_functions; });
if (!m_accept_functions && m_function_queue.empty())
{
//lock will be release automatically.
//finish the thread loop and let it join in the main thread.
return;
}
func = m_function_queue.front();
m_function_queue.pop();
//release the lock
}
func();
}
}
main.cpp
#include "function_pool.h"
#include <string>
#include <iostream>
#include <mutex>
#include <functional>
#include <thread>
#include <vector>
Function_pool func_pool;
class quit_worker_exception : public std::exception {};
void example_function()
{
std::cout << "bla" << std::endl;
}
int main()
{
std::cout << "stating operation" << std::endl;
int num_threads = std::thread::hardware_concurrency();
std::cout << "number of threads = " << num_threads << std::endl;
std::vector<std::thread> thread_pool;
for (int i = 0; i < num_threads; i++)
{
thread_pool.push_back(std::thread(&Function_pool::infinite_loop_func, &func_pool));
}
//here we should send our functions
for (int i = 0; i < 50; i++)
{
func_pool.push(example_function);
}
func_pool.done();
for (unsigned int i = 0; i < thread_pool.size(); i++)
{
thread_pool.at(i).join();
}
}
You can use thread_pool from boost library:
void my_task(){...}
int main(){
int threadNumbers = thread::hardware_concurrency();
boost::asio::thread_pool pool(threadNumbers);
// Submit a function to the pool.
boost::asio::post(pool, my_task);
// Submit a lambda object to the pool.
boost::asio::post(pool, []() {
...
});
}
You also can use threadpool from open source community:
void first_task() {...}
void second_task() {...}
int main(){
int threadNumbers = thread::hardware_concurrency();
pool tp(threadNumbers);
// Add some tasks to the pool.
tp.schedule(&first_task);
tp.schedule(&second_task);
}
Something like this might help (taken from a working app).
#include <memory>
#include <boost/asio.hpp>
#include <boost/thread.hpp>
struct thread_pool {
typedef std::unique_ptr<boost::asio::io_service::work> asio_worker;
thread_pool(int threads) :service(), service_worker(new asio_worker::element_type(service)) {
for (int i = 0; i < threads; ++i) {
auto worker = [this] { return service.run(); };
grp.add_thread(new boost::thread(worker));
}
}
template<class F>
void enqueue(F f) {
service.post(f);
}
~thread_pool() {
service_worker.reset();
grp.join_all();
service.stop();
}
private:
boost::asio::io_service service;
asio_worker service_worker;
boost::thread_group grp;
};
You can use it like this:
thread_pool pool(2);
pool.enqueue([] {
std::cout << "Hello from Task 1\n";
});
pool.enqueue([] {
std::cout << "Hello from Task 2\n";
});
Keep in mind that reinventing an efficient asynchronous queuing mechanism is not trivial.
Boost::asio::io_service is a very efficient implementation, or actually is a collection of platform-specific wrappers (e.g. it wraps I/O completion ports on Windows).
Edit: This now requires C++17 and concepts. (As of 9/12/16, only g++ 6.0+ is sufficient.)
The template deduction is a lot more accurate because of it, though, so it's worth the effort of getting a newer compiler. I've not yet found a function that requires explicit template arguments.
It also now takes any appropriate callable object (and is still statically typesafe!!!).
It also now includes an optional green threading priority thread pool using the same API. This class is POSIX only, though. It uses the ucontext_t API for userspace task switching.
I created a simple library for this. An example of usage is given below. (I'm answering this because it was one of the things I found before I decided it was necessary to write it myself.)
bool is_prime(int n){
// Determine if n is prime.
}
int main(){
thread_pool pool(8); // 8 threads
list<future<bool>> results;
for(int n = 2;n < 10000;n++){
// Submit a job to the pool.
results.emplace_back(pool.async(is_prime, n));
}
int n = 2;
for(auto i = results.begin();i != results.end();i++, n++){
// i is an iterator pointing to a future representing the result of is_prime(n)
cout << n << " ";
bool prime = i->get(); // Wait for the task is_prime(n) to finish and get the result.
if(prime)
cout << "is prime";
else
cout << "is not prime";
cout << endl;
}
}
You can pass async any function with any (or void) return value and any (or no) arguments and it will return a corresponding std::future. To get the result (or just wait until a task has completed) you call get() on the future.
Here's the github: https://github.com/Tyler-Hardin/thread_pool.
looks like threadpool is very popular problem/exercise :-)
I recently wrote one in modern C++; it’s owned by me and publicly available here - https://github.com/yurir-dev/threadpool
It supports templated return values, core pinning, ordering of some tasks.
all implementation in two .h files.
So, the original question will be something like this:
#include "tp/threadpool.h"
int arr[5] = { 0 };
concurency::threadPool<void> tp;
tp.start(std::thread::hardware_concurrency());
std::vector<std::future<void>> futures;
for (int i = 0; i < 8; ++i) { // for 8 iterations,
for (int j = 0; j < 4; ++j) {
futures.push_back(tp.push([&arr, j]() {
arr[j] += 2;
}));
}
}
// wait until all pushed tasks are finished.
for (auto& f : futures)
f.get();
// or just tp.end(); // will kill all the threads
arr[4] = *std::min_element(arr, arr + 4);
I found the pending tasks' future.get() call hangs on caller side if the thread pool gets terminated and leaves some tasks inside task queue. How to set future exception inside thread pool with only the wrapper std::function?
template <class F, class... Args>
std::future<std::result_of_t<F(Args...)>> enqueue(F &&f, Args &&...args) {
auto task = std::make_shared<std::packaged_task<std::result_of_t<F(Args...)>()>>(
std::bind(std::forward<F>(f), std::forward<Args>(args)...));
std::future<return_type> res = task->get_future();
{
std::unique_lock<std::mutex> lock(_mutex);
_tasks.push([task]() -> void { (*task)(); });
}
return res;
}
class StdThreadPool {
std::vector<std::thread> _workers;
std::priority_queue<TASK> _tasks;
...
}
struct TASK {
//int _func_return_value;
std::function<void()> _func;
int priority;
...
}
The Stroika library has a threadpool implementation.
Stroika ThreadPool.h
ThreadPool p;
p.AddTask ([] () {doIt ();});
Stroika's thread library also supports cancelation (cooperative) - so that when the ThreadPool above goes out of scope - it cancels any running tasks (similar to c++20's jthread).