Remove finished threads from vector - c++

I have a number of jobs and I want to run a subset of them in parallel. E. g. I have 100 jobs to run and I want to run 10 threads at a time. This is my current code for this problem:
#include <thread>
#include <vector>
#include <iostream>
#include <atomic>
#include <random>
#include <mutex>
int main() {
constexpr std::size_t NUMBER_OF_THREADS(10);
std::atomic<std::size_t> numberOfRunningJobs(0);
std::vector<std::thread> threads;
std::mutex maxThreadsMutex;
std::mutex writeMutex;
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 2);
for (std::size_t id(0); id < 100; ++id) {
if (numberOfRunningJobs >= NUMBER_OF_THREADS - 1) {
maxThreadsMutex.lock();
}
++numberOfRunningJobs;
threads.emplace_back([id, &numberOfRunningJobs, &maxThreadsMutex, &writeMutex, &distribution, &generator]() {
auto waitSeconds(distribution(generator));
std::this_thread::sleep_for(std::chrono::seconds(waitSeconds));
writeMutex.lock();
std::cout << id << " " << waitSeconds << std::endl;
writeMutex.unlock();
--numberOfRunningJobs;
maxThreadsMutex.unlock();
});
}
for (auto &thread : threads) {
thread.join();
}
return 0;
}
In the for loop I check how many jobs are running and if a slot is free, I add a new thread to the vector. At the end of each thread I decrement the number of running jobs and unlock the mutex to start one new thread. This solves my task but there is one point I don't like. I need a vector of size 100 to store all threads and I need to join all 100 threads at the end. I want to remove each thread from vector after it finished so that the vector contains a maximum of 10 threads and I have to join 10 threads at the end. I think about passing the vector and an iterator by reference to the lambda so that I can remove the element at the end but I don't know how. How can I optimize my code to use a maximum of 10 elements in the vector?

Since you don't seem to require extremely fine-grained thread control, I'd recommend approaching this problem with OpenMP. OpenMP is an industry-standard directive-based approach for parallelizing C, C++, and FORTRAN code. Every major compiler for these languages implements it.
Using it results in a significant reduction in the complexity of your code:
#include <iostream>
#include <random>
int main() {
constexpr std::size_t NUMBER_OF_THREADS(10);
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 2);
//Distribute the loop between threads ensuring that only
//a specific number of threads are ever active at once.
#pragma omp parallel for num_threads(NUMBER_OF_THREADS)
for (std::size_t id(0); id < 100; ++id) {
#pragma omp critical //Serialize access to generator
auto waitSeconds(distribution(generator));
std::this_thread::sleep_for(std::chrono::seconds(waitSeconds));
#pragma omp critical //Serialize access to cout
std::cout << id << " " << waitSeconds << std::endl;
}
return 0;
}
To use OpenMP you compile with:
g++ main.cpp -fopenmp
Generating and directly coordinating threads is sometimes necessary, but the massive number of new languages and libraries designed to make parallelism easier speaks to the number of use cases in which a simpler path to parallelism is sufficient.

The keyword "thread pool" helped me much. I tried boost::asio::thread_pool and it does what I want in the same way as my first approach. I solved my problem with
#include <thread>
#include <iostream>
#include <atomic>
#include <random>
#include <mutex>
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/post.hpp>
int main() {
boost::asio::thread_pool threadPool(10);
std::mutex writeMutex;
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 2);
std::atomic<std::size_t> currentlyRunning(0);
for (std::size_t id(0); id < 100; ++id) {
boost::asio::post(threadPool, [id, &writeMutex, &distribution, &generator, &currentlyRunning]() {
++currentlyRunning;
auto waitSeconds(distribution(generator));
writeMutex.lock();
std::cout << "Start: " << id << " " << currentlyRunning << std::endl;
writeMutex.unlock();
std::this_thread::sleep_for(std::chrono::seconds(waitSeconds));
writeMutex.lock();
std::cout << "Stop: " << id << " " << waitSeconds << std::endl;
writeMutex.unlock();
--currentlyRunning;
});
}
threadPool.join();
return 0;
}

Related

Parallel version of the `std::generate` performs worse than the sequential one

I'm trying to parallelize some old code using the Execution Policy from the C++ 17. My sample code is below:
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <execution>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::duration<double>;
constexpr auto NUM = 100'000'000U;
double func()
{
return rand();
}
int main()
{
std::vector<double> v(NUM);
// ------ feature testing
std::cout << "__cpp_lib_execution : " << __cpp_lib_execution << std::endl;
std::cout << "__cpp_lib_parallel_algorithm: " << __cpp_lib_parallel_algorithm << std::endl;
// ------ fill the vector with random numbers sequentially
auto const startTime1 = Clock::now();
std::generate(std::execution::seq, v.begin(), v.end(), func);
Duration const elapsed1 = Clock::now() - startTime1;
std::cout << "std::execution::seq: " << elapsed1.count() << " sec." << std::endl;
// ------ fill the vector with random numbers in parallel
auto const startTime2 = Clock::now();
std::generate(std::execution::par, v.begin(), v.end(), func);
Duration const elapsed2 = Clock::now() - startTime2;
std::cout << "std::execution::par: " << elapsed2.count() << " sec." << std::endl;
}
The program output on my Linux desktop:
__cpp_lib_execution : 201902
__cpp_lib_parallel_algorithm: 201603
std::execution::seq: 0.971162 sec.
std::execution::par: 25.0349 sec.
Why does the parallel version performs 25 times worse than the sequential one?
Compiler: g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0
The thread-safety of rand is implementation-defined. Which means either:
Your code is wrong in the parallel case, or
It's effectively serial, with a highly contended lock, which would dramatically increase the overhead in the parallel case and get incredibly poor performance.
Based on your results, I'm guessing #2 applies, but it could be both.
Either way, the answer is: rand is a terrible test case for parallelism.

No speed up with openmp for emplace_back a vector in a loop

I'm trying to emplace_back a vector in a loop with openmp. I took my inspiration from this post : C++ OpenMP Parallel For Loop - Alternatives to std::vector. So I write a test code :
// Example program
#include <iostream>
#include <string>
#include <vector>
#include <random>
#include <chrono>
#include <omp.h>
int main()
{
std::cout << "Numbers of thread available : " << omp_get_max_threads() << std::endl;
std::random_device dev;
std::mt19937 gen(dev());
std::uniform_int_distribution<unsigned> distrib(1, 5);
{
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
std::vector<std::pair<uint32_t, uint32_t> > result;
#pragma omp declare reduction (merge : std::vector<std::pair<uint32_t, uint32_t> > : omp_out.insert(omp_out.end(), std::make_move_iterator(omp_in.begin()), std::make_move_iterator(omp_in.end())))
#pragma omp parallel for reduction(merge: result)
for(int i=0; i<100000000; ++i)
{
if(distrib(gen) == 1)
{
result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
}
}
end = std::chrono::system_clock::now(); \
auto elapsed_seconds = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count(); \
std::cout << "With openmp " << " : " << elapsed_seconds << "ms\n";
}
{
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
std::vector<std::pair<uint32_t, uint32_t> > result;
for(int i=0; i<100000000; ++i)
{
if(distrib(gen) == 1)
{
result.emplace_back(std::make_pair(distrib(gen),distrib(gen)));
}
}
end = std::chrono::system_clock::now(); \
auto elapsed_seconds = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count(); \
std::cout << "Without openmp " << " : " << elapsed_seconds << "ms\n";
}
}
I compile this code with
g++ -o main -std=c++17 -fopenmp main.cpp
and the output is :
Numbers of thread available : 12
With openmp : 3982ms
Without openmp : 3887ms
Obviously, I don't have any speed up with my openmp implementation. Why ?
The current code is ill-formed regarding the documentation (since the parallelized code contains mostly-implicit dependencies). As a result, an OpenMP implementation is free to generate a fast but completely "wrong" program or a slow "correct" one.
To get a correct implementation and a not-too-bad speedup using OpenMP, one solution is to replicate the generator/distribution in each worker (by moving the variable declarations in a #pragma omp parallel section) and to add critical sections (using #pragma omp critical) for the (sequential) emplace_back.
Due to possible false-sharing and lock-contention, the resulting parallel implementation may scale poorly. It is probably better to generate thread-private arrays and then merge ultimately the sub-arrays in a big shared one rather than using naive critical section (note however that this still not ideal since the computation will likely be limited by the speed of the shared memory).
Please note that the result can be different from the sequential implementation when a specific seed need to be used (here it is not a problem since the seed is extracted from random_device).

Parallel std::fill has different performance on different architectures; why?

I'm attempting to write a parallel vector fill, using the following code:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <algorithm>
using namespace std;
using namespace std::chrono;
void fill_part(vector<double> & v, int ii, int num_threads)
{
fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0);
}
int main()
{
vector<double> v(200*1000*1000);
high_resolution_clock::time_point t = high_resolution_clock::now();
fill(v.begin(), v.end(), 0);
duration<double> d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in serial.\n";
unsigned num_threads = thread::hardware_concurrency() ? thread::hardware_concurrency() : 1;
cout << "Num threads: " << num_threads << '\n';
vector<thread> threads;
t = high_resolution_clock::now();
for(int ii = 0; ii< num_threads; ++ii)
{
threads.emplace_back(fill_part, std::ref(v), ii, num_threads);
}
for(auto & t : threads)
{
if(t.joinable()) t.join();
}
d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in parallel.\n";
}
I tried this code on four different architectures (all Intel CPUs--but no matter).
The first I tried had 4 CPUs, and the parallelization gave no speedup. The second had 4, and was 4 times as fast, the third had 4, and was twice as fast, and the last had 2, and gave no speedup.
My hypothesis is that the differences arise because the RAM bus can either be saturated by a single CPU or not, but is this correct? How can I predict what architectures will benefit from this parallelization?
Bonus question: The void fill_part function is awkward, so I wanted to do it with a lambda:
threads.emplace_back([&]{fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0); });
This compiles but terminates with a bus error; what's wrong with the lambda syntax?

Forcing race between threads using C++11 threads

Just got started on multithreading (and multithreading in general) using C++11 threading library, and and wrote small short snipped of code.
#include <iostream>
#include <thread>
int x = 5; //variable to be effected by race
//This function will be called from a thread
void call_from_thread1() {
for (int i = 0; i < 5; i++) {
x++;
std::cout << "In Thread 1 :" << x << std::endl;
}
}
int main() {
//Launch a thread
std::thread t1(call_from_thread1);
for (int j = 0; j < 5; j++) {
x--;
std::cout << "In Thread 0 :" << x << std::endl;
}
//Join the thread with the main thread
t1.join();
std::cout << x << std::endl;
return 0;
}
Was expecting to get different results every time (or nearly every time) I ran this program, due to race between two threads. However, output is always: 0, i.e. two threads run as if they ran sequentially. Why am I getting same results and is there any ways to simulate or force race between two threads ?
Your sample size is rather small, and somewhat self-stalls on the continuous stdout flushes. In short, you need a bigger hammer.
If you want to see a real race condition in action, consider the following. I purposely added an atomic and non-atomic counter, sending both to the threads of the sample. Some test-run results are posted after the code:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
void racer(std::atomic_int& cnt, int& val)
{
for (int i=0;i<1000000; ++i)
{
++val;
++cnt;
}
}
int main(int argc, char *argv[])
{
unsigned int N = std::thread::hardware_concurrency();
std::atomic_int cnt = ATOMIC_VAR_INIT(0);
int val = 0;
std::vector<std::thread> thrds;
std::generate_n(std::back_inserter(thrds), N,
[&cnt,&val](){ return std::thread(racer, std::ref(cnt), std::ref(val));});
std::for_each(thrds.begin(), thrds.end(),
[](std::thread& thrd){ thrd.join();});
std::cout << "cnt = " << cnt << std::endl;
std::cout << "val = " << val << std::endl;
return 0;
}
Some sample runs from the above code:
cnt = 4000000
val = 1871016
cnt = 4000000
val = 1914659
cnt = 4000000
val = 2197354
Note that the atomic counter is accurate (I'm running on a duo-core i7 macbook air laptop with hyper threading, so 4x threads, thus 4-million). The same cannot be said for the non-atomic counter.
There will be significant startup overhead to get the second thread going, so its execution will almost always begin after the first thread has finished the for loop, which by comparison will take almost no time at all. To see a race condition you will need to run a computation that takes much longer, or includes i/o or other operations that take significant time, so that the execution of the two computations actually overlap.

unique random number generation using threads

I have a program which use pthreads. In each thread a random number is generated using the rand() (stdlib.h) function. But it seems like every thread is generating the same random number. What is the reason for that??.. Am I doing something wrong??.. Thanks
rand() is pseudo-random and not guaranteed to be thread-safe, regardless, you need to seed rand():
std::srand(std::time(0)); // use current time as seed for random generator
See std::rand() at cppreference.com for more details.
A sample program may look like this:
#include <cstdlib>
#include <iostream>
#include <boost/thread.hpp>
boost::mutex output_mutex;
void print_n_randoms(unsigned thread_id, unsigned n)
{
while (n--)
{
boost::mutex::scoped_lock lock(output_mutex);
std::cout << "Thread " << thread_id << ": " << std::rand() << std::endl;
}
}
int main()
{
std::srand(std::time(0));
boost::thread_group threads;
for (unsigned thread_id = 1; thread_id <= 10; ++thread_id)
{
threads.create_thread(boost::bind(print_n_randoms, thread_id, 100));
}
threads.join_all();
}
Note how the pseudo-random number generator is seeded with the time only once (and not per thread).