Parallel std::fill has different performance on different architectures; why? - c++

I'm attempting to write a parallel vector fill, using the following code:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <algorithm>
using namespace std;
using namespace std::chrono;
void fill_part(vector<double> & v, int ii, int num_threads)
{
fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0);
}
int main()
{
vector<double> v(200*1000*1000);
high_resolution_clock::time_point t = high_resolution_clock::now();
fill(v.begin(), v.end(), 0);
duration<double> d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in serial.\n";
unsigned num_threads = thread::hardware_concurrency() ? thread::hardware_concurrency() : 1;
cout << "Num threads: " << num_threads << '\n';
vector<thread> threads;
t = high_resolution_clock::now();
for(int ii = 0; ii< num_threads; ++ii)
{
threads.emplace_back(fill_part, std::ref(v), ii, num_threads);
}
for(auto & t : threads)
{
if(t.joinable()) t.join();
}
d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in parallel.\n";
}
I tried this code on four different architectures (all Intel CPUs--but no matter).
The first I tried had 4 CPUs, and the parallelization gave no speedup. The second had 4, and was 4 times as fast, the third had 4, and was twice as fast, and the last had 2, and gave no speedup.
My hypothesis is that the differences arise because the RAM bus can either be saturated by a single CPU or not, but is this correct? How can I predict what architectures will benefit from this parallelization?
Bonus question: The void fill_part function is awkward, so I wanted to do it with a lambda:
threads.emplace_back([&]{fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0); });
This compiles but terminates with a bus error; what's wrong with the lambda syntax?

Related

OpenMP parallel for does not speed up array sum code [duplicate]

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time:66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
}
return result;
}
int main(int argc, char **argv) {
int threadCount = 12;
omp_set_num_threads(threadCount);
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
}
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time:" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
}
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
}
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
delete[]testArray;
return 0;
}
As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
See
OpenMP time and clock() give two different results
https://en.cppreference.com/w/c/chrono/clock
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time:267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

Parallel version of the `std::generate` performs worse than the sequential one

I'm trying to parallelize some old code using the Execution Policy from the C++ 17. My sample code is below:
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <execution>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::duration<double>;
constexpr auto NUM = 100'000'000U;
double func()
{
return rand();
}
int main()
{
std::vector<double> v(NUM);
// ------ feature testing
std::cout << "__cpp_lib_execution : " << __cpp_lib_execution << std::endl;
std::cout << "__cpp_lib_parallel_algorithm: " << __cpp_lib_parallel_algorithm << std::endl;
// ------ fill the vector with random numbers sequentially
auto const startTime1 = Clock::now();
std::generate(std::execution::seq, v.begin(), v.end(), func);
Duration const elapsed1 = Clock::now() - startTime1;
std::cout << "std::execution::seq: " << elapsed1.count() << " sec." << std::endl;
// ------ fill the vector with random numbers in parallel
auto const startTime2 = Clock::now();
std::generate(std::execution::par, v.begin(), v.end(), func);
Duration const elapsed2 = Clock::now() - startTime2;
std::cout << "std::execution::par: " << elapsed2.count() << " sec." << std::endl;
}
The program output on my Linux desktop:
__cpp_lib_execution : 201902
__cpp_lib_parallel_algorithm: 201603
std::execution::seq: 0.971162 sec.
std::execution::par: 25.0349 sec.
Why does the parallel version performs 25 times worse than the sequential one?
Compiler: g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0
The thread-safety of rand is implementation-defined. Which means either:
Your code is wrong in the parallel case, or
It's effectively serial, with a highly contended lock, which would dramatically increase the overhead in the parallel case and get incredibly poor performance.
Based on your results, I'm guessing #2 applies, but it could be both.
Either way, the answer is: rand is a terrible test case for parallelism.

HDF5 write fails even on OpenMP critical section

I've been looking to write a big 4-dimensional HDF5 file where the values are computed in parallel and written in mutual exclusion.
This is a minimal working example of my problem.
#include <fstream>
#include <iostream>
#include <algorithm>
#include <vector>
#include <omp.h>
#include "H5Cpp.h"
using namespace H5;
int main(int argc, char *argv[])
{
H5File file("out.h5", H5F_ACC_TRUNC);
const uint F_RANK = 4;
const uint M_RANK = 3;
const uint D = 8;
const uint MD = 2;
std::vector<hsize_t> fdim = {D, D, 6, 1024};
std::vector<hsize_t> count = {1, 1, 6, 1024};
std::vector<hsize_t> mdim = {MD, 6, 1024};
float fillvalue = -1.0;
DSetCreatPropList plist;
plist.setFillValue(PredType::NATIVE_FLOAT, &fillvalue);
DataSpace fspace(F_RANK, fdim.data());
DataSet dataset = file.createDataSet("data", PredType::NATIVE_FLOAT, fspace, plist);
#pragma omp parallel
{
uint id = omp_get_thread_num();
std::vector<hsize_t> start(F_RANK, 0);
DataSpace private_fspace = dataset.getSpace();
DataSpace private_mspace(M_RANK, mdim.data());
private_mspace.selectAll();
std::cout << "THREAD " << id << " STARTS" << std::endl;
#pragma omp for ordered schedule(static)
for (uint b = 0; b < D*D / MD; b++)
{
// Assume this is an expensive computation
std::vector<float> data(MD*6*1024, b);
#pragma omp critical
{
private_fspace.selectNone();
std::cout << "THREAD " << id << " SELECTED NONE" << std::endl;
// Store batch
for (uint p = 0; p < MD; p++)
{
start[0] = b / (D*D) + p;
start[1] = p;
private_fspace.selectHyperslab(H5S_SELECT_OR, count.data(), start.data());
std::cout << "THREAD " << id << " SELECTED " << start[0] << ' ' << start[1] << std::endl;
}
std::cout << "ABOUT TO WRITE " << private_fspace.getSelectNpoints() << std::endl;
dataset.write(data.data(), PredType::NATIVE_FLOAT, private_mspace, private_fspace);
private_fspace.selectNone();
}
std::cout << "WRITTEN" << std::endl;
}
}
}
I'm compiling it using h5c++ example.cpp -fopenmp.
When run, there are many possible errors that can happen, some are Segmentation faults, others are aborts and even bus errors.
There's even some HDF5 library errors like these
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,E,SL,FL,FL,.....
Sometimes it will fail before writing anything, sometimes it will finish without any errors. Of course, it only fails when OMP_NUM_THREADS is set to more than 1, so the problem is in the parallelism. However, I've set all relevant variables I could as private for each process, and everything happens within a critical section, so I don't see how this is failing.
Debugging on GDB reveals HDF5 may fail on the DataSet::getSpace() with a segmentation fault, but there are so many ways it can fail I'd say the problem is more general than this.

Remove finished threads from vector

I have a number of jobs and I want to run a subset of them in parallel. E. g. I have 100 jobs to run and I want to run 10 threads at a time. This is my current code for this problem:
#include <thread>
#include <vector>
#include <iostream>
#include <atomic>
#include <random>
#include <mutex>
int main() {
constexpr std::size_t NUMBER_OF_THREADS(10);
std::atomic<std::size_t> numberOfRunningJobs(0);
std::vector<std::thread> threads;
std::mutex maxThreadsMutex;
std::mutex writeMutex;
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 2);
for (std::size_t id(0); id < 100; ++id) {
if (numberOfRunningJobs >= NUMBER_OF_THREADS - 1) {
maxThreadsMutex.lock();
}
++numberOfRunningJobs;
threads.emplace_back([id, &numberOfRunningJobs, &maxThreadsMutex, &writeMutex, &distribution, &generator]() {
auto waitSeconds(distribution(generator));
std::this_thread::sleep_for(std::chrono::seconds(waitSeconds));
writeMutex.lock();
std::cout << id << " " << waitSeconds << std::endl;
writeMutex.unlock();
--numberOfRunningJobs;
maxThreadsMutex.unlock();
});
}
for (auto &thread : threads) {
thread.join();
}
return 0;
}
In the for loop I check how many jobs are running and if a slot is free, I add a new thread to the vector. At the end of each thread I decrement the number of running jobs and unlock the mutex to start one new thread. This solves my task but there is one point I don't like. I need a vector of size 100 to store all threads and I need to join all 100 threads at the end. I want to remove each thread from vector after it finished so that the vector contains a maximum of 10 threads and I have to join 10 threads at the end. I think about passing the vector and an iterator by reference to the lambda so that I can remove the element at the end but I don't know how. How can I optimize my code to use a maximum of 10 elements in the vector?
Since you don't seem to require extremely fine-grained thread control, I'd recommend approaching this problem with OpenMP. OpenMP is an industry-standard directive-based approach for parallelizing C, C++, and FORTRAN code. Every major compiler for these languages implements it.
Using it results in a significant reduction in the complexity of your code:
#include <iostream>
#include <random>
int main() {
constexpr std::size_t NUMBER_OF_THREADS(10);
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 2);
//Distribute the loop between threads ensuring that only
//a specific number of threads are ever active at once.
#pragma omp parallel for num_threads(NUMBER_OF_THREADS)
for (std::size_t id(0); id < 100; ++id) {
#pragma omp critical //Serialize access to generator
auto waitSeconds(distribution(generator));
std::this_thread::sleep_for(std::chrono::seconds(waitSeconds));
#pragma omp critical //Serialize access to cout
std::cout << id << " " << waitSeconds << std::endl;
}
return 0;
}
To use OpenMP you compile with:
g++ main.cpp -fopenmp
Generating and directly coordinating threads is sometimes necessary, but the massive number of new languages and libraries designed to make parallelism easier speaks to the number of use cases in which a simpler path to parallelism is sufficient.
The keyword "thread pool" helped me much. I tried boost::asio::thread_pool and it does what I want in the same way as my first approach. I solved my problem with
#include <thread>
#include <iostream>
#include <atomic>
#include <random>
#include <mutex>
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/post.hpp>
int main() {
boost::asio::thread_pool threadPool(10);
std::mutex writeMutex;
std::default_random_engine generator;
std::uniform_int_distribution<int> distribution(0, 2);
std::atomic<std::size_t> currentlyRunning(0);
for (std::size_t id(0); id < 100; ++id) {
boost::asio::post(threadPool, [id, &writeMutex, &distribution, &generator, &currentlyRunning]() {
++currentlyRunning;
auto waitSeconds(distribution(generator));
writeMutex.lock();
std::cout << "Start: " << id << " " << currentlyRunning << std::endl;
writeMutex.unlock();
std::this_thread::sleep_for(std::chrono::seconds(waitSeconds));
writeMutex.lock();
std::cout << "Stop: " << id << " " << waitSeconds << std::endl;
writeMutex.unlock();
--currentlyRunning;
});
}
threadPool.join();
return 0;
}

Boost: Creating objects and populating a vector with threads

Using this boost asio based thread pool, in this case the class is named ThreadPool, I want to parallelize the population of a vector of type std::vector<boost::shared_ptr<T>>, where T is a struct containing a vector of type std::vector<int> whose content and size are dynamically determined after struct initialization.
Unfortunately, I am a newb at both c++ and multi threading, so my attempts at solving this problem have failed spectacularly. Here's an overly simplified sample program that times the non-threaded and threaded versions of the tasks. The threaded version's performance is horrendous...
#include "thread_pool.hpp"
#include <ctime>
#include <iostream>
#include <vector>
using namespace boost;
using namespace std;
struct T {
vector<int> nums = {};
};
typedef boost::shared_ptr<T> Tptr;
typedef vector<Tptr> TptrVector;
void create_T(const int i, TptrVector& v) {
v[i] = Tptr(new T());
T& t = *v[i].get();
for (int i = 0; i < 100; i++) {
t.nums.push_back(i);
}
}
int main(int argc, char* argv[]) {
clock_t begin, end;
double elapsed;
// define and parse program options
if (argc != 3) {
cout << argv[0] << " <num iterations> <num threads>" << endl;
return 1;
}
int iterations = stoi(argv[1]),
threads = stoi(argv[2]);
// create thread pool
ThreadPool tp(threads);
// non-threaded
cout << "non-thread" << endl;
begin = clock();
TptrVector v(iterations);
for (int i = 0; i < iterations; i++) {
create_T(i, v);
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed << " seconds" << endl;
// threaded
cout << "threaded" << endl;
begin = clock();
TptrVector v2(iterations);
for (int i = 0; i < iterations; i++) {
tp.submit(boost::bind(create_T, i, v2));
}
tp.stop();
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed << " seconds" << endl;
return 0;
}
After doing some digging, I think the poor performance may be due to the threads vying for memory access, but my newb status if keeping me from exploiting this insight. Can you efficiently populate the pointer vector using multiple threads, ideally in a thread pool?
you haven't provided neither enough details or a Minimal, Complete, and Verifiable example, so expect lots of guessing.
createT is a "cheap" function. Scheduling a task and an overhead of its execution is much more expensive. It's why your performance is bad. To get a boost from parallelism you need to have proper work granularity and amount of work. Granularity means that each task (in your case one call to createT) should be big enough to pay for multithreading overhead. The simplest approach would be to group createT calls to get bigger tasks.