OpenMP parallel for does not speed up array sum code [duplicate]

OpenMP parallel for does not speed up array sum code [duplicate] - c++

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time：66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
}
return result;
}
int main(int argc, char **argv) {
int threadCount = 12;
omp_set_num_threads(threadCount);
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
}
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time：" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
}
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
}
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
delete[]testArray;
return 0;
}

As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
See
OpenMP time and clock() give two different results
https://en.cppreference.com/w/c/chrono/clock
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time：267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

Related

How to make parallel cudaMalloc fast?

When allocating a lot of memory on 4 distinct NVIDIA V100 GPUs, I observe the following behavior with regards to parallelization via OpenMP:
Using the #pragma omp parallel for directive, and therefore making the cudaMalloc calls on each GPU in parallel, results in the same performance as doing it completely serial. This is tested and the same effect validated on two HPC systems: IBM Power AC922 and an AWS EC2 p3dn.24xlarge. (The numbers are obtained on the Power machine.)
./test 4000000000
# serial
GPU 0: 0.472018550
GPU 1: 0.325776811
GPU 2: 0.334342752
GPU 3: 0.337432169
total: 1.469773541
# parallel
GPU 0: 1.199741600
GPU 2: 1.200597044
GPU 3: 1.200619017
GPU 1: 1.482700315
total: 1.493352924
How can I make the parallelization faster?
Here is my code:
#include <chrono>
#include <iomanip>
#include <iostream>
int main(int argc, char* argv[]) {
size_t num_elements = std::stoull(argv[1]);
auto t0s = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for (int i = 0; i < 4; ++i)
{
auto t0is = std::chrono::high_resolution_clock::now();
cudaSetDevice(i);
int* ptr;
cudaMalloc((void**)&ptr, sizeof(int) * num_elements);
auto t1is = std::chrono::high_resolution_clock::now();
std::cout << "GPU " << i << ": " << std::fixed << std::setprecision(9)
<< std::chrono::duration<double>(t1is - t0is).count() << std::endl;
}
auto t1s = std::chrono::high_resolution_clock::now();
std::cout << "total: " << std::fixed << std::setprecision(9)
<< std::chrono::duration<double>(t1s - t0s).count() << std::endl;
return 0;
}
You can compile the microbenchmark with:
nvcc -std=c++11 -Xcompiler -fopenmp -O3 test.cu -o test
I also tried std::thread instead of OpenMP with the same results.

C++ high_resolution_clock::now() showing inconsistent results with wallclock time

I was trying to calculate the time required to sample 7680 bit primes in FLINT library. I had a loop running for 100 iterations and finally calculated the average time required. On my mac, the below code took more than 5 hours to run (I left the code running in mac at 2:00clock without closing the lid. When I saw again at 7:00clock, the code is still running). But finally, it showed an output of "33.3442" seconds. How is this possible?
#include "fmpz.h"
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
int main() {
int count = 100;
int length = 7680;
fmpz_t primes[count];
flint_rand_t state;
flint_randinit(state);
for (int i = 0; i < count; i++)
fmpz_init(primes[i]);
auto start = high_resolution_clock::now();
for (int i = 0; i < count; i++)
{
while (true)
{
fmpz_randbits(primes[i], state, length);
if (fmpz_is_probabprime(primes[i]))
break;
}
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Generating random primes of length " << length << " " << ((double)duration.count()/1000000)/count << endl;
}

HDF5 write fails even on OpenMP critical section

I've been looking to write a big 4-dimensional HDF5 file where the values are computed in parallel and written in mutual exclusion.
This is a minimal working example of my problem.
#include <fstream>
#include <iostream>
#include <algorithm>
#include <vector>
#include <omp.h>
#include "H5Cpp.h"
using namespace H5;
int main(int argc, char *argv[])
{
H5File file("out.h5", H5F_ACC_TRUNC);
const uint F_RANK = 4;
const uint M_RANK = 3;
const uint D = 8;
const uint MD = 2;
std::vector<hsize_t> fdim = {D, D, 6, 1024};
std::vector<hsize_t> count = {1, 1, 6, 1024};
std::vector<hsize_t> mdim = {MD, 6, 1024};
float fillvalue = -1.0;
DSetCreatPropList plist;
plist.setFillValue(PredType::NATIVE_FLOAT, &fillvalue);
DataSpace fspace(F_RANK, fdim.data());
DataSet dataset = file.createDataSet("data", PredType::NATIVE_FLOAT, fspace, plist);
#pragma omp parallel
{
uint id = omp_get_thread_num();
std::vector<hsize_t> start(F_RANK, 0);
DataSpace private_fspace = dataset.getSpace();
DataSpace private_mspace(M_RANK, mdim.data());
private_mspace.selectAll();
std::cout << "THREAD " << id << " STARTS" << std::endl;
#pragma omp for ordered schedule(static)
for (uint b = 0; b < D*D / MD; b++)
{
// Assume this is an expensive computation
std::vector<float> data(MD*6*1024, b);
#pragma omp critical
{
private_fspace.selectNone();
std::cout << "THREAD " << id << " SELECTED NONE" << std::endl;
// Store batch
for (uint p = 0; p < MD; p++)
{
start[0] = b / (D*D) + p;
start[1] = p;
private_fspace.selectHyperslab(H5S_SELECT_OR, count.data(), start.data());
std::cout << "THREAD " << id << " SELECTED " << start[0] << ' ' << start[1] << std::endl;
}
std::cout << "ABOUT TO WRITE " << private_fspace.getSelectNpoints() << std::endl;
dataset.write(data.data(), PredType::NATIVE_FLOAT, private_mspace, private_fspace);
private_fspace.selectNone();
}
std::cout << "WRITTEN" << std::endl;
}
}
}
I'm compiling it using h5c++ example.cpp -fopenmp.
When run, there are many possible errors that can happen, some are Segmentation faults, others are aborts and even bus errors.
There's even some HDF5 library errors like these
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,E,SL,FL,FL,.....
Sometimes it will fail before writing anything, sometimes it will finish without any errors. Of course, it only fails when OMP_NUM_THREADS is set to more than 1, so the problem is in the parallelism. However, I've set all relevant variables I could as private for each process, and everything happens within a critical section, so I don't see how this is failing.
Debugging on GDB reveals HDF5 may fail on the DataSet::getSpace() with a segmentation fault, but there are so many ways it can fail I'd say the problem is more general than this.

Is this simple c++ program multithreaded?

I am sort of new to C++, but I was wondering if this basic (and possibly sloppy) program is actually running multiple threads at a time or if it its just pooling:
This is a console application running in visual c++ 2015:
#include <string>
#include "stdafx.h"
#include <iostream>
#include <sstream>
#include <thread>
using namespace std;
#include <stdio.h>
int temp1 = 0;
int num1 = 0;
int temp2 = 0;
int num2 = 0;
void math1() {
int running_total = 23;
for (int i = 0; i < 999999999; i++)
{
running_total = 58 * running_total + i;
}
}
int math2() {
int running_total = 23;
for (int i = 0; i < 999999999; i++)
{
running_total = 58 * running_total + i;
}
return 0;
}
int main()
{
unsigned concurentThreadsSupported = std::thread::hardware_concurrency();
cout << "Current Number of CPU threads: " << concurentThreadsSupported << endl;
thread t1(math1);
thread t2(math2);
t1.join();
t2.join();
cout << "1: " << num1 << endl;
cout << "2: " << num2 << endl;
system("pause");
return 0;
}
I notice when I run the code with thread t1(math1);thread t2(math2);t1.join();t2.join();, it uses 25% total of my cpu for 3.5 seconds, but when I use
thread t1(math1);
t1.join();
thread t2(math2);
t2.join();
it uses ~13% of the CPU for almost 7 seconds.
Is this actually multithreading?

thread t1(math1); thread t2(math2); t1.join(); t2.join(); waits for t1 to finish while also running t2. The math1 and math2 functions do the same thing, so they'll finish approx. at once, which is optimal (it could be just one function as well).
To the numbers you're seeing, you clearly have a CPU with 8 logical cores. The multithreaded version uses two hardware threads (2 / 8) = 25%, while single-threaded just one (1 / 8) = 12,5%. It also runs two times slower, simple.

Parallel std::fill has different performance on different architectures; why?

I'm attempting to write a parallel vector fill, using the following code:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <algorithm>
using namespace std;
using namespace std::chrono;
void fill_part(vector<double> & v, int ii, int num_threads)
{
fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0);
}
int main()
{
vector<double> v(200*1000*1000);
high_resolution_clock::time_point t = high_resolution_clock::now();
fill(v.begin(), v.end(), 0);
duration<double> d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in serial.\n";
unsigned num_threads = thread::hardware_concurrency() ? thread::hardware_concurrency() : 1;
cout << "Num threads: " << num_threads << '\n';
vector<thread> threads;
t = high_resolution_clock::now();
for(int ii = 0; ii< num_threads; ++ii)
{
threads.emplace_back(fill_part, std::ref(v), ii, num_threads);
}
for(auto & t : threads)
{
if(t.joinable()) t.join();
}
d = high_resolution_clock::now() - t;
cout << "Filling the vector took " << duration_cast<milliseconds>(d).count()
<< " ms in parallel.\n";
}
I tried this code on four different architectures (all Intel CPUs--but no matter).
The first I tried had 4 CPUs, and the parallelization gave no speedup. The second had 4, and was 4 times as fast, the third had 4, and was twice as fast, and the last had 2, and gave no speedup.
My hypothesis is that the differences arise because the RAM bus can either be saturated by a single CPU or not, but is this correct? How can I predict what architectures will benefit from this parallelization?
Bonus question: The void fill_part function is awkward, so I wanted to do it with a lambda:
threads.emplace_back([&]{fill(v.begin() + ii*v.size()/num_threads, v.begin() + (ii+1)*v.size()/num_threads, 0); });
This compiles but terminates with a bus error; what's wrong with the lambda syntax?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMP parallel for does not speed up array sum code [duplicate] - c++

Related

How to make parallel cudaMalloc fast?

C++ high_resolution_clock::now() showing inconsistent results with wallclock time

HDF5 write fails even on OpenMP critical section

Is this simple c++ program multithreaded?

Parallel std::fill has different performance on different architectures; why?

Categories

Resources