OpenMP only using one thread - c++

I have having a bit of a frustrating problem with openmp. When I run the following code it only seems to be running on one thread.
omp_set_num_threads(8);
#pragma omp parallel for schedule(dynamic)
for(size_t i = 0; i < jobs.size(); i++) //jobs is a vector
{
std::cout << omp_get_thread_num() << "\t" << omp_get_num_threads() << "\t" << omp_in_parallel() << std::endl;
jobs[i].run();
}
This prints...
0 1 1
for every line.
I can see using top that openmp is spawning as many threads as I have the process taskset to. They are mostly idle while it runs. The program is both compiled and linked with the -fopenmp flag with gcc. I am using redhat 6. I also tried using the num_threads(8) parameter in the pragma which made no difference. The program is linked with another library which also uses openmp so maybe this is the issue. Does anyone know what might cause this behavior? In all my past openmp experience it has just worked.

Can you print your jobs.size()?
I made a quick test and it does work:
#include <stdio.h>
#include <omp.h>
#include <iostream>
int main()
{
omp_set_num_threads(2);
#pragma omp parallel for ordered schedule(dynamic)
for(size_t i = 0; i < 4; i++) //jobs is a vector
{
#pragma omp ordered
std::cout << i << "\t" << omp_get_thread_num() << "\t" << omp_get_num_threads() << "\t" << omp_in_parallel() << std::endl;
}
return 0;
}
I got:
icpc -qopenmp test.cpp && ./a.out
0 0 2 1
1 1 2 1
2 0 2 1
3 1 2 1

Related

OpenMP parallel for does not speed up array sum code [duplicate]

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time:66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
}
return result;
}
int main(int argc, char **argv) {
int threadCount = 12;
omp_set_num_threads(threadCount);
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
}
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time:" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
}
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
}
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
delete[]testArray;
return 0;
}
As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
See
OpenMP time and clock() give two different results
https://en.cppreference.com/w/c/chrono/clock
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time:267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

How to make parallel cudaMalloc fast?

When allocating a lot of memory on 4 distinct NVIDIA V100 GPUs, I observe the following behavior with regards to parallelization via OpenMP:
Using the #pragma omp parallel for directive, and therefore making the cudaMalloc calls on each GPU in parallel, results in the same performance as doing it completely serial. This is tested and the same effect validated on two HPC systems: IBM Power AC922 and an AWS EC2 p3dn.24xlarge. (The numbers are obtained on the Power machine.)
./test 4000000000
# serial
GPU 0: 0.472018550
GPU 1: 0.325776811
GPU 2: 0.334342752
GPU 3: 0.337432169
total: 1.469773541
# parallel
GPU 0: 1.199741600
GPU 2: 1.200597044
GPU 3: 1.200619017
GPU 1: 1.482700315
total: 1.493352924
How can I make the parallelization faster?
Here is my code:
#include <chrono>
#include <iomanip>
#include <iostream>
int main(int argc, char* argv[]) {
size_t num_elements = std::stoull(argv[1]);
auto t0s = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for (int i = 0; i < 4; ++i)
{
auto t0is = std::chrono::high_resolution_clock::now();
cudaSetDevice(i);
int* ptr;
cudaMalloc((void**)&ptr, sizeof(int) * num_elements);
auto t1is = std::chrono::high_resolution_clock::now();
std::cout << "GPU " << i << ": " << std::fixed << std::setprecision(9)
<< std::chrono::duration<double>(t1is - t0is).count() << std::endl;
}
auto t1s = std::chrono::high_resolution_clock::now();
std::cout << "total: " << std::fixed << std::setprecision(9)
<< std::chrono::duration<double>(t1s - t0s).count() << std::endl;
return 0;
}
You can compile the microbenchmark with:
nvcc -std=c++11 -Xcompiler -fopenmp -O3 test.cu -o test
I also tried std::thread instead of OpenMP with the same results.

Using std::cout in a parallel for loop using OpenMP [duplicate]

This question already has answers here:
Parallelize output using OpenMP
(2 answers)
Closed 3 years ago.
I want to parallelize a for-loop in C++ using OpenMP. In that loop I want to print some results:
#pragma omp parallel for
for (int i=0; i<1000000; i++) {
if (test.successful() == true) {
std::cout << test.id() << " " << test.result() << std::endl;
}
}
The result I get without using OpenMP:
3 23923.23
1 32329.32
2 23239.45
However I get the following using a parallel for-loop:
314924.4244.5
434.
4343.43123
How can I avoid such an output?
The reason is the printing is not defined as an atomic operation, thus context switch occurs while printing to the standard output is still ongoing.
Solution: You must make the for loop body an atomic operation, using #pragma omp atomic
#pragma omp parallel for
for (int i=0; i<1000000; i++) {
if (test.successful() == true) {
#pragma omp atomic
std::cout << test.id() << " " << test.result() << std::endl;
}
}

omp_get_max_threads() returns 1 in parallel region, but it should be 8

I'm compiling a complex C++ project on Linux which uses OpenMP, compiled with CMake and GCC 7.
The strange problem I'm encountering in this particular project is that OpenMP is clearly working, but it thinks that only 1 thread is supported, when it should be 8. However, if I manually specify the number of threads, it does indeed accelerate the code.
logOut << "In parallel? " << omp_in_parallel() << std::endl;
logOut << "Num threads = " << omp_get_num_threads() << std::endl;
logOut << "Max threads = " << omp_get_max_threads() << std::endl;
logOut << "Entering my parallel region: " << std::endl;
//without num_threads(5), only 1 thread is created
#pragma omp parallel num_threads(5)
{
#pragma omp single nowait
{
logOut << "In parallel? " << omp_in_parallel() << std::endl;
logOut << "Num threads = " << omp_get_num_threads() << std::endl;
logOut << "Max threads = " << omp_get_max_threads() << std::endl;
}
}
Output:
[openmp_test] In parallel? 0
[openmp_test] Num threads = 1
[openmp_test] Max threads = 1
[openmp_test] Entering my parallel region:
[openmp_test] In parallel? 1
[openmp_test] Num threads = 5
[openmp_test] Max threads = 1
What makes it even stranger is that a simple test OpenMP program directly correctly reports the maximum number of threads as 8, both inside and outside a parallel region.
I've been combing through all the CMake files trying to find any indicator of why this project behaves differently, but I've turned up nothing so far. There is no mention of omp_set_num_threads in any of my project files, and I can confirm that OMP_NUM_THREADS is not declared. Furthermore, this problem never happened when I compiled the same project on Windows with MSVC.
Any ideas what the problem could be?
(EDIT: I've expanded the code sample to show that it is not a nested parallel block)
CPU: Intel(R) Core(TM) i7-6700K
OS: Manjaro Linux 17.0.2
Compiler: GCC 7.1.1 20170630
_OPENMP = 201511 (I'm guessing that means OpenMP 4.5)
The values you are seeing inside the parallel region seem correct (assuming that OMP_NESTED is not true). omp_get_max_threads() returns the maximum number of threads that you might obtain if you were to go parallel form the current thread. Since you are already inside a parallel region (and we're assuming that nested parallelism is disabled) that will be one.
3.2.3 omp_get_max_threads
Summary
The omp_get_max_threads routine returns an upper bound on the number of threads that could be used
to form a new team if a parallel construct without a num_threads
clause were encountered after execution returns from this routine.
That doesn't explain why you see the value one outside the parallel region, though. (But it does answer the question in the title, to which the answer is "one is the correct answer").
Your program behaves exactly as if omp_set_num_threads(1) was called before.
Considering this snippet:
#include <iostream>
#include <string>
#include <vector>
#include "omp.h"
int main() {
omp_set_num_threads(1);
std::cout << "before parallel section: " << std::endl;
std::cout << "Num threads = " << omp_get_num_threads() << std::endl;
std::cout << "Max threads = " << omp_get_max_threads() << std::endl;
//without num_threads(5), only 1 thread is created
#pragma omp parallel num_threads(5)
{
#pragma omp single
{
std::cout << "inside parallel section: " << std::endl;
std::cout << "Num threads = " << omp_get_num_threads() << std::endl;
std::cout << "Max threads = " << omp_get_max_threads() << std::endl;
}
}
return 0;
}
the output is
before parallel section:
Num threads = 1
Max threads = 1
inside parallel section:
Num threads = 5
Max threads = 1
When I run it by setting the number of threads by 4 instead of 1 (8 on your machine), the output is as expected:
before parallel section:
Num threads = 1
Max threads = 4
inside parallel section:
Num threads = 5
Max threads = 4
Have you tried to call omp_set_num_threads(8) at the begining of your code? Or have you set the number of thread to 1 before your program (for example by inside a function calling this...)?
One other explanation could be that openMP API doesn't find it necessary to have more than one thread as only a single section is implemented inside the parallel section. In this case try to add some code that could be executed by several threads to run faster (i.e incrementing all the values of a large array of integers or calling omp_get_thread_num()) outside the single section but inside the parallel section and the number of threads should be different. Calling omp_set_num_threads only sets the upper limit for the number of threads used.

OpenMP - parallel code has unexpected results

#include "/usr/lib/gcc/i686-linux-gnu/4.6/include/omp.h"
#include <iostream>
#include<list>
using namespace std;
int main()
{
list<int> lst;
for(int i=0;i<5;i++)
lst.push_back(i);
#pragma omp parallel for
for(int i=0;i<5;i++)
{
cout<<i<<" "<<omp_get_thread_num()<<endl;
}
}
suppose that I can get this:
0 0
1 0
2 0
3 1
4 1
However, sometimes I can get this result:
30 0
1 0
2 0
1
4 1
or even this kind of result:
30 1 0
4 1
1 0
2 0
I know this is because the output code:
cout<<i<<" "<<omp_get_thread_num()<<endl;
has been spliced into small segments and has no order when doing output.
But who can tell me how to prevent this from happening?
Thanks.
Standard output streams are NOT synchronized!
The only guarantee the standard gives is that, single characters are outputted atomically.
You need either a lock - which defies the point of parallelization or you could drop the "<< i" which should result in a quasi synchronized behavior.
The loop runs out of order. This is why you have unordered output.
If your problem is the 30 in
30 0
1 0
2 0
1
4 1
then stay cool, there is no 30, but 3 and 0. You still have, as expected, an unordered row of [0..4]:
3 0 0
1 0
2 0
1
4 1
What you can't tell is only which of the 0s which of the 1s is not a thread number.
Your code
#pragma omp parallel for
for(int i = 0; i < 5; i++)
{
cout << i << " " << omp_get_thread_num() << endl;
}
is equivalent to
#pragma omp parallel for
for(int i = 0; i < 5; i++)
{
cout << i;
cout << " ";
cout << omp_get_thread_num();
cout << endl;
}
Calls to << in the different threads may be executed in any order. For instance cout << i; in thread 3 may be followed by cout << i; in tread 0 which may be followed by cout << " "; in thread 3 etc. resulting in the garbled output 30 ....
The correct way is to rewrite the code so that each thread calls cout << only once in the loop:
#pragma omp parallel for
for(int i = 0; i < 5; i++)
{
stringstream ss;
ss << i << " " << omp_get_thread_num() << '\n';
cout << ss.str();
}
You can create an array (of size 5) containing which thread handled which index and then print it outside the parallel loop.