Using std::cout in a parallel for loop using OpenMP [duplicate]

Using std::cout in a parallel for loop using OpenMP [duplicate] - c++

This question already has answers here:
Parallelize output using OpenMP
(2 answers)
Closed 3 years ago.
I want to parallelize a for-loop in C++ using OpenMP. In that loop I want to print some results:
#pragma omp parallel for
for (int i=0; i<1000000; i++) {
if (test.successful() == true) {
std::cout << test.id() << " " << test.result() << std::endl;
}
}
The result I get without using OpenMP:
3 23923.23
1 32329.32
2 23239.45
However I get the following using a parallel for-loop:
314924.4244.5
434.
4343.43123
How can I avoid such an output?

The reason is the printing is not defined as an atomic operation, thus context switch occurs while printing to the standard output is still ongoing.
Solution: You must make the for loop body an atomic operation, using #pragma omp atomic
#pragma omp parallel for
for (int i=0; i<1000000; i++) {
if (test.successful() == true) {
#pragma omp atomic
std::cout << test.id() << " " << test.result() << std::endl;
}
}

Related

OpenMP parallel for does not speed up array sum code [duplicate]

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time：66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
}
return result;
}
int main(int argc, char **argv) {
int threadCount = 12;
omp_set_num_threads(threadCount);
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
}
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time：" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
}
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
}
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
delete[]testArray;
return 0;
}

As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
See
OpenMP time and clock() give two different results
https://en.cppreference.com/w/c/chrono/clock
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time：267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

OpenMP only using one thread

I have having a bit of a frustrating problem with openmp. When I run the following code it only seems to be running on one thread.
omp_set_num_threads(8);
#pragma omp parallel for schedule(dynamic)
for(size_t i = 0; i < jobs.size(); i++) //jobs is a vector
{
std::cout << omp_get_thread_num() << "\t" << omp_get_num_threads() << "\t" << omp_in_parallel() << std::endl;
jobs[i].run();
}
This prints...
0 1 1
for every line.
I can see using top that openmp is spawning as many threads as I have the process taskset to. They are mostly idle while it runs. The program is both compiled and linked with the -fopenmp flag with gcc. I am using redhat 6. I also tried using the num_threads(8) parameter in the pragma which made no difference. The program is linked with another library which also uses openmp so maybe this is the issue. Does anyone know what might cause this behavior? In all my past openmp experience it has just worked.

Can you print your jobs.size()?
I made a quick test and it does work:
#include <stdio.h>
#include <omp.h>
#include <iostream>
int main()
{
omp_set_num_threads(2);
#pragma omp parallel for ordered schedule(dynamic)
for(size_t i = 0; i < 4; i++) //jobs is a vector
{
#pragma omp ordered
std::cout << i << "\t" << omp_get_thread_num() << "\t" << omp_get_num_threads() << "\t" << omp_in_parallel() << std::endl;
}
return 0;
}
I got:
icpc -qopenmp test.cpp && ./a.out
0 0 2 1
1 1 2 1
2 0 2 1
3 1 2 1

num_threads clause not setting number of threads [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
I have the following simple program
#include <iostream>
#include <omp>
int main() {
std::cout << "max threads: " << omp_get_max_threads() << "\n";
#pragma parallel num_threads(4)
{
int tid = omp_get_thread_num();
std::cout << "Hello from " << tid << " of " << omp_get_num_threads() << "\n";
#pragma omp for
for (int i = 0; i < 5; i++) {
std::cout << "(" << tid << ", " << i << ")\n";
}
}
}
And I am compiling with clang++ -fopenmp=libomp main.cpp. I am able to compile and run other OpenMP programs compiled in this way.
I would expect the num_threads(4) to cause the parallel region to run across 4 threads. Instead I experience the following output:
max threads: 4
Hello from 0 of 1
(0, 0)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
Why is the parallel region not running across 4 threads?

You left the omp out of your parallel pragma.
#pragma omp parallel num_threads(4)

Can iterating over unsorted data structure (like array, tree), with multiple thread make iteration faster?

Can iterating over unsorted data structure like array, tree with multiple thread make it faster?
For example I have big array with unsorted data.
int array[1000];
I'm searching array[i] == 8
Can running:
Thread 1:
for(auto i = 0; i < 500; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
Thread 2:
for(auto i = 500; i < 1000; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
be faster than normal iteration?
#update
I've written simple test witch describe problem better:
For searching int* array = new int[100000000];
and repeating it 1000 times
I got the result:
a
Number of threads = 2
End of multithread iteration
End of normal iteration
Time with 2 threads 73581
Time with 1 thread 154070
Bool values:0
0
0
Process returned 0 (0x0) execution time : 256.216 s
Press any key to continue.
What's more when program was running with 2 threads cpu usage of the process was around ~90% and when iterating with 1 thread it was never more than 50%.
So Smeeheey and erip are right that it can make iteration faster.
Of course it can be more tricky for not such trivial problems.
And as I've learned from this test is that compiler can optimize main thread (when i was not showing boolean storing results of search loop in main thread was ignored) but it will not do that for other threads.
This is code I have used:
#include<cstdlib>
#include<thread>
#include<ctime>
#include<iostream>
#define SIZE_OF_ARRAY 100000000
#define REPEAT 1000
inline bool threadSearch(int* array){
for(auto i = 0; i < SIZE_OF_ARRAY/2; i++)
if(array[i] == 101) // there is no array[i]==101
return true;
return false;
}
int main(){
int i;
std::cin >> i; // stops program enabling to set real time priority of the process
clock_t with_multi_thread;
clock_t normal;
srand(time(NULL));
std::cout << "Number of threads = "
<< std::thread::hardware_concurrency() << std::endl;
int* array = new int[SIZE_OF_ARRAY];
bool true_if_found_t1 =false;
bool true_if_found_t2 =false;
bool true_if_found_normal =false;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
with_multi_thread=clock();
for(auto j=0; j<REPEAT; j++){
std::thread t([&](){
if(threadSearch(array))
true_if_found_t1=true;
});
std::thread u([&](){
if(threadSearch(array+SIZE_OF_ARRAY/2))
true_if_found_t2=true;
});
if(t.joinable())
t.join();
if(u.joinable())
u.join();
}
with_multi_thread=(clock()-with_multi_thread);
std::cout << "End of multithread iteration" << std::endl;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
normal=clock();
for(auto j=0; j<REPEAT; j++)
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
if(array[i] == 101) // there is no array[i]==101
true_if_found_normal=true;
normal=(clock()-normal);
std::cout << "End of normal iteration" << std::endl;
std::cout << "Time with 2 threads " << with_multi_thread<<std::endl;
std::cout << "Time with 1 thread " << normal<<std::endl;
std::cout << "Bool values:" << true_if_found_t1<<std::endl
<< true_if_found_t2<<std::endl
<<true_if_found_normal<<std::endl;// showing bool values to prevent compiler from optimization
return 0;
}

The answer is yes, it can make it faster - but not necessarily. In your case, when you're iterating over pretty small arrays, it is likely that the overhead of launching a new thread will be much higher than the benefit gained. If you array was much bigger then this would be reduced as a proportion of the overall runtime and eventually become worth it. Note you will only get speed up if your system has more than 1 physical core available to it.
Additionally, you should note that whilst that the code that reads the array in your case is perfectly thread-safe, writing to std::cout is not (you will get very strange looking output if your try this). Instead perhaps your thread should do something like return an integer type indicating the number of instances found.

Openmp can't create threads automatically

I am trying to learn how to use openmp for multi threading.
Here is my code:
#include <iostream>
#include <math.h>
#include <omp.h>
//#include <time.h>
//#include <cstdlib>
using namespace std;
bool isprime(long long num);
int main()
{
cout << "There are " << omp_get_num_procs() << " cores." << endl;
cout << 2 << endl;
//clock_t start = clock();
//clock_t current = start;
#pragma omp parallel num_threads(6)
{
#pragma omp for schedule(dynamic, 1000)
for(long long i = 3LL; i <= 1000000000000; i = i + 2LL)
{
/*if((current - start)/CLOCKS_PER_SEC > 60)
{
exit(0);
}*/
if(isprime(i))
{
cout << i << " Thread: " << omp_get_thread_num() << endl;
}
}
}
}
bool isprime(long long num)
{
if(num == 1)
{
return 0;
}
for(long long i = 2LL; i <= sqrt(num); i++)
{
if (num % i == 0)
{
return 0;
}
}
return 1;
}
The problem is that I want openmp to automatically create a number of threads based on how many cores are available. If I take out the num_threads(6), then it just uses 1 thread yet the omp_get_num_procs() correctly outputs 64.
How do I get this to work?

You neglected to mention which compiler and OpenMP implementation you are using. I'm going to guess you're using one of the ones, like PGI, which does not automatically assume the number of threads to create in a default parallel region unless asked to do so. Since you did not specify the compiler I cannot be certain that these options will actually help you, but for PGI's compilers the necessary option is -mp=allcores when compiling and linking the executable. With that added, it will cause the system to create one thread per core for parallel regions which do not specify the number of threads or have the appropriate environment variable set.
The number you're getting from omp_get_num_procs is used by default to set the limit on the number of threads, but not necessarily the number created. If you want to dynamically set the number created, set the environment variable OMP_NUM_THREADS to the desired number before running your application and it should behave as expected.

I'm not sure if I understand your question correctly, but it seems that you are almost there. Do you mean something like:
#include <omp.h>
#include <iostream>
int main(){
const int num_procs = omp_get_num_procs();
std::cout<<num_procs;
#pragma omp parallel for num_threads(num_procs) default(none)
for(int i=0; i<(int)1E20; ++i){
}
return 0;
}

Unless I'm rather badly mistaken, OpenMP normally serializes I/O (at least to a single stream) so that's probably at least part of where your problem is arising. Removing that from the loop, and massaging a bit of the rest (not much point in working at parallelizing until you have reasonably efficient serial code), I end up with something like this:
#include <iostream>
#include <math.h>
#include <omp.h>
using namespace std;
bool isprime(long long num);
int main()
{
unsigned long long total = 0;
cout << "There are " << omp_get_num_procs() << " cores.\n";
#pragma omp parallel for reduction(+:total)
for(long long i = 3LL; i < 100000000; i += 2LL)
if(isprime(i))
total += i;
cout << "Total: " << total << "\n";
}
bool isprime(long long num) {
if (num == 2)
return 1;
if(num == 1 || num % 2 == 0)
return 0;
unsigned long long limit = sqrt(num);
for(long long i = 3LL; i <= limit; i+=2)
if (num % i == 0)
return 0;
return 1;
}
This doesn't print out the thread number, but timing it I get something like this:
Real 78.0686
User 489.781
Sys 0.125
Note the fact that the "User" time is more than 6x as large as the "Real" time, indicating that the load is being distributed across the cores 8 available on this machine with about 80% efficiency. With a little more work, you might be able to improve that further, but even with this simple version we're seeing considerably more than one core being used (on your 64-core machine, we should see at least a 50:1 improvement over single-threaded code, and probably quite a bit better than that).

The only problem I see with your code is that when you do the output you need to put it in a critcal section otherwise multiple threads can write to the same line at the same time.
See my code corrections.
In terms of one thread I think what you might be seeing is due to using dynamic. A thread running over small numbers is much quicker then one running over large numbers. When the thread with small numbers finishes and gets another list of small numbers to run it finishes again quick while the thread with large numbers is still running. This does not mean you're only running one thread though. In my output I see long streams of the same thread finding primes but eventually others report as well. You have also set the chuck size to 1000 so if you for example only ran over 1000 numbers only one thread will be used in the loop.
It looks to me like you're trying to find a list of primes or a sum of the number of primes. You're using trial division for that. That's much less efficient than using the "Sieve of Eratosthenes".
Here is an example of the Sieve of Eratosthenes which finds the primes in the the first billion numbers in less than one second on my 4 core system with OpenMP.
http://create.stephan-brumme.com/eratosthenes/
I cleaned up your code a bit but did not try to optimize anything since the algorithm is inefficient anyway.
int main() {
//long long int n = 1000000000000;
long long int n = 1000000;
cout << "There are " << omp_get_num_procs() << " cores." << endl;
double dtime = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(long long i = 3LL; i <= n; i = i + 2LL) {
if(isprime(i)) {
#pragma omp critical
{
cout << i << "\tThread: " << omp_get_thread_num() << endl;
}
}
}
}
dtime = omp_get_wtime() - dtime;
cout << "time " << dtime << endl;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using std::cout in a parallel for loop using OpenMP [duplicate] - c++

Related

OpenMP parallel for does not speed up array sum code [duplicate]

OpenMP only using one thread

num_threads clause not setting number of threads [closed]

Can iterating over unsorted data structure (like array, tree), with multiple thread make iteration faster?

Openmp can't create threads automatically

Categories

Resources