I am experimenting with OpenMP. I wrote some code to check its performance. On a 4-core single Intel CPU with Kubuntu 11.04, the following program compiled with OpenMP is around 20 times slower than the program compiled without OpenMP. Why?
I compiled it by g++ -g -O2 -funroll-loops -fomit-frame-pointer -march=native -fopenmp
#include <math.h>
#include <iostream>
using namespace std;
int main ()
{
long double i=0;
long double k=0.7;
#pragma omp parallel for reduction(+:i)
for(int t=1; t<300000000; t++){
for(int n=1; n<16; n++){
i=i+pow(k,n);
}
}
cout << i<<"\t";
return 0;
}
The problem is that the variable k is considered to be a shared variable, so it has to be synced between the threads.
A possible solution to avoid this is:
#include <math.h>
#include <iostream>
using namespace std;
int main ()
{
long double i=0;
#pragma omp parallel for reduction(+:i)
for(int t=1; t<30000000; t++){
long double k=0.7;
for(int n=1; n<16; n++){
i=i+pow(k,n);
}
}
cout << i<<"\t";
return 0;
}
Following the hint of Martin Beckett in the comment below, instead of declaring k inside the loop, you can also declare k const and outside the loop.
Otherwise, ejd is correct - the problem here does not seem bad parallelization, but bad optimization when the code is parallelized. Remember that the OpenMP implementation of gcc is pretty young and far from optimal.
Fastest code:
for (int i = 0; i < 100000000; i ++) {;}
Slightly slower code:
#pragma omp parallel for num_threads(1)
for (int i = 0; i < 100000000; i ++) {;}
2-3 times slower code:
#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}
no matter what it is in between { and }. A simple ; or a more complex computation, same results. I compiled under Ubuntu 13.10 64-bit, using both gcc and g++, trying different parameters -ansi -pedantic-errors -Wall -Wextra -O3, and running on an Intel quad-core 3.5GHz.
I guess thread management overhead is at fault? It doens't seem smart for OMP to create a thread everytime you need one and destroy it after. I thought there would be four (or eight) threads being either running whenever needed or sleeping.
I am observing similar behavior on GCC. However I am wondering if in my case it is somehow related with template or inline function. Is your code also within template or inline function? Please look here.
However for very short for loops, you may observe some small overhead related with thread switching like in your case:
#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}
If your loop executes for some seriously long time as few ms or even seconds, you should observe performance boost when using OpenMP. But only when you have more than one CPU. The more cores you have, the higher performance you reach with OpenMP.
Related
I couldn't find out why is the code below working fine on my local system without including the vector header file but not on online judges or online compilers.
#include<iostream>
#include<algorithm>
using namespace std;
int main(){
vector<int> v(10);
for(int i = 0; i<10; i++) v[i] = i;
sort(v.begin(),v.end());
for(int i = 0; i<10; i++) cout<<v[i]<<" ";
return 0;
}
I am compiling the code by enabling the warning flags as g++ -Wall -Wextra ./ex.cpp but g++ doesn't give me any warnings at all. Removing the #include<algorithm> does give me the error I wanted, identifier "vector" is undefined, but I don't know what's the relationship between them.
Your algorithm header itself includes the vector header (either directly or indirectly). Because of this, the code after the preprocessor looks the same as if you had included the vector header yourself.
You should not rely on this behavior though, as it depends on the standard library implementation you are using and can change at any time.
As the title says, using #pragma omp critical directive in R package with Rcpp significantly slows execution in comparison to compiled & run C++ code used in R package due to not using all CPU power.
Consider a simple C++ program (with cmake):
test.h as:
#ifndef RCPP_TEST_TEST_H
#define RCPP_TEST_TEST_H
#include <limits>
#include <cstdio>
#include <chrono>
#include <iostream>
#include <omp.h>
namespace rcpptest {
class Test {
public:
static unsigned int test();
};
}
#endif //RCPP_TEST_TEST_H
implementation of test.h in test.cpp:
#include "test.h"
namespace rcpptest {
unsigned int Test::test() {
omp_set_num_threads(8);
unsigned int x = 0;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
#pragma omp parallel for
for (unsigned int i = 0; i < 100000000; ++i) {
#pragma omp critical
++x;
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "finished (ms): " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() <<std::endl;
return x;
}
}
and main as:
#include "src/test.h"
int main() {
unsigned int x = rcpptest::Test::test();
return 0;
}
If I build and run this program in IDE (CLion) everything works as excepted.
Then I created an R package using Rcpp:
library(Rcpp)
Rcpp.package.skeleton('rcppTestLib')
and used the SAME C++ source codes for the package + "Rcpp" file to export my test function to be usable from R (rcppTestLib.cpp):
#include <Rcpp.h>
#include "test.h"
// [[Rcpp::export]]
void rcppTest() {
rcpptest::Test::test();
}
If I then run the test from R using the package
library(rcppTestLib)
rcppTest()
the execution is much slower.
I ran few test using both compiled c++ and Rcpp package and results are:
program | execution time
-----------------------------
compiled c++ | ~7 200ms
Rcpp package | ~551 000 ms
The difference is that using Rcpp package, 8 threads are spawned but each one of them is using only ~1% of CPU while using compiled C++ the 8 threads combined used all of the CPU power.
I tried switching #pragma omp critical for #pragma omp atomic with results:
program | execution time
-----------------------------
compiled c++ | ~2 900ms
Rcpp package | ~3 300 ms
Using #pragma omp atomic Rcpp package spawns 8 threads and uses all of the CPU power. However there is still difference in execution times but not that significant.
So my question is: Why with #pragma omp critical R / Rcpp package does not use all the CPU power while with #pragma omp atomic it does even tho the same code built and run in CLion uses all CPU power in BOTH cases?
What am I missing here?
Two possible options here:
In package form, the OpenMP flag options were not yet set in src/Makevars (unix) or src/Makevars.win (windows)
Missing num_threads(x) as critical rolls out
For one, place in the src/Makevars or src/Makevars.win file:
PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CFLAGS)
PKG_CFLAGS = $(SHLIB_OPENMP_CFLAGS)
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
For details, see: https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support
Regarding a missing num_threads(x)... I've been able to speed up the problem a bit...
Changing:
#pragma omp parallel for
to
#pragma omp parallel for num_threads(4)
Yields:
Before
finished (ms): 30822
[1] 1e+08
vs.
After
finished (ms): 17979
[1] 1e+08
or about a 1.7 speedup. My thought is somewhere in cmake a global thread option is being set.
omp_set_num_threads(x)
or
set OMP_NUM_THREADS=x
https://gcc.gnu.org/onlinedocs/libgomp/omp_005fset_005fnum_005fthreads.html
https://software.intel.com/en-us/mkl-linux-developer-guide-setting-the-number-of-threads-using-an-openmp-environment-variable
#coatless is once again entirely correct. The default src/Makevars* we create has no OpenMP. You see this on a current enough compiler:
ccache g++ -I/usr/share/R/include -DNDEBUG -I"/usr/local/lib/R/site-library/Rcpp/include" -fpic -g -O3 -Wall -pipe -march=native -c test.cpp -o test.o
test.cpp:10:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
#pragma omp parallel for
test.cpp:13:0: warning: ignoring #pragma omp critical [-Wunknown-pragmas]
#pragma omp critical
Once you add the src/Makevars as needed, all is good. htop shows as many CPUs as I chose to be pegged.
But your example is still bad because the loop does too little. The overhead becomes dominant. I have multitple cores here but there is not reason it should run faster with OMP_NUM_THREADS=2 should run faster that OMP_NUM_THREADS=3 or OMP_NUM_THREADS=4 -- apart from the fact
that we seem to have nothing but overhead here.
I have a problem about the openmp compiling.
Like the following code:
#include <iostream>
#include <pthread.h>
#include <omp.h>
#include <semaphore.h>
#include <stack>
using namespace std;
sem_t empty,full;
stack<int> stk;
void produce(int i)
{
{
sem_wait(&empty);
cout<<"produce "<<i*i<<endl;
stk.push(i*i);
sem_post(&full);
}
}
void consume1(int &x)
{
sem_wait(&full);
int data=stk.top();
stk.pop();
x=data;
sem_post(&empty);
}
void consume2()
{
sem_wait(&full);
int data=stk.top();
stk.pop();
cout<<"consume2 "<<data<<endl;
sem_post(&empty);
}
int main()
{
sem_init(&empty,0,1);
sem_init(&full,0,0);
pthread_t t1,t2,t3;
omp_set_num_threads(3);
int TID=0;
#pragma omp parallel private(TID)
{
TID=omp_get_thread_num();
if(TID==0)
{
cout<<"There are "<<omp_get_num_threads()<<" threads"<<endl;
for(int i=0;i<5;i++)
produce(i);
}
else if(TID==1)
{
int x;
while(true)
{
consume1(x);
cout<<"consume1 "<<x<<endl;
}
}
else if(TID==2)
{
int x;
while(true)
{
consume1(x);
cout<<"consume2 "<<x<<endl;
}
}
}
return 0;
}
Firstly, I compile it using:
g++ test.cpp -fopenmp -lpthread
And, I got the right answer, there are 3 threads totally.
But, when I do the compile like this:
g++ -c test.cpp -o test.o
g++ test.o -o test -fopenmp -lpthread
there is just only ONE thread.
Anyone can tell me how to compile this code correctly. Thankyou in advance.
OpenMP is a set of code transforming pragmas, i.e. they are only applied at compile time. You cannot apply code transformation to an already compiled object code (ok, you can, but it is far more involving process and outside the scope of what most compilers do these days). You need -fopenmp during the link phase only for the compiler to automatically link the OpenMP runtime library libgomp - it does nothing else to the object code.
On a side note, although techically correct, your code does OpenMP in a very non-OpenMP way. First, you have reimplemented the OpenMP sections construct. The parallel region in your main function could be rewritten in a more OpenMP way:
#pragma omp parallel sections
{
#pragma omp section
{
cout<<"There are "<<omp_get_num_threads()<<" threads"<<endl;
for(int i=0;i<5;i++)
produce(i);
}
#pragma omp section
{
int x;
while(true)
{
consume1(x);
cout<<"consume1 "<<x<<endl;
}
}
#pragma omp section
{
int x;
while(true)
{
consume1(x);
cout<<"consume2 "<<x<<endl;
}
}
}
(if you get SIGILL while running this code with more than three OpenMP threads, you have encountered a bug in GCC, that will be fixed in the upcoming release)
Second, you might want to take a look at OpenMP task construct. With it you can queue pieces of code to be executed concurrently as tasks by any idle thread. Unfortunately it requires a compiler which supports OpenMP 3.0, which rules out MSVC++ from the equation, but only if you care about portability to Windows (and you obviously don't, because you are using POSIX threads).
The OpenMP pragmas are only enabled when compiled with -fopenmp. Otherwise they are completely ignored by the compiler. (Hence, only 1 thread...)
Therefore, you will need to add -fopenmp to the compilation of every single module that uses OpenMP. (As opposed to just the final linking step.)
g++ -c test.cpp -o test.o -fopenmp
g++ test.o -o test -fopenmp -lpthread
I try to use C++11 theading library using g++ 4.7.
First I have a question: is it expected for a next release to not be required to link by hand the pthread library ?
So my program is :
#include <iostream>
#include <vector>
#include <thread>
void f(int i) {
std::cout<<"Hello world from : "<<i<<std::endl;
}
int main() {
const int n = 4;
std::vector<std::thread> t;
for (int i = 0; i < n; ++i) {
t.push_back(std::thread(f, i));
}
for (int i = 0; i < n; ++i) {
t[i].join();
}
return 0;
}
I compile with:
g++-4.7 -Wall -Wextra -Winline -std=c++0x -pthread -O3 helloworld.cpp -o helloworld
And it returns:
Hello world from : Hello world from : Hello world from : 32
2
pure virtual method called
terminate called without an active exception
Erreur de segmentation (core dumped)
What is the problem and how to solve it ?
UPDATE:
Now using mutex:
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
static std::mutex m;
void f(int i) {
m.lock();
std::cout<<"Hello world from : "<<i<<std::endl;
m.unlock();
}
int main() {
const int n = 4;
std::vector<std::thread> t;
for (int i = 0; i < n; ++i) {
t.push_back(std::thread(f, i));
}
for (int i = 0; i < n; ++i) {
t[i].join();
}
return 0;
}
It returns :
pure virtual method called
Hello world from : 2
terminate called without an active exception
Abandon (core dumped)
UPDATE 2:
Hum... It works with my default GCC (g++4.6) but it fails with the version of gcc I compiled by hand (g++4.7.1). Was there an option I forgot to compile g++ 4.7.1 ?
General edit:
In order to prevent use of cout by multiple threads simultaneously, that will result in character interleaving, proceed as follows:
1) declare before the declaration of f():
static std::mutex m;
2) then, guard the "cout" line between:
m.lock();
std::cout<<"Hello world from : "<<i<<std::endl;
m.unlock();
Apparently, linking against the -lpthread library is a must, for some unclear reasons. At least on my machine, not linking against -lpthread results in a core dump. Adding -lpthread results in proper functionality of the program.
The possibility of character interleaving if locking is not used when accessing cout from different threads is expressed here:
https://stackoverflow.com/a/6374525/1284631
more exactly: "[ Note: Users must still synchronize concurrent use of these objects and streams by multiple threads if they wish to avoid interleaved characters. — end note ]"
OTOH, the race condition is guaranteed to be avoided, at least in the C++11 standard (beware, the gcc/g++ implementation of this standard is still at experimental level).
Note that the Microsoft implementation (see: http://msdn.microsoft.com/en-us/library/c9ceah3b.aspx credit to #SChepurin) is stricter than the standard (apparently, it guarantees character interleaving is avoided), but this might not be the case for the gcc/g++ implementation.
This is the command line that I use to compile (both updated and original code versions, everything works well on my PC):
g++ threadtest.cpp -std=gnu++11 -lpthread -O3
OTOH, without the -lpthread, it compiles but I have a core dump (gcc 4.7.2 on Linux 64).
I understand that you are using two different versions of the gcc/g++ compiler on the same machine. Just be sure that you use them properly (not mixing different library versions).
I have a problem about the openmp compiling.
Like the following code:
#include <iostream>
#include <pthread.h>
#include <omp.h>
#include <semaphore.h>
#include <stack>
using namespace std;
sem_t empty,full;
stack<int> stk;
void produce(int i)
{
{
sem_wait(&empty);
cout<<"produce "<<i*i<<endl;
stk.push(i*i);
sem_post(&full);
}
}
void consume1(int &x)
{
sem_wait(&full);
int data=stk.top();
stk.pop();
x=data;
sem_post(&empty);
}
void consume2()
{
sem_wait(&full);
int data=stk.top();
stk.pop();
cout<<"consume2 "<<data<<endl;
sem_post(&empty);
}
int main()
{
sem_init(&empty,0,1);
sem_init(&full,0,0);
pthread_t t1,t2,t3;
omp_set_num_threads(3);
int TID=0;
#pragma omp parallel private(TID)
{
TID=omp_get_thread_num();
if(TID==0)
{
cout<<"There are "<<omp_get_num_threads()<<" threads"<<endl;
for(int i=0;i<5;i++)
produce(i);
}
else if(TID==1)
{
int x;
while(true)
{
consume1(x);
cout<<"consume1 "<<x<<endl;
}
}
else if(TID==2)
{
int x;
while(true)
{
consume1(x);
cout<<"consume2 "<<x<<endl;
}
}
}
return 0;
}
Firstly, I compile it using:
g++ test.cpp -fopenmp -lpthread
And, I got the right answer, there are 3 threads totally.
But, when I do the compile like this:
g++ -c test.cpp -o test.o
g++ test.o -o test -fopenmp -lpthread
there is just only ONE thread.
Anyone can tell me how to compile this code correctly. Thankyou in advance.
OpenMP is a set of code transforming pragmas, i.e. they are only applied at compile time. You cannot apply code transformation to an already compiled object code (ok, you can, but it is far more involving process and outside the scope of what most compilers do these days). You need -fopenmp during the link phase only for the compiler to automatically link the OpenMP runtime library libgomp - it does nothing else to the object code.
On a side note, although techically correct, your code does OpenMP in a very non-OpenMP way. First, you have reimplemented the OpenMP sections construct. The parallel region in your main function could be rewritten in a more OpenMP way:
#pragma omp parallel sections
{
#pragma omp section
{
cout<<"There are "<<omp_get_num_threads()<<" threads"<<endl;
for(int i=0;i<5;i++)
produce(i);
}
#pragma omp section
{
int x;
while(true)
{
consume1(x);
cout<<"consume1 "<<x<<endl;
}
}
#pragma omp section
{
int x;
while(true)
{
consume1(x);
cout<<"consume2 "<<x<<endl;
}
}
}
(if you get SIGILL while running this code with more than three OpenMP threads, you have encountered a bug in GCC, that will be fixed in the upcoming release)
Second, you might want to take a look at OpenMP task construct. With it you can queue pieces of code to be executed concurrently as tasks by any idle thread. Unfortunately it requires a compiler which supports OpenMP 3.0, which rules out MSVC++ from the equation, but only if you care about portability to Windows (and you obviously don't, because you are using POSIX threads).
The OpenMP pragmas are only enabled when compiled with -fopenmp. Otherwise they are completely ignored by the compiler. (Hence, only 1 thread...)
Therefore, you will need to add -fopenmp to the compilation of every single module that uses OpenMP. (As opposed to just the final linking step.)
g++ -c test.cpp -o test.o -fopenmp
g++ test.o -o test -fopenmp -lpthread