I have a program that does independent computations on a bunch of images. This seems like a good idea to use OpenMP:
//file: WoodhamData.cpp
#include <omp.h>
...
void WoodhamData::GenerateLightingDirection() {
int imageWidth = (this->normalMap)->width();
int imageHeight = (this->normalMap)->height();
#pragma omp paralell for num_threads(2)
for (int r = 0; r < RadianceMaps.size(); r++) {
if (omp_get_thread_num() == 0){
std::cout<<"threads="<<omp_get_num_threads()<<std::endl;
}
...
}
}
In order to use OpenMP, I add -fopenmp to my makefile, so it outputs:
g++ -g -o test.exe src/test.cpp src/WoodhamData.cpp -pthread -L/usr/X11R6/lib -fopenmp --std=c++0x -lm -lX11 -Ilib/eigen/ -Ilib/CImg
However, I am sad to say, my program reports threads=1 (run from terminal ./test.exe ...)
Does anyone know what might be wrong? This is the slowest part of my program, and it would be great to speed it up a bit.
Your OpenMP directive is wrong - it is "parallel" not "paralell".
Related
In the following example, the elimination of unused code is performed for sin() but not for pow(). I was wondering why. Tried gcc and clang.
Here is some more details about this example, which is otherwise mostly code.
The code contains a loop over an integer from which a floating point number is computed.
The number is passed to a mathematical function: either pow() or sin()
depending on which macros are defined.
If macro USE is defined, the sum of all returned values is accumulated in another variable which is then copied to a volatile variable to prevent the optimizer from removing the code entirely.
// main.cpp
#include <chrono>
#include <cmath>
#include <cstdio>
int main() {
std::chrono::steady_clock clock;
auto start = clock.now();
double s = 0;
const size_t count = 1 << 27;
for (size_t i = 0; i < count; ++i) {
const double x = double(i) / count;
double a = 0;
#ifdef POW
a = std::pow(x, 0.5);
#endif
#ifdef SIN
a = std::sin(x);
#endif
#ifdef USE
s += a;
#endif
}
auto stop = clock.now();
printf(
"%.0f ms\n", std::chrono::duration<double>(stop - start).count() * 1e3);
volatile double a = s;
(void)a;
}
As seen from the output, the computation of sin() is completely eliminated if the results are unused. This is not the case for pow() since the execution time does not decrease.
I normally observe this if the call may return a NaN (log(-x) but not log(+x)).
# g++ 10.2.0
g++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3064 ms
g++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3172 ms
g++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
0 ms
g++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1391 ms
# clang++ 11.0.1
clang++ -std=c++14 -O3 -DPOW main.cpp -o main && ./main
3288 ms
clang++ -std=c++14 -O3 -DPOW -DUSE main.cpp -o main && ./main
3351 ms
clang++ -std=c++14 -O3 -DSIN main.cpp -o main && ./main
177 ms
clang++ -std=c++14 -O3 -DSIN -DUSE main.cpp -o main && ./main
1524 ms
Q: Is it possible to improve IO of this code with LLVM Clang under OS X:
test_io.cpp:
#include <iostream>
#include <string>
constexpr int SIZE = 1000*1000;
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::string command(argv[1]);
if (command == "gen") {
for (int i = 0; i < SIZE; ++i) {
std::cout << 1000*1000*1000 << " ";
}
} else if (command == "read") {
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
}
Compile:
clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
Benchmark:
> time ./test_io gen | ./test_io read
real 0m2.961s
user 0m3.675s
sys 0m0.012s
Apart from the sad fact that reading of 10MB file costs 3 seconds, it's much slower than g++ (installed via homebrew):
> gcc-6 -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
> time ./test_io gen | ./test_io read
real 0m0.149s
user 0m0.167s
sys 0m0.040s
My clang version is Apple LLVM version 7.0.0 (clang-700.0.72). clangs installed from homebrew (3.7 and 3.8) also produce slow io. clang installed on Ubuntu (3.8) generates fast io. Apple LLVM version 8.0.0 generates slow io (2 people asked).
I also dtrussed it a bit (sudo dtruss -c "./test_io gen | ./test_io read") and found that clang version makes 2686 write_nocancel syscalls, while gcc version makes 2079 writev syscalls. Which probably points to the root of the problem.
The issue is in libc++ that does not implement sync_with_stdio.
Your command line clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io does not use libstdc++, it will use libc++. To force use libstdc++ you need -stdlib=libstdc++.
Minimal example if you have the input file ready:
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
Timings:
$ clang++ test_io.cpp -o test -O2 -std=c++11
$ time ./test read < input
real 0m2.802s
user 0m2.780s
sys 0m0.015s
$ clang++ test_io.cpp -o test -O2 -std=c++11 -stdlib=libstdc++
clang: warning: libstdc++ is deprecated; move to libc++
$ time ./test read < input
real 0m0.185s
user 0m0.169s
sys 0m0.012s
I use Ubuntu and write several lines of code.But it creates only one thread. When I run on my terminal the nproc command, the output is 2. My code is below
int nthreads, tid;
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf("Thread = %d\n", tid);
/* for only main thread */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
The output:
Thread = 0
Number of threads = 1
How can I do parallelism?
If you are using gcc/g++ you must make sure you enable openmp extensions with the -fopenmp compiler and linker options. Specifying it during linking will link in the appropriate library (-lgomp).
Compile with something like:
g++ -fopenmp myfile.c -o exec
or:
g++ -c myfile.c -fopenmp
g++ -o exec myfile.o -fopenmp
If you leave out the -fopenmp compile option your program will compile but it will run as if openmp wasn't being used. If your program doesn't use omp_set_num_threads to set the number of threads they can be set from the command line:
OMP_NUM_THREADS=8 ./exec
I think the default is is generally the number of cores on a particular system.
In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?
As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).
I am building an c++ application that uses nested omp. It however causes crash. The problem is solved when either one of the two omp is removed, or the wait function is inside the main file itself. OS is MacOS X Lion, compiler should be either llvm-gcc or gcc-4.2 (I am not sure, simply used cmake...) I then built the following app to demonstrate:
EDIT: I now tried the same on a linux machine, it works fine. So it's a pure MACOS X (lion) issue.
OMP_NESTED is set to true.
The main:
#include "waiter.h"
#include "iostream"
#include "time.h"
#include <omp.h>
void wait(){
int seconds = 1;
#pragma omp parallel for
for (int i=0;i<2;i++){
clock_t endwait;
endwait = clock () + seconds * CLOCKS_PER_SEC ;
while (clock() < endwait) {}
std::cout << i << "\n";
}
}
int main(){
std::cout << "blub\n";
#pragma omp parallel for
for(int i=0;i<5;i++){
Waiter w; // causes crash
// wait(); // works
}
std::cout << "blub\n";
return 0;
}
header:
#ifndef WAITER_H_
#define WAITER_H_
class Waiter {
public:
Waiter ();
};
#endif // WAITER_H_
implementation:
#include "waiter.h"
#include "omp.h"
#include "time.h"
#include <iostream>
Waiter::Waiter(){
int seconds = 1;
#pragma omp parallel for
for (int i=0;i<5;i++){
clock_t endwait;
endwait = clock () + seconds * CLOCKS_PER_SEC ;
while (clock() < endwait) {}
std::cout << i << "\n";
}
}
CMakeLists.txt:
cmake_minimum_required (VERSION 2.6)
project (waiter)
set(CMAKE_CXX_FLAGS "-fPIC -fopenmp")
set(CMAKE_C_FLAGS "-fPIC -fopenmp")
set(CMAKE_SHARED_LINKDER_FLAGS "-fPIC -fopenmp")
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${PROJECT_BINARY_DIR}/lib)
set(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}/bin)
add_library(waiter SHARED waiter.cpp waiter.h)
add_executable(use_waiter use_waiter.cpp)
target_link_libraries(use_waiter waiter)
thanks for help!
EDIT: rewritten with more details.
openmp causes intermittent failure on gcc 4.2, but it is fixed by gcc 4.6.1 (or perhaps 4.6). You can get the 4.6.1 binary from http://hpc.sourceforge.net/ (look for gcc-lion.tar.gz).
The failure of openmp in lion with less than gcc 4.6.1 is intermittent. It seems to happen after many openmp calls, so is likely made more likely by nesting but nesting is not required. This link doesn't have nested openmp (there is a parallel for within a standard single threaded for) but fails. My own code had intermittent hanging or crashing due to openmp after many minutes of working fine with gcc 4.2 (with no nested pragmas) in lion and was completely fixed by gcc 4.6.1.
I downloaded your code and compiled it with gcc 4.2 and it ran fine on my machine (with both the Waiter w; and wait(); options :-). I just used:
g++ -v -fPIC -fopenmp use_waiter.cpp waiter.cpp -o waiter
I tried increasing the loop maxes but still couldn't get it to fail. I see both the starting and ending blub.
What error message do you see?
Are you sure that the gcc 4.6 you downloaded is being used (use -v to make sure)?
See also here.