I am new to OpenACC and I am writing a new program from scratch (I have a fairly good idea what loops will be computationally costly from working in a similar problem before). I am getting an "Undefined reference" from nvlink. From my research, I found this is because no device code is being generated for the class I created. However, I don't understand why this is happening and how to fix it.
Below I send a MWE from my code.
include/vec1.h
#ifndef VEC1_H
#define VEC1_H
class Vec1{
public:
double data[1];
#pragma acc routine seq
Vec1();
#pragma acc routine seq
Vec1(double x);
#pragma acc routine seq
Vec1 operator* (double x);
};
#endif
src/vec1.cpp
#include "vec1.h"
Vec1::Vec1(){
data[0] = .0;
}
Vec1::Vec1(double x){
data[0] = x;
}
Vec1 Vec1::operator*(double c){
Vec1 r = Vec1(0.);
r.data[0] = c*data[0];
return r;
}
vec1_test_gpu.cpp
#include "vec1.h"
#define NUM_VECTORS 1000000
int main(){
Vec1 vec1_array[NUM_VECTORS];
for(int iv=0; iv<NUM_VECTORS; ++iv){
vec1_array[iv] = Vec1(iv);
}
#pragma acc data copyin(vec1_array)
#pragma acc parallel loop
for(int iv=0; iv<NUM_VECTORS; ++iv){
vec1_array[iv] = vec1_array[iv]*2;
}
return 0;
}
I compile them in the following way
$ nvc++ src/vec1.cpp -c -I./include -O3 -march=native -ta=nvidia:cuda11.2 -fPIC
$ nvc++ -shared -o libvec1.so vec1.o
$ nvc++ vec1_test_gpu.cpp -I./include -O3 -march=native -ta=nvidia:cuda11.2 -L./ -lvec1
The error message appears just after the last command and reads nvlink error : Undefined reference to '_ZN4Vec1mlEd' in '/tmp/nvc++jOtCBiT_m38d.o'
The problem here is that you're trying to call a device routine, "Vec1::operator*", that's contained in a shared object from a kernel in the main program. nvc++'s OpenACC implementation uses CUDA to target NVIDIA devices. Since CUDA doesn't have a dynamic linker for device code, at least not yet, this isn't supported.
You'll need to either link this statically, or move the "parallel loop" into the shared object.
Note that the "-ta" flag has been deprecated. Please consider using "-acc -gpu=cuda11.2" instead.
Related
I'm trying to build a write of software with the Tensor module provided as unsupported from eigen3. I've written a simple piece of code that will build with a simple application of VectorXd (just printing it to stdout), and will also build with an analogous application of Tensor in place of the VectorXd, but WILL NOT build when I do not throw an optimization flag (-On). Note that my build is from within a conda enviromnent that is using conda-forge compilers, so the g++ in what follows is the g++ obtained from conda forge for ubuntu. It says its name in the error messages following, if that is perceived to be the issue.
I have a feeling this is not about the program I'm trying to write, but just in case I've included an mwe.cpp that seems to produce the error. The code follows:
#include <eigen3/Eigen/Dense>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>
#include <iostream>
using namespace Eigen;
using namespace std;
int main(int argc, char const *argv[])
{
VectorXd v(6);
v << 1, 2, 3, 4, 5, 6;
cout << v.cwiseSqrt() << "\n";
Tensor<double, 1> t(6);
for (auto i=0; i<v.size(); i++){
t(i) = v(i);
}
cout << "\n";
for (auto i=0; i<t.size(); i++){
cout << t(i) << " ";
}
cout << "\n";
return 0;
}
If the above code is compiled without any optimizations, like:
g++ -I ~/miniconda3/envs/myenv/include/ mwe.cpp -o mwe
I get the following compiler error:
/home/myname/miniconda3/envs/myenv/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: /tmp/cc2q8gj4.o: in function `Eigen::internal::(anonymous namespace)::get_random_seed()':
mwe.cpp:(.text+0x15): undefined reference to `clock_gettime'
collect2: error: ld returned 1 exit status
If instead I ask for 'n' optimization level, like the following:
g++ -I ~/miniconda3/envs/loos/include/ -On mwe.cpp -o mwe
The program builds without complaint and I get expected output:
$ ./mwe
1
1.41421
1.73205
2
2.23607
2.44949
1 2 3 4 5 6
I have no clue why this little program, or the real program I'm trying to write, would be trying to get a random seed for anything. Any advice would be appreciated. The reason why I would like to build without optimization is so that debugging is easier. I actually thought all this was being caused by debug flags, but I realized that my build tool's debug setting didn't ask for optimization and narrowed that down to the apparent cause. If I throw -g -O1 I do not see the error.
Obviously, if one were to comment out all the code that has to do with the Tensor module, that is everthing in main above 'return' and below the cwiseSqrt() line, and also the include statement, the code builds and produces expected output.
Technically, this is a linker error (g++ calls the compiler as well as the linker, depending on the command line arguments). And you get linker-errors if an externally defined function is called from somewhere, even if the code is never reached.
When compiling with optimizations enabled, g++ will optimize away uncalled functions (outside the global namespace), thus you get no linker errors. You may want to try -Og instead of -O1 for better debugging experience.
The following code should produce similar behavior:
int foo(); // externally defined
namespace { // anonymous namespace
// defined inside this module, but never called
int bar() {
return foo();
}
}
int main() {
// if you un-comment this line, the
// optimized version will fail as well:
// ::bar();
}
According to man clock_gettime you need to link with -lrt if your glibc version is older than 2.17 -- maybe that is the case for your setup:
g++ -I ~/miniconda3/envs/myenv/include/ mwe.cpp -o mwe -lrt
According to this question, the use of threadprivate with openmp is
problematic. Here is a minimum (non-)working example of the problem:
#include"omp.h"
#include<iostream>
extern const int a;
#pragma omp threadprivate(a)
const int a=2;
void my_call(){
std::cout<<a<<std::endl;
};
int main(){
#pragma omp parallel for
for(unsigned int i=0;i<8;i++){
my_call();
}
}
This codes compiles with intel 15.0.2.164 but not with gcc 4.9.2-10.
gcc says:
g++ -std=c++11 -O3 -fopenmp -O3 -fopenmp test.cpp -o test
test.cpp:5:29: error: ‘a’ declared ‘threadprivate’ after first use
#pragma omp threadprivate(a)
I would be very happy to find a way to compile it with gcc.
Note: I know that global variables are a nightmare, but this example is the
coming from a code I haven't written and that I need to use... It's >11000
lines and I don't want to rewrite everything.
Consider following scheme. We have 3 files:
main.cpp:
int main() {
clock_t begin = clock();
int a = 0;
for (int i = 0; i < 1000000000; ++i) {
a += i;
}
clock_t end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
begin = clock();
C b(0);
for (int i = 0; i < 1000000000; ++i) {
b += C(i);
}
end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
return 0;
}
class.h:
#include <iostream>
struct C {
public:
int m_number;
C(int number);
void operator+=(const C & rhs);
};
class.cpp
C::C(int number)
: m_number(number)
{
}
void
C::operator+=(const C & rhs) {
m_number += rhs.m_number;
}
Files are compiled using clang++ with flags -std=c++11 -O3.
What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751
I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003
Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.
My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.
Short answer
What you want is link time optimization. Try the answer from this question. I.e., try:
clang++ -O4 -emit-llvm main.cpp -c -o main.bc
clang++ -O4 -emit-llvm class.cpp -c -o class.bc
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable
You should see that your performance reaches the one that you had when pasting everything into main.cpp.
Long answer
The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).
When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code.
You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.
For People with GCC
The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:
g++ -c -O2 -flto main.cpp
g++ -c -O2 -flto class.cpp
g++ -o myprog -flto -O2 main.o class.o
The technique what you are looking for is called Link Time Optimization.
From the timing data, it is obvious that the compiler doesn't just generate better code for the trivial case, but that it doesn't perform any code at all to sum up a billion number. That doesn't happen in real life. You are not performing a useful benchmark. You want to test code that is at least complicated enough to avoid stupid/clever things like this.
I'd re-run the test, but change the loop to
for (int i = 0; i < 1000000000; ++i) if (i != 1000000) {
// ...
}
so that the compiler is forced to actually add up the numbers.
I have a problem when using OpenMP in combination with firstprivate and std::vector on the Intel c++ compiler. Take the following three functions:
#include <omp.h>
void pass_vector_by_value(std::vector<double> p) {
#pragma omp parallel
{
//do sth
}
}
void pass_vector_by_value_and_use_firstprivate(std::vector<double> p) {
#pragma omp parallel firstprivate(p)
{
//do sth
}
}
void create_vector_locally_and_use_firstprivate() {
std::vector<double> p(3, 7);
#pragma omp parallel firstprivate(p)
{
//do sth
}
}
The code compiles without warnings doing:
icc filename.cpp -openmp -Wall -pedantic
(icc version 14.0.1 (gcc version 4.7.0 compatibility))
or:
g++ filename.cpp -fopenmp -Wall -pedantic
(gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux))
but after compiling with icc I am getting runtime errors such as:
*** Error in `./a.out': munmap_chunk(): invalid pointer: 0x00007fff31bcc980 ***
when calling the second function (pass_vector_by_value_and_use_firstprivate)
So the error only occurs when the firstprivate clause is used (which should invoke the copy constructor) and the vector is passed by value to the function (which should invoke the copy constructor as well). When either not passing the vector but creating it locally in the function or not using firstprivate there is no error! On gcc I do not get any errors.
I am wondering if the code somehow produces undefined behavior or if this is a bug in icc ?
I get the same problem with ICC but not GCC. Looks like a bug. Here is a workaround
void pass_vector_by_value2(std::vector<double> p) {
#pragma omp parallel
{
std::vector<double> p_private = p;
//do sth with p_private
}
}
On the other hand, in general, I don't pass non-POD by value to functions anyway. I would use a reference but if you do that you get the error
error: ‘p’ has reference type for ‘firstprivate’
The solution to that is the code I posted above anyway. Pass it by value or by reference and then define a private copy inside the parallel region as I did in the code above.
I implemented a simple matrix vector multiplication for sparse matrices in CRS using an implicit openMP directive in the multiplication loop.
The complete code is in GitHub: https://github.com/torbjoernk/openMP-Examples/blob/icc_gcc_problem/matxvec_sparse/matxvec_sparse.cpp
Note: It's ugly ;-)
To control the private and shared memory I'm using restrict pointers. Compiling it with GCC 4.6.3 on 64bit Linux works fine (besides two warnings about %u and unsigned int in a printf command, but that's not the point).
However, compiling it with ICC 12.1.0 on 64bit Linux failes with the error:
matxvec_sparse.cpp(79): error: "default_n_row" must be specified in a variable list at enclosing OpenMP parallel pragma
#pragma omp parallel \
^
with the definition of the variable and pointer in question
int default_n_row = 4;
int *n_row = &default_n_row;
and the openMP directive defined as
#pragma omp parallel \
default(none) \
shared(n_row, aval, acolind, arowpt, vval, yval) \
private(x, y)
{
#pragma omp for \
schedule(static)
for ( x = 0; x < *n_row; x++ ) {
yval[x] = 0;
for ( y = arowpt[x]; y < arowpt[x+1]; y++ ) {
yval[x] += aval[y] * vval[ acolind[y] ];
}
}
} /* end PARALLEL */
Compiled with g++:
c++ -fopenmp -O0 -g -std=c++0x -Wall -o matxvec_sparse matxvec_sparse.cpp
Compiled with icc:
icc -openmp -O0 -g -std=c++0x -Wall -restrict -o matxvec_sparse matxvec_sparse.cpp
Is it an error in usage of GCC/ICC?
Is this a design issue in my code causing undefined behaviour?
If so, which line(s) is/are causing it?
Is it just inconsistency between ICC and GCC?
If so, what would be a good way to achieve compiler independence and compatibility?
Huh. Looking at the code, it's clear what icpc thinks the problem is, but I'm not sure without going through the specification which compiler is doing the right thing here, g++ or icpc.
The issue isn't the restrict keyword; if you take all those out and lose the -restrict option to icpc, the problem remains. The issue is that you've got in that parallel section default(none) shared(n_row...), but n_row is, at the start of the program, a pointer to default_n_row. And icpc is requiring that default_n_row also be shared (or, at least, something) in that omp parallel section.