I am testing to see if I can even use GSL functions within OpenACC compute regions. In Main.c I try the following (silly) for loop which uses GSL functions,
#pragma acc kernels
for(int i=0; i<100; i++){
gsl_matrix *C = gsl_matrix_calloc(10, 10);
gsl_matrix_free(C);
}
which allocates memory for a 10x10 matrix of zeroes, and then frees the memory, 100 times. However when I compile,
pgcc -pg -fast -acc -Minfo=all,intensity -lgsl -lgslcblas -lm -o Main Main.c
I get the following messages,
PGC-S-0155-Procedures called in a compute region must have acc routine information: gsl_matrix_calloc (Main.c: 60)
PGC-S-0155-Accelerator region ignored; see -Minfo messages (Main.c: 57)
main:
57, Accelerator region ignored
58, Intensity = 1.00
Loop not vectorized/parallelized: contains call
60, Accelerator restriction: call to 'gsl_matrix_calloc' with no acc routine information
In particular, do the first and last messages regarding "acc routine information", mean it is not possible to use GSL functions within acc compute regions?
I haven't seen direct support for the GSL libraries.
You will need to obtain the source code for the GSL routines that you are using and insert "!$acc routine" pragmas where the subroutines or functions are defined.
This will instruct the compiler to generate kernels for the GPU. Following those pragma insertions, you should compile the GSL libraries using the
-acc flag during compilation.
Related
I have a project written in C++, and the platform to be deployed in has a limitation of 256KB of binary size.
The toolchain is wasi-sdk-16.0 clang++, we use this compiler to compile the source code into binary in WASM format. In this step we compile the sources with the following CXX_FLAGS.
-fstack-protector-strong -Os -D_FORTIFY_SOURCE=2 -fPIC -Wformat -Wformat-security -Wl,-z,relro,-z,now -Wno-psabi
Then we strip the binary with
strip -s output_binary.wasm
After steps above, the compiled binary size in this step is 254KB.
Then we use wamrc in WAMR to compile the WASM binary with AOT runtime, the command is shown below.
wamrc --enable-indirect-mode --disable-llvm-intrinsics -o output.aot output.wasm
the output binary size become 428KB, much bigger than the limitation (256KB).
After google'd, I use wasm-opt to reduce size,
wasm-opt -Oz output.wasm -o output.wasm
The size become 4KB smaller. (almost useless)
I tried to comfirm how much does my code affects the binary size, so I write simple minimum sample code just called the standard c++ library, as the following,
#include <vector>
#include <iostream>
int main() {
std::vector<int> vec = {1,2,3,4,5};
for (auto v: vec) {
std::cout << v << " ";
}
}
The binary size after compiled has alrealy become 205KB.
I also tried to use a binary size profiler (twiggy) to track the size of each part of the bianry, but the tool failed to open this binary.
So I want to ask
While just including two standard C++ headers makes the binary size reach the size limitation, how can I strip the C++ standard library with the function I really use(I cannot use strip unused function flag because my project is a library provide to others), or is really the c++ standard library affected the binary size?
Is there any other compiler flags, or strip flags, or any other optimization tools can significantly reduce the binary size?
One of the things I didn't see mentioned in your post is wabt's wasm-strip. I am not familiar enough with it to know if it does more than the simple strip command, but maybe it is worth a try. You can install it with apt install wasm-strip on a Debian system.
From the few micro-benchmarks I see on the internet, C++ wasm binaries have large overhead. For a totally-not-scientific slide you can watch this.
If, for whatever reason, you cannot work the language to produce smaller binaries, you may try to optimize also at the link level, as busybox does.
I solve this issue with just replacing iostream and fstream to cstdio. It reduces size from 254KB to 85KB, because there contains too many templates in iostream.
Using iostream
Count of functions (with readelf -Ws)
Size of Binary
Yes
685
254KB
No
76
85KB
While specifying compiler flags such as -Oz also reduces some size, but the main factor is too many codes were generated from templates. So, do not use C++ stream API (and any other template-heavy general-purpose libraries) when there are binary size limitations. (C libraries are always worth to believe.)
I have a very simple C++ code (it was a large one, but I stripped it to the essentials), but it's failing to compile. I'm providing all the details below.
The code
#include <vector>
const int SIZE = 43691;
std::vector<int> v[SIZE];
int main() {
return 0;
}
Compilation command: g++ -std=c++17 code.cpp -o code
Compilation error:
/var/folders/l5/mcv9tnkx66l65t30ypt260r00000gn/T//ccAtIuZq.s:449:29: error: unexpected token in '.section' directive
.section .data.rel.ro.local
^
GCC version: gcc version 12.1.0 (Homebrew GCC 12.1.0_1)
Operating system: MacOS Monterey, version 12.3, 64bit architecture (M1)
My findings and remarks:
The constant SIZE isn't random here. I tried many different values, and SIZE = 43691 is the first one that causes the compilation error.
My guess is that it is caused by stack overflow. So I tried to compile using the flag -Wl,-stack_size,0x20000000, and also tried ulimit -s 65520. But neither of them has any effect on the issue, the code still fails to compile once SIZE exceeds 43690.
I also calculated the amount of stack memory this code consumes when SIZE = 43690. AFAIK, vectors use 24B stack memory, so the total comes to 24B * 43690 = 1048560B. This number is very close to 220 = 1048576. In fact, SIZE = 43691 is the first time that the consumed stack memory exceeds 220B. Unless this is quite some coincidence, my stack memory is somehow limited to 220B = 2M. If that really is the case, I still cannot find any way to increase it via the compilation command arguments. All the solutions in the internet leads to the stack_size linker argument, but it doesn't seem to work on my machine. I'm wondering now if it's because of the M1 chip somehow.
I'm aware that I can change this code to use vector of vectors to consume memory from the heap, but I have to deal with other's codes very often who are used to coding like this.
Let me know if I need to provide any more details. I've been stuck with this the whole day. Any help would be extremely appreciated. Thanks in advance!
I had the same issue, and after adding the -O2 flag to the compilation command, it started working. No idea why.
So, something like this:
g++-12 -O2 -std=c++17 code.cpp -o code
It does seem to be an M1 / M1 Pro issue. I tested your code on two seperate M1 Pro machine with the same result as yours. One workaround I found is to use the x86_64 version of gcc under rosetta, which doesn't have these allocation problems.
Works on a M1 Max running Monterey 12.5.1 with XCode 13.4.1 and using clang 13.1.6 compiler:
% cat code.cpp
#include <vector>
const int SIZE = 43691;
std::vector<int> v[SIZE];
int main() {
return 0;
}
% cc -std=c++17 code.cpp -o code -lc++
% ./code
Also fails with gcc-12.2.0:
% g++-12 -std=c++17 code.cpp -o code
/var/tmp/ccEnyMCk.s:449:29: error: unexpected token in '.section' directive
.section .data.rel.ro.local
^
So it seems to be a gcc issue on M1 issue.
This is a gcc-12 problem on Darwin Aarch64 targets. It shall not emit such sections like .section .data.rel.ro.local. Section names on macOS shall start with __, eg.: .section __DATA,...
See Mach-O reference.
I am not able to have mkoctfile to successfully create an oct file that is a wrapper of some C++ function of mine (e.g. void my_fun(double*,double)). In particular my problem rises from the fact that, the wrapper code my_fun_wrap.cpp requires the inclusion of the <octave/oct.h> library which only provides C++ headers (see here), but the original code of my_fun also uses source code that is in C. E.g.
// my_fun_wrapper.cpp
#include <octave/oct.h>
#include "custom_functions_libc.h"
DEFUN_DLD(my_fun_wrapper,args, , "EI MF network model A with delays (Brunel, JCN 2000)"){
// Input arguments
NDArray xvar = args(0).array_value();
double x = xvar(0);
// Output arguments
double dy[4];
dim_vector dv (4,1);
NDArray dxvars(dv);
// Invoke my C function which also includes code in the lib file custom_functions_libc.c
my_fun(dy,x);
// Then assign output value to NDArray
for(int i=0;i<4;i++) dxvars(i) = dy[i];
// Cast output as octave_value as required by the octave guidelines
return octave_value (dxvars);
}
Then suppose that my custom_functions_libc.h and custom_functions_libc.c files are somewhere in a folder <path_to_folder>/my_libs. Ideally, from Octave command line I would compile the above by:
mkoctfile -g -v -O -I<path_to_folder>/my_libs <path_to_folder>/my_libs/custom_functions_libc.c my_fun_wrapper.cpp -output my_fun_wrapper -lm -lgsl -lgslcblas
This actually generates my_fun_wrapper.oct as required. Then I can call this latter from within some octave code, e.g.
...
...
xx = [0., 2.5, 1.];
yy = [1e-5, 0.1, 2.];
dxv = test_my_function(xx,yy);
function dy = test_my_function(xx,yy)
xx += yy**2;
dy = my_fun_wrapper(xx);
endfunction
It turns out that the above code will exit with an error in test_my_function saying that within the my_fun_wrapper the symbol Zmy_fundd is not recognized. Upon receiving such kind of error I suspected that something went wrong on the linking process. But strangely enough the compiler did not produce any error as I said. Yet, a closer inspection of the verbose output of the compiler revealed that mkoctfile is changing compiler automatically between different files depending on their extension. So my_fun_wrapper.cpp is compiled by g++ -std=gnu++11 but custom_function_libc.c is compiled by gcc -std=gnu11 and somehow the custom_function_libc.o file ensuing by this compilation process, when linked with my_fun_wrapper.o does not matches unresolved symbols.
The example above is very simplistic. In practice, in my case custom_function_libc includes many more custom C libraries. A workaround so far was to clone the .c source file for those libraries into .cpp files. But I do not like this solution very much.
How can I eventually mix C++ and C code safely and compile it successfully by mkoctfile? octave manual suggests to prepend an extern C specification (see here) which I am afraid I am not very familiar with. Is this the best way? Could you suggest me alternatively, a potential alternative solution?
So apparently the easiest solution, according to my above post is to correct the wrapper by the following preprocessor directives:
// my_fun_wrapper.cpp
#include <octave/oct.h>
// ADDED code to include the C source code
#ifdef __cplusplus
extern "C"
{
#endif
// END ADDITION
#include "custom_functions_libc.h"
// ADDED code to include the C source code
#ifdef __cplusplus
} /* end extern "C" */
#endif
// END ADDITION
...
...
This will compile and link fine.
I'm using PGI to compile the following program which uses OpenMP's target directives to offload work to a GPU:
#include <iostream>
#include <cmath>
int main(){
const int SIZE = 400000;
double *m;
m = new double[SIZE];
#pragma omp target teams distribute parallel for
for(int i=0;i<SIZE;i++)
m[i] = std::sin((double)i);
for(int i=0;i<SIZE;i++)
std::cout<<m[i]<<"\n";
}
My compilation string is as follows:
pgc++ -omp -ta=tesla,pinned,cc60 -Minfo=accel -fast test2.cpp
Compilation succeeds, but it lacks the series of outputs that I get with OpenACC that tell me what the compiler actually did with the directive, like so:
main:
8, Accelerator kernel generated
Generating Tesla code
11, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
8, Generating implicit copyout(m[:400000])
How can I get similar information for OpenMP? -Minfo by itself didn't seem to yield anything useful.
"-Minfo" (which is the same as "-Minfo=all"), or "-Minfo=mp" will give you compiler feedback messages for OpenMP compilation.
Though, PGI only supports OpenMP 4.5 directives with our LLVM back-end compilers. These are available by default on IBM Power based systems or as a part of our LLVM beta compilers on x86. The x86 beta compilers can be found at http://www.pgroup.com/support/download_llvm.php but do require a Professional Edition license.
Also, our current OpenMP 4.5 only targets multicore CPU. We're working on GPU target offload as well but this support wont be available for awhile.
I have the intel compiler install at the following $HOME/tpl/intel. When I compile a simple hello_omp.cpp with openMP enabled
#include <omp.h>
#include <iostream>
int main ()
{
#pragma omp parallel
{
std::cout << "Hello World" << std::endl;
}
return 0;
}
I compile with ~/tpl/intel/bin/icpc -O3 -qopenmp hello_omp.cpp but when I run I get the following error:
./a.out: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory.
I would like to explicitly link the intel compiler and the appropriate library during the make process without using the LD_LIBRARY_PATH?
You have 2 simple solutions for your problem:
Linking statically with the Intel run time libraries:
~/tpl/intel/bin/icpc -O3 -qopenmp -static_intel hello_omp.cpp
Pros: you don't have to care where the Intel run time environment is installed on the machine where you run the binary, or even having it installed altogether;
Cons: your binary becomes bigger and won't allow to select a different (more recent ideally) run time environment even when it is available.
Adding the search path for dynamic library into the binary using the linker option -rpath:
~/tpl/intel/bin/icpc -O3 -qopenmp -Wl,-rpath=$HOME/tpl/intel/lib/intel64 hello_omp.cpp
Notice the use of -Wl, to transmit the option to the linker.
I guess that is more like what you were after than the first solution I proposed so I let you devise what the pros and cons are for you in comparison.
Intel Compiler ships compilervars.sh script in the bin directory which when sourced will set the appropriate env variables like LD_LIBRARY_PATH, LIBRARY_PATH and PATH with the right directories which host OpenMP runtime library and other compiler specific libraries like libsvml (short vector math library) or libimf (more optimized version of libm).