OpenCL usable when compiling host application with Address Sanitizer - c++

I'm debugging a crash of my OpenCL application. I attempted to use ASan to pin down where the problem originates. But then I discovered that I enable ASan when recompiling, my application cannot find any OpenCL devices. Simply adding -fsanitize=address to the compiler options made my program unable to use OpenCL.
With further testing, I am certain ASan is the reason.
Why is this happening? How can I use asan with OpenCL?
An MVCE:
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
int main() {
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
if(platforms.size() == 0)
std::cout << "Compiled with ASan\n";
else
std::cout << "Compiled normally\n";
}
cl::Platform::get returns CL_SUCCESS but an empty list of devices.
Some information about my setup:
GPU: GTX 780Ti
Driver: 418.56
OpenCL SDK: Nvidia OpenCL / POCL 1.3 with CPU and CUDA backend
Compiler: GCC 8.2.1
OS: Arch Linux (Kernel 5.0.7 x64)

The NVIDIA driver is known to conflict with ASAN. It attempts to mmap(2) memory into a fixed virtual memory range within the process, which coincides with ASAN's write-protected shadow gap region. Given that ASAN reserves about 20TB of virtual address space on startup, such conflicts are not unlikely with other programs or drivers, too.
ASAN recognizes certain flags that may be set in the ASAN_OPTIONS environment variable. To resolve the shadow gap range conflict, set the protect_shadow_gap option to 0. For example, assuming a POSIX-like shell, you may run your program like
$ ASAN_OPTIONS=protect_shadow_gap=0 ./mandelbrot
The writable shadow gap incurs additional performance costs under ASAN, since an unprotected gap requires its own shadowing. This is why it's not recommended to set this option globally (e. g., in your shell startup script). Enable it only for the programs that in fact require it.
I'm nearly certain this is the root cause of your issue. I am using ASAN with CUDA programs, and always need to set this option. The failure reported by CUDA without it is very similar: cudaErrorNoDevice error when I attempt to select a device.

Related

Fix crash due to SIGILL in OpenSSL

I've read several Q&As here regarding the fact that OpenSSL tries different instructions to test if cpu supports them, which causes SIGILL. But those answers usually state that OP was running the app under gdb, but I'm not. So my app on OpenWrt MIPS router actually crashes when using OpenSSL, whenever I make a call to OpenSSL library. The crash is illegal instruction. I actually don't have a backtrace, though my app is a debug build. It works fine on Ubuntu and MacOS.
I made sure that both my executable and ssl libs are of the same cpu architecture.
Result of cat /proc/cpuinfo:
system type : Atheros AR9330 rev 1
machine : 8devices Carambola2 board
processor : 0
cpu model : MIPS 24Kc V7.4
BogoMIPS : 265.42
wait instruction : yes
microsecond timers : yes
tlb_entries : 16
extra interrupt vector : yes
hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa : mips1 mips2 mips32r1 mips32r2
ASEs implemented : mips16
shadow register sets : 1
kscratch registers : 0
package : 0
core : 0
VCED exceptions : not available
VCEI exceptions : not available
What worries me is that toolchain toolchain-mips_34kc_gcc-5.2.0_musl-1.1.11 mentions 34kc in its name. I wonder if it's ok to build with this toolchain for 24 Kc cpu. Though everything else except for openssl works fine.
So could you please answer what are my options to fix it?
I don't know what the problem was, but the app didn't work with the openssl library provided in the toolchain and copied to target board. When libopenssl was installed via opkg from official carambola2 repos, the problem is gone. So it must have been some incompatibility.

Dr Memory will not run with SDL ttf (2.0.10)

Upon adding SDL_ttf (2.0.10), DrMemory refuses to work anymore. The console went from printing out the messages to outputting nothing and sending the following to stdout:
~~Dr.M~~ WARNING: unable to locate results file: can't open D:\DrMemory
\drmemory\logs/resfile.6188 (code=2). Dr. Memory failed to start the
target application, perhaps due to interference from invasive security
software. Try disabling other software or running in a virtual machine.
Is there any way around this with some command line flag for Dr Memory or will I have to forego using Dr Memory?
Note: It works perfectly fine with other SDL stuff until I add the TTF Library and add a TTF_Font *font somewhere. The code I have works fine and there is no loading errors or anything wrong with it, it's at a very primitive level and fresh/new. I just cannot get Dr Memory to work as soon as any TTF element is added to the source code.
It works with a 32-bit build, but not a 64-bit build... so I switched to using 32-bits.
Since this is not a full answer as to why. However if anyone finds it breaking for them, try 32-bit.

OpenCL not finding platforms?

I am trying to utilize the C++ API for OpenCL. I have installed my NVIDIA drivers and I have tested that I can run the simple vector addition program provided here. I can compile this program with following gcc call and the program runs without problem.
gcc main.c -o vectorAddition -l OpenCL -I/usr/local/cuda-6.5/include
However, I would very much prefer to use the C++ API as opposed the very verbose host files needed for C.
I downloaded the C++ bindings from Khronos from here and placed the cl.hpp file in the same location as my other cl.h file. The code uses some C++11 so I can compile the code with:
g++ main.cpp -o vectorAddition_cpp -std=c++11 -l OpenCL -I/usr/local/cuda-6.5/include
but when I try to run the program I get the error:
clGetPlatformIDs(-1001)
I also tried the example provided here as well which gave a more helpful error message.
No platforms found. Check OpenCL installation!
The particular code which provides this error is this:
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
if(all_platforms.size()==0){
std::cout<<" No platforms found. Check OpenCL installation!\n";
exit(1);
}
This seems so strange given that the C implementation runs without problem. Any insights would be sincerely appreciated.
EDIT
The C implementation actually isn't running correctly. Each addition is printed to equal zero. Checking the ret_num_platforms also returns 0. For some reason my setup is failing to find my GPU. What could I have missed? My install consists of the nvidia-340 driver and cuda-6.5 installed via apt-get and the .run file respectively.
My sincerest thanks to #pasternak for helping me troubleshoot this problem. To solve it however I ended up needing to avoid essentially all ubuntu apt-get calls for install and just use the cuda run file for the full installation. Here is what fixed the problem.
Purge existing nvidia and cuda implementations (sudo apt-get purge cuda* nvidia-*)
Download cuda-6.5 toolkit from the CUDA toolkit archive
Reboot computer
Switch to ttyl (Ctrl-Alt-F1)
Stop the X server (sudo stop lightdm)
Run the cuda run file (sh cuda_6.5.14_linux_64.run)
Select 'yes' and accept all defaults
Required reboot
Switch to ttyl, stop X server and run the cuda run file again and select 'yes' and default for everything (including the driver again)
Update PATH to include /usr/local/cuda-6.5/bin and LD_LIBRARY_PATH
to include /usr/local/cuda-6.5/lib64
Reboot again
Compile main.c program (gcc main.c -o vectorAddition -l OpenCL -I/usr/local/cuda-6.5/include)
Verify works with ./vectorAddition
C++ API
Download cl.hpp file from Khronos here noting that it is version 1.1
Place cl.hpp file in /usr/local/cuda-6.5/include/CL with other cl headers.
Compile main.cpp (g++ main.cpp -o vectorAddition_cpp -std=c++11 -l OpenCL -I/usr/local/cuda-6.5/include)
Verify it works (./vectorAddition_cpp)
All output from both programs show the correct output for addition between vectors.
I personally find it interesting the Ubuntu's nvidia drivers don't seem to play well with the cuda toolkits. Possibly just for the older versions but still very unexpected.
It is hard to say without running the specific code on your machine but looking at the difference between the example C code you said was working and the cl.hpp might give us a clue. In particular, notice that the C example uses the following line to simply read a single platform ID:
cl_platform_id platform_id = NULL;
cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
Notice that is passes 1 as its first argument. This assumes that at least one OpenCL platform exists and requests that the first one found is placed in platform_id. Additionally, note that even though the return code is assigned to "ret" is it not used to actually check if an error is returned.
Now if we look at the implementation of the static method used to queue the set of platforms in cl.hpp, i.e. cl::Platform::get:
static cl_int get(
VECTOR_CLASS<Platform>* platforms)
{
cl_uint n = 0;
cl_int err = ::clGetPlatformIDs(0, NULL, &n);
if (err != CL_SUCCESS) {
return detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
}
cl_platform_id* ids = (cl_platform_id*) alloca(
n * sizeof(cl_platform_id));
err = ::clGetPlatformIDs(n, ids, NULL);
if (err != CL_SUCCESS) {
return detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
}
platforms->assign(&ids[0], &ids[n]);
return CL_SUCCESS;
}
we see that it first calls
::clGetPlatformIDs(0, NULL, &n);
notice that the first parameter is 0, which tells the OpenCL runtime to return the number of platforms in "n". If this is successful it then goes on to request the actual "n" platform IDs.
So the difference here is that the C version is not checking that there is at least one platform and simply assuming that one exists, while the cl.hpp variant is and as such maybe it is this call that is failing.
The most likely reason for all this is that the ICD is not correctly installed. You can see this thread for an example of how to fix this issue:
ERROR: clGetPlatformIDs -1001 when running OpenCL code (Linux)
I hope this helps.

TI DM3730 (Design reference: beagleboard) computes wrong floating point operation results

The Situation
We have a board with a TI DM3730 processor (also known from the Beagleboard) with a Cortex A8 core (r3p2) in use with the following parameters:
Beagleboard Reference Design: Beagleboard-xM Rev-C
Kernel version: 3.2.8
Open CV library: 2.4.6
U-Boot: uboot-2013.04
Toolchain: Sourcery CodeBench ARM 2011.03
Buildroot: 2012.02
The setup is derived from this blog
Now we have written a program (written in C++ and compiled with GCC Version 4.5.2.) which uses the OpenCV library (to calculate some scores using support vector machines) and which behaves in some strange way:
The program runs on the board in its own process using defined test data: It produces repeatedly correct results.
The program runs in two or more processes (with the same defined test data): The results start to become wrong for each process, processes die with segfaults. The last remaining process runs correctly again.
The program runs in its own process (with the same defined test data again). Additionally, another process changes some exposure settings of an attached camera: The program starts to produce wrong results.
So we assume this is a very low level floating point problem.
What we tried
The complete system (all libraries, kernel, boot loader, etc.) have been compiled with compiler flags as suggested on the pandorawiki.org regarding Floating_Point_Optimization
-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
-ffast-math -fsingle-precision-constant
We tried to enable L1NEON in Cortex-A8 aux ctrl register according to the Beagle board FAQ and tried the other options mentioned there as well, but unfortunately to no avail.
All three different behaviors are reproducible, but not in the form of a minimal working example.
The same program source and the first and second scenario run correctly on Windows (using Visual Studio) and on a desktop running Linux (GCC), so it's probably not something our code does.
So the questions are now:
Are there any other known bugs with this setup and floating point operations which we are not aware of?
Are there any known compiler options which should be set or omitted which can lead to the observed results?
If a MWE would be helpful, we will look into providing one.
Any clues are welcome.
Ok, we now use an up-to-date buildroot (2014.08) with the included toolchain (arm-buildroot-linux-uclibcgueabi-), Linux-kernel 3.9.11, boost 1.55, Qt 4.8.6, and still OpenCV 2.4.6.
When compiling, we optimize for size (–Os) and for target-optimization we only use –pipe.
The following compiler-flags are currently not used anymore:
-mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -ffast-math -fsingle-precision-constant
Unfortunately, we still don't know the exact reason for the original problem, but we are quite happy that the problem went away with this setup.
So maybe this answer helps some poor soul in the future... ;)

Compiling on Vortex86: "Illegal instruction"

I'm using an embedded PC which has a Vortex86-SG CPU, Ubuntu 10.04 w/ kernel 2.6.34.10-vortex86-sg. Unfortunately we can't compile a new kernel, cause we don't have any source code, not even drivers or patches.
I have to run a small project written in C++ with OpenFrameworks. The framework compiles right each script in of_v0071_linux_release/scripts/linux/ubuntu/install_*.sh.
I noticed that in order to compile against Vortex86/Ubuntu 10.04, the following options must be added in every config.make file:
USER_CFLAGS = -march=i486
USER_LDFLAGS = -lGLEW
In effects, it compiles without errors, but the generated binary doesn't start at all:
root#jb:~/openframeworks/of_v0071_linux_release/apps/myApps/emptyExample/bin# ./emptyExample
Illegal instruction
root#jb:~/openframeworks/of_v0071_linux_release/apps/myApps/emptyExample/bin# echo $?
132
Strace last lines:
munmap(0xb77c3000, 4096) = 0
rt_sigprocmask(SIG_BLOCK, [PIPE], NULL, 8) = 0
--- SIGILL (Illegal instruction) # 0 (0) ---
+++ killed by SIGILL +++
Illegal instruction
root#jb:~/openframeworks/of_v0071_linux_release/apps/myApps/emptyExample/bin#
Any idea to solve this problem?
I know I am a bit late on this but I recently had my own issues trying to compile the kernel for the vortex86dx. I finally was able to build the kernel as well. Use these steps at your own risk as I am not a Linux guru and some settings you may have to change to your own preference/hardware:
Download and use a Linux distribution that runs on a similar kernel version that you plan on compiling. Since I will be compiling Linux 2.6.34.14, I downloaded and installed Debian 6 on virtual box with adequate ram and processor allocations. You could potentially compile on the Vortex86DX itself, but that would likely take forever.
Made sure I hade decencies: #apt-get install ncurses-dev kernel-package
Download kernel from kernel.org (I grabbed Linux-2.6.34.14.tar.xz). Extract files from package.
Grab Config file from dmp ftp site: ftp://vxmx:gc301#ftp.dmp.com.tw/Linux/Source/config-2.6.34-vortex86-sg-r1.zip. Please note vxmx user name. Copy the config file to freshly extracted Linux source folder.
Grab Patch and at ftp://vxdx:gc301#ftp.dmp.com.tw/Driver/Linux/config%26patch/patch-2.6.34-hda.zip. Please note vxdx user name. Copy to kernel source folder.
Patch Kernel: #patch -p1 < patchfilename
configure kernel with #make menuconfig
Load Alternate Configuration File
Enable generic x86 support
Enable Math Emulation
I disabled generic IDE support because I will using legacy mode(selectable in bios)
Under Device Drivers -> Ethernet (10 or 100Mbit) -> Make sure RDC R6040 Fast Ethernet Adapter Support is selected
USB support -> Select Support for Host-side USB, EHCI HCD (USB 2.0) support, OHCI HCD support
safe config as .config
check serial ports: edit .config manually make sure CONFIG_SERIAL_8250_NR_UARTS = 4 (or more if you have additional), CONFIG_SERIAL_8250_RUNTIME_UARTS = 4(or more if you have additional). If you are to use more that 4 serial ports make use config_serail_8250_MANY_PORTs is set.
compile kernel headers and source: #make-kpkg --initrd kernel_image kernel_source kernel_headers modules_image