I have the following code:
#include <iostream>
#include <random>
int main() {
std::mt19937_64 rng(std::random_device{}());
std::cout << std::uniform_int_distribution<>(0, 100)(rng) << '\n';
}
I try to profile it using valgrind, but it says:
vex amd64->IR: unhandled instruction bytes: 0xF 0xC7 0xF0 0x89 0x6 0xF 0x42 0xC1
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
==2092== valgrind: Unrecognised instruction at address 0x4cdc1b5.
==2092== at 0x4CDC1B5:std::(anonymous namespace)::__x86_rdrand() (random.cc:69)
==2092== by 0x4CDC321: std::random_device::_M_getval() (random.cc:130)
==2092== by 0x4009D4: main (random.h:1619)
Preceded by multiple instances of:
--2092-- WARNING: Serious error when reading debug info
--2092-- When reading debug info from /lib/x86_64-linux-gnu/ld-2.22.so:
--2092-- Ignoring non-Dwarf2/3/4 block in .debug_info
I am on Debian using standard packages on an x86-64 platform compiling with gcc 5.3.1 using valgrind-3.11.0. The illegal instruction seems to be inside libstdc++6.
How do I get valgrind to profile my code?
In fact, Valgrind emulate your program with an intermediate language (VEX) to know if it discovers memory violation.
This VEX language capture all the instructions of several assemblers such as i386, amd64, arm, ... But, from time to time, it miss a few instructions (especially specialized ones like rdrand which is linked to the AES specific instructions set).
Well, this is exactly what happened with your program. Valgrind did probably stumbled on an unknown instruction and could not translate it into the VEX intermediate language.
But, you are not the only one to be in line waiting for a fix:
Same issue on Launchpad.
Same issue on KDE bugtracker.
... and so on ...
Here is a patch that has been applied to Valgrind and that may solve the problem for you (depending on your CPU).
But, the only thing you can do is to install a newer version of Valgrind and hope that the instruction is supported in the newest version.
Related
I'm debugging a crash of my OpenCL application. I attempted to use ASan to pin down where the problem originates. But then I discovered that I enable ASan when recompiling, my application cannot find any OpenCL devices. Simply adding -fsanitize=address to the compiler options made my program unable to use OpenCL.
With further testing, I am certain ASan is the reason.
Why is this happening? How can I use asan with OpenCL?
An MVCE:
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
int main() {
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
if(platforms.size() == 0)
std::cout << "Compiled with ASan\n";
else
std::cout << "Compiled normally\n";
}
cl::Platform::get returns CL_SUCCESS but an empty list of devices.
Some information about my setup:
GPU: GTX 780Ti
Driver: 418.56
OpenCL SDK: Nvidia OpenCL / POCL 1.3 with CPU and CUDA backend
Compiler: GCC 8.2.1
OS: Arch Linux (Kernel 5.0.7 x64)
The NVIDIA driver is known to conflict with ASAN. It attempts to mmap(2) memory into a fixed virtual memory range within the process, which coincides with ASAN's write-protected shadow gap region. Given that ASAN reserves about 20TB of virtual address space on startup, such conflicts are not unlikely with other programs or drivers, too.
ASAN recognizes certain flags that may be set in the ASAN_OPTIONS environment variable. To resolve the shadow gap range conflict, set the protect_shadow_gap option to 0. For example, assuming a POSIX-like shell, you may run your program like
$ ASAN_OPTIONS=protect_shadow_gap=0 ./mandelbrot
The writable shadow gap incurs additional performance costs under ASAN, since an unprotected gap requires its own shadowing. This is why it's not recommended to set this option globally (e. g., in your shell startup script). Enable it only for the programs that in fact require it.
I'm nearly certain this is the root cause of your issue. I am using ASAN with CUDA programs, and always need to set this option. The failure reported by CUDA without it is very similar: cudaErrorNoDevice error when I attempt to select a device.
I am learning debugging with gdb and registers, but I am stuck in one point. As an instruction, I should print
print $esp
result: $1 = -9008
but I was expecting such result:
$2 = (void *) 0x7fffffffdcd0
In the next command, I need to enter that command:
x/24 $esp
Saying that no access to that register
Cannot access memory at address 0xffffffffffffdce0
You appear to be reading instructions from some i386 tutorial, while using x86_64 (64-bit) platform.
On x86_64, there is no $esp register, only the $rsp one.
Also note that calling convention on x86_64 is different (arguments are not necessarily passed on stack), so your best course of action is to either find a new 64-bit tutorial, or to debug 32-bit target (usually you can build and run 32-bit programs on 64-bit hosts by compiling and linking them with gcc -m32 ...).
I'm trying to profile my code but run into problems.
If I run the following code:
#include <iostream>
int main() {
size_t val = 8;
std::cout << sizeof(val) << std::endl;
std::cout << __builtin_ctz(val) << std::endl;
}
It returns as expected
8
3
If I run valgrind on it it returns:
==28602== Memcheck, a memory error detector
==28602== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==28602== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==28602== Command: ./test
==28602==
8
vex amd64->IR: unhandled instruction bytes: 0xF3 0xF 0xBC 0xC0 0x89 0xC6 0xBF 0x60
==28602== valgrind: Unrecognised instruction at address 0x400890.
==28602== at 0x400890: main (in /home/magu_/sod/test/test)
==28602== Your program just tried to execute an instruction that Valgrind
==28602== did not recognise. There are two possible reasons for this.
==28602== 1. Your program has a bug and erroneously jumped to a non-code
==28602== location. If you are running Memcheck and you just saw a
==28602== warning about a bad jump, it's probably your program's fault.
==28602== 2. The instruction is legitimate but Valgrind doesn't handle it,
==28602== i.e. it's Valgrind's fault. If you think this is the case or
==28602== you are not sure, please let us know and we'll try to fix it.
==28602== Either way, Valgrind will now raise a SIGILL signal which will
==28602== probably kill your program.
==28602==
==28602== Process terminating with default action of signal 4 (SIGILL)
==28602== Illegal opcode at address 0x400890
==28602== at 0x400890: main (in /home/magu_/sod/test/test)
==28602==
==28602== HEAP SUMMARY:
==28602== in use at exit: 0 bytes in 0 blocks
==28602== total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==28602==
==28602== All heap blocks were freed -- no leaks are possible
==28602==
==28602== For counts of detected and suppressed errors, rerun with: -v
==28602== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
Illegal instruction (core dumped)
Is this an bug of valgrind or should I not use __builtin_ctz with my computer? __builtin_popcount does not raise any errors.
My system:
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
CPU : Intel Core Duo T7500
You need to upgrade valgrind to at least 4.8.1 or use an gcc older than v4.8.
The opcode you ran into -- F3 0F BC -- is the TZCNT opcode, introduced in BMI1, which your CPU doesn't implement. However, it is also REP;BSF (F3 is REP) and older CPUs, including yours, ignore the REP for this opcode, and the similar LZCNT == REP;BSR pair. There is very little difference between TZCNT and BSF (they differ in how they handle 0).
Older gcc versions used BSF for older CPUs and TZCNT for newer ones, but since the opcode is relatively rare, in newer gcc versions the logic was simplified and TZCNT is always used, since both older and newer CPUs understand it.
Unfortunately, valgrind did not correctly fallback from TZCNT to BSF until v4.8.1. See bug 295808.
On Debian/Sid/x86-64 (Intel i4750HQ processor) with gcc version 4.9.1 (Debian 4.9.1-4) and valgrind-3.9.0 your test is working ok (and valgrind runs successfully without reporting any errors).
So I suggest you to upgrade your GCC compiler and most importantly valgrind. Start first by compiling valgrind from its valgrind-3.9.0 source code tarball (and use aptitude build-dep valgrind before).
BTW, your distribution version is quite old. Did you consider upgrading to Ubuntu 14.0 LTS?
If you don't have root access, consider passing some explicit --prefix (e.g. $HOME/pub/ ) to valgrind-3.9.0/configure
reWhen analyzing a core file, my gdb 7.0 outputs several warnings:
warning: Wrong size gregset in core file.
warning: Wrong size fpregset in core file.
warning: Wrong size gregset in core file.
warning: Wrong size fpregset in core file.
warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
I am not sure if its related, but I am unable to get a backtrace:
(gdb) bt
#0 0x00000000 in ?? ()
OS architecture is SUN Solaris 10 SPARC.
Questions:
What is the reason/cause of these warnings?
Why can't I retrieve a backtrace?
How to fix these problems?
The problem can in gdb as well in your program.
I would recommend to update gdb to the most recent version (7.3.1). Also it could be helpful to create simple test program and analyze its core with gdb to be sure that your utility works fine.
"gregset" and other error indicate that gdb unable read the data from the core file. It can happen if your program gone wild and corrupt stack. gregset error means that gdb was unable to read general-purpose register set from a core file. fpregset is for floating-point register set. The expected register size is platform dependent.
bt would not work if you cant read core file properly.
I also had the fpregset warnings (and no stack trace) when I tried to work on a 64bit core dump with gdb 7.6.2 on Solaris 10. The cause seems to be, that the userspace applications of Solaris 10 are compiled with 32bits by default - and without support for 64bit core cumps.
The guys in GDB's IRC channel gave me the following parameter:
--enable-64-bit-bfd
I also compiled a 64bit version of gdb (-m64), but that shouldn't be necessary. Now gdb could work on the 64bit core dump and create the stack trace without any warnings.
I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.
Related to this I'm using the following compilation flags:
GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp
I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP
or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.
So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea.
Any advice in this area will be greatly appreciated !!
I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:
"
Only use single precision floating point
Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
Minimize Conditional Branches
Enable RunFast mode
For softfp:
Inline floating point code (unless its very large)
Pass FP arguments via pointers instead of by value and do integer work in between function calls.
"
I cannot use hard for the float ABI as I cannot link with the libraries I have available.
Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.
Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?
... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea
I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:
The NEON register bank consists of 32 64-bit registers. If both
Advanced SIMD and VFPv3 are implemented, they share this register
bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that
supports 32 double-precision floating-point registers. This
integration simplifies implementing context switching support, because
the same routines that save and restore VFP context also save and
restore NEON context.
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers
and each of the Q0-Q15 registers map onto a pair of D registers.
Figure 1.3 shows the different views of the shared NEON and VFP
register bank. All of these views are accessible at any time. Software
does not have to explicitly switch between them, because the
instruction used determines the appropriate view.
The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.
Related to this I'm using the following compilation flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.
gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.
BeagleBone Black:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 2 (v7l)
Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
...
So the BeagleBone uses:
-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
CubieTruck v5:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4
So the CubieTruck uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Banana Pi Pro:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
So the Banana Pi uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Raspberry Pi 3:
The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags
Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 4 (v7l)
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...
So the Raspberry Pi can use:
-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
Or it can use (I don't know what to use for -mtune):
-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard
ODROID C2:
ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).
$ gcc -v 2>&1 | grep Target
Target: aarch64-linux-gnu
$ cat /proc/cpuinfo
Features : fp asimd evtstrm crc32
So the ODROID uses:
-march=armv8-a+crc -mtune=cortex-a53
In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:
CPU part: Part number. 0xd03 indicates Cortex-A53 processor.
So we may be able to lookup the value form a database. I don't know if it exists or where its located.
I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.
But to cover one part of confusion:
The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.
-mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.
See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.
I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.
It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.
Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.
Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.
Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.
Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.
Just compare the OP's version with mine in the link below, you'll then know what I mean.
Optimizing RGBA8888 to RGB565 conversion with NEON
regards