how to compile node-v4.2.4 with armv7 without fpu?

how to compile node-v4.2.4 with armv7 without fpu? - c++

I have a device whose cpu is armv7 but without fpu.
I can compile node with option --with-arm-float-abi=soft, but when I run "node", "Illegal instruction (core dumped)" happened.
root#router:/tmp/target/bin# ./node -v
v4.2.4
root#router:/tmp/target/bin# ./node --v8-options | head -2
target arm v7 vfp3 soft
ARMv7=1 VFP3=1 VFP32DREGS=0 NEON=0 SUDIV=0 UNALIGNED_ACCESSES=1
MOVW_MOVT_IMMEDIATE_LOADS=0 COHERENT_CACHE=0 USE_EABI_HARDFLOAT=0
The tool objdump showed me that there are instructions (such as vpush, vpop...) in use which are not supported by my cpu (arm v7 without fpu).
For the further, I found openssl and v8 in the source of node use fpu's instructions.
the configure line as below
./configure \
--prefix=target \
--dest-cpu=arm \
--dest-os=linux \
--without-snapshot \
--with-arm-float-abi=soft \
--fully-static
Can somebody tell me how to compile node-v4.2.4 without fpu supported?
source code: nodejs-v4.2.2
arm version: Cortex-A9 Floating-Point Unit (FPU)(Optional)

After many tries, I used node-v0.10.14 instead, which works well without fpu supported. ;-)
So I still do not known how to compile nodejs-v4.2.2 without fpu supported.

It's impossible.
V8 does not support no fpu mode since 3.18 (https://github.com/nodejs/node/issues/4447#issuecomment-168549889), the assumption is that the kernel can emulate the FPU for you. And NodeJS is based on V8.
Relavant comment in the source code:
https://github.com/v8/v8/blob/master/src/arm/assembler-arm.cc#L174
It's clarified in v8-users mailing list too.

Related

Building and using a pure llvm toolchain for c++ on linux

Assuming this is possible, could someone tell me, how I have to configure the cmake build to create a "pure" llvm toolchain on ubuntu-16.04 consisting of
clang
lld
libc++
libc++abi
libunwind (llvm)
compiler-rt
any other pieces that might be relevant and are "production ready"
The resulting compiler should
be as fast as possible (optimizations turned on, no unnecessary asserts or other checks in the compiler binary itself)
be installed in a separate, local directory (lets call it <llvm_install>)
not have dependencies to the llvm tolchain provided by packet manager
use libc++, libc++abi etc by default.
support the sanitizers (ubsan, address, memory, thread) (which probably means that I have to compile libc++ a second time)
So far I have cloned
llvm from http://llvm.org/git/llvm.git into <llvm_root>
clang from http://llvm.org/git/clang.git into <llvm_root>/tools/clang
lld from http://llvm.org/git/lld.git into <llvm_root>/tools/lld
compiler-rt, libcxx, libcxxabi, libunwind from http://llvm.org/git/<project_name> into <llvm_root>/projects/<project_name>
Then run ccmake in a separate directory - I have tried various settings, but as soon as I try anything more fancy beyond turning optimizations on, I almost always get some sort of build error. Unfortunately, I have yet to find a way to export my changes from ccmake otherwise I'd give you an example with the settings and according error, but I'm more interested in a best practice than a fix to my test configs anyway.
Bonus points: By default, this should build with the default g++ toolchain, but I'd also be interested in a two stage build if that improves the performance of the final toolchain (e.g. by using LTO).
Btw.: The whole Idea came from watching chandler's talk
Pacific++ 2017: Chandler Carruth "LLVM: A Modern, Open C++ Toolchain"

My usual procedure is to build a small enough LLVM/Clang so that I have something working with libc++ and libc++abi. I guess you can use the system-provided LLVM, but I haven't tried it. For this step, what you have checked-out is probably enough. A sample script for this:
cmake
-G Ninja \
-DCMAKE_EXPORT_COMPILE_COMMANDS=On \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DBUILD_SHARED_LIBS=On \
-DLLVM_ENABLE_ASSERTIONS=Off \
-DLLVM_TARGETS_TO_BUILD="X86" \
-DLLVM_ENABLE_SPHINX=Off \
-DLLVM_ENABLE_THREADS=On \
-DLIBCXX_ENABLE_EXCEPTIONS=On \
-DLIBCXX_ENABLE_RTTI=On \
-DCMAKE_INSTALL_PREFIX=[path-to-install-dir] \
[path-to-source-dir]
Having the aforementioned clang in your PATH environment variable,
you can use the below build script again and adjust based on your needs (sanitizers, etc). Apart from the main documentation page on the subject, poking around the CMakeLists.txt of each respective tool is also illuminating and helps adjust the build process from version to version.
LLVM_TOOLCHAIN_LIB_DIR=$(llvm-config --libdir)
LD_FLAGS=""
LD_FLAGS="${LD_FLAGS} -Wl,-L ${LLVM_TOOLCHAIN_LIB_DIR}"
LD_FLAGS="${LD_FLAGS} -Wl,-rpath-link ${LLVM_TOOLCHAIN_LIB_DIR}"
LD_FLAGS="${LD_FLAGS} -lc++ -lc++abi"
CXX_FLAGS=""
CXX_FLAGS="${CXX_FLAGS} -stdlib=libc++ -pthread"
CC=clang CXX=clang++ \
cmake -G Ninja \
-DCMAKE_EXPORT_COMPILE_COMMANDS=On \
-DBUILD_SHARED_LIBS=On \
-DLLVM_ENABLE_LIBCXX=On \
-DLLVM_ENABLE_LIBCXXABI=On \
-DLLVM_ENABLE_ASSERTIONS=On \
-DLLVM_TARGETS_TO_BUILD="X86" \
-DLLVM_ENABLE_SPHINX=Off \
-DLLVM_ENABLE_THREADS=On \
-DLLVM_INSTALL_UTILS=On \
-DLIBCXX_ENABLE_EXCEPTIONS=On \
-DLIBCXX_ENABLE_RTTI=On \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_CXX_FLAGS="${CXX_FLAGS}" \
-DCMAKE_SHARED_LINKER_FLAGS="${LD_FLAGS}" \
-DCMAKE_MODULE_LINKER_FLAGS="${LD_FLAGS}" \
-DCMAKE_EXE_LINKER_FLAGS="${LD_FLAGS}" \
-DCMAKE_POLICY_DEFAULT_CMP0056=NEW \
-DCMAKE_POLICY_DEFAULT_CMP0058=NEW \
-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
[path-to-source-dir]
A note on performance: I haven't watched that talk yet, but my motivation behind this 2 step build was to have a toolchain that I can easily relocate between systems since the minimal system dependence that matters is libc.
Lastly, relevant to the above procedure is this older question of mine, which still bugs me. If you have any insight on this, please don't hesitate.
PS: Scripts have been tested with LLVM 3.7 through 3.9 and current trunk 6.0.0.
Update: I've also applied these suggestions, and there is marked improvement when using the gold linker instead of ld. LTO is also a boost.

Compile OpenCV with TBB on Raspberry Pi 2

I've tried to build OpenCV on Raspberry Pi 2 with TBB,I've installed TBB from source on the Pi,I've specified the path to to TBB libs to cmake config but I'm getting the error:
/home/mihai/tbb43_20150316oss/include/tbb/machine/gcc_armv7.h:31:2: error: #error compilation requires an ARMv7-a architecture.
I think the error is because in the OpenCV makefile i have to include the flag for ARMv7
-DTBB_USE_GCC_BUILTINS=1 -D__TBB_64BIT_ATOMICS=0
The problem is that I don't know where to include it.Has anyone had this problem abd want to share a solution?

I have resolved it :D .For those having this problem follow these steps:
1.Go to file gcc_armv7.h line 31 and comment lines
30 #if !(__ARM_ARCH_7A__)
31 #error compilation requires an ARMv7-a architecture.
32 #endif
2.Next in the same file gcc_armv7.h go to line 56 and replace it with
56 #define __TBB_full_memory_fence() 0xffff0fa0 // __asm__ __volatile__("dmb ish": : :"memo ry")
For those who want an explanation how I did it, after the first step I get the following errors :
/tmp/ccnkbkfd.s:313: Error: selected processor does not support ARM mode `dmb ish'
/tmp/ccnkbkfd.s:386: Error: selected processor does not support ARM mode `dmb ish'
/tmp/ccnkbkfd.s:533: Error: selected processor does not support ARM mode `dmb ish'
/tmp/ccnkbkfd.s:562: Error: selected processor does not support ARM mode `dmb ish'
After I have searched on google and found this :
The alternative for using dmb is to call the Linux kernel __kuser_memory_barrier
the __kuser_memory_barrier helper operation is found in all ARM kernels 2.6.15 and later
and provide a way to issue a memory barrier that will work across all ARM arch.__kuser_memory_barrier
helper function found at address 0xffff0fa0

or you can run
sudo make CXXFLAGS="-DTBB_USE_GCC_BUILTINS=1 -D__TBB_64BIT_ATOMICS=0"
instead of just running
sudo make

TI DM3730 (Design reference: beagleboard) computes wrong floating point operation results

The Situation
We have a board with a TI DM3730 processor (also known from the Beagleboard) with a Cortex A8 core (r3p2) in use with the following parameters:
Beagleboard Reference Design: Beagleboard-xM Rev-C
Kernel version: 3.2.8
Open CV library: 2.4.6
U-Boot: uboot-2013.04
Toolchain: Sourcery CodeBench ARM 2011.03
Buildroot: 2012.02
The setup is derived from this blog
Now we have written a program (written in C++ and compiled with GCC Version 4.5.2.) which uses the OpenCV library (to calculate some scores using support vector machines) and which behaves in some strange way:
The program runs on the board in its own process using defined test data: It produces repeatedly correct results.
The program runs in two or more processes (with the same defined test data): The results start to become wrong for each process, processes die with segfaults. The last remaining process runs correctly again.
The program runs in its own process (with the same defined test data again). Additionally, another process changes some exposure settings of an attached camera: The program starts to produce wrong results.
So we assume this is a very low level floating point problem.
What we tried
The complete system (all libraries, kernel, boot loader, etc.) have been compiled with compiler flags as suggested on the pandorawiki.org regarding Floating_Point_Optimization
-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
-ffast-math -fsingle-precision-constant
We tried to enable L1NEON in Cortex-A8 aux ctrl register according to the Beagle board FAQ and tried the other options mentioned there as well, but unfortunately to no avail.
All three different behaviors are reproducible, but not in the form of a minimal working example.
The same program source and the first and second scenario run correctly on Windows (using Visual Studio) and on a desktop running Linux (GCC), so it's probably not something our code does.
So the questions are now:
Are there any other known bugs with this setup and floating point operations which we are not aware of?
Are there any known compiler options which should be set or omitted which can lead to the observed results?
If a MWE would be helpful, we will look into providing one.
Any clues are welcome.

Ok, we now use an up-to-date buildroot (2014.08) with the included toolchain (arm-buildroot-linux-uclibcgueabi-), Linux-kernel 3.9.11, boost 1.55, Qt 4.8.6, and still OpenCV 2.4.6.
When compiling, we optimize for size (–Os) and for target-optimization we only use –pipe.
The following compiler-flags are currently not used anymore:
-mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -ffast-math -fsingle-precision-constant
Unfortunately, we still don't know the exact reason for the original problem, but we are quite happy that the problem went away with this setup.
So maybe this answer helps some poor soul in the future... ;)

Finding which version of valgrind is running

in C/C++, I can include valgrind headers to know at runtime whether or not my software is running on valgrind :
#include <valgrind/valgrind.h>
bool RunningOnValgrind()
{
return RUNNING_ON_VALGRIND ? true : false;
}
This is documented in the valgrind manual.
I would like to be able to know if the valgrind I am being run on supports AVX instructions. How do I write a function that returns this information ?
From valgrind release notes, I known that these are supported from version 3.8 onwards. Hence one solution would be to spawn a process to execute valgrind --version and then parse the output but there must be a better way.

If you look in valgrind/valgrind.h, you see you can check valgrind version number this way after including valgrind.h:
#if defined(__VALGRIND_MAJOR__) && defined(__VALGRIND_MINOR__) \
&& (__VALGRIND_MAJOR__ > 3 \
|| (__VALGRIND_MAJOR__ == 3 && __VALGRIND_MINOR__ >= 8))
/* code to say avx is supported */
#endif
This has many limitations, however: it assumes you are using a globally-installed version, and that your path isn't pointing to some personal, user-built valgrind that isn't in the default location (which I have done) and it also assumes that you are building on the machine where you will run valgrind, and not shipping around a pre-built executable (which in my experience is something that happens often) so you can't rely on that.
With that in mind, spawning a sub-process with valgrind --version and checking the output may truly be the best alternative.

You can use VALGRIND_MAJOR to detect the version in runtime, or use --version flag to get the exact version.

Cortex A9 NEON vs VFP usage confusion

I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.
Related to this I'm using the following compilation flags:
GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp
I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP
or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.
So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea.
Any advice in this area will be greatly appreciated !!
I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:
"
Only use single precision floating point
Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
Minimize Conditional Branches
Enable RunFast mode
For softfp:
Inline floating point code (unless its very large)
Pass FP arguments via pointers instead of by value and do integer work in between function calls.
"
I cannot use hard for the float ABI as I cannot link with the libraries I have available.
Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.
Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea
I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:
The NEON register bank consists of 32 64-bit registers. If both
Advanced SIMD and VFPv3 are implemented, they share this register
bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that
supports 32 double-precision floating-point registers. This
integration simplifies implementing context switching support, because
the same routines that save and restore VFP context also save and
restore NEON context.
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers
and each of the Q0-Q15 registers map onto a pair of D registers.
Figure 1.3 shows the different views of the shared NEON and VFP
register bank. All of these views are accessible at any time. Software
does not have to explicitly switch between them, because the
instruction used determines the appropriate view.
The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.
Related to this I'm using the following compilation flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.
gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.
BeagleBone Black:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 2 (v7l)
Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
...
So the BeagleBone uses:
-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
CubieTruck v5:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4
So the CubieTruck uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Banana Pi Pro:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
So the Banana Pi uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Raspberry Pi 3:
The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags
Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 4 (v7l)
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...
So the Raspberry Pi can use:
-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
Or it can use (I don't know what to use for -mtune):
-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard
ODROID C2:
ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).
$ gcc -v 2>&1 | grep Target
Target: aarch64-linux-gnu
$ cat /proc/cpuinfo
Features : fp asimd evtstrm crc32
So the ODROID uses:
-march=armv8-a+crc -mtune=cortex-a53
In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:
CPU part: Part number. 0xd03 indicates Cortex-A53 processor.
So we may be able to lookup the value form a database. I don't know if it exists or where its located.

I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.
But to cover one part of confusion:
The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.
-mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.
See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.

I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.
It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.
Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.
Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.
Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.
Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.
Just compare the OP's version with mine in the link below, you'll then know what I mean.
Optimizing RGBA8888 to RGB565 conversion with NEON
regards

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js