How to check if compiled code uses SSE and AVX instructions?

How to check if compiled code uses SSE and AVX instructions? - c++

I wrote some code to do a bunch of math, and it needs to go fast, so I need it to use SSE and AVX instructions. I'm compiling it using g++ with the flags -O3 and -march=native, so I think it's using SSE and AVX instructions, but I'm not sure. Most of my code looks something like the following:
for(int i = 0;i<size;i++){
a[i] = b[i] * c[i];
}
Is there any way I can tell if my code (after compilation) uses SSE and AVX instructions? I think I could look at the assembly to see, but I don't know assembly, and I don't know how to see the assembly that the compiler outputs.

Under Linux, you could also decompile your binary:
objdump -d YOURFILE > YOURFILE.asm
Then find all SSE instructions:
awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm
Find only packed SSE instructions (suggested by #Peter Cordes in comments):
awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm
Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):
awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm
Find only packed SSE2 instructions:
awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm
Find all SSE3 instructions:
awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm
Find all SSSE3 instructions:
awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm
Find all SSE4 instructions:
awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm
Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):
awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm
NOTE: tested with gawk and nawk.

There is no need to check the assembly. Most compilers provide optimisation reports that exactly tell you whether or not your loops were vectorised using SIMD instructions.
If you compile using GCC, set -O3 -march=native to make sure vectorisation is performed using whichever SIMD instruction set (SSE, AVX, ...) the CPU you are compiling on supports, and add -fopt-info to make the compiler verbose about optimisations:
g++ -O3 -march=native -fopt-info -o main.o main.cpp
This will give you output like:
main.cpp:12:20: note: loop vectorized
main.cpp:12:20: note: loop peeled for vectorization to enhance alignment
Hope that helps.

Notice that most packed SSE instructions end with PS/PD we'll have a simpler way to check for packed SSEx instructions after dumping the binary content to asmfile
grep %xmm asmfile | grep -P '([[:xdigit:]]{2}\s)+\s*[[:alnum:]]+p[sd]\s+'
or the xmm check can be combined into the pattern
grep -P '([[:xdigit:]]{2}\s)+\s*[[:alnum:]]+p[sd]\s+.+xmm' asmfile
This will suffice for programs only use floating-point operations. However for better coverage you also need to check for instructions begin with P so you need to change the regex a bit
grep -P '([[:xdigit:]]{2}\s)+\s*([[:alnum:]]+p[sd]\s+|p[[:alnum:]]+).+%xmm' asmfile
To also include MMX instructions in 32-bit code change the %xmm part at the end to %x?mm
To check for AVX1/2 you just need to find ymm or %ymm usage instead of checking the instruction name, because AVX1/2 instructions only have the vector version
grep ymm asmfile
Similarly AVX-512 can be checked with
grep zmm asmfile

The only way to tell is to disassemble to the generated code and see what instructions it's using.
objdump -d <your executable or shared library>

As others have pointed out, you may use -S to generate assembly code.
What's more, you could use external tools to disassembe the compiled binary, like objdump, or more professional one, ida.

Related

how to compile node-v4.2.4 with armv7 without fpu?

I have a device whose cpu is armv7 but without fpu.
I can compile node with option --with-arm-float-abi=soft, but when I run "node", "Illegal instruction (core dumped)" happened.
root#router:/tmp/target/bin# ./node -v
v4.2.4
root#router:/tmp/target/bin# ./node --v8-options | head -2
target arm v7 vfp3 soft
ARMv7=1 VFP3=1 VFP32DREGS=0 NEON=0 SUDIV=0 UNALIGNED_ACCESSES=1
MOVW_MOVT_IMMEDIATE_LOADS=0 COHERENT_CACHE=0 USE_EABI_HARDFLOAT=0
The tool objdump showed me that there are instructions (such as vpush, vpop...) in use which are not supported by my cpu (arm v7 without fpu).
For the further, I found openssl and v8 in the source of node use fpu's instructions.
the configure line as below
./configure \
--prefix=target \
--dest-cpu=arm \
--dest-os=linux \
--without-snapshot \
--with-arm-float-abi=soft \
--fully-static
Can somebody tell me how to compile node-v4.2.4 without fpu supported?
source code: nodejs-v4.2.2
arm version: Cortex-A9 Floating-Point Unit (FPU)(Optional)

After many tries, I used node-v0.10.14 instead, which works well without fpu supported. ;-)
So I still do not known how to compile nodejs-v4.2.2 without fpu supported.

It's impossible.
V8 does not support no fpu mode since 3.18 (https://github.com/nodejs/node/issues/4447#issuecomment-168549889), the assumption is that the kernel can emulate the FPU for you. And NodeJS is based on V8.
Relavant comment in the source code:
https://github.com/v8/v8/blob/master/src/arm/assembler-arm.cc#L174
It's clarified in v8-users mailing list too.

Measure static memory usage for C++ ported to embedded platform

I have created a small program as a proof-of-concept for a system which are to be implemented on an embedded platform. The program is written in C++11 with use of std and compiled to run on a laptop. The final program which should be implemented later is an embedded system. We do not have access to the compiler of the embedded platform.
I would like to know if there is a way to determine a programs static memory (the size of the compiled binaries) in a sensible and comparable way when it should be ported to an embedded platform.
The requirement is that the size of the binary is less than 10kb.
Our binary has a size of 700Kb when compiled and stripped with the following flags:
g++ options: -Os -s -ffunction-sections -fdata-sections
linker options: -s -Wl,--gc-sections
strip libmodel.a -s -R .comment -R .gnu.version --strip-unneeded -R .note
It took up 4MB before we used strip and optimization options.
I am still way off and it is not really that big a program. How can I justify a comparison in any way with an equivalent program on an embedded platform.

Note that the size of the binary can be a little deceptive in the sense that uninitialised variables, the .bss sections, will not necessarily take up physical space in the binary as these are generally just noted as present without actually have any space given to them... this normally happens by the OS loader when it runs your program.
objdump (http://www.gnu.org/software/binutils/) or perhaps elfdump or the elf tool chain (http://sourceforge.net/apps/trac/elftoolchain/) will help you determine the size of your various segments, data and text, as well as the size of individual functions and globals etc. All these programs "look" into your compiled binary and extract a lot of information such as the size of the .text, .data section, list the various symbols, their locations and sizes, and can even dissasemble the .text section...
An example of using elfdump on an ELF image test.elf might be elfdump -z test.elf > output.txt. This will dump everything including text section dissassembly. For example, from an elfdump on my system I saw
Section #6: .text, type=NOBITS, addr=0x500, off=0x5f168
size=149404(0x2479c), link=0, info=0, align=16, entsize=1
flags=<WRITE,ALLOC,EXECINSTR>
Section #7: .text, type=NOBITS, addr=0x24c9c, off=0x5f168
size=362822(0x58946), link=0, info=0, align=4, entsize=1
flags=<WRITE,ALLOC,EXECINSTR,INCLUDE>
....
Section #9: .rodata, type=NOBITS, addr=0x7d5e4, off=0x5f168
size=7670(0x1df6), link=0, info=0, align=4, entsize=1
flags=<WRITE,ALLOC>
So I can see how much my code is taking up (the .text sections) and my read only data. Later in the file I then see...
Symbol table ".symtab"
Value Size Bind Type Section Name
----- ---- ---- ---- ------- ----
218 0x7c090 130 LOC FUNC .text IRemovedThisName
So I can see that my function IRemovedThisName takes 130 bytes. A quick script would allow you list functions sorted by size and variables sorted by size. This could point you at places to optimize...
For a good example of objdump try http://www.thegeekstuff.com/2012/09/objdump-examples/, specifically the section 3, which shows you how to get the contents of the section headers using the -h option.
As to how the program will compare on two different platforms I think you will just have to compile on both platforms and compare the results you get from your obj/elfdump on each system - the results will depend on the system instruction set, how well each compiler can optimize, general hardware architecture differences etc.
If you don't have access to the embedded system, you might try using a cross-compiler, configured for your eventual target, on your laptop. This would give you a binary suited to the embedded platform and the tools to analyze the file (i.e. the cross-platform version of objdump). This would give you some ball-park figures for how the program would look on the eventual embedded sys.
Hope this helps.
EDIT: This will also help How to get the size of a C function from inside a C program or with inline assembly?

It appeared that the included libraries took up an enormous of space (as it was pointed out in the comment) and by removing these it was possible to reduce the size to nearly nothing in combination with the following flags:
set(CMAKE_CXX_FLAGS "-Os -s -ffunction-sections -fdata-sections -DNO_STD -fno-rtti -fno-exceptions")
set(CMAKE_EXE_LINKER_FLAGS "-s -Wl,--gc-sections")
And stripping away any unnecessary code using:
strip libmodel.a -s -R .comment -R .gnu.version --strip-unneeded -R .note
The 4MB could be reduced to 9.4kb which is below our limit.
In summary, std takes up an tremendous amount of space.

Threaded DGEMM on ifort

I've implemented a working program (FORTRAN, compiler: Intel ifort 11.x) that makes calls to DGEMM. I've read that there's a quick way to parallelize this by compiling with:
ifort -mkl=parallel -O3 myprog.f -o myprog
I have a quad-core processor, so I run the program with (via bash):
export OMP_NUM_THREADS=4
./myprog
My assumption was that DGEMM would automatically summon 4 threads, resulting in faster matrix multiplication. That doesn't seem to be happening. Am I missing something? Any help would be appreciated.

I think -mkl=parallel is the default choice of the intel compiler. Therefore you don't have to set this flag especially. Try -mkl=sequential instead and see if the calculations are slowing down.

Cortex A9 NEON vs VFP usage confusion

I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.
Related to this I'm using the following compilation flags:
GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp
I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP
or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.
So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea.
Any advice in this area will be greatly appreciated !!
I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:
"
Only use single precision floating point
Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
Minimize Conditional Branches
Enable RunFast mode
For softfp:
Inline floating point code (unless its very large)
Pass FP arguments via pointers instead of by value and do integer work in between function calls.
"
I cannot use hard for the float ABI as I cannot link with the libraries I have available.
Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.
Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea
I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:
The NEON register bank consists of 32 64-bit registers. If both
Advanced SIMD and VFPv3 are implemented, they share this register
bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that
supports 32 double-precision floating-point registers. This
integration simplifies implementing context switching support, because
the same routines that save and restore VFP context also save and
restore NEON context.
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers
and each of the Q0-Q15 registers map onto a pair of D registers.
Figure 1.3 shows the different views of the shared NEON and VFP
register bank. All of these views are accessible at any time. Software
does not have to explicitly switch between them, because the
instruction used determines the appropriate view.
The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.
Related to this I'm using the following compilation flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.
gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.
BeagleBone Black:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 2 (v7l)
Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
...
So the BeagleBone uses:
-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
CubieTruck v5:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4
So the CubieTruck uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Banana Pi Pro:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
So the Banana Pi uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Raspberry Pi 3:
The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags
Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 4 (v7l)
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...
So the Raspberry Pi can use:
-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
Or it can use (I don't know what to use for -mtune):
-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard
ODROID C2:
ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).
$ gcc -v 2>&1 | grep Target
Target: aarch64-linux-gnu
$ cat /proc/cpuinfo
Features : fp asimd evtstrm crc32
So the ODROID uses:
-march=armv8-a+crc -mtune=cortex-a53
In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:
CPU part: Part number. 0xd03 indicates Cortex-A53 processor.
So we may be able to lookup the value form a database. I don't know if it exists or where its located.

I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.
But to cover one part of confusion:
The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.
-mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.
See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.

I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.
It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.
Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.
Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.
Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.
Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.
Just compare the OP's version with mine in the link below, you'll then know what I mean.
Optimizing RGBA8888 to RGB565 conversion with NEON
regards

Disassemble x86 code in a program at a specific address

Is there any x86 disassembler framework that can be used to analyze code from a specific address in a program, as in:
info = disassemble( startAddress , stopAddress)
It should show every instruction and its operands and any other info that is good for analysis but it should have also fast mode where it isn't so important to obtain that much info for each instruction, but only for some of them that can be specified.

Is GNU binutils not good enough? Here's how to do that with the objdump utility:
# Disassemble from virtual addresses 0x80000000 to 80000100
objdump -d program --start-address=0x80000000 --stop-address=0x80000100

Google's protobuf uses libdisasm for that matter. Sad thing is that (judging from source code) it only supports ia32 and x86 and homepage states that "it is x86 specific and will not be expanded to include other CPU architectures". But since you didn't mention other archs, this library may be sufficient.

The cross-platform InstructionAPI suits this purpose, it will analyze the code and can print out the disassembly or provide a machine-independent view of the instructions for you to query. InstructionAPI is a shared library that you would link your code against.
http://www.paradyn.org/html/manuals.html

I would debug to the start point and look at the disassembly output of the debugger. A more brutal method is to disassemble it all and search for the function name in the disassembly file. objconv can do this, but it is slow on very big files.

Try BeaEngine

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js