Enable popcnt in virtualbox

Enable popcnt in virtualbox - virtualbox

I have Oracle VirtualBox 4.3.8 RC1 and installed the stable version of Debian.
With this version of VirtualBox i can use this command to enable SSE4.1 and SSE4.2:
VBoxManage setextradata "VM name" VBoxInternal/CPUM/SSE4.1 1
I wanted to compile the dpdk, http://dpdk.org, but there is an error:
"implicit declaration of function ‘_mm_popcnt_u32’
When i am looking at the flags with
cat /proc/cpuinfo
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 sse4_1 sse4_2 lahf_lm
There is no "popcnt". Why? Can i enable it or what i am doing wrong?
Thanks

My case: POPCNT is missing on VirtualBox v6.1.22 with Hyper-V.
Run
VBoxManage setextradata VMName VBoxInternal/CPUM/IsaExts/POPCNT
enable nested paging on the VM.
It works.

You can use __builtin_popcountll to replace _mm_popcnt_u32, so that only sse3 intrinsics are pulled in and used
See here:
http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/4560

Related

How to run OpenCL Kernel (khronos example)?

I am trying to run the c++ code example from the OpenCL C++ Bindings Doc: Example.
Compilation of the c++ code works fine, but compilation of the kernel gives errors in connection with pipes:
<kernel>:10:71: error: unknown type name 'pipe'
global int *output, int val, write_only pipe int outPipe, queue_t childQueue)
^
<kernel>:10:76: error: expected ')'
global int *output, int val, write_only pipe int outPipe, queue_t childQueue)
^
<kernel>:9:30: note: to match this '('
kernel void vectorAdd(global const Foo* aNum, global const int *inputA, global const int *inputB,
^
<kernel>:10:76: error: parameter name omitted
global int *output, int val, write_only pipe int outPipe, queue_t childQueue)
^
<kernel>:13:11: warning: implicit declaration of function 'write_pipe' is invalid in C99
write_pipe(outPipe, &val);
^
<kernel>:13:22: error: use of undeclared identifier 'outPipe'
write_pipe(outPipe, &val);
^
<kernel>:25:26: error: use of undeclared identifier 'childQueue'
enqueue_kernel(childQueue, CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
My Setup:
NVIDIA GPU
Debian
used "sudo apt install opencl-headers ocl-icd-opencl-dev -y" to install ocl stuff
clinfo output:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 3.0 CUDA 11.4.264
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid cl_khr_pci_bus_info
Platform Extensions with Version cl_khr_global_int32_base_atomics 0x400000 (1.0.0)
cl_khr_global_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_local_int32_base_atomics 0x400000 (1.0.0)
cl_khr_local_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_fp64 0x400000 (1.0.0)
cl_khr_3d_image_writes 0x400000 (1.0.0)
cl_khr_byte_addressable_store 0x400000 (1.0.0)
cl_khr_icd 0x400000 (1.0.0)
cl_khr_gl_sharing 0x400000 (1.0.0)
cl_nv_compiler_options 0x400000 (1.0.0)
cl_nv_device_attribute_query 0x400000 (1.0.0)
cl_nv_pragma_unroll 0x400000 (1.0.0)
cl_nv_copy_opts 0x400000 (1.0.0)
cl_nv_create_buffer 0x400000 (1.0.0)
cl_khr_int64_base_atomics 0x400000 (1.0.0)
cl_khr_int64_extended_atomics 0x400000 (1.0.0)
cl_khr_device_uuid 0x400000 (1.0.0)
cl_khr_pci_bus_info 0x400000 (1.0.0)
Platform Numeric Version 0xc00000 (3.0.0)
Platform Extensions function suffix NV
Platform Host timer resolution 0ns
Platform Name NVIDIA CUDA
Number of devices 1
Device Name NVIDIA GeForce RTX 3060 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 3.0 CUDA
Device UUID c6edf95f-d769-661d-a242-e9a192a0dcb1
Driver UUID c6edf95f-d769-661d-a242-e9a192a0dcb1
Valid Device LUID No
Device LUID 6d69-637300000000
Device Node Mask 0
Device Numeric Version 0xc00000 (3.0.0)
Driver Version 470.141.03
Device OpenCL C Version OpenCL C 1.2
Device OpenCL C all versions OpenCL C 0x400000 (1.0.0)
OpenCL C 0x401000 (1.1.0)
OpenCL C 0x402000 (1.2.0)
OpenCL C 0xc00000 (3.0.0)
Device OpenCL C features __opencl_c_fp64 0xc00000 (3.0.0)
__opencl_c_images 0xc00000 (3.0.0)
__opencl_c_int64 0xc00000 (3.0.0)
__opencl_c_3d_image_writes 0xc00000 (3.0.0)
Latest comfornace test passed v2021-02-01-00
Device Type GPU
Device Topology (NV) PCI-E, 0000:02:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 38
Max clock frequency 1695MHz
Compute Capability (NV) 8.6
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple (device) 32
Preferred work group size multiple (kernel) 32
Warp size (NV) 32
Max sub-groups per work group 0
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 8367570944 (7.793GiB)
Error Correction support No
Max memory allocation 2091892736 (1.948GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Atomic memory capabilities relaxed, work-group scope
Atomic fence capabilities relaxed, acquire/release, work-group scope
Max size for global variable 0
Preferred total size of global vars 0
Global Memory cache type Read/Write
Global Memory cache size 1089536 (1.039MiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 268435456 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 32768x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 32
Max number of read/write image args 0
Pipe support No
Max number of pipe args 0
Max active pipe reservations 0
Max pipe packet size 0
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Generic address space support No
Max size of kernel argument 4352 (4.25KiB)
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Device enqueue capabilities (n/a)
Queue properties (on device)
Out-of-order execution No
Profiling No
Preferred size 0
Max size 0
Max queues on device 0
Max events on device 0
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Non-uniform work-groups No
Work-group collective functions No
Sub-group independent forward progress No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
IL version (n/a)
ILs with version <printDeviceInfo:186: get CL_DEVICE_ILS_WITH_VERSION : error -30>
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Built-in kernels with version <printDeviceInfo:190: get CL_DEVICE_BUILT_IN_KERNELS_WITH_VERSION : error -30>
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid cl_khr_pci_bus_info
Device Extensions with Version cl_khr_global_int32_base_atomics 0x400000 (1.0.0)
cl_khr_global_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_local_int32_base_atomics 0x400000 (1.0.0)
cl_khr_local_int32_extended_atomics 0x400000 (1.0.0)
cl_khr_fp64 0x400000 (1.0.0)
cl_khr_3d_image_writes 0x400000 (1.0.0)
cl_khr_byte_addressable_store 0x400000 (1.0.0)
cl_khr_icd 0x400000 (1.0.0)
cl_khr_gl_sharing 0x400000 (1.0.0)
cl_nv_compiler_options 0x400000 (1.0.0)
cl_nv_device_attribute_query 0x400000 (1.0.0)
cl_nv_pragma_unroll 0x400000 (1.0.0)
cl_nv_copy_opts 0x400000 (1.0.0)
cl_nv_create_buffer 0x400000 (1.0.0)
cl_khr_int64_base_atomics 0x400000 (1.0.0)
cl_khr_int64_extended_atomics 0x400000 (1.0.0)
cl_khr_device_uuid 0x400000 (1.0.0)
cl_khr_pci_bus_info 0x400000 (1.0.0)
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) NVIDIA CUDA
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [NV]
clCreateContext(NULL, ...) [default] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.14
ICD loader Profile OpenCL 3.0
I would be happy if someone could help me. I have already researched extensions and opencl versions but found nothing that fixed my problem.

Pipes are an OpenCL 2 feature. Nvidia does not implement OpenCL 2, only OpenCL 1.2. You can't run this code on an Nvidia GPU.
Clarification re/from comments:
Note how your clinfo output states:
Device OpenCL C Version OpenCL C 1.2
In other words, your kernel code must only use features supported by OpenCL 1.2 and any extensions offered by the runtime and selected in your code.
So although the implementation complies with the OpenCL 3.0 specification - which, confusingly, requires fewer features than OpenCL 2.x - you can't simply use all OpenCL 2.x features.

After further research, I came to the conclusion that NVIDIA OpenCL 3.0 does not support all OpenCL 3.0 features. Apparently it is only an OpenCL 1.2 with some new features. Information about this was hidden in the 465.89 Driver Release Notes at page 4.

Virtualbox enable nested vtx/amd-v greyed out

On my Ubuntu 18.04, I've installed VirtualBox 6.0 in order to have nested virtualization. Virtualization is enabled in my bios.
However, when I open the settings of my (powered off) virtual machine and go to System -> Processor, the option "Enable Nested VT-x/AMD-V" is greyed out and I cannot enable it.

Execute this:
$ VBoxManage modifyvm <VirtualMachineName> --nested-hw-virt on

For Windows
In Windows, go to VirtualBox installation folders -> type cmd on the bar (it will pop up cmd in that folder) -> type VBoxManage modifyvm <YourVirtualMachineName> --nested-hw-virt on -> enter.
Now the option should be checked.

On VirtualBox 6.1.2 that worked (intel i7 2630QM)
(VBoxManage modifyvm lubuntu18 --nested-hw-virt on)

From what I understand, this option is only available with AMD CPUs, and cannot be enabled on Intel CPUs. This is a little misleading, since the option clearly states both Intel, and AMD virtualization technologies.
Here is an official confirmation in VirtualBox doc:
https://www.virtualbox.org/manual/ch03.html
Chapter 3.5.2. Processor Tab
Enable Nested VT-x/AMD-V: Enables nested virtualization, with passthrough of hardware virtualization functions to the guest VM.
This feature is available on host systems that use an AMD CPU. For Intel CPUs, the option is grayed out.

So far it only works with AMD CPUs (forget about the confusing option title).
Initially this is for AMD CPUs only.
All Intel CPU posts will be deleted/split.
https://forums.virtualbox.org/viewtopic.php?f=1&t=90831
https://forums.virtualbox.org/viewtopic.php?f=7&t=90874

In Windows 10 this problem is caused because you have Memory Integrity active.
Windows Security -> Device security -> Core isolation details
Disable Memory integrity and then restart Windows.
The VB option "Enable Nested VT-x/AMD-V" should be still greyed out.
Now, open a new PowerShell in your VB installation folder and type: ./VBoxManage modifyvm "Virtual Machine Name" --nested-hw-virt on
You'll find detailed information here (idk why Microsoft does not mention this issue anywere).

recently this popped up for me out of the blue on Windows 11. I already had hyper-v disabled from previous tweaks and everything had been working. in the end I had to use this command:
bcdedit /set hypervisorlaunchtype off
which fixed it, but it broke the Windows Subsystem for Android recently introduced in 11, so, there's that...

From the directory where VirtualBox is executed, I run a similar command that works (note the placement of the quotes!
VBoxManage modifyvm "path\to\ubuntu 18.04.3.vbox" --nested-hw-virt on
Hope this helps.
BD

It's alive on VirtualBox 6.1.2 r135662 (Qt5.6.2) and Intel Core i3-8100!
CMD's output from image as text:
C:\WINDOWS\system32>ssh myuser#192.168.56.111
myuser#192.168.56.111's password:
Last login: Mon Feb 17 10:11:06 2020 from 192.168.56.1
myuser#nestedvt ~ $ su
Пароль:
root#nestedvt /home/myuser # egrep "svm|vmx" /proc/cpuinfo
root#nestedvt /home/myuser #
root#nestedvt /home/myuser # poweroff
Connection to 192.168.56.111 closed by remote host.
Connection to 192.168.56.111 closed.
C:\WINDOWS\system32>cd "C:\Program Files\Oracle\VirtualBox"
C:\Program Files\Oracle\VirtualBox>VBoxManage modifyvm CentOS7_nestedVT --nested-hw-virt on
C:\Program Files\Oracle\VirtualBox>VBoxManage startvm CentOS7_nestedVT
Waiting for VM "CentOS7_nestedVT" to power on...
VM "CentOS7_nestedVT" has been successfully started.
C:\Program Files\Oracle\VirtualBox>ssh myuser#192.168.56.111
myuser#192.168.56.111's password:
Last login: Mon Feb 17 10:12:08 2020 from 192.168.56.1
myuser#nestedvt ~ $ su
Пароль:
root#nestedvt /home/myuser # egrep "svm|vmx" /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow flexpriority fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow flexpriority fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d
root#nestedvt /home/myuser # exit
exit
myuser#nestedvt ~ $ exit
logout
Connection to 192.168.56.111 closed.
C:\Program Files\Oracle\VirtualBox>wmic cpu get name
Name
Intel(R) Core(TM) i3-8100 CPU # 3.60GHz
C:\Program Files\Oracle\VirtualBox>wmic os get caption
Caption
Microsoft Windows 10 Pro

It turned out it was greyed out for a reason! I have Windows 10 host and I used Docker for some time and uninstalled but it kept Hyper-V technology enabled (Which is incompatible with virtualization).
DO NOT DO ON A SERVER | THIS WILL DISABLE Hyper-V Technology - USE AT YOUR OWN RISK
Open command prompt as admin and run the following then restart your PC
DISM /Online /Disable-Feature:Microsoft-Hyper-V
PowerShell Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-Hypervisor -All
bcdedit /set hypervisorlaunchtype off

The cause of the problem is Hyper-V.
If you want to use nseted virtualization, you should turn off hypervisorlaunchtype.
It's worked for me: bcdedit /set hypervisorlaunchtype off

FYI,
Oracle VM VirtualBox supports nested virtualization on host systems that run AMD and Intel CPUs.
For more details, check:
https://docs.oracle.com/en/virtualization/virtualbox/6.0/admin/nested-virt.html

VBoxManage modifyvm --nested-hw-virt on
this works..

Enable VT-x/AMD-V in Virtualbox from Windows host pc.
Open oracle virtual box installed folder location from cmd with
administrator. cd C:\Program Files\Oracle\VirtualBox
then run the command.
VBoxManage modifyvm --nested-hw-virt on
is your vm name then enable nested VT-x/AMD-V in
your virtual box

The problem sometimes is your machine has saved its state, but the saved state is not the correct one, so you click on your machine and then on forget at the top to forget any saved state. In my case this solved the case

How to check if compiled code uses SSE and AVX instructions?

I wrote some code to do a bunch of math, and it needs to go fast, so I need it to use SSE and AVX instructions. I'm compiling it using g++ with the flags -O3 and -march=native, so I think it's using SSE and AVX instructions, but I'm not sure. Most of my code looks something like the following:
for(int i = 0;i<size;i++){
a[i] = b[i] * c[i];
}
Is there any way I can tell if my code (after compilation) uses SSE and AVX instructions? I think I could look at the assembly to see, but I don't know assembly, and I don't know how to see the assembly that the compiler outputs.

Under Linux, you could also decompile your binary:
objdump -d YOURFILE > YOURFILE.asm
Then find all SSE instructions:
awk '/[ \t](addps|addss|andnps|andps|cmpps|cmpss|comiss|cvtpi2ps|cvtps2pi|cvtsi2ss|cvtss2s|cvttps2pi|cvttss2si|divps|divss|ldmxcsr|maxps|maxss|minps|minss|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movss|movups|mulps|mulss|orps|rcpps|rcpss|rsqrtps|rsqrtss|shufps|sqrtps|sqrtss|stmxcsr|subps|subss|ucomiss|unpckhps|unpcklps|xorps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|psadbw|pshufw)[ \t]/' YOURFILE.asm
Find only packed SSE instructions (suggested by #Peter Cordes in comments):
awk '/[ \t](addps|andnps|andps|cmpps|cvtpi2ps|cvtps2pi|cvttps2pi|divps|maxps|minps|movaps|movhlps|movhps|movlhps|movlps|movmskps|movntps|movntq|movups|mulps|orps|pavgb|pavgw|pextrw|pinsrw|pmaxsw|pmaxub|pminsw|pminub|pmovmskb|pmulhuw|psadbw|pshufw|rcpps|rsqrtps|shufps|sqrtps|subps|unpckhps|unpcklps|xorps)[ \t]/' YOURFILE.asm
Find all SSE2 instructions (except MOVSD and CMPSD, which were first introduced in 80386):
awk '/[ \t](addpd|addsd|andnpd|andpd|cmppd|comisd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvtsd2si|cvtsd2ss|cvtsi2sd|cvtss2sd|cvttpd2dq|cvttpd2pi|cvtps2dq|cvttsd2si|divpd|divsd|maxpd|maxsd|minpd|minsd|movapd|movhpd|movlpd|movmskpd|movupd|mulpd|mulsd|orpd|shufpd|sqrtpd|sqrtsd|subpd|subsd|ucomisd|unpckhpd|unpcklpd|xorpd|movdq2q|movdqa|movdqu|movq2dq|paddq|pmuludq|pshufhw|pshuflw|pshufd|pslldq|psrldq|punpckhqdq|punpcklqdq)[ \t]/' YOURFILE.asm
Find only packed SSE2 instructions:
awk '/[ \t](addpd|andnpd|andpd|cmppd|cvtdq2pd|cvtdq2ps|cvtpd2dq|cvtpd2pi|cvtpd2ps|cvtpi2pd|cvtps2dq|cvtps2pd|cvttpd2dq|cvttpd2pi|cvttps2dq|divpd|maxpd|minpd|movapd|movapd|movhpd|movhpd|movlpd|movlpd|movmskpd|movntdq|movntpd|movupd|movupd|mulpd|orpd|pshufd|pshufhw|pshuflw|pslldq|psrldq|punpckhqdq|shufpd|sqrtpd|subpd|unpckhpd|unpcklpd|xorpd)[ \t]/' YOURFILE.asm
Find all SSE3 instructions:
awk '/[ \t](addsubpd|addsubps|haddpd|haddps|hsubpd|hsubps|movddup|movshdup|movsldup|lddqu|fisttp)[ \t]/' YOURFILE.asm
Find all SSSE3 instructions:
awk '/[ \t](psignw|psignd|psignb|pshufb|pmulhrsw|pmaddubsw|phsubw|phsubsw|phsubd|phaddw|phaddsw|phaddd|palignr|pabsw|pabsd|pabsb)[ \t]/' YOURFILE.asm
Find all SSE4 instructions:
awk '/[ \t](mpsadbw|phminposuw|pmulld|pmuldq|dpps|dppd|blendps|blendpd|blendvps|blendvpd|pblendvb|pblenddw|pminsb|pmaxsb|pminuw|pmaxuw|pminud|pmaxud|pminsd|pmaxsd|roundps|roundss|roundpd|roundsd|insertps|pinsrb|pinsrd|pinsrq|extractps|pextrb|pextrd|pextrw|pextrq|pmovsxbw|pmovzxbw|pmovsxbd|pmovzxbd|pmovsxbq|pmovzxbq|pmovsxwd|pmovzxwd|pmovsxwq|pmovzxwq|pmovsxdq|pmovzxdq|ptest|pcmpeqq|pcmpgtq|packusdw|pcmpestri|pcmpestrm|pcmpistri|pcmpistrm|crc32|popcnt|movntdqa|extrq|insertq|movntsd|movntss|lzcnt)[ \t]/' YOURFILE.asm
Find most common AVX instructions (including scalar, including AVX2, AVX-512 family and some FMA like vfmadd132pd):
awk '/[ \t](vmovapd|vmulpd|vaddpd|vsubpd|vfmadd213pd|vfmadd231pd|vfmadd132pd|vmulsd|vaddsd|vmosd|vsubsd|vbroadcastss|vbroadcastsd|vblendpd|vshufpd|vroundpd|vroundsd|vxorpd|vfnmadd231pd|vfnmadd213pd|vfnmadd132pd|vandpd|vmaxpd|vmovmskpd|vcmppd|vpaddd|vbroadcastf128|vinsertf128|vextractf128|vfmsub231pd|vfmsub132pd|vfmsub213pd|vmaskmovps|vmaskmovpd|vpermilps|vpermilpd|vperm2f128|vzeroall|vzeroupper|vpbroadcastb|vpbroadcastw|vpbroadcastd|vpbroadcastq|vbroadcasti128|vinserti128|vextracti128|vpminud|vpmuludq|vgatherdpd|vgatherqpd|vgatherdps|vgatherqps|vpgatherdd|vpgatherdq|vpgatherqd|vpgatherqq|vpmaskmovd|vpmaskmovq|vpermps|vpermd|vpermpd|vpermq|vperm2i128|vpblendd|vpsllvd|vpsllvq|vpsrlvd|vpsrlvq|vpsravd|vblendmpd|vblendmps|vpblendmd|vpblendmq|vpblendmb|vpblendmw|vpcmpd|vpcmpud|vpcmpq|vpcmpuq|vpcmpb|vpcmpub|vpcmpw|vpcmpuw|vptestmd|vptestmq|vptestnmd|vptestnmq|vptestmb|vptestmw|vptestnmb|vptestnmw|vcompresspd|vcompressps|vpcompressd|vpcompressq|vexpandpd|vexpandps|vpexpandd|vpexpandq|vpermb|vpermw|vpermt2b|vpermt2w|vpermi2pd|vpermi2ps|vpermi2d|vpermi2q|vpermi2b|vpermi2w|vpermt2ps|vpermt2pd|vpermt2d|vpermt2q|vshuff32x4|vshuff64x2|vshuffi32x4|vshuffi64x2|vpmultishiftqb|vpternlogd|vpternlogq|vpmovqd|vpmovsqd|vpmovusqd|vpmovqw|vpmovsqw|vpmovusqw|vpmovqb|vpmovsqb|vpmovusqb|vpmovdw|vpmovsdw|vpmovusdw|vpmovdb|vpmovsdb|vpmovusdb|vpmovwb|vpmovswb|vpmovuswb|vcvtps2udq|vcvtpd2udq|vcvttps2udq|vcvttpd2udq|vcvtss2usi|vcvtsd2usi|vcvttss2usi|vcvttsd2usi|vcvtps2qq|vcvtpd2qq|vcvtps2uqq|vcvtpd2uqq|vcvttps2qq|vcvttpd2qq|vcvttps2uqq|vcvttpd2uqq|vcvtudq2ps|vcvtudq2pd|vcvtusi2ps|vcvtusi2pd|vcvtusi2sd|vcvtusi2ss|vcvtuqq2ps|vcvtuqq2pd|vcvtqq2pd|vcvtqq2ps|vgetexppd|vgetexpps|vgetexpsd|vgetexpss|vgetmantpd|vgetmantps|vgetmantsd|vgetmantss|vfixupimmpd|vfixupimmps|vfixupimmsd|vfixupimmss|vrcp14pd|vrcp14ps|vrcp14sd|vrcp14ss|vrndscaleps|vrndscalepd|vrndscaless|vrndscalesd|vrsqrt14pd|vrsqrt14ps|vrsqrt14sd|vrsqrt14ss|vscalefps|vscalefpd|vscalefss|vscalefsd|valignd|valignq|vdbpsadbw|vpabsq|vpmaxsq|vpmaxuq|vpminsq|vpminuq|vprold|vprolvd|vprolq|vprolvq|vprord|vprorvd|vprorq|vprorvq|vpscatterdd|vpscatterdq|vpscatterqd|vpscatterqq|vscatterdps|vscatterdpd|vscatterqps|vscatterqpd|vpconflictd|vpconflictq|vplzcntd|vplzcntq|vpbroadcastmb2q|vpbroadcastmw2d|vexp2pd|vexp2ps|vrcp28pd|vrcp28ps|vrcp28sd|vrcp28ss|vrsqrt28pd|vrsqrt28ps|vrsqrt28sd|vrsqrt28ss|vgatherpf0dps|vgatherpf0qps|vgatherpf0dpd|vgatherpf0qpd|vgatherpf1dps|vgatherpf1qps|vgatherpf1dpd|vgatherpf1qpd|vscatterpf0dps|vscatterpf0qps|vscatterpf0dpd|vscatterpf0qpd|vscatterpf1dps|vscatterpf1qps|vscatterpf1dpd|vscatterpf1qpd|vfpclassps|vfpclasspd|vfpclassss|vfpclasssd|vrangeps|vrangepd|vrangess|vrangesd|vreduceps|vreducepd|vreducess|vreducesd|vpmovm2d|vpmovm2q|vpmovm2b|vpmovm2w|vpmovd2m|vpmovq2m|vpmovb2m|vpmovw2m|vpmullq|vpmadd52luq|vpmadd52huq|v4fmaddps|v4fmaddss|v4fnmaddps|v4fnmaddss|vp4dpwssd|vp4dpwssds|vpdpbusd|vpdpbusds|vpdpwssd|vpdpwssds|vpcompressb|vpcompressw|vpexpandb|vpexpandw|vpshld|vpshldv|vpshrd|vpshrdv|vpopcntd|vpopcntq|vpopcntb|vpopcntw|vpshufbitqmb|gf2p8affineinvqb|gf2p8affineqb|gf2p8mulb|vpclmulqdq|vaesdec|vaesdeclast|vaesenc|vaesenclast)[ \t]/' YOURFILE.asm
NOTE: tested with gawk and nawk.

There is no need to check the assembly. Most compilers provide optimisation reports that exactly tell you whether or not your loops were vectorised using SIMD instructions.
If you compile using GCC, set -O3 -march=native to make sure vectorisation is performed using whichever SIMD instruction set (SSE, AVX, ...) the CPU you are compiling on supports, and add -fopt-info to make the compiler verbose about optimisations:
g++ -O3 -march=native -fopt-info -o main.o main.cpp
This will give you output like:
main.cpp:12:20: note: loop vectorized
main.cpp:12:20: note: loop peeled for vectorization to enhance alignment
Hope that helps.

Notice that most packed SSE instructions end with PS/PD we'll have a simpler way to check for packed SSEx instructions after dumping the binary content to asmfile
grep %xmm asmfile | grep -P '([[:xdigit:]]{2}\s)+\s*[[:alnum:]]+p[sd]\s+'
or the xmm check can be combined into the pattern
grep -P '([[:xdigit:]]{2}\s)+\s*[[:alnum:]]+p[sd]\s+.+xmm' asmfile
This will suffice for programs only use floating-point operations. However for better coverage you also need to check for instructions begin with P so you need to change the regex a bit
grep -P '([[:xdigit:]]{2}\s)+\s*([[:alnum:]]+p[sd]\s+|p[[:alnum:]]+).+%xmm' asmfile
To also include MMX instructions in 32-bit code change the %xmm part at the end to %x?mm
To check for AVX1/2 you just need to find ymm or %ymm usage instead of checking the instruction name, because AVX1/2 instructions only have the vector version
grep ymm asmfile
Similarly AVX-512 can be checked with
grep zmm asmfile

The only way to tell is to disassemble to the generated code and see what instructions it's using.
objdump -d <your executable or shared library>

As others have pointed out, you may use -S to generate assembly code.
What's more, you could use external tools to disassembe the compiled binary, like objdump, or more professional one, ida.

std::tan() extremely slow after updating glibc

I have a C++ program that calls lots of trig functions. It has been running fine for more than a year. I recently installed gcc-4.8, and in the same go, updated glibc. This resulted in my program slowing down by almost a factor x1000. Using gdb I discovered that the cause of the slowdown was a call to std::tan(). When the argument is either pi or pi/2, the function takes very long to return.
Here's an MWE that reproduces the problem if compiled without optimization (the real program has the same problem both with and without the -O2 flag).
#include <cmath>
int main() {
double pi = 3.141592653589793;
double approxPi = 3.14159;
double ret = 0.;
for(int i = 0; i < 100000; ++i) ret = std::tan(pi); //Very slow
for(int i = 0; i < 100000; ++i) ret = std::tan(approxPi); //Not slow
}
Here's a sample backtrace from gdb (obtained after interrupting the program randomly with Ctrl+c). Starting from the call to tan, the backtrace is the same in the MWE and my real program.
#0 0x00007ffff7b1d048 in __mul (p=32, z=0x7fffffffc740, y=0x7fffffffcb30, x=0x7fffffffc890) at ../sysdeps/ieee754/dbl-64/mpa.c:458
#1 __mul (x=0x7fffffffc890, y=0x7fffffffcb30, z=0x7fffffffc740, p=32) at ../sysdeps/ieee754/dbl-64/mpa.c:443
#2 0x00007ffff7b1e348 in cc32 (p=32, y=0x7fffffffc4a0, x=0x7fffffffbf60) at ../sysdeps/ieee754/dbl-64/sincos32.c:111
#3 __c32 (x=<optimized out>, y=0x7fffffffcf50, z=0x7fffffffd0a0, p=32) at ../sysdeps/ieee754/dbl-64/sincos32.c:128
#4 0x00007ffff7b1e170 in __mptan (x=<optimized out>, mpy=0x7fffffffd690, p=32) at ../sysdeps/ieee754/dbl-64/mptan.c:57
#5 0x00007ffff7b45b46 in tanMp (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:503
#6 __tan_avx (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:488
#7 0x00000000004005b8 in main ()
I've tried running the code (both the MWE and the real program) on four different systems. Two of them are in clusters where I run my code. Two are my laptops. The MWE runs without issues on one of the clusters and one laptop. I checked which version of libm.so.6 each system uses in case that's relevant. The following list shows the system description (taken from cat /etc/*-release), whether the CPU is 32 or 64 bit, whether the MWE is slow, and finally the output of running /lib/libc.so.6 and cat /proc/cpuinfo.
SUSE Linux Enterprise Server 11 (x86_64), 64 bit, using libm-2.11.1.so (MWE is fast)
GNU C Library stable release version 2.11.1 (20100118), by Roland McGrath et al.
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Configured for x86_64-suse-linux.
Compiled by GNU CC version 4.3.4 [gcc-4_3-branch revision 152973].
Compiled on a Linux 2.6.32 system on 2012-04-12.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
stepping : 2
microcode : 53
cpu MHz : 1200.000
cache size : 30720 KB
physical id : 0
siblings : 24
core id : 0
cpu cores : 12
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips : 5000.05
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
CentOS release 6.7 (Final), 64 bit, using libm-2.12.so (MWE is slow)
GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-16).
Compiled on a Linux 2.6.32 system on 2015-09-22.
Available extensions:
The C stubs add-on version 2.1.2.
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5507 # 2.27GHz
stepping : 5
cpu MHz : 1596.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips : 4533.16
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
Ubuntu precise (12.04.5 LTS), 64 bit, using libm-2.15.so (my first laptop, MWE is slow)
GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.15) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.79 system on 2016-05-26.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2620M CPU # 2.70GHz
stepping : 7
microcode : 0x1a
cpu MHz : 800.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 5387.59
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
Ubuntu precise (12.04.5 LTS), 32 bit, using libm-2.15.so (my second laptop, MWE is fast)
GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.12) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.68 system on 2015-03-26.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU T5800 # 2.00GHz
stepping : 13
microcode : 0xa3
cpu MHz : 800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips : 3989.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
I hope I have managed to provide sufficient background info. These are my questions.
Why did std::tan() turn slow?
Is there a way to restore it to normal speed?
I would very much prefer a solution that does not require installing/replacing a bunch of libraries. That might work on my laptop, but I don't have the necessary permissions on the cluster nodes.
Update #1:
I removed my observation about passing constants to tan as it was explained by Sam Varshavchik. I added the output of running /lib/libc.so.6 to my system list. Also added a fourth system. As for timing, here's the output of running time ./mwe with the pi loop (approxPi commented out).
real 0m11.483s
user 0m11.465s
sys 0m0.004s
Here it is with the approxPi loop (pi commented out).
real 0m0.011s
user 0m0.008s
sys 0m0.000s
Update #2:
For each system, added whether the CPU is 32 or 64 bit as well as the output of cat /proc/cpuinfo for the first core.

Accuracy for transcendental functions (things like trigonometric functions and exponentials) has always a problem1.
Why some trig function calls are slower than others
For many arguments to trigonometric functions there is a fast approximation that produces a highly accurate result for most arguments. However, for certain arguments the approximation can be quite drastically wrong. As such, more precise methods need to be employed, but these take much longer (as you've noticed).
Why might the new library be slower now
For a long time Intel made misleading claims about the accuracy of it's float versions of trigonmetric functions, saying they were much more accurate than they really were2. So much so, that glibc used to just have sin(double) as a wrapper around fsin(float)3. You have likely upgraded to a version of glibc that has rectified this mistake. I can't speak for AMD's libm, but it is likely still relying on incorrect claims of accuracy around the float versions of the trigonometric functions4,5.
What to do
If you want speed and aren't too fussed about accuracy then use the float version of tan (ftan). Otherwise, if you need accuracy then you're stuck using the slower methods. Best you can do is cache the result of tan(pi) and tan(pi/2) and use the precomputed values when you think you might need them.

Cortex A9 NEON vs VFP usage confusion

I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.
Related to this I'm using the following compilation flags:
GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp
I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP
or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.
So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea.
Any advice in this area will be greatly appreciated !!
I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:
"
Only use single precision floating point
Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
Minimize Conditional Branches
Enable RunFast mode
For softfp:
Inline floating point code (unless its very large)
Pass FP arguments via pointers instead of by value and do integer work in between function calls.
"
I cannot use hard for the float ABI as I cannot link with the libraries I have available.
Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.
Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea
I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:
The NEON register bank consists of 32 64-bit registers. If both
Advanced SIMD and VFPv3 are implemented, they share this register
bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that
supports 32 double-precision floating-point registers. This
integration simplifies implementing context switching support, because
the same routines that save and restore VFP context also save and
restore NEON context.
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers
and each of the Q0-Q15 registers map onto a pair of D registers.
Figure 1.3 shows the different views of the shared NEON and VFP
register bank. All of these views are accessible at any time. Software
does not have to explicitly switch between them, because the
instruction used determines the appropriate view.
The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.
Related to this I'm using the following compilation flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.
gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.
BeagleBone Black:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 2 (v7l)
Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
...
So the BeagleBone uses:
-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
CubieTruck v5:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4
So the CubieTruck uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Banana Pi Pro:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
So the Banana Pi uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Raspberry Pi 3:
The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags
Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 4 (v7l)
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...
So the Raspberry Pi can use:
-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
Or it can use (I don't know what to use for -mtune):
-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard
ODROID C2:
ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).
$ gcc -v 2>&1 | grep Target
Target: aarch64-linux-gnu
$ cat /proc/cpuinfo
Features : fp asimd evtstrm crc32
So the ODROID uses:
-march=armv8-a+crc -mtune=cortex-a53
In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:
CPU part: Part number. 0xd03 indicates Cortex-A53 processor.
So we may be able to lookup the value form a database. I don't know if it exists or where its located.

I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.
But to cover one part of confusion:
The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.
-mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.
See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.

I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.
It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.
Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.
Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.
Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.
Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.
Just compare the OP's version with mine in the link below, you'll then know what I mean.
Optimizing RGBA8888 to RGB565 conversion with NEON
regards

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js