std::tan() extremely slow after updating glibc

std::tan() extremely slow after updating glibc - c++

I have a C++ program that calls lots of trig functions. It has been running fine for more than a year. I recently installed gcc-4.8, and in the same go, updated glibc. This resulted in my program slowing down by almost a factor x1000. Using gdb I discovered that the cause of the slowdown was a call to std::tan(). When the argument is either pi or pi/2, the function takes very long to return.
Here's an MWE that reproduces the problem if compiled without optimization (the real program has the same problem both with and without the -O2 flag).
#include <cmath>
int main() {
double pi = 3.141592653589793;
double approxPi = 3.14159;
double ret = 0.;
for(int i = 0; i < 100000; ++i) ret = std::tan(pi); //Very slow
for(int i = 0; i < 100000; ++i) ret = std::tan(approxPi); //Not slow
}
Here's a sample backtrace from gdb (obtained after interrupting the program randomly with Ctrl+c). Starting from the call to tan, the backtrace is the same in the MWE and my real program.
#0 0x00007ffff7b1d048 in __mul (p=32, z=0x7fffffffc740, y=0x7fffffffcb30, x=0x7fffffffc890) at ../sysdeps/ieee754/dbl-64/mpa.c:458
#1 __mul (x=0x7fffffffc890, y=0x7fffffffcb30, z=0x7fffffffc740, p=32) at ../sysdeps/ieee754/dbl-64/mpa.c:443
#2 0x00007ffff7b1e348 in cc32 (p=32, y=0x7fffffffc4a0, x=0x7fffffffbf60) at ../sysdeps/ieee754/dbl-64/sincos32.c:111
#3 __c32 (x=<optimized out>, y=0x7fffffffcf50, z=0x7fffffffd0a0, p=32) at ../sysdeps/ieee754/dbl-64/sincos32.c:128
#4 0x00007ffff7b1e170 in __mptan (x=<optimized out>, mpy=0x7fffffffd690, p=32) at ../sysdeps/ieee754/dbl-64/mptan.c:57
#5 0x00007ffff7b45b46 in tanMp (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:503
#6 __tan_avx (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:488
#7 0x00000000004005b8 in main ()
I've tried running the code (both the MWE and the real program) on four different systems. Two of them are in clusters where I run my code. Two are my laptops. The MWE runs without issues on one of the clusters and one laptop. I checked which version of libm.so.6 each system uses in case that's relevant. The following list shows the system description (taken from cat /etc/*-release), whether the CPU is 32 or 64 bit, whether the MWE is slow, and finally the output of running /lib/libc.so.6 and cat /proc/cpuinfo.
SUSE Linux Enterprise Server 11 (x86_64), 64 bit, using libm-2.11.1.so (MWE is fast)
GNU C Library stable release version 2.11.1 (20100118), by Roland McGrath et al.
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Configured for x86_64-suse-linux.
Compiled by GNU CC version 4.3.4 [gcc-4_3-branch revision 152973].
Compiled on a Linux 2.6.32 system on 2012-04-12.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
stepping : 2
microcode : 53
cpu MHz : 1200.000
cache size : 30720 KB
physical id : 0
siblings : 24
core id : 0
cpu cores : 12
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips : 5000.05
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
CentOS release 6.7 (Final), 64 bit, using libm-2.12.so (MWE is slow)
GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-16).
Compiled on a Linux 2.6.32 system on 2015-09-22.
Available extensions:
The C stubs add-on version 2.1.2.
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5507 # 2.27GHz
stepping : 5
cpu MHz : 1596.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips : 4533.16
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
Ubuntu precise (12.04.5 LTS), 64 bit, using libm-2.15.so (my first laptop, MWE is slow)
GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.15) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.79 system on 2016-05-26.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2620M CPU # 2.70GHz
stepping : 7
microcode : 0x1a
cpu MHz : 800.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 5387.59
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
Ubuntu precise (12.04.5 LTS), 32 bit, using libm-2.15.so (my second laptop, MWE is fast)
GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.12) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.68 system on 2015-03-26.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU T5800 # 2.00GHz
stepping : 13
microcode : 0xa3
cpu MHz : 800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips : 3989.79
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
I hope I have managed to provide sufficient background info. These are my questions.
Why did std::tan() turn slow?
Is there a way to restore it to normal speed?
I would very much prefer a solution that does not require installing/replacing a bunch of libraries. That might work on my laptop, but I don't have the necessary permissions on the cluster nodes.
Update #1:
I removed my observation about passing constants to tan as it was explained by Sam Varshavchik. I added the output of running /lib/libc.so.6 to my system list. Also added a fourth system. As for timing, here's the output of running time ./mwe with the pi loop (approxPi commented out).
real 0m11.483s
user 0m11.465s
sys 0m0.004s
Here it is with the approxPi loop (pi commented out).
real 0m0.011s
user 0m0.008s
sys 0m0.000s
Update #2:
For each system, added whether the CPU is 32 or 64 bit as well as the output of cat /proc/cpuinfo for the first core.

Accuracy for transcendental functions (things like trigonometric functions and exponentials) has always a problem1.
Why some trig function calls are slower than others
For many arguments to trigonometric functions there is a fast approximation that produces a highly accurate result for most arguments. However, for certain arguments the approximation can be quite drastically wrong. As such, more precise methods need to be employed, but these take much longer (as you've noticed).
Why might the new library be slower now
For a long time Intel made misleading claims about the accuracy of it's float versions of trigonmetric functions, saying they were much more accurate than they really were2. So much so, that glibc used to just have sin(double) as a wrapper around fsin(float)3. You have likely upgraded to a version of glibc that has rectified this mistake. I can't speak for AMD's libm, but it is likely still relying on incorrect claims of accuracy around the float versions of the trigonometric functions4,5.
What to do
If you want speed and aren't too fussed about accuracy then use the float version of tan (ftan). Otherwise, if you need accuracy then you're stuck using the slower methods. Best you can do is cache the result of tan(pi) and tan(pi/2) and use the precomputed values when you think you might need them.

Related

Virtualbox enable nested vtx/amd-v greyed out

On my Ubuntu 18.04, I've installed VirtualBox 6.0 in order to have nested virtualization. Virtualization is enabled in my bios.
However, when I open the settings of my (powered off) virtual machine and go to System -> Processor, the option "Enable Nested VT-x/AMD-V" is greyed out and I cannot enable it.

Execute this:
$ VBoxManage modifyvm <VirtualMachineName> --nested-hw-virt on

For Windows
In Windows, go to VirtualBox installation folders -> type cmd on the bar (it will pop up cmd in that folder) -> type VBoxManage modifyvm <YourVirtualMachineName> --nested-hw-virt on -> enter.
Now the option should be checked.

On VirtualBox 6.1.2 that worked (intel i7 2630QM)
(VBoxManage modifyvm lubuntu18 --nested-hw-virt on)

From what I understand, this option is only available with AMD CPUs, and cannot be enabled on Intel CPUs. This is a little misleading, since the option clearly states both Intel, and AMD virtualization technologies.
Here is an official confirmation in VirtualBox doc:
https://www.virtualbox.org/manual/ch03.html
Chapter 3.5.2. Processor Tab
Enable Nested VT-x/AMD-V: Enables nested virtualization, with passthrough of hardware virtualization functions to the guest VM.
This feature is available on host systems that use an AMD CPU. For Intel CPUs, the option is grayed out.

So far it only works with AMD CPUs (forget about the confusing option title).
Initially this is for AMD CPUs only.
All Intel CPU posts will be deleted/split.
https://forums.virtualbox.org/viewtopic.php?f=1&t=90831
https://forums.virtualbox.org/viewtopic.php?f=7&t=90874

In Windows 10 this problem is caused because you have Memory Integrity active.
Windows Security -> Device security -> Core isolation details
Disable Memory integrity and then restart Windows.
The VB option "Enable Nested VT-x/AMD-V" should be still greyed out.
Now, open a new PowerShell in your VB installation folder and type: ./VBoxManage modifyvm "Virtual Machine Name" --nested-hw-virt on
You'll find detailed information here (idk why Microsoft does not mention this issue anywere).

recently this popped up for me out of the blue on Windows 11. I already had hyper-v disabled from previous tweaks and everything had been working. in the end I had to use this command:
bcdedit /set hypervisorlaunchtype off
which fixed it, but it broke the Windows Subsystem for Android recently introduced in 11, so, there's that...

From the directory where VirtualBox is executed, I run a similar command that works (note the placement of the quotes!
VBoxManage modifyvm "path\to\ubuntu 18.04.3.vbox" --nested-hw-virt on
Hope this helps.
BD

It's alive on VirtualBox 6.1.2 r135662 (Qt5.6.2) and Intel Core i3-8100!
CMD's output from image as text:
C:\WINDOWS\system32>ssh myuser#192.168.56.111
myuser#192.168.56.111's password:
Last login: Mon Feb 17 10:11:06 2020 from 192.168.56.1
myuser#nestedvt ~ $ su
Пароль:
root#nestedvt /home/myuser # egrep "svm|vmx" /proc/cpuinfo
root#nestedvt /home/myuser #
root#nestedvt /home/myuser # poweroff
Connection to 192.168.56.111 closed by remote host.
Connection to 192.168.56.111 closed.
C:\WINDOWS\system32>cd "C:\Program Files\Oracle\VirtualBox"
C:\Program Files\Oracle\VirtualBox>VBoxManage modifyvm CentOS7_nestedVT --nested-hw-virt on
C:\Program Files\Oracle\VirtualBox>VBoxManage startvm CentOS7_nestedVT
Waiting for VM "CentOS7_nestedVT" to power on...
VM "CentOS7_nestedVT" has been successfully started.
C:\Program Files\Oracle\VirtualBox>ssh myuser#192.168.56.111
myuser#192.168.56.111's password:
Last login: Mon Feb 17 10:12:08 2020 from 192.168.56.1
myuser#nestedvt ~ $ su
Пароль:
root#nestedvt /home/myuser # egrep "svm|vmx" /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow flexpriority fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow flexpriority fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d
root#nestedvt /home/myuser # exit
exit
myuser#nestedvt ~ $ exit
logout
Connection to 192.168.56.111 closed.
C:\Program Files\Oracle\VirtualBox>wmic cpu get name
Name
Intel(R) Core(TM) i3-8100 CPU # 3.60GHz
C:\Program Files\Oracle\VirtualBox>wmic os get caption
Caption
Microsoft Windows 10 Pro

It turned out it was greyed out for a reason! I have Windows 10 host and I used Docker for some time and uninstalled but it kept Hyper-V technology enabled (Which is incompatible with virtualization).
DO NOT DO ON A SERVER | THIS WILL DISABLE Hyper-V Technology - USE AT YOUR OWN RISK
Open command prompt as admin and run the following then restart your PC
DISM /Online /Disable-Feature:Microsoft-Hyper-V
PowerShell Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-Hypervisor -All
bcdedit /set hypervisorlaunchtype off

The cause of the problem is Hyper-V.
If you want to use nseted virtualization, you should turn off hypervisorlaunchtype.
It's worked for me: bcdedit /set hypervisorlaunchtype off

FYI,
Oracle VM VirtualBox supports nested virtualization on host systems that run AMD and Intel CPUs.
For more details, check:
https://docs.oracle.com/en/virtualization/virtualbox/6.0/admin/nested-virt.html

VBoxManage modifyvm --nested-hw-virt on
this works..

Enable VT-x/AMD-V in Virtualbox from Windows host pc.
Open oracle virtual box installed folder location from cmd with
administrator. cd C:\Program Files\Oracle\VirtualBox
then run the command.
VBoxManage modifyvm --nested-hw-virt on
is your vm name then enable nested VT-x/AMD-V in
your virtual box

The problem sometimes is your machine has saved its state, but the saved state is not the correct one, so you click on your machine and then on forget at the top to forget any saved state. In my case this solved the case

tensorflow unusual CUDA related error

I've been using tensorflow for nearly two years and have never seen this one. On a new Ubuntu box, I have a fresh install of tensorflow in a virtualenv. When I ran a sample code, i got a Invalid Device error. It occurred when tf.Session() is called.
WARNING:tensorflow:From full_code.py:27: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
2017-06-05 11:01:55.853842: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-05 11:01:55.853867: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-05 11:01:55.853876: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-05 11:01:55.853886: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-05 11:01:55.853893: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-06-05 11:01:55.937978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 660 Ti
major: 3 minor: 0 memoryClockRate (GHz) 1.0455
pciBusID 0000:04:00.0
Total memory: 2.95GiB
Free memory: 2.91GiB
2017-06-05 11:01:55.938063: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x19e5370
2017-06-05 11:01:56.014220: E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Here is the full spec.
Ubuntu 14.04
CUDA 8.0
GeForce GTX 660 Ti
python 3.4.3

Thanks to someone from google, i figured out what went wrong. In this Dell box, there are two Nvidia graphic cards. First one comes with the manufacturer and is a NVS 310 card. As far as I know, this one does not have any compute capability and I never intend to use much of it.
I then added a second card, a GTX 660 Ti and I intended to use this one for all computations.
When Tensorflow is invoked, it defaults to Device 0, which is the NVS 310. And of course it throws an invalid error.
When I do this,
CUDA_VISIBLE_DEVICES=1 python myscript.py
it works.

Enable popcnt in virtualbox

I have Oracle VirtualBox 4.3.8 RC1 and installed the stable version of Debian.
With this version of VirtualBox i can use this command to enable SSE4.1 and SSE4.2:
VBoxManage setextradata "VM name" VBoxInternal/CPUM/SSE4.1 1
I wanted to compile the dpdk, http://dpdk.org, but there is an error:
"implicit declaration of function ‘_mm_popcnt_u32’
When i am looking at the flags with
cat /proc/cpuinfo
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 sse4_1 sse4_2 lahf_lm
There is no "popcnt". Why? Can i enable it or what i am doing wrong?
Thanks

My case: POPCNT is missing on VirtualBox v6.1.22 with Hyper-V.
Run
VBoxManage setextradata VMName VBoxInternal/CPUM/IsaExts/POPCNT
enable nested paging on the VM.
It works.

You can use __builtin_popcountll to replace _mm_popcnt_u32, so that only sse3 intrinsics are pulled in and used
See here:
http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/4560

libgcc_s.so conflicts could lead to cpu overload using exceptions?

I developed a C++ server application for embedded i386 compatible environment, so no cross compiler was needed. A dynamic library developed by a collegue, making (large) use of exceptions tecnique. That library is demanded to implement network communications, and, once copied in the target file sytem, after the client connection, causes an abort with the common message:terminated after throwing an instance of... even if the libstdc++ is available on the embedded os.
After several tries, including static link of libraries, we apparently found a solution just copying the libgcc_s.so.1 used at compile time on a Fedora3 virtual machine to the embedded file system and launching the server with environment variable LD_LIBRARY_PATH=path to fedora lib.
On the embedded os we have a busybox with few and reduced tools, but we noticed, with the command uptime, that after the client connection, the cpu usage raised from 20% to 100% (and I don't know how... even more). The first impression is an application bug but it was never noticed during the debug sessions on the Fedora machine and if you see on /proc/task/status you will see this log:
Name: taskname
State: S (sleeping)
SleepAVG: 97%
Tgid: 589
Pid: 589
PPid: 1
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 256
Groups: 0
VmSize: 3396 kB
VmLck: 0 kB
VmRSS: 1604 kB
VmData: 492 kB
VmStk: 84 kB
VmExe: 84 kB
VmLib: 2512 kB
VmPTE: 20 kB
Threads: 1
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000080000000
SigIgn: 0000000000001004
SigCgt: 0000000380004a02
CapInh: 0000000000000000
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff
So I cannot figure out who's using the cpu massively, even if the client disconnects.
This behaviour is not present if the server is launched on the Fedora machine.
I suspect that mixing the Fedora3 libgcc_s.so.1 with embedded system could lead to some strange side effect but I don't have any clue.
So I started to find another way to deploy the server:
Copying others required libraries from Fedora3 to embedded so (libstdc++ and libc). Same behaviour
Reversing the process: copying the required libraries to source tree and forcing the linker to use those libraries. Launching the application (compiler side) the error message terminated after throwing an instance of... respawned.
Additional Infos:
If useful: applying ldd -v libgcc_s.so.1 (not available on embedded system) on the two libraries I had the following results:
HOST LIBRARY:
libc.so.6 => /lib/tls/libc.so.6 (0x00694000)
/lib/ld-linux.so.2 (0x0067b000)
Version information:
/lib/libgcc_s.so.1:
libc.so.6 (GLIBC_2.2.4) => /lib/tls/libc.so.6
libc.so.6 (GLIBC_2.1.3) => /lib/tls/libc.so.6
libc.so.6 (GLIBC_2.0) => /lib/tls/libc.so.6
/lib/tls/libc.so.6:
ld-linux.so.2 (GLIBC_2.1) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_2.3) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_PRIVATE) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_2.0) => /lib/ld-linux.so.2
EMBEDDED LIBRARY:
libc.so.6 => /lib/tls/libc.so.6 (0xf6ec3000)
/lib/ld-linux.so.2 (0x0067b000)
Version information:
./libgcc_s.so.1:
libc.so.6 (GLIBC_2.1.3) => /lib/tls/libc.so.6
libc.so.6 (GLIBC_2.0) => /lib/tls/libc.so.6
/lib/tls/libc.so.6:
ld-linux.so.2 (GLIBC_2.1) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_2.3) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_PRIVATE) => /lib/ld-linux.so.2
ld-linux.so.2 (GLIBC_2.0) => /lib/ld-linux.so.2
Someone have any explanation or suggestion?
Thank you
A. Cappelli
More info about processors type:
Compiler host /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.40GHz
stepping : 1
cpu MHz : 3390.524
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss nx pni
bogomips : 6471.68
Embedded machine /proc/cpu_info:
processor : 0
vendor_id : AuthenticAMD
cpu family : 4
model : 9
model name : 486 DX/4-WB
stepping : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu
bogomips : 65.40

If your embedded system has enough recent version of Linux kernel, you can try using Linux performance counters(perf). When you install them run perf record ./server on your embedded system. This will generate perf.data when the server exits. After that you can just analyze the file using perf report in the same directory as the file. It will show how much CPU% each library and executable symbol used. Then you can narrow down the issue to the libraries or your server code. More info about perf here

Cortex A9 NEON vs VFP usage confusion

I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.
Related to this I'm using the following compilation flags:
GCC
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
ARMCC
--cpu=Cortex-A9 --apcs=/softfp
--cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp
I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP
or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.
So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes
in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea.
Any advice in this area will be greatly appreciated !!
I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:
"
Only use single precision floating point
Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
Minimize Conditional Branches
Enable RunFast mode
For softfp:
Inline floating point code (unless its very large)
Pass FP arguments via pointers instead of by value and do integer work in between function calls.
"
I cannot use hard for the float ABI as I cannot link with the libraries I have available.
Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.
Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea
I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:
The NEON register bank consists of 32 64-bit registers. If both
Advanced SIMD and VFPv3 are implemented, they share this register
bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that
supports 32 double-precision floating-point registers. This
integration simplifies implementing context switching support, because
the same routines that save and restore VFP context also save and
restore NEON context.
The NEON unit can view the same register bank as:
sixteen 128-bit quadword registers, Q0-Q15
thirty-two 64-bit doubleword registers, D0-D31.
The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers
and each of the Q0-Q15 registers map onto a pair of D registers.
Figure 1.3 shows the different views of the shared NEON and VFP
register bank. All of these views are accessible at any time. Software
does not have to explicitly switch between them, because the
instruction used determines the appropriate view.
The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.
Related to this I'm using the following compilation flags:
-O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
-O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.
gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.
BeagleBone Black:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 2 (v7l)
Features : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
...
So the BeagleBone uses:
-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
CubieTruck v5:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 5 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4
So the CubieTruck uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Banana Pi Pro:
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
So the Banana Pi uses:
-march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
Raspberry Pi 3:
The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags
Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.
$ gcc -v 2>&1 | grep Target
Target: arm-linux-gnueabihf
$ cat /proc/cpuinfo
model name : ARMv7 Processor rev 4 (v7l)
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
...
So the Raspberry Pi can use:
-march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
Or it can use (I don't know what to use for -mtune):
-march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard
ODROID C2:
ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).
$ gcc -v 2>&1 | grep Target
Target: aarch64-linux-gnu
$ cat /proc/cpuinfo
Features : fp asimd evtstrm crc32
So the ODROID uses:
-march=armv8-a+crc -mtune=cortex-a53
In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:
CPU part: Part number. 0xd03 indicates Cortex-A53 processor.
So we may be able to lookup the value form a database. I don't know if it exists or where its located.

I think this question should be split up into several, adding some code examples and detailing target platform and versions of toolchains used.
But to cover one part of confusion:
The recommendation to "use NEON as the FPU" sounds like a misunderstanding. NEON is a SIMD engine, the VFP is an FPU. You can use NEON for single-precision floating-point operations on up to 4 single-precision values in parallel, which (when possible) is good for performance.
-mfpu=neon can be seen as shorthand for -mfpu=neon-vfpv3.
See http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html for more information.

I'd stay away from VFP. It's just like the Thmub mode : It's meant to be for compilers. There's no point in optimizing for them.
It might sound rude, but I really don't see any point in NEON intrinsics either. It's more trouble than help - if any.
Just invest two or three days in basic ARM assembly: you only need to learn few instructions for loop control/termination.
Then you can start writing native NEON codes without worrying about the compiler doing something astral spitting out tons of errors/warnings.
Learning NEON instructions is less demanding than all those intrinsics macros. And all above this, the results are so much better.
Fully optimized NEON native codes usually run more than twice as fast than well-written intrinsics counterparts.
Just compare the OP's version with mine in the link below, you'll then know what I mean.
Optimizing RGBA8888 to RGB565 conversion with NEON
regards

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js