100% CPU for Docker Container when started via entrypoint - c++

We have written a small C++ application which mainly does some supervising of other processes via ZeroMQ. So most of the time, the application idles and periodcally sends and receives some requests.
We built a docker image based on ubuntu which contains just this application, some dependencies and an entrypoint.sh. The entrypoint basically runs as /bin/bash, manipulates some configuration files based on environment variables and then starts the application via exec.
Now here's the strange part. When we start the application manually without docker, we get a CPU usage of nearly 0%. When we start the same application as docker image, the CPU usage goes up to 100% and blocks exactly one CPU core.
To find out what was happening, we set the entrypoint of our image to /bin/yes (just to make sure the container keeps running) and then started a bash inside the running container. From there we started entrypoint.sh manually and the CPU again was at 0%.
So we are wondering, what could cause this situation. Is there anything we need to add to our Dockerfile to prevent this?
Here is some output generated with strace. I used strace -p <pid> -f -c and waited five minutes to collect some insights.
1. Running with docker run (100% CPU)
strace: Process 12621 attached with 9 threads
strace: [ Process PID=12621 runs in x32 mode. ]
[...]
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
71.26 17.866443 144 124127 nanosleep
14.40 3.610578 55547 65 31 futex
14.07 3.528224 1209 2918 epoll_wait
0.10 0.024760 4127 6 1 restart_syscall
0.10 0.024700 0 66479 poll
0.05 0.011339 4 2902 3 recvfrom
0.02 0.005517 2 2919 write
0.01 0.001685 1 2909 read
0.00 0.000070 70 1 1 connect
0.00 0.000020 20 1 socket
0.00 0.000010 1 18 epoll_ctl
0.00 0.000004 1 6 sendto
0.00 0.000004 1 4 fcntl
0.00 0.000000 0 1 close
0.00 0.000000 0 1 getpeername
0.00 0.000000 0 1 setsockopt
0.00 0.000000 0 1 getsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 25.073354 202359 36 total
2. Running with a dummy entrypoint and docker exec (0% CPU)
strace: Process 31394 attached with 9 threads
[...]
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
67.32 12.544007 102 123355 nanosleep
14.94 2.784310 39216 71 33 futex
14.01 2.611210 869 3005 epoll_wait
2.01 0.373797 6 66234 poll
1.15 0.213487 71 2999 recvfrom
0.41 0.076113 15223 5 1 restart_syscall
0.09 0.016295 5 3004 write
0.08 0.014458 5 3004 read
------ ----------- ----------- --------- --------- ----------------
100.00 18.633677 201677 34 total
Note that in the first case i started strace slightly earlier so there are some different calls which can all be traced back to initialization code.
The only difference I could find is the line Process PID=12621 runs in x32 mode. when using docker run. Could this be an issue?
Also note that in both measurements the total runtime is about 20 seconds while the process was running for five minutes.
Some further investigations on the 100% CPU case. I checked the process with top -H -p <pid> and only the parent process was using 100% CPU while the child threads were all mostly idling. But when calling strace -p <pid> on the parent process I could verify, that the process did not do anything (no output was generated).
So I do have a process which is using one whole core of my CPU doing exactly nothing.

As it turned out some legacy part of the software was waiting for console input in a while loop:
while (!finished) {
std::cin >> command;
processCommand(command)
}
So this worked fine when running locally and with docker exec. But since the executable was started as a docker service, there was no console present. Therefore std::cin was non-blocking and returned immediately. This way we created an endless loop without any sleeps which naturally caused a 100% CPU usage.
Thanks to #Botje for guiding us through the debugging process.

Related

Intel MPI Benchmarks result exceeds network bandwidth

I was running the Intel MPI Benchmarks on a mini MPI cluster of two nodes on AWS EC2.
The executables of the benchmark were compiled with make CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx.
The cluster was set up based on this tutorial.
(What I did differently is that I installed OpenMPI instead of MPICH.)
The type of the AWS instances is t3a.xlarge.
According to this spec the network performance is up to 5 Gbps (i.e. 625 MB/s).
I chose Ubuntu 18.04 as the OS for these instances.
The PingPong benchmark gave me surprising results:
mpiuser#mpin0:~$ mpirun -rf rankfile -np 2 ./IMB-MPI1 PingPong
#------------------------------------------------------------
# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part
#------------------------------------------------------------
# Date : Thu Nov 28 15:17:38 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1054-aws
# Version : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
# MPI Version : 3.1
# MPI_Datatype : MPI_BYTE
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 32.05 0.00
1 1000 31.74 0.03
2 1000 31.98 0.06
4 1000 32.08 0.12
8 1000 31.86 0.25
16 1000 31.30 0.51
32 1000 31.40 1.02
64 1000 32.86 1.95
128 1000 32.66 3.92
256 1000 33.56 7.63
512 1000 35.00 14.63
1024 1000 34.74 29.48
2048 1000 36.56 56.02
4096 1000 39.80 102.91
8192 1000 49.92 164.10
16384 1000 57.73 283.79
32768 1000 78.86 415.54
65536 640 170.26 384.91
131072 320 239.80 546.60
262144 160 357.34 733.61
524288 80 609.82 859.74
1048576 40 1106.36 947.77
2097152 20 2514.62 833.98
4194304 10 5830.39 719.39
Some of the speeds exceed 625 MB/s.
The rankfile was used to force distributing the two processes onto two nodes.
It is not the case that the two MPI processes are running on the same node.
The content of the rankfile is:
rank 0=mpin0 slot=0
rank 1=mpin1 slot=0
What are the possible reasons that caused the benchmark result to exceed the theoretical limit?

Ember CLI build killed

I build my Ember CLI app inside a docker container on startup. The build fails without an error message, it just says killed:
root#fstaging:/frontend/source# node_modules/ember-cli/bin/ember build -prod
version: 1.13.15
Could not find watchman, falling back to NodeWatcher for file system events.
Visit http://www.ember-cli.com/user-guide/#watchman for more info.
Buildingember-auto-register-helpers is not required for Ember 2.0.0 and later please remove from your `package.json`.
Building.DEPRECATION: The `bind-attr` helper ('app/templates/components/file-selector.hbs' # L1:C7) is deprecated in favor of HTMLBars-style bound attributes.
at isBindAttrModifier (/app/source/bower_components/ember/ember-template-compiler.js:11751:34)
Killed
The same docker image successfully starts up in another environment, but without hardware constraints. Does Ember CLI have hard-coded hardware constraints for the build process? The RAM is limited to 128m and swap to 2g.
That is likely not enough memory for Ember CLI to do what it needs. You are correct in that, the process is being killed because of an OOM situation. If you log in to the host and take a look at the dmesg output you will probably see something like:
V8 WorkerThread invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
V8 WorkerThread cpuset=867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 mems_allowed=0
CPU: 0 PID: 2027 Comm: V8 WorkerThread Tainted: G O 4.1.13-boot2docker #1
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
0000000000000000 00000000000000d0 ffffffff8154e053 ffff880039381000
ffffffff8154d3f7 ffff8800395db528 ffff8800392b4528 ffff88003e214580
ffff8800392b4000 ffff88003e217080 ffffffff81087faf ffff88003e217080
Call Trace:
[<ffffffff8154e053>] ? dump_stack+0x40/0x50
[<ffffffff8154d3f7>] ? dump_header.isra.10+0x8c/0x1f4
[<ffffffff81087faf>] ? finish_task_switch+0x4c/0xda
[<ffffffff810f46b1>] ? oom_kill_process+0x99/0x31c
[<ffffffff811340e6>] ? task_in_mem_cgroup+0x5d/0x6a
[<ffffffff81132ac5>] ? mem_cgroup_iter+0x1c/0x1b2
[<ffffffff81134984>] ? mem_cgroup_oom_synchronize+0x441/0x45a
[<ffffffff8113402f>] ? mem_cgroup_is_descendant+0x1d/0x1d
[<ffffffff810f4d77>] ? pagefault_out_of_memory+0x17/0x91
[<ffffffff815565d8>] ? page_fault+0x28/0x30
Task in /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032 killed as a result of limit of /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032
memory: usage 131072kB, limit 131072kB, failcnt 2284203
memory+swap: usage 262032kB, limit 262144kB, failcnt 970540
kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/867781e35d8a0a231ef60a272ae5d418796c45e92b5aa0233df317ce659b0032: cache:340KB rss:130732KB rss_huge:10240KB mapped_file:8KB writeback:0KB swap:130960KB inactive_anon:72912KB active_anon:57880KB inactive_file:112KB active_file:40KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 1993] 0 1993 380 1 6 3 17 0 sh
[ 2025] 0 2025 203490 32546 221 140 32713 0 npm
Memory cgroup out of memory: Kill process 2025 (npm) score 1001 or sacrifice child
Killed process 2025 (npm) total-vm:813960kB, anon-rss:130184kB, file-rss:0kB
It might be worthwhile to profile the container using something like https://github.com/google/cadvisor to find out what kind of memory maximums it may need.

How to interpret addresses in Google perf tools CPU profiler

My C++ program is consuming a lot of CPU, and more so as it runs. I used Google Performance Tools to profile CPU usage, and this is what I got:
(pprof) top
Total: 1343 samples
1330 99.0% 99.0% 1330 99.0% 0x0000000801dcb11c
7 0.5% 99.6% 7 0.5% 0x0000000801dcb11e
4 0.3% 99.9% 4 0.3% program::threadWorker
1 0.1% 99.9% 1 0.1% 0x0000000801dcb110
1 0.1% 100.0% 1 0.1% 0x00007fffffffffc0
However, only 1 out of the 5 processes shown here is an actual function name; the rest are addresses. How can I find out what these addresses pertain to? (Of course, I am most interested in the first address shown above)
Edit: This is how I ran the profiler:
env CPUPROFILE=prof.out ./a.out
[kill program]
pprof ./a.out prof.out
Also, I found the root cause by code inspection. But it would still be nice to have the profiler pinpoint the culprit function rather than an address.
Is it possible you haven't specified the executable when loading the results in google-pprof?
I run it as:
$ google-pprof executable /tmp/executable.hprof --text | less
and can see the function names just fine. Or that those methods are in some shared library not in your path when you run google-pprof?

Jython + Django not ready for production?

So recently I was playing around with Django on the Jython platform and wanted to see its performance in "production". The site I tested with was just a simple return HttpResponse("Time %.2f" % time.time()) view, so no database involved.
I tried the following two combinations (measurements done with ab -c15 -n500 -k <url>, everything in Ubuntu Server 10.10 on VirtualBox):
J2EE application server (Tomcat/Glassfish), deployed WAR file
I get results like
Requests per second: 143.50 [#/sec] (mean)
[...]
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 16
90% 31
95% 31
98% 641
99% 3219
100% 3219 (longest request)
Obviously, the server hangs for a few seconds once in a while, which is not acceptable. I assume it has something to do with reloading Jython because starting the jython shell takes about 3 seconds, too.
AJP serving using patched flup package (+ Apache as frontend)
Note: flup is the package used by manage.py runfcgi, I had to patch it because flup's threading/forking support doesn't seem to work on Jython (-> AJP was the only working method).
Almost the same results here, but sometimes the last 100 requests don't even get answered at all (but server process still alive).
I'm asking this on SO (instead of serverfault) because it's very Django/Jython-specific. Does anyone have experience with deploying Django sites on Jython? Is there maybe another (faster) way to serve the site? Or is it just too early to use Django on the Java platform?
So as nobody replied, I investigated a bit more and it seems like my problem might have to do with VirtualBox. Using different server OSes (Debian Squeeze, Ubuntu Server), I had similar problems. For example, with simple static file serving, I got this result from the Apache web server (on Debian):
> ab -c50 -n1000 http://ip.of.my.vm/some/static/file.css
Requests per second: 91.95 [#/sec] (mean) <--- quite impossible for static serving
[...]
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 22.1 0 688
Processing: 0 206 991.4 31 9188
Waiting: 0 96 401.2 16 3031
Total: 0 208 991.7 31 9203
Percentage of the requests served within a certain time (ms)
50% 31
66% 47
75% 63
80% 78
90% 156
95% 781
98% 844
99% 9141 <--- !!!!!!
100% 9203 (longest request)
This led to the conclusion that (I don't have a conclusion, but) I think the Java reloading might not be the problem here, rather the virtualization. I will try it on a real host and leave this question unanswered till then.
FOLLOWUP
Now I successfully tested a bare-bones Django site (really just the welcome page) using Jython + AJP over TCP/mod_proxy_ajp on Apache (again with patched flup package). This time on a real host (i7 920, 6 GB RAM). The result proved that my above assumption was correct and that I really should never benchmark on a virtual host again. Here's the result for the welcome page:
Document Path: /jython-test/
Document Length: 2059 bytes
Concurrency Level: 40
Time taken for tests: 24.688 seconds
Complete requests: 20000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 43640000 bytes
HTML transferred: 41180000 bytes
Requests per second: 810.11 [#/sec] (mean)
Time per request: 49.376 [ms] (mean)
Time per request: 1.234 [ms] (mean, across all concurrent requests)
Transfer rate: 1726.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.5 0 20
Processing: 2 49 16.5 44 255
Waiting: 0 48 16.5 44 255
Total: 2 49 16.5 45 256
Percentage of the requests served within a certain time (ms)
50% 45
66% 48
75% 51
80% 53
90% 69
95% 80
98% 90
99% 97
100% 256 (longest request) # <-- no multiple seconds of waiting anymore
Very promising, I would say. The only downside is that the average request time is > 40 ms whereas the development server has a mean of < 3 ms.

Non voluntary context switches: How can I prevent them?

I have a small application that I was running right now and I wanted to check if I have any memory leaks in it so I put in this piece of code:
for (unsigned int i = 0; i<10000; i++) {
for (unsigned int j = 0; j<10000; j++) {
std::ifstream &a = s->fhandle->open("test");
char temp[30];
a.getline(temp, 30);
s->fhandle->close("test");
}
}
When I ran the application i cat'ed /proc//status to see if the memory increases.
The output is the following after about 2 Minutes of runtime:
Name: origin-test
State: R (running)
Tgid: 7267
Pid: 7267
PPid: 6619
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
FDSize: 256
Groups: 4 20 24 46 110 111 119 122 1000
VmPeak: 183848 kB
VmSize: 118308 kB
VmLck: 0 kB
VmHWM: 5116 kB
VmRSS: 5116 kB
VmData: 9560 kB
VmStk: 136 kB
VmExe: 28 kB
VmLib: 11496 kB
VmPTE: 240 kB
VmSwap: 0 kB
Threads: 2
SigQ: 0/16382
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000002004
SigCgt: 00000001800044c2
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: 3f
Cpus_allowed_list: 0-5
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 120
nonvoluntary_ctxt_switches: 26475
None of the values did change except the last one, so does the mean there are no memory leaks?
But what's more important and what I would like to know is, if it is bad that the last value is increasing rapidly (about 26475 switches in about 2 Minutes!).
I looked at some other applications to compare how much non-volunary switches they have:
Firefox: about 200
Gdm: 2
Netbeans: 19
Then I googled and found out some stuff but it's to technical for me to understand.
What I got from it is that this happens when the application switches the processor or something? (I have an Amd 6-core processor btw).
How can I prevent my application from doing that and in how far could this be a problem when running the application?
Thanks in advance,
Robin.
Voluntary context switch occurs when your application is blocked in a system call and the kernel decide to give it's time slice to another process.
Non voluntary context switch occurs when your application has used all the timeslice the scheduler has attributed to it (the kernel try to pretend that each application has the whole computer for themselves, and can use as much CPU they want, but has to switch from one to another so that the user has the illusion that they are all running in parallel).
In your case, since you're opening, closing and reading from the same file, it probably stay in the virtual file system cache during the whole execution of the process, and youre program is being preempted by the kernel as it is not blocking (either because of system or library caches). On the other hand, Firefox, Gdm and Netbeans are mostly waiting for input from the user or from the network, and must not be preempted by the kernel.
Those context switches are not harmful. On the contrary, it allow your processor to be used to fairly by all application even when one of them is waiting for some resource.≈
And BTW, to detect memory leaks, a better solution would be to use a tool dedicated to this, such as valgrind.
To add to #Sylvain's info, there is a nice background article on Linux scheduling here: "Inside the Linux scheduler" (developerWorks, June 2006).
To look for a memory leak it is much better to install and use valgrind, http://www.valgrind.org/. It will identify memory leaks in the heap and memory error conditions (using uninitialized memory, tons of other problems). I use it almost every day.