runtime.futex taking 50% to 70% time in go profile - profiling

As part of profiling couple of golang services we are seeing that all the services are spending 55% to 70% time in the runtime.futex function.
Note that these services involve use of goroutines, locks, and channels for synchronization and communication between the goroutines.
Is this expected and we should only care about the time spent in the non-"runtime" functions or could this be an artifact of incorrect usage of goroutines/locks/channels that is causing this overhead in the futex call?
(pprof) top20
Showing nodes accounting for 43.80s, 80.25% of 54.58s total
Dropped 654 nodes (cum <= 0.27s)
Showing top 20 nodes out of 156
flat flat% sum% cum cum%
31.05s 56.89% 56.89% 31.05s 56.89% runtime.futex
3.55s 6.50% 63.39% 4.51s 8.26% runtime.cgocall
2.25s 4.12% 67.52% 2.25s 4.12% runtime.usleep
1.14s 2.09% 69.60% 1.53s 2.80% runtime.scanobject
0.73s 1.34% 70.94% 1.56s 2.86% syscall.Syscall
0.71s 1.30% 72.24% 2.27s 4.16% runtime.mallocgc
(pprof) top20
Showing nodes accounting for 12.49s, 95.93% of 13.02s total
Dropped 93 nodes (cum <= 0.07s)
Showing top 20 nodes out of 47
flat flat% sum% cum cum%
9.69s 74.42% 74.42% 9.69s 74.42% runtime.futex
1.25s 9.60% 84.02% 1.63s 12.52% runtime.cgocall
0.64s 4.92% 88.94% 0.64s 4.92% runtime.usleep
0.17s 1.31% 90.25% 0.22s 1.69% runtime.timeSleepUntil
0.11s 0.84% 91.09% 0.17s 1.31% runtime.scanobject
0.08s 0.61% 91.71% 0.08s 0.61% runtime.nanotime1
0.08s 0.61% 92.32% 0.26s 2.00% runtime.retake
0.07s 0.54% 92.86% 0.07s 0.54% runtime.lock2
0.07s 0.54% 93.39% 0.13s 1% runtime.traceEventLocked
0.06s 0.46% 93.86% 0.36s 2.76% runtime.notetsleep_internal
0.05s 0.38% 94.24% 0.13s 1% runtime.mallocgc

This is not unusual for a program that isn't doing very much. If your program is spending most of time waiting for something to happen, it will often be sleeping in a futex.

FYI:
I had faced the same problem. In my case, I had a worker pool implementation. In which 200 goroutines were listening on to the same channel which in turn caused the contention. Then I used 1 channel per goroutine and which reduced runtime.futex very much.

Related

Error reported while running pyomo optimization with cbc solver and using timelimit

I am trying to solve Optimisation problem with pyomo (Pyomo 5.3 (CPython 2.7.13 on Linux 3.10.0-514.26.2.el7.x86_64)) using CBC solver (Version: 2.9.8) and specifying a time limit in solver of 60 sec. The solver is getting a feasible solution (-1415.8392) but apparently not yet optimal (-1415.84) as you can see below.
After time limit ends model seemingly exits with an error code. I want to print or get values of all variables of feasible solution using CBC in specified time limit. Or is there any other way by which I can set, if Model gets 99% value of an Optimal solution, to exit and print the feasible solution.
The error code is posted below.
Cbc0004I Integer solution of -1415.8392 found after 357760 iterations and 29278 nodes (47.87 seconds)
Cbc0010I After 30000 nodes, 6350 on tree, -1415.8392 best solution, best possible -1415.84 (48.87 seconds)
Cbc0010I After 31000 nodes, 6619 on tree, -1415.8392 best solution, best possible -1415.84 (50.73 seconds)
Cbc0010I After 32000 nodes, 6984 on tree, -1415.8392 best solution, best possible -1415.84 (52.49 seconds)
Cbc0010I After 33000 nodes, 7384 on tree, -1415.8392 best solution, best possible -1415.84 (54.31 seconds)
Cbc0010I After 34000 nodes, 7419 on tree, -1415.8392 best solution, best possible -1415.84 (55.73 seconds)
Cbc0010I After 35000 nodes, 7824 on tree, -1415.8392 best solution, best possible -1415.84 (57.37 seconds)
Traceback (most recent call last):
File "model_final.py", line 392, in
solver.solve(model, timelimit = 60*1, tee=True)
File "/home/aditya/0r/lib/python2.7/site-packages/pyomo/opt/base/solvers.py", line 655, in solve
default_variable_value=self._default_variable_value)
File "/home/aditya/0r/lib/python2.7/site-packages/pyomo/core/base/PyomoModel.py", line 242, in load_from
% str(results.solver.status))
ValueError: Cannot load a SolverResults object with bad status: error
When I run the model generated by pyomo manually using the same command-line parameters as pyomo /usr/bin/cbc -sec 60 -printingOptions all -import /tmp/tmpJK1ieR.pyomo.lp -import -stat=1 -solve -solu /tmp/tmpJK1ieR.pyomo.soln it seems to exit normally and also writes the solution as shown below.
Cbc0010I After 35000 nodes, 7824 on tree, -1415.8392 best solution, best possible -1415.84 (57.06 seconds)
Cbc0038I Full problem 205 rows 289 columns, reduced to 30 rows 52 columns
Cbc0010I After 36000 nodes, 8250 on tree, -1415.8392 best solution, best possible -1415.84 (58.73 seconds)
Cbc0020I Exiting on maximum time
Cbc0005I Partial search - best objective -1415.8392 (best possible -1415.84), took 464553 iterations and 36788 nodes (60.11 seconds)
Cbc0032I Strong branching done 15558 times (38451 iterations), fathomed 350 nodes and fixed 2076 variables
Cbc0035I Maximum depth 203, 5019 variables fixed on reduced cost
Cbc0038I Probing was tried 31933 times and created 138506 cuts of which 0 were active after adding rounds of cuts (4.431 seconds)
Cbc0038I Gomory was tried 30898 times and created 99534 cuts of which 0 were active after adding rounds of cuts (4.855 seconds)
Cbc0038I Knapsack was tried 30898 times and created 12926 cuts of which 0 were active after adding rounds of cuts (8.271 seconds)
Cbc0038I Clique was tried 100 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Cbc0038I MixedIntegerRounding2 was tried 30898 times and created 13413 cuts of which 0 were active after adding rounds of cuts (3.652 seconds)
Cbc0038I FlowCover was tried 100 times and created 4 cuts of which 0 were active after adding rounds of cuts (0.019 seconds)
Cbc0038I TwoMirCuts was tried 30898 times and created 15292 cuts of which 0 were active after adding rounds of cuts (2.415 seconds)
Cbc0038I Stored from first was tried 30898 times and created 15734 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Cbc0012I Integer solution of -1411.9992 found by Reduced search after 467825 iterations and 36838 nodes (60.12 seconds)
Cbc0020I Exiting on maximum time
Cbc0005I Partial search - best objective -1411.9992 (best possible -1415.4522), took 467825 iterations and 36838 nodes (60.12 seconds)
Cbc0032I Strong branching done 476 times (1776 iterations), fathomed 1 nodes and fixed 18 variables
Cbc0035I Maximum depth 21, 39 variables fixed on reduced cost
Cuts at root node changed objective from -1484.12 to -1415.45
Probing was tried 133 times and created 894 cuts of which 32 were active after adding rounds of cuts (0.060 seconds)
Gomory was tried 133 times and created 1642 cuts of which 0 were active after adding rounds of cuts (0.047 seconds)
Knapsack was tried 133 times and created 224 cuts of which 0 were active after adding rounds of cuts (0.083 seconds)
Clique was tried 100 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.001 seconds)
MixedIntegerRounding2 was tried 133 times and created 163 cuts of which 0 were active after adding rounds of cuts (0.034 seconds)
FlowCover was tried 100 times and created 5 cuts of which 0 were active after adding rounds of cuts (0.026 seconds)
TwoMirCuts was tried 133 times and created 472 cuts of which 0 were active after adding rounds of cuts (0.021 seconds)
ImplicationCuts was tried 25 times and created 41 cuts of which 0 were active after adding rounds of cuts (0.003 seconds)
Result - Stopped on time limit
Objective value: -1411.99922848
Lower bound: -1415.452
Gap: 0.00
Enumerated nodes: 36838
Total iterations: 467825
Time (CPU seconds): 60.13
Time (Wallclock seconds): 60.98
Total time (CPU seconds): 60.13 (Wallclock seconds): 61.01
The top few lines of the CBC solution file are:
Stopped on time - objective value -1411.99922848
0 c_e_x1454_ 0 0
1 c_e_x1455_ 0 0
2 c_e_x1456_ 0 0
3 c_e_x1457_ 0 0
4 c_e_x1458_ 0 0
5 c_e_x1459_ 0 0
6 c_e_x1460_ 0 0
7 c_e_x1461_ 0 0
8 c_e_x1462_ 0 0
Can anyone tell me how can I get these values without generating any error?
Thanks in advance.
You could try to set the bound gap tolerance such that it will accept the other answer. I'm surprised that the solver status is coming back with error if there is a feasible solution found. Could you print out the whole results object?
Created a pull request https://github.com/Pyomo/pyomo/pull/265 to address this issue.
For those who want the patch right away:
In file pyomo/solvers/plugins/solvers/CBCplugin.py
## -264,7 +264,8 ## def _check_and_escape_options(options):
cmd.append('-AMPL')
if self._timelimit is not None and self._timelimit > 0.0:
- cmd.extend(['-sec', str(self._timelimit)])
+ cmd.extend(['-sec', str(self._timelimit - 1 )])
+ cmd.extend(['-timeMode', "elapsed"])
if "debug" in self.options:
cmd.extend(["-log","5"])
for key, val in _check_and_escape_options(self.options):
## -276,7 +277,8 ## def _check_and_escape_options(options):
#"-stat"])
else:
if self._timelimit is not None and self._timelimit > 0.0:
- cmd.extend(['-sec', str(self._timelimit)])
+ cmd.extend(['-sec', str(self._timelimit - 1 )])
+ cmd.extend(['-timeMode', "elapsed"])
if "debug" in self.options:
cmd.extend(["-log","5"])
# these must go after options that take a value
The timelimit is set to 1 sec lower than provided option to ensure CBC process gets time to exit properly before the pyutilib.subprocess.processmngr checks the exit code. In my test runs the process manager checked the exit status at T+0.02 seconds and CBC process typically exited after T+0.1 seconds.
Also changed the CBC code to use the wallclock seconds from the default of CPU seconds as the process manager is also checking the same.

How can I stream multiple files at the same time using HTTP::Server?

I'm working on an HTTP service that serves big files. I noticed that parallel downloads are not possible. The process serves only one file at a time and all other downloads are waiting until the previous downloads finish. How can I stream multiple files at the same time?
require "http/server"
server = HTTP::Server.new(3000) do |context|
context.response.content_type = "application/data"
f = File.open "bigfile.bin", "r"
IO.copy f, context.response.output
end
puts "Listening on http://127.0.0.1:3000"
server.listen
Request one file at a time:
$ ab -n 10 -c 1 127.0.0.1:3000/
[...]
Percentage of the requests served within a certain time (ms)
50% 9
66% 9
75% 9
80% 9
90% 9
95% 9
98% 9
99% 9
100% 9 (longest request)
Request 10 files at once:
$ ab -n 10 -c 10 127.0.0.1:3000/
[...]
Percentage of the requests served within a certain time (ms)
50% 52
66% 57
75% 64
80% 69
90% 73
95% 73
98% 73
99% 73
100% 73 (longest request)
The problem here is that both File#read and context.response.output will never block. Crystal's concurrency model is based on cooperatively scheduled fibers, where switching fibers only happens when IO blocks. Reading from the disk using nonblocking IO is impossible which means the only part that's possible to block is writing to context.response.output. However, disk IO is a lot lot slower than network IO on the same machine, meaning that writing will never block because ab is reading at a rate much faster than the disk can provide data, even from the disk cache. This example is practically the perfect storm to break crystal's concurrency.
In the real world, it's much more likely that clients of the service will reside over the network from the machine, making the response write occasionally block. Furthermore, if you were reading from another network service or a pipe/socket you would also block. Another solution would be to use a threadpool to implement nonblocking file IO, which is what libuv does. As a side note, Crystal moved to libevent because libuv doesn't allow a multithreaded event loop (i.e. have any thread resume any fiber).
Calling Fiber.yield to pass execution to any pending fiber is the correct solution. Here's an example of how to block (and yield) while reading files:
def copy_in_chunks(input, output, chunk_size = 4096)
size = 1
while size > 0
size = IO.copy(input, output, chunk_size)
Fiber.yield
end
end
File.open("bigfile.bin", "r") do |file|
copy_in_chunks(file, context.response)
end
This is a transcription of the dicussion here: https://github.com/crystal-lang/crystal/issues/4628
Props to GitHub users #cschlack, #RX14 and #ysbaddaden

Why do I get such huge jitter in time measurement?

I'm trying to measure a function's performance by measuring the time for each iteration.
During the process, I found even if I do nothing, the results still vary quite a bit.
e.g.
volatile long count = 0;
for (int i = 0; i < N; ++i) {
measure.begin();
++count;
measure.end();
}
In measure.end(), I measure the time difference and keep an unordered_map to keep track of the time-count.
I've used clock_gettime as well as rdtsc, but there's always about 1% of the data points lie far away from mean, in a 1000 factor.
Here's what the above loop generates:
T: count percentile
18 117563 11.7563%
19 111821 22.9384%
21 201605 43.0989%
22 541095 97.2084%
23 2136 97.422%
24 2783 97.7003%
...
406 1 99.9994%
3678 1 99.9995%
6662 1 99.9996%
17945 1 99.9997%
18148 1 99.9998%
18181 1 99.9999%
22800 1 100%
mean:21
So whether it's ticks or ns, the worst case 22800 is about 1000 times bigger than mean.
I did isolcpus in grub and was running this with taskset. The simple loop almost does nothing, the hash table to do time-count statistics is outside of the time measurements.
What am I missing?
I'm running this on a laptop with ubuntu installed, CPU is Intel(R) Core(TM) i5-2520M CPU # 2.50GHz
Thank you for all the answers.
The main interrupt that I couldn't stop is the local timer interrupt. And it seems new 3.10 kernel would support tickless. I'll try that one.

Jython + Django not ready for production?

So recently I was playing around with Django on the Jython platform and wanted to see its performance in "production". The site I tested with was just a simple return HttpResponse("Time %.2f" % time.time()) view, so no database involved.
I tried the following two combinations (measurements done with ab -c15 -n500 -k <url>, everything in Ubuntu Server 10.10 on VirtualBox):
J2EE application server (Tomcat/Glassfish), deployed WAR file
I get results like
Requests per second: 143.50 [#/sec] (mean)
[...]
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 16
90% 31
95% 31
98% 641
99% 3219
100% 3219 (longest request)
Obviously, the server hangs for a few seconds once in a while, which is not acceptable. I assume it has something to do with reloading Jython because starting the jython shell takes about 3 seconds, too.
AJP serving using patched flup package (+ Apache as frontend)
Note: flup is the package used by manage.py runfcgi, I had to patch it because flup's threading/forking support doesn't seem to work on Jython (-> AJP was the only working method).
Almost the same results here, but sometimes the last 100 requests don't even get answered at all (but server process still alive).
I'm asking this on SO (instead of serverfault) because it's very Django/Jython-specific. Does anyone have experience with deploying Django sites on Jython? Is there maybe another (faster) way to serve the site? Or is it just too early to use Django on the Java platform?
So as nobody replied, I investigated a bit more and it seems like my problem might have to do with VirtualBox. Using different server OSes (Debian Squeeze, Ubuntu Server), I had similar problems. For example, with simple static file serving, I got this result from the Apache web server (on Debian):
> ab -c50 -n1000 http://ip.of.my.vm/some/static/file.css
Requests per second: 91.95 [#/sec] (mean) <--- quite impossible for static serving
[...]
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 22.1 0 688
Processing: 0 206 991.4 31 9188
Waiting: 0 96 401.2 16 3031
Total: 0 208 991.7 31 9203
Percentage of the requests served within a certain time (ms)
50% 31
66% 47
75% 63
80% 78
90% 156
95% 781
98% 844
99% 9141 <--- !!!!!!
100% 9203 (longest request)
This led to the conclusion that (I don't have a conclusion, but) I think the Java reloading might not be the problem here, rather the virtualization. I will try it on a real host and leave this question unanswered till then.
FOLLOWUP
Now I successfully tested a bare-bones Django site (really just the welcome page) using Jython + AJP over TCP/mod_proxy_ajp on Apache (again with patched flup package). This time on a real host (i7 920, 6 GB RAM). The result proved that my above assumption was correct and that I really should never benchmark on a virtual host again. Here's the result for the welcome page:
Document Path: /jython-test/
Document Length: 2059 bytes
Concurrency Level: 40
Time taken for tests: 24.688 seconds
Complete requests: 20000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 43640000 bytes
HTML transferred: 41180000 bytes
Requests per second: 810.11 [#/sec] (mean)
Time per request: 49.376 [ms] (mean)
Time per request: 1.234 [ms] (mean, across all concurrent requests)
Transfer rate: 1726.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.5 0 20
Processing: 2 49 16.5 44 255
Waiting: 0 48 16.5 44 255
Total: 2 49 16.5 45 256
Percentage of the requests served within a certain time (ms)
50% 45
66% 48
75% 51
80% 53
90% 69
95% 80
98% 90
99% 97
100% 256 (longest request) # <-- no multiple seconds of waiting anymore
Very promising, I would say. The only downside is that the average request time is > 40 ms whereas the development server has a mean of < 3 ms.

What exactly does C++ profiling (google cpu perf tools) measure?

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.