NSURLConnection stops receiving data - web-services

I've recently finished an app which connects to a number of web services provided by my client. I've done this countless times in the past without any problems.
However, for some reason, with one of the services, which downloads about 250kb-1mb of data.
+ (void)connection:(MCURLConnection *)connection didReceiveData:(NSData *)data
Occasionally just stops after the first call.
Here's the log when it works
2014-08-19 11:57:43.270 MiniCheckout[886:60b] didReceiveResponse: didGetProductsForShop:
2014-08-19 11:57:43.270 MiniCheckout[886:60b] Code: 200
2014-08-19 11:57:43.271 MiniCheckout[886:60b] didReceiveData
2014-08-19 11:57:43.457 MiniCheckout[886:60b] didReceiveData
2014-08-19 11:57:43.645 MiniCheckout[886:60b] didReceiveData
2014-08-19 11:57:43.830 MiniCheckout[886:60b] didReceiveData
2014-08-19 11:57:44.003 MiniCheckout[886:60b] didReceiveData
2014-08-19 11:57:44.007 MiniCheckout[886:60b] didReceiveData
2014-08-19 11:57:44.169 MiniCheckout[886:60b] ConnectionFinished:didGetProductsForShop:
And when it doesn't (which is too often!)
2014-08-19 11:57:43.270 MiniCheckout[886:60b] didReceiveResponse: didGetProductsForShop:
2014-08-19 11:57:43.270 MiniCheckout[886:60b] Code: 200
2014-08-19 11:57:43.271 MiniCheckout[886:60b] didReceiveData
That's it, it just stops!
I have a timer which waits 60 seconds just so the app doesn't "lock up". But it occurs way too often and the client is not happy with this.
The web server/services are provided by the client and the issue occurs over Wifi and 3G.
I've handled web services countless times without problems. But I'm stumped as to why this is happening.

You can specify a timeout in your NSURLRequest object. One way to do this is to construct it via the requestWithURL:cachePolicy:timeoutInterval: method. (You can pass in the default NSURLRequestUseProtocolCachePolicy cachePolicy parameter if you don't want to worry about that part.) The timeout is a floating-point value in seconds, as are basically all time intervals in the iPhone SDK.
Also make sure your NSURLConnection's delegate is set and responds to the connection:didFailWithError: method. A connection always calls either this method or connectionDidFinishLoading: upon connection completion.

Related

Delphi - mORMot : Installation went wrong. Automated Test Error on UTF8 test! How do I Fix this?

I added mORMot folders to Delphi's Library and tested if it works properly by running TestSQL3 in Folder SQLite3. And it shows Error at UTF8
! - UTF8: 14,000 / 1,099,792 FAILED 1.15s
How do I fix this? Please Help!! Thank you in advance.
Synopse mORMot Framework Automated tests
Synopse libraries
1.1. Low level common:
System copy record: 162 assertions passed 108us
TRawUTF8List: 190,172 assertions passed 61.62ms
TDynArray: 1,092,815 assertions passed 137.96ms
TDynArrayHashed: 1,599,067 assertions passed 1.09s
TSynDictionary: 139,850 assertions passed 324.01ms
TSynQueue: 6,541,501 assertions passed 215.78ms
TObjectListHashed: 2,996,100 assertions passed 1.49s
TObjectListSorted: 79,912 assertions passed 51.59ms
TSynNameValue: 40,032 assertions passed 5.54ms
TRawUTF8Interning: 2,000,013 assertions passed 122.39ms
500000 interning 8 KB in 40.91ms i.e. 12,219,262/s, aver. 0us, 186.4 MB/s
500000 direct 7.6 MB in 12.76ms i.e. 39,175,742/s, aver. 0us, 597.7 MB/s
TObjectDynArrayWrapper: 167,501 assertions passed 13.25ms
TObjArray: 3,230 assertions passed 1.72ms
Custom RTL: 77,552 assertions passed 1s
FillChar in 30.56ms, 12.7 GB/s
Move in 4.51ms, 3.4 GB/s
small Move in 5.86ms, 3.7 GB/s
big Move in 106.81ms, 3.6 GB/s
FillCharFast [] in 33.54ms, 11.5 GB/s
MoveFast [] in 3.61ms, 4.3 GB/s
small MoveFast [] in 5.76ms, 3.8 GB/s
big MoveFast [] in 105.27ms, 3.7 GB/s
Fast string compare: 71 assertions passed 268us
IdemPropName: 216 assertions passed 207us
Url encoding: 152 assertions passed 1.08ms
GUID: 10,007 assertions passed 2.75ms
ParseCommandArguments: 232 assertions passed 370us
IsMatch: 4,250 assertions passed 2.27ms
TExprParserMatch: 140 assertions passed 663us
Soundex: 35 assertions passed 518us
Numerical conversions: 2,545,159 assertions passed 351.35ms
100000 FloatToText in 16.49ms i.e. 6,062,443/s, aver. 0us, 109.9 MB/s
100000 str in 23.31ms i.e. 4,290,004/s, aver. 0us, 94 MB/s
100000 DoubleToShort in 18.31ms i.e. 5,460,899/s, aver. 0us, 99 MB/s
Integers: 33,860 assertions passed 48.08ms
crc32c: 290,087 assertions passed 80.93ms
pas 286.7 MB/s fast 2.4 GB/s sse42 4.1 GB/s
Random32: 201,002 assertions passed 25.81ms
Bloom filters: 2,010,072 assertions passed 128.92ms
DeltaCompress: 87 assertions passed 6.38ms
Curr 64: 20,056 assertions passed 1.83ms
CamelCase: 11 assertions passed 116us
Bits: 22,985 assertions passed 14.47ms
Ini files: 7,028 assertions passed 188.97ms ! - UTF8: 14,000 / 1,099,792 FAILED 1.15s
Url decoding: 1,101 assertions passed 561us
Baudot code: 10,007 assertions passed 21.87ms
Iso 8601 date and time: 200,831 assertions passed 16.80ms
Time zones: 408 assertions passed 212.13ms
Mime types: 30 assertions passed 651us
Quick select: 4,015 assertions passed 124.33ms
TSynTable: 875 assertions passed 2.34ms
TSynCache: 404 assertions passed 404us
TSynFilter: 1,005 assertions passed 2.57ms
TSynValidate: 677 assertions passed 774us
TSynLogFile: 49 assertions passed 977us
TSynUniqueIdentifier: 1,300,002 assertions passed 515.62ms Total failed: 14,000 / 22,692,553 - Low level common FAILED 7.45s
Windows 10 64bit (10.0.18362) (cp874)
8 x Intel(R) Core(TM) i5-9300H CPU # 2.40GHz (x86) Using mORMot 1.18.6102
TSQLite3LibraryStatic 3.32.3 with internal MM Generated with: Delphi 10.3 Rio 32 bit compiler
Time elapsed for all tests: 2m29 Performed 2020-08-06 23:58:11 by
LENOVO on LAPTOP-BED954TL
Total assertions failed for all test suits: 14,000 / 45,919,717 !
Some tests FAILED: please correct the code.
Done - Press ENTER to Exit
There is a restriction with the regression tests.
As your output states:
Windows 10 64bit (10.0.18362) (cp874)
you are using a system with a Code Page 874.
During some of the tests, some UTF-8 to WinAnsi - aka code page 1252 - are performed via the AnsiString type, and some characters are likely to be missing with your own code page.
Therefore some test failure are reported.
This is a false positive error, due to some restriction of the current tests. I will try to avoid such issue in the future.
If you can compile TestSQL3 then it is very likely that your installation is correct, and it will work as expected with internal UTF-8 content (mORMot works internally with UTF-8 JSON to avoid unneeded conversions), and regular VCL string type, which is UTF-16, will safely be available via UTF8ToString/StringToUTF8() functions.

What's the meaning of the ethereum Parity console output lines?

Parity doesn't seem to have any documentation on what it's console output means. At least none that I've found which admittedly doesn't mean a whole lot. Can anyone give me a breakdown of the meaning of the following line?
2018-03-09 00:05:12 UTC Syncing #4896969 61ee…bdad 2 blk/s 508 tx/s 16 Mgas/s 645+ 1 Qed #4897616 17/25 peers 4 MiB chain 135 MiB db 42 MiB queue 5 MiB sync RPC: 0 conn, 0 req/s, 182 µs
Thanks.
Why document when you can just read code? (bleh)
2018-03-09 00:05:12 UTC(1) Syncing #4896969(2) 61ee…bdad(3) 2 blk/s(4) 508 tx/s(5) 16 Mgas/s(6) 645+(7) 1(8) Qed #4897616(9) 17/25 peers(10) 4 MiB chain(11) 135 MiB db(12) 42 MiB queue(13) 5 MiB sync(14) RPC: 0 conn(15), 0 req/s(16), 182 µs(17)
Timestamp
Best block number (latest verified block number)
Best block hash
Blocks downloaded per second
Transactions downloaded per second
Millions of gas processed per second
Unverified queue size
Verified queue size
Latest block number
Number of active peer nodes/number of total peer nodes
Blockchain header cache size
Blockchain state cache size
Queue cache size
Node sync metadata cache size
Number of open RPC sessions to your node
RPC requests per second
Approximate roundtrip ping
Now the answer to this question is also included in Parity's FAQ. It provides a comprehensive explanation of different command line output:
What does Parity's command line output mean?

Writing multiple files slows down after x seconds

I have code which gets frames from a camera and then saves it to disk. The structure of the code is: multiple threads malloc and copy their frames into new memory, enqueue memory. Finally, another thread removes frames from queue and writes them (using ffmpeg API, raw video no compression) to their files (actually I'm using my own memory pool so malloc is only called when I need more buffers). I can have upto 8 files/cams open at the same time enqueing.
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
I have a 8 core, 16GB RAM Windows 7 64 bit computer (NTFS, lots of free space in second disk drive). The disk is supposed to be able to write upto 6Gbits/sec. To save my data in time I need to be able to write at 50 MB/sec. I tested disk speed with "PassMark PerformanceTest" and I had 8 threads writing files simultaneously exactly like ffmpeg saves files (synchronized, uncached I/O) and it was able to achieve 100MB/sec. So why isn't my writes able to achieve that?
Here's how the ffmpeg writes look from Process monitor logs:
Time of Day Operation File# Result Detail
2:30:32.8759350 PM WriteFile 8 SUCCESS Offset: 749,535,120, Length: 32,768
2:30:32.8759539 PM WriteFile 8 SUCCESS Offset: 749,567,888, Length: 32,768
2:30:32.8759749 PM WriteFile 8 SUCCESS Offset: 749,600,656, Length: 32,768
2:30:32.8759939 PM WriteFile 8 SUCCESS Offset: 749,633,424, Length: 32,768
2:30:32.8760314 PM WriteFile 8 SUCCESS Offset: 749,666,192, Length: 32,768
2:30:32.8760557 PM WriteFile 8 SUCCESS Offset: 749,698,960, Length: 32,768
2:30:32.8760866 PM WriteFile 8 SUCCESS Offset: 749,731,728, Length: 32,768
2:30:32.8761259 PM WriteFile 8 SUCCESS Offset: 749,764,496, Length: 32,768
2:30:32.8761452 PM WriteFile 8 SUCCESS Offset: 749,797,264, Length: 32,768
2:30:32.8761629 PM WriteFile 8 SUCCESS Offset: 749,830,032, Length: 32,768
2:30:32.8761803 PM WriteFile 8 SUCCESS Offset: 749,862,800, Length: 32,768
2:30:32.8761977 PM WriteFile 8 SUCCESS Offset: 749,895,568, Length: 32,768
2:30:32.8762235 PM WriteFile 8 SUCCESS Offset: 749,928,336, Length: 32,768, Priority: Normal
2:30:32.8762973 PM WriteFile 8 SUCCESS Offset: 749,961,104, Length: 32,768
2:30:32.8763160 PM WriteFile 8 SUCCESS Offset: 749,993,872, Length: 32,768
2:30:32.8763352 PM WriteFile 8 SUCCESS Offset: 750,026,640, Length: 32,768
2:30:32.8763502 PM WriteFile 8 SUCCESS Offset: 750,059,408, Length: 32,768
2:30:32.8763649 PM WriteFile 8 SUCCESS Offset: 750,092,176, Length: 32,768
2:30:32.8763790 PM WriteFile 8 SUCCESS Offset: 750,124,944, Length: 32,768
2:30:32.8763955 PM WriteFile 8 SUCCESS Offset: 750,157,712, Length: 32,768
2:30:32.8764072 PM WriteFile 8 SUCCESS Offset: 750,190,480, Length: 4,104
2:30:32.8848241 PM WriteFile 4 SUCCESS Offset: 750,194,584, Length: 32,768
2:30:32.8848481 PM WriteFile 4 SUCCESS Offset: 750,227,352, Length: 32,768
2:30:32.8848749 PM ReadFile 4 END OF FILE Offset: 750,256,128, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
2:30:32.8848989 PM WriteFile 4 SUCCESS Offset: 750,260,120, Length: 32,768
2:30:32.8849157 PM WriteFile 4 SUCCESS Offset: 750,292,888, Length: 32,768
2:30:32.8849319 PM WriteFile 4 SUCCESS Offset: 750,325,656, Length: 32,768
2:30:32.8849475 PM WriteFile 4 SUCCESS Offset: 750,358,424, Length: 32,768
2:30:32.8849637 PM WriteFile 4 SUCCESS Offset: 750,391,192, Length: 32,768
2:30:32.8849880 PM WriteFile 4 SUCCESS Offset: 750,423,960, Length: 32,768, Priority: Normal
2:30:32.8850400 PM WriteFile 4 SUCCESS Offset: 750,456,728, Length: 32,768
2:30:32.8850727 PM WriteFile 4 SUCCESS Offset: 750,489,496, Length: 32,768, Priority: Normal
This looks very efficient, however, from DiskMon the actual disk writes look ridiculously fragmented back and forth writing which may account for this slow speed. See the graph for the write speed according to this data (~5MB/s).
TIme Write duration Sector Length MB/sec
95.6 0.00208855 1490439632 896 0.409131784
95.6 0.00208855 1488197000 128 0.058447398
95.6 0.00009537 1482323640 128 1.279965529
95.6 0.00009537 1482336312 768 7.679793174
95.6 0.00009537 1482343992 384 3.839896587
95.6 0.00009537 1482350648 768 7.679793174
95.6 0.00039101 1489278984 1152 2.809730729
95.6 0.00039101 1489393672 896 2.185346123
95.6 0.0001812 1482349368 256 1.347354443
95.6 0.0001812 1482358328 896 4.715740549
95.6 0.0001812 1482370616 640 3.368386107
95.6 0.0001812 1482378040 256 1.347354443
95.6 0.00208855 1488197128 384 0.175342193
95.6 0.00208855 1488202512 640 0.292236989
95.6 0.00208855 1488210320 1024 0.467579182
95.6 0.00009537 1482351416 256 2.559931058
95.6 0.00009537 1482360120 896 8.959758703
95.6 0.00009537 1482371896 640 6.399827645
95.6 0.00009537 1482380088 256 2.559931058
95.7 0.00039101 1489394568 1152 2.809730729
95.7 0.00039101 1489396744 352 0.858528834
95.7 0.00039101 1489507944 544 1.326817289
95.7 0.0001812 1482378296 768 4.042063328
95.7 0.0001812 1482392120 768 4.042063328
95.7 0.0001812 1482400568 512 2.694708885
95.7 0.00208855 1488224144 768 0.350684386
95.7 0.00208855 1488232208 384 0.175342193
I'm pretty confident it's not my code, because I timed everything and for example enqueing takes a few us suggesting that threads don't get stuck waiting for each other. It must be the disk writes. So the question is how can I improve my disk writes and what can I do to profile actual disk writes (remember that I rely on FFmpeg dlls to save so I cannot access the low level writing functions directly). If I cannot figure it out, I'll dump all the frames in a single sequential binary file (which should increase I/O speed) and then split it into video files post processing.
I don't know how much my disk I/O is caching (CacheSet only shows disk C cache size), but the following image from the performance monitor taken at 0 and 45 sec into the video (just before my queue starts piling up) looks weird to me. Basically, the modified set and standby set grew from very little to this large value. Is that the data being cached? Is it possible that only at 45 sec data is starting to be written to disk so suddenly everything slows down?
(FYI, LabVIEW is the program that loads my dll.)
I'll appreciate any help.
M.
With CreateFile it looks like you want one or both of these parameters:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
http://msdn.microsoft.com/en-us/library/cc644950(v=vs.85).aspx
Your delayed performance hit occurs when the OS starts pushing data to the disk.
6Gb/s is the performance capability of the SATA 2 bus not the actual devices connected or the physical platters or flash ram underneath.
A common problem with AV systems is constantly writing a high stream of data can get periodically interrupted by disk overhead tasks. There used to be special AV disks you can purchase that don't do this, these days you can purchase disks with special high throughput performance firmware explicitly for security video recording.
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=210671&NewLang=en
The problem is with repeated malloc and free which puts a load on the system. I suggest creating a buffer pools, i.e allocate N buffers in the initialization stage and reuse them instead of mallocing and freeing the memory. Since you have mentioned ffmpeg, to give an example from multimedia, In gstreamer, buffer management occurs in the form of buffer-pools and in a gstreamer pipeline buffers are usually taken and passed around from buffer pools. Most multimedia systems do this.
Regarding:
The problem is that for the first 45 sec everything works fine: there's never more than one frame on queue. But after that my queue gets backed up, processing takes just a few ms longer resulting in increased ram usage because I cannot save the frames fast enough so I have to malloc more memory to store them.
The application is trashing at this point. Calling malloc at this point will make matters even worse. I suggest implementing a producer-consumer model, where one of them gets waits depending on the case. In your case, set up a threshold of N buffers. If there are N buffers in the queue, new frames from camera are not enqueued till the existing buffers are processed.
Another idea, Instead of writing raw frames why not write encoded data? assuming you a want video, you can at least write a elementary H264 stream (and ffmpeg comes with a good H264 encoder!) or even better if you have access to a Mpeg-4 muxer, as a mp4 file? This will reduce the memory requirements and the IO load dramatically.
I am copying a directory with 100,000+ files to an exFat disk. The copy starts at around 60 MB/s and degrades to about 1-2 Mb/s. For me the quick & dirty fix is to put the computer to sleep during the operation and then wake it up. The original speed returns instantly. Not very sophisticated but it works for me.
I think it's finally a consequence of using an "unwise" write mode.
Why does it work with 8 threads writing in your synthetic test? Well, because in that case, little does it matter if a thread blocks. You're pushing data towards the drive full throttle. It's not surprising that you get 100 MB/s that way, which is roughly the reported real-life speed of the drive that you named.
What happens in the "real" program? You have 8 producers and a single consumer that pushes data to disk. You use non-buffered, synchronized writes. Which means as much as for as long as it takes for the drive to receive all data, your writer thread blocks, it doesn't do anything. Producers keep producing. Well that is fine, says you. You only need 50 MB/s or such, and the drive can do 100 MB/s.
Yes, except... that's the maximum with several threads hammering the drive concurrently. You have zero concurrency here, so you don't get those 100 MB/s already. Or maybe you do, for a while.
And then, the drive needs to do something. Something? Something, anything, whatever, say, a seek. Or, you get scheduled out for an instant, or for whatever reason the controller or OS or whatever doesn't push data for a tenth millisecond, and when you look again next, the disk has spun on, so you need to wait one rotation (no joke, that happens!). Maybe there's just one or two sectors in the way on a fragmented disk, who knows. It doesn't matter what... something happens.
This takes, say, anywhere from one to ten milliseconds, and your thread does nothing during that time, it's still waiting for the blocking WriteFile call to finish already. Producers keep producing and keep pulling memory blocks from the pool. The pool runs out of memory blocks, and allocates some more. Producers keep producing, and producing. Meanwhile, your writer thread keeps the buffer locked, and still does nothing.
See where this is going? That cannot be sustained forever. Even if it temporarily recovers, it's always fighting uphill. And losing.
Caching (also write caching) doesn't exist for no reason. Things in the real world do not always go as fast as we wish, both in terms of bandwidth and latency.
On the internet, we can easily crank up bandwidth to gigabit uplinks, but we are bound by latency, and there is absolutely nothing that we can do about it. That's why trans-atlantic communications still suck like they did 30 years ago, and they will still suck in 30 years (the speed of light very likely won't change).
Inside a computer, things are very similar. You can easily push two-digit gigabytes per second over a PCIe bus, but doing so in the first place takes a painfully long time. Graphics programmers know that only too well.
The same applies to disks, and it is no surprise that exactly the same strategy is applied to solve the issue: Overlap transfers, do them asynchronously. That's what virtually every operating system does by default. Data is copied to buffers, and written out lazily, and asynchronously.
That, however, is exactly what you disable by using unbuffered writes. Which is almost always (with very, very few singular exceptions) a bad idea.
For some reason, people have kept around this mindset of DMA doing magic things and being so totally superior speed-wise. Which, maybe, was even true at some point in the distant past, and maybe still is true for some very rare corner cases (e.g. read-once from optical disks). In all other cases, it's a desastrous anti-optimization.
Things that could run in parallel run sequentially. Latencies add up. Threads block, and do nothing.
Simply using "normal" writes means data is copied to buffers, and WriteFile returns almost instantly (well, with a delay equivalent to a memcpy). The drive never (well, hopefully, can never be 100% sure) starves, and your writer thread never blocks, doing nothing. Memory blocks don't stack up in the queue, and pools don't run out, they don't need to allocate more.

Jython + Django not ready for production?

So recently I was playing around with Django on the Jython platform and wanted to see its performance in "production". The site I tested with was just a simple return HttpResponse("Time %.2f" % time.time()) view, so no database involved.
I tried the following two combinations (measurements done with ab -c15 -n500 -k <url>, everything in Ubuntu Server 10.10 on VirtualBox):
J2EE application server (Tomcat/Glassfish), deployed WAR file
I get results like
Requests per second: 143.50 [#/sec] (mean)
[...]
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 16
90% 31
95% 31
98% 641
99% 3219
100% 3219 (longest request)
Obviously, the server hangs for a few seconds once in a while, which is not acceptable. I assume it has something to do with reloading Jython because starting the jython shell takes about 3 seconds, too.
AJP serving using patched flup package (+ Apache as frontend)
Note: flup is the package used by manage.py runfcgi, I had to patch it because flup's threading/forking support doesn't seem to work on Jython (-> AJP was the only working method).
Almost the same results here, but sometimes the last 100 requests don't even get answered at all (but server process still alive).
I'm asking this on SO (instead of serverfault) because it's very Django/Jython-specific. Does anyone have experience with deploying Django sites on Jython? Is there maybe another (faster) way to serve the site? Or is it just too early to use Django on the Java platform?
So as nobody replied, I investigated a bit more and it seems like my problem might have to do with VirtualBox. Using different server OSes (Debian Squeeze, Ubuntu Server), I had similar problems. For example, with simple static file serving, I got this result from the Apache web server (on Debian):
> ab -c50 -n1000 http://ip.of.my.vm/some/static/file.css
Requests per second: 91.95 [#/sec] (mean) <--- quite impossible for static serving
[...]
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 22.1 0 688
Processing: 0 206 991.4 31 9188
Waiting: 0 96 401.2 16 3031
Total: 0 208 991.7 31 9203
Percentage of the requests served within a certain time (ms)
50% 31
66% 47
75% 63
80% 78
90% 156
95% 781
98% 844
99% 9141 <--- !!!!!!
100% 9203 (longest request)
This led to the conclusion that (I don't have a conclusion, but) I think the Java reloading might not be the problem here, rather the virtualization. I will try it on a real host and leave this question unanswered till then.
FOLLOWUP
Now I successfully tested a bare-bones Django site (really just the welcome page) using Jython + AJP over TCP/mod_proxy_ajp on Apache (again with patched flup package). This time on a real host (i7 920, 6 GB RAM). The result proved that my above assumption was correct and that I really should never benchmark on a virtual host again. Here's the result for the welcome page:
Document Path: /jython-test/
Document Length: 2059 bytes
Concurrency Level: 40
Time taken for tests: 24.688 seconds
Complete requests: 20000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 43640000 bytes
HTML transferred: 41180000 bytes
Requests per second: 810.11 [#/sec] (mean)
Time per request: 49.376 [ms] (mean)
Time per request: 1.234 [ms] (mean, across all concurrent requests)
Transfer rate: 1726.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.5 0 20
Processing: 2 49 16.5 44 255
Waiting: 0 48 16.5 44 255
Total: 2 49 16.5 45 256
Percentage of the requests served within a certain time (ms)
50% 45
66% 48
75% 51
80% 53
90% 69
95% 80
98% 90
99% 97
100% 256 (longest request) # <-- no multiple seconds of waiting anymore
Very promising, I would say. The only downside is that the average request time is > 40 ms whereas the development server has a mean of < 3 ms.

What exactly does C++ profiling (google cpu perf tools) measure?

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.