epoll vs select for very small number of connections - c++

I have been using select to handle connections, recently there was a change an our socket library and select was replaced by epoll for linux platform.
my application architecture is such that I make only one or at max 2 socket connections and epoll/select on them in a single thread.
now with recent switch to epoll i noticed that performance of application has diminshed, I was actually surprised and was expecting performance go up or reamin same. I tried looking at various other parts and this is the only peice of code that has changed.
does epoll have performance penalty in terms of speed if used for very small number of sockets (like 1 or 2).
also anoher thing to note that I run around 125 such processes on same box (8 cpu cores).
could this be case that too many processes doing epoll_wait on same machine, this setup was similar when i was using select.
i noticed on box that load average is much higher but cpu usage was quite the same which makes me think that more time is spend in I/O and probaly coming from epoll related changes.
any ideas/pointers on what should i look more to identify the problem.
although absolute latency increased is quite small like average 1 millisec but this is a realtime system and this kind of latencies are generally unaccpetable.
Thanks
Hi,
Updating this question on latest findinds, apart from switching from select to epoll I found another relate change, earlier timeout with select was 10 millis but with epoll the way timeout is way smaller than before (like 1 micro..), can setting too low timeout in select or epoll result on decreased performance in anyway?
thanks

From the sounds of it, throughput may be unaffected with epoll() vs select(), but you're finding extra latency in individual requests that seems to be related to the use of epoll().
I think that in the case of watching only one or two sockets, epoll() should perform much like select(). epoll() is supposed to scale linearly as you watch more descriptors, whereas select() scales badly (& may even have a hard limit on #/descriptors). So it's not that epoll() has a penalty for a small # of descriptors, but it loses its performance advantage over select() in this case.
Can you change the code so you can easily go back & forth between the two event notification mechanisms? Get more data about the performance difference. If you conclusively find that select() has less latency & same throughput in your situation, then I'd just switch back to the "old & deprecated" API without hesitation :) To me it's fairly conclusive if you measure a performance difference from this specific code change. Perhaps previous testing of epoll() versus select() has focused on throughput versus latency of individual requests?

Related

How to do something every millisecond or better on Windows

This question is not about timing something accurately on Windows (XP or better), but rather about doing something very rapidly via callback or interrupt.
I need to be doing something regularly every 1 millisecond, or preferably even every 100 microseconds. What I need to do is drive some assynchronous hardware (ethernet) at this rate to output a steady stream of packets to the network, and make that stream appear to be as regular and synchronous as possible. But if the question can be separated from the (ethernet) device, it would be good to know the general answer.
Before you say "don't even think about using Windows!!!!", a little context. Not all real-time systems have the same demands. Most of the time songs and video play acceptably on Windows despite needing to handle blocks of audio or images every 10-16ms or so on average. With appropriate buffering, Windows can have its variable latencies, but the hardware can be broadly immune to them, and keep a steady synchronous stream of events happening. Even so, most of us tolerate the occasional glitch. My application is like that - probably quite tolerant.
The expensive option for me is to port my entire application to Linux. But Linux is simply different software running on the same hardware, so my strong preference is to write some better software, and stick with Windows. I have the luxury of being able to eliminate all competing hardware and software (no internet or other network access, no other applications running, etc). Do I have any prospect of getting Windows to do this? What limitations will I run into?
I am aware that my target hardware has a High Performance Event Timer, and that this timer can be programmed to interrupt, but that there is no driver for it. Can I write one? Are there useful examples out there? I have not found one yet. Would this interfere with QueryPerformanceCounter? Does the fact that I'm going to be using an ethernet device mean that it all becomes simple if I use select() judiciously?
Pointers to useful articles welcomed - I have found dozens of overviews on how to get accurate times, but none yet on how to do something like this other than by using what amounts to a busy wait. Is there a way to avoid a busy wait? Is there a kernel mode or device driver option?
You should consider looking at the Multimedia Timers. These are timers that are intended to the sort of resolution you are looking at.
Have a look here on MSDN.
I did this using DirectX 9, using the QueryPerformanceCounter, but you will need to hog at least one core, as task switching will mess you up.
For a good comparison on tiemers you can look at
http://www.geisswerks.com/ryan/FAQS/timing.html
If you run into timer granularity issues, I would suggest using good old Sleep() with a spin loop. Essentially, the code should do something like:
void PrecisionSleep(uint64 microSec)
{
uint64 start_time;
start_time = GetCurrentTime(); // assuming GetCurrentTime() returns microsecs
// Calculate number of 10ms intervals using standard OS sleep.
Sleep(10*(microSec/10000)); // assuming Sleep() takes millisecs as argument
// Spin loop to spend the rest of the time in
while(GetCurrentTime() - start_time < microSec)
{}
}
This way, you will have a high precision sleep which wouldn't tax your CPU much if a lot of them are larger than the scheduling granularity (assumed 10ms). You can send your packets in a loop while you use the high precision sleep to time them.
The reason audio works fine on most systems is that the audio device has its own clock. You just buffer the audio data to it and it takes care of playing it and interrupts the program when the buffer is empty. In fact, a time skew between the audio card clock and the CPU clock can cause problems if a playback engine relies on the CPU clock.
EDIT:
You can make a timer abstraction out of this by using a thread which uses a lock protected min heap of timed entries (the heap comparison is done on the expiry timestamp) and then you can either callback() or SetEvent() when the PrecisionSleep() to the next timestamp completes.
Use NtSetTimerResolution when program starts up to set timer resolution. Yes, it is undocumented function, but works well. You may also use NtQueryTimerResolution to know timer-resolution (before setting and after setting new resolution to be sure).
You need to dynamically get the address of these functions using GetProcAddress from NTDLL.DLL, as it is not declared in header or any LIB file.
Setting timer resolution this way would affect Sleep, Windows timers, functions that return current time etc.

Where's the balance between thread amount and thread block times?

Elongated question:
When having more blocking threads then CPU cores, where's the balance between thread amount and thread block times to maximize CPU efficiency by reducing context switch overhead?
I have a wide variety of IO devices that I need to control on Windows 7, with a x64 multi-core processor: PCI devices, network devices, stuff being saved to hard drives, big chunks of data being copied,... The most common policy is: "Put a thread on it!". Several dozen threads later, this is starting to feel like a bad idea.
None of my cores are being used 100%, and there's several cores who're still idling, but there are delays showing up in the range of 10 to 100ms who cannot be explained by IO blockage or CPU intensive usage. Other processes don't seem to require resources either. I'm suspecting context switch overhead.
There's a bunch of possible solutions I have:
Reduce threads by bundling the same IO devices: This mainly goes for the hard drive, but maybe for the network as well. If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same? How would this work in case of multiple hard drives?
Reduce threads by bundling similar IO devices, and increase it's priority: Dozens of threads with increased priority are probably gonna make my user interface thread stutter. But I can bundle all that functionality together in 1 or a couple of threads and increase it's priority.
Any case studies tackling similar problems are much appreciated.
First, it sounds like these tasks should be performed using asynchronous I/O (IO Completion Ports, preferably), rather than with separate threads. Blocking threads are generally the wrong way to do I/O.
Second, blocked threads shouldn't affect context switching. The scheduler has to juggle all the active threads, and so, having a lot of threads running (not blocked) might slow down context switching a bit. But as long as most of your threads are blocked, they shouldn't affect the ones that aren't.
10-100ms with some cores idle: it's not context-switching overhead in itself since a switch is orders of magnitude faster than these delays, even with a core swap and cache flush.
Async I/O would not help much here. The kernel thread pools that implement ASIO also have to be scheduled/swapped, albeit this is faster than user-space threads since there are fewer Wagnerian ring-cycles. I would certainly head for ASIO if the CPU loading was becoming an issue, but it's not.
You are not short of CPU, so what is it? Is there much thrashing - RAM shortage? Excessive paging can surely result in large delays. Where is your page file? I've shoved mine off Drive C onto another fast SATA drive.
PCI bandwidth? You got a couple of TV cards in there?
Disk controller flushing activity - have you got an SSD that's approaching capacity? That's always a good one for unexplained pauses. I get the odd pause even though my 128G SSD is only 2/3 full.
I've never had a problem specifically related to context-swap time and I've been writing multiThreaded apps for decades. Windows OS schedules & despatches the ready threads onto cores reasonably quickly. 'Several dozen threads' in itself, (ie. not all running!), is not remotely a problem - looking now at my TaskManger/performance, I have 1213 threads loaded on and no performance issues at all with ~6% CPU usage, (app on test running in background, bitTorrent etc). Firefox has 30 threads, VLC media player 27, my test app 23. No problem at all writing this post.
Given your issue of 10-100ms delays, I would be amazed if fiddling with thread priorities and/or changing the way your work is loaded onto threads provides any improvement - something else is stuffing your system, (you haven't got any drivers that I coded, have you? :).
Does perfmon give any clues?
Rgds,
Martin
I don't think that there is a conclusive answer, and it probably depends
on your OS as well; some handle threads better than others. Still,
delays in the 10 to 100 ms range are not due to context switching itself
(although they could be due to characteristics of the scheduling
algorithm). My experience under Windows is that I/O is very
inefficient, and if you're doing I/O, of any type, you will block. And
that I/O by one process or thread will end up blocking other processes
or threads. (Under Windows, for example, there's probably no point in
having more than one thread handle the hard drive. You can't read or
write several sectors at the same time, and my impression is that
Windows doesn't optimize accesses like some other systems do.)
With regards to your exact questions:
"If I'm saving 20MB to the hard drive in one thread, and 10MB in the
other, wouldn't it be better to post it all to the same?": It depends on
the OS. Normally, there should be no reduction in time or latency using
separate threads, and depending on other activity and the OS, there
could be an improvement. (If there are several disk requests in
instance, most OS's will optimize the accesses, reordering the requests
to reduce head movement.) The simplest solution would be to try both,
and see which works better on your system.
"How would this work in case of multiple hard drives?": The OS should
be able to do the I/O in parallel, if the requests are to different
drives.
With regards to increasing priority of one or more theads, it's very OS
dependent, but probably worth trying. Unless there's significant CPU
time used in the threads with the higher priority, it shouldn't impact
the user interface—these threads are mostly blocked for I/O,
remember.
Well, my Windows 7 is currently running 950 threads. I don't think that adding another few dozen on would make a significant difference. However, you should definitely be looking at a thread pool or other work-stealing device for this - you shouldn't make new threads just to let them block. If Windows provides asynchronous I/O by default, then use it.

Unexpected Socket CPU Utilization

I'm having a performance issue that I don't understand. The system I'm working on has two threads that look something like this:
Version A:
Thread 1: Data Processing -> Data Selection -> Data Formatting -> FIFO
Thread 2: FIFO -> Socket
Where 'Selection' thins down the data and the FIFO at the end of thread 1 is the FIFO at the beginning of thread 2 (the FIFOs are actually TBB Concurrent Queues). For performance reasons, I've altered the threads to look like this:
Version B:
Thread 1: Data Processing -> Data Selection -> FIFO
Thread 2: FIFO -> Data Formatting -> Socket
Initially, this optimization proved to be successful. Thread 1 is capable of much higher throughput. I didn't look too hard at Thread 2's performance because I expected the CPU usage would be higher and (due to data thinning) it wasn't a major concern. However, one of my colleagues asked for a performance comparison of version A and version B. To test the setup I had thread 2's socket (a boost asio tcp socket) write to an instance of iperf on the same box (127.0.0.1) with the goal of showing the maximum throughput.
To compare the two set ups I first tried forcing the system to write data out of the socket at 500 Mbps. As part of the performance testing I monitored top. What I saw surprised me. Version A did not show up on 'top -H' nor did iperf (this was actually as suspected). However, version B (my 'enhanced version') was showing up on 'top -H' with ~10% cpu utilization and (oddly) iperf was showing up with 8%.
Obviously, that implied to me that I was doing something wrong. I can't seem to prove that I am though! Things I've confirmed:
Both versions are giving the socket 32k chunks of data
Both versions are using the same boost library (1.45)
Both have the same optimization setting (-O3)
Both receive the exact same data, write out the same data, and write it at the same rate.
Both use the same blocking write call.
I'm testing from the same box with the exact same setup (Red Hat)
The 'formatting' part of thread 2 is not the issue (I removed it and reproduced the problem)
Small packets across the network is not the issue (I'm using TCP_CORK and I've confirmed via wireshark that the TCP Segments are all ~16k).
Putting a 1 ms sleep right after the socket write makes the CPU usage on both the socket thread and iperf(?!) go back to 0%.
Poor man's profiler reveals very little (the socket thread is almost always sleeping).
Callgrind reveals very little (the socket write barely even registers)
Switching iperf for netcat (writing to /dev/null) doesn't change anything (actually netcat's cpu usage was ~20%).
The only thing I can think of is that I've introduced a tighter loop around the socket write. However, at 500 Mbps I wouldn't expect that the cpu usage on both my process and iperf would be increased?
I'm at a loss to why this is happening. My coworkers and I are basically out of ideas. Any thoughts or suggestions? I'll happily try anything at this point.
This is going to be very hard to analyze without code snippets or actual data quantities.
One thing that comes to mind: if the pre-formatted data stream is significantly larger than post-format, you may be expending more bandwidth/cycles copying a bunch more data through the FIFO (socket) boundary.
Try estimating or measuring the data rate at each stage. If the data rate is higher at the output of 'selection', consider the effects of moving formatting to the other side of the boundary. Is it possible that no copy is required for the select->format transition in configuration A, and configuration B imposes lots of copies?
... just a guesses without more insight into the system.
What if the FIFO was the bottleneck in version A. Then both threads would sit and wait for the FIFO most of the time. And in version B, you'd be handing the data off to iperf faster.
What exactly do you store in the FIFO queues? Do you store packets of data i.e buffers?
In version A, you were writing formatted data (probably bytes) to the queue. So, sending it on the socket involved just writing out a fixed size buffer.
However in version B, you are storing high level data in the queues. Formatting it is now creating bigger buffer sizes that are being written directly to the socket. This causes the TCp/ip stack to spend CPU cycles in fragmenting & overhead...
THis is my theory based on what you have said so far.

Resource recommendations for Windows performance tuning (realtime)

Any recommendations out there for Windows application tuning resources (books web sites etc.)?
I have a C++ console application that needs to feed a hardware device with a considerable amount of data at a fairly high rate. (buffer is 32K in size and gets consumed at ~800k bytes per second)
It will stream data without under runs, except when I perform file IO like opening a folder etc... (It seems to be marginally meeting its timing requirements).
Anyway.. a good book or resource to brush up on realtime performance with windows would be helpful.
Thanks!
The best you can hope for on commodity Windows is "usually meets timing requirements". If the system is running any processes other than your target app, it will occasionally miss deadlines due scheduling inconsistencies. However, if your app/hardware can handle the rare but occasional misses, there are a few things you can do to reduce the number of misses.
Set your process's priority to REALTIME_PRIORITY_CLASS
Change the scheduler's granularity to 1ms resolution via the timeBeginPeriod() function (part of the Windows Multimedia libraries)
Avoid as many system calls in your main loop as possible (this includes allocating memory). Each syscall is an opportunity for the OS to put the process to sleep and, consequently, is an opportunity for the non-deterministic scheduler to miss the next deadline
If this doesn't get the job done for you, you might consider trying a Linux distribution with realtime kernel patches applied. I've found those to provide near-perfect timing (within 10s of microseconds accuracy over the course of several hours). That said, nothing short of a true-realtime OS will actually give you perfection but the realtime-linux distros are much closer than commodity Windows.
The first thing I would do is tune it to where it's as lean as possible. I use this method. For these reasons. Since it's a console app, another option is to try out LTProf, which will show you if there is anything you can fruitfully optimize. When that's done, you will be in the best position to look for buffer timing issues, as #Hans suggested.
Optimizing software in C++ from agner.com is a great optimization manual.
As Rakis said, you will need to be very careful in the processing loop:
No memory allocation. Use the stack and preallocated memory instead.
No throws. Exceptions are quite expensive, in win32 they have a cost even not throwing.
No polymorphism. You will save some indirections.
Use inline extensively.
No locks. Try lock-free approaches when possible.
The buffer will last for only 40 milliseconds. You can't guarantee zero under-runs on Windows with such strict timing requirements. In user mode land, you are looking at, potentially, hundreds of milliseconds when kernel threads do what they need to do. They run with higher priorities that you can ever gain. The thread quantum on the workstation version is 3 times the clock tick, already beyond 40 milliseconds (3 x 15.625 msec). You can't even reliably compete with user mode threads that boosted their priority and take their sweet old time.
If a bigger buffer is not an option then you are looking at a device driver to get this kind of service guarantee. Or something in between that can provide a larger buffer.

C++ Socket Server - Unable to saturate CPU

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).
Details:
The server fires up one thread per core
Requests are received, parsed, processed, and responses are written out
The requests are for data, which is read out of memory (read-only for this test)
I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)
So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.
Ideas I've had:
The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
I could modify the HTTP server to use the Select design pattern, is this appropriate here?
I could do some profiling to try to understand what the bottleneck's are/is
boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).
Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.
BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.
As you are using EC2, all bets are off.
Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.
I have not yet worked out what EC2 is useful for, if someone find out, please let me know.
From your comments on network utilization,
You do not seem to have much network movement.
3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).
I'd say you are having one of the following two problems,
Insufficient work-load (low request-rate from your clients)
Blocking in the server (interfered response generation)
Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).
230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.
This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?
ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.
For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by #cmeerw mostly explains why.
One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.
The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.
Hope some of this can be useful to you.
How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.
You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.