epoll performance for smaller timeout values - c++

I have a single thread server process that watches few (around 100) sockets via epoll in a loop, my question is that how to decide the optimum value of epoll_wait timeout value, since this is a single threaded process, everything is triggered off epoll_wait , if there is no activity on sockets, program remains idle, my guess is that if i give too small timeout, which causes too many epoll_wait calls there is no harm because even though my process is doing too many epoll_wait calls, it would be sitting idle otherwise, but there is another point, I run many other processes on this (8 core) box, something like 100 other process which are clients to this process, I am wondering how timeout value impacts cpu context switiching, i.e if i give too small timeout which results in many epoll_wait call will my server process be put in waiting many more times vs when I give a larger timeout value which results in fewer epoll_wait calls.
any thoughts/ideas.
Thanks

I believe there is no good reason to make your process wake up if it has nothing to do. Simply set the timeout to when you first need to do something. For example, if your server has a semantic of disconnecting a client after N seconds of inactivity, set the epoll timeout to the time after the first client would have to be disconnected assuming no activity. In other words, set it to:
min{expire_time(client); for each client} - current_time
Or, if that's negative, you can disconnect at least one client immediately. In general, this works not only for disconnecting clients; you can abstract the above into "software timers" within your application.
I'm failing to see this compromise you've mentioned. If you use a timeout any smaller than you have to, you'll wake up before you have to, then, presumably, go back to sleep because you have nothing to do. What good does that do? On the other hand, you must not use a timeout any larger than what you have to - because that would make your program not respect the disconnect timeout policy.
If your program is not waiting for any time-based event (like disconnecting clients), just give epoll_wait() timeout value -1, making it wait forever.
UPDATE If you're worried that this process being given less CPU when other processes are active, just give it lower nice value (scheduler priority). On the other hand, if you're worried that your server process will be swapped out to disk in favour of other processes when it's idle, it is possible to avoid swapping it out. (or you can just lower /proc/sys/vm/swappiness, affecting all processes)

Related

closesocket() not completing pending operations of IOCP

I am currently working on a server application in C++. My main inspirations are these examples:
Windows SDK IOCP Excample
The I/O Completion Port IPv4/IPv6 Server Program Example
My app is strongly similar to these (socketobj, packageobj, ...).
In general, my app is running without issues. The only things which still causes me troubles are half open connections.
My strategy for this is: I check every connected client in a time period and count an "idle counter" up. If one completion occurs, I reset this timer. If the Idle counter goes too high, I set a boolean to prevent other threads from posting operations, and then call closesocket().
My assumption was that now the socket is closed, the pending operations will complete (maybe not instantly but after a time). This is also the behavior the MSDN documentation is describing (hints, second paragraph). I need this because only after all operations are completed can I free the resources.
Long story short: this is not the case for me. I did some tests with my testclient app and some cout and breakpoint debugging, and discovered that pending operations for closed sockets are not completing (even after waiting 10 min). I also already tried with a shutdown() call before the closesocket(), and both returned no error.
What am I doing wrong? Does this happen to anyone else? Is the MSDN documentation wrong? What are the alternatives?
I am currently thinking of the "linger" functionality, or to cancel every operation explicitly with the CancelIoEx() function
Edit: (thank you for your responses)
Yesterday evening I added a chained list for every sockedobj to hold the per io obj of the pending operations. With this I tried the CancelIOEx() function. The function returned 0 and GetLastError() returned ERROR_NOT_FOUND for most of the operations.
Is it then safe to just free the per Io Obj in this case?
I also discovered, that this is happening more often, when I run my server app and the client app on the same machine. It happens from time to time, that the server is then not able to complete write operations. I thought that this is happening because the client side receive buffer gets to full. (The client side does not stop to receive data!).
Code snipped follows as soon as possible.
The 'linger' setting can used to reset the connection, but that way you will (a) lose data and (b) deliver a reset to the peer, which may terrify it.
If you're thinking of a positive linger timeout, it doesn't really help.
Shutdown for read should terminate read operations, but shutdown for write only gets queued after pending writes so it doesn't help at all.
If pending writes are the problem, and not completing, they will have to be cancelled.

Using timers with performance-critical software (Qt)

I am developing an application that is responsible of moving and managing robots over an UDP connection.
The application needs to:
Read joystick/user input using SDL.
Generate and send a control packet to the robot every 20 milliseconds (UDP)
Receive and decode response packets from the robot (~20 msecs). This was implemented with the signal/slot mechanism and does not require a timer.
Receive and process robot messages for debugging reasons. This is not time-regulated.
Update the UI regularly to keep the user notified about the status of the robot (e.g. battery voltage). For most cases, I have also used Qt's signal/slot mechanism.
Use a watchdog that disables the robot if no response is received after 1 second. The watchdog is reset when the application receives a robot packet (~20 msecs)
For the moment, I have implemented all of the above. However, the application fails to send the packets regularly when the watchdog is activated or when two or more QTimer objects are used. The application would generally work, but I would not consider it "production ready". I have tried to use the precision flags of the timers (Qt::Precise, Qt::Coarse and Qt::VeryCoarse), but I still experienced problems.
Notes:
The code is generally well organized, there are no "god objects" in the code base (most source files are less than 150 lines long and only create the necessary dependencies).
Most of the times, I use QTimer::singleShot() (e.g. I will only send the next packet once the current packet has been sent).
Where we use timers:
To read joystick input (~50 msecs, precise timer)
To send robot packets (~20 msecs, precise timer)
To update some aspects of the UI (~500 msecs, coarse timer)
To update the elapsed time since the robot was enabled (~100 msecs, precise timer)
To implement a watchdog (put the application and robot in safe state if 1000 msecs have passed without a robot response)
Note: the watchdog is feed when we receive a response packet from the robot (~20 msecs)
Do you have any recommendations for using QTimer objects with performance-critical code (any idea is welcome). Note that I have also tried to use different threads, but it has caused me more problems, since the application would not be in "sync", thus failing to effectively control the robots that we have tested.
Actually, I seem to have underestimated Qt's timer and event loop performance. On my system I get on average around 20k nanoseconds for an event loop cycle plus the overhead from scheduling a queued function call, and a timer with interval 1 millisecond is rarely late, most of the timeouts are a few thousand nanoseconds short of a millisecond. But it is a high end system, on embedded hardware it may be a lot worse.
You should take the time and profile your target system and Qt build to determine whether it can indeed run snappy enough, and based on those measurements, adjust your timings to compensate for the system delays to get your events scheduled more on time.
You should definitely keep the timer thread as free as possible, because if you block it by IO or extensive computation, your timer will not be accurate. Use a dedicated thread to schedule work and extra worker threads to do the actual work. You may also try playing with thread priorities a bit.
Worst case scenario, look for 3rd party high performance event loop implementations or create your own and potentially, also a faster signaling mechanism as ell. As I already mentioned in the comments, Qt's inter-thread queued signals are very slow, at least compared to something like indirect function calls.
Last but not least, if you want to do task X every N units of time, it will only be only possible if task X takes N units of time or less on your system. You need to make this consideration for each task, and for all tasks running concurrently. And in order to get accurate scheduling, you should measure how long did task X took, and if less than its frequency, schedule the next execution in the time remaining, otherwise execute immediately.

QProcess ProcessState sufficient for Blocked Processes?

I want to know if a process (started with a QProcess class) doesn't respond anymore. For instance, my process is an application that only prints 1 every seconds.
My problem is that I want to know if (for some mystical reason), that process is blocked for a short period of time (more than 1 second, something noticeable by a human).
However, the different states of a QProcess (Not Running, Starting, Running) don't include a "Blocked" state.
I mean blocked as "Don't Answer to the OS" when we got the "Non Responding" message in the Task Manager. Such as when a Windows MMI (like explorer.exe) is blocked and becomes white.
But : I want to detect that "Not Responding" state for ANY processes. Not just MMI.
Is there a way to detect such a state ?
Qt doesn't provide any api for that. You'd need to use platform-specific mechanisms. On some platforms (Windows!), there is no notion of a hung application, merely that of a hung window. You can have one application that has both responsive and unresponsive windows :)
On Windows, you'd enumerate all windows using EnumWindows, check if they belong to your process by comparing the pid from GetWindowThreadProcessId to process->pid(), and finally checking if the window is hung through IsHungAppWindow.
Caveats
Generally, there's is no such thing as an all-encompassing notion of a "non responding" process.
Suppose you have a web server. What does it mean that it's not responding? It's under heavy load, so it may deny some incoming connections. Is that "non responding" from your perspective? It may be, but there's nothing you can do about it - killing and restarting the process won't fix it. If anything, it will make things worse for the already connected clients.
Suppose you have a process that is blocking on a filesystem read because the particular drive it tries to access is slow, or under heavy load. Does it mean that it's not responding? Will killing and restarting it always fix this? If the process then retries the read from the beginning of the file, it may well make things worse.
Suppose you have a poorly designed process with a GUI. It's doing blocking serial port reads in the GUI thread. The read it's doing takes long time, and the GUI is nonresponsive for several seconds. You kill the process, it restarts and tries that long read again - you've only made things worse.
You have to tread very carefully here.
Solution Ideas
There are multiple approaches to determining what is a "responsive" process. It was already mentioned that processes with a GUI are monitored by the operating system on both Windows and OS X. Thus one can use native APIs that can query whether a window or a process is hung or not. This makes sense for applications that offer a UI, and subject to caveats above.
If the process is providing a service, you may periodically use the service to determine if it's still available, subject to some deadlines. Any elections as to what to do with a "hung" process should take into account CPU and I/O load of the system.
It may be worthwhile to keep a history of the latency of the service's response to the service request. Only "large" changes to the latency should be taken to be an indication of a problem. Suppose you're keeping track of the average latency. One could have set an ultimate deadline to 50x the previous average latency. Missing this deadline, the service is presumed dead and up for forced recycling. An "action flag" deadline may be set to 5-10x the average latency. A human would then be given an option to orderly restart the service. The flag would be automatically removed when latency backs down to, say, 30% below the deadline that triggered the flag.
If you are the developer of the monitored process, then you can invert the monitoring aspect and become a passive watchdog of the monitored process. The monitored process must then periodically, actively "wake" the watchdog to indicate that it's alive. The emission of the wake signal (in generic terms) should be performed in strategic location(s) in the code. Periodic reception of wake "signals" should allow you to reason that the process is still alive. You may have multiple wake signals, tagged with the location in the watched process. Everything depends on how many threads the process has, what is it doing, etc.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.
Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?
Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml
I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

With a single file descriptor, Is there any performance difference between select, poll and epoll and ...?

The title really says it all.
The and ... means also include pselect and ppoll..
The server project I'm working on basically structured with multiple threads. Each
thread handles one or more sessions. All the threads are identical. The protocol
takes care of which thread will host the session.
I'm using an inhouse socket class that wraps things up. The point of interest is a checkread call which calls either poll (linux) or select (windows).
In summary each thread currently calls poll on a single socket. From what I can tell, using epoll would only be of benefit if this thread was looking at multiple sockets such as what you'd get in say an HTTP server. That's not what I'm doing in my case. And the class only handles a single socket at a time.
There is some brief discussion about edge and level triggering in the man pages for epoll. I'm not really sure what it means. In the socket class I see an optimization in the windows part of the code that shortcuts the select call with an ioctlsocket & FIONREAD to check if there is any data. Wondering if that would return > 0 even if a complete UDP packet hadn't arrived at the time of the call. Is this what edge triggering is in epoll?
In some rudimentary testing, I'm also seeing no noticeable difference between using select and poll.
I can see that using ppoll might be of benefit though due to greater precision in the timeout. Any thoughts?
And yes, I am trying to optimize throughput for a session that is receiving lots of data. The server is more Network & Disk bound than CPU.
The main difference between epoll vs select or poll is that epoll scales a lot better when run in a single thread. I don't know how this would compare to using a multithreaded server using select or poll.
Look at this http://monkey.org/~provos/libevent/libevent-benchmark2.jpg
The reason for this(as far as I can tell) is that when you are using select or poll you must loop through all the connected sockets to determine which ones have data to be read. When you are using epoll, it keeps a seperate array which contains references only to sockets which have data to be read. This saves you lots of loop cycles, and the difference becomes more and more noticeable the more sockets that are connected.
Another thing to look into if performance ever becomes a major issue is io completion ports(windows only) and kqueue(FreeBSD only). It's also important to remember that epoll is linux only. In most cases select or poll will work just fine.
In the case of a single file descriptor, select and poll are more efficient than epoll due to being much simpler. (epoll has some overhead which doesn't make itself useful with only a single socket)
According to the link: http://www.intelliproject.net/articles/showArticle/index/io_multiplexing.
If you use only one descriptor:
select: 201 micro seconds.
poll: 159 micro seconds.
epoll: 176 micro seconds.
Seems poll will be a better solution in such situation.
If you have only a single socket, what's the point of polling in the first place? Wouldn't the best performance then be by just using blocking read/write?
Wrt. the performance, with only a single file descriptor I don't think there is much, if any, difference between the various approaches. If you really care, I suppose you could measure, but I find it difficult that this would particularly matter for the overall performance of your program.
Level/edge triggering. Consider you're monitoring a signal, for simplicity say some voltage in a line. Edge triggering means that something triggers when the voltage goes over or under some specific limit. Level triggering means that something is considered to be in a triggered state as long as the voltage is over/under the limit. That is, edge triggering triggers when some event happens (crossing some threshold), level triggering reflects the state of some "thing" (in this case, voltage).
To get back to network programming, and edge triggered system might be one where you get some kind of signal when a packet is received. If you don't handle the event then the signal is lost. A level triggered system, OTOH, is something like asking "is there data waiting in the buffer for me?"; if you don't handle the event and ask again, the data will still be there waiting for you.