I want to know if a process (started with a QProcess class) doesn't respond anymore. For instance, my process is an application that only prints 1 every seconds.
My problem is that I want to know if (for some mystical reason), that process is blocked for a short period of time (more than 1 second, something noticeable by a human).
However, the different states of a QProcess (Not Running, Starting, Running) don't include a "Blocked" state.
I mean blocked as "Don't Answer to the OS" when we got the "Non Responding" message in the Task Manager. Such as when a Windows MMI (like explorer.exe) is blocked and becomes white.
But : I want to detect that "Not Responding" state for ANY processes. Not just MMI.
Is there a way to detect such a state ?
Qt doesn't provide any api for that. You'd need to use platform-specific mechanisms. On some platforms (Windows!), there is no notion of a hung application, merely that of a hung window. You can have one application that has both responsive and unresponsive windows :)
On Windows, you'd enumerate all windows using EnumWindows, check if they belong to your process by comparing the pid from GetWindowThreadProcessId to process->pid(), and finally checking if the window is hung through IsHungAppWindow.
Caveats
Generally, there's is no such thing as an all-encompassing notion of a "non responding" process.
Suppose you have a web server. What does it mean that it's not responding? It's under heavy load, so it may deny some incoming connections. Is that "non responding" from your perspective? It may be, but there's nothing you can do about it - killing and restarting the process won't fix it. If anything, it will make things worse for the already connected clients.
Suppose you have a process that is blocking on a filesystem read because the particular drive it tries to access is slow, or under heavy load. Does it mean that it's not responding? Will killing and restarting it always fix this? If the process then retries the read from the beginning of the file, it may well make things worse.
Suppose you have a poorly designed process with a GUI. It's doing blocking serial port reads in the GUI thread. The read it's doing takes long time, and the GUI is nonresponsive for several seconds. You kill the process, it restarts and tries that long read again - you've only made things worse.
You have to tread very carefully here.
Solution Ideas
There are multiple approaches to determining what is a "responsive" process. It was already mentioned that processes with a GUI are monitored by the operating system on both Windows and OS X. Thus one can use native APIs that can query whether a window or a process is hung or not. This makes sense for applications that offer a UI, and subject to caveats above.
If the process is providing a service, you may periodically use the service to determine if it's still available, subject to some deadlines. Any elections as to what to do with a "hung" process should take into account CPU and I/O load of the system.
It may be worthwhile to keep a history of the latency of the service's response to the service request. Only "large" changes to the latency should be taken to be an indication of a problem. Suppose you're keeping track of the average latency. One could have set an ultimate deadline to 50x the previous average latency. Missing this deadline, the service is presumed dead and up for forced recycling. An "action flag" deadline may be set to 5-10x the average latency. A human would then be given an option to orderly restart the service. The flag would be automatically removed when latency backs down to, say, 30% below the deadline that triggered the flag.
If you are the developer of the monitored process, then you can invert the monitoring aspect and become a passive watchdog of the monitored process. The monitored process must then periodically, actively "wake" the watchdog to indicate that it's alive. The emission of the wake signal (in generic terms) should be performed in strategic location(s) in the code. Periodic reception of wake "signals" should allow you to reason that the process is still alive. You may have multiple wake signals, tagged with the location in the watched process. Everything depends on how many threads the process has, what is it doing, etc.
Related
We are having an issue when waiting in a thread on MacOS and the main window is hidden, the wait function takes up to 10 seconds even if we request it to wait 100ms.
The main program is running on a Cocoa window, and an other thread is running permanently, waiting 100ms every iteration.
Everything works fine when the main window is visible, but once the window is hidden, the problem starts happening after some time, i.e. the wait starts waiting for several seconds. We suspect the system to stop waking up the application as often because it's not visible anymore.
We are using pthread_con_wait, but the same problem happens using usleep or boost::sleep (which are probably using the same underneath).
Is there a way to prevent this or a flag to set to tell the system we are still running and we want to be woken up?
Thanks
If it's OS v10.9 or later, your app could be napping away:
Power Efficiency Guide for Mac Apps
Document says, it can be prevented with NSProcessInfo class.
Managing Activities
The system has heuristics to improve battery life, performance, and
responsiveness of applications for the benefit of the user. You can
use the following methods to manage activities that give hints to the
system that your application has special requirements:
beginActivityWithOptions:reason:
endActivity:
performActivityWithOptions:reason:usingBlock:
In response to creating an activity, the system will disable some or
all of the heuristics so your application can finish quickly while
still providing responsive behavior if the user needs it.
...
id activity = [[NSProcessInfo processInfo] ?
beginActivityWithOptions:NSActivityLatencyCritical
reason:#"Good Reason"];
// Perform some work.
[[NSProcessInfo processInfo] endActivity:activity];
Note,
NSActivityLatencyCritical
Flag to indicate the activity requires the highest amount of timer and I/O precision available.
IMPORTANT Very few applications should need to use this constant.
I'm using C++ and I need the equivalent of SIGCHLD for a process I'm aware of (i.e. I know it's pid), but did not spawn.
Is there a well established design pattern to listen/watch/monitor another process's lifespan when it is not your child or in your group or session?
EDIT: I am specifically trying to be aware of abnormal terminations (i.e. seg faults, signals, etc...). I would like to eavesdrop on signals the process in question receives.
I don't know if it follows a specific pattern, per se, but one technique is to have the process establish a connection to the watcher. The watcher monitors the connection, and when it becomes closed, it knows the process has shutdown.
If the watcher wants to know if the watched process is responsive or not, you can use the connection to monitor heartbeat messages that the process is obliged to provide.
If the watcher wants to know whether the watched process is making progress, the heartbeat message could provide state information that would allow the watcher to monitor that.
Different operating systems may provide different ways to achieve the same objective. For example, on Linux, the watcher could use inotify to monitor the /proc entry for that process to determine if the process is up or down. The BSD kqueue has a similar capability. The process could export its heartbeat/state into shared memory, and the watcher could use a timed wait on a semaphore to see if the data is being updated.
If the process is a third-party program, and source is not available, then you would have to resort to some method similar to inotify/kqueue, or as a last resort, poll the kernel state (similar to the way the top utility works).
I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.
Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?
Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml
I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.
The title really says it all.
The and ... means also include pselect and ppoll..
The server project I'm working on basically structured with multiple threads. Each
thread handles one or more sessions. All the threads are identical. The protocol
takes care of which thread will host the session.
I'm using an inhouse socket class that wraps things up. The point of interest is a checkread call which calls either poll (linux) or select (windows).
In summary each thread currently calls poll on a single socket. From what I can tell, using epoll would only be of benefit if this thread was looking at multiple sockets such as what you'd get in say an HTTP server. That's not what I'm doing in my case. And the class only handles a single socket at a time.
There is some brief discussion about edge and level triggering in the man pages for epoll. I'm not really sure what it means. In the socket class I see an optimization in the windows part of the code that shortcuts the select call with an ioctlsocket & FIONREAD to check if there is any data. Wondering if that would return > 0 even if a complete UDP packet hadn't arrived at the time of the call. Is this what edge triggering is in epoll?
In some rudimentary testing, I'm also seeing no noticeable difference between using select and poll.
I can see that using ppoll might be of benefit though due to greater precision in the timeout. Any thoughts?
And yes, I am trying to optimize throughput for a session that is receiving lots of data. The server is more Network & Disk bound than CPU.
The main difference between epoll vs select or poll is that epoll scales a lot better when run in a single thread. I don't know how this would compare to using a multithreaded server using select or poll.
Look at this http://monkey.org/~provos/libevent/libevent-benchmark2.jpg
The reason for this(as far as I can tell) is that when you are using select or poll you must loop through all the connected sockets to determine which ones have data to be read. When you are using epoll, it keeps a seperate array which contains references only to sockets which have data to be read. This saves you lots of loop cycles, and the difference becomes more and more noticeable the more sockets that are connected.
Another thing to look into if performance ever becomes a major issue is io completion ports(windows only) and kqueue(FreeBSD only). It's also important to remember that epoll is linux only. In most cases select or poll will work just fine.
In the case of a single file descriptor, select and poll are more efficient than epoll due to being much simpler. (epoll has some overhead which doesn't make itself useful with only a single socket)
According to the link: http://www.intelliproject.net/articles/showArticle/index/io_multiplexing.
If you use only one descriptor:
select: 201 micro seconds.
poll: 159 micro seconds.
epoll: 176 micro seconds.
Seems poll will be a better solution in such situation.
If you have only a single socket, what's the point of polling in the first place? Wouldn't the best performance then be by just using blocking read/write?
Wrt. the performance, with only a single file descriptor I don't think there is much, if any, difference between the various approaches. If you really care, I suppose you could measure, but I find it difficult that this would particularly matter for the overall performance of your program.
Level/edge triggering. Consider you're monitoring a signal, for simplicity say some voltage in a line. Edge triggering means that something triggers when the voltage goes over or under some specific limit. Level triggering means that something is considered to be in a triggered state as long as the voltage is over/under the limit. That is, edge triggering triggers when some event happens (crossing some threshold), level triggering reflects the state of some "thing" (in this case, voltage).
To get back to network programming, and edge triggered system might be one where you get some kind of signal when a packet is received. If you don't handle the event then the signal is lost. A level triggered system, OTOH, is something like asking "is there data waiting in the buffer for me?"; if you don't handle the event and ask again, the data will still be there waiting for you.
I'm designing a networking framework which uses WSAEventSelect for asynchronous operations. I spawn one thread for every 64th socket due to the max 64 events per thread limitation, and everything works as expected except for one thing:
Threads keep getting spawned uncontrollably by Winsock during connect and disconnect, threads that won't go away.
With the current design of the framework, two threads should be running when only a few sockets are active. And as expected, two threads are running in total. However, when I connect with a few sockets (1-5 sockets), an additional 3 threads are spawn which persist until I close the application. Also, when I lose connection on any of the sockets, 2 more threads are spawned (also persisting until closure). That's 7 threads in total, 5 of which I have no idea what they are there for.
If they are required by Winsock for connecting or whatever and then disappeared, that would be fine. But it bothers me that they persist until I close my application.
Is there anyone who could shed some light on this? Possibly a solution to avoid these threads or force them to close when no connections are active?
(Application is written in C++ with Win32 and Winsock 2.2)
Information from Process Explorer:
Expected threads:
MyApp.exe!WinMainCRTStartup
MyApp.exe!Netfw::NetworkThread::ThreadProc
Unexpected threads:
ntdll.dll!RtlpUnWaitCriticalSection+0x2dc
mswsock.dll+0x7426
ntdll.dll!RtlGetCurrentPeb+0x155
ntdll.dll!RtlGetCurrentPeb+0x155
ntdll.dll!RtlGetCurrentPeb+0x155
All of the unexpected threads have call stacks with calls to functions such as ntkrnlpa.exe!IoSetCompletionRoutineEx+0x46e which probably means it is a part of the notification mechanism.
Download the sysinternals tool process explorer. Install the appropriate debugging tools for windows. In process explorer, set Options -> Symbols path to:
SRV*C:\Websymbols*http://msdl.microsoft.com/download/symbols
Where C:\Websymbols is just a place to store the symbol cache (I'd create a new empty directory for it.)
Now, you can inspect your program with process explorer. Double click the process, go to the threads tab, and it will show you where the threads started, how busy they are, and what their current callstack is.
That usually gives you a very good idea of what the threads are. If they're Winsock internal threads, I wouldn't worry about them, even if there are hundreds.
One direction to look in (just a guess): If these are TCP connections, these may be background threads to handle internal TCP-related timers. I don't know why they would use one thread per connection, but something has to do the background work there.