How to get the number of timers in a process? - c++

I'm having exactly the same problem described here:
timer_create() : -1 EAGAIN (Resource temporarily unavailable)
in short, some process is reserving a lot of timers via timer_create but never release them.
What I cannot figure out is how to determine the process affected by the leak in our production environment.
How could I know what process is the bad one, without randomly killing all the running stuff?
Any /proc/`pidof myprocess`/ debug info that tell me how many timers are reserved?
Thank you in advance!

Why yes, actually. Use the stap tool to trace system calls and determine which calls processes make most often.
The SystemTap Beginners Guide is a good resource. In particular, see the script on this page for an example of counting specific system calls per process.

Related

Critical process check at WDK development

I have been searching information on process info. I would like to know whether
a process is critical one or not. I have already done it within the application layer. However, I cannot judge whether a process is critical at the Driver layer(kernal)
I need to know whether a process is critical
so that I do not kill a critical process to avoid windows crash.
Can you help me on this issue?
Lots of love,
If you insist on doing the check from the kernel, you can use ZwQueryInformationProcess with the BreakOnTermination class.
You can also use two-way IPC to communicate with a Windows service which will perform the check for you.

Cgroup usage to limit resources

My Goal: To provide user a way to limit resources like CPU, memory for the given process (C++).
So someone suggested me to utilize Cgroups which looks like an ideal utility.
After doing some research I have a concern:
When we utilize memory.limit_in_bytes to limit the memory usage for the given process, is there way to handle the out of memory exception in the process? I see control groups provide a parameter called "memory.oom_control" which when enabled, it kills the process that is requesting more memory than allowed. When disabled it just pauses the process.
I want a way to let the process know that it is requesting more memory than expected and should throw out of memory exception. This is so that the process gracefully exits.
Does cgroups provide such kind of behaviour?
Also is cgroup available in all flavour of linux? I am mainly interested in RHEL 5+, CENTOS 6+ and ubuntu 12+ machines.
Any help is appreciated.
Thanks
I want a way to let the process know that it is requesting more memory than expected and should throw out of memory exception. This is so that the process gracefully exits.
Does cgroups provide such kind of behaviour?
All processes in recent releases already run inside a cgroup, the default one. If you create a new cgroup and then migrate the process into the new cgroup, everything works as before but using the constraints from the new cgroup. If your process allocates more memory than permitted, it gets an ENOSPC or a malloc failure just as it presently does.

Is it possible to get a list of WinAPIs that call SendMessage or other locking APIs internally?

I'm facing a task of locating what is causing our production app to sporadically lock up its main/GUI thread on an end-user's machine. Unfortunately the intermittent nature of this bug and the inability to install and run any debugging/tracking tools in that environment makes locating the cause way more difficult. The current approach is to try to place traces around possible locking APIs and see which one may be causing the lock-up.
In light of that I'm curious, if there's a list of WinAPIs that call SendMessage or other locking WinAPI internally? (Which I believe may be a primary reason for deadlocks in the main thread.)
PS. I already checked the following kernel APIs that may be involved in the lock-up: WaitForSingleObject, WaitForMultipleObjects, etc. that were used in the app and all of them checked out.

Extracting userspace thread stack from kernel core dump on FreeBSD

I am trying to debug a multi-process solution on FreeBSD. When the system/appliance experienced a hang like scenario, we forced a kernel dump through 'sysctl debug.panic=1'. The
intention was to capture the state of all processes at the same point in time. However, I am
not able to look into the thread stacks of userspace applications. Using 'ps', I am able to
list all userspace processes/threads but not able to set their stack frame and unwind using 'bt'.
Is it possible to achieve something like what I am attempting to perform? I have seen OpenVMS
debugger (IIRC even windbg) allowing one to peek into userspace threads.
Use DDB. It supports tracing of threads. See this article. The same article also names kgdb commands to trace userspace threads. But those are not found in the manual page. :-(
In DDB "bt/u" will trace userland portion of thread's stack. See "man 4 ddb". That, combined with textdump may be enough.
If all you have to work with is core, things get a bit more complicated.
In kgdb "info threads" will list all threads that were running at the time of kernel crash. After that "thread X" followed by "bt" will give you in-kernel portion of thread's stack.
Getting userland portion of the application will be harder. Easiest way to do that would probably be to modify gcore application so that it uses libkvm to dig into VM structures associated with a given process and essentially reconstruct process' coredump. It is possible, but I don't think there's a ready-to-use solution at the moment.

Detecting application hang

I have a very large, complex (million+ LOC) Windows application written in C++. We receive a handful of reports every day that the application has locked up, and must be forcefully shut down.
While we have extensive reporting about crashes in place, I would like to expand this to include these hang scenarios -- even with heavy logging in place, we have not been able to track down root causes for some of these. We can clearly see where activity stopped - but not why it stopped, even in evaluating output of all threads.
The problem is detecting when a hang occurs. So far, the best I can come up with is a watchdog thread (as we have evidence that background threads are continuing to run w/out issues) which periodically pings the main window with a custom message, and confirms that it is handled in a timely fashion. This would only capture GUI thread hangs, but this does seem to be where the majority of them are occurring. If a reply was not received within a configurable time frame, we would capture a memory and stack dump, and give the user the option of continuing to wait or restarting the app.
Does anyone know of a better way to do this than such a periodic polling of the main window in this way? It seems painfully clumsy, but I have not seen alternatives that will work on our platforms -- Windows XP, and Windows 2003 Server. I see that Vista has much better tools for this, but unfortunately that won't help us.
Suffice it to say that we have done extensive diagnostics on this and have been met with only limited success. Note that attaching windbg in real-time is not an option, as we don't get the reports until hours or days after the incident. We would be able to retrieve a memory dump and log files, but nothing more.
Any suggestions beyond what I'm planning above would be appreciated.
The answer is simple: SendMessageTimeout!
Using this API you can send a message to a window and wait for a timeout before continuing; if the application responds before timeout the is still running otherwise it is hung.
One option is to run your program under your own "debugger" all the time. Some programs, such as GetRight, do this for copy protection, but you can also do it to detect hangs. Essentially, you include in your program some code to attach to a process via the debugging API and then use that API to periodically check for hangs. When the program first starts, it checks if there's a debugger attached to it and, if not, it runs another copy of itself and attaches to it - so the first instance does nothing but act as the debugger and the second instance is the "real" one.
How you actually check for hangs is another whole question, but having access to the debugging API there should be some way to check reasonably efficiently whether the stack has changed or not (ie. without loading all the symbols). Still, you might only need to do this every few minutes or so, so even if it's not efficient it might be OK.
It's a somewhat extreme solution, but should be effective. It would also be quite easy to turn this behaviour on and off - a command-line switch will do or a #define if you prefer. I'm sure there's some code out there that does things like this already, so you probably don't have to do it from scratch.
A suggestion:
Assuming that the problem is due to locking, you could dump your mutex & semaphore states from a watchdog thread. With a little bit of work (tracing your call graph), you can determine how you've arrived at a deadlock, which call paths are mutually blocking, etc.
While a crashdump analysis seems to provide a solution for identifying the problem, in my experience this rarely bears much fruit since it lacks sufficient unambiguous detail of what happened just before the crash. Even with the tool you propose, it would provide little more than circumstantial evidence of what happened. I bet the cause is unprotected shared data, so a lock trace wouldn't show it.
The most productive way of finding this—in my experience—is distilling the application's logic to its essence and identifying where conflicts must be occurring. How many threads are there? How many are GUI? At how many points do the threads interact? Yep, this is good old desk checking. Leading suspect interactions can be identified in a day or two, then just convince a small group of skeptics that the interaction is correct.