How to find why my application becomes unresponsive at unaccessible customer sites - c++

My application is deployed at customer sites, that I can not access, and has no internet connection.
There are complains that in several sites, once in a week or so, the application become unresponsive, so that the operators need to kill and restart it.
We were unable to observe it in our site.
Is there something I can do that may help me find the problem?
It is a VC2008 Win32 MFC applications.
The application is quite complex, and includes many threads, synchronization mechanisms, database access, HMI, communication channels...
Note: The custmer can send us log files.
Note: The application does not crash. It just hangs. Since I don't know what is the nature of the problem, I have no way to know programmatically that something went wrong (or do I?)

I have had great success with ADplus and WinDBG in the past. You may check it out. Especially check out the Hang mode in ADplus.

I would start with some questions - is the CPU hogged during these unresponsive times? Is there a specific process that's hogging it? (You can use PerfMon to get the answers). Depending on the answers I would probably proceed by taking a dump of the process at this stage (ProcDump by sysinternals is great for these purposes) and investigate it offline.

In similar situation on a non-windows platform we have the capability to gather system dumps. Get a thread dump of the entire system for off-site analysis. This enables us to find deadlocks quite easily. For slow problems rather than stop a single dump is not enough. Then we need a sequence of dumps, and some good luck.
Another, rather messier technique is to have enough trace, and enough fine-grained control of trace in the app. Then turn on some trace and hope to spot where the delays are happening.

My experience with finding bugs in installations on the other side of the planet shows three helpful techniques: Logging, logging, and logging.
What do those log files say your customers sent you? If they aren't detailed enough, send them a version that logs more. Use binary approximation to home in on the error.

To know where the process is hung is better to start with the stack trace at that instant.
Now since your program is installed remotely and you can't access it, you can write a monitoring program which can periodically check the stack of your program and log it. This information along with your logging mechanism will make things easier to identify and debug.
Since I am not a windows programmer, i don't know much about such tools availability in windows, however i think you need something similar to this http://www.codeproject.com/KB/threads/StackWalker.aspx

Related

CPU Usage gradually increases in dotnet core webservice

I have a .net Core web service which seems to slowly increase its cpu usage.
meaning at the first day it won't go past 10%, the second day it can go up to 20% and so on.
Using the TOP command in linux, all my webservices seems to sometime be shown there (probably when a request is made) and afterward disappear.
This specific process after running for a while just stays there constantly consuming cpu even when no request has been made.
the API still working fine, it seems like there are some threads that just keeps hanging and consuming cpu. last time I checked I had 5 threads that consumed 3-4% cpu and didn't die for some reason.
My guess is that in some specific scenario a thread just stays alive consuming cpu.
The app runs on ubuntu machine, my first step was trying to create a dump file with ProcDump so I can analyze those threads and maybe find where they are hanging.
ProcDump generates a huge 21gb file, which trying to analyze with lldb throws out of memory exception. even tried transferring it to a windows machine to debug with windbg , no help there as it couldn't open the file.
As there is no specific exception or anything I can't really share any piece of code as I have no idea where the issue is... just kind a hoping for some suggestion that might help me get to a solution or at least understand where the problem is.
Thanks a lot for reading, cheers
You could try using something like jetBrains’ DotMemory, they also have a fairly high level but helpful guide https://www.jetbrains.com/help/dotmemory/How_to_Find_a_Memory_Leak.html it also worth checking your startup file and double checking the services you’ve registered are used in the correct way ie not added as scoped when they should be transient or even a singleton etc
so iv'e been at it for a while.
Eventually found out that my problem was with HttpClient
Probably some bad mix of static class and creating new instances of HttpClient that causes the issue Iv'e explained above.
Solved it by utilizing HttpClientFactory as explained here -
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.1
Lesson learned :)
A little late but Procdump for Linux just added .NET Core 3 support that generates much more managable sized core dumps. It automatically detects if the target process is .NET Core and does the right thing (i.e., no need to specify switches).

How to find where the program is waiting

I am working on a big code base. It is heavily multithreaded.
After running the linux based application for a few hours, in the end, right before reporting, the application silences. It doesn't die, it doesn't crash, it just waits there. Joins, mutexes, condition variables ... any of these can be the culprit.
If it had crashed, I would at least have a chance to find the source using debugger. But this way, I have no clue how to use what tool to find the bug. I can't even post a code sample for you. The only thing that can possibly help is to tap MANY places with cout to get a visual where the application is.
Have you been in such a situation? What do you recommend?
If you're running under Linux then just use gdb to run the program. When the application 'silences', interrupt it with CTRL+C, then type backtrace to see the call stack. With this you will find out the function where your application was blocked.
Incase of linux, gdb will be great help. Another tool that can be of great help is strace (This can also be used where there are problems with program for with source is not readily available because strace does not need recompilation to trace them.)
strace shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.
iotop, LTTng and Ftrace are few of other tools that be helpful to you in this scenario.

Missing heartbeat of client machine

My application launches hundreds of child processes sent to SGE. Few of them take a lot of memory due to which jobs are being failed.
i need some way to monitor the memory usage of clients from main process and relaunch/resubmit them to grid with higher memory requirement in case of such job failures.
i have heard something about missing heartbeat algo for such requirements but I am not much aware of them.
Can experts here please help me in finding out a nice solution for this issue? my application is a c++ application on Linux/Solaris.
Thanks
Ruchi
A solution I have used before is to have a script that captures the output from the qstat-command (using rsh in my case). I filter on my jobs and store the information I need (in my case it was CPU) in a continuously updated list. When a job aborted or was killed, it was easy to go back and look at the CPU usage. It is not 100% real-time, but good enough for me.
My language of choice was Python, as it contains easy-to-use libraries for catching output and logging in to remote machines. However, it should be easy to implement something like capturing rsh-output in C++. You can for example use popen() to pipe the output into your application. I hope this helps.

Failed to resume in time Crashlog

I am trying to figure out a "Failed to resume in time" problem. In one of our testers devices (which is an iPhone 4S with the latest OS) it happens very frequently, whereas in my own device it doesn't seem to happen at all.
Anyway, I got a few crashlogs. I am unable to trace the root of the cause though. I understand that the issue might be
1.When a process is holding up the main thread for too long.
2.When there is a memory issue.
I don't think the memory is much of an issue since it seems to happen when the user leaves the main menu and comes back. Nothing much is happening in the main menu so it probably is a task that runs too long.
Here is an excerpt from the crash log:
Can somebody help me or guide me on who I can trace the cause of the issue? Is there anyway to turn off the watchdog timer(probably not huh?) Also, what does highlighted thread refer to?
I have already checked my applicationDidBecomeActive & applicationWillEnterForeground to make sure there is nothing going on there.
To my knowledge there are no synchronous calls being made at this point. Does Reachability use synchronous calls to check for internet? How can I check for that?
I am not making any large data transfers upon resume.
I notice that GameCenter automatically logs in or check for log in upon resuming your app. Is there anyway to prevent this? Could this possibly cause a time out issue?
I tried doing a time profile, but I am not able to understand how to use it to analyze. If you can provide a good resource for that, that would be amazing.
Thanks!!!
You're currently in "trying to find the issue mode". You should switch to "try to find out how much of an issue this really is" mode.
So go find another 4S (actually as many as you can) to rule out that it's a device-specific issue. If it happens on all 4S it should be easier to pinpoint. If not, have someone else look over it, discuss possible causes. The peer programming approach often helps when you're stuck in a dead-end situation.
If the issue is only on that one device, you might want to check if it's broken (or "jailbroken") or might simply need a hard reboot (hold power and home for 10+ seconds).
If it only happens on some devices but not all, try to find what they have in common. This could be language/locale, or dictation, practically any kind of setting the user might have changed. If necessary, write a logger that logs as many settings as possible to your (web) server so you can compare settings one-by-one and quickly discard those that aren't in synch.
If only very few devices are affected, you could also ignore the issue and hope that additional crash logs from users will reveal the key to the issue.
Finally, there's always the option to disable suspend on terminate and instead terminate the app when the home button is pressed (as it was pre iOS 4). Unless of course the app has to run in background.

How to write an unkillable process for Windows?

I'm looking for a way to write an application. I use Visual C++ 6.0.
I need to prevent the user from closing this process via task manager.
You can't do it.
Raymond Chen on why this is a bad idea.
You can make an unkillable process, but it won't be able to accomplish anything useful while it's unkillable. For example, one way to make a process unkillable is to have it make synchronous I/O requests to a driver that can never complete (for example, by deliberately writing a buggy driver). The kernel will not allow a process to terminate until the I/O requests finish.
So it's not quite true that you "can't do it" as some people are saying. But you wouldn't want to anyway.
That all depends on who shouldn't be able to kill that process. You usually have one interactively logged-on user. Running the process in that context will alow the user to kill it. It is her process so she can kill it, no surprise here.
If your user has limited privileges you can always start the process as another user. A user can't kill a process belonging to another user (except for the administrator), no surprise here as well.
You can also try to get your process running with Local System privileges where, I think not even an administrator could kill it (even though he could gain permission to do so, iirc).
In general, though, it's a terribly bad idea. Your process does not own the machine, the user does. The only unkillable process on a computer I know is the operating system and rightly so. You have to make sure that you can't hog resources (which can't be released because you're unkillable) and other malicious side-effects. Usually stuff like this isn't the domain of normal applications and they should stay away from that for a reason.
It's a Win32 FAQ for decades. See Google Groups and Und. boards for well-known methods.(hooking cs and others...)
Noobs who answer "You can't do it" know nothing to Win32 programming : you can do everything with Win32 api...
What I've learned from malware:
Create a process that spawns a dozen of itself
Each time you detect that one is missing (it was killed) spawn a dozen more.
Each one should be a unique process name so that a batch process could not easily kill all of them by name
Sequentially close and restart some of the processes to keep the pids changing which would also prevent a batch kill
Depends on the users permission. If you run the program as administrator a normal user will not have enough permissions to kill your process. If an administrator tries to kill the process he will in most cases succeed. If you really want someone not to kill you process you should take a look at windows system services and driver development. In any case, please be aware that if a user cannot kill a process he is stuck with it, even though it behaves abnormally duo to bugs! You will find a huge wealth of these kind of programs/examples on the legal! site rootkit.com. Please respect the user.
I just stumbled upon this post while trying to find a solution to my own (unintentional) unkillable process problem. Maybe my problem will be your solution.
Use jboss Web Native to install a service that will run a batch file (modify service.bat so that it invokes your own batch file)
In your own batch file, invoke a java process that performs whatever task you'd like to persist
Start the service. If you view the process in process explorer, the resulting tree will look like:
jbosssvc.exe -> cmd.exe -> java.exe
use taskkill from an administrative command prompt to kill cmd.exe. Jbosssvc.exe will terminate, and java.exe will be be an orphaned running process that (as far as I can tell) can't be killed. So far, I've tried with Taskmanager, process explorer (running as admin), and taskkill to no avail.
Disclaimer: There are very few instances where doing this is a good idea, as everyone else has said.
There's not a 100% foolproof method, but it should be possible to protect a process this way. Unfortunately, it would require more knowlegde of the Windows security system API than I have right now, but the principle is simple: Let the application run under a different (administrator) account and set the security properties of the process object to the maximum. (Denying all other users the right to close the process, thus only the special administrator account can close it.)
Set up a secondary service and make it run as a process guardian. It should have a lifeline to the protected application and when this lifeline gets cut (the application closes) then it should restart the process again. (This lifeline would be any kind of inter-process communications.)
There are still ways to kill such an unkillable process, though. But that does require knowledge that most users don't really know about, so about 85% of all users won't have a clue to stop your process.
Do keep in mind that there might be legal consequences to creating an application like this. For example, Sony created a rootkit application that installed itself automatically when people inserted a Sony music CD or game CD in their computer. This was part of their DRM solution. Unfortunately, it was quite hard to kill this application and was installed without any warnings to the users. Worse, it had a few weaknesses that would provide hackers with additional ways to get access to those systems and thus to get quite a few of them infected. Sony had to compensate quite a lot of people for damages and had to pay a large fine. (And then I won't even mention the consequences it had on their reputation.)
I would consider such an application to be legal only when you install it on your own computer. If you're planning to sell this application to others, you must tell those buyers how to kill the process, if need be. I know Symantec is doing something similar with their software, which is exactly why I don't use their software anymore. It's my computer, so I should be able to kill any process I like.
The oldest idea in the world, two processes that respawn each other?