Application crash at customer machine - c++

Our DCOM server crashes at customer machine. The application does not crash if I enable Page Heap,Put pdb files or attach AD Plus. It does not crash in any of our machines.
I generated crash dump with NTSD using Just In Time feature of Windows in the customer machine. But the crash location is different at different times.
What technique should I use to identify the cause of the crash?

This sounds like a memory corruption. Generally the stack trace is not reliable at this point. First thing to do is to look at the stack segment. Best way to do this is to dump the raw stack and not a stack trace and see if the stack can be manually reconstructed. In addition when the memory gets overwritten check if you see a data pattern in the overwritten data.

Related

Buffer Overflow into a different exe's memory? Or onto csrss.exe from a remote desktop prog?

Short, Question Form:
I did some googling but wasn't able to come up with the answer to this: is it possible to buffer overflow memory into another exe's memory? And/or, is it possible to overflow csrss.exe's memory from an exe running on a remote desktop session?
Longer Story - Here's Our Situation:
We've got a server with an always-running remote desktop session that has a 24/7 program running - a C++ .exe. To make things worse, the C++ exe was programmed using all sorts of unsafe memory operations (raw strcpy, sprintf, etc) You don't need to tell me how bad this is structurally - I completely agree.
Recently, our server's been having Blue Screen Of Death, and the dumpfile is indicating that csrss.exe is being terminated by our C++ exe (which will cause a BSOD, and csrss.exe is also responsible for managing remote desktop sessions.
So I wanted to know if anyone knew whether it was possible for one app to do a memory buffer overflow that overflowed onto another app's memory space, or whether it'd be possible for an app on a remote desktop session to do so onto csrss.exe?
Any help would be greatly appreciated!
Short answer no it is not.
Simplified explanation of why. Each program runs in it's own virtual address space. This virtual address space is controlled by the page table which is essentially a lookup table to map virtual addresses (the addresses in the pointers of the executable) onto physical memory addresses. When the OS switches to a task it hands the correct table to the cpu/core running the task. Any physical address not mentioned in this table will not be accessible from the program. Physical addresses belonging to another application should not appear in this table so it would be impossible to access memory belonging to another application. When a program misbehaves and accesses invalid memory location it will attempt to use virtual addresses not mentioned in the table. This will trigger an exception/fault on the cpu which is normally reported in windows as an "Access violation".
Of course the OS and the CPU can contain bugs so it is impossible to guarantee that it doesn't happen. But if your C++ program misbehaves then still most of the time this would be caught by the CPU and reported as an access violation and not result in a BSOD. If you do not see your C++ program generating access violations I would expect it to be much more likely that the problem is caused by faulty memory or a buggy driver (drivers run at a higher privilege and can do things normal programs can't).
I would say start with doing an extensive memory test with a program like memtest86. BTW if the server is a "real" server with ECC memory, faulty memory shouldn't be the problem as this should have been reported by the system.
Update
Doesn't matter how the memory access happens underflow, overflow, uninitialized pointer. The virtual address used is either mapped to a physical memory location reserved for the program or it is not mapped at all. BTW the checking is done by the CPU the OS only maintains the tables used to do the lookups.
However this doesn't mean every error by the program will be detected because as long as it is accessing addresses for which it was assigned memory the access is ok as far as the CPU is concerned. The heap manager in your program might think otherwise but has no way of detecting this. So even a buffer overflow at the end of the address space doesn't always cause an access violation because memory is assigned to the program in pages of atleast 4kB and the heap manager subdivides those pages into the smaller chunks the program asks it for. So your small 10 byte buffer can be at the start of such a page and writing a thousand bytes to it will be perfectly fine as far as the cpu is concerned. Because all that memory was setup for use by the program. However when your 10 byte buffer is at the end of the page and the next entry is not assigned to a physical address location an access violation will occur.

Win32: Is there a difference between Dr. Watson's full/mini dumps and writing my own?

I have an application that is occasionally crashing in the release build; unfortunately, it looks like it is crashing in a 3rd party DLL. While trying to get a handle on it I've been swimming in a sea of HOW TOs and descriptions of how Windows creates crash dumps.
I was thinking about using this suggested mini-dump:
Getting a dump of a process that crashes on startup
I was planning on leaving this functionality in the code so the dump is always created for my application without having to have the PC set up beforehand. BTW, this application is not for distribution; it will be paired with our own hardware so I'm not concerned about random users having dump files building on their machines if the application happens to crash.
Additional note: all of the code is C/C++.
Is there a difference between what Dr. Watson (drwtsn32.exe) and this code will produce for me?
With Dr. Watson you'll only get the dumps when the Dr. sees you 'crashed'. Using the dumper API you'll be able to invoke it from any point in the app. Eg. you can trampoline the ordinary asserts to dump instead of showing a dialog. In my experience once you have dump support in your app you'll find it easier down the road to investigate, troubleshoot and fix various problems, simply because you can produce a full dump (or even a minidump) at any place you see fit in code.
There isn't much difference except that if you create your own minidump you have more control over the level of detail in it. By default Minidumps have the stack and some local variables, but creating your own gives you the option of creating a full memory dump also which may prove to be more useful (though this then may make collection of these dumps more problematic if the memory image is large).
If the crash happens reasonably frequently it may be worth just collecting some minidumps that drwatson (or werfault in Vista onwards) produces for you, as that may give you enough information. If it doesn't then you have the option of adding your own unhandled exception filter. Another thing that can happen is that the minidump you receive is the site of the crash rather than a first chance exception that may have arisen. Creating your own minidumps means that you're more likely to get a stack trace closer to where the problem is.
Another option, if you have a machine which exhibits the problem more often is to run ADPlus in the background -- it will sit and wait until your app crashes or throws exceptions then produce some useful log files. It does a similar same thing as the unhandled exception filter except it requires no changes to your app.
The biggest thing to watch out for is that MiniDumpWriteDump has to do memory allocation and file I/O. Calling it from inside the failed process may fail if e.g. heap structures are corrupted.
Calling MiniDumpWriteDump from a helper process works the same as using Dr. Watson, except that you have control over the dump options.
Recommended reading: loader lock deadlock in MiniDumpWriteDump
I don't think so. Although Dr Watson will generate full or mini dumps, you could use the ntsd debugger instead to get a lot more control of what data is included in the dumps.
Dr Watson's minidumps are good enough for most things, you get a call stack and variables. IF you need more, ntsd has a load of options.
The only benefit to using DrWatson is that is comes pre-installed on Windows.

How can I create memory dumps and analyze memory leaks?

I need to get the following to analyze a memory leak issue. How to do that?
Orphan Block Addresses Orphan Call
Stack
Are there any good resources/tools to know about/fix memory leaks.
Thanks
If you're on linux, use valgrind. It's your new best friend. I'm not sure what tools are available for Windows.
valgrind --leak-check=full
The Microsoft Application Verifier performs memory analysis similar to valgrind if you are on a Windows platform.
In Windows, you can use the MiniDumpWriteDump function in dbghelp.dll.
How to create minidump for my process when it crashes?
This can be very helpful in tracking down errors in deployed applications because you can use your debug symbols to inspect a minidump made in the field with no debug info. It's not very useful for tracking memory leaks, however.
For memory leaks under Windows (aside from commercial tools like Purify, BoundsChecker and GlowCode, of course) you can use WinDbg from the free Debugging Tools for Windows package, along with Win32 heap tags to track down the source of memory leaks.
http://www.codeproject.com/KB/cpp/MemoryLeak.aspx
http://blogs.msdn.com/alikl/archive/2009/02/15/identifying-memory-leak-with-process-explorer-and-windbg.aspx
Yes, as J. Paulett commented, at least on the Linux platform Valgrind is an excellent starting point.
On Windows I was able to get the necessary details using UIforETW, which is handling the necessary command line arguments for xperf.
This blog post explains everything in great detail: https://randomascii.wordpress.com/2015/04/27/etw-heap-tracingevery-allocation-recorded/
Recording
Step 1: A TracingFlags registry entry is created and set to ‘1’ in the Image File Execution Options for each process name that will be trace to tell the Windows heap to configure itself for tracing when a process with that name is launched. As is always the case with Image File Execution Options the options don’t affect already running processes – only processes launched when the registry key is set are affected.
Step 2: An extra ETW session is created using the “-heap -Pids 0” incantation. This session will record information from processes that had a TracingFlags registry entry of ‘1’ when they started.
The details are a bit messy but now that UIforETW is written I don’t have to bother explaining the details, and you don’t have to pretend to listen. If you want to record a heap trace use UIforETW, and if you want to know how it works then look at the code, or click the Show commands button to see most of the dirty laundry.
Analysis
The recording can be inspected with WPA (Windows Performance Analyzer) which can be conveniently launched from UIforETW.
The recommended columns are: Process, Handle, Type, Stack.
The allocation types are:
AIFO – Allocated Inside Freed Outside (hint, hint)
AOFI – Allocated Outside Freed Inside
AOFO – Allocated Outside Freed Outside
AIFI – Allocated Inside Freed Inside

Software to track several memory errors in old project?

I am programming a game since 2 years ago.
sometimes some memory errors (ie: a function returning junk instead of what it was supposed to return, or a crash that only happen on Linux, and never happen with GDB or Windows) happen seemly at random. That is, I try to fix it, and some months later the same errors return to haunt me.
There are a software (not Valgrind, I already tried it... it does not find the errors) that can help me with that problem? Or a method of solving these errors? I want to fix them permanently.
On Windows, you can automatically capture a crashing exception in a production environment and analyze it as if the error occurred on your developer PC under the debugger. This is done using a "mini-dump" file. You basically use the Windows "dbghelp.dll" DLL to generate a copy of the thread stacks, parts or all of the heap, the register values, the loaded modules, and the unhandled exception that resulted in the crash. You can launch this ".dmp" file in the MS Visual Studio debugger as if it were an executable and it will show you exactly where the crash occurred.
You can set up a trap for unhandled exceptions and delegate the creation of the mini-dump file to dbghelp.dll in that trap. You need to keep the ".pdb" files that were generated with the deployed binaries to match up memory addresses with source code locations for a better debugging experience. This topic is too deep to fully cover See Microsoft's documentation on this DLL.
You do need to be able to copy the .dmp file from the PC where it crashed to your development environment to fully debug it. If you have a hands-off relationship with your users you'll need to have the option of having a separate utility app "phone home" over the internet to tranfer the .dmp file to a location where you can access it. You can launch the app from the unhandled exception trap after the .dmp file has been generated. For user privacy, you should give the user the option of whether or not to do this.
The Totalview debugger (commercial software) may catch the crash.
Purify (commercial software) can help you find memory leaks.
Does your code compile free of compiler warnings? Did you run lint?
One thing you could try is using the Hans Boehm GC with your project. It can be used as a leak detector, allowing you to remove suspicious-looking free() or delete statements and easily see whether they cause memory leaks.
AFAIK, Boundscheck in Windows does a very good job. In one of my project, it caught some very weird errors.
To avoid this in my own projects (on Windows), I wrote my own memory allocator which simply called VirtualAlloc and VirtualFree. It allocated an extra page for each request, aligned it just to the left of the last page, and used VirtualProtect to generate an exception whenever the last page was accessed. This detected out-of-bounds accesses, even just reads, on the spot.
Disclaimer: I was by no means the first to have this idea.
For example, if pages are 4096 bytes, and new int[1] was called, the allocator would:
Allocate 8192 bytes (4 bytes are needed, which is one page, and the extra guard page brings the total to 2 pages)
Mark the last page unaccessible
Determine the address to return (the last allocated page starts at 4096... 4096 - 2 = 4092)
The following code:
main() {
int *array = new int[10];
return array[10];
}
would then generate an access violation on the spot.
It also had a (compile-time) option to detect accesses beyond the left side of the allocation (ie, array[-1]), but these kinds of errors seemed rare, so I never used the option.

Heisenbug: WinApi program crashes on some computers

Please help! I'm really at my wits' end.
My program is a little personal notes manager (google for "cintanotes").
On some computers (and of course I own none of them) it crashes with an unhandled exception just after start.
Nothing special about these computers could be said, except that they tend to have AMD CPUs.
Environment: Windows XP, Visual C++ 2005/2008, raw WinApi.
Here is what is certain about this "Heisenbug":
1) The crash happens only in the Release version.
2) The crash goes away as soon as I remove all GDI-related stuff.
3) BoundChecker has no complains.
4) Writing a log shows that the crash happens on a declaration of a local int variable! How could that be? Memory corruption?
Any ideas would be greatly appreciated!
UPDATE: I've managed to get the app debugged on a "faulty" PC. The results:
"Unhandled exception at 0x0044a26a in CintaNotes.exe: 0xC000001D: Illegal Instruction."
and code breaks on
0044A26A cvtsi2sd xmm1,dword ptr [esp+14h]
So it seems that the problem was in the "Code Generation/Enable Enhanced Instruction Set" compiler option. It was set to "/arch:SSE2" and was crashing on the machines that didn't support SSE2. I've set this option to "Not Set" and the bug is gone. Phew!
Thank you all very much for help!!
4) Writig a log shows that the crash happen on a declaration of a local int variable! how could that be? Memory corruption?
What is the underlying code in the executable / assembly? Declaration of int is no code at all, and as such cannot crash. Do you initialize the int somehow?
To see the code where the crash happened you should perform what is called a postmortem analysis.
Windows Error Reporting
If you want to analyse the crash, you should get a crash dump. One option for this is to register for Windows Error Reporting - requires some money (you need a digital code signing ID) and some form filling. For more visit https://winqual.microsoft.com/ .
Get the crash dump intended for WER directly from the customer
Another option is to get in touch witch some user who is experiencing the crash and get a crash dump intended for WER from him directly. The user can do this when he clicks on the Technical details before sending the crash to Microsoft - the crash dump file location can be checked there.
Your own minidump
Another option is to register your own exception handler, handle the exception and write a minidump anywhere you wish. Detailed description can be found at Code Project Post-Mortem Debugging Your Application with Minidumps and Visual Studio .NET article.
So it doesnnt crash when configuration is DEBUG Configuration? There are many things different than a RELEASE configruation:
1.) Initialization of globals
2.) Actual machine Code generated etc..
So first step is find out what are exact settings for each parameter in the RELEASE mode as compared to the DEBUG mode.
-AD
1) The crash happens only in the Release version.
That's usually a sign that you're relying on some behaviour that's not guaranteed, but happens to be true in the debug build. For example, if you forget to initialize your variables, or access an array out of bounds. Make sure you've turned on all the compiler checks (/RTCsuc). Also check things like relying on the order of evaluation of function parameters (which isn't guaranteed).
2) The crash goes away as soon as I remove all GDI-related stuff.
Maybe that's a hint that you're doing something wrong with the GDI related stuff? Are you using HANDLEs after they've been freed, for example?
Download the Debugging tools for Windows package. Set the symbol paths correctly, then run your application under WinDbg. At some point, it will break with an Access Violation. Then you should run the command "!analyze -v", which is quite smart and should give you a hint on whats going wrong.
Most heisenbugs / release-only bugs are due to either flow of control that depends on reads from uninitialised memory / stale pointers / past end of buffers, or race conditions, or both.
Try overriding your allocators so they zero out memory when allocating. Does the problem go away (or become more reproducible?)
Writig a log shows that the crash happens on a declaration of a local int variable! How could that be? Memory corruption?
Stack overflow! ;)
4) Writig a log shows that the crash happen on a declaration of a local int variable!how could that be? Memory corruption
I've found the cause to numerous "strange crashes" to be dereferencing of a broken this inside a member function of said object.
What does the crash say ? Access violation ? Exception ? That would be the further clue to solve this with
Ensure you have no preceeding memory corruptions using PageHeap.exe
Ensure you have no stack overflow (CBig array[1000000])
Ensure that you have no un-initialized memory.
Further you can run the release version also inside the debugger, once you generate debug symbols (not the same as creating debug version) for the process. Step through and see if you are getting any warnings in the debugger trace window.
"4) Writing a log shows that the crash happens on a declaration of a local int variable! How could that be? Memory corruption?"
This could be a sign that the hardware is in fact faulty or being pushed too hard. Find out if they've overclocked their computer.
When I get this type of thing, i try running the code through gimpels PC-Lint (static code analysis) as it checks different classes of errors to BoundsChecker. If you are using Boundschecker, turn on the memory poisoning options.
You mention AMD CPUs. Have you investigated whether there is a similar graphics card / driver version and / or configuration in place on the machines that crash? Does it always crash on these machines or just occasionally? Maybe run the System Information tool on these machines and see what they have in common,
Sounds like stack corruption to me. My favorite tool to track those down is IDA Pro. Of course you don't have that access to the user's machine.
Some memory checkers have a hard time catching stack corruption ( if it indeed that ). The surest way to get those I think is runtime analysis.
This can also be due to corruption in an exception path, even if the exception was handled. Do you debug with 'catch first-chance exceptions' turned on? You should as long as you can. It does get annoying after a while in many cases.
Can you send those users a checked version of your application? Check out Minidump Handle that exception and write out a dump. Then use WinDbg to debug on your end.
Another method is writing very detailed logs. Create a "Log every single action" option, and ask the user to turn that on and send it too you. Dump out memory to the logs. Check out '_CrtDbgReport()' on MSDN.
Good Luck!
EDIT:
Responding to your comment: An error on a local variable declaration is not surprising to me. I've seen this a lot. It's usually due to a corrupted stack.
Some variable on the stack may be running over it's boundaries for example. All hell breaks loose after that. Then stack variable declarations throw random memory errors, virtual tables get corrupted, etc.
Anytime I've seen those for a prolong period of time, I've had to go to IDA Pro. Detailed runtime disassembly debugging is the only thing I know that really gets those reliably.
Many developers use WinDbg for this kind of analysis. That's why I also suggested Minidump.
Try Rational (IBM) PurifyPlus. It catches a lot of errors that BoundsChecker doesn't.