The process got crashed unstably in Windows 7. I use !analyze -v command in WinDbg for exception analysis. It tells below information. The exception is actually thrown by WaitForSingleObject function which is called by IrsSim!IrsNet_BlockOutput. WinDbg's exception analysis told me that it was INVALID_POINTER_READ error.
For the calling code, the pChannel->hMutex is not NULL. I already dumped it and checked its value.
IRSNETRET IrsNet_BlockOutput( IRSNET *pChannel)
{
// Check channel
IRSNET_CHECK_CHANNEL(pChannel);
// Wait for synchronization mutex
switch(WaitForSingleObject(pChannel->hMutex, INFINITE))
{
...
}
<<<<<==========
FAULTING_IP: IrsSim!Channel::SendIrsMessage+285
[s:\som5\ics\scsv\isv\test.u\irssim\irsiftransport.cpp # 539] 00520ed5
8b06 mov eax,dword ptr [esi]
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 77db4639
(ntdll!RtlDeactivateActivationContextUnsafeFast+0x00000058)
ExceptionCode: c0150010 ExceptionFlags: 00000001 NumberParameters: 3
Parameter[0]: 00000000 Parameter[1]: 07befc58 Parameter[2]:
00000000
DEFAULT_BUCKET_ID: INVALID_POINTER_READ
PROCESS_NAME: IrsSim.exe
ERROR_CODE: (NTSTATUS) 0xc0150010 - The activation context being
deactivated is not active for the current thread of execution.
EXCEPTION_CODE: (NTSTATUS) 0xc0150010 - The activation context being
deactivated is not active for the current thread of execution.
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 07befc58
EXCEPTION_PARAMETER3: 00000000
STACK_TEXT: 07d2fce0 00520ed5 irssim!Channel::SendIrsMessage+0x285
07d2fd1c 00521072 irssim!CChannelArray::SendIrsMessage+0x132 07d2fd50
0052208a irssim!CNetLibInterface::SendIrsMessage+0xba 07d2fd78
005c01b6 irssim!CSendActivity::Execute+0x76 07d2fdac 005e0b3f
irssim!SimulationThreadState::ExecuteOneActivity+0x11f 07d2fdf8
005cc937 irssim!CSimulationSubThreadState::ExecuteState+0x267 07d2fe8c
005ccf02 irssim!ThreadFctSubSimulation+0xf2 07d2fec4 73b1e3ee
mfc90u!_AfxThreadEntry+0xf2 07d2ff4c 739f3433
msvcr90!_endthreadex+0x44 07d2ff84 739f34c7 msvcr90!_endthreadex+0xd8
07d2ff90 767d339a kernel32!BaseThreadInitThunk+0xe 07d2ff9c 77d69ed2
ntdll!__RtlUserThreadStart+0x70 07d2ffdc 77d69ea5
ntdll!_RtlUserThreadStart+0x1b
================================
After that I use !teb command to try get more stack information.
0:011> k L=07beec2c 100 ChildEBP RetAddr 07bef54c 76be0bdd
ntdll!NtWaitForMultipleObjects+0x15 07bef5e8 767d1a2c
KERNELBASE!WaitForMultipleObjectsEx+0x100 07bef630 767d4208
kernel32!WaitForMultipleObjectsExImplementation+0xe0 07bef64c 767f80a4
kernel32!WaitForMultipleObjects+0x18 07bef6b8 767f7f63
kernel32!WerpReportFaultInternal+0x186 07bef6cc 767f7858
kernel32!WerpReportFault+0x70 07bef6dc 767f77d7
kernel32!BasepReportFault+0x20 07bef768 77da21d7
kernel32!UnhandledExceptionFilter+0x1af 07bef770 77da20b4
ntdll!__RtlUserThreadStart+0x62 07bef784 77da1f59
ntdll!_EH4_CallFilterFunc+0x12 07bef7ac 77d76ab9
ntdll!_except_handler4+0x8e 07bef7d0 77d76a8b
ntdll!ExecuteHandler2+0x26 07bef7f4 77d76a2d ntdll!ExecuteHandler+0x24
07bef880 77d40143 ntdll!RtlDispatchException+0x127 07bef880 77db4639
ntdll!KiUserExceptionDispatcher+0xf 07befc34 76be0ad7
ntdll!RtlDeactivateActivationContextUnsafeFast+0x58 07befc38 76be0abc
KERNELBASE!WaitForSingleObjectEx+0xde 07befc98 767d1194
KERNELBASE!WaitForSingleObjectEx+0xc3 07befcb0 767d1148
kernel32!WaitForSingleObjectExImplementation+0x75
07befcc4 005e3b6e kernel32!WaitForSingleObject+0x12
07befcd4 00520d3b IrsSim!IrsNet_BlockOutput+0x1e
07befd14 00521072 IrsSim!Channel::SendIrsMessage+0xeb 07befd48
0052208a IrsSim!CChannelArray::SendIrsMessage+0x132 07befd70 005c01b6
IrsSim!CNetLibInterface::SendIrsMessage+0xba 07befda4 005e0b3f
IrsSim!CSendActivity::Execute+0x76 07befdf0 005cc937
IrsSim!SimulationThreadState::ExecuteOneActivity+0x11f 07befe84
005ccf02 IrsSim!CSimulationSubThreadState::ExecuteState+0x267 07befebc
73b1e3ee IrsSim!ThreadFctSubSimulation+0xf2 07beff44 739f3433
mfc90u!_AfxThreadEntry+0xf2 07beff7c 739f34c7
msvcr90!_endthreadex+0x44 07beff88 767d339a msvcr90!_endthreadex+0xd8
07beff94 77d69ed2 kernel32!BaseThreadInitThunk+0xe 07beffd4 77d69ea5
ntdll!__RtlUserThreadStart+0x70 07beffec 00000000
ntdll!_RtlUserThreadStart+0x1b
====================================>>>>>>
This looks a lot like the 0xC015000f exception encountered in MFC applications ("The activation context being deactivated is not the most recently activated one.")
In all cases where I have encountered this exception, the exception is not the primary issue. It is a side effect of an earlier exception, usually an access violation, where the stack is not unwound properly. Somewhere a call frame that used a macro such as the AFX_MANAGE_STATE macro is missed in the exception handling. The result is that the next time the activation context is manipulated, say by another routine that results in a call to something like AFX_MAINTAIN_STATE2::~AFX_MAINTAIN_STATE2, the system detects a cookie mismatch and throws the exception.
In your case you may either be causing an exception (most likely an AV) in one piece of code that then is manifested by the context exception. To trap the root cause, run the debugger with first chance exception handling enabled. That way the AV that is being trapped elsewhere up the call frame by someone perhaps using a try/catch(...) will be exposed. Since you appear to be threading, you may simply have a race condition on a memory access that causes the primary exception (if that is indeed what is happening).
I see in a previous post:
"In fact, this problem comes from porting the program from 64-bit Win XP to 64-bit Win7. The compiler is switched therefore from VC6 to VC9. "
This is not a bug in MFC. MFC 6 did not include the activation context switching code (which is cookie based) that was added, I think, in Visual Studio 2005. So you would not encounter this exception. We too thought the newer MFC had issues but in every case we have encountered, it was our code that caused the problem. The original problems are masked by code flows that started with a try/catch (usually ...) that eventually called code that used one of the MFC manage state macros that then called more code where eventually the AV would occur. Since the catch was way up the stack, and depending on the corruption, not all frames are unwound properly, the back side of the MFC macros are missed (some destructor failed to pop its context). To make matters worse (for debugging), the eventual context crash can occur anywhere in your code (we experienced a lot of them in CWnd's base window message processing routing method). We eventually created another tool for a user to run that would attach itself as a debugger to our (release target) executable that trapped first chance exceptions and created a dmp file so we could find the inital point where the exception occurred since a dump of the context exception almost never was useful since the original source of the problem was long since past execution.
The only way that call can fail in that manner is if
pChannel->hMutex
is invalid. Either pChannel itself is invaild, or hMutex. Most likely the former.
You should be checking if the handle is invalid not simply not NULL like:
if (myHandle != INVALID_HANDLE_VALUE)
{
// do something
}
Usually the create handle function will return this value if there is an error.
looks like problem in context deactivation (thoughts based on windbg dump). Refer to http://blogs.msdn.com/b/junfeng/archive/2006/03/19/sxs-activation-context-activate-and-deactivate.aspx article.
Related
At least one user of my software has encountered a very strange crash after a Windows 10 update. This crash always happens in the same place, and it appears as if the IDirect3DDevice9 has been destroyed or invalidated in some way during a previous call.
There is nothing else in the program that would release or destroy this device prematurely, and there are no other threads that could possibly interfere. The user has said updating their video drivers did not fix the problem, and their graphics card is an Nvidia GTX 1060 6GB, so a little older but by no means a potato.
IDirect3DSurface9 *s;
HRESULT hr = m_d3dDevice->GetBackBuffer(0,0,D3DBACKBUFFER_TYPE_MONO,&s);
if(FAILED(hr)) {
...
return;
}
// crash happens here, when pushing m_d3dDevice to the stack before the call
m_d3dDevice->SetRenderTarget(0,s);
The above code crashes before calilng SetRenderTarget. The m_d3dDevice value is read successfully from this, but when the pointer is dereferenced again to get the vftable, the program crashes. Here's the disassembly:
mov eax, [edi+1Ch] ; read m_d3dDevice
push [ebp+var_E0] ; push s
push 0 ; push 0
mov ecx, [eax] ; load vftable; crashes here
push eax ; push m_d3dDevice (this)
call dword ptr [ecx+94h] ; call SetRenderTarget
The call to GetBackBuffer() completes successfully just before this point. Without a successful completion, it would bail out of the function. Nothing else in my code could possibly be destroying the device, or the object this code belongs to, during this time.
Also, I should mention that this code is in a final presentation routine, which is usually only called after other rendering steps have been done. (After SetRenderTarget(), a temporary surface that was used for all the drawing is rendered to the back buffer using a special shader for upscaling, before Present() is called.) Just prior to this code being called, the device has been confirmed to still be active via TestCooperativeLevel(), so this code will not be reached if the device is not ready to do any of this.
As far as I know this crash does not happen to every user, only to some (one confirmed, possibly two). Is it possible, even perhaps likely, that some other program on their system is the issue? I don't know why it would appear out of the blue even if so, but I have no idea why the device is destroyed/invalid when the second call happens yet perfectly valid during the first.
Update to this issue: The user who reported this was using MSI Afterburner, which apparently was the cause of this. After shutting down Afterburner, the application ran correctly. I was correct that an outside program was interfering, although it still isn't clear why the Windows 10 update impacted it. This suggests Afterburner does some DirectX hooks.
Ours is a PowerPC based embedded system running Linux. We are encountering a random SIGILL crash which is seen for wide variety of applications. The root-cause for the crash is zeroing out of the instruction to be executed. This indicates corruption of the text segment residing in memory. As the text segment is loaded read-only, the application cannot corrupt it. So I am suspecting some common sub-system (DMA?) causing this corruption. Since the problem takes days to reproduce (crash due to SIGILL) it is getting difficult to investigate. So to begin with I want to be able to know if and when the text segment of any application has been corrupted.
I have looked at the stack trace and all the pointers, registers are proper.
Do you guys have any suggestions how I can go about it?
Some Info:
Linux 3.12.19-rt30 #1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU/Linux
(gdb) bt
0 0x10457dc0 in xxx
Disassembly output:
=> 0x10457dc0 <+80>: mr r1,r11
0x10457dc4 <+84>: blr
Instruction expected at address 0x10457dc0: 0x7d615b78
Instruction found after catching SIGILL 0x10457dc0: 0x00000000
(gdb) maintenance info sections
0x10006c60->0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
Expected (from the application binary):
(gdb) x /32 0x10457da0
0x10457da0 : 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0 : 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0 : 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0 : 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0
Actual (after handling SIGILL and dumping nearby memory locations):
Faulting instruction address: 0x10457dc0
0x10457da0 : 0x913E0000
0x10457db0 : 0x83ABFFF4
=> 0x10457dc0 : 0x00000000
0x10457dd0 : 0x4BCD8BE5
0x10457de0 : 0x93E1000C
Edit:
One lead that we have is that the corruption is always occurring at an offset that ends with 0xdc0.
For e.g.
Faulting instruction address: 0x10653dc0 << printed by our application after catching SIGILL
Faulting instruction address: 0x1000ddc0 << printed by our application after catching SIGILL
flash_erase[8557]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite[8561]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk[4448]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk[16002]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats[20670]: unhandled signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr[27923]: unhandled signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001
Edit 2: Another lead is that the corruption is always occurring at physical frame number 0x00a4d. I suppose with PAGE_SIZE of 4096 this translates to physical address of 0x00A4DDC0. We are suspecting couple of our kernel drivers and investigating further. Is there any better idea (like putting hardware watchpoint) which could be more efficient? How about KASAN as suggested below?
Any help is appreciated. Thanks.
1.) Text segment is RO, but the permissions could be changed by mprotect, you can check that if you think it is possible
2.) If it is kernel problem:
Run kernel with KASAN and KUBSAN (undefined behaviour) sanitizers
Focus on drivers code not included in mainline
The hint here is one byte corruption. Maybe i'm wrong, but it means that DMA is not to blame. It looks like some kind of invalid store.
3.) Hardware. I think, your problem looks like a hardware problem (RAM issue).
You can try to decrease RAM system frequency in bootloader
Check if this problem reproduces on stable mainline software, that is how you can prove that it's it
Using the following setup:
Cortex-M3 based µC
gcc-arm cross toolchain
using C and C++
FreeRtos 7.5.3
Eclipse Luna
Segger Jlink with JLinkGDBServer
Code Confidence FreeRtos debug plugin
Using JLinkGDBServer and eclipse as debug frontend, I always have a nice stacktrace when stepping through my code. When using the Code Confidence freertos tools (eclipse plugin), I also see the stacktraces of all threads which are currently not running (without that plugin, I see just the stacktrace of the active thread). So far so good.
But now, when my application fall into a hardfault, the stacktrace is lost.
Well, I know the technique on how to find out the code address which causes the hardfault (as seen here).
But this is very poor information compared to full stacktrace.
Ok, some times when falling into hardfault there is no way to retain a stacktrace, e.g. when the stack is corrupted by the faulty code. But if the stack is healty, I think that getting a stacktrace might be possible (isn't it?).
I think the reason for loosing the stacktrace when in hardfault is, that the stackpointer would be swiched from PSP to MSP automatically by the Cortex-M3 architecture. One idea is now, to (maybe) set the MSP to the previous PSP value (and maybe have to do some additional stack preperation?).
Any suggestions on how to do that or other approaches to retain a stacktrace when in hardfault?
Edit 2015-07-07, added more details.
I uses this code to provocate a hardfault:
__attribute__((optimize("O0"))) static void checkHardfault() {
volatile uint32_t* varAtOddAddress = (uint32_t*)-1;
(*varAtOddAddress)++;
}
When stepping into checkHardfault(), my stacktrace looks good like this:
gdb-> backtrace
#0 checkHardfault () at Main.cxx:179
#1 0x100360f6 in GetOneEvent () at Main.cxx:185
#2 0x1003604e in executeMainLoop () at Main.cxx:121
#3 0x1001783a in vMainTask (pvParameters=0x0) at Main.cxx:408
#4 0x00000000 in ?? ()
When run into the hardfault (at (*varAtOddAddress)++;) and find myself inside of the HardFault_Handler(), the stacktrace is:
gdb-> backtrace
#0 HardFault_Handler () at Hardfault.c:312
#1 <signal handler called>
#2 0x10015f36 in prvPortStartFirstTask () at freertos/portable/GCC/ARM_CM3/port.c:224
#3 0x10015fd6 in xPortStartScheduler () at freertos/portable/GCC/ARM_CM3/port.c:301
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
The quickest way to get the debugger to give you the details of the state prior to the hard fault is to return the processor to the state prior to the hard fault.
In the debugger, write a script that takes the information from the various hardware registers and restore PC, LR, R0-R14 to the state just prior to causing the hard fault, then do your stack dump.
Of course, this isn't always helpful when you end up at the hard fault because of popping stuff off of a blown stack or stomping on stuff in memory. You generally tend to corrupt a bunch of the important registers, return back to some crazy spot in memory, and then execute whatever's there. You can end up hard faulting many thousands (millions?) of cycles after your real problem happens.
Consider using the following gdb macro to restore the register contents:
define hfstack
set $frame_ptr = (unsigned *)$sp
if $lr & 0x10
set $sp = $frame_ptr + (8 * 4)
else
set $sp = $frame_ptr + (26 * 4)
end
set $lr = $frame_ptr[5]
set $pc = $frame_ptr[6]
bt
end
document hfstack
set the correct stack context after a hard fault on Cortex M
end
Hello and good day to you.
Need a bit of assitance here:
Situation:
I have an obscure DirectX 9 application (name and application details are irrelevant to the question) that causes blue screen of death on all nvidia cards (GeForce 8400GS and up) since certain driver version. I believe that the problem is indirectly caused by DirectX 9 call or a flag that triggers driver bug.
Goal:
I'd like to track down offending flag/function call (for fun, this isn't my job/homework) and bypass error condition by writing proxy dll. I already have a finished proxy dll that provides wrappers for IDirect3D9, IDirect3DDevice9, IDirect3DVertexBuffer9 and IDirect3DIndexBuffer9 and provides basic logging/tracing of Direct3D calls. However, I can't pinpoint function which causes crash.
Problems:
No source code or technical support is available. There will be no assitance, and nobody else will fix the problem.
Memory dump produced by kernel wasn't helpful - apparently an access violation happens within nv4_disp.dll, but I can't use stacktrace to go to IDirect3DDevice9 method call, plus there's a chance that bug happens asynchronously.
(Main problem) Because of large number of Direct3D9Device method calls, I can't reliably log them into file or over network:
Logging into file causes significant slowdown even without flushing, and because of that all last contents of the log are lost when system BSODs.
Logging over network (using UDP and WINSOck's sendto)also causes significant slowdown and must not be done asynchronously (asynchronous packets are lost on BSOD), plus packets (the ones around the crash) are sometimes lost even when sent synchronously.
When application is "slowed" down by logging routines, BSOD is less likely to happen, which makes tracking it down harder.
Question:
I normally don't write drivers, and don't do this level of debugging, so I have impression that I'm missing something important there's a more trivial way to track down the problem than writing IDirect3DDevice9 proxy dll with custom logging mechanism. What is it? What is the standard way of diagnosing/handling/fixing problem like this (no source code, COM interface method triggers BSOD)?
Minidump analysis(WinDBG):
Loading User Symbols
Loading unloaded module list
...........
Unable to load image nv4_disp.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nv4_disp.dll
*** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck 1000008E, {c0000005, bd0a2fd0, b0562b40, 0}
Probably caused by : nv4_disp.dll ( nv4_disp+90fd0 )
Followup: MachineOwner
---------
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: bd0a2fd0, The address that the exception occurred at
Arg3: b0562b40, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
FAULTING_IP:
nv4_disp+90fd0
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi
TRAP_FRAME: b0562b40 -- (.trap 0xffffffffb0562b40)
ErrCode = 00000000
eax=00000808 ebx=e37f8200 ecx=e4ae1c68 edx=e37f8328 esi=e37f8400 edi=00000000
eip=bd0a2fd0 esp=b0562bb4 ebp=e37e09c0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202
nv4_disp+0x90fd0:
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi ds:0023:00000900=????????
Resetting default scope
CUSTOMER_CRASH_COUNT: 3
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x8E
LAST_CONTROL_TRANSFER: from bd0a2e33 to bd0a2fd0
STACK_TEXT:
WARNING: Stack unwind information not available. Following frames may be wrong.
b0562bc4 bd0a2e33 e37f8200 e37f8200 e4ae1c68 nv4_disp+0x90fd0
b0562c3c bf8edd6b b0562cfc e2601714 e4ae1c58 nv4_disp+0x90e33
b0562c74 bd009530 b0562cfc bf8ede06 e2601714 win32k!WatchdogDdDestroySurface+0x38
b0562d30 bd00b3a4 e2601008 e4ae1c58 b0562d50 dxg!vDdDisableSurfaceObject+0x294
b0562d54 8054161c e2601008 00000001 0012c518 dxg!DxDdDestroySurface+0x42
b0562d54 7c90e4f4 e2601008 00000001 0012c518 nt!KiFastCallEntry+0xfc
0012c518 00000000 00000000 00000000 00000000 0x7c90e4f4
STACK_COMMAND: kb
FOLLOWUP_IP:
nv4_disp+90fd0
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: nv4_disp+90fd0
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nv4_disp
IMAGE_NAME: nv4_disp.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4e390d56
FAILURE_BUCKET_ID: 0x8E_nv4_disp+90fd0
BUCKET_ID: 0x8E_nv4_disp+90fd0
Followup: MachineOwner
nv4_disp+90fd0
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi
This is the important part. Looking at this, it is most probable that eax is invalid, hence attempting to access an invalid memory address.
What you need to do is load nv4_disp.dll into IDA (you can get a free version), check the image base that IDA loads nv4_disp at and hit 'g' to goto address, try adding 90fd0 to the image base IDA is using, and it should take you directly to the offending instruction (depending on section structure).
From here you can analyze the control flow, and how eax is set and used. If you have a good kernel level debugger you can set a breakpoint on this address and try and get it to hit.
Analysing the function, you should attempt to figure out what the function does, what eax is meant to be pointing to at that point, what its actually pointing to, and why. This is the hard part and is a great part of the difficulty and skill of reverse engineering.
Found a solution.
Problem:
Logging is unreliable since messages (when dumped to file) disappear during bsod, packets are sometimes lost when logging over network, and there's slowdown due to logging.
Solution:
Instead of logging to file or over network, configure system to produce full physical memory dump on BSOD and log all messages into any memory buffer. It'll be faster. Once system crashed, it'll dump entire memory into file, and it'll be possible to either view contents of log-file buffer using WinDBG's dt (if you have debug symbols) command, or you'll be able to search and locate logfile stored in memory using "memory" view.
I used circular buffer of std::strings to store messages and separate array of const char* to make things easier to read in WinDBG, but you could simply create huge array of char and store all messages within it in plaintext.
Details:
Entire process on winxp:
Ensure that minimum page file size is equal or larger than total amount of RAM + 1 megabytes. (Right Click "My Computer"->Properties->Advanced->Performance->Advanced->Change)
Configure system to produce complete memory dump on BSOD (RIght click "My Computer'->Properties->Advanced->Startup and Recovery->Settings->Write Debugging Information . Select "Complete memory dump" and specify path you want).
Ensure that disk (where the file will be written) has required amount of free space (total amount of RAM on your system.
Build app/dll (the one that does logging) with debug symbol, and Trigger BSOD.
Wait till memory dump is finished, reboot. Feel free to swear at driver developer while system writes memory dump and reboots.
Copy MEMORY.DMP system produced to a safe place, so you won't lose everything if system crashes again.
Launch windbg.
Open Memory Dump (File->Open Crash Dump).
If you want to see what happened, use !analyze -v command.
Access memory buffer that stores logged messages using one of those methods:
To see contents of global variable, use dt module!variable where "module" is name of your library (without *.dll), and "variable" is name of variable. You can use wildcards. You can use address without module!variable
To see contents of one field of the global variable (if global variable is a struct), use dt module!variable field where "field" is variable member.
To see more details about varaible (content of arrays and substructures) use dt -b module!variable field or dt -b module!variable
If you don't have symbols, you'll need to search for your "logfile" using memory window.
At this point you'll be able to see contents of log that were stored in memory, plus you'll have snapshot of the entire system at the moment when it crashed.
Also...
To see info about process that crashed the system, use !process.
To see loaded modules use lm
For info about thread there's !thread id where id is hexadecimal id you saw in !process output.
It looks like the crash may either be caused by a bad pointer, or heap corruption. You can tell this because the crash occurs in a memory-freeing function (DxDdDestroySurface). Destroying surfaces is something that you absolutely need to do - you can't just stub this out, the surface will still get freed when the program exits, and if you disable it inside the kernel, you'll run out of on-card memory very quickly and crash that way, as well.
You can try to figure out what sequence of events leads up to this heap corruption, but there's no silver bullet here - as fileoffset suggested, you'll need to actually reverse engineer the driver to see why this happens (it may help to compare drivers before and after the offending driver version as well!)
UPDATE
Thanks to feedback below I was able to home in on ADPlus.vbs, which is part of the debugging tools for Windows.
Don't forget to set up _NT_SYMBOL_PATH before you run it.
Using this we've been able to see much more clearly in to the application with far greater clarity than we ever have using the regular dumps produced via Windows when the application crashes.
Many thanks to all for the responses.
ORIGINAL QUESTION
We have an server application written in Visual C++ that some times (relatively rarely) crashes on customer sites. We haven't been able to understand why this happens based on looking at our own log files so the next step is to start looking at crash dumps.
We've just purposefully put a bug in to our app (a null pointer) so that we can generate a crash dump and verify that the dumps produced are valuable, but thus far I can't make head or tail of what i'm seeing.
I think my first question is whether i've even got WinDbg set up correctly (the other developer here is loading the dump in to Visual Studio 2010 and seeing the same errors so i'm assuming it's fine, or we're both wrong :) ) - and then next question is, how do I understand what it's telling me.
The main confusion is that the dump seems to be telling me it has reached a break point, which seems odd to me since there was no debugger connected.
The app was running on a Windows Server 2003 system when it crashed. I believe I have pointed WinDbg at the PDB file for the DLL and EXE correctly.
FAULTING_IP:
ntdll!DbgBreakPoint+0
7c81a3e1 cc int 3
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 7c81a3e1 (ntdll!DbgBreakPoint)
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 3
Parameter[0]: 00000000
Parameter[1]: 8779fdb0
Parameter[2]: 00000003
DEFAULT_BUCKET_ID: STATUS_BREAKPOINT
PROCESS_NAME: CallPlusServerLauncher.exe
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 8779fdb0
EXCEPTION_PARAMETER3: 00000003
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
ADDITIONAL_DEBUG_TEXT: Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[ffffffff]
FAULTING_THREAD: ffffffff
PRIMARY_PROBLEM_CLASS: STATUS_BREAKPOINT
BUGCHECK_STR: APPLICATION_FAULT_STATUS_BREAKPOINT
STACK_TEXT:
1bd0ffc8 7c83fe08 00000005 00000004 00000001 ntdll!DbgBreakPoint
1bd0fff4 00000000 00000000 00000000 00000000 ntdll!DbgUiRemoteBreakin+0x36
FOLLOWUP_IP:
ntdll!DbgBreakPoint+0
7c81a3e1 cc int 3
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: ntdll!DbgBreakPoint+0
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: ntdll
IMAGE_NAME: ntdll.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 49900d60
STACK_COMMAND: ddS 1bd10000 1bd0c000 ; dt ntdll!LdrpLastDllInitializer BaseDllName ; dt ntdll!LdrpFailureData ; ~439s; .ecxr ; kb
BUCKET_ID: MANUAL_BREAKIN
FAILURE_BUCKET_ID: STATUS_BREAKPOINT_80000003_ntdll.dll!DbgBreakPoint
WATSON_STAGEONE_URL: http://watson.microsoft.com/StageOne/CallPlusServerLauncher_exe/0_0_0_0/4df87414/ntdll_dll/5_2_3790_4455/49900d60/80000003/0001a3e1.htm?Retriage=1
Followup: MachineOwner
DbgBreakPoint -- Looks to me like you broke execution using a remote debugger.
If you didn't then I have seen DbgBreakPoint show up when you have code pages (Edit: I meant page heap) turned on (you should know if you did this) and there was a detection of invalid memory access.
Asserts can also trigger a breakpoint exception. For example I have (too often) seen them come out of the heap checking around a delete when the heap has got corrupted by double-delete or overflow. But only with the debug runtime I thought, is that what you have deployed?