I'm having some issues with regsvr32.exe hanging during an installation. A DLL, let's call it common.dll, is registered as part of the installation process using regsvr32.exe. Common.dll utilises another DLL, utility.dll.
Part of utility.dll contains logging functionality. This logging functionality uses a static 'Timer' object to periodically check log file sizes and splitting accordingly. The Timer object incorporates it's own thread which it uses to fire the timer. The timer object inside the logger is a static, so it is used across multiple logger instances which use static ofstreams to point to the same file.
The timer has two events, a timer (created using CreateWaitableTimer()) and a standard synchronisation event (CreateEvent()) for triggering thread shutdown. The thread is started in the constructor (_beginthreadex()). Inside the thread function there is a WaitForMultipleObjects() call waiting on both the timer and the shutdown event. The Wait...() is INFINITE, and the thread function returns when the shutdown event is set (SetEvent()).
(The above is provided as background, no part of it can be changed as part of the solution, and all DLL files, the logger and the timer are working properly).
The issue arises during regsvr32.exe running. It loads up common.dll, which loads up utility.dll, which initialises the static timer thread object. The thread is started properly, and it reaches the WaitForMultipleObjects() call inside the thread function. As soon as registration completes (almost instantly), the timer destructor is called. The destructor calls SetEvent() on the shutdown event, but WaitForMultipleObjects() never returns. As part of trying to figure out this issue I've put a WaitForSingleObject() call immediately after the SetEvent() call, waiting on the shutdown event. This also never returns, which leads me to believe there is an issue with the event itself. I had the following possible explanations:
A timing issue. The registration process is over very quickly, and as such maybe the thread is getting into a state where it isn't ready to shutdown? The thread does reach the WaitForMultipleObjects() call though, which leads me to believe this isn't the issue.
Utility.dll is being unloaded by regsvr32.exe. I'm not really up on how this all works, but using ProcessExplorer it looks like regsvr32.exe still has the dll loaded while it is hanging, so I don't think this is the issue.
A tight loop inside regsvr32.exe during shutdown. If the destruction process for regsvr32.exe is taking place in a tight loop (i.e. while(NotShutdown()) etc), maybe this isn't relinquishing any cpu time for the timer thread, which would explain the hang.
Any more thoughts on the issue? I've scoured the internet and can't find anything remotely related to this problem.
P.S. I know the solution to the problem is to use a static pointer and initialise the timer when it is actually needed, and that's the solution i'm going with. However i'd still like to understand why this is happening, as to me it seems completely ridiculous that SetEvent() would not work.
Output from windbg !locks command:
0:000> !locks
CritSec ntdll!LdrpLoaderLock+0 at 7c97e178
LockCount 2
RecursionCount 1
OwningThread d8
EntryCount 4
ContentionCount 4
*** Locked
Scanned 253 critical sections
0:000> ~*kv
. 0 Id: a40.d8 Suspend: 0 Teb: 7ffdf000 Unfrozen
ChildEBP RetAddr Args to Child
0007e5ec 7c90df5a 7c8025db 00000764 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0007e5f0 7c8025db 00000764 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
0007e654 7c802542 00000764 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xa8 (FPO: [Non-Fpo])
*** WARNING: Unable to verify checksum for Utilityd.dll
0007e668 00a84e37 00000764 ffffffff 0007e71c kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])
0007e6c8 00a2e5af 0007e798 0007e754 00aa02e0 Utilityd!CThreadTimer::~CThreadTimer+0x97 [C:\xxx\ThreadTimer.cpp # 49]
0007e71c 00aa02ae 00fe7a18 0007e740 00aa039b Utilityd!$E177+0x3f
0007e728 00aa039b 00a10000 00000000 00000000 Utilityd!_CRT_INIT+0xde [crtdll.c # 236]
0007e740 7c90118a 00a10000 00000000 00000000 Utilityd!_DllMainCRTStartup+0xbb [crtdll.c # 289]
0007e760 7c91e044 00aa02e0 00a10000 00000000 ntdll!LdrpCallInitRoutine+0x14
0007e858 7c80ac97 00950000 00000000 0003415e ntdll!LdrUnloadDll+0x41c (FPO: [Non-Fpo])
0007e86c 0100214e 00950000 00000000 00020bca kernel32!FreeLibrary+0x3f (FPO: [Non-Fpo])
0007ff1c 010024bf 01000000 00000000 00020bca regsvr32!wWinMain+0xad1 (FPO: [Non-Fpo])
0007ffc0 7c817077 00000018 00000000 7ffd4000 regsvr32!wWinMainCRTStartup+0x198 (FPO: [Non-Fpo])
0007fff0 00000000 01002327 00000000 78746341 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])
1 Id: a40.e98 Suspend: 0 Teb: 7ffde000 Unfrozen
ChildEBP RetAddr Args to Child
0104fe34 7c90df5a 7c91b24b 00000760 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0104fe38 7c91b24b 00000760 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
0104fec0 7c901046 0197e178 7c913978 7c97e178 ntdll!RtlpWaitForCriticalSection+0x132 (FPO: [Non-Fpo])
0104fec8 7c913978 7c97e178 00000000 7ffde000 ntdll!RtlEnterCriticalSection+0x46 (FPO: [1,0,0])
0104ff34 7c80c136 006e0065 00560074 00fe43d8 ntdll!LdrShutdownThread+0x22 (FPO: [Non-Fpo])
*** ERROR: Symbol file could not be found. Defaulted to export symbols for MSVCRTD.DLL -
0104ff6c 1020c061 00000000 00fe43d8 0104ffb4 kernel32!ExitThread+0x3e (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
0104ff7c 1020bfd8 00000000 006e0065 00560074 MSVCRTD!endthreadex+0x41
0104ffb4 7c80b729 00fe43d8 006e0065 00560074 MSVCRTD!beginthreadex+0x178
0104ffec 00000000 1020bf20 00fe43d8 00000000 kernel32!BaseThreadStart+0x37 (FPO: [Non-Fpo])
2 Id: a40.1708 Suspend: 0 Teb: 7ffdd000 Unfrozen
ChildEBP RetAddr Args to Child
0136fc0c 7c90df5a 7c91b24b 00000760 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0136fc10 7c91b24b 00000760 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
0136fc98 7c901046 0197e178 7c91e3b5 7c97e178 ntdll!RtlpWaitForCriticalSection+0x132 (FPO: [Non-Fpo])
0136fca0 7c91e3b5 7c97e178 0136fd2c 00000004 ntdll!RtlEnterCriticalSection+0x46 (FPO: [1,0,0])
0136fd18 7c90e457 0136fd2c 7c900000 00000000 ntdll!_LdrpInitialize+0xf0 (FPO: [Non-Fpo])
00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x7
Global destructors and constructors are called from DllMain with the loader lock held as you can see from your stack traces. The thread calling ~CThreadTimer has the loader lock and it is waiting for the event to be set. If another thread needs the loader lock to continue, it will be blocked until the event is set. If the thread that sets the event is one of the threads that needs the loader lock, you'll end up with a deadlock like this one. The loader lock is taken when dlls are loaded, when threads are created or destroyed, when dlls are unloaded, at process exit and startup, and a few other places (GetModuleHandle for example).
An easy way to create a deadlock like this is:
static Foo { Foo() { HANDLE h = CreateThread(...); WaitForSingleObject(h, INFINITE); } g_foo;
That said, you implied SetEvent was indeed being called. If it indeed is, there's probably more going on.
You can use !handle to take a look at the event you're waiting on as well to see if you can gain some insight. Also, again I would try running with ApplicationVerifier, it may lead you to the problem.
Related
I created a daemon which I use as a proxy to the Cassandra database. I call it snapdbproxy as it proxies my CQL commands from my other servers and tools.
Whenever I access that tool, it creates a new thread, manages various CQL commands, and then I cleanly exit the thread once the connection is lost.
Looking at the memory footprint, it grows really fast (the most active systems quickly reach Gb of virtual memory and that makes use of some swap memory which grows constantly.) On startup, it is around 300Mb.
The software is written in C++ with destructors, RAII, smart pointers, etc... but I still verified:
With -fsanitizer=address (I use g++ under Linux) and I get no leaks (okay, a very few... under 300 bytes because I can't find how to get rid of a few Cryto buffers created by OpenSSL)
With valgrind massif which says I use 4.7mB at initialization time and then under 4mB ongoing (I ran the same code for over 1h and same results!)
There is some output of ms_print (I removed the stack, since it's all zeroes).
-------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B)
-------------------------------------------------------------------
0 0 0 0 0
1 78,110,172 4,663,704 4,275,532 388,172
2 172,552,798 3,600,840 3,369,538 231,302
3 269,590,806 3,611,600 3,379,648 231,952
4 350,518,548 3,655,208 3,420,483 234,725
5 425,873,410 3,653,856 3,419,390 234,466
...
67 4,257,283,952 3,693,160 3,459,545 233,615
68 4,302,665,173 3,694,624 3,460,827 233,797
69 4,348,046,440 3,693,728 3,457,524 236,204
70 4,393,427,319 3,685,064 3,449,697 235,367
71 4,438,812,133 3,698,352 3,461,918 236,434
As we can see, after one hour and many accesses from various other daemons (at least 100 accesses,) valgrind tells me that I am using only around 4mB of memory. I tried twice thinking that the first attempt probably failed. Same results.
So... I'm more or less out of ideas. Why would my process continue to grow in terms of virtual memory even though everything is correctly freed on exit of each thread--as shown by massif output--and the entire process--as shown by -fsanitizer=address (okay, I'm not showing the output of the sanitizer here, but trust me, it's under 300 bytes. Not Gb of leaks.)
There is the output of a watch command after a while as I'm looking at the memory (Virtual Memory) status:
Every 1.0s: grep ^Vm /proc/1773/status Tue Oct 2 21:36:42 2018
VmPeak: 1124060 kB <-- starts at under 300 Mb...
VmSize: 1124060 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 108776 kB
VmRSS: 108776 kB
VmData: 963920 kB <-- this tags along
VmStk: 132 kB
VmExe: 1936 kB
VmLib: 65396 kB
VmPTE: 888 kB <-- this increases too (necessary to handle the large Vm)
VmPMD: 20 kB
VmSwap: 0 kB
The VmPeak, VmSize, and VmData all increase each time the other daemons run (about once every 5 min.)
However, the memory (malloc/free) is not changing. I am now logging sbrk(0) (on an idea by 1201ProgramAlarm's comment--my interpretation of the first part of his comment) and that address remains the same:
sbrk() = 0x4228000
As suggested by phd, I looked at t he contents of /proc/<pid>/maps over time. Here is one or two increment. Unfortunate that I'm not told what creates these buffers. The only thing I could think of are my threads... (i.e. stack and a little space for the thread status)
--- a1 2018-10-02 21:50:21.887583577 -0700
+++ a2 2018-10-02 21:52:04.823169545 -0700
## -522,6 +522,10 ##
59dd0000-5a5d0000 rw-p 00000000 00:00 0
5a5d0000-5a5d1000 ---p 00000000 00:00 0
5a5d1000-5add1000 rw-p 00000000 00:00 0
+5add1000-5add2000 ---p 00000000 00:00 0
+5add2000-5b5d2000 rw-p 00000000 00:00 0
+5b5d2000-5b5d3000 ---p 00000000 00:00 0
+5b5d3000-5bdd3000 rw-p 00000000 00:00 0
802001000-802b8c000 rwxp 00000000 00:00 0
802b8c000-802b8e000 ---p 00000000 00:00 0
802b8e000-802c8e000 rwxp 00000000 00:00 0
Oh... Yep! My latest changes from having detached threads to joining... actually doesn't join threads at all. Testing with the proper join now... and it works right! My! Bad one!
I have had some problems with a server today and I have now boiled it down to that it is not able to get rid of processes that gets a segfault.
After the process gets a seg-fault, the process just keeps hanging, not getting killed.
A test that should cause the error Segmentation fault (core dumped).
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
char *buf;
buf = malloc(1<<31);
fgets(buf, 1024, stdin);
printf("%s\n", buf);
return 1;
}
Compile and set permissions with gcc segfault.c -o segfault && chmod +x segfault.
Running this (and pressing enter 1 time), on the problematic server causes it to hang. I also ran this on another server with the same kernel version (and most of the same packages), and it gets the seg-fault and then quits.
Here are the last few lines after running strace ./segfault on both of the servers.
Bad server
"\n", 1024) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
# It hangs here....
Working server
"\n", 1024) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
root#server { ~ }# echo $?
139
When the process hangs (after it have segfaulted), this is how it looks.
Not able to ^c it
root#server { ~ }# ./segfault
^C^C^C
Entry from ps aux
root 22944 0.0 0.0 69700 444 pts/18 S+ 15:39 0:00 ./segfault
cat /proc/22944/stack
[<ffffffff81223ca8>] do_coredump+0x978/0xb10
[<ffffffff810850c7>] get_signal_to_deliver+0x1c7/0x6d0
[<ffffffff81013407>] do_signal+0x57/0x6c0
[<ffffffff81013ad9>] do_notify_resume+0x69/0xb0
[<ffffffff8160bbfc>] retint_signal+0x48/0x8c
[<ffffffffffffffff>] 0xffffffffffffffff
Another funny thing is that I am unable to attach strace to a hanging segfault process. Doing so actually makes it getting killed.
root#server { ~ }# strace -p 1234
Process 1234 attached
+++ killed by SIGSEGV (core dumped) +++
ulimit -c 0 is sat and ulimit -c, ulimit -H -c, and ulimit -S -c all shows the value 0
Kernel version: 3.10.0-229.14.1.el7.x86_64
Distro-version: Red Hat Enterprise Linux Server release 7.1 (Maipo)
Running in vmware
The server is working as it should on everything else.
Update
Shutting down abrt (systemctl stop abrtd.service) fixed the problem with processes already hung after core-dump, and new processes core-dumping. Starting up abrt again did not bring back the problem.
Update 2016-01-26
We got a problem that looked similar, but not quite the same. The initial code used to test:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
char *buf;
buf = malloc(1<<31);
fgets(buf, 1024, stdin);
printf("%s\n", buf);
return 1;
}
was hanging. The output of cat /proc/<pid>/maps was
00400000-00401000 r-xp 00000000 fd:00 13143328 /root/segfault
00600000-00601000 r--p 00000000 fd:00 13143328 /root/segfault
00601000-00602000 rw-p 00001000 fd:00 13143328 /root/segfault
7f6c08000000-7f6c08021000 rw-p 00000000 00:00 0
7f6c08021000-7f6c0c000000 ---p 00000000 00:00 0
7f6c0fd5b000-7f6c0ff11000 r-xp 00000000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c0ff11000-7f6c10111000 ---p 001b6000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c10111000-7f6c10115000 r--p 001b6000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c10115000-7f6c10117000 rw-p 001ba000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c10117000-7f6c1011c000 rw-p 00000000 00:00 0
7f6c1011c000-7f6c1013d000 r-xp 00000000 fd:00 14274 /usr/lib64/ld-2.17.so
7f6c10330000-7f6c10333000 rw-p 00000000 00:00 0
7f6c1033b000-7f6c1033d000 rw-p 00000000 00:00 0
7f6c1033d000-7f6c1033e000 r--p 00021000 fd:00 14274 /usr/lib64/ld-2.17.so
7f6c1033e000-7f6c1033f000 rw-p 00022000 fd:00 14274 /usr/lib64/ld-2.17.so
7f6c1033f000-7f6c10340000 rw-p 00000000 00:00 0
7ffc13b5b000-7ffc13b7c000 rw-p 00000000 00:00 0 [stack]
7ffc13bad000-7ffc13baf000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
However, the smaller c code (int main(void){*(volatile char*)0=0;}) to trigger a segfault did cause a segfault and did not hang...
WARNING - this answer contains a number of suppositions based on the incomplete information to hand. Hopefully it is still useful though!
Why does the segfault appear to hang?
As the stack trace shows, the kernel is busy creating a core dump of the crashed process.
But why does this take so long? A likely explanation is that the method you are using to create the segfaults is resulting in the process having a massive virtual address space.
As pointed out in the comments by M.M., the outcome of the expression 1<<31 is undefined by the C standards, so it is difficult to say what actual value is being passed to malloc, but based on the subsequent behavior I am assuming it is a large number.
Note that for malloc to succeed it is not necessary for you to actually have this much RAM in your system - the kernel will expand the virtual size of your process but actual RAM will only be allocated when your program actually accesses this RAM.
I believe the call to malloc succeeds, or at least returns, because you state that it segfaults after you press enter, so after the call to fgets.
In any case, the segfault is leading the kernel to perform a core dump. If the process has a large virtual size, that could take a long time, especially if the kernel decides to dump all pages, even those that have never been touched by the process. I am not sure if it will do that, but if it did, and if there was not enough RAM in the system, it would have to begin swapping pages in and out of memory in order to dump them to the core dump. This would generate a high IO load which could lead to the process to appear to be unresponsive (and overall system performance would be degraded).
You may be able to verify some of this by looking in the abrtd dump directory (possibly /var/tmp/abrt, or check /etc/abrt/abrt.conf) where you may find the core dumps (or perhaps partial core dumps) that have been created.
If you are able to reproduce the behavior, then you can check:
/proc/[pid]/maps to see the address space map of the process and see if it really is large
Use a tool like vmstat to see if the the system is swapping, the amount of I/O going on, and how much IO Wait state is being experienced
If you had sar running then you may be able to see similar information even for the period prior to restarting abrtd.
Why is a core dump created, even though ulimit -c is 0?
According to this bug report, abrtd will trigger collection of a core dump regardless of ulimit settings.
Why did this not start happening again when arbtd was started up once more?
There are a couple of possible explanations for that. For one thing, it would depend on the amount of free RAM in the system. It might be that a single core dump of a large process would not take that long, and not be perceived as hanging, if there is enough free RAM and the system is not pushed to swap.
If in your initial experiments you had several processes in this state, then the symptoms would be far worse than is the case when just getting a single process to misbehave.
Another possibility is that the configuration of abrtd had been altered but the service not yet reloaded, so that when you restarted it, it began using the new configuration, perhaps changing it's behavior.
It is also possible that a yum update had updated abrtd, but not restarted it, so that when you restarted it, the new version was running.
I have written a small program using Debug engine API to read a dump file.
I am executing !analyze -v command through code.
I am able to get almost every detail that could be extracted with above command but not the Process Name and Image name and module name
I really don't know where I'm going wrong.
Things i tried:
copied the dll's ext,exts,Kdexts,kext to the same folder where my exe is
present.
also copied the symsrv.dll.
for symbol path i am using
symbols->SetSymbolPath("srv*http://msdl.microsoft.com/download/symbols") where symbols is an IDebugSymbols pointer
But so far it didn't work.
The result i'm getting is :
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
CRITICAL_OBJECT_TERMINATION (f4) A process or thread crucial to system
operation has unexpectedly exited or been terminated. Several
processes and threads are necessary for the operation of the system;
when they are terminated (for any reason), the system can no longer
function. Arguments: Arg1: 00000003, Process Arg2: 84d97860,
Terminating object Arg3: 84d979cc, Process image file name Arg4:
8285cec0, Explanatory message (ascii)
Debugging Details:
------------------
***** Debugger could not find nt in module list, module list might be incorrect, error 0x80070057.
-----------------------------------------------
| NT symbols are not available |
| reduced functionality |
| |
------------------------------------------------
unable to get nt!KiCurrentEtwBufferOffset
unable to get nt!KiCurrentEtwBufferBase
PROCESS_OBJECT: 84d97860
IMAGE_NAME: Unknown_Image
DEBUG_FLR_IMAGE_TIMESTAMP: 0
FAULTING_MODULE: 00000000
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0xF4
CURRENT_IRQL: 0
STACK_TEXT: WARNING: Frame IP not in any known module. Following
frames may be wrong. 950dbc9c 829223af 000000f4 00000003 84d97860
0x82722bfc 950dbcc0 828a0009 8285cec0 84d979cc 84d97ad0 0x829223af
950dbcf0 8289ff4c 84d97860 8447b030 00000001 0x828a0009 950dbd24
826818c6 000001e0 00000001 001cebb0 0x8289ff4c 950dbd34 77be70f4
badb0d00 001ceba8 00000000 0x826818c6 950dbd38 badb0d00 001ceba8
00000000 00000000 0x77be70f4 950dbd3c 001ceba8 00000000 00000000
00000000 0xbadb0d00 950dbd40 00000000 00000000 00000000 00000000
0x1ceba8
STACK_COMMAND: kb
BUCKET_ID: CORRUPT_MODULELIST
MODULE_NAME: Unknown_Module *** Followup info cannot be found !!!
Please contact "Debugger Team"
Our application (written in C++, VS 2010 project) has been running fine on all operating systems prior to Windows 8 (and still does). On Windows 8, however, when orderly exiting the application, an access violation occurs:
mfc100.dll!_DllMain#12() <<< Crash here
mfc100.dll!__CRT_INIT#12()
mfc100.dll!__DllMainCRTStartup#12()
ntdll.dll!_LdrxCallInitRoutine#16()
ntdll.dll!LdrpCallInitRoutine()
ntdll.dll!LdrShutdownProcess()
ntdll.dll!RtlExitUserProcess()
kernel32.dll!_ExitProcessImplementation#4()
mscoreei.dll!RuntimeDesc::ShutdownAllActiveRuntimes(unsigned int,class RuntimeDesc *,enum RuntimeDesc::ShutdownCompatMode)
mscoreei.dll!_CorExitProcess#4()
mscoree.dll!_ShellShim_CorExitProcess#4()
msvcr100d.dll!__crtCorExitProcess(int status) line693 C
msvcr100d.dll!__crtExitProcess(int status) line 699 C
msvcr100d.dll!doexit(int code, int quick, int retcaller) line 621 C
msvcr100d.dll!exit(int code) Zeile 393 C
my.exe!__tmainCRTStartup() Zeile 568 C
my.exe!WinMainCRTStartup() Zeile 371 C
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
In an MSDN forum topic it has been suggested to run GC.Collect() before exit, but I couldn't make any difference with such a call shortly before exit.
I am a bit at a loss about how I should debug the problem. As far as I understand, CorExitProcess takes care of cleaning up the managed resources of the application. So could this be a fault in a managed component?
Or is it more likely that some function pointer in _DllMain has been overwritten/corrupted? If so, how would I set a data breakpoint at the address in question? There is a post explaning how to debug a similar issue, but he's having the issue in his own DLL so he can actually peak at the exact source of the problem which I can't.
Any suggestions?
Edit:
Additional information, windbg !analyze -v:
FAULTING_IP:
mfc100+258e6c
64298e6c 8b4654 mov eax,dword ptr [esi+54h]
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 64298e6c (mfc100+0x00258e6c)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 53f21f0c
Attempt to read from address 53f21f0c
CONTEXT: 00000000 -- (.cxr 0x0;r)
eax=53f21eb8 ebx=00000000 ecx=64187d2d edx=7fcde000 esi=53f21eb8 edi=00000001
eip=64298e6c esp=00c3f1b8 ebp=00c3f2ec iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00210206
mfc100+0x258e6c:
64298e6c 8b4654 mov eax,dword ptr [esi+54h] ds:0023:53f21f0c=????????
FAULTING_THREAD: 00000520
DEFAULT_BUCKET_ID: WRONG_SYMBOLS
PROCESS_NAME: ww.exe
ADDITIONAL_DEBUG_TEXT:
You can run '.symfix; .reload' to try to fix the symbol path and load symbols.
MODULE_NAME: mfc100
FAULTING_MODULE: 77bc0000 ntdll
DEBUG_FLR_IMAGE_TIMESTAMP: 4d5f29b8
ERROR_CODE: (NTSTATUS) 0xc0000005 - Die Anweisung in 0x%08lx verweist auf Speicher 0x%08lx. Der Vorgang %s konnte nicht im Speicher durchgef hrt werden.
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - Die Anweisung in 0x%08lx verweist auf Speicher 0x%08lx. Der Vorgang %s konnte nicht im Speicher durchgef hrt werden.
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 53f21f0c
READ_ADDRESS: 53f21f0c
FOLLOWUP_IP:
mfc100+258e6c
64298e6c 8b4654 mov eax,dword ptr [esi+54h]
APP: ww.exe
ANALYSIS_VERSION: 6.3.9600.17029 (debuggers(dbg).140219-1702) x86fre
MANAGED_STACK: !dumpstack -EE
OS Thread Id: 0x520 (0)
Current frame:
ChildEBP RetAddr Caller, Callee
PRIMARY_PROBLEM_CLASS: WRONG_SYMBOLS
BUGCHECK_STR: APPLICATION_FAULT_WRONG_SYMBOLS
LAST_CONTROL_TRANSFER: from 6429da08 to 64298e6c
STACK_TEXT:
WARNING: Stack unwind information not available. Following frames may be wrong.
00c3f2ec 6429da08 64040000 00000000 00000001 mfc100+0x258e6c
00c3f330 6429dac7 64040000 00c3f35c 77be077a mfc100+0x25da08
00c3f33c 77be077a 64040000 00000000 00000001 mfc100+0x25dac7
00c3f35c 77be07f0 6429daa9 64040000 00000000 ntdll!RtlAddMandatoryAce+0x14e
00c3f3a4 77bfa529 6429daa9 64040000 00000000 ntdll!RtlAddMandatoryAce+0x1c4
00c3f49c 77bfa40e 00000000 00000000 6f2d4890 ntdll!RtlExitUserProcess+0x1e7
00c3f4b0 76ff4231 00000000 77e8f3b0 ffffffff ntdll!RtlExitUserProcess+0xcc
00c3f4c4 6f8b3712 00000000 bd3cbe8b 01f1c054 KERNEL32!ExitProcess+0x15
00c3f74c 6f8c19a2 00000001 00c3f76c 6f1686ad mscoreei!GetFileVersion+0x1835
00c3f758 6f1686ad 00000000 77bdab85 6f8a0000 mscoreei!CorExitProcess+0x27
00c3f76c 70737954 00000000 00c3f784 7073798d mscoree!CorExitProcess+0x94
00c3f778 7073798d 00000000 00c3f7c8 70737ab0 MSVCR100!_query_new_mode+0x159
00c3f784 70737ab0 00000000 a2b843a9 00375f5c MSVCR100!_query_new_mode+0x192
00c3f7c8 70737b1d 00000000 00000000 00000000 MSVCR100!_query_new_mode+0x2b5
00c3f7dc 003274ab 00000000 d1ef1931 00000000 MSVCR100!exit+0x11
00c3f864 76ff173e 7fcdf000 00c3f8b4 77c16911 ww!_enc$textbss$begin+0x64ab
00c3f870 77c16911 7fcdf000 a613e810 00000000 KERNEL32!BaseThreadInitThunk+0x12
00c3f8b4 77c168bd ffffffff 77c8560a 00000000 ntdll!LdrInitializeThunk+0x1f0
00c3f8c4 00000000 003275da 7fcdf000 00000000 ntdll!LdrInitializeThunk+0x19c
STACK_COMMAND: .cxr 0x0 ; kb
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: mfc100+258e6c
FOLLOWUP_NAME: MachineOwner
IMAGE_NAME: mfc100.dll
BUCKET_ID: WRONG_SYMBOLS
FAILURE_BUCKET_ID: WRONG_SYMBOLS_c0000005_mfc100.dll!Unknown
ANALYSIS_SOURCE: UM
FAILURE_ID_HASH_STRING: um:wrong_symbols_c0000005_mfc100.dll!unknown
FAILURE_ID_HASH: {9e516b68-081f-78d6-cf23-b42f2b3cb573}
Followup: MachineOwner
---------
Screenshot of there the crash occurs:
As discussed in comments, our similar problem was where we had a native C++ application that communicated with a managed C# application running as a COM server. To allow the managed component to communicate events to the C++ app, an event sink was exposed as a simple ATL COM interface from the native side, which on the .NET side was automatically encapsulated in a Runtime Callable Wrapper.
The access violation on application close - which wasn't always visible except in the event logs - was due to the fact that the RCW didn't call Release() on our ATL COM interfaces until it was garbage collected. As this happened when the .NET runtime closed, which was after the native runtime had shut down, it tried to callback into dead code.
The solution for us was to expose a "shutdown" method on the .NET side that disposed of all the communicating objects, then called:
GC.Collect();
GC.WaitForPendingFinalizers();
Ok, I understand that this might not exactly mirror your problem, but the route in to finding out what was causing it was to use the Managed Debugging Assistants, particularly reportAvOnCOMRelease.
We activated the MDA by registry keys and ran the native app via a debugger to see the additional output that identified the COM interfaces that were being held too long. Probably as a first step, it would be wise to activate all of the MDA options to glean as much info as possible from the crash.
I tried debugging this using data breakpoints, but that didn't help a lot. I could see that at some point the data being accessed was overwritten, but that didn't happen in a call stack containing any of my own code.
So I resorted in a simpler method and started removing parts of the program until the error disappeared. In a large application it may be hard to remove some parts without breaking others, but I was able to narrow down the source of the issue.
Basically, the problem stopped occurring after removing a certain call to FreeLibrary. After further investigation it turned out that this call happens during DllMain, which is not allowed:
The entry-point function should perform only simple initialization or termination tasks. It must not call the LoadLibrary or LoadLibraryEx function (or a function that calls these functions), because this may create dependency loops in the DLL load order. This can result in a DLL being used before the system has executed its initialization code. Similarly, the entry-point function must not call the FreeLibrary function (or a function that calls FreeLibrary) during process termination, because this can result in a DLL being used after the system has executed its termination code.
In another SO question, one user apparently noticed a change since Windows 8 in this regard, which would explain why the error only happens on this version of Windows.
We'll now change our application so that FreeLibrary is called at a different point of time.
UPDATE
Thanks to feedback below I was able to home in on ADPlus.vbs, which is part of the debugging tools for Windows.
Don't forget to set up _NT_SYMBOL_PATH before you run it.
Using this we've been able to see much more clearly in to the application with far greater clarity than we ever have using the regular dumps produced via Windows when the application crashes.
Many thanks to all for the responses.
ORIGINAL QUESTION
We have an server application written in Visual C++ that some times (relatively rarely) crashes on customer sites. We haven't been able to understand why this happens based on looking at our own log files so the next step is to start looking at crash dumps.
We've just purposefully put a bug in to our app (a null pointer) so that we can generate a crash dump and verify that the dumps produced are valuable, but thus far I can't make head or tail of what i'm seeing.
I think my first question is whether i've even got WinDbg set up correctly (the other developer here is loading the dump in to Visual Studio 2010 and seeing the same errors so i'm assuming it's fine, or we're both wrong :) ) - and then next question is, how do I understand what it's telling me.
The main confusion is that the dump seems to be telling me it has reached a break point, which seems odd to me since there was no debugger connected.
The app was running on a Windows Server 2003 system when it crashed. I believe I have pointed WinDbg at the PDB file for the DLL and EXE correctly.
FAULTING_IP:
ntdll!DbgBreakPoint+0
7c81a3e1 cc int 3
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 7c81a3e1 (ntdll!DbgBreakPoint)
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 3
Parameter[0]: 00000000
Parameter[1]: 8779fdb0
Parameter[2]: 00000003
DEFAULT_BUCKET_ID: STATUS_BREAKPOINT
PROCESS_NAME: CallPlusServerLauncher.exe
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 8779fdb0
EXCEPTION_PARAMETER3: 00000003
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
ADDITIONAL_DEBUG_TEXT: Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[ffffffff]
FAULTING_THREAD: ffffffff
PRIMARY_PROBLEM_CLASS: STATUS_BREAKPOINT
BUGCHECK_STR: APPLICATION_FAULT_STATUS_BREAKPOINT
STACK_TEXT:
1bd0ffc8 7c83fe08 00000005 00000004 00000001 ntdll!DbgBreakPoint
1bd0fff4 00000000 00000000 00000000 00000000 ntdll!DbgUiRemoteBreakin+0x36
FOLLOWUP_IP:
ntdll!DbgBreakPoint+0
7c81a3e1 cc int 3
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: ntdll!DbgBreakPoint+0
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: ntdll
IMAGE_NAME: ntdll.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 49900d60
STACK_COMMAND: ddS 1bd10000 1bd0c000 ; dt ntdll!LdrpLastDllInitializer BaseDllName ; dt ntdll!LdrpFailureData ; ~439s; .ecxr ; kb
BUCKET_ID: MANUAL_BREAKIN
FAILURE_BUCKET_ID: STATUS_BREAKPOINT_80000003_ntdll.dll!DbgBreakPoint
WATSON_STAGEONE_URL: http://watson.microsoft.com/StageOne/CallPlusServerLauncher_exe/0_0_0_0/4df87414/ntdll_dll/5_2_3790_4455/49900d60/80000003/0001a3e1.htm?Retriage=1
Followup: MachineOwner
DbgBreakPoint -- Looks to me like you broke execution using a remote debugger.
If you didn't then I have seen DbgBreakPoint show up when you have code pages (Edit: I meant page heap) turned on (you should know if you did this) and there was a detection of invalid memory access.
Asserts can also trigger a breakpoint exception. For example I have (too often) seen them come out of the heap checking around a delete when the heap has got corrupted by double-delete or overflow. But only with the debug runtime I thought, is that what you have deployed?