what means bytes in the stack trace of c++ application - c++

i got a crash dump and a stack trace for an app on windows 2012 r2 and the stack trace line ends up with function + N bytes.
What bytes / offset means? Could it be used to identify the line in the source code of the function that crashed? For example below what means 6092 bytes? It is not the line number
I see that in other application crash the offset is in hexal base.
app.exe caused exception C0000005 (EXCEPTION_ACCESS_VIOLATION) at 000000009067B6BC
Register dump:
Crash dump successfully written to file 'C:\Program Files\App\appcrash.dmp'
Stack Trace:
0x000000009067B6BC (0xA95C44A0 0xA86FCE60 0xA95C44A0 0x00000008) app.exe, appFunction1()+6092 bytes

Related

Stack allocation fail when close to heap

Issue Description
C++ server got several crashes at the same stack frame as below
enter image description here
Investigation
Memory was not used up, 400M-800M
Stack space used is not large, about 270K-350K
Haven't found any other code issue caused crash
The distance between stack and heap is about 1M.
Binary is built with PIE
OS is 2.6.32-696.16.1.el6.i686 #1 SMP Wed Nov 15 16:16:47 UTC 2017 i686 i686 i386 GNU/Linux
Stack space used in one crash
(gdb) f 0
#0 0xb717f616 in WDMS_Byte_Stream::write (this=Cannot access memory at
address 0xbfa16b1c) at wdms_bs.cpp:63 in wdms_bs.cpp
(gdb) set $top=$esp
(gdb) f 38
#38 0xb6a1bcbf in main (argc=1, argv=0xbfa5a8b4) at main.cpp:66
main.cpp: No such file or directory.
in main.cpp
(gdb) p $esp-$top
$1 = 277728
(gdb)
Distance between stack and heap
Memory allocated
enter image description here
One heap segment close to stack space, 0xb8a8f000->0xbf917000
enter image description here
Stack space 0xbfa18000->0xbfa5c000
enter image description here
the distance between them
(gdb) p 0xbfa18000- 0xbf917000
$3 = 1052672
Why crash
Program need to allocate 8268 bytes (0x204c) stack size at last frame
OS only expand stack size to 0xbfa18000, actual size is
(gdb) p $esp+0x204c-0xbfa18000
$19 = (void *) 0xb4c(2892 bytes)
So, when execute next instruction
call 0xb6a19ec9 <__i686.get_pc_thunk.bx>, where need access $esp, which is out of accessible memory space.
Why crash at the same stack frame? It could be because this call sequence has most call frames and to use most stack space.
enter image description here
Question-1
Is this related CVE-2017-1000364 Fix ?
enter image description here
Question-2
Why does OS allocate heap segment so close to stack? how does it avoid stack and heap collision and not allocation failure?

arm_data abort failure in case of running my program for the second time and thereafter

I add my program (load a file and do some computation) into the app of TizenRT on ARTIK053. The program can run successfully in the first time, but the data abort failure will be met when running it second time. The specific error info is as follows:
arm_dataabort:
Data abort. PC: 040d25a0 DFAR: 00000011 DFSR: 0000080d
up_assert: Assertion failed at file:armv7-r/arm_dataabort.c line: 111 task: ghsom_test
up_dumpstate: Current sp: 020c3eb0
up_dumpstate: User stack:
up_dumpstate: base: 020c3fd0
up_dumpstate: size: 00000fd4
up_dumpstate: used: 00000220
up_dumpstate: User Stack
up_stackdump: 020c3ea0: 00000003 020c3eb0 040c9638 041d38b8 00000000 040c9644 00000011 0000080
.....
.....
up_taskdump: Idle Task: PID=0 Stack Used=1024 of 1024
up_taskdump: hpwork: PID=1 Stack Used=164 of 2028
up_taskdump: lpwork: PID=2 Stack Used=164 of 2028
up_taskdump: logm: PID=3 Stack Used=300 of 2028
up_taskdump: LWIP_TCP/IP: PID=4 Stack Used=228 of 4068
up_taskdump: tash: PID=6 Stack Used=948 of 4076
up_taskdump: ghsom_test: PID=10 Stack Used=616 of 4052
I checked the remaining free RAM space, it is enough for my program. And I added some printing info into my main function to check on which line the error come out. I found that if I commented some lines before the line that the error come out, in the next time I running the program, the error line will move downward some lines. It seems like I released some stack space. So I guess it might be an issue related with the stack size that I can assign to a single proc. Anyone knows the reason, and how to solve the issue? To be mentioned, it only happens for the second time and thereafter I running the program.
With the stackdump you can almost always figure out where the error originated from.
Since you have the image file for your build you can do
arm-none-eabi-addr2line -f -p -i -b build/out/bin/tinyara 0xADDR
where ADDR would be the addr is one of the relevant addresses in the stack dump.
You can usually check the "current sp" (stack pointer) but often it points to the arm_dataabort shown in the failure above.
Then you can check the PC address and also look for addresses in the stack dump (starting from the back of it) that looks like the PC in value.
In your case it could be addresses like (in that order): 040c9644, 041d38b8, 040c9638
So basically:
arm-none-eabi-addr2line -f -p -i -b build/out/bin/tinyara 0x040c9644
notice the 0x in front of the address.
The command will give you a good indication for where this address is coming from in your binary like:
up_idlepm_static at /home/user/tizenrt/os/arch/arm/src/chip/s5j_idle.c:111
(inlined by) up_idle at /home/user/tizenrt/os/arch/arm/src/chip/s5j_idle.c:254
if the address is not pointing to code lines then it will look like:
?? ??:0
hope that helps

Debugging a nasty SIGILL crash: Text Segment corruption

Ours is a PowerPC based embedded system running Linux. We are encountering a random SIGILL crash which is seen for wide variety of applications. The root-cause for the crash is zeroing out of the instruction to be executed. This indicates corruption of the text segment residing in memory. As the text segment is loaded read-only, the application cannot corrupt it. So I am suspecting some common sub-system (DMA?) causing this corruption. Since the problem takes days to reproduce (crash due to SIGILL) it is getting difficult to investigate. So to begin with I want to be able to know if and when the text segment of any application has been corrupted.
I have looked at the stack trace and all the pointers, registers are proper.
Do you guys have any suggestions how I can go about it?
Some Info:
Linux 3.12.19-rt30 #1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU/Linux
(gdb) bt
0 0x10457dc0 in xxx
Disassembly output:
=> 0x10457dc0 <+80>: mr r1,r11
0x10457dc4 <+84>: blr
Instruction expected at address 0x10457dc0: 0x7d615b78
Instruction found after catching SIGILL 0x10457dc0: 0x00000000
(gdb) maintenance info sections
0x10006c60->0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
Expected (from the application binary):
(gdb) x /32 0x10457da0
0x10457da0 : 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0 : 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0 : 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0 : 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0
Actual (after handling SIGILL and dumping nearby memory locations):
Faulting instruction address: 0x10457dc0
0x10457da0 : 0x913E0000
0x10457db0 : 0x83ABFFF4
=> 0x10457dc0 : 0x00000000
0x10457dd0 : 0x4BCD8BE5
0x10457de0 : 0x93E1000C
Edit:
One lead that we have is that the corruption is always occurring at an offset that ends with 0xdc0.
For e.g.
Faulting instruction address: 0x10653dc0 << printed by our application after catching SIGILL
Faulting instruction address: 0x1000ddc0 << printed by our application after catching SIGILL
flash_erase[8557]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite[8561]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk[4448]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk[16002]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats[20670]: unhandled signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr[27923]: unhandled signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001
Edit 2: Another lead is that the corruption is always occurring at physical frame number 0x00a4d. I suppose with PAGE_SIZE of 4096 this translates to physical address of 0x00A4DDC0. We are suspecting couple of our kernel drivers and investigating further. Is there any better idea (like putting hardware watchpoint) which could be more efficient? How about KASAN as suggested below?
Any help is appreciated. Thanks.
1.) Text segment is RO, but the permissions could be changed by mprotect, you can check that if you think it is possible
2.) If it is kernel problem:
Run kernel with KASAN and KUBSAN (undefined behaviour) sanitizers
Focus on drivers code not included in mainline
The hint here is one byte corruption. Maybe i'm wrong, but it means that DMA is not to blame. It looks like some kind of invalid store.
3.) Hardware. I think, your problem looks like a hardware problem (RAM issue).
You can try to decrease RAM system frequency in bootloader
Check if this problem reproduces on stable mainline software, that is how you can prove that it's it

Windows heap allocations call stacks - strange callstack

I'm trying to analyze a managed process memory dump an suspect if for native memory leaks. In order to be able to use windbg (and use !heap extension from there) I activated user mode call stacks for the server process
I see a lot of blocks of size 68. And among those blocks (those that I could manually verify using !heap -p -a) there are many call-stacks of the form
!heap -p -a 000000003ca5cfd0
address 000000003ca5cfd0 found in
_HEAP # 1ea0000
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
000000003ca5cfa0 0009 0000 [00] 000000003ca5cfd0 00068 - (busy)
7766bbed ntdll! ?? ::FNODOBFM::`string'+0x000000000001913b
7fef7b76a57 msvcr120!malloc+0x000000000000005b
7fef7b76967 msvcr120!operator new+0x000000000000001f
7fe9a5cdaf8 +0x000007fe9a5cdaf8
Do you have any idea what are these allocations because they take hundreds of MB on my dump file ?
EDIT
lm shows the following around the area 7fe9a5cdaf8 (truncated)
start end module name
00000000`773b0000 00000000`774cf000 kernel32 (pdb symbols)
00000000`774d0000 00000000`775ca000 user32 (deferred)
00000000`775d0000 00000000`77779000 ntdll (pdb symbols)
00000000`77790000 00000000`77797000 psapi (deferred)
00000000`777a0000 00000000`777a3000 normaliz (deferred)
00000001`3f810000 00000001`3f818000 ManagedService (deferred)
000007fe`dd2d0000 000007fe`de398000 System_Web_ni (deferred)
I assume that there was no native image created for your application (using NGen). In that case the module (DLL) only contains IL code which will never be executed. So, from native point of view, there won't be any stacks pointing to inside the module.
Instead the IL code will be JIT compiled to another location in memory, e.g. 7fe9a5cdaf8 in your case. That's where real code is executed, so that's what you see from native side.
To revert a JIT compiled instruction into its .NET method descriptor, do the following:
0:000> .symfix
0:000> .loadby sos mscorwks ; *** .NET 2
0:000> .loadby sos clr ; *** .NET 4
0:000> !ip2md 7fe9a5cdaf8
The output should then show the .NET method name (example here, since I don't have your dump):
MethodDesc: 000007ff00033450
Method Name: ManagedService.Program.Main()
Class: 000007ff00162438
MethodTable: 000007ff00033460
mdToken: 0600001f
Module: 000007ff00032e30
IsJitted: yes
CodeAddr: 000007ff00170120

Debugging/bypassing BSOD without source code

Hello and good day to you.
Need a bit of assitance here:
Situation:
I have an obscure DirectX 9 application (name and application details are irrelevant to the question) that causes blue screen of death on all nvidia cards (GeForce 8400GS and up) since certain driver version. I believe that the problem is indirectly caused by DirectX 9 call or a flag that triggers driver bug.
Goal:
I'd like to track down offending flag/function call (for fun, this isn't my job/homework) and bypass error condition by writing proxy dll. I already have a finished proxy dll that provides wrappers for IDirect3D9, IDirect3DDevice9, IDirect3DVertexBuffer9 and IDirect3DIndexBuffer9 and provides basic logging/tracing of Direct3D calls. However, I can't pinpoint function which causes crash.
Problems:
No source code or technical support is available. There will be no assitance, and nobody else will fix the problem.
Memory dump produced by kernel wasn't helpful - apparently an access violation happens within nv4_disp.dll, but I can't use stacktrace to go to IDirect3DDevice9 method call, plus there's a chance that bug happens asynchronously.
(Main problem) Because of large number of Direct3D9Device method calls, I can't reliably log them into file or over network:
Logging into file causes significant slowdown even without flushing, and because of that all last contents of the log are lost when system BSODs.
Logging over network (using UDP and WINSOck's sendto)also causes significant slowdown and must not be done asynchronously (asynchronous packets are lost on BSOD), plus packets (the ones around the crash) are sometimes lost even when sent synchronously.
When application is "slowed" down by logging routines, BSOD is less likely to happen, which makes tracking it down harder.
Question:
I normally don't write drivers, and don't do this level of debugging, so I have impression that I'm missing something important there's a more trivial way to track down the problem than writing IDirect3DDevice9 proxy dll with custom logging mechanism. What is it? What is the standard way of diagnosing/handling/fixing problem like this (no source code, COM interface method triggers BSOD)?
Minidump analysis(WinDBG):
Loading User Symbols
Loading unloaded module list
...........
Unable to load image nv4_disp.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for nv4_disp.dll
*** ERROR: Module load completed but symbols could not be loaded for nv4_disp.dll
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck 1000008E, {c0000005, bd0a2fd0, b0562b40, 0}
Probably caused by : nv4_disp.dll ( nv4_disp+90fd0 )
Followup: MachineOwner
---------
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
KERNEL_MODE_EXCEPTION_NOT_HANDLED_M (1000008e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: bd0a2fd0, The address that the exception occurred at
Arg3: b0562b40, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
FAULTING_IP:
nv4_disp+90fd0
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi
TRAP_FRAME: b0562b40 -- (.trap 0xffffffffb0562b40)
ErrCode = 00000000
eax=00000808 ebx=e37f8200 ecx=e4ae1c68 edx=e37f8328 esi=e37f8400 edi=00000000
eip=bd0a2fd0 esp=b0562bb4 ebp=e37e09c0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202
nv4_disp+0x90fd0:
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi ds:0023:00000900=????????
Resetting default scope
CUSTOMER_CRASH_COUNT: 3
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x8E
LAST_CONTROL_TRANSFER: from bd0a2e33 to bd0a2fd0
STACK_TEXT:
WARNING: Stack unwind information not available. Following frames may be wrong.
b0562bc4 bd0a2e33 e37f8200 e37f8200 e4ae1c68 nv4_disp+0x90fd0
b0562c3c bf8edd6b b0562cfc e2601714 e4ae1c58 nv4_disp+0x90e33
b0562c74 bd009530 b0562cfc bf8ede06 e2601714 win32k!WatchdogDdDestroySurface+0x38
b0562d30 bd00b3a4 e2601008 e4ae1c58 b0562d50 dxg!vDdDisableSurfaceObject+0x294
b0562d54 8054161c e2601008 00000001 0012c518 dxg!DxDdDestroySurface+0x42
b0562d54 7c90e4f4 e2601008 00000001 0012c518 nt!KiFastCallEntry+0xfc
0012c518 00000000 00000000 00000000 00000000 0x7c90e4f4
STACK_COMMAND: kb
FOLLOWUP_IP:
nv4_disp+90fd0
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: nv4_disp+90fd0
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: nv4_disp
IMAGE_NAME: nv4_disp.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4e390d56
FAILURE_BUCKET_ID: 0x8E_nv4_disp+90fd0
BUCKET_ID: 0x8E_nv4_disp+90fd0
Followup: MachineOwner
nv4_disp+90fd0
bd0a2fd0 39b8f8000000 cmp dword ptr [eax+0F8h],edi
This is the important part. Looking at this, it is most probable that eax is invalid, hence attempting to access an invalid memory address.
What you need to do is load nv4_disp.dll into IDA (you can get a free version), check the image base that IDA loads nv4_disp at and hit 'g' to goto address, try adding 90fd0 to the image base IDA is using, and it should take you directly to the offending instruction (depending on section structure).
From here you can analyze the control flow, and how eax is set and used. If you have a good kernel level debugger you can set a breakpoint on this address and try and get it to hit.
Analysing the function, you should attempt to figure out what the function does, what eax is meant to be pointing to at that point, what its actually pointing to, and why. This is the hard part and is a great part of the difficulty and skill of reverse engineering.
Found a solution.
Problem:
Logging is unreliable since messages (when dumped to file) disappear during bsod, packets are sometimes lost when logging over network, and there's slowdown due to logging.
Solution:
Instead of logging to file or over network, configure system to produce full physical memory dump on BSOD and log all messages into any memory buffer. It'll be faster. Once system crashed, it'll dump entire memory into file, and it'll be possible to either view contents of log-file buffer using WinDBG's dt (if you have debug symbols) command, or you'll be able to search and locate logfile stored in memory using "memory" view.
I used circular buffer of std::strings to store messages and separate array of const char* to make things easier to read in WinDBG, but you could simply create huge array of char and store all messages within it in plaintext.
Details:
Entire process on winxp:
Ensure that minimum page file size is equal or larger than total amount of RAM + 1 megabytes. (Right Click "My Computer"->Properties->Advanced->Performance->Advanced->Change)
Configure system to produce complete memory dump on BSOD (RIght click "My Computer'->Properties->Advanced->Startup and Recovery->Settings->Write Debugging Information . Select "Complete memory dump" and specify path you want).
Ensure that disk (where the file will be written) has required amount of free space (total amount of RAM on your system.
Build app/dll (the one that does logging) with debug symbol, and Trigger BSOD.
Wait till memory dump is finished, reboot. Feel free to swear at driver developer while system writes memory dump and reboots.
Copy MEMORY.DMP system produced to a safe place, so you won't lose everything if system crashes again.
Launch windbg.
Open Memory Dump (File->Open Crash Dump).
If you want to see what happened, use !analyze -v command.
Access memory buffer that stores logged messages using one of those methods:
To see contents of global variable, use dt module!variable where "module" is name of your library (without *.dll), and "variable" is name of variable. You can use wildcards. You can use address without module!variable
To see contents of one field of the global variable (if global variable is a struct), use dt module!variable field where "field" is variable member.
To see more details about varaible (content of arrays and substructures) use dt -b module!variable field or dt -b module!variable
If you don't have symbols, you'll need to search for your "logfile" using memory window.
At this point you'll be able to see contents of log that were stored in memory, plus you'll have snapshot of the entire system at the moment when it crashed.
Also...
To see info about process that crashed the system, use !process.
To see loaded modules use lm
For info about thread there's !thread id where id is hexadecimal id you saw in !process output.
It looks like the crash may either be caused by a bad pointer, or heap corruption. You can tell this because the crash occurs in a memory-freeing function (DxDdDestroySurface). Destroying surfaces is something that you absolutely need to do - you can't just stub this out, the surface will still get freed when the program exits, and if you disable it inside the kernel, you'll run out of on-card memory very quickly and crash that way, as well.
You can try to figure out what sequence of events leads up to this heap corruption, but there's no silver bullet here - as fileoffset suggested, you'll need to actually reverse engineer the driver to see why this happens (it may help to compare drivers before and after the offending driver version as well!)