Proper gdb backtrace from RISC-V trap / HardFault - gdb

I'm using a RISC-V (rv32imac_zicsr) chip and have troubles debugging traps/hardfaults with gdb (riscv-none-elf-gcc toolchain v12.2.0-1 from xPack).
With ARM chips gdb bt lists some, but not all functions leading up to the hardfault.
Assume we have these functions in the error stack:
[ HardFault_Handler, Erroneous_Function, Calling_Function ]
ARM GDB would by default show the following backtrace:
[ HardFault_Handler, Erroneous_Function ]
Assuming a simple RISC-V HardFault_Handler which would work similarly in ARM:
void HardFault_Handler(void) __attribute__((naked));
void HardFault_Handler(void)
{
__asm("EBREAK;");
}
it will show the backtrace
[ HardFault_Handler, Calling_Function ]
followed by "Backtrace stopped: frame did not save the PC"
gdb seems to only be able to read the addresses in pc (HardFault_Handler), and ra (return address) when it occurred (Calling_Function). This skips Erroneous_Function completely (which is in mepc).
Additionally, with the (naked) attribute the stack pointer is not modified when entering HardFault_Handler, so sp is still the stack of the Erroneous_Function, and inspecting the Calling_Function would not be valid, either (similarly without the naked attribute).
So I tested the following HardFault_Handler:
void HardFault_Handler(void) __attribute__((naked));
void HardFault_Handler(void)
{
__asm(
"csrr ra, mepc;"
"EBREAK;"
);
}
With this, the backtrace looks like this:
[ HardFault_Handler, Erroneous_Function ]
And thanks to the (naked) attribute Erroneous_Function IS valid and can be inspected, but all the information about the Calling_Function stored in ra is lost. I can of course store it temporarily (possibly discarding some relevant registers in Erroneous_Function), but I have no way to easily inspect the stack.
While this also does not seem to be possible in ARM, since the information IS available, I wonder if this is something that has to be done in GDB or if there is a way to make it better.
For example, I would not mind having only [ Erroneous_Function, Calling_Function ] in the backtrace, but this would require EBREAK to trigger delayed (after writing PC, or in other terms, executing mret, when the PC is on the erronous instruction). And I do not know of a way to call EBREAK to trigger one instruction delayed or something like it.
This both serves as a repository for orthers to get gdb to at least return a working backtrace (since I didn't find anything like it for RISC-V), and also to get some input:
What is the best way to get a better backtrace in a trap in RISC-V processors (preferrably with rv32imac)?
Thanks!

Related

GDB Patching results in "Cannot access memory at address 0x

I have a program that I need to patch using GDB. The issue is there is a line of code that makes a "less than or equal test" and fails causing the program to end with a Segmentation fault. The program is already compiled and I do not have the source so I cannot change the source code obviously. However, using GDB, I was able to locate where the <= test is done and then I was able to locate the memory address which you can see below.
(gdb) x/100i $pc
... removed extra lines ...
0x7ffff7acb377: jle 0x7ffff7acb3b1
....
All I need to do is change the test to a 'greater than or equal to' test and then the program should run fine. The opcode for jle is 0x7e and I need to change it to 0x7d. My assignment gives instructions on how to do this as follows:
$ gdb -write -q programtomodify
(gdb) set {unsigned char} 0x8040856f = 0x7d
(gdb) quit
So I try it and get...
$ gdb -write -q player
(gdb) set {unsigned char} 0x7ffff7acb377 = 0x7d
Cannot access memory at address 0x7ffff7acb377
I have tried various other memory addresses and no matter what I try I get the same response. That is my only problem, I don't care if it's the wrong address or wrong opcode instruction at this point, I just want to be able to modify the memory.
I am running Linux Mint 14 via VMware Player
Thank
Cannot access memory at address 0x7ffff7acb377
You are trying to write to an address where some shared library resides. You can find out which library that is with info sym 0x7ffff7acb377.
At the time when you are trying to perform the patch, the said shared library has not been loaded yet, which explains the message you get.
Run the program to main. Then you should be able to write to the address. However, you'll need to have write permission on the library to make your write "stick".

How to read frames from a core dump (without GDB)?

I would like to access the frames stored in a core dump of a program that doesn't has debug symbols (I want to do this in C). When I open up the program and the core dump inside GDB I get a stack trace including the names of the functions. For example:
(gdb) bt
#0 0x08048443 in layer3 ()
#1 0x08048489 in layer2 ()
#2 0x080484c9 in layer1 ()
#3 0x0804854e in main ()
The names of all functions are stored in the executable in the .strtab section. How can I build up the stack trace with the different frames? Running GDB in batch mode is not an option. And also just "copy the parts from gdb the are needed" is also a bad idea because the code is not independently written.
So to make my question more precisely: Where do I find the point inside a core dump where I can start reading the stack information? Is there a library of some sort for accessing those information? A struct I can use? Or even better, a documentation how those informations are structured inside a core dump?
(I already seen the question "how to generate a stack trace from a core dump file in C, without invoking an external tool such as gdb", but since there is no valid answer, I thought I would ask it again)
[Edit] I'm doing this under Linux x86
Coredump contains stack information as well. If you can use this stack information along with the EBP and EIP register values in the coredump file, you can print the stack trace. I had written a program to do this. You can find the program in the following link.
http://www.emntech.com/programs/corestrace.c
Usage: Compile the above program and give the corefile when you execute it.
$corestrace core
If you want symbols also to be printed, you do like this: Let's assume the program that generated the core is 'test'.
$ nm -n test > symbols
$ corestrace core symbols
Sample output looks like this:
$ ./coretrace core symbols
0x80483cd foo+0x9
0x8048401 func+0x1f
0x8048430 main+0x2d

Linux Kernel Text Symbols

When I look through a linux kernel OOPS output, the EIP and other code address have values in the range of 0xC01-----. In my System.map and objdump -S vmlinux output, all the code addresses are at least above 0xC1------. My vmlinux has debug symbols included (CONFIG_DEBUG_INFO).
When I debug over a serial connection (kgdb), and I load gdb with gdb ./vmlinux, again I have the same issue that I cannot reconcile $eip with what I have in System.map and objdump output. When I run where in gdb, I get a jumbled mess on the stack:
#0 0xC01----- in ?? ()
#1 0xC01----- in ?? ()
#2 0xC01----- in ?? ()
...
Can anyone make any suggestions on how to resolve this/these issues? My main concern is how I actually map an eip value from an OOPS to System.map or objdump -S vmlinux. I know that the OOPS will give me the function name and offset into the object code, but I am more concerned about the previously mentioned issue and why gdb can't correctly display a stack backtrace.
Looks like the OOPS is because you jumped into a place that's not a function.
This would easily cause a crash, and would also prevent the debugger from resolving the address as a symbol.
You can check this by disassembling the area around this EIP. If I'm correct, it won't make sense as machine code.
There are generally two causes for such things:
1. Function call using a corrupt function pointer. In this case, the stack frame before the last should show the caller. But you don't have this frame, so it may be the other reason.
2. Stack overrun - your return address is corrupt, so you've returned to a bad location. If it's so, the data ESP points to should contain the address in EIP. Debugging stack overruns is hard, because the most important source of information is missing. You can try to print the stack in "raw" format (x/xa addr), and try to make sense of it.

Function name from Windows stack trace

How do I restore the stack trace function name instead of <UNKNOWN>?
Event Type: Error
Event Source: abcd
Event Category: None
Event ID: 16
Date: 1/3/2010
Time: 10:24:10 PM
User: N/A
Computer: CMS01
Description:
Server.exe caused a in module at 2CA001B:77E4BEF7
Build 6.0.0.334
WorkingSetSize: 1291071488 bytes
EAX=02CAF420 EBX=00402C88 ECX=00000000 EDX=7C82860C ESI=02CAF4A8
EDI=02CAFB68 EBP=02CAF470 ESP=02CAF41C EIP=77E4BEF7 FLG=00000206
CS=2CA001B DS=2CA0023 SS=7C820023 ES=2CA0023 FS=7C82003B GS=2CA0000
2CA001B:77E4BEF7 (0xE06D7363 0x00000001 0x00000003 0x02CAF49C)
2CA001B:006DFAC7 (0x02CAF4B8 0x00807F50 0x00760D50 0x007D951C)
2CA001B:006DFC87 (0x00003561 0x7F6A0F38 0x008E7290 0x00021A6F)
2CA001B:0067E4C3 (0x00003561 0x00000000 0x02CAFBB8 0x02CAFB74)
2CA001B:00674CB2 (0x00003561 0x006EBAC7 0x02CAFB68 0x02CAFA64)
2CA001B:00402CA4 (0x00003560 0x00000000 0x00000000 0x02CAFBB8)
2CA001B:00402B29 (0x00003560 0x00000001 0x02CAFBB8 0x00000000)
2CA001B:00683096 (0x00003560 0x563DDDB6 0x00000000 0x02CAFC18)
2CA001B:00688E32 (0x02CAFC58 0x7C7BE590 0x00000000 0x00000000)
2CA001B:00689F0C (0x02CAFC58 0x7C7BE590 0x00000000 0x00650930)
2CA001B:0042E8EA (0x7F677648 0x7F677648 0x00CAFD6C 0x02CAFD6C)
2CA001B:004100CA (0x563DDB3E 0x00000000 0x00000000 0x008E7290)
2CA001B:0063AC39 (0x7F677648 0x02CAFD94 0x02CAFD88 0x77B5B540)
2CA001B:0064CB51 (0x7F660288 0x563DD9FE 0x00000000 0x00000000)
2CA001B:0063A648 (0x00000000 0x02CAFFEC 0x77E6482F 0x008E7290)
2CA001B:0063A74D (0x008E7290 0x00000000 0x00000000 0x008E7290)
2CA001B:77E6482F (0x0063A740 0x008E7290 0x00000000 0x00000000)
Get the exact same build of the program, objdump it, and have a look at what function is at the address. If symbol names have been stripped from the executable, this might be a little bit difficult.
If the program code is in any way dynamic, you may have to run it into a debugger to find the addresses of functions.
If the program is deliberately obfuscated and nasty, and it randomises its function addresses in some way at runtime (evasive things like viruses, or copy-protection code sometimes do this) then all bets are off.
In general:
The easiest way to find out what caused the crash is to follow the steps necessary to reproduce it in an instance of the application running in a debugger. All other approaches are much more difficult. This is why developers will often not spend time trying to tackle bugs which there are no known methods to reproduce.
You need to have the debug info file (.pdb) next to the exe when the crash occurs.
Then hopefully, your crash dumping code can load it and use the information in it.
Try to run the same built from your IDE, pause it and jump to the adresses in assembly. Then you can switch to source code and see the functions there.
At least that's how it works in Delphi. And I know of this possibility in Visual Studio as well.
If you don't have the built in your IDE, you have to use a Debugger like OllyDbg. Run the exe that caused the errors and pause the application in OllyDbg. Go to the adresses from the stack trace.
Open a similar application project in your IDE, run it and pause it. Search for the same binary pattern you see in OllyDbg and then switch to source code.
The last possibility I know is analysing the map file if you built it during your built.
I use StackWalker - just compile it into the source of your program, and if it crashes you can generate a stack trace at that time, including function names.
The most common cause is that you in fact don't have a module at the specified address. This can happen e.g. when you dereference an uninitialized function pointer, or call a virtual function using an invalid this pointer.
This is probably not the case here: "77E4BEF7" is with very high probability a Windows DLL, and 006DFAC7 an address in one of your modules. You don't need PDB files for this: Windows should always know the module name (.EXE or .DLL) - in fact, it needs first that module name to even find the proper PDB file.
Now, the remaining question is why you don't have the module information. This I can't tell from the information above. You need information about the actual system. For instance, does if have DEP? Is DEP enabled for your process?

What settings should I be using with Minidumps?

Currently we call MiniDumpWriteDump with the MiniDumpNormal | MiniDumpWithIndirectlyReferencedMemory flags. That works just fine for internal builds in Debug configuration, but isn't giving as much information as we need in Release configuration.
In Release, the minidump data contains enough stack information for the debugger to work out where in the code the failure occurred, but no other data. I don't simply mean local variables are missing due to being optimised out, as you'd expect in a Release build - I mean, there is nothing useful except for the call stack and current code line. No registers, no locals, no globals, no objects pointed to by locals - nothing. We don't even get 'this' which would allow us to view the current object. That was the point of using MiniDumpWithIndirectlyReferencedMemory - it should have included memory referenced by locals and stack variables, but doesn't seem to.
What flags should we be using instead? We don't want to use MiniDumpWithFullMemory and start generating 600MB+ dumps, but would happily expand the dumps somewhat beyond the 90KB we currently get if it meant getting more useful data. Perhaps we should be using MiniDumpWithDataSegments (globals) or...?
WinDbg uses the following flags for a .dump /ma:
0:003> .dumpdebug
----- User Mini Dump Analysis
MINIDUMP_HEADER:
Version A793 (62F0)
NumberOfStreams 13
Flags 41826
0002 MiniDumpWithFullMemory
0004 MiniDumpWithHandleData
0020 MiniDumpWithUnloadedModules
0800 MiniDumpWithFullMemoryInfo
1000 MiniDumpWithThreadInfo
40000 MiniDumpWithTokenInformation
I suggest you replace MiniDumpWithFullMemory by MiniDumpWithIndirectlyReferencedMemory.