Bad align value for a ELF section causes the program to be loaded wrong - c++

I'm currently building a toy OS using a custom linker script to create the binary :
ENTRY(entry_point)
/* base virtual address of the kernel */
VIRT_BASE = 0xFFFFFFFF80000000;
SECTIONS
{
. = 0x100000;
/*
* Place multiboot header at 0x10000 as it is where Grub will be looking
* for it.
* Immediately followed by the boot code
*/
.boot :
{
*(.mbhdr)
_load_start = .;
*(.boot)
. = ALIGN(4096);
/* reserve space for paging data structures */
pml4 = .;
. += 0x1000;
pdpt = .;
. += 0x1000;
pagedir = .;
. += 0x1000;
. += 0x8000;
/* stack segment for loader */
stack = .;
}
/*
* Kernel code section is placed at his virtual address
*/
. += VIRT_BASE;
.text ALIGN(0x1000) : AT(ADDR(.text) - VIRT_BASE)
{
*(.text)
*(.gnu.linkonce.t*)
}
.data ALIGN(0x1000) : AT(ADDR(.data) - VIRT_BASE)
{
*(.data)
*(.gnu.linkonce.d*)
}
.rodata ALIGN(0x1000) : AT(ADDR(.rodata) - VIRT_BASE)
{
*(.rodata*)
*(.gnu.linkonce.r*)
}
_load_end = . - VIRT_BASE;
.bss ALIGN(0x1000) : AT(ADDR(.bss) - VIRT_BASE)
{
*(COMMON)
*(.bss)
*(.gnu.linkonce.b*)
}
_bss_end = . - VIRT_BASE;
/DISCARD/ :
{
*(.comment)
*(.eh_frame)
}
}
Since I use virtual memory, I use the AT() directive to make a distinction between the relocation address (the value of the symbol) and the actual physical load address, because virtual memory is not enabled when the binary is loaded. The address I give to AT correspond to the physical address mapped to the virtual relocation address. this works well.
But in some cases, I notice there is a strange shift after my code is loaded in memory. the .text section (which does not include the assembly boot code, only C++ code) is located 8 bytes higher than where it should be. An objdump shows the virtual relocation address is correct in the binary, as well as the load address. The only visible change is in readelf -l :
Elf file type is EXEC (Executable file)
Entry point 0x10003c
There are 3 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x0000e8 0x0000000000100000 0x0000000000100000 0x00c000 0x00c000 R 0x8
LOAD 0x00c0f0 0xffffffff8010c000 0x000000000010c000 0x004032 0x005002 RWE 0x10
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RWE 0x8
Section to Segment mapping:
Segment Sections...
00 .boot
01 .text .text._ZN2io3outEth .data .rodata .bss
02
The ALIGN of my text section is 0x10 here, whereas it is 0x8 when everything works properly.
Why is this align only present in some cases ? e.g. when I use this kind of C++ static initialization :
struct interrupt_desc
{
uint16_t clbk_low = 0x0;
uint16_t selector = 0x08;
uint8_t zero = 0x0;
uint8_t flags = 0xE;
uint16_t clbk_mid = 0x0;
uint32_t clbk_high = 0x0;
uint32_t zero2 = 0x0;
} __attribute__((packed));
interrupt_desc bar[32];
or when I implement these assembly structures for IDT handling, if I put them in the .data section, everything if fine, but if they go to the .text, the align is there again :
[SECTION .data]
[GLOBAL _interrupt_table_register]
[GLOBAL _interrupt_vector_table]
[BITS 64]
align 8
_interrupt_table_register:
DW 0xABCD
DQ _interrupt_vector_table
_interrupt_vector_table:
%rep 256
DW 0x0000
DW 0x0008
DB 0x00
DB 0x0E
DW 0x0000
DD 0x00000000
DD 0x00000000
%endrep
And finally, I don't really understand what this align value means and how I can adjust it. Moreover the load address is already 0x10 bytes-aligned, so why does this align changes the effective load address ?
Maybe I fail to understand something important here so any explanation about the internals of ELF is very welcome.
In case anyone wonders, the binary is an ELF64 linked with GNU ld, booted by Grub2 using Multiboot2.
PS: there is nothing wrong with my virtual memory mappings as they work perfectly when the align is 8. also, the shift is present event when I dump physical memory.
also, previously, I used to use Rust for the main code instead of C++. With Rust, this align bug was always present since the implementation of LTO in the compiler, but fine previously.
Thanks for your answers.

Related

GNU LD section attributes and flags for 64-bit kernel ELF

I am trying to link a 64-bit kernel ELF using GNU LD. I have a executable section named lowerhalf and then the other usual sections. The linker script I use is this
ENTRY(kernelCompatibilityModeStart);
SECTIONS {
. = 0x80000000; /* lowerhalf origin */
.lowerhalf : ALIGN(0x1000) {
*(.lowerhalf);
}
. = 0xffffffff80000000; /* higherhalf origin */
.GDT64 : ALIGN(0x1000) {
*(.GDT64);
}
.IDT64 : ALIGN(0x1000) {
*(.IDT64);
}
.TSS64 : ALIGN(0x1000) {
*(.TSS64);
}
.KERNELSTACK : ALIGN(0x1000) {
*(.KERNELSTACK);
}
.ISTs : ALIGN(0x1000) {
*(.ISTs);
}
.text : ALIGN(0x1000) {
*(.text);
}
.data : ALIGN(0x1000) {
*(.data);
}
.rodata : ALIGN(0x1000) {
*(.rodata*);
}
.bss : ALIGN(0x1000) {
*(COMMON);
*(.bss);
}
}
When I look at the generated segments using readelf I see something like
Elf file type is EXEC (Executable file)
Entry point 0x80000000
There are 3 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000001000 0x0000000080000000 0x0000000080000000
0x0000000000000030 0x0000000000000030 R 0x1000
LOAD 0x0000000000002000 0xffffffff80000000 0xffffffff80000000
0x0000000000026718 0x0000000000026718 R E 0x1000
LOAD 0x0000000000029000 0xffffffff80027000 0xffffffff80027000
0x0000000000004954 0x00000000000053d8 RW 0x1000
Section to Segment mapping:
Segment Sections...
00 .lowerhalf
01 .GDT64 .IDT64 .TSS64 .KERNELSTACK .ISTs .text
02 .data .ctors .rodata .eh_frame .bss
Some problems with the output ELF:
the .text section gets bundled up with other sections like .GDT64 and .KERNELSTACK
.lowerhalf should have RE attributes instead of just R
all .data, .ctors, and .rodata end up with the RW flags.
I tried using the MEMORY directive but that results in a load of linker errors or warnings and doesn't give the desired output.
I'd like for the sections to have appropriate RWE attributes so I can create my paging entries accordingly.

Initialized vs uninitialized global variables, in which part of the RAM are the stored?

The problem is that When I define the arrays like this :
float Data_Set_X[15000];
float Data_Set_Y[15000];
float Data_Set_Z[15000];
I get the RAM overflow error which is: .bss will not fit in region RAM Timer-Blink-Test_CM7 C/C++ Problem
When I initilaze at least one of the arrays or three of them , the error will be disappeared.
float Data_Set_X[15000]={0};
float Data_Set_Y[15000];
float Data_Set_Z[15000];
My variables are global.
In linker script file it is written that:
/* Specify the memory areas */
MEMORY
{
RAM_EXEC (rx) : ORIGIN = 0x24000000, LENGTH = 256K
RAM (xrw) : ORIGIN = 0x24040000, LENGTH = 256K
}
/* The startup code goes first into RAM_EXEC */
/* The program code and other data goes into RAM_EXEC */
/* Constant data goes into RAM_EXEC */
/* Initialized data sections goes into RAM, load LMA copy after code */
And there is a separated part of the RAM for /* Uninitialized data section */ according to the linker script.
The RAM size is 1MB and around 800KB is accessible for the user. MCU has dual core and I use the M7 Core. this core can access to a 512KB RAM area as it is mentioned in the Linker Script file. the whole size of these three arrays are 180KB
Here is the linker Script file of my Micro Controller
/*
******************************************************************************
**
** File : LinkerScript.ld
**
**
** Abstract : Linker script for STM32H7 series
** 256Kbytes RAM_EXEC and 256Kbytes RAM
**
** Set heap size, stack size and stack location according
** to application requirements.
**
** Set memory bank area and size if external memory is used.
**
** Target : STMicroelectronics STM32
**
** Distribution: The file is distributed as is, without any warranty
** of any kind.
**
*****************************************************************************
** #attention
**
** Copyright (c) 2019 STMicroelectronics.
** All rights reserved.
**
** This software component is licensed by ST under BSD 3-Clause license,
** the "License"; You may not use this file except in compliance with the
** License. You may obtain a copy of the License at:
** opensource.org/licenses/BSD-3-Clause
**
****************************************************************************
*/
/* Entry Point */
ENTRY(Reset_Handler)
/* Highest address of the user mode stack */
_estack = 0x24080000; /* end of RAM */
/* Generate a link error if heap and stack don't fit into RAM */
_Min_Heap_Size = 0x200 ; /* required amount of heap */
_Min_Stack_Size = 0x400 ; /* required amount of stack */
/* Specify the memory areas */
MEMORY
{
RAM_EXEC (rx) : ORIGIN = 0x24000000, LENGTH = 256K
RAM (xrw) : ORIGIN = 0x24040000, LENGTH = 256K
}
/* Define output sections */
SECTIONS
{
/* The startup code goes first into RAM_EXEC */
.isr_vector :
{
. = ALIGN(4);
KEEP(*(.isr_vector)) /* Startup code */
. = ALIGN(4);
} >RAM_EXEC
/* The program code and other data goes into RAM_EXEC */
.text :
{
. = ALIGN(4);
*(.text) /* .text sections (code) */
*(.text*) /* .text* sections (code) */
*(.glue_7) /* glue arm to thumb code */
*(.glue_7t) /* glue thumb to arm code */
*(.eh_frame)
KEEP (*(.init))
KEEP (*(.fini))
. = ALIGN(4);
_etext = .; /* define a global symbols at end of code */
} >RAM_EXEC
/* Constant data goes into RAM_EXEC */
.rodata :
{
. = ALIGN(4);
*(.rodata) /* .rodata sections (constants, strings, etc.) */
*(.rodata*) /* .rodata* sections (constants, strings, etc.) */
. = ALIGN(4);
} >RAM_EXEC
.ARM.extab : { *(.ARM.extab* .gnu.linkonce.armextab.*) } >RAM_EXEC
.ARM : {
__exidx_start = .;
*(.ARM.exidx*)
__exidx_end = .;
} >RAM_EXEC
.preinit_array :
{
PROVIDE_HIDDEN (__preinit_array_start = .);
KEEP (*(.preinit_array*))
PROVIDE_HIDDEN (__preinit_array_end = .);
} >RAM_EXEC
.init_array :
{
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT(.init_array.*)))
KEEP (*(.init_array*))
PROVIDE_HIDDEN (__init_array_end = .);
} >RAM_EXEC
.fini_array :
{
PROVIDE_HIDDEN (__fini_array_start = .);
KEEP (*(SORT(.fini_array.*)))
KEEP (*(.fini_array*))
PROVIDE_HIDDEN (__fini_array_end = .);
} >RAM_EXEC
/* used by the startup to initialize data */
_sidata = LOADADDR(.data);
/* Initialized data sections goes into RAM, load LMA copy after code */
.data :
{
. = ALIGN(4);
_sdata = .; /* create a global symbol at data start */
*(.data) /* .data sections */
*(.data*) /* .data* sections */
. = ALIGN(4);
_edata = .; /* define a global symbol at data end */
} >RAM AT> RAM_EXEC
/* Uninitialized data section */
. = ALIGN(4);
.bss :
{
/* This is used by the startup in order to initialize the .bss secion */
_sbss = .; /* define a global symbol at bss start */
__bss_start__ = _sbss;
*(.bss)
*(.bss*)
*(COMMON)
. = ALIGN(4);
_ebss = .; /* define a global symbol at bss end */
__bss_end__ = _ebss;
} >RAM
/* User_heap_stack section, used to check that there is enough RAM left */
._user_heap_stack :
{
. = ALIGN(8);
PROVIDE ( end = . );
PROVIDE ( _end = . );
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
. = ALIGN(8);
} >RAM
/* Remove information from the standard libraries */
/DISCARD/ :
{
libc.a ( * )
libm.a ( * )
libgcc.a ( * )
}
.ARM.attributes 0 : { *(.ARM.attributes) }
}
Thank you!
Then using this linker script the initialized static storage objects will be stored in the RAM memory section as the .data segment is stored there.
The initial values will be stored in the RAM_EXEC memory section. The startup code will have to copy this data from the RAM_EXEC to RAM.
The problem is that something (probably the debug probe or the bootloader) will have to program the RAM_EXEC memory section first
The not initialized static storage objects will be stored in the RAM memory section as the .bss section is located there. The startup code will have to zero those
The best way is to see what is actually in the memory and where. Generate the .map file to see the exact placement (you can increase the size of the RAM to pass the linking. It will not be the valid executable, but you will be able to generate the .map file)

Why does the dynamic linker *subtract* the virtual address to find out location of loaded shared library executable in memory?

According to the ld source code here in dl_main. Without the context of what phdr is passed in to dl_main, I'm a bit confused on why main_map's load address is deduced by subtracting the virtual address.
The code that I've traced through:
1124 static void
1125 dl_main (const ElfW(Phdr) *phdr,
1126 ElfW(Word) phnum,
1127 ElfW(Addr) *user_entry,
1128 ElfW(auxv_t) *auxv)
1129 {
1130 const ElfW(Phdr) *ph;
1131 enum mode mode;
1132 struct link_map *main_map;
1133 size_t file_size;
1134 char *file;
1135 bool has_interp = false;
1136 unsigned int i;
1137 bool prelinked = false;
1138 bool rtld_is_main = false;
1139 void *tcbp = NULL;
... // Before this else, it thinks you're calling `ld.so.<version>` directly. This is not usually the case.
1366 else
1367 {
1368 /* Create a link_map for the executable itself.
1369 This will be what dlopen on "" returns. */
1370 main_map = _dl_new_object ((char *) "", "", lt_executable, NULL,
1371 __RTLD_OPENEXEC, LM_ID_BASE);
1372 assert (main_map != NULL);
1373 main_map->l_phdr = phdr;
1374 main_map->l_phnum = phnum;
1375 main_map->l_entry = *user_entry;
1376
1377 /* Even though the link map is not yet fully initialized we can add
1378 it to the map list since there are no possible users running yet. */
1379 _dl_add_to_namespace_list (main_map, LM_ID_BASE);
1380 assert (main_map == GL(dl_ns)[LM_ID_BASE]._ns_loaded);
1399 }
... // Loops through program headers loaded in sequence from the ELF header.
1409 for (ph = phdr; ph < &phdr[phnum]; ++ph)
1410 switch (ph->p_type)
1411 {
1412 case PT_PHDR:
1413 /* Find out the load address. */
1414 main_map->l_addr = (ElfW(Addr)) phdr - ph->p_vaddr;
1415 break;
So why is the load address subtracted here?
In particular, on line 1414, we see main_map->l_addr = (ElfW(Addr)) phdr - ph->p_vaddr;. From the file link.h which defines the type link_map for main_map, I see that link_map is a struct that describes a loaded shared object. The l_addr field is to describe the difference between where the .so is loaded into memory vs. where it says it is loaded in the VirtAddr field when you run readelf:
❯ readelf -l main
Elf file type is EXEC (Executable file)
Entry point 0x401020
There are 11 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000400040 0x0000000000400040
0x0000000000000268 0x0000000000000268 R 0x8
INTERP 0x00000000000002a8 0x00000000004002a8 0x00000000004002a8
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000004c0 0x00000000000004c0 R 0x1000
...
This means that l_addr is NOT the load address. It's actually the offset for which you need to add from the current memory to access the shared object contents.

Identifying crash with hs_err_pid*.log and gdb

Update Sept. 12, 2011
I was able to get the core file and immediately dissabled the instruction that crashed. As per advice I tracked the value of r28 (by the way, no registry entry was log to hs_err_pid*.log) and check where did the value come from (see below w/ <---). However, I was not able to determine the value of r32.
Could the reason for the miss-alignment is that r28 is a 8-byte integer loaded to a 4-byte integer r31?
;;; 1053 if( Transfer( len ) == FALSE ) {
0xc00000000c0c55c0:2 <TFM::PrintTrace(..)+0x32>: adds r44=0x480,r32;; <---
0xc00000000c0c55d0:0 <TFM::PrintTrace(..)+0x40>: ld8 r43=[ret2]
0xc00000000c0c55d0:1 <TFM::PrintTrace(..)+0x41>: (p6) st4 [r35]=ret3
0xc00000000c0c55d0:2 <TFM::PrintTrace(..)+0x42>: adds r48=28,r33
0xc00000000c0c55e0:0 <TFM::PrintTrace(..)+0x50>: mov ret0=0;;
0xc00000000c0c55e0:1 <TFM::PrintTrace(..)+0x51>: ld8.c.clr r62=[r45]
0xc00000000c0c55e0:2 <TFM::PrintTrace(..)+0x52>: cmp.eq.unc p6,p1=r0,r62
;;; 1056 throw MutexLock ;
0xc00000000c0c55f0:0 <TFM::PrintTrace(..)+0x60>: nop.m 0x0
0xc00000000c0c55f0:1 <TFM::PrintTrace(..)+0x61>: nop.m 0x0
0xc00000000c0c55f0:2 <TFM::PrintTrace(..)+0x62>: (p6) br.cond.dpnt.many _NZ10TFM07PrintTraceEPi+0x800;;
;;; 1057 }
0xc00000000c0c5600:0 <TFM::PrintTrace(..)+0x70>: adds r41=0x488,r32
0xc00000000c0c5600:1 <TFM::PrintTrace(..)+0x71>: adds r40=0x490,r32
0xc00000000c0c5600:2 <TFM::PrintTrace(..)+0x72>: br.call.dptk.many rp=0xc00000000c080620;;
;;; 1060 dwDataLen = len ;
0xc00000000c0c5610:0 <TFM::PrintTrace(..)+0x80>: ld8 r16=[r44] <---
0xc00000000c0c5610:1 <TFM::PrintTrace(..)+0x81>: mov gp=r36
0xc00000000c0c5610:2 <TFM::PrintTrace(..)+0x82>: (p1) mov r62=8;;
0xc00000000c0c5620:0 <TFM::PrintTrace(..)+0x90>: cmp.eq.unc p6=r0,r16
0xc00000000c0c5620:1 <TFM::PrintTrace(..)+0x91>: nop.m 0x0
0xc00000000c0c5620:2 <TFM::PrintTrace(..)+0x92>: (p6) br.cond.dpnt.many _NZ10TFM07PrintTraceEPi+0xda0;;
0xc00000000c0c5630:0 <TFM::PrintTrace(..)+0xa0>: adds r21=16,r16 <---
0xc00000000c0c5630:1 <TFM::PrintTrace(..)+0xa1>: (p1) mov r62=8;;
0xc00000000c0c5630:2 <TFM::PrintTrace(..)+0xa2>: nop.i 0x0
0xc00000000c0c5640:0 <TFM::PrintTrace(..)+0xb0>: ld8 r42=[r21];; <---
0xc00000000c0c5640:1 <TFM::PrintTrace(..)+0xb1>: cmp.eq.unc p6=r0,r42
0xc00000000c0c5640:2 <TFM::PrintTrace(..)+0xb2>: nop.i 0x0
0xc00000000c0c5650:0 <TFM::PrintTrace(..)+0xc0>: nop.m 0x0
0xc00000000c0c5650:1 <TFM::PrintTrace(..)+0xc1>: mov r47=5
0xc00000000c0c5650:2 <TFM::PrintTrace(..)+0xc2>: (p6) br.cond.dpnt.many _NZ10TFM07PrintTraceEPi+0xdf0;;
0xc00000000c0c5660:0 <TFM::PrintTrace(..)+0xd0>: ld4.a r27=[r48]
;;; 1064 if( dwDataLen <= dwViewLen ) {
0xc00000000c0c5660:1 <TFM::PrintTrace(..)+0xd1>: adds r28=28,r42 <--
0xc00000000c0c5660:2 <TFM::PrintTrace(..)+0xd2>: cmp.ne.unc p6=r0,r46;;
0xc00000000c0c5670:0 <TFM::PrintTrace(..)+0xe0>: ld4.sa r26=[r28],
0xc00000000c0c5670:1 <TFM::PrintTrace(..)+0xe1>: (p6) ld4 r31=[r28] <-- instruction that crashed
Let me know if register values are needed. I think I can acquire the register value using info reg command of gdb.
This is the result of info registers (I excluded values of prXXX and brXXX), I don't have any idea how to map these to the disassembled instruction above.
gr1: 0x9fffffffbf716588
gr2: 0x9fffffff5f667c00
gr3: 0x9fffffff5f667c00
gr4: 0x6000000000e0b000
gr5: 0x9fffffff8adfe2e0
gr6: 0x9fffffff8ada9000
gr7: 0x9fffffff8ad7a000
gr8: 0x1
gr9: 0x9fffffff8adfd0f0
gr10: 0
gr11: 0xc000000000000690
gr12: 0x9fffffff8adfd140
gr13: 0x6000000001681510
gr14: 0x9fffffffbf7d8e98
gr15: 0x1a
gr16: 0x60000000044dac60
gr17: 0x1f
gr18: 0
gr19: 0x9fffffff8ad023f0
gr20: 0x9fffffff8adfd0e0
gr21: 0x60000000044dac70
gr22: 0x9fffffff5f668000
gr23: 0xd
gr24: 0x1
gr25: 0xc0000000004341f0
gr26: NaT
gr27: 0x63
gr28: 0xc00000000c5f801c
gr29: 0xc00000000029db20
gr30: 0xc00000000029db20
gr31: 0x288
gr32: 0x60000000044796d0
gr33: 0x6000000001a78910
gr34: 0x7e
gr35: 0x6000000001d03a90
gr36: 0x9fffffffbf716588
gr37: 0xc000000000000c9d
gr38: 0xc00000000c0c4f70
gr39: 0x9
gr40: 0x6000000004479b60
gr41: 0x6000000004479b58
gr42: 0xc00000000c5f8000
gr43: 0x9fffffffbf7144e0
gr44: 0x6000000004479b50
gr45: 0x6000000004479b68
gr46: 0x6000000001d03a90
gr47: 0x5
gr48: 0x6000000001a7892c
gr49: 0x9fffffff8adfe110
gr50: 0xc000000000000491
gr51: 0xc00000000c0c5520
gr52: 0xc00000000c07dd10
gr53: 0x9fffffff8adfe120
gr54: 0x9fffffff8adfe0a0
gr55: 0xc00000000000058e
gr56: 0xc00000000042be40
gr57: 0x39
gr58: 0x3
gr59: 0x33
gr60: 0
gr61: 0x9fffffffbf7d2598
gr62: 0x8
gr63: 0x9fffffffbf716588
gr64: 0xc000000000000f22
gr65: 0xc00000000c0c5610
This is an update to my previous post. Since I was furnished a copy
of the core file, I used gdb to examine the core file and executed
the following command:
1) bt
2) frame n <- the frame where the abort occurred
3) disas
And here are the results.
(gdb) bt
#0 0xc0000000001e5350:0 in _lwp_kill+0x30 ()
from /usr/lib/hpux64/libpthread.so.1
#1 0xc00000000014c7b0:0 in pthread_kill+0x9d0 ()
from /usr/lib/hpux64/libpthread.so.1
#2 0xc0000000002e4080:0 in raise+0xe0 () from /usr/lib/hpux64/libc.so.1
#3 0xc0000000003f47f0:0 in abort+0x170 () from /usr/lib/hpux64/libc.so.1
#4 0xc00000000e65e0d0:0 in os::abort ()
at /CLO/Components/JAVA_HOTSPOT/Src/src/os/hp-ux/vm/os_hp-ux.cpp:2033
#5 0xc00000000eb473e0:0 in VMError::report_and_die ()
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/utilities/vmError.cpp:1008
#6 0xc00000000e66fc90:0 in os::Hpux::JVM_handle_hpux_signal ()
at /CLO/Components/JAVA_HOTSPOT/Src/src/os_cpu/hp-ux_ia64/vm/os_hp-ux_ia64.cpp:1051
#7 <signal handler called>
#8 0xc00000000c0c5670:1 in TFMTrace::PrintTrace () at tfmtrace.cpp:1064
#9 0xc00000000c0c4f70:0 in FMLogger::WriteLog () at fmlogger.cpp:90
...
(gdb) frame 8
#8 0xc00000000c0c5670:1 in TFMTrace::PrintTrace () at tfmtrace.cpp:1064
1064 if( dwDataLen <= dwViewLen ) {
Current language: auto; currently c++
(gdb) disas $pc-16*4 $pc+16*4
...
0xc00000000c0c5660:0 <TFMTrace::PrintTrace(...)+0xd0> : ld4.a r27=[r48] MII,
;;; 1064 if( dwDataLen <= dwViewLen ) {
0xc00000000c0c5660:1 <TFMTrace::PrintTrace(...)+0xd1> : adds r28=28,r42
0xc00000000c0c5660:2 <TFMTrace::PrintTrace(...)+0xd2> : cmp.ne.unc p6=r0,r46;;
0xc00000000c0c5670:0 <TFMTrace::PrintTrace(...)+0xe0> : ld4.sa r26=[r28] MMI,
0xc00000000c0c5670:1 <TFMTrace::PrintTrace(...)+0xe1> : (p6) ld4 r31=[r28]
0xc00000000c0c5670:2 <TFMTrace::PrintTrace(...)+0xe2> : adds r46=24,r42;;
0xc00000000c0c5680:0 <TFMTrace::PrintTrace(...)+0xf0> : (p6) st4 [r35]=r31 MI,I
0xc00000000c0c5680:1 <TFMTrace::PrintTrace(...)+0xf1> : adds r59=36,r42;;
0xc00000000c0c5680:2 <TFMTrace::PrintTrace(...)+0xf2> : nop.i 0x0
0xc00000000c0c5690:0 <TFMTrace::PrintTrace(...)+0x100>: ld4.c.clr r27=[r48] MIB,
;;; 1066 dwLen = dwTrcLen ;
0xc00000000c0c5690:1 <TFMTrace::PrintTrace(...)+0x101>: cmp4.eq.unc p6,p8=99,r27
0xc00000000c0c5690:2 <TFMTrace::PrintTrace(...)+0x102>: nop.b 0x0;;
0xc00000000c0c56a0:0 <TFMTrace::PrintTrace(...)+0x110>: (p8) ld4.c.clr r26=[r28] MMI
;;; 1067 }
0xc00000000c0c56a0:1 <TFMTrace::PrintTrace(...)+0x111>: (p6) st4 [r48]=r47
0xc00000000c0c56a0:2 <TFMTrace::PrintTrace(...)+0x112>: cmp4.geu.unc p7=r26,r27
End of assemb
A "normal" crash in native code causes a report like this:
C [libc.so.6+0x88368] strstr+0x64a
Note small offset from the function (strstr in this case) to the crash point.
In your case, JVM decided that the address oxc00000000f675671 is inside libtracejni.so, but the closest function it could find is very far from the crash point (0x5065eff9 == 1.2 GB away).
Is your library really that big?
If it really is that big, chances are you have stripped it, and so the symbol _NZ10TFM07PrintTraceEPi doesn't actually have anything to do with the problem (which is in the code that is 1.2GB away).
You need to find out what code was really at address oxc00000000f675671 at the time of the crash. Usually hs_err_pid*.log contains a list of load addresses for all the shared libraries. Find the load address of libtracejni.so, subtract it from pc. That should give you an address similar to 0x400...675671 which you should be able to lookup in your unstripped version of libtracejni.so.
Also note that crash address ends with ASCII "C8G", which may or may not be a coincidence.
Update 2011/08/05.
Now you know which instruction crashed:
0x4000000000099670:1 <TFMTrace::PrintTrace(...)+0xe1>: (p6) ld4 r31=[r28]
This is a load of 4-byte integer from memory pointed by r28.
The next questions are: what is the value of r28 at crash point (should be logged in hs_err*.log), and also where did it come from (complete disassembly of TFM::PrintTrace will tell you that).

Examining mmaped addresses using GDB

I'm using the driver I posted at Direct Memory Access in Linux to mmap some physical ram into a userspace address. However, I can't use GDB to look at any of the address; i.e., x 0x12345678 (where 0x12345678 is the return value of mmap) fails with an error "Cannot access memory at address 0x12345678".
Is there any way to tell GDB that this memory can be viewed? Alternatively, is there something different I can do in the mmap (either the call or the implementation of foo_mmap there) that will allow it to access this memory?
Note that I'm not asking about /dev/mem (as in the first snippet there) but about a mmap to memory acquired via ioremap(), virt_to_phys() and remap_pfn_range()
I believe Linux does not make I/O memory accessible via ptrace(). You could write a function that simply reads the mmap'ed address and have gdb invoke it. Here's a slightly modified version of your foo-user.c program along with the output from a gdb session.
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
char *mptr;
char peek(int offset)
{
return mptr[offset];
}
int main(void)
{
int fd;
fd = open("/dev/foo", O_RDWR | O_SYNC);
if (fd == -1) {
printf("open error...\n");
return 1;
}
mptr = mmap(0, 1 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_FILE | MAP_SHARED, fd, 4096);
printf("On start, mptr points to 0x%lX.\n", (unsigned long) mptr);
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr,
*mptr);
mptr[0] = 'a';
mptr[1] = 'b';
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr,
*mptr);
close(fd);
return 0;
}
$ make foo-user CFLAGS=-g
$ gdb -q foo-user
(gdb) break 27
Breakpoint 1 at 0x804855f: file foo-user.c, line 27.
(gdb) run
Starting program: /home/me/foo/foo-user
On start, mptr points to 0xB7E1E000.
mptr points to 0xB7E1E000. *mptr = 0x61
Breakpoint 1, main () at foo-user.c:27
27 mptr[0] = 'a';
(gdb) n
28 mptr[1] = 'b';
(gdb) print peek(0)
$1 = 97 'a'
(gdb) print peek(1)
$2 = 98 'b'
I have the answer to your conundrum :) I've searched everywhere online without much help and finally debugged it myself.
This post was a good starting point for me. I wanted to achieve something in similar lines, I had implemented a char driver with MMAP to map my custom managed memory to a userspace process. When using GDB, ptrace PEEK calls access_process_vm() to access any memory in your VMA. This causes a EIO error since the generic access cannot get the PA of your memory. It turns out, you have to implement an access function for this memory by implementing .access of your VMA's vm_operations_struct. Below is an example:
//Below code needs to be implemented by your driver:
static struct vm_operations_struct custom_vm_ops = {
.access = custom_vma_access,
};
static inline int custom_vma_access(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write)
{
return custom_generic_access_phys(vma, addr, buf, len, write);
}
static int custom_generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write)
{
void __iomem *maddr;
//int offset = (addr & (PAGE_SIZE-1)) - vma->vm_start;
int offset = (addr) - vma->vm_start;
maddr = phys_to_virt(__pa(custom_mem_VA));
if (write)
memcpy_toio(maddr + offset, buf, len);
else
memcpy_fromio(buf, maddr + offset, len);
return len;
}
To access the mmapped memory, GDB will call ptrace, which will then call __access_remote_vm() to access the mmapped memory. If the memory is mapped with flags such as VMIO | VM_PFNMAP (for example, remap_pfn_range() sets them), GDB will access the memory though vm's access method defined by users.
Instead of writing our own implementation for access(), kernel already provides a generic version called generic_access_phys(), and this method could be easily linked via vm_operations_struct as the /dev/mem device did:
static const struct vm_operations_struct mmap_mem_ops = {
.access = generic_access_phys };
int mmap_mem()
{
.... ....
vma->vm_ops = &mmap_mem_ops;
.... ....
}
It is my understanding that GDB will be using ptrace to poke around in your process's memory. Perhaps you should write a simple program that just attaches to your process and uses ptrace to read from that memory. This might help narrow down what the underlying problem is. If that has no issues, then you know either I'm wrong :), or something else fishy is happening with GDB.
you go "info files"
(gdb) help info files
Names of targets and files being debugged.
Shows the entire stack of targets currently in use (including the exec-file,
core-file, and process, if any), as well as the symbol file name.
(gdb) info files
Symbols from "/bin/ls".
Unix child process:
Using the running image of child Thread 4160418656 (LWP 10729).
While running this, GDB does not access memory from...
Local exec file:
`/bin/ls', file type elf32-powerpc.
Entry point: 0x10002a10
0x10000134 - 0x10000141 is .interp
0x10000144 - 0x10000164 is .note.ABI-tag
0x10000164 - 0x100008f8 is .gnu.hash
0x100008f8 - 0x10001728 is .dynsym
0x10001728 - 0x100021f3 is .dynstr
0x100021f4 - 0x100023ba is .gnu.version
...
0x0ffa8300 - 0x0ffad8c0 is .text in /lib/libacl.so.1
0x0ffad8c0 - 0x0ffad8f8 is .fini in /lib/libacl.so.1
0x0ffad8f8 - 0x0ffadbac is .rodata in /lib/libacl.so.1
0x0ffadbac - 0x0ffadd58 is .eh_frame_hdr in /lib/libacl.so.1
0x0ffadd58 - 0x0ffae4d8 is .eh_frame in /lib/libacl.so.1
0x0ffbe4d8 - 0x0ffbe4e0 is .ctors in /lib/libacl.so.1
0x0ffbe4e0 - 0x0ffbe4e8 is .dtors in /lib/libacl.so.1
...
(gdb) info sh
From To Syms Read Shared Object Library
0xf7fcf960 0xf7fe81a0 Yes /lib/ld.so.1
0x0ffd0820 0x0ffd5d10 Yes /lib/librt.so.1
0x0ffa8300 0x0ffad8c0 Yes /lib/libacl.so.1
0x0ff6a840 0x0ff7f4f0 Yes /lib/libselinux.so.1
0x0fdfe920 0x0ff1ae70 Yes /lib/libc.so.6
0x0fda23d0 0x0fdb0db0 Yes /lib/libpthread.so.0
Failing this, you can use "mem" to configure memory ranges.
(gdb) mem 1 1414
(gdb) info mem
Num Enb Low Addr High Addr Attrs
1 y 0x00000001 0x00000586 rw nocache
(gdb) disable mem 1
(gdb) info mem
Num Enb Low Addr High Addr Attrs
1 n 0x00000001 0x00000586 rw nocache
I think that if that memory is not accessible by GDB then it is not mapped into your process address space and so you get "Cannot access memory at addresss 0x12345678". If that application were run normally you would get a segmentation fault. Also, maybe your driver is screwed and you should check you actually can access the memory from within the kernel. Try with example here:
#include <cstdio>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
int main() {
int fd = open("/dev/zero", O_RDONLY);
void* addr = mmap(NULL, 1024, PROT_READ, MAP_PRIVATE, fd, 0);
for (int x = 0; x < 10; x++) {
printf("%X\n", ((char*)addr)[x]);
}
close(fd);
return 0;
}
If you open an AF_PACKET socket and mmap it, gdb can't access this memory. So there isn't a problem with your driver. It's either a problem with ptrace or with gdb.