Simple MIPS Instructions and Compilers

Simple MIPS Instructions and Compilers - c++

Is it common for compilers (gcc for instance) to generate an instruction that loads some empty memory element into a register? Like... lw at,0(sp) where memory[sp + 0] = 0. This basically just places 0 into $at ($R1.) I ask because I'm looking through an executable file's hex dump (executable file is the result of the compilation of a c++ file) and I'm manually verifying it and if I start at the objdump state entry point I run into an instruction that does this. I'm not sure whether I should take this to be an error if it's just a common compiler action. It seems like a poor way to zero a register. ADDU $at,$0,$0 would be better. Or SLL $at,$0,$0..
The entry point is 400890. The jump target of the jal at the end is an empty memory location (tells me something is probably wrong...) Note that my previous example was purposefully arbitrated.
And just to be clear, -32636+gp is an empty memory location. I can post the memory contents at the point if you'd like proof :).
00400890 <__start>:
400890: 03e00021 move zero,ra
400894: 04110001 bal 40089c <__start+0xc>
400898: 00000000 nop
40089c: 3c1c0fc0 lui gp,0xfc0
4008a0: 279c7864 addiu gp,gp,30820
4008a4: 039fe021 addu gp,gp,ra
4008a8: 0000f821 move ra,zero
4008ac: 8f848034 lw a0,-32716(gp)
4008b0: 8fa50000 lw a1,0(sp)
4008b4: 27a60004 addiu a2,sp,4
4008b8: 2401fff8 li at,-8
4008bc: 03a1e824 and sp,sp,at
4008c0: 27bdffe0 addiu sp,sp,-32
4008c4: 8f878054 lw a3,-32684(gp)
4008c8: 8f888084 lw t0,-32636(gp)<------ this instruction
4008cc: 00000000 nop
4008d0: afa80010 sw t0,16(sp)
4008d4: afa20014 sw v0,20(sp)
4008d8: afbd0018 sw sp,24(sp)
4008dc: 8f998068 lw t9,-32664(gp)
4008e0: 00000000 nop
4008e4: 0320f809 jalr t9
4008e8: 00000000 nop
Jal target is 4010c0.
4010c0: 8f998010 lw t9,-32752(gp)
4010c4: 03e07821 move t7,ra
4010c8: 0320f809 jalr t9

Perhaps it's being placed after a jump statement? If so, that statement is run before the jump occurs and could be a do nothing instruction (nop). Beyond that, it could just be the compiler on a lower optimization setting. Another possibility is that the compiler is preserving the CPU flags field. Shift and Add play with flags while a load I don't believe does.

This looks like CRT code. I think this code is loading some parameters passed by the OS in to $a0 and $a1 registers. Probably some larger structure is passed on the stack and code is loading that structure in to correct stack location.
This code is probably not generated by C compiler but hand coded in assembly.

Related

ARM Cortex-M HardFault exception on writting halfword to flash using C++

I've written a project using C++ to run on ARM Cortex-M (STM32F0) but I had some problems with accessing defined buffers as class members though I resolved that by defining them as global vars.
But now I'm completely stuck with this new problem which I don't know what to do with it.
I've a code to unlock flash and write something into it and close it. If I implement it in C file and run it through C nature (call from main.c) it works perfect. but calling that through C++ files (whether written inside C or C++ source file) it will throw a HardFault Exception.
static uint32_t waitForLastOperation(uint32_t msDelay)
{
while (READ_BIT(FLASH->SR, FLASH_SR_BSY) && msDelay)
{
LL_mDelay(1);
msDelay--;
}
/* Check FLASH End of Operation flag */
if (READ_BIT((FLASH->SR), (FLASH_SR_EOP)))
{
/* Clear FLASH End of Operation pending bit */
(FLASH->SR) = (FLASH_SR_EOP);
}
if (READ_BIT((FLASH->SR),
(FLASH_SR_WRPERR)) || READ_BIT((FLASH->SR), (FLASH_SR_PGERR)))
{
FLASH->SR = 0U;
return 0;
}
/* There is no error flag set */
return 1;
}
uint32_t programHalfWord(uint16_t data, uint32_t address)
{
uint32_t status;
/* Proceed to program the new data */
SET_BIT(FLASH->CR, FLASH_CR_PG);
/* Write data in the address */
*(__IO uint16_t*) address = data;
/* Wait for last operation to be completed */
status = waitForLastOperation(FLASH_TIMEOUT);
if (READ_BIT(FLASH->SR, FLASH_SR_EOP))
FLASH->SR = FLASH_SR_EOP;
/* If the program operation is completed, disable the PG Bit */
CLEAR_BIT(FLASH->CR, FLASH_CR_PG);
return status;
}
uint32_t flash_unlock()
{
if (READ_BIT(FLASH->CR, FLASH_CR_LOCK) == RESET)
return 1;
/* Authorize the FLASH Registers access */
WRITE_REG(FLASH->KEYR, FLASH_KEY1);
WRITE_REG(FLASH->KEYR, FLASH_KEY2);
/* Verify Flash is unlocked */
if (READ_BIT(FLASH->CR, FLASH_CR_LOCK) != RESET)
return 0;
return 1;
}
and this is how I use it:
if(flash_unlock())
{
programHalfWord(0x11, 0x8007C00);
}
It throws exception right after executing *(__IO uint16_t*) address = data;.
Flash is erased at this address, address is aligned (it's actually start of a sector). I've checked everything to make sure that flash is unlocked but it seems that there's something with the code compiled in C++.
I'm using arm-none-eabi-gcc and arm-none-eabi-g++ to compile my code.
Thanks in advance
Update:
Here's the list of flags being used with g++ compiler:
-mcpu=cortex-m0 -std=gnu++14 -g3 -DSTM32F030x6 -DHSE_STARTUP_TIMEOUT=100 -DLSE_STARTUP_TIMEOUT=5000 -DDEBUG -DLSE_VALUE=32768 -DDATA_CACHE_ENABLE=0 -DINSTRUCTION_CACHE_ENABLE=0 -DVDD_VALUE=3300 -DLSI_VALUE=40000 -DHSI_VALUE=8000000 -DUSE_FULL_LL_DRIVER -DPREFETCH_ENABLE=1 -DHSE_VALUE=2000000 -c -I../app/Inc -I../Inc -I../Drivers/STM32F0xx_HAL_Driver/Inc -I../Drivers/CMSIS/Include -I../Drivers/CMSIS/Device/ST/STM32F0xx/Include -I../app/Driver -Og -ffunction-sections -fdata-sections -fno-exceptions -fno-rtti -fno-threadsafe-statics -fno-use-cxa-atexit -Wall -fno-short-enums -fstack-usage --specs=nano.specs -mfloat-abi=soft -mthumb
And this is for gcc:
-mcpu=cortex-m0 -std=gnu11 -g3 -DSTM32F030x6 -DHSE_STARTUP_TIMEOUT=100 -DLSE_STARTUP_TIMEOUT=5000 -DDEBUG -DLSE_VALUE=32768 -DDATA_CACHE_ENABLE=0 -DINSTRUCTION_CACHE_ENABLE=0 -DVDD_VALUE=3300 -DLSI_VALUE=40000 -DHSI_VALUE=8000000 -DUSE_FULL_LL_DRIVER -DPREFETCH_ENABLE=1 -DHSE_VALUE=2000000 -c -I../app/Inc -I../Inc -I../Drivers/STM32F0xx_HAL_Driver/Inc -I../Drivers/CMSIS/Include -I../Drivers/CMSIS/Device/ST/STM32F0xx/Include -I../app/Driver -Og -ffunction-sections -fdata-sections -Wall -fno-short-enums -fstack-usage --specs=nano.specs -mfloat-abi=soft -mthumb
and g++ linker:
-mcpu=cortex-m0 -T"./STM32F030K6TX_FLASH.ld" -Wl,-Map="${ProjName}.map" -Wl,--gc-sections -static --specs=nano.specs -mfloat-abi=soft -mthumb -Wl,--start-group -lc -lm -lstdc++ -lsupc++ -Wl,--end-group

Since it is difficult to analyze the issue without having access to your hardware / software setup, I can only make wild guesses and provide some hints, after having some troubles with STM32 flash programming as well recently (on a different STM32 model (STM32F215RET6)). - But I'm not an expert in this area at all, and I've only used the vendor supplied HAL driver to access the internal flash so far.
The error might be caused by a memory bus error.
It would be interesting to verify if that's case with a debugger (e.g. by reading the flash status register (FLASH_SR), right after the error occurred).
The question is: Why does your C code work, when compiled with gcc and why not, when compiled with g++? I guess, it might have something to do with a technical detail, that the compiler "doesn't know" about the underlying restrictions of the architecture / memory model.
The STM32F030K6T reference manual (RM0360) says, in section "3.2.2 Flash program and erase operations, Main Flash memory programming":
The main Flash memory can be programmed 16 bits at a time. The program operation is started when the CPU writes a half-word into a main Flash memory address with the PG bit of the FLASH_CR register set. Any attempt to write data that are not half-word long will result in a bus error generating a Hard Fault interrupt.
So, 32-bit write access to the internal flash will cause a Hard Fault interrupt.
When you compile the project with assembly listing generation enabled, you could analyze what's exactly going in your C++ variant, and compare it to the generated machine code of the C variant.
Since I've been working on a STM32 flash related issue recently as well, I've looked up what's going on in the vendor supplied flash code in my case (stm32f2xx_hal_flash.c), and it turns out, that the main write operation to the flash (*(__IO uint16_t*)Address = Data;) is translated to the matching ARM half-word store instruction strh, like expected:
strh r1, [r0]
This could be verified by looking at the auto-generated assembly listings for the ST supplied FLASH_Program_HalfWord() function in stm32f2xx_hal_flash.c. It looks like that (compiled with GCC with no optimization and debugging information -Og):
662:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** static void FLASH_Program_HalfWord(uint32_t Address, uint16_t Data)
663:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** {
140 .loc 1 663 1 is_stmt 1 view -0
141 .cfi_startproc
142 # args = 0, pretend = 0, frame = 0
143 # frame_needed = 0, uses_anonymous_args = 0
144 # link register save eliminated.
664:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** /* Check the parameters */
665:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** assert_param(IS_FLASH_ADDRESS(Address));
145 .loc 1 665 3 view .LVU27
666:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c ****
667:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** /* If the previous operation is completed, proceed to program the new data */
668:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** CLEAR_BIT(FLASH->CR, FLASH_CR_PSIZE);
146 .loc 1 668 3 view .LVU28
147 0000 074B ldr r3, .L9
148 0002 1A69 ldr r2, [r3, #16]
149 0004 22F44072 bic r2, r2, #768
150 0008 1A61 str r2, [r3, #16]
669:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** FLASH->CR |= FLASH_PSIZE_HALF_WORD;
151 .loc 1 669 3 view .LVU29
152 .loc 1 669 13 is_stmt 0 view .LVU30
153 000a 1A69 ldr r2, [r3, #16]
154 000c 42F48072 orr r2, r2, #256
155 0010 1A61 str r2, [r3, #16]
670:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** FLASH->CR |= FLASH_CR_PG;
156 .loc 1 670 3 is_stmt 1 view .LVU31
157 .loc 1 670 13 is_stmt 0 view .LVU32
158 0012 1A69 ldr r2, [r3, #16]
159 0014 42F00102 orr r2, r2, #1
160 0018 1A61 str r2, [r3, #16]
671:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c ****
672:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** *(__IO uint16_t*)Address = Data;
161 .loc 1 672 3 is_stmt 1 view .LVU33
162 .loc 1 672 28 is_stmt 0 view .LVU34
163 001a 0180 strh r1, [r0] # movhi
673:Drivers/STM32F2xx_HAL_Driver/Src/stm32f2xx_hal_flash.c **** }
164 .loc 1 673 1 view .LVU35
165 001c 7047 bx lr
166 .L10:
167 001e 00BF .align 2
168 .L9:
169 0020 003C0240 .word 1073888256
170 .cfi_endproc
The generated machine code could be disassembled and inspected with objdump, without all the annotations, like that:
$ arm-none-eabi-objdump -d -j .text.FLASH_Program_HalfWord build/stm32f2xx_hal_flash.o
build/stm32f2xx_hal_flash.o: file format elf32-littlearm
Disassembly of section .text.FLASH_Program_HalfWord:
00000000 <FLASH_Program_HalfWord>:
0: 4b07 ldr r3, [pc, #28] ; (20 <FLASH_Program_HalfWord+0x20>)
2: 691a ldr r2, [r3, #16]
4: f422 7240 bic.w r2, r2, #768 ; 0x300
8: 611a str r2, [r3, #16]
a: 691a ldr r2, [r3, #16]
c: f442 7280 orr.w r2, r2, #256 ; 0x100
10: 611a str r2, [r3, #16]
12: 691a ldr r2, [r3, #16]
14: f042 0201 orr.w r2, r2, #1
18: 611a str r2, [r3, #16]
1a: 8001 strh r1, [r0, #0]
1c: 4770 bx lr
1e: bf00 nop
20: 40023c00 .word 0x40023c00
It would be interesting, if you could find out how it looks like in your object file compiled as C++. Is it also using the strh instruction?
By the way, all the ARM instructions are documented also be ST in the STM32F0xxx Cortex-M0 programming manual (PM0215):
The Cortex-M0 processor implements the ARMv6-M architecture, which is based on the 16-bit Thumb® instruction set and includes Thumb-2 technology.
STRHRt, [Rn, <Rm|#imm>] Store register as halfword
And as a reference, also in the ARM®v6-M Architecture Reference Manual of course.
Side note 1:
The reference manual says that address 0x8007C00 is right at the beginning of flash page 31, in flash sector 7, assuming a STM32F030K6Tx chip is used:
Forgetting about this could cause issues, if the sector is write protected via flash option bytes (but that obviously wasn't the case, since it works fine in the C variant). Just for the sake of completeness (you've already commented on that), a quote from the reference manual, "4.1.3 Write protection option byte":
This set of registers is used to write-protect the Flash memory.
Clearing a bit in WRPx field (and at the same time setting a
corresponding bit in nWRPx field) will write-protect the given memory
sector. For STM32F030x4, STM32F030x6, STM32F070x6, STM32F030x8 and
STM32F070xB devices, WRP bits from 0 to 31 are protecting the
Flash memory by sector of 4 kB.
(Possibly unrelated, but also worth mentioning: beware of the different conditions present when Read Protection (RDP) Level 2 or Level 3 is active. RDP is a different protection mechanism, separate from the sector protection via flash option bytes, or lock state of the flash. Reading the flash from a debugger or when executing form RAM will cause a Hard Fault when RDP Level 2 or 3 is used. Documented in the reference manual, section "3.3.1 Read protection".)
Side note 2:
You could try to mix the official HAL C driver code or your own tested flash related C code, and the new C++ parts of the project, and check if the problem still occurs.
(Be careful when mixing C and C++, and always take care of naming mangeling by using extern "C" { ... }, related post: https://stackoverflow.com/a/1041880/5872574)
Side note 3:
Like already mentioned, I've recently had an unrelated issue with flash programming as well. And saw strange bus errors (in the status register after a Hard Fault). I also made sure that the flash was unlocked, and not write protected. If I remember correctly, I had to add this in front of my erase / write operations (but I do not remember exactly and can't find it right now). It was a necessary but strange fix, because there was no operation in progress, besides regular program execution (from flash).
while (FLASH_WaitForLastOperation(100) != HAL_OK) {
HAL_IWDG_Refresh(&hiwdg);
}
This issue possibly had something to do with the way the STM32 uses the flash with a prefetch buffer / wait states / instruction cache and the data cache like described in the reference manual (see also: FLASH_ACR register). I didn't investigate the issue any further. Just make sure that there is no flash operation pending/active when a write/erase access is initiated.
Also interesting to note, program/erase operations will prevent any read access to the bus (flash memory), but they will not cause an error, like described in the reference manual, in section "3.2.2 Flash program and erase operations":
An ongoing Flash memory operation will not block the CPU as long as
the CPU does not access the Flash memory.
On the contrary, during a program/erase operation to the Flash memory,
any attempt to read the Flash memory will stall the bus. The read
operation will proceed correctly once the program/erase operation has
completed. This means that code or data fetches cannot be made while a
program/erase operation is ongoing.
For program and erase operations on the Flash memory (write/erase),
the internal RC oscillator (HSI) must be ON.
EDIT:
In order to check whether there's really enough flash memory left to write to, and that the area is really unused by the running binary itself, these commands could come in handy, meant as a future reference (using my test binary for an STM32F215RET here):
$ arm-none-eabi-strip build/prj.elf
$ arm-none-eabi-objdump -h build/prj.elf
build/prj.elf: file format elf32-littlearm
Sections:
Idx Name Size VMA LMA File off Algn
0 .isr_vector 00000184 08000000 08000000 00010000 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .text 000134a0 08000188 08000188 00010188 2**3
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .rodata 00002968 08013628 08013628 00023628 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .ARM 00000008 08015f90 08015f90 00025f90 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .init_array 00000004 08015f98 08015f98 00025f98 2**2
CONTENTS, ALLOC, LOAD, DATA
5 .fini_array 00000004 08015f9c 08015f9c 00025f9c 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .data 000002c0 20000000 08015fa0 00030000 2**3
CONTENTS, ALLOC, LOAD, DATA
7 .bss 0000149c 200002c0 08016260 000302c0 2**3
ALLOC
8 ._user_heap_stack 00000604 2000175c 08016260 0003175c 2**0
ALLOC
9 .ARM.attributes 00000029 00000000 00000000 000302c0 2**0
CONTENTS, READONLY
10 .comment 0000001e 00000000 00000000 000302e9 2**0
CONTENTS, READONLY
0x08016260 marks the end of the used flash memory by the binary.
That can be verified with arm-none-eabi-size:
$ arm-none-eabi-size build/prj.elf
text data bss dec hex filename
90004 712 6816 97532 17cfc build/prj.elf
$ echo $((90004 + 712))
90716
$ echo $((0x08016260 - 0x08000000 - (90004 + 712)))
4
So, with 2**3 -> 8 byte alignment and a flash base address of 0x08000000, that means that 90720 bytes of flash memory are actually used by the binary.
To find out which of the flash sectors are left unused, it is now easy to look the address up directly in the "Flash memory organization" table in the reference manual.
In my case, the linker script was modified to make sure that only half of the flash is used, like that:
$ cat STM32F215RETx_FLASH.ld
(...)
MEMORY
{
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 128K
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 256K /* keep 256K free at the end */
/* FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 512K */
}
(...)
That way you'll get a linker error if the binary gets too large.

I had the same issue. My mistake was invalid alignment.
I'm assured 90% of such a errors is alignment issues, especially on Cortex-M0.
I've made proper alignment of my strucrure inside C++ class and it was enough.
Note: functions executed in RAM, like write half page, should better be inside extern "C" {} expression.

exploiting mips stack based overflow (rop chain)

I'm trying to develop my first MIPS stack based exploit, using ROP chain technique with zero luck...I'm failing on the first ROP gadget and I can't figure out why. I'm following Bowcaster python framework and some blog posts I've googled. As the overflow I've found allowing me to overwrite too few S registers (S0 and S8 only) I need to jump to a function epilogue which will set S0-S8 from the stack. I was using IDA PRO with mipsrop plugin to locate the gadgets. I've tried gadgets from both vulnerable binary and libc, but with the same result. I'm seeing in GDB that the $RA register is set correctly to the address I've choose with mipsrop, but for some reason S0-S8 registers are not overwritten. The ASLR is not an issue here as I can confirm that by running ldd few times. I've choose a gadget with the address 000112EC, using IDA PRO, which looks like that : (I can't post more images - reputation)
LOAD:000112EC lw $ra, 0x48+var_4($sp)
LOAD:000112F0 lw $fp, 0x48+var_8($sp)
LOAD:000112F4 lw $s7, 0x48+var_C($sp)
LOAD:000112F8 lw $s6, 0x48+var_10($sp)
LOAD:000112FC lw $s5, 0x48+var_14($sp)
LOAD:00011300 lw $s4, 0x48+var_18($sp)
LOAD:00011304 lw $s3, 0x48+var_1C($sp)
LOAD:00011308 lw $s2, 0x48+var_20($sp)
LOAD:0001130C lw $s1, 0x48+var_24($sp)
LOAD:00011310 lw $s0, 0x48+var_28($sp)
LOAD:00011314 jr $ra
LOAD:00011318 addiu $sp, 0x48
LOAD:00011318 # End of function scandir
I've added the base libc address to it (echo 'obase=16;ibase=16;2AABE000+112EC' | bc) and I've get 0x2AACF2EC. As it's Big Endian processor I've send the string like this :
AAAA and the first gadget address
Below you can see the full output from GDB :
GDB output after the hit
As you can see only S0 and S8 was overwritten and none of S registers was restored from the stack.
What I'm doing wrong here? Please help :)
some checksec.sh output :
RELRO STACK CANARY NX PIE RPATH RUNPATH FILE
No RELRO No canary found NX enabled No PIE No RPATH No RUNPATH /bin/vulnbin
RELRO STACK CANARY NX PIE RPATH RUNPATH FILE
Full RELRO No canary found NX enabled DSO No RPATH No RUNPATH /lib/libc.so.0

You have gdb output after the instructions run but not before (or during)
You should be single stepping and manually checking each load with x/wx $sp+
Could be your string was corrupted and those values in s0-s7 were in fact pulled from the stack as you asked it to do, where your buffer used to be but is now corrupted.
Also, do you have additional bytes in your overflow string after the new $ra value or does that terminate the string? If it's terminating the string then the s0-s7 registers will be filled with garbage stack data.
bottom line: breakpoint the first load, manually x/wx $sp+X each load before it occurs and stepi and confirm the result
If you need a static gdbserver I have a ton for various ARM, MIPS and MIPSEL architectures/ABIs # embedded-toolkit

find where the interrupt happened on cortex-m4

I am trying to find where in my code a specific interrupt happened. In this case it is on a stm32f4 microcontroller and the interrupt is the SysTick_Handler.
What i want is basically to figure out from where the systick interrupt happened. I am using arm-none-eabi-gdb to try to find the backtrace, but the only information i am getting from there is:
(gdb) bt
#0 SysTick_Handler () at modules/profiling.c:66
#1 <signal handler called>
#2 0x55555554 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?)
How can I get some information about where the program was before the interrupt fired?
Looking at the arm documentation here, it seems I should be able to read the stack pointer, and get the PC from there. But then this is exactly what the unwinder in GDB is doing isnt it?

You were on the right track at the end of your question. The ARM Cortex-M cores have two stack pointers, the main stack pointer (MSP, used for interrupts) and the process stack pointer (PSP, used for tasks).
When an interrupt with priority comes in, the current register values (for most of the registers) are pushed onto the current stack (PSP if interrupting the background application, or MSP if interrupting a lower priority interrupt), and then the stack is switched to the MSP (if not already there).
When you first enter an interrupt, the link register (LR, return address) will have a value that is mostly F's rather than an actual return address. This value tells the core how to exit when branched to. Typically, you'll see a value of 0xFFFFFFFD if the background task was interrupted, or 0xFFFFFFF1 if a lower priority interrupt was interrupted. These values will differ if you are using the floating point unit. The magic in this value, though, is that bit 2 (0x4) tells you whether your stack frame is on the PSP or MSP.
Once you determine which stack your frame is on, you can find the address you were executing from by looking at the appropriate stack pointer minus 24 (6 32-bit locations). See Figure 2.3 in your link. This will point you to the PC from which you were interrupted.

As many of you commented, the PC would be in two different stacks, the way I solved it was by actually finding a HardFault_Handling code in assembly and taking what i needed from there. To get the PC value correctly I am using the following code.
register int *r0 __asm("r0");
__asm( "TST lr, #4\n"
"ITE EQ\n"
"MRSEQ r0, MSP\n"
"MRSNE r0, PSP\n" // stack pointer now in r0
"ldr r0, [r0, #0x18]\n" // stored pc now in r0
//"add r0, r0, #6\n" // address to stored pc now in r0
);
The value of where the interrupt happended can now be accessed by
uint32_t PC = *r0;
and can now be used for whatever I want it. Unfortunately I did not manage to get GDB to unwind the stack automatically for me. But at least I found out where the interrupt was firing, which was the goal.

We keep seeing this question in various forms and folks keep saying there are two stacks. So I tried it myself with the systick.
The documentation says that we are in thread mode out of reset, and if you halt with openocd it says that
target halted due to debug-request, current mode: Thread
I have some code to dump registers:
20000000 APSR
00000000 IPSR
00000000 EPSR
00000000 CONTROL
00000000 SP_PROCESS
20000D00 SP_PROCESS after I modified it
20000FF0 SP_MAIN
20000FF0 mov r0,sp
then I dump the stack up to 0x20001000 which is where I know my stack started
20000FF0 00000000
20000FF4 00000000
20000FF8 00000000
20000FFC 0100005F
I setup and wait for a systick interrupt, the handler dumps registers and ram and then goes into an infinite loop. bad practice in general but just debugging/learning here. Before the interrupt I prep some registers:
.thumb_func
.globl iwait
iwait:
mov r0,#1
mov r1,#2
mov r2,#3
mov r3,#4
mov r4,#13
mov r12,r4
mov r4,#15
mov r14,r4
b .
and in the handler I see
20000000 APSR
0000000F IPSR
00000000 EPSR
00000000 CONTROL
20000D00 SP_PROCESS
20000FC0 SP_MAIN
20000FC0 mov r0,sp
20000FC0 0000000F
20000FC4 20000FFF
20000FC8 00000000
20000FCC FFFFFFF9 this is our special lr (not one rjp mentioned)
20000FD0 00000001 this is r0
20000FD4 00000002 this is r1
20000FD8 00000003 this is r2
20000FDC 00000004 this is r3
20000FE0 0000000D this is r12
20000FE4 0000000F this is r14/lr
20000FE8 01000074 and this is where we were interrupted from
20000FEC 21000000 this is probably the xpsr mentioned
20000FF0 00000000 stuff that was there before
20000FF4 00000000
20000FF8 00000000
20000FFC 0100005F
01000064 <iwait>:
1000064: 2001 movs r0, #1
1000066: 2102 movs r1, #2
1000068: 2203 movs r2, #3
100006a: 2304 movs r3, #4
100006c: 240d movs r4, #13
100006e: 46a4 mov ip, r4
1000070: 240f movs r4, #15
1000072: 46a6 mov lr, r4
1000074: e7fe b.n 1000074 <iwait+0x10>
1000076: bf00 nop
So in this case, straight out of the ARM documentation, it is not using the sp_process it is using sp_main. It is pushing the items the manual says it is pushing including the interrupted/return address which is 0x1000074.
Now, if I set the SPSEL bit (be careful to set the PSP first), it appears that a mov r0,sp in application/thread mode uses the PSP not MSP. But then the handler uses msp for a mov r0,sp but appears to put the
before in thread/foreground
20000000 APSR
00000000 IPSR
00000000 EPSR
00000000 SP_PROCESS
20000D00 SP_PROCESS modified
00000000 CONTROL
00000002 CONTROL modified
20000FF0 SP_MAIN
20000D00 mov r0,sp
now in the handler
20000000 APSR
0000000F IPSR
00000000 EPSR
00000000 CONTROL (interesting!)
20000CE0 SP_PROCESS
20000FE0 SP_MAIN
20000FE0 mov r0,sp
dump of that stack
20000FE0 0000000F
20000FE4 20000CFF
20000FE8 00000000
20000FEC FFFFFFFD
20000FF0 00000000
20000FF4 00000000
20000FF8 00000000
20000FFC 0100005F
dump of sp_process stack
20000CE0 00000001
20000CE4 00000002
20000CE8 00000003
20000CEC 00000004
20000CF0 0000000D
20000CF4 0000000F
20000CF8 01000074 our return value
20000CFC 21000000
So to be in this position of dealing with the alternate stack that folks keep mentioning, you have to put yourself in that position (or some code you rely on). Why you would want to do that for simple bare metal programs, who knows, the control register of all zeros is nice and easy, can share one stack just fine.
I dont use gdb, but you need to get it to dump all the registers sp_process and sp_main then depending on what you find, then dump a dozen or so words at each and in there you should see the 0xFFFFFFFx as a marker then count down from that to see the return address. You can have your handler read the two stack pointers as well then you can look at gprs. With gnu assembler mrs rX,psp; mrs rX,msp; For the process and main stack pointers.

This is called DEBUGGING. The easiest way to get started is to just stick a bunch of printf() calls here and there throughout the code. Run the program. If it prints out:
got to point A
got to point B
got to point C
and dies, then you know it died between "C" and "D." You can now refine that downwards by festooning the code between "C" and "D" with more closely spaced printf() calls.
This is the best way for a beginner to get started. Many seasoned experts also prefer printf() for debugging. Debuggers can get in the way.

ATmega8 doesn't support JMP instruction

Now I'm writing bootloader which starts in the middle of memory, but after it finishes I need to go to the main app, thought to try jmp 0x00, however my chip doesn't support jmp, how should I start main app?

I would use RJMP:
Relative jump to an address within PC - 2K +1 and PC + 2K (words). In
the assembler, labels are used instead of relative operands.
For example:
entry:
rjmp reset
.org 512
reset:
rjmp foo
.org 3072
foo:
rjmp entry
By the way, there are several other jump instructions (RJMP, IJMP, RCALL, ICALL, CALL, RET, RETI etc.) See this relevant discussion.

Well take a look into RET instruction. It returns to previous location, so you can try:
push 0x00
push 0x00
ret
This should work because while entering into any function you push your current location, and RET makes you go back.
As far as I remember ATmege8 has 16-bit address line, but if I'm not right you may need more push 0x00

why not simply use IJMP?
set Z to 0x00 and use IJMP. may be faster than 2xpush and ret
EOR R30, R30 ; clear ZL
EOR R31, R31 ; clear ZH
IJMP ; set PC to Z
should be 4 cycles and 3 instruction words (6 Bytes program memory)

ORG alternative for C++

In assembly we use the org instruction to set the location counter to a specific location in the memory. This is particularly helpful in making Operating Systems. Here's an example boot loader (From wikibooks):
org 7C00h
jmp short Start ;Jump over the data (the 'short' keyword makes the jmp instruction smaller)
Msg: db "Hello World! "
EndMsg:
Start: mov bx, 000Fh ;Page 0, colour attribute 15 (white) for the int 10 calls below
mov cx, 1 ;We will want to write 1 character
xor dx, dx ;Start at top left corner
mov ds, dx ;Ensure ds = 0 (to let us load the message)
cld ;Ensure direction flag is cleared (for LODSB)
Print: mov si, Msg ;Loads the address of the first byte of the message, 7C02h in this case
;PC BIOS Interrupt 10 Subfunction 2 - Set cursor position
;AH = 2
Char: mov ah, 2 ;BH = page, DH = row, DL = column
int 10h
lodsb ;Load a byte of the message into AL.
;Remember that DS is 0 and SI holds the
;offset of one of the bytes of the message.
;PC BIOS Interrupt 10 Subfunction 9 - Write character and colour
;AH = 9
mov ah, 9 ;BH = page, AL = character, BL = attribute, CX = character count
int 10h
inc dl ;Advance cursor
cmp dl, 80 ;Wrap around edge of screen if necessary
jne Skip
xor dl, dl
inc dh
cmp dh, 25 ;Wrap around bottom of screen if necessary
jne Skip
xor dh, dh
Skip: cmp si, EndMsg ;If we're not at end of message,
jne Char ;continue loading characters
jmp Print ;otherwise restart from the beginning of the message
times 0200h - 2 - ($ - $$) db 0 ;Zerofill up to 510 bytes
dw 0AA55h ;Boot Sector signature
;OPTIONAL:
;To zerofill up to the size of a standard 1.44MB, 3.5" floppy disk
;times 1474560 - ($ - $$) db 0
Is it possible accomplish the task with C++? Is there any command, function etc. like org where i can change the location of the program?

No it's not possible to do in any C compiler that I know of. You can however create your own linker script that places the code/data/bss segments at specific addresses.

Just for clarity, the org directive does not load the code at the specified address, it merely informs the assembler that the code will be loaded at that address. The code shown appears to be for Nasm (or similar) - in AT&T syntax, the .org directive does something different: it pads the code to that address - similar to the times line in the Nasm code.. Nasm can do this because in -f bin mode, it "acts as it's own linker".
The important thing for the code to know is the address where Msg can be found. The jmps and jnes (and call and ret which your example doesn't have, but a compiler may generate) are relative addressing mode. We code jmp target but the bytes that are actually emitted say jmp distance_to_target (plus or minus) so the address doesn't matter.
Gas doesn't do this, it emits a linkable object file. To use ld without a linker script the command line looks something like:
ld -o boot.bin boot.o -oformat binary -T text=0x7C00
(don't quote me on that exact syntax but "something like that") If you can get a linkable object file from your (16-bit capable!) C++ compiler, you might be able to do the same.
In the case of a bootsector, the code is loaded by the BIOS (or fake BIOS) at 0x7C00 - one of the few things we can assume about the bootsector. The sane thing for a bootsector to do is not fiddle-faddle around printing a message, but to load something else. You'll need to know how to find the something else on the disk and where you want to load it to (perhaps where your C++ compiler wants to put it by default) - and jmp there. This jmp will want to be a far jmp, which does need to know the address.
I'm guessing it's going to be some butt-ugly C++!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js