I am using Visual Studio 2008 C++ for Windows Mobile 6 ARMV4I and I'm trying to learn to read the ARM assembly code generated by VS to minimize unneessary buffer copies within an application. So, I've created a test application that looks like this:
#include <vector>
typedef std::vector< BYTE > Buf;
class Foo
{
public:
Foo( Buf b ) { b_.swap( b ); };
private:
Buf b_;
};
Buf Create()
{
Buf b( 1024 );
b[ 0 ] = 0x0001;
return b;
}
int _tmain( int argc, _TCHAR* argv[] )
{
Foo f( Create() );
return 0;
}
I'd like to understand if the buffer returned by Create is copied when given to the Foo constructor or if the compiler is able to optimize that copy away. In the Release build with optimizations turned on, this generates assembly like this:
class Foo
{
public:
Foo( Buf b ) { b_.swap( b ); };
0001112C stmdb sp!, {r4 - r7, lr}
00011130 mov r7, r0
00011134 mov r3, #0
00011138 str r3, this
0001113C str r3, [r7, #4]
00011140 str r3, [r7, #8]
00011144 ldr r3, this
00011148 ldr r2, this
0001114C mov r5, r7
00011150 mov r4, r1
00011154 str r3, this, #4
00011158 str r2, this, #4
0001115C mov r6, r1
00011160 ldr r2, this
00011164 ldr r3, this
00011168 mov lr, r7
0001116C str r3, this
00011170 str r2, this
00011174 ldr r2, [lr, #8]!
00011178 ldr r3, [r6, #8]!
0001117C str r3, this
00011180 str r2, this
00011184 ldr r3, this
00011188 movs r0, r3
0001118C beq |Foo::Foo + 0x84 ( 111b0h )|
00011190 ldr r3, [r1, #8]
00011194 sub r1, r3, r0
00011198 cmp r1, #0x80
0001119C bls |Foo::Foo + 0x80 ( 111ach )|
000111A0 bl 000112D4
000111A4 mov r0, r7
000111A8 ldmia sp!, {r4 - r7, pc}
000111AC bl |stlp_std::__node_alloc::_M_deallocate ( 11d2ch )|
000111B0 mov r0, r7
000111B4 ldmia sp!, {r4 - r7, pc}
--- ...\stlport\stl\_vector.h -----------------------------
// snip!
--- ...\asm_test.cpp
private:
Buf b_;
};
Buf Create()
{
00011240 stmdb sp!, {r4, lr}
00011244 mov r4, r0
Buf b( 1024 );
00011248 mov r1, #1, 22
0001124C bl |
b[ 0 ] = 0x0001;
00011250 ldr r3, [r4]
00011254 mov r2, #1
return b;
}
int _tmain( int argc, _TCHAR* argv[] )
{
00011264 str lr, [sp, #-4]!
00011268 sub sp, sp, #0x18
Foo f( Create() );
0001126C add r0, sp, #0xC
00011270 bl |Create ( 11240h )|
00011274 mov r1, r0
00011278 add r0, sp, #0
0001127C bl |Foo::Foo ( 1112ch )|
return 0;
00011280 ldr r0, argc
00011284 cmp r0, #0
00011288 beq |wmain + 0x44 ( 112a8h )|
0001128C ldr r3, [sp, #8]
00011290 sub r1, r3, r0
00011294 cmp r1, #0x80
00011298 bls |wmain + 0x40 ( 112a4h )|
0001129C bl 000112D4
000112A0 b |wmain + 0x44 ( 112a8h )|
000112A4 bl |stlp_std::__node_alloc::_M_deallocate ( 11d2ch )|
000112A8 mov r0, #0
}
What patterns can I look for in the assembly code to understand where the Buf structure is being copied?
Analyzing Create is fairly straightforward, because the code is so short. NRVO clearly has been applied here because the return statement generated no instructions, the return value is constructed in-place in r0.
The copy that would take place for Foo::Foo's pass-by-value parameter is slightly harder to analyze, but there's very little code between the calls to Create and Foo::Foo where the copy would have to take place, and nothing that would do a deep copy of a std::vector. So it looks like that copy has been eliminated as well. The other possibility is a custom calling convention for Foo::Foo where the argument is actually passed by reference and copied inside the function. You'd need someone capable of deeper ARM assembly analysis that I am to rule that out.
The buffer will be copied; you are using pass by value semantics of c++; no compiler will optimize that for you. How its copied will depend on the copy constructor of std::vector.
Related
I'm working on a baremetal application for a Cortex M1 on an FPGA. I'm seeing a HardFault during the __libc_init_array function call during startup of my program.
It seems the problem is, during __libc_init_array, it branches to a function "_init". This _init function, however, does not return and runs off into the weeds until it hits a hardfault. I'm not sure what I'm doing wrong that the function isn't returning.
I've distilled my program down to the bare minimum with the full source code in the github link below. I also have put the CFLAGS and LDFLAGS I'm using along with the relevant disassembled functions.
I'm using GCC version "gcc-arm-none-eabi-9-2020-q2-update" from arm.com. I've also tried the newer "gcc-arm-none-eabi-10-2020-q4-major" release and the GCC included with STM32IDE with the same result.
I've compared the compilation flags in STM32CUBEIDE with mine as well, and nothing is jumping out at me, but the elf produced by STM32CUBEIDE is returning from its _init...
Any and all help is appreciated, thank you!
https://github.com/rockybulwinkle/cortex-m1-example
CFLAGS = \
-mthumb \
-march=armv6-m \
-mcpu=cortex-m0 \
-Wall \
-Wextra
-std=c11 \
-specs=nano.specs \
-O0 \
-fdebug-prefix-map=$(REPO_ROOT)= \
-g \
-ffreestanding \
-ffunction-sections \
-fdata-sections
LDFLAGS = \
-mthumb \
-march=armv6-m \
-mcpu=cortex-m0 \
-Wl,--print-memory-usage \
-Wl,-Map=$(BUILD_DIR)/$(PROJECT).map \
-T m1.ld \
-Wl,--gc-sections \
00000084 <Reset_Handler>:
/*
* This is the code that gets called on processor reset.
* To initialize the device, and call the main() routine.
*/
void Reset_Handler(void)
{
84: b580 push {r7, lr}
86: b082 sub sp, #8
88: af00 add r7, sp, #0
/* Initialize the data segment */
uint32_t *pSrc = &_etext;
8a: 4b13 ldr r3, [pc, #76] ; (d8 <Reset_Handler+0x54>)
8c: 607b str r3, [r7, #4]
uint32_t *pDest = &_sdata;
8e: 4b13 ldr r3, [pc, #76] ; (dc <Reset_Handler+0x58>)
90: 603b str r3, [r7, #0]
if (pSrc != pDest) {
92: 687a ldr r2, [r7, #4]
94: 683b ldr r3, [r7, #0]
96: 429a cmp r2, r3
98: d00c beq.n b4 <Reset_Handler+0x30>
for (; pDest < &_edata;) {
9a: e007 b.n ac <Reset_Handler+0x28>
*pDest++ = *pSrc++;
9c: 687a ldr r2, [r7, #4]
9e: 1d13 adds r3, r2, #4
a0: 607b str r3, [r7, #4]
a2: 683b ldr r3, [r7, #0]
a4: 1d19 adds r1, r3, #4
a6: 6039 str r1, [r7, #0]
a8: 6812 ldr r2, [r2, #0]
aa: 601a str r2, [r3, #0]
for (; pDest < &_edata;) {
ac: 683a ldr r2, [r7, #0]
ae: 4b0c ldr r3, [pc, #48] ; (e0 <Reset_Handler+0x5c>)
b0: 429a cmp r2, r3
b2: d3f3 bcc.n 9c <Reset_Handler+0x18>
}
}
/* Clear the zero segment */
for (pDest = &_sbss; pDest < &_ebss;) {
b4: 4b0b ldr r3, [pc, #44] ; (e4 <Reset_Handler+0x60>)
b6: 603b str r3, [r7, #0]
b8: e004 b.n c4 <Reset_Handler+0x40>
*pDest++ = 0;
ba: 683b ldr r3, [r7, #0]
bc: 1d1a adds r2, r3, #4
be: 603a str r2, [r7, #0]
c0: 2200 movs r2, #0
c2: 601a str r2, [r3, #0]
for (pDest = &_sbss; pDest < &_ebss;) {
c4: 683a ldr r2, [r7, #0]
c6: 4b08 ldr r3, [pc, #32] ; (e8 <Reset_Handler+0x64>)
c8: 429a cmp r2, r3
ca: d3f6 bcc.n ba <Reset_Handler+0x36>
}
/* Run constructors / initializers */
__libc_init_array();
cc: f000 f898 bl 200 <__libc_init_array>
/* Branch to main function */
main();
d0: f7ff ffb6 bl 40 <main>
/* Infinite loop */
while (1);
d4: e7fe b.n d4 <Reset_Handler+0x50>
d6: 46c0 nop ; (mov r8, r8)
d8: 00001370 .word 0x00001370
dc: 2000001c .word 0x2000001c
e0: 20000080 .word 0x20000080
e4: 20000000 .word 0x20000000
e8: 2000001c .word 0x2000001c
00000200 <__libc_init_array>:
200: b570 push {r4, r5, r6, lr}
202: 2600 movs r6, #0
204: 4d0c ldr r5, [pc, #48] ; (238 <__libc_init_array+0x38>)
206: 4c0d ldr r4, [pc, #52] ; (23c <__libc_init_array+0x3c>)
208: 1b64 subs r4, r4, r5
20a: 10a4 asrs r4, r4, #2
20c: 42a6 cmp r6, r4
20e: d109 bne.n 224 <__libc_init_array+0x24>
210: 2600 movs r6, #0
212: f001 f8a9 bl 1368 <_init>
216: 4d0a ldr r5, [pc, #40] ; (240 <__libc_init_array+0x40>)
218: 4c0a ldr r4, [pc, #40] ; (244 <__libc_init_array+0x44>)
21a: 1b64 subs r4, r4, r5
21c: 10a4 asrs r4, r4, #2
21e: 42a6 cmp r6, r4
220: d105 bne.n 22e <__libc_init_array+0x2e>
222: bd70 pop {r4, r5, r6, pc}
224: 00b3 lsls r3, r6, #2
226: 58eb ldr r3, [r5, r3]
228: 4798 blx r3
22a: 3601 adds r6, #1
22c: e7ee b.n 20c <__libc_init_array+0xc>
22e: 00b3 lsls r3, r6, #2
230: 58eb ldr r3, [r5, r3]
232: 4798 blx r3
234: 3601 adds r6, #1
236: e7f2 b.n 21e <__libc_init_array+0x1e>
...
Disassembly of section .init:
00001368 <_init>:
1368: b5f8 push {r3, r4, r5, r6, r7, lr}
136a: 46c0 nop ; (mov r8, r8)
Disassembly of section .fini:
0000136c <_fini>:
136c: b5f8 push {r3, r4, r5, r6, r7, lr}
136e: 46c0 nop ; (mov r8, r8)
I found my answer while drafting the question. Essentially, I needed to add KEEP directives in my linker script for "init" and "fini".
Before:
.text :
{
. = ALIGN(4);
_stext = .;
KEEP(*(.vectors .vectors.*))
*(.text .text.*)
*(.rodata .rodata*)
. = ALIGN(4);
} > rom
After:
.text :
{
. = ALIGN(4);
_stext = .;
KEEP(*(.vectors .vectors.*))
KEEP(*(.init))
KEEP(*(.fini))
*(.text .text.*)
*(.rodata .rodata*)
. = ALIGN(4);
} > rom
I am trying to translate this C++ code:
y = y+ x*32;
z = y+ x*x;
To ARM assembly assume (x is R1 register, y is R2, z in R3), and I should use only one assembly instruction for each case
So, I suggest to do it using (MLA) but I don't know how, can you please help me!!
Firstly, put your code snippet in an function and create a complete code.
void x(void){
volatile int x = 1, y = 2, z = 3;
y = y+ x*32;
z = y+ x*x;
}
Then, compile that on Compiler Explorer.
Result is:
x:
mov r1, #1
mov r2, #2
mov r3, #3
sub sp, sp, #16
str r1, [sp, #4]
str r2, [sp, #8]
str r3, [sp, #12]
ldr r2, [sp, #4]
ldr r3, [sp, #8]
add r3, r3, r2, lsl #5
str r3, [sp, #8]
ldr r0, [sp, #4]
ldr r1, [sp, #4]
ldr r2, [sp, #8]
mla r3, r1, r0, r2
str r3, [sp, #12]
add sp, sp, #16
bx lr
After that, get the assignment of variables from the values. It looks like [sp, #4] is x, [sp, #8] is y, and [sp, #12] is z.
Finally, using this relation, construct the result.
The answer is:
add r2, r2, r1, lsl #5
mla r3, r1, r1, r2
I'm working with Keil ARMCompiler 6.15 (armclang.exe) and I'm in doubt of the correctness of the generated assembler code.
It seems to me that the attribute 'interrupt("IRQ")' is ignored.
For me r1 and r2 should be saved on the stack, too.
When I remove the attribute 'used' my complete function is removed (optimization).
Can anyone see the mistake I made or what I've forgotten?
Originally the code was created for gcc.
Attributes used for interrupt routines:
#define INTERRUPT_PROCEDURE __attribute__((interrupt("IRQ"),used,section(".IsrSection")))
#define ISR_VARIABLE __attribute__((section(".IsrSection")))
#define FAST_SHARED_DATA __attribute__((section(".FastSharedDataSection")))
C++ Code:
uint64_t volatile FAST_SHARED_DATA systick_value = uint64_t(0);
extern "C" {
void INTERRUPT_PROCEDURE SysTick_Handler()
{
systick_value++;
}
}
Assembler Code:
0x08001280 push {r4, r6, r7, lr}
0x08001282 add r7, sp, #8
0x08001284 mov r4, sp
0x08001286 bfc r4, #0, #3
0x0800128a mov sp, r4
0x0800128c movw r0, #8192 ; 0x2000
0x08001290 movt r0, #8192 ; 0x2000
0x08001294 ldrd r1, r2, [r0]
0x08001298 adds r1, #1
0x0800129a adc.w r2, r2, #0
0x0800129e strd r1, r2, [r0]
0x080012a2 sub.w r4, r7, #8
0x080012a6 mov sp, r4
0x080012a8 pop {r4, r6, r7, pc}
0x080012aa movs r0, r0
0x080012ac movs r0, r0
0x080012ae movs r0, r0
You do not need this attribute. It is needed in very rare circumstances when the stack is not aligned to 8 bytes (STKALGN bit is not set) by the hardware and you are going to use functions with 64 bits parameters (like uint64_t). ARM automatically saves R0-R3 + some others registers on the stack when entering the ISR handler. If you use FPU you may want to enable FPU registers stackup as well.
I have the following source code:
const ClassTwo g_classTwo;
void ClassOne::first()
{
g_classTwo.doSomething(1);
}
void ClassOne::second()
{
g_classTwo.doSomething(2);
}
Which produces the following objdump:
void ClassOne::first()
{
1089c50: e1a0c00d mov ip, sp
1089c54: e92dd800 push {fp, ip, lr, pc}
1089c58: e24cb004 sub fp, ip, #4
1089c5c: e24dd008 sub sp, sp, #8
1089c60: e50b0010 str r0, [fp, #-16]
g_classTwo.doSomething(1);
1089c64: e59f3014 ldr r3, [pc, #20] ; 1089c80 <ClassOne::first()+0x30>
1089c68: e08f3003 add r3, pc, r3
1089c6c: e1a00003 mov r0, r3
1089c70: e3a01001 mov r1, #1
1089c74: ebffffe2 bl 1089c04 <ClassTwo::doSomething(int) const>
}
1089c78: e24bd00c sub sp, fp, #12
1089c7c: e89da800 ldm sp, {fp, sp, pc}
1089c80: 060cd35c .word 0x060cd35c
01089c84 <ClassOne::second()>:
void ClassOne::second()
{
1089c84: e1a0c00d mov ip, sp
1089c88: e92dd800 push {fp, ip, lr, pc}
1089c8c: e24cb004 sub fp, ip, #4
1089c90: e24dd008 sub sp, sp, #8
1089c94: e50b0010 str r0, [fp, #-16]
g_classTwo.doSomething(2);
1089c98: e59f3014 ldr r3, [pc, #20] ; 1089cb4 <ClassOne::second()+0x30>
1089c9c: e08f3003 add r3, pc, r3
1089ca0: e1a00003 mov r0, r3
1089ca4: e3a01002 mov r1, #2
1089ca8: ebffffd5 bl 1089c04 <ClassTwo::doSomething(int) const>
}
1089cac: e24bd00c sub sp, fp, #12
1089cb0: e89da800 ldm sp, {fp, sp, pc}
1089cb4: 060cd328 .word 0x060cd328
Both methods are loading the address of g_classTwo with a pc relative offset: ldr r3, [pc, #20], which translates to 0x060cd35c and 0x060cd328 for the first and second method respectively.
Why are the addresses different even though they are both addressing the same global variable?
How do those addresses relate to the nm output for the same symbol: 07156fcc b g_classTwo?
In ClassOne::first() you have:
1089c64: e59f3014 ldr r3, [pc, #20] ; 1089c80 <ClassOne::first()+0x30>
1089c68: e08f3003 add r3, pc, r3
1089c6c: e1a00003 mov r0, r3
...
1089c80: 060cd35c .word 0x060cd35c
In ClassOne::second() you have:
1089c98: e59f3014 ldr r3, [pc, #20] ; 1089cb4 <ClassOne::second()+0x30>
1089c9c: e08f3003 add r3, pc, r3
1089ca0: e1a00003 mov r0, r3
...
1089cb4: 060cd328 .word 0x060cd328
In both, r0 is the this pointer (g_classTwo). As you can see, after loading an address from the literal pool into r3 it is summed to pc to get r0.
In ClassOne::first(), you get r0 = pc + r3 = 0x01089c70 + 0x060cd35c = 0x07156fcc.
In ClassOne::second(), you get r0 = pc + r3 = 0x01089ca4 + 0x060cd328 = 0x07156fcc.
So for both the this pointer is 0x07156fcc, which is the address of g_classTwo.
This code (arm):
void blinkRed(void)
{
for(;;)
{
bb[0x0008646B] ^= 1;
sys.Delay_ms(14);
}
}
...is compiled to folowing asm-code:
08000470: ldr r4, [pc, #20] ; (0x8000488 <blinkRed()+24>) // r4 = 0x422191ac
08000472: ldr r6, [pc, #24] ; (0x800048c <blinkRed()+28>)
08000474: movs r5, #14
08000476: ldr r3, [r4, #0]
08000478: eor.w r3, r3, #1
0800047c: str r3, [r4, #0]
0800047e: mov r0, r6
08000480: mov r1, r5
08000482: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
08000486: b.n 0x8000476 <blinkRed()+6>
It is ok.
But, if I just change array index (-0x400)....
void blinkRed(void)
{
for(;;)
{
bb[0x0008606B] ^= 1;
sys.Delay_ms(14);
}
}
...I've got not so optimized code:
08000470: ldr r4, [pc, #24] ; (0x800048c <blinkRed()+28>) // r4 = 0x42218000
08000472: ldr r6, [pc, #28] ; (0x8000490 <blinkRed()+32>)
08000474: movs r5, #14
08000476: ldr.w r3, [r4, #428] ; 0x1ac
0800047a: eor.w r3, r3, #1
0800047e: str.w r3, [r4, #428] ; 0x1ac
08000482: mov r0, r6
08000484: mov r1, r5
08000486: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
0800048a: b.n 0x8000476 <blinkRed()+6>
The difference is that in the first case r4 is loaded with target address immediately (0x422191ac) and then access to memory is performed with 2-byte instructions, but in the second case r4 is loaded with some intermediate
address (0x42218000) and then access to memory is performed with 4-bytes instruction with offset (+0x1ac) to target address (0x422181ac).
Why compiler does so?
I use:
arm-none-eabi-g++ -mcpu=cortex-m3 -mthumb -g2 -Wall -O1 -std=gnu++14 -fno-exceptions -fno-use-cxa-atexit -fstrict-volatile-bitfields -c -DSTM32F100C6T6B -DSTM32F10X_LD_VL
bb is:
__attribute__ ((section(".bitband"))) volatile u32 bb[0x00800000];
In .ld it is defined as:
in MEMORY section:
BITBAND(rwx): ORIGIN = 0x42000000, LENGTH = 0x02000000
in SECTIONS section:
.bitband (NOLOAD) :
SUBALIGN(0x02000000)
{
KEEP(*(.bitband))
} > BITBAND
I would consider it an artefact/missing optimization opportunity of -O1.
It can be understood in more detail if we look at the code generated with -O- to load bb[...]:
First case:
movw r2, #:lower16:bb
movt r2, #:upper16:bb
movw r3, #37292
movt r3, 33
adds r3, r2, r3
ldr r3, [r3, #0]
Second case:
movw r3, #:lower16:bb
movt r3, #:upper16:bb
add r3, r3, #2195456 ; 0x218000 = 4*0x86000
add r3, r3, #428
ldr r3, [r3, #0]
The code in the second case is better and it can be done this way because the constant can be added with two add instructions (which is not the case if the index is 0x0008646B).
-O1 does only optimizations which are not time consuming. So apparently it merges early the add and the ldr so it misses later the opportunity to load the whole address with one pc relative ldr.
Compile with -O2 (or -fgcse) and the code looks like expected.