Related
I have been assigned to convert this program to arm assembly v8.
int power(int x, int y){
if (x == 0){
return 0;
}
else if (y < 0){
return 0;
}
else if (y == 0){
return 1;
}
else {
return x * power(x, y - 1);
}
}
Although I'm not very familiar with ARM assembly language and would like to know where to start.
I have attempted to research a bit on this but ultimately found very little on the internet about ARM.
The magic command is arm-linux-gnueabi-gcc -S -O2 -march=armv8-a power.c.
I used arm-linux-gnueabi-gcc since I work on an X86-64 machine and gcc does not have ARM targets available. If you are on an arm system, you should be able to use regular gcc instead. If not it will error, but no harm done.
-S tells gcc to output assembly.
The -O2 is optional and just helps to optimize the code slightly and reduce debug clutter from the result.
-march=armv8-a tells it to use the ARM v8 target while compiling. I chose armv8-a somewhat arbitrarily. According to the docs all of the ARM v8 are armv8-a, armv8.1-a, armv8.2-a, armv8.3-a, armv8.4-a, armv8.5-a, armv8.6-a, armv8-m.base, armv8-m.main, and armv8.1-m.main. I have no idea what the differences are so you may want to choose a different one.
power.c just tells it which file to compile. Since we don't specify an output file (Ex: -o output.asm), the assembly will be outputted to power.s.
If you are not compiling on an arm machine that has provides the desired target with regular gcc, you can use arm-linux-gnueabi-gcc instead. If you do not have it installed, you can install it with:
sudo apt-get update
sudo apt-get install gcc-arm-linux-gnueabi binutils-arm-linux-gnueabi
Output
If anyone is curious, this is the output I received when I tried it on my machine.
.arch armv8-a
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "testing.c"
.text
.align 2
.global power
.syntax unified
.arm
.fpu softvfp
.type power, %function
power:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
clz r2, r0
mov r3, r0
lsr r2, r2, #5
orrs r2, r2, r1, lsr #31
bne .L4
cmp r1, #0
mov r0, #1
bxeq lr
.L3:
subs r1, r1, #1
mul r0, r3, r0
bne .L3
bx lr
.L4:
mov r0, #0
bx lr
.size power, .-power
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",%progbits
How can I get started.
Here is my thought process for how I was able to solve this problem. Looking at the problem I could tell it would probably be in two parts:
How can that code be compiled to assembly?
For the first part I found this answer which provided the command gcc -S -fverbose-asm -O2 foo.c. After testing it, I decided to remove the -fverbose-asm since only seemed to provide clutter for a program this small.
How can I set the compiler target to ARM v8?
After a quick google search I found that gcc lets you specify the target architecture with -march=xxx. My next step was to find a list of ARM architectures that I could select from. After finding gcc.gnu.org/onlinedocs/gcc/ARM-Options.html, I selected armv8-a since it sounded the most correct. When I tried it out, gcc told me that the target architecture could not be found. This was not really a surprise since I am on x86-64 and usually compilers come with the compatible targets to reduce the space required. I knew this likely meant I would need to identify the apt package which provided the arm targets so I searched around until I found this answer which filled in the rest of the information I needed.
Compiler explorer is the friend in a such simple cases.
ARMv8-a Clang Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int): // #power(int, int)
stp x29, x30, [sp, #-32]! // 16-byte Folded Spill
str x19, [sp, #16] // 8-byte Folded Spill
mov x29, sp
mov w19, w0
mov w0, wzr
cbz w19, .LBB0_5
tbnz w1, #31, .LBB0_5
cbz w1, .LBB0_4
sub w1, w1, #1
mov w0, w19
bl power(int, int)
mul w0, w0, w19
b .LBB0_5
.LBB0_4:
mov w0, #1
.LBB0_5:
ldr x19, [sp, #16] // 8-byte Folded Reload
ldp x29, x30, [sp], #32 // 16-byte Folded Reload
ret
ARM GCC (linux) Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int):
push {r4, lr}
mov r4, r0
clz r0, r0
lsrs r0, r0, #5
orrs r3, r0, r1, lsr #31
it ne
movne r0, #0
beq .L6
.L1:
pop {r4, pc}
.L6:
movs r0, #1
cmp r1, #0
beq .L1
subs r1, r1, #1
mov r0, r4
bl power(int, int)
mul r0, r4, r0
b .L1
ARM GCC (none) Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int):
push {r4, lr}
mov r4, r0
rsbs r0, r0, #1
movcc r0, #0
orrs r3, r0, r1, lsr #31
movne r0, #0
beq .L6
.L1:
pop {r4, lr}
bx lr
.L6:
cmp r1, #0
moveq r0, #1
beq .L1
sub r1, r1, #1
mov r0, r4
bl power(int, int)
mul r0, r4, r0
b .L1
ARM64 GCC Assembly with the compiler option -O1 for keeping the recursion:
# Compilation provided by Compiler Explorer at https://godbolt.org/
power(int, int):
cmp w1, 0
ccmp w0, 0, 4, ge
bne .L9
mov w0, 0
ret
.L9:
stp x29, x30, [sp, -32]!
mov x29, sp
str x19, [sp, 16]
mov w19, w0
mov w0, 1
cbnz w1, .L10
.L1:
ldr x19, [sp, 16]
ldp x29, x30, [sp], 32
ret
.L10:
sub w1, w1, #1
mov w0, w19
bl power(int, int)
mul w0, w19, w0
b .L1
I'm working with Keil ARMCompiler 6.15 (armclang.exe) and I'm in doubt of the correctness of the generated assembler code.
It seems to me that the attribute 'interrupt("IRQ")' is ignored.
For me r1 and r2 should be saved on the stack, too.
When I remove the attribute 'used' my complete function is removed (optimization).
Can anyone see the mistake I made or what I've forgotten?
Originally the code was created for gcc.
Attributes used for interrupt routines:
#define INTERRUPT_PROCEDURE __attribute__((interrupt("IRQ"),used,section(".IsrSection")))
#define ISR_VARIABLE __attribute__((section(".IsrSection")))
#define FAST_SHARED_DATA __attribute__((section(".FastSharedDataSection")))
C++ Code:
uint64_t volatile FAST_SHARED_DATA systick_value = uint64_t(0);
extern "C" {
void INTERRUPT_PROCEDURE SysTick_Handler()
{
systick_value++;
}
}
Assembler Code:
0x08001280 push {r4, r6, r7, lr}
0x08001282 add r7, sp, #8
0x08001284 mov r4, sp
0x08001286 bfc r4, #0, #3
0x0800128a mov sp, r4
0x0800128c movw r0, #8192 ; 0x2000
0x08001290 movt r0, #8192 ; 0x2000
0x08001294 ldrd r1, r2, [r0]
0x08001298 adds r1, #1
0x0800129a adc.w r2, r2, #0
0x0800129e strd r1, r2, [r0]
0x080012a2 sub.w r4, r7, #8
0x080012a6 mov sp, r4
0x080012a8 pop {r4, r6, r7, pc}
0x080012aa movs r0, r0
0x080012ac movs r0, r0
0x080012ae movs r0, r0
You do not need this attribute. It is needed in very rare circumstances when the stack is not aligned to 8 bytes (STKALGN bit is not set) by the hardware and you are going to use functions with 64 bits parameters (like uint64_t). ARM automatically saves R0-R3 + some others registers on the stack when entering the ISR handler. If you use FPU you may want to enable FPU registers stackup as well.
I have small application that compiles and runs well on my ARM Cortex M4. But when I disassemble binary file, that I flush, here is how first bytes look like:
00000000 <.data>:
0: 20020000 andcs r0, r2, r0
4: 080003b5 stmdaeq r0, {r0, r2, r4, r5, r7, r8, r9}
8: 08000345 stmdaeq r0, {r0, r2, r6, r8, r9}
c: 08000351 stmdaeq r0, {r0, r4, r6, r8, r9}
080003b5 should be the address of Reset handler (I have .word Reset_Handler there), but disassembling ELF shows that Reset handler is actually located at 080003b4, which is 1 byte before:
080003b4 <Reset_Handler>:
80003b4: 2100 movs r1, #0
80003b6: e003 b.n 80003c0 <InitData>
(It's running in THUMB mode, I have 2byte instructions).
Even if I disassemble the binary file, it's located at 080003b4:
000003b4 <.data+0x3b4>:
3b4: 2100 movs r1, #0
3b6: e003 b.n 0x3c0
My question is, why does it point 1 byte after? This code surprisingly works on actual board. Even without disassembling, shouldn't instructions be aligned by 2 byte? how can address be 0x000003b5?
Answer: ARM uses it for switching to THUMB mode.
I'm creating a C++ program for ARMv6 which crashes with BUS ERROR. Using GDB I have traced the problem to the following code
double d = *(double*)pData; pData += sizeof(int64_t); // char *pData
The program goes through a received message and has to extract some double values using the above code. The received message has several fields, some doubles some not.
On x86 architectures this works fine, but on ARM I get the 'bus error'. So, I suspect my problem is alignment of data -- the double fields have to be aligned to word boundaries in memory on the ARM architecture.
I have tried the following as a fix, which did not work (still got the error):
int64_t i = *(int64_t*)pData;
double d = *((double*)&i);
The following worked (so far):
double d = 0;
memcpy(&d, pData, sizeof(double));
Is using 'memcpy' the best approach? Or, is there a better way?
In my case I do not have control over the packing of the data in the buffer or the order of the fields in the message.
Related question: std::atomic<double> on Armv7 (RPi2) and alignment/bus errors
Is using 'memcpy' the best approach?
In general it's the only correct approach, unless you're targeting a single ABI in which no type requires greater than 1-byte alignment.
The C++ standard is rather verbose, so I'll quote the C standard expressing the same thing much more succinctly:
A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined.
There it is: that ever-present spectre of undefined behaviour. Even an x86 compiler is perfectly well allowed to break into your house and rub jam into your hair while you sleep instead of loading that data the way you expect, if its ABI says so.
One thing to note, though, is that modern compilers tend to be clever enough that correctness doesn't necessarily come at the cost of performance. Let's flesh out that example code:
#include <string.h>
double func(char *data) {
double d;
memcpy(&d, data, sizeof d);
return d;
}
...and throw it at a compiler:
$ clang -target arm -march=armv6 -mfpu=vfpv3 -mfloat-abi=hard -O1 -S test.c
...
func: # #func
.fnstart
# BB#0:
push {r4, r5, r11, lr}
sub sp, sp, #8
mov r2, r0
ldrb r1, [r0, #3]
ldrb r3, [r0, #2]
ldrb r12, [r0]
ldrb lr, [r0, #1]
ldrb r4, [r2, #4]!
orr r5, r3, r1, lsl #8
ldrb r3, [r2, #2]
ldrb r2, [r2, #3]
ldrb r0, [r0, #5]
orr r1, r12, lr, lsl #8
orr r2, r3, r2, lsl #8
orr r0, r4, r0, lsl #8
orr r1, r1, r5, lsl #16
orr r0, r0, r2, lsl #16
str r1, [sp]
str r0, [sp, #4]
vpop {d0}
pop {r4, r5, r11, pc}
OK, so it's playing things safe with a bytewise memcpy; at least it's inlined. But hey, ARMv6 does at least support unaligned word and halfword accesses if the CPU is configured appropriately - let's tell the compiler we're cool with that:
$ clang -target arm -march=armv6 -mfpu=vfpv3 -mfloat-abi=hard -O1 -S -munaligned-access test.c
...
func: # #func
.fnstart
# BB#0:
sub sp, sp, #8
ldr r1, [r0]
ldr r0, [r0, #4]
str r0, [sp, #4]
str r1, [sp]
vpop {d0}
bx lr
There we go, that's about the best you can do with just integer word loads. Now, what if we compile it for something a bit newer?
$ clang -target arm -march=armv7 -mfpu=neon-vfpv4 -mfloat-abi=hard -O1 -S test.c
...
func: # #func
.fnstart
# BB#0:
vld1.8 {d0}, [r0]
bx lr
I can guarantee that, even on a machine where it would "work", no undefined-behaviour-hackery would correctly load that unaligned double in fewer than one instructions. Note that NEON is the key player here - vld1 only requires the base address to be aligned to the element size, so for 8-bit elements it can never be unaligned. In the more general case (say, if it were a long long instead of a double) you might still need -munaligned-access to convince the compiler as before.
For comparison, let's just see how everyone's favourite mutant-grandchild-of-a-1970s-calculator-chip fares as well:
clang -O1 -S test.c
...
func: # #func
# BB#0:
movl 4(%esp), %eax
fldl (%eax)
retl
Yup, the correct code still also looks like the best code.
This code (arm):
void blinkRed(void)
{
for(;;)
{
bb[0x0008646B] ^= 1;
sys.Delay_ms(14);
}
}
...is compiled to folowing asm-code:
08000470: ldr r4, [pc, #20] ; (0x8000488 <blinkRed()+24>) // r4 = 0x422191ac
08000472: ldr r6, [pc, #24] ; (0x800048c <blinkRed()+28>)
08000474: movs r5, #14
08000476: ldr r3, [r4, #0]
08000478: eor.w r3, r3, #1
0800047c: str r3, [r4, #0]
0800047e: mov r0, r6
08000480: mov r1, r5
08000482: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
08000486: b.n 0x8000476 <blinkRed()+6>
It is ok.
But, if I just change array index (-0x400)....
void blinkRed(void)
{
for(;;)
{
bb[0x0008606B] ^= 1;
sys.Delay_ms(14);
}
}
...I've got not so optimized code:
08000470: ldr r4, [pc, #24] ; (0x800048c <blinkRed()+28>) // r4 = 0x42218000
08000472: ldr r6, [pc, #28] ; (0x8000490 <blinkRed()+32>)
08000474: movs r5, #14
08000476: ldr.w r3, [r4, #428] ; 0x1ac
0800047a: eor.w r3, r3, #1
0800047e: str.w r3, [r4, #428] ; 0x1ac
08000482: mov r0, r6
08000484: mov r1, r5
08000486: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
0800048a: b.n 0x8000476 <blinkRed()+6>
The difference is that in the first case r4 is loaded with target address immediately (0x422191ac) and then access to memory is performed with 2-byte instructions, but in the second case r4 is loaded with some intermediate
address (0x42218000) and then access to memory is performed with 4-bytes instruction with offset (+0x1ac) to target address (0x422181ac).
Why compiler does so?
I use:
arm-none-eabi-g++ -mcpu=cortex-m3 -mthumb -g2 -Wall -O1 -std=gnu++14 -fno-exceptions -fno-use-cxa-atexit -fstrict-volatile-bitfields -c -DSTM32F100C6T6B -DSTM32F10X_LD_VL
bb is:
__attribute__ ((section(".bitband"))) volatile u32 bb[0x00800000];
In .ld it is defined as:
in MEMORY section:
BITBAND(rwx): ORIGIN = 0x42000000, LENGTH = 0x02000000
in SECTIONS section:
.bitband (NOLOAD) :
SUBALIGN(0x02000000)
{
KEEP(*(.bitband))
} > BITBAND
I would consider it an artefact/missing optimization opportunity of -O1.
It can be understood in more detail if we look at the code generated with -O- to load bb[...]:
First case:
movw r2, #:lower16:bb
movt r2, #:upper16:bb
movw r3, #37292
movt r3, 33
adds r3, r2, r3
ldr r3, [r3, #0]
Second case:
movw r3, #:lower16:bb
movt r3, #:upper16:bb
add r3, r3, #2195456 ; 0x218000 = 4*0x86000
add r3, r3, #428
ldr r3, [r3, #0]
The code in the second case is better and it can be done this way because the constant can be added with two add instructions (which is not the case if the index is 0x0008646B).
-O1 does only optimizations which are not time consuming. So apparently it merges early the add and the ldr so it misses later the opportunity to load the whole address with one pc relative ldr.
Compile with -O2 (or -fgcse) and the code looks like expected.