CMP/BEQ not working Always branching (ARM) - if-statement

I'm getting very mad at this and can't figure out why my BEQ statement is always executed.
The program should replace char located in memory(Address in RO)
_ should become +
C should become A
A should become B
B should become C
This is what I have so far (sorry french comments):
MOV R11, #0 ; Initialise le nombe de copie fait
MOV R10, #43 ; R10 = +
MOV R9, #'_' ; R9 = _
MOV R8, #'A' ; R8 = A
MOV R7, #'B' ; R7 = B
MOV R6, #'C' ; R6 = C
TOP:
LDRB R5, [R0, R11] ; Copie element X dans R5
CMP R5, R9
BEQ PLUS
CMP R5, R8
BEQ A
CMP R5, R7
BEQ B
CMP R5, R6
BEQ C
PLUS: ; Branchement si _
STRB R10, [R0, R11]
A: ; Branchement si A
STRB R8, [R0, R11]
B: ; Branchement si B
STRB R7, [R0, R11]
C: ; Branchement si C
STRB R6, [R0, R11]
ADDS R11, R11, #1 ; ++nbcopiefait
CMP R11, R1 ; Validation de la condition
BNE TOP

Apparently it's not only C's switch() that confuses people...
So, what you're currently doing is the equivalent of
for (size_t i = 0; i < n; i++)
{
switch(chararray[i])
{
default:
case '_': chararray[i] = '+';
case 'C': chararray[i] = 'A';
case 'A': chararray[i] = 'B';
case 'B': chararray[i] = 'C';
}
}
You're missing the break; after every case.
Edit, because it seems I have to make it really obvious:
for (size_t i = 0; i < n; i++)
{
switch(chararray[i])
{
default:
break;
case '_': chararray[i] = '+';
break;
case 'C': chararray[i] = 'A';
break;
case 'A': chararray[i] = 'B';
break;
case 'B': chararray[i] = 'C';
break; //unnecessary, but I put it in for regularity
}
}

To expand on EOF's answer, you can see what's going on by tracing through a sample execution instruction-by-instruction - a debugger always helps, but this is simple enough to do by hand. Let's consider a couple of different situations:
Instruction case char=='A' case char=='Z'
-------------------------------------------------------------------
...
LDRB R5, [R0, R11] executes, r5='A' executes, r5='Z'
CMP R5, R9 executes, flags=ne executes, flags=ne
BEQ PLUS flags!=eq, not taken flags!=eq, not taken
CMP R5, R8 executes, flags=eq executes, flags=ne
BEQ A flags==eq, taken flags!=eq, not taken
CMP R5, R7 / executes, flags=ne
BEQ B / flags!=eq, not taken
CMP R5, R6 / executes, flags=ne
BEQ C / flags!=eq, not taken
PLUS: STRB R10, [R0, R11] V executes: oops!
A: STRB R8, [R0, R11] executes executes: oops!
B: STRB R7, [R0, R11] executes: oops! executes: oops!
C: STRB R6, [R0, R11] executes: oops! executes: oops!
ADDS R11, R11, #1 executes executes
...
So no matter what happens, everything ends up as 'C' regardless! (note there's a register mixup for 'A', 'B', and 'C' - if you match r8, you jump to storing r8, etc.) Implementing the equivalent of break is a case of making sure instructions are skipped when you don't want them executing:
...
CMP R5, R6
BEQ C
B LOOP ; no match, skip everything
PLUS: STRB R10, [R0, R11]
B LOOP ; we've stored '_', skip 'B', 'C', and 'A'
A: STRB R7, [R0, R11]
B LOOP ; we've stored 'B', skip 'C' and 'A'
B: STRB R6, [R0, R11]
B LOOP ; we've stored 'C', skip 'A'
C: STRB R8, [R0, R11] ; nothing to skip, just fall through to the loop
LOOP: ADDS R11, R11, #1
...
However, note that unlike most architectures, ARM's conditional execution applies to most instructions. Thus an alternative approach, given a small number of simple routines (1-3 instructions) is to actually remove all the branches, and let conditional execution take care of it:
...
LDRB R5, [R0, R11]
CMP R5, R9
STRBEQ R10, [R0, R11]
CMP R5, R8
STRBEQ R7, [R0, R11]
CMP R5, R7
STRBEQ R6, [R0, R11]
CMP R5, R6
STRBEQ R8, [R0, R11]
ADDS R11, R11, #1
...
That way, everything gets "executed", but any stores which fail their condition check just do nothing.

Related

libc init function is not returning

I'm working on a baremetal application for a Cortex M1 on an FPGA. I'm seeing a HardFault during the __libc_init_array function call during startup of my program.
It seems the problem is, during __libc_init_array, it branches to a function "_init". This _init function, however, does not return and runs off into the weeds until it hits a hardfault. I'm not sure what I'm doing wrong that the function isn't returning.
I've distilled my program down to the bare minimum with the full source code in the github link below. I also have put the CFLAGS and LDFLAGS I'm using along with the relevant disassembled functions.
I'm using GCC version "gcc-arm-none-eabi-9-2020-q2-update" from arm.com. I've also tried the newer "gcc-arm-none-eabi-10-2020-q4-major" release and the GCC included with STM32IDE with the same result.
I've compared the compilation flags in STM32CUBEIDE with mine as well, and nothing is jumping out at me, but the elf produced by STM32CUBEIDE is returning from its _init...
Any and all help is appreciated, thank you!
https://github.com/rockybulwinkle/cortex-m1-example
CFLAGS = \
-mthumb \
-march=armv6-m \
-mcpu=cortex-m0 \
-Wall \
-Wextra
-std=c11 \
-specs=nano.specs \
-O0 \
-fdebug-prefix-map=$(REPO_ROOT)= \
-g \
-ffreestanding \
-ffunction-sections \
-fdata-sections
LDFLAGS = \
-mthumb \
-march=armv6-m \
-mcpu=cortex-m0 \
-Wl,--print-memory-usage \
-Wl,-Map=$(BUILD_DIR)/$(PROJECT).map \
-T m1.ld \
-Wl,--gc-sections \
00000084 <Reset_Handler>:
/*
* This is the code that gets called on processor reset.
* To initialize the device, and call the main() routine.
*/
void Reset_Handler(void)
{
84: b580 push {r7, lr}
86: b082 sub sp, #8
88: af00 add r7, sp, #0
/* Initialize the data segment */
uint32_t *pSrc = &_etext;
8a: 4b13 ldr r3, [pc, #76] ; (d8 <Reset_Handler+0x54>)
8c: 607b str r3, [r7, #4]
uint32_t *pDest = &_sdata;
8e: 4b13 ldr r3, [pc, #76] ; (dc <Reset_Handler+0x58>)
90: 603b str r3, [r7, #0]
if (pSrc != pDest) {
92: 687a ldr r2, [r7, #4]
94: 683b ldr r3, [r7, #0]
96: 429a cmp r2, r3
98: d00c beq.n b4 <Reset_Handler+0x30>
for (; pDest < &_edata;) {
9a: e007 b.n ac <Reset_Handler+0x28>
*pDest++ = *pSrc++;
9c: 687a ldr r2, [r7, #4]
9e: 1d13 adds r3, r2, #4
a0: 607b str r3, [r7, #4]
a2: 683b ldr r3, [r7, #0]
a4: 1d19 adds r1, r3, #4
a6: 6039 str r1, [r7, #0]
a8: 6812 ldr r2, [r2, #0]
aa: 601a str r2, [r3, #0]
for (; pDest < &_edata;) {
ac: 683a ldr r2, [r7, #0]
ae: 4b0c ldr r3, [pc, #48] ; (e0 <Reset_Handler+0x5c>)
b0: 429a cmp r2, r3
b2: d3f3 bcc.n 9c <Reset_Handler+0x18>
}
}
/* Clear the zero segment */
for (pDest = &_sbss; pDest < &_ebss;) {
b4: 4b0b ldr r3, [pc, #44] ; (e4 <Reset_Handler+0x60>)
b6: 603b str r3, [r7, #0]
b8: e004 b.n c4 <Reset_Handler+0x40>
*pDest++ = 0;
ba: 683b ldr r3, [r7, #0]
bc: 1d1a adds r2, r3, #4
be: 603a str r2, [r7, #0]
c0: 2200 movs r2, #0
c2: 601a str r2, [r3, #0]
for (pDest = &_sbss; pDest < &_ebss;) {
c4: 683a ldr r2, [r7, #0]
c6: 4b08 ldr r3, [pc, #32] ; (e8 <Reset_Handler+0x64>)
c8: 429a cmp r2, r3
ca: d3f6 bcc.n ba <Reset_Handler+0x36>
}
/* Run constructors / initializers */
__libc_init_array();
cc: f000 f898 bl 200 <__libc_init_array>
/* Branch to main function */
main();
d0: f7ff ffb6 bl 40 <main>
/* Infinite loop */
while (1);
d4: e7fe b.n d4 <Reset_Handler+0x50>
d6: 46c0 nop ; (mov r8, r8)
d8: 00001370 .word 0x00001370
dc: 2000001c .word 0x2000001c
e0: 20000080 .word 0x20000080
e4: 20000000 .word 0x20000000
e8: 2000001c .word 0x2000001c
00000200 <__libc_init_array>:
200: b570 push {r4, r5, r6, lr}
202: 2600 movs r6, #0
204: 4d0c ldr r5, [pc, #48] ; (238 <__libc_init_array+0x38>)
206: 4c0d ldr r4, [pc, #52] ; (23c <__libc_init_array+0x3c>)
208: 1b64 subs r4, r4, r5
20a: 10a4 asrs r4, r4, #2
20c: 42a6 cmp r6, r4
20e: d109 bne.n 224 <__libc_init_array+0x24>
210: 2600 movs r6, #0
212: f001 f8a9 bl 1368 <_init>
216: 4d0a ldr r5, [pc, #40] ; (240 <__libc_init_array+0x40>)
218: 4c0a ldr r4, [pc, #40] ; (244 <__libc_init_array+0x44>)
21a: 1b64 subs r4, r4, r5
21c: 10a4 asrs r4, r4, #2
21e: 42a6 cmp r6, r4
220: d105 bne.n 22e <__libc_init_array+0x2e>
222: bd70 pop {r4, r5, r6, pc}
224: 00b3 lsls r3, r6, #2
226: 58eb ldr r3, [r5, r3]
228: 4798 blx r3
22a: 3601 adds r6, #1
22c: e7ee b.n 20c <__libc_init_array+0xc>
22e: 00b3 lsls r3, r6, #2
230: 58eb ldr r3, [r5, r3]
232: 4798 blx r3
234: 3601 adds r6, #1
236: e7f2 b.n 21e <__libc_init_array+0x1e>
...
Disassembly of section .init:
00001368 <_init>:
1368: b5f8 push {r3, r4, r5, r6, r7, lr}
136a: 46c0 nop ; (mov r8, r8)
Disassembly of section .fini:
0000136c <_fini>:
136c: b5f8 push {r3, r4, r5, r6, r7, lr}
136e: 46c0 nop ; (mov r8, r8)
I found my answer while drafting the question. Essentially, I needed to add KEEP directives in my linker script for "init" and "fini".
Before:
.text :
{
. = ALIGN(4);
_stext = .;
KEEP(*(.vectors .vectors.*))
*(.text .text.*)
*(.rodata .rodata*)
. = ALIGN(4);
} > rom
After:
.text :
{
. = ALIGN(4);
_stext = .;
KEEP(*(.vectors .vectors.*))
KEEP(*(.init))
KEEP(*(.fini))
*(.text .text.*)
*(.rodata .rodata*)
. = ALIGN(4);
} > rom

gdb - optimized value analysis

My CPU is arm.How can I figure out the function parameter value if it's optimized out?
For example:
status_t NuPlayer::GenericSource::setDataSource(
int fd, int64_t offset, int64_t length) {
resetDataSource();
mFd = dup(fd);
mOffset = offset;
mLength = length;
Above function has 3 parameters, when I try to print the second parameter offset, I will get below result:
Thread 4 "Binder:15082_3" hit Breakpoint 1, android::NuPlayer::GenericSource::setDataSource (this=0xae63bb40, fd=8, offset=<optimized out>, length=9384436) at frameworks/av/media/libmediaplayerservice/nuplayer/GenericSource.cpp:123
123 resetDataSource();
(gdb) x/i $pc
=> 0xb02aaa80 <android::NuPlayer::GenericSource::setDataSource(int, long long, long long)+12>: blx 0xb0282454 <_ZN7android8NuPlayer13GenericSource15resetDataSourceEv#plt>
(gdb) n
125 mFd = dup(fd);
(gdb) print offset
$1 = <optimized out>
(gdb) p $eax
$2 = void
(gdb) disassemble /m
Dump of assembler code for function android::NuPlayer::GenericSource::setDataSource(int, long long, long long):
122 int fd, int64_t offset, int64_t length) {
0xb02aaa74 <+0>: push {r4, r5, r6, r7, lr}
0xb02aaa76 <+2>: sub sp, #4
0xb02aaa78 <+4>: mov r4, r3
0xb02aaa7a <+6>: mov r5, r2
0xb02aaa7c <+8>: mov r6, r1
0xb02aaa7e <+10>: mov r7, r0
123 resetDataSource();
=> 0xb02aaa80 <+12>: blx 0xb0282454 <_ZN7android8NuPlayer13GenericSource15resetDataSourceEv#plt>
124
125 mFd = dup(fd);
0xb02aaa84 <+16>: mov r0, r6
0xb02aaa86 <+18>: blx 0xb027e5d8 <dup#plt>
0xb02aaa8a <+22>: ldrd r2, r1, [sp, #24]
0xb02aaa8e <+26>: str.w r0, [r7, #224] ; 0xe0
0xb02aaa92 <+30>: movs r0, #0
126 mOffset = offset;
0xb02aaa94 <+32>: strd r5, r4, [r7, #232] ; 0xe8
127 mLength = length;
0xb02aaa98 <+36>: strd r2, r1, [r7, #240] ; 0xf0
128
129 // delay data source creation to prepareAsync() to avoid blocking
130 // the calling thread in setDataSource for any significant time.
131 return OK;
0xb02aaa9c <+40>: add sp, #4
0xb02aaa9e <+42>: pop {r4, r5, r6, r7, pc}
End of assembler dump.
(gdb)
I guess it's in some register but the result of $eax is void.
I guess it's in some register but the result of $eax is void.
There is no register called eax on ARM.
To know which register the parameter is in, you need to know calling convention.
Looks like you are using 32-bit ARM. From above link:
r0 to r3: used to hold argument values passed to a subroutine
So you should do info registers, verify that r0 == 0xae63bb40, r1 == 8 and find the offset in r2.
Sounds like example code has assigned the parameter variable to local variable already, so print that value will be exactly the same as optimized out parameters.
mOffset = offset;
mLength = length;

GCC generates different code depending on array index value

This code (arm):
void blinkRed(void)
{
for(;;)
{
bb[0x0008646B] ^= 1;
sys.Delay_ms(14);
}
}
...is compiled to folowing asm-code:
08000470: ldr r4, [pc, #20] ; (0x8000488 <blinkRed()+24>) // r4 = 0x422191ac
08000472: ldr r6, [pc, #24] ; (0x800048c <blinkRed()+28>)
08000474: movs r5, #14
08000476: ldr r3, [r4, #0]
08000478: eor.w r3, r3, #1
0800047c: str r3, [r4, #0]
0800047e: mov r0, r6
08000480: mov r1, r5
08000482: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
08000486: b.n 0x8000476 <blinkRed()+6>
It is ok.
But, if I just change array index (-0x400)....
void blinkRed(void)
{
for(;;)
{
bb[0x0008606B] ^= 1;
sys.Delay_ms(14);
}
}
...I've got not so optimized code:
08000470: ldr r4, [pc, #24] ; (0x800048c <blinkRed()+28>) // r4 = 0x42218000
08000472: ldr r6, [pc, #28] ; (0x8000490 <blinkRed()+32>)
08000474: movs r5, #14
08000476: ldr.w r3, [r4, #428] ; 0x1ac
0800047a: eor.w r3, r3, #1
0800047e: str.w r3, [r4, #428] ; 0x1ac
08000482: mov r0, r6
08000484: mov r1, r5
08000486: bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
0800048a: b.n 0x8000476 <blinkRed()+6>
The difference is that in the first case r4 is loaded with target address immediately (0x422191ac) and then access to memory is performed with 2-byte instructions, but in the second case r4 is loaded with some intermediate
address (0x42218000) and then access to memory is performed with 4-bytes instruction with offset (+0x1ac) to target address (0x422181ac).
Why compiler does so?
I use:
arm-none-eabi-g++ -mcpu=cortex-m3 -mthumb -g2 -Wall -O1 -std=gnu++14 -fno-exceptions -fno-use-cxa-atexit -fstrict-volatile-bitfields -c -DSTM32F100C6T6B -DSTM32F10X_LD_VL
bb is:
__attribute__ ((section(".bitband"))) volatile u32 bb[0x00800000];
In .ld it is defined as:
in MEMORY section:
BITBAND(rwx): ORIGIN = 0x42000000, LENGTH = 0x02000000
in SECTIONS section:
.bitband (NOLOAD) :
SUBALIGN(0x02000000)
{
KEEP(*(.bitband))
} > BITBAND
I would consider it an artefact/missing optimization opportunity of -O1.
It can be understood in more detail if we look at the code generated with -O- to load bb[...]:
First case:
movw r2, #:lower16:bb
movt r2, #:upper16:bb
movw r3, #37292
movt r3, 33
adds r3, r2, r3
ldr r3, [r3, #0]
Second case:
movw r3, #:lower16:bb
movt r3, #:upper16:bb
add r3, r3, #2195456 ; 0x218000 = 4*0x86000
add r3, r3, #428
ldr r3, [r3, #0]
The code in the second case is better and it can be done this way because the constant can be added with two add instructions (which is not the case if the index is 0x0008646B).
-O1 does only optimizations which are not time consuming. So apparently it merges early the add and the ldr so it misses later the opportunity to load the whole address with one pc relative ldr.
Compile with -O2 (or -fgcse) and the code looks like expected.

Why this code is not efficient?

I want to improve the next code, calculating the mean:
void calculateMeanStDev8x8Aux(cv::Mat* patch, int sx, int sy, int& mean, float& stdev)
{
unsigned sum=0;
unsigned sqsum=0;
const unsigned char* aux=patch->data + sy*patch->step + sx;
for (int j=0; j< 8; j++) {
const unsigned char* p = (const unsigned char*)(j*patch->step + aux ); //Apuntador al inicio de la matrix
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
}
}
mean = sum >> 6;
int r = (sum*sum) >> 6;
stdev = sqrtf(sqsum - r);
if (stdev < .1) {
stdev=0;
}
}
I also improved the next loop with NEON intrinsics:
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
}
This is the code improved for the other loop:
int32x4_t vsum= { 0 };
int32x4_t vsum2= { 0 };
int32x4_t vsumll = { 0 };
int32x4_t vsumlh = { 0 };
int32x4_t vsumll2 = { 0 };
int32x4_t vsumlh2 = { 0 };
uint8x8_t f= vld1_u8(p); // VLD1.8 {d0}, [r0]
//int 16 bytes /8 elementos
int16x8_t val = (int16x8_t)vmovl_u8(f);
//int 32 /4 elementos *2
int32x4_t vall = vmovl_s16(vget_low_s16(val));
int32x4_t valh = vmovl_s16(vget_high_s16(val));
// update 4 partial sum of products vectors
vsumll2 = vmlaq_s32(vsumll2, vall, vall);
vsumlh2 = vmlaq_s32(vsumlh2, valh, valh);
// sum 4 partial sum of product vectors
vsum = vaddq_s32(vall, valh);
vsum2 = vaddq_s32(vsumll2, vsumlh2);
// do scalar horizontal sum across final vector
sum += vgetq_lane_s32(vsum, 0);
sum += vgetq_lane_s32(vsum, 1);
sum += vgetq_lane_s32(vsum, 2);
sum += vgetq_lane_s32(vsum, 3);
sqsum += vgetq_lane_s32(vsum2, 0);
sqsum += vgetq_lane_s32(vsum2, 1);
sqsum += vgetq_lane_s32(vsum2, 2);
sqsum += vgetq_lane_s32(vsum2, 3);
But it is more or less 30 ms more slow. Does anyone know why?
All the code is working right.
Add to Lundin. Yes, instruction sets like ARM where you have a register based index or some reach with an immediate index you might benefit encouraging the compiler to use indexing. Also though the ARM for example can increment its pointer register in the load instruction, basically *p++ in one instruction.
it is always a toss up using p[i] or p[i++] vs *p or *p++, some instruction sets are much more obvious which path to take.
Likewise your index. if you are not using it counting down instead of up can save an instruction per loop, maybe more. Some might do this:
inc reg
cmp reg,#7
bne loop_top
If you were counting down though you might save an instruction per loop:
dec reg
bne loop_top
or even one processor I know of
decrement_and_jump_if_not_zero loop_top
The compilers usually know this and you dont have to encourage them. BUT if you use the p[i] form where the memory read order is important, then the compiler cant or at least should not arbitrarily change the order of the reads. So for that case you would want to have the code count down.
So I tried all of these things:
unsigned fun1 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
unsigned f;
sum = 0;
sqsum = 0;
for(i=0; i<8; i++)
{
f = *p++;
sum += f;
sqsum += f*f;
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
unsigned fun2 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
unsigned f;
sum = 0;
sqsum = 0;
for(i=8;i--;)
{
f = *p++;
sum += f;
sqsum += f*f;
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
unsigned fun3 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
sum = 0;
sqsum = 0;
for(i=0; i<8; i++)
{
sum += (unsigned)p[i];
sqsum += ((unsigned)p[i])*((unsigned)p[i]);
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
unsigned fun4 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
sum = 0;
sqsum = 0;
for(i=8; i;i--)
{
sum += (unsigned)p[i-1];
sqsum += ((unsigned)p[i-1])*((unsigned)p[i-1]);
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
with both gcc and llvm (clang). And of course both unrolled the loop since it was a constant. gcc, for each of the experiments produce the same code, in cases a subtle register mix change. And I would argue a bug as at least one of them the reads were not in the order described by the code.
gcc solutions for all four were this, with some read reordering, notice the reads being out of order from the source code. If this were against hardware/logic that relied on the reads being in the order described by the code, you would have a big problem.
00000000 <fun1>:
0: e92d05f0 push {r4, r5, r6, r7, r8, sl}
4: e5d06001 ldrb r6, [r0, #1]
8: e00a0696 mul sl, r6, r6
c: e4d07001 ldrb r7, [r0], #1
10: e02aa797 mla sl, r7, r7, sl
14: e5d05001 ldrb r5, [r0, #1]
18: e02aa595 mla sl, r5, r5, sl
1c: e5d04002 ldrb r4, [r0, #2]
20: e02aa494 mla sl, r4, r4, sl
24: e5d0c003 ldrb ip, [r0, #3]
28: e02aac9c mla sl, ip, ip, sl
2c: e5d02004 ldrb r2, [r0, #4]
30: e02aa292 mla sl, r2, r2, sl
34: e5d03005 ldrb r3, [r0, #5]
38: e02aa393 mla sl, r3, r3, sl
3c: e0876006 add r6, r7, r6
40: e0865005 add r5, r6, r5
44: e0854004 add r4, r5, r4
48: e5d00006 ldrb r0, [r0, #6]
4c: e084c00c add ip, r4, ip
50: e08c2002 add r2, ip, r2
54: e082c003 add ip, r2, r3
58: e023a090 mla r3, r0, r0, sl
5c: e080200c add r2, r0, ip
60: e5812000 str r2, [r1]
64: e1a00003 mov r0, r3
68: e8bd05f0 pop {r4, r5, r6, r7, r8, sl}
6c: e12fff1e bx lr
the index for the loads and subtle register mixing was the only difference between functions from gcc, all of the operations were the same in the same order.
llvm/clang:
00000000 <fun1>:
0: e92d41f0 push {r4, r5, r6, r7, r8, lr}
4: e5d0e000 ldrb lr, [r0]
8: e5d0c001 ldrb ip, [r0, #1]
c: e5d03002 ldrb r3, [r0, #2]
10: e5d08003 ldrb r8, [r0, #3]
14: e5d04004 ldrb r4, [r0, #4]
18: e5d05005 ldrb r5, [r0, #5]
1c: e5d06006 ldrb r6, [r0, #6]
20: e5d07007 ldrb r7, [r0, #7]
24: e08c200e add r2, ip, lr
28: e0832002 add r2, r3, r2
2c: e0882002 add r2, r8, r2
30: e0842002 add r2, r4, r2
34: e0852002 add r2, r5, r2
38: e0862002 add r2, r6, r2
3c: e0870002 add r0, r7, r2
40: e5810000 str r0, [r1]
44: e0010e9e mul r1, lr, lr
48: e0201c9c mla r0, ip, ip, r1
4c: e0210393 mla r1, r3, r3, r0
50: e0201898 mla r0, r8, r8, r1
54: e0210494 mla r1, r4, r4, r0
58: e0201595 mla r0, r5, r5, r1
5c: e0210696 mla r1, r6, r6, r0
60: e0201797 mla r0, r7, r7, r1
64: e8bd41f0 pop {r4, r5, r6, r7, r8, lr}
68: e1a0f00e mov pc, lr
much easier to read and follow, perhaps thinking about a cache and getting the reads all in one shot. llvm in at least one case got the reads out of order as well.
00000144 <fun4>:
144: e92d40f0 push {r4, r5, r6, r7, lr}
148: e5d0c007 ldrb ip, [r0, #7]
14c: e5d03006 ldrb r3, [r0, #6]
150: e5d02005 ldrb r2, [r0, #5]
154: e5d05004 ldrb r5, [r0, #4]
158: e5d0e000 ldrb lr, [r0]
15c: e5d04001 ldrb r4, [r0, #1]
160: e5d06002 ldrb r6, [r0, #2]
164: e5d00003 ldrb r0, [r0, #3]
Yes, for averaging some values from ram, order is not an issue, moving on.
So the compiler choose the unrolled path and didnt care about the micro-optmizations. because of the size of the loop both choose to burn a bunch of registers holding one of the loaded values per loop then either performing the adds from those temporary reads or the multiplies. if we increase the size of the loop a little I would expect to see sum and sqsum accumulations within the unrolled loop as it runs out of registers, or the threshold will be reached where they choose not to unroll the loop.
If I pass the length in, and replace the 8's in the code above with that passed in length, forcing the compiler to make a loop out of this. You sorta see the optimizations, instructions like this are used:
a4: e4d35001 ldrb r5, [r3], #1
And being arm they do a modification of the loop register in one place and branch if not equal a number of instructions later...because they can.
Granted this is a math function, but using float is painful. And using multplies is painful, divides are much worse, fortunately a shift was used. and fortunately this was unsigned so that you could use the shift (the compiler would/should have known to use an arithmetic shift if available if you used a divide against a signed number).
So basically focus on micro-optmizations of the inner loop since it gets run multiple times, and if this can be changed so it becomes shifts and adds, if possible, or arranging the data so you can take it out of the loop (if possible, dont waste other copy loops elsewhere to do this)
const unsigned char* p = (const unsigned char*)(j*patch->step + aux );
you could get some speed. I didnt try it but because it is a loop in a loop the compiler probably wont unroll that loop...
Long story short, you might get some gains depending on the instruction set against a dumber compiler, but this code is not really bad so the compiler can optimize it as well as you can.
First of all, you will probably get very good, detailed answers on stuff like this if you post at Code review instead.
Some comments regarding efficiency and suspicious variable types:
unsigned f = *p++; You will probably be better off if you access p through array indexing and then use p[i] to access the data. This is highly dependent on compiler, cache memory optimizations etc (some ARM guru can give a better advise than me in this matter).
Btw the whole const char to int looks highly suspicious. I take it those chars are to be regarded as 8-bit unsigned integers? Standard C uint8_t is likely a better type to for this, char has various undefined signedness issues that you want to avoid.
Also, why are you doing wild mixing of unsigned and int? You are asking for implicit integer balancing bugs.
stdev < .1. Just a minor thing: change this to .1f or you enforce an implicit promotion of your float to double, since .1 is a double literal.
As your data is being read in in groups of 8 bytes, depending on your hardware bus and the alignment of the array itself, you can probably get some gains by reading the inner loop in via a single long long read, then either manually splitting the numbers into seperate values, or using ARM intrinsics to do the adds in parallel with some inline asm using the add8 instruction (adds 4 numbers together at a time in 1 register) or do a touch of shifting and use add16 to allow the values to overflow into 16-bits worth of space. There is also a dual signed multiply and accumulate instruction which makes your first accumulation loop nearly perfectly supported via ARM with just a little help. Also, if the data coming in could be massaged to be 16-bit values, that could also speed this up.
As to why the NEON is slower, my guess is the overhead in setting up the vectors along with the added data you are pushing around with larger types is killing any performance it might gain with such a small set of data. The original code is very ARM friendly to begin with, which means the setup overhead is probably killing you. When in doubt, look at the assembly output. That will tell you what's truly going on. Perhaps the compiler is pushing and popping data all over the place when trying to use the intrinsics - wouldn't be the first time I've see this sort of behavior.
Thanks to Lundin, dwelch and Michel.
I made the next improvement and it seems the best for my code.
I´m trying to decrease the number of cycles improving the cache access, because is only accessing to cache one time.
int step=patch->step;
for (int j=0; j< 8; j++) {
p = (uint8_t*)(j*step + aux ); /
i=8;
do {
f=p[i];
sum += f;
sqsum += f*f;
} while(--i);
}

Looking for unnecessary buffer copies in assembly code

I am using Visual Studio 2008 C++ for Windows Mobile 6 ARMV4I and I'm trying to learn to read the ARM assembly code generated by VS to minimize unneessary buffer copies within an application. So, I've created a test application that looks like this:
#include <vector>
typedef std::vector< BYTE > Buf;
class Foo
{
public:
Foo( Buf b ) { b_.swap( b ); };
private:
Buf b_;
};
Buf Create()
{
Buf b( 1024 );
b[ 0 ] = 0x0001;
return b;
}
int _tmain( int argc, _TCHAR* argv[] )
{
Foo f( Create() );
return 0;
}
I'd like to understand if the buffer returned by Create is copied when given to the Foo constructor or if the compiler is able to optimize that copy away. In the Release build with optimizations turned on, this generates assembly like this:
class Foo
{
public:
Foo( Buf b ) { b_.swap( b ); };
0001112C stmdb sp!, {r4 - r7, lr}
00011130 mov r7, r0
00011134 mov r3, #0
00011138 str r3, this
0001113C str r3, [r7, #4]
00011140 str r3, [r7, #8]
00011144 ldr r3, this
00011148 ldr r2, this
0001114C mov r5, r7
00011150 mov r4, r1
00011154 str r3, this, #4
00011158 str r2, this, #4
0001115C mov r6, r1
00011160 ldr r2, this
00011164 ldr r3, this
00011168 mov lr, r7
0001116C str r3, this
00011170 str r2, this
00011174 ldr r2, [lr, #8]!
00011178 ldr r3, [r6, #8]!
0001117C str r3, this
00011180 str r2, this
00011184 ldr r3, this
00011188 movs r0, r3
0001118C beq |Foo::Foo + 0x84 ( 111b0h )|
00011190 ldr r3, [r1, #8]
00011194 sub r1, r3, r0
00011198 cmp r1, #0x80
0001119C bls |Foo::Foo + 0x80 ( 111ach )|
000111A0 bl 000112D4
000111A4 mov r0, r7
000111A8 ldmia sp!, {r4 - r7, pc}
000111AC bl |stlp_std::__node_alloc::_M_deallocate ( 11d2ch )|
000111B0 mov r0, r7
000111B4 ldmia sp!, {r4 - r7, pc}
--- ...\stlport\stl\_vector.h -----------------------------
// snip!
--- ...\asm_test.cpp
private:
Buf b_;
};
Buf Create()
{
00011240 stmdb sp!, {r4, lr}
00011244 mov r4, r0
Buf b( 1024 );
00011248 mov r1, #1, 22
0001124C bl |
b[ 0 ] = 0x0001;
00011250 ldr r3, [r4]
00011254 mov r2, #1
return b;
}
int _tmain( int argc, _TCHAR* argv[] )
{
00011264 str lr, [sp, #-4]!
00011268 sub sp, sp, #0x18
Foo f( Create() );
0001126C add r0, sp, #0xC
00011270 bl |Create ( 11240h )|
00011274 mov r1, r0
00011278 add r0, sp, #0
0001127C bl |Foo::Foo ( 1112ch )|
return 0;
00011280 ldr r0, argc
00011284 cmp r0, #0
00011288 beq |wmain + 0x44 ( 112a8h )|
0001128C ldr r3, [sp, #8]
00011290 sub r1, r3, r0
00011294 cmp r1, #0x80
00011298 bls |wmain + 0x40 ( 112a4h )|
0001129C bl 000112D4
000112A0 b |wmain + 0x44 ( 112a8h )|
000112A4 bl |stlp_std::__node_alloc::_M_deallocate ( 11d2ch )|
000112A8 mov r0, #0
}
What patterns can I look for in the assembly code to understand where the Buf structure is being copied?
Analyzing Create is fairly straightforward, because the code is so short. NRVO clearly has been applied here because the return statement generated no instructions, the return value is constructed in-place in r0.
The copy that would take place for Foo::Foo's pass-by-value parameter is slightly harder to analyze, but there's very little code between the calls to Create and Foo::Foo where the copy would have to take place, and nothing that would do a deep copy of a std::vector. So it looks like that copy has been eliminated as well. The other possibility is a custom calling convention for Foo::Foo where the argument is actually passed by reference and copied inside the function. You'd need someone capable of deeper ARM assembly analysis that I am to rule that out.
The buffer will be copied; you are using pass by value semantics of c++; no compiler will optimize that for you. How its copied will depend on the copy constructor of std::vector.