What's aarch64's way to conditionally load/store? - if-statement

I know that armv7 can use condition codes for load/store, like ldrne/streq. But A64 does not allow instructions to be conditionally executed. so how can i archive this in arm64:
ands tmp1, dstend, 7 # set nzcv flag with ands
# if not zero, ldr w6, [srcend, -4]!, str w6, [dstend, -4]!
# else, do nothing and goes on
...

Predication of every instruction was a feature that made high-performance ARM CPUs harder to implement, especially with out-of-order execution. It was intentionally removed for AArch64. (Why are conditionally executed instructions not present in later ARM instruction sets? quotes the vendor's own justification)
If you need something with side effects / possible faults like store and load to be conditional, you normally need to branch.
The only branchless option I can think of that seems worth considering would be csel with a pointer to a dummy location (e.g. on the stack) vs. the real location. Then you still actually load and store, but not to the location you care about. This is probably worse unless the branch mispredict penalty is high and the branch is hard to predict.

void fun0 ( unsigned int x, unsigned int *y )
{
if(x) *y=0;
else *y=1;
}
void fun1 ( unsigned int x, unsigned int *y )
{
if(x) *y=10;
else *y=11;
}
void fun2 ( unsigned int x, unsigned int *y, unsigned int a, unsigned int b )
{
if(x) *y=a;
else *y=b;
}
void fun5 ( unsigned int x, unsigned int *y, unsigned int *z )
{
if(x) *y=123;
else *z=123;
}
0000000000000000 <fun0>:
0: 7100001f cmp w0, #0x0
4: 1a9f17e0 cset w0, eq // eq = none
8: b9000020 str w0, [x1]
c: d65f03c0 ret
0000000000000010 <fun1>:
10: 7100001f cmp w0, #0x0
14: 1a9f17e0 cset w0, eq // eq = none
18: 11002800 add w0, w0, #0xa
1c: b9000020 str w0, [x1]
20: d65f03c0 ret
0000000000000024 <fun2>:
24: 7100001f cmp w0, #0x0
28: 1a831042 csel w2, w2, w3, ne // ne = any
2c: b9000022 str w2, [x1]
30: d65f03c0 ret
0000000000000034 <fun5>:
34: 34000080 cbz w0, 44 <fun5+0x10>
38: 52800f60 mov w0, #0x7b // #123
3c: b9000020 str w0, [x1]
40: d65f03c0 ret
44: 52800f60 mov w0, #0x7b // #123
48: b9000040 str w0, [x2]
4c: d65f03c0 ret

Related

Why does C++ uses 32-bits register to store 8-bits value [duplicate]

This question already has answers here:
Why doesn't GCC use partial registers?
(3 answers)
Closed 2 years ago.
I've tried the following C++ code:
void foo( ) {
char c = 'a';
c = c + 1;
}
Got the following results x86-64 gcc 10.1 default flags:
mov BYTE PTR [rbp-1], 97
movzx eax, BYTE PTR [rbp-1] ; EAX here
add eax, 1
mov BYTE PTR [rbp-1], al
But! Got the following results x86-64 djgpp 7.2.0 default flags:
mov BYTE PTR [ebp-1], 97
mov al, BYTE PTR [ebp-1] ; AL here
inc eax
mov BYTE PTR [ebp-1], al
Why does GCC use EAX instead of AL?
And why does djgpp use AL only?
Is it performance issues?
If so what kind of performance issues stand behind using 32-bits register for 8-bits value?
On AMD and recent Intel processors loading a partial register requires previous value of the whole register in order to combine it with the loaded value to produce the new register value.
If the full register is written the old value is not required and therefore, with register renaming, can be done before the previous write of the register.
unsigned char fun ( unsigned char a, unsigned char b )
{
return(a+b);
}
Disassembly of section .text:
0000000000000000 <fun>:
0: 8d 04 3e lea (%rsi,%rdi,1),%eax
3: c3 retq
Disassembly of section .text:
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e20000ff and r0, r0, #255 ; 0xff
8: e12fff1e bx lr
Disassembly of section .text:
00000000 <fun>:
0: 1840 adds r0, r0, r1
2: b2c0 uxtb r0, r0
4: 4770 bx lr
Disassembly of section .text:
00000000 <fun>:
0: 952e add x10,x10,x11
2: 0ff57513 andi x10,x10,255
6: 8082 ret
different targets all from gcc.
This is a compiler choice so you need to talk to the compiler authors about it, not Stack Overflow. The compiler needs to functionally implement the high level language, so in these cases all of which have 32 bit GPRs the choice is do you mask every operation or at least before the value is left to be used later or do you assume that the register is dirty and you need to mask it before you use it or do you have architectural features like eax can be accessed in smaller parts ax, al, and design around that? so long as it functionally works any solution is perfectly fine.
One compiler may choose to use al for 8 bit operations another may choose eax (which is likely more efficient from a performance perspective, there is stuff you can read up on that topic) in both cases you have to design for the remaining bits in the rax/eax/ax register and not oops it later and use the larger register.
Where you don't have this option of partial register access you pretty much need to functionally implement the code and the easy way is to do the mask thing. This would match the C code in this case, and one could argue that the x86 code is buggy because it uses eax but doesn't clip so it does not return an unsigned char.
Make it signed though:
signed char fun ( signed char a, signed char b )
{
return(a+b);
}
Disassembly of section .text:
0000000000000000 <fun>:
0: 8d 04 3e lea (%rsi,%rdi,1),%eax
3: c3 retq
Disassembly of section .text:
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e1a00c00 lsl r0, r0, #24
8: e1a00c40 asr r0, r0, #24
c: e12fff1e bx lr
Same story, one compiler design is clearly going to handle the variable size one way and the other right there and then.
Force it to deal with the size in this function
signed char fun ( signed char a, signed char b )
{
if((a+b)>200) return(1);
return(0);
}
Disassembly of section .text:
0000000000000000 <fun>:
0: 40 0f be f6 movsbl %sil,%esi
4: 40 0f be ff movsbl %dil,%edi
8: 01 f7 add %esi,%edi
a: 81 ff c8 00 00 00 cmp $0xc8,%edi
10: 0f 9f c0 setg %al
13: c3 retq
Disassembly of section .text:
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e35000c8 cmp r0, #200 ; 0xc8
8: d3a00000 movle r0, #0
c: c3a00001 movgt r0, #1
10: e12fff1e bx lr
Because the arm design knows the values passed in are already clipped and this was a greater than they chose to not clip it, possibly because I left this as signed. In the case of x86 though because they don't clip on the way out they clipped on the way into the operation.
unsigned char fun ( unsigned char a, unsigned char b )
{
if((a+b)>200) return(1);
return(0);
}
Disassembly of section .text:
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e35000c8 cmp r0, #200 ; 0xc8
8: d3a00000 movle r0, #0
c: c3a00001 movgt r0, #1
10: e12fff1e bx lr
Now that I would disagree with because for example 0xFF + 0x01 = 0x00 and that is not greater than 200, but this code would pass it through as greater than 200. They also used the signed less than and greater than on an unsigned compare.
unsigned char fun ( unsigned char a, unsigned char b )
{
if(((unsigned char)(a+b))>200) return(1);
return(0);
}
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e20000ff and r0, r0, #255 ; 0xff
8: e35000c8 cmp r0, #200 ; 0xc8
c: 93a00000 movls r0, #0
10: 83a00001 movhi r0, #1
14: e12fff1e bx lr
Ahh, there you go some C language promotion thing. (just like float f; f=f+1.0; vs f=f+1.0F;)
and that changes the x86 results as well
Disassembly of section .text:
0000000000000000 <fun>:
0: 01 fe add %edi,%esi
2: 40 80 fe c8 cmp $0xc8,%sil
6: 0f 97 c0 seta %al
9: c3 retq
Why does GCC use EAX instead of AL?
And why does djgpp use AL only?
Is it performance issues?
These are compiler design choices, not issues, not performance necessarily, but overall compiler design as to how to implement the high level language with the targets instruction set. Each compiler is free to do that however they wish, no reason to expect gcc and clang and djgpp and others to have the same design choices, no reason to expect gcc version x.x.x and y.y.y to have the same design choices either, so if you go far enough back perhaps it was done differently, perhaps not (and if they had then maybe the commit explains the "why" question and or developer group emails from that time would cover it).

If statement vs if-else statement, which is faster?

I argued with a friend the other day about those two snippets. Which is faster and why ?
value = 5;
if (condition) {
value = 6;
}
and:
if (condition) {
value = 6;
} else {
value = 5;
}
What if value is a matrix ?
Note: I know that value = condition ? 6 : 5; exists and I expect it to be faster, but it wasn't an option.
Edit (requested by staff since question is on hold at the moment):
please answer by considering either x86 assembly generated by mainstream compilers (say g++, clang++, vc, mingw) in both optimized and non optimized versions or MIPS assembly.
when assembly differ, explain why a version is faster and when (e.g. "better because no branching and branching has following issue blahblah")
TL;DR: In unoptimized code, if without else seems irrelevantly more efficient but with even the most basic level of optimization enabled the code is basically rewritten to value = condition + 5.
I gave it a try and generated the assembly for the following code:
int ifonly(bool condition, int value)
{
value = 5;
if (condition) {
value = 6;
}
return value;
}
int ifelse(bool condition, int value)
{
if (condition) {
value = 6;
} else {
value = 5;
}
return value;
}
On gcc 6.3 with optimizations disabled (-O0), the relevant difference is:
mov DWORD PTR [rbp-8], 5
cmp BYTE PTR [rbp-4], 0
je .L2
mov DWORD PTR [rbp-8], 6
.L2:
mov eax, DWORD PTR [rbp-8]
for ifonly, while ifelse has
cmp BYTE PTR [rbp-4], 0
je .L5
mov DWORD PTR [rbp-8], 6
jmp .L6
.L5:
mov DWORD PTR [rbp-8], 5
.L6:
mov eax, DWORD PTR [rbp-8]
The latter looks slightly less efficient because it has an extra jump but both have at least two and at most three assignments so unless you really need to squeeze every last drop of performance (hint: unless you are working on a space shuttle you don't, and even then you probably don't) the difference won't be noticeable.
However, even with the lowest optimization level (-O1) both functions reduce to the same:
test dil, dil
setne al
movzx eax, al
add eax, 5
which is basically the equivalent of
return 5 + condition;
assuming condition is zero or one.
Higher optimization levels don't really change the output, except they manage to avoid the movzx by efficiently zeroing out the EAX register at the start.
Disclaimer: You probably shouldn't write 5 + condition yourself (even though the standard guarantees that converting true to an integer type gives 1) because your intent might not be immediately obvious to people reading your code (which may include your future self). The point of this code is to show that what the compiler produces in both cases is (practically) identical. Ciprian Tomoiaga states it quite well in the comments:
a human's job is to write code for humans and let the compiler write code for the machine.
The answer from CompuChip shows that for int they both are optimized to the same assembly, so it doesn't matter.
What if value is a matrix ?
I will interpret this in a more general way, i.e. what if value is of a type whose constructions and assignments are expensive (and moves are cheap).
then
T value = init1;
if (condition)
value = init2;
is sub-optimal because in case condition is true, you do the unnecessary initialization to init1 and then you do the copy assignment.
T value;
if (condition)
value = init2;
else
value = init3;
This is better. But still sub-optimal if default construction is expensive and if copy construction is more expensive then initialization.
You have the conditional operator solution which is good:
T value = condition ? init1 : init2;
Or, if you don't like the conditional operator, you can create a helper function like this:
T create(bool condition)
{
if (condition)
return {init1};
else
return {init2};
}
T value = create(condition);
Depending on what init1 and init2 are you can also consider this:
auto final_init = condition ? init1 : init2;
T value = final_init;
But again I must emphasize that this is relevant only when construction and assignments are really expensive for the given type. And even then, only by profiling you know for sure.
In pseudo-assembly language,
li #0, r0
test r1
beq L1
li #1, r0
L1:
may or may not be faster than
test r1
beq L1
li #1, r0
bra L2
L1:
li #0, r0
L2:
depending on how sophisticated the actual CPU is. Going from simplest to fanciest:
With any CPU manufactured after roughly 1990, good performance depends on the code fitting within the instruction cache. When in doubt, therefore, minimize code size. This weighs in favor of the first example.
With a basic "in-order, five-stage pipeline" CPU, which is still roughly what you get in many microcontrollers, there is a pipeline bubble every time a branch—conditional or unconditional—is taken, so it is also important to minimize the number of branch instructions. This also weighs in favor of the first example.
Somewhat more sophisticated CPUs—fancy enough to do "out-of-order execution", but not fancy enough to use the best known implementations of that concept—may incur pipeline bubbles whenever they encounter write-after-write hazards. This weighs in favor of the second example, where r0 is written only once no matter what. These CPUs are usually fancy enough to process unconditional branches in the instruction fetcher, so you aren't just trading the write-after-write penalty for a branch penalty.
I don't know if anyone is still making this kind of CPU anymore. However, the CPUs that do use the "best known implementations" of out-of-order execution are likely to cut corners on the less frequently used instructions, so you need to be aware that this sort of thing can happen. A real example is false data dependencies on the destination registers in popcnt and lzcnt on Sandy Bridge CPUs.
At the highest end, the OOO engine will wind up issuing exactly the same sequence of internal operations for both code fragments—this is the hardware version of "don't worry about it, the compiler will generate the same machine code either way." However, code size still does matter, and now you also should be worrying about the predictability of the conditional branch. Branch prediction failures potentially cause a complete pipeline flush, which is catastrophic for performance; see Why is it faster to process a sorted array than an unsorted array? to understand how much difference this can make.
If the branch is highly unpredictable, and your CPU has conditional-set or conditional-move instructions, this is the time to use them:
li #0, r0
test r1
setne r0
or
li #0, r0
li #1, r2
test r1
movne r2, r0
The conditional-set version is also more compact than any other alternative; if that instruction is available it is practically guaranteed to be the Right Thing for this scenario, even if the branch was predictable. The conditional-move version requires an additional scratch register, and always wastes one li instruction's worth of dispatch and execute resources; if the branch was in fact predictable, the branchy version may well be faster.
In unoptimised code, the first example assigns a variable always once and sometimes twice. The second example only ever assigns a variable once. The conditional is the same on both code paths, so that shouldn't matter. In optimised code, it depends on the compiler.
As always, if you are that concerned, generate the assembly and see what the compiler is actually doing.
What would make you think any of them even the one liner is faster or slower?
unsigned int fun0 ( unsigned int condition, unsigned int value )
{
value = 5;
if (condition) {
value = 6;
}
return(value);
}
unsigned int fun1 ( unsigned int condition, unsigned int value )
{
if (condition) {
value = 6;
} else {
value = 5;
}
return(value);
}
unsigned int fun2 ( unsigned int condition, unsigned int value )
{
value = condition ? 6 : 5;
return(value);
}
More lines of code of a high level language gives the compiler more to work with so if you want to make a general rule about it give the compiler more code to work with. If the algorithm is the same like the cases above then one would expect the compiler with minimal optimization to figure that out.
00000000 <fun0>:
0: e3500000 cmp r0, #0
4: 03a00005 moveq r0, #5
8: 13a00006 movne r0, #6
c: e12fff1e bx lr
00000010 <fun1>:
10: e3500000 cmp r0, #0
14: 13a00006 movne r0, #6
18: 03a00005 moveq r0, #5
1c: e12fff1e bx lr
00000020 <fun2>:
20: e3500000 cmp r0, #0
24: 13a00006 movne r0, #6
28: 03a00005 moveq r0, #5
2c: e12fff1e bx lr
not a big surprise it did the first function in a different order, same execution time though.
0000000000000000 <fun0>:
0: 7100001f cmp w0, #0x0
4: 1a9f07e0 cset w0, ne
8: 11001400 add w0, w0, #0x5
c: d65f03c0 ret
0000000000000010 <fun1>:
10: 7100001f cmp w0, #0x0
14: 1a9f07e0 cset w0, ne
18: 11001400 add w0, w0, #0x5
1c: d65f03c0 ret
0000000000000020 <fun2>:
20: 7100001f cmp w0, #0x0
24: 1a9f07e0 cset w0, ne
28: 11001400 add w0, w0, #0x5
2c: d65f03c0 ret
Hopefully you get the idea you could have just tried this if it wasnt obvious that the different implementations were not actually different.
As far as a matrix goes, not sure how that matters,
if(condition)
{
big blob of code a
}
else
{
big blob of code b
}
just going to put the same if-then-else wrapper around the big blobs of code be they value=5 or something more complicated. Likewise the comparison even if it is a big blob of code it still has to be computed, and equal to or not equal to something is often compiled with the negative, if (condition) do something is often compiled as if not condition goto.
00000000 <fun0>:
0: 0f 93 tst r15
2: 03 24 jz $+8 ;abs 0xa
4: 3f 40 06 00 mov #6, r15 ;#0x0006
8: 30 41 ret
a: 3f 40 05 00 mov #5, r15 ;#0x0005
e: 30 41 ret
00000010 <fun1>:
10: 0f 93 tst r15
12: 03 20 jnz $+8 ;abs 0x1a
14: 3f 40 05 00 mov #5, r15 ;#0x0005
18: 30 41 ret
1a: 3f 40 06 00 mov #6, r15 ;#0x0006
1e: 30 41 ret
00000020 <fun2>:
20: 0f 93 tst r15
22: 03 20 jnz $+8 ;abs 0x2a
24: 3f 40 05 00 mov #5, r15 ;#0x0005
28: 30 41 ret
2a: 3f 40 06 00 mov #6, r15 ;#0x0006
2e: 30 41
we just went through this exercise with someone else recently on stackoverflow. this mips compiler interestingly in that case not only realized the functions were the same, but had one function simply jump to the other to save on code space. Didnt do that here though
00000000 <fun0>:
0: 0004102b sltu $2,$0,$4
4: 03e00008 jr $31
8: 24420005 addiu $2,$2,5
0000000c <fun1>:
c: 0004102b sltu $2,$0,$4
10: 03e00008 jr $31
14: 24420005 addiu $2,$2,5
00000018 <fun2>:
18: 0004102b sltu $2,$0,$4
1c: 03e00008 jr $31
20: 24420005 addiu $2,$2,5
some more targets.
00000000 <_fun0>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 0bf5 0004 tst 4(r5)
8: 0304 beq 12 <_fun0+0x12>
a: 15c0 0006 mov $6, r0
e: 1585 mov (sp)+, r5
10: 0087 rts pc
12: 15c0 0005 mov $5, r0
16: 1585 mov (sp)+, r5
18: 0087 rts pc
0000001a <_fun1>:
1a: 1166 mov r5, -(sp)
1c: 1185 mov sp, r5
1e: 0bf5 0004 tst 4(r5)
22: 0204 bne 2c <_fun1+0x12>
24: 15c0 0005 mov $5, r0
28: 1585 mov (sp)+, r5
2a: 0087 rts pc
2c: 15c0 0006 mov $6, r0
30: 1585 mov (sp)+, r5
32: 0087 rts pc
00000034 <_fun2>:
34: 1166 mov r5, -(sp)
36: 1185 mov sp, r5
38: 0bf5 0004 tst 4(r5)
3c: 0204 bne 46 <_fun2+0x12>
3e: 15c0 0005 mov $5, r0
42: 1585 mov (sp)+, r5
44: 0087 rts pc
46: 15c0 0006 mov $6, r0
4a: 1585 mov (sp)+, r5
4c: 0087 rts pc
00000000 <fun0>:
0: 00a03533 snez x10,x10
4: 0515 addi x10,x10,5
6: 8082 ret
00000008 <fun1>:
8: 00a03533 snez x10,x10
c: 0515 addi x10,x10,5
e: 8082 ret
00000010 <fun2>:
10: 00a03533 snez x10,x10
14: 0515 addi x10,x10,5
16: 8082 ret
and compilers
with this i code one would expect the different targets to match as well
define i32 #fun0(i32 %condition, i32 %value) #0 {
%1 = icmp ne i32 %condition, 0
%. = select i1 %1, i32 6, i32 5
ret i32 %.
}
; Function Attrs: norecurse nounwind readnone
define i32 #fun1(i32 %condition, i32 %value) #0 {
%1 = icmp eq i32 %condition, 0
%. = select i1 %1, i32 5, i32 6
ret i32 %.
}
; Function Attrs: norecurse nounwind readnone
define i32 #fun2(i32 %condition, i32 %value) #0 {
%1 = icmp ne i32 %condition, 0
%2 = select i1 %1, i32 6, i32 5
ret i32 %2
}
00000000 <fun0>:
0: e3a01005 mov r1, #5
4: e3500000 cmp r0, #0
8: 13a01006 movne r1, #6
c: e1a00001 mov r0, r1
10: e12fff1e bx lr
00000014 <fun1>:
14: e3a01006 mov r1, #6
18: e3500000 cmp r0, #0
1c: 03a01005 moveq r1, #5
20: e1a00001 mov r0, r1
24: e12fff1e bx lr
00000028 <fun2>:
28: e3a01005 mov r1, #5
2c: e3500000 cmp r0, #0
30: 13a01006 movne r1, #6
34: e1a00001 mov r0, r1
38: e12fff1e bx lr
fun0:
push.w r4
mov.w r1, r4
mov.w r15, r12
mov.w #6, r15
cmp.w #0, r12
jne .LBB0_2
mov.w #5, r15
.LBB0_2:
pop.w r4
ret
fun1:
push.w r4
mov.w r1, r4
mov.w r15, r12
mov.w #5, r15
cmp.w #0, r12
jeq .LBB1_2
mov.w #6, r15
.LBB1_2:
pop.w r4
ret
fun2:
push.w r4
mov.w r1, r4
mov.w r15, r12
mov.w #6, r15
cmp.w #0, r12
jne .LBB2_2
mov.w #5, r15
.LBB2_2:
pop.w r4
ret
Now technically there is a performance difference in some of these solutions, sometimes the result is 5 case has a jump over the result is 6 code, and vice versa, is a branch faster than executing through? one could argue but the execution should vary. But that is more of an if condition vs if not condition in the code resulting in the compiler doing the if this jump over else execute through. but this is not necessarily due to the coding style but the comparison and the if and the else cases in whatever syntax.
Ok, since assembly is one of the tags, I will just assume your code is pseudo code (and not necessarily c) and translate it by human into 6502 assembly.
1st Option (without else)
ldy #$00
lda #$05
dey
bmi false
lda #$06
false brk
2nd Option (with else)
ldy #$00
dey
bmi else
lda #$06
sec
bcs end
else lda #$05
end brk
Assumptions: Condition is in Y register set this to 0 or 1 on the first line of either option, result will be in accumulator.
So, after counting cycles for both possibilities of each case, we see that the 1st construct is generally faster; 9 cycles when condition is 0 and 10 cycles when condition is 1, whereas option two is also 9 cycles when condition is 0, but 13 cycles when condition is 1. (cycle counts do not include the BRK at the end).
Conclusion: If only is faster than If-Else construct.
And for completeness, here is an optimized value = condition + 5 solution:
ldy #$00
lda #$00
tya
adc #$05
brk
This cuts our time down to 8 cycles (again not including the BRK at the end).

gdb - optimized value analysis

My CPU is arm.How can I figure out the function parameter value if it's optimized out?
For example:
status_t NuPlayer::GenericSource::setDataSource(
int fd, int64_t offset, int64_t length) {
resetDataSource();
mFd = dup(fd);
mOffset = offset;
mLength = length;
Above function has 3 parameters, when I try to print the second parameter offset, I will get below result:
Thread 4 "Binder:15082_3" hit Breakpoint 1, android::NuPlayer::GenericSource::setDataSource (this=0xae63bb40, fd=8, offset=<optimized out>, length=9384436) at frameworks/av/media/libmediaplayerservice/nuplayer/GenericSource.cpp:123
123 resetDataSource();
(gdb) x/i $pc
=> 0xb02aaa80 <android::NuPlayer::GenericSource::setDataSource(int, long long, long long)+12>: blx 0xb0282454 <_ZN7android8NuPlayer13GenericSource15resetDataSourceEv#plt>
(gdb) n
125 mFd = dup(fd);
(gdb) print offset
$1 = <optimized out>
(gdb) p $eax
$2 = void
(gdb) disassemble /m
Dump of assembler code for function android::NuPlayer::GenericSource::setDataSource(int, long long, long long):
122 int fd, int64_t offset, int64_t length) {
0xb02aaa74 <+0>: push {r4, r5, r6, r7, lr}
0xb02aaa76 <+2>: sub sp, #4
0xb02aaa78 <+4>: mov r4, r3
0xb02aaa7a <+6>: mov r5, r2
0xb02aaa7c <+8>: mov r6, r1
0xb02aaa7e <+10>: mov r7, r0
123 resetDataSource();
=> 0xb02aaa80 <+12>: blx 0xb0282454 <_ZN7android8NuPlayer13GenericSource15resetDataSourceEv#plt>
124
125 mFd = dup(fd);
0xb02aaa84 <+16>: mov r0, r6
0xb02aaa86 <+18>: blx 0xb027e5d8 <dup#plt>
0xb02aaa8a <+22>: ldrd r2, r1, [sp, #24]
0xb02aaa8e <+26>: str.w r0, [r7, #224] ; 0xe0
0xb02aaa92 <+30>: movs r0, #0
126 mOffset = offset;
0xb02aaa94 <+32>: strd r5, r4, [r7, #232] ; 0xe8
127 mLength = length;
0xb02aaa98 <+36>: strd r2, r1, [r7, #240] ; 0xf0
128
129 // delay data source creation to prepareAsync() to avoid blocking
130 // the calling thread in setDataSource for any significant time.
131 return OK;
0xb02aaa9c <+40>: add sp, #4
0xb02aaa9e <+42>: pop {r4, r5, r6, r7, pc}
End of assembler dump.
(gdb)
I guess it's in some register but the result of $eax is void.
I guess it's in some register but the result of $eax is void.
There is no register called eax on ARM.
To know which register the parameter is in, you need to know calling convention.
Looks like you are using 32-bit ARM. From above link:
r0 to r3: used to hold argument values passed to a subroutine
So you should do info registers, verify that r0 == 0xae63bb40, r1 == 8 and find the offset in r2.
Sounds like example code has assigned the parameter variable to local variable already, so print that value will be exactly the same as optimized out parameters.
mOffset = offset;
mLength = length;

Why this code is not efficient?

I want to improve the next code, calculating the mean:
void calculateMeanStDev8x8Aux(cv::Mat* patch, int sx, int sy, int& mean, float& stdev)
{
unsigned sum=0;
unsigned sqsum=0;
const unsigned char* aux=patch->data + sy*patch->step + sx;
for (int j=0; j< 8; j++) {
const unsigned char* p = (const unsigned char*)(j*patch->step + aux ); //Apuntador al inicio de la matrix
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
}
}
mean = sum >> 6;
int r = (sum*sum) >> 6;
stdev = sqrtf(sqsum - r);
if (stdev < .1) {
stdev=0;
}
}
I also improved the next loop with NEON intrinsics:
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
}
This is the code improved for the other loop:
int32x4_t vsum= { 0 };
int32x4_t vsum2= { 0 };
int32x4_t vsumll = { 0 };
int32x4_t vsumlh = { 0 };
int32x4_t vsumll2 = { 0 };
int32x4_t vsumlh2 = { 0 };
uint8x8_t f= vld1_u8(p); // VLD1.8 {d0}, [r0]
//int 16 bytes /8 elementos
int16x8_t val = (int16x8_t)vmovl_u8(f);
//int 32 /4 elementos *2
int32x4_t vall = vmovl_s16(vget_low_s16(val));
int32x4_t valh = vmovl_s16(vget_high_s16(val));
// update 4 partial sum of products vectors
vsumll2 = vmlaq_s32(vsumll2, vall, vall);
vsumlh2 = vmlaq_s32(vsumlh2, valh, valh);
// sum 4 partial sum of product vectors
vsum = vaddq_s32(vall, valh);
vsum2 = vaddq_s32(vsumll2, vsumlh2);
// do scalar horizontal sum across final vector
sum += vgetq_lane_s32(vsum, 0);
sum += vgetq_lane_s32(vsum, 1);
sum += vgetq_lane_s32(vsum, 2);
sum += vgetq_lane_s32(vsum, 3);
sqsum += vgetq_lane_s32(vsum2, 0);
sqsum += vgetq_lane_s32(vsum2, 1);
sqsum += vgetq_lane_s32(vsum2, 2);
sqsum += vgetq_lane_s32(vsum2, 3);
But it is more or less 30 ms more slow. Does anyone know why?
All the code is working right.
Add to Lundin. Yes, instruction sets like ARM where you have a register based index or some reach with an immediate index you might benefit encouraging the compiler to use indexing. Also though the ARM for example can increment its pointer register in the load instruction, basically *p++ in one instruction.
it is always a toss up using p[i] or p[i++] vs *p or *p++, some instruction sets are much more obvious which path to take.
Likewise your index. if you are not using it counting down instead of up can save an instruction per loop, maybe more. Some might do this:
inc reg
cmp reg,#7
bne loop_top
If you were counting down though you might save an instruction per loop:
dec reg
bne loop_top
or even one processor I know of
decrement_and_jump_if_not_zero loop_top
The compilers usually know this and you dont have to encourage them. BUT if you use the p[i] form where the memory read order is important, then the compiler cant or at least should not arbitrarily change the order of the reads. So for that case you would want to have the code count down.
So I tried all of these things:
unsigned fun1 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
unsigned f;
sum = 0;
sqsum = 0;
for(i=0; i<8; i++)
{
f = *p++;
sum += f;
sqsum += f*f;
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
unsigned fun2 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
unsigned f;
sum = 0;
sqsum = 0;
for(i=8;i--;)
{
f = *p++;
sum += f;
sqsum += f*f;
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
unsigned fun3 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
sum = 0;
sqsum = 0;
for(i=0; i<8; i++)
{
sum += (unsigned)p[i];
sqsum += ((unsigned)p[i])*((unsigned)p[i]);
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
unsigned fun4 ( const unsigned char *p, unsigned *x )
{
unsigned sum;
unsigned sqsum;
int i;
sum = 0;
sqsum = 0;
for(i=8; i;i--)
{
sum += (unsigned)p[i-1];
sqsum += ((unsigned)p[i-1])*((unsigned)p[i-1]);
}
//to keep the compiler from optimizing
//stuff out
x[0]=sum;
return(sqsum);
}
with both gcc and llvm (clang). And of course both unrolled the loop since it was a constant. gcc, for each of the experiments produce the same code, in cases a subtle register mix change. And I would argue a bug as at least one of them the reads were not in the order described by the code.
gcc solutions for all four were this, with some read reordering, notice the reads being out of order from the source code. If this were against hardware/logic that relied on the reads being in the order described by the code, you would have a big problem.
00000000 <fun1>:
0: e92d05f0 push {r4, r5, r6, r7, r8, sl}
4: e5d06001 ldrb r6, [r0, #1]
8: e00a0696 mul sl, r6, r6
c: e4d07001 ldrb r7, [r0], #1
10: e02aa797 mla sl, r7, r7, sl
14: e5d05001 ldrb r5, [r0, #1]
18: e02aa595 mla sl, r5, r5, sl
1c: e5d04002 ldrb r4, [r0, #2]
20: e02aa494 mla sl, r4, r4, sl
24: e5d0c003 ldrb ip, [r0, #3]
28: e02aac9c mla sl, ip, ip, sl
2c: e5d02004 ldrb r2, [r0, #4]
30: e02aa292 mla sl, r2, r2, sl
34: e5d03005 ldrb r3, [r0, #5]
38: e02aa393 mla sl, r3, r3, sl
3c: e0876006 add r6, r7, r6
40: e0865005 add r5, r6, r5
44: e0854004 add r4, r5, r4
48: e5d00006 ldrb r0, [r0, #6]
4c: e084c00c add ip, r4, ip
50: e08c2002 add r2, ip, r2
54: e082c003 add ip, r2, r3
58: e023a090 mla r3, r0, r0, sl
5c: e080200c add r2, r0, ip
60: e5812000 str r2, [r1]
64: e1a00003 mov r0, r3
68: e8bd05f0 pop {r4, r5, r6, r7, r8, sl}
6c: e12fff1e bx lr
the index for the loads and subtle register mixing was the only difference between functions from gcc, all of the operations were the same in the same order.
llvm/clang:
00000000 <fun1>:
0: e92d41f0 push {r4, r5, r6, r7, r8, lr}
4: e5d0e000 ldrb lr, [r0]
8: e5d0c001 ldrb ip, [r0, #1]
c: e5d03002 ldrb r3, [r0, #2]
10: e5d08003 ldrb r8, [r0, #3]
14: e5d04004 ldrb r4, [r0, #4]
18: e5d05005 ldrb r5, [r0, #5]
1c: e5d06006 ldrb r6, [r0, #6]
20: e5d07007 ldrb r7, [r0, #7]
24: e08c200e add r2, ip, lr
28: e0832002 add r2, r3, r2
2c: e0882002 add r2, r8, r2
30: e0842002 add r2, r4, r2
34: e0852002 add r2, r5, r2
38: e0862002 add r2, r6, r2
3c: e0870002 add r0, r7, r2
40: e5810000 str r0, [r1]
44: e0010e9e mul r1, lr, lr
48: e0201c9c mla r0, ip, ip, r1
4c: e0210393 mla r1, r3, r3, r0
50: e0201898 mla r0, r8, r8, r1
54: e0210494 mla r1, r4, r4, r0
58: e0201595 mla r0, r5, r5, r1
5c: e0210696 mla r1, r6, r6, r0
60: e0201797 mla r0, r7, r7, r1
64: e8bd41f0 pop {r4, r5, r6, r7, r8, lr}
68: e1a0f00e mov pc, lr
much easier to read and follow, perhaps thinking about a cache and getting the reads all in one shot. llvm in at least one case got the reads out of order as well.
00000144 <fun4>:
144: e92d40f0 push {r4, r5, r6, r7, lr}
148: e5d0c007 ldrb ip, [r0, #7]
14c: e5d03006 ldrb r3, [r0, #6]
150: e5d02005 ldrb r2, [r0, #5]
154: e5d05004 ldrb r5, [r0, #4]
158: e5d0e000 ldrb lr, [r0]
15c: e5d04001 ldrb r4, [r0, #1]
160: e5d06002 ldrb r6, [r0, #2]
164: e5d00003 ldrb r0, [r0, #3]
Yes, for averaging some values from ram, order is not an issue, moving on.
So the compiler choose the unrolled path and didnt care about the micro-optmizations. because of the size of the loop both choose to burn a bunch of registers holding one of the loaded values per loop then either performing the adds from those temporary reads or the multiplies. if we increase the size of the loop a little I would expect to see sum and sqsum accumulations within the unrolled loop as it runs out of registers, or the threshold will be reached where they choose not to unroll the loop.
If I pass the length in, and replace the 8's in the code above with that passed in length, forcing the compiler to make a loop out of this. You sorta see the optimizations, instructions like this are used:
a4: e4d35001 ldrb r5, [r3], #1
And being arm they do a modification of the loop register in one place and branch if not equal a number of instructions later...because they can.
Granted this is a math function, but using float is painful. And using multplies is painful, divides are much worse, fortunately a shift was used. and fortunately this was unsigned so that you could use the shift (the compiler would/should have known to use an arithmetic shift if available if you used a divide against a signed number).
So basically focus on micro-optmizations of the inner loop since it gets run multiple times, and if this can be changed so it becomes shifts and adds, if possible, or arranging the data so you can take it out of the loop (if possible, dont waste other copy loops elsewhere to do this)
const unsigned char* p = (const unsigned char*)(j*patch->step + aux );
you could get some speed. I didnt try it but because it is a loop in a loop the compiler probably wont unroll that loop...
Long story short, you might get some gains depending on the instruction set against a dumber compiler, but this code is not really bad so the compiler can optimize it as well as you can.
First of all, you will probably get very good, detailed answers on stuff like this if you post at Code review instead.
Some comments regarding efficiency and suspicious variable types:
unsigned f = *p++; You will probably be better off if you access p through array indexing and then use p[i] to access the data. This is highly dependent on compiler, cache memory optimizations etc (some ARM guru can give a better advise than me in this matter).
Btw the whole const char to int looks highly suspicious. I take it those chars are to be regarded as 8-bit unsigned integers? Standard C uint8_t is likely a better type to for this, char has various undefined signedness issues that you want to avoid.
Also, why are you doing wild mixing of unsigned and int? You are asking for implicit integer balancing bugs.
stdev < .1. Just a minor thing: change this to .1f or you enforce an implicit promotion of your float to double, since .1 is a double literal.
As your data is being read in in groups of 8 bytes, depending on your hardware bus and the alignment of the array itself, you can probably get some gains by reading the inner loop in via a single long long read, then either manually splitting the numbers into seperate values, or using ARM intrinsics to do the adds in parallel with some inline asm using the add8 instruction (adds 4 numbers together at a time in 1 register) or do a touch of shifting and use add16 to allow the values to overflow into 16-bits worth of space. There is also a dual signed multiply and accumulate instruction which makes your first accumulation loop nearly perfectly supported via ARM with just a little help. Also, if the data coming in could be massaged to be 16-bit values, that could also speed this up.
As to why the NEON is slower, my guess is the overhead in setting up the vectors along with the added data you are pushing around with larger types is killing any performance it might gain with such a small set of data. The original code is very ARM friendly to begin with, which means the setup overhead is probably killing you. When in doubt, look at the assembly output. That will tell you what's truly going on. Perhaps the compiler is pushing and popping data all over the place when trying to use the intrinsics - wouldn't be the first time I've see this sort of behavior.
Thanks to Lundin, dwelch and Michel.
I made the next improvement and it seems the best for my code.
I´m trying to decrease the number of cycles improving the cache access, because is only accessing to cache one time.
int step=patch->step;
for (int j=0; j< 8; j++) {
p = (uint8_t*)(j*step + aux ); /
i=8;
do {
f=p[i];
sum += f;
sqsum += f*f;
} while(--i);
}

is i=(i+1)&3 faster than i=(i+1)%4

I am optimizing a c++ code.
at one critical step, I want to implement the following function y=f(x):
f(0)=1
f(1)=2
f(2)=3
f(3)=0
which one is faster ? using a lookup table or i=(i+1)&3 or i=(i+1)%4 ? or any better suggestion?
Almost certainly the lookup table is going to be slowest. In a lot of cases, the compiler will generate the same assembly for (i+1)&3 and (i+1)%4; however depending on the type/signedness of i, they may not be strictly equivalent and the compiler won't be able to make that optimization. For example for the code
int foo(int i)
{
return (i+1)%4;
}
unsigned bar(unsigned i)
{
return (i+1)%4;
}
on my system, gcc -O2 generates:
0000000000000000 <foo>:
0: 8d 47 01 lea 0x1(%rdi),%eax
3: 89 c2 mov %eax,%edx
5: c1 fa 1f sar $0x1f,%edx
8: c1 ea 1e shr $0x1e,%edx
b: 01 d0 add %edx,%eax
d: 83 e0 03 and $0x3,%eax
10: 29 d0 sub %edx,%eax
12: c3 retq
0000000000000020 <bar>:
20: 8d 47 01 lea 0x1(%rdi),%eax
23: 83 e0 03 and $0x3,%eax
26: c3 retq
so as you can see because of the rules about signed modulus results, (i+1)%4 generates a lot more code in the first place.
Bottom line, you're probably best off using the (i+1)&3 version if that expresses what you want, because there's less chance for the compiler to do something you don't expect.
I won't get into the discussion of premature optimization. But the answer is that they will be the same speed.
Any sane compiler will compile them to the same thing. Division/modulus by a power of two will be optimized to bitwise operations anyway.
So use whichever you find (or others will find) to be more readable.
EDIT : As Roland has pointed out, it does sometimes behave different depending on the signness:
Unsigned &:
int main(void)
{
unsigned x;
cin >> x;
x = (x + 1) & 3;
cout << x;
return 0;
}
mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push eax
Unsigned Modulus:
int main(void)
{
unsigned x;
cin >> x;
x = (x + 1) % 4;
cout << x;
return 0;
}
mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push eax
Signed &:
int main(void)
{
int x;
cin >> x;
x = (x + 1) & 3;
cout << x;
return 0;
}
mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, 3
push eax
Signed Modulus:
int main(void)
{
int x;
cin >> x;
x = (x + 1) % 4;
cout << x;
return 0;
}
mov eax, DWORD PTR _x$[ebp]
inc eax
and eax, -2147483645 ; 80000003H
jns SHORT $LN3#main
dec eax
or eax, -4 ; fffffffcH
Good chances are, you wouldn't find any differences: any reasonably modern compiler knows to optimize both into the same code.
Have you tried benchmarking it? As an offhand gues, I'll assume that the &3 version will be faster, as that's a simple addition and bitwise AND operation, both of which should be single-cycle operations on any modern-ish CPU.
The %4 could go a few different ways, depending on how smart the compiler is. it could be done via division, which is much slower than addition, or it could be translated into a bitwise and operation as well and end up being just as fast as the &3 version.
same as Mystical but C and ARM
int fun1 ( int i )
{
return( (i+1)&3 );
}
int fun2 ( int i )
{
return( (i+1)%4 );
}
unsigned int fun3 ( unsigned int i )
{
return( (i+1)&3 );
}
unsigned int fun4 ( unsigned int i )
{
return( (i+1)%4 );
}
creates:
00000000 <fun1>:
0: e2800001 add r0, r0, #1
4: e2000003 and r0, r0, #3
8: e12fff1e bx lr
0000000c <fun2>:
c: e2802001 add r2, r0, #1
10: e1a0cfc2 asr ip, r2, #31
14: e1a03f2c lsr r3, ip, #30
18: e0821003 add r1, r2, r3
1c: e2010003 and r0, r1, #3
20: e0630000 rsb r0, r3, r0
24: e12fff1e bx lr
00000028 <fun3>:
28: e2800001 add r0, r0, #1
2c: e2000003 and r0, r0, #3
30: e12fff1e bx lr
00000034 <fun4>:
34: e2800001 add r0, r0, #1
38: e2000003 and r0, r0, #3
3c: e12fff1e bx lr
For negative numbers the mask and the modulo are not equivalent, only for positive/unsigned numbers. For those cases your compiler should know that %4 is the same as &3 and use the less expensive on (&3) as gcc above. clang/llc below
fun3:
add r0, r0, #1
and r0, r0, #3
mov pc, lr
fun4:
add r0, r0, #1
and r0, r0, #3
mov pc, lr
Ofcourse & is faster then %. Which is proven by many previous posts. Also as i is local variable, u can use ++i instead of i+1, as it is better implemented by most of the compilers. i+1 may(not) be optimized as ++i.
UPDATE: Perhaps i was not clear, i meant, the function should just "return((++i)&3);"