The compare function is a function that takes two arguments a and b and returns an integer describing their order. If a is smaller than b, the result is some negative integer. If a is bigger than b, the result is some positive integer. Otherwise, a and b are equal, and the result is zero.
This function is often used to parameterize sorting and searching algorithms from standard libraries.
Implementing the compare function for characters is quite easy; you simply subtract the arguments:
int compare_char(char a, char b)
{
return a - b;
}
This works because the difference between two characters is generally assumed to fit into an integer. (Note that this assumption does not hold for systems where sizeof(char) == sizeof(int).)
This trick cannot work to compare integers, because the difference between two integers generally does not fit into an integer. For example, INT_MAX - (-1) = INT_MIN suggests that INT_MAX is smaller than -1 (technically, the overflow leads to undefined behavior, but let's assume modulo arithmetic).
So how can we implement the compare function efficiently for integers? Here is my first attempt:
int compare_int(int a, int b)
{
int temp;
int result;
__asm__ __volatile__ (
"cmp %3, %2 \n\t"
"mov $0, %1 \n\t"
"mov $1, %0 \n\t"
"cmovg %0, %1 \n\t"
"mov $-1, %0 \n\t"
"cmovl %0, %1 \n\t"
: "=r"(temp), "=r"(result)
: "r"(a), "r"(b)
: "cc");
return result;
}
Can it be done in less than 6 instructions? Is there a less straightforward way that is more efficient?
This one has no branches, and doesn't suffer from overflow or underflow:
return (a > b) - (a < b);
With gcc -O2 -S, this compiles down to the following six instructions:
xorl %eax, %eax
cmpl %esi, %edi
setl %dl
setg %al
movzbl %dl, %edx
subl %edx, %eax
Here's some code to benchmark various compare implementations:
#include <stdio.h>
#include <stdlib.h>
#define COUNT 1024
#define LOOPS 500
#define COMPARE compare2
#define USE_RAND 1
int arr[COUNT];
int compare1 (int a, int b)
{
if (a < b) return -1;
if (a > b) return 1;
return 0;
}
int compare2 (int a, int b)
{
return (a > b) - (a < b);
}
int compare3 (int a, int b)
{
return (a < b) ? -1 : (a > b);
}
int compare4 (int a, int b)
{
__asm__ __volatile__ (
"sub %1, %0 \n\t"
"jno 1f \n\t"
"cmc \n\t"
"rcr %0 \n\t"
"1: "
: "+r"(a)
: "r"(b)
: "cc");
return a;
}
int main ()
{
for (int i = 0; i < COUNT; i++) {
#if USE_RAND
arr[i] = rand();
#else
for (int b = 0; b < sizeof(arr[i]); b++) {
*((unsigned char *)&arr[i] + b) = rand();
}
#endif
}
int sum = 0;
for (int l = 0; l < LOOPS; l++) {
for (int i = 0; i < COUNT; i++) {
for (int j = 0; j < COUNT; j++) {
sum += COMPARE(arr[i], arr[j]);
}
}
}
printf("%d=0\n", sum);
return 0;
}
The results on my 64-bit system, compiled with gcc -std=c99 -O2, for positive integers (USE_RAND=1):
compare1: 0m1.118s
compare2: 0m0.756s
compare3: 0m1.101s
compare4: 0m0.561s
Out of C-only solutions, the one I suggested was the fastest. user315052's solution was slower despite compiling to only 5 instructions. The slowdown is likely because, despite having one less instruction, there is a conditional instruction (cmovge).
Overall, FredOverflow's 4-instruction assembly implementation was the fastest when used with positive integers. However, this code only benchmarked the integer range RAND_MAX, so the 4-instuction test is biased, because it handles overflows separately, and these don't occur in the test; the speed may be due to successful branch prediction.
With a full range of integers (USE_RAND=0), the 4-instruction solution is in fact very slow (others are the same):
compare4: 0m1.897s
The following has always proven to be fairly efficient for me:
return (a < b) ? -1 : (a > b);
With gcc -O2 -S, this compiles down to the following five instructions:
xorl %edx, %edx
cmpl %esi, %edi
movl $-1, %eax
setg %dl
cmovge %edx, %eax
As a follow-up to Ambroz Bizjak's excellent companion answer, I was not convinced that his program tested the same assembly code what was posted above. And, when I was studying the compiler output more closely, I noticed that the compiler was not generating the same instructions as was posted in either of our answers. So, I took his test program, hand modified the assembly output to match what we posted, and compared the resulting times. It seems the two versions compare roughly identically.
./opt_cmp_branchless: 0m1.070s
./opt_cmp_branch: 0m1.037s
I am posting the assembly of each program in full so that others may attempt the same experiment, and confirm or contradict my observation.
The following is the version with the cmovge instruction ((a < b) ? -1 : (a > b)):
.file "cmp.c"
.text
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "%d=0\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB20:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
pushq %rbx
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
movl $arr.2789, %ebx
subq $8, %rsp
.cfi_def_cfa_offset 32
.L9:
leaq 4(%rbx), %rbp
.L10:
call rand
movb %al, (%rbx)
addq $1, %rbx
cmpq %rbx, %rbp
jne .L10
cmpq $arr.2789+4096, %rbp
jne .L9
xorl %r8d, %r8d
xorl %esi, %esi
orl $-1, %edi
.L12:
xorl %ebp, %ebp
.p2align 4,,10
.p2align 3
.L18:
movl arr.2789(%rbp), %ecx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L15:
movl arr.2789(%rax), %edx
xorl %ebx, %ebx
cmpl %ecx, %edx
movl $-1, %edx
setg %bl
cmovge %ebx, %edx
addq $4, %rax
addl %edx, %esi
cmpq $4096, %rax
jne .L15
addq $4, %rbp
cmpq $4096, %rbp
jne .L18
addl $1, %r8d
cmpl $500, %r8d
jne .L12
movl $.LC0, %edi
xorl %eax, %eax
call printf
addq $8, %rsp
.cfi_def_cfa_offset 24
xorl %eax, %eax
popq %rbx
.cfi_def_cfa_offset 16
popq %rbp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE20:
.size main, .-main
.local arr.2789
.comm arr.2789,4096,32
.section .note.GNU-stack,"",#progbits
The version below uses the branchless method ((a > b) - (a < b)):
.file "cmp.c"
.text
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "%d=0\n"
.text
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB20:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
pushq %rbx
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
movl $arr.2789, %ebx
subq $8, %rsp
.cfi_def_cfa_offset 32
.L9:
leaq 4(%rbx), %rbp
.L10:
call rand
movb %al, (%rbx)
addq $1, %rbx
cmpq %rbx, %rbp
jne .L10
cmpq $arr.2789+4096, %rbp
jne .L9
xorl %r8d, %r8d
xorl %esi, %esi
.L19:
movl %ebp, %ebx
xorl %edi, %edi
.p2align 4,,10
.p2align 3
.L24:
movl %ebp, %ecx
xorl %eax, %eax
jmp .L22
.p2align 4,,10
.p2align 3
.L20:
movl arr.2789(%rax), %ecx
.L22:
xorl %edx, %edx
cmpl %ebx, %ecx
setg %cl
setl %dl
movzbl %cl, %ecx
subl %ecx, %edx
addl %edx, %esi
addq $4, %rax
cmpq $4096, %rax
jne .L20
addq $4, %rdi
cmpq $4096, %rdi
je .L21
movl arr.2789(%rdi), %ebx
jmp .L24
.L21:
addl $1, %r8d
cmpl $500, %r8d
jne .L19
movl $.LC0, %edi
xorl %eax, %eax
call printf
addq $8, %rsp
.cfi_def_cfa_offset 24
xorl %eax, %eax
popq %rbx
.cfi_def_cfa_offset 16
popq %rbp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE20:
.size main, .-main
.local arr.2789
.comm arr.2789,4096,32
.section .note.GNU-stack,"",#progbits
Okay, I managed to get it down to four instructions :) The basic idea is as follows:
Half the time, the difference is small enough to fit into an integer. In that case, just return the difference. Otherwise, shift the number one to the right. The crucial question is what bit to shift into the MSB then.
Let's look at two extreme examples, using 8 bits instead of 32 bits for the sake of simplicity:
10000000 INT_MIN
01111111 INT_MAX
---------
000000001 difference
00000000 shifted
01111111 INT_MAX
10000000 INT_MIN
---------
111111111 difference
11111111 shifted
Shifting the carry bit in would yield 0 for the first case (although INT_MIN is not equal to INT_MAX) and some negative number for the second case (although INT_MAX is not smaller than INT_MIN).
But if we flip the carry bit before doing the shift, we get sensible numbers:
10000000 INT_MIN
01111111 INT_MAX
---------
000000001 difference
100000001 carry flipped
10000000 shifted
01111111 INT_MAX
10000000 INT_MIN
---------
111111111 difference
011111111 carry flipped
01111111 shifted
I'm sure there's a deep mathematical reason why it makes sense to flip the carry bit, but I don't see it yet.
int compare_int(int a, int b)
{
__asm__ __volatile__ (
"sub %1, %0 \n\t"
"jno 1f \n\t"
"cmc \n\t"
"rcr %0 \n\t"
"1: "
: "+r"(a)
: "r"(b)
: "cc");
return a;
}
I have tested the code with one million random inputs plus every combination of INT_MIN, -INT_MAX, INT_MIN/2, -1, 0, 1, INT_MAX/2, INT_MAX/2+1, INT_MAX. All tests passed. Can you proove me wrong?
For what it's worth I put together an SSE2 implementation. vec_compare1 uses the same approach as compare2 but requires just three SSE2 arithmetic instructions:
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>
#define COUNT 1024
#define LOOPS 500
#define COMPARE vec_compare1
#define USE_RAND 1
int arr[COUNT] __attribute__ ((aligned(16)));
typedef __m128i vSInt32;
vSInt32 vec_compare1 (vSInt32 va, vSInt32 vb)
{
vSInt32 vcmp1 = _mm_cmpgt_epi32(va, vb);
vSInt32 vcmp2 = _mm_cmpgt_epi32(vb, va);
return _mm_sub_epi32(vcmp2, vcmp1);
}
int main ()
{
for (int i = 0; i < COUNT; i++) {
#if USE_RAND
arr[i] = rand();
#else
for (int b = 0; b < sizeof(arr[i]); b++) {
*((unsigned char *)&arr[i] + b) = rand();
}
#endif
}
vSInt32 vsum = _mm_set1_epi32(0);
for (int l = 0; l < LOOPS; l++) {
for (int i = 0; i < COUNT; i++) {
for (int j = 0; j < COUNT; j+=4) {
vSInt32 v1 = _mm_loadu_si128(&arr[i]);
vSInt32 v2 = _mm_load_si128(&arr[j]);
vSInt32 v = COMPARE(v1, v2);
vsum = _mm_add_epi32(vsum, v);
}
}
}
printf("vsum = %vd\n", vsum);
return 0;
}
Time for this is 0.137s.
Time for compare2 with the same CPU and compiler is 0.674s.
So the SSE2 implementation is around 4x faster, as might be expected (since it's 4-wide SIMD).
This code has no branches and uses 5 instructions. It may outperform other branch-less alternatives on recent Intel processors, where cmov* instructions are quite expensive. Disadvantage is non-symmetrical return value (INT_MIN+1, 0, 1).
int compare_int (int a, int b)
{
int res;
__asm__ __volatile__ (
"xor %0, %0 \n\t"
"cmpl %2, %1 \n\t"
"setl %b0 \n\t"
"rorl $1, %0 \n\t"
"setnz %b0 \n\t"
: "=q"(res)
: "r"(a)
, "r"(b)
: "cc"
);
return res;
}
This variant does not need initialization, so it uses only 4 instructions:
int compare_int (int a, int b)
{
__asm__ __volatile__ (
"subl %1, %0 \n\t"
"setl %b0 \n\t"
"rorl $1, %0 \n\t"
"setnz %b0 \n\t"
: "+q"(a)
: "r"(b)
: "cc"
);
return a;
}
Maybe you can use the following idea (in pseudo-code; didn't write asm-code because i am not comfortable with syntax):
Subtract the numbers (result = a - b)
If no overflow, done (jo instruction and branch prediction should work very well here)
If there was overflow, use any robust method (return (a < b) ? -1 : (a > b))
Edit: for additional simplicity: if there was overflow, flip the sign of the result, instead of step 3.
You could consider promoting the integers to 64bit values.
Related
I would like to know how to implement this lines of code into x86 masm assembly:
if (x >= 1 && x <= 100) {
printsomething1();
} else if (x >= 101 && x <= 200) {
printsomething2();
} else {
printsomething3();
}
I'd break it into contiguous ranges, (assuming x is unsigned) like:
x is 0, do printsomething3()
x is 1 to 100, do nothing printsomething1()
x is 101 to 200, do nothing printsomething2()
x is 201 or higher, do nothing printsomething3()
Then work from lowest to highest, like:
;eax = x;
cmp eax,0
je .printsomething3
cmp eax,100
jbe .printsomething1
cmp eax,200
jbe .printsomething2
jmp .printsomething3
If the only difference is the string they print (and not the code they use to print it) I'd go one step further:
mov esi,something3 ;esi = address of string if x is 0
cmp eax,0
je .print
mov esi,something1 ;esi = address of string if x is 1 to 100
cmp eax,100
jbe .print
mov esi,something2 ;esi = address of string if x is 101 to 200
cmp eax,200
jbe .print
mov esi,something3 ;esi = address of string if x is 201 or higher
jmp .print
If you have access to a decent C compiler, you can compile it into assembly language. For gcc use the -S flag:
gcc test.c -S
This creates the file test.s which contains the assembly language output which can be assembled and linked if needed.
For example, to make your code compile successfully, I rewrote it slightly to this:
#include <stdio.h>
#include <stdlib.h>
void printsomething (int y)
{
printf ("something %d", y);
}
void func (int x)
{
if (x >= 1 && x <= 100)
printsomething(1);
else
if (x >= 101 && x <= 200)
printsomething(2);
else
printsomething(3);
}
int main (int argc, char **argv)
{
int x = 0;
if (argc > 1)
x = atoi (argv [1]);
return 0;
}
It compiles into this assembler:
.file "s.c"
.text
.section .rodata
.LC0:
.string "something %d"
.text
.globl printsomething
.type printsomething, #function
printsomething:
.LFB5:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
nop
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE5:
.size printsomething, .-printsomething
.globl func
.type func, #function
func:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
cmpl $0, -4(%rbp)
jle .L3
cmpl $100, -4(%rbp)
jg .L3
movl $1, %edi
call printsomething
jmp .L4
.L3:
cmpl $100, -4(%rbp)
jle .L5
cmpl $200, -4(%rbp)
jg .L5
movl $2, %edi
call printsomething
jmp .L4
.L5:
movl $3, %edi
call printsomething
.L4:
nop
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE6:
.size func, .-func
.globl main
.type main, #function
main:
.LFB7:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $0, -4(%rbp)
cmpl $1, -20(%rbp)
jle .L7
movq -32(%rbp), %rax
addq $8, %rax
movq (%rax), %rax
movq %rax, %rdi
call atoi
movl %eax, -4(%rbp)
.L7:
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE7:
.size main, .-main
.ident "GCC: (GNU) 7.3.1 20180712 (Red Hat 7.3.1-6)"
.section .note.GNU-stack,"",#progbits
Examine the func: part of it and you'll see how it sets up the comparisons with 1, 100, 101, etc.
Let's say I have this code:
int v;
setV(&v);
for (int i = 0; i < v - 5; i++) {
// Do stuff here, but don't use v.
}
Will the operation v - 5 be run every time or will a modern compiler be smart enough to store it once and never run it again?
What if I did this:
int v;
setV(&v);
const int cv = v;
for (int i = 0; i < cv - 5; i++) {
// Do stuff here. Changing cv is actually impossible.
}
Would the second style make a difference?
Edit:
This was an interesting question for an unexpected reason. It's more a question of the compiler avoiding the obtuse case of an unintended aliasing of v. If the compiler can prove that this won't happen (version 2) then we get better code.
The lesson here is to be more concerned with eliminating aliasing than trying to do the optimiser's job for it.
Making the copy cv actually presented the biggest optimisation (elision of redundant memory fetches), even though at a first glance it would appear to be (slightly) less efficient.
original answer and demo:
Let's see:
given:
extern void setV(int*);
extern void do_something(int i);
void test1()
{
int v;
setV(&v);
for (int i = 0; i < v - 5; i++) {
// Do stuff here, but don't use v.
do_something(i);
}
}
void test2()
{
int v;
setV(&v);
const int cv = v;
for (int i = 0; i < cv - 5; i++) {
// Do stuff here. Changing cv is actually impossible.
do_something(i);
}
}
compile on gcc5.3 with -x c++ -std=c++14 -O2 -Wall
gives:
test1():
pushq %rbx
subq $16, %rsp
leaq 12(%rsp), %rdi
call setV(int*)
cmpl $5, 12(%rsp)
jle .L1
xorl %ebx, %ebx
.L5:
movl %ebx, %edi
addl $1, %ebx
call do_something(int)
movl 12(%rsp), %eax
subl $5, %eax
cmpl %ebx, %eax
jg .L5
.L1:
addq $16, %rsp
popq %rbx
ret
test2():
pushq %rbp
pushq %rbx
subq $24, %rsp
leaq 12(%rsp), %rdi
call setV(int*)
movl 12(%rsp), %eax
cmpl $5, %eax
jle .L8
leal -5(%rax), %ebp
xorl %ebx, %ebx
.L12:
movl %ebx, %edi
addl $1, %ebx
call do_something(int)
cmpl %ebp, %ebx
jne .L12
.L8:
addq $24, %rsp
popq %rbx
popq %rbp
ret
The second form is better on this compiler.
I have a simple function that computes a product of
two double arrays:
#include <stdlib.h>
#include <emmintrin.h>
struct S {
double *x;
double *y;
double *z;
};
void f(S& s, size_t n) {
for (int i = 0; i < n; i += 2) {
__m128d xs = _mm_load_pd(&s.x[i]);
__m128d ys = _mm_load_pd(&s.y[i]);
_mm_store_pd(&s.z[i], _mm_mul_pd(xs, ys) );
}
return;
}
int main(void) {
S s;
size_t size = 4;
posix_memalign((void **)&s.x, 16, sizeof(double) * size);
posix_memalign((void **)&s.y, 16, sizeof(double) * size);
posix_memalign((void **)&s.z, 16, sizeof(double) * size);
f(s, size);
return 0;
}
Note that the first argument of function f is passed in by reference.
Let's look at the resulting assembly of f() (I removed some irrelevant
pieces, inserted comments and put some labels):
$ g++ -O3 -S asmtest.cpp
.globl _Z1fR1Sm
_Z1fR1Sm:
xorl %eax, %eax
testq %rsi, %rsi
je .L1
.L5:
movq (%rdi), %r8 # array x (1)
movq 8(%rdi), %rcx # array y (2)
movq 16(%rdi), %rdx # array z (3)
movapd (%r8,%rax,8), %xmm0 # load x[0]
mulpd (%rcx,%rax,8), %xmm0 # multiply x[0]*y[0]
movaps %xmm0, (%rdx,%rax,8) # store to y
addq $2, %rax # and loop
cmpq %rax, %rsi
ja .L5
Notice that addresses of arrays x, y, z are loaded into general-purpose
registers on each iteration, see statements (1),(2),(3). Why doesn't gcc move
these instructions outside the loop?
Now make a local copy (not a deep copy) of the structure:
void __attribute__((noinline)) f(S& args, size_t n) {
S s = args;
for (int i = 0; i < n; i += 2) {
__m128d xs = _mm_load_pd(&s.x[i]);
__m128d ys = _mm_load_pd(&s.y[i]);
_mm_store_pd(&s.z[i], _mm_mul_pd(xs, ys) );
}
return;
}
Assembly:
_Z1fR1Sm:
.LFB525:
.cfi_startproc
xorl %eax, %eax
testq %rsi, %rsi
movq (%rdi), %r8 # (1)
movq 8(%rdi), %rcx # (2)
movq 16(%rdi), %rdx # (3)
je .L1
.L5:
movapd (%r8,%rax,8), %xmm0
mulpd (%rcx,%rax,8), %xmm0
movaps %xmm0, (%rdx,%rax,8)
addq $2, %rax
cmpq %rax, %rsi
ja .L5
.L1:
rep ret
Notice that unlike in the previous code,
loads (1), (2), (3) are now outside the loop.
I would appreciate an explanation why these two assembly
codes are different. Is memory aliasing relevant here?
Thanks.
$ gcc --version
gcc (Debian 5.2.1-21) 5.2.1 20151003
Yes, gcc is reloading s.x and s.y with each iteration of the loop because gcc does not know if &s.z[i] for some i aliases part of the S object passed by reference to f(S&, size_t).
With gcc 5.2.0, applying __restrict__ to S::z and the s reference parameter to f(), i.e.:
struct S {
double *x;
double *y;
double *__restrict__ z;
};
void f(S&__restrict__ s, size_t n) {
for (int i = 0; i < n; i += 2) {
__m128d xs = _mm_load_pd(&s.x[i]);
__m128d ys = _mm_load_pd(&s.y[i]);
_mm_store_pd(&s.z[i], _mm_mul_pd(xs, ys));
}
return;
}
.. causes gcc to generate:
__Z1fR1Sm:
LFB518:
testq %rsi, %rsi
je L1
movq (%rdi), %r8
xorl %eax, %eax
movq 8(%rdi), %rcx
movq 16(%rdi), %rdx
.align 4,0x90
L4:
movapd (%r8,%rax,8), %xmm0
mulpd (%rcx,%rax,8), %xmm0
movaps %xmm0, (%rdx,%rax,8)
addq $2, %rax
cmpq %rax, %rsi
ja L4
L1:
ret
With Apple Clang 700.1.76, only __restrict__ on the s reference is needed:
__Z1fR1Sm: ## #_Z1fR1Sm
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
testq %rsi, %rsi
je LBB0_3
## BB#1: ## %.lr.ph
movq (%rdi), %rax
movq 8(%rdi), %rcx
movq 16(%rdi), %rdx
xorl %edi, %edi
.align 4, 0x90
LBB0_2: ## =>This Inner Loop Header: Depth=1
movapd (%rax,%rdi,8), %xmm0
mulpd (%rcx,%rdi,8), %xmm0
movapd %xmm0, (%rdx,%rdi,8)
addq $2, %rdi
cmpq %rsi, %rdi
jb LBB0_2
LBB0_3: ## %._crit_edge
popq %rbp
retq
.cfi_endproc
I'm working on the classic "Reverse a String" problem.
Is a good idea to use the position of the null terminator for swap space? The idea is to save the declaration of one variable.
Specifically, starting with Kernighan and Ritchie's algorithm:
void reverse(char s[])
{
int length = strlen(s);
int c, i, j;
for (i = 0, j = length - 1; i < j; i++, j--)
{
c = s[i];
s[i] = s[j];
s[j] = c;
}
}
...can we instead do the following?
void reverseUsingNullPosition(char s[]) {
int length = strlen(s);
int i, j;
for (i = 0, j = length - 1; i < j; i++, j--) {
s[length] = s[i]; // Use last position instead of a new var
s[i] = s[j];
s[j] = s[length];
}
s[length] = 0; // Replace null character
}
Notice how the "c" variable is no longer needed. We simply use the last position in the array--where the null termination resides--as our swap space. When we're done, we simply replace the 0.
Here's the main routine (Xcode):
#include <stdio.h>
#include <string>
int main(int argc, const char * argv[]) {
char cheese[] = { 'c' , 'h' , 'e' , 'd' , 'd' , 'a' , 'r' , 0 };
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
reverse(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: raddehc
reverseUsingNullPosition(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
}
Yes, this can be done. No, this is not a good idea, because it makes your program much harder to optimize.
When you declare char c in the local scope, the optimizer can figure out that the value is not used beyond the s[j] = c; assignment, and could place the temporary in a register. In addition to effectively eliminating the variable for you, the optimizer could even figure out that you are performing a swap, and emit a hardware-specific instruction. All this would save you a memory access per character.
When you use s[length] for your temporary, the optimizer does not have as much freedom. It is forced to emit the write into memory. This could be just as fast due to caching, but on embedded platforms this could have a significant effect.
First of all such microoptimizations are totally irrelevant until proven relevant. We're talking about C++, you have std::string, std::reverse, you shouldn't even think about such facts.
In any case if you compile both code with -Os on Xcode you obtain for reverse:
.cfi_startproc
Lfunc_begin0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
pushq %r14
pushq %rbx
Ltmp6:
.cfi_offset %rbx, -32
Ltmp7:
.cfi_offset %r14, -24
movq %rdi, %r14
Ltmp8:
callq _strlen
Ltmp9:
leal -1(%rax), %ecx
testl %ecx, %ecx
jle LBB0_3
Ltmp10:
movslq %ecx, %rcx
addl $-2, %eax
Ltmp11:
xorl %edx, %edx
LBB0_2:
Ltmp12:
movb (%r14,%rdx), %sil
movb (%r14,%rcx), %bl
movb %bl, (%r14,%rdx)
movb %sil, (%r14,%rcx)
Ltmp13:
incq %rdx
decq %rcx
cmpl %eax, %edx
leal -1(%rax), %eax
jl LBB0_2
Ltmp14:
LBB0_3:
popq %rbx
popq %r14
popq %rbp
ret
Ltmp15:
Lfunc_end0:
.cfi_endproc
and for reverseUsingNullPosition:
.cfi_startproc
Lfunc_begin1:
pushq %rbp
Ltmp19:
.cfi_def_cfa_offset 16
Ltmp20:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp21:
.cfi_def_cfa_register %rbp
pushq %rbx
pushq %rax
Ltmp22:
.cfi_offset %rbx, -24
movq %rdi, %rbx
Ltmp23:
callq _strlen
Ltmp24:
leal -1(%rax), %edx
testl %edx, %edx
Ltmp25:
movslq %eax, %rdi
jle LBB1_3
Ltmp26:
movslq %edx, %rdx
addl $-2, %eax
Ltmp27:
xorl %esi, %esi
LBB1_2:
Ltmp28:
movb (%rbx,%rsi), %cl
movb %cl, (%rbx,%rdi)
movb (%rbx,%rdx), %cl
movb %cl, (%rbx,%rsi)
movb (%rbx,%rdi), %cl
movb %cl, (%rbx,%rdx)
Ltmp29:
incq %rsi
decq %rdx
cmpl %eax, %esi
leal -1(%rax), %eax
jl LBB1_2
Ltmp30:
LBB1_3: ## %._crit_edge
movb $0, (%rbx,%rdi)
addq $8, %rsp
popq %rbx
Ltmp31:
popq %rbp
ret
Ltmp32:
Lfunc_end1:
.cfi_endproc
If you check the inner loop you have
movb (%r14,%rdx), %sil
movb (%r14,%rcx), %bl
movb %bl, (%r14,%rdx)
movb %sil, (%r14,%rcx)
vs
movb (%rbx,%rsi), %cl
movb %cl, (%rbx,%rdi)
movb (%rbx,%rdx), %cl
movb %cl, (%rbx,%rsi)
movb (%rbx,%rdi), %cl
movb %cl, (%rbx,%rdx)
So I wouldn't say you are saving so much overhead as you think (since you are accessing the array more times), maybe yes, maybe no. Which teaches you another thing: thinking that some code is more performant than other code is irrelevant, the only thing that matters is a well-done benchmark and profile of the code.
Legal: Yes
Good idea: No
The cost of an "extra" variable is zero so there is absolutely no reason to avoid it. The stack pointer needs to be changed anyway so it doesn't matter if it needs to cope with an extra int.
Further:
With compiler optimization turned on, the variable c in the original code will most likely not even exists. It will just be a register in the cpu.
With your code: Optimization will be more difficult so it is not easy to say how well the compiler will do. Maybe you'll get the same - maybe you'll get something worse. But you won't get anything better.
So just forget the idea.
We can use printf and the STL and also manually unroll things and use pointers.
#include <stdio.h>
#include <string>
#include <cstring>
void reverse(char s[])
{
char * b=s;
char * e=s+::strlen(s)-4;
while (e - b > 4)
{
std::swap(b[0], e[3]);
std::swap(b[1], e[2]);
std::swap(b[2], e[1]);
std::swap(b[3], e[0]);
b+=4;
e-=4;
}
e+=3;
while (b < e)
{
std::swap(*(b++), *(e--));
}
}
int main(int argc, const char * argv[]) {
char cheese[] = { 'c' , 'h' , 'e' , 'd' , 'd' , 'a' , 'r' , 0 };
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
reverse(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: raddehc
}
Hard to tell if its faster with just the test case of "cheddar"
There is such code in C++:
#include <iostream>
int main(){
int a = 4;
while(a--){
std::cout << (a + 1) << '\n';
}
return 0;
}
and corresponding code of main function in assembly code produced by g++:
.globl main
.type main, #function
main:
.LFB957:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
andl $-16, %esp
subl $32, %esp
movl $4, 28(%esp) # int a = 4;
jmp .L2
.L3:
movl 28(%esp), %eax # std::cout << (a + 1) << '\n';
addl $1, %eax
movl %eax, 4(%esp)
movl $_ZSt4cout, (%esp)
call _ZNSolsEi
movl $10, 4(%esp)
movl %eax, (%esp)
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_c
.L2:
cmpl $0, 28(%esp)
setne %al
subl $1, 28(%esp) # a = a - 1
testb %al, %al
jne .L3
movl $0, %eax
leave
ret
.cfi_endproc
.LFE957:
.size main, .-main
What are used instructions setne and testb in following fragment for?
.L2:
cmpl $0, 28(%esp)
setne %al
subl $1, 28(%esp) # a = a - 1
testb %al, %al
jne .L3
Couldn't it be just so to check in while loop whether a is not zero and jump?
The while condition is formally the equivalent of:
while ( a -- != 0 )
(Omitting the comparison is a legal obfuscation.)
The compiler is generating code to compare a with 0, save the
results in register al, then decrement a, and then test the saved
results.
Because a-- means
tmpval=a;
a=a-1;
return tmpval;
so compiler needs to save the previous value of a.
In this program, the body part of while will be executed when a = 0 (after a--, so it will print 1).
It's been a long while since I did assembler, but I would assume it's some optimisation to keep the pipelines busy / optimise register use.