I am recently reading CSAPP and I have a question about example of assembly code. This is an example from CSAPP, the code is followed:
long pcount_goto
(unsigned long x) {
long result = 0;
result += x & 0x1;
x >>= 1;
if(x) goto loop;
return result;

And the corresponding assembly code is:
movl $0, %eax # result = 0
.L2: # loop:
movq %rdi, %rdx
andl $1, %edx # t = x & 0x1
addq %rdx, %rax # result += t
shrq %rdi # x >>= 1
jne .L2 # if (x) goto loop
rep; ret
The questions I have may look naive since I am very new to assembly code but I will be grateful is someone can help me with these questions.
what's the difference between %eax, %rax, (also %edx, %rdx). I have seen them occur in the assembly code but they seems to refer to the same space/address. What's the point of using two different names?
In the code
andl $1, %edx # t = x & 0x1
I understand that %edx now stores the t, but where does x goes then?
In the code
shrq %rdi
I think
shrq 1, %rdi
should be better?
For
jne .L2 # if (x) goto loop
Where does if (x) goes? I can't see any judgement.
These are really basic questions, a little research of your own should have answered all of them. Anyway,
The e registers are the low 32 bits of the r registers. You pick one depending on what size you need. There are also 16 and 8 bit registers. Consult a basic architecture manual.
The and instruction modifies its argument, it's not a = b & c, it's a &= b.
That would be shrq $1, %rdi which is valid, and shrq %rdi is just an alias for it.
jne examines the zero flag which is set earlier by shrq automatically if the result was zero.
Related
This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.
If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0
This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.
Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm writing (or at least trying to write) some high-performance C++ code. I've come across a a part where I need to do a large amount of integer comparisons, namely, to check if the result is equal to zero.
Which is more efficient? That is, which requires fewer processor instructions?
if (i == 0) {
// do stuff
}
or
if (!i) {
// do stuff
}
I'm running it on an x86-64 architecture, if that makes any difference.
Let's look at the assembly (with no optimizations) of this code with gcc :
void foo(int& i)
{
if(!i)
i++;
}
void bar(int& i)
{
if(i == 0)
i++;
}
int main()
{
int i = 0;
foo(i);
bar(i);
}
foo(int&): # #foo(int&)
movq %rdi, -8(%rsp)
movq -8(%rsp), %rdi
cmpl $0, (%rdi)
jne .LBB0_2
movq -8(%rsp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
.LBB0_2:
ret
bar(int&): # #bar(int&)
movq %rdi, -8(%rsp)
movq -8(%rsp), %rdi
cmpl $0, (%rdi)
jne .LBB1_2
movq -8(%rsp), %rax
movl (%rax), %ecx
addl $1, %ecx
movl %ecx, (%rax)
.LBB1_2:
ret
main: # #main
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -8(%rbp), %rdi
movl $0, -4(%rbp)
movl $0, -8(%rbp)
callq foo(int&)
leaq -8(%rbp), %rdi
callq bar(int&)
movl -4(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
Bottom line:
The generated assembly is exactly identical (even without optimizations enabled), so it doesn't matter : choose the clearer, most readable syntax, which is probably if( i == 0) in your case.
In C++, you almost never need to care about such micro optimizations, compilers/optimizers are very good at this game : trust them. If you don't and if you have a performance bottleneck, profile / look at the assembly for your particular platform.
Note:
You can use godbolt.org to generate such assembly, it is a very handy tool.
You can also use the -S option on gcc to produce the assembly (other compilers have similar options)
Unless you have an insane compiler, they should compile identically. Having said that, for the sanity of future people looking at your code, only use i == 0 if i is a numeric type and !i if i is a bool type.
No compiler of the better known ones will compile those to anything that differs significantly enough that it will matter when you do what everyone must do before applying manual optimizations: measure.
Should one use dynamic memory allocation when one knows that a variable will not be needed before it goes out of scope?
For example in the following function:
void func(){
int i =56;
//do something with i, i is not needed past this point
for(int t; t<1000000; t++){
//code
}
}
say one only needed i for a small section of the function, is it worthwhile deleting i as it is not needed in the very long for loop?
As Borgleader said:
A) This is micro (and most probably premature) optimization, meaning
don't worry about it. B) In this particular case, dynamically
allocation i might even hurt performance. tl;dr; profile first,
optimize later
As an example, I compiled the following two programs into assembly (using g++ -S flag with no optimisation enabled).
Creating i on the stack:
int main(void)
{
int i = 56;
i += 5;
for(int t = 0; t<1000; t++) {}
return 0;
}
Dynamically:
int main(void)
{
int* i = new int(56);
*i += 5;
delete i;
for(int t = 0; t<1000; t++) {}
return 0;
}
The first program compiled to:
movl $56, -8(%rbp) # Store 56 on stack (int i = 56)
addl $5, -8(%rbp) # Add 5 to i (i += 5)
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
And the second:
subq $16, %rsp # Allocate memory (new)
movl $4, %edi
call _Znwm
movl $56, (%rax) # Store 56 in *i
movq %rax, -16(%rbp)
movq -16(%rbp), %rax # Add 5
movl (%rax), %eax
leal 5(%rax), %edx
movq -16(%rbp), %rax
movl %edx, (%rax)
movq -16(%rbp), %rax # Free memory (delete)
movq %rax, %rdi
call _ZdlPv
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
In the above assembly output, you can see strait away that there is a significant difference between the number of commands being executed. If I compile the same programs with optimisation turned on. The first program produced the result:
xorl %eax, %eax # Equivalent to return 0;
The second produced:
movl $4, %edi
call _Znwm
movl $61, (%rax) # A smart compiler knows 56+5 = 61
movq %rax, %rdi
call _ZdlPv
xorl %eax, %eax
addq $8, %rsp
With optimisation on, the compiler becomes a pretty powerful tool for improving your code, in certain cases it can even detect that a program only returns 0 and get rid of all the unnecessary code. When you use dynamic memory in the code above, the program still has to request and then free the dynamic memory, it can't optimise it out.
Beating the dead horse here. A typical (and fast) way of doing integer powers in C is this classic:
int64_t ipow(int64_t base, int exp){
int64_t result = 1;
while(exp){
if(exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
However I needed a compile time integer power so I went ahead and made a recursive implementation using constexpr:
constexpr int64_t ipow_(int base, int exp){
return exp > 1 ? ipow_(base, (exp>>1) + (exp&1)) * ipow_(base, exp>>1) : base;
}
constexpr int64_t ipow(int base, int exp){
return exp < 1 ? 1 : ipow_(base, exp);
}
The second function is only to handle exponents less than 1 in a predictable way. Passing exp<0 is an error in this case.
The recursive version is 4 times slower
I generate a vector of 10E6 random valued bases and exponents in the range [0,15] and time both algorithms on the vector (after doing a non-timed run to try to remove any caching effects). Without optimization the recursice method is twice as fast as the loop. But with -O3 (GCC) the loop is 4 times faster than the recursice method.
My question to you guys is this: Can any one come up with a faster ipow() function that handles exponent and bases of 0 and can be used as a constexpr?
(Disclaimer: I don't need a faster ipow, I'm just interested to see what the smart people here can come up with).
A good optimizing compiler will transform tail-recursive functions to run as fast as imperative code. You can transform this function to be tail recursive with pumping. GCC 4.8.1 compiles this test program:
#include <cstdint>
constexpr int64_t ipow(int64_t base, int exp, int64_t result = 1) {
return exp < 1 ? result : ipow(base*base, exp/2, (exp % 2) ? result*base : result);
}
int64_t foo(int64_t base, int exp) {
return ipow(base, exp);
}
into a loop (See this at gcc.godbolt.org):
foo(long, int):
testl %esi, %esi
movl $1, %eax
jle .L4
.L3:
movq %rax, %rdx
imulq %rdi, %rdx
testb $1, %sil
cmovne %rdx, %rax
imulq %rdi, %rdi
sarl %esi
jne .L3
rep; ret
.L4:
rep; ret
vs. your while loop implementation:
ipow(long, int):
testl %esi, %esi
movl $1, %eax
je .L4
.L3:
movq %rax, %rdx
imulq %rdi, %rdx
testb $1, %sil
cmovne %rdx, %rax
imulq %rdi, %rdi
sarl %esi
jne .L3
rep; ret
.L4:
rep; ret
Instruction-by-instruction identical is good enough for me.
It seems that this is a standard problem with constexpr and template programming in C++. Due to compile time constraints, the constexpr version is slower than a normal version if executed at runtime. But overloading doesn't allows to chose the correct version. The standardization committee is working on this issue. See for example the following working document http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3583.pdf
After reading many of the comments on this question, there are a couple people (here and here) that suggest that this code:
int val = 5;
int r = (0 < val) - (val < 0); // this line here
will cause branching. Unfortunately, none of them give any justification or say why it would cause branching (tristopia suggests it requires a cmove-like instruction or predication, but doesn't really say why).
Are these people right in that "a comparison used in an expression will not generate a branch" is actually myth instead of fact? (assuming you're not using some esoteric processor) If so, can you give an example?
I would've thought there wouldn't be any branching (given that there's no logical "short circuiting"), and now I'm curious.
To simplify matters, consider just one part of the expression: val < 0. Essentially, this means “if val is negative, return 1, otherwise 0”; you could also write it like this:
val < 0 ? 1 : 0
How this is translated into processor instructions depends heavily on the compiler and the target processor. The easiest way to find out is to write a simple test function, like so:
int compute(int val) {
return val < 0 ? 1 : 0;
}
and review the assembler code that is generated by the compiler (e.g., with gcc -S -o - example.c). For my machine, it does it without branching. However, if I change it to return 5 instead of 1, there are branch instructions:
...
cmpl $0, -4(%rbp)
jns .L2
movl $5, %eax
jmp .L3
.L2:
movl $0, %eax
.L3:
...
So, “a comparison used in an expression will not generate a branch” is indeed a myth. (But “a comparison used in an expression will always generate a branch” isn’t true either.)
Addition in response to this extension/clarification:
I'm asking if there's any (sane) platform/compiler for which a branch is likely. MIPS/ARM/x86(_64)/etc. All I'm looking for is one case that demonstrates that this is a realistic possibility.
That depends on what you consider a “sane” platform. If the venerable 6502 CPU family is sane, I think there is no way to calculate val > 0 on it without branching. Most modern instruction sets, on the other hand, provide some type of set-on-X instruction.
(val < 0 can actually be computed without branching even on 6502, because it can be implemented as a bit shift.)
Empiricism for the win:
int sign(int val) {
return (0 < val) - (val < 0);
}
compiled with optimisations. gcc (4.7.2) produces
sign:
.LFB0:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
setg %al
shrl $31, %edi
subl %edi, %eax
ret
.cfi_endproc
no branch. clang (3.2):
sign: # #sign
.cfi_startproc
# BB#0:
movl %edi, %ecx
shrl $31, %ecx
testl %edi, %edi
setg %al
movzbl %al, %eax
subl %ecx, %eax
ret
neither. (on x86_64, Core i5)
This is actually architecture-dependent. If there exists an instruction to set the value to 0/1 depending on the sign of another value, there will be no branching. If there's no such instruction, branching would be necessary.