I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:
function expensive_loop( void (*do_true)(int), void (*do_false)(int)){
for(i=0; i<VeryLargeN; i++){
element=elements[i]
// long computation that produce a boolean condition
if (condition){
do_true(element);
}else{
do_false(element);
}
}
}
Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.
To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.
So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?
Note that the function accepts pointers to functions, so those get called through a pointer. The optimizer may inline those calls through the function pointers if the definitions of expensive_loop and those functions are available and the compiler inlining limits have not been breached.
Another option is to make this algorithm a function template that accepts callable objects (function pointers, objects with a call operator, lambdas), just like standard algorithms do. This way the compiler may have more optimization opportunities. E.g.:
template<class DoTrue, class DoFalse>
void expensive_loop(DoTrue do_true, DoFalse do_false) {
// Original function body here.
}
There is -Winline compiler switch for g++:
-Winline
Warn if a function can not be inlined and it was declared as inline. Even with this option, the compiler will not warn about failures to inline functions declared in system headers.
The compiler uses a variety of heuristics to determine whether or not to inline a function. For example, the compiler takes into account the size of the function being inlined and the the amount of inlining that has already been done in the current function. Therefore, seemingly insignificant changes in the source program can cause the warnings produced by -Winline to appear or disappear.
It probably does not warn about a function not being inlined when it is called through a pointer.
The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.
If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.
struct function_one {
void operator()( int element ) {
}
};
extern int elements[];
extern bool condition();
template < typename DoTrue, typename DoFalse >
void expensive_loop(){
DoTrue do_true;
DoFalse do_false;
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element); // call DoTrue's operator()
}else{
do_false(element); // call DoFalse's operator()
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop<function_one,function_one>();
return 0;
}
The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.
For the example I shown, note how the function is empty.
The compiler just strips away the function call and leaves the loop:
main:
push rbx
mov ebx, 50
.L2:
call condition()
sub ebx, 1
jne .L2
xor eax, eax
pop rbx
ret
See example in https://godbolt.org/g/hV52Nn
Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop
// File A.cpp
void foo( int arg );
void bar( int arg );
extern bool condition();
extern int elements[];
void expensive_loop( void (*do_true)(int), void (*do_false)(int)){
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element);
}else{
do_false(element);
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop( foo, bar );
return 0;
}
and the functions passed by argument
// File B.cpp
#include <math.h>
int elements[50];
bool condition() {
return elements[0] == 1;
}
inline int foo( int arg ) {
return arg%3;
}
inline int bar( int arg ) {
return 1234%arg;
}
are defined in different translation units.
0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
400620: 41 55 push %r13
400622: 49 89 fd mov %rdi,%r13
400625: 41 54 push %r12
400627: 49 89 f4 mov %rsi,%r12
40062a: 55 push %rbp
40062b: 53 push %rbx
40062c: bb 60 10 60 00 mov $0x601060,%ebx
400631: 48 83 ec 08 sub $0x8,%rsp
400635: eb 19 jmp 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
400637: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40063e: 00 00
400640: 48 83 c3 04 add $0x4,%rbx
400644: 41 ff d5 callq *%r13
400647: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx
40064e: 74 1d je 40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
400650: 8b 2b mov (%rbx),%ebp
400652: e8 79 ff ff ff callq 4005d0 <condition()>
400657: 84 c0 test %al,%al
400659: 89 ef mov %ebp,%edi
40065b: 75 e3 jne 400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
40065d: 48 83 c3 04 add $0x4,%rbx
400661: 41 ff d4 callq *%r12
400664: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx
40066b: 75 e3 jne 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
40066d: 48 83 c4 08 add $0x8,%rsp
400671: 5b pop %rbx
400672: 5d pop %rbp
400673: 41 5c pop %r12
400675: 41 5d pop %r13
400677: c3 retq
400678: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40067f: 00
You can see how the calls are still performed even when using -O3 optimization level:
400644: 41 ff d5 callq *%r13
Related
Assembly included. This weekend I tried to get my own small library running without any C libs and the thread local stuff is giving me problems. Below you can see I created a struct called Try1 (because it's my first attempt!) If I set the thread local variable and use it, the code seems to execute fine. If I call a const method on Try1 with a global variable it seems to run fine. Now if I do both, it's not fine. It segfaults despite me being able to access members and running the function with a global variable. The code will print Hello and Hello2 but not Hello3
I suspect the problem is the address of the variable. I tried using an if statement to print the first hello. if ((s64)&t1 > (s64)buf+1024*16) It was true so it means the pointer isn't where I thought it was. Also it isn't -8 as gdb suggest (it's a signed compare and I tried 0 instead of buf)
Assembly under the c++ code. First line is the first call to write
//test.cpp
//clang++ or g++ -std=c++20 -g -fno-rtti -fno-exceptions -fno-stack-protector -fno-asynchronous-unwind-tables -static -nostdlib test.cpp -march=native && ./a.out
#include <immintrin.h>
typedef unsigned long long int u64;
ssize_t my_write(int fd, const void *buf, size_t size) {
register int64_t rax __asm__ ("rax") = 1;
register int rdi __asm__ ("rdi") = fd;
register const void *rsi __asm__ ("rsi") = buf;
register size_t rdx __asm__ ("rdx") = size;
__asm__ __volatile__ (
"syscall"
: "+r" (rax)
: "r" (rdi), "r" (rsi), "r" (rdx)
: "cc", "rcx", "r11", "memory"
);
return rax;
}
void my_exit(int exit_status) {
register int64_t rax __asm__ ("rax") = 60;
register int rdi __asm__ ("rdi") = exit_status;
__asm__ __volatile__ (
"syscall"
: "+r" (rax)
: "r" (rdi)
: "cc", "rcx", "r11", "memory"
);
}
struct Try1
{
u64 val;
constexpr Try1() { val=0; }
u64 Get() const { return val; }
};
static char buf[1024*8]; //originally mmap but lets reduce code
static __thread u64 sanity_check;
static __thread Try1 t1;
static Try1 global;
extern "C"
int _start()
{
auto tls_size = 4096*2;
auto originalFS = _readfsbase_u64();
_writefsbase_u64((u64)(buf+4096));
global.val = 1;
global.Get(); //Executes fine
sanity_check=6;
t1.val = 7;
my_write(1, "Hello\n", sanity_check);
my_write(1, "Hello2\n", t1.val); //Still fine
my_write(1, "Hello3\n", t1.Get()); //crash! :/
my_exit(0);
return 0;
}
Asm:
4010b4: e8 47 ff ff ff call 401000 <_Z8my_writeiPKvm>
4010b9: 64 48 8b 04 25 f8 ff mov rax,QWORD PTR fs:0xfffffffffffffff8
4010c0: ff ff
4010c2: 48 89 c2 mov rdx,rax
4010c5: 48 8d 05 3b 0f 00 00 lea rax,[rip+0xf3b] # 402007 <_ZNK4Try13GetEv+0xeef>
4010cc: 48 89 c6 mov rsi,rax
4010cf: bf 01 00 00 00 mov edi,0x1
4010d4: e8 27 ff ff ff call 401000 <_Z8my_writeiPKvm>
4010d9: 64 48 8b 04 25 00 00 mov rax,QWORD PTR fs:0x0
4010e0: 00 00
4010e2: 48 05 f8 ff ff ff add rax,0xfffffffffffffff8
4010e8: 48 89 c7 mov rdi,rax
4010eb: e8 28 00 00 00 call 401118 <_ZNK4Try13GetEv>
4010f0: 48 89 c2 mov rdx,rax
4010f3: 48 8d 05 15 0f 00 00 lea rax,[rip+0xf15] # 40200f <_ZNK4Try13GetEv+0xef7>
4010fa: 48 89 c6 mov rsi,rax
4010fd: bf 01 00 00 00 mov edi,0x1
401102: e8 f9 fe ff ff call 401000 <_Z8my_writeiPKvm>
401107: bf 00 00 00 00 mov edi,0x0
40110c: e8 12 ff ff ff call 401023 <_Z7my_exiti>
401111: b8 00 00 00 00 mov eax,0x0
401116: c9 leave
401117: c3 ret
The ABI requires that fs:0 contains a pointer with the absolute address of the thread-local storage block, i.e. the value of fsbase. The compiler needs access to this address to evaluate expressions like &t1, which here it needs in order to compute the this pointer to be passed to Try1::Get().
It's tricky to recover this address on x86-64, since the TLS base address isn't in a convenient general register, but in the hidden fsbase. It isn't feasible to execute rdfsbase every time we need it (expensive instruction that may not be available) nor worse yet to call arch_prctl, so the easiest solution is to ensure that it's available in memory at a known address. See this past answer and sections 3.4.2 and 3.4.6 of "ELF Handling for Thread-Local Storage", which is incorporated by reference into the x86-64 ABI.
In your disassembly at 0x4010d9, you can see the compiler trying to load from address fs:0x0 into rax, then adding -8 (the offset of t1 in the TLS block) and moving the result into rdi as the hidden this argument to Try1::Get(). Obviously since you have zeros at fs:0 instead, the resulting pointer is invalid and you get a crash when Try1::Get() reads val, which is really this->val.
I would write something like
void *fsbase = buf+4096;
_writefsbase_u64((u64)fsbase);
*(void **)fsbase = fsbase;
(Or memcpy(fsbase, &fsbase, sizeof(void *)) might be more compliant with strict aliasing.)
This question already has answers here:
C++ performance of accessing member variables versus local variables
(11 answers)
Closed 2 years ago.
I am trying to write a code as efficient as possible and I encountered the following situation:
int foo(int a, int b, int c)
{
return (a + b) % c;
}
All good! But what if I want to check if the result of the expression to be different of a constant lets say myConst. Lets say I can afford a temporary variable.
What method is the fastest of the following:
int foo(int a, int b, int c)
{
return (((a + b) % c) != myConst) ? (a + b) % c : myException;
}
or
int foo(int a, int b, int c)
{
int tmp = (a + b) % c
return (tmp != myConst) ? tmp : myException;
}
I can't decide. Where is the 'line' where recalculation is more expensive than allocating and deallocating a temporary variable or the other way around.
Don't worry about it, write concise code and leave micro-optimizations to the compiler.
In your example writing the same calculation twice is error prone - so do not do this. In your specific example, compiler is more than likely to avoid creating a temporary on the stack at all!
Your example can (does on my compiler) produce following assembly (i have replaced myConst with constexpr 42 and myException with 0):
foo(int, int, int):
leal (%rdi,%rsi), %eax # this adds a and b, puts result to eax
movl %edx, %ecx # loads c
cltd
idivl %ecx # performs division, puts result into edx
movl $0, %eax #, prepares to return exception value
cmpl $42, %edx #, compares result of division with magic const
cmovne %edx, %eax # overwrites pessimized exception if all is cool
ret
As you see, there is no temporary anywhere in sight!
Use the later.
You're not computing the same value twice.
The code is more clear.
Creating local variables on the stack doesn't take any significant amount of time.
Check the assembler code this generates for both versions. You most likely want the highest optimization settings for your compiler.
You may very well find out the compiler itself can figure out the intermediate value is used twice, but only inside the function, so safe to store in a register.
To add to what has already been posted, ease of debugging is at least as important as code efficiency, (if there is any effect on code efficiency which, as others have posted, is unlikely with optimization on).
Go with the easiest to follow, test and debug.
Use a temp var.
If more developers used simpler, non-compound expressions and more temp vars, there would be far fewer 'Help - I cannot debug my code!' posts to SO.
Hard coded values
The following only applies to hard coded values (even if they aren't const or constexpr)
In the following example on MSVC 2015 they were optimized away completely and replaced only with mov edx, result (=1 in this example):
#include <iostream>
#include <exception>
int myConst{4};
int myException{2};
int foo1(int a,int b,int c)
{
return (((a + b) % c) != myConst) ? (a + b) % c : myException;
}
int foo2(int a,int b,int c)
{
int tmp = (a + b) % c;
return (tmp != myConst) ? tmp : myException;
}
int main()
{
00007FF71F0E1000 48 83 EC 28 sub rsp,28h
auto test1{foo1(5,2,3)};
auto test2{foo2(5,2,3)};
std::cout << test1 <<'\n';
00007FF71F0E1004 48 8B 0D 75 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF71F0E3080h)]
00007FF71F0E100B BA 01 00 00 00 mov edx,1
00007FF71F0E1010 FF 15 72 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF71F0E3088h)]
00007FF71F0E1016 48 8B C8 mov rcx,rax
00007FF71F0E1019 E8 B2 00 00 00 call std::operator<<<std::char_traits<char> > (07FF71F0E10D0h)
std::cout << test2 <<'\n';
00007FF71F0E101E 48 8B 0D 5B 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF71F0E3080h)]
00007FF71F0E1025 BA 01 00 00 00 mov edx,1
00007FF71F0E102A FF 15 58 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF71F0E3088h)]
00007FF71F0E1030 48 8B C8 mov rcx,rax
00007FF71F0E1033 E8 98 00 00 00 call std::operator<<<std::char_traits<char> > (07FF71F0E10D0h)
return 0;
00007FF71F0E1038 33 C0 xor eax,eax
}
00007FF71F0E103A 48 83 C4 28 add rsp,28h
00007FF71F0E103E C3 ret
At this point other have pointed out that the optimization will not happen if the values were passed, or if we had separate files, but it seems even if the code is in separate compilation units the optimization is still done and we don't get any instructions for these functions:
#include <iostream>
#include <exception>
#include "Header.h"
int main()
{
00007FF667BF1000 48 83 EC 28 sub rsp,28h
int var1{5},var2{2},var3{3};
auto test1{foo1(var1,var2,var3)};
auto test2{foo2(var1,var2,var3)};
std::cout << test1 <<'\n';
00007FF667BF1004 48 8B 0D 75 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF667BF3080h)]
00007FF667BF100B BA 01 00 00 00 mov edx,1
00007FF667BF1010 FF 15 72 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF667BF3088h)]
00007FF667BF1016 48 8B C8 mov rcx,rax
00007FF667BF1019 E8 B2 00 00 00 call std::operator<<<std::char_traits<char> > (07FF667BF10D0h)
std::cout << test2 <<'\n';
00007FF667BF101E 48 8B 0D 5B 20 00 00 mov rcx,qword ptr [__imp_std::cout (07FF667BF3080h)]
00007FF667BF1025 BA 01 00 00 00 mov edx,1
00007FF667BF102A FF 15 58 20 00 00 call qword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF667BF3088h)]
00007FF667BF1030 48 8B C8 mov rcx,rax
00007FF667BF1033 E8 98 00 00 00 call std::operator<<<std::char_traits<char> > (07FF667BF10D0h)
return 0;
00007FF667BF1038 33 C0 xor eax,eax
}
00007FF667BF103A 48 83 C4 28 add rsp,28h
00007FF667BF103E C3 ret
I have been learning C++ for the past couple of months. I know with functions your first declare parameters like so:
int myFunc(int funcVar);
and then you can pass in an integer variable to that function like so:
int x = 5;
myFunc(x);
When passing an argument to a function I would usually think of it like assigning and copying the value of x into the parameter of myFunc, which in C++ would look like this:
funcVar = x;
However, I noticed when declaring functions which have parameters of references (or pointers):
int myFunc(int & funcVar);
that I can either pass in the variable x to myFunc:
myFunc(x);
which would look like (in my mind):
&funcVar = x;
or you can pass in an actual reference as the argument
int & rX = x;
myFunc(rX);
and the function would work as well which with my thinking would look like this statement in C++
int & funcVar = rX
which would not make sense assigning a reference to a reference. My question is then how does the compiler actually load in arguments in a function? Should I not think of it like assigning the value of the variable to the parameter of the function?
When you call a function, each parameter of the function is initialized (not assigned). The rules for this are the same as the rules for any other copy-initialization. So if you have
int myFunc(int funcVar);
int x = 5;
myFunc(x);
then funcVar is initialized as though by a statement like this:
int funcVar = x;
and if you have
int myFunc(int & funcVar);
myFunc(x);
int & rX = x;
myFunc(rX);
then funcVar is initialized (and not assigned) as though by statements like this:
int & funcVar = x;
int & funcVar = rX;
The initialization of a reference binds it to the object or function denoted by the initializer. The second initialization does make sense---the expression rX denotes the object x because rX is a reference bound to x. Therefore, initializing a reference with rX has the same effect as initializing a reference with x.
Let us make easy code and disassemble.
int by_value(int x) { return x; }
int by_reference(int &x) { return x; }
int by_pointer(int *x) { return *x; }
int main()
{
int x = 1;
by_value(x);
by_reference(x);
by_pointer(&x);
return 0;
}
$ g++ -g -O0 a.cpp ; objdump -dS a.out
In my environment (x86_64, g++ (SUSE Linux) 4.8.3 20140627), result is as following.
(full text is here http://ideone.com/Z5G8yz)
00000000004005dd <_Z8by_valuei>:
int by_value(int x) { return x; }
4005dd: 55 push %rbp
4005de: 48 89 e5 mov %rsp,%rbp
4005e1: 89 7d fc mov %edi,-0x4(%rbp)
4005e4: 8b 45 fc mov -0x4(%rbp),%eax
4005e7: 5d pop %rbp
4005e8: c3 retq
00000000004005e9 <_Z12by_referenceRi>:
int by_reference(int &x) { return x; }
4005e9: 55 push %rbp
4005ea: 48 89 e5 mov %rsp,%rbp
4005ed: 48 89 7d f8 mov %rdi,-0x8(%rbp)
4005f1: 48 8b 45 f8 mov -0x8(%rbp),%rax
4005f5: 8b 00 mov (%rax),%eax
4005f7: 5d pop %rbp
4005f8: c3 retq
00000000004005f9 <_Z10by_pointerPi>:
int by_pointer(int *x) { return *x; }
4005f9: 55 push %rbp
4005fa: 48 89 e5 mov %rsp,%rbp
4005fd: 48 89 7d f8 mov %rdi,-0x8(%rbp)
400601: 48 8b 45 f8 mov -0x8(%rbp),%rax
400605: 8b 00 mov (%rax),%eax
400607: 5d pop %rbp
400608: c3 retq
0000000000400609 <main>:
int main()
{
400609: 55 push %rbp
40060a: 48 89 e5 mov %rsp,%rbp
40060d: 48 83 ec 10 sub $0x10,%rsp
int x = 1;
400611: c7 45 fc 01 00 00 00 movl $0x1,-0x4(%rbp)
by_value(x);
400618: 8b 45 fc mov -0x4(%rbp),%eax
40061b: 89 c7 mov %eax,%edi
40061d: e8 bb ff ff ff callq 4005dd <_Z8by_valuei>
by_reference(x);
400622: 48 8d 45 fc lea -0x4(%rbp),%rax
400626: 48 89 c7 mov %rax,%rdi
400629: e8 bb ff ff ff callq 4005e9 <_Z12by_referenceRi>
by_pointer(&x);
40062e: 48 8d 45 fc lea -0x4(%rbp),%rax
400632: 48 89 c7 mov %rax,%rdi
400635: e8 bf ff ff ff callq 4005f9 <_Z10by_pointerPi>
return 0;
40063a: b8 00 00 00 00 mov $0x0,%eax
}
by_reference(x) is as same as by_pointer(&x) !
It makes perfect sense to assign a reference to another reference (when first defining it, i.e. at initialization), and that's what actually happens. A reference is just an alias, so when you assign a reference to another reference you are just saying that the first one aliases the one you assigned. Example
int x = 42;
int& rx = x;
int& ry = rx;
++ry;
std::cout << x; // displays 43
Live on Coliru
I have a vector:
vector<Body*> Bodies;
And it contains pointers to Body objects that I have defined.
I also have a unsigned int const that contains the number of bodyobjects I wish to have in bodies.
unsigned int const NumParticles = 1000;
I have populated Bodieswith NumParticles amount of Body objects.
Now if I wish to iterate through a loop, for example invoking each of the Body's Update() functions in Bodies, I have two choices on what I can do:
First:
for (unsigned int i = 0; i < NumParticles; i++)
{
Bodies.at(i)->Update();
}
Or second:
for (unsigned int i = 0; i < Bodies.size(); i++)
{
Bodies.at(i)->Update();
}
There are pro's and con's of each. I would like to know which one (if either) would be the better practice, in terms of safety, readability and convention.
I expect, given that the compiler (at least in this case) can inline all relevant code in the std::vector, it will be identical code [aside from 1000 being a true constant literal in the machine code, and Bodies.size() will be a "variable" value].
Short summary of findings:
The compiler doesn't call a function for size() of a vector for every iteration, it calculates that in the beginning of the loop, and uses it as a "constant value".
Actual code IN the loop is identical, only the preparation of the loop is different.
As always: If performance is highly important, measure on your system with your data and your compiler. Otherwise, write the code that makes most sense for your design (I prefer using for(auto i : vec), as that is easy and straight forward [and works for all the containers])
Supporting evidence:
After fetching coffee, I wrote this code:
class X
{
public:
void Update() { x++; }
operator int() { return x; }
private:
int x = rand();
};
extern std::vector<X*> vec;
const size_t vec_size = 1000;
void Process1()
{
for(auto i : vec)
{
i->Update();
}
}
void Process2()
{
for(size_t i = 0; i < vec.size(); i++)
{
vec[i]->Update();
}
}
void Process3()
{
for(size_t i = 0; i < vec_size; i++)
{
vec[i]->Update();
}
}
(along with a main function that fills the array, and calls Process1(), Process2() and Process3() - the main is in an separate file to avoid the compiler deciding to inline everything and making it hard to tell what is what)
Here's the code generated by g++ 4.9.2:
0000000000401940 <_Z8Process1v>:
401940: 48 8b 0d a1 18 20 00 mov 0x2018a1(%rip),%rcx # 6031e8 <vec+0x8>
401947: 48 8b 05 92 18 20 00 mov 0x201892(%rip),%rax # 6031e0 <vec>
40194e: 48 39 c1 cmp %rax,%rcx
401951: 74 14 je 401967 <_Z8Process1v+0x27>
401953: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401958: 48 8b 10 mov (%rax),%rdx
40195b: 48 83 c0 08 add $0x8,%rax
40195f: 83 02 01 addl $0x1,(%rdx)
401962: 48 39 c1 cmp %rax,%rcx
401965: 75 f1 jne 401958 <_Z8Process1v+0x18>
401967: f3 c3 repz retq
0000000000401970 <_Z8Process2v>:
401970: 48 8b 35 69 18 20 00 mov 0x201869(%rip),%rsi # 6031e0 <vec>
401977: 48 8b 0d 6a 18 20 00 mov 0x20186a(%rip),%rcx # 6031e8 <vec+0x8>
40197e: 31 c0 xor %eax,%eax
401980: 48 29 f1 sub %rsi,%rcx
401983: 48 c1 f9 03 sar $0x3,%rcx
401987: 48 85 c9 test %rcx,%rcx
40198a: 74 14 je 4019a0 <_Z8Process2v+0x30>
40198c: 0f 1f 40 00 nopl 0x0(%rax)
401990: 48 8b 14 c6 mov (%rsi,%rax,8),%rdx
401994: 48 83 c0 01 add $0x1,%rax
401998: 83 02 01 addl $0x1,(%rdx)
40199b: 48 39 c8 cmp %rcx,%rax
40199e: 75 f0 jne 401990 <_Z8Process2v+0x20>
4019a0: f3 c3 repz retq
00000000004019b0 <_Z8Process3v>:
4019b0: 48 8b 05 29 18 20 00 mov 0x201829(%rip),%rax # 6031e0 <vec>
4019b7: 48 8d 88 40 1f 00 00 lea 0x1f40(%rax),%rcx
4019be: 66 90 xchg %ax,%ax
4019c0: 48 8b 10 mov (%rax),%rdx
4019c3: 48 83 c0 08 add $0x8,%rax
4019c7: 83 02 01 addl $0x1,(%rdx)
4019ca: 48 39 c8 cmp %rcx,%rax
4019cd: 75 f1 jne 4019c0 <_Z8Process3v+0x10>
4019cf: f3 c3 repz retq
Whilst the assembly code looks slightly different for each of those cases, in practice, I'd say you'd be hard pushed to measure the difference between those loops, and in fact, a run of perf on the code show that it's "the same time for all loops" [this is with 100000 elements and 100 calls to Process1, Process2 and Process3 in a loop, otherwise the time was dominated by new X in main]:
31.29% a.out a.out [.] Process1
31.28% a.out a.out [.] Process3
31.13% a.out a.out [.] Process2
Unless you think 1/10th of a percent is significant - and it may be for something that takes a week to run, but this is only a few tenths of a seconds [0.163 seconds on my machine], and probably more measurement error than anything else - and the shorter time is actually the one that in theory should be slowest, Process2, using vec.size(). I did another run with a higher loop count, and now the measurement for each of the loops is with 0.01% of each other - in other words identical in time spent.
Of course, if you look carefully, you will see that the actual loop content for all three variants is essentially identical, except for the early part of Process3 which is simpler because the compiler knows that we will do at least one loop - Process1 and Process2 has to check for "is the vector empty" before the first iteration. This would make a difference for VERY short vector lengths.
I would vote for for range:
for (auto* body : Bodies)
{
body->Update();
}
NumParticles is not a property of the vector. It is some external constant relative to the vector. I would prefer to use the property size() of the vector. In this case the code is more safe and clear for the reader.
Usually using some constant instead of size() means for the reader that in general the constant can be unequal to the size().
Thus if you want to say the reader that you are going to process all elements of the vector then it is better to use size(). Otherwise use the constant.
Of course there are exceptions from this implicit rule when the accent is put on the constant. In this case it is better to use the constant. But it depends on the context.
I would suggest you to use the .size() function instead of defining a new constant.
Why?
Safety : Since .size() does not throw any exceptions, it is perfectly safe to use .size().
Readability : IMHO, Bodies.size() conveys the size of the vector Bodies more clearly than NumParticles.
Convention : According to conventions too, it is better to use .size() as it is a property of the vector, instead of the variable NumParticles.
Performance: .size() is a constant complexity member function, so there is no significant performance difference between using a const int and .size().
I prefer this form:
for (auto const& it : Bodies)
{
it->Update();
}
Is it possible to take the address of a function that would be found through ADL?
For example:
template<class T>
void (*get_swap())(T &, T &)
{
return & _________; // how do I take the address of T's swap() function?
}
int main()
{
typedef some_type T;
get_swap<T>();
}
Honestly, I don't know but I tend towards saying that this is not possible.
Depending on what you want to achieve I can suggest a workaround. More precisely, if you just need the address of a function that has the same semantics as swap called through ADL then you can use this:
template <typename T>
void (*get_swap())(T&, T&) {
return [](T& x, T& y) { return swap(x, y); };
}
For instance, the following code:
namespace a {
struct b {
int i;
};
void swap(b& x, b& y) {
std::swap(x.i, y.i);
}
}
int main() {
auto f0 = (void (*)(a::b&, a::b&)) a::swap;
auto f1 = get_swap<a::b>();
std::cout << std::hex;
std::cout << (unsigned long long) f0 << '\n';
std::cout << (unsigned long long) f1 << '\n';
}
compiled with gcc 4.8.1 (-std=c++11 -O3) on my machine gave:
4008a0
4008b0
The relevant assembly code (objdump -dSC a.out) is
00000000004008a0 <a::swap(a::b&, a::b&)>:
4008a0: 8b 07 mov (%rdi),%eax
4008a2: 8b 16 mov (%rsi),%edx
4008a4: 89 17 mov %edx,(%rdi)
4008a6: 89 06 mov %eax,(%rsi)
4008a8: c3 retq
4008a9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
00000000004008b0 <void (*get_swap<a::b>())(a::b&, a::b&)::{lambda(a::b&, a::b&)#1}::_FUN(a::b&, a::b&)>:
4008b0: 8b 07 mov (%rdi),%eax
4008b2: 8b 16 mov (%rsi),%edx
4008b4: 89 17 mov %edx,(%rdi)
4008b6: 89 06 mov %eax,(%rsi)
4008b8: c3 retq
4008b9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
As one can see the functions pointed by f0 and f1 (located at 0x4008a0 and 0x4008b0, respectively) are binary identical. The same holds when compiled with clang 3.3.
If the linker can do identical COMDAT folding (ICF), I guess, we can even get f0 == f1. (For more on ICF see this post.)