I need to optimize a program as good as somehow possible. Now I came across this issue: I have a one-dimensional array which represents a texture in form of pixel data. I now need to manipulate that data. The array is accessed via the following function:
(y * width) + x
to have x,y coordinates. Now the question is, what way is the most optimized for this function, I have considered the following two possibilities:
Inline:
inline int Coords(x,y) { return (y * width) + x); }
Macro:
#define COORDS(X,Y) ((Y)*width)+(X)
which one is the best practice to use here, or is there a way to get a even more optimized variant of this which I dont know?
I wrote a little test program to see what the difference would be between the two approaches.
Here it is:
#include <cstdint>
#include <algorithm>
#include <iterator>
#include <iostream>
using namespace std;
static constexpr int width = 100;
inline int Coords(int x, int y) { return (y * width) + x; }
#define COORDS(X,Y) ((Y)*width)+(X)
void fill1(uint8_t* bytes, int height)
{
for (int x = 0 ; x < width ; ++x) {
for (int y = 0 ; y < height ; ++y) {
bytes[Coords(x,y)] = 0;
}
}
}
void fill2(uint8_t* bytes, int height)
{
for (int x = 0 ; x < width ; ++x) {
for (int y = 0 ; y < height ; ++y) {
bytes[COORDS(x,y)] = 0;
}
}
}
auto main() -> int
{
uint8_t buf1[100 * 100];
uint8_t buf2[100 * 100];
fill1(buf1, 100);
fill2(buf2, 100);
// these are here to prevent the compiler from optimising away all the above code.
copy(begin(buf1), end(buf1), ostream_iterator<char>(cout));
copy(begin(buf2), end(buf2), ostream_iterator<char>(cout));
return 0;
}
I compiled it like this:
c++ -S -o intent.s -std=c++1y -O3 intent.cpp
and then looked at the source code to see what the compiler would do.
As expected, the compiler completely ignores all attempts by the programmer to optimise, and instead looks solely at the expressed intent, side effects and possibilities of aliases. Then it emits exactly the same code for both functions (which are of course inlined).
relevant parts of the assembly:
.globl _main
.align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp16:
.cfi_def_cfa_offset 16
Ltmp17:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp18:
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $20024, %rsp ## imm = 0x4E38
Ltmp19:
.cfi_offset %rbx, -56
Ltmp20:
.cfi_offset %r12, -48
Ltmp21:
.cfi_offset %r13, -40
Ltmp22:
.cfi_offset %r14, -32
Ltmp23:
.cfi_offset %r15, -24
movq ___stack_chk_guard#GOTPCREL(%rip), %r15
movq (%r15), %r15
movq %r15, -48(%rbp)
xorl %eax, %eax
xorl %ecx, %ecx
.align 4, 0x90
LBB2_1: ## %.lr.ph.us.i
## =>This Loop Header: Depth=1
## Child Loop BB2_2 Depth 2
leaq -10048(%rbp,%rcx), %rdx
movl $400, %esi ## imm = 0x190
.align 4, 0x90
LBB2_2: ## Parent Loop BB2_1 Depth=1
## => This Inner Loop Header: Depth=2
movb $0, -400(%rdx,%rsi)
movb $0, -300(%rdx,%rsi)
movb $0, -200(%rdx,%rsi)
movb $0, -100(%rdx,%rsi)
movb $0, (%rdx,%rsi)
addq $500, %rsi ## imm = 0x1F4
cmpq $10400, %rsi ## imm = 0x28A0
jne LBB2_2
## BB#3: ## in Loop: Header=BB2_1 Depth=1
incq %rcx
cmpq $100, %rcx
jne LBB2_1
## BB#4:
xorl %r13d, %r13d
.align 4, 0x90
LBB2_5: ## %.lr.ph.us.i10
## =>This Loop Header: Depth=1
## Child Loop BB2_6 Depth 2
leaq -20048(%rbp,%rax), %rcx
movl $400, %edx ## imm = 0x190
.align 4, 0x90
LBB2_6: ## Parent Loop BB2_5 Depth=1
## => This Inner Loop Header: Depth=2
movb $0, -400(%rcx,%rdx)
movb $0, -300(%rcx,%rdx)
movb $0, -200(%rcx,%rdx)
movb $0, -100(%rcx,%rdx)
movb $0, (%rcx,%rdx)
addq $500, %rdx ## imm = 0x1F4
cmpq $10400, %rdx ## imm = 0x28A0
jne LBB2_6
## BB#7: ## in Loop: Header=BB2_5 Depth=1
incq %rax
cmpq $100, %rax
jne LBB2_5
## BB#8:
movq __ZNSt3__14coutE#GOTPCREL(%rip), %r14
leaq -20049(%rbp), %r12
xorl %ebx, %ebx
.align 4, 0x90
LBB2_9: ## %_ZNSt3__116ostream_iteratorIccNS_11char_traitsIcEEEaSERKc.exit.us.i.i13
## =>This Inner Loop Header: Depth=1
movb -10048(%rbp,%r13), %al
movb %al, -20049(%rbp)
movl $1, %edx
movq %r14, %rdi
movq %r12, %rsi
callq __ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m
incq %r13
cmpq $10000, %r13 ## imm = 0x2710
jne LBB2_9
## BB#10:
movq __ZNSt3__14coutE#GOTPCREL(%rip), %r14
leaq -20049(%rbp), %r12
.align 4, 0x90
LBB2_11: ## %_ZNSt3__116ostream_iteratorIccNS_11char_traitsIcEEEaSERKc.exit.us.i.i
## =>This Inner Loop Header: Depth=1
movb -20048(%rbp,%rbx), %al
movb %al, -20049(%rbp)
movl $1, %edx
movq %r14, %rdi
movq %r12, %rsi
callq __ZNSt3__124__put_character_sequenceIcNS_11char_traitsIcEEEERNS_13basic_ostreamIT_T0_EES7_PKS4_m
incq %rbx
cmpq $10000, %rbx ## imm = 0x2710
jne LBB2_11
## BB#12: ## %_ZNSt3__14copyIPhNS_16ostream_iteratorIccNS_11char_traitsIcEEEEEET0_T_S7_S6_.exit
cmpq -48(%rbp), %r15
jne LBB2_14
## BB#13: ## %_ZNSt3__14copyIPhNS_16ostream_iteratorIccNS_11char_traitsIcEEEEEET0_T_S7_S6_.exit
xorl %eax, %eax
addq $20024, %rsp ## imm = 0x4E38
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
Note that without the two calls to copy(..., ostream_iterator...) the compiler surmised that the total effect of the program was nothing and refused to emit any code at all, other than to return 0 from main()
Moral of the story: stop trying to do the compiler's job. Get on with yours.
Your job is to express intent as elegantly as you can. That's all.
Inline function, for two reasons:
it's less prone to bugs,
it lets the compiler decide whether to inline or not, so you don't have to waste time worrying about such trivial things.
First job: fix the bugs in the macro.
If you're that concerned, implement both ways using a compiler directive and profile the results.
Change inline int Coords(x,y) to inline int Coords(const x, const y) so, if the macro version does turn out quicker, then the inline build version will error if the macro is ever refactored to modify the arguments.
My hunch is that the function will be no slower than the macro in a good optimised build. And a code base without macros is easier to maintain.
If you do end up settling for the macro, then I'd be inclined to pass width as a macro argument too for the sake of program stability.
I am surprised that no one mentioned one major difference between a function and a macro: any compiler can inline the function, but not many (if at all) can create a function out of a macro even if that will benefit the performance.
I would offer a diverging answer in that this question seems to be looking at the wrong solutions. It's comparing two things that even the most basic optimizer from the 90s (maybe even 80s) should be able to optimize to the same degree (a trivial one-liner function versus a macro).
If you want to improve performance here, you have to compare between solutions that aren't so trivial for the compiler to optimize.
For example, let's say you access the texture in a sequential way. Then you don't need to access a pixel through (y*w) + x, you can simply iterate over it sequentially:
for (int j=0; j < num_pixels; ++j)
// do something with pixels[j]
In practice I've seen performance benefits with these kinds of loops over the y/x double loop even against the most modern compilers.
Let's say you aren't accessing things perfectly sequentially but can still access adjacent horizontal pixels within a scanline. You might get a performance boost in that case by doing:
// Given a particular y value:
Pixel* scanline = pixels + y*w;
for (int x=0; x < w; ++x)
// do something with scanline[x]
If you aren't doing either of these things and need completely random access to an image, maybe you can figure out a way to make your memory access pattern more uniform (accessing more horizontal pixels that would likely be in the same L1 cache line prior to eviction).
Sometimes it can even be worth the cost to transpose the image if that results in the bulk of your subsequent memory access being horizontal within a scanline and not across scanlines (due to spatial locality). It might seem crazy that the cost of transposing an image (basically rotating it 90 degrees and swapping rows with columns) will more than make up for the reduced cost of accessing it afterwards, but accessing memory in an efficient, cache-friendly pattern is a huge deal, and especially in image processing (like the difference between hundreds of millions of pixels per second vs. just millions of pixels per second).
If you can't do any of this and still need random access and you're facing profiler hotspots here, then it might help to split your texture image into smaller tiles (that would mean rendering more textured quads/triangles and possibly even doing extra work to ensure seamless results at the boundaries of each texture tile, but it can balance out and the extra geometry overhead can outweigh the cost if your overhead is in processing the texture). That would be increasing locality of reference and the probability that you'll use more memory cached to faster but smaller memory prior to eviction by actually reducing the size of the texture input you are processing in a totally random-access kind of way.
Any of these techniques can provide a boost in performance -- trying to optimize a one-liner function by using a macro instead is very unlikely to help anything except just make the code harder to maintain. In the best case scenario a macro might improve performance in a completely unoptimized debug build, but that kind of defeats the whole purpose of a debug build which is intended to be easy to debug, and macros are notoriously difficult to debug.
Related
I would like to know what my compiler does with the following code
void design_grid::design_valid()
{
auto valid_idx = [this]() {
if ((row_num < 0) || (col_num < 0))
{
return false;
}
if ((row_num >= this->num_rows) || (col_num >= this->num_rows))
{
return false;
}
return true;
}
/* some code that calls lambda function valid_idx() */
}
If I repeatedly call the class member function above (design_grid::design_valid), then what exactly happens when my program encounters the creation of valid_idx every time? Does the compiler inline the code where it is called later, at compile time, so that it does not actually do anything where the creation of valid_idx is encountered?
UPDATE
A segment of the assembly code is below. If this is a little too much too read, I will post another batch of code later which is coloured, to illustrate which parts are which. (don't have a nice way to colour code segments with me at the present moment). Also note that I have updated the definition of my member function and the lambda function above to reflect what it is really named in my code (and thus, in the assembly language).
In any case, it appears that the lambda is defined separately from the main function. The lambda function is represented by the _ZZN11design_grid12design_validEvENKUliiE_clEii function directly below. Directly below this function, in turn, the outer function (design_grid::design_valid), represented by _ZN11design_grid12design_validEv starts. Later in _ZN11design_grid12design_validEv, a call is made to _ZZN11design_grid12design_validEvENKUliiE_clEii. This line where the call is made looks like
call _ZZN11design_grid12design_validEvENKUliiE_clEii #
Correct me if I'm wrong, but this means that the compiler defined the lambda as a normal function outside the design_valid function, then calls it as a normal function when it should? That is, it does not create a new object every time it encounters the statement which declares the lambda function? The only trace I could see of the lambda function in that particular location is in the line which is commented # tmp85, valid_idx.__this in the second function, right after the base and stack pointers readjusted at the start of the function, but this is just a simple movq operation.
.type _ZZN11design_grid12design_validEvENKUliiE_clEii, #function
_ZZN11design_grid12design_validEvENKUliiE_clEii:
.LFB4029:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp) # __closure, __closure
movl %esi, -12(%rbp) # row_num, row_num
movl %edx, -16(%rbp) # col_num, col_num
cmpl $0, -12(%rbp) #, row_num
js .L107 #,
cmpl $0, -16(%rbp) #, col_num
jns .L108 #,
.L107:
movl $0, %eax #, D.81546
jmp .L109 #
.L108:
movq -8(%rbp), %rax # __closure, tmp65
movq (%rax), %rax # __closure_4(D)->__this, D.81547
movl 68(%rax), %eax # _5->D.69795.num_rows, D.81548
cmpl -12(%rbp), %eax # row_num, D.81548
jle .L110 #,
movq -8(%rbp), %rax # __closure, tmp66
movq (%rax), %rax # __closure_4(D)->__this, D.81547
movl 68(%rax), %eax # _7->D.69795.num_rows, D.81548
cmpl -16(%rbp), %eax # col_num, D.81548
jg .L111 #,
.L110:
movl $0, %eax #, D.81546
jmp .L109 #
.L111:
movl $1, %eax #, D.81546
.L109:
popq %rbp #
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE4029:
.size _ZZN11design_grid12design_validEvENKUliiE_clEii,.-_ZZN11design_grid12design_validEvENKUliiE_clEii
.align 2
.globl _ZN11design_grid12design_validEv
.type _ZN11design_grid12design_validEv, #function
_ZN11design_grid12design_validEv:
.LFB4028:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
pushq %rbx #
subq $72, %rsp #,
.cfi_offset 3, -24
movq %rdi, -72(%rbp) # this, this
movq -72(%rbp), %rax # this, tmp85
movq %rax, -32(%rbp) # tmp85, valid_idx.__this
movl $0, -52(%rbp) #, active_count
movl $0, -48(%rbp) #, row_num
jmp .L113 #
.L128:
movl $0, -44(%rbp) #, col_num
jmp .L114 #
.L127:
movl -44(%rbp), %eax # col_num, tmp86
movslq %eax, %rbx # tmp86, D.81551
Closures (unnamed function objects) are to lambdas as objects are to classes. This means that a closure lambda_func is created repeatedly from the lambda:
[this]() {
/* some code here */
}
Just as an object could be created repeatedly from a class. Of course the compiler may optimize some steps away.
As for this part of the question:
Does the compiler inline the code where it is called later, at compile
time, so that it does not actually do anything where the creation of
lambda_func is encountered?
See:
are lambda functions in c++ inline? and
Why can lambdas be better optimized by the compiler than plain functions?.
Here is a sample program to test what might happen:
#include <iostream>
#include <random>
#include <algorithm>
class ClassA {
public:
void repeatedly_called();
private:
std::random_device rd{};
std::mt19937 mt{rd()};
std::uniform_int_distribution<> ud{0,10};
};
void ClassA::repeatedly_called()
{
auto lambda_func = [this]() {
/* some code here */
return ud(mt);
};
/* some code that calls lambda_func() */
std::cout << lambda_func()*lambda_func() << '\n';
};
int main()
{
ClassA class_a{};
for(size_t i{0}; i < 100; ++i) {
class_a.repeatedly_called();
}
return 0;
}
It was tested here.
We can see that in this particular case the function repeatedly_called does not make a call to the lambda (which is generating the random numbers) as it has been inlined:
In the question's update it appears that the lambda instructions were not inlined. Theoretically the closure is created and normally that would mean some memory allocation, but, the compiler may optimize and remove some steps.
With only the capture of this the lambda is similar to a member function.
Basically what happens is that the compiler creates an unnamed class with a function call operator, storing the captured variables in the class as member variables. Then the compiler uses this unnamed class to create an object, which is your lambda_func variable.
In reviewing a large software project I came a cross two ways of doing essentially the same thing, pushing an initial entry on a std::vector
consider a class like Foo
class Foo
{
public:
Foo(int param){
m_param = param;
}
setParam(int param){
m_param = param;
}
private:
int m_param;
}
Is there a preferred method between the following considering whatever applicable metrics.... speed, stability, etc.
Foo bar;
int val = 5;
bar.setParam(val);
std::vector<Foo> fooVec(1, bar);
Versus
int val = 5;
std::vector<Foo> fooVec;
fooVec.push_back(Foo(val));
Is there a preferred method between the following considering whatever applicable metrics.... speed, stability, etc.
It can be argued that Without doubt this is poor style:
auto test1()
{
Foo bar; // redundant default construction
int val = 5; // redundant load
bar.setParam(val); // only now setting the value
std::vector<Foo> fooVec(1, bar); // redundant copy
return fooVec;
}
and that this is good style:
auto test2()
{
return std::vector<Foo>(1, Foo(5));
}
What about performance, we all care about that, right?
But what does it mean in reality? once you've enabled optimisations?...
__Z5test1v: ## #_Z5test1v
.cfi_startproc
## BB#0: ## %_ZNSt3__16vectorI3FooNS_9allocatorIS1_EEEC2EmRKS1_.exit1
pushq %rbx
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbx, -16
movq %rdi, %rbx
movq $0, 16(%rbx)
movq $0, 8(%rbx)
movq $0, (%rbx)
movl $4, %edi
callq __Znwm
movq %rax, (%rbx)
leaq 4(%rax), %rcx
movq %rcx, 16(%rbx)
movl $5, (%rax)
movq %rcx, 8(%rbx)
movq %rbx, %rax
popq %rbx
retq
.cfi_endproc
.globl __Z5test2v
.align 4, 0x90
__Z5test2v: ## #_Z5test2v
.cfi_startproc
## BB#0: ## %_ZNSt3__16vectorI3FooNS_9allocatorIS1_EEEC2EmRKS1_.exit1
pushq %rbx
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbx, -16
movq %rdi, %rbx
movq $0, 16(%rbx)
movq $0, 8(%rbx)
movq $0, (%rbx)
movl $4, %edi
callq __Znwm
movq %rax, (%rbx)
leaq 4(%rax), %rcx
movq %rcx, 16(%rbx)
movl $5, (%rax)
movq %rcx, 8(%rbx)
movq %rbx, %rax
popq %rbx
retq
.cfi_endproc
Absolutely no difference whatsoever. The generated machine code is exactly the same in this case.
Unless you have a fairly specific reason to use one of these, like needing to support an older (pre-C++11) compiler, I'd just use:
std::vector<Foo> fooVec { 5 }; // or fooVec { foo(5) };, if you really prefer
This is pretty much guaranteed to be as fast, stable, etc. as any of the others (and may be a tad faster, depending...)
My code looks like this:
template<typename F>
void printHello(F f)
{
f("Hello!");
}
int main() {
std::string buf;
printHello([&buf](const char*msg) { buf += msg; });
printHello([&buf]() { });
}
The question is - how can I restrict the F type to accept only lambdas that have a signature void(const char*), so that the second call to printHello doesn't fail at some obscure place inside printHello but instead on the line that calls printHello incorrectly?
==EDIT==
I know that std::function can solve it in this particular case (is what I'd use if I really wanted to print 'hello'). But std::function is really something else and comes at a cost (however small that cost is, as of today, April 2016, GCC and MSVC cannot optimize away the virtual call). So my question can be seen as purely academic - is there a "template" way to solve it?
unless you're using an ancient standard library, std::function will have optimisations for small function objects (of which yours is one). You will see no performance reduction whatsoever.
People who tell you not to use std::function because of performance reasons are the very same people who 'optimise' code before measuring performance bottlenecks.
Write the code that expresses intent. IF it becomes a performance bottleneck (it won't) then look at changing it.
I once worked on a financial forwards pricing system. Someone decided that it ran too slowly (64 cores, multiple server boxes, hundreds of thousands of discrete algorithms running in parallel in a massive DAG). So we profiled it.
What did we find?
The processing took almost no time at all. The program spent 99% of its time converting doubles to strings and strings to doubles at the boundaries of the IO, where we had to communicate with a message bus.
Using a lambda in place of a std::function for the callbacks would have made no difference whatsoever.
Write elegant code. Express your intent clearly. Compile with optimisations. Marvel as the compiler does its job and turns your 100 lines of c++ into 5 machine code instructions.
A simple demonstration:
#include <functional>
// external function forces an actual function call
extern void other_func(const char* p);
// try to force std::function to call polymorphically
void test_it(const std::function<void(const char*)>& f, const char* p)
{
f(p);
}
int main()
{
// make our function object
auto f = std::function<void(const char*)>([](const char* p) { other_func(p); });
const char* const data[] = {
"foo",
"bar",
"baz"
};
// call it in a tight loop
for(auto p : data) {
test_it(f, p);
}
}
compile with apple clang, -O2:
result:
.globl _main
.align 4, 0x90
_main: ## #main
Lfunc_begin1:
.cfi_startproc
.cfi_personality 155, ___gxx_personality_v0
.cfi_lsda 16, Lexception1
## BB#0: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i
#
# the normal stack frame stuff...
#
pushq %rbp
Ltmp13:
.cfi_def_cfa_offset 16
Ltmp14:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp15:
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %rbx
subq $72, %rsp
Ltmp16:
.cfi_offset %rbx, -40
Ltmp17:
.cfi_offset %r14, -32
Ltmp18:
.cfi_offset %r15, -24
movq ___stack_chk_guard#GOTPCREL(%rip), %rbx
movq (%rbx), %rbx
movq %rbx, -32(%rbp)
leaq -80(%rbp), %r15
movq %r15, -48(%rbp)
#
# take the address of std::function's vtable... we'll need it (once)
#
leaq __ZTVNSt3__110__function6__funcIZ4mainE3$_0NS_9allocatorIS2_EEFvPKcEEE+16(%rip), %rax
#
# here's the tight loop...
#
movq %rax, -80(%rbp)
leaq L_.str(%rip), %rdi
movq %rdi, -88(%rbp)
Ltmp3:
#
# oh look! std::function's call has been TOTALLY INLINED!!
#
callq __Z10other_funcPKc
Ltmp4:
LBB1_2: ## %_ZNSt3__110__function6__funcIZ4mainE3$_0NS_9allocatorIS2_EEFvPKcEEclEOS6_.exit
## =>This Inner Loop Header: Depth=1
#
# notice that the loop itself uses more instructions than the call??
#
leaq L_.str1(%rip), %rax
movq %rax, -88(%rbp)
movq -48(%rbp), %rdi
testq %rdi, %rdi
je LBB1_1
## BB#3: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i.1
## in Loop: Header=BB1_2 Depth=1
#
# destructor called once (constant time, therefore irrelevant)
#
movq (%rdi), %rax
movq 48(%rax), %rax
Ltmp5:
leaq -88(%rbp), %rsi
callq *%rax
Ltmp6:
## BB#4: ## in Loop: Header=BB1_2 Depth=1
leaq L_.str2(%rip), %rax
movq %rax, -88(%rbp)
movq -48(%rbp), %rdi
testq %rdi, %rdi
jne LBB1_5
#
# the rest of this function is exception handling. Executed at most
# once, in exceptional circumstances. Therefore, irrelevant.
#
LBB1_1: ## in Loop: Header=BB1_2 Depth=1
movl $8, %edi
callq ___cxa_allocate_exception
movq __ZTVNSt3__117bad_function_callE#GOTPCREL(%rip), %rcx
addq $16, %rcx
movq %rcx, (%rax)
Ltmp10:
movq __ZTINSt3__117bad_function_callE#GOTPCREL(%rip), %rsi
movq __ZNSt3__117bad_function_callD1Ev#GOTPCREL(%rip), %rdx
movq %rax, %rdi
callq ___cxa_throw
Ltmp11:
jmp LBB1_2
LBB1_9: ## %.loopexit.split-lp
Ltmp12:
jmp LBB1_10
LBB1_5: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i.2
movq (%rdi), %rax
movq 48(%rax), %rax
Ltmp7:
leaq -88(%rbp), %rsi
callq *%rax
Ltmp8:
## BB#6:
movq -48(%rbp), %rdi
cmpq %r15, %rdi
je LBB1_7
## BB#15:
testq %rdi, %rdi
je LBB1_17
## BB#16:
movq (%rdi), %rax
callq *40(%rax)
jmp LBB1_17
LBB1_7:
movq -80(%rbp), %rax
leaq -80(%rbp), %rdi
callq *32(%rax)
LBB1_17: ## %_ZNSt3__18functionIFvPKcEED1Ev.exit
cmpq -32(%rbp), %rbx
jne LBB1_19
## BB#18: ## %_ZNSt3__18functionIFvPKcEED1Ev.exit
xorl %eax, %eax
addq $72, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
Can we stop arguing about performance now please?
You can add compile time checking for template parameters by defining constraints.
This'll allow to catch such errors early and you also won't have runtime overhead as no code is generated for a constraint using current compilers.
For example we can define such constraint:
template<class F, class T> struct CanCall
{
static void constraints(F f, T a) { f(a); }
CanCall() { void(*p)(F, T) = constraints; }
};
CanCall checks (at compile time) that a F can be called with T.
Usage:
template<typename F>
void printHello(F f)
{
CanCall<F, const char*>();
f("Hello!");
}
As a result compilers also give readable error messages for a failed constraint.
Well... Just use SFINAE
template<typename T>
auto printHello(T f) -> void_t<decltype(f(std::declval<const char*>()))> {
f("hello");
}
And void_t is implemented as:
template<typename...>
using void_t = void;
The return type will act as a constraint on parameter sent to your function. If the expression inside the decltype cannot be evaluated, it will result in an error.
I'm working on the classic "Reverse a String" problem.
Is a good idea to use the position of the null terminator for swap space? The idea is to save the declaration of one variable.
Specifically, starting with Kernighan and Ritchie's algorithm:
void reverse(char s[])
{
int length = strlen(s);
int c, i, j;
for (i = 0, j = length - 1; i < j; i++, j--)
{
c = s[i];
s[i] = s[j];
s[j] = c;
}
}
...can we instead do the following?
void reverseUsingNullPosition(char s[]) {
int length = strlen(s);
int i, j;
for (i = 0, j = length - 1; i < j; i++, j--) {
s[length] = s[i]; // Use last position instead of a new var
s[i] = s[j];
s[j] = s[length];
}
s[length] = 0; // Replace null character
}
Notice how the "c" variable is no longer needed. We simply use the last position in the array--where the null termination resides--as our swap space. When we're done, we simply replace the 0.
Here's the main routine (Xcode):
#include <stdio.h>
#include <string>
int main(int argc, const char * argv[]) {
char cheese[] = { 'c' , 'h' , 'e' , 'd' , 'd' , 'a' , 'r' , 0 };
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
reverse(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: raddehc
reverseUsingNullPosition(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
}
Yes, this can be done. No, this is not a good idea, because it makes your program much harder to optimize.
When you declare char c in the local scope, the optimizer can figure out that the value is not used beyond the s[j] = c; assignment, and could place the temporary in a register. In addition to effectively eliminating the variable for you, the optimizer could even figure out that you are performing a swap, and emit a hardware-specific instruction. All this would save you a memory access per character.
When you use s[length] for your temporary, the optimizer does not have as much freedom. It is forced to emit the write into memory. This could be just as fast due to caching, but on embedded platforms this could have a significant effect.
First of all such microoptimizations are totally irrelevant until proven relevant. We're talking about C++, you have std::string, std::reverse, you shouldn't even think about such facts.
In any case if you compile both code with -Os on Xcode you obtain for reverse:
.cfi_startproc
Lfunc_begin0:
pushq %rbp
Ltmp3:
.cfi_def_cfa_offset 16
Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp5:
.cfi_def_cfa_register %rbp
pushq %r14
pushq %rbx
Ltmp6:
.cfi_offset %rbx, -32
Ltmp7:
.cfi_offset %r14, -24
movq %rdi, %r14
Ltmp8:
callq _strlen
Ltmp9:
leal -1(%rax), %ecx
testl %ecx, %ecx
jle LBB0_3
Ltmp10:
movslq %ecx, %rcx
addl $-2, %eax
Ltmp11:
xorl %edx, %edx
LBB0_2:
Ltmp12:
movb (%r14,%rdx), %sil
movb (%r14,%rcx), %bl
movb %bl, (%r14,%rdx)
movb %sil, (%r14,%rcx)
Ltmp13:
incq %rdx
decq %rcx
cmpl %eax, %edx
leal -1(%rax), %eax
jl LBB0_2
Ltmp14:
LBB0_3:
popq %rbx
popq %r14
popq %rbp
ret
Ltmp15:
Lfunc_end0:
.cfi_endproc
and for reverseUsingNullPosition:
.cfi_startproc
Lfunc_begin1:
pushq %rbp
Ltmp19:
.cfi_def_cfa_offset 16
Ltmp20:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp21:
.cfi_def_cfa_register %rbp
pushq %rbx
pushq %rax
Ltmp22:
.cfi_offset %rbx, -24
movq %rdi, %rbx
Ltmp23:
callq _strlen
Ltmp24:
leal -1(%rax), %edx
testl %edx, %edx
Ltmp25:
movslq %eax, %rdi
jle LBB1_3
Ltmp26:
movslq %edx, %rdx
addl $-2, %eax
Ltmp27:
xorl %esi, %esi
LBB1_2:
Ltmp28:
movb (%rbx,%rsi), %cl
movb %cl, (%rbx,%rdi)
movb (%rbx,%rdx), %cl
movb %cl, (%rbx,%rsi)
movb (%rbx,%rdi), %cl
movb %cl, (%rbx,%rdx)
Ltmp29:
incq %rsi
decq %rdx
cmpl %eax, %esi
leal -1(%rax), %eax
jl LBB1_2
Ltmp30:
LBB1_3: ## %._crit_edge
movb $0, (%rbx,%rdi)
addq $8, %rsp
popq %rbx
Ltmp31:
popq %rbp
ret
Ltmp32:
Lfunc_end1:
.cfi_endproc
If you check the inner loop you have
movb (%r14,%rdx), %sil
movb (%r14,%rcx), %bl
movb %bl, (%r14,%rdx)
movb %sil, (%r14,%rcx)
vs
movb (%rbx,%rsi), %cl
movb %cl, (%rbx,%rdi)
movb (%rbx,%rdx), %cl
movb %cl, (%rbx,%rsi)
movb (%rbx,%rdi), %cl
movb %cl, (%rbx,%rdx)
So I wouldn't say you are saving so much overhead as you think (since you are accessing the array more times), maybe yes, maybe no. Which teaches you another thing: thinking that some code is more performant than other code is irrelevant, the only thing that matters is a well-done benchmark and profile of the code.
Legal: Yes
Good idea: No
The cost of an "extra" variable is zero so there is absolutely no reason to avoid it. The stack pointer needs to be changed anyway so it doesn't matter if it needs to cope with an extra int.
Further:
With compiler optimization turned on, the variable c in the original code will most likely not even exists. It will just be a register in the cpu.
With your code: Optimization will be more difficult so it is not easy to say how well the compiler will do. Maybe you'll get the same - maybe you'll get something worse. But you won't get anything better.
So just forget the idea.
We can use printf and the STL and also manually unroll things and use pointers.
#include <stdio.h>
#include <string>
#include <cstring>
void reverse(char s[])
{
char * b=s;
char * e=s+::strlen(s)-4;
while (e - b > 4)
{
std::swap(b[0], e[3]);
std::swap(b[1], e[2]);
std::swap(b[2], e[1]);
std::swap(b[3], e[0]);
b+=4;
e-=4;
}
e+=3;
while (b < e)
{
std::swap(*(b++), *(e--));
}
}
int main(int argc, const char * argv[]) {
char cheese[] = { 'c' , 'h' , 'e' , 'd' , 'd' , 'a' , 'r' , 0 };
printf("Cheese is: %s\n", cheese); //-> Cheese is: cheddar
reverse(cheese);
printf("Cheese is: %s\n", cheese); //-> Cheese is: raddehc
}
Hard to tell if its faster with just the test case of "cheddar"
Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization