Segfault sharing array between assembly and C++ - c++

I am writing a program that has a shared state between assembly and C++. I declared a global array in the assembly file and accessed that array in a function within C++. When I call that function from within C++, there are no issues, but then I call that same function from within assembly and I get a segmentation fault. I believe I preserved the right registers across function calls.
Strangely, when I change the type of the pointer within C++ to a uint64_t pointer, it correctly outputs the values but then segmentation faults again after casting it to a uint64_t.
In the following code, the array which keeps giving me errors is currentCPUState.
//CPU.cpp
extern uint64_t currentCPUState[6];
extern "C" {
void initInternalState(void* instructions, int indexSize);
void printCPUState();
}
void printCPUState() {
uint64_t b = currentCPUState[0];
printf("%d\n", b); //this line DOESNT crash ???
std::cout << b << "\n"; //this line crashes
//omitted some code for the sake of brevity
std::cout << "\n";
}
CPU::CPU() {
//set initial cpu state
currentCPUState[AF] = 0;
currentCPUState[BC] = 0;
currentCPUState[DE] = 0;
currentCPUState[HL] = 0;
currentCPUState[SP] = 0;
currentCPUState[PC] = 0;
printCPUState(); //this has no issues
initInternalState(instructions, sizeof(void*));
}
//cpu.s
.section .data
.balign 8
instructionArr:
.space 8 * 1024, 0
//stores values of registers
//used for transitioning between C and ASM
//uint64_t currentCPUState[6]
.global currentCPUState
currentCPUState:
.quad 0, 0, 0, 0, 0, 0
.section .text
.global initInternalState
initInternalState:
push %rdi
push %rsi
mov %rcx, %rdi
mov %rdx, %rsi
push %R12
push %R13
push %R14
push %R15
call initGBCpu
pop %R15
pop %R14
pop %R13
pop %R12
pop %rsi
pop %rdi
ret
//omitted unimportant code
//initGBCpu(rdi: void* instructions, rsi:int size)
//function initializes the array of opcodes
initGBCpu:
pushq %rdx
//move each instruction into the array in proper order
//also fill the instructionArr
leaq instructionArr(%rip), %rdx
addop inst0x00
addop inst0x01
addop inst0x02
addop inst0x03
addop inst0x04
call loadCPUState
call inst0x04 //inc BC
call saveCPUState
call printCPUState //CRASHES HERE
popq %rdx
ret
Additional details:
OS: Windows 64 bit
Compiler (MinGW64-w)
Architecture: x64
Any insight would be much appreciated
Edit:
addop is a macro:
//adds an opcode to the array of functions
.macro addop lbl
leaq \lbl (%rip), %rcx
mov %rcx, 0(%rdi)
mov %rcx, 0(%rdx)
add %rsi, %rdi
add %rsi, %rdx
.endm

Some of x86-64 calling conventions require that the stack have to be alligned to 16-byte boundary before calling functions.
After functions are called, a 8-byte return address is pushed on the stack, so another 8-byte data have to be added to the stack to satisfy this allignment requirement. Otherwise, some instruction with allignment requirement (like some of the SSE instructions) may crash.
Assumign that such calling conventions are applied, the initGBCpu function looks OK, but the initInternalState function have to add one more 8-byte thing to the stack before calling the initInternalState function.
For example:
initInternalState:
push %rdi
push %rsi
mov %rcx, %rdi
mov %rdx, %rsi
push %R12
push %R13
push %R14
push %R15
sub $8, %rsp // adjust stack allignment
call initGBCpu
add $8, %rsp // undo the stack pointer movement
pop %R15
pop %R14
pop %R13
pop %R12
pop %rsi
pop %rdi
ret

Related

Cause of EXC_BREAKPOINT crash

I've got a crash occurring on some users computers in a C++ audiounit component running inside Logic X. I can't repeat it locally unfortunately and in the process of trying to work out how it might occur I've got some questions.
Here's the relevant info from the crash dump:
Exception Type: EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Trace/BPT trap: 5
Termination Reason: Namespace SIGNAL, Code 0x5
Terminating Process: exc handler [0]
The questions are:
What might cause a EXC_BREAKPOINT in the situation I'm looking at. Is this information from Apple complete and accurate: "Similar to an Abnormal Exit, this exception is intended to give an attached debugger the chance to interrupt the process at a specific point in its execution. You can trigger this exception from your own code using the __builtin_trap() function. If no debugger is attached, the process is terminated and a crash report is generated."
Why would it occur on SharedObject + 200 (see disassembly)
Is RBX the 'this' pointer at the moment the crash occurs.
The crash occurs here:
juce::ValueTree::SharedObject::SharedObject(juce::ValueTree::SharedObject const&) + 200
The C++ is as follows:
SharedObject (const SharedObject& other)
: ReferenceCountedObject(),
type (other.type), properties (other.properties), parent (nullptr)
{
for (int i = 0; i < other.children.size(); ++i)
{
SharedObject* const child = new SharedObject (*other.children.getObjectPointerUnchecked(i));
child->parent = this;
children.add (child);
}
}
The disassembly:
-> 0x127167950 <+0>: pushq %rbp
0x127167951 <+1>: movq %rsp, %rbp
0x127167954 <+4>: pushq %r15
0x127167956 <+6>: pushq %r14
0x127167958 <+8>: pushq %r13
0x12716795a <+10>: pushq %r12
0x12716795c <+12>: pushq %rbx
0x12716795d <+13>: subq $0x18, %rsp
0x127167961 <+17>: movq %rsi, %r12
0x127167964 <+20>: movq %rdi, %rbx
0x127167967 <+23>: leaq 0x589692(%rip), %rax ; vtable for juce::ReferenceCountedObject + 16
0x12716796e <+30>: movq %rax, (%rbx)
0x127167971 <+33>: movl $0x0, 0x8(%rbx)
0x127167978 <+40>: leaq 0x599fe9(%rip), %rax ; vtable for juce::ValueTree::SharedObject + 16
0x12716797f <+47>: movq %rax, (%rbx)
0x127167982 <+50>: leaq 0x10(%rbx), %rdi
0x127167986 <+54>: movq %rdi, -0x30(%rbp)
0x12716798a <+58>: leaq 0x10(%r12), %rsi
0x12716798f <+63>: callq 0x12711cf70 ; juce::Identifier::Identifier(juce::Identifier const&)
0x127167994 <+68>: leaq 0x18(%rbx), %rdi
0x127167998 <+72>: movq %rdi, -0x38(%rbp)
0x12716799c <+76>: leaq 0x18(%r12), %rsi
0x1271679a1 <+81>: callq 0x12711c7b0 ; juce::NamedValueSet::NamedValueSet(juce::NamedValueSet const&)
0x1271679a6 <+86>: movq $0x0, 0x30(%rbx)
0x1271679ae <+94>: movl $0x0, 0x38(%rbx)
0x1271679b5 <+101>: movl $0x0, 0x40(%rbx)
0x1271679bc <+108>: movq $0x0, 0x48(%rbx)
0x1271679c4 <+116>: movl $0x0, 0x50(%rbx)
0x1271679cb <+123>: movl $0x0, 0x58(%rbx)
0x1271679d2 <+130>: movq $0x0, 0x60(%rbx)
0x1271679da <+138>: cmpl $0x0, 0x40(%r12)
0x1271679e0 <+144>: jle 0x127167aa2 ; <+338>
0x1271679e6 <+150>: xorl %r14d, %r14d
0x1271679e9 <+153>: nopl (%rax)
0x1271679f0 <+160>: movl $0x68, %edi
0x1271679f5 <+165>: callq 0x12728c232 ; symbol stub for: operator new(unsigned long)
0x1271679fa <+170>: movq %rax, %r13
0x1271679fd <+173>: movq 0x30(%r12), %rax
0x127167a02 <+178>: movq (%rax,%r14,8), %rsi
0x127167a06 <+182>: movq %r13, %rdi
0x127167a09 <+185>: callq 0x127167950 ; <+0>
0x127167a0e <+190>: movq %rbx, 0x60(%r13) // MY NOTES: child->parent = this
0x127167a12 <+194>: movl 0x38(%rbx), %ecx
0x127167a15 <+197>: movl 0x40(%rbx), %eax
0x127167a18 <+200>: cmpl %eax, %ecx
Update 1:
It looks like RIP is suggesting we are in the middle of the 'add' call which is this function, inlined:
/** Appends a new object to the end of the array.
This will increase the new object's reference count.
#param newObject the new object to add to the array
#see set, insert, addIfNotAlreadyThere, addSorted, addArray
*/
ObjectClass* add (ObjectClass* const newObject) noexcept
{
data.ensureAllocatedSize (numUsed + 1);
jassert (data.elements != nullptr);
data.elements [numUsed++] = newObject;
if (newObject != nullptr)
newObject->incReferenceCount();
return newObject;
}
Update 2:
At the point of crash register values of relevant registers:
this == rbx: 0x00007fe5bc37c950
&other == r12: 0x00007fe5bc348cc0
rax = 0
rcx = 0
There may be a few problems in this code:
like SM mentioned, other.children.getObjectPointerUnchecked(i) could return nullptr
in ObjectClass* add (ObjectClass* const newObject) noexcept, you check if newObject isn't null before calling incReferenceCount (which means a null may occur in this method call), but you don't null-check before adding this object during data.elements [numUsed++] = newObject;, so you may have a nullptr here, and if you call this array somewhere else without checking, you may have a crash.
we don't really know the type of data.elements (I assume an array), but there could be a copy operation occuring because of the pointer assignment (if the pointer is converted to object, if data.elements is not an array of pointer or there is some operator overload), there could be a crash here (but it's unlikely)
there is a circular ref between children and parent, so there may be a problem during object destruction
I suspect that the problem is that a shared, ref-counted object that should have been placed on the heap was inadvertently allocated on the stack. If the stack unwinds and is overwritten afterwards, and a reference to the ref-counted object still exists, then random things will happen at unpredictable times when that reference is accessed. (The proximity in address-space of this and &other also makes me suspect that both are on the stack.)

Repeatedly implementing lambda function

I would like to know what my compiler does with the following code
void design_grid::design_valid()
{
auto valid_idx = [this]() {
if ((row_num < 0) || (col_num < 0))
{
return false;
}
if ((row_num >= this->num_rows) || (col_num >= this->num_rows))
{
return false;
}
return true;
}
/* some code that calls lambda function valid_idx() */
}
If I repeatedly call the class member function above (design_grid::design_valid), then what exactly happens when my program encounters the creation of valid_idx every time? Does the compiler inline the code where it is called later, at compile time, so that it does not actually do anything where the creation of valid_idx is encountered?
UPDATE
A segment of the assembly code is below. If this is a little too much too read, I will post another batch of code later which is coloured, to illustrate which parts are which. (don't have a nice way to colour code segments with me at the present moment). Also note that I have updated the definition of my member function and the lambda function above to reflect what it is really named in my code (and thus, in the assembly language).
In any case, it appears that the lambda is defined separately from the main function. The lambda function is represented by the _ZZN11design_grid12design_validEvENKUliiE_clEii function directly below. Directly below this function, in turn, the outer function (design_grid::design_valid), represented by _ZN11design_grid12design_validEv starts. Later in _ZN11design_grid12design_validEv, a call is made to _ZZN11design_grid12design_validEvENKUliiE_clEii. This line where the call is made looks like
call _ZZN11design_grid12design_validEvENKUliiE_clEii #
Correct me if I'm wrong, but this means that the compiler defined the lambda as a normal function outside the design_valid function, then calls it as a normal function when it should? That is, it does not create a new object every time it encounters the statement which declares the lambda function? The only trace I could see of the lambda function in that particular location is in the line which is commented # tmp85, valid_idx.__this in the second function, right after the base and stack pointers readjusted at the start of the function, but this is just a simple movq operation.
.type _ZZN11design_grid12design_validEvENKUliiE_clEii, #function
_ZZN11design_grid12design_validEvENKUliiE_clEii:
.LFB4029:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp) # __closure, __closure
movl %esi, -12(%rbp) # row_num, row_num
movl %edx, -16(%rbp) # col_num, col_num
cmpl $0, -12(%rbp) #, row_num
js .L107 #,
cmpl $0, -16(%rbp) #, col_num
jns .L108 #,
.L107:
movl $0, %eax #, D.81546
jmp .L109 #
.L108:
movq -8(%rbp), %rax # __closure, tmp65
movq (%rax), %rax # __closure_4(D)->__this, D.81547
movl 68(%rax), %eax # _5->D.69795.num_rows, D.81548
cmpl -12(%rbp), %eax # row_num, D.81548
jle .L110 #,
movq -8(%rbp), %rax # __closure, tmp66
movq (%rax), %rax # __closure_4(D)->__this, D.81547
movl 68(%rax), %eax # _7->D.69795.num_rows, D.81548
cmpl -16(%rbp), %eax # col_num, D.81548
jg .L111 #,
.L110:
movl $0, %eax #, D.81546
jmp .L109 #
.L111:
movl $1, %eax #, D.81546
.L109:
popq %rbp #
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE4029:
.size _ZZN11design_grid12design_validEvENKUliiE_clEii,.-_ZZN11design_grid12design_validEvENKUliiE_clEii
.align 2
.globl _ZN11design_grid12design_validEv
.type _ZN11design_grid12design_validEv, #function
_ZN11design_grid12design_validEv:
.LFB4028:
.cfi_startproc
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
pushq %rbx #
subq $72, %rsp #,
.cfi_offset 3, -24
movq %rdi, -72(%rbp) # this, this
movq -72(%rbp), %rax # this, tmp85
movq %rax, -32(%rbp) # tmp85, valid_idx.__this
movl $0, -52(%rbp) #, active_count
movl $0, -48(%rbp) #, row_num
jmp .L113 #
.L128:
movl $0, -44(%rbp) #, col_num
jmp .L114 #
.L127:
movl -44(%rbp), %eax # col_num, tmp86
movslq %eax, %rbx # tmp86, D.81551
Closures (unnamed function objects) are to lambdas as objects are to classes. This means that a closure lambda_func is created repeatedly from the lambda:
[this]() {
/* some code here */
}
Just as an object could be created repeatedly from a class. Of course the compiler may optimize some steps away.
As for this part of the question:
Does the compiler inline the code where it is called later, at compile
time, so that it does not actually do anything where the creation of
lambda_func is encountered?
See:
are lambda functions in c++ inline? and
Why can lambdas be better optimized by the compiler than plain functions?.
Here is a sample program to test what might happen:
#include <iostream>
#include <random>
#include <algorithm>
class ClassA {
public:
void repeatedly_called();
private:
std::random_device rd{};
std::mt19937 mt{rd()};
std::uniform_int_distribution<> ud{0,10};
};
void ClassA::repeatedly_called()
{
auto lambda_func = [this]() {
/* some code here */
return ud(mt);
};
/* some code that calls lambda_func() */
std::cout << lambda_func()*lambda_func() << '\n';
};
int main()
{
ClassA class_a{};
for(size_t i{0}; i < 100; ++i) {
class_a.repeatedly_called();
}
return 0;
}
It was tested here.
We can see that in this particular case the function repeatedly_called does not make a call to the lambda (which is generating the random numbers) as it has been inlined:
In the question's update it appears that the lambda instructions were not inlined. Theoretically the closure is created and normally that would mean some memory allocation, but, the compiler may optimize and remove some steps.
With only the capture of this the lambda is similar to a member function.
Basically what happens is that the compiler creates an unnamed class with a function call operator, storing the captured variables in the class as member variables. Then the compiler uses this unnamed class to create an object, which is your lambda_func variable.

Why g++ still optimizes tail recursion when the recursion function result is multiplied?

They say, the tail recursion optimization works only when the the call is just before return from the function. So they show this code as example of what shouldn't be optimized by C compilers:
long long f(long long n) {
return n > 0 ? f(n - 1) * n : 1;
}
because there the recursive function call is multiplied by n which means the last operation is multiplication, not recursive call. However, it is even on -O1 level:
recursion`f:
0x100000930 <+0>: pushq %rbp
0x100000931 <+1>: movq %rsp, %rbp
0x100000934 <+4>: movl $0x1, %eax
0x100000939 <+9>: testq %rdi, %rdi
0x10000093c <+12>: jle 0x10000094e
0x10000093e <+14>: nop
0x100000940 <+16>: imulq %rdi, %rax
0x100000944 <+20>: cmpq $0x1, %rdi
0x100000948 <+24>: leaq -0x1(%rdi), %rdi
0x10000094c <+28>: jg 0x100000940
0x10000094e <+30>: popq %rbp
0x10000094f <+31>: retq
They say that:
Your final rules are therefore sufficiently correct. However, return n
* fact(n - 1) does have an operation in the tail position! This is the multiplication *, which will be the last thing the function does
before it returns. In some languages, this might actually be
implemented as a function call which could then be tail-call
optimized.
However, as we see from ASM listing, multiplication is still an ASM instruction, not a separate function. So I really struggle to see difference with accumulator approach:
int fac_times (int n, int acc) {
return (n == 0) ? acc : fac_times(n - 1, acc * n);
}
int factorial (int n) {
return fac_times(n, 1);
}
This produces
recursion`fac_times:
0x1000008e0 <+0>: pushq %rbp
0x1000008e1 <+1>: movq %rsp, %rbp
0x1000008e4 <+4>: testl %edi, %edi
0x1000008e6 <+6>: je 0x1000008f7
0x1000008e8 <+8>: nopl (%rax,%rax)
0x1000008f0 <+16>: imull %edi, %esi
0x1000008f3 <+19>: decl %edi
0x1000008f5 <+21>: jne 0x1000008f0
0x1000008f7 <+23>: movl %esi, %eax
0x1000008f9 <+25>: popq %rbp
0x1000008fa <+26>: retq
Am I missing something? Or it's just compilers became smarter?
As you see in the assembly code, the compiler is smart enough to turn your code into a loop that is basically equivalent to (disregarding the different data types):
int fac(int n)
{
int result = n;
while (--n)
result *= n;
return result;
}
GCC is smart enough to know that the state needed by each call to your original f can be kept in two variables (n and result) through the whole recursive call sequence, so that no stack is necessary. It can transform f to fac_times, and both to fac, so to say. This is most likely not only a result of tail call optimization in the strictest sense, but one of the loads of other heuristics that GCC uses for optimization.
(I can't go more into detail regarding the specific heuristics that are used here since I don't know enough about them.)
The non-accumulator f isn't tail-recursive. The compiler's options include turning it into a loop by transforming it, or call / some insns / ret, but they don't include jmp f without other transformations.
tail-call optimization applies in cases like this:
int ext(int a);
int foo(int x) { return ext(x); }
asm output from godbolt:
foo: # #foo
jmp ext # TAILCALL
Tail-call optimization means leaving a function (or recursing) with a jmp instead of a ret. Anything else is not tailcall optimization. Tail-recursion that's optimized with a jmp really is a loop, though.
A good compiler will do further transformations to put the conditional branch at the bottom of the loop when possible, removing the unconditional branch. (In asm, the do{}while() style of looping is the most natural).

Define a function on assembly in different compilers

Until now, I used inline asm with clobbering what it is not the best choice to get good performance. I am starting with assembly, but I am programing in my machine (GCC), but the result code is to run in other on (ICC), both in 64 bit (Sandy Bridge & Haswell).
To call a function without arguments we can do it with a CALL, but I do not understand too well how to call a function with arguments, and because of that I am trying use an inline __asm__ inside of all function. It is a good choice?
My function:
void add_N(size_t *cnum, size_t *ap, size_t *bp, long &n, unsigned int &c){
__asm__(
//Insert my code here
);
}
And when I see the disassembly (with GCC), I have:
add_N(unsigned long*, unsigned long*, unsigned long*, long&, unsigned int&):
0x100001ff0 <+0>: pushq %rbp
0x100001ff1 <+1>: movq %rsp, %rbp
0x100001ff4 <+4>: movq %rdi, -0x8(%rbp)
0x100001ff8 <+8>: movq %rsi, -0x10(%rbp)
0x100001ffc <+12>: movq %rdx, -0x18(%rbp)
0x100002000 <+16>: movq %rcx, -0x20(%rbp)
0x100002004 <+20>: movq %r8, -0x28(%rbp)
0x100002008 <+24>: popq %rbp
0x100002009 <+25>: retq
I understand what is happening.. Will different compilers/microarchitectures always associate the same registers addresses if the function signature be the same?
Then put some code inside of my function (NOT __ASM__ CODE), and the desassembly PUSH a lot of registers. Why did it happen? Why didn't I need to push %rax and %rsi (for example), and need to push r13, r14 and r15?
If I need to push the r** registers, can I do in the inline __asm__?
0x100001ea0 <+0>: pushq %rbp
0x100001ea1 <+1>: movq %rsp, %rbp
0x100001ea4 <+4>: pushq %r15
0x100001ea6 <+6>: pushq %r14
0x100001ea8 <+8>: pushq %r13
0x100001eaa <+10>: pushq %r12
0x100001eac <+12>: pushq %rbx
0x100001ead <+13>: movq %rdi, -0x30(%rbp)
0x100001eb1 <+17>: movq %rsi, -0x38(%rbp)
0x100001eb5 <+21>: movq %rdx, -0x40(%rbp)
0x100001eb9 <+25>: movq %rcx, -0x48(%rbp)
0x100001ebd <+29>: movq %r8, -0x50(%rbp)
For the last question - yes, it will use the same register for parameters as long as they use the same ABI. The Linux x86_64 ABI is defined here: http://www.x86-64.org/documentation/abi.pdf and all compilers have to conform to it. Specifically you are interested in page 16 - Parameters Passing.
Windows have slightly different ABI I believe. So you cannot run your program or library compiled on Linux and run on Windows for example (there are some additional reasons for that though).
For details about gcc inline assembly check some existing tutorial as it's quite long topic. This is good start: http://asm.sourceforge.net/articles/rmiyagi-inline-asm.txt

How to avoid writing multiple versions of the same loop

Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization