I've got a crash occurring on some users computers in a C++ audiounit component running inside Logic X. I can't repeat it locally unfortunately and in the process of trying to work out how it might occur I've got some questions.
Here's the relevant info from the crash dump:
Exception Type: EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Trace/BPT trap: 5
Termination Reason: Namespace SIGNAL, Code 0x5
Terminating Process: exc handler [0]
The questions are:
What might cause a EXC_BREAKPOINT in the situation I'm looking at. Is this information from Apple complete and accurate: "Similar to an Abnormal Exit, this exception is intended to give an attached debugger the chance to interrupt the process at a specific point in its execution. You can trigger this exception from your own code using the __builtin_trap() function. If no debugger is attached, the process is terminated and a crash report is generated."
Why would it occur on SharedObject + 200 (see disassembly)
Is RBX the 'this' pointer at the moment the crash occurs.
The crash occurs here:
juce::ValueTree::SharedObject::SharedObject(juce::ValueTree::SharedObject const&) + 200
The C++ is as follows:
SharedObject (const SharedObject& other)
: ReferenceCountedObject(),
type (other.type), properties (other.properties), parent (nullptr)
{
for (int i = 0; i < other.children.size(); ++i)
{
SharedObject* const child = new SharedObject (*other.children.getObjectPointerUnchecked(i));
child->parent = this;
children.add (child);
}
}
The disassembly:
-> 0x127167950 <+0>: pushq %rbp
0x127167951 <+1>: movq %rsp, %rbp
0x127167954 <+4>: pushq %r15
0x127167956 <+6>: pushq %r14
0x127167958 <+8>: pushq %r13
0x12716795a <+10>: pushq %r12
0x12716795c <+12>: pushq %rbx
0x12716795d <+13>: subq $0x18, %rsp
0x127167961 <+17>: movq %rsi, %r12
0x127167964 <+20>: movq %rdi, %rbx
0x127167967 <+23>: leaq 0x589692(%rip), %rax ; vtable for juce::ReferenceCountedObject + 16
0x12716796e <+30>: movq %rax, (%rbx)
0x127167971 <+33>: movl $0x0, 0x8(%rbx)
0x127167978 <+40>: leaq 0x599fe9(%rip), %rax ; vtable for juce::ValueTree::SharedObject + 16
0x12716797f <+47>: movq %rax, (%rbx)
0x127167982 <+50>: leaq 0x10(%rbx), %rdi
0x127167986 <+54>: movq %rdi, -0x30(%rbp)
0x12716798a <+58>: leaq 0x10(%r12), %rsi
0x12716798f <+63>: callq 0x12711cf70 ; juce::Identifier::Identifier(juce::Identifier const&)
0x127167994 <+68>: leaq 0x18(%rbx), %rdi
0x127167998 <+72>: movq %rdi, -0x38(%rbp)
0x12716799c <+76>: leaq 0x18(%r12), %rsi
0x1271679a1 <+81>: callq 0x12711c7b0 ; juce::NamedValueSet::NamedValueSet(juce::NamedValueSet const&)
0x1271679a6 <+86>: movq $0x0, 0x30(%rbx)
0x1271679ae <+94>: movl $0x0, 0x38(%rbx)
0x1271679b5 <+101>: movl $0x0, 0x40(%rbx)
0x1271679bc <+108>: movq $0x0, 0x48(%rbx)
0x1271679c4 <+116>: movl $0x0, 0x50(%rbx)
0x1271679cb <+123>: movl $0x0, 0x58(%rbx)
0x1271679d2 <+130>: movq $0x0, 0x60(%rbx)
0x1271679da <+138>: cmpl $0x0, 0x40(%r12)
0x1271679e0 <+144>: jle 0x127167aa2 ; <+338>
0x1271679e6 <+150>: xorl %r14d, %r14d
0x1271679e9 <+153>: nopl (%rax)
0x1271679f0 <+160>: movl $0x68, %edi
0x1271679f5 <+165>: callq 0x12728c232 ; symbol stub for: operator new(unsigned long)
0x1271679fa <+170>: movq %rax, %r13
0x1271679fd <+173>: movq 0x30(%r12), %rax
0x127167a02 <+178>: movq (%rax,%r14,8), %rsi
0x127167a06 <+182>: movq %r13, %rdi
0x127167a09 <+185>: callq 0x127167950 ; <+0>
0x127167a0e <+190>: movq %rbx, 0x60(%r13) // MY NOTES: child->parent = this
0x127167a12 <+194>: movl 0x38(%rbx), %ecx
0x127167a15 <+197>: movl 0x40(%rbx), %eax
0x127167a18 <+200>: cmpl %eax, %ecx
Update 1:
It looks like RIP is suggesting we are in the middle of the 'add' call which is this function, inlined:
/** Appends a new object to the end of the array.
This will increase the new object's reference count.
#param newObject the new object to add to the array
#see set, insert, addIfNotAlreadyThere, addSorted, addArray
*/
ObjectClass* add (ObjectClass* const newObject) noexcept
{
data.ensureAllocatedSize (numUsed + 1);
jassert (data.elements != nullptr);
data.elements [numUsed++] = newObject;
if (newObject != nullptr)
newObject->incReferenceCount();
return newObject;
}
Update 2:
At the point of crash register values of relevant registers:
this == rbx: 0x00007fe5bc37c950
&other == r12: 0x00007fe5bc348cc0
rax = 0
rcx = 0
There may be a few problems in this code:
like SM mentioned, other.children.getObjectPointerUnchecked(i) could return nullptr
in ObjectClass* add (ObjectClass* const newObject) noexcept, you check if newObject isn't null before calling incReferenceCount (which means a null may occur in this method call), but you don't null-check before adding this object during data.elements [numUsed++] = newObject;, so you may have a nullptr here, and if you call this array somewhere else without checking, you may have a crash.
we don't really know the type of data.elements (I assume an array), but there could be a copy operation occuring because of the pointer assignment (if the pointer is converted to object, if data.elements is not an array of pointer or there is some operator overload), there could be a crash here (but it's unlikely)
there is a circular ref between children and parent, so there may be a problem during object destruction
I suspect that the problem is that a shared, ref-counted object that should have been placed on the heap was inadvertently allocated on the stack. If the stack unwinds and is overwritten afterwards, and a reference to the ref-counted object still exists, then random things will happen at unpredictable times when that reference is accessed. (The proximity in address-space of this and &other also makes me suspect that both are on the stack.)
Related
I am writing a program that has a shared state between assembly and C++. I declared a global array in the assembly file and accessed that array in a function within C++. When I call that function from within C++, there are no issues, but then I call that same function from within assembly and I get a segmentation fault. I believe I preserved the right registers across function calls.
Strangely, when I change the type of the pointer within C++ to a uint64_t pointer, it correctly outputs the values but then segmentation faults again after casting it to a uint64_t.
In the following code, the array which keeps giving me errors is currentCPUState.
//CPU.cpp
extern uint64_t currentCPUState[6];
extern "C" {
void initInternalState(void* instructions, int indexSize);
void printCPUState();
}
void printCPUState() {
uint64_t b = currentCPUState[0];
printf("%d\n", b); //this line DOESNT crash ???
std::cout << b << "\n"; //this line crashes
//omitted some code for the sake of brevity
std::cout << "\n";
}
CPU::CPU() {
//set initial cpu state
currentCPUState[AF] = 0;
currentCPUState[BC] = 0;
currentCPUState[DE] = 0;
currentCPUState[HL] = 0;
currentCPUState[SP] = 0;
currentCPUState[PC] = 0;
printCPUState(); //this has no issues
initInternalState(instructions, sizeof(void*));
}
//cpu.s
.section .data
.balign 8
instructionArr:
.space 8 * 1024, 0
//stores values of registers
//used for transitioning between C and ASM
//uint64_t currentCPUState[6]
.global currentCPUState
currentCPUState:
.quad 0, 0, 0, 0, 0, 0
.section .text
.global initInternalState
initInternalState:
push %rdi
push %rsi
mov %rcx, %rdi
mov %rdx, %rsi
push %R12
push %R13
push %R14
push %R15
call initGBCpu
pop %R15
pop %R14
pop %R13
pop %R12
pop %rsi
pop %rdi
ret
//omitted unimportant code
//initGBCpu(rdi: void* instructions, rsi:int size)
//function initializes the array of opcodes
initGBCpu:
pushq %rdx
//move each instruction into the array in proper order
//also fill the instructionArr
leaq instructionArr(%rip), %rdx
addop inst0x00
addop inst0x01
addop inst0x02
addop inst0x03
addop inst0x04
call loadCPUState
call inst0x04 //inc BC
call saveCPUState
call printCPUState //CRASHES HERE
popq %rdx
ret
Additional details:
OS: Windows 64 bit
Compiler (MinGW64-w)
Architecture: x64
Any insight would be much appreciated
Edit:
addop is a macro:
//adds an opcode to the array of functions
.macro addop lbl
leaq \lbl (%rip), %rcx
mov %rcx, 0(%rdi)
mov %rcx, 0(%rdx)
add %rsi, %rdi
add %rsi, %rdx
.endm
Some of x86-64 calling conventions require that the stack have to be alligned to 16-byte boundary before calling functions.
After functions are called, a 8-byte return address is pushed on the stack, so another 8-byte data have to be added to the stack to satisfy this allignment requirement. Otherwise, some instruction with allignment requirement (like some of the SSE instructions) may crash.
Assumign that such calling conventions are applied, the initGBCpu function looks OK, but the initInternalState function have to add one more 8-byte thing to the stack before calling the initInternalState function.
For example:
initInternalState:
push %rdi
push %rsi
mov %rcx, %rdi
mov %rdx, %rsi
push %R12
push %R13
push %R14
push %R15
sub $8, %rsp // adjust stack allignment
call initGBCpu
add $8, %rsp // undo the stack pointer movement
pop %R15
pop %R14
pop %R13
pop %R12
pop %rsi
pop %rdi
ret
I'm developing a terrain generator in OpenGL and encounter a strange bug. I can define how long the triangles that the terrain is made from are like this:
#define TRIANGLE_WIDTH 0.25f
#define INVERSE_TRIANGLE_WIDTH 4
With those sizes I create arrays, that are used to store the data of the terrain. I have made some typedefs for the array that look like this:
typedef std::array<std::array<glm::vec3, CHUNK_SIDE_LENGHT>, CHUNK_SIDE_LENGHT> MapDataVec3Type;
typedef std::array<std::array<glm::vec2, CHUNK_SIDE_LENGHT>, CHUNK_SIDE_LENGHT> MapDataVec2Type;
I am using OpenGLs index based rendering. To minimize memory use I automatically calculate if unsigned shorts are sufficient or if I need unsigned ints like this:
#define CHUNK_SIDE_LENGHT (CHUNK_WIDTH * INVERSE_TRIANGLE_WIDTH)
#define CHUNK_SIDE_LENGHT_FLOAT (float)(CHUNK_SIDE_LENGHT)
#define CHUNK_ARRAY_SIZE CHUNK_SIDE_LENGHT * CHUNK_SIDE_LENGHT
#if CHUNK_ARRAY_SIZE >= 65535
#define MAP_INDICES_ARRAY_TYPE unsigned int
#define MAP_INDICES_ARRAY_GL_TYPE GL_UNSIGNED_INT
#else
#define MAP_INDICES_ARRAY_TYPE unsigned short
#define MAP_INDICES_ARRAY_GL_TYPE GL_UNSIGNED_SHORT
#endif
This all works but when the TRIANGLE_WIDTH gets smaller then 0.125 I get a crash when running the program.
I am using Xcode and whenever a program crashes, Xcode shows me where the error is in the assembly code:
SDL-OpenGL-Tests-3`main:
0x100030760 <+0>: pushq %rbp
0x100030761 <+1>: movq %rsp, %rbp
0x100030764 <+4>: subq $0x807f60, %rsp ; imm = 0x807F60
0x10003076b <+11>: movq 0x5b936(%rip), %rax ; (void *)0x00007fff92ccdd40: __stack_chk_guard
0x100030772 <+18>: movq (%rax), %rax
0x100030775 <+21>: movq %rax, -0x8(%rbp)
0x100030779 <+25>: movl $0x0, -0x291c(%rbp)
0x100030783 <+35>: movl %edi, -0x2920(%rbp)
0x100030789 <+41>: movq %rsi, -0x2928(%rbp)
0x100030790 <+48>: leaq 0x4357b(%rip), %rsi ; "/dev/urandom"
0x100030797 <+55>: leaq -0x2948(%rbp), %rax
0x10003079e <+62>: movq %rax, %rdi
-> 0x1000307a1 <+65>: movq %rax, -0x807178(%rbp)
0x1000307a8 <+72>: callq 0x100030090 ; std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::basic_string<std::nullptr_t> at string:817
0x1000307ad <+77>: leaq -0x2930(%rbp), %rdi
The error is marked by the arrow. From my limited knowledge of assembly I think this part is corespondent to the start of my int main() that looks like this:
int main(int argc, const char * argv[]) {
std::random_device randomDevice;
std::mt19937 randomEngine(randomDevice());
bool capFps = true;
int fpsCap = 60;
double frameTimeCap = double(1e6) / double(fpsCap);
std::chrono::steady_clock::time_point start, end;
std::cout << "Main Thread ID: " << std::this_thread::get_id() << std::endl;
I really don't get how my terrain generation is affecting the random functions of C++ since I don't use the random engine in my terrain generation, I use it to randomly place some light sources.
Update:
Because elsamuko pointed out that I should interchange the C++ random function with some pseudo random number I removed the random functions and now just set all the random values to zero and it still crashes now with this code:
SDL-OpenGL-Tests-3`main:
0x10002fc30 <+0>: pushq %rbp
0x10002fc31 <+1>: movq %rsp, %rbp
0x10002fc34 <+4>: subq $0x807510, %rsp ; imm = 0x807510
0x10002fc3b <+11>: movsd 0x5aacd(%rip), %xmm0 ; typeinfo name for Sphere + 12, xmm0 = mem[0],zero
0x10002fc43 <+19>: movq 0x5c45e(%rip), %rax ; (void *)0x00007fff92ccdd40: __stack_chk_guard
0x10002fc4a <+26>: movq (%rax), %rax
0x10002fc4d <+29>: movq %rax, -0x8(%rbp)
0x10002fc51 <+33>: movl $0x0, -0x291c(%rbp)
0x10002fc5b <+43>: movl %edi, -0x2920(%rbp)
0x10002fc61 <+49>: movq %rsi, -0x2928(%rbp)
0x10002fc68 <+56>: movb $0x1, -0x2929(%rbp)
0x10002fc6f <+63>: movl $0x3c, -0x2930(%rbp)
0x10002fc79 <+73>: cvtsi2sdl -0x2930(%rbp), %xmm1
0x10002fc81 <+81>: divsd %xmm1, %xmm0
0x10002fc85 <+85>: movsd %xmm0, -0x2938(%rbp)
0x10002fc8d <+93>: leaq -0x2940(%rbp), %rdi
-> 0x10002fc94 <+100>: callq 0x100038310 ; std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::time_point at chrono:1338
0x10002fc99 <+105>: leaq -0x2948(%rbp), %rdi
0x10002fca0 <+112>: callq 0x100038310 ; std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >::time_point at chrono:1338
0x10002fca5 <+117>: movq 0x5c384(%rip), %rdi ; (void *)0x00007fff92814760: std::__1::cout
0x10002fcac <+124>: leaq 0x44007(%rip), %rsi ; "Main Thread ID: "
0x10002fcb3 <+131>: callq 0x10007132c ; symbol stub for: std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::operator<<<std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*)
It seems as the code just pics the first command it gets to crash.
The crash is due to a stack overflow that occurs when the gets too large. The simplest solution is to use a std::unique_ptr, because this array will be allocated on the heap.
I have a dynamically loadable bundle on Mac OS X and it crashes on Mac OS X Sierra. It works fine on earlier Mac OS X versions (at least on 64-bit) and also (this is a cross-platform plug-in and this code is shared) on Windows, both 32- and 64-bit. The bundle is compiled on Mac OS X 10.11 with clang in universal mode without optimizations against 10.11 SDK; the minimal OS version was set first to 10.4, then to 10.11 without any effect.
It crashes in a rather simple function. I don't have access to a computer with Sierra, so all I have is a crash log, the binary that crashes, and the source. I'm trying to read the assembly dump to understand what might be wrong. So far everything seems to be OK (with some redundant operations, but no error), but I'm not fluent with assembly. Below are the relevant parts:
The crash log:
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000140009f3cabc
Exception Note: EXC_CORPSE_NOTIFY
...
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 ... 0x0000000109f2825b Do_GetString2(unsigned char, unsigned long, unsigned short*) + 171
...
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x0000000000000000 rbx: 0x00007fcb95adc528 rcx: 0x0000140009f3cabc rdx: 0x0000000000000000
rdi: 0x0000000000000082 rsi: 0x0000000000000100 rbp: 0x00007fff5b5f5f10 rsp: 0x00007fff5b5f5f10
r8: 0x00000000000014c0 r9: 0x0000000000000001 r10: 0x0000000106c823fc r11: 0x0000000106c82400
r12: 0x0000000000000082 r13: 0x0000000000000000 r14: 0x00007fcb95adc552 r15: 0xffffffffffffffff
rip: 0x0000000109f2825b rfl: 0x0000000000010287 cr2: 0x0000140009f3cabc
The Do_GetString2 is a simple function that copies a few short ASCII strings into wider UTF-16 strings:
static void
Do_GetString2(uint8_t type, uint64_t limit, uint16_t *target)
{
const char *source; uint64_t i;
switch (type) {
case 128 : source = strings[0]; break;
case 129 : source = strings[1]; break;
case 131 : source = strings[2]; break;
}
for (i = 0; i < limit && source[i] != '\0'; ++i)
target[i] = source[i];
}
strings is like that:
static const char* strings[] = {
"... A ...",
"... B ...",
"... C ..."
};
The assembly code is below. The crashing address, as far as I undersand, is (+171) 000000000000125b, which means it crashes when it tries to read source[i], the very first character (i is 0). The type parameter is 131, the limit is 256 (0x100). (I could be wrong here, as I said, I'm not that familiar with assembly.) This is probably the very first call to the library.
__ZL13Do_GetString2hmPt:
00000000000011b0 pushq %rbp
00000000000011b1 movq %rsp, %rbp
00000000000011b4 movb %dil, %al
00000000000011b7 movb %al, -0x1(%rbp)
00000000000011ba movq %rsi, -0x10(%rbp)
00000000000011be movq %rdx, -0x18(%rbp)
00000000000011c2 movzbl -0x1(%rbp), %edi
00000000000011c6 movl %edi, %ecx
00000000000011c8 subl $0x80, %ecx
00000000000011ce movl %edi, -0x2c(%rbp)
00000000000011d1 movl %ecx, -0x30(%rbp)
00000000000011d4 je 0x121b
00000000000011da jmp 0x11df
00000000000011df movl -0x2c(%rbp), %eax
00000000000011e2 subl $0x81, %eax
00000000000011e7 movl %eax, -0x34(%rbp)
00000000000011ea je 0x122b
00000000000011f0 jmp 0x11f5
00000000000011f5 movl -0x2c(%rbp), %eax
00000000000011f8 subl $0x83, %eax
00000000000011fd movl %eax, -0x38(%rbp)
0000000000001200 jne 0x1236
0000000000001206 jmp 0x120b
000000000000120b movq __ZL14plugin_strings(%rip), %rax ## plugin_strings
0000000000001212 movq %rax, -0x20(%rbp)
0000000000001216 jmp 0x1236
000000000000121b movq 0x15406(%rip), %rax
0000000000001222 movq %rax, -0x20(%rbp)
0000000000001226 jmp 0x1236
000000000000122b movq 0x153fe(%rip), %rax
0000000000001232 movq %rax, -0x20(%rbp)
0000000000001236 movq $0x0, -0x28(%rbp)
000000000000123e xorl %eax, %eax
0000000000001240 movb %al, %cl
0000000000001242 movq -0x28(%rbp), %rdx
0000000000001246 cmpq -0x10(%rbp), %rdx
000000000000124a movb %cl, -0x39(%rbp)
000000000000124d jae 0x126d
0000000000001253 movq -0x28(%rbp), %rax
0000000000001257 movq -0x20(%rbp), %rcx
000000000000125b movsbl (%rcx,%rax), %edx # CRASHES HERE
000000000000125f cmpl $0x0, %edx
0000000000001265 setne %sil
0000000000001269 movb %sil, -0x39(%rbp)
000000000000126d movb -0x39(%rbp), %al
0000000000001270 testb $0x1, %al
0000000000001272 jne 0x127d
0000000000001278 jmp 0x12ad
000000000000127d movq -0x28(%rbp), %rax
0000000000001281 movq -0x20(%rbp), %rcx
0000000000001285 movb (%rcx,%rax), %dl
0000000000001288 movsbl %dl, %esi
000000000000128b movw %si, %di
000000000000128e movq -0x28(%rbp), %rax
0000000000001292 movq -0x18(%rbp), %rcx
0000000000001296 movw %di, (%rcx,%rax,2)
000000000000129a movq -0x28(%rbp), %rax
000000000000129e addq $0x1, %rax
00000000000012a4 movq %rax, -0x28(%rbp)
00000000000012a8 jmp 0x123e
00000000000012ad popq %rbp
00000000000012ae retq
00000000000012af nop
Does anyone see any error here? What could I try to debug this crash? This is a GUI application, so I'm not sure I can debug it with gdb or something like that over SSH.
My code looks like this:
template<typename F>
void printHello(F f)
{
f("Hello!");
}
int main() {
std::string buf;
printHello([&buf](const char*msg) { buf += msg; });
printHello([&buf]() { });
}
The question is - how can I restrict the F type to accept only lambdas that have a signature void(const char*), so that the second call to printHello doesn't fail at some obscure place inside printHello but instead on the line that calls printHello incorrectly?
==EDIT==
I know that std::function can solve it in this particular case (is what I'd use if I really wanted to print 'hello'). But std::function is really something else and comes at a cost (however small that cost is, as of today, April 2016, GCC and MSVC cannot optimize away the virtual call). So my question can be seen as purely academic - is there a "template" way to solve it?
unless you're using an ancient standard library, std::function will have optimisations for small function objects (of which yours is one). You will see no performance reduction whatsoever.
People who tell you not to use std::function because of performance reasons are the very same people who 'optimise' code before measuring performance bottlenecks.
Write the code that expresses intent. IF it becomes a performance bottleneck (it won't) then look at changing it.
I once worked on a financial forwards pricing system. Someone decided that it ran too slowly (64 cores, multiple server boxes, hundreds of thousands of discrete algorithms running in parallel in a massive DAG). So we profiled it.
What did we find?
The processing took almost no time at all. The program spent 99% of its time converting doubles to strings and strings to doubles at the boundaries of the IO, where we had to communicate with a message bus.
Using a lambda in place of a std::function for the callbacks would have made no difference whatsoever.
Write elegant code. Express your intent clearly. Compile with optimisations. Marvel as the compiler does its job and turns your 100 lines of c++ into 5 machine code instructions.
A simple demonstration:
#include <functional>
// external function forces an actual function call
extern void other_func(const char* p);
// try to force std::function to call polymorphically
void test_it(const std::function<void(const char*)>& f, const char* p)
{
f(p);
}
int main()
{
// make our function object
auto f = std::function<void(const char*)>([](const char* p) { other_func(p); });
const char* const data[] = {
"foo",
"bar",
"baz"
};
// call it in a tight loop
for(auto p : data) {
test_it(f, p);
}
}
compile with apple clang, -O2:
result:
.globl _main
.align 4, 0x90
_main: ## #main
Lfunc_begin1:
.cfi_startproc
.cfi_personality 155, ___gxx_personality_v0
.cfi_lsda 16, Lexception1
## BB#0: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i
#
# the normal stack frame stuff...
#
pushq %rbp
Ltmp13:
.cfi_def_cfa_offset 16
Ltmp14:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp15:
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %rbx
subq $72, %rsp
Ltmp16:
.cfi_offset %rbx, -40
Ltmp17:
.cfi_offset %r14, -32
Ltmp18:
.cfi_offset %r15, -24
movq ___stack_chk_guard#GOTPCREL(%rip), %rbx
movq (%rbx), %rbx
movq %rbx, -32(%rbp)
leaq -80(%rbp), %r15
movq %r15, -48(%rbp)
#
# take the address of std::function's vtable... we'll need it (once)
#
leaq __ZTVNSt3__110__function6__funcIZ4mainE3$_0NS_9allocatorIS2_EEFvPKcEEE+16(%rip), %rax
#
# here's the tight loop...
#
movq %rax, -80(%rbp)
leaq L_.str(%rip), %rdi
movq %rdi, -88(%rbp)
Ltmp3:
#
# oh look! std::function's call has been TOTALLY INLINED!!
#
callq __Z10other_funcPKc
Ltmp4:
LBB1_2: ## %_ZNSt3__110__function6__funcIZ4mainE3$_0NS_9allocatorIS2_EEFvPKcEEclEOS6_.exit
## =>This Inner Loop Header: Depth=1
#
# notice that the loop itself uses more instructions than the call??
#
leaq L_.str1(%rip), %rax
movq %rax, -88(%rbp)
movq -48(%rbp), %rdi
testq %rdi, %rdi
je LBB1_1
## BB#3: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i.1
## in Loop: Header=BB1_2 Depth=1
#
# destructor called once (constant time, therefore irrelevant)
#
movq (%rdi), %rax
movq 48(%rax), %rax
Ltmp5:
leaq -88(%rbp), %rsi
callq *%rax
Ltmp6:
## BB#4: ## in Loop: Header=BB1_2 Depth=1
leaq L_.str2(%rip), %rax
movq %rax, -88(%rbp)
movq -48(%rbp), %rdi
testq %rdi, %rdi
jne LBB1_5
#
# the rest of this function is exception handling. Executed at most
# once, in exceptional circumstances. Therefore, irrelevant.
#
LBB1_1: ## in Loop: Header=BB1_2 Depth=1
movl $8, %edi
callq ___cxa_allocate_exception
movq __ZTVNSt3__117bad_function_callE#GOTPCREL(%rip), %rcx
addq $16, %rcx
movq %rcx, (%rax)
Ltmp10:
movq __ZTINSt3__117bad_function_callE#GOTPCREL(%rip), %rsi
movq __ZNSt3__117bad_function_callD1Ev#GOTPCREL(%rip), %rdx
movq %rax, %rdi
callq ___cxa_throw
Ltmp11:
jmp LBB1_2
LBB1_9: ## %.loopexit.split-lp
Ltmp12:
jmp LBB1_10
LBB1_5: ## %_ZNKSt3__18functionIFvPKcEEclES2_.exit.i.2
movq (%rdi), %rax
movq 48(%rax), %rax
Ltmp7:
leaq -88(%rbp), %rsi
callq *%rax
Ltmp8:
## BB#6:
movq -48(%rbp), %rdi
cmpq %r15, %rdi
je LBB1_7
## BB#15:
testq %rdi, %rdi
je LBB1_17
## BB#16:
movq (%rdi), %rax
callq *40(%rax)
jmp LBB1_17
LBB1_7:
movq -80(%rbp), %rax
leaq -80(%rbp), %rdi
callq *32(%rax)
LBB1_17: ## %_ZNSt3__18functionIFvPKcEED1Ev.exit
cmpq -32(%rbp), %rbx
jne LBB1_19
## BB#18: ## %_ZNSt3__18functionIFvPKcEED1Ev.exit
xorl %eax, %eax
addq $72, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
Can we stop arguing about performance now please?
You can add compile time checking for template parameters by defining constraints.
This'll allow to catch such errors early and you also won't have runtime overhead as no code is generated for a constraint using current compilers.
For example we can define such constraint:
template<class F, class T> struct CanCall
{
static void constraints(F f, T a) { f(a); }
CanCall() { void(*p)(F, T) = constraints; }
};
CanCall checks (at compile time) that a F can be called with T.
Usage:
template<typename F>
void printHello(F f)
{
CanCall<F, const char*>();
f("Hello!");
}
As a result compilers also give readable error messages for a failed constraint.
Well... Just use SFINAE
template<typename T>
auto printHello(T f) -> void_t<decltype(f(std::declval<const char*>()))> {
f("hello");
}
And void_t is implemented as:
template<typename...>
using void_t = void;
The return type will act as a constraint on parameter sent to your function. If the expression inside the decltype cannot be evaluated, it will result in an error.
Inside a large loop, I currently have a statement similar to
if (ptr == NULL || ptr->calculate() > 5)
{do something}
where ptr is an object pointer set before the loop and never changed.
I would like to avoid comparing ptr to NULL in every iteration of the loop. (The current final program does that, right?) A simple solution would be to write the loop code once for (ptr == NULL) and once for (ptr != NULL). But this would increase the amount of code making it more difficult to maintain, plus it looks silly if the same large loop appears twice with only one or two lines changed.
What can I do? Use dynamically-valued constants maybe and hope the compiler is smart? How?
Many thanks!
EDIT by Luther Blissett. The OP wants to know if there is a better way to remove the pointer check here:
loop {
A;
if (ptr==0 || ptr->calculate()>5) B;
C;
}
than duplicating the loop as shown here:
if (ptr==0)
loop {
A;
B;
C;
}
else loop {
A;
if (ptr->calculate()>5) B;
C;
}
I just wanted to inform you, that apparently GCC can do this requested hoisting in the optimizer. Here's a model loop (in C):
struct C
{
int (*calculate)();
};
void sideeffect1();
void sideeffect2();
void sideeffect3();
void foo(struct C *ptr)
{
int i;
for (i=0;i<1000;i++)
{
sideeffect1();
if (ptr == 0 || ptr->calculate()>5) sideeffect2();
sideeffect3();
}
}
Compiling this with gcc 4.5 and -O3 gives:
.globl foo
.type foo, #function
foo:
.LFB0:
pushq %rbp
.LCFI0:
movq %rdi, %rbp
pushq %rbx
.LCFI1:
subq $8, %rsp
.LCFI2:
testq %rdi, %rdi # ptr==0? -> .L2, see below
je .L2
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
call sideeffect1 # sideeffect1
xorl %eax, %eax
call *0(%rbp) # call p->calculate, no check for ptr==0
cmpl $5, %eax
jle .L3
xorl %eax, %eax
call sideeffect2 # ok, call sideeffect2
.L3:
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L4
addq $8, %rsp
.LCFI3:
xorl %eax, %eax
popq %rbx
.LCFI4:
popq %rbp
.LCFI5:
ret
.L2: # here's the loop with ptr==0
.LCFI6:
movl $1000, %ebx
.p2align 4,,10
.p2align 3
.L6:
xorl %eax, %eax
call sideeffect1 # does not try to call ptr->calculate() anymore
xorl %eax, %eax
call sideeffect2
xorl %eax, %eax
call sideeffect3
subl $1, %ebx
jne .L6
addq $8, %rsp
.LCFI7:
xorl %eax, %eax
popq %rbx
.LCFI8:
popq %rbp
.LCFI9:
ret
And so does clang 2.7 (-O3):
foo:
.Leh_func_begin1:
pushq %rbp
.Llabel1:
movq %rsp, %rbp
.Llabel2:
pushq %r14
pushq %rbx
.Llabel3:
testq %rdi, %rdi # ptr==NULL -> .LBB1_5
je .LBB1_5
movq %rdi, %rbx
movl $1000, %r14d
.align 16, 0x90
.LBB1_2:
xorb %al, %al # here's the loop with the ptr->calculate check()
callq sideeffect1
xorb %al, %al
callq *(%rbx)
cmpl $6, %eax
jl .LBB1_4
xorb %al, %al
callq sideeffect2
.LBB1_4:
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_2
jmp .LBB1_7
.LBB1_5:
movl $1000, %r14d
.align 16, 0x90
.LBB1_6:
xorb %al, %al # and here's the loop for the ptr==NULL case
callq sideeffect1
xorb %al, %al
callq sideeffect2
xorb %al, %al
callq sideeffect3
decl %r14d
jne .LBB1_6
.LBB1_7:
popq %rbx
popq %r14
popq %rbp
ret
In C++, although completely overkill you can put the loop in a function and use a template. This will generate twice the body of the function, but eliminate the extra check which will be optimized out. While I certainly don't recommend it, here is the code:
template<bool ptr_is_null>
void loop() {
for(int i = x; i != y; ++i) {
/**/
if(ptr_is_null || ptr->calculate() > 5) {
/**/
}
/**/
}
}
You call it with:
if (ptr==NULL) loop<true>(); else loop<false>();
You are better off without this "optimization", the compiler will probably do the RightThing(TM) for you.
Why do you want to avoid comparing to NULL?
Creating a variant for each of the NULL and non-NULL cases just gives you almost twice as much code to write, test and more importantly maintain.
A 'large loop' smells like an opportunity to refactor the loop into separate functions, in order to make the code easier to maintain. Then you can easily have two variants of the loop, one for ptr == null and one for ptr != null, calling different functions, with just a rough similarity in the overall structure of the loop.
Since
ptr is an object pointer set before the loop and never changed
can't you just check if it is null before the loop and not check again... since you don't change it.
If it is not valid for your pointer to be NULL, you could use a reference instead.
If it is valid for your pointer to be NULL, but if so then you skip all processing, then you could either wrap your code with one check at the beginning, or return early from your function:
if (ptr != NULL)
{
// your function
}
or
if (ptr == NULL) { return; }
If it is valid for your pointer to be NULL, but only some processing is skipped, then keep it like it is.
if (ptr == NULL || ptr->calculate() > 5)
{do something}
I would simply think in terms of what is done if the condition is true.
If "do something" is really the exact same stuff for (ptr == NULL) or (ptr->calculate() > 5), then I hardly see a reason to split up anything.
If "do something" contains particular cases for either condition, then I would consider to refactor into separate loops to get rid of extra special case checking. Depends on the special cases involved.
Eliminating code duplication is good up to a point. You should not care too much about optimizing until your program does what it should do and until performance becomes a problem.
[...] Premature optimization is the root of all evil
http://en.wikipedia.org/wiki/Program_optimization