NULL vs throw and performance - c++

I have a class:
class Vector {
public:
element* get(int i);
private:
element* getIfExists(int i):
};
get invokes getIfExists; if element exists, it is returned, if not, some action is performed. getIfExists can signal that some element i is not present
either by throwing exception, or by returning NULL.
Question: would there be any difference in performance? In one case, get will need to check ==NULL, in another try... catch.

Its a matter of design, not performance. If its an exceptional situation -like in your get function- then throw an exception; or even better fire an assert since violation of a function precondition is a programming error. If its an expected case -like in your getIfExist function- then don't throw an exception.
Regarding performance, zero cost exception implementations exist (although not all compilers use that strategy). This means that the overhead is only paid when an exception its thrown, which should be... well... exceptionally.

Modern compilers implement 'zero cost' exceptions - they only incur cost when thrown, and the cost is proportional to the cleanup plus the cache-miss to get the list to clean up. Therefore, if exceptions are exceptional, they can indeed be faster than return-codes. And if they are unexceptional, they may be slower. And if your error is in a function in a function in a function call, it actually do much less work throwing. The details are fascinating and well worth googling.
But the cost is very marginal. In a tight loop it might make a difference, but generally not.
You should write the code that is easiest to reason about and maintain, and then profile it and revisit your decision only if its a bottleneck.

(See comments!)
Without a doubt the return NULL variant has a better performance.
You should mostly never use exceptions when using return values is possible too.
Since the method is named get I assume NULL won't be a valid result value, so passing NULL should be the best solution.
If the caller does not test the result value, it dereferences a null value, rendering a SIGSEGV, what is appropriate too.
If the method is rarely called, you should not care about micro optimizations at all.
Which translated method looks easier to you?
$ g++ -Os -c test.cpp
#include <cstddef>
void *get_null(int i) throw ();
void *get_throwing(int i) throw (void*);
int one(int i) {
void *res = get_null(i);
if(res != NULL) {
return 1;
}
return 0;
}
int two(int i) {
try {
void *res = get_throwing(i);
return 1;
} catch(void *ex) {
return 0;
}
}
$ objdump -dC test.o
0000000000000000 <one(int)>:
0: 50 push %rax
1: e8 00 00 00 00 callq 6 <one(int)+0x6>
6: 48 85 c0 test %rax,%rax
9: 0f 95 c0 setne %al
c: 0f b6 c0 movzbl %al,%eax
f: 5a pop %rdx
10: c3 retq
0000000000000011 <two(int)>:
11: 56 push %rsi
12: e8 00 00 00 00 callq 17 <two(int)+0x6>
17: b8 01 00 00 00 mov $0x1,%eax
1c: 59 pop %rcx
1d: c3 retq
1e: 48 ff ca dec %rdx
21: 48 89 c7 mov %rax,%rdi
24: 74 05 je 2b <two(int)+0x1a>
26: e8 00 00 00 00 callq 2b <two(int)+0x1a>
2b: e8 00 00 00 00 callq 30 <two(int)+0x1f>
30: e8 00 00 00 00 callq 35 <two(int)+0x24>
35: 31 c0 xor %eax,%eax
37: eb e3 jmp 1c <two(int)+0xb>

There will certainly be a difference in performance (maybe even a very big one if you give Vector::getIfExists a throw() specification, but I 'm speculating a bit here). But IMO that's missing the forest for the trees.
The money question is: are you going to call this method so many times with an out-of-bounds parameter? And if yes, why?

Yes, there would be a difference in performance: returning NULL is less expensive than throwing an exception, and checking for NULL is less expensive than catching an exception.
Addendum: But performance is only relevant if you expect that this case will happen frequently, in which case it's probably not an exceptional case anyway. In C++, it's considered bad style to use exceptions to implement normal program logic, which this seems to be: I'm assuming that the point of get is to auto-extend the vector when necessary?

If the caller is going to be expecting to deal with the possibility of an item not existing, you should return in a way that indicates that without throwing an exception. If the caller is not going to be prepared, you should throw an exception. Of course, the called routine isn't likely to magically know whether the caller is prepared for trouble. A few approaches to consider:
Microsoft's pattern is to have a Get() method, which returns the object if it exists and throws an exception if it doesn't, and a TryGet() method, which returns a Boolean indicating whether the object existed, and stores the object (if it exists) to a Ref parameter. My big complaint with this pattern is that interfaces using it cannot be covariant.
A variation, which I often prefer for collections of reference types, is to have Get and TryGet methods, and have TryGet return null for non-existent items. Interface covariance works much better this way.
A slight variation on the above, which works even for value-types or unconstrained generics, is to have the TryGet method accept a Boolean by reference, and store to that Boolean a success/fail indicator. In case of failure, the code can return an unspecified object of the appropriate type (most likely default<T>.
Another approach, which is particularly suitable for private methods, is to pass either a Boolean or an enumerated type specifying whether a routine should return null or throw an exception in case of failure. This approach can improve the quality of generated exceptions while minimizing duplicated code. For example, if one is trying to get a packet of data from a communications pipe and the caller isn't prepared for failure, and an error occurs in a routine that reads the packet header, an exception should probably be thrown by the packet-header-read routine. If, however, the caller will be prepared not to receive a packet, the packet-header-read routine should indicate failure without throwing. The cleanest way to allow for both possibilities would be for the read-packet routine to pass an "errors will be dealt with by caller" flag to the read-packet-header routine.
In some contexts, it may be useful for a routine's caller to pass a delegate to be invoked in case anticipated problems arise. The delegate could attempt to resolve the problem, and do something to indicate whether the operation should be retried, the caller should return with an error code, an exception should be raised, or something else entirely should happen. This can sometimes be the best approach, but it's hard to figure out what data should be passed to the error delegate and how it should be given control over the error handling.
In practice, I tend to use #2 a lot. I dislike #1, since I feel that the return value of a function should correspond with its primary purpose.

Related

_CrtMemDumpAllObjectsSince not returning expected results

I'm using _CrtMemCheckpoint and _CrtMemDumpAllObjectsSince to track possible memory leaks in my dll.
In DllMain when DLL_PROCESS_ATTACH is detected an init function is called which calls _CrtMemCheckpoint(&startState) on the global _CrtMemState variable startState. When DLL_PROCESS_DETACH is detected an exit function is called that calls _CrtMemDumpAllObjectsSince(&startState). This returns
ExitInstance()Dumping objects ->
{8706} normal block at 0x07088200, 8 bytes long.
Data: <p v > 70 FF 76 07 01 01 CD CD
{8705} normal block at 0x07084D28, 40 bytes long.
Data: < > 00 00 00 10 FF FF FF FF FF FF FF FF 00 00 00 00
{4577} normal block at 0x070845F0, 40 bytes long.
Data: <dbV > 64 62 56 0F 01 00 00 00 FF FF FF FF FF FF FF FF
{166} normal block at 0x028DD4B8, 40 bytes long.
Data: <dbV > 64 62 56 0F 01 00 00 00 FF FF FF FF FF FF FF FF
{87} normal block at 0x02889BA8, 12 bytes long.
Data: < P > DC 50 90 02 00 00 00 00 01 00 00 00
So far so good, except the last three entries (4577, 166 and 87) are also in startState. I.E. If I run _CrtDumpMemoryLeaks() in my Init function and in my Exit function those entries are in both lists.
The documentation says this:
_CrtMemDumpAllObjectsSince uses the value of the state parameter to determine where to initiate the dump operation. To begin dumping from
a specified heap state, the state parameter must be a pointer to a
_CrtMemState structure that has been filled in by _CrtMemCheckpoint before _CrtMemDumpAllObjectsSince was called.
Which makes me believe that items tracked in startState would be excluded from the output. At the end of the Init function where _CrtMemCheckpoint is called there have been about 4700 allocation calls. Shouldn't _CrtMemDumpAllObjectsSince only dump objects allocated after that checkpoint call?
What have I missed?
Short
This is apparently strange only, but it does the job (in part), however in an overzealous mode.
These functions are decades old, so not buggy but not completely well designed.
Truth is there is something in the "old" state that change after your "since" state.
So the question is "yes it does reflect a change since, but is it a lethal leak?"
This frequent and amplified with delayed init for DLL.
Also by a lot of complex objects like map/string/array/list which does delay allocation of internal buffer.
Bad news being that nearly all complex object declared as "static" are in fact inited on first use.
So theses change ought to be shown in _CrtMemDumpAllObjectsSince because they changed their memory alloc.
Unfortunately the display is so crude and unfiltered that it also show too many irrelevant blocks (not modified).
Typical biggest culprit is use of "realloc" that change state of old alloc
This may even look stranger as they may disappear,
For example when a genuine malloc is made after stat snapshoot, because this one will do a kind of 'reset' of the low water marker used for dump, setting it to a higher level. And this magically make a bunch of your "extra display" to disappear.
Behavior is even more erratic if you are doing Multithreading, as it becomes easily non repetitive.
Note:
The fact that it doesn't show a file name and line number is a sign that it's dealing here with a preinit.
So culprits are most likely static complex objetc inited BEFORE main() (or InitInstance();)
Long:
_CrtMemDumpAllObjectsSince is painful!
And in fact, can be so clutter of non-useful info that it defies the purpose for a day/day simple use of _CrtMemDumpAllObjectsSince.
(In the good essence it inspires)
Workaround
None simple!
You may try to do a malloc then free AFTER you do your "since" state snapshoot, in order to tease this marker.
But to be safer and more in control, unfortunetaly I did saw way around writting my own _MyCrtMemDumpAllObjectsSince that dump from the original MS structure.
This was inspired by static void __cdecl dump_all_object_since_nolock(_CrtMemState const* const state) throw()
( see "debug_heap.cpp") Copyright Microsoft!
Code available "as is" more for inspiration.
But before some explanation on the way _CrtMemState works:
A _CrtMemState state does have a pointer 'pBlockHeader' which is usually a link of a double linked list of '_CrtMemBlockHeader*'
This list is in fact more than a snapshoot at a time it is build, but a selection (unclear how) of all the memory blocks in use, arranged in such a way that the "current state" is directly pointed to by 'pBlockHeader'
So that going
-> _block_header_prev allows the exploration of older blocks
-> _block_header_next allows the exploration of newer blocks (the Juice you look for in the concept "Since" but very dangerous as there is no end marker)
The TRICKY part:
MS maintain a vital internal static _CrtMemBlockHeader* called __acrt_first_block
This __acrt_first_block continuously changes during alloc and realloc
However, _MyCrtMemDumpAllObjectsSince dump does start from this __acrt_first_block and go forward (use _block_header_next) until finding a NULL ptr
The first block handled is decided by this __acrt_first_block, and the 'state' you sent is not more than a STOP of the dump.
Otherwise said _CrtMemDumpAllObjectsSince doesn't really dump the "since" state
But dump from __acrt_first_block as it is to your "since" state.
The 'for' loop is overkilling by showing blocks from a 'start'(since) to an 'end'(oldest modified since).
This makes sense but this does also encompass dumping blocks that have NOT been modified. Showing things we don't care about.
MS structure is clever and can be used directly, while it is not guaranteed that Microsoft will maintain the same vital structure _CrtMemBlockHeader in future
But over the last 15 years I haven't seen a bit of change in it (nor do I foresee any reason they would change a strategical and critical structure.)
I dislike copy/paste of MS code and resolve linker with my piggyback code
So the workaround I used is based on the capability to intercept text message sent to "Output" windows, decoding and storing ALL information in my own bank
Structurally below gives an idea of the intercepts using a static struct under lock to store all infos
_CrtSetReportHook2(_CRT_RPTHOOK_INSTALL,MyReportHookDumpFilter);
_CrtMemDumpAllObjectsSince(state); // CONTRARY to what it says, this thing seems to dump everything until old_state
_CrtSetReportHook2(_CRT_RPTHOOK_REMOVE,MyReportHookDumpFilter);
_MyReportHookDumpFilterCommand(_CUST_SORT,NULL);
_MyReportHookDumpFilterCommand(_CUST_DUMP,NULL);
The `_MyReportHookDumpFilterCommand` does check the preexistence of blocks that are NOT modified at all and avoids displaying those during it's Dump phase
Take it as inspiration of code to ease display.
If anybody have simpler way to use it, please share!

Do macros in C++ improve performance?

I'm a beginner in C++ and I've just read that macros work by replacing text whenever needed. In this case, does this mean that it makes the .exe run faster? And how is this different than an inline function?
For example, if I have the following macro :
#define SQUARE(x) ((x) * (x))
and normal function :
int Square(const int& x)
{
return x*x;
}
and inline function :
inline int Square(const int& x)
{
return x*x;
}
What are the main differences between these three and especially between the inline function and the macro? Thank you.
You should avoid using macros if possible. Inline functions are always the better choice, as they are type safe. An inline function should be as fast as a macro (if it is indeed inlined by the compiler; note that the inline keyword is not binding but just a hint to the compiler, which may ignore it if inlining is not possible).
PS: as a matter of style, avoid using const Type& for parameter types that are fundamental, like int or double. Simply use the type itself, in other words, use
int Square(int x)
since a copy won't affect (or even make it worse) performance, see e.g. this question for more details.
Macros translate to: stupid replacing of pattern A with pattern B. This means: everything happens before the compiler kicks in. Sometimes they come in handy; but in general, they should be avoided. Because you can do a lot of things, and later on, in the debugger, you have no idea what is going on.
Besides: your approach to performance is well, naive, to say it friendly. First you learn the language (which is hard for modern C++, because there are a ton of important concepts and things one absolutely need to know and understand). Then you practice, practice, practice. And then, when you really come to a point where your existing application has performance problems; then do profiling to understand the real issue.
In other words: if you are interested in performance, you are asking the wrong question. You should worry much more about architecture (like: potential bottlenecks), configuration (in the sense of latency between different nodes in your system), and so on. Of course, you should apply common sense; and not write code that is obviously wasting memory or CPU cycles. But sometimes a piece of code that runs 50% slower ... might be 500% easier to read and maintain. And if execution time is then 500ms, and not 250ms; that might be totally OK (unless that specific part is called a thousand times per minute).
The difference between a macro and an inlined function is that a macro is dealt with before the compiler sees it.
On my compiler (clang++) without optimisation flags the square function won't be inlined. The code it generates looks like this
4009f0: 55 push %rbp
4009f1: 48 89 e5 mov %rsp,%rbp
4009f4: 89 7d fc mov %edi,-0x4(%rbp)
4009f7: 8b 7d fc mov -0x4(%rbp),%edi
4009fa: 0f af 7d fc imul -0x4(%rbp),%edi
4009fe: 89 f8 mov %edi,%eax
400a00: 5d pop %rbp
400a01: c3 retq
the imul is the assembly instruction doing the work, the rest is moving data around.
code that calls it looks like
400969: e8 82 00 00 00 callq 4009f0 <_Z6squarei>
iI add the -O3 flag to Inline it and that imul shows up in the main function where the function is called from in C++ code
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
It's a reasonable thing to do to get a basic handle on assembly language for your machine and use gcc -S on your source, or objdump -D on your binary (as I did here) to see exactly what is going on.
Using the macro instead of the inlined function gets something very similar
0000000000400a10 <main>:
400a10: 41 56 push %r14
400a12: 53 push %rbx
400a13: 50 push %rax
400a14: 48 8b 7e 08 mov 0x8(%rsi),%rdi
400a18: 31 f6 xor %esi,%esi
400a1a: ba 0a 00 00 00 mov $0xa,%edx
400a1f: e8 9c fe ff ff callq 4008c0 <strtol#plt>
400a24: 48 89 c3 mov %rax,%rbx
400a27: 0f af db imul %ebx,%ebx
Note one of the many dangers here with macros: what does this do ?
x = 5; std::cout << SQUARE(++x) << std::endl;
36? nope, 42. It becomes
std::cout << ++x * ++x << std::endl;
which becomes 6 * 7
Don't be put off by people telling you not to care about optimisation. Using C or C++ as your language is an optimisation in itself. Just try to work out if you're wasting time with it and be sensible.
Macros just perform text substitution to modify source code.
As such, macros don't inherently affect performance of code. The techniques you use to design and code obviously affect performance. So the only implication of macros on performance is based on what the macro does (i.e. what code you write the macro to emit).
The big danger of macros is that they do not respect scope. The changes they make are unconditional, cross function boundaries, and things like that. There are a lot of subtleties in writing macros to make them behave as intended (avoid unintended side effects in code, avoid undefined behaviour, etc). This means code which uses macros is harder to understand, and harder to get right.
At best, with modern compilers, the performance gain you can get using macros, is the same as can be achieved with inline functions - at the expense of increasing chances of the code behaving incorrectly. You are therefore better off using inline functions - unlike macros they are typesafe and work consistently with other code.
Modern compilers might choose to not inline a function, even if you have specified it as inline. If that happens, you generally don't need to worry - modern compilers are able to do a better job than most modern programmers in deciding whether a function should be inlined.
Using such a macro only make sense if its argument is itself a #define'd constant, as the computation will then be performed by the preprocessor. Even then, double-check that the result is the expected one.
When working on classic variables, the (inlined) function form should be preferred as:
It is type-safe;
It will handle expressions used as an argument in a consistent way. This not only includes the case of per/post increments as quoted by Peter, but when the argument it itself some computation-intensive expression, using the macro form forces the evaluation of that argument twice (which may not necessarely evaluate to the same value btw) vs. only once for the function.
I have to admit that I used to code such macros for quick prototyping of apparently simple functions, but the time those make me lose over the years finalyl changed my mind !

Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.
Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2 generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4 generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?
count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2) but in your case it'll create problems with the caches so it's not advisable.
It seems that the memory_order_relaxed trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.

Exception handler

There is this code:
char text[] = "zim";
int x = 777;
If I look on stack where x and text are placed there output is:
09 03 00 00 7a 69 6d 00
Where:
09 03 00 00 = 0x309 = 777 <- int x = 777
7a 69 6d 00 = char text[] = "zim" (ASCII code)
There is now code with try..catch:
char text[] = "zim";
try{
int x = 777;
}
catch(int){
}
Stack:
09 03 00 00 **97 85 04 08** 7a 69 6d 00
Now between text and x is placed new 4 byte value. If I add another catch, then there will be something like:
09 03 00 00 **97 85 04 08** **xx xx xx xx** 7a 69 6d 00
and so on. I think that this is some value connected with exception handling and it is used during stack unwinding to find appropriate catch when exception is thrown in try block. However question is, what is exactly this 4-byte value (maybe some address to excception handler structure or some id)?
I use g++ 4.6 on 32 bit Linux machine.
AFAICT, that's a pointer to an "unwind table". Per the the Itanium ABI implementation suggestions, the process "[uses] an unwind table, [to] find information on how to handle exceptions that occur at that PC, and in particular, get the address of the personality routine for that address range. "
The idea behind unwind tables is that the data needed for stack unwinding is rarely used. Therefore, it's more efficient to put a pointer on the stack, and store the reast of the data in another page. In the best cases, that page can remain on disk and doesn't even need to be loaded in RAM. In comparison, C style error handling often ends up in the L1 cache because it's all inline.
Needless to say all this is platform-dependent and etc.
This may be an address. It may point to either a code section (some handler address), or data section (pointer to a build-time-generated structure with frame info), or the stack of the same thread (pointer to a run-time-generated table of frame info).
Or it may also be a garbage, left due to an alignment requirement, which EH may demand.
For instance on Win32/x86 there's no such a gap. For every function that uses exception handling (has either try/catch or __try/__except/__finally or objects with d'tors) - the compiler generates an EXCEPTION_RECORD structure that is allocated on the stack (by the function prolog code). Then, whenever something changes within the function (object is created/destroyed, try/catch block entered/exited) - the compiler adds an instruction that modifies this structure (more correctly - modifies its extension). But nothing more is allocated on the stack.

Address of function is not actual code address

Debugging some code in Visual Studio 2008 (C++), I noticed that the address in my function pointer variable is not the actual address of the function itself. This is an extern "C" function.
int main() {
void (*printaddr)(const char *) = &print; // debug shows printaddr == 0x013C1429
}
Address: 0x013C4F10
void print() {
...
}
The disassembly of taking the function address is:
void (*printaddr)(const char *) = &print;
013C7465 C7 45 BC 29 14 3C 01 mov dword ptr [printaddr],offset print (13C1429h)
EDIT: I viewed the code at address 013C4F10 and the compiler is apparently inserting a "jmp" instruction at that address.
013C4F10 E9 C7 3F 00 00 jmp print (013C1429h)
There is actually a whole jump table of every method in the .exe.
Can someone expound on why it does this? Is it a debugging "feature" ?
That is caused by 'Incremental Linking'. If you disable that in your compiler/linker settings the jumps will go away.
http://msdn.microsoft.com/en-us/library/4khtbfyf(VS.80).aspx
I'm going to hazard a guess here, but it's possibly to enable Edit-and-Continue.
Say you need to recompile that function, you only need to change the indirection table, not all callers. That would dramatically reduce the amount of work to do when the Edit-and-Continue feature is being exercised.
The compiler is inserting a "jmp" instruction at that address to the real method.
013C4F10 E9 C7 3F 00 00 jmp print (013C1429h)
There is actually a whole jump table of every method in the .exe.
It is a Debugging feature. When I switch to release mode the jump table goes away and the address is indeed the actual function address.