Does C++ have any runtime metaprogramming functionality? - c++

By "runtime metaprogramming" I mean changing the actual code at runtime.
For example, take this code:
while (true) {
if (flag) {
//do stuff
}
//do other stuff
}
Let's say something happens so that flag will always be false/true, so there is no need to keep checking its value. It would be nice to just "get rid" of the if statement. Obviously there might just be better design in the code before execution to handle this, but this is just an example.

C++ does not have facilities for modifying code at runtime like that. This is because C++ code is compiled to machine code, and modifying machine code is very difficult, insecure, and non-portable. If you're interested, see the Wikipedia article on Self-modifying code.
In theory, if you really needed to, you could hoist the test for flag further back, by writing three sets of functions (one where flag is known to be always true, one where it's always false, and one where it could be either) and switch between those sets of functions at an earlier time. However, the code complexity and performance impact of maintaining three separate copies of the functions will not be worth the microscopic speedup from removing one easily-predicted branch.
If you're concerned about the performance of your application, you will find better opportunities for optimization elsewhere.

Does C++ have any runtime metaprogramming functionality?
The abstract machine that a compiler has to have in mind when optimizing and organizing instructions does not involve the volatile nature of someone flipping a bit somewhere in memory that it can't control. That "somewhere" may not even exist after the compiler has done its job.
In the case of the true check in your loop:
None of the compilers I know will actually produce code that checks if there is a true there. It's just a an infinite loop.
Why? The compiler has to produce a program acting "as-if" it did all you instructed it to do. If you instruct it to check if true is true, the "as-if" rule tells the compiler that it can with a 100% certainty rely on the fact that you want this loop to go on forever. No runtime check necessary.

Related

When should you not use [[carries_dependency]]?

I've found questions (like this one) asking what [[carries_dependency]] does, and that's not what I'm asking here.
I want to know when you shouldn't use it, because the answers I've read all make it sound like you can plaster this code everywhere and magically you'd get equal or faster code. One comment said the code can be equal or slower, but the poster didn't elaborate.
I imagine appropriate places to use this is on any function return or parameter that is a pointer or reference and that will be passed or returned within the calling thread, and it shouldn't be used on callbacks or thread entry points.
Can someone comment on my understanding and elaborate on the subject in general, of when and when not to use it?
EDIT: I know there's this tome on the subject, should any other reader be interested; it may contain my answer, but I haven't had the chance to read through it yet.
In modern C++ you should generally not use std::memory_order_consume or [[carries_dependency]] at all. They're essentially deprecated while the committee comes up with a better mechanism that compilers can practically implement.
And that hopefully doesn't require sprinkling [[carries_dependency]] and kill_dependency all over the place.
2016-06 P0371R1: Temporarily discourage memory_order_consume
It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire. The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the kill_dependency call, and from the frequent need for [[carries_dependency]] annotations. Details can be found in e.g. P0098R0.
Notably that in C++ x - x still carries a dependency but most compilers would naturally break the dependency and replace that expression with a constant 0. But also compilers do sometimes turn data dependencies into control dependencies if they can prove something about value-ranges after a branch.
On modern compilers that just promote mo_consume to mo_acquire, fully aggressive optimizations can always happen; there's never anything to gain from [[carries_dependency]] and kill_dependency even in code that uses mo_consume, let alone in other code.
This strengthening to mo_acquire has potentially-significant performance cost (an extra barrier) for real use-cases like RCU on weakly-ordered ISAs like POWER and ARM. See this video of Paul E. McKenney's CppCon 2015 talk C++ Atomics: The Sad Story of memory_order_consume. (Link includes a summary).
If you want real dependency-ordering read-only performance, you have to "roll your own", e.g. by using mo_relaxed and checking the asm to verify it compiled to asm with a dependency. (Avoid doing anything "weird" with such a value, like passing it across functions.) DEC Alpha is basically dead and all other ISAs provide dependency ordering in asm without barriers, as long as the asm itself has a data dependency.
If you don't want to roll your own and live dangerously, it might not hurt to keep using mo_consume in "simple" use-cases where it should be able to work; perhaps some future mo_consume implementation will have the same name and work in a way that's compatible with C++11.
There is ongoing work on making a new consume, e.g. 2018's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html
because the answers I've read all make it sound like you can plaster
this code everywhere and magically you'd get equal or faster code
The only way you can get faster code is when that annotation allows the omission of a fence.
So the only case where it could possibly be useful is:
your program uses consume ordering on an atomic load operation, in an important frequently executed code;
the "consume value" isn't just used immediately and locally, but also passed to other functions;
the target CPU gives specific guarantees for consuming operations (as strong as a given fence before that operation, just for that operation);
the compiler writers take their job seriously: they manage to translate high level language consuming of a value to CPU level consuming, to get the benefit from CPU guarantees.
That's a bunch of necessary conditions to possibly get measurably faster code.
(And the latest trend in the C++ community is to give up inventing a proper compiling scheme that's safe in all cases and to come up with a completely different way for the user to instruct the compiler to produce code that "consumes" values, with much more explicit, naively translatable, C++ code.)
One comment said the code can be equal or slower, but the poster
didn't elaborate.
Of course annotations of the kind that you can randomly put on programs simply cannot make code more efficient in general! That would be too easy and also self contradictory.
Either some annotation specifies a constrain on your code, that is a promise to the compiler, and you can't put it anywhere where it doesn't correspond an guarantee in the code (like noexcept in C++, restrict in C), or it would break code in various ways (an exception in a noexcept function stops the program, aliasing of restricted pointers can cause funny miscompilation and bad behavior (formerly the behavior is not defined in that case); then the compiler can use it to optimize the code in specific ways.
Or that annotation doesn't constrain the code in any way, and the compiler can't count on anything and the annotation does not create any more optimization opportunity.
If you get more efficient code in some cases at no cost of breaking program with an annotation then you must potentially get less efficient code in other cases. That's true in general and specifically true with consume semantic, which imposes the previously described constrained on translation of C++ constructs.
I imagine appropriate places to use this is on any function return or
parameter that is a pointer or reference and that will be passed or
returned within the calling thread
No, the one and only case where it might be useful is when the intended calling function will probably use consume memory order.

C++ - put expression to register and use it in assembly

How to resolve expression and put it in the register, use it in inline assembly and even use it again and put it somewhere?
For example:
EAX=a[i]; //Any expression that valid in C++
__asm xor eax,0xFFFF //Do something with this
b[i]=EAX; //And then put it in some variable.
By the way, the reason is for performance.
Several compilers have compiler specific ways of accomplishing this. But it's almost never worth doing.
There are a list of reasons why this is almost never worth doing:
The compiler will usually generate better code than you can write most of the time.
Even if it doesn't, you can frequently tweak your code slightly to convince the compiler to write code that's at least as good as you could write, and still have your program remain portable.
The code that has the perceived performance issue is not actually critical to performance because the program spends 0.01% of it's time there.
You want your program to stay standard C++ and don't want to clutter it with tons of #ifdef guards.
The example you've shown is not very compelling.

Performance penalty for "if error then fail fast" in C++?

Is there any performance difference (in C++) between the two styles of writing if-else, as shown below (logically equivalent code) for the likely1 == likely2 == true path (likely1 and likely2 are meant here as placeholders for some more elaborate conditions)?
// Case (1):
if (likely1) {
Foo();
if (likely2) {
Bar();
} else {
Alarm(2);
}
} else {
Alarm(1);
}
vs.
// Case (2):
if (!likely1) {
Alarm(1);
return;
}
Foo();
if (!likely2) {
Alarm(2);
return;
}
Bar();
I'd be very grateful for information on as many compilers and platforms as possible (but with gcc/x86 highlighted).
Please note I'm not interested in readability opinions on those two styles, neither in any "premature optimisation" claims.
EDIT: In other words, I'd like to ask if the two styles are at some point considered fully-totally-100% equivalent/transparent by a compiler (e.g. bit-by-bit equivalent AST at some point in a particular compiler), and if not, then what are the differences? For any (with a preference towards "modern" and gcc) compiler you know.
And, to make it more clear, I too don't suppose that it's going to give me much of a performance improvement, and that it usually would be premature optimization, but I am interested in whether and how much it can improve/degrade anything?
It depends greatly on the compiler, and the optimization settings. If the difference is crucial - implement both, and either analyze the assembly, or do benchmarks.
I have no answers for specific platforms, but I can make a few general points:
The traditional answer on non-modern processors without branch prediction, is that the first is likely to be more efficient since in the common case it takes fewer branches. But you seem interested in modern compilers and processors.
On modern processors, generally speaking short forward branches are not expensive, whereas mispredicted branches may be expensive. By "expensive" of course I mean a few cycles
Quite aside from this, the compiler is entitled to order basic blocks however it likes provided it doesn't change the logic. So when you write if (blah) {foo();} else {bar();}, the compiler is entitled to emit code like:
evaluate condition blah
jump_if_true else_label
bar()
jump endif_label
else_label:
foo()
endif_label:
On the whole, gcc tends to emit things in roughly the order you write them, all else being equal. There are various things which make all else not equal, for example if you have the logical equivalent of bar(); return in two different places in your function, gcc might well coalesce those blocks, emit only one call to bar() followed by return, and jump or fall through to that from two different places.
There are two kinds of branch prediction - static and dynamic. Static means that the CPU instructions for the branch specify whether the condition is "likely", so that the CPU can optimize for the common case. Compilers might emit static branch predictions on some platforms, and if you're optimizing for that platform you might write code to take account of that. You can take account of it either by knowing how your compiler treats the various control structures, or by using compiler extensions. Personally I don't think it's consistent enough to generalize about what compilers will do. Look at the disassembly.
Dynamic branch prediction means that in hot code, the CPU will keep statistics for itself how likely branches are to be taken, and optimize for the common case. Modern processors use various different dynamic branch prediction techniques: http://en.wikipedia.org/wiki/Branch_predictor. Performance-critical code pretty much is hot code, and as long as the dynamic branch prediction strategy works, it very rapidly optimizes hot code. There might be certain pathological cases that confuse particular strategies, but in general you can say that anything in a tight loop where there's a bias towards taken/not taken, will be correctly predicted most of the time
Sometimes it doesn't even matter whether the branch is correctly predicted or not, since some CPUs in some cases will include both possibilities in the instruction pipeline while it's waiting for the condition to be evaluated, and ditch the unnecessary option. Modern CPUs get complicated. Even much simpler CPU designs have ways of avoiding the cost of branching, though, such as conditional instructions on ARM.
Calls out of line to other functions will upset all such guesswork anyway. So in your example there may be small differences, and those differences may depend on the actual code in Foo, Bar and Alarm. Unfortunately it's not possible to distinguish between significant and insignificant differences, or to account for details of those functions, without getting into the "premature optimization" accusations you're not interested in.
It's almost always premature to micro-optimize code that isn't written yet. It's very hard to predict the performance of functions named Foo and Bar. Presumably the purpose of the question is to discern whether there's any common gotcha that should inform coding style. To which the answer is that, thanks to dynamic branch prediction, there is not. In hot code it makes very little difference how your conditions are arranged, and where it does make a difference that difference isn't as easily predictable as "it's faster to take / not take the branch in an if condition".
If this question was intended to apply to just one single program with this code proven to be hot, then of course it can be tested, there's no need to generalize.
It is compiler dependent. Check out the gcc documentation on __builtin_expect. Your compiler may have something similar. Note that you really should be concerned about premature optimization.
The answer depends a lot on the type of "likely". If it is an integer constant expression, the compiler can optimize it and both cases will be equivalent. Otherwise, it will be evaluated in runtime and can't be optimized much.
Thus, case 2 is generally more efficient than case 1.
As input from real-time embedded systems, which I work with, your "case 2" is often the norm for code that is safety- and/or performance critical. Style guides for safety-critical embedded systems often allow this syntax so a function can quit quickly upon errors.
Generally, style guides will frown upon the "case 2" syntax, but make an exception to allow several returns in one function either if
1) the function needs to quit quickly and handle the error, or
2) if one single return at the end of the function leads to less readable code, which is often the case for various protocol and data parsers.
If you are this concerned about performance, I assume you are using profile guided optimization.
If you are using profile guided optimization, the two variants you have proposed are exactly the same.
In any event, the performance of what you are asking about is completely overshadowed by performance characteristics of things not evident in your code samples, so we really can not answer this. You have to test the performance of both.
Though I'm with everyone else here insofar as optimizing a branch makes no sense without having profiled and actually having found a bottleneck... if anything, it makes sense to optimize for the likely case.
Both likely1 and likely2 are likely, as their name suggests. Thus ruling out the also likely combination of both being true would likely be fastest:
if(likely1 && likely2)
{
... // happens most of the time
}else
{
if(likely1)
...
if(likely2)
...
else if(!likely1 && !likely2) // happens almost never
...
}
Note that the second else is probably not necessary, a decent compiler will figure out that the last if clause cannot possibly be true if the previous one was, even if you don't explicitly tell it.

How to know what optimizations are done automatically by my compiler

I was going through this link Will it optimize and wondered how can we know what optimizations are done by a particular compiler.
Like does VC8.0 convert if-else statements to switch-case?
Is such information available on msdn?
As everyone seems to be bent on telling the OP that he shouldn't worry about it, there is some useful although not as specific as the OP requested) information about compiler optimization (options).
You'll have to figure out what flags you're using, especially for MSVC and Intel (GCC release build should default to -O2), but here are the links:
GCC
MSVC
Intel
This is about as close as you'll get before disassembling your binary after compilation.
It depends on the level of of optimization you choose for compiler.
you can find a very nice article about it here
First of all, if optimization took place then your program should work faster usually. After that you could inspect disassembly code to find out what kind of optimizations were performed.
I don't know anything about VC8.0, so I'm not sure how you would access that information. However, if you are generally interested in the kinds of optimisations that go on and want to experiment, I recommend you use LLVM. You can look at the unoptimised, disassembled byte code generated from the default C front end, and then run various optimiser passes over it to see what the effect is each time. Because it's a nicer, abstract assembly code, it tends to be a little easier to figure out what is an optimisation derivable from the code and what is a machine-specific optimisation.
Like does VC8.0 convert if-else statements to switch-case?
Compilers do not do magically rewrite your source code. And even if they did, what would that tell you? What you really would want to know is if the compiler compiled it into a jump table or into multiple compare operations. Any dis-assembler will tell you that.
To clarify my point: Writing a switch-case statement does not necesseraly imply that there will be a jump table in the binary. Not needing to worry about this is the whole point of having compilers.
Instead of figuring out which optimizations are done by the compiler in general, it's probably better to NOT have any dependencies on such compiler-specific knowledge.
Instead start out with a good design and algorithm, writing (as much as possible) portable code that's easy to follow. Then profile the code if it's too slow and fix the actual hotspots. Compiler optimizations are useful no doubt, but better is to apply some investigation to what's actually happening in the code. Algorithmic/design improvements at the source level will typically help performance more than the presence or absence of optimizations like transforming if/else into switch-case.
I'm not sure what "convert if/else to switch/case" means. My processor doesn't have a hardware switch/case instruction.
Typical compilers have several different ways to implement switch/case. A well-known one is using a jump table, but this is only done if appropriate.
For if/else, certainly it is normal for compilers to analyse a digraph of execution flow. I would expect a compiler to notice if each condition references the same variable, and I would expect the compiler to treat equivalent forms of conditionals the same way in general. But this isn't something I'd worry about.
IIRC, the general policy in GCC is that regressions in optimisation are tolerable so long as preferred improvements result. Optimisation is complex and what is "generally" a good optimisation isn't always that great. Plus for perfect optimisation, the compiler would have to know things it can't know (e.g. what inputs it will encounter in real life).
The point is that it really isn't worthwhile knowing that much about specific optimisations unless you happen to be a compiler developer. If you depend on something being optimised by V8, that particular optimisation might not happen in V9 or V10.

Should I use a function in a situation where it would be called an extreme number of times?

I have a section of my program that contains a large amount of math with some rather long equations. Its long and unsightly and I wish to replace it with a function. However, chunk of code is used an extreme number of times in my code and also requires a lot of variables to be initialized.
If I'm worried about speed, is the cost of calling the function and initializing the variables negligible here or should i stick to directly coding it in each time?
Thanks,
-Faken
Most compilers are smart about inlining reasonably small functions to avoid the overhead of a function call. For functions big enough that the compiler won't inline them, the overhead for the call is probably a very small fraction of the total execution time.
Check your compiler documentation to understand it's specific approach. Some older compilers required or could benefit from hints that a function is a candidate for inlining.
Either way, stick with functions and keep your code clean.
Are you asking if you should optimize prematurely?
Code it in a maintainable manner first; if you then find that this section is a bottleneck in the overall program, worry about tuning it at that point.
You don't know where your bottlenecks are until you profile your code. Anything you can assume about your code hot spots is likely to be wrong. I remember once I wanted to optimize some computational code. I ran a profiler and it turned out that 70 % of the running time was spent zeroing arrays. Nobody would have guessed it by looking at the code.
So, first code clean, then run a profiler, then optimize the rough spots. Not earlier. If it's still slow, change algorithm.
Modern C++ compilers generally inline small functions to avoid function call overhead. As far as the cost of variable initialization, one of the benefits of inlining is that it allows the compiler to perform additional optimizations at the call site. After performing inlining, if the compiler can prove that you don't need those extra variables, the copying will likely be eliminated. (I assume we're talking about primitives, not things with copy constructors.)
The only way to answer that is to test it. Without knowing more about the proposed function, nobody can really say whether the compiler can/will inline that code or not. This may/will also depend on the compiler and compiler flags you use. Depending on the compiler, if you find that it's really a problem, you may be able to use different flags, a pragma, etc., to force it to be generated inline even if it wouldn't be otherwise.
Without knowing how big the function would be, and/or how long it'll take to execute, it's impossible guess how much effect on speed it'll have if it isn't generated inline.
With both of those being unknown, none of us can really guess at how much effect moving the code into a function will have. There might be none, or little or huge.