Dear all,
I've a program in C++ that should be as fast as possible.
In particular, there is a crucial part that works as follows:
There is a variable D ranging in [0,255], based on its value
it calls the function whose pointer is stored in a array F (with 255 elements).
i.e., F[D] gives the pointer to the function to call.
The functions are very simple, they execute few expressions or/and assignments (no cycles).
How can I do this faster?
I can replicate the code in the functions. I do not need features of functions call (I'm using them since it was simpler way to do).
My goal is that of removing the inefficiencies due to calls of functions.
I consider to use Switch/case.
The code of each case is the code of the corresponding function.
Is there a faster way to do it?
A switch/case may be faster, as the compiler can make more specific optimizations, and there are other things like code locality compared to calling a function pointer. However, it's unlikely that you'll notice a substantial performance increase.
Convert it to a switch
Make sure the tiny functions can be inlined (how to do this is compiler dependant - but explicitly marking them inline and placing them in the compilation unit is a good bet)
Finally, you might try profile-guided optimizations if (as seems quite possible) your switch statement calls the branches in a non-uniform fashion.
The aim is to avoid function-call overhead and have the compiler reorder the switch statement to reduce the number of branches typically encountered.
Edit: since there's some skeptical voices as to how much this helps; I gave it a shot. In the following program, the function-pointer loop needs 1.07 seconds, and the inlined switch statement takes 0.79 seconds on my machine - YMMV:
template<unsigned n0,unsigned n> struct F { static inline unsigned func(unsigned val); };
template<unsigned n> struct F<0,n> { static inline unsigned func(unsigned val) { return val + n;} };
template<unsigned n> struct F<1,n> { static inline unsigned func(unsigned val) { return val - n; } };
template<unsigned n> struct F<2,n> { static inline unsigned func(unsigned val) { return val ^ n; } };
template<unsigned n> struct F<3,n> { static inline unsigned func(unsigned val) { return val * n; } };
template<unsigned n> struct F<4,n> { static inline unsigned func(unsigned val) { return (val << ( n %16)) + n*(n&0xff); } };
template<unsigned n> struct F<5,n> { static inline unsigned func(unsigned val) { return (val >> ( n %16)) + (n*(n&0xff) << 16); } };
template<unsigned n> struct F<6,n> { static inline unsigned func(unsigned val) { return val / (n|1) + val; } };
template<unsigned n> struct F<7,n> { static inline unsigned func(unsigned val) { return (val <<16) + (val>>16); } };
template<unsigned n> struct f { static inline unsigned func(unsigned val) { return F<n%8,n>::func(val); } };
typedef unsigned (*fPtr)(unsigned);
fPtr funcs[256];
template<unsigned n0,unsigned n1> inline void fAssign() {
if(n0==n1-1 || n0==n1) //||n0==n1 just to avoid compiler warning
funcs[n0] = f<n0>::func;
else {
fAssign<n0,(n0 + n1)/2>();
fAssign<(n0 + n1)/2,n1>();
}
}
__forceinline unsigned funcSwitch(unsigned char type,unsigned val);//huge function elided
__declspec(noinline) unsigned doloop(unsigned val,unsigned start,unsigned end) {
for(unsigned x=start;x<end;++x)
val = funcs[x*37&0xff](val);
return val;
}
__declspec(noinline) unsigned doloop2(unsigned val,unsigned start,unsigned end) {
for(unsigned x=start;x<end;++x)
val = funcSwitch(x*37&0xff,val);
return val;
}
I verified that all function calls are inlined except those to doloop, doloop2 and funcs[?] to ensure I'm not measuring odd compiler choices.
So, on this machine, with MSC 10, this (thoroughly artifical) benchmark shows that the huge-switch version is a third faster than the function-pointer lookup-based version. PGO slowed both versions down; probably because they're too small to exhibit cache effects and the program is small enough to be fully inlined/optimized even without PGO.
You're not going to beat the lookup table based indirect call with any other method, including switch. Switch is going to take at best logarithmic time (well, it might hash to get constant time, but that'll likely be slower than logarithmic time with ints for most input sizes).
This is assuming your code looks like this:
typedef void (*myProc_t)();
myProc_t functionArray[255] = { ... };
void CallSomepin(unsigned char D)
{
functionArray[D]();
}
If you're creating the array each time the function is called however, it might be a good idea to amortize the cost of the construction by doing the initialization of the function array once, rather than every time.
EDIT: That said, the best way to avoid the inefficiency of the indirect call is simply to not do it. Look at your code and see if there are places you can replace the indirect lookup based call with a direct call.
There is a variable D ranging in [0,255], based on its value it calls the function whose pointer is stored in a array F (with 255 elements)
Warning, this will result in a buffer overflow if D has the value 255.
Related
I am amazed at how many topics on StackOverflow deal with finding out the endianess of the system and converting endianess. I am even more amazed that there are hundreds of different answers to these two questions. All proposed solutions that I have seen so far are based on undefined behaviour, non-standard compiler extensions or OS-specific header files. In my opinion, this question is only a duplicate if an existing answer gives a standard-compliant, efficient (e.g., use x86-bswap), compile time-enabled solution.
Surely there must be a standard-compliant solution available that I am unable to find in the huge mess of old "hacky" ones. It is also somewhat strange that the standard library does not include such a function. Perhaps the attitude towards such issues is changing, since C++20 introduced a way to detect endianess into the standard (via std::endian), and C++23 will probably include std::byteswap, which flips endianess.
In any case, my questions are these:
Starting at what C++ standard is there a portable standard-compliant way of performing host to network byte order conversion?
I argue below that it's possible in C++20. Is my code correct and can it be improved?
Should such a pure-c++ solution be preferred to OS specific functions such as, e.g., POSIX-htonl? (I think yes)
I think I can give a C++23 solution that is OS-independent, efficient (no system call, uses x86-bswap) and portable to little-endian and big-endian systems (but not portable to mixed-endian systems):
// requires C++23. see https://gcc.godbolt.org/z/6or1sEvKn
#include <type_traits>
#include <utility>
#include <bit>
constexpr inline auto host_to_net(std::integral auto i) {
static_assert(std::endian::native == std::endian::big || std::endian::native == std::endian::little);
if constexpr (std::endian::native == std::endian::big) {
return i;
} else {
return std::byteswap(i);
}
}
Since std::endian is available in C++20, one can give a C++20 solution for host_to_net by implementing byteswap manually. A solution is described here, quote:
// requires C++17
#include <climits>
#include <cstdint>
#include <type_traits>
template<class T, std::size_t... N>
constexpr T bswap_impl(T i, std::index_sequence<N...>) {
return ((((i >> (N * CHAR_BIT)) & (T)(unsigned char)(-1)) <<
((sizeof(T) - 1 - N) * CHAR_BIT)) | ...);
}; // ^~~~~ fold expression
template<class T, class U = typename std::make_unsigned<T>::type>
constexpr U bswap(T i) {
return bswap_impl<U>(i, std::make_index_sequence<sizeof(T)>{});
}
The linked answer also provides a C++11 byteswap, but that one seems to be less efficient (not compiled to x86-bswap). I think there should be an efficient C++11 way of doing this, too (using either less template-nonsense or even more) but I don't care about older C++ and didn't really try.
Assuming I am correct, the remaining question is: can one can determine system endianess before C++20 at compile time in a standard-compliant and compiler-agnostic way? None of the answers here seem to do achieve this. They use reinterpret_cast (not compile time), OS-headers, union aliasing (which I believe is UB in C++), etc. Also, for some reason, they try to do it "at runtime" although a compiled executable will always run under the same endianess.)
One could do it outside of constexpr context and hope it's optimized away. On the other hand, one could use system-defined preprocessor definitions and account for all platforms, as seems to be the approach taken by Boost. Or maybe (although I would guess the other way is better?) use macros and pick platform-specific htnl-style functions from networking libraries(done, e.g., here (GitHub))?
compile time-enabled solution.
Consider whether this is useful requirement in the first place. The program isn't going to be communicating with another system at compile time. What is the case where you would need to use the serialised integer in a compile time constant context?
Starting at what C++ standard is there a portable standard-compliant way of performing host to network byte order conversion?
It's possible to write such function in standard C++ since C++98. That said, later standards bring tasty template goodies that make this nicer.
There isn't such function in the standard library as of the latest standard.
Should such a pure-c++ solution be preferred to OS specific functions such as, e.g., POSIX-htonl? (I think yes)
Advantage of POSIX is that it's less important to write tests to make sure that it works correctly.
Advantage of pure C++ function is that you don't need platform specific alternatives to those that don't conform to POSIX.
Also, the POSIX htonX are only for 16 bit and 32 bit integers. You could instead use htobeXX functions instead that are in some *BSD and in Linux (glibc).
Here is what I have been using since C+17. Some notes beforehand:
Since endianness conversion is always1 for purposes of serialisation, I write the result directly into a buffer. When converting to host endianness, I read from a buffer.
I don't use CHAR_BIT because network doesn't know my byte size anyway. Network byte is an octet, and if your CPU is different, then these functions won't work. Correct handling of non-octet byte is possible but unnecessary work unless you need to support network communication on such system. Adding an assert might be a good idea.
I prefer to call it big endian rather than "network" endian. There's a chance that a reader isn't aware of the convention that de-facto endianness of network is big.
Instead of checking "if native endianness is X, do Y else do Z", I prefer to write a function that works with all native endianness. This can be done with bit shifts.
Yeah, it's constexpr. Not because it needs to be, but just because it can be. I haven't been able to produce an example where dropping constexpr would produce worse code.
// helper to promote an integer type
template <class T>
using promote_t = std::decay_t<decltype(+std::declval<T>())>;
template <class T, std::size_t... I>
constexpr void
host_to_big_impl(
unsigned char* buf,
T t,
[[maybe_unused]] std::index_sequence<I...>) noexcept
{
using U = std::make_unsigned_t<promote_t<T>>;
constexpr U lastI = sizeof(T) - 1u;
constexpr U bits = 8u;
U u = t;
( (buf[I] = u >> ((lastI - I) * bits)), ... );
}
template <class T, std::size_t... I>
constexpr void
host_to_big(unsigned char* buf, T t) noexcept
{
using Indices = std::make_index_sequence<sizeof(T)>;
return host_to_big_impl<T>(buf, t, Indices{});
}
1 In all use cases I've encountered. Conversions from integer to integer can be implemented by delegating these if you have such case, although they cannot be constexpr due to need for reinterpret_cast.
I made a benchmark comparing my C++ solution from the question and the solution by eeroika from the accepted answer.
Looking at this is a complete waste of time, but now that I did it, I though I might as well share it. The result is that (in the specific not-quite-realistic usecase I look at) they seem to be equivalent in terms of performance. This is despite my solution being compiled to use x86-bswap, while the solution by eeroika does it by just using mov.
The performance seems to differ a lot (!!) when using different compilers and the main thing I learned from these benchmarks is, again, that I'm just wasting my time...
// benchmark to compare two C++20-stand-alone host-to-big-endian endianess conversion.]
// Run at quick-bench.com! This is not a complete program. (https://quick-bench.com/q/2qnr4xYKemKLZupsicVFV_09rEk)
// To run locally, include Google benchmark header and a main method as required by the benchmarking library.
// Adapted from https://stackoverflow.com/a/71004000/9988487
#include <type_traits>
#include <utility>
#include <cstddef>
#include <cstdint>
#include <climits>
#include <type_traits>
#include <utility>
#include <bit>
#include <random>
/////////////////////////////// Solution 1 ////////////////////////////////
template <typename T> struct scalar_t { T t{}; /* no begin/end */ };
static_assert(not std::ranges::range< scalar_t<int> >);
template<class T, std::size_t... N>
constexpr T bswap_impl(T i, std::index_sequence<N...>) noexcept {
constexpr auto bits_per_byte = 8u;
static_assert(bits_per_byte == CHAR_BIT);
return ((((i >> (N * bits_per_byte)) & (T)(unsigned char)(-1)) <<
((sizeof(T) - 1 - N) * bits_per_byte)) | ...);
}; // ^~~~~ fold expression
template<class T, class U = typename std::make_unsigned<T>::type>
constexpr U bswap(T i) noexcept {
return bswap_impl<U>(i, std::make_index_sequence<sizeof(T)>{});
}
constexpr inline auto host_to_net(std::integral auto i) {
static_assert(std::endian::native == std::endian::big || std::endian::native == std::endian::little);
if constexpr (std::endian::native == std::endian::big) {
return i;
} else {
return bswap(i); // replace by `std::byteswap` once it's available!
}
}
/////////////////////////////// Solution 2 ////////////////////////////////
// helper to promote an integer type
template <class T>
using promote_t = std::decay_t<decltype(+std::declval<T>())>;
template <class T, std::size_t... I>
constexpr void
host_to_big_impl(
unsigned char* buf,
T t,
[[maybe_unused]] std::index_sequence<I...>) noexcept {
using U = std::make_unsigned_t<promote_t<T>>;
constexpr U lastI = sizeof(T) - 1u;
constexpr U bits = 8u;
U u = t;
( (buf[I] = u >> ((lastI - I) * bits)), ... );
}
template <class T, std::size_t... I>
constexpr void
host_to_big(unsigned char* buf, T t) noexcept {
using Indices = std::make_index_sequence<sizeof(T)>;
return host_to_big_impl<T>(buf, t, Indices{});
}
//////////////////////// Benchmarks ////////////////////////////////////
template<std::integral T>
std::vector<T> get_random_vector(std::size_t length, unsigned int seed) {
// NOTE: IT IS VERY SLOW TO RECREATE RNG EVERY TIME. Don't use in production code!
std::mt19937_64 rng{seed};
std::uniform_int_distribution<T> distribution(
std::numeric_limits<T>::min(), std::numeric_limits<T>::max());
std::vector<T> result(length);
for (auto && val : result) {
val = distribution(rng);
}
return result;
}
template<>
std::vector<bool> get_random_vector<bool>(std::size_t length, unsigned int seed) {
// NOTE: IT IS VERY SLOW TO RECREATE RNG EVERY TIME. ONLY USE FOR TESTING!
std::mt19937_64 rng{seed};
std::bernoulli_distribution distribution{0.5};
std::vector<bool> vec(length);
for (auto && val : vec) {
val = distribution(rng);
}
return vec;
}
constexpr std::size_t n_ints{1000};
static void solution1(benchmark::State& state) {
std::vector<int> intvec = get_random_vector<int>(n_ints, 0);
std::vector<std::uint8_t> buffer(sizeof(int)*intvec.size());
for (auto _ : state) {
for (std::size_t i{}; i < intvec.size(); ++i) {
host_to_big(buffer.data() + sizeof(int)*i, intvec[i]);
}
benchmark::DoNotOptimize(buffer);
benchmark::ClobberMemory();
}
}
BENCHMARK(solution1);
static void solution2(benchmark::State& state) {
std::vector<int> intvec = get_random_vector<int>(n_ints, 0);
std::vector<std::uint8_t> buffer(sizeof(int)*intvec.size());
for (auto _ : state) {
for (std::size_t i{}; i < intvec.size(); ++i) {
buffer[sizeof(int)*i] = host_to_net(intvec[i]);
}
benchmark::DoNotOptimize(buffer);
benchmark::ClobberMemory();
}
}
BENCHMARK(solution2);
I was curious how far I could push gcc as far as compile-time evaluation is concerned, so I made it compute the Ackermann function, specifically with input values of 4 and 1 (anything higher than that is impractical):
consteval unsigned int A(unsigned int x, unsigned int y)
{
if(x == 0)
return y+1;
else if(y == 0)
return A(x-1, 1);
else
return A(x-1, A(x, y-1));
}
unsigned int result = A(4, 1);
(I think the recursion depth is bounded at ~16K but just to be safe I compiled this with -std=c++20 -fconstexpr-depth=100000 -fconstexpr-ops-limit=12800000000)
Not surprisingly, this takes up an obscene amount of stack space (in fact, it causes the compiler to crash if run with the default process stack size of 8mb) and takes several minutes to compute. However, it does eventually get there so evidently the compiler could handle it.
After that I decided to try implementing the Ackermann function using templates, with metafunctions and partial specialization pattern matching. Amazingly, the following implementation only takes a few seconds to evaluate:
template<unsigned int x, unsigned int y>
struct A {
static constexpr unsigned int value = A<x-1, A<x, y-1>::value>::value;
};
template<unsigned int y>
struct A<0, y> {
static constexpr unsigned int value = y+1;
};
template<unsigned int x>
struct A<x, 0> {
static constexpr unsigned int value = A<x-1, 1>::value;
};
unsigned int result = A<4,1>::value;
(compile with -ftemplate-depth=17000)
Why is there such a dramatic difference in evaluation time? Aren't these essentially equivalent? I guess I can understand the consteval solution requiring slightly more memory and evaluation time because semantically it consists of a bunch of function calls, but that doesn't explain why this exact same (non-consteval) function computed at runtime only takes slightly longer than the metafunction version (compiled without optimizations).
Why is consteval so slow? And how can the metafunction version be so fast? It's actually not much slower than optimized machine-code.
In the template version of A, when a particular specialization, say A<2,3>, is instantiated, the compiler remembers this type, and never needs to instantiate it again. This comes from the fact that types are unique, and each "call" to this meta-function is just computing a type.
The consteval function version is not optimized to do this, and so A(2,3) may be evaluated multiple times, depending on the control flow, resulting in the performance difference you observe. There's nothing stopping compilers from "caching" the results of function calls, but these optimizations likely just haven't been implemented yet.
While attempting to write the fastest factorial function that evaluates at run-time, I found myself questioning whether it is a good idea to declare the constant array f[] at the function level or at the unit level.
#include <stdint.h>
#include <cassert>
// uint64_t const f[21] = { 1,1,2,6,24,120,720,5040,40320,362880,3628800,39916800,479001600,6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000, 6402373705728000, 121645100408832000, 2432902008176640000 };
const uint64_t factorial(const uint8_t n) {
static uint64_t const f[21] = { 1,1,2,6,24,120,720,5040,40320,362880,3628800,39916800,479001600,6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000, 6402373705728000, 121645100408832000, 2432902008176640000 };
assert(n <= 20);
return f[n];
}
What are the pros and cons of each placement, assuming that f[] will be used only by the factorial() function ?
Is a constant, which is declared at the function level, created and destroyed each time the function is executed, like a non-const variable, or does the linker collect and put all constants in the .rodata section at compile-time?
In my humble opinion the best is to let the compiler decide and do evrything with constexpr. The compiler will make the best decision for you.
And because the number of factorials that will fit into an unsigned 64 bit value is very low (21), a compile time array will use mainly only 21*8 = 168 bytes.
168 bytes
That number is that low that we can build easily a compile time constexpr std::array and stop all further considerations.
Really everything can be done at compile time.
We will first define the default approach for calculation a factorial as a constexpr function:
constexpr unsigned long long factorial(unsigned long long n) noexcept {
return n == 0ull ? 1 : n * factorial(n - 1ull);
}
With that, factorials can easily be calculated at compile time. Then, we fill a std::array with all factorials. We use also a constexpr and make it a template with a variadic parameter pack.
We use std::integer_sequence to create a factorials for indices 0,1,2,3,4,5, ....
That is straigtforward and not complicated:
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};
This function will be fed with an integer sequence 0,1,2,3,4,... and return a std::array<unsigned long long, ...> with the corresponding factorials.
We know that we can store maximum 21 values. And therefore we make a next function, that will call the above with the integer sequence 1,2,3,4,...,20,21, like so:
constexpr auto generateArray()noexcept {
return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
And now, finally,
constexpr auto Factorial = generateArray();
will give us a compile-time std::array<unsigned long long, 21> with the name Factorial containing all factorials. And if we need the i'th factorial, then we can simply write Factorial[i]. There will be no calculation at runtime.
I do not think that there is a faster way to calculate a factorial.
Please see the complete program below:
#include <iostream>
#include <array>
#include <utility>
// ----------------------------------------------------------------------
// All the below will be calculated at compile time
// constexpr factorial function
constexpr unsigned long long factorial(unsigned long long n) noexcept {
return n == 0ull ? 1 : n * factorial(n - 1ull);
}
// We will automatically build an array of factorials at compile time
// Generate a std::array with n elements
template <size_t... ManyIndices>
constexpr auto generateArrayHelper(std::integer_sequence<size_t, ManyIndices...>) noexcept {
return std::array<unsigned long long, sizeof...(ManyIndices)>{ { factorial(ManyIndices)... } };
};
// Max index for factorials for an 64bit unsigned value
constexpr size_t MaxIndexFor64BitValue = 21;
// Generate the required number of elements
constexpr auto generateArray()noexcept {
return generateArrayHelper(std::make_integer_sequence<size_t, MaxIndexFor64BitValue>());
}
// This is an constexpr array of all factorials numbers
constexpr auto Factorial = generateArray();
// All the above was compile time
// ----------------------------------------------------------------------
// Test function
int main() {
for (size_t i{}; i < MaxIndexFor64BitValue; ++i)
std::cout << i << '\t' << Factorial[i] << '\n';
return 0;
}
Developed, compiled and tested with Microsoft Visual Studio Community 2019, Version 16.8.2
Additionally compiled and tested with gcc 10.2 and clang 11.0.1
Language: C++17
What assurances do I have that a core constant expression (as in [expr.const].2) possibly containing constexpr function calls will actually be evaluated at compile time and on which conditions does this depend?
The introduction of constexpr implicitly promises runtime performance improvements by moving computations into the translation stage (compile time).
However, the standard does not (and presumably cannot) mandate what code a compiler produces. (See [expr.const] and [dcl.constexpr]).
These two points appear to be at odds with each other.
Under which circumstances can one rely on the compiler resolving a core constant expression (which might contain an arbitrarily complicated computation) at compile time rather than deferring it to runtime?
At least under -O0 gcc appears to actually emit code and call for a constexpr function. Under -O1 and up it doesn't.
Do we have to resort to trickery such as this, that forces the constexpr through the template system:
template <auto V>
struct compile_time_h { static constexpr auto value = V; };
template <auto V>
inline constexpr auto compile_time = compile_time_h<V>::value;
constexpr int f(int x) { return x; }
int main() {
for (int x = 0; x < compile_time<f(42)>; ++x) {}
}
When a constexpr function is called and the output is assigned to a constexpr variable, it will always be run at compiletime.
Here's a minimal example:
// Compile with -std=c++14 or later
constexpr int fib(int n) {
int f0 = 0;
int f1 = 1;
for(int i = 0; i < n; i++) {
int hold = f0 + f1;
f0 = f1;
f1 = hold;
}
return f0;
}
int main() {
constexpr int blarg = fib(10);
return blarg;
}
When compiled at -O0, gcc outputs the following assembly for main:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 55
mov eax, 55
pop rbp
ret
Despite all optimization being turned off, there's never any call to fib in the main function itself.
This applies going all the way back to C++11, however in C++11 the fib function would have to be re-written to use conversion to avoid the use of mutable variables.
Why does the compiler include the assembly for fib in the executable sometimes? A constexpr function can be used at runtime, and when invoked at runtime it will behave like a regular function.
Used properly, constexpr can provide some performance benefits in specific cases, but the push to make everything constexpr is more about writing code that the compiler can check for Undefined Behavior.
What's an example of constexpr providing performance benefits? When implementing a function like std::visit, you need to create a lookup table of function pointers. Creating the lookup table every time std::visit is called would be costly, and assigning the lookup table to a static local variable would still result in measurable overhead because the program has to check if that variable's been initialized every time the function is run.
Thankfully, you can make the lookup table constexpr, and the compiler will actually inline the lookup table into the assembly code for the function so that the contents of the lookup table is significantly more likely to be inside the instruction cache when std::visit is run.
Does C++20 provide any mechanisms for guaranteeing that something runs at compiletime?
If a function is consteval, then the standard specifies that every call to the function must produce a compile-time constant.
This can be trivially used to force the compile-time evaluation of any constexpr function:
template<class T>
consteval T run_at_compiletime(T value) {
return value;
}
Anything given as a parameter to run_at_compiletime must be evaluated at compile-time:
constexpr int fib(int n) {
int f0 = 0;
int f1 = 1;
for(int i = 0; i < n; i++) {
int hold = f0 + f1;
f0 = f1;
f1 = hold;
}
return f0;
}
int main() {
// fib(10) will definitely run at compile time
return run_at_compiletime(fib(10));
}
Never; the C++ standard permits almost the entire compilation to occur at "runtime". Some diagnostics have to be done at compile time, but nothing prevents insanity on the part of the compiler.
Your binary could be a copy of the compiler with your source code appended, and C++ wouldn't say the compiler did anything wrong.
What you are looking at is a QoI - Quality of Implrmentation - issue.
In practice, constexpr variables tend to be compile time computed, and template parameters are always compile time computed.
consteval can also be used to markup functions.
I'm using some template meta-programming to solve a small problem, but the syntax is a little annoying -- so I was wondering, in the example below, will overloading operators on the meta-class that has an empty constructor cause a (run-time) performance penalty? Will all the temporaries actually be constructed or can it be assumed that they will be optimized out?
template<int value_>
struct Int {
static const int value = value_;
template<typename B>
struct Add : public Int<value + B::value> { };
template<typename B>
Int<value + B::value> operator+(B const&) { return Int<value + B::value>(); }
};
int main()
{
// Is doing this:
int sum = Int<1>::Add<Int<2> >().value;
// any more efficient (at runtime) than this:
int sum = (Int<1>() + Int<2>()).value;
return sum;
}
Alright, I tried my example under GCC.
For the Add version with no optimization (-O0), the resulting assembly just loads a constant into sum, then returns it.
For the operator+ version with no optimization (-O0), the resulting assembly does a bit more (it appears to be calling operator+).
However, with -O3, both versions generate the same assembly, which simply loads 3 directly into the return register; the temporaries, function calls, and sum had been optimized out entirely in both cases.
So, they're equally fast with a decent compiler (as long as optimizations are turned on).
Compare assembly code generated by g++ -O3 -S for both solutions. It gives same code for both solutions. It actually optimize code to simply return 3.