Background: I am trying to repurpose some C++ code written for GCC in a MSVC project. I have been trying to refactor code to make it compatible with MSVC compiler.
Simplified, one of the functions originally was this:
[[nodiscard]] constexpr int count() const noexcept {
return __builtin_popcountll(mask);//gcc-specific function
}
Where mask is a 64-bit member variable. The obvious conversion to MSVC is:
[[nodiscard]] constexpr int count() const noexcept {
return __popcnt64(mask); // MSVC replacement
}
However, it doesn't compile because __popcnt64 in not allowed in a constexpr function.
I am using C++17, and I would prefer to avoid having to switch to C++20 if possible.
Is there a way to make it work?
You cannot make a non-constexpr function become a constexpr one. If their standard library doesn't declare it constexpr, then that's it. You will have to write your own, which would be difficult in C++17.
It depends on goal:
Just count bits in compile time and possibly in runtime. Then just implement own constexpr bit counting and don't use __popcnt64. You can look in into Wikipedia's Hamming weight article for ideas.
Use popcnt instruction in runtime. Then you need to implement compile-time / run-time distinction, to use different compile-time and runtime implementations.
For compile-time/runtime distinction in C++20 you would have used if (std::is_constant_evaluated()) { ... } else { ... }
In MSVC, std::is_constant_evaluated is implemented via compiler magic __builtin_is_constant_evaluated(), which happen to compile and work properly in C++17. So you can:
constexpr int popcount(unsigned long long x)
{
if (__builtin_is_constant_evaluated())
{
return -1; // TODO: count bits
}
else
{
return __popcnt64(x);
}
}
Note: __builtin_popcountll compiles into either the popcnt instruction or bit counting via bit hacks, depending on compilation flags. MSVC __popcnt64 always compiles into the popcnt instruction. If the goal is to support older CPUs that do not have the popcnt instruction, you'd have to provide CPU detection (compile-time or runtime, again, depending on goals) and fallback, or don't use __popcnt64 at all.
The question has already been answered. So, just some side note.
Building your own and efficient pop function would pobably be the best solution.
And for that you may reuse the very old wisdom from the book "Hacker's Delight" by Henry S. Warren. This book is from the time, when programmers and algorithm developers worked on solutions, to minimize the usage of the precious assembler statements. Both for ROM consumption (yes, indeed) and performance.
You will find there many very efficient solutions, even completely loop free and an astonishingly low number of assembler instructions. For example, with the usage of the divide and conquer method.
It is worth a visit . . .
Related
I know the difference in requirements, I am mostly interested in what benefits from code quality it brings.
Few things I can think of:
reader can just read the function signature and know that function is evaluated at compile time
compiler may emit less code since consteval fns are never used at runtime(this is speculative, I have no real data on this)
no need to have variables to force ctfe, example at the end
note: if code quality is too vague I understand some people might want to close this question, for me code quality is not really that vague term, but...
example where constexpr failure is delayed to runtime:
constexpr int div_cx(int a, int b)
{
assert(b!=0);
return a/b;
}
int main()
{
static constexpr int result = div_cx(5,0); // compile time error, div by 0
std::cout << result;
std::cout << div_cx(5,0) ; // runtime error :(
}
In order to have meaningful, significant static reflection (reflection at compile time), you need a way to execute code at compile time. The initial static reflection TS proposal used traditional template metaprogramming techniques, because those were the only effective tools for executing code at compile-time at all.
However, as constexpr code gained more features, it became increasingly more feasible to do compile-time static reflection through constexpr functions. One problem with such ideas is that static reflection values cannot be allowed to leak out into non-compile-time code.
We need to be able to write code that must only be executed at compile-time. It's easy enough to do that for small bits of code in the middle of a function; the runtime version of that code simply won't contain the reflection parts, only the results of them.
But what if you want to write a function that takes a reflection value and returns a reflection value? Or a list of reflection values?
That function cannot be constexpr, because a constexpr function must be able to be executed at runtime. You are allowed to do things like get pointers to constexpr functions and call them in ways that the compiler can't trace, thus forcing it to execute at runtime.
A function which takes a reflection value can't do that. It must execute only at compile-time. So constexpr is inappropriate for such functions.
Enter consteval: a function which is "required" to execute only at compile time. There are specific rules in place that make it impossible for pointers to such functions to leak out into runtime code and so forth.
As such, consteval doesn't have much purpose at the moment. It gets used in a few places like source_location::current(), which fundamentally makes no sense to execute at runtime. But ultimately, the feature is a necessary building-block for further compile-time programming tools that don't exist yet.
This was laid down in the paper that originally proposed this feature:
The impetus for the present paper, however, is the work being done by SG7 in the realm of compile-time reflection. There is now general agreement that future language support for reflection should use constexpr functions, but since "reflection functions" typically have to be evaluated at compile time, they will in fact likely be immediate functions.
tl;dr: Can it be ensured somehow (e.g. by writing a unit test) that some things are optimized away, e.g. whole loops?
The usual approach to be sure that something is not included in the production build is wrapping it with #if...#endif. But I prefer to stay with C++ mechanics instead. Even there, instead of complicated template specializations I like to keep implementations simple and argue "hey, the compiler will optimize this out anyway".
Context is embedded SW in automotive (binary size matters) with often poor compilers. They are certified in the sense of safety, but usually not good in optimizations.
Example 1: In a container the destruction of elements is typically a loop:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
This works also for build-in types such as int, as the standard allows the explicit call of the destructor also for any scalar types (C++11 12.4-15). In such case the loop does nothing and is optimized out. In GCC it is, but in another (Aurix) not, I saw a literally empty loop in the disassembly! So that needed a template specialization to fix it.
Example 2: Code, which is intended for debugging, profiling or fault-injection etc. only:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
// Albeit 'dead' section, it may not appear in production binary!
// (size, security, safety...)
// 'if constexpr..' not an option (C++11)
std::cout << "Arg was " << arg << std::endl;
}
// normal code here...
}
I can look at the disassembly, sure. But being an upstream platform software it's hard to control all targets, compilers and their options one might use. The fear is big that due to any reason a downstream project has a code bloat or performance issue.
Bottom line: Is it possible to write the software in a way, that certain code is known to be optimized away in a safe manner as a #if would do? Or a unit tests, which give a fail if optimization is not as expected?
[Timing tests come to my mind for the first problem, but being bare-metal I don't have convenient tools yet.]
There may be a more elegant way, and it's not a unit test, but if you're just looking for that particular string, and you can make it unique,
strings $COMPILED_BINARY | grep "Arg was"
should show you if the string is being included
if constexpr is the canonical C++ expression (since C++17) for this kind of test.
constexpr bool DEBUG = /*...*/;
int main() {
if constexpr(DEBUG) {
std::cerr << "We are in debugging mode!" << std::endl;
}
}
If DEBUG is false, then the code to print to the console won't generate at all. So if you have things like log statements that you need for checking the behavior of your code, but which you don't want to interact with in production code, you can hide them inside if constexpr expressions to eliminate the code entirely once the code is moved to production.
Looking at your question, I see several (sub-)questions in it that require an answer. Not all answers might be possible with your bare-metal compilers as hardware vendors don't care that much about C++.
The first question is: How do I write code in a way that I'm sure it gets optimized. The obvious answer here is to put everything in a single compilation unit so the caller can see the implementation.
The second question is: How can I force a compiler to optimize. Here constexpr is a bless. Depending on whether you have support for C++11, C++14, C++17 or even the upcoming C++20, you'll get different feature sets of what you can do in a constexpr function. For the usage:
constexpr char c = std::string_view{"my_very_long_string"}[7];
With the code above, c is defined as a constexpr variable. Because you apply it to the variable, you require some things:
Your compiler should optimize the code so the value of c is known at compile time. This even holds true for -O0 builds!
All functions used for calculate c are constexpr and available. (and by result, enforce the behaviour of the first question)
No undefined behaviour is allowed to be triggered in the calculation of c. (For the given value)
The negative about this is: Your input needs to be known at compile time.
C++17 also provides if constexpr which has similar requirements: condition needs to be calculated at compile time. The result is that 1 branch of the code ain't allowed to be compiled (as it even can contain elements that don't work on the type you are using).
Which than brings us to the question: How do I ensure sufficient optimizations for my program to run fast enough, even if my compiler ain't well behaving. Here the only relevant answer is: create benchmarks and compare the results. Take the effort to setup a CI job that automates this for you. (And yes, you can even use external hardware although not being that easy) In the end, you have some requirements: handling A should take less than X seconds. Do A several times and time it. Even if they don't handle everything, as long as it's within the requirements, its fine.
Note: As this is about debug, you most likely can track the size of an executable as well. As soon as you start using streams, a lot of conversions to string ... your exe size will grow. (And you'll find it a bless as you will immediately find commits which add 10% to the image size)
And than the final question: You have a buggy compiler, it doesn't meet my requirements. Here the only answer is: Replace it. In the end, you can use any compiler to compiler your code to bare metal, as long as the linker scripts work. If you need a start, C++Now 2018: Michael Caisse “Modern C++ in Embedded Systems” gives you a very good idea of what you need to use a different compiler. (Like a recent Clang or GCC, on which you even can log bugs if the optimization ain't good enough)
Insert a reference to external data or function into the block that should be verified to be optimised away. Like this:
extern void nop();
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
nop();
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
In Debug-Builds, link with an implementation of nop() in a extra compilation unit nop.cpp:
void nop() {}
In Release-Builds, don't provide an implementation.
Release builds will only link if the optimisable code is eliminated.
`- kisch
Here's another nice solution using inline assembly.
This uses assembler directives only, so it might even be kind of portable (checked with clang).
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm(".globl _marker\n_marker:\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
This would leave an exported linker symbol in the compiled executable, if the code isn't optimised away. You can check for this symbol using nm(1).
clang can even stop the compilation right away:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm("_marker=1\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
asm volatile (
".ifdef _marker\n"
".err \"code not optimised away\"\n"
".endif\n"
);
// normal code here...
}
This is not an answer to "How to ensure some code is optimized away?" but to your summary line "Can a unit test be written that e.g. whole loops are optimized away?".
First, the answer depends on how far you see the scope of unit-testing - so if you put in performance tests, you might have a chance.
If in contrast you understand unit-testing as a way to test the functional behaviour of the code, you don't. For one thing, optimizations (if the compiler works correctly) shall not change the behaviour of standard-conforming code.
With incorrect code (code that has undefined behaviour) optimizers can do what they want. (Well, for code with undefined behaviour the compiler can do it also in the non-optimizing case, but sometimes only the deeper analyses peformed during optimization make it possible for the compiler to detect that some code has undefined behaviour.) Thus, if you write unit-tests for some piece of code with undefined behaviour, the test results may differ when you run the tests with and without optimization. But, strictly speaking, this only tells you that the compiler translated the code both times in a different way - it does not guarantee you that the code is optimized in the way you want it to be.
Here's another different way that also covers the first example.
You can verify (at runtime) that the code has been eliminated, by comparing two labels placed around it.
This relies on the GCC extension "Labels as Values" https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
before:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
behind:
if (intptr_t(&&behind) != intptr_t(&&before)) abort();
It would be nice if you could check this in a static_assert(), but sadly the difference of &&label expressions is not accepted as compile-time constant.
GCC insists on inserting a runtime comparison, even though both labels are in fact at the same address.
Interestingly, if you compare the addresses (type void*) directly, without casting them to intptr_t, GCC falsely optimises away the if() as "always true", whereas clang correctly optimises away the complete if() as "always false", even at -O1.
Consider the following function:
template <class T, class Priority>
void MutableQueue<T, Priority>::update(const T& item, const Priority& priority)
{
...
}
Would modern x86-64 compilers be smart enough to pass the priority argument by value rather than reference if the priority type could fit within a register?
Compiler may do the optimization, but it is not mandatory.
To force to pass the "best" type, you may use boost:
http://www.boost.org/doc/libs/1_55_0/libs/utility/call_traits.htm
Replacing const T& (where passing by value is correct) by call_traits<T>::param_type.
So your code may become:
template <class T, class Priority>
void MutableQueue<T, Priority>::update(call_traits<T>::param_type item,
call_traits<Priority>::param_type priority)
{
...
}
As #black mentioned, optimizations are compiler and platform dependent. That said, we typically expect a number of optimizations to happen day-to-day when using a good optimizing compiler. For instance, we count on function inlining, register allocation, converting constant multiplications and divisions to bit-shifts when possible, etc.
To answer your question
Would modern x86-64 compilers be smart enough to pass the priority argument by value rather than reference if the priority type could fit within a register?
I'll simply try it out. See for your self:
GCC latest (without inlining)
CLANG 3.5.1 (without inlining)
This is the code:
template<typename T>
T square(const T& num) {
return num * num;
}
int sq(int x) {
return square(x);
}
GCC -O3, -O2, and -O1 reliably perform this optimization.
Clang 3.5.1, on the other hand, does not seem to perform this optimization.
Should you count on such optimization happening? Not always, and not absolutely--the C++ standard says nothing about when an optimization like this could take place. In practice, if you are using GCC, you can 'expect' the optimization to take place.
If you absolutely positively want to ensure that such optimization happens, you will want to use template specialization.
This is totally platform and compiler dependent and so is how the arguments are passed to a function.
These specifics are defined in the ABI of the system the program runs on; some have a large number of registers and therefore use them mainly. Some push them all on the stack. Some mix them together up to N-th parameter.
Again, it is something you cannot rely on; you can check it in a couple of ways, though. The C++ language has no concept of a register.
This is more of a philosophical question rather than practical code snippet, but perhaps C++ gurus can enlighten me (and apologies if it's been asked already).
I have been reading Item 15 in Meyers's "Effective Modern C++" book, as well as this thread: implicit constexpr? (plus a reasonable amount of googling). The item goes over usage of constexpr for expressions, namely that it defines functions that can return compile time values given compile time inputs.
Moreover, the StackOverflow thread I referred to shows that some compilers are perfectly capable of figuring out for themselves which function invocation results are known at compile time.
Hence the question: why was constexpr added to the standard as compared to defining when compilers should derive and allow static/compile-time values?
I realise it makes various compile-only (e.g. std::array<T, constexpr>) definitions less predictable, but on the other hand, as per Meyers's book, constexpr is a part of the interface,..., if you remove it, you may cause arbitrarily large amounts of client code to stop compiling.
So, not only having explicit constexpr requires people to remember adding it, it also adds permanent semantics to the interface.
Clarification: This question is not about why constexpr should be used. I appreciate that having an ability to programatically derive compile-time values is very useful, and employed it myself on a number of occasions. It's a question on why it is mandatory in situations where compiler may deduce const-time behaviour on its own.
Clarification no. 2: Here is a code snippet showing that compilers do not deduce that automatically, I've used g++ in this case.
#include <array>
size_t test()
{
return 42;
}
int main()
{
auto i = test();
std::array<int, i> arrayTst;
arrayTst[1] = 20;
return arrayTst[1];
}
std::array declaration fails to compile because I have not defined test() as constexpr, which is of course as per standard. If the standard were different, nothing would have prevented gcc from figuring out independently that test() always returns a constant expression.
This question does not ask "what the standard defines", but rather "why the standard is the way it is"?
Before constexpr the compilers could sometimes figure out a compile time constant and use it. However, the programmer could never know when this would happen.
Afterwards, the programmer is immediately informed if an expression is not a compile time constant and he or she realizes the need to fix it.
I have a performance critical inline function. It generates some data, based on a parameter. I want the compiler to optimize the data generation for all invocations, where the parameter is known at compile-time. The problem is that I can't force the compiler to put the optimized data out of the stack to a static constant, since marking the data static would break the case when parameter is not a compile-time constant. Having constant data on the stack hurts performance. Is there a way to deduce (maybe using templates/boost::enable_if), that the parameter is a compile-time constant and choose appropriate implementation of the data generation?
CLARIFICATION
Basically I have something like the following:
struct Data {
int d_[16];
};
inline Data fun(int param)
{ //param can sometimes be a compile-time constant
... //generate the data
Data res = {gen0, gen2, gen3, ..., gen15}; //put the data into result
return res;
}
So when param isn't compile-time constant, we just generate all the data and return.
When param is known, the compiler can optimize data generation out. But then it fails to optimize the following line out and generates a lot of code, that just sets res members to known data (the data is embedded to program code). I want the compiler to create a static constant, and then copy it to the return object (that is faster than executing much code with embedded data). Since this is an inline function, even the copy may be not necessary.
Disclaimer
This question is not the same as How to use different overload of an inline function, depending on a compile time parameter?. This is more generic problem.
I don't believe there is any way to do that; it's the compiler's responsibility to optimize the calls, not the language's... so there's no portable way to do that. :\
Did you actually profile your code and prove that passing constants to your (inline?) function(s) is the bottleneck?
If you did do the profiling, then you're going to have to help the compiler figure this one out, as there's no way to do it automatically. You'll have to manually call the template version of the function when you know the constant and the normal version otherwise.
This is not portable, but on GCC and possibly Clang there is a __builtin_constant_p compiler function. This lets you ask the compiler if it knows the value of a variable during compile time. You can use it like this:
void f(int arg) {
if (__builtin_constant_p(arg) && arg == 0) {
// Handle case where arg is 0 AND known at compile time.
} else {
// Generic code.
}
}
With this, the compiler will not generate the code in the else branch if arg is both known at compile time and is 0.
A useful trick to make this more portable might be to use a bit of macro hackery.
#ifdef __GNUC__
# define CONSTANT_P(x) __builtin_constant_p(x)
#else
# define CONSTANT_P(x) 0
#endif
Add other compilers that support something similar to this as needed and you are now able to use this with no extra overhead on compilers that don't support this. That is those compilers, if they are worth anything at all, will eliminate the CONSTANT_P branches leaving only the generic code.
So it sounds like you have:
template <int N> myfunc_const_N() { /*...*/ }
inline myfunc_var_N(int N);
and you want to be able to type myfunc(n); and have the compiler call myfunc_const_N<n>(); if valid or myfunc_var_N(n); if not?
My guess is this impossible, but that's a difficult thing to prove.
But would it really gain you much if you could? How often do you not know at code-writing time whether a given expression is a compile-time constant or not? Why not just use the template version yourself if you do have a constant and the function parameter version if you don't?
If the function is inlined, then the compiler will perform constant folding optimizations where appropriate, when it inlines the function, assuming that you have a fairly reasonable compiler.