c++ template instantiation - c++

I have a template class like below.
template<int S> class A
{
private:
char string[S];
public:
A()
{
for(int i =0; i<S; i++)
{
.
.
}
}
int MaxLength()
{
return S;
}
};
If i instantiate the above class with different values of S, will the compiler create different instances of A() and MaxLenth() function? Or will it create one instance and pass the S as some sort of argument?
How will it behave if i move the definition of A and Maxlength to a different cpp file.

The template will be instantiated for each different values of S.
If you move the method implementations to a different file, you'll need to #include that file. (Boost for instance uses the .ipp convention for such source files that need to be #included).
If you want to minimise the amount of code that is generated with the template instantiation (and hence needs to be made available in the .ipp file) you should try to factor it out by removing the dependency on S. So for example you could derive from a (private) base class which provides member functions with S as a parameter.

Actually this is fully up to the compiler. It's only required to generate correct code for its inputs. In order to do so it must follow the C++ standard as that explains what is correct. In this case it says that the compiler must at one step in the process instantiate templates with different arguments as different types, these types may later be represented by the same code, or not, it's fully up to the compiler.
It's most probable that the compiler would inline at least MaxLength() but possibly also your ctor. Otherwise it may very well generate a single instance of your ctor and pass/have it retrieve S from elsewhere. The only way to know for sure is to examine the output of the compiler.
So in order to know for sure I decided to list what VS2005 does in a release build. The file I have compiled looks like this:
template <int S>
class A
{
char s_[S];
public:
A()
{
for(int i = 0; i < S; ++i)
{
s_[i] = 'A';
}
}
int MaxLength() const
{
return S;
}
};
extern void useA(A<5> &a, int n); // to fool the optimizer
extern void useA(A<25> &a, int n);
void test()
{
A<5> a5;
useA(a5, a5.MaxLength());
A<25> a25;
useA(a25, a25.MaxLength());
}
The assembler output is the following:
?test##YAXXZ PROC ; test, COMDAT
[snip]
; 25 : A<5> a5;
mov eax, 1094795585 ; 41414141H
mov DWORD PTR _a5$[esp+40], eax
mov BYTE PTR _a5$[esp+44], al
; 26 : useA(a5, a5.MaxLength());
lea eax, DWORD PTR _a5$[esp+40]
push 5
push eax
call ?useA##YAXAAV?$A#$04##H#Z ; useA
As you can see both the ctor and the call to MaxLength() are inlined. And as you may now guess it does the same with the A<25> type:
; 28 : A<25> a25;
mov eax, 1094795585 ; 41414141H
; 29 : useA(a25, a25.MaxLength());
lea ecx, DWORD PTR _a25$[esp+48]
push 25 ; 00000019H
push ecx
mov DWORD PTR _a25$[esp+56], eax
mov DWORD PTR _a25$[esp+60], eax
mov DWORD PTR _a25$[esp+64], eax
mov DWORD PTR _a25$[esp+68], eax
mov DWORD PTR _a25$[esp+72], eax
mov DWORD PTR _a25$[esp+76], eax
mov BYTE PTR _a25$[esp+80], al
call ?useA##YAXAAV?$A#$0BJ###H#Z ; useA
It's very interesting to see the clever ways the compiler optimizes the for-loop. For all those premature optimizers out there using memset(), I would say fool on you.
How will it behave if i move the definition of A and Maxlength to a different cpp file.
It will probably not compile (unless you only use A in that cpp-file).

If i instantiate the above class with
different values of S, will the
compiler create different instances of
A() and MaxLenth() function? Or will
it create one instance and pass the S
as some sort of argument?
The compiler will instantiate a different copy of the class template for each different value of the parameter. Regarding the member functions, it will instantiate a different copy of each, for each different value of S. But unlike member functions of non-template classes, they will only be generated if they are actually used.
How will it behave if i move the definition of A and Maxlength to a different cpp file.
You mean if you put the definition of A into a header file, but define the member function MaxLength in a cpp file? Well, if users of your class template want to call MaxLength, the compiler wants to see its code, since it wants to instantiate a copy of it with the actual value of S. If it doesn't have the code available, it assumes the code is provided otherwise, and doesn't generate any code:
A.hpp
template<int S> class A {
public:
A() { /* .... */ }
int MaxLength(); /* not defined here in the header file */
};
A.cpp
template<int S> int
A<S>::MaxLength() { /* ... */ }
If you now only include include A.hpp for the code using the class template A, then the compiler won't see the code for MaxLength and won't generate any instantiation. You have two options:
Include the file A.cpp too, so the compiler sees the code for it, or
Provide an explicit instantiation for values of S you know you will use. For those values, you won't need to provide the code of MaxLength
For the second option, this is done by putting a line like the following inside A.cpp:
template class A<25>;
The compiler will be able to survive without seeing code for the member functions of A<25> now, since you explicitly instantiated a copy of your template for S=25. If you don't do any of the two options above, the linker will refuse to create a final executable, since there is still code missing that is needed.

A<S>::MaxLength() is so trivial it will be fully inlined. Hence, there will be 0 copies. A<S>::A() looks more complex, so it likely will cause multiple copies to be generated. The compiler may of course decide not to, as long as the code works as intended.
You might want to see if you can move the loop to A_base::A_base(int S).

Wherever S is used, a separate version of that function will get compiled into your code for each different S you instantiate.

It will create two different versions of A() and MaxLength() that will return compile-time constants. The simple return S; will be compiled efficiently and even inlined where possible.

Compiler will create different instances of the class if you instantiate it for different types or parameters.

Related

Why does this bad universal initializer syntax compile and result in unpredictable behavior?

I have a bunch of code for working with hardware (FPGA) registers, which is roughly of the form:
struct SomeRegFields {
unsigned int lower : 16;
unsigned int upper : 16;
};
union SomeReg {
uint32_t wholeReg;
SomeRegFields fields;
};
(Most of these register types are more complex. This is illustrative.)
While cleaning up a bunch of code that set up registers in the following way:
SomeReg reg1;
reg1.wholeReg = 0;
// ... assign individual fields
card->writeReg(REG1_ADDRESS, reg1.wholeReg);
SomeReg reg2;
reg2.wholeReg = card->readReg(REG2_ADDRESS);
// ... do something with reg2 field values
I got a bit absent-minded and accidentally ended up with the following:
SomeReg reg1{ reg1.wholeReg = 0 };
SomeReg reg2{ reg2.wholeReg = card->readReg(REG2_ADDRESS) };
The reg1.wholeReg = part is wrong, of course, and should be removed.
What's bugging me is that this compiles on both MSVC and GCC. I would have expected a syntax error here. Moreover, sometimes it works fine and the value actually gets copied/assigned correctly, but other times, it will result in a 0 value even if the register value returned is non-0. It's unpredictable, but appears to be consistent between runs which cases work and which don't.
Any idea why the compilers don't flag this as bad syntax, and why it seems to work in some cases but breaks in others? I assume this is undefined behavior, of course, but why would it would change behaviors between what often seem like nearly identical calls, often back-to-back?
Some compilation info:
If I run this through Compiler Explorer:
int main()
{
SomeReg myReg { myReg.wholeReg = 10 };
return myReg.fields.upper;
}
This is the code GCC trunk spits out for main with optimization off (-O0):
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 10
* mov eax, DWORD PTR [rbp-4]
* mov DWORD PTR [rbp-4], eax
movzx eax, WORD PTR [rbp-2]
movzx eax, ax
pop rbp
ret
The lines marked with * are the only difference between this version and a version without the bad myReg.wholeReg = part. MSVC gives similar results, though even with optimization off, it seems to be doing some. In this case, it just causes an extra assignment in and back out of a register, so it still works as intended, but given my accidental experimental results, it must not always compile this way in more complex cases, i.e. not assigning from a compile-time-deducible value.
reg1.wholeReg = card->readReg(REG2_ADDRESS)
This is simply treated as an expression. You are assigning the return value of card->readReg(REG2_ADDRESS) to reg1.wholeReg and then you use the result of this expression (a lvalue referring to reg1.wholeReg) to aggregate-initialize the first member of reg2 (i.e. reg2.wholeReg). Afterwards reg1 and reg2 should hold the same value, the return value of the function.
Syntactically the same happens in
SomeReg reg1{ reg1.wholeReg = 0 };
However, here it is technically undefined behavior since you are not allowed to access variables or class members before they are initialized. Practically speaking, I would expect this to usually work nontheless, initializing reg1.wholeReg to 0 and then once again.
Referring to a variable in its own initializer is syntactically correct and may sometimes be useful (e.g. to pass a pointer to the variable itself). This is why there is no compilation error.
int main()
{
SomeReg myReg { myReg.wholeReg = 10 };
return myReg.fields.upper;
}
This has additional undefined behavior, even if you fix the initialization, because you can't use a union in C++ for type punning at all. That is always undefined behavior, although some compilers might allow it to the degree that is allowed in C. Still, the standard does not allow reading fields.upper if wholeReg is the active member of the union (meaning the last member to which a value was assigned).

Is it a good idea to cache a raw pointer along with its owning shared_ptr for better access performance?

Consider this scenario:
class A
{
std::shared_ptr<B> _b;
B* _raw;
A(std::shared_ptr<B> b)
{
_b = b;
_raw = b.get();
}
void foo()
{
// Use _raw instead of _b
// avoid one extra indirection / memory jump
// and also avoid polluting cache
}
};
I know that technically it works and offers a slight performance advantage (I tried it). (EDIT: False conclusion).
But my question is: Is it conceptually wrong? Is it bad practice? And why?
And if not why is this hack not more commonly used?
Here is a minimal reproducible example comparing raw pointer access to shared_ptr access:
#include <chrono>
#include <iostream>
#include <memory>
struct timer final
{
timer()
: start{std::chrono::system_clock::now()}
{ }
void elapsed()
{
auto now = std::chrono::system_clock::now();
std::cout << std::chrono::duration<double>(now - start).count() << " seconds" << std::endl;
}
private:
std::chrono::time_point<std::chrono::system_clock> start;
};
struct A
{
size_t a [2097152];
};
int main()
{
size_t data_size = 2097152;
size_t count = 10000000000;
// Using Raw pointer
A * pa = new A();
timer t0;
for(size_t i = 0; i < count; i++)
pa->a[i % data_size] = i;
t0.elapsed();
// Using shared_ptr
std::shared_ptr<A> sa = std::make_shared<A>();
timer t1;
for(size_t i = 0; i < count; i++)
sa->a[i % data_size] = i;
t1.elapsed();
}
Output:
3.98586 seconds
4.10491 seconds
I ran this multiple times and the results are consistent.
EDIT: As per the consensus in the answers, the above experiment is invalid. Compilers are way smarter than they appear.
This answer proves your test is invalid (correct performance measurements in C++ are quite hard since there are lots of pitfalls) and as a result you come to invalid conclusions.
Take a look on this godbolt.
for loop for first version:
.L39:
mov rdx, rax
and edx, 2097151
mov QWORD PTR [rbp+0+rdx*8], rax
add rax, 1
cmp rax, rcx
jne .L39
for loop for second version:
.L40:
mov rdx, rax
and edx, 2097151
mov QWORD PTR [rbp+16+rdx*8], rax
add rax, 1
cmp rax, rcx
jne .L40
I do not see a difference! Results should be exactly same.
So I suspect that you have done measurements when building in Debug configuration.
Here is version where you can compare this
What is more interesting clang is able to optimize away for loop if shared pointer is not used. It noticed there is no viable results of this loop and just remove it. So if you used release configuration compiler just outsmart you.
Bottom line:
shared_ptr do not provide overhead
when checking performance you must compile with optimizations enabled
you must also ensure if test code was not optimized away to be sure that results are valid.
Here is proper test written using google benchmark and test results for both cases are exactly the same.
A shared_ptr internally looks roughly like this:
template <typename _Tp>
struct shared_ptr {
T *_M_ptr;
control_block *cb;
}
So one member (_M_ptr) points to the managed object and one pointer to the control block (that is used for reference counting, and locking).
The oeprator-> will look something like this:
_Tp* operator->() const {
return _M_ptr;
}
Because the location of _M_ptr within shared_ptr is known, the compiler can directly retrieve the memory location from where to read the memory address stored in _M_ptr without the indirection.
If T *_M_ptr is the first member of shared_ptr(Which is indeed the case for libstdc++) a compiler could change this code:
std::shared_ptr<A> sa = std::make_shared<A>();
sa->a[0] = 1;
To something that would be similar to this:
std::shared_ptr<A> sa = std::make_shared<A>();
(*reinterpret_cast<A**>(&sa))->a[0] = 1;
And reinterpret_cast is a no-opt (won't create any machine code).
So in most cases, there won't be a difference between dereferencing a shared_ptr or a raw pointer. You might be able to construct a case where the memory address of a raw pointer is sorely stored in a register, and the same thing might not be possible for _M_ptr of shared_ptr, but that would IMHO be a really artificial example.
Regarding your performance test:
The shown code will be slightly slower for shared_ptr with no optimizations active because there the indirection over operator->() won't be optimized away. With optimizations on, the compiler might optimize - for the given example - everything away and your measurement can be meaningless.
And if not why is this hack not more commonly used?
In many cases, you use shared_ptr only to manage the ownership. When working on the object that is managed by a shared pointer, you often pass the object itself (by reference, or pointer) to another function that does not need ownership (void do_something_with_object(A *ptr) {…} and a call do_something_with_object(sa.get())) so even if there is a performance impact for a certain stdlib implementation, this won't manifest in most of the cases.
Is it conceptually wrong?
In general, it's not that wrong a priori to have internal supporting objects/helpers in order to achieve required performance goals. Caches for instance are the most popular example I guess. But here for your particular example, I tend to say, it's conceptually wrong even for the case of a slightly better performance (which I doubt at latest when significance is the keyword), since it's not really an internal quality since the raw pointer you use is not an internal of the shared_ptr. The issue I see here especially is the fact, that you duplicate responsibilites since you don't trust a well established standard class here for the sake of a minimal better performance. In software design, the keyword here besides single responsibility is proportionality. You have to duplicate semantics here for all relevant places (Copy/Move constructors), you have to think twice in terms of exception safety, you have to think about that aspect again and again if your class expands and so on. And it will become quite unintelligibly for other developers being faced with your code due to the entangled responsibilites.
A word about shared_ptr's performance:
SharedPtr-dereferencing within memory.h for my current VS 2019 looks like this:
template<class _Ty2 = _Ty,
enable_if_t<!is_array_v<_Ty2>, int> = 0>
_NODISCARD _Ty2 * operator->() const noexcept
{ // return pointer to resource
return (get());
}
And get() directly returns the raw pointer. So why should any common compiler's optimizer not be able to inline the hell out of this?

Have compiler ignoring setting an argument register before calling function

TL;DR; I am looking for a standard way to basically tell the compiler to pass whatever happened to be in a given register to the next function.
Basically I have a function int bar(int a, int b, int c). In some cases c is unused and I would like to be able to call bar in the cases where c is unused without modifying rdx in any way.
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
I would like the assembly to just be:
For a tailcall
jmp bar
or for a normal call
call bar
Note: clang generally produces what I am looking for. But I am unsure if this will always be the case in more complex functions and I am hoping to not have to check the assembly each time I build.
GCC produces:
For a tailcall
xorl %edx, %edx
jmp bar
or for a normal call
xorl %edx, %edx
call bar
I can get the results I want using inline assembly i.e changing foo (for tail calls) to
int foo(int a, int b) {
asm volatile("jmp bar" : : :);
__builtin_unreachable();
}
which compiles to just
jmp bar
I understand that the performance implications of an xorl %edx, %edx is about as close to 0 as possible but
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that will require me verifying the assembly each time. I am looking for a method that you can basically tell the compiler "pass whatever happened to be in register".
See for examples: https://godbolt.org/z/eh1vK8
Edit: This is happening with -O3 set.
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that
will require me verifying the assembly each time. I am looking for a
method that you can basically tell the compiler "pass whatever
happened to be in register".
No, there is no standard way to achieve it in either C or C++. Neither of these languages speak to any lower-level function call semantics, nor even acknowledge the existence of CPU registers,* and both languages require every function call to provide arguments corresponding to all non-optional parameters (which is simply "all declared parameters" in C).
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
... then you reap undefined behavior as a result of using the value of no_init while it is indeterminate. Whatever any particular C or C++ implementation that accepts that at all does with it is non-standard by definition.
If you want to call bar(), but you don't care what value is passed as the third argument, then why not just choose a convenient value to pass? Zero, for example:
return bar(a, b, 0);
*Even the register keyword does not do this as far as either language standard is concerned.
Note that if the called function does read its 3rd arg, leaving it unwritten risks creating a false dependency on whatever last used EDX. For example it might be the result of a cache-miss load, or a long chain of calculations.
GCC is careful to xor-zero to break false dependencies in a lot of cases, e.g. before cvtsi2ss (bad ISA design) or popcnt (Sandybridge-family quirk).
Usually the xor edx,edx is basically a wasted 2-byte NOP, but it does prevent possible coupling of otherwise-independent dependency chains (critical paths).
If you're sure you want to defeat the compiler's attempt to protect you from that, then Nate's asm("" :"=r"(var)); is a good way to do an integer version of _mm_undefined_ps() that actually leaves a register uninitialized. (Note that _mm_undefined_ps doesn't guarantee leaving an XMM reg unwritten; some compilers will xor-zero one for you instead of fully implementing the false-dependency recklessness that intrinsic was designed to allow for Intel's compiler.)
One approach that should work for gcc/clang on most platforms is to do
int no_init;
asm("" : "=r" (no_init));
return bar(a, b, no_init);
This way you don't have to lie to the compiler about the prototype of bar (whichc could break some calling conventions), and you fool the compiler into thinking no_init is really initialized.
I would wonder about an architecture like Itanium with its "trap bit" that causes a fault when an uninitialized register is accessed. This code would probably not be safe there.
There is no portable way to get this behavior that I know of, but you could ifdef it:
#ifdef __GNUC__
#define UNUSED_INT ({ int x; asm("" : "=r" (x)); x; })
#else
#define UNUSED_INT 0
#endif
// ...
bar(a, b, UNUSED_INT);
Then you can fall back to the (infinitesimally) less efficient but correct code when necessary.
It results in a bare jmp on gcc/x86-64, see https://godbolt.org/z/d3ordK. On x86-32 it is not quite optimal as it pushes an uninitialized register, instead of just adjusting an existing subtraction from esp. Note that a bare jmp/call is not safe on x86-32 because that third stack slot may contain something important, and the callee is allowed to overwrite it (even if the variable is unused on the path you have in mind, the compiler could be using it as scratch space).
One portable alternative would be to rewrite bar to be variadic. However, then it would need to use va_arg to retrieve the third argument when it is present, and that tends to be less efficient.
Cast the function to have the smaller signature (i.e. fewer parameters):
extern int bar(int, int, int);
int foo(int a, int int b) {
return ((int (*)(int,int))bar)(a, b);
}
Maybe make a macro for 2 parameter bar, and even get rid of foo:
extern int bar3(int, int, int);
#define bar2(a,b) ((int (*)(int,int))bar3)(a,b)
int userOfBar(int a, int b) { return bar2 (a,b); }
https://godbolt.org/z/Gn4a69
Oddly, given the above gcc doesn't touch %edx, but clang does... oh, well.
(Still can't guarantee the compiler won't touch some registers, though, that's its domain.  Otherwise, you can write these functions directly in assembly and avoid the middleperson.)

How to test if constexpr is evaluated correctly

I have used constexpr to calculate hash codes in compile times. Code compiles correctly, runs correctly. But I dont know, if hash values are compile time or run time. If I trace code in runtime, I dont do into constexpr functions. But, those are not traced even for runtime values (calculate hash for runtime generated string - same methods).
I have tried to look into dissassembly, but I quite dont understand it
For debug purposes, my hash code is only string length, using this:
constexpr inline size_t StringLengthCExpr(const char * const str) noexcept
{
return (*str == 0) ? 0 : StringLengthCExpr(str + 1) + 1;
};
I have ID class created like this
class StringID
{
public:
constexpr StringID(const char * key);
private:
const unsigned int hashID;
}
constexpr inline StringID::StringID(const char * key)
: hashID(StringLengthCExpr(key))
{
}
If I do this in program main method
StringID id("hello world");
I got this disassembled code (part of it - there is a lot of more from inlined methods and other stuff in main)
;;; StringID id("hello world");
lea eax, DWORD PTR [-76+ebp]
lea edx, DWORD PTR [id.14876.0]
mov edi, eax
mov esi, edx
mov ecx, 4
mov eax, ecx
shr ecx, 2
rep movsd
mov ecx, eax
and ecx, 3
rep movsb
// another code
How can I tell from this, that "hash value" is a compile time. I don´t see any constant like 11 moved to register. I am not quite good with ASM, so maybe it is correct, but I am not sure what to check or how to be sure, that "hash code" values are compile time and not computed in runtime from this code.
(I am using Visual Studio 2013 + Intel C++ 15 Compiler - VS Compiler is not supporting constexpr)
Edit:
If I change my code and do this
const int ix = StringLengthCExpr("hello world");
mov DWORD PTR [-24+ebp], 11 ;55.15
I have got the correct result
Even with this
change private hashID to public
StringID id("hello world");
// mov DWORD PTR [-24+ebp], 11 ;55.15
printf("%i", id.hashID);
// some other ASM code
But If I use private hashID and add Getter
inline uint32 GetHashID() const { return this->hashID; };
to ID class, then I got
StringID id("hello world");
//see original "wrong" ASM code
printf("%i", id.GetHashID());
// some other ASM code
The most convenient way is to use your constexpr in a static_assert statement. The code will not compile when it is not evaluated during compile time and the static_assert expression will give you no overhead during runtime (and no unnecessary generated code like with a template solution).
Example:
static_assert(_StringLength("meow") == 4, "The length should be 4!");
This also checks whether your function is computing the result correctly or not.
If you want to ensure that a constexpr function is evaluated at compile time, use its result in something which requires compile-time evaluation:
template <size_t N>
struct ForceCompileTimeEvaluation { static constexpr size_t value = N; };
constexpr inline StringID::StringID(const char * key)
: hashID(ForceCompileTimeEvaluation<StringLength(key)>::value)
{}
Notice that I've renamed the function to just StringLength. Name which start with an underscore followed by an uppercase letter, or which contain two consecutive underscores, are not legal in user code. They're reserved for the implementation (compiler & standard library).
In the future(c++20) you can use the consteval specifier to declare a function, which must be evaluated at compile time, thus requiring a constant expression context.
The consteval specifier declares a function or function template to be
an immediate function, that is, every call to the function must
(directly or indirectly) produce a compile time constant expression.
An example from cppreference(see consteval):
consteval int sqr(int n) {
return n*n;
}
constexpr int r = sqr(100); // OK
int x = 100;
int r2 = sqr(x); // Error: Call does not produce a constant
consteval int sqrsqr(int n) {
return sqr(sqr(n)); // Not a constant expression at this point, but OK
}
constexpr int dblsqr(int n) {
return 2*sqr(n); // Error: Enclosing function is not consteval and sqr(n) is not a constant
}
There are a few ways to force compile-time evaluation. But these aren't as flexible and easy to setup as what you'd expect when using constexpr. And they don't help you in finding if the compile-time constants are actually been used.
What you'd want for constexpr is to work where you expect it to beneficial. Therefor you try to meet its requirements. But Then you need to test if the code you expect to be generated at compile-time has been generated, and if the users actually consume the generated result or trigger the function at runtime.
I've found two ways to detect if a class or (member)function is using the compile-time or runtime evaluated path.
Using the property of constexpr functions returning true from the noexcept operator (bool noexcept( expression )) if evaluated at compile-time. Since the generated result will be a compile-time constant. This method is quite accessible and usable with Unit-testing.
(Be aware that marking these functions explicitly noexcept will break the test.)
Source: cppreference.com (2017/3/3)
Because the noexcept operator always returns true for a constant expression, it can be used to check if a particular invocation of a constexpr function takes the constant expression branch (...)
(less convenient) Using a debugger: By putting a break-point inside the function marked constexpr. Whenever the break-point isn't triggered, the compiler evaluated result was used. Not the easiest, but possible for incidental checking.
Soure: Microsoft documentation (2017/3/3)
Note: In the Visual Studio debugger, you can tell whether a constexpr function is being evaluated at compile time by putting a breakpoint inside it. If the breakpoint is hit, the function was called at run-time. If not, then the function was called at compile time.
I've found both these methods to be useful while experimenting with constexpr. Although I haven't done any testing with environments outside VS2017. And haven't been able to find an explicit statement supporting this behaviour in the current draft of the standard.
The following trick can help to check if the constexpr function has been evaluated during compile time only:
With gcc you can compile the source file with assembly listing + c sources; given that both the constexpr and its calls are in source file try.cpp
gcc -std=c++11 -O2 -Wa,-a,-ad try.cpp | c++filt >try.lst
If the constexpr function has been evaluated during run time time then you will see the compiled function and a call instruction (call function_name on x86) in the assembly listing try.lst (note that c++filt command has undecorated the linker names)
Interesting that it I always see a call if compiled without optimization (without -O2 or -O3 option).
Simply put it in constexpr variable.
constexpr StringID id("hello world");
constexpr int ix = StringLengthCExpr("hello world");
A constexpr variable is always a real constant expression. If it compiles, it is computed on compile time.

Unconventional Calls with Inline ASM

I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}