C++ code with undefined behavior, compiler generates std::exception - c++

I came across an interesting secure coding rule in C++ which states:
Do not reenter a function during the initialization of a static variable declaration. If a function is reentered during the constant initialization of a static object inside that function, the behavior of the program is undefined. Infinite recursion is not required to trigger undefined behavior, the function need only recur once as part of the initialization.
The non_compliant example of the same is:
#include <stdexcept>
int fact(int i) noexcept(false) {
if (i < 0) {
// Negative factorials are undefined.
throw std::domain_error("i must be >= 0");
}
static const int cache[] = {
fact(0), fact(1), fact(2), fact(3), fact(4), fact(5),
fact(6), fact(7), fact(8), fact(9), fact(10), fact(11),
fact(12), fact(13), fact(14), fact(15), fact(16)
};
if (i < (sizeof(cache) / sizeof(int))) {
return cache[i];
}
return i > 0 ? i * fact(i - 1) : 1;
}
which according to the source gives the error:
terminate called after throwing an instance of '__gnu_cxx::recursive_init_error'
what(): std::exception
when executed in Visual Studio 2013. I tried similar code of my own and got the same error (compiled using g++ and executed, on Ubuntu).
I am doubtful if my understanding is correct with respect to this concept as I am not well-versed with C++. According to me, since the cache array is constant, which means it can be read-only and needs to be initialized only once as static, it is getting initialized again and again as the values for this array is the value returned by each of the comma-separated recursive function calls which is against the behavior of the declared array. Thus, it gives undefined behavior which is also stated in the rule.
What is a better explanation for this?

In order to execute fact(), you need to first statically initialize fact::cache[]. In order to initially fact::cache, you need to execute fact(). There's a circular dependency there, which leads to the behavior you see. cache will only be initialized once, but it requires itself to be initialized in order to initialize itself. Even typing this makes my head spin.
The right way to introduce a cache table like this is to separate it into a different function:
int fact(int i) noexcept(false) {
if (i < 0) {
// Negative factorials are undefined.
throw std::domain_error("i must be >= 0");
}
return i > 0 ? i * fact(i - 1) : 1;
}
int memo_fact(int i) noexcept(false) {
static const int cache[] = {
fact(0), fact(1), fact(2), fact(3), fact(4), fact(5),
fact(6), fact(7), fact(8), fact(9), fact(10), fact(11),
fact(12), fact(13), fact(14), fact(15), fact(16)
};
if (i < (sizeof(cache) / sizeof(int))) {
return cache[i];
}
else {
return fact(i);
}
}
Here, memo_fact::cache[] will only be initialized once - but its initialization is no longer dependent on itself. So we have no issue.

The C++ standard, §6.7/4, says the following about the initialisation of block-scope variables with static storage duration:
If control re-enters the declaration recursively while the variable is
being initialized, the behavior is undefined.
The following informative example is given:
int foo(int i) {
static int s = foo(2*i); // recursive call - undefined
return i+1;
}
This applies to your example as well. fact(0) is a recursive call, so the declaration of cache is re-entered. Undefined behaviour is invoked.
It's important to recall what undefined behaviour means. Undefined behaviour means that everything can happen, and "everything" quite naturally includes exceptions being thrown.
Undefined behaviour also means that you can no longer reason about anything else in the code, except when you really want to get down to compiler-implementation details. But then you are no longer talking about C++ in terms of using a programming language but in terms of how to implement that language.

Related

Returning reference to local variable without trivial conversion

I am quite a newbie to the C++ programming, but this question keeps on spinning in my head. I understand that returning reference to a local variable in a function is illegal, i.e. compiling this code snippet:
inline int& funref() {
int a = 8;
return a; // not OK!
}
results in a warning from the compiler and then a runtime error. But then, why does this piece of code get compiled without any warnings and run without error?
inline int& funref() {
int a = 8;
int& refa = a;
return refa; // OK!
}
int main() {
int& refa = funref();
cout << refa;
}
My compiler is g++ on Linux Fedora platform.
It's still wrong, it just happens to be working by (un)happy coincidence.
This code has undefined behaviour with all the usual caveats (it might always work, it might always work until it's too late to fix, it might set fire to your house and run away with your betrothed).
The compiler isn't required to issue a diagnostic (warning or error message) for every possible mistake, just because it isn't always possible to do so. Here, at least your current version of g++ hasn't warned. A different compiler, or a different version of g++, or even the same version with different flags, might warn you.
The reason why you can't return a reference to a local variable is because the local variable will get wiped when your function returns. Simply put, the compiler prevents you from referencing garbage data.
However, the compiler isn't bulletproof (as shown in your example #2).
It does work for retrieving a singleton instance, though.
inline int& funref()
{
static int* p_a = nullptr;
if (nullptr == p_a)
p_a = new int(8);
return *p_a;
}
this case is valid because the memory pointed by p_a remains valid after the function returns.

C++ understanding RVO (as compared to returning local variable reference)

It's my first year of using C++ and learning on the way. I'm currently reading up on Return Value Optimizations (I use C++11 btw). E.g. here https://en.wikipedia.org/wiki/Return_value_optimization, and immediately these beginner examples with primitive types spring to mind:
int& func1()
{
int i = 1;
return i;
}
//error, 'i' was declared with automatic storage (in practice on the stack(?))
//and is undefined by the time function returns
...and this one:
int func1()
{
int i = 1;
return i;
}
//perfectly fine, 'i' is copied... (to previous stack frame... right?)
Now, I get to this and try to understand it in the light of the other two:
Simpleclass func1()
{
return Simpleclass();
}
What actually happens here? I know most compilers will optimise this, what I am asking is not 'if' but:
how the optimisation works (the accepted response)
does it interfere with storage duration: stack/heap (Old: Is it basically random whether I've copied from stack or created on heap and moved (passed the reference)? Does it depend on created object size?)
is it not better to use, say, explicit std::move?
You won't see any effect of RVO when returning ints.
However, when returning large objects like this:
struct Huge { ... };
Huge makeHuge() {
Huge h { x, y, x };
h.doSomething();
return h;
}
The following code...
auto h = makeHuge();
... after RVO would be implemented something like this (pseudo code) ...
h_storage = allocate_from_stack(sizeof(Huge));
makeHuge(addressof(h_storage));
auto& h = *properly_aligned(h_storage);
... and makeHuge would compile to something like this...
void makeHuge(Huge* h_storage) // in fact this address can be
// inferred from the stack pointer
// (or just 'known' when inlining).
{
phuge = operator (h_storage) new Huge(x, y, z);
phuge->doSomething();
}

Can the compiler decide the noexcept'ness of a function?

Let's do an example
class X
{
int value;
public:
X (int def = 0) : value (def) {}
void add (int i)
{
value += i;
}
};
Clearly, the function void X::add (int) will never throw any exception.
My question is, can the compiler analyze the code and decide not to generate machine code to handle exceptions, even if the function is not marked as noexcept?
If the compiler can prove that a function will never throw, it is allowed by the "As-If" rule (§1.9, "Program execution" of the C++ standard) to remove the code to handle exceptions.
However it is not possible to decide if a function will never throw in general, as it amounts to solving the Halting Problem.

Un-initialize a variable in C/C++

This is more a theoretical question than a practical one, but I was wondering whether is it possible to un-initialize a variable in C (or C++). So let's say we have the following code:
void some_fun()
{
int a; // <-- Here a is an un-initialized variable, it's value is the value
// found in the memory at the given location.
a = 7; // Now I have initialized it
// do something with a ...
// and here do something again with a that it
// will have the same state (ie: indeterministic) as before initialization
}
(No, I don't want to put a random value in a because that would be also an initialization, nor 0, because that's a very nice value, ... I just want it to be again in that "I don't know anything about it" stage it was before initializing it).
(Yes I am aware of: What happens to a declared, uninitialized variable in C? Does it have a value?)
You can use setjmp() and longjmp() to get the behavior you want, with some rearrangement of your code. The code below initializes a to 1 so that the print statements do not invoke undefined behavior.
jmp_buf jb;
void some_func (void)
{
int a = 1;
if (setjmp(jb) == 0) {
a = 7;
// do something
printf("a = %d (initialized)\n", a);
// use longjmp to make `a` not initialized
longjmp(jb, 1);
// NOTREACHED
} else {
printf("a = %d (not initialized)\n", a);
}
}
The longjmp() call returns back to the saved context of setjmp(), and moving to the else case means that a had not been initialized.
When compiled with GCC with optimizations, the above function outputs:
a = 7 (initialized)
a = 1 (not initialized)
If you want this behavior without optimizations enabled, try adding the register storage class to a's declaration.
A demo.
A longer explanation
So, why did I think setjmp() and longjmp() would work? This is what C.11 §7.13 ¶1-2 has to say about it:
The header <setjmp.h> defines the macro setjmp, and declares one function and
one type, for bypassing the normal function call and return discipline.
The type declared is
jmp_buf
which is an array type suitable for holding the information needed to restore a calling
environment. The environment of a call to the setjmp macro consists of information
sufficient for a call to the longjmp function to return execution to the correct block and
invocation of that block, were it called recursively. It does not include the state of the
floating-point status flags, of open files, or of any other component of the abstract
machine.
This explains that what is supposed to happen is that a longjmp back to the context saved in the jmp_buf by a call to setjmp will act as if the code that ran up until the longjmp call was a recursive function call, the the longjmp acts like a return from that recursive call back the setjmp. To me, this implies that the automatic variable would be "uninitialized".
int a;
// the following expression will be false if returning from `longjmp`
if (setjmp(jb) == 0) {
// this section of code can be treated like the `setjmp` call induced
// a recursive call on this function that led to the execution of the
// code in this body
a = 7;
//...
// assuming not other code modified `jb`, the following call will
// jump back to the `if` check above, but will behave like a recursive
// function call had returned
longjmp(jb, 1);
} else {
// `a` expected to be uninitialized here
}
But, there seems to be a catch. From C.11 §7.13.2 ¶3:
All accessible objects have values, and all other components of the abstract machine
have state, as of the time the longjmp function was called, except that the values of
objects of automatic storage duration that are local to the function containing the
invocation of the corresponding setjmp macro that do not have volatile-qualified type
and have been changed between the setjmp invocation and longjmp call are
indeterminate.
Since a is local, is not volatile-qualified, and has been changed between setjmp and longjmp calls, its value is indeterminate, even if it was properly initialized before calling setjmp!
So, using longjmp back to a local setjmp after an automatic non-volatile variable has been modified will always result in making those modified variables "uninitialized" after returning to the point of the setjmp.
You can emulate this using boost::optional<T>:
#include <boost/optional.hpp>
int main()
{
boost::optional<int> a;
a = 7;
std::cout << a.is_initialized() << std::endl; // true
a.reset(); // "un-initialize"
std::cout << a.is_initialized() << std::endl; // false
}
I am curious why you want to do that. However did you try following:
void some_fun() {
int a;
int b = a; // Hoping compiler does not discard this.
a = 7;
// do something
a = b;
}
Another approach is along the lines of:
int a, olda;
memcpy(&olda, &a, sizeof(a));
a = 7;
//...
memcpy(&a, &olda, sizeof(a));
// a is "uninitialized"
This avoids the trap representation issues of using assignment, relying on the fact that char does not have any trap representations. It also benefits from being vastly simpler than using setjmp() and longjmp().

"missing return statement", but I know it is there

Assume I have the following function:
// Precondition: foo is '0' or 'MAGIC_NUMBER_4711'
// Returns: -1 if foo is '0'
// 1 if foo is 'MAGIC_NUMBER_4711'
int transmogrify(int foo) {
if (foo == 0) {
return -1;
} else if (foo == MAGIC_NUMBER_4711) {
return 1;
}
}
The compiler complains "missing return statement", but I know that foo never has different values than 0 or MAGIC_NUMBER_4711, or else my function shall have no defined semantics.
What are preferable solutions to this?
Is this really an issue, i.e. what does the standard say?
Sometimes, your compiler is not able to deduce that your function actually has no missing return. In such cases, several solutions exist:
Assume the following simplified code (though modern compilers will see that there is no path leak, just exemplary):
if (foo == 0) {
return bar;
} else {
return frob;
}
Restructure your code
if (foo == 0) {
return bar;
}
return frob;
This works good if you can interpret the if-statement as a kind of firewall or precondition.
abort()
if (foo == 0) {
return bar;
} else {
return frob;
}
abort(); return -1; // unreachable
Return something else accordingly. The comment tells fellow programmers and yourself why this is there.
throw
#include <stdexcept>
if (foo == 0) {
return bar;
} else {
return frob;
}
throw std::runtime_error ("impossible");
Disadvantages of Single Function Exit Point
flow of control control
Some fall back to one-return-per-function a.k.a. single-function-exit-point as a workaround. This might be seen as obsolete in C++ because you almost never know where the function will really exit:
void foo(int&);
int bar () {
int ret = -1;
foo (ret);
return ret;
}
Looks nice and looks like SFEP, but reverse engineering the 3rd party proprietary libfoo reveals:
void foo (int &) {
if (rand()%2) throw ":P";
}
This argument does not hold true if bar() is nothrow and so can only call nothrow functions.
complexity
Every mutable variable increases the complexity of your code and puts a higher burden on the cerebral capacity on your code's maintainer. It means more code and more state to test and verify, in turn means that you suck off more state from the maintainers brain, in turn means less maintainer's brain capacity left for the important stuff.
missing default constructor
Some classes have no default construction and you would have to write really bogus code, if possible at all:
File mogrify() {
File f ("/dev/random"); // need bogus init because it requires readable stream
...
}
That's quite a hack just to get it declared.
In C89 and in C99, the return statement is never required. Even if it is a function that has a return different than void.
C99 only says:
(C99, 6.9.1p12 "If the } that terminates a function is reached, and the value of the function call is used by the caller, the behavior is undefined."
In C++11, the Standard says:
(C++11, 6.6.3p2) "Flowing off the end of a function is equivalent to a return with no value; this results in undefined behavior in a value-returning function"
Just because you can tell that the input will only have one of two values doesn't mean the compiler can, so it's expected that it will generate such a warning.
You have a couple options for helping the compiler figure this out.
You could use an enumerated type for which the two values are the only valid enumerated values. Then the compiler can tell immediately that one of the two branches has to execute and there's no missing return.
You could abort at the end of the function.
You could throw an appropriate exception at the end of the function.
Note that the latter two options are better than silencing the warning because it predictably shows you when the pre-conditions are violated rather than allowing undefined behavior. Since the function takes an int and not a class or enumerated type, it's only a matter of time before someone calls it with a value other than the two allowed values and you want to catch those as early in the development cycle as possible rather than pushing them off as undefined behavior because it violated the function's requirements.
Actually the compiler is doing exactly what it should.
int transmogrify(int foo) {
if (foo == 0) {
return -1;
} else if (foo == MAGIC_NUMBER_4711) {
return 1;
}
// you know you shouldn't get here, but the compiler has
// NO WAY of knowing that. In addition, you are putting
// great potential for the caller to create a nice bug.
// Why don't you catch the error using an ELSE clause?
else {
error( "transmorgify had invalid value %d", foo ) ;
return 0 ;
}
}