Must constant evaluator reject undefined behavior (union example) in C++? - c++

As far as I know, undefined behavior shall be a compile error during constant evaluation.
But if one takes an example of undefined behavior from C++20 standard class.union#6.3 with minor modification to activate constant evaluation:
struct X { const int a; int b; };
union Y { X x; int k; };
constexpr bool g() {
Y y = { { 1, 2 } }; // OK, y.x is active union member ([class.mem])
int n = y.x.a;
y.k = 4; // OK: ends lifetime of y.x, y.k is active member of union
y.x.b = n; // undefined behavior: y.x.b modified outside its lifetime,
// S(y.x.b) is empty because X's default constructor is deleted,
// so union member y.x's lifetime does not implicitly start
return y.x.b > 0;
}
int main() {
static_assert( g() );
}
then it is accepted by all compilers without any warnings. Demo: https://gcc.godbolt.org/z/W7o4n5KrG
Are all compilers wrong here, or there is no undefined behavior in the example, or no diagnostic is required?

In the original versions of the C and C++ Standards, the phrase "Undefined Behavior" was intended to mean nothing more nor less than "the Standard imposes no requirements". There was no perceived need for the Standard to ensure that every possible execution of every possible construct as either having unambiguously defined behavior or as being readily and unambiguously recognizable as invoking Undefined Behavior.
Both the C and C++ drafts explicitly state that in cases where the Standard imposes no requirements, implementations may behave "in a documented manner characteristic of the environment". If there were some execution environment where cache lines were twice as large as int, and where storing an int value into the first half of cache line and zeroing the rest would be faster than a read-modify-write sequence necessary to update just the first half of the cache line while leaving the remainder undisturbed, an implementation for that platform might process the act of writing to y.k in a manner which would disturb the storage associated with y.x.b. On the other hand, for most environments the "characteristic behavior" of writing y.k would be to modify an int-sized chunk of storage, while leaving the remainder of the storage associated with the union undisturbed.
Treating the act of writing to y.k and then reading y.x.b as UB was intended to allow implementations to process the write to y.k in the fastest fashion, without having to consider whether code might care about the contents of y.x.b. It was not intended to require that implementations make any effort to prevent code from accessing y.x.b after writing y.k. Although C++ mandates that integer constant expressions within a template expansion be viewed as substitution failures in cases where they invoke certain actions upon which the Standard would otherwise impose no requirements, requiring that all such actions be treated as substitution failures would create contradictions where the Standard could be interpreted both as requiring that a compiler make a particular template substitution, and as requiring that it refrain from doing so.

Huh, I guess it is the compiler being a bit lax - but there is technically nothing undefined about this at compile time, as there is no way for y.x.a to ever be accessed. Indeed, if you change your definition of g to return y.x.a; instead of y.x.b > 0 then it does spit out an error message ("expression did not evaluate to a constant" on my machine).
When a compiler evaluates a constexpr expression, instead of compiling the relevant parts of the code it is (universally, as far as I'm aware but don't quote me on that) delegated to an interpreter to evaluate and the constant result is then given to the compiler to be compiled along with the rest of the non-constexpr code. Interpreter's are generally far worse at catching what we would call "compile time errors", and so if nothing is actually undefined about the execution of the code, then this is probably good enough for the interpreter. For instance, there is some documentation on the clang interpreter which shows that the execution model is very different to how one would expect the compiled code to run.

Related

If this is undefined behavior then why is it given as a seemingly legitimate example?

In a Wikipedia article on type punning it gives an example of pointing an int type pointer at a float to extract the signed bit:
However, supposing that floating-point comparisons are expensive, and
also supposing that float is represented according to the IEEE
floating-point standard, and integers are 32 bits wide, we could
engage in type punning to extract the sign bit of the floating-point
number using only integer operations:
bool is_negative(float x) {
unsigned int *ui = (unsigned int *)&x;
return *ui & 0x80000000;
}
Is it true that pointing a pointer to a type not its own is undefined behavior? The article makes it seem as if this operation is a legitimate and common thing. What are the things that can possibly go wrong in this particular piece of code? I'm interested in both C and C++, if it makes any difference. Both have the strict aliasing rule, right?
Is it true that pointing a pointer to a type not its own is undefined behavior?
No, both C and C++ allow an object pointer to be converted to a different pointer type, with some caveats.
But with a few narrow exceptions, accessing the pointed-to object via the differently-typed pointer does have undefined behavior. Such undefined behavior arises from evaluating the expression *ui in the example function.
The article makes it seem as if this operation is a legitimate and common thing. What are the things that can possibly go wrong in this particular piece of code?
The behavior is undefined, so anything and everything within the power of the program to do is possible. In practice, the observed behavior might be exactly what the author(s) of the Wikipedia article expected, and if not, then the most likely misbehaviors are variations on the function computing incorrect results.
I'm interested in both C and C++, if it makes any difference. Both have the strict aliasing rule, right?
To the best of my knowledge, the example code has undefined behavior in both C and C++, for substantially the same reason.
The fact that it is technically undefined behaviour to call this is_negative function implies that compilers are legally allowed to "exploit" this fact, e.g., in the below code:
if (condition) {
is_negative(bar);
} else {
// do something
}
the compiler may "optimize out" the branch, by evaluating condition and then unconditionally proceeding to the else substatement even if the condition is true.
However, because this would break enormous amounts of existing code, "real" compilers are practically forced to treat is_negative as if it were legitimate. In legal C++, the author's intent is expressed as follows:
unsigned int ui;
memcpy(&ui, &x, sizeof(x));
return ui & 0x80000000;
So the reinterpret_cast approach to type punning, while undefined according to the standard in this case, is thought of by many people as "de facto implementation-defined" and equivalent to the memcpy approach.
Why
If this is undefined behavior then why is it given as a seemingly legitimate example?
This was a common practice before C was standardized and added the rules about aliasing, and it has unfortunately persisted in practice. Nonetheless, Wikipedia pages should not be offering it as examples.
Aliasing Via Pointer Conversions
Is it true that pointing a pointer to a type not its own is undefined behavior?
The rules are more complicated than that, but, yes, many uses of an object through an lvalue of a different type are not defined by the C or C++ standards, including this one. There are also rules about pointer conversions that may be violated.
The fact that many compilers support this behavior even though the C and C++ standards do not require them to is not a reason to do so, as there is a simple alternative defined by the standards (use memcpy, below).
Using Unions
In C, an object may be reinterpreted as another type using a union. C++ does not define this:
union { float f; unsigned int ui; } u = { .f = x };
unsigned int ui = u.ui;
or the new value may be obtained more tersely using a compound literal:
(union { float f; unsigned int ui; }) {x} .ui
Naturally, float and unsigned int should have the same size when using this.
Copying Bytes
Both C and C++ support reinterpreting an object by copying the bytes that represent it:
unsigned int ui;
memcpy(&ui, &x, sizeof ui);
Naturally, float and unsigned int should have the same size when using this. The above is C code; C++ requires std::memcpy or a suitable using declaration.
Accessing data through pointers (or unions) seems pretty common in (embedded) c code but requires often extra knowledge.
If a float would be smaller then an int, you would be accessing outside defined space.
the code takes several assumptions on where and how the sign bit is stored (little vs big endian, 2s-complement)
When the C Standard characterizes an action as invoking Undefined Behavior, that implies that at least one of the following is true:
The code is non-portable.
The code is erroneous.
The code is acting upon erroneous data.
One of the reasons for the Standard leaves some actions as Undefined is to, among other things, "identify areas of possible conforming language extension: the
implementor may augment the language by providing a definition of the officially undefined
behavior." A common extension, listed in the Standard as one of the ways implementations may process constructs that invokes "Undefined Behavior", is to process some such constructs by "behaving during translation or program execution in a documented manner characteristic of the environment".
I don't think the code listed in the example claims to be 100% portable. As such, the fact that it invokes Undefined Behavior does not preclude the possibility of it being non-portable but correct. Some compiler writers believe that the Standard was intended to deprecate non-portable constructs, but such a notion is contradicted by both the text of the Standard and the published Rationale. According to the published Rationale, the authors of the Standard wanted to give programmers a "fighting chance" [their term] to write portable code, and defined a category of maximally-portable programs, but not not specify portability as a requirement for anything other than strictly conforming C programs, and they expressly did not wish to demean programs that were conforming but not strictly conforming.

What would break if control-reaches-end-of-body were to return nullopt?

C++14 gave us automatic return type deduction, and C++17 has an optional<T> type template (or type constructor, if you will). Now, true, optional lives within the standard library, not the language itself, but - why not use it as return value from a non-void function when control reaches the end of the body? I would think that:
optional<int> foo(int x)
{
if (x > 0) return 2 * x;
}
should be perfectly valid and compilable syntax for a partial function, returning optional<int>.
Now, I know this is a bit of a crazy idea. My question is not whether you like it or not, but rather - suppose everyone on the committee really liked it for some strange reason. What would it break / conflict with?
Note: Of course if you specify a non-optional return value this can't work, but that doesn't count as breakage.
Think of functions that end with abort(); or a custom function that has the same effect. If the compiler cannot statically prove functions never reach the closing }, this would force the compiler to generate dead code, and is therefore in conflict with one of the principles of C++, namely the zero overhead principle: what you don't use, you don't pay for.
Special casing std::optional is ridiculous here. Users should be able to write their own first-class equivalent to std::optional.
Which means falling off the end of a function needs to involve using some kind of magic to figure out what the implicit return value should be.
The easiest magic is that falling-off-the-end is equivalent to return {}; In the case of optional, this is nullopt. If I read my standardese correctly, for int this is 0, and this matches the behavior of falling-off-the-end-of-main.
There are downsides. First, suppose you have a function:
int foo(bool condition) {
if (condition) return 7;
custom_abort(); // does not return, but not marked up with `[[noreturn]]`
}
This would cause the compiler to write a return {}; after custom_abort(); if the compiler cannot prove that abort doesn't return. This has a cost (in binary size at the least). Currently, the compiler is free to exclude any work required to return from foo after abort() and assume abort() will not return.
It is true that no valid programs will behave differently with this change, but what was previously undefined behavior becomes defined, and that can have costs.
We could approach this in a slightly different way:
int foo(bool condition) {
if (condition) return 7;
custom_abort();
this cannot be reached;
}
where we add in an explicit "this location cannot be reached" to C++.
Once added, we could then issue warnings for code paths that do not return, and in a later standard enforce the rule that all code paths must either assert they cannot be reached, or must return.
After such a transformation of the language was in place for a standard cycle or two, then implicit return {}; would be harmless, except for people who skipped over the return cannot happen phase of standardization.
Up to now, it is full Undefined Behavior. That means no valid existing code contains this construct. Adding well-defined behavior will therefore break no valid code. As for code that was broken, that may or may not be broken if your proposal would be accepted, but that's almost never a concern for WG21.
The main concern would be how it would interact with other language features. I don't see a conflict with constexpr; falling off the end of a constexpr function would give an empty constexpr optional<T>. The [[noreturn]] attribute obviously makes no sense. The [[nodiscard]] attribute affects the caller, not the implementation. Exceptions are not affected either. So on the whole, the proposal seems to stand on its own.
In a proposal to WG21, it might be worth suggesting a less radical alternative: make plain return; a valid alternative for return optional<T>{};

What makes this usage of pointers unpredictable?

I'm currently learning pointers and my professor provided this piece of code as an example:
//We cannot predict the behavior of this program!
#include <iostream>
using namespace std;
int main()
{
char * s = "My String";
char s2[] = {'a', 'b', 'c', '\0'};
cout << s2 << endl;
return 0;
}
He wrote in the comments that we can't predict the behavior of the program. What exactly makes it unpredictable though? I see nothing wrong with it.
The behaviour of the program is non-existent, because it is ill-formed.
char* s = "My String";
This is illegal. Prior to 2011, it had been deprecated for 12 years.
The correct line is:
const char* s = "My String";
Other than that, the program is fine. Your professor should drink less whiskey!
The answer is: it depends on what C++ standard you're compiling against. All the code is perfectly well-formed across all standards‡ with the exception of this line:
char * s = "My String";
Now, the string literal has type const char[10] and we're trying to initialize a non-const pointer to it. For all other types other than the char family of string literals, such an initialization was always illegal. For example:
const int arr[] = {1};
int *p = arr; // nope!
However, in pre-C++11, for string literals, there was an exception in §4.2/2:
A string literal (2.13.4) that is not a wide string literal can be converted to an rvalue of type “pointer to char”; [...]. In either case, the result is a pointer to the first element of the array. This conversion is considered only when there is an explicit appropriate pointer target type, and not when there is a general need to convert from an lvalue to an rvalue. [Note: this conversion is deprecated. See Annex D. ]
So in C++03, the code is perfectly fine (though deprecated), and has clear, predictable behavior.
In C++11, that block does not exist - there is no such exception for string literals converted to char*, and so the code is just as ill-formed as the int* example I just provided. The compiler is obligated to issue a diagnostic, and ideally in cases such as this that are clear violations of the C++ type system, we would expect a good compiler to not just be conforming in this regard (e.g. by issuing a warning) but to fail outright.
The code should ideally not compile - but does on both gcc and clang (I assume because there's probably lots of code out there that would be broken with little gain, despite this type system hole being deprecated for over a decade). The code is ill-formed, and thus it does not make sense to reason about what the behavior of the code might be. But considering this specific case and the history of it being previously allowed, I do not believe it to be an unreasonable stretch to interpret the resulting code as if it were an implicit const_cast, something like:
const int arr[] = {1};
int *p = const_cast<int*>(arr); // OK, technically
With that, the rest of the program is perfectly fine, as you never actually touch s again. Reading a created-const object via a non-const pointer is perfectly OK. Writing a created-const object via such a pointer is undefined behavior:
std::cout << *p; // fine, prints 1
*p = 5; // will compile, but undefined behavior, which
// certainly qualifies as "unpredictable"
As there is no modification via s anywhere in your code, the program is fine in C++03, should fail to compile in C++11 but does anyway - and given that the compilers allow it, there's still no undefined behavior in it†. With allowances that the compilers are still [incorrectly] interpreting the C++03 rules, I see nothing that would lead to "unpredictable" behavior. Write to s though, and all bets are off. In both C++03 and C++11.
†Though, again, by definition ill-formed code yields no expectation of reasonable behavior
‡Except not, see Matt McNabb's answer
Other answers have covered that this program is ill-formed in C++11 due to the assignment of a const char array to a char *.
However the program was ill-formed prior to C++11 also.
The operator<< overloads are in <ostream>. The requirement for iostream to include ostream was added in C++11.
Historically, most implementations had iostream include ostream anyway, perhaps for ease of implementation or perhaps in order to provide a better QoI.
But it would be conforming for iostream to only define the ostream class without defining the operator<< overloads.
The only slightly wrong thing that I see with this program is that you're not supposed to assign a string literal to a mutable char pointer, though this is often accepted as a compiler extension.
Otherwise, this program appears well-defined to me:
The rules that dictate how character arrays become character pointers when passed as parameters (such as with cout << s2) are well-defined.
The array is null-terminated, which is a condition for operator<< with a char* (or a const char*).
#include <iostream> includes <ostream>, which in turn defines operator<<(ostream&, const char*), so everything appears to be in place.
You can't predict the behaviour of the compiler, for reasons noted above. (It should fail to compile, but may not.)
If compilation succeeds, then the behaviour is well-defined. You certainly can predict the behaviour of the program.
If it fails to compile, there is no program. In a compiled language, the program is the executable, not the source code. If you don't have an executable, you don't have a program, and you can't talk about behaviour of something that doesn't exist.
So I'd say your prof's statement is wrong. You can't predict the behaviour of the compiler when faced with this code, but that's distinct from the behaviour of the program. So if he's going to pick nits, he'd better make sure he's right. Or, of course, you might have misquoted him and the mistake is in your translation of what he said.
As others have noted, the code is illegitimate under C++11, although it was valid under earlier versions. Consequently, a compiler for C++11 is required to issue at least one diagnostic, but behavior of the compiler or the remainder of the build system is unspecified beyond that. Nothing in the Standard would forbid a compiler from exiting abruptly in response to an error, leaving a partially-written object file which a linker might think was valid, yielding a broken executable.
Although a good compiler should always ensure before it exits that any object file it is expected to have produced will be either valid, non-existent, or recognizable as invalid, such issues fall outside the jurisdiction of the Standard. While there have historically been (and may still be) some platforms where a failed compilation can result in legitimate-appearing executable files that crash in arbitrary fashion when loaded (and I've had to work with systems where link errors often had such behavior), I would not say that the consequences of syntax errors are generally unpredictable. On a good system, an attempted build will generally either produce an executable with a compiler's best effort at code generation, or won't produce an executable at all. Some systems will leave behind the old executable after a failed build, since in some cases being able to run the last successful build may be useful, but that can also lead to confusion.
My personal preference would be for disk-based systems to to rename the output file, to allow for the rare occasions when that executable would be useful while avoiding the confusion that can result from mistakenly believing one is running new code, and for embedded-programming systems to allow a programmer to specify for each project a program that should be loaded if a valid executable is not available under the normal name [ideally something which which safely indicates the lack of a useable program]. An embedded-systems tool-set would generally have no way of knowing what such a program should do, but in many cases someone writing "real" code for a system will have access to some hardware-test code that could easily be adapted to the purpose. I don't know that I've seen the renaming behavior, however, and I know that I haven't seen the indicated programming behavior.

Why is returning a reference to a function local value not a compile error?

The following code invokes undefined behaviour.
int& foo()
{
int bar = 1234;
return bar;
}
g++ issues a warning:
warning: reference to local variable ‘bar’ returned [-Wreturn-local-addr]
clang++ too:
warning: reference to stack memory associated with local variable 'bar' returned [-Wreturn-stack-address]
Why is this not a compile error (ignoring -Werror)?
Is there a case where returning a ref to a local var is valid?
EDIT As pointed out, the spec mandates this be compilable. So, why does the spec not prohibit such code?
I would say that requiring this to make the program ill-formed (that is, make this a compilation error) would complicate the standard considerably for little benefit. You'd have to exactly spell out in the standard when such cases shall be diagnosed, and all compilers would have to implement them.
If you specify too little, it will not be too useful. And compilers probably already check for this to emit warnings, and real programmers compile with -Wall_you_can_give_me -Werror anyway.
If you specify too much, it will be difficult (or impossible) for compilers to implement the standard.
Consider this class (for which you only have the header and a library):
class Foo
{
int x;
public:
int& getInteger();
};
And this code:
int& bar()
{
Foo f;
return f.getInteger();
}
Now, should the standard be written to make this ill-formed or not? Probably not, what if Foo is implemented like this:
#include "Foo.h"
int global;
int& Foo::getInteger()
{
return global;
}
At the same time, it could be implemented like this:
#include "Foo.h"
int& Foo::getInteger()
{
return x;
}
Which of course would give you a dangling reference.
My point is that the compiler cannot really know whether returning a reference is OK or not, except for a few trivial cases (returning a reference to a function-scope automatic variable or parameter of non-reference type). I don't think it's worth it to complicate the standard for that. Especially as most compilers already warn about this as a quality-of-implementation matter.
Also, because you may want to get the current stack pointer (whatever that means on your particular implementation).
This function:
void* get_stack_pointer (void) { int x; return &x; };
AFAIK, it is not undefined behavior if you don't dereference the resulting pointer.
is much more portable than this one:
void* get_stack_pointer (void) {
register void* sp asm ("%esp"); return sp; }
As to why you may want to get the stack pointer: well, there are cases where you have a valid reason to get it: for instance the conservative Boehm garbage collector needs to scan the stack (so wants the stack pointer and the stack bottom).
And if you returned a C++ reference on which you would only take its address using the & unary operator, getting such an address is IIUC legal (it is IMHO the only licit operation you can do on it).
Another reason to get the stack pointer would be to get a non-NULL pointer address (which you could e.g. hash) different of any heap, local or static data. However, you could use (void*)1 or (void*)-1 for that purpose.
So the compiler is right in only warning against this.
I guess that a C++ compiler should accept
int& get_sp_ref(void) { int x; return x; }
void show_sp(void) {
std::cout << (&(get_sp_ref())) << std::endl; }
For the same reason C allows you to return a pointer to a memory block that's been freed.
It's valid according to the language specification. It's a horribly bad idea (and is nowhere close to being guaranteed to work) but it's still valid inasmuch as it's not forbidden.
If you're asking why the standard allows this, it's probably because, when references were introduced, that's the way they worked. Each iteration of the standard has certain guidelines to follow (such as minimising the possibility of "breaking changes", those that render existing well-formed programs invalid) and the standard is an agreement between user and implementer, with undoubtedly more implementers than users sitting on the committees :-)
It may be worth pushing that idea through as a potential change and seeing what ISO say but I suspect it would be considered one of those "breaking changes" and therefore very suspect.
To expand on the earlier answers, the ISO C++ standard does not capture the distinction between warnings and errors to begin with; it simply uses the term 'diagnostic' when referring to what a compiler must emit upon seeing an ill-formed program. Quoting N3337, 1.4, paragraphs 1 and 2:
The set of diagnosable rules consists of all syntactic and semantic rules in this
International Standard except for those rules containing an explicit notation that “no
diagnostic is required” or which are described as resulting in “undefined behavior.”
Although this International Standard states only requirements on C++ implementations,
those requirements are often easier to understand if they are phrased as requirements on
programs, parts of programs, or execution of programs. Such requirements have the
following meaning:
If a program contains no violations of the rules in this International Standard, a
conforming implementation shall, within its resource limits, accept and correctly execute
that program.
If a program contains a violation of any diagnosable rule or an occurrence of a
construct described in this Standard as “conditionally-supported” when the
implementation does not support that construct, a conforming implementation shall issue
at least one diagnostic message.
If a program contains a violation of a rule for which no diagnostic is required, this
International Standard places no requirement on implementations with respect to that
program.
Something not mentioned by other answers yet is that this code is OK if the function is never called.
The compiler isn't required to diagnose whether a function might ever be called or not. For example you might set up a program which looks for counterexamples to Fermat's Last Theorem, and calls this function if it finds one. It would be a mistake for the compiler to reject such a program.
Returning reference into local variable is bad idea, however some people may create code which requires that, so compiler should only warn about that and don't determine valid (valid structure) code as erroneous.
Angew already posted sample with local variable that is actually global. However there is some other (IMHO better) sample.
Object& GetSmth()
{
Object* obj = new Object();
return *obj;
}
In this case reference to local object is valid and caller after usage should dealocate memory.
IMPORTANT NOTE I don't encourage and don't recommend to use such coding style, because it is bad, usually it is hard to understand what is going on and it leads in some kind of problems like memory leaks or crashes. It is just a sample which shows why this particular situation cannot be treated as error.

Defining Undefined Behavior

Does there exist any implementation of C++ (and/or C) that guarantees that anytime undefined behavior is invoked, it will signal an error? Obviously, such an implementation could not be as efficient as a standard C++ implementation, but it could be a useful debugging/testing tool.
If such an implementation does not exist, then are there any practical reasons that would make it impossible to implement? Or is it just that no one has done the work to implement it yet?
Edit: To make this a little more precise: I would like to have a compiler that allows me to make the assertion, for a given run of a C++ program that ran to completion, that no part of that run involved undefined behavior.
Yes, and no.
I am fairly certain that for practical purposes, an implementation could make C++ a safe language, meaning every operation has well-defined behavior. Of course, this comes at a huge overhead and there is probably some cases where it's simply unfeasible, such as race conditions in multithreaded code.
Now, the problem is that this can't guarantee your code is defined in other implementations! That is, it could still invoke UB. For instance, observe the following code:
int a;
int* b;
int foo() {
a = 5;
b = &a;
return 0;
}
int bar() {
*b = a;
return 0;
}
int main() {
std::cout << foo() << bar() << std::endl;
}
According to the standard, the order that foo and bar are called is up to the implementation to decide. Now, in a safe implementation this order would have to be defined, likely being left-to-right evaluation. The problem is that evaluating right-to-left invokes UB, which wouldn't be caught until you ran it on an unsafe implementation. The safe implementation could simply compile each permutation of evaluation order or do some static analysis, but this quickly becomes unfeasible and possibly undecidable.
So in conclusion, if such an implementation existed it would give you a false sense of security.
The new C standard has an interesting list in the new Annex L with the crude title "Analyzability". It talks about UB that is so-called critical UB. This includes among others:
An object is referred to outside of its lifetime (6.2.4).
A pointer is used to call a function whose type is not compatible with the referenced
type
The program attempts to modify a string literal
All of these are UB that are impossible or very hard to capture, since they usually can't be completely tested at compile time. This is due to the fact that a valid C (or C++) program is composed of several compilation units that may not know much of each other. E.g if one program passes a pointer to a string literal into a function with a char* parameter, or even worse, a program that casts away const-ness from a static variable.
Two C interpreters that detect a large class of undefined behaviors for a large subset of sequential C are KCC
and Frama-C's value analysis. They are both used to make sure that automatically generated, automatically reduced random C programs are appropriate to report bugs in C compilers.
From the webpage for KCC:
One of the main aims of this work is the ability to detect undefined
programs (e.g., programs that read invalid memory).
A third interpreter for a dialect of C is CompCert's interpreter mode (a writeup). This one detects all behaviors that are undefined in the input language of the certified C compiler CompCert. The input language of CompCert is essentially C, but it renders defined some behaviors that are undefined in the standard (signed arithmetic overflow is defined as computing 2's complement results, for instance).
In truth, all three of the interpreters mentioned in this answer have had difficult choices to make in the name of pragmatism.
The whole point of defining something as "undefined behaviour" is to avoid having to detect this situation in the compiler. It is defined that way, so that compilers can be built for a wide variety of platforms and architectures, and so that the hardware and software doesn't have to have specific features "just to detect undefined behaviour". Imagine that you have a memory subsystem that can't detect whether you are writing to real memory or not - how would the compiler or runtime system detect that you have just done somepointer = rand(); *somepointer = 42;
You can detect SOME situations. But to require that ALL are detected, would make life very difficult.
Given the Edit in the original question: I still don't think this is plausible to achieve in C. There is so much freedom to do almost anything (making pointers to almost anything, these pointers can be converted, indexed, recalculated, and all manner of other things), and will be able to cause all manner of undefined behaviour.
There is a list of all undefined behaviour in C here - it lists 186 different circumstances of undefined behaviour, ranging from a backslash as the last character of the file (likely to cause compiler error, but not defined as one) to "The comparison function called by the bsearch or qsort function returns ordering values inconsistently".
How on earth do you write a compiler to check that the function passed into bsearch or qsort is ordering values consistently? Of course, if the data passed into the comparison function is of a simple type, such as integers, then it's not that difficult, but if the data type is a complex type such as
struct {
char name[20];
char street[20];
int age;
char post_code[10];
};
and the programmer decides to sort the data based on ascending name, ascending street, descending age and ascending postcode, in that order? If that's what you want, but somehow the code got messed up and post code comparison returns some inconsistant result, things will go wrong, but it's very hard to formally inspect that case. There are lots of others that are similarly obscure and complex. Sure, YOUR code may not sort names and addresses etc, but someone will probably write somethng like that at some point or another.