"Observable behaviour" and compiler freedom to eliminate/transform pieces c++ code - c++

After reading this discussion I realized that I almost totally misunderstand the matter :)
As the description of C++ abstract machine is not rigorous enough(comparing, for instance, with JVM specification), and if a precise answer isn't possible I would rather want to get informal clarifications about rules that reasonable "good" (non-malicious) implementation should follow.
The key concept of part 1.9 of the Standard addressing implementation freedom is so called as-if rule:
an implementation is free to disregard any requirement of this
Standard as long as the result is as if the requirement had been
obeyed, as far as can be determined from the observable behavior of
the program.
The term "observable behavior", according to the standard (I cite n3092), means the following:
— Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
— At program termination, all data written into files shall be
identical to one of the possible results that execution of the program
according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take
place in such a fashion that prompting output is actually delivered
before a program waits for input. What constitutes an interactive
device is implementation-defined.
So, roughly speaking, the order and operands of volatile access operations and io operations should be preserved; implementation may make arbitrary changes in the program which preserve these invariants (comparing to some allowed behaviour of the abstract c++ machine)
Is it reasonable to expect that non-malicious implementation treates io operations wide enough (for instance, any system call from user code is treated as such operation)? (E.g. RAII mutex lock/unlock wouldn't be thrown away by compiler in case RAII wrapper contains no volatiles)
How deeply the "behavioral observation" should immerse from user-defined c++ program level into library/system calls? The question is, of course, only about library calls that not intended to have io/volatile access from the user viewpoint (e.g. as new/delete operations) but may (and usually does) access volatiles or io in the library/system implementation. Should the compiler treat such calls from the user viewpoint (and consider such side effects as not observable) or from "library" viewpoint (and consider the side effects as observable) ?
If I need to prevent some code from elimination by compiler, is it a good practice not to ask all the questions above and simply add (possibly fake) volatile access operations (wrap the actions needed to volatile methods and call them on volatile instances of my own classes) in any case that seems suspicious?
Or I'm totally wrong and the compiler is disallowed to remove any c++ code except of cases explicitly mentioned by the standard (as copy elimination)

The important bit is that the compiler must be able to prove that the code has no side effects before it can remove it (or determine which side effects it has and replace it with some equivalent piece of code). In general, and because of the separate compilation model, that means that the compiler is somehow limited as to what library calls have observable behavior and can be eliminated.
As to the deepness of it, it depends on the library implementation. In gcc, the C standard library uses compiler attributes to inform the compiler of potential side effects (or absence of them). For example, strlen is tagged with a pure attribute that allows the compiler to transform this code:
char p[] = "Hi there\n";
for ( int i = 0; i < strlen(p); ++i ) std::cout << p[i];
into
char * p = get_string();
int __length = strlen(p);
for ( int i = 0; i < __length; ++i ) std::cout << p[i];
But without the pure attribute the compiler cannot know whether the function has side effects or not (unless it is inlining it, and gets to see inside the function), and cannot perform the above optimization.
That is, in general, the compiler will not remove code unless it can prove that it has no side effects, i.e. will not affect the outcome of the program. Note that this does not only relate to volatile and io, since any variable change might have observable behavior at a later time.
As to question 3, the compiler will only remove your code if the program behaves exactly as if the code was present (copy elision being an exception), so you should not even care whether the compiler removes it or not. Regarding question 4, the as-if rule stands: If the outcome of the implicit refactor made by the compiler yields the same result, then it is free to perform the change. Consider:
unsigned int fact = 1;
for ( unsigned int i = 1; i < 5; ++i ) fact *= i;
The compiler can freely replace that code with:
unsigned int fact = 120; // I think the math is correct... imagine it is
The loop is gone, but the behavior is the same: each loop interaction does not affect the outcome of the program, and the variable has the correct value at the end of the loop, i.e. if it is later used in some observable operation, the result will be as-if the loop had been executed.
Don't worry too much on what observable behavior and the as-if rule mean, they basically mean that the compiler must yield the output that you programmed in your code, even if it is free to get to that outcome by a different path.
EDIT
#Konrad raises a really good point regarding the initial example I had with strlen: how can the compiler know that strlen calls can be elided? And the answer is that in the original example it cannot, and thus it could not elide the calls. There is nothing telling the compiler that the pointer returned from the get_string() function does not refer to memory that is being modified elsewhere. I have corrected the example to use a local array.
In the modified example, the array is local, and the compiler can verify that there are no other pointers that refer to the same memory. strlen takes a const pointer and so it promises not to modify the contained memory, and the function is pure so it promises not to modify any other state. The array is not modified inside the loop construct, and gathering all that information the compiler can determine that a single call to strlen suffices. Without the pure specifier, the compiler cannot know whether the result of strlen will differ in different invocations and has to call it.

The abstract machine defined by the standard will, given a specific
input, produce one of a set of specific output. In general, all that is
guaranteed is that for that specific input, the compiled code will
produce one of the possible specific output. The devil is in the
details, however, and there are a number of points to keep in mind.
The most important of these is probably the fact that if the program has
undefined behavior, the compiler can do absolutely anything. All bets
are off. Compilers can and do use potential undefined behavior for
optimizing: for example, if the code contains something like *p = (*q) ++,
the compiler can conclude that p and q aren't aliases to the same
variable.
Unspecified behavior can have similar effects: the actual behavior may
depend on the level of optimization. All that is requires is that the
actual output correspond to one of the possible outputs of the abstract
machine.
With regards to volatile, the stadnard does say that access to
volatile objects is observable behavior, but it leaves the meaning of
"access" up to the implementation. In practice, you can't really count
much on volatile these days; actual accesses to volatile objects may
appear to an outside observer in a different order than they occur in
the program. (This is arguably in violation of the intent of the
standard, at the very least. It is, however, the actual situation with
most modern compilers, running on a modern architecture.)
Most implementations treat all system calls as “IO”. With
regards to mutexes, of course: as far as C++03 is concerned, as soon as
you start a second thread, you've got undefined behavior (from the C++
point of view—Posix or Windows do define it), and in C++11,
synchronization primatives are part of the language, and constrain the
set of possible outputs. (The compiler can, of course, elimiate the
synchronizations if it can prove that they weren't necessary.)
The new and delete operators are special cases. They can be
replaced by user defined versions, and those user defined versions may
clearly have observable behavior. The compiler can only remove them if
it has some means of knowing either that they haven't been replaced, of
that the replacements have no observable behavior. In most systems,
replacement is defined at link time, after the compiler has finished its
work, so no changes are allowed.
With regards to your third question: I think you're looking at it from
the wrong angle. Compilers don't “eliminate” code, and no
particular statement in a program is bound to a particular block of
code. Your program (the complete program) defines a particular
semantics, and the compiler must do something which produces an
executable program having those semantics. The most obvious solution
for the compiler writer is to take each statement separately, and
generate code for it, but that's the compiler writer's point of view,
not yours. You put source code in, and get an executable out; but lots
of statements don't result in any code, and even for those that do,
there isn't necessarily a one to one relationship. In this sense, the
idea of “preventing some code elimination” doesn't make
sense: your program has a semantics, specified by the standard, and all
you can ask for (and all that you should be interested in) is that the
final executable have those semantics. (Your fourth point is similar:
the compiler doesn't “remove” any code.)

I can't speak for what the compilers should do, but here's what some compilers actually do
#include <array>
int main()
{
std::array<int, 5> a;
for(size_t p = 0; p<5; ++p)
a[p] = 2*p;
}
assembly output with gcc 4.5.2:
main:
xorl %eax, %eax
ret
replacing array with vector shows that new/delete are not subject to elimination:
#include <vector>
int main()
{
std::vector<int> a(5);
for(size_t p = 0; p<5; ++p)
a[p] = 2*p;
}
assembly output with gcc 4.5.2:
main:
subq $8, %rsp
movl $20, %edi
call _Znwm # operator new(unsigned long)
movl $0, (%rax)
movl $2, 4(%rax)
movq %rax, %rdi
movl $4, 8(%rax)
movl $6, 12(%rax)
movl $8, 16(%rax)
call _ZdlPv # operator delete(void*)
xorl %eax, %eax
addq $8, %rsp
ret
My best guess is that if the implementation of a function call is not available to the compiler, it has to treat it as possibly having observable side-effects.

1. Is it reasonable to expect that non-malicious implementation treates io operations wide enough
Yes. Assuming side-effects is the default. Beyond default, compilers must prove things (except for copy-elimination).
2. How deeply the "behavioral observation" should immerse from user-defined c++ program level into library/system calls?
As deep as it can. Using current standard C++ the compiler can't look behind library with meaning of static library, i.e. calls that target a function inside some ".a-" or ".lib file" calls, so side effects are assumed.
Using the traditional compilation model with multiple object files, the compiler is even unable to look behind extern calls. Optimizations accross
units of compilation may be done at link-time though.
Btw, some compilers have an extension to tell it about pure functions. From the gcc documentation:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure. For example,
int square (int) __attribute__ ((pure));
says that the hypothetical function square is safe to call fewer times than the program says.
Some of common examples of pure functions are strlen or memcmp. Interesting non-pure functions
are functions with infinite loops or those depending on volatile memory or other system resource,
that may change between two consecutive calls (such as feof in a multithreading environment).
Thinking about poses an interesting question to me: If some chunk of code mutates a non-local variable, and calls an un-introspectible function,
will it assume that this extern function might depend on that non-local variable?
compilation-unit A:
int foo() {
extern int x;
return x;
}
compilation-unit B:
int x;
int bar() {
for (x=0; x<10; ++x) {
std::cout << foo() << '\n';
}
}
The current standard has a notion of sequence points. I guess if a compiler does not see enough,
it can only optimize as far as to not break the ordering of dependent sequence points.
3. If I need to prevent some code from elimination by compiler
Except by looking at the object-dump, how could you judge whether something was removed?
And if you can't judge, than is this not equivalent to the impossibility of writing code that depends on its (non-)removal?
In that respect, compiler extensions (like for example OpenMP) help you in being able to judge. Some builtin mechanisms exist, too, like volatile variables.
Does a tree exist if nobody can observe it? Et hop, we are at quantum mechanics.
4. Or I'm totally wrong and the compiler is disallowed to remove any c++ code except of cases explicitly mentioned by the standard (as copy elimination)
No, it is perfectly allowed so. It is also allowed to transform code like it's a piece of slime.
(with the exception of copy elimination, you couldn't judge anyways).

One difference is that Java is designed to run on one platform only, the JVM. That makes it much easier to be "rigorous enough" in the specification, as there is only the platform to consider and you can document exactly how it works.
C++ is designed to be able to run on a wide selection of platforms and do that natively, without an intervening abstraction layer, but use the underlying hardware functionality directly. Therefore it has chosen to allow the functionality that actually exist on different platforms. For example, the result of some shift operations like int(1) << 33 is allowed to be different on different system, because that's the way the hardware works.
The C++ standard describes the result you can expect from your program, not the way it has to be achieved. In some cases it says that you have to check you particular implementation, because the results may differ but still be what is expected there.
For example, on an IBM mainframe nobody expects floating point to be IEEE compatible because the mainframe series is much older that the IEEE standard. Still C++ allows the use of the underlying hardware while Java does not. Is that an advantage or a disavantage for either language? It depends!
Within the restrictions and allowances of the language, a reasonable implementation must behave as if it did like you have coded in your program. If you do system calls like locking a mutex, the compiler has the options of not knowing what the calls do and therefore cannot remove them, or do know exactly what they do and therefore also know if they can be removed or not. The result is the same!
If you do calls to the standard library, the compiler can very well know exactly what the call does, as this is described in the standard. It then has the option of really calling a function, replace it with some other code, or skip it entirely if it has no effect. For example, std::strlen("Hello world!") can be replaced by 12. Some compilers do that, and you will not notice.

Related

C++ Statement Reordering

This is a question about Chandler's answer here (I didn't have a high enough rep to comment): Enforcing statement order in C++
In his answer, suppose foo() has no input or output. It's a black box that does work that is observable eventually, but won't be needed immediately (e.g. executes some callback). So we don't have input/output data locally handy to tell the compiler not to optimize. But I know that foo() will modify the memory somewhere, and the result will be observable eventually. Will the following prevent statement reordering and get the correct timing in this case?
#include <chrono>
#include <iostream>
//I believe this tells the compiler that all memory everywhere will be clobbered?
//(from his cppcon talk: https://youtu.be/nXaxk27zwlk?t=2441)
__attribute__((always_inline)) inline void DoNotOptimize() {
asm volatile("" : : : "memory");
}
// The compiler has full knowledge of the implementation.
static int ugly_global = 1; //we print this to screen sometime later
static void foo(void) { ugly_global *= 2; }
auto time_foo() {
using Clock = std::chrono::high_resolution_clock;
auto t1 = Clock::now(); // Statement 1
DoNotOptimize();
foo(); // Statement 2
DoNotOptimize();
auto t2 = Clock::now(); // Statement 3
return t2 - t1;
}
Will the following prevent statement reordering and get the correct timing in this case?
It should not be necessary because the calls to Clock::now should, at the language-definition level, enforce enough ordering. (That is, the C++11 standard says that the high resolution clock ought to get as much information as the system can give here, in the way that is most useful here. See "secondary question" below.)
But there is a more general case. It's worth thinking about the question: How does whoever provides the C++ library implementation actually write this function? Or, take C++ itself out of the equation. Given a language standard, how does an implementor—a person or group writing an implementation of that language—get you what you need? Fundamentally, we need to make a distinction between what the language standard requires and how an implementation provider goes about implementing the requirements.
The language itself may be expressed in terms of an abstract machine, and the C and C++ languages are. This abstract machine is pretty loosely defined: it executes some kind of instructions, which access data, but in many cases we don't know how it does these things, or even how big the various data items are (with some exceptions for fixed-size integers like int64_t), and os on. The machine may or may not have "registers" that hold things in ways that cannot be addressed as well as memory that can be addressed and whose addresses can be recorded in pointers:
p = &var
makes the value store in p (in memory or a register) such that using *p accesses the value stored in var (in memory or a register—some machines, especially back in the olden days, have / had addressable registers).1
Nonetheless, despite all of this abstraction, we want to run real code on real machines. Real machines have real constraints: some instructions might require particular values in particular registers (think about all the bizarre stuff in the x86 instruction sets, or wide-result integer multipliers and dividers that use special-purpose registers, as on some MIPS processors), or cause CPU sychronizations, or whatever.
GCC in particular invented a system of constraints to express what you could or could not do on the machine itself, using the machine's instruction set. Over time, this evolved into user-accessible asm constructs with input, output, and clobber sections. The particular one you show:
__attribute__((always_inline)) inline void DoNotOptimize() {
asm volatile("" : : : "memory");
}
expresses the idea that "this instruction" (asm; the actual provided instruction is blank) "cannot be moved" (volatile) "and clobbers all of the computer's memory, but no registers" ("memory" as the clobber section).
This is not part of either C or C++ as a language. It's just a compiler construction, supported by GCC and now supported by clang as well. But it suffices to force the compiler to issue all stores-to-memory before the asm, and reload values from memory as needed after the asm, in case they changed when the computer executed the (nonexistent) instruction included in the asm line. There's no guarantee that this will work, or even compile at all, in some other compiler, but as long as we're the implementor, we choose the compiler we're implementing for/with.
C++ as a language now has support for ordered memory operations, which an implementor must implement. The implementor can use these asm volatile constructs to achieve the right result, provided they do actually achieve the right result. For instance, if we need to cause the machine itself to synchronize—to emit a memory barrier—we can stick the appropriate machine instruction, such as mfence or membar #sync or whatever it may be, in the asm's instruction-section clause. See also compiler reordering vs memory reordering as Klaus mentioned in a comment.
It is up to the implementor to find an appropriately effective trick, compiler-specific or not, to get the right semantics while minimizing any runtime slowdown: for instance, we might want to use lfence rather than mfence if that's sufficient, or membar #LoadLoad, or whatever the right thing is for the machine. If our implementation of Clock::now requires some sort of fancy inline asm, we write one. If not, we don't. We make sure that we produce what's required—and then all users of the system can just use it, without needing to know what sort of grubby implementation tricks we had to invoke.
There's a secondary question here: does the language specification really constrain the implementor the way we think/hope it does? Chris Dodd's comment says he thinks so, and he's usually right on these kinds of questions. A couple of other commenters think otherwise, but I'm with Chris Dodd on this one. I think it is not necessary. You can always compile to assembly, or disassemble the compiled program, to check, though!
If the compiler didn't do the right thing, that asm would force it to do the right thing, in GCC and clang. It probably wouldn't work in other compilers.
1On the KA-10 in particular, the registers were just the first sixteen words of memory. As the Wikipedia page notes, this meant you could put instructions into there and call them. Because the first 16 words were the registers, these instructions ran much faster than other instructions.

When are type-punned pointers safe in practice?

A colleague of mine is working on C++ code that works with binary data arrays a lot. In certain places, he has code like
char *bytes = ...
T *p = (T*) bytes;
T v = p[i]; // UB
Here, T can be sometimes short or int (assume 16 and 32 bit respectively).
Now, unlike my colleague, I belong to the "no UB if at all possible" camp, while he is more along the lines of "if it works, it's OK". I am having a hard time trying to convince him otherwise.
Given that:
bytes really come from somewhere outside this compilation unit, being read from some binary file.
It's safe to assume that array really contains integers in the native endianness.
In practice, given mainstream C++ compilers like MSVC 2017 and gcc 4.8, and Intel x64 hardware, is such a thing really safe? I know it wouldn't be if T was, say, float (got bitten by it in the past).
char* can alias other entities without breaking strict aliasing rule.
Your code would be UB only if originally p + i wasn't a T originally.
char* byte = (char*) floats;
int *p = (int*) bytes;
int v = p[i]; // UB
but
char* byte = (char*) floats;
float *p = (float*) bytes;
float v = p[i]; // OK
If origin of byte is "unknown", compiler cannot benefit of UB for optimization and should assume we are in valid case and generate code according.
But how do you guaranty it is unknown ? Even outside the TU, something like Link-Time Optimization might allow to provide the hidden information.
Type-punned pointers are safe if one uses a construct which is recognized by the particular compiler one is using [i.e. any compiler that is configured support quality semantics if one is using straightforward constructs; neither gcc nor clang support quality semantics qualifies with optimizations are enabled, however, unless one uses -fno-strict-aliasing]. The authors of C89 were certainly aware that many applications required the use of various type-punning constructs beyond those mandated by the Standard, but thought the question of which constructs to recognize was best left as a quality-of-implementation issue. Given something like:
struct s1 { int objectClass; };
struct s2 { int objectClass; double x,y; };
struct s3 { int objectClass; char someData[32]; };
int getObjectClass(void *p) { return ((struct s1*)p)->objectClass; }
I think the authors of the Standard would have intended that the function be usable to read field objectClass of any of those structures [that is pretty much the whole purpose of the Common Initial Sequence rule] but there would be many ways by which compilers might achieve that. Some might recognize function calls as barriers to type-based aliasing analysis, while others might treat pointer casts in such a fashion. Most programs that use type punning would do several things that compilers might interpret as indications to be cautious with optimizations, so there was no particular need for a compiler to recognize any particular one of them. Further, since the authors of the Standard made no effort to forbid implementations that are "conforming" but are of such low-quality implementations as to be useless, there was no need to forbid compilers that somehow managed not to see any of the indications that storage might be used in interesting ways.
Unfortunately, for whatever reason, there hasn't been any effort by compiler vendors to find easy ways of recognizing common type-punning situations without needlessly impairing optimizations. While handling most cases would be fairly easy if compiler writers hadn't adopted designs that filter out the clearest and most useful evidence before applying optimization logic, both the designs of gcc and clang--and the mentalities of their maintainers--have evolved to oppose such a concept.
As far as I'm concerned, there is no reason why any "quality" implementation should have any trouble recognizing type punning in situations where all operations upon a byte of storage using a pointer converted to a pointer-to-PODS, or anything derived from that pointer, occur before the first time any of the following occurs:
That byte is accessed in conflicting fashion via means not derived from that pointer.
A pointer or reference is formed which will be used sometime in future to access that byte in conflicting fashion, or derive another that will.
Execution enters a function which will do one of the above before it exits.
Execution reaches the start of a bona fide loop [not, e.g. a do{...}while(0);] which will do one of the above before it exits.
A decently-designed compiler should have no problem recognizing those cases while still performing the vast majority of useful optimizations. Further, recognizing aliasing in such cases would be simpler and easier than trying to recognize it only in the cases mandated by the Standard. For those reasons, compilers that can't handle at least the above cases should be viewed as falling in the category of implementations that are of such low quality that the authors of the Standard didn't particularly want to allow, but saw no reason to forbid. Unfortunately, neither gcc nor clang offer any options to behave reasonably except by requiring that they disable type-based aliasing altogether. Unfortunately, the authors of gcc and clang would rather deride as "broken" any code needing features beyond what the Standard requires, than attempt a useful blend of optimization and semantics.
Incidentally, neither gcc nor clang should be relied upon to properly handle any situation in which storage that has been used as one type is later used as another, even when the Standard would require them to do so. Given something like:
union { struct s1 v1; struct s2 v2} unionArr[100];
void test(int i)
{
int test = unionArr[i].v2.objectClass;
unionArr[i].v1.objectClass = test;
}
Both clang and gcc will treat it as a no-op even if it is executed between code which writes unionArr[i].v2.objectClass and code which happens to reads member v1.objectClass of the same union object, thus causing them to ignore the possibility that the write to unionArr[i].v2.objectClass might affect v1.objectClass.

How not specify an exact order of evaluation of function argument helps C & C++ compiler to generate optimized code?

#include <iostream>
int foo() {
std::cout<<"foo() is called\n";
return 9;
}
int bar() {
std::cout<<"bar() is called\n";
return 18;
}
int main() {
std::cout<<foo()<<' '<<bar()<<' '<<'\n';
}
// Above program's behaviour is unspecified
// clang++ evaluates function arguments from left to right: http://melpon.org/wandbox/permlink/STnvMm1YVrrSRSsB
// g++ & MSVC++ evaluates function arguments from right to left
// so either foo() or bar() can be called first depending upon compiler.
Output of above program is compiler dependent. Order in which function arguments are evaluated is unspecified. The reason I've read about this is that it can result in highly optimized code. How not specify an exact order of evaluation of function argument helps compiler to generate optimized code?
AFAIK, the order of evaluation is strictly specified in languages such as Java, C#, D etc.
I think the whole premise of the question is wrong:
How not specify an exact order of evaluation of function argument helps C & C++ compiler to generate optimized code?
It is not about optimizing code (though it does allow that). It is about not penalizing compilers because the underlying hardware has certain ABI constraints.
Some systems depend on parameters being pushed to stack in reverse order while others depend on forward order. C++ runs on all kinds of systems with all kinds on constraints. If you enforce an order at the language level you will require some systems to pay a penalty to enforce that order.
The first rule of C++ is "If you don't use it then you should not have to pay for it". So enforcing an order would be a violation of the prime directive of C++.
It doesn't. At least, it doesn't today. Maybe it did in the past.
A proposal for C++17 suggests defining left-right evaluation order for function calls, operator<< and so on.
As described in Section 7 of that paper, this proposal was tested by compiling the Windows NT kernel, and it actually led to a speed increase in some benchmarks. The authors' comment:
It is worth noting that these results are for the worst case scenario where the optimizers have not yet been updated to be aware of, and take advantage of the new evaluation rules and they are blindly forced to evaluate function calls from left to right.
suggests that there is further room for speed improvement.
The order of evaluation is related to the way arguments are passed. If stack is used to pass the arguments, evaluating right to left helps performance, since this is how arguments are pushed into the stack.
For example, with the following code:
void foo(bar(), baz());
Assuming calling conevention is 'passing arguments through the stack', C calling convention requires arguments to be pushed into stack starting from the last one - so that when callee function reads it, it would pop the first argument first and be able to support variadic functions. If order of evaluation was left to right, a result of bar() would have to be saved in the temporary, than baz() called, it's result pushed, following by temporary push. However, right-to-left evaluation allows compiler to avoid the temporary.
If arguments are passed through registers, order of evaluation is not overly important.
The original reason that the C and C++ standards didn't specify and an order of evaluation for function arguments is to provide more optimization opportunities for the compiler. Unfortunately, this rationale has not been backed up by extensive experimentation at the time these languages were initially designed. But it made sense.
This issue has been raised in the past few years. See this blog post by Herb Sutter and don't forget to go through the comments.
Proposal P0145R1 suggests that it's better to specify an order of evaluation for function arguments and for other operators. It says:
The order of expression evaluation, as
it is currently specified in the standard, undermines advices,
popular programming idioms, or the relative safety of standard library
facilities. The traps aren’t just for novices or the careless
programmer. They affect all of us indiscriminately, even when we know
the rules.
You can find more information about how this affects optimization opportunities in that document.
In the past few months, there has been a very extensive discussion about how this change in the language affects optimization, compatibility and portability. The thread begins here and continues here. You can find there numerous examples.

"If you've written a compiler test, you've written a call to main"

Calling maininside your program violates the C++ Standard
void f()
{
main(); // an endless loop calling main ? No that's not allowed
}
int main()
{
static int = 0;
std::cout << i++ << std::endl;
f();
}
In a lecture Chandler Carruth, at about '22.40' says
if you've written a compiler test you've written a call to main
How is this relevant or how the fact that the Standard doesn't allow is overcome ?
The point here is that if you write compiler test-code, you probably will want to test calling main with a few different parameter sets and that it is possible to do this, with the understanding of the compiler you are going to test on.
The standard forbids calls to main so that main can have magical code (e.g. the code to construct global objects or to initialize some data structure, zero out global uninitialized POD data, etc.). But if you are writing test code for a compiler, you probably will have an understanding of whether the compiler does this - and if so, what it actually does in such a step, and take that into account in your testing - you could for example "dirty" some global variable, and then call main again and see that this variable is indeed set to zero again. Or it could be that main is indeed not callable in this particular compiler.
Since Chandler is talking about LLVM (and in C terms, Clang), he knows how that compiler produces code for main.
This clearly doesn't apply to "black box testing" of compilers. In such a test-suite, you could not rely on the compiler doing anything in particular, or NOT doing something that would harm your test.
Like ALL undefined behaviour, it is not guaranteed to work in any particular way, but SOMETIMES, if you know the actual implementation of the compiler, it will be possible to exploit that behaviour - just don't consider it good programming, and don't expect it to work in a portable way.
As an example, on a PC, you can write to the text-screen (before the MMU has been configured at least) by doing this:
char *ptr = (char *)0xA0000;
ptr[0] = 'a';
ptr[1] = 7; // Determines the colour.
This, by the standard, is undefined behaviour, because the standard does say that you can only use pointers to allocations made inside the C or C++ runtime. But clearly, you can't allocate memory in the graphics card... So technically, it's UB, but guess what Linux and Windows do during early boot? Write directly to the VGA memory... [Or at least they used to some time ago, when I last looked at it]. And if you know your hardware, this should work with every compiler I'm aware of - if it doesn't, you probably can't use it to write low-level driver code. But it is undefined by the standard, and "UB sanitizer" will probably moan at the code.

Defining Undefined Behavior

Does there exist any implementation of C++ (and/or C) that guarantees that anytime undefined behavior is invoked, it will signal an error? Obviously, such an implementation could not be as efficient as a standard C++ implementation, but it could be a useful debugging/testing tool.
If such an implementation does not exist, then are there any practical reasons that would make it impossible to implement? Or is it just that no one has done the work to implement it yet?
Edit: To make this a little more precise: I would like to have a compiler that allows me to make the assertion, for a given run of a C++ program that ran to completion, that no part of that run involved undefined behavior.
Yes, and no.
I am fairly certain that for practical purposes, an implementation could make C++ a safe language, meaning every operation has well-defined behavior. Of course, this comes at a huge overhead and there is probably some cases where it's simply unfeasible, such as race conditions in multithreaded code.
Now, the problem is that this can't guarantee your code is defined in other implementations! That is, it could still invoke UB. For instance, observe the following code:
int a;
int* b;
int foo() {
a = 5;
b = &a;
return 0;
}
int bar() {
*b = a;
return 0;
}
int main() {
std::cout << foo() << bar() << std::endl;
}
According to the standard, the order that foo and bar are called is up to the implementation to decide. Now, in a safe implementation this order would have to be defined, likely being left-to-right evaluation. The problem is that evaluating right-to-left invokes UB, which wouldn't be caught until you ran it on an unsafe implementation. The safe implementation could simply compile each permutation of evaluation order or do some static analysis, but this quickly becomes unfeasible and possibly undecidable.
So in conclusion, if such an implementation existed it would give you a false sense of security.
The new C standard has an interesting list in the new Annex L with the crude title "Analyzability". It talks about UB that is so-called critical UB. This includes among others:
An object is referred to outside of its lifetime (6.2.4).
A pointer is used to call a function whose type is not compatible with the referenced
type
The program attempts to modify a string literal
All of these are UB that are impossible or very hard to capture, since they usually can't be completely tested at compile time. This is due to the fact that a valid C (or C++) program is composed of several compilation units that may not know much of each other. E.g if one program passes a pointer to a string literal into a function with a char* parameter, or even worse, a program that casts away const-ness from a static variable.
Two C interpreters that detect a large class of undefined behaviors for a large subset of sequential C are KCC
and Frama-C's value analysis. They are both used to make sure that automatically generated, automatically reduced random C programs are appropriate to report bugs in C compilers.
From the webpage for KCC:
One of the main aims of this work is the ability to detect undefined
programs (e.g., programs that read invalid memory).
A third interpreter for a dialect of C is CompCert's interpreter mode (a writeup). This one detects all behaviors that are undefined in the input language of the certified C compiler CompCert. The input language of CompCert is essentially C, but it renders defined some behaviors that are undefined in the standard (signed arithmetic overflow is defined as computing 2's complement results, for instance).
In truth, all three of the interpreters mentioned in this answer have had difficult choices to make in the name of pragmatism.
The whole point of defining something as "undefined behaviour" is to avoid having to detect this situation in the compiler. It is defined that way, so that compilers can be built for a wide variety of platforms and architectures, and so that the hardware and software doesn't have to have specific features "just to detect undefined behaviour". Imagine that you have a memory subsystem that can't detect whether you are writing to real memory or not - how would the compiler or runtime system detect that you have just done somepointer = rand(); *somepointer = 42;
You can detect SOME situations. But to require that ALL are detected, would make life very difficult.
Given the Edit in the original question: I still don't think this is plausible to achieve in C. There is so much freedom to do almost anything (making pointers to almost anything, these pointers can be converted, indexed, recalculated, and all manner of other things), and will be able to cause all manner of undefined behaviour.
There is a list of all undefined behaviour in C here - it lists 186 different circumstances of undefined behaviour, ranging from a backslash as the last character of the file (likely to cause compiler error, but not defined as one) to "The comparison function called by the bsearch or qsort function returns ordering values inconsistently".
How on earth do you write a compiler to check that the function passed into bsearch or qsort is ordering values consistently? Of course, if the data passed into the comparison function is of a simple type, such as integers, then it's not that difficult, but if the data type is a complex type such as
struct {
char name[20];
char street[20];
int age;
char post_code[10];
};
and the programmer decides to sort the data based on ascending name, ascending street, descending age and ascending postcode, in that order? If that's what you want, but somehow the code got messed up and post code comparison returns some inconsistant result, things will go wrong, but it's very hard to formally inspect that case. There are lots of others that are similarly obscure and complex. Sure, YOUR code may not sort names and addresses etc, but someone will probably write somethng like that at some point or another.