I'm currently learning pointers and my professor provided this piece of code as an example:
//We cannot predict the behavior of this program!
#include <iostream>
using namespace std;
int main()
{
char * s = "My String";
char s2[] = {'a', 'b', 'c', '\0'};
cout << s2 << endl;
return 0;
}
He wrote in the comments that we can't predict the behavior of the program. What exactly makes it unpredictable though? I see nothing wrong with it.
The behaviour of the program is non-existent, because it is ill-formed.
char* s = "My String";
This is illegal. Prior to 2011, it had been deprecated for 12 years.
The correct line is:
const char* s = "My String";
Other than that, the program is fine. Your professor should drink less whiskey!
The answer is: it depends on what C++ standard you're compiling against. All the code is perfectly well-formed across all standards‡ with the exception of this line:
char * s = "My String";
Now, the string literal has type const char[10] and we're trying to initialize a non-const pointer to it. For all other types other than the char family of string literals, such an initialization was always illegal. For example:
const int arr[] = {1};
int *p = arr; // nope!
However, in pre-C++11, for string literals, there was an exception in §4.2/2:
A string literal (2.13.4) that is not a wide string literal can be converted to an rvalue of type “pointer to char”; [...]. In either case, the result is a pointer to the first element of the array. This conversion is considered only when there is an explicit appropriate pointer target type, and not when there is a general need to convert from an lvalue to an rvalue. [Note: this conversion is deprecated. See Annex D. ]
So in C++03, the code is perfectly fine (though deprecated), and has clear, predictable behavior.
In C++11, that block does not exist - there is no such exception for string literals converted to char*, and so the code is just as ill-formed as the int* example I just provided. The compiler is obligated to issue a diagnostic, and ideally in cases such as this that are clear violations of the C++ type system, we would expect a good compiler to not just be conforming in this regard (e.g. by issuing a warning) but to fail outright.
The code should ideally not compile - but does on both gcc and clang (I assume because there's probably lots of code out there that would be broken with little gain, despite this type system hole being deprecated for over a decade). The code is ill-formed, and thus it does not make sense to reason about what the behavior of the code might be. But considering this specific case and the history of it being previously allowed, I do not believe it to be an unreasonable stretch to interpret the resulting code as if it were an implicit const_cast, something like:
const int arr[] = {1};
int *p = const_cast<int*>(arr); // OK, technically
With that, the rest of the program is perfectly fine, as you never actually touch s again. Reading a created-const object via a non-const pointer is perfectly OK. Writing a created-const object via such a pointer is undefined behavior:
std::cout << *p; // fine, prints 1
*p = 5; // will compile, but undefined behavior, which
// certainly qualifies as "unpredictable"
As there is no modification via s anywhere in your code, the program is fine in C++03, should fail to compile in C++11 but does anyway - and given that the compilers allow it, there's still no undefined behavior in it†. With allowances that the compilers are still [incorrectly] interpreting the C++03 rules, I see nothing that would lead to "unpredictable" behavior. Write to s though, and all bets are off. In both C++03 and C++11.
†Though, again, by definition ill-formed code yields no expectation of reasonable behavior
‡Except not, see Matt McNabb's answer
Other answers have covered that this program is ill-formed in C++11 due to the assignment of a const char array to a char *.
However the program was ill-formed prior to C++11 also.
The operator<< overloads are in <ostream>. The requirement for iostream to include ostream was added in C++11.
Historically, most implementations had iostream include ostream anyway, perhaps for ease of implementation or perhaps in order to provide a better QoI.
But it would be conforming for iostream to only define the ostream class without defining the operator<< overloads.
The only slightly wrong thing that I see with this program is that you're not supposed to assign a string literal to a mutable char pointer, though this is often accepted as a compiler extension.
Otherwise, this program appears well-defined to me:
The rules that dictate how character arrays become character pointers when passed as parameters (such as with cout << s2) are well-defined.
The array is null-terminated, which is a condition for operator<< with a char* (or a const char*).
#include <iostream> includes <ostream>, which in turn defines operator<<(ostream&, const char*), so everything appears to be in place.
You can't predict the behaviour of the compiler, for reasons noted above. (It should fail to compile, but may not.)
If compilation succeeds, then the behaviour is well-defined. You certainly can predict the behaviour of the program.
If it fails to compile, there is no program. In a compiled language, the program is the executable, not the source code. If you don't have an executable, you don't have a program, and you can't talk about behaviour of something that doesn't exist.
So I'd say your prof's statement is wrong. You can't predict the behaviour of the compiler when faced with this code, but that's distinct from the behaviour of the program. So if he's going to pick nits, he'd better make sure he's right. Or, of course, you might have misquoted him and the mistake is in your translation of what he said.
As others have noted, the code is illegitimate under C++11, although it was valid under earlier versions. Consequently, a compiler for C++11 is required to issue at least one diagnostic, but behavior of the compiler or the remainder of the build system is unspecified beyond that. Nothing in the Standard would forbid a compiler from exiting abruptly in response to an error, leaving a partially-written object file which a linker might think was valid, yielding a broken executable.
Although a good compiler should always ensure before it exits that any object file it is expected to have produced will be either valid, non-existent, or recognizable as invalid, such issues fall outside the jurisdiction of the Standard. While there have historically been (and may still be) some platforms where a failed compilation can result in legitimate-appearing executable files that crash in arbitrary fashion when loaded (and I've had to work with systems where link errors often had such behavior), I would not say that the consequences of syntax errors are generally unpredictable. On a good system, an attempted build will generally either produce an executable with a compiler's best effort at code generation, or won't produce an executable at all. Some systems will leave behind the old executable after a failed build, since in some cases being able to run the last successful build may be useful, but that can also lead to confusion.
My personal preference would be for disk-based systems to to rename the output file, to allow for the rare occasions when that executable would be useful while avoiding the confusion that can result from mistakenly believing one is running new code, and for embedded-programming systems to allow a programmer to specify for each project a program that should be loaded if a valid executable is not available under the normal name [ideally something which which safely indicates the lack of a useable program]. An embedded-systems tool-set would generally have no way of knowing what such a program should do, but in many cases someone writing "real" code for a system will have access to some hardware-test code that could easily be adapted to the purpose. I don't know that I've seen the renaming behavior, however, and I know that I haven't seen the indicated programming behavior.
Related
Why does the following code compile?
class Demo
{
public:
Demo() : a(this->a){}
int& a;
};
int main()
{
Demo d;
}
In this case, a is a reference to an integer. However, when I initialize Demo, I pass a reference to a reference of an integer which has not yet been initialized. Why does this compile?
This still compiles even if instead of int, I use a reference to a class which has a private default constructor. Why is this allowed?
Why does this compile?
Because it is syntactically valid.
C++ is not a safe programming language. There are several features that make it easy to do the right thing, but preventing someone from doing the wrong thing is not a priority. If you are determined to do something foolish, nothing will stop you. As long as you follow the syntax, you can try to do whatever you want, no matter how ludicrous the semantics. Keep that in mind: compiling is about syntax, not semantics.*
That being said, the people who write compilers are not without pity. They know the common mistakes (probably from personal experience), and they recognize that your compiler is in a good position to spot certain kinds of semantic mistakes. Hence, most compilers will emit warnings when you do certain things (not all things) that do not make sense. That is why you should always enable compiler warnings.
Warnings do not catch all logical errors, but for the ones they do catch (such as warning: 'Demo::a' is initialized with itself and warning: '*this.Demo::a' is used uninitialized), you've saved yourself a ton of debugging time.
* OK, there are some semantics involved in compiling, such as giving a meaning to identifiers. When I say compiling is not about semantics, I am referring to a higher level of semantics, such as the intended behavior.
Why does this compile?
Because there is no rule that would make the program ill-formed.
Why is this allowed?
To be clear, the program is well-formed, so it compiles. But the behaviour of the program is undefined, so from that perspective, the premise of your question is flawed. This isn't allowed.
It isn't possible to prove all cases where an indeterminate value is used, and it isn't easy to specify which of the easy cases should be detected by the compiler, and which would be considered to be too difficult. As such, the standard doesn't attempt to specify it, and leaves it up to the compiler to warn when it is able to detect it. For what it's worth, GCC is able to detect it in this case for example.
C++ allows you to pass a reference to a reference to uninitialized data because you might want to use the called function as the initializer.
A colleague of mine is working on C++ code that works with binary data arrays a lot. In certain places, he has code like
char *bytes = ...
T *p = (T*) bytes;
T v = p[i]; // UB
Here, T can be sometimes short or int (assume 16 and 32 bit respectively).
Now, unlike my colleague, I belong to the "no UB if at all possible" camp, while he is more along the lines of "if it works, it's OK". I am having a hard time trying to convince him otherwise.
Given that:
bytes really come from somewhere outside this compilation unit, being read from some binary file.
It's safe to assume that array really contains integers in the native endianness.
In practice, given mainstream C++ compilers like MSVC 2017 and gcc 4.8, and Intel x64 hardware, is such a thing really safe? I know it wouldn't be if T was, say, float (got bitten by it in the past).
char* can alias other entities without breaking strict aliasing rule.
Your code would be UB only if originally p + i wasn't a T originally.
char* byte = (char*) floats;
int *p = (int*) bytes;
int v = p[i]; // UB
but
char* byte = (char*) floats;
float *p = (float*) bytes;
float v = p[i]; // OK
If origin of byte is "unknown", compiler cannot benefit of UB for optimization and should assume we are in valid case and generate code according.
But how do you guaranty it is unknown ? Even outside the TU, something like Link-Time Optimization might allow to provide the hidden information.
Type-punned pointers are safe if one uses a construct which is recognized by the particular compiler one is using [i.e. any compiler that is configured support quality semantics if one is using straightforward constructs; neither gcc nor clang support quality semantics qualifies with optimizations are enabled, however, unless one uses -fno-strict-aliasing]. The authors of C89 were certainly aware that many applications required the use of various type-punning constructs beyond those mandated by the Standard, but thought the question of which constructs to recognize was best left as a quality-of-implementation issue. Given something like:
struct s1 { int objectClass; };
struct s2 { int objectClass; double x,y; };
struct s3 { int objectClass; char someData[32]; };
int getObjectClass(void *p) { return ((struct s1*)p)->objectClass; }
I think the authors of the Standard would have intended that the function be usable to read field objectClass of any of those structures [that is pretty much the whole purpose of the Common Initial Sequence rule] but there would be many ways by which compilers might achieve that. Some might recognize function calls as barriers to type-based aliasing analysis, while others might treat pointer casts in such a fashion. Most programs that use type punning would do several things that compilers might interpret as indications to be cautious with optimizations, so there was no particular need for a compiler to recognize any particular one of them. Further, since the authors of the Standard made no effort to forbid implementations that are "conforming" but are of such low-quality implementations as to be useless, there was no need to forbid compilers that somehow managed not to see any of the indications that storage might be used in interesting ways.
Unfortunately, for whatever reason, there hasn't been any effort by compiler vendors to find easy ways of recognizing common type-punning situations without needlessly impairing optimizations. While handling most cases would be fairly easy if compiler writers hadn't adopted designs that filter out the clearest and most useful evidence before applying optimization logic, both the designs of gcc and clang--and the mentalities of their maintainers--have evolved to oppose such a concept.
As far as I'm concerned, there is no reason why any "quality" implementation should have any trouble recognizing type punning in situations where all operations upon a byte of storage using a pointer converted to a pointer-to-PODS, or anything derived from that pointer, occur before the first time any of the following occurs:
That byte is accessed in conflicting fashion via means not derived from that pointer.
A pointer or reference is formed which will be used sometime in future to access that byte in conflicting fashion, or derive another that will.
Execution enters a function which will do one of the above before it exits.
Execution reaches the start of a bona fide loop [not, e.g. a do{...}while(0);] which will do one of the above before it exits.
A decently-designed compiler should have no problem recognizing those cases while still performing the vast majority of useful optimizations. Further, recognizing aliasing in such cases would be simpler and easier than trying to recognize it only in the cases mandated by the Standard. For those reasons, compilers that can't handle at least the above cases should be viewed as falling in the category of implementations that are of such low quality that the authors of the Standard didn't particularly want to allow, but saw no reason to forbid. Unfortunately, neither gcc nor clang offer any options to behave reasonably except by requiring that they disable type-based aliasing altogether. Unfortunately, the authors of gcc and clang would rather deride as "broken" any code needing features beyond what the Standard requires, than attempt a useful blend of optimization and semantics.
Incidentally, neither gcc nor clang should be relied upon to properly handle any situation in which storage that has been used as one type is later used as another, even when the Standard would require them to do so. Given something like:
union { struct s1 v1; struct s2 v2} unionArr[100];
void test(int i)
{
int test = unionArr[i].v2.objectClass;
unionArr[i].v1.objectClass = test;
}
Both clang and gcc will treat it as a no-op even if it is executed between code which writes unionArr[i].v2.objectClass and code which happens to reads member v1.objectClass of the same union object, thus causing them to ignore the possibility that the write to unionArr[i].v2.objectClass might affect v1.objectClass.
Calling maininside your program violates the C++ Standard
void f()
{
main(); // an endless loop calling main ? No that's not allowed
}
int main()
{
static int = 0;
std::cout << i++ << std::endl;
f();
}
In a lecture Chandler Carruth, at about '22.40' says
if you've written a compiler test you've written a call to main
How is this relevant or how the fact that the Standard doesn't allow is overcome ?
The point here is that if you write compiler test-code, you probably will want to test calling main with a few different parameter sets and that it is possible to do this, with the understanding of the compiler you are going to test on.
The standard forbids calls to main so that main can have magical code (e.g. the code to construct global objects or to initialize some data structure, zero out global uninitialized POD data, etc.). But if you are writing test code for a compiler, you probably will have an understanding of whether the compiler does this - and if so, what it actually does in such a step, and take that into account in your testing - you could for example "dirty" some global variable, and then call main again and see that this variable is indeed set to zero again. Or it could be that main is indeed not callable in this particular compiler.
Since Chandler is talking about LLVM (and in C terms, Clang), he knows how that compiler produces code for main.
This clearly doesn't apply to "black box testing" of compilers. In such a test-suite, you could not rely on the compiler doing anything in particular, or NOT doing something that would harm your test.
Like ALL undefined behaviour, it is not guaranteed to work in any particular way, but SOMETIMES, if you know the actual implementation of the compiler, it will be possible to exploit that behaviour - just don't consider it good programming, and don't expect it to work in a portable way.
As an example, on a PC, you can write to the text-screen (before the MMU has been configured at least) by doing this:
char *ptr = (char *)0xA0000;
ptr[0] = 'a';
ptr[1] = 7; // Determines the colour.
This, by the standard, is undefined behaviour, because the standard does say that you can only use pointers to allocations made inside the C or C++ runtime. But clearly, you can't allocate memory in the graphics card... So technically, it's UB, but guess what Linux and Windows do during early boot? Write directly to the VGA memory... [Or at least they used to some time ago, when I last looked at it]. And if you know your hardware, this should work with every compiler I'm aware of - if it doesn't, you probably can't use it to write low-level driver code. But it is undefined by the standard, and "UB sanitizer" will probably moan at the code.
The following code invokes undefined behaviour.
int& foo()
{
int bar = 1234;
return bar;
}
g++ issues a warning:
warning: reference to local variable ‘bar’ returned [-Wreturn-local-addr]
clang++ too:
warning: reference to stack memory associated with local variable 'bar' returned [-Wreturn-stack-address]
Why is this not a compile error (ignoring -Werror)?
Is there a case where returning a ref to a local var is valid?
EDIT As pointed out, the spec mandates this be compilable. So, why does the spec not prohibit such code?
I would say that requiring this to make the program ill-formed (that is, make this a compilation error) would complicate the standard considerably for little benefit. You'd have to exactly spell out in the standard when such cases shall be diagnosed, and all compilers would have to implement them.
If you specify too little, it will not be too useful. And compilers probably already check for this to emit warnings, and real programmers compile with -Wall_you_can_give_me -Werror anyway.
If you specify too much, it will be difficult (or impossible) for compilers to implement the standard.
Consider this class (for which you only have the header and a library):
class Foo
{
int x;
public:
int& getInteger();
};
And this code:
int& bar()
{
Foo f;
return f.getInteger();
}
Now, should the standard be written to make this ill-formed or not? Probably not, what if Foo is implemented like this:
#include "Foo.h"
int global;
int& Foo::getInteger()
{
return global;
}
At the same time, it could be implemented like this:
#include "Foo.h"
int& Foo::getInteger()
{
return x;
}
Which of course would give you a dangling reference.
My point is that the compiler cannot really know whether returning a reference is OK or not, except for a few trivial cases (returning a reference to a function-scope automatic variable or parameter of non-reference type). I don't think it's worth it to complicate the standard for that. Especially as most compilers already warn about this as a quality-of-implementation matter.
Also, because you may want to get the current stack pointer (whatever that means on your particular implementation).
This function:
void* get_stack_pointer (void) { int x; return &x; };
AFAIK, it is not undefined behavior if you don't dereference the resulting pointer.
is much more portable than this one:
void* get_stack_pointer (void) {
register void* sp asm ("%esp"); return sp; }
As to why you may want to get the stack pointer: well, there are cases where you have a valid reason to get it: for instance the conservative Boehm garbage collector needs to scan the stack (so wants the stack pointer and the stack bottom).
And if you returned a C++ reference on which you would only take its address using the & unary operator, getting such an address is IIUC legal (it is IMHO the only licit operation you can do on it).
Another reason to get the stack pointer would be to get a non-NULL pointer address (which you could e.g. hash) different of any heap, local or static data. However, you could use (void*)1 or (void*)-1 for that purpose.
So the compiler is right in only warning against this.
I guess that a C++ compiler should accept
int& get_sp_ref(void) { int x; return x; }
void show_sp(void) {
std::cout << (&(get_sp_ref())) << std::endl; }
For the same reason C allows you to return a pointer to a memory block that's been freed.
It's valid according to the language specification. It's a horribly bad idea (and is nowhere close to being guaranteed to work) but it's still valid inasmuch as it's not forbidden.
If you're asking why the standard allows this, it's probably because, when references were introduced, that's the way they worked. Each iteration of the standard has certain guidelines to follow (such as minimising the possibility of "breaking changes", those that render existing well-formed programs invalid) and the standard is an agreement between user and implementer, with undoubtedly more implementers than users sitting on the committees :-)
It may be worth pushing that idea through as a potential change and seeing what ISO say but I suspect it would be considered one of those "breaking changes" and therefore very suspect.
To expand on the earlier answers, the ISO C++ standard does not capture the distinction between warnings and errors to begin with; it simply uses the term 'diagnostic' when referring to what a compiler must emit upon seeing an ill-formed program. Quoting N3337, 1.4, paragraphs 1 and 2:
The set of diagnosable rules consists of all syntactic and semantic rules in this
International Standard except for those rules containing an explicit notation that “no
diagnostic is required” or which are described as resulting in “undefined behavior.”
Although this International Standard states only requirements on C++ implementations,
those requirements are often easier to understand if they are phrased as requirements on
programs, parts of programs, or execution of programs. Such requirements have the
following meaning:
If a program contains no violations of the rules in this International Standard, a
conforming implementation shall, within its resource limits, accept and correctly execute
that program.
If a program contains a violation of any diagnosable rule or an occurrence of a
construct described in this Standard as “conditionally-supported” when the
implementation does not support that construct, a conforming implementation shall issue
at least one diagnostic message.
If a program contains a violation of a rule for which no diagnostic is required, this
International Standard places no requirement on implementations with respect to that
program.
Something not mentioned by other answers yet is that this code is OK if the function is never called.
The compiler isn't required to diagnose whether a function might ever be called or not. For example you might set up a program which looks for counterexamples to Fermat's Last Theorem, and calls this function if it finds one. It would be a mistake for the compiler to reject such a program.
Returning reference into local variable is bad idea, however some people may create code which requires that, so compiler should only warn about that and don't determine valid (valid structure) code as erroneous.
Angew already posted sample with local variable that is actually global. However there is some other (IMHO better) sample.
Object& GetSmth()
{
Object* obj = new Object();
return *obj;
}
In this case reference to local object is valid and caller after usage should dealocate memory.
IMPORTANT NOTE I don't encourage and don't recommend to use such coding style, because it is bad, usually it is hard to understand what is going on and it leads in some kind of problems like memory leaks or crashes. It is just a sample which shows why this particular situation cannot be treated as error.
Does there exist any implementation of C++ (and/or C) that guarantees that anytime undefined behavior is invoked, it will signal an error? Obviously, such an implementation could not be as efficient as a standard C++ implementation, but it could be a useful debugging/testing tool.
If such an implementation does not exist, then are there any practical reasons that would make it impossible to implement? Or is it just that no one has done the work to implement it yet?
Edit: To make this a little more precise: I would like to have a compiler that allows me to make the assertion, for a given run of a C++ program that ran to completion, that no part of that run involved undefined behavior.
Yes, and no.
I am fairly certain that for practical purposes, an implementation could make C++ a safe language, meaning every operation has well-defined behavior. Of course, this comes at a huge overhead and there is probably some cases where it's simply unfeasible, such as race conditions in multithreaded code.
Now, the problem is that this can't guarantee your code is defined in other implementations! That is, it could still invoke UB. For instance, observe the following code:
int a;
int* b;
int foo() {
a = 5;
b = &a;
return 0;
}
int bar() {
*b = a;
return 0;
}
int main() {
std::cout << foo() << bar() << std::endl;
}
According to the standard, the order that foo and bar are called is up to the implementation to decide. Now, in a safe implementation this order would have to be defined, likely being left-to-right evaluation. The problem is that evaluating right-to-left invokes UB, which wouldn't be caught until you ran it on an unsafe implementation. The safe implementation could simply compile each permutation of evaluation order or do some static analysis, but this quickly becomes unfeasible and possibly undecidable.
So in conclusion, if such an implementation existed it would give you a false sense of security.
The new C standard has an interesting list in the new Annex L with the crude title "Analyzability". It talks about UB that is so-called critical UB. This includes among others:
An object is referred to outside of its lifetime (6.2.4).
A pointer is used to call a function whose type is not compatible with the referenced
type
The program attempts to modify a string literal
All of these are UB that are impossible or very hard to capture, since they usually can't be completely tested at compile time. This is due to the fact that a valid C (or C++) program is composed of several compilation units that may not know much of each other. E.g if one program passes a pointer to a string literal into a function with a char* parameter, or even worse, a program that casts away const-ness from a static variable.
Two C interpreters that detect a large class of undefined behaviors for a large subset of sequential C are KCC
and Frama-C's value analysis. They are both used to make sure that automatically generated, automatically reduced random C programs are appropriate to report bugs in C compilers.
From the webpage for KCC:
One of the main aims of this work is the ability to detect undefined
programs (e.g., programs that read invalid memory).
A third interpreter for a dialect of C is CompCert's interpreter mode (a writeup). This one detects all behaviors that are undefined in the input language of the certified C compiler CompCert. The input language of CompCert is essentially C, but it renders defined some behaviors that are undefined in the standard (signed arithmetic overflow is defined as computing 2's complement results, for instance).
In truth, all three of the interpreters mentioned in this answer have had difficult choices to make in the name of pragmatism.
The whole point of defining something as "undefined behaviour" is to avoid having to detect this situation in the compiler. It is defined that way, so that compilers can be built for a wide variety of platforms and architectures, and so that the hardware and software doesn't have to have specific features "just to detect undefined behaviour". Imagine that you have a memory subsystem that can't detect whether you are writing to real memory or not - how would the compiler or runtime system detect that you have just done somepointer = rand(); *somepointer = 42;
You can detect SOME situations. But to require that ALL are detected, would make life very difficult.
Given the Edit in the original question: I still don't think this is plausible to achieve in C. There is so much freedom to do almost anything (making pointers to almost anything, these pointers can be converted, indexed, recalculated, and all manner of other things), and will be able to cause all manner of undefined behaviour.
There is a list of all undefined behaviour in C here - it lists 186 different circumstances of undefined behaviour, ranging from a backslash as the last character of the file (likely to cause compiler error, but not defined as one) to "The comparison function called by the bsearch or qsort function returns ordering values inconsistently".
How on earth do you write a compiler to check that the function passed into bsearch or qsort is ordering values consistently? Of course, if the data passed into the comparison function is of a simple type, such as integers, then it's not that difficult, but if the data type is a complex type such as
struct {
char name[20];
char street[20];
int age;
char post_code[10];
};
and the programmer decides to sort the data based on ascending name, ascending street, descending age and ascending postcode, in that order? If that's what you want, but somehow the code got messed up and post code comparison returns some inconsistant result, things will go wrong, but it's very hard to formally inspect that case. There are lots of others that are similarly obscure and complex. Sure, YOUR code may not sort names and addresses etc, but someone will probably write somethng like that at some point or another.