I'm trying to understand what's going on inside with this (weird?) g++ behavior.
#include <iostream>
using namespace std;
int& f(void) {
int a = 9;
int& b = a;
return a;
}
int main(void) {
int& l = f();
cout << ++l << '\n' << l << '\n';
}
When returning a itself and bind it to l, I get a warning (reference to local variable) and a seg.fault if I acess it from l, but when returning b itself not only do I not get a seg.fault but I can access it once from l (UB I am guessing) before the value of l randomly changes. But what exactly happens here?
Aren't the two returns identical? Does g++ automatically mark a's area as unusable after the return, hence the seg.fault while for some reason allowing b to live longer?
Your main question is why gcc doesn't issue a warning in one of the alternatives. Both alternatives are undefined behavior and the only difference is that in one case the compiler can detect it and warn you about it.
The C++ standard does not require a diagnostic for undefined behavior. Any diagnostic to that effect, from your compiler, is just an extra bonus; and although modern C++ compilers are very smart, they can't always figure out that the compiled code will result in demons flying out of your nose.
P.S. gcc 10.2 does issue a warning with the -O3 option, for the return b; alternative. With -Wall only, gcc also issues a 2nd warning for undefined behavior, you can discover what it is by yourself.
No, C++ compilers do not mark areas as unusable.
Segfaults are just one of many ways undefined behavior can exhibit itself. It is in fact one of the friendlier ways, as it makes you notice it early.
You are in charge of lifetime. If you get it wrong, the result is undefined behavior. Not an exception. Not a segfault. Literally anything.
One possibly symptom is "it appears to work". Another is segfault. Others include literal time travel (where UB later in the program makes ealier code behave differently), your computer hard drive being rendered unusable, someone getting your credit card information, your browser history being emailed to your contact list, etc.
Some compilers, in debug mode, mark deallocated memory with a bit pattern to aid debugging.
But what exactly happens here?
Undefined behaviour.
Aren't the two returns identical?
The source code is clearly syntactically different. The programs don't have same semantic meaning because neither program has any semantic meaning because the meaning of the both programs is undefined. As such, the behaviour of the programs is not guaranteed to be the same.
Does g++ automatically mark a's area as unusable after the return
Perhaps. I wouldn't assume this to be the case based on that one observation but this may be true. See the source code of GCC to confirm.
Related
I've been studying undefined behavior examples for C++, and I've found following one:
int a = 0;
a = a++;
Tried it with g++ -Wall -Wextra and it got me warning about sequence point.
But then I thought about another situation using reference:
int a = 0;
int &b = a;
b = a++;
This one didn't shout on me about sequence point. It seems almost obvious that it should.
Is there any good explanation why those two examples are treated differently by compiler?
That might seem obvious UB, but you have to understand that there are uncountably many different ways to violate the sequencing rules. And proving whether any particular expression is in violation is a slow and complex process, and sometimes turns out to be impossible. This is why the violation of these rules has been specified undefined behaviour in the standard, instead of ill-formed which would have required a diagnostic in every possible violation.
So, the compiler has to draw a line somewhere, and not spend resources to validate all expressions. Your test shows two expressions that are on opposite sides of that "line".
Calling maininside your program violates the C++ Standard
void f()
{
main(); // an endless loop calling main ? No that's not allowed
}
int main()
{
static int = 0;
std::cout << i++ << std::endl;
f();
}
In a lecture Chandler Carruth, at about '22.40' says
if you've written a compiler test you've written a call to main
How is this relevant or how the fact that the Standard doesn't allow is overcome ?
The point here is that if you write compiler test-code, you probably will want to test calling main with a few different parameter sets and that it is possible to do this, with the understanding of the compiler you are going to test on.
The standard forbids calls to main so that main can have magical code (e.g. the code to construct global objects or to initialize some data structure, zero out global uninitialized POD data, etc.). But if you are writing test code for a compiler, you probably will have an understanding of whether the compiler does this - and if so, what it actually does in such a step, and take that into account in your testing - you could for example "dirty" some global variable, and then call main again and see that this variable is indeed set to zero again. Or it could be that main is indeed not callable in this particular compiler.
Since Chandler is talking about LLVM (and in C terms, Clang), he knows how that compiler produces code for main.
This clearly doesn't apply to "black box testing" of compilers. In such a test-suite, you could not rely on the compiler doing anything in particular, or NOT doing something that would harm your test.
Like ALL undefined behaviour, it is not guaranteed to work in any particular way, but SOMETIMES, if you know the actual implementation of the compiler, it will be possible to exploit that behaviour - just don't consider it good programming, and don't expect it to work in a portable way.
As an example, on a PC, you can write to the text-screen (before the MMU has been configured at least) by doing this:
char *ptr = (char *)0xA0000;
ptr[0] = 'a';
ptr[1] = 7; // Determines the colour.
This, by the standard, is undefined behaviour, because the standard does say that you can only use pointers to allocations made inside the C or C++ runtime. But clearly, you can't allocate memory in the graphics card... So technically, it's UB, but guess what Linux and Windows do during early boot? Write directly to the VGA memory... [Or at least they used to some time ago, when I last looked at it]. And if you know your hardware, this should work with every compiler I'm aware of - if it doesn't, you probably can't use it to write low-level driver code. But it is undefined by the standard, and "UB sanitizer" will probably moan at the code.
Is it possible in either gcc/g++ or ms c++ to set a flag which only allows defined behavior? so something like the below gives me a warning or preferably an error
func(a++, a, ++a)
Undefined and unspecified behavior is designated so in the standard specifically because it could cause undue burden on the implementation to diagnose all examples of it (or it would be impossible to determine).
It's expected that the programmer take care to avoid those areas that are undefined.
For your stated example it should be fairly obvious to a programmer to just not write that code in the first place.
That being said, g++ -Wall will catch some bad code, such as missing return in a non-void function to give one example.
EDIT: #sehe also points out -Wsequence-point which will catch this precise code construct, although there should be a sequence point between evaluation of each argument (the order in which arguments is evaluated is unspecified however).
GNU C++ has the following
-Wsequence-point
Warn about code that may have undefined semantics because of violations of sequence point rules in the C and C++ standards.
This will correctly flag the invocation you showed
-Wstrict-overflow
-Wstrict-overflow
-fstrict-aliasing
-fstrict-overflow
HTH
No. For example, consider the following:
int badfunc(int &a, int &b) {
return func(a++, b++);
}
This has undefined behavior if a and b have the same referand. In general the compiler cannot know what arguments will be passed to a function, so it can't reliably catch this case of undefined behavior. Therefore it can't catch all undefined behavior.
Compiler warnings serve to identify some instances of undefined behavior, but never all.
In theory you could write a C++ implementation that does vast numbers of checks at runtime to ensure that undefined behavior is always identified and dealt with in ways defined by that implementation. It still wouldn't tell you at compile time (see: halting problem), and in practice you'd probably be better off with C#, which was designed to make the necessary runtime checks reasonably efficient...
Even if you built that magical checking C++ implementation, it still might not tell you what you really want to know, which is whether your code is correct. Sometimes (hang on to your seats), it is implementation-defined whether or not behavior is undefined. For a simple example, tolower((char)-1); has defined behavior[*] if the char type is unsigned, but undefined behavior if the char type is signed.
So, unless your magical checking implementation makes all the same implementation choices as the "real" implementation that you want your code to run on, it won't tell you whether the code has defined behavior for the set of implementation choices made in the "real" implementation, only whether it has defined behavior for the implementation choices made in the magical checking implementation.
To know that your code is correct and portable, you need to know (for starters) that it produces no undefined behavior for any set of implementation choices. And, for that matter, for any input, not just the inputs used in your tests. You might think that this is a big deficiency in C++ compared to languages with no undefined behavior. Certainly it is inconvenient at times, and affects how you go about sandboxing programs for security. In practice, though, for you to consider your code correct you don't just need it to have defined behavior, you need the behavior to match the specification document. That's a much bigger problem, and in practice it isn't very much harder to write a bug in (say) Java or Python than it is in C++. I've written countless bugs in all three, and knowing that in Java or Python the behavior was defined but wrong didn't help me all that much.
[*] Well, the result is still implementation-defined, it depends on the execution character set, but the implementation has to return the correct result. If char is signed it's allowed to crash.
This gave me a good laugh. Sorry about that, didn't mean any offense; it's a good question.
There is no compiler on the planet that only allows 100% defined behavior. It's the undefined nature of things that makes it so hard. There are a lot of cases taken up in the standard, but they're often too vague to efficiently implement in a compiler.
I know Clang developers showed some interest to adding that functionality, but they haven't started as far as I know.
The only thing you can do now and in the near/far future is cranking up the warning level and strictness of your compiler. Sadly, even in recent versions, MSVC is a pain in that regard. On warning level 4 and up, it spits some stupid warnings that have nothing to do with code correctness, and you often have to jump through hoops to get them to go away.
GCC is better at that in my personal experience. I personnally use these options, ensuring the strictest checks (I currently know of)
-std=c++0x -pedantic -Wextra -Weffc++ -Wmissing-include-dirs -Wstrict-aliasing
I of course ensure zero warnings, if you want to enforce even that, just add -Werror to the line above and any error will error out. It's mostly the std and pedantic options that enforce Standard behavior, Wextra catches some off-chance semi-errors.
And of course, compile your code with different compilers if possible (and make sure they are correctly diagnosing the problem by asking here, where people know what the Standard says/means).
While I agree with Mark's answer, I just thought I should let you know...
#include <stdio.h>
int func(int a, int b, int c)
{
return a + b + c;
}
int main()
{
int a=0;
printf("%d\n", func(a++, a, ++a)); /* line 11 */
return 0;
}
When compiling the code above with gcc -Wall, I get the following warnings:
test.c:11: warning: operation on ‘a’ may be undefined
test.c:11: warning: operation on ‘a’ may be undefined
because of a++ and ++a, I suppose. So to some degree, it's been implemented. But obviously we can't expect all undefined behavior to be recognized by the compiler.
I’m using g++ with warning level -Wall -Wextra and treating warnings as errors (-Werror).
Now I’m sometimes getting an error “variable may be used uninitialized in this function”.
By “sometimes” I mean that I have two independent compilation units that both include the same header file. One compilation unit compiles without error, the other gives the above error.
The relevant piece of code in the header files is as follows. Since the function is pretty long, I’ve only reproduced the relevant bit below.
The exact error is:
'cmpres' may be used uninitialized in this function
And I’ve marked the line with the error by * below.
for (; ;) {
int cmpres; // *
while (b <= c and (cmpres = cmp(b, pivot)) <= 0) {
if (cmpres == 0)
::std::iter_swap(a++, b);
++b;
}
while (c >= b and (cmpres = cmp(c, pivot)) >= 0) {
if (cmpres == 0)
::std::iter_swap(d--, c);
--c;
}
if (b > c) break;
::std::iter_swap(b++, c--);
}
(cmp is a functor that takes two pointers x and y and returns –1, 0 or +1 if *x < *y, *x == *y or *x > *y respectively. The other variables are pointers into the same array.)
This piece of code is part of a larger function but the variable cmpres is used nowhere else. Hence I fail to understand why this warning is generated. Furthermore, the compiler obviously understands that cmpres will never be read uninitialized (or at least, it doesn’t always warn, see above).
Now I have two questions:
Why the inconsistent behaviour? Is this warning generated by a heuristic? (This is plausible since emitting this warning requires a control flow analysis which is NP hard in the general case and cannot always be performed.)
Why the warning? Is my code unsafe? I have come to appreciate this particular warning because it has saved me from very hard to detect bugs in other cases – so this is a valid warning, at least sometimes. Is it valid here?
An algorithm that diagnoses uninitialized variables with no false negatives or positives must (as a subroutine) include an algorithm that solves the Halting Problem. Which means there is no such algorithm. It is impossible for a computer to get this right 100% of the time.
I don't know how GCC's uninitialized variable analysis works exactly, but I do know it's very sensitive to what early optimization passes have done to the code. So I'm not at all surprised you get false positives only sometimes. It does distinguish cases where it's certain from cases where it can't be certain --
int foo() { int a; return a; }
produces "warning: ‘a’ is used uninitialized in this function" (emphasis mine).
EDIT: I found a case where recent versions of GCC (4.3 and later) fail to diagnose an uninitialized variable:
int foo(int x)
{
int a;
return x ? a : 0;
}
Early optimizations notice that if x is nonzero, the function's behavior is undefined, so they assume x must be zero and replace the entire body of the function with "return 0;" This happens well before the pass that generates the used-uninitialized warnings, so there's no diagnostic. See GCC bug 18501 for gory details.
I bring this up partially to demonstrate that production-grade compilers can get uninitialized-variable diagnostics wrong both ways, and partially because it's a nice example of the point that undefined behavior can propagate backward in execution time. There's nothing undefined about testing x, but because code control-dependent on x has undefined behavior, a compiler is allowed to assume that the control dependency is never satisfied and discard the test.
There was an interesting discussion on clang dev-mailing list related to those heuristics this week.
The bottom line is: it's actually quite difficult to diagnose unitialized values without getting exponential behavior...
Apparently (from the discussion), gcc uses a predicate base approach, but given your experience it seems that it is not always sufficient.
I suspect it's got something to do with the fact that the assignment is mixed within the condition (and after a short-circuiting operator at that...). Have you tried without ?
I think both the gcc and clang folks would be very interested by this example since it's relatively common practice in C or C++ and thus could benefit from some tuning.
The code is correct, but the compiler is failing to identify that the variable is never used without initialization.
I would suggest that it's likely a heuristical error- that's what the "may" is for. I suspect that not many loop conditions look quite like that. That code is not unsafe because in all control paths, cmpres is assigned before use. However, I certainly wouldn't find it wrong to initialize it first.
You could, however, have some kind of variable shadowing going on here. That would be the only explanation I could think of for only one of the two translation units giving errors.
Can't a compiler warn (even better if it throws errors) when it notices a statement with undefined/unspecified/implementation-defined behaviour?
Probably to flag a statement as error, the standard should say so, but it can warn the coder at least. Is there any technical difficulties in implementing such an option? Or is it merely impossible?
Reason I got this question is, in statements like a[i] = ++i; won't it be knowing that the code is trying to reference a variable and modifying it in the same statement, before a sequence point is reached.
It all boils down to
Quality of Implementation: the more accurate and useful the warnings are, the better it is. A compiler that always printed: "This program may or may not invoke undefined behavior" for every program, and then compiled it, is pretty useless, but is standards-compliant. Thankfully, no one writes compilers such as these :-).
Ease of determination: a compiler may not be easily able to determine undefined behavior, unspecified behavior, or implementation-defined behavior. Let's say you have a call stack that's 5 levels deep, with a const char * argument being passed from the top-level, to the last function in the chain, and the last function calls printf() with that const char * as the first argument. Do you want the compiler to check that const char * to make sure it is correct? (Assuming that the first function uses a literal string for that value.) How about when the const char * is read from a file, but you know that the file will always contain valid format specifier for the values being printed?
Success rate: A compiler may be able to detect many constructs that may or may not be undefined, unspecified, etc.; but with a very low "success rate". In that case, the user doesn't want to see a lot of "may be undefined" messages—too many spurious warning messages may hide real warning messages, or prompt a user to compile at "low-warning" setting. That is bad.
For your particular example, gcc gives a warning about "may be undefined". It even warns for printf() format mismatch.
But if your hope is for a compiler that issues a diagnostic for all undefined/unspecified cases, it is not clear if that should/can work.
Let's say you have the following:
#include <stdio.h>
void add_to(int *a, int *b)
{
*a = ++*b;
}
int main(void)
{
int i = 42;
add_to(&i, &i); /* bad */
printf("%d\n", i);
return 0;
}
Should the compiler warn you about *a = ++*b; line?
As gf says in the comments, a compiler cannot check across translation units for undefined behavior. Classic example is declaring a variable as a pointer in one file, and defining it as an array in another, see comp.lang.c FAQ 6.1.
Different compilers trap different conditions; most compilers have warning level options, GCC specifically has many, but -Wall -Werror will switch on most of the useful ones, and coerce them to errors. Use \W4 \WX for similar protection in VC++.
In GCC You could use -ansi -pedantic, but pedantic is what it says, and will throw up many irrelevant issues and make it hard to use much third party code.
Either way, because compilers catch different errors, or produce different messages for the same error, it is therefore useful to use multiple compilers, not necessarily for deployment, but as a poor-man's static analysis. Another approach for C code is to attempt to compile it as C++; the stronger type checking of C++ generally results in better C code; but be sure that if you want C compilation to work, don't use the C++ compilation exclusively; you are likely to introduce C++ specific features. Again this need not be deployed as C++, but just used as an additional check.
Finally, compilers are generally built with a balance of performance and error checking; to check exhaustively would take time that many developers would not accept. For this reason static analysers exist, for C there is the traditional lint, and the open-source splint. C++ is more complex to statically analyse, and tools are often very expensive. One of the best I have used is QAC++ from Programming Research. I am not aware of any free or open source C++ analysers of any repute.
gcc does warn in that situation (at least with -Wall):
#include <stdio.h>
int main(int argc, char *argv[])
{
int a[5];
int i = 0;
a[i] = ++i;
printf("%d\n", a[0]);
return 0;
}
Gives:
$ make
gcc -Wall main.c -o app
main.c: In function ‘main’:
main.c:8: warning: operation on ‘i’ may be undefined
Edit:
A quick read of the man page shows that -Wsequence-point will do it, if you don't want -Wall for some reason.
Contrarily, compilers are not required to make any sort of diagnosis for undefined behavior:
§1.4.1:
The set of diagnosable rules consists of all syntactic and semantic rules in this International Standard except for those rules containing an explicit notation that “no diagnostic is required” or which are described as resulting in “undefined behavior.”
Emphasis mine. While I agree it may be nice, the compiler's have enough problem trying to be standards compliant, let alone teach the programmer how to program.
GCC warns as much as it can when you do something out of the norms of the language while still being syntactically correct, but beyond the certain point one must be informed enough.
You can call GCC with the -Wall flag to see more of that.
If your compiler won't warn of this, you can try a Linter.
Splint is free, but only checks C http://www.splint.org/
Gimpel Lint supports C++ but costs US $389 - maybe your company c an be persuaded to buy a copy? http://www.gimpel.com/