C++ compilers happily compiles this code, with no warning:
int ival = 8;
const char *strval = "x";
const char *badresult = ival + strval;
Here we add a char* pointer value (strval) to an int value (ival) and store the result in a char* pointer (badresult). Of course, the content of the badresult will be total garbage and the app might crash on this line or later when it is trying to use the badresult elsewhere.
The problem is that it is very easy to make such mistakes in real life. The one I caught in my code looked like this:
message += port + "\n";
(where message is a string type handling the result with its operator += function; port is an int and \n is obviously a const char pointer).
Is there any way to disable this kind of behavior and trigger an error at compile time?
I don't see any normal use case for adding char* to int and I would like a solution to prevent this kind of mistakes in my large code base.
When using classes, we can create private operators and use the explicit keyword to disable unneeded conversions/casts, however now we are talking about basic types (char* and int).
One solutions is to use clang as that has a flag to enable warning for this.
However I can't use clang, so I am seeking for a solution that triggers a compiler error (some kind of operator overload or mangling with some defines to prevent such constructs or any other idea).
Is there any way to disable this kind of behavior and trigger an error at compile time?
Not in general, because your code is very similar to the following, legitimate, code:
int ival = 3;
const char *strval = "abcd";
const char *goodresult = ival + strval;
Here goodresult is pointing to the last letter d of strval.
BTW, on Linux, getpid(2) is known to return a positive integer. So you could imagine:
int ival = (getpid()>=0)?3:1000;
const char *strval = "abcd";
const char *goodresult = ival + strval;
which is morally the same as the previous example (so we humans know that ival is always 3). But teaching the compiler that getpid() does not return a negative value is tricky in practice (the return type pid_t of getpid is some signed integer, and has to be signed to be usable by fork(2), which could give -1). And you could imagine more weird examples!
You want compile-time detection of buffer overflow (or more generally of undefined behavior), and in general that is equivalent to the halting problem (it is an unsolvable problem). So it is impossible in general.
Of course, one could claim that a clever compiler could warn for your particular case, but then there is a concern about what cases should be useful to warn.
You might try some static source program analysis tools, perhaps Clang static analyzer or Frama-C (with its recent Frama-C++ variant) - or some costly proprietary tools like Coverity and many others. These tools don't detect all errors statically and takes much more time to execute than an optimizing compiler.
You could (for example) write your own GCC plugin to detect such mistakes (that means developing your own static source code analyzer). You'll spend months in writing it. Are you sure it is worth the effort?
However I can't use clang,
Why? You could ask permission to use the clang static analyzer (or some other one), during development (not for production). If your manager refuses that, it becomes a management problem, not a technical one.
I don't see any normal use case for adding char* to int
You need more imagination. Think of something like
puts("non-empty" + (isempty(x)?4:0));
Ok that is not very readable code, but it is legitimate. In the previous century, when memory was costly, some people used to code that way.
Today you'll code perhaps
if (isempty(x))
puts("empty");
else
puts("non-empty")
and the cute thing is that a clever compiler could probably optimize the later into the equivalent of former (according to the as-if rule).
No way. It is valid syntax, and very useful in many cases.
Just think about you were to write int b=a+10 but you wrote int b=a+00 incorrectly, the compiler won't know it is an error by mistake.
However, you can consider to use C++ classes. Most C++ classes are well designed to prevent such obvious mistakes.
In the first example in your question, really, compilers should issue a warning. Compilers can trivially see that the addition resolves to 8 + "x" and clang does indeed optimise it to a constant. I see the fact it doesn't warn about this as a compiler bug. Although compilers are not required to warn about this, clang goes through great efforts to provide useful diagnostics, and it would be an improvement to diagnose this as well.
In the second example, as Matteo Italia pointed out, clang does already provide a warning option for this, enabled by default: -Wstring-plus-int. You can turn specific warnings into errors by using -Werror=<warning-option>, so in this case -Werror=string-plus-int.
Related
A colleague of mine is working on C++ code that works with binary data arrays a lot. In certain places, he has code like
char *bytes = ...
T *p = (T*) bytes;
T v = p[i]; // UB
Here, T can be sometimes short or int (assume 16 and 32 bit respectively).
Now, unlike my colleague, I belong to the "no UB if at all possible" camp, while he is more along the lines of "if it works, it's OK". I am having a hard time trying to convince him otherwise.
Given that:
bytes really come from somewhere outside this compilation unit, being read from some binary file.
It's safe to assume that array really contains integers in the native endianness.
In practice, given mainstream C++ compilers like MSVC 2017 and gcc 4.8, and Intel x64 hardware, is such a thing really safe? I know it wouldn't be if T was, say, float (got bitten by it in the past).
char* can alias other entities without breaking strict aliasing rule.
Your code would be UB only if originally p + i wasn't a T originally.
char* byte = (char*) floats;
int *p = (int*) bytes;
int v = p[i]; // UB
but
char* byte = (char*) floats;
float *p = (float*) bytes;
float v = p[i]; // OK
If origin of byte is "unknown", compiler cannot benefit of UB for optimization and should assume we are in valid case and generate code according.
But how do you guaranty it is unknown ? Even outside the TU, something like Link-Time Optimization might allow to provide the hidden information.
Type-punned pointers are safe if one uses a construct which is recognized by the particular compiler one is using [i.e. any compiler that is configured support quality semantics if one is using straightforward constructs; neither gcc nor clang support quality semantics qualifies with optimizations are enabled, however, unless one uses -fno-strict-aliasing]. The authors of C89 were certainly aware that many applications required the use of various type-punning constructs beyond those mandated by the Standard, but thought the question of which constructs to recognize was best left as a quality-of-implementation issue. Given something like:
struct s1 { int objectClass; };
struct s2 { int objectClass; double x,y; };
struct s3 { int objectClass; char someData[32]; };
int getObjectClass(void *p) { return ((struct s1*)p)->objectClass; }
I think the authors of the Standard would have intended that the function be usable to read field objectClass of any of those structures [that is pretty much the whole purpose of the Common Initial Sequence rule] but there would be many ways by which compilers might achieve that. Some might recognize function calls as barriers to type-based aliasing analysis, while others might treat pointer casts in such a fashion. Most programs that use type punning would do several things that compilers might interpret as indications to be cautious with optimizations, so there was no particular need for a compiler to recognize any particular one of them. Further, since the authors of the Standard made no effort to forbid implementations that are "conforming" but are of such low-quality implementations as to be useless, there was no need to forbid compilers that somehow managed not to see any of the indications that storage might be used in interesting ways.
Unfortunately, for whatever reason, there hasn't been any effort by compiler vendors to find easy ways of recognizing common type-punning situations without needlessly impairing optimizations. While handling most cases would be fairly easy if compiler writers hadn't adopted designs that filter out the clearest and most useful evidence before applying optimization logic, both the designs of gcc and clang--and the mentalities of their maintainers--have evolved to oppose such a concept.
As far as I'm concerned, there is no reason why any "quality" implementation should have any trouble recognizing type punning in situations where all operations upon a byte of storage using a pointer converted to a pointer-to-PODS, or anything derived from that pointer, occur before the first time any of the following occurs:
That byte is accessed in conflicting fashion via means not derived from that pointer.
A pointer or reference is formed which will be used sometime in future to access that byte in conflicting fashion, or derive another that will.
Execution enters a function which will do one of the above before it exits.
Execution reaches the start of a bona fide loop [not, e.g. a do{...}while(0);] which will do one of the above before it exits.
A decently-designed compiler should have no problem recognizing those cases while still performing the vast majority of useful optimizations. Further, recognizing aliasing in such cases would be simpler and easier than trying to recognize it only in the cases mandated by the Standard. For those reasons, compilers that can't handle at least the above cases should be viewed as falling in the category of implementations that are of such low quality that the authors of the Standard didn't particularly want to allow, but saw no reason to forbid. Unfortunately, neither gcc nor clang offer any options to behave reasonably except by requiring that they disable type-based aliasing altogether. Unfortunately, the authors of gcc and clang would rather deride as "broken" any code needing features beyond what the Standard requires, than attempt a useful blend of optimization and semantics.
Incidentally, neither gcc nor clang should be relied upon to properly handle any situation in which storage that has been used as one type is later used as another, even when the Standard would require them to do so. Given something like:
union { struct s1 v1; struct s2 v2} unionArr[100];
void test(int i)
{
int test = unionArr[i].v2.objectClass;
unionArr[i].v1.objectClass = test;
}
Both clang and gcc will treat it as a no-op even if it is executed between code which writes unionArr[i].v2.objectClass and code which happens to reads member v1.objectClass of the same union object, thus causing them to ignore the possibility that the write to unionArr[i].v2.objectClass might affect v1.objectClass.
I was writing some code recently and found myself doing a lot of c-style casts, such as the following:
Client* client = (Client*)GetWindowLong(hWnd, GWL_USERDATA);
I thought to myself; why do we actually need to do these?
I can somewhat understand why this would be needed in circumstances where there is lot of code where the compiler may not what types can be converted to what, such as when using reflection.
but when casting from a long to a pointer where both types are of the same size, I don't understand why the compiler would not allow us to do this?
when casting from a long to a pointer where both types are of the same size, I don't understand why the compiler would not allow us to do this?
Ironically, this is the place where compiler's intervention is most important!
In vast majority of situations, converting between long and a pointer is a programming error which you don't want to go unnoticed, even if your platform allows it.
For example, when you write this
unsigned long *ptr = getLongPtr();
unsigned long val = ptr; // Probably an error
it is almost a certainty that you are missing an asterisk in front of ptr:
unsigned long val = *ptr; // This is what it should be
Finding errors like this without compiler's help is very hard, hence the compiler wants you to tell it that you know when you are doing on conversions like that.
Moreover, something that is fine on one platform may not work on other platforms. For example, an integral type and a pointer may have the same size on 32-bit platforms, but have different sizes on 64-bit platform. If you want to maintain any degree of portability, the compiler should warn you of the conversion even on the 32-bit platform, where the sizes are identical. Compiler warning will help you identify an error, and switch to a portable pointer-as-integer type intptr_t.
I think the idea is that we want compiler to tell us when we are doing something dodgy and/or potentially unintended. That way we don't do it by accident. So the compiler complains unless we explicitly tell the compiler that this is what we really, really want. We do that by using the a cast.
Edited to add:
It might be better to ask why we are allowed to cast between types. Originally C was created as a strongly typed language. Although it allows promotion/conversion between related object types (like between ints and floats) it is supposed to prevent access and assignment to the wrong type as a language feature, a safety measure. However occasionally this is useful so casting was put in the language to allow us to circumvent the type rules on those occasions when we need to.
I have a large program that has a large number of fragments of the form
float t = amplitudes->read(current_element);
if ( *((uint32_t * ) &t) == 0xsomereservedvalue)
do_something
else
do_something_else
This compiles fine with gcc 3.4.x no warnings. Compiled with -Wtrict-aliasing=2 still no warnings. I recently tried compiling with gcc4.4 and got a vast number of warnings about type punned references. Can someone tell me is there any reasonable situation under which this sort of code can fail? As far as I can see, type punning is only a potential issue if optimizations arrange things such that something might be written back from a register after another line of code reads it, which since we're on the return of a function here simply can't happen. Am I missing something here, or is gcc being somewhat braindead?
Of course it warns. I just checked gcc 4.8.2 it warns too. Imagine, that float size for target machine is 16 bit and you are trying to read 32 bits after data bounds. It is UB.
btw I disagree with reinterpret_cast usage in this case. The only useful way for reinterpret_cast is *(reinterpret_cast<char *>(&t)), even *(reinterpret_cast<unsigned *>(&t)) will still break aliasing rules, and you'll get the same warning. That is because compiler only knows in C++ that char is smallest so every type might be casted to it.
GCC have an attribute to tell that given type may alias. It is not so easy to use it as well.
unsigned * __attribute__ ((__may_alias__)) pu = (unsigned *) &t;
if (*pu == 0xsomereservedvalue)
do_something ();
else
do_something_else ();
But here things at least make sense. We are asking for aliasable pointer. Cost: code is not portable now.
So after all considerations, I recommend just supply -fno-strict-aliasing option if you really sure that you are know what you do.
The “dereferencing type-punned pointer will break strict aliasing” tells you that the compiler may be making optimizations that may break your code.
The reason why this might happen is that the compiler is allowed (according to the language spec) to assume that that you will only access an array with its own type or a char type.
To a guarantee of correctness one must -either use memcpy when type punning or build with strict aliasing disabled. (Unions are a bit unclear)
Take a look here for another more detailed posting on strict aliasing.
Why does C++ (and probably C as well) allow me to assign and int to a char without at least giving me a warning?
Is it okay to directly assign the value, like in
int i = 12;
char c = i;
i.e. do an implicit conversion, or shall I use a static_cast<>?
EDIT
Btw, I'm using gcc.
It was allowed in C before an explicit cast syntax was invented.
Then it remained a common practice, so C++ inherited it in order to not break a huge amount of code.
Actually most compilers issue a warning. If your doesn't, try change its settings.
C as originally designed wasn't really a strongly-typed language. The general philosophy was that you the programmer must know what you are doing, and the compiler is just there to help you do it. If you asked to convert between float, int, and unsigned char six or seven times in a single expression, well that must be what you wanted.
C++ sort of picked that up just so that all the existing C code wouldn't be too much of a bear to port. They are slowly trying to make it stronger with each revision though. Today just about any C++ compiler will give you a warning for that if you turn the warning levels up (which I highly recommend you do).
Otherwise, perhaps you should look into true strongly-typed languages, like Java and Ada. The equivalent Ada code would not compile without an explicit conversion.
Short answer: It's okay (by the c++ standard) in your example.
Slightly longer answer: It's not okay if char is signed and you are trying to assign it a value outside its range. It's okay if it is unsigned though (whether or not char is signed depends on your environment), then you'll get modulo arithmetics. Compilers usually have a switch to warn you because of the first case, but as long as you stay in the bounds it's perfectly fine (however, an explicit cast to make your intentions clear does not hurt).
char is the same as short. So, there should be a warning about possible lose of information. May be you have warnings switched off, try to configure somehow your compiler/ide.
Can't a compiler warn (even better if it throws errors) when it notices a statement with undefined/unspecified/implementation-defined behaviour?
Probably to flag a statement as error, the standard should say so, but it can warn the coder at least. Is there any technical difficulties in implementing such an option? Or is it merely impossible?
Reason I got this question is, in statements like a[i] = ++i; won't it be knowing that the code is trying to reference a variable and modifying it in the same statement, before a sequence point is reached.
It all boils down to
Quality of Implementation: the more accurate and useful the warnings are, the better it is. A compiler that always printed: "This program may or may not invoke undefined behavior" for every program, and then compiled it, is pretty useless, but is standards-compliant. Thankfully, no one writes compilers such as these :-).
Ease of determination: a compiler may not be easily able to determine undefined behavior, unspecified behavior, or implementation-defined behavior. Let's say you have a call stack that's 5 levels deep, with a const char * argument being passed from the top-level, to the last function in the chain, and the last function calls printf() with that const char * as the first argument. Do you want the compiler to check that const char * to make sure it is correct? (Assuming that the first function uses a literal string for that value.) How about when the const char * is read from a file, but you know that the file will always contain valid format specifier for the values being printed?
Success rate: A compiler may be able to detect many constructs that may or may not be undefined, unspecified, etc.; but with a very low "success rate". In that case, the user doesn't want to see a lot of "may be undefined" messages—too many spurious warning messages may hide real warning messages, or prompt a user to compile at "low-warning" setting. That is bad.
For your particular example, gcc gives a warning about "may be undefined". It even warns for printf() format mismatch.
But if your hope is for a compiler that issues a diagnostic for all undefined/unspecified cases, it is not clear if that should/can work.
Let's say you have the following:
#include <stdio.h>
void add_to(int *a, int *b)
{
*a = ++*b;
}
int main(void)
{
int i = 42;
add_to(&i, &i); /* bad */
printf("%d\n", i);
return 0;
}
Should the compiler warn you about *a = ++*b; line?
As gf says in the comments, a compiler cannot check across translation units for undefined behavior. Classic example is declaring a variable as a pointer in one file, and defining it as an array in another, see comp.lang.c FAQ 6.1.
Different compilers trap different conditions; most compilers have warning level options, GCC specifically has many, but -Wall -Werror will switch on most of the useful ones, and coerce them to errors. Use \W4 \WX for similar protection in VC++.
In GCC You could use -ansi -pedantic, but pedantic is what it says, and will throw up many irrelevant issues and make it hard to use much third party code.
Either way, because compilers catch different errors, or produce different messages for the same error, it is therefore useful to use multiple compilers, not necessarily for deployment, but as a poor-man's static analysis. Another approach for C code is to attempt to compile it as C++; the stronger type checking of C++ generally results in better C code; but be sure that if you want C compilation to work, don't use the C++ compilation exclusively; you are likely to introduce C++ specific features. Again this need not be deployed as C++, but just used as an additional check.
Finally, compilers are generally built with a balance of performance and error checking; to check exhaustively would take time that many developers would not accept. For this reason static analysers exist, for C there is the traditional lint, and the open-source splint. C++ is more complex to statically analyse, and tools are often very expensive. One of the best I have used is QAC++ from Programming Research. I am not aware of any free or open source C++ analysers of any repute.
gcc does warn in that situation (at least with -Wall):
#include <stdio.h>
int main(int argc, char *argv[])
{
int a[5];
int i = 0;
a[i] = ++i;
printf("%d\n", a[0]);
return 0;
}
Gives:
$ make
gcc -Wall main.c -o app
main.c: In function ‘main’:
main.c:8: warning: operation on ‘i’ may be undefined
Edit:
A quick read of the man page shows that -Wsequence-point will do it, if you don't want -Wall for some reason.
Contrarily, compilers are not required to make any sort of diagnosis for undefined behavior:
§1.4.1:
The set of diagnosable rules consists of all syntactic and semantic rules in this International Standard except for those rules containing an explicit notation that “no diagnostic is required” or which are described as resulting in “undefined behavior.”
Emphasis mine. While I agree it may be nice, the compiler's have enough problem trying to be standards compliant, let alone teach the programmer how to program.
GCC warns as much as it can when you do something out of the norms of the language while still being syntactically correct, but beyond the certain point one must be informed enough.
You can call GCC with the -Wall flag to see more of that.
If your compiler won't warn of this, you can try a Linter.
Splint is free, but only checks C http://www.splint.org/
Gimpel Lint supports C++ but costs US $389 - maybe your company c an be persuaded to buy a copy? http://www.gimpel.com/