Is there any tool for C++ which will check for common unspecified behavior? - c++

Often one makes assumptions about a particular platform one is coding on, for example that signed integers use two's complement storage, or that (0xFFFFFFFF == -1), or things of that nature.
Does a tool exist which can check a codebase for the most common violations of these kinds of things (for those of us who want portable code but don't have strange non-two's-complement machines)?
(My examples above are specific to signed integers, but I'm interested in other errors (such as alignment or byte order) as well)

There are various levels of compiler warnings that you may wish to have switched on, and you can treat warnings as errors.
If there are other assumptions you know you make at various points in the code you can assert them. If you can do that with static asserts you will get failure at compile time.

I know that CLang is very actively developing a static analyzer (as a library).
The goal is to catch errors at analysis time, however the exact extent of the errors caught is not that clear to me yet. The library is called "Checker" and T. Kremenek is the responsible for it, you can ask about it on clang-dev mailing list.
I don't have the impression that there is any kind of reference about the checks being performed, and I don't think it's mature enough yet for production tool (given the rate of changes going on) but it may be worth a look.

Maybe a static code analysis tool? I used one a few years ago and it reported errors like this. It was not perfect and still limited but maybe the tools are better now?
update:
Maybe one of these:
What open source C++ static analysis tools are available?
update2:
I tried FlexeLint on your example (you can try it online using the Do-It-Yourself Example on http://www.gimpel-online.com/OnlineTesting.html) and it complains about it but perhaps not in a way you are looking for:
5 int i = -1;
6 if (i == 0xffffffff)
diy64.cpp 6 Warning 650: Constant '4294967295' out of range for operator '=='
diy64.cpp 6 Info 737: Loss of sign in promotion from int to unsigned int
diy64.cpp 6 Info 774: Boolean within 'if' always evaluates to False [Reference: file diy64.cpp: lines 5, 6]

Very interesting question. I think it would be quite a challenge to write a tool to flag these usefully, because so much depends on the programmer's intent/assumptions
For example, it would be easy to recognize a construct like:
x &= -2; // round down to an even number
as being dependent on twos-complement representation, but what if the mask is a variable instead of a constant "-2"?
Yes, you could take it a step further and warn of any use of a signed int with bitwise &, any assignment of a negative constant to an unsigned int, and any assignment of a signed int to an unsigned int, etc., but I think that would lead to an awful lot of false positives.
[ sorry, not really an answer, but too long for a comment ]

Related

Which to use? int32_t vs uint32_t [duplicate]

When is it appropriate to use an unsigned variable over a signed one? What about in a for loop?
I hear a lot of opinions about this and I wanted to see if there was anything resembling a consensus.
for (unsigned int i = 0; i < someThing.length(); i++) {
SomeThing var = someThing.at(i);
// You get the idea.
}
I know Java doesn't have unsigned values, and that must have been a concious decision on Sun Microsystems' part.
I was glad to find a good conversation on this subject, as I hadn't really given it much thought before.
In summary, signed is a good general choice - even when you're dead sure all the numbers are positive - if you're going to do arithmetic on the variable (like in a typical for loop case).
unsigned starts to make more sense when:
You're going to do bitwise things like masks, or
You're desperate to to take advantage of the sign bit for that extra positive range .
Personally, I like signed because I don't trust myself to stay consistent and avoid mixing the two types (like the article warns against).
In your example above, when 'i' will always be positive and a higher range would be beneficial, unsigned would be useful. Like if you're using 'declare' statements, such as:
#declare BIT1 (unsigned int 1)
#declare BIT32 (unsigned int reallybignumber)
Especially when these values will never change.
However, if you're doing an accounting program where the people are irresponsible with their money and are constantly in the red, you will most definitely want to use 'signed'.
I do agree with saint though that a good rule of thumb is to use signed, which C actually defaults to, so you're covered.
I would think that if your business case dictates that a negative number is invalid, you would want to have an error shown or thrown.
With that in mind, I only just recently found out about unsigned integers while working on a project processing data in a binary file and storing the data into a database. I was purposely "corrupting" the binary data, and ended up getting negative values instead of an expected error. I found that even though the value converted, the value was not valid for my business case.
My program did not error, and I ended up getting wrong data into the database. It would have been better if I had used uint and had the program fail.
C and C++ compilers will generate a warning when you compare signed and unsigned types; in your example code, you couldn't make your loop variable unsigned and have the compiler generate code without warnings (assuming said warnings were turned on).
Naturally, you're compiling with warnings turned all the way up, right?
And, have you considered compiling with "treat warnings as errors" to take it that one step further?
The downside with using signed numbers is that there's a temptation to overload them so that, for example, the values 0->n are the menu selection, and -1 means nothing's selected - rather than creating a class that has two variables, one to indicate if something is selected and another to store what that selection is. Before you know it, you're testing for negative one all over the place and the compiler is complaining about how you're wanting to compare the menu selection against the number of menu selections you have - but that's dangerous because they're different types. So don't do that.
size_t is often a good choice for this, or size_type if you're using an STL class.

Advice on unsigned int (Gangnam Style edition)

The video "Gangnam Style" (I'm sure you've heard it) just exceeded 2 billion views on youtube. In fact, Google says that they never expected a video to be greater than a 32-bit integer... which alludes to the fact that Google used int instead of unsigned for their view counter. I think they had to re-write their code a bit to accommodate larger views.
Checking their style guide: https://google-styleguide.googlecode.com/svn/trunk/cppguide.html#Integer_Types
...they advise "don't use an unsigned integer type," and give one good reason why: unsigned could be buggy.
It's a good reason, but could be guarded against. My question is: is it bad coding practice in general to use unsigned int?
The Google rule is widely accepted in professional circles. The problem
is that the unsigned integral types are sort of broken, and have
unexpected and unnatural behavior when used for numeric values; they
don't work well as a cardinal type. For example, an index into an array
may never be negative, but it makes perfect sense to write
abs(i1 - i2) to find the distance between two indices. Which won't work if
i1 and i2 have unsigned types.
As a general rule, this particular rule in the Google style guidelines
corresponds more or less to what the designers of the language intended.
Any time you see something other than int, you can assume a special
reason for it. If it is because of the range, it will be long or
long long, or even int_least64_t. Using unsigned types is generally
a signal that you're dealing with bits, rather than the numeric value of
the variable, or (at least in the case of unsigned char) that you're
dealing with raw memory.
With regards to the "self-documentation" of using an unsigned: this
doesn't hold up, since there are almost always a lot of values that the
variable cannot (or should not) take, including many positive ones. C++
doesn't have sub-range types, and the way unsigned is defined means
that it cannot really be used as one either.
This guideline is extremely misleading. Blindly using int instead of unsigned int won't solve anything. That simply shifts the problems somewhere else. You absolutely must be aware of integer overflow when doing arithmetic on fixed precision integers. If your code is written in a way that it does not handle integer overflow gracefully for some given inputs, then your code is broken regardless of whether you use signed or unsigned ints. With unsigned ints you must be aware of integer underflow as well, and with doubles and floats you must be aware of many additional issues with floating point arithmetic.
Just take this article about a bug in the standard Java binary search algorithm published by none other than Google for why you must be aware of integer overflow. In fact, that very article shows C++ code casting to unsigned int in order to guarantee correct behavior. The article also starts out by presenting a bug in Java where guess what, they don't have unsigned int. However, they still ran into a bug with integer overflow.
Use the right type for the operations which you will perform. float wouldn't make sense for a counter. Nor does signed int. The normal operations on the counter are print and +=1.
Even if you had some unusual operations, such as printing the difference in viewcounts, you wouldn't necessarily have a problem. Sure, other answers mention the incorrect abs(i2-i1) but it's not unreasonable to expect programmers to use the correct max(i2,i1) - min(i2,i1). Which does have range issues for signed int. No uniform solution here; programmers should understand the properties of the types they're working with.
Google states that: "Some people, including some textbook authors, recommend using unsigned types to represent numbers that are never negative. This is intended as a form of self-documentation."
I personally use unsigned ints as index parameters.
int foo(unsigned int index, int* myArray){
return myArray[index];
}
Google suggests: "Document that a variable is non-negative using assertions. Don't use an unsigned type."
int foo(int index, int* myArray){
assert(index >= 0);
return myArray[index];
}
Pro for Google: If a negative number is passed in debug mode my code will hopefully return an out of bounds error. Google's code is guaranteed to assert.
Pro for me: My code can support a greater size of myArray.
I think the actual deciding factor comes down to, how clean is your code? If you clean up all warnings, it will be clear when the compiler warns you know when you're trying to assign a signed variable to an unsigned variable. If your code already has a bunch of warnings, the compiler's warning is going to be lost on you.
A final note here: Google says: "Sometimes gcc will notice this bug and warn you, but often it will not." I haven't seen that to be the case on Visual Studio, checks against negative numbers and assignments from signed to unsigned are always warned. But if you use gcc you might have a care.
You specific question is:
"Is it bad practice to use unsigned?" to which the only correct answer can be no. It is not bad practice.
There are many style guides, each with a different focus, and while in some cases, an organisation, given their typical toolchain and deployment platform may choose not to use unsigned for their products, other toolchains and platforms almost demand it's use.
Google seem to get a lot of deference because they have a good business model (and probably employ some smart people like everyone else).
CERT IIRC recommend unsigned for buffer indexes, because if you do overflow, at least you'll still be in your own buffer, some intrinsic security there.
What do the language and standard library designers say (probably the best representation of accepted wisdom). strlen returns a size_t, which is probably unsigned (platform dependent), other answers suggest this is an anachronism because shiny new computers have wide architectures, but this misses the point that C and C++ are general programming languages and should scale well on big and small platforms.
Bottom line is that this is one of many religious questions; certainly not settled, and in these cases, I normally go with my religion for green field developments, and go with the existing convention of the codebase for existing work. Consistency matters.

Why worry about 'undefined behavior' in >> of signed type?

My question is related to this one and will contain few questions.
For me the most obvious (means I would use it in my code) solution to above problem is just this:
uint8_t x = some value;
x = (int8_t)x >> 7;
Yes, yes I hear you all .... undefined behavior and this is why I've not posted my 'solution'.
I have a feeling (maybe it is only my sick mind) that term 'undefined behavior' is overused on SO just to justify downvoting someone if question is tagged c/c++.
So - let's (for a while) put aside C/C++ standards and think about everyday life/programming, real compiler implementations and code they generate for contemporary hardware.
Taking into account the following:
As far as I remember all the hardware I had encountered had distinct instructions for arithmetic and logical shift.
All compilers that I know translate >> into arithmetic shift for signed types and logical shift for unsigned types.
I cannot recall any compiler ever emitting div-like low level instruction when >> was used in c/c++ code (and we are not talking about operator overloading here).
All the hardware I know use U2.
So ... is there anything (any contemporary compiler, hardware) that behaves differently than mentioned above? Put simply should I ever be worried about right shifting signed value not being translated to arithmetic shift?
My 'solution' compiles to just one low level instruction on many platforms while others require multiple low level instructions. What would you use in your code?
Truth please ;-)
Why worry about 'undefined behavior' in >> of signed type?
Because it doesn't really matter how well defined any particular undefined behaviour is now; the point is that it may break at any point in the future. You're relying on a side-effect that may be optimized (or un-optimized) away at any point for any reason or no reason.
Also, I don't want to have to ask somebody with detailed knowledge of many different compiler's implementations before I use something I shouldn't use in the first place, so I skip it.
Yes, there are compilers which behave different from what you assume.
In particular, optimization phases within compilers. These take advantage of the known possible values of variables, and will derive those possible values from the absence of UB. A pointer must be non-null if it's been dereferenced, an integer must be non-zero if it's been used as a divider, and a right-shifted value must be non-negative.
And that works back in time:
if (x<0) {
printf("This is dead code\n");
}
x >> 3;
What it really comes down to is, are you willing to take the risk?
"The standard doesn't guarantee yada yada" is nice and all, but let's be honest now, the risk isn't big. If you're going to run your code on some crazy platform, you generally know in advance. And if it takes you by surprise, well, that's the risk you took.
Also, the workaround is horrible. If you're not going to need it, it's just polluting your codebase with pointless "function calls instead of right shifts" that will be harder to maintain (and thus carry a cost). And you'll never to able to "paste and forget" code from other places into the project - you'd always have to check the code for the possibility of right shifting negative signed integers.

Disable default numeric types in compiler

When creating custom typedefs for integers, is it possible for compiler to warn when you when using a default numeric type?
For example,
typedef int_fast32_t kint;
int_fast32_t test=0;//Would be ok
kint test=0; //Would be ok
int test=0; //Would throw a warning or error
We're converting a large project and the default int size on platform is 32767 which is causing some issues. This warning would warn a user to not use ints in the code.
If possible, it would be great if this would work on GCC and VC++2012.
I'm reasonably sure gcc has no such option, and I'd be surprised if VC did.
I suggest writing a program that detects references to predefined types in source code, and invoking that tool automatically as part of your build process. It would probably suffice to search for certain keywords.
Be sure you limit this to your own source files; predefined and third-party headers are likely to make extensive use of predefined types.
But I wouldn't make the prohibition absolute. There are a number of standard library functions that use predefined types. For example, in c = getchar() it makes no sense to declare c as anything other than int. And there's no problem for something like for (int i = 0; i <= 100; i ++) ...
Ideally, the goal should be to use predefined types properly. The language has never guaranteed that an int can exceed 32767. (But "proper" use is difficult or impossible to verify automatically.)
I'd approach this by doing a replace-all first and then documenting this thoroughly.
You can use a preprocessor directive:
#define int use kint instead
Note that technically this is undefined behavior and you'll run into trouble if you do this definition before including third-party headers.
I would recommend to make bulk replacement int -> old_int_t at the very beginning of your porting. This way you can continue modifying your code without facing major restrictions and at the same time have access to all places that are not yet updated.
Eventually, at the end of your work, all occurencies of old_int_t should go away.
Even if one could somehow undefine the keyword int, that would do nothing to prevent usage of that type, since there are many cases where the compiler will end up using that type. Beyond the obvious cases of integer literals, there are some more subtle cases involving integer promotion. For example, if int happens to be 64 bits, operations between two variables of type uint32_t will be performed using type int rather than uint32_t. As nice as it would be to be able to specify that some variables represent numbers (which should be eagerly promoted when practical) while others represent members of a wrapping algebraic ring (which should not be promoted), I know of no facility to do such a thing. Consequently, int is unavoidable.

How can elusive 64-bit portability issues be detected?

I found a snippet similar to this in some (C++) code I'm preparing for a 64-bit port.
int n;
size_t pos, npos;
/* ... initialization ... */
while((pos = find(ch, start)) != npos)
{
/* ... advance start position ... */
n++; // this will overflow if the loop iterates too many times
}
While I seriously doubt this would actually cause a problem in even memory-intensive applications, it's worth looking at from a theoretical standpoint because similar errors could surface that will cause problems. (Change n to a short in the above example and even small files could overflow the counter.)
Static analysis tools are useful, but they can't detect this kind of error directly. (Not yet, anyway.) The counter n doesn't participate in the while expression at all, so this isn't as simple as other loops (where typecasting errors give the error away). Any tool would need to determine that the loop would execute more than 231 times, but that means it needs to be able to estimate how many times the expression (pos = find(ch, start)) != npos will evaluate as true—no small feat! Even if a tool could determine that the loop could execute more than 231 times (say, because it recognizes the find function is working on a string), how could it know that the loop won't execute more than 264 times, overflowing a size_t value, too?
It seems clear that to conclusively identify and fix this kind of error requires a human eye, but are there patterns that give away this kind of error so it can be manually inspected? What similar errors exist that I should be watchful for?
EDIT 1: Since short, int and long types are inherently problematic, this kind of error could be found by examining every instance of those types. However, given their ubiquity in legacy C++ code, I'm not sure this is practical for a large piece of software. What else gives away this error? Is each while loop likely to exhibit some kind of error like this? (for loops certainly aren't immune to it!) How bad is this kind of error if we're not dealing with 16-bit types like short?
EDIT 2: Here's another example, showing how this error appears in a for loop.
int i = 0;
for (iter = c.begin(); iter != c.end(); iter++, i++)
{
/* ... */
}
It's fundamentally the same problem: loops are counting on some variable that never directly interacts with a wider type. The variable can still overflow, but no compiler or tool detects a casting error. (Strictly speaking, there is none.)
EDIT 3: The code I'm working with is very large. (10-15 million lines of code for C++ alone.) It's infeasible to inspect all of it, so I'm specifically interested in ways to identify this sort of problem (even if it results in a high false-positive rate) automatically.
Code reviews. Get a bunch of smart people looking at the code.
Use of short, int, or long is a warning sign, because the range of these types isn't defined in the standard. Most usage should be changed to the new int_fastN_t types in <stdint.h>, usage dealing with serialization to intN_t. Well, actually these <stdint.h> types should be used to typedef new application-specific types.
This example really ought to be:
typedef int_fast32_t linecount_appt;
linecount_appt n;
This expresses a design assumption that linecount fits in 32 bits, and also makes it easy to fix the code if the design requirements change.
Its clear what you need is a smart "range" analyzer tool to determine what the range of values are that are computed vs the type in which those values are being stored. (Your fundamental objection is to that smart range analyzer being a person). You might need some additional code annotations (manually well-placed typedefs or assertions that provide explicit range constraints) to enable a good analysis, and to handle otherwise apparantly arbitrarily large user input.
You'd need special checks to handle the place where C/C++ says the arithmetic is legal but dumb (e.g., assumption that you don't want [twos complement] overflows).
For your n++ example, (equivalent to n_after=n_before+1), n_before can be 2^31-1 (because of your observations about strings), so n_before+1 can be 2^32 which is overflow. (I think standard C/C++ semantics says that overflow to -0 without complaint is OK).
Our DMS Software Reengineering Toolkit in fact has range analysis machinery built in... but it is not presently connected to the DMS's C++ front end; we can only peddle so fast :-{ [We have used it on COBOL programs for different problems involving ranges].
In the absence of such range analysis, you could probably detect the existing of loops with such dependent flows; the value of n clearly depends on the loop count. I suspect this would get you every loop in the program that had a side effect, which might not be that much help.
Another poster suggests somehow redeclaring all the int-like declarations using application specific types (e.g., *linecount_appt*) and then typedef'ing those to value that work for your application. To do this, I'd think you'd have to classify each int-like declaration into categories (e.g., "these declarations are all *linecount_appt*"). Doing this by manual inspection for 10M SLOC seems pretty hard and very error prone. Finding all declarations which receive (by assignment) values from the "same" value sources might be a way to get hints about where such application types are. You'd want to be able to mechanically find such groups of declarations, and then have some tool automatically replace the actual declarations with a designated application type (e.g., *linecount_appt*). This is likely somewhat easier than doing precise range analysis.
There are tools that help find such issues. I won't give any links here because the ones I know of are commercial but should be pretty easy to find.