Advice on unsigned int (Gangnam Style edition) - c++

The video "Gangnam Style" (I'm sure you've heard it) just exceeded 2 billion views on youtube. In fact, Google says that they never expected a video to be greater than a 32-bit integer... which alludes to the fact that Google used int instead of unsigned for their view counter. I think they had to re-write their code a bit to accommodate larger views.
Checking their style guide: https://google-styleguide.googlecode.com/svn/trunk/cppguide.html#Integer_Types
...they advise "don't use an unsigned integer type," and give one good reason why: unsigned could be buggy.
It's a good reason, but could be guarded against. My question is: is it bad coding practice in general to use unsigned int?

The Google rule is widely accepted in professional circles. The problem
is that the unsigned integral types are sort of broken, and have
unexpected and unnatural behavior when used for numeric values; they
don't work well as a cardinal type. For example, an index into an array
may never be negative, but it makes perfect sense to write
abs(i1 - i2) to find the distance between two indices. Which won't work if
i1 and i2 have unsigned types.
As a general rule, this particular rule in the Google style guidelines
corresponds more or less to what the designers of the language intended.
Any time you see something other than int, you can assume a special
reason for it. If it is because of the range, it will be long or
long long, or even int_least64_t. Using unsigned types is generally
a signal that you're dealing with bits, rather than the numeric value of
the variable, or (at least in the case of unsigned char) that you're
dealing with raw memory.
With regards to the "self-documentation" of using an unsigned: this
doesn't hold up, since there are almost always a lot of values that the
variable cannot (or should not) take, including many positive ones. C++
doesn't have sub-range types, and the way unsigned is defined means
that it cannot really be used as one either.

This guideline is extremely misleading. Blindly using int instead of unsigned int won't solve anything. That simply shifts the problems somewhere else. You absolutely must be aware of integer overflow when doing arithmetic on fixed precision integers. If your code is written in a way that it does not handle integer overflow gracefully for some given inputs, then your code is broken regardless of whether you use signed or unsigned ints. With unsigned ints you must be aware of integer underflow as well, and with doubles and floats you must be aware of many additional issues with floating point arithmetic.
Just take this article about a bug in the standard Java binary search algorithm published by none other than Google for why you must be aware of integer overflow. In fact, that very article shows C++ code casting to unsigned int in order to guarantee correct behavior. The article also starts out by presenting a bug in Java where guess what, they don't have unsigned int. However, they still ran into a bug with integer overflow.

Use the right type for the operations which you will perform. float wouldn't make sense for a counter. Nor does signed int. The normal operations on the counter are print and +=1.
Even if you had some unusual operations, such as printing the difference in viewcounts, you wouldn't necessarily have a problem. Sure, other answers mention the incorrect abs(i2-i1) but it's not unreasonable to expect programmers to use the correct max(i2,i1) - min(i2,i1). Which does have range issues for signed int. No uniform solution here; programmers should understand the properties of the types they're working with.

Google states that: "Some people, including some textbook authors, recommend using unsigned types to represent numbers that are never negative. This is intended as a form of self-documentation."
I personally use unsigned ints as index parameters.
int foo(unsigned int index, int* myArray){
return myArray[index];
}
Google suggests: "Document that a variable is non-negative using assertions. Don't use an unsigned type."
int foo(int index, int* myArray){
assert(index >= 0);
return myArray[index];
}
Pro for Google: If a negative number is passed in debug mode my code will hopefully return an out of bounds error. Google's code is guaranteed to assert.
Pro for me: My code can support a greater size of myArray.
I think the actual deciding factor comes down to, how clean is your code? If you clean up all warnings, it will be clear when the compiler warns you know when you're trying to assign a signed variable to an unsigned variable. If your code already has a bunch of warnings, the compiler's warning is going to be lost on you.
A final note here: Google says: "Sometimes gcc will notice this bug and warn you, but often it will not." I haven't seen that to be the case on Visual Studio, checks against negative numbers and assignments from signed to unsigned are always warned. But if you use gcc you might have a care.

You specific question is:
"Is it bad practice to use unsigned?" to which the only correct answer can be no. It is not bad practice.
There are many style guides, each with a different focus, and while in some cases, an organisation, given their typical toolchain and deployment platform may choose not to use unsigned for their products, other toolchains and platforms almost demand it's use.
Google seem to get a lot of deference because they have a good business model (and probably employ some smart people like everyone else).
CERT IIRC recommend unsigned for buffer indexes, because if you do overflow, at least you'll still be in your own buffer, some intrinsic security there.
What do the language and standard library designers say (probably the best representation of accepted wisdom). strlen returns a size_t, which is probably unsigned (platform dependent), other answers suggest this is an anachronism because shiny new computers have wide architectures, but this misses the point that C and C++ are general programming languages and should scale well on big and small platforms.
Bottom line is that this is one of many religious questions; certainly not settled, and in these cases, I normally go with my religion for green field developments, and go with the existing convention of the codebase for existing work. Consistency matters.

Related

Making unsigned integer underflow throw an exception

I understand that there are applications in which using unsigned integer over/underflow is a good way to get cheap modular arithmetic.
In my code, I use uint exclusively for indices to containers, so I never want this behaviour.
Is this a bad idea? Should I be using int everywhere instead? I do have to do some unsavoury things to get a for loop to count down to 0.
Is there a commonly used implementation of a less unsafe unsigned integer type? Something that throws an exception?
Do compilers (for me gcc, clang) provide a mechanism for less unsafe behaviour in the given compilation unit?
First, a terminology quibble: there is no such thing as unsigned integer underflow, precisely because of the way they wrap around (using modulo arithmetic), which is probably the phrase you meant.
Second, is this a common scenario to be in? Yes, it is a bit. You're not the only one doing "unsavoury things" with loops for reverse counting, and I bet there are a ton of bugs out there where people haven't done "unsavoury things" and, as a result, their code has an unsavoury infinite loop hidden in it. Mind you, I'm not sure I'd go so far as to call unsigneds "unsafe" as a result; like anything, they are the right tool for a subset of infinite possible jobs, and within that subset they perfectly safe.
There is debate over whether unsigned integers should be used for array indexes at all. Some standard committee members believe that their use in the standard library was a mistake; I know that several members of the c++ community here on Stack Overflow also hate unsigned values and wish they'd go away.
Personally I think having access to the full range of the integer by default is absolutely crucial (and losing that is not worth it for a single "-1" sentinel value or whatever), so I think that — while you're not alone in this requirement, and it's a sensible requirement — using unsigned array indexes by default is a good thing. (And what the heck is a negative array index? Semantics, people!)
But that doesn't help you in this scenario. So, what can you do about it? No, there's no trapping unsigned integer implementation (at least, not one that I'm aware of, let alone widespread) because that would literally violate the rules of the type as defined by C++: it would introduce well-defined underflow/overflow semantics to a type for which underflow/overflow shouldn't even be possible.
You will have to use signed integers and check for "logical underflow" (i.e. going out of your desired range, say -1) yourself. You could wrap this behaviour in a class.
I suppose you could actually just wrap an unsigned integer while you're at it, adding some extra logic to operator-- and operator-= to detect a wrap-around and throw.
But I guess my point is that, whatever you do, it's going to be in your "code space" and thus subject to decreased performance. You can't eke out this behaviour from the platform itself.

Which to use? int32_t vs uint32_t [duplicate]

When is it appropriate to use an unsigned variable over a signed one? What about in a for loop?
I hear a lot of opinions about this and I wanted to see if there was anything resembling a consensus.
for (unsigned int i = 0; i < someThing.length(); i++) {
SomeThing var = someThing.at(i);
// You get the idea.
}
I know Java doesn't have unsigned values, and that must have been a concious decision on Sun Microsystems' part.
I was glad to find a good conversation on this subject, as I hadn't really given it much thought before.
In summary, signed is a good general choice - even when you're dead sure all the numbers are positive - if you're going to do arithmetic on the variable (like in a typical for loop case).
unsigned starts to make more sense when:
You're going to do bitwise things like masks, or
You're desperate to to take advantage of the sign bit for that extra positive range .
Personally, I like signed because I don't trust myself to stay consistent and avoid mixing the two types (like the article warns against).
In your example above, when 'i' will always be positive and a higher range would be beneficial, unsigned would be useful. Like if you're using 'declare' statements, such as:
#declare BIT1 (unsigned int 1)
#declare BIT32 (unsigned int reallybignumber)
Especially when these values will never change.
However, if you're doing an accounting program where the people are irresponsible with their money and are constantly in the red, you will most definitely want to use 'signed'.
I do agree with saint though that a good rule of thumb is to use signed, which C actually defaults to, so you're covered.
I would think that if your business case dictates that a negative number is invalid, you would want to have an error shown or thrown.
With that in mind, I only just recently found out about unsigned integers while working on a project processing data in a binary file and storing the data into a database. I was purposely "corrupting" the binary data, and ended up getting negative values instead of an expected error. I found that even though the value converted, the value was not valid for my business case.
My program did not error, and I ended up getting wrong data into the database. It would have been better if I had used uint and had the program fail.
C and C++ compilers will generate a warning when you compare signed and unsigned types; in your example code, you couldn't make your loop variable unsigned and have the compiler generate code without warnings (assuming said warnings were turned on).
Naturally, you're compiling with warnings turned all the way up, right?
And, have you considered compiling with "treat warnings as errors" to take it that one step further?
The downside with using signed numbers is that there's a temptation to overload them so that, for example, the values 0->n are the menu selection, and -1 means nothing's selected - rather than creating a class that has two variables, one to indicate if something is selected and another to store what that selection is. Before you know it, you're testing for negative one all over the place and the compiler is complaining about how you're wanting to compare the menu selection against the number of menu selections you have - but that's dangerous because they're different types. So don't do that.
size_t is often a good choice for this, or size_type if you're using an STL class.

Why worry about 'undefined behavior' in >> of signed type?

My question is related to this one and will contain few questions.
For me the most obvious (means I would use it in my code) solution to above problem is just this:
uint8_t x = some value;
x = (int8_t)x >> 7;
Yes, yes I hear you all .... undefined behavior and this is why I've not posted my 'solution'.
I have a feeling (maybe it is only my sick mind) that term 'undefined behavior' is overused on SO just to justify downvoting someone if question is tagged c/c++.
So - let's (for a while) put aside C/C++ standards and think about everyday life/programming, real compiler implementations and code they generate for contemporary hardware.
Taking into account the following:
As far as I remember all the hardware I had encountered had distinct instructions for arithmetic and logical shift.
All compilers that I know translate >> into arithmetic shift for signed types and logical shift for unsigned types.
I cannot recall any compiler ever emitting div-like low level instruction when >> was used in c/c++ code (and we are not talking about operator overloading here).
All the hardware I know use U2.
So ... is there anything (any contemporary compiler, hardware) that behaves differently than mentioned above? Put simply should I ever be worried about right shifting signed value not being translated to arithmetic shift?
My 'solution' compiles to just one low level instruction on many platforms while others require multiple low level instructions. What would you use in your code?
Truth please ;-)
Why worry about 'undefined behavior' in >> of signed type?
Because it doesn't really matter how well defined any particular undefined behaviour is now; the point is that it may break at any point in the future. You're relying on a side-effect that may be optimized (or un-optimized) away at any point for any reason or no reason.
Also, I don't want to have to ask somebody with detailed knowledge of many different compiler's implementations before I use something I shouldn't use in the first place, so I skip it.
Yes, there are compilers which behave different from what you assume.
In particular, optimization phases within compilers. These take advantage of the known possible values of variables, and will derive those possible values from the absence of UB. A pointer must be non-null if it's been dereferenced, an integer must be non-zero if it's been used as a divider, and a right-shifted value must be non-negative.
And that works back in time:
if (x<0) {
printf("This is dead code\n");
}
x >> 3;
What it really comes down to is, are you willing to take the risk?
"The standard doesn't guarantee yada yada" is nice and all, but let's be honest now, the risk isn't big. If you're going to run your code on some crazy platform, you generally know in advance. And if it takes you by surprise, well, that's the risk you took.
Also, the workaround is horrible. If you're not going to need it, it's just polluting your codebase with pointless "function calls instead of right shifts" that will be harder to maintain (and thus carry a cost). And you'll never to able to "paste and forget" code from other places into the project - you'd always have to check the code for the possibility of right shifting negative signed integers.

`short int` vs `int`

Should I bother using short int instead of int? Is there any useful difference? Any pitfalls?
short vs int
Don't bother with short unless there is a really good reason such as saving memory on a gazillion values, or conforming to a particular memory layout required by other code.
Using lots of different integer types just introduces complexity and possible wrap-around bugs.
On modern computers it might also introduce needless inefficiency.
const
Sprinkle const liberally wherever you can.
const constrains what might change, making it easier to understand the code: you know that this beastie is not gonna move, so, can be ignored, and thinking directed at more useful/relevant things.
Top-level const for formal arguments is however by convention omitted, possibly because the gain is not enough to outweight the added verbosity.
Also, in a pure declaration of a function top-level const for an argument is simply ignored by the compiler. But on the other hand, some other tools may not be smart enough to ignore them, when comparing pure declarations to definitions, and one person cited that in an earlier debate on the issue in the comp.lang.c++ Usenet group. So it depends to some extent on the toolchain, but happily I've never used tools that place any significance on those consts.
Cheers & hth.,
Absolutely not in function arguments. Few calling conventions are going to make any distinction between short and int. If you're making giant arrays you could use short if your data fits in short to save memory and increase cache effectiveness.
What Ben said. You will actually create less efficient code since all the registers need to strip out the upper bits whenever any comparisons are done. Unless you need to save memory because you have tons of them, use the native integer size. That's what int is for.
EDIT: Didn't even see your sub-question about const. Using const on intrinsic types (int, float) is useless, but any pointers/references should absolutely be const whenever applicable. Same for class methods as well.
The question is technically malformed "Should I use short int?". The only good answer will be "I don't know, what are you trying to accomplish?".
But let's consider some scenarios:
You know the definite range of values that your variable can take.
The ranges for signed integers are:
signed char — -2⁷ – 2⁷-1
short — -2¹⁵ – 2¹⁵-1
int — -2¹⁵ – 2¹⁵-1
long — -2³¹ – 2³¹-1
long long — -2⁶³ – 2⁶³-1
We should note here that these are guaranteed ranges, they can be larger in your particular implementation, and often are. You are also guaranteed that the previous range cannot be larger than the next, but they can be equal.
You will quickly note that short and int actually have the same guaranteed range. This gives you very little incentive to use it. The only reason to use short given this situation becomes giving other coders a hint that the values will be not too large, but this can be done via a comment.
It does, however, make sense to use signed char, if you know that you can fit every potential value in the range -128 — 127.
You don't know the exact range of potential values.
In this case you are in a rather bad position to attempt to minimise memory useage, and should probably use at least int. Although it has the same minimum range as short, on many platforms it may be larger, and this will help you out.
But the bigger problem is that you are trying to write a piece of software that operates on values, the range of which you do not know. Perhaps something wrong has happened before you have started coding (when requirements were being written up).
You have an idea about the range, but realise that it can change in the future.
Ask yourself how close to the boundary are you. If we are talking about something that goes from -1000 to +1000 and can potentially change to -1500 – 1500, then by all means use short. The specific architecture may pad your value, which will mean you won't save any space, but you won't lose anything. However, if we are dealing with some quantity that is currently -14000 – 14000, and can grow unpredictably (perhaps it's some financial value), then don't just switch to int, go to long right away. You will lose some memory, but will save yourself a lot of headache catching these roll-over bugs.
short vs int - If your data will fit in a short, use a short. Save memory. Make it easier for the reader to know how much data your variable may fit.
use of const - Great programming practice. If your data should be a const then make it const. It is very helpful when someone reads your code.

Is there any tool for C++ which will check for common unspecified behavior?

Often one makes assumptions about a particular platform one is coding on, for example that signed integers use two's complement storage, or that (0xFFFFFFFF == -1), or things of that nature.
Does a tool exist which can check a codebase for the most common violations of these kinds of things (for those of us who want portable code but don't have strange non-two's-complement machines)?
(My examples above are specific to signed integers, but I'm interested in other errors (such as alignment or byte order) as well)
There are various levels of compiler warnings that you may wish to have switched on, and you can treat warnings as errors.
If there are other assumptions you know you make at various points in the code you can assert them. If you can do that with static asserts you will get failure at compile time.
I know that CLang is very actively developing a static analyzer (as a library).
The goal is to catch errors at analysis time, however the exact extent of the errors caught is not that clear to me yet. The library is called "Checker" and T. Kremenek is the responsible for it, you can ask about it on clang-dev mailing list.
I don't have the impression that there is any kind of reference about the checks being performed, and I don't think it's mature enough yet for production tool (given the rate of changes going on) but it may be worth a look.
Maybe a static code analysis tool? I used one a few years ago and it reported errors like this. It was not perfect and still limited but maybe the tools are better now?
update:
Maybe one of these:
What open source C++ static analysis tools are available?
update2:
I tried FlexeLint on your example (you can try it online using the Do-It-Yourself Example on http://www.gimpel-online.com/OnlineTesting.html) and it complains about it but perhaps not in a way you are looking for:
5 int i = -1;
6 if (i == 0xffffffff)
diy64.cpp 6 Warning 650: Constant '4294967295' out of range for operator '=='
diy64.cpp 6 Info 737: Loss of sign in promotion from int to unsigned int
diy64.cpp 6 Info 774: Boolean within 'if' always evaluates to False [Reference: file diy64.cpp: lines 5, 6]
Very interesting question. I think it would be quite a challenge to write a tool to flag these usefully, because so much depends on the programmer's intent/assumptions
For example, it would be easy to recognize a construct like:
x &= -2; // round down to an even number
as being dependent on twos-complement representation, but what if the mask is a variable instead of a constant "-2"?
Yes, you could take it a step further and warn of any use of a signed int with bitwise &, any assignment of a negative constant to an unsigned int, and any assignment of a signed int to an unsigned int, etc., but I think that would lead to an awful lot of false positives.
[ sorry, not really an answer, but too long for a comment ]