Making unsigned integer underflow throw an exception

Making unsigned integer underflow throw an exception - c++

I understand that there are applications in which using unsigned integer over/underflow is a good way to get cheap modular arithmetic.
In my code, I use uint exclusively for indices to containers, so I never want this behaviour.
Is this a bad idea? Should I be using int everywhere instead? I do have to do some unsavoury things to get a for loop to count down to 0.
Is there a commonly used implementation of a less unsafe unsigned integer type? Something that throws an exception?
Do compilers (for me gcc, clang) provide a mechanism for less unsafe behaviour in the given compilation unit?

First, a terminology quibble: there is no such thing as unsigned integer underflow, precisely because of the way they wrap around (using modulo arithmetic), which is probably the phrase you meant.
Second, is this a common scenario to be in? Yes, it is a bit. You're not the only one doing "unsavoury things" with loops for reverse counting, and I bet there are a ton of bugs out there where people haven't done "unsavoury things" and, as a result, their code has an unsavoury infinite loop hidden in it. Mind you, I'm not sure I'd go so far as to call unsigneds "unsafe" as a result; like anything, they are the right tool for a subset of infinite possible jobs, and within that subset they perfectly safe.
There is debate over whether unsigned integers should be used for array indexes at all. Some standard committee members believe that their use in the standard library was a mistake; I know that several members of the c++ community here on Stack Overflow also hate unsigned values and wish they'd go away.
Personally I think having access to the full range of the integer by default is absolutely crucial (and losing that is not worth it for a single "-1" sentinel value or whatever), so I think that — while you're not alone in this requirement, and it's a sensible requirement — using unsigned array indexes by default is a good thing. (And what the heck is a negative array index? Semantics, people!)
But that doesn't help you in this scenario. So, what can you do about it? No, there's no trapping unsigned integer implementation (at least, not one that I'm aware of, let alone widespread) because that would literally violate the rules of the type as defined by C++: it would introduce well-defined underflow/overflow semantics to a type for which underflow/overflow shouldn't even be possible.
You will have to use signed integers and check for "logical underflow" (i.e. going out of your desired range, say -1) yourself. You could wrap this behaviour in a class.
I suppose you could actually just wrap an unsigned integer while you're at it, adding some extra logic to operator-- and operator-= to detect a wrap-around and throw.
But I guess my point is that, whatever you do, it's going to be in your "code space" and thus subject to decreased performance. You can't eke out this behaviour from the platform itself.

Related

Why is std::ssize being forced to a minimum size for its signed size type?

In C++20, std::ssize is being introduced to obtain the signed size of a container for generic code. (And the reason for its addition is explained here.)
Somewhat peculiarly, the definition given there (combining with common_type and ptrdiff_t) has the effect of forcing the return value to be "either ptrdiff_t or the signed form of the container's size() return value, whichever is larger".
P1227R1 indirectly offers a justification for this ("it would be a disaster for std::ssize() to turn a size of 60,000 into a size of -5,536").
This seems to me like an odd way to try to "fix" that, however.
Containers which intentionally define a uint16_t size and are known to never exceed 32,767 elements will still be forced to use a larger type than required.
The same thing would occur for containers using a uint8_t size and 127 elements, respectively.
In desktop environments, you probably don't care; but this might be important for embedded or otherwise resource-constrained environments, especially if the resulting type is used for something more persistent than a stack variable.
Containers which use the default size_t size on 32-bit platforms but which nevertheless do contain between 2B and 4B items will hit exactly the same problem as above.
If there still exist platforms for which ptrdiff_t is smaller than 32 bits, they will hit the same problem as well.
Wouldn't it be better to just use the signed type as-is (without extending its size) and to assert that a conversion error has not occurred (eg. that the result is not negative)?
Am I missing something?
To expand on that last suggestion a bit (inspired by Nicol Bolas' answer): if it were implemented the way that I suggested, then this code would Just Work™:
void DoSomething(int16_t i, T const& item);
for (int16_t i = 0, len = std::ssize(rng); i < len; ++i)
{
DoSomething(i, rng[i]);
}
With the current implementation, however, this produces warnings and/or errors unless static_casts are explicitly added to narrow the result of ssize, or to use int i instead and then narrow it in the function call (and the range indexing), neither of which seem like an improvement.

Containers which intentionally define a uint16_t size and are known to never exceed 32,767 elements will still be forced to use a larger type than required.
It's not like the container is storing the size as this type. The conversion happens via accessing the value.
As for embedded systems, embedded systems programmers already know about C++'s propensity to increase the size of small types. So if they expect a type to be an int16_t, they're going to spell that out in the code, because otherwise C++ might just promote it to an int.
Furthermore, there is no standard way to ask about what size a range is "known to never exceed". decltype(size(range)) is something you can ask for; sized ranges are not required to provide a max_size function. Without such an ability, the safest assumption is that a range whose size type is uint16_t can assume any size within that range. So the signed size should be big enough to store that entire range as a signed value.
Your suggestion is basically that any ssize call is potentially unsafe, since half of any size range cannot be validly stored in the return type of ssize.
Containers which use the default size_t size on 32-bit platforms but which nevertheless do contain between 2B and 4B items will hit exactly the same problem as above.
Assuming that it is valid for ptrdiff_t to not be a signed 64-bit integer on such platforms, there isn't really a valid solution to that problem. So yes, there will be cases where ssize is potentially unsafe.
ssize currently is potentially unsafe in cases where it is not possible to be safe. Your proposal would make ssize potentially unsafe in all cases.
That's not an improvement.
And no, merely asserting/contract checking is not a viable solution. The point of ssize is to make for(int i = 0; i < std::ssize(rng); ++i) work without the compiler complaining about signed/unsigned mismatch. To get an assert because of a conversion failure that didn't need to happen (and BTW, cannot be corrected without using std::size, which we are trying to avoid), one which is ultimately irrelevant to your algorithm? That's a terrible idea.
if it were implemented the way that I suggested, then this code would Just Work™:
Let us ignore the question of how often it is that a user would write this code.
The reason your compiler will expect/require you to use a cast there is because you are asking for an inherently dangerous operation: you are potentially losing data. Your code only "Just Works™" if the current size fits into an int16_t; that makes the conversion statically dangerous. This is not something that should implicitly take place, so the compiler suggests/requires you to explicitly ask for it. And users looking at that code get a big, fat eyesore reminding them that a dangerous thing is being done.
That is all to the good.
See, if your suggested implementation were how ssize behaved, then that means we must treat every use of ssize as just as inherently dangerous as the compiler treats your attempted implicit conversion. But unlike static_cast, ssize is small and easily missed.
Dangerous operations should be called out as such. Since ssize is small and difficult to notice by design, it therefore should be as safe as possible. Ideally, it should be as safe as size, but failing that, it should be unsafe only to the extend that it is impossible to make it safe.
Users should not look on ssize usage as something dubious or disconcerting; they should not fear to use it.

Advice on unsigned int (Gangnam Style edition)

The video "Gangnam Style" (I'm sure you've heard it) just exceeded 2 billion views on youtube. In fact, Google says that they never expected a video to be greater than a 32-bit integer... which alludes to the fact that Google used int instead of unsigned for their view counter. I think they had to re-write their code a bit to accommodate larger views.
Checking their style guide: https://google-styleguide.googlecode.com/svn/trunk/cppguide.html#Integer_Types
...they advise "don't use an unsigned integer type," and give one good reason why: unsigned could be buggy.
It's a good reason, but could be guarded against. My question is: is it bad coding practice in general to use unsigned int?

The Google rule is widely accepted in professional circles. The problem
is that the unsigned integral types are sort of broken, and have
unexpected and unnatural behavior when used for numeric values; they
don't work well as a cardinal type. For example, an index into an array
may never be negative, but it makes perfect sense to write
abs(i1 - i2) to find the distance between two indices. Which won't work if
i1 and i2 have unsigned types.
As a general rule, this particular rule in the Google style guidelines
corresponds more or less to what the designers of the language intended.
Any time you see something other than int, you can assume a special
reason for it. If it is because of the range, it will be long or
long long, or even int_least64_t. Using unsigned types is generally
a signal that you're dealing with bits, rather than the numeric value of
the variable, or (at least in the case of unsigned char) that you're
dealing with raw memory.
With regards to the "self-documentation" of using an unsigned: this
doesn't hold up, since there are almost always a lot of values that the
variable cannot (or should not) take, including many positive ones. C++
doesn't have sub-range types, and the way unsigned is defined means
that it cannot really be used as one either.

This guideline is extremely misleading. Blindly using int instead of unsigned int won't solve anything. That simply shifts the problems somewhere else. You absolutely must be aware of integer overflow when doing arithmetic on fixed precision integers. If your code is written in a way that it does not handle integer overflow gracefully for some given inputs, then your code is broken regardless of whether you use signed or unsigned ints. With unsigned ints you must be aware of integer underflow as well, and with doubles and floats you must be aware of many additional issues with floating point arithmetic.
Just take this article about a bug in the standard Java binary search algorithm published by none other than Google for why you must be aware of integer overflow. In fact, that very article shows C++ code casting to unsigned int in order to guarantee correct behavior. The article also starts out by presenting a bug in Java where guess what, they don't have unsigned int. However, they still ran into a bug with integer overflow.

Use the right type for the operations which you will perform. float wouldn't make sense for a counter. Nor does signed int. The normal operations on the counter are print and +=1.
Even if you had some unusual operations, such as printing the difference in viewcounts, you wouldn't necessarily have a problem. Sure, other answers mention the incorrect abs(i2-i1) but it's not unreasonable to expect programmers to use the correct max(i2,i1) - min(i2,i1). Which does have range issues for signed int. No uniform solution here; programmers should understand the properties of the types they're working with.

Google states that: "Some people, including some textbook authors, recommend using unsigned types to represent numbers that are never negative. This is intended as a form of self-documentation."
I personally use unsigned ints as index parameters.
int foo(unsigned int index, int* myArray){
return myArray[index];
}
Google suggests: "Document that a variable is non-negative using assertions. Don't use an unsigned type."
int foo(int index, int* myArray){
assert(index >= 0);
return myArray[index];
}
Pro for Google: If a negative number is passed in debug mode my code will hopefully return an out of bounds error. Google's code is guaranteed to assert.
Pro for me: My code can support a greater size of myArray.
I think the actual deciding factor comes down to, how clean is your code? If you clean up all warnings, it will be clear when the compiler warns you know when you're trying to assign a signed variable to an unsigned variable. If your code already has a bunch of warnings, the compiler's warning is going to be lost on you.
A final note here: Google says: "Sometimes gcc will notice this bug and warn you, but often it will not." I haven't seen that to be the case on Visual Studio, checks against negative numbers and assignments from signed to unsigned are always warned. But if you use gcc you might have a care.

You specific question is:
"Is it bad practice to use unsigned?" to which the only correct answer can be no. It is not bad practice.
There are many style guides, each with a different focus, and while in some cases, an organisation, given their typical toolchain and deployment platform may choose not to use unsigned for their products, other toolchains and platforms almost demand it's use.
Google seem to get a lot of deference because they have a good business model (and probably employ some smart people like everyone else).
CERT IIRC recommend unsigned for buffer indexes, because if you do overflow, at least you'll still be in your own buffer, some intrinsic security there.
What do the language and standard library designers say (probably the best representation of accepted wisdom). strlen returns a size_t, which is probably unsigned (platform dependent), other answers suggest this is an anachronism because shiny new computers have wide architectures, but this misses the point that C and C++ are general programming languages and should scale well on big and small platforms.
Bottom line is that this is one of many religious questions; certainly not settled, and in these cases, I normally go with my religion for green field developments, and go with the existing convention of the codebase for existing work. Consistency matters.

Why worry about 'undefined behavior' in >> of signed type?

My question is related to this one and will contain few questions.
For me the most obvious (means I would use it in my code) solution to above problem is just this:
uint8_t x = some value;
x = (int8_t)x >> 7;
Yes, yes I hear you all .... undefined behavior and this is why I've not posted my 'solution'.
I have a feeling (maybe it is only my sick mind) that term 'undefined behavior' is overused on SO just to justify downvoting someone if question is tagged c/c++.
So - let's (for a while) put aside C/C++ standards and think about everyday life/programming, real compiler implementations and code they generate for contemporary hardware.
Taking into account the following:
As far as I remember all the hardware I had encountered had distinct instructions for arithmetic and logical shift.
All compilers that I know translate >> into arithmetic shift for signed types and logical shift for unsigned types.
I cannot recall any compiler ever emitting div-like low level instruction when >> was used in c/c++ code (and we are not talking about operator overloading here).
All the hardware I know use U2.
So ... is there anything (any contemporary compiler, hardware) that behaves differently than mentioned above? Put simply should I ever be worried about right shifting signed value not being translated to arithmetic shift?
My 'solution' compiles to just one low level instruction on many platforms while others require multiple low level instructions. What would you use in your code?
Truth please ;-)

Why worry about 'undefined behavior' in >> of signed type?
Because it doesn't really matter how well defined any particular undefined behaviour is now; the point is that it may break at any point in the future. You're relying on a side-effect that may be optimized (or un-optimized) away at any point for any reason or no reason.
Also, I don't want to have to ask somebody with detailed knowledge of many different compiler's implementations before I use something I shouldn't use in the first place, so I skip it.

Yes, there are compilers which behave different from what you assume.
In particular, optimization phases within compilers. These take advantage of the known possible values of variables, and will derive those possible values from the absence of UB. A pointer must be non-null if it's been dereferenced, an integer must be non-zero if it's been used as a divider, and a right-shifted value must be non-negative.
And that works back in time:
if (x<0) {
printf("This is dead code\n");
}
x >> 3;

What it really comes down to is, are you willing to take the risk?
"The standard doesn't guarantee yada yada" is nice and all, but let's be honest now, the risk isn't big. If you're going to run your code on some crazy platform, you generally know in advance. And if it takes you by surprise, well, that's the risk you took.
Also, the workaround is horrible. If you're not going to need it, it's just polluting your codebase with pointless "function calls instead of right shifts" that will be harder to maintain (and thus carry a cost). And you'll never to able to "paste and forget" code from other places into the project - you'd always have to check the code for the possibility of right shifting negative signed integers.

No useful and reliable way to detect integer overflow in C/C++?

No, this is not a duplicate of How to detect integer overflow?. The issue is the same but the question is different.
The gcc compiler can optimize away an overflow check (with -O2), for example:
int a, b;
b = abs(a); // will overflow if a = 0x80000000
if (b < 0) printf("overflow"); // optimized away
The gcc people argue that this is not a bug. Overflow is undefined behavior, according to the C standard, which allows the compiler to do anything. Apparently, anything includes assuming that overflow never happens. Unfortunately, this allows the compiler to optimize away the overflow check.
The safe way to check for overflow is described in a recent CERT paper. This paper recommends doing something like this before adding two integers:
if ( ((si1^si2) | (((si1^(~(si1^si2) & INT_MIN)) + si2)^si2)) >= 0) {
/* handle error condition */
} else {
sum = si1 + si2;
}
Apparently, you have to do something like this before every +, -, *, / and other operations in a series of calculations when you want to be sure that the result is valid. For example if you want to make sure an array index is not out of bounds. This is so cumbersome that practically nobody is doing it. At least I have never seen a C/C++ program that does this systematically.
Now, this is a fundamental problem:
Checking an array index before accessing the array is useful, but not reliable.
Checking every operation in the series of calculations with the CERT method is reliable but not useful.
Conclusion: There is no useful and reliable way of checking for overflow in C/C++!
I refuse to believe that this was intended when the standard was written.
I know that there are certain command line options that can fix the problem, but this doesn't alter the fact that we have a fundamental problem with the standard or the current interpretation of it.
Now my question is:
Are the gcc people taking the interpretation of "undefined behavior" too far when it allows them to optimize away an overflow check, or is the C/C++ standard broken?
Added note:
Sorry, you may have misunderstood my question. I am not asking how to work around the problem - that has already been answered elsewhere. I am asking a more fundamental question about the C standard. If there is no useful and reliable way of checking for overflow then the language itself is dubious. For example, if I make a safe array class with bounds checking then I should be safe, but I'm not if the bounds checking can be optimized away.
If the standard allows this to happen then either the standard needs revision or the interpretation of the standard needs revision.
Added note 2:
People here seem unwilling to discuss the dubious concept of "undefined behavior". The fact that the C99 standard lists 191 different kinds of undefined behavior (link) is an indication of a sloppy standard.
Many programmers readily accept the statement that "undefined behavior" gives the license to do anything, including formatting your hard disk. I think it is a problem that the standard puts integer overflow into the same dangerous category as writing outside array bounds.
Why are these two kinds of "undefined behavior" different? Because:
Many programs rely on integer overflow being benign, but few programs rely on writing outside array bounds when you don't know what is there.
Writing outside array bounds actually can do something as bad as formatting your hard disk (at least in an unprotected OS like DOS), and most programmers know that this is dangerous.
When you put integer overflow into the dangerous "anything goes" category, it allows the compiler to do anything, including lying about what it is doing (in the case where an overflow check is optimized away)
An error such as writing outside array bounds can be found with a debugger, but the error of optimizing away an overflow check cannot, because optimization is usually off when debugging.
The gcc compiler evidently refrains from the "anything goes" policy in case of integer overflow. There are many cases where it refrains from optimizing e.g. a loop unless it can verify that overflow is impossible. For some reason, the gcc people have recognized that we would have too many errors if they followed the "anything goes" policy here, but they have a different attitude to the problem of optimizing away an overflow check.
Maybe this is not the right place to discuss such philosophical questions. At least, most answers here are off the point. Is there a better place to discuss this?

The gcc developers are entirely correct here. When the standard says that the behavior is undefined that means exactly that there are no requirements on the compiler.
As a valid program can not do anything that causes UB (as then it would not be valid anymore), the compiler can very well assume that UB doesn't happen. And if it still does, anything the compiler does would be ok.
For your problem with overflow, one solution is to consider what ranges the caclulations are supposed to handle. For example, when balancing my bank account I can assume that the amounts would be well below 1 billion, so a 32-bit int will work.
For your application domain you can probably do similar estimates about exactly where an overflow could be possible. Then you can add checks at those points or choose another data type, if available.

int a, b;
b = abs(a); // will overflow if a = 0x80000000
if (b < 0) printf("overflow"); // optimized away
(You seem to be assuming 2s complement... let's run with that)
Who says abs(a) "overflows" if a has that binary pattern (more accurately, if a is INT_MIN)? The Linux man page for abs(int) says:
Trying to take the absolute value of the most negative integer is not defined.
Not defined doesn't necessarily mean overflow.
So, your premise that b could ever be less than 0, and that's somehow a test for "overflow", is fundamentally flawed from the start. If you want to test, you can not do it on the result that may have undefined behaviour - do it before the operation instead!
If you care about this, you can use C++'s user-defined types (i.e. classes) to implement your own set of tests around the operations you need (or find a library that already does that). The language does not need inbuilt support for this as it can be implemented equally efficiently in such a library, with the resulting semantics of use unchanged. That's fundamental power is one of the great things about C++.

Ask yourself: how often do you actually need checked arithmetic? If you need it often you should write a checked_int class that overloads the common operators and encapsulate the checks into this class. Props for sharing the implementation on an Open Source website.
Better yet (arguably), use a big_integer class so that overflows can’t happen in the first place.

Just use the correct type for b:
int a;
unsigned b = a;
if (b == (unsigned)INT_MIN) printf("overflow"); // never optimized away
else b = abs(a);
Edit: Test for overflow in C can be safely done with the unsigned type. Unsigned types just wrap around on arithmetic and signed types are safely converted to them. So you can do any test on them that you like. On modern processors this conversion is usually just a reinterpretation of a register or so, so it comes for no runtime cost.

`short int` vs `int`

Should I bother using short int instead of int? Is there any useful difference? Any pitfalls?

short vs int
Don't bother with short unless there is a really good reason such as saving memory on a gazillion values, or conforming to a particular memory layout required by other code.
Using lots of different integer types just introduces complexity and possible wrap-around bugs.
On modern computers it might also introduce needless inefficiency.
const
Sprinkle const liberally wherever you can.
const constrains what might change, making it easier to understand the code: you know that this beastie is not gonna move, so, can be ignored, and thinking directed at more useful/relevant things.
Top-level const for formal arguments is however by convention omitted, possibly because the gain is not enough to outweight the added verbosity.
Also, in a pure declaration of a function top-level const for an argument is simply ignored by the compiler. But on the other hand, some other tools may not be smart enough to ignore them, when comparing pure declarations to definitions, and one person cited that in an earlier debate on the issue in the comp.lang.c++ Usenet group. So it depends to some extent on the toolchain, but happily I've never used tools that place any significance on those consts.
Cheers & hth.,

Absolutely not in function arguments. Few calling conventions are going to make any distinction between short and int. If you're making giant arrays you could use short if your data fits in short to save memory and increase cache effectiveness.

What Ben said. You will actually create less efficient code since all the registers need to strip out the upper bits whenever any comparisons are done. Unless you need to save memory because you have tons of them, use the native integer size. That's what int is for.
EDIT: Didn't even see your sub-question about const. Using const on intrinsic types (int, float) is useless, but any pointers/references should absolutely be const whenever applicable. Same for class methods as well.

The question is technically malformed "Should I use short int?". The only good answer will be "I don't know, what are you trying to accomplish?".
But let's consider some scenarios:
You know the definite range of values that your variable can take.
The ranges for signed integers are:
signed char — -2⁷ – 2⁷-1
short — -2¹⁵ – 2¹⁵-1
int — -2¹⁵ – 2¹⁵-1
long — -2³¹ – 2³¹-1
long long — -2⁶³ – 2⁶³-1
We should note here that these are guaranteed ranges, they can be larger in your particular implementation, and often are. You are also guaranteed that the previous range cannot be larger than the next, but they can be equal.
You will quickly note that short and int actually have the same guaranteed range. This gives you very little incentive to use it. The only reason to use short given this situation becomes giving other coders a hint that the values will be not too large, but this can be done via a comment.
It does, however, make sense to use signed char, if you know that you can fit every potential value in the range -128 — 127.
You don't know the exact range of potential values.
In this case you are in a rather bad position to attempt to minimise memory useage, and should probably use at least int. Although it has the same minimum range as short, on many platforms it may be larger, and this will help you out.
But the bigger problem is that you are trying to write a piece of software that operates on values, the range of which you do not know. Perhaps something wrong has happened before you have started coding (when requirements were being written up).
You have an idea about the range, but realise that it can change in the future.
Ask yourself how close to the boundary are you. If we are talking about something that goes from -1000 to +1000 and can potentially change to -1500 – 1500, then by all means use short. The specific architecture may pad your value, which will mean you won't save any space, but you won't lose anything. However, if we are dealing with some quantity that is currently -14000 – 14000, and can grow unpredictably (perhaps it's some financial value), then don't just switch to int, go to long right away. You will lose some memory, but will save yourself a lot of headache catching these roll-over bugs.

short vs int - If your data will fit in a short, use a short. Save memory. Make it easier for the reader to know how much data your variable may fit.
use of const - Great programming practice. If your data should be a const then make it const. It is very helpful when someone reads your code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js