Why is std::ssize() introduced in C++20? - c++

C++20 introduced the std::ssize() free function as below:
template <class C>
constexpr auto ssize(const C& c)
-> std::common_type_t<std::ptrdiff_t,
std::make_signed_t<decltype(c.size())>>;
A possible implementation seems using static_cast, to convert the return value of the size() member function of class C into its signed counterpart.
Since the size() member function of C always returns non-negative values, why would anyone want to store them in signed variables? In case one really wants to, it is a matter of simple static_cast.
Why is std::ssize() introduced in C++20?

The rationale is described in this paper. A quote:
When span was adopted into C++17, it used a signed integer both as an index and a size. Partly this was to allow for the use of "-1" as a sentinel value to indicate a type whose size was not known at compile time. But having an STL container whose size() function returned a signed value was problematic, so P1089 was introduced to "fix" the problem. It received majority support, but not the 2-to-1 margin needed for consensus.
This paper, P1227, was a proposal to add non-member std::ssize and member ssize() functions. The inclusion of these would make certain code much more straightforward and allow for the avoidance of unwanted unsigned-ness in size computations. The idea was that the resistance to P1089 would decrease if ssize() were made available for all containers, both through std::ssize() and as member functions.

Gratuitously stolen from Eric Niebler:
'Unsigned types signal that a negative index/size is not sane' was
the prevailing wisdom when the STL was first designed. But logically,
a count of things need not be positive. I may want to keep a count in
a signed integer to denote the number of elements either added to or
removed from a collection. Then I would want to combine that with the
size of the collection. If the size of the collection is unsigned, now
I'm forced to mix signed and unsigned arithmetic, which is a bug farm.
Compilers warn about this, but because the design of the STL pretty
much forces programmers into this situation, the warning is so common
that most people turn it off. That's a shame because this hides real
bugs.
Use of unsigned ints in interfaces isn't the boon many people think it
is. If by accident a user passes a slightly negative number to the
API, it suddenly becomes a huge positive number. Had the API taken the
number as signed, then it can detect the situation by asserting the
number is greater than or equal to zero.
If we restrict our use of unsigned ints to bit twiddling (e.g., masks)
and use signed ints everywhere else, bugs are less likely to occur,
and easier to detect when they do occur.

Related

Integer comparison. Use same signedness vs c++20 std::cmp_

Integer comparison, although seemingly a simple matter, can involve some unexpected implications, hard to notice from the code itself, for the unskilled eye. Take the following piece of code for example: -1 > 10U;. Applying the rules for implicit integral conversion (that I think were introduced in C++20), it turns out to be equivalent to static_cast<unsigned>(-1) > 10U; (For 32-bit unsigned integer -1 is equivalent to 0xFFFFFFFF).
C++20 introduces std::cmp_* functions to achieve correct comparison behavior even between integer values of different signedness and size. Before c++20, when you had to write an algorithm that does some integer comparisons for some purpose, either you had to write your own comparison functions, or use integer types of the same signedness (or play along with the implicit integral conversions, that I think were implementation-dependant).
This is where you are faced with the design choice of using the same signedness for all integers that compare together (read: between them), or use the new functions from C++20. For example, sometimes it makes perfect sense to use an unsigned type (like std::size_t) to represent sizes (that are never negative), but then you might need to calculate the difference between the sizes of two objects, or make sure they differ by at most some amount (again, never negative).
Using the same signedness for all types that compare together in this scenario would mean to use signed integers, because you need to be able to compute the difference of two of these values without knowing which one is bigger (1 - 2 = 0xFFFFFFFF). But that means you lose half of the possible integer representations, paying for a feature (a sign bit) you never really use.
Using the C++20 comparison functions sacrifices some ease of reading, and also demands you to write a little more code (x <= 7 vs std::cmp_less_equal(x, 7)). Besides from these facts, are there other differences or advantages that arise from the use of one alternative over the other? Are there any situations where one of them would be preferable? I'm specially interested in performance-critical code. What impact does this choice have on performance?

Why is the template-parameter for std::counting_semaphore<> a ptrdiff_t?

Why is the template-parameter for std::counting_semaphore<> a ptrdiff_t and not a size_t ? For me there's no sense in having negative maximum semaphore counting values.
size_t is semantically not a count, it is a size. Moreover, both Herb, Bjarne (et al) have repeatedly spoken about how the existing used of unsigned integers in STL was a mistake to begin with. Even if it semantically can make sense to use unsigned integers for representing numbers such as "size" and "count", it comes with all the costs and confusion of using unsigned integers (well-defined wrap-around) as compared to the optimizer-friendly signed integers.
Note however that this discussion (with the "unsigned was a mistake" view supported by e.g. Bjarne, Herb, Meyers) has historically been a divider in community.
See e.g. Bjarne's P1428R0 (Subscripts and sizes should be signed):
The Problem
[...] I will dig into the arguments and consider alternatives, but my conclusion stands:
The original use of unsigned for the STL was a bad mistake and should be corrected (eventually)
Why we have unsigned subscripts in the STL
As far as I remember (the STL is 25 years old so my memory may not be completely accurate) three reasons were given for the STL using unsigned types for subscripts
[...]
Basically, we were wrong on all three counts [...]
Span
[...] This was an opportunity to start an effort to convert the STL away from its mistaken use of unsigned for subscripts.
Unsigned sizes
Unfortunately, sizeof yields an unsigned (and it would be hard to change that), but we don’t have to follow that for all types with something to do with sizes.
There are ongoing activities (e.g. P1227R2: Signed ssize() functions, unsigned size() functions) to slowly correct this argued "mistake of the past", and particularly not blindly use the same argued mistake for new features.

Should std::ssize() still needed in C++20? [duplicate]

C++20 introduced the std::ssize() free function as below:
template <class C>
constexpr auto ssize(const C& c)
-> std::common_type_t<std::ptrdiff_t,
std::make_signed_t<decltype(c.size())>>;
A possible implementation seems using static_cast, to convert the return value of the size() member function of class C into its signed counterpart.
Since the size() member function of C always returns non-negative values, why would anyone want to store them in signed variables? In case one really wants to, it is a matter of simple static_cast.
Why is std::ssize() introduced in C++20?
The rationale is described in this paper. A quote:
When span was adopted into C++17, it used a signed integer both as an index and a size. Partly this was to allow for the use of "-1" as a sentinel value to indicate a type whose size was not known at compile time. But having an STL container whose size() function returned a signed value was problematic, so P1089 was introduced to "fix" the problem. It received majority support, but not the 2-to-1 margin needed for consensus.
This paper, P1227, was a proposal to add non-member std::ssize and member ssize() functions. The inclusion of these would make certain code much more straightforward and allow for the avoidance of unwanted unsigned-ness in size computations. The idea was that the resistance to P1089 would decrease if ssize() were made available for all containers, both through std::ssize() and as member functions.
Gratuitously stolen from Eric Niebler:
'Unsigned types signal that a negative index/size is not sane' was
the prevailing wisdom when the STL was first designed. But logically,
a count of things need not be positive. I may want to keep a count in
a signed integer to denote the number of elements either added to or
removed from a collection. Then I would want to combine that with the
size of the collection. If the size of the collection is unsigned, now
I'm forced to mix signed and unsigned arithmetic, which is a bug farm.
Compilers warn about this, but because the design of the STL pretty
much forces programmers into this situation, the warning is so common
that most people turn it off. That's a shame because this hides real
bugs.
Use of unsigned ints in interfaces isn't the boon many people think it
is. If by accident a user passes a slightly negative number to the
API, it suddenly becomes a huge positive number. Had the API taken the
number as signed, then it can detect the situation by asserting the
number is greater than or equal to zero.
If we restrict our use of unsigned ints to bit twiddling (e.g., masks)
and use signed ints everywhere else, bugs are less likely to occur,
and easier to detect when they do occur.

Why does Qt implement QFile::size() which returns a qint64 rather than quint64 [duplicate]

The question is clear.
I wonder why they even thought this would be handy, as clearly negative indices are unusable in the containers that would be used with them (see for example QList's docs).
I thought they wanted to allow that for some crazy form of indexing, but it seems unsupported?
It also generates a ton of (correct) compiler warnings about casting to and comparing of signed/unsigned types (on MSVC).
It just seems incompatible with the STL by design for some reason...
Although I am deeply sympathetic to Chris's line of reasoning, I will disagree here (at least in part, I am playing devil's advocate). There is nothing wrong with using unsigned types for sizes, and it can even be beneficial in some circumstances.
Chris's justification for signed size types is that they are naturally used as array indices, and you may want to do arithmetic on array indices, and that arithmetic may create temporary values that are negative.
That's fine, and unsigned arithmetic introduces no problem in doing so, as long as you make sure to interpret your values correctly when you do comparisons. Because the overflow behavior of unsigned integers is fully specified, temporary overflows into the negative range (or into huge positive numbers) do not introduce any error as long as they are corrected before a comparison is performed.
Sometimes, the overflow behavior is even desirable, as the overflow behavior of unsigned arithmetic makes certain range checks expressible as a single comparison that would require two comparisons otherwise. If I want to check if x is in the range [a,b] and all the values are unsigned, I can simply do:
if (x - a < b - a) {
}
That doesn't work with signed variables; such range checks are pretty common with sizes and array offsets.
I mentioned before that a benefit is that overflow arithmetic has defined results. If your index arithmetic overflows a signed type, the behavior is implementation defined; there is no way to make your program portable. Use an unsigned type and this problem goes away. Admittedly this only applies to huge offsets, but it is a concern for some uses.
Basically, the objections to unsigned types are frequently overstated. The real problem is that most programmers don't really think about the exact semantics of the code they write, and for small integer values, signed types behave more nearly in line with their intuition. However, data sizes grow pretty fast. When we deal with buffers or databases, we're frequently way outside of the range of "small", and signed overflow is far more problematic to handle correctly than is unsigned overflow. The solution is not "don't use unsigned types", it is "think carefully about the code you are writing, and make sure you understand it".
Because, realistically, you usually want to perform arithmetic on indices, which means that you might want to create temporaries that are negative.
This is clearly painful when the underlying indexing type is unsigned.
The only appropriate time to use unsigned numbers is with modulus arithmetic.
Using "unsgined" as some kind of contract specifier "a number in the range [0..." is just clumsy, and too coarse to be useful.
Consider: What type should I use to represent the idea that the number should be a positive integer between 1 and 10? Why is 0...2^x a more special range?

Is it a best practice to use unsigned data types to enforce non-negative and/or valid values?

Recently, during a refactoring session, I was looking over some code I wrote and noticed several things:
I had functions that used unsigned char to enforce values in the interval [0-255].
Other functions used int or long data types with if statements inside the functions to silently clamp the values to valid ranges.
Values contained in classes and/or declared as arguments to functions that had an unknown upper bound but a known and definite non-negative lower bound were declared as an unsigned data type (int or long depending on the possibility that the upper bound went above 4,000,000,000).
The inconsistency is unnerving. Is this a good practice that I should continue? Should I rethink the logic and stick to using int or long with appropriate non-notifying clamping?
A note on the use of "appropriate": There are cases where I use signed data types and throw notifying exceptions when the values go out of range but these are reserved for divde by zero and constructors.
In C and C++, signed and unsigned integer types have certain specific characteristics.
Signed types have bounds far from zero, and operations that exceed those bounds have undefined behavior (or implementation-defined in the case of conversions).
Unsigned types have a lower bound of zero and an upper bound far from zero, and operations that exceed those bounds quietly wrap around.
Often what you really want is a particular range of values with some particular behavior when operations exceed those bounds (saturation, signaling an error, etc.). Neither signed nor unsigned types are entirely suitable for such requirements. And operations that mix signed and unsigned types can be confusing; the rules for such operations are defined by the language, but they're not always obvious.
Unsigned types can be problematic because the lower bound is zero, so operations with reasonable values (nowhere near the upper bound) can behave in unexpected ways. For example, this:
for (unsigned int u = 10; u >= 0; u --) {
// ...
}
is an infinite loop.
One approach is to use signed types for everything that doesn't absolutely require an unsigned representation, choosing a type wide enough to hold the values you need. This avoids problems with signed/unsigned mixed operations. Java, for example, enforces this approach by not having unsigned types at all. (Personally, I think that decision was overkill, but I can see the advantages of it.)
Another approach is to use unsigned types for values that logically cannot be negative, and be very careful with expressions that might underflow or that mix signed and unsigned types.
(Yet another is to define your own types with exactly the behavior you want, but that has costs.)
As John Sallay's answer says, consistency is probably more important than which particular approach you take.
I wish I could give a "this way is right, that way is wrong" answer, but there really isn't one.
The biggest benefit from unsigned is that it documents your code that the values are always positive.
It doesn't really buy you any safety as going outside the range of an unsigned is usually unintentional and can cause just as much frustration as if it were signed.
I had functions that used unsigned char to enforce values in the interval [0-255].
If you're relying on the wraparound then use uint8_t as unsigned char could possibly be more than 8 bits.
Other functions used int or long data types with if statements inside the functions to silently clamp the values to valid ranges.
Is this really the correct behavior?
Values contained in classes and/or declared as arguments to functions that had an unknown upper bound but a known and definite non-negative lower bound were declared as an unsigned data type (int or long depending on the possibility that the upper bound went above 4,000,000,000).
Where did you get an upper bound of 4,000,000,000 from? Your bound is between INT_MAX and INT_MIN (you can also use std::numeric_limits. In C++11 you can use decltype to specify the type which you can wrap into a template/macro:
decltype(4000000000) x; // x can hold at least 4000000000
I would probably argue that consistency is most important. If you pick one way and do it right then it will be easy for someone else to understand what you are doing at a later point in time. On the note of doing it right, there are several issues to think about.
First, it is common when checking if an integer variable n is in a valid range, say 0 to N to write:
if ( n > 0 && n <= N ) ...
This comparison only makes sense if n is signed. If n is unsigned then it will never be less than 0 since negative values will wrap around. You could rewrite the above if as just:
if ( n <= N ) ...
If someone isn't used to seeing this, they might be confused and think you did it wrong.
Second, I would keep in mind that there is no guarantee of type size for integers in c++. Thus, if you want something to be bounded by 255, an unsigned char may not do the trick. If the variable has a specific meaning then it may be valuable to to a typedef to show that. For example, size_t is a value as wide as a memory address. Which means that you can use it with arrays and not have to worry about being on 32 or 64 bit machines. I try to use such typedefs whenever possible because they clearly communicate why I am using the type. (size_t because I'm accessing an array.)
Third, is back on the issue of wrap around. What do you want to happen with an invalid number. In the case of an unsigned char, if you use the type to bound the data, then you won't be able to check if a value over 255 was entered. That may or may not be a problem.
This is a subjective issue but I'll give you my take.
Personally if there isn't type designated to the operation I am trying to carray out, IE std::size_t for sizes and index, uintXX_t for specific bit depths etc... then I default to unsigned unless I need to use negative values.
So it isn't a case of using it to enforce positive values, but rather I have to select signed feature explicitly.
As well as this I if you are worried about boundaries then you need to do your own bounds checking to ensure that you aren't overflowing.
But I said, more often then not your datatype will be decided by your context with the return type of the functions you apply it to.