Integer comparison. Use same signedness vs c++20 std::cmp_ - c++

Integer comparison, although seemingly a simple matter, can involve some unexpected implications, hard to notice from the code itself, for the unskilled eye. Take the following piece of code for example: -1 > 10U;. Applying the rules for implicit integral conversion (that I think were introduced in C++20), it turns out to be equivalent to static_cast<unsigned>(-1) > 10U; (For 32-bit unsigned integer -1 is equivalent to 0xFFFFFFFF).
C++20 introduces std::cmp_* functions to achieve correct comparison behavior even between integer values of different signedness and size. Before c++20, when you had to write an algorithm that does some integer comparisons for some purpose, either you had to write your own comparison functions, or use integer types of the same signedness (or play along with the implicit integral conversions, that I think were implementation-dependant).
This is where you are faced with the design choice of using the same signedness for all integers that compare together (read: between them), or use the new functions from C++20. For example, sometimes it makes perfect sense to use an unsigned type (like std::size_t) to represent sizes (that are never negative), but then you might need to calculate the difference between the sizes of two objects, or make sure they differ by at most some amount (again, never negative).
Using the same signedness for all types that compare together in this scenario would mean to use signed integers, because you need to be able to compute the difference of two of these values without knowing which one is bigger (1 - 2 = 0xFFFFFFFF). But that means you lose half of the possible integer representations, paying for a feature (a sign bit) you never really use.
Using the C++20 comparison functions sacrifices some ease of reading, and also demands you to write a little more code (x <= 7 vs std::cmp_less_equal(x, 7)). Besides from these facts, are there other differences or advantages that arise from the use of one alternative over the other? Are there any situations where one of them would be preferable? I'm specially interested in performance-critical code. What impact does this choice have on performance?

Related

Should std::ssize() still needed in C++20? [duplicate]

C++20 introduced the std::ssize() free function as below:
template <class C>
constexpr auto ssize(const C& c)
-> std::common_type_t<std::ptrdiff_t,
std::make_signed_t<decltype(c.size())>>;
A possible implementation seems using static_cast, to convert the return value of the size() member function of class C into its signed counterpart.
Since the size() member function of C always returns non-negative values, why would anyone want to store them in signed variables? In case one really wants to, it is a matter of simple static_cast.
Why is std::ssize() introduced in C++20?
The rationale is described in this paper. A quote:
When span was adopted into C++17, it used a signed integer both as an index and a size. Partly this was to allow for the use of "-1" as a sentinel value to indicate a type whose size was not known at compile time. But having an STL container whose size() function returned a signed value was problematic, so P1089 was introduced to "fix" the problem. It received majority support, but not the 2-to-1 margin needed for consensus.
This paper, P1227, was a proposal to add non-member std::ssize and member ssize() functions. The inclusion of these would make certain code much more straightforward and allow for the avoidance of unwanted unsigned-ness in size computations. The idea was that the resistance to P1089 would decrease if ssize() were made available for all containers, both through std::ssize() and as member functions.
Gratuitously stolen from Eric Niebler:
'Unsigned types signal that a negative index/size is not sane' was
the prevailing wisdom when the STL was first designed. But logically,
a count of things need not be positive. I may want to keep a count in
a signed integer to denote the number of elements either added to or
removed from a collection. Then I would want to combine that with the
size of the collection. If the size of the collection is unsigned, now
I'm forced to mix signed and unsigned arithmetic, which is a bug farm.
Compilers warn about this, but because the design of the STL pretty
much forces programmers into this situation, the warning is so common
that most people turn it off. That's a shame because this hides real
bugs.
Use of unsigned ints in interfaces isn't the boon many people think it
is. If by accident a user passes a slightly negative number to the
API, it suddenly becomes a huge positive number. Had the API taken the
number as signed, then it can detect the situation by asserting the
number is greater than or equal to zero.
If we restrict our use of unsigned ints to bit twiddling (e.g., masks)
and use signed ints everywhere else, bugs are less likely to occur,
and easier to detect when they do occur.

Why does Qt implement QFile::size() which returns a qint64 rather than quint64 [duplicate]

The question is clear.
I wonder why they even thought this would be handy, as clearly negative indices are unusable in the containers that would be used with them (see for example QList's docs).
I thought they wanted to allow that for some crazy form of indexing, but it seems unsupported?
It also generates a ton of (correct) compiler warnings about casting to and comparing of signed/unsigned types (on MSVC).
It just seems incompatible with the STL by design for some reason...
Although I am deeply sympathetic to Chris's line of reasoning, I will disagree here (at least in part, I am playing devil's advocate). There is nothing wrong with using unsigned types for sizes, and it can even be beneficial in some circumstances.
Chris's justification for signed size types is that they are naturally used as array indices, and you may want to do arithmetic on array indices, and that arithmetic may create temporary values that are negative.
That's fine, and unsigned arithmetic introduces no problem in doing so, as long as you make sure to interpret your values correctly when you do comparisons. Because the overflow behavior of unsigned integers is fully specified, temporary overflows into the negative range (or into huge positive numbers) do not introduce any error as long as they are corrected before a comparison is performed.
Sometimes, the overflow behavior is even desirable, as the overflow behavior of unsigned arithmetic makes certain range checks expressible as a single comparison that would require two comparisons otherwise. If I want to check if x is in the range [a,b] and all the values are unsigned, I can simply do:
if (x - a < b - a) {
}
That doesn't work with signed variables; such range checks are pretty common with sizes and array offsets.
I mentioned before that a benefit is that overflow arithmetic has defined results. If your index arithmetic overflows a signed type, the behavior is implementation defined; there is no way to make your program portable. Use an unsigned type and this problem goes away. Admittedly this only applies to huge offsets, but it is a concern for some uses.
Basically, the objections to unsigned types are frequently overstated. The real problem is that most programmers don't really think about the exact semantics of the code they write, and for small integer values, signed types behave more nearly in line with their intuition. However, data sizes grow pretty fast. When we deal with buffers or databases, we're frequently way outside of the range of "small", and signed overflow is far more problematic to handle correctly than is unsigned overflow. The solution is not "don't use unsigned types", it is "think carefully about the code you are writing, and make sure you understand it".
Because, realistically, you usually want to perform arithmetic on indices, which means that you might want to create temporaries that are negative.
This is clearly painful when the underlying indexing type is unsigned.
The only appropriate time to use unsigned numbers is with modulus arithmetic.
Using "unsgined" as some kind of contract specifier "a number in the range [0..." is just clumsy, and too coarse to be useful.
Consider: What type should I use to represent the idea that the number should be a positive integer between 1 and 10? Why is 0...2^x a more special range?

Why is std::ssize() introduced in C++20?

C++20 introduced the std::ssize() free function as below:
template <class C>
constexpr auto ssize(const C& c)
-> std::common_type_t<std::ptrdiff_t,
std::make_signed_t<decltype(c.size())>>;
A possible implementation seems using static_cast, to convert the return value of the size() member function of class C into its signed counterpart.
Since the size() member function of C always returns non-negative values, why would anyone want to store them in signed variables? In case one really wants to, it is a matter of simple static_cast.
Why is std::ssize() introduced in C++20?
The rationale is described in this paper. A quote:
When span was adopted into C++17, it used a signed integer both as an index and a size. Partly this was to allow for the use of "-1" as a sentinel value to indicate a type whose size was not known at compile time. But having an STL container whose size() function returned a signed value was problematic, so P1089 was introduced to "fix" the problem. It received majority support, but not the 2-to-1 margin needed for consensus.
This paper, P1227, was a proposal to add non-member std::ssize and member ssize() functions. The inclusion of these would make certain code much more straightforward and allow for the avoidance of unwanted unsigned-ness in size computations. The idea was that the resistance to P1089 would decrease if ssize() were made available for all containers, both through std::ssize() and as member functions.
Gratuitously stolen from Eric Niebler:
'Unsigned types signal that a negative index/size is not sane' was
the prevailing wisdom when the STL was first designed. But logically,
a count of things need not be positive. I may want to keep a count in
a signed integer to denote the number of elements either added to or
removed from a collection. Then I would want to combine that with the
size of the collection. If the size of the collection is unsigned, now
I'm forced to mix signed and unsigned arithmetic, which is a bug farm.
Compilers warn about this, but because the design of the STL pretty
much forces programmers into this situation, the warning is so common
that most people turn it off. That's a shame because this hides real
bugs.
Use of unsigned ints in interfaces isn't the boon many people think it
is. If by accident a user passes a slightly negative number to the
API, it suddenly becomes a huge positive number. Had the API taken the
number as signed, then it can detect the situation by asserting the
number is greater than or equal to zero.
If we restrict our use of unsigned ints to bit twiddling (e.g., masks)
and use signed ints everywhere else, bugs are less likely to occur,
and easier to detect when they do occur.

Why isn't there an endianness modifier in C++ like there is for signedness?

(I guess this question could apply to many typed languages, but I chose to use C++ as an example.)
Why is there no way to just write:
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
to specify the endianness for specific members, variables and parameters?
Comparison to signedness
I understand that the type of a variable not only determines how many bytes are used to store a value but also how those bytes are interpreted when performing computations.
For example, these two declarations each allocate one byte, and for both bytes, every possible 8-bit sequence is a valid value:
signed char s;
unsigned char u;
but the same binary sequence might be interpreted differently, e.g. 11111111 would mean -1 when assigned to s but 255 when assigned to u. When signed and unsigned variables are involved in the same computation, the compiler (mostly) takes care of proper conversions.
In my understanding, endianness is just a variation of the same principle: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
It seems obvious to have that feature in a typed language that allows low-level programming. However, this is not a part of C, C++ or any other language I know, and I did not find any discussion about this online.
Update
I'll try to summarize some takeaways from the many comments that I got in the first hour after asking:
signedness is strictly binary (either signed or unsigned) and will always be, in contrast to endianness, which also has two well-known variants (big and little), but also lesser-known variants such as mixed/middle endian. New variants might be invented in the future.
endianness matters when accessing multiple-byte values byte-wise. There are many aspects beyond just endianness that affect the memory layout of multi-byte structures, so this kind of access is mostly discouraged.
C++ aims to target an abstract machine and minimize the number of assumptions about the implementation. This abstract machine does not have any endianness.
Also, now I realize that signedness and endianness are not a perfect analogy, because:
endianness only defines how something is represented as a binary sequence, but now what can be represented. Both big int and little int would have the exact same value range.
signedness defines how bits and actual values map to each other, but also affects what can be represented, e.g. -3 can't be represented by an unsigned char and (assuming that char has 8 bits) 130 can't be represented by a signed char.
So that changing the endianness of some variables would never change the behavior of the program (except for byte-wise access), whereas a change of signedness usually would.
What the standard says
[intro.abstract]/1:
The semantic descriptions in this document define a parameterized nondeterministic abstract machine.
This document places no requirement on the structure of conforming implementations.
In particular, they need not copy or emulate the structure of the abstract machine.
Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
C++ could not define an endianness qualifier since it has no concept of endianness.
Discussion
About the difference between signness and endianness, OP wrote
In my understanding, endianness is just a variation of the same principle [(signness)]: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
I'd argue signness both have a semantic and a representative aspect1. What [intro.abstract]/1 implies is that C++ only care about semantic, and never addresses the way a signed number should be represented in memory2. Actually, "sign bit" only appears once in the C++ specs and refer to an implementation-defined value.
On the other hand, endianness only have a representative aspect: endianness conveys no meaning.
With C++20, std::endian appears. It is still implementation-defined, but let us test the endian of the host without depending on old tricks based on undefined behaviour.
1) Semantic aspect: an signed integer can represent values below zero; representative aspect: one need to, for example, reserve a bit to convey the positive/negative sign.
2) In the same vein, C++ never describe how a floating point number should be represented, IEEE-754 is often used, but this is a choice made by the implementation, in any case enforced by the standard: [basic.fundamental]/8 "The value representation of floating-point types is implementation-defined".
In addition to YSC's answer, let's take your sample code, and consider what it might aim to achieve
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
You might hope that this would exactly specify layout for architecture-independent data interchange (file, network, whatever)
But this can't possibly work, because several things are still unspecified:
data type size: you'd have to use little int32_t, big int64_t and int16_t respectively, if that's what you want
padding and alignment, which cannot be controlled strictly within the language: use #pragma or __attribute__((packed)) or some other compiler-specific extension
actual format (1s- or 2s-complement signedness, floating-point type layout, trap representations)
Alternatively, you might simply want to reflect the endianness of some specified hardware - but big and little don't cover all the possibilities here (just the two most common).
So, the proposal is incomplete (it doesn't distinguish all reasonable byte-ordering arrangements), ineffective (it doesn't achieve what it sets out to), and has additional drawbacks:
Performance
Changing the endianness of a variable from the native byte ordering should either disable arithmetic, comparisons etc (since the hardware cannot correctly perform them on this type), or must silently inject more code, creating natively-ordered temporaries to work on.
The argument here isn't that manually converting to/from native byte order is faster, it's that controlling it explicitly makes it easier to minimise the number of unnecessary conversions, and much easier to reason about how code will behave, than if the conversions are implicit.
Complexity
Everything overloaded or specialized for integer types now needs twice as many versions, to cope with the rare event that it gets passed a non-native-endianness value. Even if that's just a forwarding wrapper (with a couple of casts to translate to/from native ordering), it's still a lot of code for no discernible benefit.
The final argument against changing the language to support this is that you can easily do it in code. Changing the language syntax is a big deal, and doesn't offer any obvious benefit over something like a type wrapper:
// store T with reversed byte order
template <typename T>
class Reversed {
T val_;
static T reverse(T); // platform-specific implementation
public:
explicit Reversed(T t) : val_(reverse(t)) {}
Reversed(Reversed const &other) : val_(other.val_) {}
// assignment, move, arithmetic, comparison etc. etc.
operator T () const { return reverse(val_); }
};
Integers (as a mathematical concept) have the concept of positive and negative numbers. This abstract concept of sign has a number of different implementations in hardware.
Endianness is not a mathematical concept. Little-endian is a hardware implementation trick to improve the performance of multi-byte twos-complement integer arithmetic on a microprocessor with 16 or 32 bit registers and an 8-bit memory bus. Its creation required using the term big-endian to describe everything else that had the same byte-order in registers and in memory.
The C abstract machine includes the concept of signed and unsigned integers, without details -- without requiring twos-complement arithmetic, 8-bit bytes or how to store a binary number in memory.
PS: I agree that binary data compatibility on the net or in memory/storage is a PIA.
That's a good question and I have often thought something like this would be useful. However you need to remember that C aims for platform independence and endianness is only important when a structure like this is converted into some underlying memory layout. This conversion can happen when you cast a uint8_t buffer into an int for example. While an endianness modifier looks neat the programmer still needs to consider other platform differences such as int sizes and structure alignment and packing.
For defensive programming when you want find grain control over how some variables or structures are represented in a memory buffer then it is best to code explicit conversion functions and then let the compiler optimiser generate the most efficient code for each supported platform.
Endianness is not inherently a part of a data type but rather of its storage layout.
As such, it would not be really akin to signed/unsigned but rather more like bit field widths in structs. Similar to those, they could be used for defining binary APIs.
So you'd have something like
int ip : big 32;
which would define both storage layout and integer size, leaving it to the compiler to do the best job of matching use of the field to its access. It's not obvious to me what the allowed declarations should be.
Short Answer: if it should not be possible to use objects in arithmetic expressions (with no overloaded operators) involving ints, then these objects should not be integer types. And there is no point in allowing addition and multiplication of big-endian and little-endian ints in the same expression.
Longer Answer:
As someone mentioned, endianness is processor-specific. Which really means that this is how numbers are represented when they are used as numbers in the machine language (as addresses and as operands/results of arithmetic operations).
The same is "sort of" true of signage. But not to the same degree. Conversion from language-semantic signage to processor-accepted signage is something that needs to be done to use numbers as numbers. Conversion from big-endian to little-endian and reverse is something that needs to be done to use numbers as data (send them over the network or represent metadata about data sent over the network such as payload lengths).
Having said that, this decision appears to be mostly driven by use cases. The flip side is that there is a good pragmatic reason to ignore certain use cases. The pragmatism arises out of the fact that endianness conversion is more expensive than most arithmetic operations.
If a language had semantics for keeping numbers as little-endian, it would allow developers to shoot themselves in the foot by forcing little-endianness of numbers in a program which does a lot of arithmetic. If developed on a little-endian machine, this enforcing of endianness would be a no-op. But when ported to a big-endian machine, there would a lot of unexpected slowdowns. And if the variables in question were used both for arithmetic and as network data, it would make the code completely non-portable.
Not having these endian semantics or forcing them to be explicitly compiler-specific forces the developers to go through the mental step of thinking of the numbers as being "read" or "written" to/from the network format. This would make the code which converts back and forth between network and host byte order, in the middle of arithmetic operations, cumbersome and less likely to be the preferred way of writing by a lazy developer.
And since development is a human endeavor, making bad choices uncomfortable is a Good Thing(TM).
Edit: here's an example of how this can go badly:
Assume that little_endian_int32 and big_endian_int32 types are introduced. Then little_endian_int32(7) % big_endian_int32(5) is a constant expression. What is its result? Do the numbers get implicitly converted to the native format? If not, what is the type of the result? Worse yet, what is the value of the result (which in this case should probably be the same on every machine)?
Again, if multi-byte numbers are used as plain data, then char arrays are just as good. Even if they are "ports" (which are really lookup values into tables or their hashes), they are just sequences of bytes rather than integer types (on which one can do arithmetic).
Now if you limit the allowed arithmetic operations on explicitly-endian numbers to only those operations allowed for pointer types, then you might have a better case for predictability. Then myPort + 5 actually makes sense even if myPort is declared as something like little_endian_int16 on a big endian machine. Same for lastPortInRange - firstPortInRange + 1. If the arithmetic works as it does for pointer types, then this would do what you'd expect, but firstPort * 10000 would be illegal.
Then, of course, you get into the argument of whether the feature bloat is justified by any possible benefit.
From a pragmatic programmer perspective searching Stack Overflow, it's worth noting that the spirit of this question can be answered with a utility library. Boost has such a library:
http://www.boost.org/doc/libs/1_65_1/libs/endian/doc/index.html
The feature of the library most like the language feature under discussion is a set of arithmetic types such as big_int16_t.
Because nobody has proposed to add it to the standard, and/or because compiler implementer have never felt a need for it.
Maybe you could propose it to the committee. I do not think it is difficult to implement it in a compiler: compilers already propose fundamental types that are not fundamental types for the target machine.
The development of C++ is an affair of all C++ coders.
#Schimmel. Do not listen to people who justify the status quo! All the cited arguments to justify this absence are more than fragile. A student logician could find their inconsistence without knowing anything about computer science. Just propose it, and just don't care about pathological conservatives. (Advise: propose new types rather than a qualifier because the unsigned and signed keywords are considered mistakes).
Endianness is compiler specific as a result of being machine specific, not as a support mechanism for platform independence. The standard -- is an abstraction that has no regard for imposing rules that make things "easy" -- its task is to create similarity between compilers that allows the programmer to create "platform independence" for their code -- if they choose to do so.
Initially, there was a lot of competition between platforms for market share and also -- compilers were most often written as proprietary tools by microprocessor manufacturers and to support operating systems on specific hardware platforms. Intel was likely not very concerned about writing compilers that supported Motorola microprocessors.
C was -- after all -- invented by Bell Labs to rewrite Unix.

In C/C++, what's the minimum type up-casting required for mixed-type integer math?

I have code that depends on data that is a mixture of uint16_t, int32_t / uint32_t and int64_t values. It also includes some larger bit shifted constants (e.g., 1<<23, even 1<<33).
In calculation of a int64_t value, if I carefully cast each sub-part (e.g., up-casting uint16_t values to int64_t) it works - if I don't, the calculations often go awry.
I end up with code that looks like this:
int64_t sensDT = (int64_t)sensD2-(int64_t)promV[PROM_C5]*(int64_t)(1<<8);
temperatureC = (double)((2000+sensDT*(int64_t)promV[PROM_C6]/(1<<23))/100.0);
I wonder, though, if my sprinkling of type casts here is too cluttered and too generous. I'm not sure the 1<<8 requires the cast (while despite not having one, 1<<23 doesn't lead to erroneous calculations) but perhaps they do too. How much is too much when it comes to up-casting values for a calculation like this?
Edit: So it's clear, I'm asking what the minimum proper amount of casting is - what's necessary for correct functionality (one can add more casts or modifiers for clarity, but from the compiler's perspective what's necessary to ensure correct calculations?)
Edit2: I'm using C-style casts as this is from an Arduino-type embedded code base (which itself used that style of casts already). From the perspective of having the desired effect they appear to be equivalent, thus I used the existing coding style.
Generally you can rely on the integer promotions to give you the correct operation, as long as one of the operands for each operator have the correct size. So your first example could be simplified:
int64_t sensDT = sensD2-(int64_t)promV[PROM_C5]*(1<<8);
Be careful to consider the precedence rules to know what order the operators will be applied!
You might run into trouble if you're mixing signed and unsigned types of the same size, although either should promote to a larger signed type.
You need to be careful with constants, because without any decoration those will be the default integer size and signed. 1<<8 won't be a problem, but 1<<35 probably will; you need 1LL<<35.
When in doubt, a few extra casts or parentheses won't hurt.