Why not enforce 2's complement in C++? - c++

The new C++ standard still refuses to specify the binary representation of integer types. Is this because there are real-world implementations of C++ that don't use 2's complement arithmetic? I find that hard to believe. Is it because the committee feared that future advances in hardware would render the notion of 'bit' obsolete? Again hard to believe. Can anyone shed any light on this?
Background: I was surprised twice in one comment thread (Benjamin Lindley's answer to this question). First, from piotr's comment:
Right shift on signed type is undefined behaviour
Second, from James Kanze's comment:
when assigning to a long, if the value doesn't fit in a long, the results are
implementation defined
I had to look these up in the standard before I believed them. The only reason for them is to accommodate non-2's-complement integer representations. WHY?

(Edit: C++20 now imposes 2's complement representation, note that overflow of signed arithmetic is still undefined and shifts continue to have undefined and implementation defined behaviors in some cases.)
A major problem in defining something which isn't, is that compilers were built assuming that is undefined. Changing the standard won't change the compilers and reviewing those to find out where the assumption was made is a difficult task.
Even on 2 complement machine, you may have more variety than you think. Two examples: some don't have a sign preserving right shift, just a right shift which introduce zeros; a common feature in DSP is saturating arithmetic, there assigning an out of range value will clip it at the maximum, not just drop the high order bits.

I suppose it is because the Standard says, in 3.9.1[basic.fundamental]/7
this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types.
which, I am willing to bet, came along from the C programming language, which lists sign and magnitude, two's complement, and one's complement as the only allowed representations in 6.2.6.2/2. And there sure were 1's complement systems around when C was wide-spread: UNIVACs are the most often mentioned, it seems.

It seems to me that, even today, if you are writing a broadly-applicable C++ library that you expect to run on any machine, then 2's complement cannot be assumed. C++ is just too widely used to be making assumptions like that.
Most people don't write those sorts of libraries, though, so if you want to take a dependency on 2's complement you should just go ahead.

Many aspects of the language standard are as they are because the Standards Committee has been extremely loath to forbid compilers from behaving in ways that existing code may rely upon. If code exists which would rely upon one's complement behavior, then requiring that compilers behave as though the underlying hardware uses two's complement would make it impossible for the older code to run using newer compilers.
The solution, which the Standards Committee has alas not yet seen fit to implement, would be to allow code to specify the desired semantics for things in a fashion independent of the machine's word size or hardware characteristics. If support for code which relies upon ones'-complement behavior is deemed important, design a means by which code could expressly demand one's-complement behavior regardless of the underlying hardware platform. If desired, to avoid overly complicating every single compiler, specify that certain aspects of the standard are optional, but conforming compilers must document which aspects they support. Such a design would allow compilers for ones'-complement machines to support both two's-complement behavior and ones'-complement behavior depending upon the needs of the program. Further, it would make it possible to port the code to two's-complement machines with compilers that happened to include ones'-complement support.
I'm not sure exactly why the Standards Committee has as yet not allowed any way by which code can specify behavior in a fashion independent of the underlying architecture and word size (so that code wouldn't have some machines use signed semantics for comparisons where other machines would use unsigned semantics), but for whatever reason they have yet to do so. Support for ones'-complement representation is but a part of that.

Related

Does it affect the working of programs that how negative numbers are represented internally?

I am not specific about any algorithm or program. But considering bit manipulation programs and other tasks that involves 2's complement or 1's complement etc., what if the negative numbers are represented (in memory or wherever) in a way opposite to the assumptions of the programmer. Does this scenario even occur? If yes, then how can it be handled.
Does this scenario even occur?
Yes, before C++20 that is allowed and there have been architectures that don't use two's complement in the past. However since C++20 two's complement representation is mandated.
If yes, then how can it be handled.
By not relying on operations with implementation-defined behavior or by asserting that your code may only be used on two's complement architectures.

Is there a C++ floating point data type which is guaranteed to be 64 bits long on every system? [duplicate]

In the stdint.h (C99), boost/cstdint.hpp, and cstdint (C++0x) headers there is, among others, the type int32_t.
Are there similar fixed-size floating point types? Something like float32_t?
Nothing like this exists in the C or C++ standards at present. In fact, there isn't even a guarantee that float will be a binary floating-point format at all.
Some compilers guarantee that the float type will be the IEEE-754 32 bit binary format. Some do not. In reality, float is in fact the IEEE-754 single type on most non-embedded platforms, though the usual caveats about some compilers evaluating expressions in a wider format apply.
There is a working group discussing adding C language bindings for the 2008 revision of IEEE-754, which could consider recommending that such a typedef be added. If this were added to C, I expect the C++ standard would follow suit... eventually.
If you want to know whether your float is the IEEE 32-bit type, check std::numeric_limits<float>::is_iec559. It's a compile-time constant, not a function.
If you want to be more bulletproof, also check std::numeric_limits<float>::digits to make sure they aren't sneakily using the IEEE standard double-precision for float. It should be 24.
When it comes to long double, it's more important to check digits because there are a couple IEEE formats which it might reasonably be: 128 bits (digits = 113) or 80 bits (digits = 64).
It wouldn't be practical to have float32_t as such because you usually want to use floating-point hardware, if available, and not to fall back on a software implementation.
If you think having typedefs such as float32_t and float64_t are impractical for any reasons, you must be too accustomed to your familiar OS, compiler, that you are unable too look outside your little nest.
There exist hardware which natively runs 32-bit IEEE floating point operations and others that do 64-bit. Sometimes such systems even have to talk to eachother, in which case it is extremely important to know if a double is 32 bit or 64 bit on each platform. If the 32-bit platform were to do excessive calculations on base on the 64-bit values from the other, we may want to cast to the lower precision depending on timing and speed requirements.
I personally feel uncomfortable using floats and doubles unless I know exactly how many bits they are on my platfrom. Even more so if I am to transfer these to another platform over some communications channel.
There is currently a proposal to add the following types into the language:
decimal32
decimal64
decimal128
which may one day be accessible through #include <decimal>.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3871.html

Creating a unsigned int with undefined overflow?

In a recent CppCon talk done by Chandler Carruth (link) around 39:16, he explain how letting the overflow of a signed integer undefined allow compilers to generate optimized assembly.
These kind of optimization can also be found in this blog post done by Krister Walfridsson, here
Being previously bitten by bugs involving sizes overflowing on INT_MAX, I tend to be pedantic on types I use in my code, but at the same time I don't want to lose quite straightforward performance gains.
While I have a limited knowledge of assembly, this left me wondering what would it entail to create a unsigned integer with undefined overflow ? This seems to be a recurring issue but I didn't find any proposal to introduce one (and eventually update std::size_t), does something like this has ever been discussed ?
This question is completely backward. There isn't some magic panacea by which a behaviour can be deemed undefined in order to give a compiler optimisation opportunities. There is always an offset.
For an operation to have undefined behaviour in some conditions, the C++ standard would need to describe no constraints on the resultant behaviour. That is the definition of undefined behaviour in the standard.
For the C++ standard (or any standard - undefined behaviour is a feature of standards, not just standards for programming languages) to do that, there would need to be more than one realistic way of implementing the operation, under a range of conditions, that produces different outcomes, advantages, and disadvantages. There would also need to be a realistic prospect of real-world implementation for more than one alternative. Lastly, there needs to be realistic chances that each of those features provide some value (e.g. desirable attributes of a system which uses those features, etc) - otherwise one approach can be specified, and there is no need for alternatives.
Overflow of signed integers has undefined behaviour on overflow because of a number of contributing factors. Firstly, there are different representations of a signed integer (e.g. ones-complement, twos-complement, etc). Second, the representation of a signed integer (by definition) includes representation of a sign e.g. a sign bit. Third, there is no particular representation of a signed integer that is inherently superior to another (choosing one or the others involves engineering trade-offs, for example in design of circuitry within a processor to implement an addition operation). Fourth, there are real-world implementations that use different representations. Because of these factors, operations on a signed integer that overflow may "wrap" with one CPU, but result in a hardware signal that must be cleared on another CPU. Each of these types of behaviour - or others - may be "optimal", by some measure, for some applications than not others. The standard has to allow for all of these possibilities - and the means by which it does that is to deem the behaviour undefined.
The reason arithmetic on unsigned integers has well-defined behaviour is because there aren't as many realistic ways of representing them or operations on them and - when such representations and operations on them are implemented in CPU circuitry, the results all come out the same (i.e. modulo arithmetic). There is no "sign bit" to worry about in creating circuits to represent and operate on unsigned integral values. Even if bits in an unsigned variable are physically laid out differently the implementation of operations (e.g. an adder circuit using NAND gates) causes a consistent behaviour on overflow for all basic math operations (addition, subtraction, multiplication, division). And, not surprisingly, all existing CPUs do it this way. There isn't one CPU that generates a hardware fault on unsigned overflow.
So, if you want overflow operations on unsigned values to have undefined behaviour, you would first need to find a way of representing an unsigned value in some way that results in more than one feasible/useful result/behaviour, make a case that your scheme is better in some way (e.g. performance, more easily fabricated CPU circuitry, application performance, etc). You would then need to convince some CPU designers to implement such a scheme, and convince system designers that scheme gives a real-world advantage. At the same time, you would need to leave some CPU designers and system designers with the belief that some other scheme has an advantage over yours for their purposes. In other words, real world usage of your approach must involve real trade-offs, rather than your approach having consistent advantage over all possible alternatives. Then, once you have multiple realisations in hardware of different approaches - which result in different behaviours on overflow on different platforms - you would need to convince the C++ standardisation committee that there is advantage in supporting your scheme in the standard (i.e. language or library features that exploit your approach), and that all of the possible behaviours on overflow need to be permitted. Only then will overflow on unsigned integers (or your variation thereof) have undefined behaviour.
I've described the above in terms of starting at the hardware level (i.e. having native support for your unsigned integer type in hardware). The same goes if you do it in software, but you would need to convince developers of libraries or operating systems instead.
Only then will you have introduced an unsigned integral type which has undefined behaviour if operations overflow.
More generally, as said at the start, this question is backward though. It is true that compilers exploit undefined behaviour (sometimes in highly devious ways) to improve performance. But, for the standard to deem that something has undefined behaviour, there needs to be more than one way of doing relevant operations, and implementations (compilers, etc) need to be able to analyse benefits and trade-offs of the alternatives, and then - according to some criteria - pick one. Which means there will always be a benefit (e.g. performance) and an unwanted consequence (e.g. unexpected results in some edge cases).
There is no such thing as an unsigned integer with undefined overflow. C++ is very specific that unsigned types do not overflow; they obey modulo arithmetic.
Could a future version of the language add an arithmetic type that does not obey modulo arithmetic, but also does not support signedness (and thus may use the whole range of its bits)? Maybe. But the alleged performance gains are not what they are with a signed value (which would otherwise have to consider correct handling of the sign bit, whereas an unsigned value has no "special" bits mandated) so I wouldn't hold your breath. In fact, although I'm no assembly expert, I can't imagine that this would be useful in any way.

Exotic architectures the standards committees care about

I know that the C and C++ standards leave many aspects of the language implementation-defined just because if there was an architecture with other characteristics, a standard confirming compiler for that architecture would need to emulate those parts of the language, resulting in inefficient machine code.
Surely, 40 years ago every computer had its own unique specification. However, I don't know of any architectures used today where:
CHAR_BIT != 8
signed is not two's complement (I heard Java had problems with this one).
Floating point is not IEEE 754 compliant (Edit: I meant "not in IEEE 754 binary encoding").
The reason I'm asking is that I often explain to people that it's good that C++ doesn't mandate any other low-level aspects like fixed sized types†. It's good because unlike 'other languages' it makes your code portable when used correctly (Edit: because it can be ported to more architectures without requiring emulation of low-level aspects of the machine, like e.g. two's complement arithmetic on sign+magnitude architecture). But I feel bad that I cannot point to any specific architecture myself.
So the question is: what architectures exhibit the above properties?
† uint*_ts are optional.
Take a look at this one
Unisys ClearPath Dorado Servers
offering backward compatibility for people who have not yet migrated all their Univac software.
Key points:
36-bit words
CHAR_BIT == 9
one's complement
72-bit non-IEEE floating point
separate address space for code and data
word-addressed
no dedicated stack pointer
Don't know if they offer a C++ compiler though, but they could.
And now a link to a recent edition of their C manual has surfaced:
Unisys C Compiler Programming Reference Manual
Section 4.5 has a table of data types with 9, 18, 36, and 72 bits.
None of your assumptions hold for mainframes. For starters, I don't know
of a mainframe which uses IEEE 754: IBM uses base 16 floating point, and
both of the Unisys mainframes use base 8. The Unisys machines are a bit
special in many other respects: Bo has mentioned the 2200 architecture,
but the MPS architecture is even stranger: 48 bit tagged words.
(Whether the word is a pointer or not depends on a bit in the word.)
And the numeric representations are designed so that there is no real
distinction between floating point and integral arithmetic: the floating
point is base 8; it doesn't require normalization, and unlike every
other floating point I've seen, it puts the decimal to the right of the
mantissa, rather than the left, and uses signed magnitude for the
exponent (in addition to the mantissa). With the results that an
integral floating point value has (or can have) exactly the same bit
representation as a signed magnitude integer. And there are no floating
point arithmetic instructions: if the exponents of the two values are
both 0, the instruction does integral arithmetic, otherwise, it does
floating point arithmetic. (A continuation of the tagging philosophy in
the architecture.) Which means that while int may occupy 48 bits, 8
of them must be 0, or the value won't be treated as an integer.
Full IEEE 754 compliance is rare in floating-point implementations. And weakening the specification in that regard allows lots of optimizations.
For example the subnorm support differers between x87 and SSE.
Optimizations like fusing a multiplication and addition which were separate in the source code slightly change the results too, but is nice optimization on some architectures.
Or on x86 strict IEEE compliance might require certain flags being set or additional transfers between floating point registers and normal memory to force it to use the specified floating point type instead of its internal 80bit floats.
And some platforms have no hardware floats at all and thus need to emulate them in software. And some of the requirements of IEEE 754 might be expensive to implement in software. In particular the rounding rules might be a problem.
My conclusion is that you don't need exotic architectures in order to get into situations were you don't always want to guarantee strict IEEE compliance. For this reason were few programming languages guarantee strict IEEE compliance.
I found this link listing some systems where CHAR_BIT != 8. They include
some TI DSPs have CHAR_BIT == 16
BlueCore-5 chip (a Bluetooth
chip from Cambridge Silicon Radio) which has CHAR_BIT ==
16.
And of course there is a question on Stack Overflow: What platforms have something other than 8-bit char
As for non two's-complement systems there is an interesting read on
comp.lang.c++.moderated. Summarized: there are platforms having ones' complement or sign and magnitude representation.
I'm fairly sure that VAX systems are still in use. They don't support IEEE floating-point; they use their own formats. Alpha supports both VAX and IEEE floating-point formats.
Cray vector machines, like the T90, also have their own floating-point format, though newer Cray systems use IEEE. (The T90 I used was decommissioned some years ago; I don't know whether any are still in active use.)
The T90 also had/has some interesting representations for pointers and integers. A native address can only point to a 64-bit word. The C and C++ compilers had CHAR_BIT==8 (necessary because it ran Unicos, a flavor of Unix, and had to interoperate with other systems), but a native address could only point to a 64-bit word. All byte-level operations were synthesized by the compiler, and a void* or char* stored a byte offset in the high-order 3 bits of the word. And I think some integer types had padding bits.
IBM mainframes are another example.
On the other hand, these particular systems needn't necessarily preclude changes to the language standard. Cray didn't show any particular interest in upgrading its C compiler to C99; presumably the same thing applied to the C++ compiler. It might be reasonable to tighten the requirements for hosted implementations, such as requiring CHAR_BIT==8, IEEE format floating-point if not the full semantics, and 2's-complement without padding bits for signed integers. Old systems could continue to support earlier language standards (C90 didn't die when C99 came out), and the requirements could be looser for freestanding implementations (embedded systems) such as DSPs.
On the other other hand, there might be good reasons for future systems to do things that would be considered exotic today.
CHAR_BITS
According to gcc source code:
CHAR_BIT is 16 bits for 1750a, dsp16xx architectures.
CHAR_BIT is 24 bits for dsp56k architecture.
CHAR_BIT is 32 bits for c4x architecture.
You can easily find more by doing:
find $GCC_SOURCE_TREE -type f | xargs grep "#define CHAR_TYPE_SIZE"
or
find $GCC_SOURCE_TREE -type f | xargs grep "#define BITS_PER_UNIT"
if CHAR_TYPE_SIZE is appropriately defined.
IEEE 754 compliance
If target architecture doesn't support floating point instructions, gcc may generate software fallback witch is not the standard compliant by default. More than, special options (like -funsafe-math-optimizations witch also disables sign preserving for zeros) can be used.
IEEE 754 binary representation was uncommon on GPUs until recently, see GPU Floating-Point Paranoia.
EDIT: a question has been raised in the comments whether GPU floating point is relevant to the usual computer programming, unrelated to graphics. Hell, yes! Most high performance thing industrially computed today is done on GPUs; the list includes AI, data mining, neural networks, physical simulations, weather forecast, and much much more. One of the links in the comments shows why: an order of magnitude floating point advantage of GPUs.
Another thing I'd like to add, which is more relevant to the OP question: what did people do 10-15 years ago when GPU floating point was not IEEE and when there was no API such as today's OpenCL or CUDA to program GPUs? Believe it or not, early GPU computing pioneers managed to program GPUs without an API to do that! I met one of them in my company. Here's what he did: he encoded the data he needed to compute as an image with pixels representing the values he was working on, then used OpenGL to perform the operations he needed (such as "gaussian blur" to represent a convolution with a normal distribution, etc), and decoded the resulting image back into an array of results. And this still was faster than using CPU!
Things like that is what prompted NVidia to finally make their internal data binary compatible with IEEE and to introduce an API oriented on computation rather than image manipulation.

Compilers and negative numbers representations

Recently I was confused by this question. Maybe because I didn't read language specifications (it's my fault, I know).
C99 standard doesn't say which negative numbers representation should be used by compiler. I always thought that the only right way to store negative numbers is two's complement (in most cases).
So here's my question: do you know any present-day compiler that implements by default one's complement or sign-magnitude representation? Can we change default representation with some compiler flag?
What is the simplest way to determine which representation is used?
And what about C++ standard?
I think it's not so much a question of what representation the compiler uses, but rather what representation the underlying machine uses. The compiler would be very stupid to pick a representation not supported by the target machine, since that would introduce loads of overhead for no benefit.
Some checksum fields in the IP protocol suite use one's complement, so perhaps dedicated "network accelerator"-type CPU:s implement it.
While twos-complement representation is by far the most common, it is not the only one (see some). The C and C++ standardisation committees did not want to require non-twos-complement machines to emulate a non-native representation. Therefore neither C not C++ require a specific negative integer format.
This leads to the undefined behaviour of bitwise operations on signed types.
The UNISYS 2200 series which implements one's complement math, is still in use with some quite updated compiler. You can read more about it in the questions below
Exotic architectures the standards committees care about
Are there any non-twos-complement implementations of C?