How to create a zero-overhead integer-like type that does not implicitly convert to other types? - c++

We have a fairly sized C++ code base which uses signed 32-bit int as the default integer data type. Due to changing requirements, it is necessary that we switch to 64-bit long integers for a particular data structure. Changing the default integer data type in the entire program is not viable due to the significant memory overhead. On the other hand, we need to avoid that unaware developers mix 64-bit and 32-bit integers and create problems that only occur when very large data sets are handled (and which are thus hard to detect and even harder to debug).
Question: How can I create a zero-overhead 64-bit integer type that does not implicitly convert to other types (specifically, 32-bit integers) that is still "convenient" to use?
Or - if the above is not possible or sensible - what would be a good alternative to my proposed approach?
Example
I'm thinking about creating a data structure like this:
class Int64 {
long value;
};
And then add c'tors for implicit construction, assignment, and operator overloads for arithmetic operations etc. However, I was not able to find a good resource online that might explain how to go about something like this and what the caveats are. Any suggestions?

Related

Does going through uintptr_t bring any safety when casting a pointer type to uint64_t?

Note that this is purely an academic question, from a language lawyer perspective. It's about the theoretically safest way to accomplish the conversion.
Suppose I have a void* and I need to convert it to a 64-bit integer. The reason is that this pointer holds the address of a faulting instruction; I wish to report this to my backend to be logged, and I use a fixed-size protocol - so I have precisely 64 bits to use for the address.
The cast will of course be implementation defined. I know my platform (64-bit Windows) allows this conversion, so in practice it's fine to just reinterpret_cast<uint64_t>(address).
But I'm wondering: from a theoretical standpoint, is it any safer to first convert to uintptr_t? That is: static_cast<uint64_t>(reinterpret_cast<uintptr_t>(address)). https://en.cppreference.com/w/cpp/language/reinterpret_cast says (emphasis mine):
Unlike static_cast, but like const_cast, the reinterpret_cast expression does not compile to any CPU instructions (except when converting between integers and pointers or on obscure architectures where pointer representation depends on its type).
So, in theory, pointer representation is not defined to be anything in particular; going from pointer to uintptr_t might theoretically perform a conversion of some kind to make the pointer representable as an integer. After that, I forcibly extract the lower 64 bits. Whereas just directly casting to uint64_t would not trigger the conversion mentioned above, and so I'd get a different result.
Is my interpretation correct, or is there no difference whatsoever between the two casts in theory as well?
FWIW, on a 32-bit system, apparently the widening conversion to unsigned 64-bit could sign-extend, as in this case. But on 64-bit I shouldn't have that issue.
You’re parsing that (shockingly informal, for cppreference) paragraph too closely. The thing it’s trying to get at is simply that other casts potentially involve conversion operations (float/int stuff, sign extension, pointer adjustment), whereas reinterpret_cast has the flavor of direct reuse of the bits.
If you reinterpret a pointer as an integer and the integer type is not large enough, you get a compile-time error. If it is large enough, you’re fine. There’s nothing magical about uintptr_t other than the guarantee that (if it exists) it’s large enough, and if you then re-cast to a smaller type you lose that anyway. Either 64 bits is enough, in which case you get the same guarantees with either type, or it’s not, and you’re screwed no matter what you do. And if your implementation is willing to do something weird inside reinterpret_cast, which might give different results than (say) bit_cast, neither method will guarantee nor prevent that.
That’s not to say the two are guaranteed identical, of course. Consider a DS9k-ish architecture with 32-bit pointers, where reinterpret_cast of a pointer to a uint64_t resulted in the pointer bits being duplicated in the low and high words. There you’d get both copies if you went directly to a uint64_t, and zeros in the top half if you went through a 32-bit uintptr_t. In that case, which one was “right” would be a matter of personal opinion.

Efficient 8-bits and 16-bits computations in OCaml

I am currently working on a project in OCaml where I have to manipulate unsigned integers on 8 bits and on 16 bits. In my context, things can get a little messy, I sometimes want to convert an 8 bit integer into a 16 bits one, or split a 16 bits integer into two 8 bits one. I also want to use all the operations like addition, or the bitwise operations on those. Since there are all these interaction between 8 and 16 bits, I really like the comfort of having separate types for those. However, I still want my program to compute reasonnably efficiently, and I don't want to actually lose too much time casting an integer of a given size into another size. So my question is essentially how should I go about this? I have two main options but I don't know enough about the low-level interpretation of OCaml to comfortably chose:
Option 1 : Use dedicated types
I figured that I can use the Stdint library that is available through opam and has an implementation of the types uint8 and uint16 which are exactly what I am looking for.
Pros
I get very good mileage from the typing and will definitely avoid silly bugs from this
Cons
I have to constantly use the functions Uint8.to_uint16 and Uint16.to_uint8, which might eventually add up to heavy memory usage and poor efficiency of the compiled program, depending on how the precise representation is stored in machine
Option 2 : Encode everything within the type int
This means that all my integers will simply be of type int and I will have to program the addition of two 8-bits integers and of two 16-bits integers in this type, for instance.
Pros:
I think these operations can be programmed in a very efficient way using usual operations and the bitwise operations on the type int.
Cons:
I get essentially nothing from the typing and I have to trust myself to chose the right function at the right time.
Possible workaround
I could use two modules for defining 8-bits and 16-bits integers encoded in an int declared as private. I think that would essentially work like I presented with Option 2. The fact that I chose the type to be private would however mean that I cannot switch from one to the other without running into a typing mistake, thus forcing explicit casts and getting leverage from the type system. Still I expect the casts to be very efficient, since the memory representation of the object won't change.
So I would like to know how you would go about that? Is it worth going through all the trouble, do you think a solution is clearly better, or are they reasonably equivalent?
Bonus
Everytime I want to print (in hexadecimal) the value of a variable a of type uint8, I am writing
Printf.ksprintf "a = %02x" (Uint8.to_int a)
There is again a cast that seems to me a bit silly, I could also use direclty the Uint8.to_string_hex function, but it writes explicitly the 0x in front of the number, which I don't want. Ideally I would like to just write
Printf.ksprintf "a = %02x" (Uint8.to_int a)
Is there a way to change the scopes and do some magic with the Printf to make it happen?
In the stdint library both int8 and int16 are represented as int so there is no real tradeoff between option 1 and option 2,
type int8 = private int
(** Signed 8-bit integer *)
type int16 = private int
(** Signed 16-bit integer *)
The stdint library already provides you the best of two worlds, you have an efficient implementation and type safety. Yes, you need to do these translations but they no-ops and there only for the typechecker.
Also, if you're looking for modular arithmetic (and, in general, modeling machine words and bitvectors) then you can look at our Bitvec library, which we developed as part of the Binary Analysis Platform. It is focused on performance while still providing type safety and a lot of operations. We modeled it based on the latest SMT-LIB specification to give clear semantics to all operations. It uses the excellent Zarith library underneath the hood that enables efficient representation for small and arbitrary-length integers.
Since the modularity is not a property of a bitvector itself, but a property of an operation we do not encode the number of bits in the type and use the same type (and representation) for all bitvectors from 1-bits to thousands-of-bits. However, it is impossible to use mix-match the types incorrectly. E.g., you can use generic functions,
(x + y) mod m8
Or predefined modules for the specified modulus, e.g.,
M8.(x + y)
The library has a minimal number of dependencies, so try it by installing
opam install bitvec
There are also additional libraries like bitvec-order, bitvec-sexp, and bitvec-order that enable further integration with the Core suite of libraries, if you need them.

Using std::string as a generic uint8_t buffer

I am looking through the source code of Chromium to study how they implemented MediaRecorder API that encodes/records raw mic input stream to a particular format.
I came across interesting codes from their source. In short:
bool DoEncode(float* data_in, std::string* data_out) {
...
data_out->resize(MAX_DATA_BTYES_OR_SOMETHING);
opus_encode_float(
data_in,
reinterpret_cast<uint8_t*>(base::data(*data_out))
);
...
}
So DoEncode (C++ method) here accepts an array of float and converts it to an encoded byte stream, and the actual operation is done in opus_encode_float() (which is a pure C function).
The interesting part is the Google Chromium team used std::string for an byte array instead of std::vector<uint_8> and they even manually cast to a uint8_t buffer.
Why would the guys from Google Chromium team do like this, and is there a scenario that using std::string is more useful for a generic bytes buffer than using others like std::vector<uint8_t>?
The Chromium coding style (see below) forbids using unsigned integral types without good reason. External API is not such reason. Sizes of signed and unsigned chars are 1, so why not.
I looked at opus encoder API and it seems the earlier versions used signed char:
[out] data char*: Output payload (at least max_data_bytes long)
Although API uses unsigned chars now, the description still refers to signed char. So std::string for chars was more convenient for the earlier API and Chromium team didn't change the already used container after API was updated, they used cast in one line instead of updating tens other lines.
Integer Types
You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.
If your code is a container that returns a size, be sure to use a type that will accommodate any possible usage of your container. When in doubt, use a larger type rather than a smaller type.
Use care when converting integer types. Integer conversions and promotions can cause undefined behavior, leading to security bugs and other problems.
On Unsigned Integers
Unsigned integers are good for representing bitfields and modular arithmetic. Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. In other cases, the defined behavior impedes optimization.
That said, mixing signedness of integer types is responsible for an equally large class of problems. The best advice we can provide: try to use iterators and containers rather than pointers and sizes, try not to mix signedness, and try to avoid unsigned types (except for representing bitfields or modular arithmetic). Do not use an unsigned type merely to assert that a variable is non-negative.
We can only theorize.
My speculation: they wanted to use the built-in SSO optimization that exists in std::string but might not be available for std::vector<uint8_t>.

Why isn't there an endianness modifier in C++ like there is for signedness?

(I guess this question could apply to many typed languages, but I chose to use C++ as an example.)
Why is there no way to just write:
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
to specify the endianness for specific members, variables and parameters?
Comparison to signedness
I understand that the type of a variable not only determines how many bytes are used to store a value but also how those bytes are interpreted when performing computations.
For example, these two declarations each allocate one byte, and for both bytes, every possible 8-bit sequence is a valid value:
signed char s;
unsigned char u;
but the same binary sequence might be interpreted differently, e.g. 11111111 would mean -1 when assigned to s but 255 when assigned to u. When signed and unsigned variables are involved in the same computation, the compiler (mostly) takes care of proper conversions.
In my understanding, endianness is just a variation of the same principle: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
It seems obvious to have that feature in a typed language that allows low-level programming. However, this is not a part of C, C++ or any other language I know, and I did not find any discussion about this online.
Update
I'll try to summarize some takeaways from the many comments that I got in the first hour after asking:
signedness is strictly binary (either signed or unsigned) and will always be, in contrast to endianness, which also has two well-known variants (big and little), but also lesser-known variants such as mixed/middle endian. New variants might be invented in the future.
endianness matters when accessing multiple-byte values byte-wise. There are many aspects beyond just endianness that affect the memory layout of multi-byte structures, so this kind of access is mostly discouraged.
C++ aims to target an abstract machine and minimize the number of assumptions about the implementation. This abstract machine does not have any endianness.
Also, now I realize that signedness and endianness are not a perfect analogy, because:
endianness only defines how something is represented as a binary sequence, but now what can be represented. Both big int and little int would have the exact same value range.
signedness defines how bits and actual values map to each other, but also affects what can be represented, e.g. -3 can't be represented by an unsigned char and (assuming that char has 8 bits) 130 can't be represented by a signed char.
So that changing the endianness of some variables would never change the behavior of the program (except for byte-wise access), whereas a change of signedness usually would.
What the standard says
[intro.abstract]/1:
The semantic descriptions in this document define a parameterized nondeterministic abstract machine.
This document places no requirement on the structure of conforming implementations.
In particular, they need not copy or emulate the structure of the abstract machine.
Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
C++ could not define an endianness qualifier since it has no concept of endianness.
Discussion
About the difference between signness and endianness, OP wrote
In my understanding, endianness is just a variation of the same principle [(signness)]: a different interpretation of a binary pattern based on compile-time information about the memory in which it will be stored.
I'd argue signness both have a semantic and a representative aspect1. What [intro.abstract]/1 implies is that C++ only care about semantic, and never addresses the way a signed number should be represented in memory2. Actually, "sign bit" only appears once in the C++ specs and refer to an implementation-defined value.
On the other hand, endianness only have a representative aspect: endianness conveys no meaning.
With C++20, std::endian appears. It is still implementation-defined, but let us test the endian of the host without depending on old tricks based on undefined behaviour.
1) Semantic aspect: an signed integer can represent values below zero; representative aspect: one need to, for example, reserve a bit to convey the positive/negative sign.
2) In the same vein, C++ never describe how a floating point number should be represented, IEEE-754 is often used, but this is a choice made by the implementation, in any case enforced by the standard: [basic.fundamental]/8 "The value representation of floating-point types is implementation-defined".
In addition to YSC's answer, let's take your sample code, and consider what it might aim to achieve
struct foo {
little int x; // little-endian
big long int y; // big-endian
short z; // native endianness
};
You might hope that this would exactly specify layout for architecture-independent data interchange (file, network, whatever)
But this can't possibly work, because several things are still unspecified:
data type size: you'd have to use little int32_t, big int64_t and int16_t respectively, if that's what you want
padding and alignment, which cannot be controlled strictly within the language: use #pragma or __attribute__((packed)) or some other compiler-specific extension
actual format (1s- or 2s-complement signedness, floating-point type layout, trap representations)
Alternatively, you might simply want to reflect the endianness of some specified hardware - but big and little don't cover all the possibilities here (just the two most common).
So, the proposal is incomplete (it doesn't distinguish all reasonable byte-ordering arrangements), ineffective (it doesn't achieve what it sets out to), and has additional drawbacks:
Performance
Changing the endianness of a variable from the native byte ordering should either disable arithmetic, comparisons etc (since the hardware cannot correctly perform them on this type), or must silently inject more code, creating natively-ordered temporaries to work on.
The argument here isn't that manually converting to/from native byte order is faster, it's that controlling it explicitly makes it easier to minimise the number of unnecessary conversions, and much easier to reason about how code will behave, than if the conversions are implicit.
Complexity
Everything overloaded or specialized for integer types now needs twice as many versions, to cope with the rare event that it gets passed a non-native-endianness value. Even if that's just a forwarding wrapper (with a couple of casts to translate to/from native ordering), it's still a lot of code for no discernible benefit.
The final argument against changing the language to support this is that you can easily do it in code. Changing the language syntax is a big deal, and doesn't offer any obvious benefit over something like a type wrapper:
// store T with reversed byte order
template <typename T>
class Reversed {
T val_;
static T reverse(T); // platform-specific implementation
public:
explicit Reversed(T t) : val_(reverse(t)) {}
Reversed(Reversed const &other) : val_(other.val_) {}
// assignment, move, arithmetic, comparison etc. etc.
operator T () const { return reverse(val_); }
};
Integers (as a mathematical concept) have the concept of positive and negative numbers. This abstract concept of sign has a number of different implementations in hardware.
Endianness is not a mathematical concept. Little-endian is a hardware implementation trick to improve the performance of multi-byte twos-complement integer arithmetic on a microprocessor with 16 or 32 bit registers and an 8-bit memory bus. Its creation required using the term big-endian to describe everything else that had the same byte-order in registers and in memory.
The C abstract machine includes the concept of signed and unsigned integers, without details -- without requiring twos-complement arithmetic, 8-bit bytes or how to store a binary number in memory.
PS: I agree that binary data compatibility on the net or in memory/storage is a PIA.
That's a good question and I have often thought something like this would be useful. However you need to remember that C aims for platform independence and endianness is only important when a structure like this is converted into some underlying memory layout. This conversion can happen when you cast a uint8_t buffer into an int for example. While an endianness modifier looks neat the programmer still needs to consider other platform differences such as int sizes and structure alignment and packing.
For defensive programming when you want find grain control over how some variables or structures are represented in a memory buffer then it is best to code explicit conversion functions and then let the compiler optimiser generate the most efficient code for each supported platform.
Endianness is not inherently a part of a data type but rather of its storage layout.
As such, it would not be really akin to signed/unsigned but rather more like bit field widths in structs. Similar to those, they could be used for defining binary APIs.
So you'd have something like
int ip : big 32;
which would define both storage layout and integer size, leaving it to the compiler to do the best job of matching use of the field to its access. It's not obvious to me what the allowed declarations should be.
Short Answer: if it should not be possible to use objects in arithmetic expressions (with no overloaded operators) involving ints, then these objects should not be integer types. And there is no point in allowing addition and multiplication of big-endian and little-endian ints in the same expression.
Longer Answer:
As someone mentioned, endianness is processor-specific. Which really means that this is how numbers are represented when they are used as numbers in the machine language (as addresses and as operands/results of arithmetic operations).
The same is "sort of" true of signage. But not to the same degree. Conversion from language-semantic signage to processor-accepted signage is something that needs to be done to use numbers as numbers. Conversion from big-endian to little-endian and reverse is something that needs to be done to use numbers as data (send them over the network or represent metadata about data sent over the network such as payload lengths).
Having said that, this decision appears to be mostly driven by use cases. The flip side is that there is a good pragmatic reason to ignore certain use cases. The pragmatism arises out of the fact that endianness conversion is more expensive than most arithmetic operations.
If a language had semantics for keeping numbers as little-endian, it would allow developers to shoot themselves in the foot by forcing little-endianness of numbers in a program which does a lot of arithmetic. If developed on a little-endian machine, this enforcing of endianness would be a no-op. But when ported to a big-endian machine, there would a lot of unexpected slowdowns. And if the variables in question were used both for arithmetic and as network data, it would make the code completely non-portable.
Not having these endian semantics or forcing them to be explicitly compiler-specific forces the developers to go through the mental step of thinking of the numbers as being "read" or "written" to/from the network format. This would make the code which converts back and forth between network and host byte order, in the middle of arithmetic operations, cumbersome and less likely to be the preferred way of writing by a lazy developer.
And since development is a human endeavor, making bad choices uncomfortable is a Good Thing(TM).
Edit: here's an example of how this can go badly:
Assume that little_endian_int32 and big_endian_int32 types are introduced. Then little_endian_int32(7) % big_endian_int32(5) is a constant expression. What is its result? Do the numbers get implicitly converted to the native format? If not, what is the type of the result? Worse yet, what is the value of the result (which in this case should probably be the same on every machine)?
Again, if multi-byte numbers are used as plain data, then char arrays are just as good. Even if they are "ports" (which are really lookup values into tables or their hashes), they are just sequences of bytes rather than integer types (on which one can do arithmetic).
Now if you limit the allowed arithmetic operations on explicitly-endian numbers to only those operations allowed for pointer types, then you might have a better case for predictability. Then myPort + 5 actually makes sense even if myPort is declared as something like little_endian_int16 on a big endian machine. Same for lastPortInRange - firstPortInRange + 1. If the arithmetic works as it does for pointer types, then this would do what you'd expect, but firstPort * 10000 would be illegal.
Then, of course, you get into the argument of whether the feature bloat is justified by any possible benefit.
From a pragmatic programmer perspective searching Stack Overflow, it's worth noting that the spirit of this question can be answered with a utility library. Boost has such a library:
http://www.boost.org/doc/libs/1_65_1/libs/endian/doc/index.html
The feature of the library most like the language feature under discussion is a set of arithmetic types such as big_int16_t.
Because nobody has proposed to add it to the standard, and/or because compiler implementer have never felt a need for it.
Maybe you could propose it to the committee. I do not think it is difficult to implement it in a compiler: compilers already propose fundamental types that are not fundamental types for the target machine.
The development of C++ is an affair of all C++ coders.
#Schimmel. Do not listen to people who justify the status quo! All the cited arguments to justify this absence are more than fragile. A student logician could find their inconsistence without knowing anything about computer science. Just propose it, and just don't care about pathological conservatives. (Advise: propose new types rather than a qualifier because the unsigned and signed keywords are considered mistakes).
Endianness is compiler specific as a result of being machine specific, not as a support mechanism for platform independence. The standard -- is an abstraction that has no regard for imposing rules that make things "easy" -- its task is to create similarity between compilers that allows the programmer to create "platform independence" for their code -- if they choose to do so.
Initially, there was a lot of competition between platforms for market share and also -- compilers were most often written as proprietary tools by microprocessor manufacturers and to support operating systems on specific hardware platforms. Intel was likely not very concerned about writing compilers that supported Motorola microprocessors.
C was -- after all -- invented by Bell Labs to rewrite Unix.

C++ How can I assign a datatype to a binary sequence?

I have a binary sequence. This sequence represents an arbitrary precision integer but as far as the computer is concerned, it's just a binary sequence. I'm working in C++, with the multiprecision library. I only know how to assign values to the arbitrary precision datatype:
mp::cpp_int A = 51684861532215151;
How can I take a binary sequence and directly assign it to the datatype mp::cpp_int? I realize I can go through each bit and add 2^bit where ever I hit a 1, but I'm trying to avoid doing this.
REPLY:
Galik: My compiler (visual studio 2013) isn't liking that for some reason.
mp::cpp_int A = 0b0010011;
It keeps putting the red squigly after the first 0.
Also yup, boost multiprecision.
How to construct a particular type of big integer from a sequence of raw bits depends on that particular type, on the various constructors/methods that it offers for the purpose and/or what operator overloads are available.
The only generic mechanisms involve constructing a big integer with one word's worth of low-order bits (since such a constructor is almost universally available) and then using arithmetic to push the bits in, one bit at a time or one word's worth of bits at a time. This reduces the dependence on particulars of the given type to a minimum and it may work across a wide range of types completely unchanged, but it is rather cumbersome and not very efficient.
The particular type of big integer that is shown in your code splinter looks like boost::multiprecision::cpp_int, and Olaf Dietsche has already provided a link to its main documentation page. Conversion to and from raw binary formats for this type is documented on the page Importing and Exporting Data to and from cpp_int and cpp_bin_float, including code examples like initialising a cpp_int from a vector<byte>.