Can marshalling or packing be implemented by unions? - c++

In beej's guide to networking there is a section of marshalling or packing data for Serialization where he describes various functions for packing and unpacking data (int,float,double ..etc).
It is easier to use union(similar can be defined for float and double) as defined below and transmit integer.pack as packed version of integer.i, rather than pack and unpack functions.
union _integer{
char pack[4];
int i;
}integer;
Can some one shed some light on why union is a bad choice?
Is there any better method of packing data?

Different computers may lay the data out differently. The classic issue is endianess (in your example, whether pack[0] has the MSB or LSB). Using a union like this ties the data to the specific representation on the computer that generated it.
If you want to see other ways to marshall data, check out the Boost serialization and Google protobuf.

The union trick is not guaranteed to work, although it usually does. It's perfectly valid (according to the standard) for you to set the char data, and then read 0s when you attempt to read the int, or vice-versa. union was designed to be a memory micro-optimization, not a replacement for casting.
At this point, usually you either wrap up the conversion in a handy object or use reinterpret_cast. Slightly bulky, or ugly... but neither of those are necessarily bad things when you're packing data.

Why not just do a reinterpret_cast to a char* or a memcpy into a char buffer? They're basically the same thing and less confusing.
Your idea would work, so go for it if you want, but I find that clean code is happy code. The easier it is to understand my work, the less likely it is that someone (like my future self) will break it.
Also note that only POD (plain old data) types can be placed in a union, which puts some limitations on the union approach that aren't there in a more intuitive one.

Related

Advantages/Disadvantages of using __int16 (or int16_t) over int

As far as I understand, the number of bytes used for int is system dependent. Usually, 2 or 4 bytes are used for int.
As per Microsoft's documentation, __int8, __int16, __int32 and __int64 are Microsoft Specific keywords. Furthermore, __int16 uses 16-bits (i.e. 2 bytes).
Question: What are advantage/disadvantage of using __int16 (or int16_t)? For example, if I am sure that the value of my integer variable will never need more than 16 bits then, will it be beneficial to declare the variable as __int16 var (or int16_t var)?
UPDATE: I see that several comments/answers suggest using int16_t instead of __int16, which is a good suggestion but not really an advantage/disadvantage of using __int16. Basically, my question is, what is the advantage/disadvantage of saving 2 bytes by using 16-bit version of an integer instead of int ?
Saving 2 bytes is almost never worth it. However, saving thousands of bytes is. If you have an large array containing integers, using a small integer type can save quite a lot of memory. This leads to faster code, because the less memory one uses the less cache misses one receives (cache misses are a major loss of performance).
TL;DR: this is beneficial to do in large arrays, but pointless for 1-off variables.
The second use of these is if for dealing with binary files and messages. If you are reading a binary file that uses 16-bit integers, well, it's pretty convenient if you can represent that type exactly in your code.
BTW, don't use microsoft's versions. Use the standard versions (std::int16_t)
It depends.
On x86, primitive types are generally aligned on their size. So 2-byte types would be aligned on a 2-byte boundary. This is useful when you have more than one of these short variables, because you will be saving 50% of space. That directly translates to better memory and cache utilization and thus theoretically, better performance.
On the other hand, doing arithmetic on shorter-than-int types usually involves widening conversion to int. So if you do a lot of arithmetic on these types, using int types might result in better performance (contrived example).
So if you care about performance of a critical section of code, profile it to find out for sure if using a certain data type is faster or slower.
A possible rule of thumb would be - if you're memory-bound (i.e. you have lots of variables and especially arrays), use as short a data types as possible. If not - don't worry about it and use int types.
If you for some reason just need a shorter integer type it's already have that in the language - called short - unless you know you need exactly 16 bits there's really no good reason not to just stick with the agnostic short and int types. The broad idea is that these types should align well the target architecture (for example see word ).
That being said, theres no need to use the platform specific type (__int16), you can just use the standard one:
int16_t
See https://en.cppreference.com/w/cpp/types/integer for more information and standard types
Even if you still insist on __int16 you probably want a typedef something ala.:
using my_short = __int16;
Update
Your main question is:
What is the advantage/disadvantage of
saving 2 bytes by using 16-bit version of an integer instead of int ?
If you have a lot of data (In the ballpark of at least some 100.000-1.000.000 elements as a rule of thumb) - then there could be an overall performance saving in terms of using less cpu-cache. Overall there's no disadvantage of using a smaller type - except for the obvious one - and possible conversions as explained in this answer
The main reason for using these types is to make sure about the size of your variable in different architectures and compilers. we call it "code reusability" and "portability"
in higher-level modern languages, all this will handle with compiler/interpreter/virtual machine/etc. that you don't need to worry about, but it has some performance and memory usage costs.
When you have some kind of limitation you may need to optimize everything. The best example is embedded systems that have a very limited size of memory and work at low frequency. In the other hand, there are lots of compilers out there with different implementations. Some of them interpret "int" as a "16bit" value and some as a "32bit".
for example, you receive and specific stream of values over a communication system, you want to save them in a buffer or array and you want to make sure the input data is always interpreted as a 16bit noting else.

Typical case in padding of structures in c++ [duplicate]

If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.

Loading struct from file

I have read about problems with loading structs from file. There are problems with endianness and different variable sizes. But let us say that there is a structure like this one:
struct Structure
{
uint8_t value1;
uint16_t value2;
uint32_t value3;
uint64_t value;4
};
Let us say that the file is always written in little-endian format, so application reads it in strict way. In such case endianness should not cause any problems. (Let us assume that there is some kind of convertEndinness() function which is clever enough to omit byte order issue). The second thing which I know is neccesery to consider is variable size variety. There is my question. Do fixed size types manage to handle this problem and what else should I consider in order to create multiplatform binary file?
Do fixed size types manage to handle this problem
Not quite.
The fixed-size types have fixed sizes, but their alignment requirements (and therefore padding) may vary between platforms and/or ABI flavours. So, your struct could still have different layout on different platforms even with the same endianness.
You can insist that there should be no padding, and use some compiler-specific and non-standard way to specify this (like #pragma pack or __attribute__((packed))). This can produce worse code for accessing misaligned members directly, though.
what else should I consider in order to create multiplatform binary file?
If you choose an endianness, use fixed-size types and specify the alignment correctly, you're probably fine.
I'd strongly suggested adding a header and/or some framing information, with a version and possibly some metadata about the sizes and alignments you chose. Otherwise you can never change this file format in the future without things breaking in unpleasant ways.

union vs bit masking and bit shifting

What are the disadvantages of using unions when storing some information like a series of bytes and being able to access them at once or one by one.
Example : A Color can be represented in RGBA. So a color type may be defined as,
typedef unsigned int RGBAColor;
Then we can use "shifting and masking" of bits to "retrieve or set" the red, green, blue, alpha values of a RGBAColor object ( just like it is done in Direct3D functions with the macro functions such as D3DCOLOR_ARGB() ).
But what if I used a union,
union RGBAColor
{
unsigned int Color;
struct RGBAColorComponents
{
unsigned char Red;
unsigned char Green;
unsigned char Blue;
unsigned char Alpha;
} Component;
};
Then I will not be needing to always do the shifting (<<) or masking (&) for reading or writing the color components. But is there problem with this? ( I suspect that this has some problem because I haven't seen anyone using such a method. )
Can Endianness Be a broblem? If we always use Component for accessing color components and use Color for accessing the whole thing ( for copying, assigning, etc.. as a whole ) the endianness should not be a problem, right?
-- EDIT --
I found an old post which is the same problem. So i guess this question is kinda repost :P sorry for that. here is the link : Is it a good practice to use unions in C++?
According to the answers it seems that the use of unions for the given example is OK in C++. Because there is no change of data type in there, its just two ways to access the same data. Please correct me if i am wrong. Thanks. :)
This usage of unions is illegal in C++, where a union comprises overlapping, but mutually exclusive objects. You are not allowed to write one member of a union, then read out another member.
It is legal in C where this is a recommended way of type punning.
This relates to the issue of (strict) aliasing, which is a difficulty faced by the compiler when trying to determine whether two objects with different types are distinct. The language standards disagree because the experts are still figuring out what guarantees can safely be provided without sacrificing performance. Personally, I avoid all of this. What would the int actually be used for? The safe way to translate is to copy the bytes, as by memcpy.
There is also the endianness issue, but whether that matters depends on what you want to do with the int.
I believe using the union solves any problems related to endianness, as most likely the RGBA order is defined in network order. Also the fact that each component will be uint8_t or such, can help some compilers to use sign/zero extended loads, storing the low 8 bits directly to a nonaligned byte pointer and being even able to parallelize some byte operations (e.g. arm has some packed 4x8 bit instructions).

byte representation of numbers

I use the next way to get the byte representation of numbers in C++:
template< class T >
union u_value
{
T value;
unsigned char bytes[ sizeof( T ) ];
}
Please, tell me is that the true way? And if not why and how should I get it?
There is no "true" way. What you did is one way to do it. Generally speaking, stuff like this is discouraged as it usually results in non-portable code. Sometimes there are good reasons for poking internals like that, but since you didn't tell what you're about to do with the "byte representation", there's little we can do to judge if this approach is appropriate.
Edit: So networking is your subject here. In this case, either:
You are transferring POD types only (char, short, int, the likes). In this case, you might want to look into <netinet/in.h>, where you'll find the macros htons(), htonl(), ntohs() and ntohl(), which do host-to-network-byte-order and vice versa for you.
You are transferring complex types (e.g. classes). In this case, you might want to look into Boost Serialization, because there's much more to be considered here than mere byte order.
Either way, it is advisable to use ready-made, well-documented and -understood code, instead of doing byte-juggling yourself.
Not true way. This way is not portable and often may result in undefined behavior. Because,
Padding (done by compiler) is ignored
virtual function extra space is not taken care visibly (if T is polymorphic)
If you are transferring the bytes across the network you need to be careful about the Endianess
This will work. Just be sure that anything with virtual functions, pointers and so on won't be accurate next time you deserialize them.