union vs bit masking and bit shifting - c++

What are the disadvantages of using unions when storing some information like a series of bytes and being able to access them at once or one by one.
Example : A Color can be represented in RGBA. So a color type may be defined as,
typedef unsigned int RGBAColor;
Then we can use "shifting and masking" of bits to "retrieve or set" the red, green, blue, alpha values of a RGBAColor object ( just like it is done in Direct3D functions with the macro functions such as D3DCOLOR_ARGB() ).
But what if I used a union,
union RGBAColor
{
unsigned int Color;
struct RGBAColorComponents
{
unsigned char Red;
unsigned char Green;
unsigned char Blue;
unsigned char Alpha;
} Component;
};
Then I will not be needing to always do the shifting (<<) or masking (&) for reading or writing the color components. But is there problem with this? ( I suspect that this has some problem because I haven't seen anyone using such a method. )
Can Endianness Be a broblem? If we always use Component for accessing color components and use Color for accessing the whole thing ( for copying, assigning, etc.. as a whole ) the endianness should not be a problem, right?
-- EDIT --
I found an old post which is the same problem. So i guess this question is kinda repost :P sorry for that. here is the link : Is it a good practice to use unions in C++?
According to the answers it seems that the use of unions for the given example is OK in C++. Because there is no change of data type in there, its just two ways to access the same data. Please correct me if i am wrong. Thanks. :)

This usage of unions is illegal in C++, where a union comprises overlapping, but mutually exclusive objects. You are not allowed to write one member of a union, then read out another member.
It is legal in C where this is a recommended way of type punning.
This relates to the issue of (strict) aliasing, which is a difficulty faced by the compiler when trying to determine whether two objects with different types are distinct. The language standards disagree because the experts are still figuring out what guarantees can safely be provided without sacrificing performance. Personally, I avoid all of this. What would the int actually be used for? The safe way to translate is to copy the bytes, as by memcpy.
There is also the endianness issue, but whether that matters depends on what you want to do with the int.

I believe using the union solves any problems related to endianness, as most likely the RGBA order is defined in network order. Also the fact that each component will be uint8_t or such, can help some compilers to use sign/zero extended loads, storing the low 8 bits directly to a nonaligned byte pointer and being even able to parallelize some byte operations (e.g. arm has some packed 4x8 bit instructions).

Related

Typical case in padding of structures in c++ [duplicate]

If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.

Why is getting extra bytes in a convertion from Struct to array bytes? [duplicate]

If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.

Creating a simple portable bitmask and using it

This is my first time trying to create a bitmask, and although seemingly simple I have having trouble visualizing everything.
Keep in mind I cannot use std::bitset
First, I have read that accessing raw bits is undefined behavior. (so using a union of a char would be bad because the bits might be reversed for a different compiler).
Most code I've looked at uses a struct to define each bit, and this way of structuring data should be compiler independent because the first bit will always be the LSB. (I assume) Here is an example:
struct foo
{
unsigned char a : 1;
unsigned char b : 1;
unsigned char unused : 6;
};
Now the question is...could you use more than one bit for a variable in the struct AND have it still be comipiler independent? It seems like the answer is yes, but I have had some weird answers and want to be sure. Something like:
struct foo
{
unsigned char ab : 2;
unsigned char unused : 6;
};
It seems like regardless if the raw structure is reversed, the first bit accessed from the struct is always the LSB, so how many bits you use should not matter.
The C standard does not specify the ordering of fields within a unit -- there's no guarantee that a, in your example, is in the LSB. If you want fully portable behavior, you need to do the bit manipulation yourself, using unsigned integral types, and (if using unsigned integral types bigger than a byte) you need to worry about the endianness when reading/writing them from external sources.
The behaviour does not depend on the bit order. What you have written corresponds to the language standard and therefore behaves the same on all platforms.
Bitfields cannot be portably used to access specific bits in an external block of data (like a hardware register or data serialized in a stream of bytes). So bitfields aren't useful in this context - at least for portable code.
But if you're talking about using the bitfield within the program and not trying to have it model some external bit representation, then it's 100% portable. Not super useful, but portable.
I've spent a career twiddling bits in C/C++, and maybe because of this issue, I never see it done this way. We always use unsigned variables and apply bit masks to them:
#define BITMASK_A 0x01
#define BITMASK_B 0x02
unsigned char bitfield;
Then when you want to access a, you use (bitfield & BITMASK_A)
But to answer your question, there should be no logical difference between your two examples, if the compiler places ab at the low end, then the first example should also place a at the LSb.

Struct padding in C++

If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.

Can marshalling or packing be implemented by unions?

In beej's guide to networking there is a section of marshalling or packing data for Serialization where he describes various functions for packing and unpacking data (int,float,double ..etc).
It is easier to use union(similar can be defined for float and double) as defined below and transmit integer.pack as packed version of integer.i, rather than pack and unpack functions.
union _integer{
char pack[4];
int i;
}integer;
Can some one shed some light on why union is a bad choice?
Is there any better method of packing data?
Different computers may lay the data out differently. The classic issue is endianess (in your example, whether pack[0] has the MSB or LSB). Using a union like this ties the data to the specific representation on the computer that generated it.
If you want to see other ways to marshall data, check out the Boost serialization and Google protobuf.
The union trick is not guaranteed to work, although it usually does. It's perfectly valid (according to the standard) for you to set the char data, and then read 0s when you attempt to read the int, or vice-versa. union was designed to be a memory micro-optimization, not a replacement for casting.
At this point, usually you either wrap up the conversion in a handy object or use reinterpret_cast. Slightly bulky, or ugly... but neither of those are necessarily bad things when you're packing data.
Why not just do a reinterpret_cast to a char* or a memcpy into a char buffer? They're basically the same thing and less confusing.
Your idea would work, so go for it if you want, but I find that clean code is happy code. The easier it is to understand my work, the less likely it is that someone (like my future self) will break it.
Also note that only POD (plain old data) types can be placed in a union, which puts some limitations on the union approach that aren't there in a more intuitive one.