Creating a simple portable bitmask and using it - c++

This is my first time trying to create a bitmask, and although seemingly simple I have having trouble visualizing everything.
Keep in mind I cannot use std::bitset
First, I have read that accessing raw bits is undefined behavior. (so using a union of a char would be bad because the bits might be reversed for a different compiler).
Most code I've looked at uses a struct to define each bit, and this way of structuring data should be compiler independent because the first bit will always be the LSB. (I assume) Here is an example:
struct foo
{
unsigned char a : 1;
unsigned char b : 1;
unsigned char unused : 6;
};
Now the question is...could you use more than one bit for a variable in the struct AND have it still be comipiler independent? It seems like the answer is yes, but I have had some weird answers and want to be sure. Something like:
struct foo
{
unsigned char ab : 2;
unsigned char unused : 6;
};
It seems like regardless if the raw structure is reversed, the first bit accessed from the struct is always the LSB, so how many bits you use should not matter.

The C standard does not specify the ordering of fields within a unit -- there's no guarantee that a, in your example, is in the LSB. If you want fully portable behavior, you need to do the bit manipulation yourself, using unsigned integral types, and (if using unsigned integral types bigger than a byte) you need to worry about the endianness when reading/writing them from external sources.

The behaviour does not depend on the bit order. What you have written corresponds to the language standard and therefore behaves the same on all platforms.

Bitfields cannot be portably used to access specific bits in an external block of data (like a hardware register or data serialized in a stream of bytes). So bitfields aren't useful in this context - at least for portable code.
But if you're talking about using the bitfield within the program and not trying to have it model some external bit representation, then it's 100% portable. Not super useful, but portable.

I've spent a career twiddling bits in C/C++, and maybe because of this issue, I never see it done this way. We always use unsigned variables and apply bit masks to them:
#define BITMASK_A 0x01
#define BITMASK_B 0x02
unsigned char bitfield;
Then when you want to access a, you use (bitfield & BITMASK_A)
But to answer your question, there should be no logical difference between your two examples, if the compiler places ab at the low end, then the first example should also place a at the LSb.

Related

Why is getting extra bytes in a convertion from Struct to array bytes? [duplicate]

If I have a struct in C++, is there no way to safely read/write it to a file that is cross-platform/compiler compatible?
Because if I understand correctly, every compiler 'pads' differently based on the target platform.
No. That is not possible. It's because of lack of standardization of C++ at the binary level.
Don Box writes (quoting from his book Essential COM, chapter COM As A Better C++)
C++ and Portability
Once the decision is made to
distribute a C++ class as a DLL, one
is faced with one of the fundamental
weaknesses of C++, that is, lack of
standardization at the binary level.
Although the ISO/ANSI C++ Draft
Working Paper attempts to codify which
programs will compile and what the
semantic effects of running them will
be, it makes no attempt to standardize
the binary runtime model of C++. The
first time this problem will become
evident is when a client tries to link
against the FastString DLL's import library from
a C++ developement environment other
than the one used to build the
FastString DLL.
Struct padding is done differently by different compilers. Even if you use the same compiler, the packing alignment for structs can be different based on what pragma pack you're using.
Not only that if you write two structs whose members are exactly same, the only difference is that the order in which they're declared is different, then the size of each struct can be (and often is) different.
For example, see this,
struct A
{
char c;
char d;
int i;
};
struct B
{
char c;
int i;
char d;
};
int main() {
cout << sizeof(A) << endl;
cout << sizeof(B) << endl;
}
Compile it with gcc-4.3.4, and you get this output:
8
12
That is, sizes are different even though both structs have the same members!
The bottom line is that the standard doesn't talk about how padding should be done, and so the compilers are free to make any decision and you cannot assume all compilers make the same decision.
If you have the opportunity to design the struct yourself, it should be possible. The basic idea is that you should design it so that there would be no need to insert pad bytes into it. the second trick is that you must handle differences in endianess.
I'll describe how to construct the struct using scalars, but the you should be able to use nested structs, as long as you would apply the same design for each included struct.
First, a basic fact in C and C++ is that the alignment of a type can not exceed the size of the type. If it would, then it would not be possible to allocate memory using malloc(N*sizeof(the_type)).
Layout the struct, starting with the largest types.
struct
{
uint64_t alpha;
uint32_t beta;
uint32_t gamma;
uint8_t delta;
Next, pad out the struct manually, so that in the end you will match up the largest type:
uint8_t pad8[3]; // Match uint32_t
uint32_t pad32; // Even number of uint32_t
}
Next step is to decide if the struct should be stored in little or big endian format. The best way is to "swap" all the element in situ before writing or after reading the struct, if the storage format does not match the endianess of the host system.
No, there's no safe way. In addition to padding, you have to deal with different byte ordering, and different sizes of builtin types.
You need to define a file format, and convert your struct to and from that format. Serialization libraries (e.g. boost::serialization, or google's protocolbuffers) can help with this.
Long story short, no. There is no platform-independent, Standard-conformant way to deal with padding.
Padding is called "alignment" in the Standard, and it begins discussing it in 3.9/5:
Object types have alignment
requirements (3.9.1, 3.9.2). The
alignment of a complete object type is
an implementation-defined integer
value representing a number of bytes;
an object is allocated at an address
that meets the alignment requirements
of its object type.
But it goes on from there and winds off to many dark corners of the Standard. Alignment is "implementation-defined" meaning it can be different across different compilers, or even across address models (ie 32-bit/64-bit) under the same compiler.
Unless you have truly harsh performance requirements, you might consider storing your data to disc in a different format, like char strings. Many high-performance protocols send everything using strings when the natural format might be something else. For example, a low-latency exchange feed I recently worked on sends dates as strings formatted like this: "20110321" and times are sent similarly: "141055.200". Even though this exchange feed sends 5 million messages per second all day long, they still use strings for everything because that way they can avoid endian-ness and other issues.

Can I use data types like bool to compress data while improving readability?

My official question will be: "Is there a clean way to use data types to "encode and compress" data rather than using messy bit masking." The hopes would be to save space in the case of compressing, and I would like to use native data types, structures, and arrays in order to improve readability over bit masking. I am proficient in bit masking from my assembly background but I am learning C++ and OOP. We can store so much information in a 32 bit register by using individual bits and I feel that I am trying to get back to that low level environment while having the readability of C++ code.
I am attempting to save some space because I am working with huge resource requirements. I am still learning more about how c++ treats the bool data type. I realize that memory is stored in byte chunks and not individual bits. I believe that a bool usually uses one byte and is masked somehow. In my head I could use 8 bool values in one byte.
If I malloc in C++ an array of 2 bool elements. Does it allocate two bytes or just one?
Example: We will use DNA as an example since it can be encoded into two bit to represent A,C,G and T. If I make a struct with an array of two bool called DNA_Base, then I make an array of 20 of those.
struct DNA_Base{ bool Bit_1; bool Bit_2; };
DNA_Base DNA_Sequence[7] = {false};
cout << sizeof(DNA_Base)<<sizeof(DNA_Sequence)<<endl;
//Yields a 2 and a 14.
//I would like this to say 1 and 2.
In my example I would also show the case where the DNA sequence can be 20 bases long which would require 40 bits to encode. GATTACA could only take up a maximum of 2 bytes? I suppose an alternative question would have been "How to make C++ do the bit masking for me in a more readable way" or should I just make my own data type and classes and implement the bit masking using classes and operator overloading.
Not fully what you want but you can use bitfield:
struct DNA_Base
{
unsigned char Bit_1 : 1;
unsigned char Bit_2 : 1;
};
DNA_Base DNA_Sequence[7];
So sizeof(DNA_Base) == 1 and sizeof(DNA_Sequence) == 7
So you have to pack the DNA_Base to avoid to lose place with padding, something like:
struct DNA_Base_4
{
unsigned char base1 : 2; // may have value 0 1 2 or 3
unsigned char base2 : 2;
unsigned char base3 : 2;
unsigned char base4 : 2;
};
So sizeof(DNA_Base_4) == 1
std::bitset is an other alternative, but you have to do the interpretation job yourself.
An array of bools will be N-elements x sizeof(bool).
If your goal is to save space in registers, don't bother, because it is actually more efficient to use a word size for the processor in question than to use a single byte, and the compiler will prefer to use a word anyway, so in a struct/class the bool will usually be expanded to a 32-bit or 64-bit native word.
Now, if you like to save room on disk, or in RAM, due to needing to store LOTS of bools, go ahead, but it isn't going to save room in all cases unless you actually pack the structure, and on some architectures packing can also have performance impact because the CPU will have to perform unaligned or byte-by-byte access.
A bitmask (or bitfield), on the other hand, is performant and efficient and as dense as possible, and uses a single bitwise operation. I would look at one of the abstract data types that provide bit fields.
The standard library has bitset http://www.cplusplus.com/reference/bitset/bitset/ which can be as long as you want.
Boost also has something I'm sure.
Unless you are on a 4 bit machine, the final result will be using bit arithmetic. Whether you do it explicitly, have the compiler do it via bit fields, or use a bit container, there will be bit manipulation.
I suggest the following:
Use existing compression libraries.
Use the method that is most readable or understood by people other
than yourself.
Use the method that is most productive (talking about development
time).
Use the method that you will inject the least amount of defects.
Edit 1:
Write each method up as a separate function.
Tell the compiler to generate the assembly language for each function.
Compare the assembly language of each function to each other.
My belief is that they will be very similar, enough that wasting time discussing them is not worthwhile.
You can't operate on bits directly, but you can treat the smallest unit available to you as a multiple data store, and define
enum class DNAx4 : uint8_t {
AAAA = 0x00, AAAC = 0x01, AAAG = 0x02, AAAT = 0x03,
// .... And the rest of them
AAAA = 0xFC, AAAC = 0xFD, AAAG = 0xFE, AAAT = 0xFF
}
I'd actually go further, and create a structure DNAx16 or DNAx32 to efficiently use the native word size on your machine.
You can then define functions on the data type, which will have to use the underlying bit representation, but at least it allows you to encapsulate this and build higher level operations from these primitives.

How many options can I include in a bit mask?

In this question it was pointed out that:
Using int [for bit mask] is asking for trouble
I have been using an unsigned char to store bitmask flags, but it occurs to me that I will hit low limit since a char is only a byte, thus 8 bits, thus only 8 options in my mask?
enum options{
k1=1<<0,
k2=1<<1,
.... through to k8
}
unsigned char myOption=k2;
Do I simply need to make myOption an int or some other type for example if I wish it to store more than 8 possible options (and combinations of options, of course, hence why I am using the bit mask in the first place)? What's the best type?
If you need an unknown number of 'bits' you could use something like the std::vector<bool> class, see here:
http://www.cplusplus.com/reference/vector/vector-bool/
This is a specialization of the vector class which can pack the bool values using bits, so it is more space efficient than an array of bools (whether you need that extra efficiency is up to you).
Of course I don't know what your application is, there are many valid reasons for using bitfields. If you are simply storing a bunch of true and false values though, something like an array of bools or this vector of bools might be more easily maintained (it has downsides though of course, you can't test to see if say 3 bits are all set in one operation as you can with masking and bitfields, so it is application specific).
vector<bool> is somewhat controversial though, I think. See: http://howardhinnant.github.io/onvectorbool.html
#include <stdint.h>
This defines types with fixed sizes that are not compiler specific.
int16_t = 16 bits
uint16_t = 16 bits unsigned
int32_t = 32 bits
If you need more than 64 flags you should consider the ::std::vector<> as Wayne Uroda suggested.

union vs bit masking and bit shifting

What are the disadvantages of using unions when storing some information like a series of bytes and being able to access them at once or one by one.
Example : A Color can be represented in RGBA. So a color type may be defined as,
typedef unsigned int RGBAColor;
Then we can use "shifting and masking" of bits to "retrieve or set" the red, green, blue, alpha values of a RGBAColor object ( just like it is done in Direct3D functions with the macro functions such as D3DCOLOR_ARGB() ).
But what if I used a union,
union RGBAColor
{
unsigned int Color;
struct RGBAColorComponents
{
unsigned char Red;
unsigned char Green;
unsigned char Blue;
unsigned char Alpha;
} Component;
};
Then I will not be needing to always do the shifting (<<) or masking (&) for reading or writing the color components. But is there problem with this? ( I suspect that this has some problem because I haven't seen anyone using such a method. )
Can Endianness Be a broblem? If we always use Component for accessing color components and use Color for accessing the whole thing ( for copying, assigning, etc.. as a whole ) the endianness should not be a problem, right?
-- EDIT --
I found an old post which is the same problem. So i guess this question is kinda repost :P sorry for that. here is the link : Is it a good practice to use unions in C++?
According to the answers it seems that the use of unions for the given example is OK in C++. Because there is no change of data type in there, its just two ways to access the same data. Please correct me if i am wrong. Thanks. :)
This usage of unions is illegal in C++, where a union comprises overlapping, but mutually exclusive objects. You are not allowed to write one member of a union, then read out another member.
It is legal in C where this is a recommended way of type punning.
This relates to the issue of (strict) aliasing, which is a difficulty faced by the compiler when trying to determine whether two objects with different types are distinct. The language standards disagree because the experts are still figuring out what guarantees can safely be provided without sacrificing performance. Personally, I avoid all of this. What would the int actually be used for? The safe way to translate is to copy the bytes, as by memcpy.
There is also the endianness issue, but whether that matters depends on what you want to do with the int.
I believe using the union solves any problems related to endianness, as most likely the RGBA order is defined in network order. Also the fact that each component will be uint8_t or such, can help some compilers to use sign/zero extended loads, storing the low 8 bits directly to a nonaligned byte pointer and being even able to parallelize some byte operations (e.g. arm has some packed 4x8 bit instructions).

In C++ what is the proper term for splitting an int into bits

I see in some C++ code things like:
// Header
struct SomeStruct {
uint32_t nibble1:4, bitField1:1, bitField2:1, bitField3:1, bitField4:1,
padding:11, field5Bits:5, byteField:8;
};
What is this called? I typically like to google before asking here, but I have no idea what to even type in. I'm hoping to understand this when it comes to endianness - is bit order something to consider or just byte order? Also, what is the type of each field - bitFieldX should be a bool, while field5Bits should be a uint8_t. At least that's what I would think.
Thanks.
They are called bitfields (MSVC) (GCC)
Endianess usually refers to the order of bytes. However bit order can be important, see the above links.
They behave as an unsigned int (uint32_t) in your case.
In general, the term for selecting several bits out of a larger binary integer representation is masking.
What you posted is a packed structure. The elements within the structure are know as bitfields as others have posted. These are often used to represent communication protocol structures, where the protocol specifies fields that are less than one byte, or not aligned to a byte, half-word or word alignment that would normally take place.
Since there is only one type listed, each member of the structure is the same type, uint_32.
Endianess does matter for anthing that is part of a data type that is larger than 1 byte.