Bit setting question

Bit setting question - c++

In C or C++, it's apparently possible to restrict the number of bits a variable has, so for example:
unsigned char A:1;
unsigned char B:3;
I am unfamiliar however with how it works specifically, so a number of questions:
If I have a class with the following variables:
unsigned char A:1;
unsigned char B:3;
unsigned char C:1;
unsigned char D:3;
What is the above technique actually called?
Is above class four bytes in size, or one byte in size?
Are the variables treated as 1 (or 3) bits as shown, or as per the 'unsigned char', treated as a byte each?
Is there someway of combining the bits to a centralised byte? So for example:
.
unsigned char MainByte;
unsigned char A:1; //Can this be made to point at the first bit in MainByte?
unsigned char B:3; //Etc etc
unsigned char C:1;
unsigned char D:3;
Is there an article that covers this topic in more depth?
If 'A:1' is treated like an entire byte, what is the point/purple of it?
Feel free to mention any other considerations (like compiler restrictions or other limitations).
Thank you.

What is the above technique actually called?
Bitfields. And you're only supposed to use int (signed, unsigned or otherwise) as the "type", not char.
Is above class four bytes in size, or one byte in size?
Neither. It is probably sizeof(int) because the compiler generates a word-sized object. The actual bitfields will be stored within a byte, however. It'll just waste some space.
Are the variables treated as 1 (or 3) bits as shown, or as per the 'unsigned char', treated as a byte each?
They represent only the bits specified, and will be packed as tightly as possible.
Is there someway of combining the bits to a centralised byte? So for example:
Use a union:
struct bits {
unsigned A:1;
unsigned B:3;
unsigned C:1;
unsigned D:3;
};
union SplitByte {
struct bits Bits;
unsigned char Byte[sizeof(struct bits)];
/* the array is a trick so the two fields
are guaranteed to be the same size and
thus be aligned on the same boundary */
} SplitByteObj;
// access the byte
SplitByteObj.Byte[0]
// access a bitfield
SplitByteObj.Bits.B
Note that there are problems with bitfields, for example when using threads. Each bitfield cannot be accessed individually, so you may get errors if you try to use a mutex to guard each of them. Also, the order in which the fields are laid out is not clearly specified by the standard. Many people prefer to use bitwise operators to implement bitfields manually for that reason.
Is there an article that covers this topic in more depth?
Not many. The first few you'll get when you Google it are about all you'll find. They're not a widely used construct. You'll be best off nitpicking the standard to figure out exactly how they work so that you don't get bitten by a weird edge case. I couldn't tell you exactly where in the standard they're specified.
If 'A:1' is treated like an entire byte, what is the point/purple of it?
It's not, but I've addressed this already.

These are bit-fields.
The details of how these fields are arranged in memory are largely implementation-defined. Typically, you will find that the compiler packs them in some way. But it may take various alignment issues into account.

Related

Is there an actual 8-bit integer data type in C++

In c++, specifically the cstdint header file, there are types for 8-bit integers which turn out to be of the char data type with a typedef. Could anyone suggest an actual 8-bit integer type?

Yes, you are right. int8_t and uint8_t are typedef to char on platforms where 1 byte is 8 bits. On platforms where it is not, appropriate definition will be given.
Following answer is based on assumption that char is 8 bits
char holds 1 byte, which may be signed or unsigned based on implementation.
So int8_t is signed char and uint8_t is unsigned char, but this will be safe to use int8_t/uint8_t as actual 8-bit integer without relying too much on the implementation.
For a implementer's point of view, typedeffing where char is 8 bits makes sense.
Having seen all this, It is safe to use int8_t or uint8_t as real 8 bit integer.

C/C++ Why to use unsigned char for binary data?

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?

In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.

You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff

The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.

Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.

The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).

I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.

Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.

Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).

Forcing unaligned bitfield packing in MSVC

I've a struct of bitfields that add up to 48 bits. On GCC this correctly results in a 6 byte structure, but in MSVC the structure comes out 8 bytes. I need to find some way to force MSVC to pack the struct properly, both for interoperability and because it's being used in a memory-critical environment.
The struct seen below consists of three 15-bit numbers, one 2-bit number, and a 1-bit sign. 15+15+15+2+1 = 48, so in theory it should fit into six bytes, right?
struct S
{
unsigned short a:15;
unsigned short b:15;
unsigned short c:15;
unsigned short d:2;
unsigned short e:1;
};
However, compiling this on both GCC and MSVC results in sizeof(S) == 8. Thinking that this might have to do with alignment, I tried using #pragma pack(1) before the struct declaration, telling the compiler to back to byte, not int, boundaries. On GCC, this worked, resulting in sizeof(S) == 6.
However, on MSVC05, the sizeof still came out to 8, even with pack(1) set! After reading this other SO answer, I tried replacing unsigned short d with unsigned char and unsigned short e with bool. The result is sizeof(S) == 7!
I found that if I split d into two one-bit fields and wedged them in between the other members, the struct finally packed properly.
struct S
{
unsigned short a:15;
unsigned short dHi : 1;
unsigned short b:15;
unsigned short dLo : 1;
unsigned short c:15;
unsigned short e:1;
};
printf( "%d\n", sizeof(S) ); // "6"
But having d split like that is cumbersome and causes trouble for me later on when I have to work on the struct. Is there some way I can force MSVC to pack this struct into 6 bytes, exactly as GCC does?

It is implementation defined how fields will be placed in the structure. Visual Studio will fit consecutive bitfields into an underlying type, if it can, and waste the leftover space. (C++ Bit Fields in VS)

If you use the type "unsigned __int64" to declare all elements of the structure, you'll get an object with sizeof(S)=8, but the last two bytes will be unused and the first six will contain the data in the format you want.
Alternatively, if you can accept some structure reordering, this will work
#pragma pack(1)
struct S3
{
unsigned int a:15;
unsigned int b:15;
unsigned int d:2;
unsigned short c:15;
unsigned short e:1;
};

I don't think so, and I think it's MSVC's behavior that is actually correct, and GCC that deviates from the standard.
AFAIK, the standard does not permit bitfields to cross word boundaries of the underlying type.

C++ struct containing unsigned char and int bug

Ok i have a struct in my C++ program that is like this:
struct thestruct
{
unsigned char var1;
unsigned char var2;
unsigned char var3[2];
unsigned char var4;
unsigned char var5[8];
int var6;
unsigned char var7[4];
};
When i use this struct, 3 random bytes get added before the "var6", if i delete "var5" it's still before "var6" so i know it's always before the "var6".
But if i remove the "var6" then the 3 extra bytes are gone.
If i only use a struct with a int in it, there is no extra bytes.
So there seem to be a conflict between the unsigned char and the int, how can i fix that?

The compiler is probably using its default alignment option, where members of size x are aligned on a memory boundary evenly divisible by x.
Depending on your compiler, you can affect this behaviour using a #pragma directive, for example:
#pragma pack(1)
will turn off the default alignment in Visual C++:
Specifies the value, in bytes, to be used for packing. The default value for n is 8. Valid values are 1, 2, 4, 8, and 16. The alignment of a member will be on a boundary that is either a multiple of n or a multiple of the size of the member, whichever is smaller.
Note that for low-level CPU performance reasons, it is usually best to try to align your data members so that they fall on an aligned boundary. Some CPU architectures require alignment, while others (such as Intel x86) tolerate misalignment with a decrease in performance (sometimes quite significantly).

Your data structure being aligned so that your int falls on word boundries, which for your target might be 32 or 64 bits.
You can reorganize your struct like so so that this won't happen:
struct thestruct
{
int var6;
unsigned char var1;
unsigned char var2;
unsigned char var3[2];
unsigned char var4;
unsigned char var5[8];
unsigned char var7[4];
};

Are you talking about padding bytes? That's not a bug. As allowed by the C++ standard, the compiler is adding padding to keep the members aligned. This is required for some architectures, and will greatly improve performance for others.

You're having a byte alignment problem. The compiler is adding padding to align the bytes. See this wikipedia article.

Read up on data structure alignment. Essentially, depending on the compiler and compile options, you'll get alignment onto different powers-of-2.
To avoid it, move multi-byte items (int or pointers) before single-byte (signed or unsigned char) items -- although it might still be there after your last item.

While rearranging the order you declare data members inside your struct is fine, it should be emphasized that overriding the default alignment by using #pragmas and such is a bad idea unless you know exactly what you're doing. Depending on your compiler and architecture, attempting to access unaligned data, particularly by storing the address in a pointer and later trying to dereference it, can easily give the dreaded Bus Error or other undefined behavior.

How can I get bitfields to arrange my bits in the right order?

To begin with, the application in question is always going to be on the same processor, and the compiler is always gcc, so I'm not concerned about bitfields not being portable.
gcc lays out bitfields such that the first listed field corresponds to least significant bit of a byte. So the following structure, with a=0, b=1, c=1, d=1, you get a byte of value e0.
struct Bits {
unsigned int a:5;
unsigned int b:1;
unsigned int c:1;
unsigned int d:1;
} __attribute__((__packed__));
(Actually, this is C++, so I'm talking about g++.)
Now let's say I'd like a to be a six bit integer.
Now, I can see why this won't work, but I coded the following structure:
struct Bits2 {
unsigned int a:6;
unsigned int b:1;
unsigned int c:1;
unsigned int d:1;
} __attribute__((__packed__));
Setting b, c, and d to 1, and a to 0 results in the following two bytes:
c0 01
This isn't what I wanted. I was hoping to see this:
e0 00
Is there any way to specify a structure that has three bits in the most significant bits of the first byte and six bits spanning the five least significant bits of the first byte and the most significant bit of the second?
Please be aware that I have no control over where these bits are supposed to be laid out: it's a layout of bits that are defined by someone else's interface.

(Note that all of this is gcc-specific commentary - I'm well aware that the layout of bitfields is implementation-defined).
Not on a little-endian machine: The problem is that on a little-endian machine, the most significant bit of the second byte isn't considered "adjacent" to the least significant bits of the first byte.
You can, however, combine the bitfields with the ntohs() function:
union u_Bits2{
struct Bits2 {
uint16_t _padding:7;
uint16_t a:6;
uint16_t b:1;
uint16_t c:1;
uint16_t d:1;
} bits __attribute__((__packed__));
uint16_t word;
}
union u_Bits2 flags;
flags.word = ntohs(flag_bytes_from_network);
However, I strongly recommend you avoid bitfields and instead use shifting and masks.

Usually you can't do strong assumptions on how the union will be packed, every compiler implementation may choose to pack it differently (to save space or align bitfields inside bytes).
I would suggest you to just work out with masking and bitwise operators..
from this link:
The main use of bitfields is either to allow tight packing of data or to be able to specify the fields within some externally produced data files. C gives no guarantee of the ordering of fields within machine words, so if you do use them for the latter reason, you program will not only be non-portable, it will be compiler-dependent too. The Standard says that fields are packed into ‘storage units’, which are typically machine words. The packing order, and whether or not a bitfield may cross a storage unit boundary, are implementation defined. To force alignment to a storage unit boundary, a zero width field is used before the one that you want to have aligned.

C/C++ has no means of specifying the bit by bit memory layout of structs, so you will need to do manual bit shifting and masking on 8 or 16bit (unsigned) integers (uint8_t, uint16_t from <stdint.h> or <cstdint>).
Of the good dozen of programming languages I know, only very few allow you to specify bit-by-bit memory layout for bit fields: Ada, Erlang, VHDL (and Verilog).
(Community wiki if you want to add more languages to that list.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js