Endianess inside a byte - c++

Recently I'm tracking down a bug that appears when the two sides of the network communication have different endianness. One side has already sent a telegram marking lastSegment while the other side is still waiting for the last segment endlessly.
I read this code:
#ifndef kBigEndian
struct tTelegram
{
u8 lastSegment : 1;
u8 reserved: 7;
u8 data[1];
};
#else
struct tTelegram
{
u8 reserved: 7;
u8 lastSegment : 1;
u8 data[1];
};
#endif
I know endianness is concerned for multi-byte type, e.g., int, long, etc. But why it cares in the previous code? lastSegment and reserved are inside a single byte.
Is that a bug?

You have 16 bits in your struct. On a 32-bit or 64-bit architecture, depending on the endianess, data may come "before" reserved and lastSegment or it may come "after" when looking at the raw binary. IE If we consider 32 bits, your struct may be packed along 32-bit boundaries. It might look like this:
padbyte1 padbyte2 data lastSegment+reserved
or it may look like this
lastSegment+reserved data padbyte1 padbyte2
So when you put those 16 bits over the wire then reinterpret them on the other side, do you know if you're getting data or lastSegment?
Your problem isn't within the byte, its where data lies in relation to reserved and lastSegment.

When it comes to bitfields, ordering is not guaranteed even between different compilers running on the same CPU. You could theoretically even get a change of order just by changing flags with the same compiler (though, in fairness, I have to add that I've never actually seen that happen).

Related

Struct Bit Packing and LSB / MSB ambiguity C++

I had to write a c++ code for the following packet header:
Original image link, PNG version of the above JPEG.
Here is the struct code I wrote for the above Packet Format. I want to know if the uint8_t or the uint16_t bit fields are correct
struct TelemetryTransferFramePrimaryHeader
{
//-- 6 Ocets Long --//
//-- Master Channel ID (2 octets)--//
uint16_t TransferFrameVersionNumber : 2;
uint16_t SpacecraftID : 10;
uint16_t VirtualChannelID : 3;
uint16_t OCFFlag : 1;
//-----------------//
uint8_t MasterChannelFrameCount;
uint8_t VirtualChannelFrameCount;
//-- Transfer Frame Data Field Status (2 octets) --//
uint16_t TransferFrameSecondaryHeaderFlag : 1;
uint16_t SyncFlag : 1;
uint16_t PacketOrderFlag : 1;
uint16_t SegmentLengthID : 2;
uint16_t FirstHeaderPointer : 11;
//-----------------//
};
How do I ensure that that the LSB -> MSB is preserved in the struct ?
I keep getting confused, and I've tried reading up but it ends up confusing me even more.
PS: I am using a 32bit processor.
Exactly how bits are mapped when using bit fields is implementation-specific. So it's very hard to say for sure if you did it right, we'd need to know the exact CPU and compiler (and compiler version, of course).
In short; don't do this. Bit fields are not very usable for things like this.
Do it manually instead, by declaring the words as needed and setting the bits inside them.
IMHO anyone trying to construct a struct in this way is in a state of sin.
The C99 Standard, for example, says:
An implementation may allocate any addressable storage unit large enough to hold a bitfield. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.
Even if you could predict that your compiler would construct bit-fields in units of (say) uint32_t, and the fields were arranged first field LS bits... you still have endian-ness to deal with !
So... as unwind says... do it by hand !
I agree that you should not do this. However STMicroelectronics uses bitfields to access the bits of its Cortex-M3/M4 microcontroller registers. So any compiler vendor that wants its users to be able to use the STMicroelectronics Cortex-M3/M4 libraries needs to support the allocation of bitfields starting at the least significant bit. In my compiler this is the default, but it is also optional so I could reverse it if I wanted to.

Modeling microcomputer registers in memory?

As a little side project I've been creating an older microcomputer CPU emulator, mostly based off the 8080's architecture. Its 8-bit general purpose registers can (according to wikipedia) be used "as three 16-bit register pairs," as well as the normal 8-bit mode. And here's my problem.
My first try at modelling this was individual named bytes and shorts, which worked fine until I re-read the specs page and found that the 16-bit registers aren't actually their own thing. Oops.
What I'm trying now is an array of bytes, with one location for each 8-bit register, and two locations reserved for the stack/instruction pointers. This works very well and good for the 8-bit registers, and it's a lot less of a hassle to manage, but I don't actually know how to convert two bytes to a short in memory. Is that even possible? If not, do you have any suggestions on how else to do this?
Solved via casting the address of the first byte in the 16-bit register to a void pointer, and back to a short. Not very type-safe, but hey, it works. Apparently I was just googling the wrong things.
You can make a struct of two uint8_ts that is also accessible as a uint16_t by doing this:
union Register
{
uint16_t word;
struct
{
uint8_t lo, hi;
} byte;
};
This way, if you have a value in scope Register r, then r.word will access the contents as a single 16-bit value, and r.byte.lo and r.byte.hi will access the first and the second 8-bit byte. (The first one is lo because the Intel 8080 is a little-endian architecture.)

Creating a simple portable bitmask and using it

This is my first time trying to create a bitmask, and although seemingly simple I have having trouble visualizing everything.
Keep in mind I cannot use std::bitset
First, I have read that accessing raw bits is undefined behavior. (so using a union of a char would be bad because the bits might be reversed for a different compiler).
Most code I've looked at uses a struct to define each bit, and this way of structuring data should be compiler independent because the first bit will always be the LSB. (I assume) Here is an example:
struct foo
{
unsigned char a : 1;
unsigned char b : 1;
unsigned char unused : 6;
};
Now the question is...could you use more than one bit for a variable in the struct AND have it still be comipiler independent? It seems like the answer is yes, but I have had some weird answers and want to be sure. Something like:
struct foo
{
unsigned char ab : 2;
unsigned char unused : 6;
};
It seems like regardless if the raw structure is reversed, the first bit accessed from the struct is always the LSB, so how many bits you use should not matter.
The C standard does not specify the ordering of fields within a unit -- there's no guarantee that a, in your example, is in the LSB. If you want fully portable behavior, you need to do the bit manipulation yourself, using unsigned integral types, and (if using unsigned integral types bigger than a byte) you need to worry about the endianness when reading/writing them from external sources.
The behaviour does not depend on the bit order. What you have written corresponds to the language standard and therefore behaves the same on all platforms.
Bitfields cannot be portably used to access specific bits in an external block of data (like a hardware register or data serialized in a stream of bytes). So bitfields aren't useful in this context - at least for portable code.
But if you're talking about using the bitfield within the program and not trying to have it model some external bit representation, then it's 100% portable. Not super useful, but portable.
I've spent a career twiddling bits in C/C++, and maybe because of this issue, I never see it done this way. We always use unsigned variables and apply bit masks to them:
#define BITMASK_A 0x01
#define BITMASK_B 0x02
unsigned char bitfield;
Then when you want to access a, you use (bitfield & BITMASK_A)
But to answer your question, there should be no logical difference between your two examples, if the compiler places ab at the low end, then the first example should also place a at the LSb.

When does Endianness become a factor?

Endianness from what I understand, is when the bytes that compose a multibyte word differ in their order, at least in the most typical case. So that an 16-bit integer may be stored as either 0xHHLL or 0xLLHH.
Assuming I don't have that wrong, what I would like to know is when does Endianness become a major factor when sending information between two computers where the Endian may or may not be different.
If I transmit a short integer of 1, in the form of a char array and with no correction, is it received and interpretted as 256?
If I decompose and recompose the short integer using the following code, will endianness no longer be a factor?
// Sender:
for(n=0, n < sizeof(uint16)*8; ++n) {
stl_bitset[n] = (value >> n) & 1;
};
// Receiver:
for(n=0, n < sizeof(uint16)*8; ++n) {
value |= uint16(stl_bitset[n] & 1) << n;
};
Is there a standard way of compensating for endianness?
Thanks in advance!
Very abstractly speaking, endianness is a property of the reinterpretation of a variable as a char-array.
Practically, this matters precisely when you read() from and write() to an external byte stream (like a file or a socket). Or, speaking abstractly again, endianness matters when you serialize data (essentially because serialized data has no type system and just consists of dumb bytes); and endianness does not matter within your programming language, because the language only operates on values, not on representations. Going from one to the other is where you need to dig into the details.
To wit - writing:
uint32_t n = get_number();
unsigned char bytesLE[4] = { n, n >> 8, n >> 16, n >> 24 }; // little-endian order
unsigned char bytesBE[4] = { n >> 24, n >> 16, n >> 8, n }; // big-endian order
write(bytes..., 4);
Here we could just have said, reinterpret_cast<unsigned char *>(&n), and the result would have depended on the endianness of the system.
And reading:
unsigned char buf[4] = read_data();
uint32_t n_LE = buf[0] + buf[1] << 8 + buf[2] << 16 + buf[3] << 24; // little-endian
uint32_t n_BE = buf[3] + buf[2] << 8 + buf[1] << 16 + buf[0] << 24; // big-endian
Again, here we could have said, uint32_t n = *reinterpret_cast<uint32_t*>(buf), and the result would have depended on the machine endianness.
As you can see, with integral types you never have to know the endianness of your own system, only of the data stream, if you use algebraic input and output operations. With other data types such as double, the issue is more complicated.
For the record, if you're transferring data between devices you should pretty much always use network-byte-ordering with ntohl, htonl, ntohs, htons. It'll convert to the network byte order standard for Endianness regardless of what your system and the destination system use. Of course, both systems shoud be programmed like this - but they usually are in networking scenarios.
No, though you do have the right general idea. What you're missing is the fact that even though it's normally a serial connection, a network connection (at least most network connections) still guarantees correct endianness at the octet (byte) level -- i.e., if you send a byte with a value of 0x12 on a little endian machine, it'll still be received as 0x12 on a big endian machine.
Looking at a short, if you look at the number in hexadecimal,it'l probably help. It starts out as 0x0001. You break it into two bytes: 0x00 0x01. Upon receipt, that'll be read as 0x0100, which turns out to be 256.
Since the network deals with endianess at the octet level, you normally only have to compensate for the order of bytes, not bits within bytes.
Probably the simplest method is to use htons/htonl when sending, and ntohs/ntohl when receiving. When/if that's not sufficient, there are many alternatives such as XDR, ASN.1, CORBA IIOP, Google protocol buffers, etc.
The "standard way" of compensating is that the concept of "network byte order" has been defined, almost always (AFAIK) as big endian.
Senders and receivers both know the wire protocol, and if necessary will convert before transmitting and after receiving, to give applications the right data. But this translation happens inside your networking layer, not in your applications.
Both endianesses have an advantage that I know of:
Big-endian is conceptually easier to understand because it's similar to our positional numeral system: most significant to least significant.
Little-endian is convenient when reusing a memory reference for multiple memory sizes. Simply put, if you have a pointer to a little-endian unsigned int* but you know the value stored there is < 256, you can cast your pointer to unsigned char*.
Endianness is ALWAYS an issue. Some will say that if you know that every host connected to the network runs the same OS, etc, then you will not have problems. This is true until it isn't. You always need to publish a spec that details the EXACT format of on-wire data. It can be any format you want, but every endpoint needs to understand the format and be able to interpret it correctly.
In general, protocols use big-endian for numerical values, but this has limitations if everyone isn't IEEE 754 compatible, etc. If you can take the overhead, then use an XDR (or your favorite solution) and be safe.
Here are some guidelines for C/C++ endian-neutral code. Obviously these are written as "rules to avoid"... so if code has these "features" it could be prone to endian-related bugs !! (this is from my article on Endianness published in Dr Dobbs)
Avoid using unions which combine different multi-byte datatypes.
(the layout of the unions may have different endian-related orders)
Avoid accessing byte arrays outside of the byte datatype.
(the order of the byte array has an endian-related order)
Avoid using bit-fields and byte-masks
(since the layout of the storage is dependent upon endianness, the masking of the bytes and selection of the bit fields is endian sensitive)
Avoid casting pointers from multi-byte type to other byte types.
(when a pointer is cast from one type to another, the endianness of the source (ie. The original target) is lost and subsequent processing may be incorrect)
You shouldn't have to worry, unless you're at the border of the system. Normally, if you're talking in terms of the stl, you already passed that border.
It's the task of the serialization protocol to indicate/determine how a series of bytes can be transformed into the type you're sending, beit a built-in type or a custom type.
If you're talking built-in only, you may suffice with the machine-abstraction provided by tools provided by your environment]

C/C++: Force Bit Field Order and Alignment

I read that the order of bit fields within a struct is platform specific. What about if I use different compiler-specific packing options, will this guarantee data is stored in the proper order as they are written? For example:
struct Message
{
unsigned int version : 3;
unsigned int type : 1;
unsigned int id : 5;
unsigned int data : 6;
} __attribute__ ((__packed__));
On an Intel processor with the GCC compiler, the fields were laid out in memory as they are shown. Message.version was the first 3 bits in the buffer, and Message.type followed. If I find equivalent struct packing options for various compilers, will this be cross-platform?
No, it will not be fully-portable. Packing options for structs are extensions, and are themselves not fully portable. In addition to that, C99 ยง6.7.2.1, paragraph 10 says: "The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined."
Even a single compiler might lay the bit field out differently depending on the endianness of the target platform, for example.
Bit fields vary widely from compiler to compiler, sorry.
With GCC, big endian machines lay out the bits big end first and little endian machines lay out the bits little end first.
K&R says "Adjacent [bit-]field members of structures are packed into implementation-dependent storage units in an implementation-dependent direction. When a field following another field will not fit ... it may be split between units or the unit may be padded. An unnamed field of width 0 forces this padding..."
Therefore, if you need machine independent binary layout you must do it yourself.
This last statement also applies to non-bitfields due to padding -- however all compilers seem to have some way of forcing byte packing of a structure, as I see you already discovered for GCC.
Bitfields should be avoided - they aren't very portable between compilers even for the same platform. from the C99 standard 6.7.2.1/10 - "Structure and union specifiers" (there's similar wording in the C90 standard):
An implementation may allocate any addressable storage unit large enough to hold a bitfield. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.
You cannot guarantee whether a bit field will 'span' an int boundary or not and you can't specify whether a bitfield starts at the low-end of the int or the high end of the int (this is independant of whether the processor is big-endian or little-endian).
Prefer bitmasks. Use inlines (or even macros) to set, clear and test the bits.
endianness are talking about byte orders not bit orders. Nowadays , it is 99% sure that bit orders are fixed. However, when using bitfields, endianness should be taken in count. See the example below.
#include <stdio.h>
typedef struct tagT{
int a:4;
int b:4;
int c:8;
int d:16;
}T;
int main()
{
char data[]={0x12,0x34,0x56,0x78};
T *t = (T*)data;
printf("a =0x%x\n" ,t->a);
printf("b =0x%x\n" ,t->b);
printf("c =0x%x\n" ,t->c);
printf("d =0x%x\n" ,t->d);
return 0;
}
//- big endian : mips24k-linux-gcc (GCC) 4.2.3 - big endian
a =0x1
b =0x2
c =0x34
d =0x5678
1 2 3 4 5 6 7 8
\_/ \_/ \_____/ \_____________/
a b c d
// - little endian : gcc (Ubuntu 4.3.2-1ubuntu11) 4.3.2
a =0x2
b =0x1
c =0x34
d =0x7856
7 8 5 6 3 4 1 2
\_____________/ \_____/ \_/ \_/
d c b a
Most of the time, probably, but don't bet the farm on it, because if you're wrong, you'll lose big.
If you really, really need to have identical binary information, you'll need to create bitfields with bitmasks - e.g. you use an unsigned short (16 bit) for Message, and then make things like versionMask = 0xE000 to represent the three topmost bits.
There's a similar problem with alignment within structs. For instance, Sparc, PowerPC, and 680x0 CPUs are all big-endian, and the common default for Sparc and PowerPC compilers is to align struct members on 4-byte boundaries. However, one compiler I used for 680x0 only aligned on 2-byte boundaries - and there was no option to change the alignment!
So for some structs, the sizes on Sparc and PowerPC are identical, but smaller on 680x0, and some of the members are in different memory offsets within the struct.
This was a problem with one project I worked on, because a server process running on Sparc would query a client and find out it was big-endian, and assume it could just squirt binary structs out on the network and the client could cope. And that worked fine on PowerPC clients, and crashed big-time on 680x0 clients. I didn't write the code, and it took quite a while to find the problem. But it was easy to fix once I did.
Thanks #BenVoigt for your very useful comment starting
No, they were created to save memory.
Linux source does use a bit field to match to an external structure: /usr/include/linux/ip.h has this code for the first byte of an IP datagram
struct iphdr {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u8 ihl:4,
version:4;
#elif defined (__BIG_ENDIAN_BITFIELD)
__u8 version:4,
ihl:4;
#else
#error "Please fix <asm/byteorder.h>"
#endif
However in light of your comment I'm giving up trying to get this to work for the multi-byte bit field frag_off.
Of course the best answer is to use a class which reads/writes bit fields as a stream. Using the C bit field structure is just not guaranteed. Not to mention it is considered unprofessional/lazy/stupid to use this in real world coding.