is C++ abstraction Endian neutral? - c++

Suppose I have a client and a server that communicate 16 bits numbers with each other via some network protocols, say for example ModbusTCP, but the protocol is not relevant here.
Now I know, that the endian of the client is little (my PC) and the endian of the server is big (some PLC), the client is written entirely in C++ with Boost Asio sockets. With this setup, I thought I had to swap the bytes received from the server to correctly store the number in a uint16_t variable, however this is wrong because I'm reading incorrect values.
My understanding so far is that my C++ abstraction is storing the values into variables correctly without the need for me to actually care about swapping or endianness. Consider this snippet:
// received 0x0201 (513 in big endian)
uint8_t high { 0x02 }; // first byte
uint8_t low { 0x01 }; // second byte
// merge into 16 bit value (no swap)
uint16_t val = (static_cast<uint16_t>(high)<< 8) | (static_cast<uint16_t>(low));
std::cout<<val; //correctly prints 513
This somewhat surprised me, also because if I look into the memory representation with pointers, I found that they are actually stored in little endian on the client:
// take the address of val, convert it to uint8_t pointer
auto addr = static_cast<uint8_t*>(&val);
// take the first and second bytes and print them
printf ("%d ", (int)addr[0]); // print 1
printf ("%d", (int)addr[1]); // print 2
So the question is:
As long as I don't mess with memory addresses and pointers, C++ can guarantee me that the values I'm reading from the network are correct no matter the endian of the server, correct? Or I'm missing something here?
EDIT:
Thanks for the answers, I want to add that I'm currently using boost::asio::write(socket, boost::asio::buffer(data)) to send data from the client to the server and data is a std::vector<uint8_t>. So my understanding is that as long as I fill data in network order I should not care about endianness of my system (or even of the server for 16 bit data), because I'm operating on the "values" and not reading bytes directly from memory, right?
To use htons family of functions I have to change my underlying TCP layer to use memcpy or similar and a uint8_t* data buffer, that is more C-esque rather than C++ish, why should I do it? is there an advantage I'm not seeing?

(static_cast<uint16_t>(high)<< 8) | (static_cast<uint16_t>(low)) has the same behaviour regardless of the endianness, the "left" end of a number will always be the most significant bit, endianness only changes whether that bit is in the first or the last byte.
For example:
uint16_t input = 0x0201;
uint8_t leftByte = input >> 8; // same result regardless of endianness
uint8_t rightByte = input & 0xFF; // same result regardless of endianness
uint8_t data[2];
memcpy(data, &input, sizeof(input)); // data will be {0x02, 0x01} or {0x01, 0x02} depending on endianness
The same applies in the other direction:
uint8_t data[] = {0x02, 0x01};
uint16_t output1;
memcpy(&output1, data, sizeof(output1)); // will be 0x0102 or 0x0201 depending on endianness
uint16_t output2 = data[1] << 8 | data[0]; // will be 0x0201 regardless of endianness
To ensure your code works on all platforms its best to use the htons and ntohs family of functions:
uint16_t input = 0x0201; // input is in host order
uint16_t networkInput = htons(input);
uint8_t data[2];
memcpy(data, &networkInput , sizeof(networkInput));
// data is big endian or "network" order
uint16_t networkOutput;
memcpy(&networkOutput, &data, sizeof(networkOutput));
uint16_t output = ntohs(networkOutput); // output is in host order

The first fragment of your code works correctly because you don't directly work with byte addresses. Such code is compiled to have correct operation result independently of your platform ENDIANness due to defintion of operators '<<' and '|' by C++ language.
The second fragment of your code proves this, showing actual values of separate bytes on your little-endian system.
The TCP/IP network standardizes usage of big-endian format and provides the following utilities:
before sending multi-byte numeric values use standard functions: htonl ("host-to-network-long") and htons("host-to-netowrk-short") to convert your values to network representation,
after receiving multi-byte numeric values use standard functions: ntohl ("network-to-host-long") and ntohs ("network-to-host-short") to convert your values to your platform-specific representation.
(Actually these 4 utilities make conversions on little-endian platforms only and do nothing on big-endial platforms. But using them allways makes your code platform-independent).
With ASIO you have access to these utilities using:
#include <boost/asio.hpp>
You can read more looking for topic 'man htonl' or 'msdn htonl' in Google.

About Modbus :
For 16-bit words Modbus sends the most significant byte first, that means it uses Big-Endian, then if the client or the server use Little-Endian they will have to swap the bytes when sending or receiving.
Another problem is that Modbus does not define in what order 16-bit registers are sent for 32-bit types.
There are Modbus server devices that send the most significant 16-bit register first and others that do the opposite. For this the only solution is to have in the client configuration the possibility of swap the 16-bit registers.
Similar problem can also happen when character strings are transmitted, some servers instead of sending abcdef send badcfe

Related

endianness influence in C++ code

I know that this might be a silly question, but I am a newbie C++ developer and I need some clarifications about the endianness.
I have to implement a communication interface that relies on SCTP protocol in order to communicate between two different machines (one ARM based, and the other Intel based).
The aim is to:
encode messages into a stream of bytes to be sent on the socket (I used a vector of uint8_t, and positioned each byte of the different fields -taking care of splitting uint16/32/64 to single bytes- following big-endian convention)
send the bytestream via socket to the receiver (using stcp)
retrieve the stream and parse it in order to fill the message object with the correct elements (represented by header + TV information elements)
I am confused on where I could have problem with the endianness of the underlying architecture of the 2 machines in where the interface will be used.
I think that taking care of splitting objects into single bytes and positioning them using big-endian can preclude that, at the arrival, the stream is represented differently, right? or am I missing something?
Also, I am in doubt about the role of C++ representation of multiple-byte variables, for example:
uint16_t var=0x0123;
//low byte 0x23
uint8_t low = (uint8_t)var;
//hi byte 0x01
uint8_t hi = (uint8_t)(var >> 8);
This piece of code is endianness dependent or not? i.e. if I work on a big-endian machine I suppose that the above code is ok, but if it is little-endian, will I pick up the bytes in different order?
I've searched already for such questions but no one gave me a clear reply, so I have still doubts on this.
Thank you all in advance guys, have a nice day!
This piece of code is endianness dependent or not?
No the code doesn't depend on endianess of the target machine. Bitwise operations work the same way as e.g. mathematical operators do.
They are independent of the internal representation of the numbers.
Though if you're exchanging data over the wire, you need to have a defined byte order known at both sides. Usually that's network byte ordering (i.e. big endian).
The functions of the htonx() ntohx() family will help you do en-/decode the (multibyte) numbers correctly and transparently.
The code you presented is endian-independent, and likely the correct approach for your use case.
What won't work, and is not portable, is code that depends on the memory layout of objects:
// Don't do this!
uint16_t var=0x0123;
auto p = reinterpret_cast<char*>(&var);
uint8_t hi = p[0]; // 0x01 or 0x23 (probably!)
uint8_t lo = p[1]; // 0x23 or 0x01 (probably!)
(I've written probably in the comments to show that these are the likely real-world values, rather than anything specified by Standard C++)

Sending null byte over a socket to a Windows Machine from Linux - Will it be same

This is based on an earlier question I asked . Currently I am sending an octet of zero bits to linux from a linux machine as such over a socket
const char null_data(0);
send(newsockfd,&null_data,1,0);
My question is will this be the same when sending to a windows machine (64 bits) ?
or will I have to change the code ?
The trick here is to use the uint*_t data types, insofar as feasible:
#include <cstdint>
/* ... */
#if !defined(__WIN64)
// *nix variant
typedef int socket_fd_t;
#else
// WinXX socket descriptor data type.
typedef SOCKET socket_fd_t;
#endif
void send_0_byte(socket_fd_t newsockfd)
{
uint8_t zero_byte(0);
send(newsockfd, &zero_byte, 1, 0);
}
You probably want to add some error checking code, include the correct platform socket header. uint8_t is an 8-bit quantity (octet) by definition, which meets your requirement and avoids potential char size issues.
On the receiver side, you want to recv into a uint8_t buffer.
You send 1 char. The char is defined in the standard to be 1 byte in size. Fortunately, on all the C++ implementations I know, the byte size is 1 octet (8 bits), so that you should always get the same result.
Note however that the standard does not define the size of a byte:
1.7/1 The fundamental storage unit in the C ++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation- defined.
This means that all the sending/receiving machines/architectures/implementation do not necessarily have the same understanding of the number of octets to be sent/received. For example, if in the future some implementation would for example define a byte to be represented by 2 octets (perfectly valid according to the standard, although not probable), you could in theory get some troubles.
The real problems will start if you use larger integers, as you'll have to cope with potentially different endianness. Even worse if you consider floating point data as the encoding is not specified by the standard.

C++ - Creating an integer of bits and nibbles

For a full background (you don't really need to understand this to understand the problem but it may help) I am writing a CLI program that sends data over Ethernet and I wish to add VLAN tags and priority tags to the Ethernet headers.
The problem I am facing is that I have a single 16 bit integer value that is built from three smaller values: PCP is 3 bits long (so 0 to 7), DEI is 1 bit, then VLANID is 12 bits long (0-4095). PCP and DEI together form the first 4 bit nibble, 4 bits from VLANID add on to complete the first byte, the remaining 8 bits from VLANID form the second byte of the integer.
11123333 33333333
1 == PCP bits, 2 == DEI bit, 3 == VLANID bits
Lets pretend PCP == 5, which in binary is 101, DEI == 0, and VLANID == 164 which in binary is 0000 10100011. Firstly I need to compile these values together like to form the following:
10100000 10100101
The problem I face is then when I copy this integer into a buffer to be encoded onto the wire (Ethernet medium) the bit ordering changes as follows (I am printing out my integer in binary before it gets copied to the wire and using wireshark to capture it on the wire to compare):
Bit order in memory: abcdefgh 87654321
Bit order on the wire: 8765321 abcdefgh
I have two problems here really:
The first is creating the 2 byte integer by "sticking" the three smaller ones together
The second is ensuring the order of bits is that which will be encoded correctly onto the wire (so the bytes aren't in the reverse order)
Obviously I have made an attempt at this code to get this far but I'm really out of my depth and would like to see someone’s suggestion from scratch, rather than posting what I have done so far and someone suggestion how to change that it to perform the required functionality in a possibly hard to read and long winded fashion.
The issue is byte ordering, rather than bit ordering. Bits in memory don't really have an order because they are not individually addressable, and the transmission medium is responsible for ensuring that the discrete entities transmitted, octets in this case, arrive in the same shape they were sent in.
Bytes, on the other hand, are addressable and the transmission medium has no idea whether you're sending a byte string which requires that no reordering be done, or a four byte integer, which may require one byte ordering on the receiver's end and another on the sender's.
For this reason, network protocols have a declared 'byte ordering' to and from which all sender's and receivers should convert their data. This way data can be sent and retrieved transparently by network hosts of different native byte orderings.
POSIX defines some functions for doing the required conversions:
#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
'n' and 'h' stand for 'network' and 'host'. So htonl converts a 32-bit quantity from the host's in-memory byte ordering to the network interface's byte ordering.
Whenever you're preparing a buffer to be sent across the network you should convert each value in it from the host's byte ordering to the network's byte ordering, and any time you're processing a buffer of received data you should convert the data in it from the network's ordering to the host's.
struct { uint32_t i; int8_t a, b; uint16_t s; } sent_data = {100000, 'a', 'b', 500};
sent_data.i = htonl(sent_data.i);
sent_data.s = htons(sent_data.s);
write(fd, &sent_data, sizeof sent_data);
// ---
struct { uint32_t i; int8_t a, b; uint16_t s; } received_data;
read(fd, &received_data, sizeof received_data);
received_data.i = ntohl(received_data.i);
received_data.s = ntohs(received_data.s);
assert(100000 == received_data.i && 'a' == received_data.a &&
'a' == received_data.b && 500 == received_data);
Although the above code still makes some assumptions, such as that both the sender and receiver use compatible char encodings (e.g., that they both use ASCII), that they both use 8-bit bytes, that they have compatible number representations after accounting for byte ordering, etc.
Programs that do not care about portability and inter-operate only with themselves on remote hosts may skip byte ordering in order to avoid the performance cost. Since all hosts will share the same byte ordering they don't need to convert at all. Of course if a program does this and then later needs to be ported to a platform with a different byte ordering then either the network protocol has to change or the program will have to handle a byte ordering that is neither the network ordering nor the host's ordering.
Today the only common byte orderings are simply reversals of each other, meaning that hton and ntoh both do the same thing and one could just as well use hton both for sending and receiving. However one should still use the proper conversion simply to communicate the intent of the code. And, who knows, maybe someday your code will run on a PDP-11 where hton and ntoh are not interchangeable.

Endianness swap without ntohs

I am writing an ELF analyzer, but I'm having some trouble converting endianness properly. I have functions to determine the endianness of the analyzer and the endiannness of the object file.
Basically, there are four possible scenarios:
A big endian compiled analyzer run on a big endian object file
nothing needs converted
A big endian compiled analyzer run on a little endian object file
the byte order needs swapped, but ntohs/l() and htons/l() are both null macros on a big endian machine, so they won't swap the byte order. This is the problem
A little endian compiled analyzer run on a big endian object file
the byte order needs swapped, so use htons() to swap the byte order
A little endian compiled analyzer run on a little endian object file.
nothing needs converted
Is there a function I can use to explicitly swap byte order/change endianness, since ntohs/l() and htons/l() take the host's endianness into account and sometimes don't convert? Or do I need to find/write my own swap byte order function?
I think it's worth raising The Byte Order Fallacy article here, by Rob Pyke (one of Go's author).
If you do things right -- ie you do not assume anything about your platforms byte order -- then it will just work. All you need to care about is whether ELF format files are in Little Endian or Big Endian mode.
From the article:
Let's say your data stream has a little-endian-encoded 32-bit integer. Here's how to extract it (assuming unsigned bytes):
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
If it's big-endian, here's how to extract it:
i = (data[3]<<0) | (data[2]<<8) | (data[1]<<16) | (data[0]<<24);
And just let the compiler worry about optimizing the heck out of it.
In Linux there are several conversion functions in endian.h, which allow to convert between arbitrary endianness:
uint16_t htobe16(uint16_t host_16bits);
uint16_t htole16(uint16_t host_16bits);
uint16_t be16toh(uint16_t big_endian_16bits);
uint16_t le16toh(uint16_t little_endian_16bits);
uint32_t htobe32(uint32_t host_32bits);
uint32_t htole32(uint32_t host_32bits);
uint32_t be32toh(uint32_t big_endian_32bits);
uint32_t le32toh(uint32_t little_endian_32bits);
uint64_t htobe64(uint64_t host_64bits);
uint64_t htole64(uint64_t host_64bits);
uint64_t be64toh(uint64_t big_endian_64bits);
uint64_t le64toh(uint64_t little_endian_64bits);
Edited, less reliable solution. You can use union to access the bytes in any order. It's quite convenient:
union {
short number;
char bytes[sizeof(number)];
};
Do I need to find/write my own swap byte order function?
Yes you do. But, to make it easy, I refer you to this question: How do I convert between big-endian and little-endian values in C++? which gives a list of compiler specific byte order swap functions, as well as some implementations of byte order swap functions.
The ntoh functions can swap between more than just big and little endian. Some systems are also 'middle endian' where the bytes are scrambled up rather than just ordered one way or another.
Anyway, if all you care about are big and little endian, then all you need to know is if the host and the object file's endianess differ. You'll have your own function which unconditionally swaps byte order and you'll call it or not based on whether or not host_endianess()==objectfile_endianess().
If I would think about a cross-platform solution that would work on windows or linux, I would write something like:
#include <algorithm>
// dataSize is the number of bytes to convert.
char le[dataSize];// little-endian
char be[dataSize];// big-endian
// Fill contents in le here...
std::reverse_copy(le, le + dataSize, be);

When does Endianness become a factor?

Endianness from what I understand, is when the bytes that compose a multibyte word differ in their order, at least in the most typical case. So that an 16-bit integer may be stored as either 0xHHLL or 0xLLHH.
Assuming I don't have that wrong, what I would like to know is when does Endianness become a major factor when sending information between two computers where the Endian may or may not be different.
If I transmit a short integer of 1, in the form of a char array and with no correction, is it received and interpretted as 256?
If I decompose and recompose the short integer using the following code, will endianness no longer be a factor?
// Sender:
for(n=0, n < sizeof(uint16)*8; ++n) {
stl_bitset[n] = (value >> n) & 1;
};
// Receiver:
for(n=0, n < sizeof(uint16)*8; ++n) {
value |= uint16(stl_bitset[n] & 1) << n;
};
Is there a standard way of compensating for endianness?
Thanks in advance!
Very abstractly speaking, endianness is a property of the reinterpretation of a variable as a char-array.
Practically, this matters precisely when you read() from and write() to an external byte stream (like a file or a socket). Or, speaking abstractly again, endianness matters when you serialize data (essentially because serialized data has no type system and just consists of dumb bytes); and endianness does not matter within your programming language, because the language only operates on values, not on representations. Going from one to the other is where you need to dig into the details.
To wit - writing:
uint32_t n = get_number();
unsigned char bytesLE[4] = { n, n >> 8, n >> 16, n >> 24 }; // little-endian order
unsigned char bytesBE[4] = { n >> 24, n >> 16, n >> 8, n }; // big-endian order
write(bytes..., 4);
Here we could just have said, reinterpret_cast<unsigned char *>(&n), and the result would have depended on the endianness of the system.
And reading:
unsigned char buf[4] = read_data();
uint32_t n_LE = buf[0] + buf[1] << 8 + buf[2] << 16 + buf[3] << 24; // little-endian
uint32_t n_BE = buf[3] + buf[2] << 8 + buf[1] << 16 + buf[0] << 24; // big-endian
Again, here we could have said, uint32_t n = *reinterpret_cast<uint32_t*>(buf), and the result would have depended on the machine endianness.
As you can see, with integral types you never have to know the endianness of your own system, only of the data stream, if you use algebraic input and output operations. With other data types such as double, the issue is more complicated.
For the record, if you're transferring data between devices you should pretty much always use network-byte-ordering with ntohl, htonl, ntohs, htons. It'll convert to the network byte order standard for Endianness regardless of what your system and the destination system use. Of course, both systems shoud be programmed like this - but they usually are in networking scenarios.
No, though you do have the right general idea. What you're missing is the fact that even though it's normally a serial connection, a network connection (at least most network connections) still guarantees correct endianness at the octet (byte) level -- i.e., if you send a byte with a value of 0x12 on a little endian machine, it'll still be received as 0x12 on a big endian machine.
Looking at a short, if you look at the number in hexadecimal,it'l probably help. It starts out as 0x0001. You break it into two bytes: 0x00 0x01. Upon receipt, that'll be read as 0x0100, which turns out to be 256.
Since the network deals with endianess at the octet level, you normally only have to compensate for the order of bytes, not bits within bytes.
Probably the simplest method is to use htons/htonl when sending, and ntohs/ntohl when receiving. When/if that's not sufficient, there are many alternatives such as XDR, ASN.1, CORBA IIOP, Google protocol buffers, etc.
The "standard way" of compensating is that the concept of "network byte order" has been defined, almost always (AFAIK) as big endian.
Senders and receivers both know the wire protocol, and if necessary will convert before transmitting and after receiving, to give applications the right data. But this translation happens inside your networking layer, not in your applications.
Both endianesses have an advantage that I know of:
Big-endian is conceptually easier to understand because it's similar to our positional numeral system: most significant to least significant.
Little-endian is convenient when reusing a memory reference for multiple memory sizes. Simply put, if you have a pointer to a little-endian unsigned int* but you know the value stored there is < 256, you can cast your pointer to unsigned char*.
Endianness is ALWAYS an issue. Some will say that if you know that every host connected to the network runs the same OS, etc, then you will not have problems. This is true until it isn't. You always need to publish a spec that details the EXACT format of on-wire data. It can be any format you want, but every endpoint needs to understand the format and be able to interpret it correctly.
In general, protocols use big-endian for numerical values, but this has limitations if everyone isn't IEEE 754 compatible, etc. If you can take the overhead, then use an XDR (or your favorite solution) and be safe.
Here are some guidelines for C/C++ endian-neutral code. Obviously these are written as "rules to avoid"... so if code has these "features" it could be prone to endian-related bugs !! (this is from my article on Endianness published in Dr Dobbs)
Avoid using unions which combine different multi-byte datatypes.
(the layout of the unions may have different endian-related orders)
Avoid accessing byte arrays outside of the byte datatype.
(the order of the byte array has an endian-related order)
Avoid using bit-fields and byte-masks
(since the layout of the storage is dependent upon endianness, the masking of the bytes and selection of the bit fields is endian sensitive)
Avoid casting pointers from multi-byte type to other byte types.
(when a pointer is cast from one type to another, the endianness of the source (ie. The original target) is lost and subsequent processing may be incorrect)
You shouldn't have to worry, unless you're at the border of the system. Normally, if you're talking in terms of the stl, you already passed that border.
It's the task of the serialization protocol to indicate/determine how a series of bytes can be transformed into the type you're sending, beit a built-in type or a custom type.
If you're talking built-in only, you may suffice with the machine-abstraction provided by tools provided by your environment]