For a full background (you don't really need to understand this to understand the problem but it may help) I am writing a CLI program that sends data over Ethernet and I wish to add VLAN tags and priority tags to the Ethernet headers.
The problem I am facing is that I have a single 16 bit integer value that is built from three smaller values: PCP is 3 bits long (so 0 to 7), DEI is 1 bit, then VLANID is 12 bits long (0-4095). PCP and DEI together form the first 4 bit nibble, 4 bits from VLANID add on to complete the first byte, the remaining 8 bits from VLANID form the second byte of the integer.
11123333 33333333
1 == PCP bits, 2 == DEI bit, 3 == VLANID bits
Lets pretend PCP == 5, which in binary is 101, DEI == 0, and VLANID == 164 which in binary is 0000 10100011. Firstly I need to compile these values together like to form the following:
10100000 10100101
The problem I face is then when I copy this integer into a buffer to be encoded onto the wire (Ethernet medium) the bit ordering changes as follows (I am printing out my integer in binary before it gets copied to the wire and using wireshark to capture it on the wire to compare):
Bit order in memory: abcdefgh 87654321
Bit order on the wire: 8765321 abcdefgh
I have two problems here really:
The first is creating the 2 byte integer by "sticking" the three smaller ones together
The second is ensuring the order of bits is that which will be encoded correctly onto the wire (so the bytes aren't in the reverse order)
Obviously I have made an attempt at this code to get this far but I'm really out of my depth and would like to see someone’s suggestion from scratch, rather than posting what I have done so far and someone suggestion how to change that it to perform the required functionality in a possibly hard to read and long winded fashion.
The issue is byte ordering, rather than bit ordering. Bits in memory don't really have an order because they are not individually addressable, and the transmission medium is responsible for ensuring that the discrete entities transmitted, octets in this case, arrive in the same shape they were sent in.
Bytes, on the other hand, are addressable and the transmission medium has no idea whether you're sending a byte string which requires that no reordering be done, or a four byte integer, which may require one byte ordering on the receiver's end and another on the sender's.
For this reason, network protocols have a declared 'byte ordering' to and from which all sender's and receivers should convert their data. This way data can be sent and retrieved transparently by network hosts of different native byte orderings.
POSIX defines some functions for doing the required conversions:
#include <arpa/inet.h>
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
'n' and 'h' stand for 'network' and 'host'. So htonl converts a 32-bit quantity from the host's in-memory byte ordering to the network interface's byte ordering.
Whenever you're preparing a buffer to be sent across the network you should convert each value in it from the host's byte ordering to the network's byte ordering, and any time you're processing a buffer of received data you should convert the data in it from the network's ordering to the host's.
struct { uint32_t i; int8_t a, b; uint16_t s; } sent_data = {100000, 'a', 'b', 500};
sent_data.i = htonl(sent_data.i);
sent_data.s = htons(sent_data.s);
write(fd, &sent_data, sizeof sent_data);
// ---
struct { uint32_t i; int8_t a, b; uint16_t s; } received_data;
read(fd, &received_data, sizeof received_data);
received_data.i = ntohl(received_data.i);
received_data.s = ntohs(received_data.s);
assert(100000 == received_data.i && 'a' == received_data.a &&
'a' == received_data.b && 500 == received_data);
Although the above code still makes some assumptions, such as that both the sender and receiver use compatible char encodings (e.g., that they both use ASCII), that they both use 8-bit bytes, that they have compatible number representations after accounting for byte ordering, etc.
Programs that do not care about portability and inter-operate only with themselves on remote hosts may skip byte ordering in order to avoid the performance cost. Since all hosts will share the same byte ordering they don't need to convert at all. Of course if a program does this and then later needs to be ported to a platform with a different byte ordering then either the network protocol has to change or the program will have to handle a byte ordering that is neither the network ordering nor the host's ordering.
Today the only common byte orderings are simply reversals of each other, meaning that hton and ntoh both do the same thing and one could just as well use hton both for sending and receiving. However one should still use the proper conversion simply to communicate the intent of the code. And, who knows, maybe someday your code will run on a PDP-11 where hton and ntoh are not interchangeable.
Related
Suppose I have a client and a server that communicate 16 bits numbers with each other via some network protocols, say for example ModbusTCP, but the protocol is not relevant here.
Now I know, that the endian of the client is little (my PC) and the endian of the server is big (some PLC), the client is written entirely in C++ with Boost Asio sockets. With this setup, I thought I had to swap the bytes received from the server to correctly store the number in a uint16_t variable, however this is wrong because I'm reading incorrect values.
My understanding so far is that my C++ abstraction is storing the values into variables correctly without the need for me to actually care about swapping or endianness. Consider this snippet:
// received 0x0201 (513 in big endian)
uint8_t high { 0x02 }; // first byte
uint8_t low { 0x01 }; // second byte
// merge into 16 bit value (no swap)
uint16_t val = (static_cast<uint16_t>(high)<< 8) | (static_cast<uint16_t>(low));
std::cout<<val; //correctly prints 513
This somewhat surprised me, also because if I look into the memory representation with pointers, I found that they are actually stored in little endian on the client:
// take the address of val, convert it to uint8_t pointer
auto addr = static_cast<uint8_t*>(&val);
// take the first and second bytes and print them
printf ("%d ", (int)addr[0]); // print 1
printf ("%d", (int)addr[1]); // print 2
So the question is:
As long as I don't mess with memory addresses and pointers, C++ can guarantee me that the values I'm reading from the network are correct no matter the endian of the server, correct? Or I'm missing something here?
EDIT:
Thanks for the answers, I want to add that I'm currently using boost::asio::write(socket, boost::asio::buffer(data)) to send data from the client to the server and data is a std::vector<uint8_t>. So my understanding is that as long as I fill data in network order I should not care about endianness of my system (or even of the server for 16 bit data), because I'm operating on the "values" and not reading bytes directly from memory, right?
To use htons family of functions I have to change my underlying TCP layer to use memcpy or similar and a uint8_t* data buffer, that is more C-esque rather than C++ish, why should I do it? is there an advantage I'm not seeing?
(static_cast<uint16_t>(high)<< 8) | (static_cast<uint16_t>(low)) has the same behaviour regardless of the endianness, the "left" end of a number will always be the most significant bit, endianness only changes whether that bit is in the first or the last byte.
For example:
uint16_t input = 0x0201;
uint8_t leftByte = input >> 8; // same result regardless of endianness
uint8_t rightByte = input & 0xFF; // same result regardless of endianness
uint8_t data[2];
memcpy(data, &input, sizeof(input)); // data will be {0x02, 0x01} or {0x01, 0x02} depending on endianness
The same applies in the other direction:
uint8_t data[] = {0x02, 0x01};
uint16_t output1;
memcpy(&output1, data, sizeof(output1)); // will be 0x0102 or 0x0201 depending on endianness
uint16_t output2 = data[1] << 8 | data[0]; // will be 0x0201 regardless of endianness
To ensure your code works on all platforms its best to use the htons and ntohs family of functions:
uint16_t input = 0x0201; // input is in host order
uint16_t networkInput = htons(input);
uint8_t data[2];
memcpy(data, &networkInput , sizeof(networkInput));
// data is big endian or "network" order
uint16_t networkOutput;
memcpy(&networkOutput, &data, sizeof(networkOutput));
uint16_t output = ntohs(networkOutput); // output is in host order
The first fragment of your code works correctly because you don't directly work with byte addresses. Such code is compiled to have correct operation result independently of your platform ENDIANness due to defintion of operators '<<' and '|' by C++ language.
The second fragment of your code proves this, showing actual values of separate bytes on your little-endian system.
The TCP/IP network standardizes usage of big-endian format and provides the following utilities:
before sending multi-byte numeric values use standard functions: htonl ("host-to-network-long") and htons("host-to-netowrk-short") to convert your values to network representation,
after receiving multi-byte numeric values use standard functions: ntohl ("network-to-host-long") and ntohs ("network-to-host-short") to convert your values to your platform-specific representation.
(Actually these 4 utilities make conversions on little-endian platforms only and do nothing on big-endial platforms. But using them allways makes your code platform-independent).
With ASIO you have access to these utilities using:
#include <boost/asio.hpp>
You can read more looking for topic 'man htonl' or 'msdn htonl' in Google.
About Modbus :
For 16-bit words Modbus sends the most significant byte first, that means it uses Big-Endian, then if the client or the server use Little-Endian they will have to swap the bytes when sending or receiving.
Another problem is that Modbus does not define in what order 16-bit registers are sent for 32-bit types.
There are Modbus server devices that send the most significant 16-bit register first and others that do the opposite. For this the only solution is to have in the client configuration the possibility of swap the 16-bit registers.
Similar problem can also happen when character strings are transmitted, some servers instead of sending abcdef send badcfe
I want to send the audio data to server using libwebsocket.
audio data is 16bits, 16Khz format.
But ws_service_callback on the server side is like below:
static int ws_service_callback(
struct lws *wsi,
enum lws_callback_reasons reason, void *user,
*void *in*, size_t len)
Here ws_service_callback is the callback function of server when server receives something from client side.
*void *in* is the data from client side, and it is 8 bit format.
(here 8 bit and 16 bit mitmatch)
Should the client side split the audio data to two 8 bits, then send it to the server side??
Then the server side needs to sum the two 8bit to one 16bit?
Usually the 16b audio data are stored in memory as continuous block of bits (and also bytes then), so whether you treat them as 16b or 8b doesn't matter during transmission of the whole block (or save/load to disc, etc). I.e. 1234 16b values can be treated as 2468 8b values in this context the network layer doesn't need to care.
But then processing the data on the server side, when it will treat the two consecutive 8 bit values as single 16b value, needs to know how the original data should be treated, because there are two common ways (and infinite amount of uncommon/custom including things like compressed data or specifically designed order of bits, delta encoding, etc... none of that supported by the CPU on single machine instruction level, i.e. these need implementation in the code), little-endian vs big-endian (one of these is for sure supported in "native" way by the CPU, some CPUs even support both and can be switched at boot time).
For example Intel CPUs are "little-endian", so when you write 16b value 0x1234 into memory in the "native" way, it will occupy two bytes in this order: 0x34, 0x12 (lower 8 bits are stored first, then higher 8 bits), or 32b value 0x12345678 is stored as four bytes 78 56 34 12.
On big-endian platforms the order is opposite, so high-order bytes are stored first, and 32b value 0x12345678 is stored (in CPU "native" way) as four bytes 12 34 56 78.
As you can see, two platforms talking through network layer, must be aware of the endian-ness of each, and convert such data to interpret them correctly.
As you are basically designing binary network protocol for this service, you should define what is the expected order of the bits. If both client and server are expected to be of particular platform, for best performance a simplest implementation you can define the protocol to follow the native way of those plaftorms. Then if you will implement this protocol on platform which has different native way, you will have to add conversion of the data before sending it to server. (or if you expect the clients to be low-power devices and server to have abundance of power, you can define protocol which supports both common ways by having some flag, and do the conversion on server side for the "other" one).
This is based on an earlier question I asked . Currently I am sending an octet of zero bits to linux from a linux machine as such over a socket
const char null_data(0);
send(newsockfd,&null_data,1,0);
My question is will this be the same when sending to a windows machine (64 bits) ?
or will I have to change the code ?
The trick here is to use the uint*_t data types, insofar as feasible:
#include <cstdint>
/* ... */
#if !defined(__WIN64)
// *nix variant
typedef int socket_fd_t;
#else
// WinXX socket descriptor data type.
typedef SOCKET socket_fd_t;
#endif
void send_0_byte(socket_fd_t newsockfd)
{
uint8_t zero_byte(0);
send(newsockfd, &zero_byte, 1, 0);
}
You probably want to add some error checking code, include the correct platform socket header. uint8_t is an 8-bit quantity (octet) by definition, which meets your requirement and avoids potential char size issues.
On the receiver side, you want to recv into a uint8_t buffer.
You send 1 char. The char is defined in the standard to be 1 byte in size. Fortunately, on all the C++ implementations I know, the byte size is 1 octet (8 bits), so that you should always get the same result.
Note however that the standard does not define the size of a byte:
1.7/1 The fundamental storage unit in the C ++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation- defined.
This means that all the sending/receiving machines/architectures/implementation do not necessarily have the same understanding of the number of octets to be sent/received. For example, if in the future some implementation would for example define a byte to be represented by 2 octets (perfectly valid according to the standard, although not probable), you could in theory get some troubles.
The real problems will start if you use larger integers, as you'll have to cope with potentially different endianness. Even worse if you consider floating point data as the encoding is not specified by the standard.
Endianness from what I understand, is when the bytes that compose a multibyte word differ in their order, at least in the most typical case. So that an 16-bit integer may be stored as either 0xHHLL or 0xLLHH.
Assuming I don't have that wrong, what I would like to know is when does Endianness become a major factor when sending information between two computers where the Endian may or may not be different.
If I transmit a short integer of 1, in the form of a char array and with no correction, is it received and interpretted as 256?
If I decompose and recompose the short integer using the following code, will endianness no longer be a factor?
// Sender:
for(n=0, n < sizeof(uint16)*8; ++n) {
stl_bitset[n] = (value >> n) & 1;
};
// Receiver:
for(n=0, n < sizeof(uint16)*8; ++n) {
value |= uint16(stl_bitset[n] & 1) << n;
};
Is there a standard way of compensating for endianness?
Thanks in advance!
Very abstractly speaking, endianness is a property of the reinterpretation of a variable as a char-array.
Practically, this matters precisely when you read() from and write() to an external byte stream (like a file or a socket). Or, speaking abstractly again, endianness matters when you serialize data (essentially because serialized data has no type system and just consists of dumb bytes); and endianness does not matter within your programming language, because the language only operates on values, not on representations. Going from one to the other is where you need to dig into the details.
To wit - writing:
uint32_t n = get_number();
unsigned char bytesLE[4] = { n, n >> 8, n >> 16, n >> 24 }; // little-endian order
unsigned char bytesBE[4] = { n >> 24, n >> 16, n >> 8, n }; // big-endian order
write(bytes..., 4);
Here we could just have said, reinterpret_cast<unsigned char *>(&n), and the result would have depended on the endianness of the system.
And reading:
unsigned char buf[4] = read_data();
uint32_t n_LE = buf[0] + buf[1] << 8 + buf[2] << 16 + buf[3] << 24; // little-endian
uint32_t n_BE = buf[3] + buf[2] << 8 + buf[1] << 16 + buf[0] << 24; // big-endian
Again, here we could have said, uint32_t n = *reinterpret_cast<uint32_t*>(buf), and the result would have depended on the machine endianness.
As you can see, with integral types you never have to know the endianness of your own system, only of the data stream, if you use algebraic input and output operations. With other data types such as double, the issue is more complicated.
For the record, if you're transferring data between devices you should pretty much always use network-byte-ordering with ntohl, htonl, ntohs, htons. It'll convert to the network byte order standard for Endianness regardless of what your system and the destination system use. Of course, both systems shoud be programmed like this - but they usually are in networking scenarios.
No, though you do have the right general idea. What you're missing is the fact that even though it's normally a serial connection, a network connection (at least most network connections) still guarantees correct endianness at the octet (byte) level -- i.e., if you send a byte with a value of 0x12 on a little endian machine, it'll still be received as 0x12 on a big endian machine.
Looking at a short, if you look at the number in hexadecimal,it'l probably help. It starts out as 0x0001. You break it into two bytes: 0x00 0x01. Upon receipt, that'll be read as 0x0100, which turns out to be 256.
Since the network deals with endianess at the octet level, you normally only have to compensate for the order of bytes, not bits within bytes.
Probably the simplest method is to use htons/htonl when sending, and ntohs/ntohl when receiving. When/if that's not sufficient, there are many alternatives such as XDR, ASN.1, CORBA IIOP, Google protocol buffers, etc.
The "standard way" of compensating is that the concept of "network byte order" has been defined, almost always (AFAIK) as big endian.
Senders and receivers both know the wire protocol, and if necessary will convert before transmitting and after receiving, to give applications the right data. But this translation happens inside your networking layer, not in your applications.
Both endianesses have an advantage that I know of:
Big-endian is conceptually easier to understand because it's similar to our positional numeral system: most significant to least significant.
Little-endian is convenient when reusing a memory reference for multiple memory sizes. Simply put, if you have a pointer to a little-endian unsigned int* but you know the value stored there is < 256, you can cast your pointer to unsigned char*.
Endianness is ALWAYS an issue. Some will say that if you know that every host connected to the network runs the same OS, etc, then you will not have problems. This is true until it isn't. You always need to publish a spec that details the EXACT format of on-wire data. It can be any format you want, but every endpoint needs to understand the format and be able to interpret it correctly.
In general, protocols use big-endian for numerical values, but this has limitations if everyone isn't IEEE 754 compatible, etc. If you can take the overhead, then use an XDR (or your favorite solution) and be safe.
Here are some guidelines for C/C++ endian-neutral code. Obviously these are written as "rules to avoid"... so if code has these "features" it could be prone to endian-related bugs !! (this is from my article on Endianness published in Dr Dobbs)
Avoid using unions which combine different multi-byte datatypes.
(the layout of the unions may have different endian-related orders)
Avoid accessing byte arrays outside of the byte datatype.
(the order of the byte array has an endian-related order)
Avoid using bit-fields and byte-masks
(since the layout of the storage is dependent upon endianness, the masking of the bytes and selection of the bit fields is endian sensitive)
Avoid casting pointers from multi-byte type to other byte types.
(when a pointer is cast from one type to another, the endianness of the source (ie. The original target) is lost and subsequent processing may be incorrect)
You shouldn't have to worry, unless you're at the border of the system. Normally, if you're talking in terms of the stl, you already passed that border.
It's the task of the serialization protocol to indicate/determine how a series of bytes can be transformed into the type you're sending, beit a built-in type or a custom type.
If you're talking built-in only, you may suffice with the machine-abstraction provided by tools provided by your environment]
First of all, to clarify my goal: There exist two programs written in C in our laboratory. I am working on a Proxy Server (bidirectional) for them (which will also mainpulate the data). And I want to write that proxy server in Python. It is important to know that I know close to nothing about these two programs, I only know the definition file of the packets.
Now: assuming a packet definition in one of the C++ programs reads like this:
unsigned char Packet[0x32]; // Packet[Length]
int z=0;
Packet[0]=0x00; // Spare
Packet[1]=0x32; // Length
Packet[2]=0x01; // Source
Packet[3]=0x02; // Destination
Packet[4]=0x01; // ID
Packet[5]=0x00; // Spare
for(z=0;z<=24;z+=8)
{
Packet[9-z/8]=((int)(720000+armcontrolpacket->dof0_rot*1000)/(int)pow((double)2,(double)z));
Packet[13-z/8]=((int)(720000+armcontrolpacket->dof0_speed*1000)/(int)pow((double)2,(double)z));
Packet[17-z/8]=((int)(720000+armcontrolpacket->dof1_rot*1000)/(int)pow((double)2,(double)z));
Packet[21-z/8]=((int)(720000+armcontrolpacket->dof1_speed*1000)/(int)pow((double)2,(double)z));
Packet[25-z/8]=((int)(720000+armcontrolpacket->dof2_rot*1000)/(int)pow((double)2,(double)z));
Packet[29-z/8]=((int)(720000+armcontrolpacket->dof2_speed*1000)/(int)pow((double)2,(double)z));
Packet[33-z/8]=((int)(720000+armcontrolpacket->dof3_rot*1000)/(int)pow((double)2,(double)z));
Packet[37-z/8]=((int)(720000+armcontrolpacket->dof3_speed*1000)/(int)pow((double)2,(double)z));
Packet[41-z/8]=((int)(720000+armcontrolpacket->dof4_rot*1000)/(int)pow((double)2,(double)z));
Packet[45-z/8]=((int)(720000+armcontrolpacket->dof4_speed*1000)/(int)pow((double)2,(double)z));
Packet[49-z/8]=((int)armcontrolpacket->timestamp/(int)pow(2.0,(double)z));
}
if(SendPacket(sock,(char*)&Packet,sizeof(Packet)))
return 1;
return 0;
What would be the easiest way to receive that data, convert it into a readable python format, manipulate them and send them forward to the receiver?
You can receive the packet's 50 bytes with a .recv call on a properly connected socked (it might actually take more than one call in the unlikely event the TCP packet gets fragmented, so check incoming length until you have exactly 50 bytes in hand;-).
After that, understanding that C code is puzzling. The assignments of ints (presumably 4-bytes each) to Packet[9], Packet[13], etc, give the impression that the intention is to set 4 bytes at a time within Packet, but that's not what happens: each assignment sets exactly one byte in the packet, from the lowest byte of the int that's the RHS of the assignment. But those bytes are the bytes of (int)(720000+armcontrolpacket->dof0_rot*1000) and so on...
So must those last 44 bytes of the packet be interpreted as 11 4-byte integers (signed? unsigned?) or 44 independent values? I'll guess the former, and do...:
import struct
f = '>x4bx11i'
values = struct.unpack(f, packet)
the format f indicates: big-endian, 4 unsigned-byte values surrounded by two ignored "spare" bytes, 11 4-byte signed integers. Tuple values ends up with 15 values: the four single bytes (50, 1, 2, 1 in your example), then 11 signed integers. You can use the same format string to pack a modified version of the tuple back into a 50-bytes packet to resend.
Since you explicitly place the length in the packet it may be that different packets have different lenghts (though that's incompatible with the fixed-length declaration in your C sample) in which case you need to be a bit more accurate in receiving and unpacking it; however such details depend on information you don't give, so I'll stop trying to guess;-).
Take a look at the struct module, specifically the pack and unpack functions. They work with format strings that allow you to specify what types you want to write or read and what endianness and alignment you want to use.