Types bit length and architecture specific implementations - c++

I'm doing stuff in C++ but lately I've found that there are slight differences regarding how much data a type can accomodate and also the byte order is an issue.
Suppose I got a binary file, where I've encoded shorts that are 2 bytes in size. The file is in binary format like:
FA C8 - data segment 1
BA 32 - data segment 2
53 56 - data segment 3
Now all is well up to this point. Now I want to read this data. There are 2 problems:
1 what data type to choose to store this values?
2 how to deal with endianness of the target architecture?
The first problem is actually related to the second because here I will have to do bit shifts in order to swap the order of bytes.
I know that I could read the file byte by byte and add every two bytes. But is there an approach that could ease that pain?
I'm sorry If I'm being ambiguous. The problem is hard to explain. Hope you get a glimpse of what I'm talking about. I just want to store this data internally.
So I would appreciate some advices or if you can share some of your experience in this topic.

If you use big endian on the file that stores the data then you could just rely on htons(), htonl(), ntohs(), ntohl() to convert the integers to the right endianess before saving or after reading.

There is no easy way to do this.
Rather than doing that yourself, you might want to look into serialization libraries (for example Protobuf or boost serialization), they'll take care of a lot of that for you.
If you want to do it yourself, use fixed-width types (uint32_t and the like from <cstdint>), and endian conversion functions as appropriate. Either have a "prefix" in your file that determines what endianness it contains (a BOM/Byte Order Mark), or always store in either big or little endian, and systematically convert.
Be extra careful if you need to serialize strings, they have encoding problems of their own too.

Related

Is there a portable Binary-serialisation schema in FlatBuffers/Protobuf that supports arbitrary 24bit signed integer definitions?

We are sending data over UART Serial at a high data rate so data size is important. The most optimal format is Int24 for our data which may be simplified as a C bit-field struct (GCC compiler) under C/C++ to be perfectly optimal:
#pragma pack(push, 1)
struct Int24
{
int32_t value : 24;
};
#pragma pack(pop)
typedef std::array<Int24,32> ArrayOfInt24;
This data is packaged with other data and shared among devices and cloud infrastructures. Basically we need to have a binary serialization which is sent between devices of different architecture and programming languages. We would like to use a Schema based Binary serialisation such as ProtoBuffers or FlatBuffers to avoid the client codes needing to handle the respective bit-shifting and recovery of the twos-complement sign bit handling themselves. i.e. Reading the 24-bit value in a non-C language requires the following:
bool isSigned = (_b2 & (byte)0x80) != 0; // Sign extend negative quantities
int32_t value = _b0 | (_b1 << 8) | (_b2 << 16) | (isSigned ? 0xFF : 0x00) << 24;
If not already existing which (if any) existing Binary Serialisation library could be modified easily to extend support to this as we would be willing to add to any open-source project in this respect.
Depending on various things, you might like to look at ASN.1 and the unaligned Packed Encoding Rules (uPER). This is a binary serialisation that is widely used in telephony to easily minimise the number of transmitted bits. Tools are available for C, C++, C#, Java, Python (I think they cover uPER). A good starting point is Useful Old Technologies.
One of the reasons you might choose to use it is that uPER likely ends up doing better than anything else out there. Other benefits are contraints (on values and array sizes). You can express these in your schema, and the generated code will check data against them. This is something that can make a real difference to a project - automatic sanitisation of incoming data is a great way of resisting attacks - and is something that GPB doesn't do.
Reasons not to use it are that the very best tools are commercial, and quite pricey. Though there are some open source tools that are quite good but not necessarily implementing the entire ASN.1 standard (which is vast). It's also a learning curve, though (at a basic level) not so very different to Google Protocol Buffers. In fact, at the conference where Google announced GPB, someone asked "why not use ASN.1?". The Google bod hadn't heard of it; somewhat ironic, a search company not searching the web for binary serialisation technologies, went right ahead and invented their own...
Protocol Buffers use a dynamically sized integer encoding called varint, so you can just use uint32 or sint32, and the encoded value will be four bytes or less for all values and three bytes or less for any value < 2^21 (the actual size for an encoded integer is ⌈HB/7⌉ where HB is the highest bit set in the value).
Make sure not to use int32 as that uses a very inefficient fixed size encoding (10 bytes!) for negative values. For repeated values, just mark them as repeated, so multiple values will be sent efficiently packed.
syntax = "proto3";
message Test {
repeated sint32 data = 1;
}
FlatBuffers doesn't support 24-bit ints. The only way to represent it would be something like:
struct Int24 { a:ubyte; b:ubyte; c:ubyte; }
which obviously doesn't do the bit-shifting for you, but would still allow you to pack multiple Int24 together in a parent vector or struct efficiently. It would also save a byte when stored in a table, though there you'd probably be better off with just a 32-bit int, since the overhead is higher.
One particularly efficient use of protobuf's varint format is to use it as a sort of compression scheme, by writing the deltas between values.
In your case, if there is any correlation between consecutive values, you could have a repeated sint32 values field. Then as the first entry in the array, write the first value. For all further entries, write the difference from the previous value.
This way e.g. [100001, 100050, 100023, 95000] would get encoded as [100001, 49, -27, -5023]. As a packed varint array, the deltas would take 3, 1, 1 and 2 bytes, total of 7 bytes. Compared with a fixed 24-bit encoding taking 12 bytes or non-delta varint taking also 12 bytes.
Of course this also needs a bit of code on the receiving side to process. But adding up the previous value is easy enough to implement in any language.

endianness influence in C++ code

I know that this might be a silly question, but I am a newbie C++ developer and I need some clarifications about the endianness.
I have to implement a communication interface that relies on SCTP protocol in order to communicate between two different machines (one ARM based, and the other Intel based).
The aim is to:
encode messages into a stream of bytes to be sent on the socket (I used a vector of uint8_t, and positioned each byte of the different fields -taking care of splitting uint16/32/64 to single bytes- following big-endian convention)
send the bytestream via socket to the receiver (using stcp)
retrieve the stream and parse it in order to fill the message object with the correct elements (represented by header + TV information elements)
I am confused on where I could have problem with the endianness of the underlying architecture of the 2 machines in where the interface will be used.
I think that taking care of splitting objects into single bytes and positioning them using big-endian can preclude that, at the arrival, the stream is represented differently, right? or am I missing something?
Also, I am in doubt about the role of C++ representation of multiple-byte variables, for example:
uint16_t var=0x0123;
//low byte 0x23
uint8_t low = (uint8_t)var;
//hi byte 0x01
uint8_t hi = (uint8_t)(var >> 8);
This piece of code is endianness dependent or not? i.e. if I work on a big-endian machine I suppose that the above code is ok, but if it is little-endian, will I pick up the bytes in different order?
I've searched already for such questions but no one gave me a clear reply, so I have still doubts on this.
Thank you all in advance guys, have a nice day!
This piece of code is endianness dependent or not?
No the code doesn't depend on endianess of the target machine. Bitwise operations work the same way as e.g. mathematical operators do.
They are independent of the internal representation of the numbers.
Though if you're exchanging data over the wire, you need to have a defined byte order known at both sides. Usually that's network byte ordering (i.e. big endian).
The functions of the htonx() ntohx() family will help you do en-/decode the (multibyte) numbers correctly and transparently.
The code you presented is endian-independent, and likely the correct approach for your use case.
What won't work, and is not portable, is code that depends on the memory layout of objects:
// Don't do this!
uint16_t var=0x0123;
auto p = reinterpret_cast<char*>(&var);
uint8_t hi = p[0]; // 0x01 or 0x23 (probably!)
uint8_t lo = p[1]; // 0x23 or 0x01 (probably!)
(I've written probably in the comments to show that these are the likely real-world values, rather than anything specified by Standard C++)

Is there a fast way/trick to add one bit at the beginning of a file?

For a special algorithm I have to add (or remove) several times one bit at the beginning of a file. It must be a bit and not a whole byte like '0000 0001'.
After that I don't have to overwrite the file with the new content, so it is sufficient if I edit the file data just in memory. For this algorithm I can add one byte like '0000 0000' or '1000 0000' to the end of the file data.
You can summarize it as a bitshift over a whole file. I already tried it on my own. I read the file in integers (32 bit), bitshifted them in each case to the right and transfered the last bit from the integer before to the first position.
But this method is definitely not fast enough. I also searched the internet but I couldn't find anything like this. Is there perhaps a possibility to do this faster?
The quick answer to your question is: there is no way to do this efficiently.
The long answer is actually a series of new questions: what do you really intend to achieve with this? What do you even mean exactly by shifting one bit at the beginning of a file?
You mention reading the file in 32 bit chunks (int, or better uint32_t) and shifting them one at a time: there is a byte ordering issue in doing it this way. It is not portable as some CPUs will read uint32_t in little endian order (Intel architecture) and some others in big endian order (Motorola, PowerPC, ea).
Even the order of bits in bytes is somewhat confusing: By shifting a bit at the beginning of the file, do you mean setting bit 0x80 of the first byte or bit 0x01 of the first byte? Bitmap files and graphics cards have conflicting conventions to this regard.
If this bit file is specified outside of your program, you should be very careful about these details. If it is your own invention, a change of algorithm might be helpful to simplify this situation.

Does endianness have an effect when copying bytes in memory?

Am I right in thinking that endianess is only relevant when we're talking about how to store a value and not relevant when copying memory?
For example
if I have a value 0xf2fe0000 and store it on a little endian system - the bytes get stored in the order 00, 00, fe and f2. But on a big endian system the bytes get stored f2, fe, 00 and 00.
Now - if I simply want to copy these 4 bytes to another 4 bytes (on the same system), on a little endian system am I going to end up with another 4 bytes containing 00, 00, fe and f2 in that order?
Or does endianness have an effect when copying these bytes in memory?
Endianness is only relevant in two scenarios
When manually inspecting a byte-dump of a multibyte object, you need to know if the bytes are ordered in little endian or big endian order to be able to correctly interpret the bytes.
When the program is communicating multibyte values with the outside world, e.g. over a network connection or a file. Then both parties need to agree on the endianness used in the communication and, if needed, convert between the internal and external byte orders.
Answering the question title.
Assume 'int' to be of 4 bytes
union{
unsigned int i;
char a[4];
};
// elsewhere
i = 0x12345678;
cout << a[0]; // output depends on endianness. This is relevant during porting code
// to different architectures
So, it is not about copying (alone)? It's about how you access?
It is also of significance while transferring raw bytes over a network!.
Here's the info on finding endianness programatically
memcpy doesn't know what it is copying. If it has to copy 43 61 74 00, it doesn't know whether it is copying 0x00746143 or 0x43617400 or a float or "Cat"
no when working on the same machine you don't have to worry about endianess, only when transferring binary data between little and big endian machines
Basically, you have to worry about endianess only when you need to transfer binary data between architectures which differ in endianess.
However, when you transfer binary data between architectures, you will also have to worry about other things, like the size of integer types, the format of floating numbers and other nasty headaches.
Yes, you are correct thinking that you should be endianness-aware when storing or communicating binary values outside your current "scope".
Generally you dont need to worry as long as everything is just inside your own program.
If you copy memory, have in mind what you are copying. (You could get in trouble if you store long values and read ints).
Have a look at htonl(3) or books about network programming for some good explanations.
Memcpy just copies bytes and doesn't care about endianness.
So if you want to copy one network stream to another use memcpy.

Any way to read big endian data with little endian program?

An external group provides me with a file written on a Big Endian machine, and they also provide a C++ parser for the file format.
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
Back in the early Iron Age, the Ancients encountered this issue when they tried to network primitive PDP-11 minicomputers with other primitive computers. The PDP-11 was the first little-Endian computer, while most others at the time were big-Endian.
To solve the problem, once and for all, they developed the network byte order concept (always big-Endia), and the corresponding network byte order macros ntohs(), ntohl(), htons(), and htonl(). Code written with those macros will always "get the right answer".
Lean on your external supplier to use the macros in their code, and the file they supply you will always be big-Endian, even if they switch to a little-Endian machine. Rewrite the parser they gave you to use the macros, and you will always be able to read their file, even if you switch to a big-Endian machine.
A truly prodigious amount of programmer time has been wasted on this particular problem. There are days when I think a good argument could be made for hanging the PDP-11 designer who made the little-Endian feature decision.
Try persuading the parser team to include the following code:
int getInt(char* bytes, int num)
{
int ret;
assert(num == 4);
ret = bytes[0] << 24;
ret |= bytes[1] << 16;
ret |= bytes[2] << 8;
ret |= bytes[3];
return ret;
}
it might be more time consuming than a general int i = *(reinterpret_cast<*int>(&myCharArray)); but will always get the endianness right on both big and small endian systems.
In general, there's no "easy" solution to this. You will have to modify the parser to swap the bytes of each and every integer read from the file.
It depends upon what you are doing with the data. If you are going to print the data out, you need to swap the bytes on all the numbers. If you are looking through the file for one or more values, it may be faster to byte swap your comparison value.
In general, Greg is correct, you'll have to do it the hard way.
the best approach is to just define the endianess in the file format, and not say it's machine dependent.
the writer will have to write the bytes in the correct order regardless of the CPU it's running on, and the reader will have to do the same.
You could write a parser that wraps their parser and reverses the bytes, if you don't want to modify their parser.
Be conscious of the types of data being read in. A 4-byte int or float would need endian correction. A 4-byte ASCII string would not.
In general, no.
If the read/write calls are not type aware (which, for example fread and fwrite are not) then they can't tell the difference between writing endian sensitive data and endian insensitive data.
Depending on how the parser is structured you may be able to avoid some suffering, if the I/O functions they use are aware of the types being read/written then you could modify those routines apply the correct endian conversions.
If you do have to modify all the read/write calls then creating just such a routine would be a sensible course of action.
Your question somehow conatins the answer: No!
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
If you read (and want to interpret) big endian data on a little endian machine, you must somehow and somewhere convert the data. You might do this after each read or after the whole file has been read (if the data read does not contain any information on how to read further data) - but there is no way in omitting the conversion.