Reversing long read from file? - c++

I'm trying to read a long (signed, 4 bytes) from a binary file in C++.
My main concerns are: portability (longs are different sizes on different platforms), when you read from binary files w/ std::ifstream, it reverses the byte order (to my machine's endianness).
I understand for data types like unsigned int, you can simply use bitwise operators and shift and AND each byte to reverse the byte order after being read from a file.
I'm just not sure what I'd do for this:
Currently my code will give a nonsense value:
long value;
in.seekg(0x3c);
in.read(reinterpret_cast<char*>(&value), sizeof(long));
I'm not sure how I can achieve portability (I read something about unions and char*) and also reverse the signed long it reads in.
Thanks.

Rather than using long, use int32_t from <stdint.h> to directly specify a 32-bit integer. (or uint32_t for unsigned).
Use htonl and ntohl as appropriate to get to/from network byte order.
Better:
int32_t value;
in.seekg(0x3c);
in.read(reinterpret_cast<char*>(&value), sizeof(value));
value = ntohl(value); // convert from big endian to native endian

I'd suggest you use functions like htonl, htnons, ntohl and ntohs. These are used in network programming to achieve just the same goal: portability and independence of endianness.

Since cross platform support is important to you I'd recommend using cstdint to specify the size of your types. You'll be able to say int32_t x (for example) and know you are getting 32 bits of data.
Regarding the endianness of the data I'd recommend standardizing on a format (eg all data is written in little endian format) and wrapping your I/O operations in a class and using it to read/write the data. Then use a #define to decide how to read the data:
#ifdef BIG_ENDIAN
// Read the data that is in little endian format and convert
#else
// We're in little endian mode so no need to convert data
#endif
Alternatively you could look at using something like Google Protobuf that will take care of all the encoding issues for you.

Related

C/C++ Little/Big Endian handler

There are two systems that communicate via TCP. One uses little endian and the second one big endian. The ICD between systems contains a lot of structs (fields). Making bytes swap for each field looks like not the best solution.
Is there any generic solution/practice for handling communication between systems with different endianness?
Each system may have a different architecture, but endianness should be defined by the communication protocol. If the protocol says "data must be sent as big endian", then that's how the system sends it and how the other system receives it.
I am guessing the reason why you're asking is because you would like to cast a struct pointer to a char* and just send it over the wire, and this won't work.
That is generally a bad idea. It's far better to create an actual serializer, so that your internal data is decoupled from the actual protocol, which also means you can easily add support for different protocols in the future, or different versions of the protocols. You also don't have to worry about struct padding, aliasing, or any implementation-defined issues that casting brings along.
(update)
So generally, you would have something like:
void Serialize(const struct SomeStruct *s, struct BufferBuilder *bb)
{
BufferBuilder_append_u16_le(bb, s->SomeField);
BufferBuilder_append_s32_le(bb, s->SomeOther);
...
BufferBuilder_append_u08(bb, s->SomeOther);
}
Where you would already have all these methods written in advance, like
// append unsigned 16-bit value, little endian
void BufferBuilder_append_u16_le(struct BufferBuilder *bb, uint16_t value)
{
if (bb->remaining < sizeof(value))
{
return; // or some error handling, whatever
}
memcpy(bb->buffer, &value, sizeof(value));
bb->remaining -= sizeof(value);
}
We use this approach because it's simpler to unit test these "appending" methods in isolation, and writing (de)serializers is then a matter of just calling them in succession.
But of course, if you can pick any protocol and implement both systems, then you could simply use protobuf and avoid doing a bunch of plumbing.
Generally speaking, values transmitted over a network should be in network byte order, i.e. big endian. So values should be converted from host byte order to network byte order for transmission and converted back when received.
The functions htons and ntohs do this for 16 bit integer values and htonl and ntohl do this for 32 bit integer values. On little endian systems these functions essentially reverse the bytes, while on big endian systems they're a no-op.
So for example if you have the following struct:
struct mystruct {
char f1[10];
uint32_t f2;
uint16_t f3;
};
Then you would serialize the data like this:
// s points to the struct to serialize
// p should be large enough to hold the serialized struct
void serialize(struct mystruct *s, unsigned char *p)
{
memcpy(p, s->f1, sizeof(s->f1));
p += sizeof(s->f1);
uint32_t f2_tmp = htonl(s->f2);
memcpy(p, &f2_tmp, sizeof(f2_tmp));
p += sizeof(s->f2);
uint16_t f3_tmp = htons(s->f3);
memcpy(p, &f3_tmp, sizeof(f3_tmp));
}
And deserialize it like this:
// s points to a struct which will store the deserialized data
// p points to the buffer received from the network
void deserialize(struct mystruct *s, unsigned char *p)
{
memcpy(s->f1, p, sizeof(s->f1));
p += sizeof(s->f1);
uint32_t f2_tmp;
memcpy(&f2_tmp, p, sizeof(f2_tmp));
s->f2 = ntohl(f2_tmp);
p += sizeof(s->f2);
uint16_t f3_tmp;
memcpy(&f3_tmp, p, sizeof(f3_tmp));
s->f3 = ntohs(f3_tmp);
}
While you could use compiler specific flags to pack the struct so that it has a known size, allowing you to memcpy the whole struct and just convert the integer fields, doing so means that certain fields may not be aligned properly which can be a problem on some architectures. The above will work regardless of the overall size of the struct.
You mention one problem with struct fields. Transmitting structs also requires taking care of alignment of fields (causing gaps between fields): compiler flags.
For binary data one can use Abstract Syntax Notation One (ASN.1) where you define the data format. There are some alternatives. Like Protocol Buffers.
In C one can with macros determine endianess and field offsets inside a struct, and hence use such a struct description as the basis for a generic bytes-to-struct conversion. So this would work independent of endianess and alignment.
You would need to create such a descriptor for every struct.
Alternatively a parser might generate code for bytes-to-struct conversion.
But then again you could use a language neutral solution like ASN.1.
C and C++ of course have no introspection/reflection capabilities like Java has, so that are the only solutions.
The fastest and most portable way is to use bit shifts.
These have the big advantage that you only need to know the network endianess, never the CPU endianess.
Example:
uint8_t buf[4] = { MS_BYTE, ... LS_BYTE}; // some buffer from TCP/IP = Big Endian
uint32_t my_u32 = ((uint32_t)buf[0] << 24) |
((uint32_t)buf[1] << 16) |
((uint32_t)buf[2] << 8) |
((uint32_t)buf[3] << 0) ;
Do not use (bit-field) structs/type punning directly on the input. They are poorly standardized, may contain padding/alignment requirements, depend on endianess. It is fine to use structs if you have proper serialization/deserialization routines in between. A deserialization routine may contain the above bit shifts, for example.
Do not use pointer arithmetic to iterate across the input, or plain memcpy(). Neither one of these solves the endianess issue.
Do not use htons etc bloat libs. Because they are non-portable. But more importantly because anyone who can't write a simple bit shift like above without having some lib function holding their hand, should probably stick to writing high level code in a more family-friendly programming language.
There is no point in writing code in C if you don't have a clue about how to do efficient, close to the hardware programming, also known as the very reason you picked C for the task to begin with.
EDIT
Helping hand for people who are confused over how C code gets translated to asm: https://godbolt.org/z/TT1MP7oc4. As we can see, the machine code is identical on x86 Linux. The htonl won't compile on a number of embedded targets, nor on MSVC, while leading to worse performance on Mips64.

endianness influence in C++ code

I know that this might be a silly question, but I am a newbie C++ developer and I need some clarifications about the endianness.
I have to implement a communication interface that relies on SCTP protocol in order to communicate between two different machines (one ARM based, and the other Intel based).
The aim is to:
encode messages into a stream of bytes to be sent on the socket (I used a vector of uint8_t, and positioned each byte of the different fields -taking care of splitting uint16/32/64 to single bytes- following big-endian convention)
send the bytestream via socket to the receiver (using stcp)
retrieve the stream and parse it in order to fill the message object with the correct elements (represented by header + TV information elements)
I am confused on where I could have problem with the endianness of the underlying architecture of the 2 machines in where the interface will be used.
I think that taking care of splitting objects into single bytes and positioning them using big-endian can preclude that, at the arrival, the stream is represented differently, right? or am I missing something?
Also, I am in doubt about the role of C++ representation of multiple-byte variables, for example:
uint16_t var=0x0123;
//low byte 0x23
uint8_t low = (uint8_t)var;
//hi byte 0x01
uint8_t hi = (uint8_t)(var >> 8);
This piece of code is endianness dependent or not? i.e. if I work on a big-endian machine I suppose that the above code is ok, but if it is little-endian, will I pick up the bytes in different order?
I've searched already for such questions but no one gave me a clear reply, so I have still doubts on this.
Thank you all in advance guys, have a nice day!
This piece of code is endianness dependent or not?
No the code doesn't depend on endianess of the target machine. Bitwise operations work the same way as e.g. mathematical operators do.
They are independent of the internal representation of the numbers.
Though if you're exchanging data over the wire, you need to have a defined byte order known at both sides. Usually that's network byte ordering (i.e. big endian).
The functions of the htonx() ntohx() family will help you do en-/decode the (multibyte) numbers correctly and transparently.
The code you presented is endian-independent, and likely the correct approach for your use case.
What won't work, and is not portable, is code that depends on the memory layout of objects:
// Don't do this!
uint16_t var=0x0123;
auto p = reinterpret_cast<char*>(&var);
uint8_t hi = p[0]; // 0x01 or 0x23 (probably!)
uint8_t lo = p[1]; // 0x23 or 0x01 (probably!)
(I've written probably in the comments to show that these are the likely real-world values, rather than anything specified by Standard C++)

C++ Binary Writing/Reading on 32bit to/from 64bit

If you have a binary output stream, and write integers to a file on a 32-bit Windows computer. Would you then be able to read the same integers from that same file on a 64-bit Windows computer?
My guess would be no. Since an integer on a 32-bit computer is 4 bytes, where an integer on a 64-bit computer is 8 bytes.
So does the following code work, while the files have to be able to be read and written from and by both 64-bit and 32-bit computers, no matter the OS, computer architecture and data type. If not how would one be able to do that, while the files have to be in binary form.
Writing
std::ofstream ofs("example.bin", std::ios::binary);
int i = 128;
ofs.write((char*) (&i), sizeof(i));
ofs.close();
Reading
std::ifstream ifs("example.bin", std::ios::binary);
int i = 0;
ifs.read((char*) (&i), sizeof(i));
ifs.close();
While int is 4 bytes on almost all modern platforms (32bit and 64bit), there is no guarantee for its size. So for serializing data into a file or other binary streams, you should prefer fixed width integer types from the header <cstdint> which were introduced in C++11 (some compilers support it in C++03):
#include <cstdint>
...
int32_t i = 128;
ofs.write((char*)(&i), sizeof(i));
...
Another option is to enforce a certain type to have a certain size, e.g. int to have size 4. To make sure your program won't compile if this was not true, use static_assert:
...
int i = 128;
static_assert(sizeof(i) == 4, "Field i has to have size 4.");
ofs.write((char*)(&i), sizeof(i));
...
While this sounds stupid considering we have fixed width integers as above, this might be useful if you want to store a whole struct of which you made assumptions in a certain version of some library. Example: vec4 from glm is documented to contain four floats, so when serializing this struct, it's good to check this statically in order to catch future library changes (unlikely but possible).
Another very important thing to consider however is the endianess of integral types, which varies among platforms. Most compilers for modern x86 desktop platforms use little endian for integral types, so I'd prefer this for your binary file format; but if the platform uses big endian you need to convert it (reverse the byte order).
There's no guarantee for the size of an int in C++. All you know is that it will be at least as big as a short int and no larger than a long int. The compiler is free to choose an appropriate size within these constraints. While most will choose 32-bits as the size of an int, some won't.
If you know your type is always 32-bits then you can use the int32_t type.
include <stdint.h>
to get this type.

Endianness swap without ntohs

I am writing an ELF analyzer, but I'm having some trouble converting endianness properly. I have functions to determine the endianness of the analyzer and the endiannness of the object file.
Basically, there are four possible scenarios:
A big endian compiled analyzer run on a big endian object file
nothing needs converted
A big endian compiled analyzer run on a little endian object file
the byte order needs swapped, but ntohs/l() and htons/l() are both null macros on a big endian machine, so they won't swap the byte order. This is the problem
A little endian compiled analyzer run on a big endian object file
the byte order needs swapped, so use htons() to swap the byte order
A little endian compiled analyzer run on a little endian object file.
nothing needs converted
Is there a function I can use to explicitly swap byte order/change endianness, since ntohs/l() and htons/l() take the host's endianness into account and sometimes don't convert? Or do I need to find/write my own swap byte order function?
I think it's worth raising The Byte Order Fallacy article here, by Rob Pyke (one of Go's author).
If you do things right -- ie you do not assume anything about your platforms byte order -- then it will just work. All you need to care about is whether ELF format files are in Little Endian or Big Endian mode.
From the article:
Let's say your data stream has a little-endian-encoded 32-bit integer. Here's how to extract it (assuming unsigned bytes):
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
If it's big-endian, here's how to extract it:
i = (data[3]<<0) | (data[2]<<8) | (data[1]<<16) | (data[0]<<24);
And just let the compiler worry about optimizing the heck out of it.
In Linux there are several conversion functions in endian.h, which allow to convert between arbitrary endianness:
uint16_t htobe16(uint16_t host_16bits);
uint16_t htole16(uint16_t host_16bits);
uint16_t be16toh(uint16_t big_endian_16bits);
uint16_t le16toh(uint16_t little_endian_16bits);
uint32_t htobe32(uint32_t host_32bits);
uint32_t htole32(uint32_t host_32bits);
uint32_t be32toh(uint32_t big_endian_32bits);
uint32_t le32toh(uint32_t little_endian_32bits);
uint64_t htobe64(uint64_t host_64bits);
uint64_t htole64(uint64_t host_64bits);
uint64_t be64toh(uint64_t big_endian_64bits);
uint64_t le64toh(uint64_t little_endian_64bits);
Edited, less reliable solution. You can use union to access the bytes in any order. It's quite convenient:
union {
short number;
char bytes[sizeof(number)];
};
Do I need to find/write my own swap byte order function?
Yes you do. But, to make it easy, I refer you to this question: How do I convert between big-endian and little-endian values in C++? which gives a list of compiler specific byte order swap functions, as well as some implementations of byte order swap functions.
The ntoh functions can swap between more than just big and little endian. Some systems are also 'middle endian' where the bytes are scrambled up rather than just ordered one way or another.
Anyway, if all you care about are big and little endian, then all you need to know is if the host and the object file's endianess differ. You'll have your own function which unconditionally swaps byte order and you'll call it or not based on whether or not host_endianess()==objectfile_endianess().
If I would think about a cross-platform solution that would work on windows or linux, I would write something like:
#include <algorithm>
// dataSize is the number of bytes to convert.
char le[dataSize];// little-endian
char be[dataSize];// big-endian
// Fill contents in le here...
std::reverse_copy(le, le + dataSize, be);

Any way to read big endian data with little endian program?

An external group provides me with a file written on a Big Endian machine, and they also provide a C++ parser for the file format.
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
Back in the early Iron Age, the Ancients encountered this issue when they tried to network primitive PDP-11 minicomputers with other primitive computers. The PDP-11 was the first little-Endian computer, while most others at the time were big-Endian.
To solve the problem, once and for all, they developed the network byte order concept (always big-Endia), and the corresponding network byte order macros ntohs(), ntohl(), htons(), and htonl(). Code written with those macros will always "get the right answer".
Lean on your external supplier to use the macros in their code, and the file they supply you will always be big-Endian, even if they switch to a little-Endian machine. Rewrite the parser they gave you to use the macros, and you will always be able to read their file, even if you switch to a big-Endian machine.
A truly prodigious amount of programmer time has been wasted on this particular problem. There are days when I think a good argument could be made for hanging the PDP-11 designer who made the little-Endian feature decision.
Try persuading the parser team to include the following code:
int getInt(char* bytes, int num)
{
int ret;
assert(num == 4);
ret = bytes[0] << 24;
ret |= bytes[1] << 16;
ret |= bytes[2] << 8;
ret |= bytes[3];
return ret;
}
it might be more time consuming than a general int i = *(reinterpret_cast<*int>(&myCharArray)); but will always get the endianness right on both big and small endian systems.
In general, there's no "easy" solution to this. You will have to modify the parser to swap the bytes of each and every integer read from the file.
It depends upon what you are doing with the data. If you are going to print the data out, you need to swap the bytes on all the numbers. If you are looking through the file for one or more values, it may be faster to byte swap your comparison value.
In general, Greg is correct, you'll have to do it the hard way.
the best approach is to just define the endianess in the file format, and not say it's machine dependent.
the writer will have to write the bytes in the correct order regardless of the CPU it's running on, and the reader will have to do the same.
You could write a parser that wraps their parser and reverses the bytes, if you don't want to modify their parser.
Be conscious of the types of data being read in. A 4-byte int or float would need endian correction. A 4-byte ASCII string would not.
In general, no.
If the read/write calls are not type aware (which, for example fread and fwrite are not) then they can't tell the difference between writing endian sensitive data and endian insensitive data.
Depending on how the parser is structured you may be able to avoid some suffering, if the I/O functions they use are aware of the types being read/written then you could modify those routines apply the correct endian conversions.
If you do have to modify all the read/write calls then creating just such a routine would be a sensible course of action.
Your question somehow conatins the answer: No!
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
If you read (and want to interpret) big endian data on a little endian machine, you must somehow and somewhere convert the data. You might do this after each read or after the whole file has been read (if the data read does not contain any information on how to read further data) - but there is no way in omitting the conversion.