Detect endianness of binary file data

Detect endianness of binary file data - c++

Recently I was (again) reading about 'endian'ness. I know how to identify the endianness of host, as there are lots of post on SO, and also I have seen this, which I think is pretty good resource.
However, one thing I like to know is to how to detect the endianness of input binary file. For example, I am reading a binary file (using C++) like following:
ifstream mydata("mydata.raw", ios::binary);
short value;
char buf[sizeof(short)];
int dataCount = 0;
short myDataMat[DATA_DIMENSION][DATA_DIMENSION];
while (mydata.read(reinterpret_cast<char*>(&buf), sizeof(buf)))
{
memcpy(&value, buf, sizeof(value));
myDataMat[dataCount / DATA_DIMENSION][dataCount%DATA_DIMENSION] = value;
dataCount++;
}
I like to know how I can detect the endianness in the mydata.raw, and whether endianness affects this program anyway.
Additional Information:
I am only manipulating the data in myDataMat using mathematical operations, and no pointer operation or bitwise operation is done on the data).
My machine (host) is little endian.

It is impossible to "detect" the endianity of data in general. Just like it is impossible to detect whether the data is an array of 4 byte integers, or twice that many 2 byte integers. Without any knowledge about the representation, raw data is just a mass of meaningless bits.
However, with some extra knowledge about the data representation, it become possible. Some examples:
Most file formats mandate particular endianity, in which case this is never a problem.
Unicode text files may optionally start with a byte order mark. Same idea can be implemented by other data representations.
Some file formats contain a checksum. You can guess one endianity, and if the checksum does not match, try again with another endianity. It will be unlikely that the checksum matches with wrong interpretation of the data.
Sometimes you can make guesses based on the data. Is the temperature outside 33'554'432 degrees, or maybe 2? You can pick the endianity that represents sane data. Of course, this type of guesswork fails miserably, when the aliens invade and start melting our planet.

You can't tell.
The endianness transformation is essentially an operator E(x) on a number x such that x = E(E(x)). So you don't know "which way round" the x elements are in your file.

Related

How to convert Text to Binary (and Reverse) in C++?

Okay, this will be a very beginner question, Though I can´t seem to find a good resource on this topic.
What I want is simple. take a string (or char*) and convert it to a binary file that I can store somewhere on my system.
Then, at a later date, I want to be able to read that binary file and convert it back to a string (or char*).
Now...
Whenever I search for this I often get to the concept of Serialisation, which is basically what I want.
There´s a problem though, most often "Boost-Serialisation" is recommended. Which (IMO) is quite heavy for just converting simple text to binary and converting simple binary to text. (ok, I know it isn´t THAT easy, but you get the idea)
There has got to be an easier way to handle this. I hope you can help me find it. :D
Thank you very much in advance for your answers.

How to convert Text to Binary (and Reverse)
There's nothing to do. Text is already data, and the in-memory presentation of all data in any modern computer is always binary.
You need to know what you mean. If you just mean "write it to a file" (in any representation), then just do that:
std::string my_text;
std::ofstream ofs("myfile.bin", std::ios::binary);
ofs.write(my_text.data(), my_text.size());
If you need some specific representation (different character sets, encodings or even (archive) file formats) you might need to do that conversion.
Oh, lest I forget, to read-back:
std::ifstream ifs("myfile.bin", std::ios::binary);
std::string my_text(std::istreambuf_iterator<char>(ifs), {});

You just have to use bit manipulation tricks. C and C++ both have operators that allow you to run integer values through logic gates. So for example:
x = 3 & 1 ;
Will set x to 1 because when you do an AND operation you're taking every bit from the left side of the & and the corresponding bit from the right side of the & and putting them through an And operation.
You can also do bit shifting. Where you shift the bits over by some number. For example:
y = 1 << 2;
Will shift all the bits in the integer 1 over by two, and the new rightmost bits will be set to zero.
So the way to do this is: for every byte in the string, do an AND operation with 128 and then if the value from that is zero then that means the left-most bit is zero (and you print "0"), if not then the value is one (so you print "1"). Then shift it to the left by one and do the operation again. Do that eight times and you've converted one byte to binary.

You could use ofstream/ifstream. They work similar to cout/cin except it reads and writes files instead of the console. Maybe this link is helpful: https://www.cplusplus.com/doc/tutorial/files/

Is there a portable Binary-serialisation schema in FlatBuffers/Protobuf that supports arbitrary 24bit signed integer definitions?

We are sending data over UART Serial at a high data rate so data size is important. The most optimal format is Int24 for our data which may be simplified as a C bit-field struct (GCC compiler) under C/C++ to be perfectly optimal:
#pragma pack(push, 1)
struct Int24
{
int32_t value : 24;
};
#pragma pack(pop)
typedef std::array<Int24,32> ArrayOfInt24;
This data is packaged with other data and shared among devices and cloud infrastructures. Basically we need to have a binary serialization which is sent between devices of different architecture and programming languages. We would like to use a Schema based Binary serialisation such as ProtoBuffers or FlatBuffers to avoid the client codes needing to handle the respective bit-shifting and recovery of the twos-complement sign bit handling themselves. i.e. Reading the 24-bit value in a non-C language requires the following:
bool isSigned = (_b2 & (byte)0x80) != 0; // Sign extend negative quantities
int32_t value = _b0 | (_b1 << 8) | (_b2 << 16) | (isSigned ? 0xFF : 0x00) << 24;
If not already existing which (if any) existing Binary Serialisation library could be modified easily to extend support to this as we would be willing to add to any open-source project in this respect.

Depending on various things, you might like to look at ASN.1 and the unaligned Packed Encoding Rules (uPER). This is a binary serialisation that is widely used in telephony to easily minimise the number of transmitted bits. Tools are available for C, C++, C#, Java, Python (I think they cover uPER). A good starting point is Useful Old Technologies.
One of the reasons you might choose to use it is that uPER likely ends up doing better than anything else out there. Other benefits are contraints (on values and array sizes). You can express these in your schema, and the generated code will check data against them. This is something that can make a real difference to a project - automatic sanitisation of incoming data is a great way of resisting attacks - and is something that GPB doesn't do.
Reasons not to use it are that the very best tools are commercial, and quite pricey. Though there are some open source tools that are quite good but not necessarily implementing the entire ASN.1 standard (which is vast). It's also a learning curve, though (at a basic level) not so very different to Google Protocol Buffers. In fact, at the conference where Google announced GPB, someone asked "why not use ASN.1?". The Google bod hadn't heard of it; somewhat ironic, a search company not searching the web for binary serialisation technologies, went right ahead and invented their own...

Protocol Buffers use a dynamically sized integer encoding called varint, so you can just use uint32 or sint32, and the encoded value will be four bytes or less for all values and three bytes or less for any value < 2^21 (the actual size for an encoded integer is ⌈HB/7⌉ where HB is the highest bit set in the value).
Make sure not to use int32 as that uses a very inefficient fixed size encoding (10 bytes!) for negative values. For repeated values, just mark them as repeated, so multiple values will be sent efficiently packed.
syntax = "proto3";
message Test {
repeated sint32 data = 1;
}

FlatBuffers doesn't support 24-bit ints. The only way to represent it would be something like:
struct Int24 { a:ubyte; b:ubyte; c:ubyte; }
which obviously doesn't do the bit-shifting for you, but would still allow you to pack multiple Int24 together in a parent vector or struct efficiently. It would also save a byte when stored in a table, though there you'd probably be better off with just a 32-bit int, since the overhead is higher.

One particularly efficient use of protobuf's varint format is to use it as a sort of compression scheme, by writing the deltas between values.
In your case, if there is any correlation between consecutive values, you could have a repeated sint32 values field. Then as the first entry in the array, write the first value. For all further entries, write the difference from the previous value.
This way e.g. [100001, 100050, 100023, 95000] would get encoded as [100001, 49, -27, -5023]. As a packed varint array, the deltas would take 3, 1, 1 and 2 bytes, total of 7 bytes. Compared with a fixed 24-bit encoding taking 12 bytes or non-delta varint taking also 12 bytes.
Of course this also needs a bit of code on the receiving side to process. But adding up the previous value is easy enough to implement in any language.

Fastest way to read a vector<double> from file

I have 3 vector, each with exactly 256^3 ~ 16 million elements that i want to store in a file and read as fast as possible. I only care about reading performance, and the representation of the data in memory can be any.
I have taken a look at some serialization techniques as well as writing/ reading plain numbers to/ from a file with ofstream, however i wonder if there is a more direct and faster approach.
(i am pretty new to c++ and its concepts)

Assuming both systems, windows and android, are little endian, which is common in ARM and x86/x64 CPUs, you can do the following.
First: Determine the type with a sepcific size, so either double, with 64-bit, float with 32-bit, or uint64/32/16 or int64/32/16. Do NOT use stuff like int or long to determine your data type.
Second: Use the following method to write binary data:
std::vector<uint64_t> myVec;
std::ofstream f("outputFile.bin", std::ios::binary);
f.write(reinterpret_cast<char*>(myVec.data()), myVec.size()*sizeof(uint64_t));
f.close();
In this, you're take the raw data and writing its binary format to a file.
Now on other machine, make sure the data type you use has the same datatype size and same endianness. If both are the same, you can do this:
std::vector<uint64_t> myVec(sizeOfTheData);
std::ifstream f("outputFile.bin", std::ios::binary);
f.read(reinterpret_cast<char*>(&myVec.front()), myVec.size()*sizeof(uint64_t));
f.close();
Notice that you have to know the size of the data before reading it.
Note: This code is off my head. I haven't tested it, but it should work.
Now if the target system doesn't have the same endianness, you have to read the data in batches, flip the endianness, then put it in your vector. How to flip endianness was extensively discussed here.
To determine the endianness of your system, this was discussed here.
The penalty on performance will be proportional to how different these systems are. If they're both the same endianness and you choose the same data type and size, you're good and you have optimum performance. Otherwise, you'll have some penalty depending on how many conversion you have to do. This is the fastest that you can ever get.
Note from comments: If you're transferring doubles or floats, make sure both systems use IEEE 754 standard. It's very common to use these, way more than endianness, but just to be sure.
Now if these solutions don't fit you, then you have to use a proper serialization library to standardize the format for you. There are libraries that can do that, such as protobuf.

Write a C++ struct to a file and read file using another programming language?

I have a challenging situation; we will have programs on Mac, PC, iOS and Android receiving files in a legacy format and parsing data from those files. We cannot change how those files are created.
The files are produced by a C++ program filling a struct with numbers and Strings and then writing it out. Here's a sanitized version.
struct MyObject {
String Kfkj(MAXKYS);
String Oern(MAXKYS);
String Vdflj(MAXKYS, 9);
int Muic;
int Tdfkj;
int VdfkAsdk;
int SsdjsdDsldsk;
int Ndsoief;
String TdflsajPdlj;
String TdckjdfPas;
String AdsfakjIdd;
int IdkfjdKasdkj;
int AsadkjaKadkja(MAXKYS);
int Kasldsdkj;
bool Usadl;
String PsadkjOasdj(9);
String PasdkjOsdkj;
};
Primitives and Strings, as you can see.
Then here is how they write it out to a file:
MyInstance MyObject;
FileName = "C:\MyFile.ab2"
ofstream fout (FileName, ios::binary);
fout.write((char*)& MyInstance, sizeof(MyInstance));
There is no option for us to translate it once and then distribute the file to other platforms; we must translate it on each and every different platform, and this is what we have to work with. I'd appreciate any information on how C++ serializes data, so we know how to parse the file.
EDIT: solution
The feedback I received from multiple answers here was VERY helpful. Using that, I did extensive analysis with hex editors and discovered:
the elements come in the file one after another
a "String," in this case, starts with an int describing how many characters follow the int for that String. If the String does not exist, it will still have that int with a value of 0.
integers, for the files and machines I saw, are two bytes, little-endian, and MOSTLY unsigned (there were a few that were signed, just to keep me on my toes)
the boolean was two bytes, with apparently -1 (FF FF) representing "true"
So far we have not ran into issues with different padding or endianness on different devices, but those are very real concerns. The skilled notes and warnings in these answers provides us with more ammunition to try to convince the client to change to a less fragile alternative, such as XML or JSON, for transferring data online across platforms.
As for those of you asking if the developer was fired... well, let's just say their code is very old, but after multiple conversations we're still having trouble convincing them writing out the C++ struct and trying to read that on different platforms is not a good idea.

You're going to run into many problems.
C++ doesn't have a specific format for serializing data per se. It is highly dependent on the computer architecture/processor that you are running on.
The compiler is allowed to add padding to help alignment on systems. When we say alignment we basically are referring to an architecture/processor's affinity for having data lie on specific byte boundaries. For example, some processors vastly prefer floating point numbers to lie at 4 or 8 byte boundaries - if they don't the processor may work much slower or may not work at all.
So, you can't simply know what padding your system is adding magically.
What you can do is use #pragma pack(1) / #pragma pack(0) to stop your compiler from padding your numbers.
PS: you also have to worry about endianness. What if one computer is running on big-endian and one is little endian? They will interpret bytes differently without a conversion.
Simply put, you either have to fix the application generating the files so it uses a proper serialization scheme OR you need to look at it running on a SPECIFIC computer, look at exactly how it writes the files, and write a translator for every target platform (which is just silly).
Interesting Suggestion
If you're really stuck, write an app that monitors the folder where you write files. Have the app pick up the files (since it's on the same PC it'll be able to read their format without issue). Have it write the files back in XML or some other true serialization format and distribute those instead.

Whoa - that's crazy. So String objects don't contain any pointers? Must not- because you claim this is working code.
Anyway, that code isn't doing any serialization. Its just writing the structure out to file exactly the way it is laid out in memory. The only issue you have is that on some platforms padding and sizeof integral types like int may be different.
You'll have to find the size of the integral types, and use that information in reader/writer for newer platforms to make sure they get laid out the same way on the legacy platform.
You're running a real risk with that code though. As it is, a compiler change could suddenly cause the file layout to change.

The format of your data file is entirely down to the compiler that your C++ program is compiled with, and the definition of your String class. You can rely on the fields being in the order they're declared in, and in this case, I think you can rely on there not being any padding at the start, but that's about all. Some tips that might help you out in this case:-
You don't give the definition of the String class you're using. If it's a typedef for std::string, you're completely screwed, because the contents of the string aren't in the memory. I assume your C++ programmers are using some special local buffer, in which case I'll guess you will find the first bytes of the object are the string, and there is some amount of useless padding afterwards. I hope the struct contains an int at the start telling you how much data in it is useful.
You'll probably find the int fields are four bytes long.
You'll probably find the bool field is one byte long, followed by three bytes of useless padding. Only one bit, most likely the bottom bit, will be set.
That's about all the useful guesswork I can offer you. In your target language, make sure to read the whole file in as the closest thing to a byte array available in the language, and only after that, use the language features to convert it into the right kind of thing in your language. Don't try reading it in as integers, as that won't let you byte-swap if you're on a platform with different endianness to the C++ program. I suggest also looking through the file in a text editor to reverse-engineer it and help you find the offset of each field.
Last piece of advice: consider printing P45s (or pink slips, or whatever you have in your country) for whichever programmers or project managers thought this kind of 'serialization' was a good idea. This kind of sloppy work might have been acceptable in a life-or-death situation, but they have seriously screwed you over in a way you're going to find it very hard to recover from. Writing the code to read in these files will not be that hard, if it's only one struct like this, but keeping it reliable will be a world of pain, and they've effectively made it impossible for themselves to change compilers or compiler version safely.

The way it's done, the struct is written in raw form to a file. So basically what you need to know to parse this file is the binary layout of your struct.
Basically, the fields are just one after the other, so to read an int, you just read 4 bytes and cast that to an int, etc.
Strings are a particular case. It's not clear from your code whether this "String" type is an inline array of characters, or a pointer to such an array. In the first case, you need to know how many characters each string contains and simply read that number of characters sequentially. In the second case, you won't be able to get the string back, since it won't have been written to file. The pointer will be useless to you.
One last concern is whether the struct is packed or not. Since you gave no indication to that, by default struct fields are aligned to 4-bytes boundaries, so there may be space for instance after the boolean field that you need to account for. If the struct is packed, then each field comes directly after the previous.
So, to make a long story short, figure out your struct binary layout using its definition and, if all else fails, inspecting the memory at run-time with the debugger, or use a hex editor to study the output file. Then write that specification down somewhere and this will give you what you need to read from the file. It's impossible to tell exactly what that layout is simply by looking at the pseudo-definition you gave.

Writing in an ofstream does not serialize data. This code write the raw memory content of the struct as it was a string of char. Depending of your compiler, its version, its options and the system it is running on the content will be completely different.
Even the number of bits of a char is allowed to change between c++ implementation.
Data referenced by the object of the struct won't be written (forget the content of std::string).
If you cannot change the writer code. You must know the alignment policy, the size of base type and the data representation. You will have to analyze files produced by hand, for example with an hexadecimal editor like this one
http://www.physics.ohio-state.edu/~prewett/hexedit/
, and probably look at your compiler documentation.
If you can change the writer code. Use proper serialization like json, protocol buffer or simply xml.

No one has pointed out something that sticks out to me as particularly problematic (maybe because I've been bit by it). That problem: the data member bool Usadl;. sizeof(bool) varies across platforms, across compilers, and even across releases of the same compiler. Common values for sizeof(bool) are 4 and 1. This will bite you. It's getting hard to find a big endian machine nowadays, very, very hard to find a computer where CHAR_BIT is not 8 or sizeof(int) is not 4. This is not the case for sizeof(bool).
In agreement with everyone else, Chad's team needs to document the structure of the records in the file, and then make sure the program that produces the file writes this structure explicitly, including element sizes, padding, and endianness. Don't depend on class layout to do this for you. That's just asking for trouble.

The best way would probably be to use JSON or if you want a more robust solution go with something like Avro. Avro has a C++ API and a Java API, so it covers most of the cases you're encountering.

Any way to read big endian data with little endian program?

An external group provides me with a file written on a Big Endian machine, and they also provide a C++ parser for the file format.
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?

Back in the early Iron Age, the Ancients encountered this issue when they tried to network primitive PDP-11 minicomputers with other primitive computers. The PDP-11 was the first little-Endian computer, while most others at the time were big-Endian.
To solve the problem, once and for all, they developed the network byte order concept (always big-Endia), and the corresponding network byte order macros ntohs(), ntohl(), htons(), and htonl(). Code written with those macros will always "get the right answer".
Lean on your external supplier to use the macros in their code, and the file they supply you will always be big-Endian, even if they switch to a little-Endian machine. Rewrite the parser they gave you to use the macros, and you will always be able to read their file, even if you switch to a big-Endian machine.
A truly prodigious amount of programmer time has been wasted on this particular problem. There are days when I think a good argument could be made for hanging the PDP-11 designer who made the little-Endian feature decision.

Try persuading the parser team to include the following code:
int getInt(char* bytes, int num)
{
int ret;
assert(num == 4);
ret = bytes[0] << 24;
ret |= bytes[1] << 16;
ret |= bytes[2] << 8;
ret |= bytes[3];
return ret;
}
it might be more time consuming than a general int i = *(reinterpret_cast<*int>(&myCharArray)); but will always get the endianness right on both big and small endian systems.

In general, there's no "easy" solution to this. You will have to modify the parser to swap the bytes of each and every integer read from the file.

It depends upon what you are doing with the data. If you are going to print the data out, you need to swap the bytes on all the numbers. If you are looking through the file for one or more values, it may be faster to byte swap your comparison value.
In general, Greg is correct, you'll have to do it the hard way.

the best approach is to just define the endianess in the file format, and not say it's machine dependent.
the writer will have to write the bytes in the correct order regardless of the CPU it's running on, and the reader will have to do the same.

You could write a parser that wraps their parser and reverses the bytes, if you don't want to modify their parser.
Be conscious of the types of data being read in. A 4-byte int or float would need endian correction. A 4-byte ASCII string would not.

In general, no.
If the read/write calls are not type aware (which, for example fread and fwrite are not) then they can't tell the difference between writing endian sensitive data and endian insensitive data.
Depending on how the parser is structured you may be able to avoid some suffering, if the I/O functions they use are aware of the types being read/written then you could modify those routines apply the correct endian conversions.
If you do have to modify all the read/write calls then creating just such a routine would be a sensible course of action.

Your question somehow conatins the answer: No!
I only can run the parser on a little endian machine - is there any way to read the file using their parser without add a swapbytes() call after each read?
If you read (and want to interpret) big endian data on a little endian machine, you must somehow and somewhere convert the data. You might do this after each read or after the whole file has been read (if the data read does not contain any information on how to read further data) - but there is no way in omitting the conversion.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js