Why is floating point byte swapping different from integer byte swapping? - c++

I have a binary file of doubles that I need to load using C++. However, my problem is that it was written in big-endian format but the fstream >> operator will then read the number wrong because my machine is little-endian. It seems like a simple problem to resolve for integers, but for doubles and floats the solutions I have found won't work. How can I (or should I) fix this?
I read this as a reference for integer byte swapping:
How do I convert between big-endian and little-endian values in C++?
EDIT: Though these answers are enlightening, I have found that my problem is with the file itself and not the format of the binary data. I believe my byte swapping does work, I was just getting confusing results. Thanks for your help!

The most portable way is to serialize in textual format so that you don't have byte order issues. This is how operator>> works so you shouldn't be having any endian issues with >>. The principal problem with binary formats (which would explain endian problems) is that floating point numbers consist of a number of mantissa bits, a number of exponent bits and a sign bit. The exponent may use an offset. This mean that a straight byte re-ordering may not be sufficient, depending on the source and target format.
If you are using and IEEE-754 on both machines then you may be OK with a straight byte reversal as this standard specifies a bit-string interchange format that should be portable (byte order issues aside).
If you have to convert between two machine architectures and you have to use a raw byte memory dump, then so long as the basic number format is the same (i.e. they have the same bit counts in each part of the number), you can read the data into an array of unsigned char, use some basic byte and bit swapping routines to correct the storage format and then copy the raw bytes into a variable of the appropriate type.

The standard conversion operators do not work with binary data, so it's not clear how you got where you are.
However, since byte swapping operates on bytes, not numbers, you perform it on data destined to be floats just as data which will be integers.
And since text is so inefficient and floating-point data sets tend to be so large, it's entirely reasonable to want this.
int32_t raw_bytes;
stream >> raw_bytes; // not an int, just 32 bits of bytes
my_byte_swap( raw_bytes ); // swap 'em
float f = * reinterpret_cast< float * >( & raw_bytes ); // read them into FPU

Related

Parsing long strings of binary data with C++

I am looking for idea how to parse long binary data, so for example :"10100011111000111001"
bits: 0-4 are the id
bits 5-15 are the data
etc etc...
the binary data structure can be change so I need to build a kind of data-base will store the data how to parse each string.
illustration (it could be 200~ bits) :
Ideas how to implement it?
Thanks
Edit
What am I missing here?
struct Bitfield {
uint16_t a : 10 , b:6;};
void diag(){
uint16_t t= 61455;
struct Bitfield test = {t};
cout<<"a: "<<test.a<<endl;
cout<<"b: "<<test.b<<endl;
return;}
and the output is:
a: 15
b: 0
Options available
To manage a large structured set of bits, you have the following options:
C++ bit-fields: you define a structure with bitfield members. You can have as many members as you want, provided that each single one has no more bits than an unsigned long long.
It's super easy to use; The compiler manages the access to bits or groups of bits for you. The major inconvenience is that the bit layout is implementation dependent. So this is not an option for writing portable code that exchanges data in a binary format.
Container of unsigned integral type: you define an array large enough to hold the all the bits, and access bits or groups of bits using a combination of logical operations.
It requires to be at ease with binary operations and is not practical if groups of bits are split over consecutive elements. For exchanging data in binary format with the outside world in a protable way, you'd need to either take care of differences between big and little endian architectures or use arrays of uint8_t.
std::vector<bool>: gives you total flexibility to manage you bits. The main constraint is that you need to address each bit separately. Moreover, there's no data() member that could give direct access to the binary data .
std::bitset: is very similar to vector<bool> for accessing bits. It has a fixed size at compile time, but offers useful features such as reading and writing binary in ascci from strings or streams]5, converting from binary values of integral types, and logical operations on the full bitset.
A combination of these techniques
Make your choice
To communicate with the outside world in a portable way, the easiest approach is to use bitsets. Bitsets offer easy input/output/string conversion in a format using ascci '0' or '1' (or any substitutes thereof)
bitset<msg_header_size> bh,bh2;
bitset<msg_body_size> bb,bb2;
cin>>bh>>bb; // reads a string od ascii 0 and 1
cout<<bh<<"-"<<bb<<endl<<endl; // writes a string of ascii 0 and 1
You can also convert from/to binary data (but a single element, large enough for the bitset size):
bitset<8> b(static_cast<uint8_t>(c));
cout<<b<<endl;
cout<<b.to_ulong()<<endl;
For reading/writing large sets, you'd need to read small bitsets and use logical operators to aggregate them in a larger bitset. It this seems time consuming, it's in fact very close to what you'd do in containers of integrals, but without having to care about byte boundaries.
In your case, with a fixed size header and a maximum size, the bitset seems to be a good choice (be careful however because the variable part is right justified) for exchanging binary data with the external world.
For working the data content, it's easy to access a specific bit, but you have to use some logical operations (shift, and) to access to groups of bits. Moreover, if you want readable and maintainable code, it's better to abstract the bit layout.
Conclusion:
I would therefore strongly advise to use internally a bit-field structure for working with the data and keep a comparable memory footprint than the original data and at the same time, use bitsets just to convert from/to this structure for the purpose of external data exchanges.
The "best way" depends on the details of the problem.
If the whole number fits into the largest integer type available (usually long long), convert the string into an integer first (for example with stoi/stol/stoll functions, assuming C++11 is available). Then use bit-shifting combined with binary and (&) to extract the sections of the value you are interested in.
If the whole number does not fit into the largest integer type available, chop it up as a string (using the substr function) and then convert the substrings into integers one by one.

Can I be sure that 32 bytes of binary data read from a file is equal to 256 bits?

I want to read a file 32 bytes at a time using a C/C++ program, but I want to be sure that the data will be 256 bits. In essence I am worried about leading bits in the "bytes" that I read from the file being off ? Is that even a matter of concern ?
Example : If I have a number say 2 represented in binary as 10 . This would be sufficient for me as a human.
How is that different as far a computer is concerned if it's written as: 00000010 to represent a char value of 1 byte ??? Would the leading zeros affect the bit count ? Does that in turn affect operations like XOR ?
I've trouble understanding its effects ! Does that involve data loss ? I really do not know... !
Every help to clear my misunderstanding will be appreciated !!!
Every routine in the C standard library that reads from a file or stream reads in units of bytes. Each byte read is a fixed number of bits; what is read from a file does not vary due to leading zeros or lack thereof in a byte. Some routines return a single character (which is a byte). Some routines put data read into a buffer and return a count of bytes read. Some routines, such as scanf, return a count of the number of items successfully converted. (You generally would not use these routines to read a fixed number of bytes.)
The number of bits in a byte is set by the C implementation. It is reported in CHAR_BIT, defined in <limits.h>. It is eight in all common C implementations.
Although the number of bits per byte does not vary, the number of bytes read from a stream can vary. If a stream is opened as a text stream, characters may be “added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment” (C 2018 7.21.2 2). To avoid this, you should open a stream as a binary stream.
The CHAR_BIT macro (defined in climits) will tell you how many bits make up a byte in the execution environment. However I am not aware of any recent general-purpose hardware that uses bytes of other than 8 bits. Some specialized processors (such as for digital signal processing) may use other sizes. Also completely outdated equipment used a wide variety of sizes (the typical alternative to 8 bits being 9).
No
C++ allows a char to be any size a platform requires. However, the macro CHAR_BIT always has the number of bits in a char.
So, to find out the number of bits in 32 bytes, you would use the formula 32*CHAR_BIT.
C++17 has introduced the new type std::byte that is not a character type and is always CHAR_BIT bits, as explained in the SO question std::byte on odd platforms
In order to find the number of bytes needed to hold 256 bits, you have a problem, because CHAR_BIT isn't always a divisor of 256. So, you have to decide what you want and use a more complicated formula. For example, 1+(255+CHAR_BIT)/CHAR_BIT will give you the number of bytes needed to hold 256 contiguous bits.

Byte order for packed images

So from http://en.wikipedia.org/wiki/RGBA_color_space, I learned that the byte order for ARGB is, from lowest address to highest address, BGRA, on a little endian machine in certain interpretations.
How does this effect the naming convention of packed data eg a uint8_t ar[]={R,G,B,R,G,B,R,G,B}?
Little endian by definition stores the bytes of a number in reverse order. This is not strictly necessary if you are treating them as byte arrays however any vaguely efficient code base will actually treat the 4 bytes as a 32 bit unsigned integer. This will speed up software blitting by a factor of almost 4.
Now the real question is why. This comes from the fact that when treating a pixel as a 32 bit int as described above coders want to be able to run arithmetic and shifts in a predictable way. This relies on the bytes being in reverse order.
In short, this is not actually odd as in little endian machines the last byte (highest address) is actually the most significant byte and the first the least significant. Thus a field like this will naturally be in reverse order so it is the correct way around when treated as a number (as a number it will appear ARGB but as a byte array it will appear BGRA).
Sorry if this is unclear, but I hope it helps. If you do not understand or I have missed something please comment.
If you are storing data in a byte array like you have specified, you are using BGR format which is basically RGB reversed:
bgr-color-space

Integer Types in file formats

I am currently trying to learn some more in depth stuff of file formats.
I have a spec for a 3D file format (U3D in this case) and I want to try to implement that. Nothing serious, just for the learning effect.
My problem starts very early with the types, that need to be defined. I have to define different integers (8Bit, 16bit, 32bit unsigned and signed) and these then need to be converted to hex before writing that to a file.
How do I define these types, since I can not just create an I16 i.e.?
Another problem for me is how to convert that I16 to a hex number with 8 digits
(i.e. 0001 0001).
Hex is just a representation of a number. Whether you interpret the number as binary, decimal, hex, octal etc is up to you. In C++ you have support for decimal, hex, and octal representations, but they are all stored in the same way.
Example:
int x = 0x1;
int y = 1;
assert(x == y);
Likely the file format wants you to store the files in normal binary format. I don't think the file format wants the hex numbers as a readable text string. If it does though then you could use std::hex to do the conversion for you. (Example: file << hex << number;)
If the file format talks about writing more than a 1 byte type to file then be careful of the Endianness of your architecture. Which means do you store the most significant byte of the multi byte type first or last.
It is very common in file format specifications to show you how the binary should look for a given part of the file. Don't confuse this though with actually storing binary digits as strings. Likewise they will sometimes give a shortcut for this by specifying in hex how it should look. Again most of the time they don't actually mean text strings.
The smallest addressable unit in C++ is a char which is 1 byte. If you want to set bits within that byte you need to use bitwise operators like & and |. There are many tutorials on bitwise operators so I won't go into detail here.
If you include <stdint.h> you will get types such as:
uint8_t
int16_t
uint32_t
First, let me understand.
The integers are stored AS TEXT in a file, in hexadecimal format, without prefix 0x?
Then, use this syntax:
fprintf(fp, "%08x", number);
Will write 0abc1234 into a file.
As for "define different integers (8Bit, 16bit, 32bit unsigned and signed)", unless you roll your own, and concomitant math operations for them, you should stick to the types supplied by your system. See stdint.h for the typedefs available, such as int32_t.

Byte swap of a byte array into a long long

I have a program where i simply copy a byte array into a long long array. There are a total of 20 bytes and so I just needed a long long of 3. The reason I copied the bytes into a long long was to make it portable on 64bit systems.
I just need to now byte swap before I populate that array such that the values that go into it go reversed.
there is a byteswap.h which has _int64 bswap_64(_int64) function that i think i can use. I was hoping for some help with the usage of that function given my long long array. would i just simply pass in the name of the long long and read it out into another long long array?
I am using c++ not .net or c#
update:
clearly there are issues i am still confused about. for example, workng with byte arrays that just happen to be populated with 160 bit hex string which then has to be outputed in decimal form made me think about the case where if i just do a simple assignment to a long (4 byte) array my worries would be over. Then i found out that this code would ahve to run on a 64bit sun box. Then I thought that since the sizes of data from one env to another can change just a simple assignment would not cut it. this made me think about just using a long long to just make the code sort of immune to that size issue. however, then i read about endianess and how 64bit reads MSB vs 32bit which is LSB. So, taking my data and reversing it such that it is stored in my long long as MSB was the only solution that came to mind. ofc, there is the case about the 4 extra bytes which in this case does not matter and i simply will take the decimal output and display any random six digits i choose. However programatically, i guess it would be better to just work with 4 byte longs and not deal with that whole wasted 4 byte issue.
Between this and your previous questions, it sounds like there are several fundamental confusions here:
If your program is going to be run on a 64-bit machine, it sounds like you should compile and unit-test it on a 64-bit machine. Running unit tests on a 32-bit machine can give you confidence the program is correct in that environment, but doesn't necessarily mean the code is correct for a 64-bit environment.
You seem to be confused about how 32- and 64-bit architectures relate to endianness. 32-bit machines are not always little-endian, and 64-bit machines are not always big-endian. They are two separate concepts and can vary independently.
Endianness only matters for single values consisting of multiple bytes; for example, the integer 305,419,896 (0x12345678) requires 4 bytes to represent, or a UTF-16 character (usually) requires 2 bytes to represent. For these, the order of storage matters because the bytes are interpreted as a single unit. It sounds like what you are working with is a sequence of raw bytes (like a checksum or hash). Values like this, where multiple bytes are not interpreted in groups, are not affected by the endianness of the processor. In your case, casting the byte array to a long long * actually creates a potential endianness problem (on a little-endian architecture, your bytes will now be interpreted in the opposite order), not the other way around.
Endianness also doesn't matter unless the little-endian and big-endian versions of your program actually have to communicate with each other. For example, if the little-endian program writes a file containing multi-byte integers without swapping and the big-endian program reads it in, the big-endian program will probably misinterpret the data. It sounds like you think your code that works on a little-endian platform will suddenly break on a big-endian platform even if the two never exchange data. You generally don't need to be worried about the endianness of the architecture if the two versions don't need to talk to each other.
Another point of confusion (perhaps a bit pedantic). A byte does not store a "hex value" versus a "decimal value," it stores an integer. Decimal and hexadecimal are just two different ways of representing (printing) a particular integer value. It's all binary in the computer's memory anyway, hexadecimal is just an easy conversion to and from binary and decimal is convenient to our brains since we have ten fingers.
Assuming what you're trying to do is print the value of each byte of the array as decimal, you could do this:
unsigned char bytes[] = {0x12, 0x34, 0x56, 0x78};
for (int i = 0; i < sizeof(bytes) / sizeof(unsigned char); ++i)
{
printf("%u ", (unsigned int)bytes[i]);
}
printf("\n");
Output should be something like:
18 52 86 120
Ithink you should look at: htonl() and family
http://beej.us/guide/bgnet/output/html/multipage/htonsman.html
This family of functions is used to encode/decode integers for transport between machines that have different sizes/endianness of integers.
Write your program in the clearest, simplest way. You shouldn't need to do anything to make it "portable."
Byte-swapping is done to translate data of one endianness to another. bswap_64 is for resolving incompatibility between different 64-bit systems such as Power and X86-64. It isn't for manipulating your data.
If you want to reverse bytes in C++, try searching the STL for "reverse." You will find std::reverse, a function which takes pointers or iterators to the first and one-past-last bytes of your 20-byte sequence and reverses it. It's in the <algorithm> header.