I'm trying to read some bytes from a file.
This is what I've done:
struct HeaderData {
char format[2];
char n_trks[2];
char division[2];
};
HeaderData* header = new HeaderData;
Then, to get the data directly from the file to header I do
file.read(reinterpret_cast<char*>(header), sizeof(HeaderData))
If the first two bytes are00 06, header->format[0] will be 00 and header->format[1] 06. This two numbers combined represent the number 0x0006 which is 6 in decimal, which is the desired value.
When I do something like
*reinterpret_cast<unsigned*>(header->format) // In this case, the result is 0x0600
it erroneously returns the number 0x0600, so it seems that it inverts the reading of bytes.
My question is what is some workaround to correctly read the numbers as unsigned.
This is going to be an endianness mismatch.
When you read in from the file in that fashion, the bytes will be placed into your structure in the exact order they were in the file in.
When you read from the structure with an unsigned, the processor will interpret those bytes in whatever order the architecture requires it to do (most are hardcoded but some can be set to either order).
Or to put it another way
This two numbers combined represent the number 0x0006 which is 6 in decimal.
That's not necessarily remotely true. It's perfectly permissible for the processor of your choice to represent 6 in decimal as 0x06 0x00, this would be the little-endian scheme which is used on very common processors like x86. Representing it as 0x00 0x06 would be big-endian.
As M.M has stated in his comment, if your format explicitly defines the integer to be little-endian, you should explicitly read it as little-endian, e.g. format[0] + format[1] * 256, or if it is defined to be big-endian, you should read it as format[0] * 256 + format[1]. Don't rely on the processor's endianness happening to match the endianness of the data.
Related
I want to read a file 32 bytes at a time using a C/C++ program, but I want to be sure that the data will be 256 bits. In essence I am worried about leading bits in the "bytes" that I read from the file being off ? Is that even a matter of concern ?
Example : If I have a number say 2 represented in binary as 10 . This would be sufficient for me as a human.
How is that different as far a computer is concerned if it's written as: 00000010 to represent a char value of 1 byte ??? Would the leading zeros affect the bit count ? Does that in turn affect operations like XOR ?
I've trouble understanding its effects ! Does that involve data loss ? I really do not know... !
Every help to clear my misunderstanding will be appreciated !!!
Every routine in the C standard library that reads from a file or stream reads in units of bytes. Each byte read is a fixed number of bits; what is read from a file does not vary due to leading zeros or lack thereof in a byte. Some routines return a single character (which is a byte). Some routines put data read into a buffer and return a count of bytes read. Some routines, such as scanf, return a count of the number of items successfully converted. (You generally would not use these routines to read a fixed number of bytes.)
The number of bits in a byte is set by the C implementation. It is reported in CHAR_BIT, defined in <limits.h>. It is eight in all common C implementations.
Although the number of bits per byte does not vary, the number of bytes read from a stream can vary. If a stream is opened as a text stream, characters may be “added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment” (C 2018 7.21.2 2). To avoid this, you should open a stream as a binary stream.
The CHAR_BIT macro (defined in climits) will tell you how many bits make up a byte in the execution environment. However I am not aware of any recent general-purpose hardware that uses bytes of other than 8 bits. Some specialized processors (such as for digital signal processing) may use other sizes. Also completely outdated equipment used a wide variety of sizes (the typical alternative to 8 bits being 9).
No
C++ allows a char to be any size a platform requires. However, the macro CHAR_BIT always has the number of bits in a char.
So, to find out the number of bits in 32 bytes, you would use the formula 32*CHAR_BIT.
C++17 has introduced the new type std::byte that is not a character type and is always CHAR_BIT bits, as explained in the SO question std::byte on odd platforms
In order to find the number of bytes needed to hold 256 bits, you have a problem, because CHAR_BIT isn't always a divisor of 256. So, you have to decide what you want and use a more complicated formula. For example, 1+(255+CHAR_BIT)/CHAR_BIT will give you the number of bytes needed to hold 256 contiguous bits.
To start with, I have a char array that store data
unsigned char dat[3];
memset(dat, 0, sizeof(dat));
memcpy(dat, &no, 2);
when I inspect dat, it contain a hex of 0xfd 0x01
as the value of no is 509, the hex should be 0x01 0xfd
I'm wondering should I be concern of the order of the hexadecimal,
should I change the order. Many thanks
Your system is little endian. It's hardware dependent and on little endian platform first byte is the least significant one when treated as part of multi-byte value. Look up: https://en.wikipedia.org/wiki/Endianness
Essentially if CPU is little endian, then value 0x12345689 would be represented as set of bytes starting with 0x89. On big endian it's opposite order and on mixed endian it may change during run-time.
The question really is what do you want to do next ? On your current hardware (Little Endian) this is how the system orders bytes of a numeric. The least significant byte comes first: 0xfd 0x01.
In case you really want to swap this byte order, for whatever reason, checkout: How do I convert between big-endian and little-endian values in C++?
Let's say that I've encoded my Huffman tree in with the compressed file. So I have as an example file output:
001A1C01E01B1D
I'm having an issue saving this string to file bit-by-bit. I know that C++ can only output to file one byte at a time, so I'm having an issue storing this string in bytes. Is it possible to convert the first three bits to a char without the program padding to a byte? If it pads to a byte for the traversal codes then my tree (And the codes) will be completely messed up. If I were to chop this up one byte at a time, then what happens if the tree isn't exactly a multiple of 8? What happens if the compressed file's bit-length isn't exactly a multiple of 8?
Hopefully I've been clear enough.
The standard solution to this problem is padding. There are many possible padding schemes. Padding schemes pad up to an even number of bytes (i.e., a multiple of 8 bits). Additionally, they encode either the length of the message in bits, or the number of padding bits (from which the message length in bits can be determined by subtraction). The latter solution obviously results in slightly more efficient paddings.
Most simply, you can append the number of "unused" bits in the last byte as an additional byte value.
One level up, start by assuming that the number of padding bits fits in 3 bits. Define the last 3 bits of an encoded file to encode the number of padding bits. Now if the message uses up no more than 5 bits of the last byte, the padding can fit nicely in the same byte. If it is necessary to add a byte to contain the padding, the maximum gap is 5+2=7 (5 from the unused high bits of the extra byte, and 2 is the maximum possible space free in the last byte, otherwise the 3-bit padding value would've fit there). Since 0-7 is representable in 3 bits, this works (it doesn't work for 2 bits, since the maximum gap is larger and the range of representable values is smaller).
By the way, one of the main advantages of placing the padding information at the end of the file (rather than as a header at the beginning of the file) is that the compression functions can then operate on a stream without having to know its length in advance. Decompression can be stream-based as well, with careful handling of EOF signals.
Simply treat a sequence of n bytes as a sequence of 8n bits. Use the >> or <<, |, and & operators to assemble bytes from the sequence of variable-length bit codes.
The end of the stream is important to handle properly. You need an end of stream code so that the decoder knows to stop and not try to decode the final padding bits that complete the last byte.
My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;
If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.
Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
|0|9bit|0|9bit|1|17bit|0|9bit|1|17bit|1|17bit|...
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.
I have a binary file of doubles that I need to load using C++. However, my problem is that it was written in big-endian format but the fstream >> operator will then read the number wrong because my machine is little-endian. It seems like a simple problem to resolve for integers, but for doubles and floats the solutions I have found won't work. How can I (or should I) fix this?
I read this as a reference for integer byte swapping:
How do I convert between big-endian and little-endian values in C++?
EDIT: Though these answers are enlightening, I have found that my problem is with the file itself and not the format of the binary data. I believe my byte swapping does work, I was just getting confusing results. Thanks for your help!
The most portable way is to serialize in textual format so that you don't have byte order issues. This is how operator>> works so you shouldn't be having any endian issues with >>. The principal problem with binary formats (which would explain endian problems) is that floating point numbers consist of a number of mantissa bits, a number of exponent bits and a sign bit. The exponent may use an offset. This mean that a straight byte re-ordering may not be sufficient, depending on the source and target format.
If you are using and IEEE-754 on both machines then you may be OK with a straight byte reversal as this standard specifies a bit-string interchange format that should be portable (byte order issues aside).
If you have to convert between two machine architectures and you have to use a raw byte memory dump, then so long as the basic number format is the same (i.e. they have the same bit counts in each part of the number), you can read the data into an array of unsigned char, use some basic byte and bit swapping routines to correct the storage format and then copy the raw bytes into a variable of the appropriate type.
The standard conversion operators do not work with binary data, so it's not clear how you got where you are.
However, since byte swapping operates on bytes, not numbers, you perform it on data destined to be floats just as data which will be integers.
And since text is so inefficient and floating-point data sets tend to be so large, it's entirely reasonable to want this.
int32_t raw_bytes;
stream >> raw_bytes; // not an int, just 32 bits of bytes
my_byte_swap( raw_bytes ); // swap 'em
float f = * reinterpret_cast< float * >( & raw_bytes ); // read them into FPU