Does endianness affect writing an odd number of bytes? - c++

Imagine you had a uint64_t bytes and you know that you only need 7 bytes because the integers you store will not exceed the limit of 7 bytes.
When writing a file you could do something like
std::ofstream fout(fileName);
fout.write((char *)&bytes, 7);
to only write 7 bytes.
The question I'm trying to figure out is whether endianess of a system affects the bytes that are written to the file. I know that endianess affects the order in which the bytes are written, but does it also affect which bytes are written? (Only for the case when you write less bytes than the integer usually has.)
For example, on a little endian system the first 7 bytes are written to the file, starting with the LSB. On a big endian system what is written to the file?
Or to put it differently, on a little endian system the MSB(the 8th byte) is not written to the file. Can we expect the same behavior on a big endian system?

Endianess affects only the way (16, 32, 64) int are written. If you are writing bytes, (as it is your case) they will be written in the exact same order you are doing it.
For example, this kind of writing will be affected by endianess:
std::ofstream fout(fileName);
int i = 67;
fout.write((char *)&i, sizeof(int));

uint64_t bytes = ...;
fout.write((char *)&bytes, 7);
This will write exactly 7 bytes starting from the address of &bytes. There is a difference between LE and BE systems how the eight bytes in memory are laid out, though (let's assume the variable is located at address 0xff00):
0xff00 0xff01 0xff02 0xff03 0xff04 0xff05 0xff06 0xff07
LE: [byte 0 (LSB!)][byte 1][byte 2][byte 3][byte 4][byte 5][byte 6][byte 7 (MSB)]
BE: [byte 7 (MSB!)][byte 6][byte 5][byte 4][byte 3][byte 2][byte 1][byte 0 (LSB)]
Starting address (0xff00) won't change if casting to char*, and you'll print out the byte at exactly this address plus the next six following ones – in both cases (LE and BE), address 0xff07 won't be printed. Now if you look at my memory table above, it should be obvious that on BE system, you lose the LSB while storing the MSB, which does not carry information...
On a BE-System, you could instead write fout.write((char *)&bytes + 1, 7);. Be aware, though, that this yet leaves a portability issue:
fout.write((char *)&bytes + isBE(), 7);
// ^ giving true/false, i. e. 1 or 0
// (such function/test existing is an assumption!)
This way, data written by a BE-System would be misinterpreted by a LE-system, when read back, and vice versa. Safe version would be decomposing each single byte as geza did in his answer. To avoid multiple system calls, you might decompose the values into an array instead and print out that one.
If on linux/BSD, there's a nice alternative, too:
bytes = htole64(bytes); // will likely result in a no-op on LE system...
fout.write((char *)&bytes, 7);

The question I'm trying to figure out is whether endianess of a system affects the bytes that are written to the file.
Yes, it affects the bytes are written to the file.
For example, on a little endian system the first 7 bytes are written to the file, starting with the LSB. On a big endian system what is written to the file?
The first 7 bytes are written to the file. But this time, starting with the MSB. So, in the end, the lowest byte is not written in the file, because on big endian systems, the last byte is the lowest byte.
So, this is not what you've wanted, because you lose information.
A simple solution is to convert uint64_t to little endian, and write the converted value. Or just write the value byte-by-byte in a way that a little endian system would write it:
uint64_t x = ...;
write_byte(uint8_t(x));
write_byte(uint8_t(x>>8));
write_byte(uint8_t(x>>16));
// you get the idea how to write the remaining bytes

Related

writing to a binary file in c++

My problem is that I have a string in hex representation say:
036e. I want to write it to a binary file in the same hex representation. I first converted the string to an integer using sstrtoul() function. Then
I use fwrite() function to write that to a file. Here is the code that I wrote. I get the following output in my file after running this:
6e03 0000
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
ofstream fileofs("binary.bin", ios::binary | ios::out);
string s = "036e";
int x = strtoul(s.c_str(), NULL, 16);
fileofs.write((char*)&x, sizeof(int));
fileofs.close();
}
While the result that I expect is something like this:
036e
Can anybody explain exactly what I'm doing wrong over here?
Your problem has to do with endianees and also with the size of an integer.
For the inverting bytes the explanation is that you are running in a little-endian system.
For the extra 2 zeros the explanation is that you are using a 32 bit compiler where ints have 4 bytes.
There is nothing wrong with your code as long as you are going to use it always in 32 bit, little-endian systems.
Provided that you keep the system, if you read your integers from the file using similar code you'll get the right values (your first read integer will be 0x36E).
To write the data as you wish you could use exactly the same code with minor changes as noted bellow:
unsigned short x = htons(strtoul(s.c_str(), NULL, 16));
fileofs.write((char*)&x, sizeof(x));
But you must be aware that when you read back the data you must convert it to the right format using ntohs(). If you write your code this way it will work in any compiler and system as the network order is allways the same and the converting functions will only perform data changes if necessary.
You can find more information on another thread here and in the linux.com man page for those functions.
If you want 16 bits, use a data type that is guaranteed to be 16 bits. uint16_t from cstdint should do the trick.
Next Endian.
This is described in detail many places. The TL;DR version is some systems, and virtually every desktop PC you are likely to write code for, store their integers with the bytes BACKWARD. Way out of scope to explain why, but when you get down into the common usage patterns it does make sense.
So What you see as 036e is two bytes, 03 and 6e, and stored with the lowest significance byte, 6e, first. So that the computer sees is a two byte value containing 6e03. This is what is written to the output file unless you take steps to force an ordering on the output.
There are tonnes of different ways to force an ordering, but lets focus on the one that both always works(even when porting to a system that is already big endian) and is easy to read.
uint16_t in;
uint8_t out[2];
out[0] = (in >> 8) & 0xFF; // put highest in byte in first out byte
out[1] = in & 0xFF; // put lowest in byte in second out byte
out is then written to the file.
Recommended supplementary reading: Serialization This will help explain the common next problem: "Why my strings crash my program after I read them back in?"

Converting a uint8_t to its binary representation

I have a variable of type uint8_t which I'd like to serialize and write to a file (which should be quite portable, at least for Windows, which is what I'm aiming at).
Trying to write it to a file in its binary form, I came accross this working snippet:
uint8_t m_num = 3;
unsigned int s = (unsigned int)(m_num & 0xFF);
file.write((wchar_t*)&s, 1); // file = std::wofstream
First, let me make sure I understand what this snippet does - it takes my var (which is basically an unsigned char, 1 byte long), converts it into an unsigned int (which is 4 bytes long, and not so portable), and using & 0xFF "extracts" only the least significant byte.
Now, there are two things I don't understand:
Why convert it into unsigned int in the first place, why can't I simply do something like
file.write((wchar_t*)&m_num, 1); or reinterpret_cast<wchar_t *>(&m_num)? (Ref)
How would I serialize a longer type, say a uint64_t (which is 8 bytes long)? unsigned int may or may not be enough here.
uint8_t is 1 byte, same as char
wchar_t is 2 bytes in Windows, 4 bytes in Linux. It is also depends on endianness. You should avoid wchar_t if portability is a concern.
You can just use std::ofstream. Windows has an additional version for std::ofstream which accepts UTF16 file name. This way your code is compatible with Windows UTF16 filenames and you can still use std::fstream. For example
int i = 123;
std::ofstream file(L"filename_in_unicode.bin", std::ios::binary);
file.write((char*)&i, sizeof(i)); //sizeof(int) is 4
file.close();
...
std::ifstream fin(L"filename_in_unicode.bin", std::ios::binary);
fin.read((char*)&i, 4); // output: i = 123
This is relatively simple because it's only storing integers. This will work on different Windows systems, because Windows is always little-endian, and int size is always 4.
But some systems are big-endian, you would have to deal with that separately.
If you use standard I/O, for example fout << 123456 then integer will be stored as text "123456". Standard I/O is compatible, but it takes a little more disk space and can be a little slower.
It's compatibility versus performance. If you have large amounts of data (several mega bytes or more) and you can deal with compatibility issues in future, then go ahead with writing bytes. Otherwise it's easier to use standard I/O. The performance difference is usually not measurable.
It is impossible to write unit8_t values to a wofstream because a wofstream only writes wide characters and doesn't handle binary values at all.
If what you want to do is to write a wide character representing a code point between 0 and 255, then your code is correct.
If you want to write binary data to a file then your nearest equivalent is ofstream, which will allow you to write bytes.
To answer your questions:
wofstream::write writes wide characters, not bytes. If you reinterpret the address of m_num as the address of a wide character, you will be writing a 16-bit or 32-bit (depending on platform) wide character of which the first byte (that is, the least significant or most significant, depending on platform) is the value of m_num and the remaining bytes are whatever happens to occur in memory after m_num. Depending on the character encoding of the wide characters, this may not even be a valid character. Even if valid, it is largely nonsense. (There are other possible problems if wofstream::write expects a wide-character-aligned rather than a byte-aligned input, or if m_num is immediately followed by unreadable memory).
If you use wofstream then this is a mess, and I shan't address it. If you switch to a byte-oriented ofstream then you have two choices. 1. If you will only ever be reading the file on the same system, file.write(&myint64value,sizeof(myint64value)) will work. The sequence in which the bytes of the 64-bit value are written will be undefined, but the same sequence will be used when you read back, so this doesn't matter. Don't try do something analogous with wofstream because it's dangerous! 2. Extract each of the 8 bytes of myint64value separately (shift right by a multiple of 8 bits and then take the bottom 8 bits) and then write it. This is fully portable because you control the order in which the bytes are written.

Size of binary representation of a number

We generally say that the number 5 can be represented as a 3 bit binary number. But, if we convert 5 to its binary representation i.e. 101 and print it into a text file, it actually takes 3 bytes as it is read as a character array. How can I create a file (not necessarily a text file) such that the size of that file is 3 bits?
You can logically represent 5 as three bits, but neither the filesystem nor the memory management system (for RAM) will let you address space in units smaller than one byte.
If you had eight of these numbers, you could pack them into 24 bits = 3 bytes and store those "efficiently" in memory or a file. Efficiently in quotes, because while you save some space, it becomes difficult to work with the packed data as you need to bit-shift things around a lot. CPU instructions, memory loads, array indexing etc all don't work with less-than-byte units.
The most practical way would be to just use a whole byte for your three bits and live with the overhead.
I don't think you'll get a file system that will tell you the file is 3 bits. It will be at least a byte, plus storage for the file's extra information.
But you could simply open a file for writing and write 3 as binary.
FILE *ptr;
ptr = fopen("file", "wb");
fwrite('a', 1, 1, ptr);
You can use the following code and work based on this...the following code stores three numbers (5, 3 and 2) in a single byte. for storing the 3 numbers the file occupy only one byte. in general we can not store data in partial bytes in files.
#include<stdio.h>
struct bits
{
unsigned char first:3,second:3,third:2;
};
main()
{
struct bits b;
FILE *f;
b.first=5;
b.second=3;
b.third=2;
printf("\ninitial data:%u %u %u",b.first,b.second,b.third);
/*storing in file*/
f=fopen("bitsfile","w");
fwrite(&b,sizeof(b),1,f);
fclose(f);
/*reading back from file*/
f=fopen("bitsfile","r");
fread(&b,sizeof(b),1,f);
fclose(f);
printf("\ndata read from file:%u %u %u",b.first,b.second,b.third);
}
You generally can't since binary files have a minimal 'quantum' that is a byte ( 8 bit ).
There is something interesting about storing symbols with not homogeneous bit length using Huffman Encoding. Just to explain before you read the complete article: your alphabet symbols stays each one on a binary tree leaf. There is a single path of 1 ( ie left ) 0 (ie right) starting from the root and landing to your symbol. If the tree is unbalanced ( and it would be) different symbols can be represented uniquely with different bit length. of course there is some effort because you have to read the file always at byte level, and then unpack and handle the bits with your algorithm implementation.

size of char being written to file as a binary value in C++

What I understood about char type from a few questions asked here is that it is always 1 byte in C++, but number of bits can vary from system to system.
sizeof() operator uses char as a unit so sizeof(char) is always 1 in bytes of C++.(which takes number of bits of smallest unit of address of local machine) If when using file functions of fstream() in binary mode, we directly read and write from/to an address of any variable in RAM, the size of variable as smallest unit of data written to file should be in size of the value read from RAM and for one read from file it is vice-versa. Then can we say that data may not be written 8 by 8 in bits if something like this is tried:
ofstream file;
file.open("blabla.bin",ios::out|ios::binary);
char a[]="asdfghjkkll";
file.seekp(0);
file.write((char*)a,sizeof(a)-1);
file.close();
Unless char is always used in bytes existing standard 8 bits, what happens if a heap of data is written to file in a 16 bit machine and is read in a 32 bit machine? Or should I use OS-dependent text mode? If not, and I misunderstood what is truth?
Edit : I have corrected my mistake.
Thanks for warning.
Edit2: My system is 64 bit but I get number of bits of char type as 8.What is wrong? Is the way I get the result of 8false?
I got a 00000... by shifting a char variable more than possible size of it with bitwise operators.After guaranteeing that all bits of the variable is zero, I got a 111... by inverting it. And shifted until it become zero.If we shift it its size time, we get a zero, so we can get number of bits from indice of the loop terminated below.
char zero,test;
zero<<=64; //hoping that system is not more than 64 bit(most likely)
test=~zero; //we have a 111...
int i;
for(i=0; test!=zero; i++)
test=test<<1;
Value of variable of i after the loop is number of bits in char type.According to this, the result is 8.
My last question is:
Are filesystem byte and char type different data types because how computer adresses pointers in file stream is different from standart char type which is at least 8 bits?
So, exactly what is going on the background?
Edit3: Why these minuses? What is my mistake? Isn't the question clear enough? Maybe my question is stupid but why there is no any response related to my question?
A language standard can't really specify what the filesystem does - it can only specify how the language interacts with it. The C and C++ standards also don't address anything to do with interoperability or communication between different implementations. In other words, there isn't a general answer to this question except to say that:
the VAST majority of systems use 8-bit bytes
the C and C++ standard require that char is at least 8 bits
it is very likely that greater-than-8-bit systems have mechanisms in place to somehow utilize (or at least transcode) 8-bit files.

Writing binary data in c++

I am in the process of building an assembler for a rather unusual machine that me and a few other people are building. This machine takes 18 bit instructions, and I am writing the assembler in C++.
I have collected all of the instructions into a vector of 32 bit unsigned integers, none of which is any larger than what can be represented with an 18 bit unsigned number.
However, there does not appear to be any way (as far as I can tell) to output such an unusual number of bits to a binary file in C++, can anyone help me with this.
(I would also be willing to use C's stdio and File structures. However there still does not appear to be any way to output such an arbitrary amount of bits).
Thank you for your help.
Edit: It looks like I didn't specify how the instructions will be stored in memory well enough.
Instructions are contiguous in memory. Say the instructions start at location 0 in memory:
The first instruction will be at 0. The second instruction will be at 18, the third instruction will be at 36, and so on.
There is no gaps, or no padding in the instructions. There can be a few superfluous 0s at the end of the program if needed.
The machine uses big endian instructions. So an instruction stored as 3 should map to: 000000000000000011
Keep an eight-bit accumulator.
Shift bits from the current instruction into to the accumulator until either:
The accumulator is full; or
No bits remain of the current instruction.
Whenever the accumulator is full:
Write its contents to the file and clear it.
Whenever no bits remain of the current instruction:
Move to the next instruction.
When no instructions remain:
Shift zeros into the accumulator until it is full.
Write its contents.
End.
For n instructions, this will leave (8 - 18n mod 8) zero bits after the last instruction.
There are a lot of ways you can achieve the same end result (I am assuming the end result is a tight packing of these 18 bits).
A simple method would be to create a bit-packer class that accepts the 32-bit words, and generates a buffer that packs the 18-bit words from each entry. The class would need to do some bit shifting, but I don't expect it to be particularly difficult. The last byte can have a few zero bits at the end if the original vector length is not a multiple of 4. Once you give all your words to this class, you can get a packed data buffer, and write it to a file.
You could maybe represent your data in a bitset and then write the bitset to a file.
Wouldn't work with fstreams write function, but there is a way that is described here...
The short answer: Your C++ program should output the 18-bit values in the format expected by your unusual machine.
We need more information, specifically, that format that your "unusual machine" expects, or more precisely, the format that your assembler should be outputting. Once you understand what the format of the output that you're generating is, the answer should be straightforward.
One possible format — I'm making things up here — is that we could take two of your 18-bit instructions:
instruction 1 instruction 2 ...
MSB LSB MSB LSB ...
bits → ABCDEFGHIJKLMNOPQR abcdefghijklmnopqr ...
...and write them in an 8-bits/byte file thus:
KLMNOPQR CDEFGHIJ 000000AB klmnopqr cdefghij 000000ab ...
...this is basically arranging the values in "little-endian" form, with 6 zero bits padding the 18-bit values out to 24 bits.
But I'm assuming: the padding, the little-endianness, the number of bits / byte, etc. Without more information, it's hard to say if this answer is even remotely near correct, or if it is exactly what you want.
Another possibility is a tight packing:
ABCDEFGH IJKLMNOP QRabcdef ghijklmn opqr0000
or
ABCDEFGH IJKLMNOP abcdefQR ghijklmn 0000opqr
...but I've made assumptions about where the corner cases go here.
Just output them to the file as 32 bit unsigned integers, just as you have in memory, with the endianness that you prefer.
And then, when the loader / eeprom writer / JTAG or whatever method you use to send the code to the machine, for each 32 bit word that is read, just omit the 14 more significant bits and send the real 18 bits to the target.
Unless, of course, you have written a FAT driver for your machine...