writing to a binary file in c++ - c++

My problem is that I have a string in hex representation say:
036e. I want to write it to a binary file in the same hex representation. I first converted the string to an integer using sstrtoul() function. Then
I use fwrite() function to write that to a file. Here is the code that I wrote. I get the following output in my file after running this:
6e03 0000
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
ofstream fileofs("binary.bin", ios::binary | ios::out);
string s = "036e";
int x = strtoul(s.c_str(), NULL, 16);
fileofs.write((char*)&x, sizeof(int));
fileofs.close();
}
While the result that I expect is something like this:
036e
Can anybody explain exactly what I'm doing wrong over here?

Your problem has to do with endianees and also with the size of an integer.
For the inverting bytes the explanation is that you are running in a little-endian system.
For the extra 2 zeros the explanation is that you are using a 32 bit compiler where ints have 4 bytes.
There is nothing wrong with your code as long as you are going to use it always in 32 bit, little-endian systems.
Provided that you keep the system, if you read your integers from the file using similar code you'll get the right values (your first read integer will be 0x36E).
To write the data as you wish you could use exactly the same code with minor changes as noted bellow:
unsigned short x = htons(strtoul(s.c_str(), NULL, 16));
fileofs.write((char*)&x, sizeof(x));
But you must be aware that when you read back the data you must convert it to the right format using ntohs(). If you write your code this way it will work in any compiler and system as the network order is allways the same and the converting functions will only perform data changes if necessary.
You can find more information on another thread here and in the linux.com man page for those functions.

If you want 16 bits, use a data type that is guaranteed to be 16 bits. uint16_t from cstdint should do the trick.
Next Endian.
This is described in detail many places. The TL;DR version is some systems, and virtually every desktop PC you are likely to write code for, store their integers with the bytes BACKWARD. Way out of scope to explain why, but when you get down into the common usage patterns it does make sense.
So What you see as 036e is two bytes, 03 and 6e, and stored with the lowest significance byte, 6e, first. So that the computer sees is a two byte value containing 6e03. This is what is written to the output file unless you take steps to force an ordering on the output.
There are tonnes of different ways to force an ordering, but lets focus on the one that both always works(even when porting to a system that is already big endian) and is easy to read.
uint16_t in;
uint8_t out[2];
out[0] = (in >> 8) & 0xFF; // put highest in byte in first out byte
out[1] = in & 0xFF; // put lowest in byte in second out byte
out is then written to the file.
Recommended supplementary reading: Serialization This will help explain the common next problem: "Why my strings crash my program after I read them back in?"

Related

Does endianness affect writing an odd number of bytes?

Imagine you had a uint64_t bytes and you know that you only need 7 bytes because the integers you store will not exceed the limit of 7 bytes.
When writing a file you could do something like
std::ofstream fout(fileName);
fout.write((char *)&bytes, 7);
to only write 7 bytes.
The question I'm trying to figure out is whether endianess of a system affects the bytes that are written to the file. I know that endianess affects the order in which the bytes are written, but does it also affect which bytes are written? (Only for the case when you write less bytes than the integer usually has.)
For example, on a little endian system the first 7 bytes are written to the file, starting with the LSB. On a big endian system what is written to the file?
Or to put it differently, on a little endian system the MSB(the 8th byte) is not written to the file. Can we expect the same behavior on a big endian system?
Endianess affects only the way (16, 32, 64) int are written. If you are writing bytes, (as it is your case) they will be written in the exact same order you are doing it.
For example, this kind of writing will be affected by endianess:
std::ofstream fout(fileName);
int i = 67;
fout.write((char *)&i, sizeof(int));
uint64_t bytes = ...;
fout.write((char *)&bytes, 7);
This will write exactly 7 bytes starting from the address of &bytes. There is a difference between LE and BE systems how the eight bytes in memory are laid out, though (let's assume the variable is located at address 0xff00):
0xff00 0xff01 0xff02 0xff03 0xff04 0xff05 0xff06 0xff07
LE: [byte 0 (LSB!)][byte 1][byte 2][byte 3][byte 4][byte 5][byte 6][byte 7 (MSB)]
BE: [byte 7 (MSB!)][byte 6][byte 5][byte 4][byte 3][byte 2][byte 1][byte 0 (LSB)]
Starting address (0xff00) won't change if casting to char*, and you'll print out the byte at exactly this address plus the next six following ones – in both cases (LE and BE), address 0xff07 won't be printed. Now if you look at my memory table above, it should be obvious that on BE system, you lose the LSB while storing the MSB, which does not carry information...
On a BE-System, you could instead write fout.write((char *)&bytes + 1, 7);. Be aware, though, that this yet leaves a portability issue:
fout.write((char *)&bytes + isBE(), 7);
// ^ giving true/false, i. e. 1 or 0
// (such function/test existing is an assumption!)
This way, data written by a BE-System would be misinterpreted by a LE-system, when read back, and vice versa. Safe version would be decomposing each single byte as geza did in his answer. To avoid multiple system calls, you might decompose the values into an array instead and print out that one.
If on linux/BSD, there's a nice alternative, too:
bytes = htole64(bytes); // will likely result in a no-op on LE system...
fout.write((char *)&bytes, 7);
The question I'm trying to figure out is whether endianess of a system affects the bytes that are written to the file.
Yes, it affects the bytes are written to the file.
For example, on a little endian system the first 7 bytes are written to the file, starting with the LSB. On a big endian system what is written to the file?
The first 7 bytes are written to the file. But this time, starting with the MSB. So, in the end, the lowest byte is not written in the file, because on big endian systems, the last byte is the lowest byte.
So, this is not what you've wanted, because you lose information.
A simple solution is to convert uint64_t to little endian, and write the converted value. Or just write the value byte-by-byte in a way that a little endian system would write it:
uint64_t x = ...;
write_byte(uint8_t(x));
write_byte(uint8_t(x>>8));
write_byte(uint8_t(x>>16));
// you get the idea how to write the remaining bytes

Detect endianness of binary file data

Recently I was (again) reading about 'endian'ness. I know how to identify the endianness of host, as there are lots of post on SO, and also I have seen this, which I think is pretty good resource.
However, one thing I like to know is to how to detect the endianness of input binary file. For example, I am reading a binary file (using C++) like following:
ifstream mydata("mydata.raw", ios::binary);
short value;
char buf[sizeof(short)];
int dataCount = 0;
short myDataMat[DATA_DIMENSION][DATA_DIMENSION];
while (mydata.read(reinterpret_cast<char*>(&buf), sizeof(buf)))
{
memcpy(&value, buf, sizeof(value));
myDataMat[dataCount / DATA_DIMENSION][dataCount%DATA_DIMENSION] = value;
dataCount++;
}
I like to know how I can detect the endianness in the mydata.raw, and whether endianness affects this program anyway.
Additional Information:
I am only manipulating the data in myDataMat using mathematical operations, and no pointer operation or bitwise operation is done on the data).
My machine (host) is little endian.
It is impossible to "detect" the endianity of data in general. Just like it is impossible to detect whether the data is an array of 4 byte integers, or twice that many 2 byte integers. Without any knowledge about the representation, raw data is just a mass of meaningless bits.
However, with some extra knowledge about the data representation, it become possible. Some examples:
Most file formats mandate particular endianity, in which case this is never a problem.
Unicode text files may optionally start with a byte order mark. Same idea can be implemented by other data representations.
Some file formats contain a checksum. You can guess one endianity, and if the checksum does not match, try again with another endianity. It will be unlikely that the checksum matches with wrong interpretation of the data.
Sometimes you can make guesses based on the data. Is the temperature outside 33'554'432 degrees, or maybe 2? You can pick the endianity that represents sane data. Of course, this type of guesswork fails miserably, when the aliens invade and start melting our planet.
You can't tell.
The endianness transformation is essentially an operator E(x) on a number x such that x = E(E(x)). So you don't know "which way round" the x elements are in your file.

Converting a uint8_t to its binary representation

I have a variable of type uint8_t which I'd like to serialize and write to a file (which should be quite portable, at least for Windows, which is what I'm aiming at).
Trying to write it to a file in its binary form, I came accross this working snippet:
uint8_t m_num = 3;
unsigned int s = (unsigned int)(m_num & 0xFF);
file.write((wchar_t*)&s, 1); // file = std::wofstream
First, let me make sure I understand what this snippet does - it takes my var (which is basically an unsigned char, 1 byte long), converts it into an unsigned int (which is 4 bytes long, and not so portable), and using & 0xFF "extracts" only the least significant byte.
Now, there are two things I don't understand:
Why convert it into unsigned int in the first place, why can't I simply do something like
file.write((wchar_t*)&m_num, 1); or reinterpret_cast<wchar_t *>(&m_num)? (Ref)
How would I serialize a longer type, say a uint64_t (which is 8 bytes long)? unsigned int may or may not be enough here.
uint8_t is 1 byte, same as char
wchar_t is 2 bytes in Windows, 4 bytes in Linux. It is also depends on endianness. You should avoid wchar_t if portability is a concern.
You can just use std::ofstream. Windows has an additional version for std::ofstream which accepts UTF16 file name. This way your code is compatible with Windows UTF16 filenames and you can still use std::fstream. For example
int i = 123;
std::ofstream file(L"filename_in_unicode.bin", std::ios::binary);
file.write((char*)&i, sizeof(i)); //sizeof(int) is 4
file.close();
...
std::ifstream fin(L"filename_in_unicode.bin", std::ios::binary);
fin.read((char*)&i, 4); // output: i = 123
This is relatively simple because it's only storing integers. This will work on different Windows systems, because Windows is always little-endian, and int size is always 4.
But some systems are big-endian, you would have to deal with that separately.
If you use standard I/O, for example fout << 123456 then integer will be stored as text "123456". Standard I/O is compatible, but it takes a little more disk space and can be a little slower.
It's compatibility versus performance. If you have large amounts of data (several mega bytes or more) and you can deal with compatibility issues in future, then go ahead with writing bytes. Otherwise it's easier to use standard I/O. The performance difference is usually not measurable.
It is impossible to write unit8_t values to a wofstream because a wofstream only writes wide characters and doesn't handle binary values at all.
If what you want to do is to write a wide character representing a code point between 0 and 255, then your code is correct.
If you want to write binary data to a file then your nearest equivalent is ofstream, which will allow you to write bytes.
To answer your questions:
wofstream::write writes wide characters, not bytes. If you reinterpret the address of m_num as the address of a wide character, you will be writing a 16-bit or 32-bit (depending on platform) wide character of which the first byte (that is, the least significant or most significant, depending on platform) is the value of m_num and the remaining bytes are whatever happens to occur in memory after m_num. Depending on the character encoding of the wide characters, this may not even be a valid character. Even if valid, it is largely nonsense. (There are other possible problems if wofstream::write expects a wide-character-aligned rather than a byte-aligned input, or if m_num is immediately followed by unreadable memory).
If you use wofstream then this is a mess, and I shan't address it. If you switch to a byte-oriented ofstream then you have two choices. 1. If you will only ever be reading the file on the same system, file.write(&myint64value,sizeof(myint64value)) will work. The sequence in which the bytes of the 64-bit value are written will be undefined, but the same sequence will be used when you read back, so this doesn't matter. Don't try do something analogous with wofstream because it's dangerous! 2. Extract each of the 8 bytes of myint64value separately (shift right by a multiple of 8 bits and then take the bottom 8 bits) and then write it. This is fully portable because you control the order in which the bytes are written.

size of char being written to file as a binary value in C++

What I understood about char type from a few questions asked here is that it is always 1 byte in C++, but number of bits can vary from system to system.
sizeof() operator uses char as a unit so sizeof(char) is always 1 in bytes of C++.(which takes number of bits of smallest unit of address of local machine) If when using file functions of fstream() in binary mode, we directly read and write from/to an address of any variable in RAM, the size of variable as smallest unit of data written to file should be in size of the value read from RAM and for one read from file it is vice-versa. Then can we say that data may not be written 8 by 8 in bits if something like this is tried:
ofstream file;
file.open("blabla.bin",ios::out|ios::binary);
char a[]="asdfghjkkll";
file.seekp(0);
file.write((char*)a,sizeof(a)-1);
file.close();
Unless char is always used in bytes existing standard 8 bits, what happens if a heap of data is written to file in a 16 bit machine and is read in a 32 bit machine? Or should I use OS-dependent text mode? If not, and I misunderstood what is truth?
Edit : I have corrected my mistake.
Thanks for warning.
Edit2: My system is 64 bit but I get number of bits of char type as 8.What is wrong? Is the way I get the result of 8false?
I got a 00000... by shifting a char variable more than possible size of it with bitwise operators.After guaranteeing that all bits of the variable is zero, I got a 111... by inverting it. And shifted until it become zero.If we shift it its size time, we get a zero, so we can get number of bits from indice of the loop terminated below.
char zero,test;
zero<<=64; //hoping that system is not more than 64 bit(most likely)
test=~zero; //we have a 111...
int i;
for(i=0; test!=zero; i++)
test=test<<1;
Value of variable of i after the loop is number of bits in char type.According to this, the result is 8.
My last question is:
Are filesystem byte and char type different data types because how computer adresses pointers in file stream is different from standart char type which is at least 8 bits?
So, exactly what is going on the background?
Edit3: Why these minuses? What is my mistake? Isn't the question clear enough? Maybe my question is stupid but why there is no any response related to my question?
A language standard can't really specify what the filesystem does - it can only specify how the language interacts with it. The C and C++ standards also don't address anything to do with interoperability or communication between different implementations. In other words, there isn't a general answer to this question except to say that:
the VAST majority of systems use 8-bit bytes
the C and C++ standard require that char is at least 8 bits
it is very likely that greater-than-8-bit systems have mechanisms in place to somehow utilize (or at least transcode) 8-bit files.

c++: working with bytes

My problem is, that I need to load a binary file and work with single bits from the file. After that I need to save it out as bytes of course.
My main problem is - what datatype to choose to work in - char or long int? Can I somehow work with chars?
Unless performance is mission-critical here, use whatever makes your code easiest to understand and maintain.
Before beginning to code any thing make sure you understand endianess, c++ type sizes, and how strange they might be.
The unsigned char is the only type that is a fixed size (natural byte of the machine, normally 8 bits). So if you design for portability that is a safe bet. But it isn't hard to just use the unsigned int or even a long long to speed up the process and use size_of to find out how many bits you are getting in each read, although the code gets more complex that way.
You should know that for true portability none of the internal types of c++ is fixed. An unsigned char might have 9 bits, and the int might be as small as in the range of 0 to 65535, as noted in this and this answer
Another alternative, as user1200129 suggests, is to use the boost integer library to reduce all these uncertainties. This is if you have boost available on your platform. Although if going for external libraries there are many serializing libraries to choose from.
But first and foremost before even start optimizing, make something simple that work. Then you can start profiling when you start experiencing timing issues.
It really just depends on what you are wanting to do, but I would say in general, the best speed will be to stick with the size of integers that your program is compiled in. So if you have a 32 bit program, then choose 32 bit integers, and if you have 64 bit, choose 64 bit.
This could be different if there are some bytes in your file, or if there are integers. Without knowing the exact structure of your file, it's difficult to determine what the optimal value is.
Your sentences are not really correct English, but as far as I can interpret the question you can beter use unsigned char (which is a byte) type to be able to modify each byte separately.
Edit: changed according to comment.
If you are dealing with bytes then the best way to do this is to use a size specific type.
#include <algorithm>
#include <iterator>
#include <cinttypes>
#include <vector>
#include <fstream>
int main()
{
std::vector<int8_t> file_data;
std::ifstream file("file_name", std::ios::binary);
//read
std::copy(std::istream_iterator<int8_t>(file),
std::istream_iterator<int8_t>(),
std::back_inserter(file_data));
//write
std::ofstream out("outfile");
std::copy(file_data.begin(), file_data.end(),
std::ostream_iterator<int8_t>(out));
}
EDIT fixed bug
If you need to enforce how many bits are in an integer type, you need to be using the <stdint.h> header. It is present in both C and C++. It defines type such as uint8_t (8-bit unsigned integer), which are guaranteed to resolve to the proper type on the platform. It also tells other programmers who read your code that the number of bits is important.
If you're worrying about performance, you might want to use the larger-than-8-bits types, such as uint32_t. However, when reading and writing files, you will need to pay attention to the endianess of your system. Notably, if you have a little-endian system (e.g. x86, most all ARM), then the 32-bit value 0x12345678 will be written to the file as the four bytes 0x78 0x56 0x34 0x12, while if you have a big-endian system (e.g. Sparc, PowerPC, Cell, some ARM, and the Internet), it will be written as 0x12 0x34 0x56 0x78. (same goes or reading). You can, of course, work with 8-bit types and avoid this issue entirely.