I am working on some homework for Huffman coding. I already have the Huffman algorithm completed, but need to slightly alter it to work with binary files. I have some spent some time reading related problems, and perhaps due to my lack of understanding of data types and binary files, I am still struggling a bit, so hopefully I am not repeating a prior question (I won't be posting code related to the huffman part of the program).
Here is the key phrase: "You can assume that each symbol, which will be mapped to a codeword, is a 4-byte binary string.", and what I think I know is that Char represents one byte and unsigned int represents four byte, so I am guessing I should be reading the input four bytes at a time into a unsigned int Buffer and then collect my data for the Huffman part of the program.
int main() {
unsigned int buffer;
fstream input;
input.open("test.txt", ios::in | ios::binary);
while(input) {
input.read(reinterpret_cast<char *>(&buffer), 4);
//if buffer does not exist as unique symbol in collection of data add it
//if buffer exists update statistics of symbol
}
input.close();
}
Does this look like a good way to handle the data? How should I handle the very end of the file if there are only 1,2, or 3 bytes left? So then I am just storing buffer as unsigned int in a struct. Just out of curiosity how would I recast buffer to a string of characters?
Edit: What's the best way to store the header of a Huffman compressed a file?
Does this look like a good way to handle the data?
Instead of casting a pointer, I would suggest using union of int and char [4] and passing pointer to the char array as you should be. Don't know what's the rest of the logic, so can't say if the actual handling (which is not in the code you posted) is done in a good way, but it seems to me rather trivial.
How should I handle the very end of the file if there are only 1,2, or 3 bytes left?
Assuming each symbol is 4 bytes long, I would expect that not be a valid input.
So then I am just storing buffer as unsigned int in a struct. Just out of curiosity how would I recast buffer to a string of characters?
Why would you do that? In your data, a "character" is 4 bytes. But you can just use casting to array of bytes if you want (or, better, use bitwise operations to extract the actual bytes, if the order matters).
Related
My problem is that I have a string in hex representation say:
036e. I want to write it to a binary file in the same hex representation. I first converted the string to an integer using sstrtoul() function. Then
I use fwrite() function to write that to a file. Here is the code that I wrote. I get the following output in my file after running this:
6e03 0000
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main() {
ofstream fileofs("binary.bin", ios::binary | ios::out);
string s = "036e";
int x = strtoul(s.c_str(), NULL, 16);
fileofs.write((char*)&x, sizeof(int));
fileofs.close();
}
While the result that I expect is something like this:
036e
Can anybody explain exactly what I'm doing wrong over here?
Your problem has to do with endianees and also with the size of an integer.
For the inverting bytes the explanation is that you are running in a little-endian system.
For the extra 2 zeros the explanation is that you are using a 32 bit compiler where ints have 4 bytes.
There is nothing wrong with your code as long as you are going to use it always in 32 bit, little-endian systems.
Provided that you keep the system, if you read your integers from the file using similar code you'll get the right values (your first read integer will be 0x36E).
To write the data as you wish you could use exactly the same code with minor changes as noted bellow:
unsigned short x = htons(strtoul(s.c_str(), NULL, 16));
fileofs.write((char*)&x, sizeof(x));
But you must be aware that when you read back the data you must convert it to the right format using ntohs(). If you write your code this way it will work in any compiler and system as the network order is allways the same and the converting functions will only perform data changes if necessary.
You can find more information on another thread here and in the linux.com man page for those functions.
If you want 16 bits, use a data type that is guaranteed to be 16 bits. uint16_t from cstdint should do the trick.
Next Endian.
This is described in detail many places. The TL;DR version is some systems, and virtually every desktop PC you are likely to write code for, store their integers with the bytes BACKWARD. Way out of scope to explain why, but when you get down into the common usage patterns it does make sense.
So What you see as 036e is two bytes, 03 and 6e, and stored with the lowest significance byte, 6e, first. So that the computer sees is a two byte value containing 6e03. This is what is written to the output file unless you take steps to force an ordering on the output.
There are tonnes of different ways to force an ordering, but lets focus on the one that both always works(even when porting to a system that is already big endian) and is easy to read.
uint16_t in;
uint8_t out[2];
out[0] = (in >> 8) & 0xFF; // put highest in byte in first out byte
out[1] = in & 0xFF; // put lowest in byte in second out byte
out is then written to the file.
Recommended supplementary reading: Serialization This will help explain the common next problem: "Why my strings crash my program after I read them back in?"
Recently I was (again) reading about 'endian'ness. I know how to identify the endianness of host, as there are lots of post on SO, and also I have seen this, which I think is pretty good resource.
However, one thing I like to know is to how to detect the endianness of input binary file. For example, I am reading a binary file (using C++) like following:
ifstream mydata("mydata.raw", ios::binary);
short value;
char buf[sizeof(short)];
int dataCount = 0;
short myDataMat[DATA_DIMENSION][DATA_DIMENSION];
while (mydata.read(reinterpret_cast<char*>(&buf), sizeof(buf)))
{
memcpy(&value, buf, sizeof(value));
myDataMat[dataCount / DATA_DIMENSION][dataCount%DATA_DIMENSION] = value;
dataCount++;
}
I like to know how I can detect the endianness in the mydata.raw, and whether endianness affects this program anyway.
Additional Information:
I am only manipulating the data in myDataMat using mathematical operations, and no pointer operation or bitwise operation is done on the data).
My machine (host) is little endian.
It is impossible to "detect" the endianity of data in general. Just like it is impossible to detect whether the data is an array of 4 byte integers, or twice that many 2 byte integers. Without any knowledge about the representation, raw data is just a mass of meaningless bits.
However, with some extra knowledge about the data representation, it become possible. Some examples:
Most file formats mandate particular endianity, in which case this is never a problem.
Unicode text files may optionally start with a byte order mark. Same idea can be implemented by other data representations.
Some file formats contain a checksum. You can guess one endianity, and if the checksum does not match, try again with another endianity. It will be unlikely that the checksum matches with wrong interpretation of the data.
Sometimes you can make guesses based on the data. Is the temperature outside 33'554'432 degrees, or maybe 2? You can pick the endianity that represents sane data. Of course, this type of guesswork fails miserably, when the aliens invade and start melting our planet.
You can't tell.
The endianness transformation is essentially an operator E(x) on a number x such that x = E(E(x)). So you don't know "which way round" the x elements are in your file.
I'm running a physics simulation and I'd like to improve the way it is handling its data. I'm saving and reading files that contained one float then two ints followed by 512*512 = 262144 +1 or -1 weighting 595 kb per datafile in the end. All these numbers are separated by a single space.
I'm saving hundreds of thousands of these files so it quickly adds up to gigas of storage, I'd like to know if there is a quick (hopefully light cpu-effort-wise way) of compressing and decompressing this kind of data on the go (I mean not tarring/untarring before/after use).
How much could I expect saving in the end?
If you want relatively fast read-write, you would probably want to store and read them in "binary" format, i.e. native to the way they are internally stored in bytes. A float uses 4-bytes of data, and you do not need any kind of "separator" when storing a large sequence of them.
To do this you might consider boost's "serialize" library.
Note that using data compression methods (zlib etc) will save you on bytes stored but will be relatively slow to zip and unzip them for usage.
Storing in binary format will not only use less disk storage (than storing in text format) but should also be more performant, not just because there is less file I/O but also because there is no string writing/parsing going on.
Note that when you input/output to binary_iarchive or binary_oarchive you pass in an underlying istream or ostream and if this is a file, you need to open it with ios::binary flag because of the issue of line-endings being potentially converted.
Even if you do decide that data-compression (zlib or some other library) is the way to go, it is still worth using boost::serialize to get your data into a "blob" to compress. In that case you would probably use std::ostringstream as your output stream to create the blob.
Incidentally, if you have 2^18 "boolean" values that can only be 1 or -1, you only need 1 bit for each one, (they would be physically stored as 1 or 0 but you would logically translate that). That would come to 2^15 bytes which is 32K not 595K
Given the extra info about the valid data, define your class like this:-
class Data
{
float m_float_value;
int m_int_value_1, m_int_value_2;
unsigned m_weights [8192];
};
Then use binary file IO to stream this class to and from a file, don't convert to text!
The weights are stored as Boolean values, packed into unsigned integers.
To get the weight, add an accessor:-
int Data::GetWeight (size_t index)
{
return m_weights [index >> 5] & (1 << (index & 31)) ? 1 : -1;
}
This gives you a data file size of 32780 bytes (5.4%) if there's no packing in the class data.
I would suggest that if you are concerned about size a binary format would be the most useful way to "compress" your data. It sounds like you are dealing with something like the following:
struct data {
float a;
int b, c;
signed char d[512][512];
};
someFunc() {
data* someData = new data;
std::ifstream inFile("inputData.bin", std::ifstream::binary);
std::ofstream outFile("outputData.bin", std::ofstream::binary);
// Read from file
inFile.read(someData, sizeof(data));
inFile.close();
// Write to file
outFile.write(someData, sizeof(data));
outFile.close();
delete someData;
}
I should also mention then if you encode your +1/-1 as bits you should get a lot of space savings (another factor of 8 on top of what I'm showing here).
For that amount of data, anything homemade isn't going perform anywhere near as well as good-quality open-source binary-storage libraries. Try boost serialize or - for this type of storage requirement - HDF5. I've used HDF5 successfully on a few projects with very very large amounts of double, float, long and int data . Found it useful one can control the compression-rate vs cpu-effort on the fly per "file". Also useful is storing millions of "files" in a hierarchically-structured single "disk" file. NASA - probably ripping my style;) - also uses it.
I'm reading up on the write method of basic_ostream objects and this is what I found on cppreference:
basic_ostream& write( const char_type* s, std::streamsize count );
Behaves as an UnformattedOutputFunction. After constructing and checking the sentry object, outputs the characters from successive locations in the character array whose first element is pointed to by s. Characters are inserted into the output sequence until one of the following occurs:
exactly count characters are inserted
inserting into the output sequence fails (in which case setstate(badbit) is called)
So I get that it writes a chunk of characters from a buffer into the stream. And the number of characters are the bytes specified by count. But there are a few things of which I'm not sure. These are my questions:
Should I use write only when I want to specify how many bytes I want to write to a stream? Because normally when you print a char array it will print the entire array until it reaches the null byte, but when you use write you can specify how many characters you want written.
char greeting[] = "Hello World";
std::cout << greeting; // prints the entire string
std::cout.write(greeting, 5); // prints "Hello"
But maybe I'm misinterpreting something with this one.
And I often see this in code samples that use write:
stream.write(reinterpret_cast<char*>(buffer), sizeof(buffer));
Why is the reinterpret_cast to char* being use? When should I know to do something like that when writing to a stream?
If anyone can help me with these two question it would be greatly appreciated.
•Should I use write only when I want to specify how many bytes I want to write to a stream?
Yes - you should use write when there's a specific number of bytes of data arranged contiguously in memory that you'd like written to the stream in order. But sometimes you might want a specific number of bytes and need to get them another way, such as by formatting a double's ASCII representation to have specific width and precision.
Other times you might use >>, but that has to be user-defined for non builtin types, and when it is defined - normally for better but it may be worse for your purposes - it prints whatever the class designer choose, including potentially data that's linked from the object via pointers or references and static data of interest, and/or values calculated on the fly. It may change the data representation: say converting binary doubles to ASCII representations, or ensuring a network byte order regardless of the host's endianness. It may also omit some of the object's data, such as cache entries, counters used to manage but not logically part of the data, array elements that aren't populated etc..
Why is the reinterpret_cast to char* being use? When should I know to do something like that when writing to a stream?
The write() function signature expects a const char* argument, so this conversion is being done. You'll need to use a cast whenever you can't otherwise get a char* to the data.
The cast reflects the way write() treats data starting at the first byte of the object as 8-bit values without any consideration of the actual pre-cast type of the data. This ties in with being able to do things like say a write() of the last byte of a float and first 3 bytes of a double appearing next in the same structure - all the data boundaries and interpretation is lost after the reinterpret_cast<>.
(You've actually got to be more careful of this when doing a read() of bytes from an input stream... say you read data that constituted a double when written into memory that's not aligned appropriately for a double, then try to use it as a double, you may get a SIGBUS or similar alignment exception from your CPU or degraded performance depending on your system.)
basic_ostream::write and its counterpart basic_istream::read, is used to perform unformatted I/O on a data stream. Typically, this is raw binary data which may or may not contain printable ascii characters.
The main difference between read/write and the other formatted operators like <<, >>, getline etc. is that the former doesn't make any assumptions on the data being worked on -- you have full control over what bytes get read from and written to the stream. Compared to the latter which may skip over whitespaces, discard or ignore them etc.
To answer your second question, the reinterpret_cast <char *> is there to satisfy the function signature and to work with the buffer a byte at a time. Don't let the type char fool you. The reason char is used is because it's the smallest builtin primitive type provided by the language. Perhaps a better name would be something like uint8 to indicate it's really an unsigned byte type.
This is my function which creates a binary file
void writefile()
{
ofstream myfile ("data.abc", ios::out | ios::binary);
streamoff offset = 1;
if(myfile.is_open())
{
char c='A';
myfile.write(&c, offset );
c='B';
myfile.write(&c, offset );
c='C';
myfile.write(&c,offset);
myfile.write(StartAddr,streamoff (16) );
myfile.close();
}
else
cout << "Some error" << endl ;
}
The value of StartAddr is 1000, hence the expected output file is:
A B C 1000 NUL NUL NUL
However, strangely my output file appends this: data.abc
So the final outcome is: A B C 1000 NUL NUL NUL data.abc
Please help me out with this. How to deal with this? Why is this strange behavior?
I recommend you quit with binary writing and work on writing the data in a textual format. You've already encountered some of the problems with writing data. There are still issues for you to come across about reading the data and portability. Expect more pain if you continue this route.
Use textual representations. For simplicity you can put one field per line and use std::getline to read it in. The textual representation allows you to view the data in any text editor, easily. Try using Notepad to view a binary file!
Oh, but binary data is soo much faster and takes up less space in the file. You've already wasted enough time and money than you would gain by using binary data. The speed of computers and huge memory capacities (disk and RAM) make binary representations a thing of the past (except in extreme cases).
As a learning tool, go ahead and use binary. For ease of development and quick schedules (IOW, finishing early), use textual representations.
Search Stack Overflow for "C++ micro optimization" for the justifications.
There are several issues with this code.
For starters, if you want to write individual characters t a stream, you don't need to use ostream::write. Instead, just use ostream::put, as shown here:
myfile.put('A');
Second, if you want to write out a string into a file stream, just use the stream insertion operator:
myfile << StartAddr;
This is perfectly safe, even in binary mode.
As for the particular problem you're reporting, I think that the issue is that you're trying to write out a string of length four (StartAddr), but you've told the stream to write out sixteen bytes. This means that you're writing out the four bytes for the string contents, then the null terminator, and then nine bytes of whatever happens to be in memory after the buffer. In your case, this is two more null bytes, then the meaningless text that you saw after that. To fix this, either change your code to write fewer bytes or, if StartAddr is a string, then just write it using <<.
With the line myfile.write(StartAddr,streamoff (16) ); you are instructing the myfile object to write 16 bytes to the stream starting at the address StartAddr. Imagine that StartAddr is an array of 16 bytes:
char StartAddr[16] = "1000\0\0\0data.b32\0";
myfile.write(StartAddr, sizeof(StartAddr));
Would generate the output that you see. Without seeing the declaration / definition of StartAddr I cannot say for certain, but it appears you are writing out a five byte nul terminated string "1000" followed by whatever happens to reside in the next 11 bytes after StartAddr. In this case, it appears a couple of nul bytes followed by the constant nul terminated string "data.b32" (which the compiler must put somewhere in memory) are what follow StartAddr.
Regardless, it is clear that you overread a buffer.
If you are trying to write a 16 bit integer type to a stream you have a couple of options, both based on the fact that there are typically 8 bits in a byte. The 'cleanest' one would be something like:
char x = (StartAddr & 0xFF);
myfile.write(x);
x = (StartAddr >> 8);
myfile.write(x);
This assumes StartAddr is a 16 bit integer type and does not take into account any translation that might occur (such as potential conversion of a value of 10 [a linefeed] into a carriage return / linefeed sequence).
Alternatively, you could write something like:
myfile.write(reinterpret_cast<char*>(&StartAddr), sizeof(StartAddr));