Compressing/decompressing data on the go

Compressing/decompressing data on the go - c++

I'm running a physics simulation and I'd like to improve the way it is handling its data. I'm saving and reading files that contained one float then two ints followed by 512*512 = 262144 +1 or -1 weighting 595 kb per datafile in the end. All these numbers are separated by a single space.
I'm saving hundreds of thousands of these files so it quickly adds up to gigas of storage, I'd like to know if there is a quick (hopefully light cpu-effort-wise way) of compressing and decompressing this kind of data on the go (I mean not tarring/untarring before/after use).
How much could I expect saving in the end?

If you want relatively fast read-write, you would probably want to store and read them in "binary" format, i.e. native to the way they are internally stored in bytes. A float uses 4-bytes of data, and you do not need any kind of "separator" when storing a large sequence of them.
To do this you might consider boost's "serialize" library.
Note that using data compression methods (zlib etc) will save you on bytes stored but will be relatively slow to zip and unzip them for usage.
Storing in binary format will not only use less disk storage (than storing in text format) but should also be more performant, not just because there is less file I/O but also because there is no string writing/parsing going on.
Note that when you input/output to binary_iarchive or binary_oarchive you pass in an underlying istream or ostream and if this is a file, you need to open it with ios::binary flag because of the issue of line-endings being potentially converted.
Even if you do decide that data-compression (zlib or some other library) is the way to go, it is still worth using boost::serialize to get your data into a "blob" to compress. In that case you would probably use std::ostringstream as your output stream to create the blob.
Incidentally, if you have 2^18 "boolean" values that can only be 1 or -1, you only need 1 bit for each one, (they would be physically stored as 1 or 0 but you would logically translate that). That would come to 2^15 bytes which is 32K not 595K

Given the extra info about the valid data, define your class like this:-
class Data
{
float m_float_value;
int m_int_value_1, m_int_value_2;
unsigned m_weights [8192];
};
Then use binary file IO to stream this class to and from a file, don't convert to text!
The weights are stored as Boolean values, packed into unsigned integers.
To get the weight, add an accessor:-
int Data::GetWeight (size_t index)
{
return m_weights [index >> 5] & (1 << (index & 31)) ? 1 : -1;
}
This gives you a data file size of 32780 bytes (5.4%) if there's no packing in the class data.

I would suggest that if you are concerned about size a binary format would be the most useful way to "compress" your data. It sounds like you are dealing with something like the following:
struct data {
float a;
int b, c;
signed char d[512][512];
};
someFunc() {
data* someData = new data;
std::ifstream inFile("inputData.bin", std::ifstream::binary);
std::ofstream outFile("outputData.bin", std::ofstream::binary);
// Read from file
inFile.read(someData, sizeof(data));
inFile.close();
// Write to file
outFile.write(someData, sizeof(data));
outFile.close();
delete someData;
}
I should also mention then if you encode your +1/-1 as bits you should get a lot of space savings (another factor of 8 on top of what I'm showing here).

For that amount of data, anything homemade isn't going perform anywhere near as well as good-quality open-source binary-storage libraries. Try boost serialize or - for this type of storage requirement - HDF5. I've used HDF5 successfully on a few projects with very very large amounts of double, float, long and int data . Found it useful one can control the compression-rate vs cpu-effort on the fly per "file". Also useful is storing millions of "files" in a hierarchically-structured single "disk" file. NASA - probably ripping my style;) - also uses it.

Related

Portability for Binary File in C++?

I have a doubt regarding binary I/O for portability of the binary file.
Lets say the PC running my software uses 8 bytes for storing double variable.
The binary file generated will have 8 bytes for a double variable.
Now say the file is being opened in a PC which uses 6 bytes for a double variable (just assuming).
Then the application will read only 6 bytes from the file and store it in the double variable in memory.
Not only does this result in underflow/overflow of data but also the data read after the double will definitely be incorrect because of the 2 byte offset created due to under reading.
I want to support my application for not only 32/64 bit, but also Windows, Ubuntu PC's.
So how do you make sure that the data read from the same file in any PC would be the same?

In general, you should wrap data to be stored in binaries in your own data structures and implement platform independent read/write operations for those data structures - basically, size of binary data structure written to disk should be same for all platforms (max possible size of elementary data over all supported platforms).
When writing data on platform with smaller data size, data should be padded with extra 0 bytes to ensure size of recorded data stays same.
When reading, whole data can be read in fixed data blocks of known size, and conversion should be performed depending on platform it was written/it is being read on. This should take care of endianess too. You may want to include some header indicating sizes of data to distinguish between files recorded on different platforms when reading them.
this would give truly platform independent serialization for binary file.
Example for doubles
class CustomDouble
{
public:
double val;
static const int DISK_SIZE;
void toFile(std::ofstream &file)
{
int bytesWritten(0);
file.write(reinterpret_cast<const char*>(&val),sizeof(val));
bytesWritten+=sizeof(val);
while(bytesWritten<CustomDouble::DISK_SIZE)
{
char byte(0);
file.write(&byte,sizeof(byte));
bytesWritten+=sizeof(byte);
}
}
};
const int CustomDouble::DISK_SIZE = 8;
This ensures you always write 8 bytes regarding of size of double on your platform. When you read the file, you always read those 8 bytes still as binary, and do conversions if necessary depending whioch platform it was written on/ is being read on (you will probably add some small header to the file to identify platform it was recorded on)
While custom conversion does add some overhead, it is way less then those of storing values as text, and normally you will only perform conversions for incompatible platforms, while for same platform there will be no overhead.

cstdint includes type definitions that are a fixed size, so int32_t will always be 4 bytes long. You can use these in place of regular types when the size of the type is important to you.

Use Google Protocol Buffers or any other cross-platform serialization library. You can also roll out your own solution, based on fact, that char is guaranteed to be 1 byte (i.e. serialize anything into char arrays).

Read large amount of ASCII numbers and write in binary form

I have data files with about 1.5 Gb worth of floating-point numbers stored as ASCII text separated by whitespace, e.g., 1.2334 2.3456 3.4567 and so on.
Before processing such numbers I first translate the original file to binary format. This is helpful because I can choose whether to use float or double, reduce file size (to about 800 MB for double and 400 MB for float), and read in chunks of the appropriate size once I am processing the data.
I wrote the following function to make the ASCII-to-binary translation:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst){
RealType value;
std::fstream src(fsrc.c_str(), std::fstream::in | std::fstream::binary);
std::fstream dst(fdst.c_str(), std::fstream::out | std::fstream::binary);
while(src >> value){
dst.write((char*)&value, sizeof(RealType));
}
// RAII closes both files
}
I would like to speed-up acii_to_binary, and I seem unable to come up with anything. I tried reading the file in chunks of 8192 bytes, and then try to process the buffer in another subroutine. This seems very complicated because the last few characters in the buffer may be whitespace (in which case all is good), or a truncated number (which is very bad) - the logic to handle the possible truncation seems hardly worth it.
What would you do to speed up this function? I would rather rely on standard C++ (C++11 is OK) with no additional dependencies, like boost.
Thank you.
Edit:
#DavidSchwarts:
I tried to implement your suggestion as follows:
template<typename RealType=float>
void ascii_to_binary(const std::string& fsrc, const std::string& fdst{
std::vector<RealType> buffer;
typedef typename std::vector<RealType>::iterator VectorIterator;
buffer.reserve(65536);
std::fstream src(fsrc, std::fstream::in | std::fstream::binary);
std::fstream dst(fdst, std::fstream::out | std::fstream::binary);
while(true){
size_t k = 0;
while(k<65536 && src >> buffer[k]) k++;
dst.write((char*)&buffer[0], buffer.size());
if(k<65536){
break;
}
}
}
But it does not seem to be writing the data! I'm working on it...

I did exactly the same thing, except that my fields were separated by tab '\t' and I had to also handle non-numeric comments on the end of each line and header rows interspersed with the data.
Here is the documentation for my utility.
And I also had a speed problem. Here are the things I did to improve performance by around 20x:
Replace explicit file reads with memory-mapped files. Map two blocks at once. When you are in the second block after processing a line, remap with the second and third blocks. This way a line that straddles a block boundary is still contiguous in memory. (Assumes that no line is larger than a block, you can probably increase blocksize to guarantee this.)
Use SIMD instructions such as _mm_cmpeq_epi8 to search for line endings or other separator characters. In my case, any line containing an '=' character was a metadata row that needed different processing.
Use a barebones number parsing function (I used a custom one for parsing times in HH:MM:SS format, strtod and strtol are perfect for grabbing ordinary numbers). These are much faster than istream formatted extraction functions.
Use the OS file write API instead of the standard C++ API.
If you dream of throughput in the 300,000 lines/second range, then you should consider a similar approach.
Your executable also shrinks when you don't use C++ standard streams. I've got 205KB, including a graphical interface, and only dependent on DLLs that ship with Windows (no MSVCRTxx.dll needed). And looking again, I still am using C++ streams for status reporting.

Aggregate the writes into a fixed buffer, using a std::vector of RealType. Your logic should work like this:
Allocate a std::vector<RealType> with 65,536 default-constructed entries.
Read up to 65,536 entries into the vector, replacing the existing entries.
Write out as many entries as you were able to read in.
If you read in exactly 65,536 entries, go to step 2.
Stop, you are done.
This will prevent you from alternating reads and writes to two different files, minimizing the seek activity significantly. It will also allow you make far fewer write calls, reducing copying and buffering logic.

Can I make a single binary write for a C++ struct which contains a vector

I am trying to build and write a binary request and have a "is this possible" type question. It might be important for me to mention the recipiant of the request is not aware of the data structure I have included below, it's just expecting a sequence of bytes, but using a struct seemed like a handy way to prepare the pieces of the request, then write them easily.
Writing the header and footer is fine as they are fixed size but I'm running into problems with the struct "Details", because of the vector. For now Im writing to a file so I can check the request is to spec, but the intention is to write to a PLC using boost asio serial port eventually
I can use syntax like so to write a struct, but that writes pointer addresses rather than values when it gets to the vector
myFile.write((char*) &myDataRequest, drSize);
I can use this sytax to write a vector by itself, but I must include the indexer at 0 to write the values
myFile.write((char*) &myVector[0], vectorSize);
Is there an elegant way to binary write a struct containing a vector (or other suitable collection), doing it in one go? Say for example if I declared the vector differently, or am I resigned to making multiple writes for the content inside the struct. If I replace the vector with an array I can send the struct in one go (without needing to include any indexer) but I dont know the required size until run time so I don't think it is suitable.
My Struct
struct Header
{ ... };
struct Details
{
std::vector<DataRequest> DRList;
};
struct DataRequest
{
short numAddresses; // Number of operands to be read Bytes 0-1
unsigned char operandType; // Byte 2
unsigned char Reserved1; //Should be 0xFF Byte 3
std::vector<short> addressList; // either, starting address (for sequence), or a list of addresses (for non-sequential)
};
struct Footer
{ ... };

It's not possible because the std::vector object doesn't actually contain an array but rather a pointer to a block of memory. However, I'm tempted to claim that being able to write a raw struct like that is not desireable:
I believe that by treating a struct as a block of memory you may end up sending padding bytes, I don't think this is desireable.
Depending on what you write to you may find that writes are buffered anyway, so multiple write calls aren't actually less efficient.
Chances are that you want to do something with the fields being sent over. In particular, with the numeric values you send. This requires enforcing a byte order which both sides of the transmission agree on. In order to be portable, you should exlicitely convert the byte order to make sure that your software is portable (if this is required).
To make a long story short: I suspect writing out each field one by one is not less efficient, it also is more correct.

This is not really a good strategy, since even if you could do this you're copying memory content directly to file. If you change the architecture/processor your client will get different data. If you write a method taking your struct and a filename, which writes the structs values individually and iterates over the vector writing out its content, you'll have full control over the binary format your client expects and are not dependent on the compilers current memory representation.
If you want convenience for marshalling/unmarshalling you should take a look at the boost::serialization library. They do offer a binary archive (besides text and xml) but it has its own format (e.g. it has a version number, which serialization lib was used to dump the data) so it is probably not what your client wants.

What exactly is the format expected at the other end? You have to write
that, period. You can't just write any random bytes. The probability
that just writing an std::vector like you're doing will work is about
as close to 0 as you can get. But the probability that writing a
struct with only int will work is still less than 50%. If the other
side is expecting a specific sequence of bytes, then you have to write
that sequence, byte by byte. To write an int, for example, you must
still write four (or whatever the protocol requires) bytes, something
like:
byte[0] = (value >> 24) & 0xFF;
byte[1] = (value >> 16) & 0xFF;
byte[2] = (value >> 8) & 0xFF;
byte[3] = (value ) & 0xFF;
(Even here, I'm supposing that your internal representation of negative
numbers corresponds to that of the protocol. Usually the case, but not
always.)
Typically, of course, you build your buffer in a std::vector<char>,
and then write &buffer[0], buffer.size(). (The fact that you need a
reinterpret_cast for the buffer pointer should signal that your
approach is wrong.)

write data structure into a file using binary mode

code looks like this:
struct Dog {
string name;
unsigned int age;
};
int main()
{
Dog d = {.age = 3, .name = "Lion"};
FILE *fp = fopen("dog.txt", "wb");
fwrite(&d, sizeof(d), 1, fp); //write d into dog.txt
}
My problem is what's the point of write a data object or structure into a binary file? I assume it is for making the data generated in a running program persistent, right? If yes, then how can I get the data back? Using fread?
This makes me think of database-like stuff, dose database write data into disk the same way?

You can do it but you will have a lot of issues to care about:
structure types: all your data needs really be into struct or you can just writing a pointer to some other place.
structure changes: if you need change your structure you will need write a converter to read old struct and write the new.
language interoperability: will be hard to access the data using other language
It was a common practice in the early days before relational databases popularization. You can make index files pointing to a record number.
However nowadays I will advice you to make serialization and write strings instead binaries.
NOTE:
if string is something like char[40] your code maybe will survive... but if your question is about C++ and string is a class then kill you child before it grows up! The string object characters are not into your struct but in the heap.

Writing data in binary is extremely useful and much faster then reading/writing in text, take for instance video games (Although not every video game does this), when the game is saved all of the nescessary structures/classes and other data are written into a save file in binary.
It is just one use for using binary, but the major reason for doing this is speed.
And to read the data back, you will need to know the format that you saved it in, for instance as a simple example, if I saved an integer, char array of n size, and a boolean, I would need to read the binary file in as an integer, char array of n size, and a boolean. Otherwise the data is read improperly and will not be very useful at all

Be careful. The type of field 'name' in your structure is 'string'. This class contains data allocated dynamically. So writing 'string' data into file this way only pointers will be writed, not data itself.

The C++ Middleware Writer supports binary serialization to/from files.
From a marshalling perspective the "unsigned int age" member of your struct is a potential problem. I'd consider changing the type to uint32_t.

Whats the 'best' method of writing data out to a file, to be later read in again.

What is the best way of storing data out to a file on a network, which will be later read in again programmatically. Target platform for the program is Linux (Fedora), but it will need to write out a file to a Windows (XP) machine
This needs to be in C++, there will be a high number of write / read events so it needs to be efficient, and the data needs to be written out in such a way that it can be read back in easily.
The whole file may not be being read back in, I'll need to search for a specific block of data in the file and read that back in.
Will simple binary stream writer do?
How should I store the data - XML?
Anything else I need to worry about?
UPDATE : To clarify, here are some answers to peterchen's points
Please clarify:
* do you only append blocks, or do you
also need to remove / update them?
I only need to append to the end of the file, but will need to search through it and retrieve from any point in it
*** are all blocks of the same size?**
No, the data will vary in size - some will be free text comments (like a post here) others will be specific object-like data (sets of parameters)
*** is it necessary to be a single file?**
No, but desirable
*** by which criteria do you need to locate blocks?**
By data type and by timestamp. For example, if I periodically write out a specific set of parameters, in amognst other data, like free text, I want to find the value of those parameters at a cerain date/time - so I'll need to search for the time I wrote out those parameters nearest that date and read them back in.
*** must the data be readable for other applications?**
No.
*** do you need concurrent access?**
Yes, I may be continuing to write as I read. but should only ever do one write at a time.
*** Amount of data (per block / total) - kilo, mega, giga, tera?**
Amount of data will be low per write... from a number of bytes to a coupe hundred bytes - total should see no more than few hundred kilobytes possible a fwe megabytes. (still unsure as yet)
**> If you need all of this, rolling your own will be a challenge, I would definitely
recommend to use a database. If you need less than that, please specify so we can
recommend.**
A database would over complicate the system so that is not an option unfortunately.

Your question is too general. I would first define my needs, then a record structure for the file, and then use a textual representation to save it. Take a look at Eric Stone Raymond's data metaformat, at JSON, and maybe CSV or XML. All of peterchen's points seem relevant.

there will be a high number of write / read events so it needs to be efficient,
That will not be efficient.
I did a lot of timing on this back in the Win2K days, when I had to implement a program that essentially had a file copy in it. What I found was that by far the biggest bottleneck in my program seemed to be the overhead in each I/O operation. The single most effective thing I found in reducing total runtime was to reduce the number of I/O operations I requested.
I started doing pretty stream I/O, but that was no good because the stupid compiler was issuing an I/O for every single character. Its performace compared to the shell "copy" command was just pitiful. Then I tried writing out an entire line at a time, but that was only marginally better.
Eventually I ended up writing the program to attempt to read the entire file into memory so that in most cases there would be only 2 I/Os: one to read it in and another to write it out. This is where I saw the huge savings. The extra code involved in dealing with the manual buffering was more than made up for in less time waiting for I/Os to complete.
Of course this was 7 years or so ago, so I suppose things may be much different now. Time it yourself if you want to be sure.

Probably you should have another file that would be read into a vector with fixed size data.
struct structBlockInfo
{
int iTimeStamp; // TimeStamp
char cBlockType; // Type of Data (PArameters or Simple Text)
long vOffset; // Position on the real File
};
Every time you added a new block you would also add it to this vector the correspondent information and save it.
Now if you wanted to to read a specific block you could do a search on this vector, position yourself on the "Real File" with fseek (or whatever) to the correspondent offset, and read X bytes (this offset to the start of the other or to the end of the file)
And do a cast to something depending on the cBlockType, examples:
struct structBlockText
{
char cComment[];
};
struct structBlockValuesExample1
{
int iValue1;
int iValue2;
};
struct structBlockValuesExample2
{
int iValue1;
int iValue2;
long lValue1;
char cLittleText[];
};
Read some Bytes....
fread(cBuffer, 1, iTotalBytes, p_File);
If it was a BLockText...
structBlockText* p_stBlock = (structBlockText*) cBuffer;
If it was a structBlockValuesExample1...
structBlockValuesExample1* p_stBlock = (structBlockValuesExample1*) cBuffer;
Note: that cBuffer can hold more than one Block.

You'll need to look at the kind of data you are writing out. Once you are dealing with objects instead of PODs, simply writing out the binary representation of the object will not necessarily result in anything that you can deserialise successfully.
If you are "only" writing out text, reading the data back in should be comparatively easy if you are writing out in the same text representation. If you are trying to write out more complex data types you'll probably need to look at something like boost::serialization.

Your application sounds like it needs a database. If you can afford, use one. But don't use an embedded database engine like sqlite for a file over a network storage, since it may be too unstable for your purposes. If you still want to use something like it, you have to use it through your own reader/writer process with your own access protocol, stability concerns still applies if you use a text based file format like XML instead, so you will have to do same for them.
I can't be certain without knowing your workload though.

If you are only talking about a few megabytes, I wouldn't store in on disk at all. Have a process on the network that accepts data and stores it internally, and also accepts queries on that data. If you need a record of the data, this process can also write it to the disk. Note that this sounds a lot like a database, and this indeed may be the best way to do it. I don't see how this complicates the system. In fact, it makes it much easier. Just write a class that abstracts the database, and have the rest of the code use that.
I went through this same process myself in the past, including dismissing a database as too complicated. It started off fairly simple, but after a couple of years we had written our own, poorly implemented, buggy, hard to use database. At that point, we junked our code and moved to postgres. We've never regretted the change.

This is what I have for reading/writing of data:
template<class T>
int write_pod( std::ofstream& out, T& t )
{
out.write( reinterpret_cast<const char*>( &t ), sizeof( T ) );
return sizeof( T );
}
template<class T>
void read_pod( std::ifstream& in, T& t )
{
in.read( reinterpret_cast<char*>( &t ), sizeof( T ) );
}
This doesn't work for vectors, deque's etc. but it is easy to do by simply writing out the number of items followed by the data:
struct object {
std::vector<small_objects> values;
template <class archive>
void deserialize( archive& ar ) {
size_t size;
read_pod( ar, size );
values.resize( size );
for ( int i=0; i<size; ++i ) {
values[i].deserialize( ar );
}
}
}
Of course you will need to implement the serialize & deserialize functions but they are easy to implement...

I would check out the Boost Serialization library
One of their examples is:
#include <fstream>
// include headers that implement a archive in simple text format
#include <boost/archive/text_oarchive.hpp>
#include <boost/archive/text_iarchive.hpp>
/////////////////////////////////////////////////////////////
// gps coordinate
//
// illustrates serialization for a simple type
//
class gps_position
{
private:
friend class boost::serialization::access;
// When the class Archive corresponds to an output archive, the
// & operator is defined similar to <<. Likewise, when the class Archive
// is a type of input archive the & operator is defined similar to >>.
template<class Archive>
void serialize(Archive & ar, const unsigned int version)
{
ar & degrees;
ar & minutes;
ar & seconds;
}
int degrees;
int minutes;
float seconds;
public:
gps_position(){};
gps_position(int d, int m, float s) :
degrees(d), minutes(m), seconds(s)
{}
};
int main() {
// create and open a character archive for output
std::ofstream ofs("filename");
// create class instance
const gps_position g(35, 59, 24.567f);
// save data to archive
{
boost::archive::text_oarchive oa(ofs);
// write class instance to archive
oa << g;
// archive and stream closed when destructors are called
}
// ... some time later restore the class instance to its orginal state
gps_position newg;
{
// create and open an archive for input
std::ifstream ifs("filename");
boost::archive::text_iarchive ia(ifs);
// read class state from archive
ia >> newg;
// archive and stream closed when destructors are called
}
return 0;
}

Store it as binary if you're not doing text storage. Text is hideously inefficient; XML is even worse. The lack of efficiency of the storage format predicates larger file transfers which means more time. If you are having to store text, filter it through a zip library.
Your main issue is going to be file locking and concurrency. Everything starts to get groady when you have to write/read/write in a concurrent fashion. At this point, get a DB of some sort installed and BLOB the file up or something, because you'll be writing your own DB at this point....and no one wants to reinvent that wheel(you know, if they aren't doing their own DB company of their own, or are a PhD student, or have a strange hobby...)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js