cross-platform file structure handling - c++

Whenever I need to define a file structure, I'm using compiler-specific commands (like #pragma pack(1)) to ensure that I can safely read and write this file and don't need to worry about padding issues.
However, is there any other way to reach the same goal? I don't need to de-/serialize complex objects, just POD types.

It is impossible to define a cross-platform binary format that always nicely maps to the in-memory representation of types.
The two options for defining cross-platform file formats are:
Use text
Define a binary format in terms of what your favourite cross-platform serialisation library can provide and use that library to convert the file contents between their internal and external representation.

Boost Serialization Library might be an option, if you want it to be solved fast and without much ado.

Related

FILE IO to InMemory IO

I have a legacy C library which accepts a file, works on the file payload and writes the processed payload to an output file. The functions in the library are tightly coupled with FILE i.e. it passes around FILE handle to the functions and functions do file IO to retrieve the necessary data.
I want to modify this library such that it works with in memory data(No file IO). i.e pass a binary array and get back binary array.
I have 2 solution in mind
Implement a InMemory File module (which implants all operations as C FILE) and override the default file operations with new implementation using typedef or #define
Pass around binary array to all the functions of the library and retrieve the necessary data from the same.
Which one of this is better or any other better way to solve the problem
I would suggest not to change legacy code if any other code depends on it.
If you are building for a somewhat POSIX compliant platform, you can use fmemopen http://pubs.opengroup.org/onlinepubs/9699919799/functions/fmemopen.html
For Windows maybe this might help
C - create file in memory
I don't know what is the exact purpose of changing legacy code.The issue which is I understand is the overhead caused by reading and writing. But there are many methods are available to resolve overhead issues as:
As already told you can use fmemopen
You may also use mmap for plain read/write should make little difference; either way, everything happens through the filesystem cache/buffers.
You can also use tmpfs to leverage memory as (temporary) files also known as RAMDisk as storage. As the files are washed out easily as files are temporary already in nature
Another solution - you can use inmemory database (TimesTen for sample)

How to save a c++ readable .mat file

I am running a DCT code in matlab and i would like to read the compressed file (.mat) into a c code. However, am not sure this is right. I have not yet finished my code but i would like to request for an explanation of how to create a c++ readable file from my .mat file.
Am kinda confused when it comes to .mat, .txt and then binary, float details of files. Someone please explain this to me.
It seems that you have a lot of options here, depending on your exact needs, time, and skill level (in both Matlab and C++). The obvious ones are:
ASCII files
You can generate ASCII files in Matlab either using the save(filename, variablename, '-ascii') syntax, or you can create a more custom format using c-style fprintf commands. Then, within a C or C++ program the files are read using an fscanf.
This is often easiest, and good enough in many cases. The fact that a human can read the files using notepad++, emacs, etc. is a nice sanity check, (although this is often overrated).
There are two big downsides. First, the files are very large (an 8 byte double number requires about 19 bytes to store in ASCII). Second, you have to be very careful to minimize the inevitable loss of precision.
Bytes-on-a-disk
For a simple array of numbers (for example, a 32-by-32 array of doubles) you can simply use the fwrite Matlab function to write the array to a disk. Then within C/C++ use the parallel fread function.
This has no loss of precision, is pretty fast, and relatively small size on disk.
The downside with this approach is that complex Matlab structures cannot necessarily be saved.
Mathworks provided C library
Since this is a pretty common problem, the Mathworks has actually solved this by a direct C implementation of the functions needed to read/write to *.mat files. I have not used this particular library, but generally the libraries they provide are pretty easy to integrate. Some documentation can be found starting here: http://www.mathworks.com/help/matlab/read-and-write-matlab-mat-files-in-c-c-and-fortran.html
This should be a pretty robust solution, and relatively insensitive to changes, since it is part of the mainstream, supported Matlab toolset.
HDF5 based *.mat file
With recent versions of Matlab, you can use the notation save(filename, variablename, '-v7.3'); to force Matlab to save the file in an HDF5 based format. Then you can use tools from the HDF5 group to handle the file. Note a decent, java-based GUI viewer (http://www.hdfgroup.org/hdf-java-html/hdfview/index.html#download_hdfview) and libraries for C, C++ and Fortran.
This is a non-fragile method to store binary data. It is also a bit of work to get the libraries working in your code.
One downside is that the Mathworks may change the details of how they map Matlab data types into the HDF5 file. If you really want to be robust, you may want to try ...
Custom HDF5 file
Instead of just taking whatever format the Mathworks decides to use, it's not that hard create a HDF5 file directly and push data into it from Matlab. This lets you control things like compression, chunk sizing, dataset hierarchy and names. It also insulates you from any future changes in the default *.mat file format. See the h5write command in Matlab.
It is still a bit of effort to get running from the C/C++ end, so I would only go down this path if your project warranted it.
.mat is special format for the MATLAB itself.
What you can do is to load your .mat file in the MATLAB workspace:
load file.mat
Then use fopen and fprintf to write the data to file.txt and then you can read the content of that file in C.
You can also use matlab's dlmwrite to write to a delimited asci file which will be easy to read in C (and human readable too) although it may not be as compressed if that is core to the issue
Adding to what has already been mentioned you can save your data from MATLAB using -ascii.
save x.mat x
Becomes:
save x.txt x -ascii

Is it a good idea to save/load an array of struct?

I was wondering if it was a good idea to load/save an array of a certain type of structure using fstream. Note, I am talking about loading/saving to a binary file. Should I be loading/saving independent variables such as int, float, boolean rather then a struct? The reason I ask that is because I've heard that a structure might have some type of padding which might offset the save/load.
A structure may contain padding, which will be written to the file. This is no big deal if the file is going to be read back on the same platform, using code emitted by the same compiler that did the write. However, this is difficult to guarantee, and if you cannot guarantee it, you should normally write the data in some textual format, such as XML, json or whatever.
Without serialization, your binary data will not be portable across different platform (and compilers). So if you need portability, then you need to serialize the data before storing it in file as binary.
Have a look at these:
Boost Serialization Tutorial
Boost Serializable Concept
It's not deprecated (it's not part of any formal spec, where should it be deprecated?), but it's extremely not portable and probably the worst way to go about serialising stuff. Use Boost.Serialization, or a similar library.
As you pointed out in your answer this will happen with writing structs this way. If you want your files to be portable across platforms, e.g. file being written on Linux i686 to opened by Solaris on Sparc then even writing individual float's won't work.
Try writing your data to something like text or XML and then zip/tar the files to make one document of them.
As Neil said, prefer textual representation of data. The XML format may be overkill. Simpler versions are Comma Separated Value (CSV) and one value per text line.

Getting better error messages for iostreams

I implemented a small program that can extract (and via fuse mount) a certain archive format. I use boost::filesystem::ifstream, but on error (e.g. the file a user wants to extract does not exist) I get very nondescript error messages. I wonder is there a way to get better error messages for IO related problems in C++?
On a related note I wonder whether I should have used C's FILE* or in the case of the fuse filesystem just plain file descriptors? Because strerror(errno) is way better than what iostreams are giving me.
We couldn't find any better way than using boost::iostreams and implementing our own file-based sink and source.
If you want, you can grab the source code here (Apache-licensed):
http://sourceforge.net/projects/cgatools/files/1.3.0/cgatools-1.3.0.9-source.tar.gz/download
the relevant files are:
cgatools/util/Streams.[ch]pp
Since your using the filesystem library anyway, you could test to see if the file exists prior to trying to access it with a stream. This would avoid your bloat concerns, but it would not operate in the same sense as what you're looking for, i.e. the stream itself would not perform the existence check.
However, since you are using boost::filesystem::ifstream, I'm assuming that you are using that because you are using boost::filesystem::path. In boost's implementation of ifstream, they inherit from std::basic_ifstream and override two functions: the constructor and open. So, if you want better error reporting you could simply do they same thing, inherit from boost's implementation and override those two functions to provide the checking you wish. Additional bloat: probably not a lot, and it incorporates the behavior you wish into the stream itself.

Simple API for random access into a compressed data file

Please recommend a technology suitable for the following task.
I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.
What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.
Of course, I hope to use existing library rather than reinvent the NIH wheel.
If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.
Byte Pair Encoding allows random access to data.
You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.
However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.
One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.
If it's accessed rarely, you could compress the index with gzip and decode it when needed.
If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).
Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).
Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).
Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.
It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.
Tipical usage:
dzip -j4 myfile
...
CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);
It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio
Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.
I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.
The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.
Boost Releases
Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.
Sort the big file first
divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
take first ID from each chunk plus the filename and put both data into another file
compress the chunks
you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.
If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files.
on the web exists several implementation of this because are little tricky the code.
You could use bzip2 and make your own API pretty easily based on the James Taylor's seek-bzip2