Recover object lazy-loading the containing file - c++

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;

Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.

You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

Related

Boost::knuth_morris_pratt over a std::istream

I would like to use boost::algorithm::knuth_morris_pratt over some huge files (serveral hundred gigabytes). This means I can't just read the whole file into memory nor mmap it, I need to read it in chunks.
knuth_morris_pratt operates on an iterator, so I guess it is possible to make it read input data "lazily" (on-demand), it would be a matter of writing a "lazy" iterator for some file access class like ifstream, or better istream.
I would like to know if there is some adapter available (already written) that adapts istream to Boost's knuth_morris_pratt so that it won't read all file data all at once?
I know there is a boost::spirit::istream_iterator, but it lacks some some methods (like operator+), so it would have to be modified to work.
On StackOverflow there's a implementation of bidirectional_iterator here, but it still requires some work before it can be used with knuth_morris_pratt.
Are there any istream iterators that are already written, tested and working?
Update: I can't do mmap, because my software should work on multiple operating systems, both on 32-bit and 64-bit architectures. Also very often I don't have the files anyway, they're being generated on-the-fly, that's why I search for a solution that involves streams.
You should simply memory map it.
In practice, 64-bit processors usually have 48-bit address space which is enough for 256 terabytes of memory.
Last I checked, Linux allows 128TB of virtual address space per process on x86-64
(from https://superuser.com/a/168121/75721)
Spirit's istream_iterator is actually it's multi_pass_adaptor and it has different design goals. Unless you have a way to flush the buffer, it will grow indefinitely (by default allocating a deque buffer).

Shared-Named memory (Windows)

I have recently started a project that requires the use of Shared/Named memory. I have a working prototype - but before I commit to the current implementation, I would like to learn a little bit more on the subject.
I have checked the MSDN docs (and various other sources) and I grasp the basic principles behind how everything works, but I was not able to find answers to my questions below.
1) If you create a shared memory space and don't provide a valid file handle, it uses the System Page file for its storage. My question is - If I create my own file, and map the view to that file - will the performance be comparatively the same as when mapping to the system page file?
2) You can access the data in the shared memory space by using CopyMemory (which creates a copy of the data) or by casting the result of MapViewOfFile to the type that was written there in the first place. lets assume we wrote a data structure "MyStruct" there. Is it save to do the following?
auto pReferenceToSharedMemory = (MyStruct*)MapViewOfFile(....);
pReferenceToSharedMemory->SomeField = 12345;
pReferenceToSharedMemory->SomeField2 = ...;
...
Assuming the above is safe to do - its surely more efficient to apply data changes to the data stored in the shared memory space than to copy the data out, change some values, and copy it back?
3) Lastly - how expensive are the OpenFileMapping and MapViewOfFile operations? I would think that ideally you should only execute OpenFileMapping once (at the beginning of your operation), execute MapViewOfFile once, and use the reference that it returns throughout the operation, rather than executing MapViewOfFile every time you would like to access the data?
Finally: Is it possible that the reference returned by MapViewOfFile and the data stored in MapViewOfFile to become out of sync ?
1) The choice between your own file and the system page file isn't performance; it's persistence. Changes to your file will still be there next time your program runs.
2) The proper solution would be `new (MapViewOfFile(...)) MyStruct, if the mapping is backed by the page file and therefore still empty.
3) The expensive operations are reading and writing, not the meta-operations.
4) I've got no idea what that even means, so I'm fairly certain the answer is NO.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

For file reading, when to use filebuf

I'm going to be doing random-access reading from a read-only binary file. The interface to ifstream seems simpler than filebuf; but is there any use-case where filebuf would give better performance?
More details: I have a file of fixed-length (48-byte) records, and will be doing random-access reads in sequence -- read 1 record, process, read 1 record (from elsewhere), process, .... (Traversing a tree.) The file never changes. Since the records are fixed-length, I may later use a "character-type" that is the 48-byte record, but I don't imagine that has any performance effect.
May be if you are on Linux may be using mmap would get around the whole problem of reading the file bit by bit.
Or boost memory mapped files?
http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/classes/mapped_file.html

Using the pagefile for caching?

I have to deal with a huge amount of data that usually doesn't fit into main memory. The way I access this data has high locality, so caching parts of it in memory looks like a good option. Is it feasible to just malloc() a huge array, and let the operating system figure out which bits to page out and which bits to keep?
Assuming the data comes from a file, you're better off memory mapping that file. Otherwise, what you end up doing is allocating your array, and then copying the data from your file into the array -- and since your array is mapped to the page file, you're basically just copying the original file to the page file, and in the process polluting the "cache" (i.e., physical memory) so other data that's currently active has a much better chance of being evicted. Then, when you're done you (typically) write the data back from the array to the original file, which (in this case) means copying from the page file back to the original file.
Memory mapping the file instead just creates some address space and maps it directly to the original file instead. This avoids copying data from the original file to the page file (and back again when you're done) as well as temporarily moving data into physical memory on the way from the original file to the page file. The biggest win, of course, is when/if there are substantial pieces of the original file that you never really use at all (in which case they may never be read into physical memory at all, assuming the unused chunk is at least a page in size).
If the data are in a large file, look into using mmap to read it. Modern computers have so much RAM, you might not enough swap space available.