Boost::knuth_morris_pratt over a std::istream - c++

I would like to use boost::algorithm::knuth_morris_pratt over some huge files (serveral hundred gigabytes). This means I can't just read the whole file into memory nor mmap it, I need to read it in chunks.
knuth_morris_pratt operates on an iterator, so I guess it is possible to make it read input data "lazily" (on-demand), it would be a matter of writing a "lazy" iterator for some file access class like ifstream, or better istream.
I would like to know if there is some adapter available (already written) that adapts istream to Boost's knuth_morris_pratt so that it won't read all file data all at once?
I know there is a boost::spirit::istream_iterator, but it lacks some some methods (like operator+), so it would have to be modified to work.
On StackOverflow there's a implementation of bidirectional_iterator here, but it still requires some work before it can be used with knuth_morris_pratt.
Are there any istream iterators that are already written, tested and working?
Update: I can't do mmap, because my software should work on multiple operating systems, both on 32-bit and 64-bit architectures. Also very often I don't have the files anyway, they're being generated on-the-fly, that's why I search for a solution that involves streams.

You should simply memory map it.
In practice, 64-bit processors usually have 48-bit address space which is enough for 256 terabytes of memory.
Last I checked, Linux allows 128TB of virtual address space per process on x86-64
(from https://superuser.com/a/168121/75721)
Spirit's istream_iterator is actually it's multi_pass_adaptor and it has different design goals. Unless you have a way to flush the buffer, it will grow indefinitely (by default allocating a deque buffer).

Related

Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

I'm trying to read the data contained in a .dat file with size ~1.1GB.
Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.
To do this, I employed the slurp function found in this SO answer.
The problem is that the code sometimes, but not always, throws a bad_alloc exception.
Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.
Here is the code that reproduces this error
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
ifstream file;
file.open("big_file.dat");
if(!file.is_open())
cerr << "The file was not found\n";
stringstream sstr;
sstr << file.rdbuf();
string text = sstr.str();
cout << "Successfully read file!\n";
return 0;
}
What could be causing this problem?
And what are the best practices to avoid it?
The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.
I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.
The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.
Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.
Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!
If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.
The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....
I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?
Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.
I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?
Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.
Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.
Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.
By the way, to give you an impression of how bad storing floats in ASCII is:
The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:
-1.23456E-10
That's an average of 11 characters.
Add one , or \n after every number.
Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.
Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.
In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.
To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.
There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.
Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.
There are many file io datastructures, from stxxl to boost, or you can roll your own.
The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.
If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Recover object lazy-loading the containing file

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;
Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.
You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

For file reading, when to use filebuf

I'm going to be doing random-access reading from a read-only binary file. The interface to ifstream seems simpler than filebuf; but is there any use-case where filebuf would give better performance?
More details: I have a file of fixed-length (48-byte) records, and will be doing random-access reads in sequence -- read 1 record, process, read 1 record (from elsewhere), process, .... (Traversing a tree.) The file never changes. Since the records are fixed-length, I may later use a "character-type" that is the 48-byte record, but I don't imagine that has any performance effect.
May be if you are on Linux may be using mmap would get around the whole problem of reading the file bit by bit.
Or boost memory mapped files?
http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/classes/mapped_file.html

Which is faster in memory, ints or chars? And file-mapping or chunk reading?

Okay, so I've written a (rather unoptimized) program before to encode images to JPEGs, however, now I am working with MPEG-2 transport streams and the H.264 encoded video within them. Before I dive into programming all of this, I am curious what the fastest way to deal with the actual file is.
Currently I am file-mapping the .mts file into memory to work on it, although I am not sure if it would be faster to (for example) read 100 MB of the file into memory in chunks and deal with it that way.
These files require a lot of bit-shifting and such to read flags, so I am wondering that when I reference some of the memory if it is faster to read 4 bytes at once as an integer or 1 byte as a character. I thought I read somewhere that x86 processors are optimized to a 4-byte granularity, but I'm not sure if this is true...
Thanks!
Memory mapped files are usually the fastest operations available if you require your file to be available synchronously. (There are some asynchronous APIs that allow the O/S to reorder things for a slight speed increase sometimes, but that sounds like it's not helpful in your application)
The main advantage you're getting with the mapped files is that you can work in memory on the file while it is still being read from disk by the O/S, and you don't have to manage your own locking/threaded file reading code.
Memory reference wise, on the x86 memory is going to be read an entire line at a time no matter what you're actually working with. The extra time associated with non byte granular operations refers to the fact that integers need not be byte aligned. For example, performing an ADD will take more time if things aren't aligned on a 4 byte boundary, but for something like a memory copy there will be little difference. If you are working with inherently character data then it's going to be faster to keep it that way than to read everything as integers and bit shift things around.
If you're doing h.264 or MPEG2 encoding the bottleneck is probably going to be CPU time rather than disk i/o in any case.
If you have to access the whole file, it is always faster to read it to memory and do the processing there. Of course, it's also wasting memory, and you have to lock the file somehow so you won't get concurrent access by some other application, but optimization is about compromises anyway. Memory mapping is faster if you're skipping (large) parts of the file, because you don't have to read them at all then.
Yes, accessing memory at 4-byte (or even 8-byte) granularity is faster than accessing it byte-wise. Again it's a compromise - depending on what you have to do with the data afterwards, and how skilled you are at fiddling with the bits in an int, it might not be faster overall.
As for everything regarding optimization:
measure
optimize
measure
These are sequential bit-streams - you basically consume them one bit at a time without random-access.
You don't need to put a lot of effort into explicitly buffering reads and such in this scenario: the operating system will be buffering them for you anyway. I've written H.264 parsers before, and the time is completely dominated by the decoding and manipulation, not the IO.
My recommendation is to use a standard library and for parsing these bit-streams.
Flavor is such a parser, and the website even includes examples of MPEG-2 (PS) and various H.264 parts like M-Coder. Flavor builds native parsing code from a c++-like language; here's an quote from the MPEG-2 PS spec:
class TargetBackgroundGridDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 7
{
unsigned int(14) horizontal_size;
unsigned int(14) vertical_size;
unsigned int(4) aspect_ratio_information;
}
class VideoWindowDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 8
{
unsigned int(14) horizontal_offset;
unsigned int(14) vertical_offset;
unsigned int(4) window_priority;
}
Regarding to the best size to read from memory, I'm sure you will enjoy reading this post about memory access performance and cache effects.
One thing to consider about memory-mapping files is that a file with a size greater than the available address range will only be able to be map a portion of the file. To access the remainder of the file requires the first part to be unmapped and the next part to mapped in its place.
Since you're decoding mpeg streams you may want to use a double buffered approach with asynchronous file reading. It works like this:
blocksize = 65536 bytes (or whatever)
currentblock = new byte [blocksize]
nextblock = new byte [blocksize]
read currentblock
while processing
asynchronously read nextblock
parse currentblock
wait for asynchronous read to complete
swap nextblock and currentblock
endwhile