For file reading, when to use filebuf - c++

I'm going to be doing random-access reading from a read-only binary file. The interface to ifstream seems simpler than filebuf; but is there any use-case where filebuf would give better performance?
More details: I have a file of fixed-length (48-byte) records, and will be doing random-access reads in sequence -- read 1 record, process, read 1 record (from elsewhere), process, .... (Traversing a tree.) The file never changes. Since the records are fixed-length, I may later use a "character-type" that is the 48-byte record, but I don't imagine that has any performance effect.

May be if you are on Linux may be using mmap would get around the whole problem of reading the file bit by bit.
Or boost memory mapped files?
http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/classes/mapped_file.html

Related

Boost::knuth_morris_pratt over a std::istream

I would like to use boost::algorithm::knuth_morris_pratt over some huge files (serveral hundred gigabytes). This means I can't just read the whole file into memory nor mmap it, I need to read it in chunks.
knuth_morris_pratt operates on an iterator, so I guess it is possible to make it read input data "lazily" (on-demand), it would be a matter of writing a "lazy" iterator for some file access class like ifstream, or better istream.
I would like to know if there is some adapter available (already written) that adapts istream to Boost's knuth_morris_pratt so that it won't read all file data all at once?
I know there is a boost::spirit::istream_iterator, but it lacks some some methods (like operator+), so it would have to be modified to work.
On StackOverflow there's a implementation of bidirectional_iterator here, but it still requires some work before it can be used with knuth_morris_pratt.
Are there any istream iterators that are already written, tested and working?
Update: I can't do mmap, because my software should work on multiple operating systems, both on 32-bit and 64-bit architectures. Also very often I don't have the files anyway, they're being generated on-the-fly, that's why I search for a solution that involves streams.
You should simply memory map it.
In practice, 64-bit processors usually have 48-bit address space which is enough for 256 terabytes of memory.
Last I checked, Linux allows 128TB of virtual address space per process on x86-64
(from https://superuser.com/a/168121/75721)
Spirit's istream_iterator is actually it's multi_pass_adaptor and it has different design goals. Unless you have a way to flush the buffer, it will grow indefinitely (by default allocating a deque buffer).

Recover object lazy-loading the containing file

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;
Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.
You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

Getting information from a file without traversing its contents

This question made me search for what else can I get from a file without traversing its contents (means without inputting the contents using ifstream or getc etc).
Other than file size and number of characters, what other information can I gather? I searched fseek, I found I can use SEEK_SET, SEEK_CUR and SEEK_END, which only allow me to find the end of the file, start of the file and current pointer.
In order to make it a question, I specifically want to ask:
Can occurrences of some character or type of character (newline etc) be counted?
Can its contents be matched with a certain template?
Is using these methods faster than reading the file multiple times?
And I am asking about Microsoft Windows, not Linux.
1) No, becuase searching of something in unpredicteble conditions requires thorough examing of contents. Examing is reading. Of course, you may collect some statistics before, but you need to traverse you data not less then once. You can use other applications to do this implicitly, but they also will traverse your file from very begining to the end. You may orginize your file some way to obtain necessary info with minimal amount of read-operations, but its all up to your task, and there is no general approach (Because any generiosuty comes to examing the whole source structure).
2) Also No (see above)
3) Yes. Store as much as possible (or required by task) in memory (that's called caching). For example, use mapping (See MapViewOfFile for Windows and mmap(2) on *nix systems), this uses some in-system caching mechanism.
No
No
Depends on wether there's an actual need to read the file multiple times.
There're no miracles here. The former question had a "shortcut" because the number of characters in the file equals to its size in bytes (more strictly speaking - the ansi-text file is considered of a character sequence, each is represented by a single byte).
The stat structure contains information about the file, including permissions, ownership, size, access and creation date info. As for metadata, maybe there's an API to tie into a Windows search database that might allow searching on other criteria, like content attributes (I'm a Linux guy, usually, so I don't know what Windows offers in this respect).

Reading large txt efficiently in c++

I have to read a large text file (> 10 GB) in C++. This is a csv file with variable length lines. when I try to read line by line using ifstream it works but takes long time, i guess this is becuase each time I read a line it goes to disk and reads, which makes it very slow.
Is there a way to read in bufferes, for example read 250 MB at one shot (using read method of ifstream) and then get lines from this buffer, i see lot of issues with solution like buffer can have incomplete lines etc..
Is there a solution for this in c++ which handles all these cases etc. Are there any open source libraries that can do this for example boost etc ?
Note: I would want to avoid c stye FILE* pointers etc.
Try using the Windows memory mapped file function. The calls are buffered and you get to treat a file as if its just memory.
memory mapped files
IOstreams already use buffers much as you describe (though usually only a few kilobytes, not hundreds of megabytes). You can use pubsetbuf to get it to use a larger buffer, but I wouldn't expect any huge gains. Most of the overhead in IOstreams stems from other areas (like using virtual functions), not from lack of buffering.
If you're running this on Windows, you might be able to gain a little by writing your own stream buffer, and having it call CreateFile directly, passing (for example) FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_NO_BUFFERING. Under the circumstances, either of these may help your performance substantially.
If you want real speed, then you're going to have to stop reading lines into std::string, and start using char*s into the buffer. Whether you read that buffer using ifstream::read() or memory mapped files is less important, though read() has the disadvantage you note about potentially having N complete lines and an incomplete one in the buffer, and needing to recognise that (can easily do that by scanning the rest of the buffer for '\n' - perhaps by putting a NUL after the buffer and using strchr). You'll also need to copy the partial line to the start of the buffer, read the next chunk from file so it continues from that point, and change the maximum number of characters read such that it doesn't overflow the buffer. If you're nervous about FILE*, I hope you're comfortable with const char*....
As you're proposing this for performance reasons, I do hope you've profiled to make sure that it's not your CSV field extraction etc. that's the real bottleneck.
I hope this helps -
http://www.cppprog.com/boost_doc/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
BTW, you wrote "i see lot of issues with solution like buffer can have incomplete lines etc.." - in this situation how about reading 250 MB and then read char by char until you get the delimiter to complete the line.