Positioning an ifstream in very large files - c++

I have to process very large log files (hundreds of Gigabytes) and in order to speed things up I want to split that processing on all the cores I have available. Using seekg and tellg I'm able to estimate the block sizes in relatively small files and position each thread on the beginning of these blocks but when they grow big the indexes overflow.
How can I position and index in very big files when using C++ ifstreams and Linux?
Best regards.

The easiest way would be to do the processing on a 64-bit OS, and write the code using a 64-bit compiler. This will (at least normally) give you a 64-bit type for file offsets, so the overflow no longer happens, and life is good.

You have two options:
Use 64-bit OS.
Use OS specific functions.

Related

C++: will this disk seek take a very large performance hit?

I am using the STL fstream utilities to read from a file. However, what I would like to do is is read a specified number of bytes and then seek back some bytes and read again from that position. So, it is sort of an overlapped read. In code, this would look as follows:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
size_t read_num = 0;
size_t windows_size = 200;
while (read_num < total_num)
{
char buffer[1024];
size_t num_bytes_read = fileStream.read(buffer, sizeof(buffer));
read_num += num_bytes_read - 200;
filestream.seekg(read_num);
}
This is the not the only way to solve my problem but will make multi-tasking a breeze (I have been looking at other data structures like circular buffers but that will make multitasking difficult). I was wondering if I can have your input on how much of a performance hit these seek operations might take when processing very large files. I will only ever use one thread to read the data from file.
The files contain large sequence of texts only characters from the set {A,D,C,G,F,T}. Would it also be advisable to open it as a binary file rather than in text mode as I am doing?
Because the file is large, I am also opening it in chucks with the chuck being set to a 32 MB block. Would this be too large to take advantage of any caching mechanism?
On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).
As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.
Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.
If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).
At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.
Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.
On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).
Plasibly, yes. Whenever you seek, the cached file data for that file is (likely to be) discarded, causing extra overhead of, at least, a system call to fetch the data again.
Assuming the file isn't enormous, it MAY be a better choice to read the entire file into memory (or, if you don't need portability, use a memory mapped file, at which point caching of the file content is trivial - again assuming the entire file fits in (virtual) memory).
However, all this is implementation dependent, so measuring performance of each method would be essential - it's only possible to KNOW these things for a given system by measuring, it's not something you can read about and get precise information on the internet (not even here on SO), because there are a whole bunch of factors that affect the behaviour.

Which type of file access to use?

I'm building an io framework for a custom file format for windows only. The files are anywhere between 100MB and 100GB. The reads/writes come in sequences of few hundred KB to a couple MB in unpredictable locations. Read speed is most critical, though, cpu use might trump that since I hear fstream can put a real dent in it when working with SSDs.
Originally I was planning to use fstream but as I red a bit more into file IO, I discovered a bunch of other options. Since I have little experience with the topic, I'm torn as to which method to use. The options I've scoped out are fstream, FILE and mapped file.
In my research so far, all I've found is a lot of contradictory benchmark results depending on chunk sizes, buffer sizes and other bottlenecks i don't understand. It would be helpful if somebody could clarify the advantages/drawbacks between those options.
In this case the bottleneck is mainly your hardware and not the library you use, since the blocks you are reading are relatively big 200KB - 5MB (compared to the sector size) and sequential (all in one).
With hard disks (high seek time) it could have a sense to read more data than needed to optimize the caching. With SSDs I would not use big buffers but read only the exact required data since the seek time is not a big issue.
Memory Mapped files are convenient for complete random access to your data, especially in small chunks (even few bytes). But it takes more code to setup a memory mapped file. On 64 bit systems you may could map the whole file (virtually) and then let the OS caching system optimize the reads (multiple access to the same data). You can then return just the pointer to the required memory location without even the need to have temporary buffers or use memcpy. It would be very simple.
fstream gives you extra features compared to FILE which I think aren't much of use in your case.

Reading big files sequentially

How to process (in read-only fashion) a big binary file in C/C++ on Linux as fast as possible? Via read or mmap? What buffer size? (No boost or anything.)
mmap is faster and optimal for read only applications. See answer here:
https://stackoverflow.com/a/258097/1094175
You could use madvise with mmap, and you might also call readahead (perhaps in a separate thread, since it is a blocking syscall).
If you read the file using ordinary read(2), consider using posix_fadvise(2) and pass buffers of 32kbytes to 1Mbytes to read(2)...
Call mmap on big enough regions; at least several dozen of megabytes (assuming you have more than 1Gb of RAM), and if you have a lot of available RAM, on bigger regions (up to perhaps 80% of available RAM).
Take care of resource limits e.g. set with setrlimit
For not too big files (and not too much of them), you could mmap them entirely. You'll need to call e.g. stat to get their size. As a rule of thumb, when reading one (not several) big files on my desktop machine I would mmap it in full it if is less than 3Gb.
If performance is important, take time to benchmark your application and your system, and to tune it accordingly. Getting the parameters (like mmap-ing region size) configurable makes sense.
The /proc/ filesystem, notably inside /proc/self/ from your application, gives several measures (e.g. /proc/self/status, /proc/self/maps, /proc/self/smaps, /proc/self/statm etc.)
GNU libc should use mmap for reading FILEs which you have fopen-ed with "rm" mode.

Best way to read 12-15GB ASCII file in C++

I am trying to count the number of lines in a huge file. This ASCII file is anywhere from 12-15GB. Right now, I am using something along the lines of readline() to count each line of the file. But ofcourse, this is extremely slow. I've also tried to implement a lower level reading using seekg() and tellg() but due to the size of my file, I am unable to allocate a large enough array to store each character to run a '\n' comparison (I have 8GB of ram). What would be a faster way of reading this ridiculously large file? I've looked through many posts here and most people don't seem to have trouble with the 32bit system limitation, but here, I see that as a problem (correct me if I'm wrong).
Also, if anyone can recommend me a good way of splitting something this large, that would be helpful as well.
Thanks!
Don't try to read the whole file at once. If you're counting lines, just read in chunks of a given size. A couple of MB should be a reasonable buffer size.
Try Boost Memory-Mapped Files, one code for both Windows and POSIX platforms.
Memory-mapping a file does not require that you actually have enough RAM to hold the whole file. I've used this technique successfully with files up to 30 GB (I think I had 4 GB of RAM in that machine). You will need a 64-bit OS and 64-bit tools (I was using Python on FreeBSD) in order to be able to address that much.
Using a memory mapped file significantly increased the performance over explicitly reading chunks of the file.
what OS are you on? is there no wc -l or equivalent command on that platform?

Memory mapped files performance - memory management when working with large data sets

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
Thanks.
It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.
I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.
I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.
It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.
Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.