Memory Mapped Files and Max File Size

Memory Mapped Files and Max File Size - c++

I am using boost::iostreams::mapped_file_source to create a memory mapped files. In excess of 1024. To my surprise when I have created around 1024 memory mapped files my program throws an exception stating there are too many files open. After some research I found that Ubuntu uses a max file size per processes of 1024 (found from ulimit -n). Unfortunately, I need all of the files to be open at the same time. Does anyone know a way around this? Is it possible to make the files not count towards the limit someway? I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance. And I would also like to not modify the operating system by changing the value. Any points in the correct direction are much appreciated!

Why do you need many mapped files open? That seems very inefficient. Maybe you can map (regions of) a single large file?
Q. I was thinking of trying to combine them into a single file; however, I would like to avoid that if possible due to the performance
That's ... nonsense. The performance could basically only increase.
One particular thing to keep in mind is to align the different regions inside your "big mapped file" to multiple of your memory page/disk block size. 4k should be a nice starter for this coarse alignment.

Related

c++: how to optimize IO?

I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.
More specifically:
I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).
The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).
Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.
Are there any strategies for making the IO faster?
I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.
Any strategies or optimizations?
Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?

Certainly the fastest (but the fragilest) solution would be to mmap the data to a fixed address. Slap it all in one big struct, and instantiate the std:::map with an allocator which will allocate in a block attached to the end of the struct. It's not simple, but it will be fast; one call to mmap, and the data is in your (virtual) memory. And because you're forcing the address in mmap, you can even store the pointers, etc.
As mentioned above, in addition to requiring a fair amount of work, it's fragile. Recompile your application, and the targeted address might not be available, or the layout might be different, or whatever. But since it's really just an optimization, this might not be an issue; anytime a compatibility issue arises, just drop the old file and start over. It will make the first run after a change which breaks compatibility extremely slow, but if you don't break compatibility too often...

The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.
The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().
Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.

Some guidelines:
multiple calls to read() are more expensive than single call
binary files are faster than text files
single file is faster than multiple files for large values of "multiple"
use memory-mapped files if you can
use 64 bit OS to let OS manage the memory for you
Ideally, I'd try to put all long doubles into memory-mapped file, and all maps into binary files.
Divide and conquer: if 64 bits is not an option, try to break your data into large chunks in a way that all chunks are never used together, and the entire chunk is needed when it's needed. This way you could load the chunks when they needed and discard them when they are not.

These suggestions of uploading the whole data to the RAM are good when two conditions are met:
Sum of all I/O times during is much more than cost of loading all data to RAM
Relatively large portion of all data is being accessed during application run
(they are usually met when some application is running for a long time processing different data)
However for other cases other options might be considered.
E.g. it is essential to understand if access pattern is truly random. If no, look into reordering data to ensure that items that are accessible together are close to each other. This will ensure that OS caching is performing at its best, and also will reduce HDD seek times (not a case for SSD of course).
If accesses are truly random, and application is not running as long as needed to ammortize one-time data loading cost I would look into architecture, e.g. by extracting this data manager into separate module that will keep this data preloaded.
For Windows it might be system service, for other OSes other options are available.

Cache, cache, cache. If it's only several GB it should be feasible to cache most if not all of your data in something like memcached. This is an especially good solution if you're using MPI across multiple machines rather than just multiple processors on the same machine.
If it's all running on the same machine, consider a shared memory cache if you have the memory available.
Also, make sure your file writes are being done on a separate thread. No need to block an entire process waiting for a file to write.

As was said, cache as much as you can in memory.
If you're finding that the amount you need to cache is larger than your memory will allow, try swapping out the caches between memory and disk how it is often done when virtual memory pages need to be swapped to disk. It is essentially the same problem.
One common method is the Least Recently Used Algorithm for determining which page will be swapped.

It really depends on how much memory is available and what the access pattern is.
The simplest solution is to use memory mapped files. This generally requires that the file has been layed out as if the objects were in memory, so you will need to only use POD data with no pointers (but you can use relative indexes).
You need to study your access pattern to see if you can group together the values that are often used together. This will help the OS in better caching those values (ie, keeping them in memory for you, rather than always going to the disk to read them).
Another option will be to split the file into several chunks, preferably in a logical way. It might be necessary to create an index file that map a range of values to the file that contain them.
Then, you can only access the set of files required.
Finally, for complex data structures (where memory mapped files fail) or for sparse reading (when you only ever extract only a small piece of information from a given file), it might be interesting to read about LRU caches.
The idea will be to use serialization and compression. You write several files, among which an index, and compress all of them (zip). Then, at launch time, you start by loading the index and save it in memory.
Whenever you need to access a value, you first try your cache, if it is not it, you access the file that contains it, decompress it in memory, dump its content in your cache. Note: if the cache is too small, you have to be picky about what you dump in... or reduce the size of the files.
The frequently accessed values will stay in cache, avoiding unnecessary round-trip, and because the file is zipped there will be less IO.

Structure your data in a way that caching can be effective. For instance, when you are reading "certain pieces," if those are all contiguous it won't have to seek around the disk to gather all of them.
Reading and writing in batches, instead of record by record will help if you are sharing disk access with another process.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
As far as I understood the std::map are pre-calculated also and there are no insert/remove operations. Only search. How about an idea to replace the maps to something like std::hash_map or sparsehash. In theory it can give performance gain.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map, and much more - and save all this stuff to disk (several Gb).
Don't reinvent the wheel. I'd suggest using a key-value data store, such as berkeley db: http://docs.oracle.com/cd/E17076_02/html/gsg/C/concepts.html
This will enable saving and sharing the files, caching the parts you actually use a lot and keeping other parts on disk.

logical operations on large files binary data in C/C++

I have two binary files (order of tens of MB) and I want to or every bit of these files. And of course, I want it to be as efficient as possible.
So I have two ways in mind to do that, but I still think (I kinda feel) that should be a more efficient way that I do not know of.
Given file a and b .. what I want to do is a = a|b
Loading two files, parse them in to two huge std::bitsets and or them together
loading two files byte by byte and or them if a huge for loop...
Is there any other way to do that?

Don't go byte-by-byte. That'd be seriously slow. Instead, read the files in chunks. Find what the block size is for your system (4k? 8K? 64k?) and read the file using chunks of that size. Then you can loop through the byte streams in memory and do the OR operations there.
In logical terms, even though you might only be reading a byte at a time, the OS will still read an entire block worth of data, then throw away all but the byte you wanted. Next time around that block'll be cached, but it's still going through the full read motions for every byte you want. So... just suck the entire block into memory and save yourself that wasted overhead.

I would recommend loading the two files a chunk at a time, where a chunk is some appropiate portion of the data. The best size would depend on your operating system and filesystem, but its usually something like the cluster size, or 2 * the cluster size, or so on... You would have to run some test to determine the best buffer size.

I don't think you would have any performance advantage either way (if in your "second option" you are going to load the file in big chunks), after all you'd be using a big stack-allocated buffer in both cases (which is what std::bitset boils down to), so go with the one you like best.
The only advantage I see in the std::bitset::operator|=, besides clarity, is that it may be able to exploit some platform-specific trick to or big sequences of bytes, but I think that the compiler would be able to optimize your big "or loop" anyway.

Best way to read 12-15GB ASCII file in C++

I am trying to count the number of lines in a huge file. This ASCII file is anywhere from 12-15GB. Right now, I am using something along the lines of readline() to count each line of the file. But ofcourse, this is extremely slow. I've also tried to implement a lower level reading using seekg() and tellg() but due to the size of my file, I am unable to allocate a large enough array to store each character to run a '\n' comparison (I have 8GB of ram). What would be a faster way of reading this ridiculously large file? I've looked through many posts here and most people don't seem to have trouble with the 32bit system limitation, but here, I see that as a problem (correct me if I'm wrong).
Also, if anyone can recommend me a good way of splitting something this large, that would be helpful as well.
Thanks!

Don't try to read the whole file at once. If you're counting lines, just read in chunks of a given size. A couple of MB should be a reasonable buffer size.

Try Boost Memory-Mapped Files, one code for both Windows and POSIX platforms.

Memory-mapping a file does not require that you actually have enough RAM to hold the whole file. I've used this technique successfully with files up to 30 GB (I think I had 4 GB of RAM in that machine). You will need a 64-bit OS and 64-bit tools (I was using Python on FreeBSD) in order to be able to address that much.
Using a memory mapped file significantly increased the performance over explicitly reading chunks of the file.

what OS are you on? is there no wc -l or equivalent command on that platform?

Memory mapped files performance - memory management when working with large data sets

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
Thanks.

It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.

I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.

I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.

It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.

Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.

Are memory mapped files bad for constantly changing data?

I have a service that is responsible for collecting a constantly updating stream of data off the network. The intent is that the entire data set must be available for use (read only) at any time. This means that the newest data message that arrives to the oldest should be accessible to client code.
The current plan is to use a memory mapped file on Windows. Primarily because the data set is enormous, spanning tens of GiB. There is no way to know which part of the data will be needed, but when its needed, the client might need to jump around at will.
Memory mapped files fit the bill. However I have seen it said (written) that they are best for data sets that are already defined, and not constantly changing. Is this true? Can the scenario that I described above work reasonably well with memory mapped files?
Or am I better off keeping a memory mapped file for all the data up to some number of MB of recent data, so that the memory mapped file holds almost 99% of the history of the incoming data, but I store the most recent, say 100MB in a separate memory buffer. Every time this buffer becomes full, I move it to the memory mapped file and then clear it.

Any data set that is defined and doesn't change is best!
Memory mapped files generally win over anthing else - most OSs will cache the accesses in RAM anyway.
And the performance will be predictable, you don't fall off a cliff when you start to swap.

Sounds like a database fits your description. Paging is something most commercial ones do well out of the box.

From your problem statement, I see following requirements:
data must be always available
data is written once, I assume it is append only, never overwritten.
data read access pattern is random, i.e jumping around
there also appears to have an implicit latency requirement
Seems to me, memory mapped file is chosen to address 3) + 4). If your data size can be fit into memory, this may well be a reasonable solution. However, if your data size is too large to fit in memory, memory mapped file may result in performance issue due to frequent page fault.
You did not describe how "jumping around" is done. If it is possible to build an index, you may be able to save data into multiple files, keep index in memory, use index to load data and serve, and also cache most frequent used data. The basic idea is similar to disk based hash. This is probably a more scalable solution.

Since you tagged this Win32 I'm assuming you're working on a 32 bit machine, in which case you simply don't have enough address space to memory map all of your data set. This means you will have to create and destroy mappings into the file as you "jump around", which is going to make this less efficient than you might expect.
In practice, you typically have a bit more than 1 GB of contiguous address space to memory map the file into on a 32 bit windows box, and you can end up with less if you fragment your address space.
That being said, doing this with memory maps does have a benefit if you are memory (not address space) constrained, since when you memory map a file as read only (as opposed to explicitly reading it into memory) the OS will not have a second copy in the file system cache.

The file can be mapped as readonly in one thread that presents the data and have a background worker thread which has the file mapped as readwrite to do the appending.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js