Write a large file to disk from RAM - c++

If I need to write a large file from allocated memory to disk, what is the most efficient way to do it?
Currently I use something along the lines of:
char* data = static_cast<char*>(operator new(0xF00000000)); // 60 GB
// Do something to fill `data` with data
std::ofstream("output.raw", std::ios::binary).
write(data, 0xF00000000);
But I am not sure if the most straightforward way is also the most efficient, taking into account various buffering mechanisms and alike.
I am using Windows 7 64-bit and Visual Studio 2012 RC compiler with 64-bit target.

For Windows, you should use CreateFile API. Have a good read of that page and any links from it mentioning optimization. There are some flags you pass in to turn off buffering. I did this in the past when I was collecting video at about 800MB per second, and having to write off small parts of it as fast as possible to a RAID array.
Now, for the flags - I think it's primarily these:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
For reading, you may want to use FILE_FLAG_SEQUENTIAL_SCAN, although I think this has no effect if buffering is turned off.
Have a look at the Caching Behaviour section
There's a couple of things you need to do. Firstly, you should always write amounts of data that are a multiple of the sector size. This is (or at least was) 512 bytes almost universally, but you may want to consider up to 2048 in future.
Secondly, your memory has to be aligned to that sector size too. You can either use _aligned_malloc() or just allocate more buffer than you need and align manually.
There may be other memory optimization concerns, and you may want to limit individual write operations to a memory page size. I never went into that depth. I was still able to write data at speeds very close to the disk's limit. It was significantly faster than using stdio calls.
If you need to do this in the background, you can use overlapped I/O, but to be honest I never understood it. I made a background worker thread dedicated to writing out video buffer and controlled it externally.

The most promising thing that comes to mind is memory mapping the output file. Depending on how the data gets filled, you may even be able to have your existing program write directly to the disk via the pointer, and not need a separate write step at the end. That trusts the OS to efficiently page the file, which it may be having to do with the heap memory anyway... could potentially avoid a disk-to-disk copy.
I'm not sure how to do it in Windows specifically, but you can probably notify the OS of your intended memory access pattern to increase performance further.
(boost::asio has portable support for memory mapped files)

If you want to use std::ofstream you should make sure of the following:
No buffer is used by the file stream. The way to do this to call out.setbuf(0, 0).
Make sure that the std::locale used by stream doesn't do any character conversion, i.e., std::use_facet<std::codecvt<char, char> >(loc).always_noconv() yields true. The "C" locale does this.
With this, I would expect that std::ofstream is as fast as any other approach writing a large buffer. I would also expect it to be slower than using memory mapped I/O because memory mapped I/O should avoid paging sections of the memory when reading them just to write their content.

Open a file with CreateFile, use SetEndOfFile to preallocate the space for the file (to avoid too much fragmentation as you write), then call WriteFile with 2 MB sized buffers (this size works the best in most scenarios) in a loop until you write the entire file out.
FILE_FLAG_NO_BUFFERING may help in some situations and may make the situation worse in others, so no real need to use it, because normally Windows file system write cache is doing its work well.

Related

Do memory mapped files provide advantage for large buffers?

My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.
Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.
I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.
Yes, mmap has the potential to speed things up.
Things to consider:
Remember the VMM will page things in and out in page size blocked (4k on Linux)
If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
Consider using madvise with mmap to assist the VMM in paging well.
When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.
If the data is already in a file, it would speed up things, especially in the non-sequential case. (In the sequential case, read wins)
If using open and read, consider using posix_fadvise as well.
This really depends on your mmap() implementation. Mapping a file into memory has several advantages that can be exploited by the kernel:
The kernel knows that the contents of the mmap() pages is already present on disk. If it decides to evict these pages, it can omit the write back.
You reduce copying operations: read() operations typically first read the data into kernel memory, then copy it over to user space.
The reduced copies also mean that less memory is used to store data from the file, which means more memory is available for other uses, which can reduce paging as well.
This is also, why it is generally a bad idea to use large caches within an I/O library: Modern kernels already cache everything they ever read from disk, caching a copy in user space means that the amount of data that can be cached is actually reduced.
Of course, you also avoid a lot of headaches that result from buffering data of unknown size in your application. But that is just a convenience for you as a programmer.
However, even though the kernel can exploit these properties, it does not necessarily do so. My experience is that LINUX mmap() is generally fine; on AIX, however, I have witnessed really bad mmap() performance. So, if your goal is performance, it's the old measure-compare-decide stand by.

Release Memory Mapped Memory

I am memory mapping a large file (~200GB) into a single region/view and sequentially writing to it. Every now and then I perform a boost::interprocess::mapped_region::flush(last, current, false).
After a while the process uses up the entire system memory. Which, from what I understand, is normal as it will be releasing the memory as other process request memory.
This works well on Windows 8. However, running on Windows 7 it doesn't seem to play well with the drivers for AJA video cards and it starts affecting performance (dropping IO packets).
Is there any way I can force the Windows 7 to flush parts of the memory to disk (after the data is written it is only interesting for a few seconds, and remember I am writing sequentially through the entire file), as to not use up the entire available system memory?
Flushing has nothing to with reclamation, IYAM. It just makes sure dirty pages are written out (I think you still need a disk sync to make sure it actually /hit the disk/).
So, you're looking for a way to unmap.
Maybe you can use a function like
EmptyWorkingSet to evict as many pages as possible
SetProcessWorkingSetSize to temporarily reduce the allowed process working set.
Of course, in a more portable fashion, you might just get away with unmapping and remapping. If the access is to spinning HDD and remains sequential across remaps, there might not be a performance penalty (there might be though, if the kernel prefetched data e.g. due to madvise() or the windows equivalent thereof)

Fastest way of reading a file in Linux?

On Linux what would be the fastest way of reading a file in to an array of bytes/to process the bytes? This can include memory-mapping, sys calls etc. I am not familiar with the many Linux-specific functions.
In the past I have used boost memory mapping, but I need faster Linux-specific performance rather than portability.
mmap should be the fastest way to access the contents of a file if the file is large enough. There's an initial cost for setting up the memory mappings, but that's offset by not needing to copy the data from the page cache into userland. And if you want all the contents of the file, the cost to allocate the memory to your program should be more or less the same as the cost of mmap.
Your best bet, as always, is to test and benchmark.
Don't let yourself get fooled by lazy stuff like memory mapping. Rather focus on what you really need. Do you really need to read the whole file into memory? Then the straight-forward way of opening, reading chunks in a loop, and closing the file will be as fast as it can be done.
But often you don't really want that. Instead you might want to read specific parts, a block here, a block there, jump through the file, read a block at a specific position, etc.
Then still fseeking out those positions and freading the blocks won't have overheads worth mentioning. But it can be more convenient to use memory mapping to let the operating system or a library deal with stuff like memory allocation etc. It won't get the job done faster, though.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Manipulating data in Memory instead of file

consider the function below:
int Func(char* filename);
int Func(FILE* filepointer);
these two do the same, reads alot of data from the given file (by name or pointer), analyze he result, and returns it.
I wanna call this function with lots of different data. Therefore I should write data into file, then pass the new filename to Func. but data is huge and reading and writing in hard is very slow. actually the analyze time is much less than I/O.
can I get rid of save/load data all the time by any means?
for example by making a FILE* pointer which points somewhere in Memory?
Update: obviously I don't have the source code of Func! It's a DLL call.
You could use memory-mapped file technique or something like boost::iostreams with custom memory sinks / sources.
Actually, the second variant is a lot more flexible, but sometimes all that flexi- and versatibility is simply not needed.
In many operating systems you can use an in-memory filesystem such as tmpfs -- and in Windows "temporary files" (opened with the appropriate flags, then rewound rather than closed) behave similarly (i.e., can stay in memory).
However, there isn't all that much to be gained there compared to writing (with lots of buffering) and reading (ditto) sequentially from an un-fragmented disk, for large files -- tmpfs's performance advantages are mostly for small files. If your performance is very bad, either the disk is horribly fragmented, or (perhaps more likely these days of self-adjusting filesystems) you're not using buffering appropriately (possibly just not buffering enough). (of course, both factors could be in play). Modern devices and filesystems can have awesome performance when just streaming huge buffers to and from memory, after all.
For a given amount of RAM devoted to buffering, you can get better performance (for what from app level look like huge numbers of tiny writes and reads) if that RAM is in userland in your app's address space (rather than under kernel control e.g. in a tmpfs), simply because you'll need fewer context switches -- and switches from user to kernel mode and back tend to dominate runtime when the only other ops performed are copies of small amounts of memory back and forth. When you use very large buffers in your app's stdio library, your "I/O" amounts to userland memory-memory copies within your address space with very rare "streaming" ops that actually transfers those buffers back and forth.