Optimal method to mmap a file to RAM?

Optimal method to mmap a file to RAM? - c++

I am using mmap to read a file and I only recently found out that it is not actually getting it into RAM, but is only creating a virtual address space for it. This will cause any accessing of the data to still use disk which I want to avoid, so I want to read it all into RAM.
I am reading the file via:
char* cs_virt;
cs_virt = (char*)mmap(0, nchars, PROT_READ, MAP_PRIVATE, finp, offset);
and when I loop after this, I see that the virtual memory for this process has, indeed, been blown up. I want to copy this into RAM, though, so I do the following:
char* cs_virt;
cs_virt = (char*)mmap(0, nchars, PROT_READ, MAP_PRIVATE, finp, offset);
cs = (char*)malloc(nchars*sizeof(char));
for(int ichar = 0; ichar < nchars; ichar++) {
cs[ichar] = cs_virt[ichar];
}
Is this the best method? If not, what is a more efficient method to do this? I have this taking place in a function and cs is declared outside the function. Once I exit the function, I will retain cs, but will cs_virt need to be deleting or will it go away on it's own since it is declared locally in the function?

If you are using Linux, you may be able to use MAP_POPULATE:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later
accesses to the mapping will not be blocked by page faults.
MAP_POPULATE is supported for private mappings
only since Linux 2.6.23.
This may be useful if you have time to spare when you mmap() but your later accesses need to be responsive. Consider also MAP_LOCKED if you really need the file to be mapped in and never swapped back out.

MPI and I/O is a murky issue. HDF5 seems to be the most common library that can help you with that, but it often needs tuning for the particular cluster, which is often impossible for mere users of the cluster. A colleague of mine had better success with SIONlib, and was able to get his code working on nearly 1e6 cores on JUGENE with that, so I'd have look at that.
In both cases you will probably need to adapt your file format. In the case of my colleague it even paid of to write the data in parallel fashion using SIONlib, and to later do e sequential postprocessing to "defragment" the holes left be the parallel access pattern that SIONlib chose. It might be similar for input.

Related

Why is this loop destroying my memory?

I have this function in my MMF class
void Clear() {
int size = SizeB();
int iter = size / sysInfo.granB;
for (int i = 0; i < iter; i++) {
auto v = (char*)MapViewOfFile(hMmf, FILE_MAP_READ | (write ? FILE_MAP_WRITE : 0), 0, i * sysInfo.granB, sysInfo.granB);
std::memset(v, 0, sysInfo.granB);
UnmapViewOfFile(v);
}
}
So what it does is go through the whole file in smallest addressable chunks (64k in this case), maps the view, writes 0's, unmap, repeat. It works allright and is very quick but when I use it, there is some phantom memory usage going on.
According to windows task manager, the process itself is using just a few megabytes but the "physical memory usage" leaps up when I use it on larger files. For instance, using this on a 2GB file is enough to put my laptop in a coma for a few minutes, physical memory usage goes to 99%, everything in task manager is frantically reducing memory and everything freezes for a while.
The whole reason I'm trying to do this in 64k chunks is to keep memory usage down but the chunk size doesn't really matter in this case, any size chunks * n to cover the file does the same thing.
Couple of things I've tried:
flushing the view before unmapping - this makes things terribly slow, doing that 2gb file in any size chunks takes like 10 minutes minutes.
adding a hardcoded delay in the loop - it actually works really good, it still gets it done in seconds and the memory usage stays down but I just really don't like the concept of a hardcoded delay in any loop
writing 0's to just the end of the file - I don't actually need to clear the file but only to force it to be ready for usage. What I mean is - when I create a new file and just start with my random IO, I get ~1MB/s at best. If I open an existing file or force write 0's in the new file first, I get much better speeds. I'm not exactly sure why that is but a user in another thread suggested that writing something to the very end of the file after setting the file pointer would have the same effect as clearing but from testing, this is not true.
So currently I'm trying to solve this from the angle of clearing the file without destroying the computers memory. Does anybody know how to appropriately limit that loop?

So here's the thing. When you MapViewOfFile, it allocates the associated memory range but may may mark it as swapped out (eg, if it hasn't already been read into memory). If that's the case, you then get a page fault when you first access it (which will then cause the OS to read it in).
Then when you UnmapViewOfFile, the OS takes ownership of the associated memory range and writes the now-not-accessible-by-userspace data back to disk (assuming, of course, that you've written to it, which marks the page as "dirty", otherwise it's straight up deallocated). To quote the documentation (that I asked you to read in comments): modified pages are written "lazily" to disk; that is, modifications may be cached in memory and written to disk at a later time.
Unmapping the view of the file is not guaranteed to "un-commit" and write the data to disk. Moreover, even CloseHandle does not provide that guarantee either. It merely closes the handle to it. Because of caching mechanisms, the operating system is entirely allowed to write data back to disk on its own time if you do not call FlushViewOfFile. Even re-opening the same file may simply pull data back from the cache instead of from disk.
Ultimately the problem is
you memory map a file
you write to the memory map
writing to the memory map's address range causes the file's mapping to be read in from disk
you unmap the file
unmapping the file "lazily" writes the data back to disk
OS may reach memory stress, sees that there's some unwritten data that it can now write to disk, and forces that to happen to recover the physical memory for new allocations; by the way, because of the OS lazily flushing, your IO is no longer sequential and causes spindle disk latency to drastically increase
You see better performance when you're sleeping because you're giving the OS the opportunity to say "hey I'm not doing anything... let's go ahead and flush cache" which coerces disk IO to be roughly sequential.

dd - Understanding block size

I've used "dd" for creating test files and performing backups across HDDs. No problem.
Currently, I'm trying to use it to test NFS transfer rates. At first, I was varying the block size ("bs" argument)... But this got me thinking, why would I need to vary this argument?
A typical use-case that I want to simulate is:
Node X has a large data structure in memory
Node X wants to write it to a file located in a NFS-mounted directory
In this case, the typical C/C++ code for a 2D array would be:
FILE *ptr = fopen("path_to_nfs_area", "w");
for (int i = 0; i < data.size(); ++i)
fwrite(data[i], sizeof(float), width, ptr);
...
So in this case, we're writing to a buffer in 32bit increments (sizeof(float)) - and since this is a FILE object, it's probably being buffered as well (maybe that's not a good thing, but might be irrelevant for this discussion).
I'm having a hard time making the jump from "dd" writing from if->of in "bs" chunks versus an application writing out variables from memory (and simulating this with dd).
Does it make sense to say that it is pointless to vary the value of "bs" less than the system PAGE_SIZE?
Here's my current understanding, so I don't see why changing the "dd" block size would matter:

You might get better answers on superuser.com, as this question is a bit off-topic here.
But consider the possibility that the nfs share is not mounted with an async flag - in that case, each single write would need to be confirmed by the nfs server before the next write can evens start. So bs=1 would need about double time compared to bs=2, and each of them would be MUCH slower that a sensible block size.
If the async flag is set on the nfs mount, your kernel might merge several small writes into one big one anyway, so the effect of setting bs should be neglectable.
Anyway, if you're testing to set up an environment for a specific application, use that application for testing, nothing else. Performance might depend on so much application-specific behaviour that any generic tool won't be able to reproduce.

Dynamic Memory Allocation in fast RAM

On a Windows 32-bit and 64-bit machine, I have to allocate memory to store large amounts of data that are streaming live, a total of around 1GB. If I use malloc(), I am going to obtain a virtual memory address, and this address could be actually causing some paging to the hard drive depending on the amount of memory I have. Unfortunately I'm afraid that the HD will impact performance and cause data to be missing.
Is there a way to force memory to allocate only in RAM, even if it means that I get an error when not enough memory is available (so the user needs to close other things or use another machine)? I want to guarantee that all operations will be done in memory. If this fails, forcing the application to exit is acceptable.
I know that another process may come in and itself take some memory, but I am not worried because in this machine that is not happening (it'll be the only application on the machine to be doing this large allocation).
[Edit:]
My attempt so far has been to try use VirtualLock as follows:
if(!SetProcessWorkingSetSize(this, 300000, 300008))
printf("Error Changing Working Set Size\n");
// Allocate 1GB space
unsigned long sz = sizeof(unsigned char)*1000000000;
unsigned char * m_buffer = (unsigned char *) malloc(sz);
if(m_buffer == NULL)
{
printf("Memory Allocation failed\n");
}
else
{
// Protect memory from being swapped
if(!VirtualLock(m_buffer , sz))
{
printf("Memory swap protection failed\n");
}
}
But the change in Working set fails, and so does the VirtualLock. Malloc does return non-null.
[Edit2]
I have tried also:
unsigned long sz = sizeof(unsigned char)*1000000000;
LPVOID lpvResult;
lpvResult = VirtualAlloc(NULL,sz, MEM_PHYSICAL|MEM_RESERVE, PAGE_NOCACHE);
But lpvResult is 0, so no luck there either.

You can use mlock, mlockall, munlock, munlockall functions in order to prevent pages from being swapped (part of POSIX, also available in MinGW). Unfortunately, I have no experience with Windows but it looks like VirtualLock does the same thing.
Hope it helps. Good Luck!

I think VirtualAlloc might get you some of what you want.
This problem really boils down to just writing your own memory manager instead of using CRT function.

You need to use the undocumented NtLockVirtualMemory function with lock option 2 (LOCK_VM_IN_RAM); make sure you request and obtain SE_LOCK_MEMORY_NAME privilege first, and be aware that it might not be granted (I'm sure what the group policy defaults the privilege to, but it might very well be granted to nobody).
I suggest using VirtualLock as a fallback, and if that fails too, to use SetProcessWorkingSetSize. If that fails then just let it fail I guess...
See this link for some nice discussion about this. One person says:
When you specify LOCK_VM_IN_WSL flag, you just tell the Balance Set Manager that you don't want some particular page to get swapped to the disk, and ask it to leave this page alone when trimming the working set of the target process. This is just an indication, so that the target page may still get swapped if the system is low on RAM. However, when you specify LOCK_VM_IN_RAM flag, you issue a directive to the Memory Manager to treat this page as non-pageable (i.e. do something the driver does when it calls MmProbeAndLockPages() in order to lock pages, described by MDL) , so that the page is question is guaranteed to be loaded in RAM all the time.
Edit:
Read this.

One option would be to create a RAM Disk out of your host's memory. While there is no longer native support for this in the distributed Windows code, you can still find the drivers necessary for free or made available through commercial products. For instance, DRDataRam provides a free driver for personal use and a commercially licensed product for business use at: http://memory.dataram.com/products-and-services/software/ramdisk
There is also ImDisk Virtual Driver available at: http://www.ltr-data.se/opencode.html/#ImDisk It is open sourced and free for commercial use. It is digitally signed with a trusted certificate from Microsoft.
For more information concerning RAM Drives on Windows, check out ServerFault.com.

You should take a look at Address Windowing Extensions (AWE). It sounds like it matches the memory constraints you have (emphasis mine):
AWE uses physical nonpaged memory and window views of various portions of this physical memory within a 32-bit virtual address space.

how to manage large arrays

I have a c++ program that uses several very large arrays of doubles, and I want to reduce the memory footprint of this particular part of the program. Currently, I'm allocating 100 of them and they can be 100 Mb each.
Now, I do have the advantage, that eventually parts of these arrays become obsolete during later parts of the program's execution, and there is little need to ever have the whole of any one of then in memory at any one time.
My question is this:
Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?
I'm coming to the conclusion that the only way to achieve this is going to be to declare an array of pointers, each of which may point to a chunk say 1Mb of the desired array, so that old chunks that are not needed any more can be reused for new bits of the array. This seems to me like writing a custom memory manager which does seem like a bit of a sledgehammer, that's going to create a bit of a performance hit as well
I can't move the data in the array because it is going to cause too many thread contention issues. the arrays may be accessed by any one of a large number of threads at any time, though only one thread ever writes to any given array.

It depends on the operating system. POSIX - including Linux - has the system call madvise to do improve memory performance. From the man page:
The madvise() system call advises the kernel about how to handle paging input/output in the address range beginning at address addr and with size length bytes. It allows an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. This call does not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. The kernel is free to ignore the advice.
See the man page of madvise for more information.
Edit: Apparently, the above description was not clear enough. So, here are some more details, and some of them are specific to Linux.
You can use mmap to allocate a block of memory (directly from the OS instead of the libc), that is not backed by any file. For large chunks of memory, malloc is doing exactly the same thing. You have to use munmap to release the memory - regardless of the usage of madvise:
void* data = ::mmap(nullptr, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ...
::munmap(data, size);
If you want to get rid of some parts of this chunk, you can use madvise to tell the kernel to do so:
madvise(static_cast<unsigned char*>(data) + 7 * page_size,
3 * page_size, MADV_DONTNEED);
The address range is still valid, but it is no longer backed - neither by physical RAM nor by storage. If you access the pages later, the kernel will allocate some new pages on the fly and re-initialize them to zero. Be aware, that the dontneed pages are also part of the virtual memory size of the process. It might be necessary to make some configuration changes to the virtual memory management, e.g. activating over-commit.

It would be easier to answer if we had more details.
1°) The answer to the question "Is there any way of telling the OS after I have created the array with new or malloc that a part of it is unnecessary any more ?" is "not really". That's the point of C and C++, and any language that let you handle memory manually.
2°) If you're using C++ and not C, you should not be using malloc.
3°) Nor arrays, unless for a very specific reason. Use a std::vector.
4°) Preferably, if you need to change often the content of the array and reduce the memory footprint, use a linked list (std::list), though it'll be more expensive to "access" individually the content of the list (but will be almost as fast if you only iterate through it).

A std::deque with pointers to std::array<double,LARGE_NUMBER> may do the job, but you better make a dedicated container with the deque, so you can remap the indexes and most importantly, define when entries are not used anymore.
The dedicated container can also contain a read/write lock, so it can be used in a thread-safe way.

You could try using lists instead of arrays. Of course list is 'heavyer' than array but on the other hand it is easy to reconstruct a list so that you can throw away a part of it when it becomes obsolete. You could also use a wrapper which would only contain indexes saying which part of the list is up-to-date and which part may be reused.
This will help you improve performance, but will require a little bit more (reusable) memory.

Allocating by chunk and delete[]-ing and new[]-ing on the way seems like the good solution. It may be possible to do as little as memory management as possible. Do not reuse chunk yourself, simply deallocate old one and allocate new chunks when needed.

How do you pre-allocate space for a file in C/C++ on Windows?

I'm adding some functionality to an existing code base that uses pure C functions (fopen, fwrite, fclose) to write data out to a file. Unfortunately I can't change the actual mechanism of file i/o, but I have to pre-allocate space for the file to avoid fragmentation (which is killing our performance during reads). Is there a better way to do this than to actually write zeros or random data to the file? I know the ultimate size of the file when I'm opening it.
I know I can use fallocate on linux, but I don't know what the windows equivalent is.
Thanks!

Programatically, on Windows you have to use Win32 API functions to do this:
SetFilePointerEx() followed by SetEndOfFile()
You can use these functions to pre-allocate the clusters for the file and avoid fragmentation. This works much more efficiently than pre-writing data to the file. Do this prior to doing your fopen().
If you want to avoid the Win32 API altogether, you can also do it non-programatically using the system() function to issue the following command:
fsutil file createnew filename filesize

You can use the SetFileValidData function to extend the logical length of a file without having to write out all that data to disk. However, because it can allow to read disk data to which you may not otherwise have been privileged, it requires the SE_MANAGE_VOLUME_NAME privilege to use. Carefully read the Remarks section of the documentation.
I'd recommend instead just writing out the 0's. You can also use SetFilePointerEx and SetEndOfFile to extend the file, but doing so still requires writing out zeros to disk (unless the file is sparse, but that defeats the point of reserving disk space). See Why does my single-byte write take forever? for more info on that.

Sample code, note that it isn't necessarily faster especially with smart filesystems like NTFS.
if ( INVALID_HANDLE_VALUE != (handle=CreateFile(fileName,GENERIC_WRITE,0,0,CREATE_ALWAYS,FILE_FLAG_SEQUENTIAL_SCAN,NULL) )) {
// preallocate 2Gb disk file
LARGE_INTEGER size;
size.QuadPart=2048 * 0x10000;
::SetFilePointerEx(handle,size,0,FILE_BEGIN);
::SetEndOfFile(handle);
::SetFilePointer(handle,0,0,FILE_BEGIN);
}

You could use the _chsize() function.

Check out this example on Code Project. It looks pretty straightforward to set the file size when the file is initially crated.
http://www.codeproject.com/Questions/172979/How-to-create-a-fixed-size-file.aspx
FILE *fp = fopen("C:\\myimage.jpg","ab");
fseek(fp,0,SEEK_END);
long size = ftell(fp);
char *buffer = (char*)calloc(500*1024-size,1);
fwrite(buffer,500*1024-size,1,fp);
fclose(fp);

这篇文章可能对你有帮助。
The following article from Raymond may help.
How can I preallocate disk space for a file without it being reported as readable?
Use the SetFileInformationByHandle function, passing function code
FileAllocationInfo and a FILE_ALLOCATION_INFO structure. “Note that
this will decrease fragmentation, but because each write is still
updating the file size there will still be synchronization and metadata
overhead caused at each append.”
The effect of setting the file allocation info lasts only as long as
you keep the file handle open. When you close the file handle, all the
preallocated space that you didn’t use will be freed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js