Are Windows Memory Mapped File contents always zeroed by default? - c++

I've determined empirically that, on my system, a memory mapped file created to be a certain size is always completely zeroed by default. For example, using the call
HANDLE hMM =
CreateFileMapping (h,
NULL,
PAGE_READWRITE,
0,
0x01400000,//20MB
NULL);
.. and writing into a mapped view of that file always results in a 20MB file that is completely zeroed, except where I have written non-zero data.
I'm wondering if uninitialized parts of the file can be assumed to be zeros. Is this behavior guaranteed on Windows in general?

The CreateFileMapping documentation (Remarks section) explicitly states that
If the file is extended, the contents of the file between the old end of the file and the new end of the file are not guaranteed to be zero; the behavior is defined by the file system.
so, if your file on disk starts empty, it's not guaranteed to be zeroed (since you are expanding it); I don't think that file system drivers would take the risk of leaking potentially sensitive information that way, but who knows, maybe some file system driver recycles pages already used for your process (and this shouldn't be a security risk).
On the other hand, I don't know if filesystems that do not offer security at all (e.g. FAT) would be so concerned to give you the content of the clusters that they happened to allocate for the new part of the file.
If, instead, you are creating a memory section not backed by a file on disk but by the paging file it's guaranteed that the memory you get is all zeroed:
The initial contents of the pages in a file mapping object backed by the operating system paging file are 0 (zero).
This is guaranteed probably because when creating a memory-only paging file the memory manager has the complete control on what's going on, and it takes the pages from the blanked pages pool.

All newly allocated pages are zeroed before they are made accessible to user-mode, because otherwise sensitive information could be leaked from kernel-mode or other processes. This applies to things like NtAllocateVirtualMemory/VirtualAlloc and NtCreateSection/CreateFileMapping.
I imagine the same concept extends to files, because any decent file system wouldn't want to leak information in this way.
EDIT: However, take that last paragraph with a grain of salt - both the documentation for CreateFileMapping and SetEndOfFile claim that the extended portion of the file is not defined. I'll do some more investigation.
EDIT 2: OK, the Win32 MSDN documentation is definitely wrong. The documentation for ZwSetInformationFile states:
If you set FileInformationClass to
FileEndOfFileInformation, and the
EndOfFile member of
FILE_END_OF_FILE_INFORMATION specifies
an offset beyond the current
end-of-file mark, ZwSetInformationFile
extends the file and pads the
extension with zeros.
So there you go. The extended portion is guaranteed to be zero.

Yes, as pointed out by wj32. This is related to c2 requirements which NT has met since its birth. However depending on what you are trying to do, you should probably look into sparse files.

Related

std::ofstream::open will it read the entire file into memory?

I'm writing things from my memory to the disk in order to free my memory.
I wonder each time I call open(), and appendix new elements to the end of the file, will it read the entire file into memory? or it is just a pointer to the end of the file?
The fstream implementation doesn't specify exactly what happens if you use the ofstream::app, ios::app, ofstream::ate or ios::ate mode to open the file.
But in any sane implementation, the file is not read into memory, all that happens is that the fstream implementation positions the "current position" to the end of the file.
To read the entire file into memoiry would be rather terrible if you have a system with 2GB of RAM and you wanted to append to a file that is bigger than 2GB.
Being very pedantic, when writing something to a text-file, it is likely that the filesystem that is part of the operating system will read the last few (kilo)bytes of the file, as most hard-disks and similar storage requires that the data is being written to a "block", which is a fixed size (e.g. 512 bytes or 4 kilobytes). So, unless the current filesize is exactly at a boundary of such a block, the filesystem must read the last block of the file and write it back with the additional data that you asked to write.
If you are worried about appending to a log-file that gets very large, no, it's not an issue. If you are worried about memory safety because your file has secret data that you won't want stored in memory, then may be a problem, because a portion of that will probably be loaded into memory, and there is nothing you can do to control that.

Getting information from a file without traversing its contents

This question made me search for what else can I get from a file without traversing its contents (means without inputting the contents using ifstream or getc etc).
Other than file size and number of characters, what other information can I gather? I searched fseek, I found I can use SEEK_SET, SEEK_CUR and SEEK_END, which only allow me to find the end of the file, start of the file and current pointer.
In order to make it a question, I specifically want to ask:
Can occurrences of some character or type of character (newline etc) be counted?
Can its contents be matched with a certain template?
Is using these methods faster than reading the file multiple times?
And I am asking about Microsoft Windows, not Linux.
1) No, becuase searching of something in unpredicteble conditions requires thorough examing of contents. Examing is reading. Of course, you may collect some statistics before, but you need to traverse you data not less then once. You can use other applications to do this implicitly, but they also will traverse your file from very begining to the end. You may orginize your file some way to obtain necessary info with minimal amount of read-operations, but its all up to your task, and there is no general approach (Because any generiosuty comes to examing the whole source structure).
2) Also No (see above)
3) Yes. Store as much as possible (or required by task) in memory (that's called caching). For example, use mapping (See MapViewOfFile for Windows and mmap(2) on *nix systems), this uses some in-system caching mechanism.
No
No
Depends on wether there's an actual need to read the file multiple times.
There're no miracles here. The former question had a "shortcut" because the number of characters in the file equals to its size in bytes (more strictly speaking - the ansi-text file is considered of a character sequence, each is represented by a single byte).
The stat structure contains information about the file, including permissions, ownership, size, access and creation date info. As for metadata, maybe there's an API to tie into a Windows search database that might allow searching on other criteria, like content attributes (I'm a Linux guy, usually, so I don't know what Windows offers in this respect).

Fast resize of a mmap file

I need a copy-free re-size of a very large mmap file while still allowing concurrent access to reader threads.
The simple way is to use two MAP_SHARED mappings (grow the file, then create a second mapping that includes the grown region) in the same process over the same file and then unmap the old mapping once all readers that could access it are finished. However, I am curious if the scheme below could work, and if so, is there any advantage to it.
mmap a file with MAP_PRIVATE
do read-only access to this memory in multiple threads
either acquire a mutex for the file, write to the memory (assume this is done in a way that the readers, which may be reading that memory, are not messed up by it)
or acquire the mutex, but increase the size of the file and use mremap to move it to a new address (resize the mapping without copying or unnecessary file IO.)
The crazy part comes in at (4). If you move the memory the old addresses become invalid, and the readers, which are still reading it, may suddenly have an access violation. What if we modify the readers to trap this access violation and then restart the operation (i.e. don't re-read the bad address, re-calculate the address given the offset and the new base address from mremap.) Yes I know that's evil, but to my mind the readers can only successfully read the data at the old address or fail with an access violation and retry. If sufficient care is taken, that should be safe. Since re-sizing would not happen often, the readers would eventually succeed and not get stuck in a retry loop.
A problem could occur if that old address space is re-used while a reader still has a pointer to it. Then there will be no access violation, but the data will be incorrect and the program enters the unicorn and candy filled land of undefined behavior (wherein there is usually neither unicorns nor candy.)
But if you controlled allocations completely and could make certain that any allocations that happen during this period do not ever re-use that old address space, then this shouldn't be a problem and the behavior shouldn't be undefined.
Am I right? Could this work? Is there any advantage to this over using two MAP_SHARED mappings?
It is hard for me to imagine a case where you don't know the upper bound on how large the file can be. Assuming that's true, you could "reserve" the address space for the maximum size of the file by providing that size when the file is first mapped in with mmap(). Of course, any accesses beyond the actual size of the file will cause an access violation, but that's how you want it to work anyway -- you could argue that reserving the extra address space ensures the access violation rather than leaving that address range open to being used by other calls to things like mmap() or malloc().
Anyway, the point is with my solution, you never move the address range, you only change its size and now your locking is around the data structure that provides the current valid size to each thread.
My solution doesn't work if you have so many files that the maximum mapping for each file runs you out of address space, but this is the age of the 64-bit address space so hopefully your maximum mapping size is no problem.
(Just to make sure I wasn't forgetting something stupid, I did write a small program to convince myself creating the larger-than-file-size mapping gives an access violation when you try to access beyond the file size, and then works fine once you ftruncate() the file to be larger, all with the same address returned from the first mmap() call.)

mmap(): what happens if underlying file changes (shrinks)?

If you memory map a file using mmap(), but then the underlying file changes to a much smaller size. What happens if you access a memory offset that was shaved off from the file?
IBM says it is undefined http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=%2Fapis%2Fmmap.htm
If the size of the mapped file is decreased after mmap(), attempts to reference beyond the end of the file are undefined and may result in an MCH0601 exception.
If the size of the file increases after the mmap() function completes, then the whole pages beyond the original end of file will not be accessible via the mapping.
The same is said in SingleUnixSpecification: http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html
If the size of the mapped file changes after the call to mmap() as a result of some other operation on the mapped file, the effect of references to portions of the mapped region that correspond to added or removed portions of the file is unspecified.
'undefined' or 'unspecified' means - the OS is allowed to start formatting of disk or anything. Most probable is SIGSEGV-killing your application.
It depends on what flags you gave to mmap, the man page:
MAP_SHARED Share this mapping. Updates to the mapping are visible to
other processes that map this file, and are carried through to the
underlying file. The file may not actually be updated until msync(2)
or munmap() is called.
and
MAP_PRIVATE Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same file, and
are not carried through to the underlying file. It is unspecified
whether changes made to the file after the mmap() call are visible in
the mapped region.
So for MAP_PRIVATE, doesn't matter, each writer effectively has a "private" copy. (though it is only copies when a mutating operation occurs).
I would think that if you use MAP_SHARED, then no other process would be allowed to open the file with write privileged. But that's a guess.
EDIT: ninjalj is right, the file can be modified even when you mmap with MAP_SHARED.
According to the man pages mmap returns EINVAL error when you try to access an address that is too large for the current file mapping.
"dnotify" and "inotify" are the current file change notification services in the Linux kernel.
Presumably, they would inform the mmap subsystem of changes to the file.

Using the pagefile for caching?

I have to deal with a huge amount of data that usually doesn't fit into main memory. The way I access this data has high locality, so caching parts of it in memory looks like a good option. Is it feasible to just malloc() a huge array, and let the operating system figure out which bits to page out and which bits to keep?
Assuming the data comes from a file, you're better off memory mapping that file. Otherwise, what you end up doing is allocating your array, and then copying the data from your file into the array -- and since your array is mapped to the page file, you're basically just copying the original file to the page file, and in the process polluting the "cache" (i.e., physical memory) so other data that's currently active has a much better chance of being evicted. Then, when you're done you (typically) write the data back from the array to the original file, which (in this case) means copying from the page file back to the original file.
Memory mapping the file instead just creates some address space and maps it directly to the original file instead. This avoids copying data from the original file to the page file (and back again when you're done) as well as temporarily moving data into physical memory on the way from the original file to the page file. The biggest win, of course, is when/if there are substantial pieces of the original file that you never really use at all (in which case they may never be read into physical memory at all, assuming the unused chunk is at least a page in size).
If the data are in a large file, look into using mmap to read it. Modern computers have so much RAM, you might not enough swap space available.