Accessing binary files with mmap vs fstream or fopen

Accessing binary files with mmap vs fstream or fopen - c++

I did not know binary files could be read with mmap(). I used to think mmap()could only be used for IPC(interprocess communication) in Linux to exchange data between unrelated process.
Can someone explain how files are read with mmap()? I heard it has huge advantage when binary files are randomly accessed.

Well, mmapping a file is done the same way as it is done for the IPC or mapping anonymous memory. In the case of mapping anonymous memory the parts that have not been written to will be filled with zeroed pages on demand.
In case of a mapped file, the pages that correspond to file contents are read upon access (and upon writes too) from the file / or the buffer cache. Reading or writing outside the size of the file will result in SIGBUS. Essentially the pointer returned by mmap can be considered in the similar manner as the pointer returned by malloc, except that up to the size of mapping / up to the end-of-file bytes within the mapping are automatically read from / and possibly written to the backing file transparently.
Example:
fd = open("input.txt", O_RDWR, 0666);
fstat(fd, &stat_result);
char* contents = mmap(0, stat_result->st_size,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
(error checking omitted)
After executing that you can consider contents as pointing to the first byte of a character array of stat_result->st_size characters, and you can use it just like an ordinary array, and the operating system will transparently write back the changes into the file.
With mmap the operating system will have a better view about which parts of the file should be kept in memory / buffer cache and which shouldn't.

Related

Copying data from a shared-memory-mapped object using sendfile()/fcopyfile()

Is it possible – and if so prudent – to use sendfile() (or its Darwin/BSD cousin fcopyfile()) to shuttle data directly between a shared-memory object and a file?
Functions like sendfile() and fcopyfile() can perform all of the mechanistic necessities underpinning such transfers of data entirely without leaving kernel-space – you pass along two open descriptors, a source and a destination, when calling these functions, and they take it from there.
Other means of copying data will invariably require one to manually maneuver across the boundary between kernel-space and user-space; such context-switches are inherently quite costly, performance-wise.
I can’t find anything definitive on the subject of using a shared-memory descriptor as an argument thusly: no articles for or against the practice; nothing in the respective man-pages; no tweets publicly considering sendfile()-ing shared-memory descriptors harmful; &c… But so, I am thinking I should be able to do something like this:
char const* name = "/yo-dogg-i-heard-you-like-shm"; /// only one slash, at zero-index
int len = A_REASONABLE_POWER_OF_TWO; /// valid per shm_open()
int descriptor = shm_open(name, O_RDWR | O_CREAT, 0600);
int destination = open("/tmp/yodogg.block", O_RDWR | O_CREAT, 0644);
void* memory = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, descriptor, 0);
off_t bytescopied = 0;
sendfile(destination, descriptor, &bytescopied, len);
/// --> insert other stuff with memset(…), memcopy(…) &c. here, possibly
munmap(memory, len);
close(descriptor); close(destination);
shm_unlink(name);
… Is this misguided, or a valid technique?
And if the latter, can one adjust the size of the in-memory shared map before copying the data?
EDIT: I am developing the project to which this inquiry pertains on macOS 10.12.4; I am aiming for it to work on Linux, with eventual FreeBSD interoperability.

Copying data between two "things" mapped in memory - like in the example above - will indeed require copying things from kernel to userspace and then back. And no, you can't really use sendfile(2) system call to send to a file descriptor, I'm afraid.
But you should be able to do it like this:
Create the shared memory object (or a file, really; due to the second step it will be shared in memory anyway
Map it in memory, with MAP_SHARED; you'll get a pointer
Open the destination file
write(destination_fd, source_pointer, source_length)
In this case, the write syscall won't need to copy the data into your process. Not sure what the actual performance characteristic will be, though. Smart use of madvise(2) might help.

What is the functionality of munmap, mmap

When I try to study some piece of code that deals with FPGA, I came across with munmap, mmap.
I go through the manual provided here. I am still not understanding the purpose of this function. What exactly this does?

mmap() is a system call, which helps in memory-mapped I/O operations. It allocates a memory region and maps that into the calling process virtual address space so as to enable the application to access the memory.
mmap() returns a pointer to the mapped area which can be used to access the memory.
Similarly, munmap() removes the mapping so no further access to the allocated memory remains legal.
These are lower level calls, behaviourally similar to what is offered by memory allocator functions like malloc() / free() on a higher level. However, this system call allow one to have fine grained control over the allocated region behaviour, like,
memory protection of the mapping (read, write, execute permission)
(approximate) location of the mapping (see MAP_FIXED flag)
the initial content of the mapped area (see MAP_UNINITIALIZED flag)
etc.
You can also refer to the wikipedia article if you think alternate wordings can help you.

It maps a chunk of disk cache into process space so that the mapped file can be manipulated at a byte level instead of requiring the application to go through the VFS with read(), write(), et alia.

The manual is clear:
mmap() creates a new mapping in the virtual address space of the calling process
In short, it maps a chunk of file/device memory/whatever into the process' space, so that it can directly access the content by just accessing the memory.
For example:
fd = open("xxx", O_RDONLY);
mem = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
Will map the file's content to mem, reading mem is just like reading the content of the file xxx.
If the fd is some FPGA's device memory, then the mem becomes the content of the FPGA's content.
It is very convenient to use and efficient in some cases.

How to read huge file in c++

If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?

There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.

I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:
ifstream is("/path/to/file");
char buf[4096];
do {
is.read(buf, sizeof(buf));
process_chunk(buf, is.gcount());
} while(is);

A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:
Under windows: CreateFileMapping(), MapViewOfFile()
Under linux: open(2) / creat(2), shm_open, mmap
you will need to compile 64bit app to make it work.
for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory

You can use fread
char buffer[size];
fread(buffer, size, sizeof(char), fp);
Or, if you want to use C++ fstreams you can use read as buratino said.
Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.

std::ofstream::open will it read the entire file into memory?

I'm writing things from my memory to the disk in order to free my memory.
I wonder each time I call open(), and appendix new elements to the end of the file, will it read the entire file into memory? or it is just a pointer to the end of the file?

The fstream implementation doesn't specify exactly what happens if you use the ofstream::app, ios::app, ofstream::ate or ios::ate mode to open the file.
But in any sane implementation, the file is not read into memory, all that happens is that the fstream implementation positions the "current position" to the end of the file.
To read the entire file into memoiry would be rather terrible if you have a system with 2GB of RAM and you wanted to append to a file that is bigger than 2GB.
Being very pedantic, when writing something to a text-file, it is likely that the filesystem that is part of the operating system will read the last few (kilo)bytes of the file, as most hard-disks and similar storage requires that the data is being written to a "block", which is a fixed size (e.g. 512 bytes or 4 kilobytes). So, unless the current filesize is exactly at a boundary of such a block, the filesystem must read the last block of the file and write it back with the additional data that you asked to write.
If you are worried about appending to a log-file that gets very large, no, it's not an issue. If you are worried about memory safety because your file has secret data that you won't want stored in memory, then may be a problem, because a portion of that will probably be loaded into memory, and there is nothing you can do to control that.

mmap(): what happens if underlying file changes (shrinks)?

If you memory map a file using mmap(), but then the underlying file changes to a much smaller size. What happens if you access a memory offset that was shaved off from the file?

IBM says it is undefined http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=%2Fapis%2Fmmap.htm
If the size of the mapped file is decreased after mmap(), attempts to reference beyond the end of the file are undefined and may result in an MCH0601 exception.
If the size of the file increases after the mmap() function completes, then the whole pages beyond the original end of file will not be accessible via the mapping.
The same is said in SingleUnixSpecification: http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html
If the size of the mapped file changes after the call to mmap() as a result of some other operation on the mapped file, the effect of references to portions of the mapped region that correspond to added or removed portions of the file is unspecified.
'undefined' or 'unspecified' means - the OS is allowed to start formatting of disk or anything. Most probable is SIGSEGV-killing your application.

It depends on what flags you gave to mmap, the man page:
MAP_SHARED Share this mapping. Updates to the mapping are visible to
other processes that map this file, and are carried through to the
underlying file. The file may not actually be updated until msync(2)
or munmap() is called.
and
MAP_PRIVATE Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same file, and
are not carried through to the underlying file. It is unspecified
whether changes made to the file after the mmap() call are visible in
the mapped region.
So for MAP_PRIVATE, doesn't matter, each writer effectively has a "private" copy. (though it is only copies when a mutating operation occurs).
I would think that if you use MAP_SHARED, then no other process would be allowed to open the file with write privileged. But that's a guess.
EDIT: ninjalj is right, the file can be modified even when you mmap with MAP_SHARED.

According to the man pages mmap returns EINVAL error when you try to access an address that is too large for the current file mapping.
"dnotify" and "inotify" are the current file change notification services in the Linux kernel.
Presumably, they would inform the mmap subsystem of changes to the file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js