Why use MPI_File_open instead of fopen? - c++

After reading the MPI documentation, it doesn't sound like this gives you any additional functionality at all. I had assumed that it coordinated network traffic such that all file operations work with the given file on the executing system (the one issuing an mpirun command), as opposed to using the local filesystem on each individual host. This would be useful. Instead, the "user" needs to ensure that they all end up at the same file. Clearly they're not communicating that much about this file... are they?
What does MPI_File_open actually do, and how is it beneficial? Why should I not just use fopen?

Sure, MPI_File_open allows you to seek and read/write at particular blocks, like you would with fopen, in which case each process has a private file pointer. Differences from fopen include the nonblocking IO methods would allow your program to continue execution without waiting for the operation to complete. MPI also supports shared file pointers (e.g. MPI_File_read_shared), although obviously use of shared pointers have a synchronization overhead.

Related

Can I call posix_fadvise when another thread is reading/writing the fd?

I'm working a project under linux, which needs read/write the same fd using multi-threads. And I want to use posix_fadvise to free page cache.
Can I call posix_fadvise when another thread is reading or writing the same fd?
Read posix_fadvise(2) and syscalls(2). Since posix_fadvise is a genuine syscall (e.g. wraps fadvise64 having its __NR_fadvise64 in <asm/unistd.h>...) you should be able to call it while another thread is writing the same fd, exactly as you may have two threads doing write(2) to the same file descriptor (but what happens then is perhaps non-deterministic).
I imagine that the kernel is internally locking the kernel file object referenced by a file descriptor.
BTW, the man page of posix_advise tells:
Programs can use posix_fadvise() to announce an intention to access
file data in a specific pattern in the future, thus allowing the
kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at
offset and extending for len bytes (or until the end of the file if
len is 0) within the file referred to by fd. The advice is not
binding; it merely constitutes an expectation on behalf of the
application.
Hence I guess that the kernel may follow the posix_fadvise later (or not at all)...
So I think you can do that, but I believe you should avoid, at least for readability reasons (and because of the non-determinism), to have several threads working on the same file descriptor. My feeling is that your code may have some design issues, but something will perhaps happen...
Generally, I would avoid having several threads doing I/O on the same file descriptor (or at the very least, use pwrite(2) or lock the I/O with a mutex...). So while you could do what you are asking, I would avoid doing that.
Remember that I/O operations to a disk file system are much much slower (they may take many milliseconds) that ordinary computations. Locking them with a mutex should not be significant, and will give you more determinism.

Can you open the same file multiple times for writing?

We are writing a multi threaded application that does a bunch of bit twiddling and writes the binary data to disk. Is it possible to have each thread std::fopen the same file for writing at the same time? The reasoning would be each thread could do its work and have its own access to the writable file.
std::fstream has functionality defined in terms of the C stdio library. I would be surprised if it were actually specified, but the most likely behavior from opening the same file twice is multiple internal buffers bound to the same file descriptor.
The usual way to simultaneously write to multiple points in the same file is POSIX pwrite or writev. This functionality is not wrapped by C stdio, and by extension not by C++ iostreams either. But, having multiple descriptors to the same filesystem file might work too.
Edit: POSIX open called twice on the same file in Mac OS X produces different file descriptors. So, it might work on your platform, but it's probably not portable.
A definitive answer would require connecting these dots:
Where the C++ standard specifies that fstream works like a C (stdio) stream.
Where the C standard defines when a stream is created (fopen is only defined to associate a stream with a newly-opened file).
Where the POSIX standard defines its requirements for C streams.
Where POSIX defines the case of opening the same file twice.
This is a bit more research than I'm up for at the moment, but I'm sure someone out there has done the legwork.
I've written some high speed multi-threaded data capture utilities, but the output went to separate files on separate hard drives, and then were post-processed.
I seem to recall that you can have fopen not lock the file so in theory that would allow different threads to all write to the same file with independent handles. In practice you're going to run into other issues, namely concurrency. Your threads are almost certainly going to step all over each other and scramble the results unless you implement some synchronization. And if you have to do that, why not just use one handle across all the threads?
I/O access is not a parallelizable task (it can't be, you simply can't send two or more data addresses over the device bus at the same time) so you'd better implement a queue in which many threads posts their chunks of data and one single consumer actually writes them to disk.

Safe access to file from two processes

Suppose I have two processes. One always resides in memory and periodically reads some settings from a file on a disk. If it detects that settings was changed then it applies them.
The other process runs under command line by demand and modifies the settings. Thus the first process only read the file and never write to it while the second can only write to the file.
Should I synchronize the access to the file to ensure that the first process will always get consistent settings i.e. before or after modifications not some intermediate contents? If yes, what is the simplest way to do this in C++.
I'm interested mainly in cross-platform ways. But also curious about Windows- and/or Linux-specific ones.
Use a named semaphore and require either process to hold the semaphore before editing the file on disk. Named semaphores can be connected to by any running application.
Look at man 7 sem_overview for more information on named semaphores on linux machines.
The closest equivalent for windows I can find is http://msdn.microsoft.com/en-us/library/windows/desktop/ms682438(v=vs.85).aspx
You are using C++ so your first option should be to check through the usual cross-platform libs - POCO, Boost, ACE, and so forth to see if there is anything that already does what you require.
You really have two separate issues: (1) file synchronization and (2) notification.
On Linux to avoid having your daemon constantly polling to see if a file has changed you can use inotify calls and set up events that will tell you when the file has been changed by the command line program. It might be simplest to look for IN_CLOSE_WRITE events since a CL prog will presumably be opening, changing, and closing the file.
For synchronization, since you are in control of both programs, you can just use file or record locking e.g. lockf, flock or fcntl.
The most obvious solution is to open the file in exclusive mode. If the file can not be opened, wait some time and try to open the file again. This will prevent possible access/modification conflicts.
The benefit of this approach is that it's simple and doesn't have significant drawbacks.
Of course you could use some synchronization primitives (Mutex, Semaphore depending on the OS) but this would be an overkill in your scenario, when speedy response is not required (waiting 200 msec between open attempts is fine, and writing of config file won't take more).

C++ how to check if file is in use - multi-threaded multi-process system

C++:
Is there a way to check if a file has been opened for writing by another process/ class/ device ?
I am trying to read files from a folder that may be accessed by other processes for writing. If I read a file that is simultaneously being written on, both the read and the write process give me errors (the writing is incomplete, I might only get a header).
So I must check for some type of condition before I decide whether to open that specific file.
I have been using boost::filesystem to get my file list. I want compatibility with both Unix and Windows.
You must use a file advisory lock. In Unix, this is flock, in Windows it is LockFile.
However, the fact that your reading process is erroring probably indicates that you have not opened the file in read-only mode in that process. You must specify the correct flags for read-only access or from the OS' perspective you have two writers.
Both operating systems support reader-writer locks, where unlimited readers are allowed, but only in the absence of writers, and only at most one writer at a time will have access.
Since you say your system is multi-process (ie, not multi thread), you can't use a condition variable (unless it's in interprocess shared memory). You also can't use a single writer as a coordinator unless you're willing to shuttle your data there via sockets or shared memory.
From what I understand about boost::filesystem, you're not going to get the granularity you need from that feature-set in order to perform the tasks you're requesting. In general, there are two different approaches you can take:
Use a synchronization mechanism such as a named semaphore visible at the file-system level
Use file-locks (i.e., fcntl or flock on POSIX systems)
Unfortunately both approaches are going to be platform-specific, or at least specific to POSIX vs. Win32.
A very nice solution can be found here using Sutter's active object https://sites.google.com/site/kjellhedstrom2/active-object-with-cpp0x
This is quite advanced but really scaled well on many cores.

c++ fstream concurrent access

What will happen if files are accessed concurrently from different processes/threads?
I understand there is no standard way of locking a file, only os specific functions.
In my case files will be read often and written seldom.
Now if A open a file for reading (ifstream) and starts reading chunks. And B opens the same file for writing (ofstream) and starts writing. What will happen? Is there a defined behavior?
edit
My goal is concurrent read, write access to many files. But write access will not occur very often. I would be content if the fstreams guarantee that file content doesn't get mixed up.
E.g.:
Process 1 and 2 write to file A. If they write concurrently I dont't care if the version of 1 or 2 is written to disc, as long as it is a consistent version of the file.
If a process reads a file and another writes to it at the same time, I want the reading process to get the "old" version of the file.
If fstreams don't handle this I will use a database.
There is certainly no portable way to do efficient file sharing (with simultaneous access) using C++.
You can share files using a "lock" file. Before opening "foo.dat", try to create file "foo.lock". Keep looping until you succeed. After access, delete foo.lock. That allows serial access, but not concurrent access.
You can use byte-level locking in platform-specific ways. Windows has LockFileEx(). POSIX has fcntl and flock. If you need multi-platforms you will need separate implementations. You can encapsulate them in a class and use #if to handle the platform-specific bits.
This is the most efficient (fastest) by a lot, but it involves very complex programming and is prone to bugs.
You can use a DBMS.
A DBMS will be simplest by a lot, but it does tie you to an external product which may or may not be a problem. Byte-wise locking is much faster than anything else, but will add a lot to devel and maintenance costs.
What is your goal? Are you trying to prevent concurrent read/write operations to files or do you want to implement some form of IPC via files?
Either way, look at boost interprocess, it provides you the opportunity to use file locks (and other cool stuff for IPC) And it has the added advantage of being portable!