Writing concurrently to a file

Writing concurrently to a file - c++

I have this tool in which a single log-like file is written to by several processes.
What I want to achieve is to have the file truncated when it is first opened, and then have all writes done at the end by the several processes that have it open.
All writes are systematically flushed and mutex-protected so that I don't get jumbled output.
First, a process creates the file, then starts a sequence of other processes, one at a time, that then open the file and write to it (the master sometimes chimes in with additional content; the slave process may or may not be open and writing something).
I'd like, as much as possible, not to use more IPC that what already exists (all I'm doing now is writing to a popen-created pipe). I have no access to external libraries other that the CRT and Win32 API, and I would like not to start writing serialization code.
Here is some code that shows where I've gone:
// open the file. Truncate it if we're the 'master', append to it if we're a 'slave'
std::ofstream blah(filename, ios::out | (isClient ? ios:app : 0));
// do stuff...
// write stuff
myMutex.acquire();
blah << "stuff to write" << std::flush;
myMutex.release();
Well, this does not work: although the output of the slave process is ordered as expected, what the master writes is either bunched together or at the wrong place, when it exists at all.
I have two questions: is the flag combination given to the ofstream's constructor the right one ? Am I going the right way anyway ?

If you'll be writing a lot of data to the log from multiple threads, you'll need to rethink the design, since all threads will block on trying to acquire the mutex, and in general you don't want your threads blocked from doing work so they can log. In that case, you'd want to write your worker thread to log entries to queue (which just requires moving stuff around in memory), and have a dedicated thread to pull entries off the queue and write them to the output. That way your worker threads are blocked for as short a time as possible.
You can do even better than this by using async I/O, but that gets a bit more tricky.

As suggested by reinier, the problem was not in the way I use the files but in the way the programs behave.
The fstreams do just fine.
What I missed out is the synchronization between the master and the slave (the former was assuming a particular operation was synchronous where it was not).
edit: Oh well, there still was a problem with the open flags. The process that opened the file with ios::out did not move the file pointer as needed (erasing text other processes were writing), and using seekp() completely screwed the output when writing to cout as another part of the code uses cerr.
My final solution is to keep the mutex and the flush, and, for the master process, open the file in ios::out mode (to create or truncate the file), close it and reopen it using ios::app.

I made a 'lil log system that has it's own process and will handle the writing process, the idea is quite simeple. The proccesses that uses the logs just send them to a pending queue which the log process will try to write to a file. It's like batch procesing in any realtime rendering app. This way you'll grt rid of too much open/close file operations. If I can I'll add the sample code.

How do you create that mutex?
For this to work this needs to be a named mutex so that both processes actually lock on the same thing.
You can check that your mutex is actually working correctly with a small piece of code that lock it in one process and another process which tries to acquire it.

I suggest blocking such that the text is completely written to the file before releasing the mutex. I've had instances where the text from one task is interrupted by text from a higher priority thread; doesn't look very pretty.
Also, put the format into Comma Separated format, or some format that can be easily loaded into a spreadsheet. Include thread ID and timestamp. The interlacing of the text lines shows how the threads are interacting. The ID parameter allows you to sort by thread. Timestamps can be used to show sequential access as well as duration. Writing in a spreadsheet friendly format will allow you to analyze the log file with an external tool without writing any conversion utilities. This has helped me greatly.

One option is to use ACE::logging. It has an efficient implementation of concurrent logging.

Related

Concurent state file manipulation with multiple process beyond our control for Linux and Windows

The question below may sound a bit long and complex, but actually it's a quite simple, generic and common problem of three processes working on a same file. In a text below I'm trying to decompose the problem into set of particular requirements with some illustrative examples.
Task preamble
There is a text file, called index, which contains some metadata.
There is an application (APP), which understands the file format and perform meaningful changes on it.
The file is stored under version control system (VCS), which is a source of changes performed on the same file by other users.
We need to design an application (APP), that will work with the file in a reasonable file, preferable without interring much with VCS, as it's assumed that VCS is used to keep a large project with the index file being just a small part of it, and user may want to update the VCS at any point without considering any ongoing operations within APP. In that case APP should gracefully handle the situation in a way preventing any possible loss of data.
Preable remarks
Please note that VCS is unspecified, it could be perforce, git, svn, tarballs, flash drives or your favourite WWII Morse-based radio and a text editor.
Text file could be binary, that doesn't change things much. But with VCS storage in mind, it's prone to be merged and therefore text/human-readable format is most adequate.
Possible examples for such things are: complex configurations (AI behaviour trees, game object descriptions), resource listings, other things that are not meant to be edited by hand, related to a project at hand, but which history matters.
Note that, unless you are keen to implement your own version control system, "outsourcing" most of the configuration into some external, client-server based solution does not solve the problem - you still have to keep a reference file within version control system with a reference to a matching version of configuration in question in the database. Which means, that you still have the same problem, but at a bit smaller scale - a single text line in a file instead of a dozen.
The task itself
A generic APP in vacuum may the index in three phases: read, modify, write. The read phase - read and de-serialize the file, modify - change an in-memory state, write - serialize the state and write to the file.
There are three kind of generic workflows for such an application:
read -> <present an information>
read -> <present an information and await user's input> -> modify -> write
read -> modify -> write
The first workflow is for read-only "users", like a game client, which reads data once and forgets about the file.
The second workflow is for editing application. With external updates being rather rare occurrence and being improbable that user will edit the same file in few editing applications at the same time, it's only reasonable to assume, that a generic editing application will want to read the state only once (especially if it's a resource-consuming operation) and re-read only in case of external updates.
The third workflow is for an automated cli usage - build servers, scripts and such.
With that in mind, it's reasonable to threat read and modify + write separately. Let's call an operation that makes only read phase and prepares some information a read operation. And a write operation would be an operation that modifies a state from a read operation and writes it to the disk.
As workflows one and two may be running at the same time by different application instances, it's also reasonable to allow multiple read operations running at the same time. Some read operations, like reads for editing applications, may want to wait until any existing write operations are finished to read the most recent and up-to-date state. Other read operations, like this in a game client may want to read the current state, whatever it is, without being blocked at all.
On other hand, it's only reasonable for write operations to detect any other write operations running and abort. Write operations should also detect any external changes made to the index file and abort. Rationale - there is no point to perform (and wait for) any work, that would be thrown away due to the fact that they've been made basing on a possible out-of-date state.
For a robust application, a possibility for a critical failure of a galaxy scale should be assumed at every single point of an application. Under no circumstances such a failure should left the index file inconsistent.
Requirements
file reads are consistent - under no circumstances should we read a half of a file before it have been changed or an another half after.
write operations are exclusive - no other write operations are allowed at the same time with the same file.
write operations are robustly waitable - we should be able to wait for a write operation to complete or fail.
write operations are transactional - under no circumstances should the file be left in partially changed or otherwise inconsistent state or based on an out-of-date state. Any change to the index file prior or during the operation should be detected and operation should be aborted as soon as possible.

Linux
A read operation:
Obtain a shared lock, if requested - open(2) (O_CREAT | O_RDONLY) and flock(2) (LOCK_SH) the "lock" file.
open(2) (O_RDONLY) the index file.
Create contents snapshot and parse it.
close(2) the index file.
Unlock - flock(2) (LOCK_UN) and close(2) the "lock" file
A write operation:
Obtain an exclusive lock - open(2) (O_CREAT | O_RDONLY) and flock(2) (LOCK_EX) the "lock" file.
open(2) (O_RDONLY) the index file.
fcntl(2) (F_SETLEASE, F_RDLCK) the index file. - we are only interested in writes, those RDLCK lease.
Check if the state is up-to-date, do things, change the state, write it to a temporary file nearby.
rename(2) the temporary file to the index - it's atomic, and if we haven't got a lease break so far, we won't at all - this will be a different file, not the one we've got the lease on.
fcntl(2) (*F_SETLEASE, F_UNLCK) the index file.
close(2) the index file (the "old" one, with no reference in the filesystem left)
Unlock - close(2) the "lock" file
If a signal from the lease is received - abort and cleanup, no rename. rename(2) has no mention that it might be interrupted and POSIX requires it to be atomic, so once we've got to it - we've made it.
I know there are shared-memory mutexes and named semaphores (instead of an advisory locking for cooperation between application instances), but I think we all agree, that they are needlessly complex for the task at hand and have their own problems.
Windows
A read operation:
Obtain a shared lock, if requested - CreateFile (OPEN_ALWAYS, GENERIC_READ, FILE_SHARE_READ) and LockFileEx (1 byte) the "lock" file
CreateFile (OPEN_EXISTING, GENERIC_READ, FILE_SHARE_READ) the index file
Read file contents
CloseHandle the index
Unlock - CloseHandle the "lock" file
A write operation:
Obtain an exclusive lock - CreateFile (OPEN_ALWAYS, GENERIC_READ, FILE_SHARE_READ) and LockFileEx (LOCKFILE_EXCLUSIVE_LOCK, 1 byte) the "lock" file
CreateFile (OPEN_EXISTING, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE ) the index file
ReadDirectoryChanges (FALSE, FILE_NOTIFY_CHANGE_LAST_WRITE) on the index file directory, with OVERLAPPED structure and an event
Check the state is up-to-date. Modify the state. Write it a temporary file
Replace the index file with a temporary
CloseHandle the index
Unlock - CloseHandle the "lock" file
During the modification part check for the event from the OVERLAPPED structure with WaitForSingleObject (zero timeout). If there are events for the index - abort the operation. Otherwise - fire the watch again, check if we are still up-to-date and if so - continue.
Remarks
Windows version use locking instead of Linux version's notification mechanism, which may interfere with outside processes making writes, but there is seemingly no other way in Windows.

In Linux, you can also use mandatory file locking.
See "Semantics" section:
If a process has locked a region of a file with a mandatory read lock, then
other processes are permitted to read from that region. If any of these
processes attempts to write to the region it will block until the lock is
released, unless the process has opened the file with the O_NONBLOCK
flag in which case the system call will return immediately with the error
status EAGAIN.
and:
If a process has locked a region of a file with a mandatory write lock, all
attempts to read or write to that region block until the lock is released,
unless a process has opened the file with the O_NONBLOCK flag in which case
the system call will return immediately with the error status EAGAIN.
With this approach, the APP may set read or write lock on file, and VCS will be blocked until lock is released.
Note that neither mandatory locks, nor file leases will work good if VCS can unlink() index file or replace it using rename():
If you use mandatory locks, VCS won't be blocked.
If you use file leases, APP won't get notification.
You also can't establish locks or leases on a directory. What you can do in this case:
After read operation, APP can manually check that file still exist and has the same i-node.
But it's not enough for write operations. Since APP can't atomically check file i-node and modify file, it can accidentally overwrite changes made by VCS without being able to detect it. You probably can detect this situation using inotify(7).

fortran: wait to open a file until closed by another application

I have a fortran code which needs to read a series of ascii data files (which all together are about 25 Gb). Basically the code opens a given ascii file, reads the information and use it to do some operations, and then close it. Then opens another file, reads the information, do some operations, and close it again. And so on with the rest of ascii files.
Overall each complete run takes about 10h. I usually need to run several independent calculations with different parameters, and the way I do is to run each independent calculation sequentially, so that at the end if I have 10 independent calculations, the total CPU time is 100h.
A more rapid way would be to run the 10 independent calculations at the same time using different processors on a cluster machine, but the problem is that if a given calculation needs to open and read data from a given ascii file which has been already opened and it's being used by another calculation, then the code gives obviously an error.
I wonder whether there is a way to verify if a given ascii file is already being used by another calculation, and if so to ask the code to wait until the ascii file is finally closed.
Any help would be of great help.
Many thanks in advance.
Obamakoak.

Two processes should be able to read the same file. Perhaps action="read" on the open statement might help. Must the files be human readable? The I/O would very likely be much faster with unformatted (sometimes call binary) files.
P.S. If your OS doesn't support multiple-read access, you might have to create your own lock system. Create a master file that a process opens to check which files are in use or not, and to update said list. Immediately closing after a check or update. To handle collisions on this read/write file, use iostat on the open statement and retry after a delay if there is an error.

I know this is an old thread but I've been struggling with the same issue for my own code.
My first attempt was creating a variable on a certain process (e.g. the master) and accessing this variable exclusively using one-sided passive MPI. This is fancy and works well, but only with newer versions of MPI.
Also, my code seemed happy to open (with READWRITE status) files that were also open in other processes.
Therefore, the easiest workaround, if your program has file access, is to make use of an external lock file, as described here. In your case, the code might look something like this:
A process checks whether the lock file exists using the NEW statement, which fails if a file already exists. It will look something like:
file_exists = .true.
do while (file_exists)
open(STATUS='NEW',unit=11,file=lock_file_name,iostat=open_stat)
if (open_stat.eq.0) then
file_exists = .false.
open(STATUS='OLD',ACTION=READWRITE',unit=12,file=data_file_name,iostat=ierr)
if (ierr.ne.0) stop
else
call sleep(1)
end if
end do
The file is now opened exclusively by the current process. Do the operations you need to do, such as reading, writing.
When you are done, close the data file and finally the lock file
close(12,iostat=ierr)
if (ierr.ne.0) stop
close(11,status='DELETE',iostat=ierr)
if (ierr.ne.0) stop
The data file is now again unlocked for the other processes.
I hope this may be useful for other people who have the same problem.

Reading file that changes over time C++

I am going to read a file in C++. The reading itself is happening in a while-loop, and is reading from one file.
When the function reads information from the file, it is going to push this information up some place in the system. The problem is that this file may change while the loop is ongoing.
How may I catch that new information in the file? I tried out std::ifstream reading and changing my file manually on my computer as the endless-loop (with a sleep(2) between each loop) was ongoing, but as expected -- nothing happend.
EDIT: the file will overwrite itself at each new entry of data to the file.
Help?
Running on virtual box Ubuntu Linux 12.04, if that may be useful info. And this is not homework.

The usual solution is something along the lines of what MichaelH
proposes: the writing process opens the file in append mode, and
always writes to the end. The reading process does what
MichaelH suggests.
This works fine for small amounts of data in each run. If the
processes are supposed to run a long time, and generate a lot of
data, the file will eventually become too big, since it will
contain all of the processed data as well. In this case, the
solution is to use a directory, generating numbered files in it,
one file per data record. The writing process will write each
data set to a new file (incrementing the number), and the
reading process will try to open the new file, and delete it
when it has finished. This is considerably more complex than
the first suggestion, but will work even for processes
generating large amounts of data per second and running for
years.
EDIT:
Later comments by the OP say that the device is actually a FIFO.
In that case:
you can't seek, so MichaelH's suggestion can't be used
literally, but
you don't need to seek, since data is automatically removed
from the FIFO whenever it has been read, and
depending on the size of the data, and how it is written, the
writes may be atomic, so you don't have to worry about partial
records, even if you happen to read exactly in the middle of
a write.
With regards to the latter: make sure that both the read and
write buffers are large enough to contain a complete record, and
that the writer flushes after each record. And make sure that
the records are smaller than the size needed to guarantee
atomicity. (Historically, on the early Unix I know, this was
4096, but I would be surprised if it hasn't increased since
then. Although... Under Posix, this is defined by PIPE_BUF,
which is only guaranteed to be at least 512, and is only 4096
under modern Linux.)

Just read the file, rename the file, open the renamed file. do the processing of data to your system, and at the end of the loop close the file. After a sleep, re-open the file at the top of the white loop, rename it and repeat.
That's the simplest way to approach the problem and saves having to write code to process dynamic changes to the file during the processing stage.
To be absolutely sure you don't get any corruption it's best to rename the file. This guarantees that any changes from another process do not affect the processing. It may not be necessary to do this - it depends on the processing and how the file is updated. But it's a safer approach. A move or rename operation is guaranteed to be atomic - so there should be no concurreny issues if using this approach.

You can use inotify to watch file changes.
If you need simpler solution - read file attributes ( with stat(), and check last_write_time of a file ).
However you still may miss some file modification, while you'll be opening and rereading file. So if you have control over the application which writes to a file, i'd recommend you using something else to communicate between these processes, pipes for example.

To be more explicit, if you want tail-like behavior you'll want to:
Open the file, read in the data. Save the length. Close the file.
Wait for a bit.
Open the file again, attempt to seek to the last read position, read the remaining data, close.
rinse and repeat

update a file simultaneously without locking file

Problem- Multiple processes want to update a file simultaneously.I do not want to use file locking functionality as highly loaded environment a process may block for a while which i don't want. I want something like all process send data to queue or some shared place or something else and one master process will keep on taking data from there and write to the file.So that no process will get block.
One possibility using socket programming.All the processes will send data to to single port and master keep on listening this single port and store data to file.But what if master got down for few seconds.if it happen than i may write to some file based on timestamp and than later sync.But i am putting this on hold and looking for some other solution.(No data lose)
Another possibility may be tacking lock for the particular segment of the file on which the process want to write.Basically each process will write a line.I am not sure how good it will be for high loaded system.
Please suggest some solution for this problem.

Have a 0mq instance handle the writes (as you initially proposed for the socket) and have the workers connect to it and add their writes to the queue (example in many languages).

Each process can write to own file (pid.temp) and periodically rename file (pid-0.data, pid-1.data, ...) for master process that can grab all this files.

You may not need to construct something like this. If you do not want to get processes blocked just use the LOCK_NB flag of perl flock. Periodically try to flock. If not succeeds continue the processing and the values can stored in an array. If file locked, write the data to it from the array.

In C/C++ I want to write to the same pipe multiple times

I have a program that creates pipes between two processes. One process constantly monitors the output of the other and when specific output is encountered it gives input through the other pipe with the write() function. The problem I am having, though is that the contents of the pipe don't go through to the other process's stdin stream until I close() the pipe. I want this program to infinitely loop and react every time it encounters the output it is looking for. Is there any way to send the input to the other process without closing the pipe?
I have searched a bit and found that named pipes can be reopened after closing them, but I wanted to find out if there was another option since I have already written the code to use unnamed pipes and I haven't yet learned to use named pipes.

Take a look at using fflush.

How are you reading the other end? Are you expecting complete strings? You aren't sending terminating NULs in the snippet you posted. Perhaps sending strlen(string)+1 bytes will fix it. Without seeing the code it's hard to tell.

Use fsync. http://pubs.opengroup.org/onlinepubs/007908799/xsh/fsync.html
From http://www.delorie.com/gnu/docs/glibc/libc_239.html:
Once write returns, the data is enqueued to be written and can be read back right away, but it is not necessarily written out to permanent storage immediately. You can use fsync when you need to be sure your data has been permanently stored before continuing. (It is more efficient for the system to batch up consecutive writes and do them all at once when convenient. Normally they will always be written to disk within a minute or less.) Modern systems provide another function fdatasync which guarantees integrity only for the file data and is therefore faster. You can use the O_FSYNC open mode to make write always store the data to disk before returning.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js