Is there a way to dump a buffer to file atomically?
By "atomically" I mean: if for example someone terminates my application during writing, I'd like to have file in either before- or after-writing state, but not in a corrupted intermediate state.
If the answer is "no", then probably it could be done with a really small buffers?
For example, can I dump 2 consequent int32_t variables with a single 8 bytes fwrite (on x64 platform), and be sure that both of those int32s are dumped, or neither of them, but not only just one of them?
I recommend writing to a temporary file and then doing a rename(2) on it.
ofstream o("file.tmp"); //Write to a temporary file
o << "my data";
o.close();
//Perform an atomic move operation... needed so readers can't open a partially written file
rename("file.tmp", "file.real");
Related
Thread 1 (T1) creates the file using
FILE *MyFile = tmpfile();
Thread 2 (T2) then starts writing to the file. While thread 2 is writing, thread 1 occasionally reads from the file.
I set it up such that T2 is temporarily suspended when T1 is reading but, as T1 is only ever reading part of the file T2 won't be writing to (the file is written sequentially), I'm wondering if suspending T2 is necessary. I know this would be OK if FILE was replaced by fixed size array / vector. Just wondering how disc differs from memory.
Edit.
The writes are done using fseek and fwrite. The reads are done using fseek and fread. I assumed that was a given but maybe not from some of the comments. I suppose if T1 fseeks to position X at the same time as T2 fseeks to position Y then who knows where the next read or write will start from. Will take a look at pipes, Thanks for the help.
Mixing reads and writes on a FILE is not even safe when dealing with a single thread. From the manpage of fopen:
Reads and writes may be intermixed on read/write streams in any order. Note that ANSI C
requires that a file positioning function intervene between output and input, unless an input
operation encounters end-of-file. (If this condition is not met, then a read is allowed to
return the result of writes other than the most recent.) Therefore it is good practice (and
indeed sometimes necessary under Linux) to put an fseek(3) or fgetpos(3) operation between
write and read operations on such a stream. This operation may be an apparent no-op (as in
fseek(..., 0L, SEEK_CUR) called for its synchronizing side effect).
So don't assume reads and writes are magically synchronized for you and protect access to the FILE with a mutex.
I'm trying to do random write (Benchmark test) to a file using multiple threads (pthread). Looks like if I comment out mutex lock the created file size is less than actual as if Some writes are getting lost (always in some multiple of chunk size). But if I keep the mutex it's always exact size.
Is my code have a problem in other place and mutex is not really required (as suggested by #evan ) or mutex is necessary here
void *DiskWorker(void *threadarg) {
FILE *theFile = fopen(fileToWrite, "a+");
....
for (long i = 0; i < noOfWrites; ++i) {
//pthread_mutex_lock (&mutexsum);
// For Random access
fseek ( theFile , randomArray[i] * chunkSize , SEEK_SET );
fputs ( data , theFile );
//Or for sequential access (in this case above 2 lines would not be here)
fprintf(theFile, "%s", data);
//sequential access end
fflush (theFile);
//pthread_mutex_unlock(&mutexsum);
}
.....
}
You are opening a file using "append mode". According to C11:
Opening a file with append mode ('a' as the first character in the
mode argument) causes all subsequent writes to the file to be forced
to the then current end-of-file, regardless of intervening calls to
the fseek function.
C standard does not specified how exactly this should be implemented, but on POSIX system this is usually implemented using O_APPEND flag of open function, while flushing data is done using function write. Note that fseek call in your code should have no effect.
I think POSIX requires this, as it describes how redirecting output in append mode (>>) is done by the shell:
Appended output redirection shall cause the file whose name results
from the expansion of word to be opened for output on the designated
file descriptor. The file is opened as if the open() function as
defined in the System Interfaces volume of POSIX.1-2008 was called
with the O_APPEND flag. If the file does not exist, it shall be
created.
And since most programs use FILE interface to send data to stdout, this probably requires fopen to use open with O_APPEND and write (and not functions like pwrite) when writing data.
So if on your system fopen with 'a' mode uses O_APPEND and flushing is done using write and your kernel and filesystem correctly implement O_APPEND flag, using mutex should have no effect as writes do not intervene:
If the O_APPEND flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Note that not all filesystems support this behavior. Check this answer.
As for my answer to your previous question, my suggestion was to remove mutex as it should have no effect on the size of a file (and it didn't have any effect on my machine).
Personally, I never really used O_APPEND and would be hesitant to do so, as its behavior might not be supported at some level, plus its behavior is weird on Linux (see "bugs" section of pwrite).
You definitely need a mutex because you are issuing several different file commands. The underlying file subsystem can't possibly know how many file commands you are going to call to complete your whole operation.
So you need the mutex.
In your situation you may find you get better performance putting the mutex outside the loop. The reason being that, otherwise, switching between threads may cause excessive skipping between different parts of the disk. Hard disks take about 10ms to move the read/write head so that could potentially slow things down a lot.
So it might be a good idea to benchmark that.
The title is not so clear but what I mean is this:
std::fstream filestream("abc.dat", std::ios::out);
double write_to_file;
while (some_condition) {
write_to_file = 1.345; ///this number will be different in each loop iteration
filestream.seekg( 345 );
filestream << std::setw(5) << write_to_file << std::flush;
///write the number to replace the number that is written in the previous iteration
system( "./Some_app ./abc.dat" ); ///open an application in unix,
////which uses "abc.dat" as the input file
}
filestream.close();
that's the rough idea, each iteration re-write the number into the file and flush. I'm hoping not to open and close the file in each iteration, in order to save computing time. (also not sure of the complexity of open and close :/ ) Is it ok to do this?
On unix, std::flush does not necessarily write to the physical device. Typically, it does not. std::ofstream::flush calls rdbuf->pubsync(), which in turn calls rdbuf->sync(), which in turn "synchronizes the controlled sequences with the arrays." What are those "control sequences"? Typically they're not the underlying physical device. In a modern OS such as unix, there are lots of things in between high level I/O constructs such as C++'s concept of an I/O buffer and the bits on the device.
Even the low-level POSIX function fsync() does not necessarily guarantee that bits are written to the device. Even closing and reopening the output file does not necessarily guarantee that bits are written to the device.
You might want to rethink your design.
You need at least to flush the C++ stream buffer with filestream.flush() before calling system (but you did that with << std::flush;)
I am assuming that ./Someapp is not writing the file, and is opening it for reading only.
But in your case, better open and close the file at each iteration, since the system call is obviously a huge bottleneck.
I am working on a program, which uses multiple std::ifstreams for reading a binary file, one std::ifstream for each thread. Now I need to know, if std::ofstream is thread-safe on Windows and Linux for writing in a same file. I am using using only one std::ofstream and using for multiple threads.
I am reading different blocks using using each thread and writing those block in output file using seekp() and write(). Currently it is working for me but whether it is problematic for big files.
Is std::ofstream thread safe?
If I haven't misunderstood you - no, nothing in the standard library is thread safe (except the std::thread specific things, of course (from C++11 and later)). You need additional synchronization.
Even more - if there are several processes, reading from/writing to these files, you need to lock the files, to sync the access.
From C++ standards (Input/Output Library Thread Safety):
27.1.3 Thread safety [iostreams.thread-safety]
Concurrent access to a stream object [string.streams, file.streams], stream buffer object
[stream.buffers], or C Library stream [c.files] by multiple threads may result in a data
race [intro.multithread] unless otherwise specified [iostream.objects]. [Note: Data races
result in undefined behavior [intro.multithread].
I have written a little program to verify the thread-safety of std::ifstream
and std::ofstream https://github.com/rmspacefish/rpi-file-concurrency . I have tested with on a Linux Desktop host and a Raspberry Pi. The program starts two threads, one writer thread and one reader thread and there is a textual mode and a binary mode.
In the textual mode, I am writing two lines in the writer thread, and the reader thread attempts to read two lines. For the textual mode, I get the following output:
Concurrent file access test
Write Open Fail Count: 0
Write Count: 191090
Write Fail Count: 0
Read Open Fail Count: 0
Read Count: 93253
Read One Line Count: 93253
Read Both Lines Count: 93253
Faulty Read Count: 0
EOF Count: 0
Fail Count: 0
Finished.
So this appears to be thread-safe for Linux. For the binary mode, I am writing a binary block in form of a struct consisting of multiple fields like char arrays, integers with various sizes etc. I have two states which are written in alternating cycles. In the reader thread, I check the consistency of the data (inconistent states or worse, wrong values). Here I get the following results
Concurrent file access test
Write Open Fail Count: 0
Write Count: 0
Write Fail Count: 0
Read Open Fail Count: 0
Read Count: 0
Blob in state one read: 25491
Blob in state two read: 24702
Blob in invalid state: 0
Faulty Read Count: 0
EOF Count: 91295
Fail Count: 91295
Finished.
I checked the error flags after calling read (and this is important). If there are no error flags, the state is read in a consistent manner. It looks thread-safe to me.
The thread-safety might still be implementation dependent, but at least for Linux/GCC, file access seems to be thread-safe. I will still test this with MSVC on Windows, but Microsoft specified this should be thread-safe as well.
Yes. It is.
For Windows:
it is safe to write to fstream from multiple threads on windows. Please see the msdn document: Thread Safety in the C++ Standard Library
For Linux:
In short, it is. From the document of libstdc++: "if your platform's C library is threadsafe, then your fstream I/O operations will be threadsafe at the lowest level". Is your platform's C library threadsafe? Yes. The POSIX standard requires that C stdio FILE* operations(such as fread/fwrite) are atomic, and glibc did so.
When I run my small-scale parallel codes, I typically output N files (N being number of processors) in the form fileout.dat.xxx where xxx is the processor number (using I3.3) and then just cat them into a single fileout.dat file after the code is finished.
My question is can I use ACCESS='append' or POSITION='append' in the OPEN statement and have all processors write to the same file?
In practice, no. POSITION='append' merely says that the file pointer will be at the end of file after the open statement is executed. It is, however, possible to change the file position, e.g. with the BACKSPACE, REWIND or such statements. Thus, Fortran POSITION='append' does not correspond to the POSIX O_APPEND, and hence a POSIX OS cannot ensure that all writes only append to the file and do not overwrite older data.
Furhtermore, in case you run your code on a cluster, be aware that O_APPEND does not work on many networked file systems such as NFS.
In order to do parallel I/O with several processes/threads writing to a single file, use ACCESS='direct' or ACCESS='stream' and have the processes agree on which records/byte ranges to write to.