Modifying an reading big .txt file with MPI/c++?

Modifying an reading big .txt file with MPI/c++? - c++

I am using MPI together with C++. I want to read information from one file, modify it by some rule, and then write modified content in the same file. I am using temporary file which where I store modified content and at the end I overwrite it by these commands:
temp_file.open("temporary.txt",ios::in);
ofstream output_file(output_name,ios::out);
output_file<<temp_file.rdbuf();
output_file.flush();
temp_file.close();
output_file.close();
remove("temporary.txt");
This function which modify the file is executed by MPI process with rank 0. After exiting from function, MPI_Barrier(MPI_COMM_WORLD); is called to ensure synchronization.
And then, all MPI processes should read modified file and perform some computations. The problem is that, since file is too big, data are not completely written to file when execution of function is finished, and I get wrong results. I also tried to put sleep() command, but sometimes it works, sometimes it doesn't (it depends on the node where I perform computations). Is there general way to solve this problem?
I put MPI as a tag, but I think this problm is inherently connected with c++ standard and manipulating with storage. How to deal with this latency between writing in buffer aand writing in file on storage medium?

Fun topic. You are dealing with two or maybe three consistency semantics here.
POSIX consistency says essentially when a byte is written to a file, it's visible.
NFS consistency says "woah, that's way too hard. you write to this file and I'll make it visible whenever I feel like it. "
MPI-IO consistency semantics (which you aren't using, but are good to know) say that data is visible after specific synchronization events occur. Those two events are "close a file and reopen it" or "sync file, barrier, sync file again".
If you are using NFS, give up now. NFS is horrible. There are a lot of good parallel file systems you can use, several of which you can set up entirely in userspace (such as PVFS).
If you use MPI-IO here, you'll get more well-defined behavior, but the MPI-IO routines are more like C system calls than C++ iostream operators, so think more like open(2) read(2) write(2) and close(2). Text files are usually a headache to deal with but in your case where modifications are appended to file, that shouldn't be too bad.

Related

Thread-safe file updates

I need to learn how to update a file concurrently without blocking other threads. Let me explain how it should work, needs, and how I think it should be implemented, then I ask my questions:
Here is how the worker works:
Worker is multithreaded.
There is one very large file (6 Terabyte).
Each thread is updating part of this file.
Each write is equal to one or more disk blocks (4096 bytes).
No two worker write at same block (or same group of blocks) at the same time.
Needs:
Threads should not block other blocks (no lock on file, or minimum possible number of locks should be used)
In case of (any kind of) failure, There is no problem if updating block corrupts.
In case of (any kind of) failure, blocks that are not updating should not corrupts.
If file write was successful, we must be sure that it is not buffered and be sure that actually written on disk (fsync)
I can convert this large file to as many smaller files as needed (down to 4kb files), but I prefer not to do that. Handling that many files is difficult, and needs a lot of file handles open/close operations, which has negative impact on performance.
How I think it should be implemented:
I'm not much familiar with file manipulation and how it works at operating system level, but I think writing on a single block should not corrupt other blocks when errors happen. So I think this code should perfectly work as needed, without any change:
char write_value[] = "...4096 bytes of data...";
int write_block = 12345;
int block_size = 4096;
FILE *fp;
fp = fopen("file.txt","w+");
fseek(fp, write_block * block_size, SEEK_SET);
fputs(write_value, fp);
fsync(fp);
fclose(fp);
Questions:
Obviously, I'm trying to understand how it should be implemented. So any suggestions are welcome. Specially:
If writing to one block of a large file fails, what is the chance of corrupting other blocks of data?
In short, What things should be considered on perfecting code above, (according to the last question)?
Is it possible to replace one block of data with another file/block atomically? (like how rename() system call replaces one file with another atomically, but in block-level. Something like replacing next-block-address of previous block in file system or whatever else).
Any device/file system/operating system specific notes? (This code will run on CentOS/FreeBSD (not decided yet), but I can change the OS if there is better alternative for this problem. File is on one 8TB SSD).

Threads should not block other blocks (no lock on file, or minimum possible number of locks should be used)
Your code sample uses fseek followed by fwrite. Without locking in-between those two, you have a race condition because another thread could jump in-between. There are three reasonable solutions:
Use flockfile, followed by regular fseek and fwrite_unlocked then funlock. Those are POSIX-2001 standard
Use separate file handles per thread
Use pread and pwrite to do IO without having to worry about the seek position
Option 3 is the best for you.
You could also use the asynchronous IO from <aio.h> to handle the multithreading. It basically works with a thread-pool calling pwrite on most Unix implementations.
In case of (any kind of) failure, There is no problem if updating block corrupts
I understand this to mean that there should be no file corruption in any failure state. To the best of my knowledge, that is not possible when you overwrite data. When the system fails in the middle of a write command, there is no way to guarantee how many bytes were written, at least not in a file-system agnostic version.
What you can do instead is similar to a database transaction: You write the new content to a new location in the file. Then you do an fsync to ensure it is on disk. Then you overwrite a header to point to the new location. If you crash before the header is written, your crash recovery will see the old content. If the header gets written, you see the new content. However, I'm not an expert in this field. That final header update is a bit of a hand-wave.
In case of (any kind of) failure, blocks that are not updating should not corrupts.
Should be fine
If file write was successful, we must be sure that it is not buffered and be sure that actually written on disk (fsync)
Your sample code called fsync, but forgot fflush before that. Or you set the file buffer to unbuffered using setvbuf
I can convert this large file to as many smaller files as needed (down to 4kb files), but I prefer not to do that. Handling that many files is difficult, and needs a lot of file handles open/close operations, which has negative impact on performance.
Many calls to fsync will kill your performance anyway. Short of reimplementing database transactions, this seems to be your best bet to achieve maximum crash recovery. The pattern is well documented and understood:
Create a new temporary file on the same file system as the data you want to overwrite
Read-Copy-Update the old content to the new temporary file
Call fsync
Rename the new file to the old file
The renaming on a single file system is atomic. Therefore this procedure will ensure after a crash, you either get the old data or the new one.

If writing to one block of a large file fails, what is the chance of corrupting other blocks of data?
None.
Is it possible to replace one block of data with another file/block atomically? (like how rename() system call replaces one file with another atomically, but in block-level. Something like replacing next-block-address of previous block in file system or whatever else).
No.

What happens if someone overwrites a file after I open it?

When I open a file in C, I get a file descriptor, if I had not read it's contents, and then someone modifies the file, will I read the old file or the new file?
Let's say a file has lots of lines, what happens that while reading the file, someone edits the beginning, will this somehow corrupt how my file reads the file?
How do programs don't get corrupted while the file is being read? Is the OS that takes care of this problem? If I can still read the old data, where is this data being stored?
The man page of open, has some information about the internals of open, but it is not very clear to me.

The C language standard doesn't acknowledge the existence of other processes nor specify interaction between them and the program (nor does C++). The behaviour depends on the operating system and / or the file system.
Generally, it is safest to assume that file operations are not atomic and therefore accessing a file while another process is editing it would be an example of race condition. Some systems may provide some stricter guarantees.
A general approach to attempt avoiding problems is file locking. The standard C library does not have an API for file locking, but multitasking operating systems generally do.

All this depends heavily on the OS, not at the C++ level. In Windows, for example, opening the file with CreateFile allows you to lock the file for subsequent access. But not at the language level.
You must decide based on the specific OS you work with. There are no assumptions; it all depends on the documentation you are provided with.
Generally, C++ level documentation is not much useful at such problems because there can never be a full standard for something so low level as file access (even the fs was only recently added to C++) and there is no point creating 'portable' code on such. You must make it a habit to immerse in the OS specific documentation and libraries.

With what API do you perform a read-consistent file operation in OS X, analogous to Windows Volume Shadow Service

We're writing a C++/Objective C app, runnable on OSX from versions 10.7 to present (10.11).
Under windows, there is the concept of a shadow file, which allows you read a file as it exists at a certain point in time, without having to worry about other processes writing to that file in the interim.
However, I can't find any documentation or online articles discussing a similar feature in OS X. I know that OS X will not lock a file when it's being written to, so is it necessary to do something special to make sure I don't pick up a file that is in the middle of being modified?
Or does the Journaled Filesystem make any special handling unnecessary? I'm concerned that if I have one process that is creating or modifying files (within a single context of, say, an fopen call - obviously I can't be guaranteed of "completeness" if the writing process is opening and closing a file repeatedly during what should be an atomic operation), that a reading process will end up getting a "half-baked" file.
And if JFS does guarantee that readers only see "whole" files, does this extend to Fat32 volumes that may be mounted as external drives?

A few things:
On Unix, once you open a file, if it is replaced (as opposed to modified), your file descriptor continues to access the file you opened, not its replacement.
Many apps will replace rather than modify files, using things like -[NSData writeToFile:atomically:] with YES for atomically:.
Cocoa and the other high-level frameworks do, in fact, lock files when they write to them, but that locking is advisory not mandatory, so other programs also have to opt in to the advisory locking system to be affected by that.
The modern approach is File Coordination. Again, this is a voluntary system that apps have to opt in to.
There is no feature quite like what you described on Windows. If the standard approaches aren't sufficient for your needs, you'll have to build something custom. For example, you could make a copy of the file that you're interested in and, after your copy is complete, compare it to the original to see if it was being modified as you were copying it. If the original has changed, you'll have to start over with a fresh copy operation (or give up). You can use File Coordination to at least minimize the possibility of contention from cooperating programs.

Is it necessary to close files after reading (only) in any programming language?

I read that a program should close files after writing to them in case there is still data in the write buffer not yet physically written to it. I also read that some languages such as Python automatically close all files that go out of scope, such as when the program ends.
But if I'm merely reading a file and not modifying it in any way, maybe except the OS changing its last-access date, is there ever a need to close it (even if the program never terminates, such as a daemon that monitors a log file)?
(Why is it necessary to close a file after using it? asks about file access in general, not only for reading.)

In general, you should always close a file after you are done using it.
Reason number 1: There are not unlimited available File Descriptors
(or in windows, the conceptually similar HANDLES).
Every time you access a file ressource, even for reading, you are reducing the number of handles (or FD's) available to other processes.
every time you close a handle, you release it and makes it available for other processes.
Now consider the consequences of a loop that opens a file, reads it, but doesn't close it...
http://en.wikipedia.org/wiki/File_descriptor
https://msdn.microsoft.com/en-us/library/windows/desktop/aa364225%28v=vs.85%29.aspx
Reason number 2: If you are doing anything else than reading a file, there are problems with race conditions, if multiple processes or threads accesses the same file..
To avoid this, you may find file locks in place.
http://en.wikipedia.org/wiki/File_locking
if you are reading a file, and not closing it afterward, other applications, that could try to obtain a file lock are denied access.
oh - and the file can't be deleted by anyone that doesn't have rights to kill your process..
Reason number 3: There is absolutely no reason to leave a file unclosed. In any language, which is why Python helps the lazy programmers, and automatically closes a handle that drops out of scope, in case the programmer forgot.

Yes, it's better to close file after reading is completed.
That's necessary because the other software might request exclusive access to that file. If file is still opened then such request will fail.

Not closing a file will result in unnecessary resources being taken from the system (File Descriptors on Unix and Handles on windows). Especially when a bug happens in some sort of loop or a system is never turned off, this gets important. Some languages manage unclosed files themselves when they for example run out of scope, others don't or only at some random time when it is checked (like the garbage collector in Java).
Imagine you have some sort of system that needs to run forever. For example a server. Then unclosed files can consume more and more resources, till ultimately all space is used by unclosed files.
In order to read a file you have to open it. So independent of what you do with a file, space will be reserved for the file. So far I tried to explain the importance of closing a file for resources, it's also important that you as a programmer know when an object (file) could be closed since no further use will be required. I think it's bad practice to not be at-least aware of unclosed files, and it's good practice to close files if no further use is required.
Some applications also require only access to a file, so require no other applications to have the file open. For example when you're trying to empty your recycle bin or move a file which you still have open on windows. (This is referred to as file locking). When you still have the file open windows won't let you throw away or move the files. This is just an example of when it would be annoying that a file is open while it should (rather) not (be). The example happens to me daily.

Redirect cout to a file vs writing to a file directly in linux

For C++/linux programs, how does writing to cout (when cout has been redirected to a file during program launch) compare against writing to the target file directly? (via say fstream)
Does the system do the appropriate magic at the start of the program to make these two cases exactly equivalent or is the later gonig to be better than the first?
Thanks!

They are basically equivalent. In both cases, the underlying stream buffer will end up calling the write() system call, for the same effect.
Note however that by default, cout is synchronized to stdio, for backwards compatibility (so you can use C-style standard output as well as cout, and have it work as expected). This additional synchronization can slow down C++ output. If this is important, then you can use std::ios_base::sync_with_stdio(false) to unlink them. Then, a file-redirected cout and an fstream should have essentially identical performance and function.

The former is better for the phylosophy of UNIX tools, that is feeding a program with the output of another.
Let's say your programs prints numbers and you need to sort them. You feed the sort tool with the output of your commands and then write the result to a file, always with output redirection.
On the contrary, if you wrote directly to a file you couldn't to that.
Of course, if you don't plan your application to do this sort of things, you can write directly to a file. But if I were in you, I'll let the user decide. Maybe with a command line argument.

I don't think the latter is necessiarily "better" or "worse". It certainly requires far less code when you simply redirect cout/stdout from the shell. It allows for simple text output (via printf/fprintf/cout).
I prefer using simple cout calls for quick and dirty logging and "printf" style debugging.
In my experience, I use syslog for things that absolutely have to be logged. For example, error cases where a file fails to open or you run out of resources or whatever.
I use printf/fprintf for the other "simple" logging tasks.
Some years ago, I developed a simple debugging system that now I just plug into my new Linux applications. I can then just call the appropriate functions in that code. It's similar to syslog in that it has debug "levels". For example, level 1 always writes to stdout, level 2 writes to stderr, level 4 writes to syslog, level 5 may create a new file and write messages to that, etc.

Yes. If the spawning process has arranged that file descriptor 1 (standard output) is the result of an open() call on a disk file, then I/O from the child process to that descriptor is exactly equivalent to the same I/O done to a file it had opened manually. The same is true for processes spawned with a pipe or socket (i.e. equivalent behavior to having opened their own pipe or socket). In fact, the kernel can't even tell the difference. All that matters is how the file descriptor was created (i.e. using what syscall, and on what filesystem), not which process did it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js