Data written to disk callback - c++

How can I get callbacks once data has been successfully written to disk in Linux?
I would like to have my programs db file mapped into memory for read/write operations and receive callbacks once a write has successfully hit disk. Kind of like what the old VMSs used to do..

You need to call fdatasync (or fsync if you really need the metadata to be synchronised as well) and wait for it to return.
You could do this from another thread, but if one thread writes to the file while another thread is doing a fdatasync(), it's not going to be clear which of the writes are guaranteed to be persistent or not.
Databases which want to store transaction logs in a guaranteed-durable way, need to call fdatasync.
Databases (such as innodb) typically use direct IO (as well as their own data-caching, rather than rely on the OS) on their main data files, so that they know that it will be written in a predictable manner.

As far as I know you cannot get any notification when the actual synchronization between a file (or a mmaped region) happen, not even the timestamps of the file are going to change. You can, however, force the synchronization of the file (or region) by using fsync.
It is also hard to see a reason for why you would want that. File IO is supposed to be opaque.

Related

According to POSIX, when will my writes be visible to other processes?

If I open a file with O_CREAT | O_WRONLY and write to it. Does POSIX say 1) other apps can see the file in the folder (without fsync) and 2) be able to see what I wrote?
Same questions after I do close without an fsync.
Now finally will it once my program ends? I understand fsync will confirm my write is on disk but I don't need my file to be on disk, I need it to be visible to other processes
Yes, other processes see your writes immediately. You do not need to close or fsync.
https://pubs.opengroup.org/onlinepubs/009695399/functions/write.html
Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. A similar requirement applies to multiple write operations to the same file position. This is needed to guarantee the propagation of data from write() calls to subsequent read() calls.
This means, for instance, that if the OS caches your write instead of actually writing it to disk, it needs to ensure that any other reads of that file are fulfilled from that same cache.

Is stat() atomic with respect to the file system

I have a collection of executables that regularly update a collection of files every couple of minutes 24/7. I am thinking about writing a single monitoring program that will continuously check the last write time (using the function stat()) of all these files so that if any have not been updated recently enough it can ring an alarm. My concern though is that perhaps the very act of calling stat() may cause a program that is attempting to write to that file, to fail. Need I worry?... and if so is there an alternative way to achieve my goal?
Yes, a stat call can be thought of as atomic, in that all the information it returns is guaranteed to be consistent. If you call stat at the same instant some other process is writing to the file, there should be no possibility that, say, the other process's write is reflected in st_mtime but not st_size.
And in any case, there's certainly no possibility that calling stat at the same instant some other process is writing to the file could cause that other process to fail. (That would be a serious and quite unacceptable bug in the operating system -- one of an OS'es main jobs is to ensure that unrelated processes can't accidentally interact with each other in such ways. This lack-of-interference property isn't usually what we mean by "atomic", though.)
With that said, though, the usual way to monitor a process is via its process ID. And there are probably plenty of prewritten packages out there to help you manage one or more processes that are supposed to run continuously, giving you clean start/stop and monitoring capabilities. (See s6 as an example. I know nothing about this package and am not recommending it; it's just the first one I came across in a web search.)
Another possibility, if you have any kind of IPC mechanism set up between your processes, is to set up a periodic heartbeat that each one publishes, so that a watchdog timer somewhere can detect a process dying.
If you want to keep monitoring your processes by the timeliness of the files they write, though, that sounds like a perfectly fine technique also.

Is using cat command to merge files created by multiple threads efficient?

I have a multi-threaded C++11 program in which each thread produces a large amount of data that need to be written to the disk. All the data need to be written onto one file. At the moment, I use a mutex that protects accesses to the file from multiple threads. My friend suggested me that I can use one file for each thread, then at the end merge the files into one file with cat command done from C++ code using system().
I'm thinking if cat command is going to read all the data back from the disk and then write it again to the disk but this time into a single file, it's not going to be any better. I have googled but couldn't find cat command implementation details. May I know how it works and if it's going to accelerate the whole procedure?
Edit:
Chronology of events is not important, and there's no ordering constraint on the contents of the files. Both methods will perform what I want.
You don't specify if you have some ordering or structuring constraints on the content of the file. Generally it is the case, so I'll treat it as such, but hopefully my solution should work either way.
The classical programmatic approach
The idea is to offload the work of writing to disk to a dedicated IO thread, and have a multiple producers/ one consumer queue to queue up all the write commands. Each work thread simply format its output as a string and push it back to the queue. The IO thread pop batches of messages from the queue into a buffer, and issue the write commands.
Alternatively, you could add a field in your messages to indicate which worker emitted the write command, and have the IO thread push to different files, if needed.
For better performance, it also interesting to look into async versions of the IO system primitives (read/write), if your host OS supports them. The IO thread would then be able to monitor several concurrent IO, and feed the OS with new ones as soon as one terminate.
As it has been advised in comments, you will have to monitor the IO thread for congestion situations, and tune the number of workers accordingly. The "natural" feedback based mechanism is to simply make the queue bounded, and workers will wait on the lock until space free up on it. This let you control the amount of produced data at any point during the process life, which is an important point in memory constrained scenarios.
Your cat concerns
As for cat, this command line tool simply read whatever is wrote to its input channel (usually stdin), and duplicates it to its output (stdout). It's as simple a that, and you can clearly see the similarity with the solution advocated above. The difference is that cat doesn't understand the file internal structure (if any), it only deals with byte streams, which means that if several processes write concurrently to a cat input without synchronization, the resulting output would probably be completely mixed up. Another issue is the atomicity (or lack thereof) of IO primitives.
NB: On some systems, there's a neat little feature called a fork, which let you have several "independent" streams of data multiplexed in a single file. If you happen to work on a platform supporting that feature, you could have all your data streams bundled in a single file, but separately reachable.

Safe access to file from two processes

Suppose I have two processes. One always resides in memory and periodically reads some settings from a file on a disk. If it detects that settings was changed then it applies them.
The other process runs under command line by demand and modifies the settings. Thus the first process only read the file and never write to it while the second can only write to the file.
Should I synchronize the access to the file to ensure that the first process will always get consistent settings i.e. before or after modifications not some intermediate contents? If yes, what is the simplest way to do this in C++.
I'm interested mainly in cross-platform ways. But also curious about Windows- and/or Linux-specific ones.
Use a named semaphore and require either process to hold the semaphore before editing the file on disk. Named semaphores can be connected to by any running application.
Look at man 7 sem_overview for more information on named semaphores on linux machines.
The closest equivalent for windows I can find is http://msdn.microsoft.com/en-us/library/windows/desktop/ms682438(v=vs.85).aspx
You are using C++ so your first option should be to check through the usual cross-platform libs - POCO, Boost, ACE, and so forth to see if there is anything that already does what you require.
You really have two separate issues: (1) file synchronization and (2) notification.
On Linux to avoid having your daemon constantly polling to see if a file has changed you can use inotify calls and set up events that will tell you when the file has been changed by the command line program. It might be simplest to look for IN_CLOSE_WRITE events since a CL prog will presumably be opening, changing, and closing the file.
For synchronization, since you are in control of both programs, you can just use file or record locking e.g. lockf, flock or fcntl.
The most obvious solution is to open the file in exclusive mode. If the file can not be opened, wait some time and try to open the file again. This will prevent possible access/modification conflicts.
The benefit of this approach is that it's simple and doesn't have significant drawbacks.
Of course you could use some synchronization primitives (Mutex, Semaphore depending on the OS) but this would be an overkill in your scenario, when speedy response is not required (waiting 200 msec between open attempts is fine, and writing of config file won't take more).

How do I concurrently download and convert a binary file using threads?

I have a program that downloads a binary file, from another PC.
I also have a another standalone program that can convert this binary file to a human readable CSV.
I would like to bring the conversion tool "into" the download tool, creating a thread in the download tool that kicks off the conversion code (so it can start converting while it is downloading, reducing the total time of download and convert independently).
I believe I can successfully kick off another thread but how do I synchronize the conversion thread with the main download?
i.e. The conversion catches up with the download, needs to wait for more to download, then start converting again, etc.
Is this similar to the Synchronizing Execution of Multiple Threads ? If so does this mean the downloaded binary needs to be a resource accessed by semaphores?
Am I on the right path or should i be pointed in another direction before I start?
Any advice is appreciated.
Thank You.
This is a classic case of the producer-consumer problem with the download thread as the producer and the conversion thread as the consumer.
Google around and you'll find an implementation for your language of choice. Here are some from MSDN: How to: Implement Various Producer-Consumer Patterns.
Intead of downloading to a file, you should write the downloaded data to a pipe. The convert thread can be reading from the pipe and then writing the converted output to a file. That will automatically synchronize them.
If you need the original file as well as the converted one, just have the download thread write the data to the file then write the same data to the pipe.
Yes, you undoubtedly need semaphores (or something similar such as an event or critical section) to protect access to the data.
My immediate reaction would be to think primarily in terms of a sequence of blocks though, not an entire file. Second, I almost never use a semaphore (or anything similar) directly. Instead, I would normally use a thread-safe queue, so when the network thread has read a block, it puts a structure into the queue saying where the data is and such. The processing thread waits for an item in the queue, and when one arrives it pops and processes that block.
When it finishes processing a block, it'll typically push the result onto another queue for the next stage of processing (e.g., writing to a file), and (quite possibly) put a descriptor for the processed block onto another queue, so the memory can be re-used for reading another block of input.
At least in my experience, this type of design eliminates a large percentage of thread synchronization issues.
Edit: I'm not sure about guidelines about how to design a thread-safe queue, but I've posted code for a simple one in a previous answer.
As far as design patterns go, I've seen this called at least "pipeline" and "production line" (though I'm not sure I've seen the latter in much literature).