How to ensure data is written to file - c++

I'm writing a logging program for a microcontroller with OS Linux. There is also a calculation function, in which those results shall stored on HDD and loaded when the logger is restarted.
My problem is, when I unplug the µC from current meanwhile the µC is overwritting some data, the overwritten data could be lost.
So how I may overwrite some data, but ensure whether the overwritten data or the written data is consistent if a unplug meanwhile the µC is overwritting happens?
Programming language is C++, so I would be in love if there is an boost library or even better a stl type.

Use stream << flush; to flush the C++ output buffer to the OS, and use Linux fsync() to flush from the OS buffer to disk.
The latter requires a Unix file descriptor, so you'll need to use an implementation-dependent method to get the FD from the C++ stream. See Retrieving file descriptor from a std::fstream
For additional protection you need to use a fault-resistent filesystem with journaling. See https://www.ibm.com/developerworks/library/l-journaling-filesystems/index.html for an example.

Related

Does fwrite block until data has been written to disk?

Does the fwrite() function return after the data to be written to disk has been handed over to the operating system or does it return only after the data is actually physically written to the disk?
For my case, I'm hoping that it's the first case since I don't want to wait until all the data is physically written to the disk. I'm hoping that another OS thread transfers it in the background.
I'm curious about behavior on Windows 10 in this particular case.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fwrite
There are several places where there is buffering of data in order to improve efficiency when using fwrite(): buffering within the C++ Runtime and buffering in the operating system file system interface and buffering within the actual disk hardware.
The default for these are to delay the actual physical writing of data to disk until there is an actual request to flush buffers or if appropriate indicators are turned on to perform physical writes as the write requests are made.
If you want to change the behavior of fwrite() take a look at the setbuf() function setbuf redirection as well as setbuff() Linux man page and here is the Microsoft documentation on setbuf().
And if you look at the documentation for the underlying Windows CreateFile() function you will see there are a number of flags which include flags as to whether buffering of data should be done or not.
FILE_FLAG_NO_BUFFERING 0x20000000
The file or device is being opened with no system caching for data
reads and writes. This flag does not affect hard disk caching or
memory mapped files.
There are strict requirements for successfully working with files
opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for
details see File Buffering.
And see the Microsoft documentation topic File Buffering.
In a simple example, the application would open a file for write
access with the FILE_FLAG_NO_BUFFERING flag and then perform a call to
the WriteFile function using a data buffer defined within the
application. This local buffer is, in these circumstances, effectively
the only file buffer that exists for this operation. Because of
physical disk layout, file system storage layout, and system-level
file pointer position tracking, this write operation will fail unless
the locally-defined data buffers meet certain alignment criteria,
discussed in the following section.
Take a look at this discussion about settings at the OS level for what looks to be Linux https://superuser.com/questions/479379/how-long-can-file-system-writes-be-cached-with-ext4
Does the fwrite(fp,...) function return after the data to be written to disk has been handed over to the operating system or does it return only after the data is actually physically written to the disk?
No. In fact, it does not even (necessarily) wait until data has been handed over to the OS -- fwrite may just put the data in its internal buffer and return immediately without actually writing anything.
To force data to the OS, you need to use fflush(fp) on the FILE pointer, but that still does not necessarily write data to the disk, though it will generally queue it for writing. But it does not wait for those queued writes to finish.
So to guarentee the data is written to disk, you need to do an OS level call to wait until the queued writes complete. On POSIX systems (such as Linux), that is fsync(fileno(fp)). You'll need to study the Windows documentation to figure out how to do the equivalent on Windows.

What's the mechanism of iostreams with buffer?

To start with,Here in cplusplus.com it says every stream object has a associated std::streambuf
And in c++ primer 5th it says:
Each output stream manages a buffer, which it uses to hold the data that the programreads and writes. For example, when the following code is executed
os << "please enter a value: ";
the literal string might be printed immediately, or the operating system might store the data in a buffer to be printed later
There are several conditions that cause the buffer to be flushed—that is, to be written—to the actual output device or file:The program completes normally,using a manipulator such as endl,etc.
In my understanding,the sentence "the operating system might store the data in a buffer(not in the buffer)" in the context above means that both the stream object and OS use their own buffer,that is,one in the process address space,another in the kernel space managed by the OS.
And here is my question,
why does every process/object(like cout) manage its own buffer?Why not just arose a system call and give the data directly to the OS buffer?
Furthermore,is the term 'flushed' acting on object buffer or the OS buffer?I guess the flushed action actually arouse a system call and tell the OS to immediately put the data in OS buffer onto the screen.
why does every process/object(like cout) manage its own buffer?Why not just arose a system call and give the data directly to the OS buffer?
As a bit of a pre-answer, you could always re-write the stream buffer to always flush to a system OS call for output (or input). In fact, your system may already do this -- it just depends on the implementation. This system just allows buffering at the level of the iostreams library, but doesn't necessarily require it as far as I remember.
For buffering, it is not always the most efficient to send out or read data byte by byte. In cases like cout and cin in many systems this may be better handled by the OS, but you could adapt the iostreams to handle input and output streams that are reading sockets (I/O from internet connections). In sockets, you could write each individual character within a single package to your target over the internet, but this could become really slow depending on the type of link and how busy the link is. When you read a socket, the message can be split across packets so you need to buffer the input until you hit 'critical mass'. There are potentially ways to do this buffering at the level of OS, but I found at least I could get much better performance if I handled most of this buffering myself (since usually the size of messages had a large standard deviation across the runtime). So the buffering within iostreams was a useful way to manage the input and output to optimize performance, and this especially helped when you tried to juggle I/O from multiple connections at the same time.
But you can't always assume the OS will do the right thing. I remember once we were using this FUSE module that allowed us to have a distributed file system across multiple computer nodes. It had a really weird problem when writing and reading single characters. Whereas reading or writing a long sequence of single characters would take at most seconds on a normal hard disk using an ext4 system, the same operation would take days on the FUSE system (ignoring for the moment why we did it this way in the first place). Through debugging, we found the hang was at the level of I/O, and reading and writing individual characters exacerbated this run-time problem. We had to re-write the code to buffer our reads and writes. The best we could figure out is that the OS on ext4 did its own buffering but this FUSE file system didn't do a similar buffering when reading and writing to the hard disk.
In any case, the OS may do its own buffering, but there are a number of cases where this buffering is non-existent or minimal. Buffering on the iostream end could help your performance.
Furthermore,is the term 'flushed' acting on object buffer or the OS buffer?I guess the flushed action actually arouse a system call and tell the OS to immediately put the data in OS buffer onto the screen.
I believe most texts will talk about 'flushed' in terms of the standard I/O streams in C++. Your program probably doesn't have direct control over how the OS handles its I/O. But in general I think the I/O of the OS and your program will be in sync for most systems.

Flushing only file metadata

We're developing on a new ACID database system that focuses more on data integrity than throughput. Its storage engine accesses secondary storage devices directly with flags like O_DIRECT or FILE_FLAG_WRITE_THROUGH & FILE_FLAG_NO_BUFFERING.
In some cases we only change file metadata using kernel functions like fallocate() or SetFileValidData() - in these cases I would like to flush only the metadata and not all pending file I/O to leverage execution performance as the call blocks until the device reports that the transfer has completed - even if no file buffering is in use it still only applies to application data and the file system may still cache file metadata.
I've so far found that fsync() or FlushFileBuffers() flushes metadata, but unfortunately it also flushes all pending I/O. Anyone know of a way of only flushing the file metadata? This problem applies to Linux, UNIX, and Windows.
I am a newbie to FS. But when you go through implementation of any physical FS (ext4/ext3/etc) they haven't exposed such functionality to upper layer. But internally in fsyc() implementation they only update metadata of the file and remaining task is delegated to generic_block_fdatasync().
You might want to write a hack for your requirement of flushing only metadata.
Anyone know of a way of only flushing the file metadata?
No, Based on my understanding, there is no interface/API provided by any operating system. There are two types of the interfaces provided by FileSystem through which application(User mode) program can control when data gets written/saved to disk.
fsync: A call to fsync( ) ensures that all dirty data associated with the file mapped by the file descriptor fd is written back to disk. This call writes back both data and metadata.
fdatasync: This system call does the same thing as fsync( ), except that it only flushes data.
This means there is a way to perform something opposite to the task mentioned in this question. However while reading your question,it appears to me that you want to achieve this to get optimal performance and data consistency. With my understanding we should not think much about the execution performance as modern FileSystem implements the "delayed write" and various other mechanism to avoid unnecessary disk writes.
The main intention over here is to switch between User Mode and Kernel Mode as it is more expensive compared to anything else. This might be reason that kernel developer has not provided such interface which can only be used to update the meta data of that particular file. This could be due to limitation of the FileSystem and I guess here we can do little to achieve more efficiency.
For complete information on internal algorithm you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.

synchronized write operation in C

I am working on a smart camera that runs linux. I capture images from the camera streaming software and writes the images on a SD card (attached with the camera). For writing the individual JPEG images, I used fopen and fwrite C functions. For synchronizing the disk write operation, I use fflulsh(pointer) to flush the buffers and write the data on the SD card. But it seems it has no effect as the write operation uses system memory and the memory gets decreased after every write operation. I also used low-level open and write functions in conjunction with fsync (filedesc), but it also has no effect.
The flushing of buffers take place only when I dismount the SD card and then the memory is freed. How can I disable this cache write instead of SD card write? or how can I force the data to be written on the SD card at the same time instead of using the system memory?
sync(2) is probably your best bet:
SYNC(2) Linux Programmer's Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
BUGS
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually wait. (This still
does not guarantee data integrity: modern disks have large caches.)
You can set the O_SYNC if you open the file using open(), or use sync() as suggested above.
With fopen(), you can use fsync(), or use a combination of fileno() and ioctl() to set options on the descriptor.
For more details see this very similar post: How can you flush a write using a file descriptor?
Check out fsync(2) when working with specific files.
There may be nothing that you can really do. Many file systems are heavily cached in memory so a write to a file may not immediately be written to disk. The only way to guarantee a write in this scenario is to actually unmount the drive.
When mounting the disk, you might want to specify the sync option (either using the -oflag in mount or on your fstab line. This will ensure that at least your writes are written synchronously. This is what you should always use for removable media.
Just because it's still taking up memory doesn't mean it hasn't also been written out to storage - a clean (identical to the copy on physical storage) copy of the data will stay in the page cache until that memory is needed for something else, in case an application later reads that data back.
Note that fflush() doesn't ensure the data has been written to storage - if you are using stdio, you must first use fflush(f), then fsync(fileno(f)).
If you know that you will not need to read that data again in the forseeable future (as seems likely for this case), you can use posix_fadvise() with the POSIX_FADV_DONTNEED flag before closing the file.

How does behave linux buffer cache when an application is crashing down?

Let's say I am using c++ files stream asynchronously. I mean never using std::flush nor std::endl. My application writes a lot of data to a file and abruptly crashes down.
Is the data remaining in the cache system flushed to the disk, or discarded (and lost)?
Complicating this problem is that there are multiple 'caches' in play.
C++ streams have their own internal buffering mechanism. Streams don't ask the OS to write to disk until either (a) you've sent enough data into the buffer that the streams library thinks the write wouldn't be wasted (b) you ask for a flush specifically (c) the stream is in line-buffering mode, and you've sent along the endl. Any data in these buffers are lost when the program crashes.
The OS will buffer writes to make best use of the limited amount of disk IO available. Writes will typically be flushed within five to thirty seconds; sooner if the programmer (or libraries) calls fdatasync(2) or fsync(2) or sync(2) (which asks for all dirty data to be flushed). Any data in the OS buffers are written to disk (eventually) when the program crashes, lost if the kernel crashes.
The hard drive will buffer writes to try to make the best use of its slow head, rotational latency, etc. Data arrives in this buffer when the OS flushes its caches. Data in these buffers are written to disk when the program crashes, will probably be written to disk if the kernel crashes, and might be written to disk if the power is suddenly removed from the drive. (Some have enough power to continue writing their buffers, typically this would take less than a second anyway.)
The stuff in the library buffer (which you flush with std::flush or such) is lost, the data in the OS kernel buffers (which you can flush e.g. with fsync()) is not lost unless the OS itself crashes.