Flushing only file metadata - c++

We're developing on a new ACID database system that focuses more on data integrity than throughput. Its storage engine accesses secondary storage devices directly with flags like O_DIRECT or FILE_FLAG_WRITE_THROUGH & FILE_FLAG_NO_BUFFERING.
In some cases we only change file metadata using kernel functions like fallocate() or SetFileValidData() - in these cases I would like to flush only the metadata and not all pending file I/O to leverage execution performance as the call blocks until the device reports that the transfer has completed - even if no file buffering is in use it still only applies to application data and the file system may still cache file metadata.
I've so far found that fsync() or FlushFileBuffers() flushes metadata, but unfortunately it also flushes all pending I/O. Anyone know of a way of only flushing the file metadata? This problem applies to Linux, UNIX, and Windows.

I am a newbie to FS. But when you go through implementation of any physical FS (ext4/ext3/etc) they haven't exposed such functionality to upper layer. But internally in fsyc() implementation they only update metadata of the file and remaining task is delegated to generic_block_fdatasync().
You might want to write a hack for your requirement of flushing only metadata.

Anyone know of a way of only flushing the file metadata?
No, Based on my understanding, there is no interface/API provided by any operating system. There are two types of the interfaces provided by FileSystem through which application(User mode) program can control when data gets written/saved to disk.
fsync: A call to fsync( ) ensures that all dirty data associated with the file mapped by the file descriptor fd is written back to disk. This call writes back both data and metadata.
fdatasync: This system call does the same thing as fsync( ), except that it only flushes data.
This means there is a way to perform something opposite to the task mentioned in this question. However while reading your question,it appears to me that you want to achieve this to get optimal performance and data consistency. With my understanding we should not think much about the execution performance as modern FileSystem implements the "delayed write" and various other mechanism to avoid unnecessary disk writes.
The main intention over here is to switch between User Mode and Kernel Mode as it is more expensive compared to anything else. This might be reason that kernel developer has not provided such interface which can only be used to update the meta data of that particular file. This could be due to limitation of the FileSystem and I guess here we can do little to achieve more efficiency.
For complete information on internal algorithm you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.

Related

High performance ways to stream local files as they're being written to network

Today a system exists that will write packet-capture files to the local disk as they come in. Dropping these files to local disk as the first step is deemed desirable for fault-tolerance reasons. If a client dies and needs to reconnect or be brought up somewhere else, we enjoy the ability to replay from the disk.
The next step in the data pipeline is trying to get this data that was landed to disk out to remote clients. Assuming sufficient disk space, it strikes me as very convenient to use the local disk (and the page-cache on top of it) as a persistent boundless-FIFO. It is also desirable to use the file system to keep the coupling between the producer and consumer low.
In my research, I have not found a lot of guidance on this type of architecture. More specifically, I have not seen well-established patterns in popular open-source libraries/frameworks for reading the file as it is being written to stream out.
My questions:
Is there a flaw in this architecture that I am not noting or indirectly downplaying?
Are there recommendations for consuming a file as it is being written, and efficiently blocking and/or asynchronously being notified when more data is available in the file?
A goal would be to either explicitly or implicitly have the consumer benefit from page-cache warmth. Are there any recommendations on how to optimize for this?
The file-based solution sounds clunky but could work. Similarly to how tail -f does it:
read the file until EOF, but not close it
setup an inode watch (with inotify), waiting for more writes
repeat
The difficulty is usually with file rotation and cleanup, i.e. you need to watch for new files and/or truncation.
Having said that, it might be more efficient to connect to the packet-capture interface directly, or setup a queue to which clients can subscribe.

Does fwrite block until data has been written to disk?

Does the fwrite() function return after the data to be written to disk has been handed over to the operating system or does it return only after the data is actually physically written to the disk?
For my case, I'm hoping that it's the first case since I don't want to wait until all the data is physically written to the disk. I'm hoping that another OS thread transfers it in the background.
I'm curious about behavior on Windows 10 in this particular case.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fwrite
There are several places where there is buffering of data in order to improve efficiency when using fwrite(): buffering within the C++ Runtime and buffering in the operating system file system interface and buffering within the actual disk hardware.
The default for these are to delay the actual physical writing of data to disk until there is an actual request to flush buffers or if appropriate indicators are turned on to perform physical writes as the write requests are made.
If you want to change the behavior of fwrite() take a look at the setbuf() function setbuf redirection as well as setbuff() Linux man page and here is the Microsoft documentation on setbuf().
And if you look at the documentation for the underlying Windows CreateFile() function you will see there are a number of flags which include flags as to whether buffering of data should be done or not.
FILE_FLAG_NO_BUFFERING 0x20000000
The file or device is being opened with no system caching for data
reads and writes. This flag does not affect hard disk caching or
memory mapped files.
There are strict requirements for successfully working with files
opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for
details see File Buffering.
And see the Microsoft documentation topic File Buffering.
In a simple example, the application would open a file for write
access with the FILE_FLAG_NO_BUFFERING flag and then perform a call to
the WriteFile function using a data buffer defined within the
application. This local buffer is, in these circumstances, effectively
the only file buffer that exists for this operation. Because of
physical disk layout, file system storage layout, and system-level
file pointer position tracking, this write operation will fail unless
the locally-defined data buffers meet certain alignment criteria,
discussed in the following section.
Take a look at this discussion about settings at the OS level for what looks to be Linux https://superuser.com/questions/479379/how-long-can-file-system-writes-be-cached-with-ext4
Does the fwrite(fp,...) function return after the data to be written to disk has been handed over to the operating system or does it return only after the data is actually physically written to the disk?
No. In fact, it does not even (necessarily) wait until data has been handed over to the OS -- fwrite may just put the data in its internal buffer and return immediately without actually writing anything.
To force data to the OS, you need to use fflush(fp) on the FILE pointer, but that still does not necessarily write data to the disk, though it will generally queue it for writing. But it does not wait for those queued writes to finish.
So to guarentee the data is written to disk, you need to do an OS level call to wait until the queued writes complete. On POSIX systems (such as Linux), that is fsync(fileno(fp)). You'll need to study the Windows documentation to figure out how to do the equivalent on Windows.

synchronized write operation in C

I am working on a smart camera that runs linux. I capture images from the camera streaming software and writes the images on a SD card (attached with the camera). For writing the individual JPEG images, I used fopen and fwrite C functions. For synchronizing the disk write operation, I use fflulsh(pointer) to flush the buffers and write the data on the SD card. But it seems it has no effect as the write operation uses system memory and the memory gets decreased after every write operation. I also used low-level open and write functions in conjunction with fsync (filedesc), but it also has no effect.
The flushing of buffers take place only when I dismount the SD card and then the memory is freed. How can I disable this cache write instead of SD card write? or how can I force the data to be written on the SD card at the same time instead of using the system memory?
sync(2) is probably your best bet:
SYNC(2) Linux Programmer's Manual SYNC(2)
NAME
sync - commit buffer cache to disk
SYNOPSIS
#include <unistd.h>
void sync(void);
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
BUGS
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually wait. (This still
does not guarantee data integrity: modern disks have large caches.)
You can set the O_SYNC if you open the file using open(), or use sync() as suggested above.
With fopen(), you can use fsync(), or use a combination of fileno() and ioctl() to set options on the descriptor.
For more details see this very similar post: How can you flush a write using a file descriptor?
Check out fsync(2) when working with specific files.
There may be nothing that you can really do. Many file systems are heavily cached in memory so a write to a file may not immediately be written to disk. The only way to guarantee a write in this scenario is to actually unmount the drive.
When mounting the disk, you might want to specify the sync option (either using the -oflag in mount or on your fstab line. This will ensure that at least your writes are written synchronously. This is what you should always use for removable media.
Just because it's still taking up memory doesn't mean it hasn't also been written out to storage - a clean (identical to the copy on physical storage) copy of the data will stay in the page cache until that memory is needed for something else, in case an application later reads that data back.
Note that fflush() doesn't ensure the data has been written to storage - if you are using stdio, you must first use fflush(f), then fsync(fileno(f)).
If you know that you will not need to read that data again in the forseeable future (as seems likely for this case), you can use posix_fadvise() with the POSIX_FADV_DONTNEED flag before closing the file.

Is it safe to parse a /proc/ file?

I want to parse /proc/net/tcp/, but is it safe?
How should I open and read files from /proc/ and not be afraid, that some other process (or the OS itself) will be changing it in the same time?
In general, no. (So most of the answers here are wrong.) It might be safe, depending on what property you want. But it's easy to end up with bugs in your code if you assume too much about the consistency of a file in /proc. For example, see this bug which came from assuming that /proc/mounts was a consistent snapshot.
For example:
/proc/uptime is totally atomic, as someone mentioned in another answer -- but only since Linux 2.6.30, which is less than two years old. So even this tiny, trivial file was subject to a race condition until then, and still is in most enterprise kernels. See fs/proc/uptime.c for the current source, or the commit that made it atomic. On a pre-2.6.30 kernel, you can open the file, read a bit of it, then if you later come back and read again, the piece you get will be inconsistent with the first piece. (I just demonstrated this -- try it yourself for fun.)
/proc/mounts is atomic within a single read system call. So if you read the whole file all at once, you get a single consistent snapshot of the mount points on the system. However, if you use several read system calls -- and if the file is big, this is exactly what will happen if you use normal I/O libraries and don't pay special attention to this issue -- you will be subject to a race condition. Not only will you not get a consistent snapshot, but mount points which were present before you started and never stopped being present might go missing in what you see. To see that it's atomic for one read(), look at m_start() in fs/namespace.c and see it grab a semaphore that guards the list of mountpoints, which it keeps until m_stop(), which is called when the read() is done. To see what can go wrong, see this bug from last year (same one I linked above) in otherwise high-quality software that blithely read /proc/mounts.
/proc/net/tcp, which is the one you're actually asking about, is even less consistent than that. It's atomic only within each row of the table. To see this, look at listening_get_next() in net/ipv4/tcp_ipv4.c and established_get_next() just below in the same file, and see the locks they take out on each entry in turn. I don't have repro code handy to demonstrate the lack of consistency from row to row, but there are no locks there (or anything else) that would make it consistent. Which makes sense if you think about it -- networking is often a super-busy part of the system, so it's not worth the overhead to present a consistent view in this diagnostic tool.
The other piece that keeps /proc/net/tcp atomic within each row is the buffering in seq_read(), which you can read in fs/seq_file.c. This ensures that once you read() part of one row, the text of the whole row is kept in a buffer so that the next read() will get the rest of that row before starting a new one. The same mechanism is used in /proc/mounts to keep each row atomic even if you do multiple read() calls, and it's also the mechanism that /proc/uptime in newer kernels uses to stay atomic. That mechanism does not buffer the whole file, because the kernel is cautious about memory use.
Most files in /proc will be at least as consistent as /proc/net/tcp, with each row a consistent picture of one entry in whatever information they're providing, because most of them use the same seq_file abstraction. As the /proc/uptime example illustrates, though, some files were still being migrated to use seq_file as recently as 2009; I bet there are still some that use older mechanisms and don't have even that level of atomicity. These caveats are rarely documented. For a given file, your only guarantee is to read the source.
In the case of /proc/net/tcp, you can read it and parse each line without fear. But if you try to draw any conclusions from multiple lines at once -- beware, other processes and the kernel are changing it while you read it, and you are probably creating a bug.
Although the files in /proc appear as regular files in userspace, they are not really files but rather entities that support the standard file operations from userspace (open, read, close). Note that this is quite different than having an ordinary file on disk that is being changed by the kernel.
All the kernel does is print its internal state into its own memory using a sprintf-like function, and that memory is copied into userspace whenever you issue a read(2) system call.
The kernel handles these calls in an entirely different way than for regular files, which could mean that the entire snapshot of the data you will read could be ready at the time you open(2) it, while the kernel makes sure that concurrent calls are consistent and atomic. I haven't read that anywhere, but it doesn't really make sense to be otherwise.
My advice is to take a look at the implementation of a proc file in your particular Unix flavour. This is really an implementation issue (as is the format and the contents of the output) that is not governed by a standard.
The simplest example would be the implementation of the uptime proc file in Linux. Note how the entire buffer is produced in the callback function supplied to single_open.
/proc is a virtual file system : in fact, it just gives a convenient view of the kernel internals. It's definitely safe to read it (that's why it's here) but it's risky on the long term, as the internal of these virtual files may evolve with newer version of kernel.
EDIT
More information available in proc documentation in Linux kernel doc, chapter 1.4 Networking
I can't find if the information how the information evolve over time. I thought it was frozen on open, but can't have a definite answer.
EDIT2
According to Sco doc (not linux, but I'm pretty sure all flavours of *nix behave like that)
Although process state and
consequently the contents of /proc
files can change from instant to
instant, a single read(2) of a /proc
file is guaranteed to return a
``sane'' representation of state, that
is, the read will be an atomic
snapshot of the state of the process.
No such guarantee applies to
successive reads applied to a /proc
file for a running process. In
addition, atomicity is specifically
not guaranteed for any I/O applied to
the as (address-space) file; the
contents of any process's address
space might be concurrently modified
by an LWP of that process or any other
process in the system.
The procfs API in the Linux kernel provides an interface to make sure that reads return consistent data. Read the comments in __proc_file_read. Item 1) in the big comment block explains this interface.
That being said, it is of course up to the implementation of a specific proc file to use this interface correctly to make sure its returned data is consistent. So, to answer your question: no, the kernel does not guarantee consistency of the proc files during a read but it provides the means for the implementations of those files to provide consistency.
I have the source for Linux 2.6.27.8 handy since I'm doing driver development at the moment on an embedded ARM target.
The file ...linux-2.6.27.8-lpc32xx/net/ipv4/raw.c at line 934 contains, for example
seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
i, src, srcp, dest, destp, sp->sk_state,
atomic_read(&sp->sk_wmem_alloc),
atomic_read(&sp->sk_rmem_alloc),
0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
atomic_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops));
which outputs
[wally#zenetfedora ~]$ cat /proc/net/tcp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
0: 017AA8C0:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 15160 1 f552de00 299
1: 00000000:C775 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 13237 1 f552ca00 299
...
in function raw_sock_seq_show() which is part of a hierarchy of procfs handling functions. The text is not generated until a read() request is made of the /proc/net/tcp file, a reasonable mechanism since procfs reads are surely much less common than updating the information.
Some drivers (such as mine) implement the proc_read function with a single sprintf(). The extra complication in the core drivers implementation is to handle potentially very long output which may not fit in the intermediate, kernel-space buffer during a single read.
I tested that with a program using a 64K read buffer but it results in a kernel space buffer of 3072 bytes in my system for proc_read to return data. Multiple calls with advancing pointers are needed to get more than that much text returned. I don't know what the right way to make the returned data consistent when more than one i/o is needed. Certainly each entry in /proc/net/tcp is self-consistent. There is some likelihood that lines side-by-side are snapshot at different times.
Short of unknown bugs, there are no race conditions in /proc that would lead to reading corrupted data or a mix of old and new data. In this sense, it's safe. However there's still the race condition that much of the data you read from /proc is potentially-outdated as soon as it's generated, and even moreso by the time you get to reading/processing it. For instance processes can die at any time and a new process can be assigned the same pid; the only process ids you can ever use without race conditions are your own child processes'. Same goes for network information (open ports, etc.) and really most of the information in /proc. I would consider it bad and dangerous practice to rely on any data in /proc being accurate, except data about your own process and potentially its child processes. Of course it may still be useful to present other information from /proc to the user/admin for informative/logging/etc. purposes.
When you read from a /proc file, the kernel is calling a function which has been registered in advance to be the "read" function for that proc file. See the __proc_file_read function in fs/proc/generic.c .
Therefore, the safety of the proc read is only as safe as the function the kernel calls to satisfy the read request. If that function properly locks all data it touches and returns to you in a buffer, then it is completely safe to read using that function. Since proc files like the one used for satisfying read requests to /proc/net/tcp have been around for a while and have undergone scrupulous review, they are about as safe as you could ask for. In fact, many common Linux utilities rely on reading from the proc filesystem and formatting the output in a different way. (Off the top of my head, I think 'ps' and 'netstat' do this).
As always, you don't have to take my word for it; you can look at the source to calm your fears. The following documentation from proc_net_tcp.txt tells you where the "read" functions for /proc/net/tcp live, so you can look at the actual code that is run when you read from that proc file and verify for yourself that there are no locking hazards.
This document describes the interfaces
/proc/net/tcp and /proc/net/tcp6.
Note that these interfaces are
deprecated in favor of tcp_diag.
These /proc interfaces provide information about currently active TCP
connections, and are implemented by
tcp4_seq_show() in net/ipv4/tcp_ipv4.c
and tcp6_seq_show() in
net/ipv6/tcp_ipv6.c, respectively.

Force ofstream file flush on Windows

I'm using an ofstream to write data to a file. I regularly call flush on the file but the backing file doesn't always get updated at that time. I assume this is related to an OS-level cache, or something inside the MSVC libraries.
I need a way to have the data properly flush at that point. Preferably written to disc, but at least enough such that a copy operation from another program would see all data up to the flush point.
What API can I use to do this?
FlushFileBuffers will flush the Windows write file cache and write it to a file. Be aware it can be very slow if called repeatedly.
I also found this KB article which describes the use of _commit(). This might be more useful to you since you are using ofstream.
CXXFileBuf.flush();
_commit(CXXFileBuf.rdbuf()->fd());
I used:
MyOfstreamObject.rdbuf()->pubsync();
I'm using stl_port on Win 7 with ICC 9.1.
I have not tested the solution extensively but it seems to work... Maybe it could solve the problem of the absence of fd() noticed by edA-qa mort-ora-y .
Just add commode.obj to Linker->Input->Additional Dependencies in the project's Property Pages in Visual Studio and call std::ostream::flush(). That way std::ostream's flush will link against another method which has the desired behavior. That's what helped to me.
If this is a windows-only solution, you might want to use FlushFileBuffers(). This means you will have to re-write some of your code to accomodate calls to CreateFile(), WriteFile(), etc. If your application depends on many different operator<< functions, you can write your own std::streambuf.
You also might want to read the remarks section carefully. In particular,
Due to disk caching interactions within the system, the FlushFileBuffers function can be inefficient when used after every write to a disk drive device when many writes are being performed separately. If an application is performing multiple writes to disk and also needs to ensure critical data is written to persistent media, the application should use unbuffered I/O instead of frequently calling FlushFileBuffers.