Is it safe to parse a /proc/ file? - c++

I want to parse /proc/net/tcp/, but is it safe?
How should I open and read files from /proc/ and not be afraid, that some other process (or the OS itself) will be changing it in the same time?

In general, no. (So most of the answers here are wrong.) It might be safe, depending on what property you want. But it's easy to end up with bugs in your code if you assume too much about the consistency of a file in /proc. For example, see this bug which came from assuming that /proc/mounts was a consistent snapshot.
For example:
/proc/uptime is totally atomic, as someone mentioned in another answer -- but only since Linux 2.6.30, which is less than two years old. So even this tiny, trivial file was subject to a race condition until then, and still is in most enterprise kernels. See fs/proc/uptime.c for the current source, or the commit that made it atomic. On a pre-2.6.30 kernel, you can open the file, read a bit of it, then if you later come back and read again, the piece you get will be inconsistent with the first piece. (I just demonstrated this -- try it yourself for fun.)
/proc/mounts is atomic within a single read system call. So if you read the whole file all at once, you get a single consistent snapshot of the mount points on the system. However, if you use several read system calls -- and if the file is big, this is exactly what will happen if you use normal I/O libraries and don't pay special attention to this issue -- you will be subject to a race condition. Not only will you not get a consistent snapshot, but mount points which were present before you started and never stopped being present might go missing in what you see. To see that it's atomic for one read(), look at m_start() in fs/namespace.c and see it grab a semaphore that guards the list of mountpoints, which it keeps until m_stop(), which is called when the read() is done. To see what can go wrong, see this bug from last year (same one I linked above) in otherwise high-quality software that blithely read /proc/mounts.
/proc/net/tcp, which is the one you're actually asking about, is even less consistent than that. It's atomic only within each row of the table. To see this, look at listening_get_next() in net/ipv4/tcp_ipv4.c and established_get_next() just below in the same file, and see the locks they take out on each entry in turn. I don't have repro code handy to demonstrate the lack of consistency from row to row, but there are no locks there (or anything else) that would make it consistent. Which makes sense if you think about it -- networking is often a super-busy part of the system, so it's not worth the overhead to present a consistent view in this diagnostic tool.
The other piece that keeps /proc/net/tcp atomic within each row is the buffering in seq_read(), which you can read in fs/seq_file.c. This ensures that once you read() part of one row, the text of the whole row is kept in a buffer so that the next read() will get the rest of that row before starting a new one. The same mechanism is used in /proc/mounts to keep each row atomic even if you do multiple read() calls, and it's also the mechanism that /proc/uptime in newer kernels uses to stay atomic. That mechanism does not buffer the whole file, because the kernel is cautious about memory use.
Most files in /proc will be at least as consistent as /proc/net/tcp, with each row a consistent picture of one entry in whatever information they're providing, because most of them use the same seq_file abstraction. As the /proc/uptime example illustrates, though, some files were still being migrated to use seq_file as recently as 2009; I bet there are still some that use older mechanisms and don't have even that level of atomicity. These caveats are rarely documented. For a given file, your only guarantee is to read the source.
In the case of /proc/net/tcp, you can read it and parse each line without fear. But if you try to draw any conclusions from multiple lines at once -- beware, other processes and the kernel are changing it while you read it, and you are probably creating a bug.

Although the files in /proc appear as regular files in userspace, they are not really files but rather entities that support the standard file operations from userspace (open, read, close). Note that this is quite different than having an ordinary file on disk that is being changed by the kernel.
All the kernel does is print its internal state into its own memory using a sprintf-like function, and that memory is copied into userspace whenever you issue a read(2) system call.
The kernel handles these calls in an entirely different way than for regular files, which could mean that the entire snapshot of the data you will read could be ready at the time you open(2) it, while the kernel makes sure that concurrent calls are consistent and atomic. I haven't read that anywhere, but it doesn't really make sense to be otherwise.
My advice is to take a look at the implementation of a proc file in your particular Unix flavour. This is really an implementation issue (as is the format and the contents of the output) that is not governed by a standard.
The simplest example would be the implementation of the uptime proc file in Linux. Note how the entire buffer is produced in the callback function supplied to single_open.

/proc is a virtual file system : in fact, it just gives a convenient view of the kernel internals. It's definitely safe to read it (that's why it's here) but it's risky on the long term, as the internal of these virtual files may evolve with newer version of kernel.
EDIT
More information available in proc documentation in Linux kernel doc, chapter 1.4 Networking
I can't find if the information how the information evolve over time. I thought it was frozen on open, but can't have a definite answer.
EDIT2
According to Sco doc (not linux, but I'm pretty sure all flavours of *nix behave like that)
Although process state and
consequently the contents of /proc
files can change from instant to
instant, a single read(2) of a /proc
file is guaranteed to return a
``sane'' representation of state, that
is, the read will be an atomic
snapshot of the state of the process.
No such guarantee applies to
successive reads applied to a /proc
file for a running process. In
addition, atomicity is specifically
not guaranteed for any I/O applied to
the as (address-space) file; the
contents of any process's address
space might be concurrently modified
by an LWP of that process or any other
process in the system.

The procfs API in the Linux kernel provides an interface to make sure that reads return consistent data. Read the comments in __proc_file_read. Item 1) in the big comment block explains this interface.
That being said, it is of course up to the implementation of a specific proc file to use this interface correctly to make sure its returned data is consistent. So, to answer your question: no, the kernel does not guarantee consistency of the proc files during a read but it provides the means for the implementations of those files to provide consistency.

I have the source for Linux 2.6.27.8 handy since I'm doing driver development at the moment on an embedded ARM target.
The file ...linux-2.6.27.8-lpc32xx/net/ipv4/raw.c at line 934 contains, for example
seq_printf(seq, "%4d: %08X:%04X %08X:%04X"
" %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d\n",
i, src, srcp, dest, destp, sp->sk_state,
atomic_read(&sp->sk_wmem_alloc),
atomic_read(&sp->sk_rmem_alloc),
0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
atomic_read(&sp->sk_refcnt), sp, atomic_read(&sp->sk_drops));
which outputs
[wally#zenetfedora ~]$ cat /proc/net/tcp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
0: 017AA8C0:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 15160 1 f552de00 299
1: 00000000:C775 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 13237 1 f552ca00 299
...
in function raw_sock_seq_show() which is part of a hierarchy of procfs handling functions. The text is not generated until a read() request is made of the /proc/net/tcp file, a reasonable mechanism since procfs reads are surely much less common than updating the information.
Some drivers (such as mine) implement the proc_read function with a single sprintf(). The extra complication in the core drivers implementation is to handle potentially very long output which may not fit in the intermediate, kernel-space buffer during a single read.
I tested that with a program using a 64K read buffer but it results in a kernel space buffer of 3072 bytes in my system for proc_read to return data. Multiple calls with advancing pointers are needed to get more than that much text returned. I don't know what the right way to make the returned data consistent when more than one i/o is needed. Certainly each entry in /proc/net/tcp is self-consistent. There is some likelihood that lines side-by-side are snapshot at different times.

Short of unknown bugs, there are no race conditions in /proc that would lead to reading corrupted data or a mix of old and new data. In this sense, it's safe. However there's still the race condition that much of the data you read from /proc is potentially-outdated as soon as it's generated, and even moreso by the time you get to reading/processing it. For instance processes can die at any time and a new process can be assigned the same pid; the only process ids you can ever use without race conditions are your own child processes'. Same goes for network information (open ports, etc.) and really most of the information in /proc. I would consider it bad and dangerous practice to rely on any data in /proc being accurate, except data about your own process and potentially its child processes. Of course it may still be useful to present other information from /proc to the user/admin for informative/logging/etc. purposes.

When you read from a /proc file, the kernel is calling a function which has been registered in advance to be the "read" function for that proc file. See the __proc_file_read function in fs/proc/generic.c .
Therefore, the safety of the proc read is only as safe as the function the kernel calls to satisfy the read request. If that function properly locks all data it touches and returns to you in a buffer, then it is completely safe to read using that function. Since proc files like the one used for satisfying read requests to /proc/net/tcp have been around for a while and have undergone scrupulous review, they are about as safe as you could ask for. In fact, many common Linux utilities rely on reading from the proc filesystem and formatting the output in a different way. (Off the top of my head, I think 'ps' and 'netstat' do this).
As always, you don't have to take my word for it; you can look at the source to calm your fears. The following documentation from proc_net_tcp.txt tells you where the "read" functions for /proc/net/tcp live, so you can look at the actual code that is run when you read from that proc file and verify for yourself that there are no locking hazards.
This document describes the interfaces
/proc/net/tcp and /proc/net/tcp6.
Note that these interfaces are
deprecated in favor of tcp_diag.
These /proc interfaces provide information about currently active TCP
connections, and are implemented by
tcp4_seq_show() in net/ipv4/tcp_ipv4.c
and tcp6_seq_show() in
net/ipv6/tcp_ipv6.c, respectively.

Related

How Linux handles the case when multiple processes try to replace the same file at the same time?

I know this is a bit of theoretical question but haven't got any satisfactory answer yet. So thought to put this question here.
I have multiple C++ processes (would also like to know thread behaviour) which contend to replace the same file at the same time. How much is it safe to do in Linux (Using Ubuntu 14.04 and Centos 7)? Do I need to put locks?
Thanks in advance.
The filesystems of Unix-based OS's like Linux are designed around the notion of inodes, which are internal records describing various metadata about the file. Normally these aren't interacted with directly by users or programs, but their presence gives these filesystems a level of indirection that allows them to provide some useful semantics that other OS's (read: Windows) cannot.
filename --> inode --> data
In particular, when a file gets deleted, what's actually happening is the separation of the file's inode from its filename; not (necessarily) the deletion of the file's data itself. That is, the file and its contents can continue to exist (albeit invisibly, from the user's point of view) until all processes have closed their file-handles that were open on that file; once the inode is no longer accessible to any process, only then will the filesystem actually mark the file's data-blocks as free and available-for-reuse. In the meantime, the filename becomes available for another file's inode (and data) to be associated with, even though the old file's inode/data still technically exists.
The upshot of that is that under Linux it's perfectly valid to delete (or rename) a file at any time, even if other threads/processes are in the middle of using it; your delete will succeed, and any other programs that have that file open at that instant can simply continue reading/writing/using it, exactly as if it hadn't been deleted. The only thing that is different is that the filename will no longer appear in its directory, and when they call fclose() (or close() or etc) on the file, the file's data will go away.
Since doing mv new.txt old.txt is essentially the same as doing a rm old.txt ; mv new.txt old.txt, there should be no problems with doing this from multiple threads without any synchronization. (note that the slightly different situation of having multiple threads or processes opening the same file simultaneously and writing into it at the same time is a bit more perilous; nothing will crash, but it would be easy for them to overwrite each other's data and corrupt the file, if they aren't careful)
It depends a lot on exactly what you're going to be doing and how you're using the files. In general, in Unix/Posix systems like Linux, all file calls will succeed if multiple processes make them, and the general way the OS handles contention is "the last one to do something wins". Essentially, all modifications to the filesystem are serialized, so the filesystem is always in a consistent state. But otherwise it's a free-for-all.
There are a lot of details here though. There's flags used in opening a file like O_EXCL that can result in failure if another process did it first (a sort of lock). There's advisory (aka, nobody is forced by the OS to pay attention to them) locking systems like flock (try typing man 2 flock to learn more) for file contents. There are more Linux specific mandatory locking system.
And there are also details like "What happens if someone deleted a file I have open?" that the other answer explains correctly and well.
And lastly, there's a whole mess of detail surrounding whether it's guaranteed that any particular change to the filesystem is recorded for all eternity, or whether it has a chance of disappearing if someone flicks the power switch. And that's a mess-and-a-half once you really dive into it, between dodgy hardware that lies to the OS about things to the confusing morass of different Linux system calls covering different aspects of this problem, often entering Linux from different eras of Unix/Posix history and interacting with each other in strange and arcane ways.
So, an answer to your very general and open-ended question is going to have to necessarily be vague, abstract, and hand-wavey.

Thread Optimization [duplicate]

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
Construct a new ifstream for the second thread, open it on the same file.
Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.
Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.
Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.
If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.
Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.
Other option:
Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).
It really depends on your system. A modern system will generally read
ahead; seeking within the file is likely to inhibit this, so should
definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to /dev/null is usually sufficient for this.
That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single ifstream (and thus a single file descriptor), and constantly
seeking, won, but you never know.
I'd also recommend system specific solutions like mmap, but if you've
got that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use mmap, mapping sections of it
at a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into
smaller files? That might be the fastest solution of all. (Ideally,
this would be done where the data is generated or imported into the
system.)
My vote would be a single reader, which hands the data to multiple worker threads.
If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

With what API do you perform a read-consistent file operation in OS X, analogous to Windows Volume Shadow Service

We're writing a C++/Objective C app, runnable on OSX from versions 10.7 to present (10.11).
Under windows, there is the concept of a shadow file, which allows you read a file as it exists at a certain point in time, without having to worry about other processes writing to that file in the interim.
However, I can't find any documentation or online articles discussing a similar feature in OS X. I know that OS X will not lock a file when it's being written to, so is it necessary to do something special to make sure I don't pick up a file that is in the middle of being modified?
Or does the Journaled Filesystem make any special handling unnecessary? I'm concerned that if I have one process that is creating or modifying files (within a single context of, say, an fopen call - obviously I can't be guaranteed of "completeness" if the writing process is opening and closing a file repeatedly during what should be an atomic operation), that a reading process will end up getting a "half-baked" file.
And if JFS does guarantee that readers only see "whole" files, does this extend to Fat32 volumes that may be mounted as external drives?
A few things:
On Unix, once you open a file, if it is replaced (as opposed to modified), your file descriptor continues to access the file you opened, not its replacement.
Many apps will replace rather than modify files, using things like -[NSData writeToFile:atomically:] with YES for atomically:.
Cocoa and the other high-level frameworks do, in fact, lock files when they write to them, but that locking is advisory not mandatory, so other programs also have to opt in to the advisory locking system to be affected by that.
The modern approach is File Coordination. Again, this is a voluntary system that apps have to opt in to.
There is no feature quite like what you described on Windows. If the standard approaches aren't sufficient for your needs, you'll have to build something custom. For example, you could make a copy of the file that you're interested in and, after your copy is complete, compare it to the original to see if it was being modified as you were copying it. If the original has changed, you'll have to start over with a fresh copy operation (or give up). You can use File Coordination to at least minimize the possibility of contention from cooperating programs.

Flushing only file metadata

We're developing on a new ACID database system that focuses more on data integrity than throughput. Its storage engine accesses secondary storage devices directly with flags like O_DIRECT or FILE_FLAG_WRITE_THROUGH & FILE_FLAG_NO_BUFFERING.
In some cases we only change file metadata using kernel functions like fallocate() or SetFileValidData() - in these cases I would like to flush only the metadata and not all pending file I/O to leverage execution performance as the call blocks until the device reports that the transfer has completed - even if no file buffering is in use it still only applies to application data and the file system may still cache file metadata.
I've so far found that fsync() or FlushFileBuffers() flushes metadata, but unfortunately it also flushes all pending I/O. Anyone know of a way of only flushing the file metadata? This problem applies to Linux, UNIX, and Windows.
I am a newbie to FS. But when you go through implementation of any physical FS (ext4/ext3/etc) they haven't exposed such functionality to upper layer. But internally in fsyc() implementation they only update metadata of the file and remaining task is delegated to generic_block_fdatasync().
You might want to write a hack for your requirement of flushing only metadata.
Anyone know of a way of only flushing the file metadata?
No, Based on my understanding, there is no interface/API provided by any operating system. There are two types of the interfaces provided by FileSystem through which application(User mode) program can control when data gets written/saved to disk.
fsync: A call to fsync( ) ensures that all dirty data associated with the file mapped by the file descriptor fd is written back to disk. This call writes back both data and metadata.
fdatasync: This system call does the same thing as fsync( ), except that it only flushes data.
This means there is a way to perform something opposite to the task mentioned in this question. However while reading your question,it appears to me that you want to achieve this to get optimal performance and data consistency. With my understanding we should not think much about the execution performance as modern FileSystem implements the "delayed write" and various other mechanism to avoid unnecessary disk writes.
The main intention over here is to switch between User Mode and Kernel Mode as it is more expensive compared to anything else. This might be reason that kernel developer has not provided such interface which can only be used to update the meta data of that particular file. This could be due to limitation of the FileSystem and I guess here we can do little to achieve more efficiency.
For complete information on internal algorithm you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.

Are POSIX' read() and write() system calls atomic?

I am trying to implement a database index based on the data structure (Blink tree) and algorithms suggested by Lehman and Yao in this paper. In page 2, the authors state that:
The disk is partitioned in sections of fixed size (physical pages; in this paper, these correspond to the nodes of the tree). These are the only units that can be read or written by a process. [emphasis mine] (...)
(...) a process is allowed to lock and unlock a disk page. This lock gives that process exclusive modification rights to that page; also, a process must have a page locked in order to modify that page. (...) Locks do not prevent other processes from reading the locked page. [emphasis mine]
I am not completely sure my interpretation is correct (I am not used to reading academic papers), but I think it can be concluded from the emphasized sentences that the authors mean the operations that read and write a page are assumed to be "atomic", in the sense that, if a process A has already begun reading (resp. writing) a page, another process B may not begin writing (resp. reading) that same page until A is done performing its read (resp. write) operation. Multiple processes simultaneously reading the same page is, of course, a legitimate condition, as is having multiple processes simultaneously performing arbitrary operations on exclusively different pages (process A on page P, process B on page Q, process C on page R, etc.).
Is my interpretation correct?
Can I assume POSIX' read() and write() system calls are "atomic" in the sense described above? Can I rely on these system calls having some internal logic to determine whether a specfic read() or write() call should be temporarily blocked based on the position of the file descriptor and the specified size of the chunk to be read or written?
If the answer to the above questions is "No", how should I roll my own locking mechanism?
I don't believe the text you cites implies anything of the sort. It doesn't even mention read() or write() or POSIX. In fact, read() and write() cannot be relied on to be atomic. The only thing POSIX says is that write() must be atomic if the size of the write is less than PIPE_BUF bytes, and even that only applies to pipes.
I didn't read the context around the part of the paper you cited, but it sounds like the passage you cited is stating constraints which must be placed on an implementation in order for the algorithm to work correctly. In other words, it states that an implementation of this algorithm requires locking.
How you do that locking is up to you (the implementor). If we are dealing with a regular file and multiple independent processes, you might try fcntl(F_SETLKW)-style locking. If your data structure is in memory and you are dealing with multiple threads in the same process, something else might be appropriate.
Answers:
Concurrent reads to writes may see torn writes depending on OS, filing system, and what flags you opened the file with. A quick summary by flags, OS and filing system is below.
You can lock byte ranges in a file before accessing them using fcntl() on POSIX or LockFile() on Windows.
No O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = 1 byte
Linux 4.2.6 with ext4: update atomicity = 1 byte
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite (*)
O_DIRECT/FILE_FLAG_NO_BUFFERING:
Microsoft Windows 10 with NTFS: update atomicity = up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in (*).
Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite (*). Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this.
FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite (*)
You can see the raw empirical test results at https://github.com/BoostGSoC13/boost.afio/blob/master/fs_probe/fs_probe_results.yaml. The results were generated by a program written using asynchronous file i/o through on all platforms. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.