I would like to write multithreading-safe logger using lock-free queue. Logging threads will push messages to queue and logger will be popping them and send to output. I consider how to solve that issue- sending to output.
I would like to avoid using mutex/locks as long as it is possible.
So, let's assume that I am going to use C++ streams to write to the file/console. We can assume that target system is Linux.
Ok, writing to stream must be just a wrapper ( perhaps a advanced wrapper) for system call offered by Unix write. From what I know syscalls are atomic ( only one process can execute syscall at the same time). So, it is tempting not to use locks to make safe writing to file.
But write is a system call but it doesn't guarantees writing "whole output". It returns number of bytes which are succesfully written to the file.
Basically, my question is:
How to solve it? Is it possible to avoid mutex? ( I think it is not possible). And please mark my considerations, am I wrong?
Igor is right: just have one thread do all the log writes. Keep in mind that the kernel has to do locking to synchronize access to the open file descriptor (which keeps track of the file position), so by doing writes from multiple cores you're causing contention inside the kernel. Even worse, you're making system calls from multiple cores, which means the kernel's code / data accesses will dirty your caches on multiple cores.
See this paper for more about the impact of making system calls on the performance of user-space code after the syscall completes. (And about data / instruction cache misses inside the kernel for infrequent syscalls). It definitely makes sense to have one thread doing all the system calls, at least all the write system calls, to keep that part of your process's footprint isolated to one core. As well as the locking contention inside the kernel.
That FlexSC paper is about an idea for batching system calls to reduce user->kernel->user transitions, but they also measure overhead for the normal synchronous system-call method. More important is the discussion of cache-pollution from making system calls.
Alternatively, if you can let multiple threads write to your log file, you could just do that and not use the queue at all.
It's not guaranteed that a large write will finish uninterrupted, but a small to medium sized write should (almost?) always copy its whole buffer on most OSes. Especially if you're writing to a file, not a pipe. IDK how Linux write() behaves when it's preempted, but I expect it usually resumes to finish the write instead of returning without having written all the requested bytes. Partial writes might be more likely when interrupted by a signal.
It is guaranteed that bytes from two write() system calls won't be mixed together; all the bytes from one will be before or after the bytes from the other. You're correct that partial writes are a potential problem, though. I forget if the glibc syscall wrapper will resume the call for you on EINTR. Although in that case, it means no bytes actually got written, or it would have returned success with a byte count.
You should test this, for partial writes and for performance. kernel-space locking might be cheaper than the overhead of your lock-free queue, but making system calls from every thread that generates log messages might be worse for performance. (And when you test this, make sure you do it with some real work happening in your user-space process, not just a loop that only calls write.)
Related
I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio
Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.
The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.
Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.
Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.
This is similar but a bit different to existing questions. Say I have many threads that open the same file but they all do their own fopen and maintain their own FILE pointer.
a) is it necessary to lock fwrite calls if they have their own FILE ptrs?
b) if it is necessary, is locking around fwrite enough or will they potentially flush at different times and end up intermingling when they flush? If yes, would locking on fwrite and then fflush cover it?
This question can not be answered in the context of programming languages. As far as programming language is concerned, those file handles are completely independent objects, and whatever you do with one has no effect whatsoever on another.
The question is on the operating system - can it handle multiple write operation to the same underlying file at the same time. In other words, are those writes atomic. I can't say for all of them, but in Linux, for example, writes for less than PIPE_BUF size are atomic.
For the quick measure, yeah, you can put a lock around the I/O part. That'd work, I guarantee it. As for flusing I/O cache, I'd recommend not doing that. It's always best to let OS to handle I/O timing because kernel knows what's going on the best. You are not gonna have it in effect immediately after calling flush anyway because it's that complicated. Just like the other flush operations(java GC, glFlush and so on). If you choose to stick to this option, please be mindful of a start and an end point of the concurrent I/O op. You wouldn't want a case where the main thread closes the file and another worker thread tries to do I/O on that.
The general solution to this problem is creating a thread that handles the file exclusively. If other thread should read/write from/to the file, they must ask the thread to do that for them. This is tricky, I know. You'd need to compose a simple protocol, sync mechanism, but in a nutshell, it goes like this:
prep a queue, a cv(condition variable), a lock. create a thread and open the file. Doesn't matter who opens the file
The thread spawns and waits for the queue to be filled in
Other threads send a request I/O op to the thread. The request includes the data for the file and an op code.
The thread handles the requests from the queue. This is where the real I/O happens.
You could use anonymous FIFO instead of a queue. Or skip the opcode part if the file is write-only.
Unlike network I/O, modern OSes can't do file I/Os in a non-blocking manner. So expect a significant blocking time(io wait). Also, there's this problem where the queue fills up too quick and eats a lot of memory when I/O is relatively slow. There will be a case where the whole program should wait for the I/O to complete before terminating itself. Not much you can do about it. You could close the file from another thread while I/O is in progress on Linux(close() is MT-safe ), I don't know how that's gonna work on other OS.
There are alternatives like async file I/O or overlapped I/O which involves signal handling or callbacks. Using these doesn't require a creating of a thread but each has pros and cons, mostly regarding portability.
What would happen if you call read (or write, or both) in two different thread, on the same file descriptor (lets says we are interested about a local file, and a it's a socket file descriptor), without using explicitly a synchronization mechanism?
Read and Write are syscall, so, on a single core CPU, it's probably unlucky that two read would be executed "at the same time". But with multiple cores...
What the linux kernel will do?
And let's be a bit more general : is the behavior always the same for other kernels (like BSDs) ?
Edit : According to the close documentation, we should be sure that the file descriptor isn't used by a syscall in an other thread. So it seams that explicit synchronization would be required before closing a file descriptor (and so, also around read/write if thread that may call it are still running).
Any system level (syscall) file descriptor access is thread safe in all mainstream UNIX-like OSes.
Though depending on the age they are not necessarily signal safe.
If you call read, write, accept or similar on a file descriptor from two different tasks then the kernel's internal locking mechanism will resolve contention.
For reads each byte may be only read once though and writes will go in any undefined order.
The stdio library functions fread, fwrite and co. also have by default internal locking on the control structures, though by using flags it is possible to disable that.
The comment about close is because it doesn't make a lot of sense to close a file descriptor in any situation in which some other thread might be trying to use it. So while it is 'safe' as far as the kernel is concerned, it can lead to odd, hard to diagnose corner cases.
If a thread closes a file descriptor while a second thread is trying to read from it, the second thread may get an unexpected EBADF error. Worse, if a third thread is simultaneously opening a new file, that might reallocate the same fd, and the second thread might accidentally read from the new file rather than the one it was expecting...
Have a care for those who follow in your footsteps
It's perfectly normal to protect the file descriptor with a mutex semaphore. It removes any dependence on kernel behaviour so your message boundaries are now certain. You then don't have to cite the last paragraph at the bottom of a 15,489 line manpage which explains why the mutex isn't necessary (I exaggerated, but you get my meaning)
It also makes it clear to anyone reading your code that the file descriptor is being used by more than one thread.
Fringe Benefit
There is a fringe benefit to using a mutex that way. Suppose you've got different messages coming from the different threads and some of those messages are more important than others. All you need to do is set the thread priorities to reflect their messages' importance. That way the OS will ensure that your messages will be sent in order of importance for minimal effort on your part.
The result would depend on how the threads are scheduled to run at that particular instant in time.
One way to potentially avoid undefined behavior with multi-threading is to assume that you are doing memory operations. E.g. updating a linked list or changing a variable, etc.
If you use mutex/semaphores/lock or some other synchronization mechanism, it should work as intended.
Visual Studio's fread "locks out other threads." There is an alternate version _fread_nolock, which reads "without locking other threads", which should only be used "in thread-safe contexts such as single-threaded applications or where the calling scope already handles thread isolation."
Even after reading other somewhat relevant discussions on the two, I'm confused if the locking fread implements is on a specific FILE struct, a specific actual file, or on all fread calls on totally different files.
If you use the nolock versions, what level of locking do you need to provide? Can multiple threads in parallel be reading separate files without any locking? Can multiple threads in parallel be writing separate files without any locking? Or are there global or static variables involved that would be corrupted?
So, by using the nolock versions, are you able to potentially achieve better I/O throughput (if you aren't needlessly moving heads, like reading off separate drives, or a SSD drive), or is the potential gain just reducing redundant locks to a single lock (which should be negligible.)
Does VS' ifstream.read function work just like the regular fread? (I don't see a nolock version of it.)
The MS standard library implementation fully supports multi-threading. The C++ standard explain this requirement:
27.2.3: Concurrent access to a stream object, stream buffer object, or C Library stream by multiple threads may result in a data
race unless otherwise specified.
If one thread makes a library call a that writes a value to a stream
and, as a result, another thread reads this value from the stream
through a library call b such that this does not result in a data
race, then a’s write synchronizes with b’s read.
This means that if you write on a stream, a locking (not file locking, but concurrent access locking to the in-memory stream data structure) is done, to be sure that concurrency is well manageged for all the other threads using the same stream.
This locking overhead is always there, even if not needed. This could have a performance aspect, according to Microsoft:
the performance of the multithreaded libraries has been improved and
is close to the performance of the now-eliminated single-threaded
libraries. For those situations when even higher performance is
required, there are several new features.
This is why _nolock functions are provided. They access the stream directly without thread locking. It must be used with extreme care, for example:
if your application is single threaded (another process using the same stream has its own data structure, and OS manageds concurrency here)
if you're sure that no two threads use the same stream (for example if you have only one reader thread and writing is done outside your porgramme).
if you have other synchronisation mechasnism that protect a critical section of your code. For example, if you use a mutex lock, or an thread safe non blocking algorithm that makes use of atomics.
In such cases, the additional lock for stream access is not needed/redundant. For file intensive functions, it could be worth using the no_lock then.
Note: as you've pointed out: it's only worth using the nolock for intensive file accesses where you make millions of accesses.
fread_no_lock() appears to be used once you make sure that the file is locked with an external mechanism (some form of mutex, probably), and then you use it to reduce overhead: related: What's the intended use of _fread_nolock, _fseek_nolock?
This may also answer any further questions you might have: it may or may not be possible for your hard-drive to actually perform more than I/O operation at the same time depending on what type of hard drive you have: https://superuser.com/questions/252959/which-is-faster-copying-everything-at-once-or-one-thing-at-a-time
I'm working a project under linux, which needs read/write the same fd using multi-threads. And I want to use posix_fadvise to free page cache.
Can I call posix_fadvise when another thread is reading or writing the same fd?
Read posix_fadvise(2) and syscalls(2). Since posix_fadvise is a genuine syscall (e.g. wraps fadvise64 having its __NR_fadvise64 in <asm/unistd.h>...) you should be able to call it while another thread is writing the same fd, exactly as you may have two threads doing write(2) to the same file descriptor (but what happens then is perhaps non-deterministic).
I imagine that the kernel is internally locking the kernel file object referenced by a file descriptor.
BTW, the man page of posix_advise tells:
Programs can use posix_fadvise() to announce an intention to access
file data in a specific pattern in the future, thus allowing the
kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at
offset and extending for len bytes (or until the end of the file if
len is 0) within the file referred to by fd. The advice is not
binding; it merely constitutes an expectation on behalf of the
application.
Hence I guess that the kernel may follow the posix_fadvise later (or not at all)...
So I think you can do that, but I believe you should avoid, at least for readability reasons (and because of the non-determinism), to have several threads working on the same file descriptor. My feeling is that your code may have some design issues, but something will perhaps happen...
Generally, I would avoid having several threads doing I/O on the same file descriptor (or at the very least, use pwrite(2) or lock the I/O with a mutex...). So while you could do what you are asking, I would avoid doing that.
Remember that I/O operations to a disk file system are much much slower (they may take many milliseconds) that ordinary computations. Locking them with a mutex should not be significant, and will give you more determinism.