What actually happens in asynchronous IO

What actually happens in asynchronous IO - c++

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio

Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.

The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.

Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

Related

Multiple threads for content retrieval (GET) and data write on disk

I need to make different GET queries to a server to download a bunch of json files and write each download to disk and I want to launch some threads to speed that up.
Each download and writting of each file takes approximately 0.35 seconds.
I would like to know if, under linux at least (and under Windows since we are here), it is safe to write in parallel to disk and how many threads can I launch taking into account the waiting time of each thread.
If it changes something (I actually think so), the program doesn't write directly to disk. It just calls std::system to run the program wget because it is currently easier to do that way than importing a library. So, the waiting time is the time that the system call takes to return.
So, each writting to disk is being performed by a different process. I only wait that program to finish, and I'm not actually bound by I/O, but by the running time of an external process (each wget call creates and writes to a different file and thus they are completely independent processes). Each thread just waits for one call to complete.
My machine has 4 CPUs.
Some kind of formula to get an ideal number of threads according to CPU concurrency and "waiting time" per thread would be welcome.
NOTE: The ideal solution will be of course to make some performance testing, but I could be banned for the server if I abuse with so many request.

It is safe to do concurrent file I/O from multiple threads, but if you are concurrently writing to the same file, some form of synchronization is necessary to ensure that the writes to the file don't become interleaved.
For what you describe as your problem, it is perfectly safe to fetch each JSON blob in a separate thread and write them to different, unique files (in fact, this is probably the sanest, simplest design). Given that you mention running on a 4-core machine, I would expect to see a speed-up well past the four concurrent thread mark; network and file I/O tends to do quite a bit of blocking, so you'll probably run into a bottleneck with network I/O (or on the server's ability to send) before you hit a processing bottleneck.
Write your code so that you can control the number of threads that are spawned, and benchmark different numbers of threads. I'll guess that your sweet spot will be somewhere between 8 and 16 threads.

Do Asynchronous Loggers really help in performance?

We know that synchronous logging, writes the log message to the file and then continues to the program execution. Asynchronous loggers queues the log messages and writes them in a separate thread. I'm starting to implement Log4CPlus in my Project and couple of things came to my mind.
I can't initialize more LogObjects, because that will open more file handles and we don't need that. (I Know we should use Feature based logging objects, example for UploadLogObj,DownloadLogOb,WebReqLogObj,AuthLogObj,etc). Hope each and every addition of log object may increase logging threads too.
Still for argument sake, if i use a Single Log Object and push log messages from Multiple Threads, i suppose there must be some mutex lock to prevent writing to the message queue. My Question won't this mutex lock slow down the process, won't it create performance issue ..?
I'm just wondering how Asynchronous loggers work, i can look into the code, that's one way. But Hope the answers will be enlightening to a lot of people.

Yes, the mutex will slow down the process a bit, but if you are logging from multiple threads to the same destination you will need some form of synchronization anyway, since you don't want lines from different threads to be mixed up.
In the end it's a matter of deciding where to synchronize, not if. With asynchronous logging this happens when the object to be logged is pushed to the queue of the logging thread. In the synchronous case probably at the time the line is written (though it depends on the implementation).
In the first case the time spent inside the mutex will be much shorter and predictable, since no disk flushes happens while in the mutex. This means that you may have less performance degradation and better scaling than in the second case (plus the time that you didn't spend writing the actual data, because the other thread is taking care of it).
If you don't have a lot of threads competing for the mutex anyway it won't a problem. I had the chance to write and use an asynchronous logger for a real-time system some time ago, and we reached disk-bandwidth related issues long before sychronization issues.
One downside of asynchronous logging is more memory related: since you need to pass the data to be logged around you need to be careful and avoid unneeded allocations/deallocations.

Mutex lock takes something like 40-60 nanoseconds (if mutex is not locked by another thread) on modern hardware. This is nothing comparing to IO operation which is theoretically can write file to a slow HDD or network drive for a few seconds.
Lock-free is a different thing - in this case you don't even have mutexes. However, there is price for it - you'll have to write a more complicated code.

Does endless While loop take up CPU resources?

From what I understand, you write your Linux Daemon that listens to a request in an endless loop.
Something like..
int main() {
while(1) {
//do something...
}
}
ref: http://www.thegeekstuff.com/2012/02/c-daemon-process/
I read that sleeping a program makes it go into waiting mode so it doesn't eat up resources.
1.If I want my daemon to check for a request every 1 second, would the following be resource consuming?
int main() {
while(1) {
if (request) {
//do something...
}
sleep(1)
}
}
2.If I were to remove the sleep, does it mean the CPU consumption will go up 100%?
3.Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
Endless loops and CPU resources is a mystery to me.

Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
There ia a better option.
You can just use a semaphore, which remains blocked at the begining of loop and you can signal the semaphore whenever you want the loop to execute.
Note that this will not eat any resources.

The poll and select calls (mentioned by Basile Starynkevitch in a comment) or a semaphore (mentioned by Als in an answer) are the correct ways to wait for requests, depending on circumstances. On operating systems without poll or select, there should be something similar.
Neither sleep, YieldProcessor, nor sched_yield are proper ways to do this, for the following reasons.
YieldProcessor and sched_yield merely move the process to the end of the runnable queue but leave it runnable. The effect is that they allow other processes at the same or higher priority to execute, but, when those processes are done (or if there are none), then the process that called YieldProcessor or sched_yield continues to run. This causes two problems. One is that lower priority processes still will not run. Another is that this causes the processor to be always running, using energy. We would prefer the operating system to recognize when no process needs to be running and to put the processor into a low-power state.
sleep may permit this low-power state, but it plays a guessing game about how long it will be until the next request comes in, it wakes the processor repeatedly when there is no need, and it makes the process less responsive to requests, since the process will continue sleeping until the expiration of the requested time even if there is a request to be serviced.
The poll and select calls are designed for exactly this situation. They tell the operating system that this process wants to service a request coming in on one of its I/O channels but otherwise has no work to do. This allows the operating system to mark the process as not runnable and to put the processor in a low-power state if suitable.
Using a semaphore provides the same behavior, except that the signal to wake the process comes from another process raising the semaphore instead of activity arising in an I/O channel. Semaphores are suitable when the signal to do some work arrives in this way; simply use whichever of poll or a semaphore is more appropriate for your situation.
The criticism that poll, select, or a semaphore causes a kernel-mode call is irrelevant, because the other methods also cause kernel-mode calls. A process cannot sleep on its own; it has to call the operating system to request it. Similarly, YieldProcessor and sched_yield make requests to the operating system.

The short answer is yes -- removing sleep gives 100% CPU -- but the answer does depend on some additional details. It consumes all CPU it can get, unless...
The loop body is trivial, and optimised away.
The loop contains a blocking operation (like a file or network operation). The link you provide suggests to avoid this, but it is often a good idea to block until something relevant happens.
EDIT : For your scenario, I support the suggestion made by #Als.
EDIT 2: I expect this answer has received a -1 because I claim blocking operations can actually be a good idea. [If you -1, you should leave a motivation in a comment so that we all may learn something.]
Current popular thinking is that non-block (event-based) IO is good and blocking is bad. This view is oversimplified because it assumes all software that performs IO can improve throughput by using non-blocking operations.
What? Am I really suggesting that using non-blocking IO can actually reduce throughput? Yes it can. When a process serves a single activity it is actually better to use blocking IO because blocking IO only burns resources that have already been paid for in the existence of the process.
In contrast, non-blocking IO can carry a greater fixed overhead than simple blocking IO. If the process isn't able to supply additional IO that can be interleaved, then there is nothing gained by paying for non-blocking setup. (In practice, the greatest cost of innapropriate non-blocking IO is simply in the added code complexity. Beyond that, this topic is largely a thought exercise.)
Under blocking IO we rely upon the operating system to schedule those processes that can make progress. That's what the OS is designed to do.
Under non-blocking IO we have greater setup costs but can share the resources of the process and its threads between interleaved work. The non-blocking IO is therefor ideal for any process that serves multiple independent activities, such as a web server. The throughput gained is vastly superior to the fixed cost overheads of non-blocking IO.

Low-latency read of UDP port

I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?
Boost is a very good library, but quite generic, is there a lower latency alternative?
All opinions on writing low-latency UDP network programs are very welcome.
EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?

Your latency will vary, but it will be far from the best you can get. Here are few things that will stand in your way to the better latency:
Boost.ASIO
It constantly allocates/deallocates memory to store "state" in order to invoke a callback function associated with your read operation.
It does unnecessary mutex locking/unlocking in order to support a broken mix of async and sync approaches.
The worst, it constantly adds and removes event descriptors from the underlying notification mechanism.
All in all, asio is a good library for high-level application developers, but it comes with a big price tag and a lot of CPU cycle eating gremlins. Another alternative is libevent, it is a lot better, but still aims to support many notification mechanisms and be platform-independent. Nothing can beat native mechanisms, i.e. epoll.
Other things
UDP stack. It doesn't do a very good job for latency sensitive applications. One of the most popular solutions is OpenOnload. It by-passes the stack and works directly with your NIC.
Scheduler. By default, scheduler is optimized for throughput and not latency. You will have to tweak and tune your OS in order to make it latency oriented. Linux, for example, has a lot of "rt" patches for that purpose.
Watch out not to sleep. Once your process is sleeping, you will never get a good wakeup latency compared to constantly burning CPU and waiting for a packet to arrive.
Interference with other IRQs, processes etc.
I cannot tell you exact numbers, but assuming that you won't be getting a lot of traffic, using Boost and a regular Linux kernel, with a regular hardware, your latency will range somewhere between ~50 microseconds to ~100 milliseconds. It will improve a bit as you get more data, and after some point start dropping, and will always be ranging. I'd say that if you are OK with those numbers, don't bother optimizing.

I think using recv() in a "spin" loop thread and attach the thread to a single CPU core(Processor Affinity), the latency should be lower than using select(), the precision of select() varies from 1 to 10 micro-seconds while spin loop at 1 micro-second in my test.

Threads ordering in C++/Linux

I'm currently doing a simulation of a hard disk drive IOs in C++, and I'm using pthread threads and a mutex to do the reading on the disk.
However I'm trying to optimize the reading time by ordering my threads. The problem is that is my disk is currently reading a sector, and a bunch of requests to read arrive, any of them will be executed. What I want is ordering them so that the request with the closest sector is executed next.
This way, the head of the virtual hard disk drive won't move excessively.
My question is : Is using Linux process priority system a good way to make sure that the closest reading request will be executed before the others? If not, what could I rely on to do this?
PS: Sorry for my english.
Thanks for your help.

It is very rarely a good idea to rely on the exact behaviour of process priority schemes, especially on a general purpose operating system like Linux, because they don't really guarantee you any particular behaviour. Making something the very highest priority won't help if it references some address in memory or some I/O call that causes it to held up for an instant - the operating system will then run some lower priority process instead, and you will be unpleasantly surprised.
If you want to be sure of the order in which disk I/O requests are completed, or to simulate this, you could create a thread that keeps a list of pending I/O and asks for the requests to be executed one at a time, in an order it controls.

The I/O schedulers in the Linux kernel can re-order and coalesce reads (and to some extent writes) so that their ordering is more favorable for the disk, just like you are describing. This affects the process scheduler (which takes care of threads too) in that the threads waiting for I/O also get "re-ordered" - their read or write requests complete in the order in which the disk served them, not in the order in which they made their request. (This is a very simplified view of what really happens.)
But if you're simulating disk I/O, i.e. if you're not actually doing real I/O, the I/O scheduler isn't involved at all. Only the process scheduler. And the process scheduler has no idea that you're "simulating" a hard disk - it has no information about what the processes are doing, just information about whether or not they're in need of CPU resources. (Again this is a simplified view of how things work).
So the process scheduler will not help you in re-ordering or coalescing your simulation of read requests. You need to implement that logic in your code. (Reading about I/O schedulers is a great idea.)
If you do submit real I/O, then doing the re-ordering yourself could improve performance in some situations, and indeed the I/O scheduler's algorithms for optimizing throughput or latency will affect the way your threads are scheduled (for blocking I/O anyway - asynchronous I/O makes it a bit more complicated still).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js