From what I understand, you write your Linux Daemon that listens to a request in an endless loop.
Something like..
int main() {
while(1) {
//do something...
}
}
ref: http://www.thegeekstuff.com/2012/02/c-daemon-process/
I read that sleeping a program makes it go into waiting mode so it doesn't eat up resources.
1.If I want my daemon to check for a request every 1 second, would the following be resource consuming?
int main() {
while(1) {
if (request) {
//do something...
}
sleep(1)
}
}
2.If I were to remove the sleep, does it mean the CPU consumption will go up 100%?
3.Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
Endless loops and CPU resources is a mystery to me.
Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
There ia a better option.
You can just use a semaphore, which remains blocked at the begining of loop and you can signal the semaphore whenever you want the loop to execute.
Note that this will not eat any resources.
The poll and select calls (mentioned by Basile Starynkevitch in a comment) or a semaphore (mentioned by Als in an answer) are the correct ways to wait for requests, depending on circumstances. On operating systems without poll or select, there should be something similar.
Neither sleep, YieldProcessor, nor sched_yield are proper ways to do this, for the following reasons.
YieldProcessor and sched_yield merely move the process to the end of the runnable queue but leave it runnable. The effect is that they allow other processes at the same or higher priority to execute, but, when those processes are done (or if there are none), then the process that called YieldProcessor or sched_yield continues to run. This causes two problems. One is that lower priority processes still will not run. Another is that this causes the processor to be always running, using energy. We would prefer the operating system to recognize when no process needs to be running and to put the processor into a low-power state.
sleep may permit this low-power state, but it plays a guessing game about how long it will be until the next request comes in, it wakes the processor repeatedly when there is no need, and it makes the process less responsive to requests, since the process will continue sleeping until the expiration of the requested time even if there is a request to be serviced.
The poll and select calls are designed for exactly this situation. They tell the operating system that this process wants to service a request coming in on one of its I/O channels but otherwise has no work to do. This allows the operating system to mark the process as not runnable and to put the processor in a low-power state if suitable.
Using a semaphore provides the same behavior, except that the signal to wake the process comes from another process raising the semaphore instead of activity arising in an I/O channel. Semaphores are suitable when the signal to do some work arrives in this way; simply use whichever of poll or a semaphore is more appropriate for your situation.
The criticism that poll, select, or a semaphore causes a kernel-mode call is irrelevant, because the other methods also cause kernel-mode calls. A process cannot sleep on its own; it has to call the operating system to request it. Similarly, YieldProcessor and sched_yield make requests to the operating system.
The short answer is yes -- removing sleep gives 100% CPU -- but the answer does depend on some additional details. It consumes all CPU it can get, unless...
The loop body is trivial, and optimised away.
The loop contains a blocking operation (like a file or network operation). The link you provide suggests to avoid this, but it is often a good idea to block until something relevant happens.
EDIT : For your scenario, I support the suggestion made by #Als.
EDIT 2: I expect this answer has received a -1 because I claim blocking operations can actually be a good idea. [If you -1, you should leave a motivation in a comment so that we all may learn something.]
Current popular thinking is that non-block (event-based) IO is good and blocking is bad. This view is oversimplified because it assumes all software that performs IO can improve throughput by using non-blocking operations.
What? Am I really suggesting that using non-blocking IO can actually reduce throughput? Yes it can. When a process serves a single activity it is actually better to use blocking IO because blocking IO only burns resources that have already been paid for in the existence of the process.
In contrast, non-blocking IO can carry a greater fixed overhead than simple blocking IO. If the process isn't able to supply additional IO that can be interleaved, then there is nothing gained by paying for non-blocking setup. (In practice, the greatest cost of innapropriate non-blocking IO is simply in the added code complexity. Beyond that, this topic is largely a thought exercise.)
Under blocking IO we rely upon the operating system to schedule those processes that can make progress. That's what the OS is designed to do.
Under non-blocking IO we have greater setup costs but can share the resources of the process and its threads between interleaved work. The non-blocking IO is therefor ideal for any process that serves multiple independent activities, such as a web server. The throughput gained is vastly superior to the fixed cost overheads of non-blocking IO.
Related
I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio
Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.
The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.
Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.
Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.
I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)
I often see Sleep(N) after a thread starts or sometimes I see Thread::Sleep(N); where N is in milliseconds. Is it meant to put the current thread in sleep so that another thread can start?
I appreciate any response
Use of sleep() function (and it's friends) usually indicate design flaw. The rare exceptions are sleeps used in debugging.
The common misguided usages of sleep include an attempt to time events to a certain time (bad, because no one guarantees that sleep will take exactly that many units as prescribed on non-RT systems), attempt to wait for some events (bad, because to wait for event you should use specific waiting functions available with your threading library) or an attempt to yield resources - bad, because if you have nothing to do, just exit the thread.
Sleep puts the thread in non-runnable state for the specified amount of time or until the process is woken up by a signal.
When a thread is in non-runnable state, the OS scheduler won't schedule the thread into the run queue, and the OS forces a thread/context switch so that another runnable task (thread/process) can run instead.
As I know, we use sleep() for the cases below:
1) Simulation. When you need simulate some situation to test your code, you may use sleep().
For example, you are designing a module, which is to be a server. Now you need to test your server with a case where a client sends a heavy request of 5 sec. To do this test, you don't need a real client. What you just need is to simulate a client with sleep(5000).
2) Give other threads chance of execution --- as you mentioned. But please attention: sleep() will hold the lock.
3) Save the resource of CPU.
For example, in the mode of non-blocking of socket, you may code like this:
while(true)
{
sleep(200);
res = accept(mysocket, NONBLOCKING);
if (getMsg(res))
{
// do something
}
else
{
// do something
}
}
This is non-blocking mode, when it executes accept, it will check if there is some messages immediately. If no message, it continue.
In this case, if we don't add a sleep() like above, this code will consume lots of resource of CPU for nothing (just a infinite loop). So we add a sleep() so that other threads or other processes can use CPU. In other words, CPU is used more efficient now.
By the way, the network card has its own cache, so if a client sends a message to the code above and at the same time, the code starts to sleep for 1 second, it wouldn't be a problem because after 1 second, the message is still there (in the cache), so the code could get it. But, if in this 1 second, there lots of clients sending messages, the code can miss the messages because 1 second is too long to process messages fast. In a word, you must make sure that the received messages can't fill up the cache in 1 second. Otherwise, sleep shorter or process messages faster.
Doing Thread::Sleep(N); will only delays the program's execution, normally waiting for external resources to be there... this was a very common way in low-level/old languages like Asm, where no OOP was able and where programs where run sequentially. Nowadays are interfaces, callbacks, multi thread, notifications. and is just a bad habit to use such techniques...
Thread sleep functions are usually meant to block a particular thread's execution for a at least the provided time, usually, in order to let other threads run without any kind of "interference" from it. It may block for more than the provided time due to scheduling or resource contention delays. It does not, however make the sleeping thread release its locks, if any! They are frequently used for debugging purposes.
Sleep(0) will give up the current time slice to another process if there is a process with a greater priority which is ready to schedule. If there is no process to schedule the current process will return immediately.
Sleep(n) where n > 0 will give up unconditionally the current time slice to another process.
reference
We know that synchronous logging, writes the log message to the file and then continues to the program execution. Asynchronous loggers queues the log messages and writes them in a separate thread. I'm starting to implement Log4CPlus in my Project and couple of things came to my mind.
I can't initialize more LogObjects, because that will open more file handles and we don't need that. (I Know we should use Feature based logging objects, example for UploadLogObj,DownloadLogOb,WebReqLogObj,AuthLogObj,etc). Hope each and every addition of log object may increase logging threads too.
Still for argument sake, if i use a Single Log Object and push log messages from Multiple Threads, i suppose there must be some mutex lock to prevent writing to the message queue. My Question won't this mutex lock slow down the process, won't it create performance issue ..?
I'm just wondering how Asynchronous loggers work, i can look into the code, that's one way. But Hope the answers will be enlightening to a lot of people.
Yes, the mutex will slow down the process a bit, but if you are logging from multiple threads to the same destination you will need some form of synchronization anyway, since you don't want lines from different threads to be mixed up.
In the end it's a matter of deciding where to synchronize, not if. With asynchronous logging this happens when the object to be logged is pushed to the queue of the logging thread. In the synchronous case probably at the time the line is written (though it depends on the implementation).
In the first case the time spent inside the mutex will be much shorter and predictable, since no disk flushes happens while in the mutex. This means that you may have less performance degradation and better scaling than in the second case (plus the time that you didn't spend writing the actual data, because the other thread is taking care of it).
If you don't have a lot of threads competing for the mutex anyway it won't a problem. I had the chance to write and use an asynchronous logger for a real-time system some time ago, and we reached disk-bandwidth related issues long before sychronization issues.
One downside of asynchronous logging is more memory related: since you need to pass the data to be logged around you need to be careful and avoid unneeded allocations/deallocations.
Mutex lock takes something like 40-60 nanoseconds (if mutex is not locked by another thread) on modern hardware. This is nothing comparing to IO operation which is theoretically can write file to a slow HDD or network drive for a few seconds.
Lock-free is a different thing - in this case you don't even have mutexes. However, there is price for it - you'll have to write a more complicated code.
I have a UDP network application that reads packets sent to it and then processes them (same thread). The reads are non-blocking so I'm not using poll or select.
Packets received are grouped by sessions.
Work is governed by whether there is a session in progress. If there is no work to be done i.e. there are no sessions, or there are no packets to process then I need to spin.
I've been looking at the hybrid algorithm found here:
http://www.1024cores.net/home/lock-free-algorithms/tricks/spinning
Been playing with it. I'm told it's more for busy waits. What methods do you use to prevent unnecessary processing and needlessly high CPU usage?
EDIT:
Thanks for all the answers and comments.
I'm now doing the following. When it comes to reading from the network I look to see if there is other work to be done. If there is, then I call poll with a timeout of zero. I then read as many packets as I can and place them into an in memory queue for processing. If no other work then I poll indefinite (i.e. -1). It seems to work well, CPU is only high when things are busy, otherwise it drops to zero.
If you have nothing to do, you should be blocking - if not on the socket itself (i.e. if it's an event loop that processes more than one network socket or event type), then on a gate that gets signaled when something happens (the design depends on how your OS does async I/O).
Spinning is something you should only be doing when you're waiting for a very short period of time (usually only in kernel mode).
How many packets per second are you processing? How long does it take to process those packets? If you use blocking threads, what is the average CPU usage you get?
Unless blocking wait is close to 100% usage, where shaving a few bits of performance from the blocking itself can help, spinning will not improve but rather worsen performance. By spinning, you lock one core that will not be available to run other code (possibly including the code that feeds you with work: i.e. kernel code that reads network and passes up to your app the packets), you burn resources without performing any work at all...
Note that when the article says that it is harder to write blocking code than non blocking spin waits, the author is not talking about operations for which the blocking version is implemented in the system, but rather for situations where on thread must wait on a condition triggered by other threads (a shared variable value goes above/below a limit, a flag is changed...).
Also, if the cost of checking the condition is high, then spinning will incur in that cost for each and every iteration of the loop, and that might well exceed the cost of checking once and performing an expensive wait.
Remember that spinning is an active wait, it does not make sense to ask how to actively wait while not consuming processor, as the active wait approach implies consuming processor time. What can you do to avoid needless CPU usage? Use a blocking call to get the next packet. In the particular case of reading an UDP packet I doubt that two calls to the non-blocking read are not more expensive in processing time than a single call to the blocking read operation.
Again think on the questions in the beginning, that can be summed to: Is blocking proven to be the bottleneck? *Is this an scenario where active waits can actually help?*
Since you have to read from a socket, you can just do a blocking read. Without a packet, you have no reason to be running, right?
If there is more than one socket, then the blocking read won't work, so you need pselect() to monitor multiple descriptors.
Am I missing something obvious?
It occurs to me that you may have some long-term processing after you do receive a datagram. If the reason you are going with non-blocking I/O is to avoid ignoring incoming traffic while working on a session, then in that case the obvious thing to do is to fork() the sessions. (Hmm, so I still think I must be missing something...)