Futex throughput on Linux

Futex throughput on Linux - c++

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?

In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

Related

Should I just believe that std::thread is implemented not by creating user threads only?

I learned that all of user threads mapped with a kernel thread be blocked if one of the threads calls some system call likes I/O System Call.
If std::thread is implemented by creating only a user thread in some environment, then a thread for I/O in some programs can block a thread for Rendering.
So I think distinguishing user / kernel is important but c++ standard does not.
Then how I assure that some situations like above will not occur in particular environment(like Windows10 )?

I learned that all of user threads mapped with a kernel thread be blocked if one of the threads calls some system call likes I/O System Call.
Yes, however it's rare for anything to use kernel's system calls directly. Typically they use a user-space library. For a normally blocking "system" call (e.g. the read() function in a standard C library) the library can emulate it using asynchronous functions (e.g. the aio_read() function in a standard C library) and a user-space thread switches.
So I think distinguishing user / kernel is important but c++ standard does not.
It is important, but for a different reason.
The first problem with user-space threading is that the kernel isn't aware of thread priorities. If you imagine a computer running 2 completely separate applications (with the user using "alt+tab" to switch between them), where each application has a high priority thread (for user interface), and few medium priority threads (for general work) plus a few low priority threads (for doing things like prefetching and pre-calculating stuff in the background); you can end up with a situation where kernel gives CPU time to one application (that uses the CPU time for low priority threads) because it doesn't know the other application needs CPU time for its higher priority threads.
In other words, for a multi-process environment, user-space threading has a high risk of wasting CPU time doing irrelevant work (in one process) while important work (in another process) waits.
The second problem with user-space threading is that (for modern systems) good scheduling decisions take into account differences between different CPUs ("big.Little", hyper-threading, which caches are shared by which CPUs, ..) and power management (e.g. for low priority threads it's reasonable to reduce CPU clock speed to increase battery life and/or reduce CPU temperatures so they can run faster for longer when higher priority work needs to be done later); and user-space has none of the information needed (and none of the ability to change CPU speeds, etc) and can not make good scheduling decisions.
Note that these problems could be "fixed" by having a huge amount of communication between user-space and kernel (the user-space threading informing kernel of thread priorities of waiting threads and currently running thread, kernel informing user-space of CPU differences and power management, etc); but the whole point of user-space thread switching is to avoid the cost of kernel system calls, so this communication between user-space and kernel would make user-space thread switching pointless.
Then how I assure that some situations like above will not occur in particular environment(like Windows10 )?
You can't. It's not your decision.
When you choose to use high level abstractions (std::thread in C++ rather than using the kernel directly from assembly language) you are deliberately delegating low level decisions to something else (the compiler and its run-time environment). The advantages (you no longer have to care about these decisions) are the disadvantages (you are no longer able to make these decisions).

Rephrasing my attempt to answer, after talking to the OP and understanding better what is really being asked.
Most I/O operations are blocking per thread level: if a threads starts one, only this thread will be blocked, not the whole process.
The OP seems to intend to start a rendering operation in a thread and doesn't want it to be blocked by an I/O operation in this thread. Two possible solutions are:
To spawn another thread to do this blocking I/O operation, and then let the rendering thread to proceed independently of the I/O;
To use resources specific of each OS (that doesn't belong to C++), to start the same I/O operation in an asynchronous, non blocking form.
Lastly, to minimize the blocking of the OS access to I/O, what an application developer can do is to try to make sure that there's no simultaneous access to the same I/O device at the same time.

You can be assured that std::thread is not using "user threads" because that concept pretty much died around the turn of the century.
Modern hardware has multiple CPU cores, which work much better if there are sufficient kernel threads. Without enough kernel threads, CPU cores may sit idle.
The idea of "user threads" originated in an era when there was only a single CPU core, and people instead worried about having too many kernel threads.

Reduce Context Switches Between Threads With Same Priority

I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?

There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.

Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.

Does endless While loop take up CPU resources?

From what I understand, you write your Linux Daemon that listens to a request in an endless loop.
Something like..
int main() {
while(1) {
//do something...
}
}
ref: http://www.thegeekstuff.com/2012/02/c-daemon-process/
I read that sleeping a program makes it go into waiting mode so it doesn't eat up resources.
1.If I want my daemon to check for a request every 1 second, would the following be resource consuming?
int main() {
while(1) {
if (request) {
//do something...
}
sleep(1)
}
}
2.If I were to remove the sleep, does it mean the CPU consumption will go up 100%?
3.Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
Endless loops and CPU resources is a mystery to me.

Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
There ia a better option.
You can just use a semaphore, which remains blocked at the begining of loop and you can signal the semaphore whenever you want the loop to execute.
Note that this will not eat any resources.

The poll and select calls (mentioned by Basile Starynkevitch in a comment) or a semaphore (mentioned by Als in an answer) are the correct ways to wait for requests, depending on circumstances. On operating systems without poll or select, there should be something similar.
Neither sleep, YieldProcessor, nor sched_yield are proper ways to do this, for the following reasons.
YieldProcessor and sched_yield merely move the process to the end of the runnable queue but leave it runnable. The effect is that they allow other processes at the same or higher priority to execute, but, when those processes are done (or if there are none), then the process that called YieldProcessor or sched_yield continues to run. This causes two problems. One is that lower priority processes still will not run. Another is that this causes the processor to be always running, using energy. We would prefer the operating system to recognize when no process needs to be running and to put the processor into a low-power state.
sleep may permit this low-power state, but it plays a guessing game about how long it will be until the next request comes in, it wakes the processor repeatedly when there is no need, and it makes the process less responsive to requests, since the process will continue sleeping until the expiration of the requested time even if there is a request to be serviced.
The poll and select calls are designed for exactly this situation. They tell the operating system that this process wants to service a request coming in on one of its I/O channels but otherwise has no work to do. This allows the operating system to mark the process as not runnable and to put the processor in a low-power state if suitable.
Using a semaphore provides the same behavior, except that the signal to wake the process comes from another process raising the semaphore instead of activity arising in an I/O channel. Semaphores are suitable when the signal to do some work arrives in this way; simply use whichever of poll or a semaphore is more appropriate for your situation.
The criticism that poll, select, or a semaphore causes a kernel-mode call is irrelevant, because the other methods also cause kernel-mode calls. A process cannot sleep on its own; it has to call the operating system to request it. Similarly, YieldProcessor and sched_yield make requests to the operating system.

The short answer is yes -- removing sleep gives 100% CPU -- but the answer does depend on some additional details. It consumes all CPU it can get, unless...
The loop body is trivial, and optimised away.
The loop contains a blocking operation (like a file or network operation). The link you provide suggests to avoid this, but it is often a good idea to block until something relevant happens.
EDIT : For your scenario, I support the suggestion made by #Als.
EDIT 2: I expect this answer has received a -1 because I claim blocking operations can actually be a good idea. [If you -1, you should leave a motivation in a comment so that we all may learn something.]
Current popular thinking is that non-block (event-based) IO is good and blocking is bad. This view is oversimplified because it assumes all software that performs IO can improve throughput by using non-blocking operations.
What? Am I really suggesting that using non-blocking IO can actually reduce throughput? Yes it can. When a process serves a single activity it is actually better to use blocking IO because blocking IO only burns resources that have already been paid for in the existence of the process.
In contrast, non-blocking IO can carry a greater fixed overhead than simple blocking IO. If the process isn't able to supply additional IO that can be interleaved, then there is nothing gained by paying for non-blocking setup. (In practice, the greatest cost of innapropriate non-blocking IO is simply in the added code complexity. Beyond that, this topic is largely a thought exercise.)
Under blocking IO we rely upon the operating system to schedule those processes that can make progress. That's what the OS is designed to do.
Under non-blocking IO we have greater setup costs but can share the resources of the process and its threads between interleaved work. The non-blocking IO is therefor ideal for any process that serves multiple independent activities, such as a web server. The throughput gained is vastly superior to the fixed cost overheads of non-blocking IO.

Best way to slow down a thread? Is using Sleep() OK?

I've written a C++ library that does some seriously heavy CPU work (all of it math and calculations) and if left to its own devices, will easily consume 100% of all available CPU resources (it's also multithreaded to the number of available logical cores on the machine).
As such, I have a callback inside the main calculation loop that software using the library is supposed to call:
while(true)
{
//do math here
callback(percent_complete);
}
In the callback, the client calls Sleep(x) to slow down the thread.
Originally, the clientside code was a fixed Sleep(100) call, but this led to bad unreliable performance because some machines finish the math faster than others, but the sleep is the same on all machines. So now the client checks the system time, and if more than 1 second has passed (which == several iterations), it will sleep for half a second.
Is this an acceptable way of slowing down a thread? Should I be using a semaphore/mutex instead of Sleep() in order to maximize performance? Is sleeping x milliseconds for each 1 second of processing work fine or is there something wrong that I'm not noticing?
The reason I ask is that the machine still gets heavily bogged down even though taskman shows the process taking up ~10% of the CPU. I've already explored hard disk and memory contention to no avail, so now I'm wondering if the way I'm slowing down the thread is causing this problem.
Thanks!

Why don't you use a lower priority for the calculation threads? That will ensure other threads are scheduled while allowing your calculation threads to run as fast as possible if no other threads need to run.

What is wrong with the CPU at 100%? That's what you should strive for, not try to avoid. These math calculations are important, no? Unless you're trying to avoid hogging some other resource not explicitly managed by the OS (a mutex, the disk, etc) and used by the main thread, generally trying to slow your thread down is a bad idea. What about on multicore systems (which almost all systems will be, going forward)? You'd be slowing down a thread for absolutely no reason.
The OS has a concept of a thread quantum. It will take care of ensuring that no important thread on your system is starved. And, as I mentioned, on multicore systems spiking one thread on one CPU does not hurt performance for other threads on other cores at all.
I also see in another comment that this thread is also doing a lot of disk I/O - these operations will already cause your thread to yield while it's waiting for the results, so the sleeps will do nothing.
In general, if you're calling Sleep(x), there is something wrong/lazy with your design, and if x==0, you're opening yourself up to live locks (the thread calling Sleep(0) can actually be rescheduled immediately, making it a noop).

Sleep should be fine for throttling an app, which from your comments is what you're after. Perhaps you just need to be more precise how long you sleep for.
The only software in which I use a feature like this is the BOINC client. I don't know what mechanism it uses, but it's open-source and multi-platform, so help yourself.
It has a configuration option ("limit CPU use to X%"). The way I'd expect to implement that is to use platform-dependent APIs like clock() or GetSystemTimes(), and compare processor time against elapsed wall clock time. Do a bit of real work, check whether you're over or under par, and if you're over par sleep for a while to get back under.
The BOINC client plays nicely with priorities, and doesn't cause any performance issues for other apps even at 100% max CPU. The reason I use the throttle it is that otherwise, the client runs the CPU flat-out all the time, and drives up the fan speed and noise. So I run it at the level where the fan stays quiet. With better cooling maybe I wouldn't need it :-)

Another, not so elaborate, method could be to time one iteration and let the thread sleep for (x * t) milliseconds before the next iteration where t is the millisecond time for one iteration and x is the choosen sleep time fraction (between 0 and 1).

Have a look at cpulimit. It sends SIGSTOP and SIGCONT as required to keep a process below a given CPU usage percentage.
Even still, WTF at "crazy complaints and outlandish reviews about your software killing PC performance". I'd be more likely to complain that your software was slow and not making the best use of my hardware, but I'm not your customer.
Edit: on Windows, SuspendThread() and ResumeThread() can probably produce similar behaviour.

Allocate more processor cycles to my program

I've been working on win32, c,c++ for a while. I code on visual studio. Most of the time I see system idle process uses more cpu utilization. Is there a way to allocate more processor cycles to my program to run it faster? I understand there might be limitations from i/o, in those cases this question doesn't make any sense.
OR
did i misunderstood the task manager numbers? I'm in a confusion, please help me out.
And I want to do something in program itself, btw I will be happy if answers are specific to windows.
Thanks in advance
~calvin

If your program it the only program that has something to do (not wait for IO), its thread will always be assigned to a processor core.
However, if you have a multi-core processor, and a single-threaded program, the CPU usage of your process displayed in the task manager will always be limited by 100/Ncores.
For example, if you have a quad-core machine, your process will be at 25% (using one core), and the idle process at around 75%. You can only additional CPU power by dividing your tasks into chunks that can be worked on by separate threads which will then be run on the idle cores.

The idle process only "runs" when no other process needs to. If you want to use more CPU cycles, then use them.

If your program is idling, it doesn't do anything, i.e. there is nothing that could be done any faster. So the CPU is probably not the bottle-neck in your case.
Are you maybe waiting for data coming from the disk or network?
In case your processor has multiple cores and your program uses only one core to its full extent, making your program multi-threaded could work.

In a multitask / multithread OS the processor(s) time is splitted among threads.
If you want a specific thread to get bigger time chunk you can set its priority with the SetThreadPriority function, not wise to do it though.
Only special software (should) mess with those settings.
It's common for window applications to have a low cpu usage percent (which we see in the task manager)
because most of the time they just wait for messages.

Use threads to:
abstract away all the I/O waits.
assign work to all cores.
also, remove all sleep-wait states from main thread.
Defer all I/O to a thread, so that wait states are confined within it. Keep the actual computations in the foreground thread, and use synchronization mechanisms that make the I/O slave thread to wait for your main thread when communicating.
If your CPU is multi-core, and your problem is paralellizable, create as many threads as you have cores, research "set affinity" functions to assign them between the cores and still keep a separate thread for all I/O.
Also pay attention not to wait in your main thread - usleep(1) doesn't send you into background for 1 microsecond, but for "no less than..." and that may mean anything between 1ms and 100ms but hardly ever less than that, and never anything close to a microsecond.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js