Cycling between Fibers with no IO

Cycling between Fibers with no IO - concurrency

As far as I know, crystal cycles Fibers with io, meaning that if one fiber is waiting for io, crystal will switch to an another fiber.
What if we spawn two fibers but one of them does constant computation/loop with no io?
For example, with the code below server doesn't respond to any http requests
spawn do
Kemal.run
end
spawn do
# constant computation/loop with no IO
some_func
end
Fiber.yield
# or sleep

By default, Crystal uses cooperative multitasking. To implement it, the Crystal runtime provides Fibers. Due to their cooperative nature, you have to yield execution from time to time (e.g. with Fibers.yield):
Fibers are cooperative. That means execution can only be drawn from a fiber when it offers it. It can't be interrupted in its execution at random. In order to make concurrency work, fibers must make sure to occasionally provide hooks for the scheduler to swap in other fibers. [...]
When a computation-intensive task has none or only rare IO operations, a fiber should explicitly offer to yield execution from time to time using Fiber.yield to break up tight loops. The frequency of this call depends on the application and concurrency model.
Note that CPU intense operations are not the only source for starving other Fibers. When calling C libraries that may block, the Fiber will also wait the operation to complete. An example would be a long-polling operation, which will wait for the next event or eventually time out (e.g. rd_kafka_poll in Kafka). To prevent that, prefer async API version (if available), or use a short polling interval (e.g. 0 for Kafka poll) and shift the sleep operation to the Crystal runtime, so the other Fibers can run.
In 2019, Crystal introduced support for parallelism. By running multiple worker threads, you can also prevent one expensive computation for starving all other operations. However, you have to be careful as the responsiveness (and maybe even correctness) of the program could then depend on the number of workers (e.g. with only one worker, it will still hang). Overall, yielding occasionally in time-extensive operations seems to be the better solution, even if you end up using multiple workers for the improved performance on multi-core machines.

Related

Is it really impossible to suspend two std/posix threads at the same time?

I want to briefly suspend multiple C++ std threads, running on Linux, at the same time.
It seems this is not supported by the OS.
The threads work on tasks that take an uneven and unpredictable amount of time (several seconds).
I want to suspend them when the CPU temperature rises above a threshold.
It is impractical to check for suspension within the tasks, only inbetween tasks.
I would like to simply have all workers suspend operation for a few milliseconds.
How could that be done?
What I'm currently doing
I'm currently using a condition variable in a slim, custom binary semaphore class (think C++20 Semaphore).
A worker checks for suspension before starting the next task by acquiring and immediately releasing the semaphore.
A separate control thread occupies the control semaphore for a few milliseconds if the temperature is too high.
This often works well and the CPU temperature is stable.
I do not care much about a slight delay in suspending the threads.
However, when one task takes some seconds longer than the others, its thread will continue to run alone.
This activates CPU turbo mode, which is the opposite of what I want to achieve (it is comparatively power inefficient, thus bad for thermals).
I cannot deactivate CPU turbo as I do not control the hardware.
In other words, the tasks take too long to complete.
So I want to forcefully pause them from outside.

I want to suspend them when the CPU temperature rises above a threshold.
In general, that is putting the cart before the horse.
Properly designed hardware should have adequate cooling for maximum load and your program should not be able to exceed that cooling capacity.
In addition, since you are talking about Turbo, we can assume an Intel CPU, which will thermally throttle all on their own, making your program run slower without you doing anything.
In other words, the tasks take too long to complete
You could break the tasks into smaller parts, and check the semaphore more often.
A separate control thread occupies the control semaphore for a few milliseconds
It's really unlikely that your hardware can react to millisecond delays -- that's too short a timescale for anything thermal. You will probably be better off monitoring the temperature and simply reducing the number of tasks you are scheduling when the temperature is rising and getting close to your limits.
I've now implemented it with pthread_kill and SIGRT.
Note that suspending threads in unknown state (whatever the target task was doing at the time of signal receipt) is a recipe for deadlocks. The task may be inside malloc, may be holding arbitrary locks, etc. etc.
If your "control thread" also needs that lock, it will block and you lose. Your control thread must execute only direct system calls, may not call into libc, etc. etc.
This solution is ~impossible to test, and ~impossible to implement correctly.

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?

In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

Boost: Single Threaded IO Service

In my app I will receive various events that I would like to process asynchronously in a prioritised order.
I could do this with a boost::asio::io_service, but my application is single threaded. I don't want to pay for locks and mallocs you might need for a multi threaded program (the performance cost really is significant to me). I'm basically looking for a boost::asio::io_service that is written for single threaded execution.
I'm pretty sure I could implement this myself using boost::coroutine, but before I do, does something like a boost::asio::io_service that is written for single threaded execution exist already? I scanned the list of boost libraries already and nothing stood out to me

Be aware that you have to pay for synchronization as soon as you use any non-blocking calls of Asio.
Even though you might use a single thread for scheduling work and processing the resulting callbacks, Asio might still have to spawn additional threads internally for executing asynchronous calls. Those will access the io_service concurrently.
Think of an async_read on a socket: As soon as the received data becomes available, the socket has to notify the io_service. This happens concurrent to your main thread, so additional synchronization is required.
For blocking I/O this problem goes away in theory, but since asynchronous I/O is sort of the whole point of the library, I would not expect to find too many optimizations for this case in the implementation.
As was pointed out in the comments already, the contention on the io_service will be very low with only one main thread, so unless profiling indicates a clear performance bottleneck there, you should not worry about it too much.

I suggest to use boost::asio together with boost::coroutine -> boost::asio::yield_context (does already the coupling between coroutine + io_service). If you detect an task with higher priority you could suspend the current task and start processing the task with higher priority.
The problem is that you have to define/call certain check-points in the code of your task in order to suspend the task if the condition (higher prio task enqueued) is given.

Reduce Context Switches Between Threads With Same Priority

I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?

There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.

Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.

Setting thread priorities from the running process

I've just come across the Get/SetThreadPriority methods and they got me wondering - can a thread priority meaningfully be set higher than the owning process priority (which I don't believe can be changed programatically in the same way) ?
Are there any pitfalls to using these APIs?

Yes, you can set the thread priority to any class, including a class higher than the one of the current process. In fact, these two values are complementary and provide the base priority of the thread. You can read about it in the Remarks section of the link you posted.
You can set the process priority using SetPriorityClass.
Now that we got the technicalities out of the way, I find little use for manipulating the priority of a thread directly. The OS scheduler is sophisticated enough to boost the priority of threads blocked in I/O over threads doing CPU computations (to the point that an I/O thread will preempt a CPU thread when the I/O interrupt arrives). In fact, even I/O threads are differentiated, with keyboard I/O threads getting a priority boost over file I/O threads for example.

On Windows, the thread and process priorities are combined using an algorthm that decides overall scheduling priority:
Windows priorities
Pitfalls? Well:
Raising the priority of a thread is likely to give the greatest overall gain if it is usually blocked on IO but must run ASAP afer being signaled by its driver, eg. Video IO that must process buffers quickly.
Raising the priority of threads is likely to have the greatest overall negative impact if they are CPU-bound and raised to a high priority, so preventing the running of normal-priority threads. If taken to extremes, OS threads and utilities like Task Manger will not run.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js