How to reduce CUDA synchronize latency / delay

How to reduce CUDA synchronize latency / delay - concurrency

This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?

The main difference between synchronize methods is "polling" and "blocking."
"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.
"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.
The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).

Related

Best practice for realtime periodic task (< 1ms) with linux and multi core system

I am using a quad-core embedded computer with Linux as OS for controlling robot system.
Basically the project is multi-threaded, single-process program written in C++.
Here are some of the backgrounds and requirements;
Some task (signal processing and communication with hardware) requires some what strict "real-time" operation.
Cycle period is 500us or 1000us (configurable)
The counter party of the operation is so-called 'hard real-time HW' (one with designated DSP)
Missing 1~2 cycle occasionally (may be due to jitter) will cause hardly noticeable degradation of system operation.
Missing 3~9 cycle occasionally will cause quite noticeable degradation of system performance but not fatal.
Missing >10 cycle at least once will cause whole system stop and considered fatal major malfunction.
The solution in my mind is combination of below.
Put all the function required for 'real-time operation' in a single thread and make it 'real-time thread'.
Put other functions in one or two threads and make it 'non-real-time threads'.
Set sched_priority of the real-time thread around 95.
Spare a CPU core for real-time operation by manipulating the CPU-affinity of all the default linux services to use 3 cores except 'real-time core' and set thread CPU affinity of the 'real-time thread' to use spared real-time core (so that critical tasks get minimum interference).
Inter-thread communication will be done with std::atomic variables with memory order release and acquire but minimized.
Would it be a good practice for the application? Is there any other practice for more stable operation?

If you need your 1ms periodic process to have a very low jitter, you simply cannot rely on the system timers. Xenomai or not, jitter will be significant.
The only reasonable method is to do the following (it's a typical approach in both robotics and low latency finanical applications):
Isolate one CPU core (isolcpus=...)
Build kernel with NOHZ_FULL support, set up nohz_full for the isolated CPU core
Configure rcu_nocbs for this isolated core.
Use sched_setaffinity(...) to bind your process to your chosen isolated CPU
Use sched_setscheduler(...) to set SCHED_FIFO policy for this process (or any other real-time scheduling policy, it will tell the kernel to actually respec the NOHZ_FULL setting for this core).
You must only have ONE process running on that core
Do not use any system calls in that real-time process, system calls result in context switches and this is what you must avoid at all costs.
Use busy wait - spin all the time you need to wait for an even or a time period. Do not use system calls for sleeping. Your CPU load must always stay at 100%
Use lock-free communication between your real-time process and non-realtime supporting processes.
Be mindful of the DDR latency
Make sure that everything you do in your real-time process loop terminates before this 1ms timeout.
This way you will have a sub-microsecond jitter, as long as the other cores or devices in your system do not clog the bus too much.
Keep in mind that your I/O in this process must be entirely in user space. You should try to avoid using system calls to talk to the peripherals. It's possible in some cases if you can directly write to the control registers of the devices and use DMA to get the data out of them (i.e., just move the driver functionality from kernel to user space). More complicated if your peripherals rely on interrupts. An ideal implementation is if you're using an FPGA SoC (such as Xilinx Zynq UltraScale+, or Intel Cyclone V, etc.) - you can implement your own I/O peripherals that communicate via registers and DMA only and do not need interrupts.

Non-Overlapped Serial - Do ReadFile() calls from separate threads block each other?

I've inherited a large code base that contains multiple serial interface classes to various hardware components. Each of these serial models use non-overlapped serial for their communication. I have an issue where I get random CPU spikes to 100% which causes the threads to stall briefly, and then the CPU goes back to normal usage after ~10-20 seconds.
My theory is that due to the blocking nature of non-overlapped serial that there are times when multiple threads are calling readFile() and blocking each other.
My question is if multiple threads are calling readFile() (or writeFile()) at the same time will they block each other? Based on my research I believe that's true but would like confirmation.
The platform is Windows XP running C++03 so I don't have many modern tools available

"if multiple threads are calling readFile() (or writeFile()) at the same time will they block each other?"
As far as I'm concerned, they will block each other.
I suggest you could refer to the Doc:Synchronization and Overlapped Input and Output
When a function is executed synchronously, it does not return until
the operation has been completed. This means that the execution of the
calling thread can be blocked for an indefinite period while it waits
for a time-consuming operation to finish. Functions called for
overlapped operation can return immediately, even though the operation
has not been completed. This enables a time-consuming I/O operation to
be executed in the background while the calling thread is free to
perform other tasks.
Using the same event on multiple threads can lead to a race condition
in which the event is signaled correctly for the thread whose
operation completes first and prematurely for other threads using that
event.
And the operating system is in charge of the CPU. Your code only gets to run when the operating system calls it. The OS will not bother running threads that are blocked.Blocking will not occupy the CPU. I suggest you could try to use Windows Performance Toolkik to check cpu utilization.

What actually happens in asynchronous IO

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio

Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.

The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.

Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?

In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

How can I avoid preemption of my thread in user mode

I have a simple chunk of deterministic work that only takes thirteen machine instructions to complete. Because the first instruction takes a homemade semaphore (spinlock) and the last instruction releases it, I am safe from all of the other threads running on the other cores as they are attempting to take and give the same semaphore.
The problem arises when some thread interrupts a thread holding the semaphore before it can finish its "critical section". Worst case the interruption kills the thread while holding the semaphore or as can happen one of the threads normally competing for the semaphore branches out into code that can generate the interrupt causing a deadlock.
I don't have a way synchronizing with these other threads when they branch into those parts of the code I can't control. I think I need to disable interrupts like I used to do in my old VxWorks days when I was running in kernel mode. Its always thirteen instructions and I am always completely safe if I can get all thirteen instructions done before I have to honor an interrupt. Oh and it is all my own internal data, other that the homemade semaphore there is nothing that locks anything else up.
I have read several answers that I think are close. Most have to do with Critical Section calls on the Windows API (wrong OS but maybe the right concept). Most of the wrong solutions assume that I can get all of the offending threads to use a mutex that I create with the pthread libraries.
I need this solution in C/C++ on Linux and Solaris.
Johnny Crash's question is very close
prevent linux thread from being interrupted by scheduler
KermitG also
Can I prevent a Linux user space pthread yielding in critical code?
Thanks for your consideration.

You may not prevent preemption of a user-mode thread. Critical sections (and all other sync objects) prevent collisions of your threads, however they by no means prevent them from preemption by the OS.
If your other threads branch into something on timeout, whereas that something may lead to a deadlock - you have a design problem.
A correct design should be the most pessimistic: preemption may occur everywhere for indeterminate time.

Yes, yes - 7 years old - I need to do exactly this but for other reasons.
So I put this here for others to read in a historical context.
I am writing an emulation layer for an embedded RTOS where I need to emulate the embedded platform CPU_irq_disable(), and CPU_irq_restore() The closest thing I can think of is to disable peemption in the scheduler.
Yes, the target does have an RTOS - sometimes it does not.
IO is emulated via sockets, ie: a serial port is like a stream socket!
A GPIO pin (Edge IRQ) can be a socket to. The current value is in a quasi-global to the driver, and to wait for a pin change = waiting for a packet to arrive on a socket.
So the socket read thread acts like an IRQ when a packet shows up.
Thus- to emulate irq disable, it is reasonable to emulate by disabling pre-emption within my own application.
Also at the embedded application layer, I need to emulate what would be a superloop.
No amount of mutex stuff is going to emulate the embedded platform reasonably.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js