Does CU_CTX_SCHED_BLOCKING_SYNC make kernels synchronous? - c++

Does creating a CUDA context with CU_CTX_SCHED_BLOCKING_SYNC make CUDA kernel launches actually synchronous (i.e. stalling the CPU thread as a normal CPU same-thread function would)?
Documentation only states
CU_CTX_SCHED_BLOCKING_SYNC: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the GPU to finish work.
but I'm not sure I understood it right.

No.
These flags control how the host thread will behave when a host<->device synchronization API like cuCtxSynchronize , cuEventSynchronize, or cuStreamSynchonize are called using the host API. Other non-blocking API calls are asynchronous in both cases.
There are two models of host behaviour, blocking or yielding. Blocking means the calling host thread will spin while waiting for the call to return and block access to the driver by other threads, yield means it can yield to other host threads trying to interact with the GPU driver.
If you want to enforce blocking behaviour on kernel launch, use the CUDA_​LAUNCH_​BLOCKING environment variable.

Related

Non-Overlapped Serial - Do ReadFile() calls from separate threads block each other?

I've inherited a large code base that contains multiple serial interface classes to various hardware components. Each of these serial models use non-overlapped serial for their communication. I have an issue where I get random CPU spikes to 100% which causes the threads to stall briefly, and then the CPU goes back to normal usage after ~10-20 seconds.
My theory is that due to the blocking nature of non-overlapped serial that there are times when multiple threads are calling readFile() and blocking each other.
My question is if multiple threads are calling readFile() (or writeFile()) at the same time will they block each other? Based on my research I believe that's true but would like confirmation.
The platform is Windows XP running C++03 so I don't have many modern tools available
"if multiple threads are calling readFile() (or writeFile()) at the same time will they block each other?"
As far as I'm concerned, they will block each other.
I suggest you could refer to the Doc:Synchronization and Overlapped Input and Output
When a function is executed synchronously, it does not return until
the operation has been completed. This means that the execution of the
calling thread can be blocked for an indefinite period while it waits
for a time-consuming operation to finish. Functions called for
overlapped operation can return immediately, even though the operation
has not been completed. This enables a time-consuming I/O operation to
be executed in the background while the calling thread is free to
perform other tasks.
Using the same event on multiple threads can lead to a race condition
in which the event is signaled correctly for the thread whose
operation completes first and prematurely for other threads using that
event.
And the operating system is in charge of the CPU. Your code only gets to run when the operating system calls it. The OS will not bother running threads that are blocked.Blocking will not occupy the CPU. I suggest you could try to use Windows Performance Toolkik to check cpu utilization.

How to create a user space thread? [duplicate]

I am just started coding of device driver and new to threading, went through many documents for getting an idea about threads. I still have some doubts.
what is a kernel thread?
how it differs from user thread?
what is the relationship between the two threads?
how can i implement kernel threads?
where can i see the output of the implementation?
Can anyone help me?
Thanks.
A kernel thread is a task_struct with no userspace components.
Besides the lack of userspace, it has different ancestors (kthreadd kernel thread instead of the init process) and is created by a kernel-only API instead of sequences of clone from fork/exec system calls.
Two kernel threads have kthreadd as a parent. Apart from that, kernel threads enjoy the same "independence" one from another as userspace processes.
Use the kthread_run function/macro from the kthread.h header You will most probably have to write a kernel module in order to call this function, so you should take a look a the Linux Device Drivers
If you are referring to the text output of your implementation (via printk calls), you can see this output in the kernel log using the dmesg command.
A kernel thread is a kernel task running only in kernel mode; it usually has not been created by fork() or clone() system calls. An example is kworker or kswapd.
You probably should not implement kernel threads if you don't know what they are.
Google gives many pages about kernel threads, e.g. Frey's page.
user threads & stack:
Each thread has its own stack so that it can use its own local variables, thread’s share global variables which are part of .data or .bss sections of linux executable.
Since threads share global variables i.e we use synchronization mechanisms like mutex when we want to access/modify global variables in multi threaded application. Local variables are part of thread individual stack, so no need of any synchronization.
Kernel threads
Kernel threads have emerged from the need to run kernel code in process context. Kernel threads are the basis of the workqueue mechanism. Essentially, a thread kernel is a thread that only runs in kernel mode and has no user address space or other user attributes.
To create a thread kernel, use kthread_create():
#include <linux/kthread.h>
structure task_struct *kthread_create(int (*threadfn)(void *data),
void *data, const char namefmt[], ...);
kernel threads & stack:
Kernel threads are used to do post processing tasks for kernel like pdf flush threads, workq threads etc.
Kernel threads are basically new process only without address space(can be created using clone() call with required flags), means they can’t switch to user-space. kernel threads are schedulable and preempt-able as normal processes.
kernel threads have their own stacks, which they use to manage local info.
More about kernel stacks:-
https://www.kernel.org/doc/Documentation/x86/kernel-stacks
Since you're comparing kernel threads with user[land] threads, I assume you mean something like the following.
The normal way of implementing threads nowadays is to do it in the kernel, so those can be considered "normal" threads. It's however also possible to do it in userland, using signals such as SIGALRM, whose handler will save the current process state (registers, mostly) and change them to another one previously saved. Several OSes used this as a way to implement threads before they got proper kernel thread support. They can be faster, since you don't have to go into kernel mode, but in practice they've faded away.
There's also cooperative userland threads, where one thread runs until it calls a special function (usually called yield), which then switches to another thread in a similar way as with SIGALRM above. The advantage here is that the program is in total control, which can be useful when you have timing concerns (a game for example). You also don't have to care much about thread safety. The big disadvantage is that only one thread can run at a time, and therefore this method is also uncommon now that processors have multiple cores.
Kernel threads are implemented in the kernel. Perhaps you meant how to use them? The most common way is to call pthread_create.

Kworker threads getting blocked by SCHED_RR userspace threads

We have a Linux system using kernel 3.14.17, PREEMPT RT. It is a single core system.
For latency issues, our application has some of its threads' scheduling type set to SCHED_RR. However, this causes the kworkers in the kernel to be blocked, as they are only running in mode SCHED_OTHER. This can cause a kind of priority inversion, as a low priority SCHED_RR thread can block a higher priority SHED_RR from receiving the data from the driver.
It is the TTY driver that is being blocked. It uses a work queue in the function tty_flip_buffer_push. Possibly more calls, but that is one we've identified.
Is there any way to easily fix this problem - a RT application being dependent on a kworker? We are hoping we don't have to hack the driver/kernel ourselves. Are there any kernel config options in the RT kernel for this kind of stuff? Can we,
set a SCHED_RR priority for the kworkers?
disable work queues for a specific driver?
If we'd have to hack the driver, we'd probably give it its own work queue, with a SCHED_RR kworker.
Of course, any other solution is also of interest. We can upgrade to a later kernel version if there is some new feature.
The root-cause for this behaviour is tty_flip_buffer_push()
In kernel/drivers/tty/tty_buffer.c:518,
tty_flip_buffer_push schedules an asynchronous task. This is soon executed asynchronously by a kworker thread.
However, if any realtime threads execute on the system and keep it busy then the chance that the kworker thread will execute soon is very less. Eventually once the RT threads relinquish CPU or RT-throttling is triggerred, it might eventually provide the kworker thread a chance to execute.
Older kernels support the low_latency flag within the TTY sub-system.
Prior to Linux kernel v3.15 tty_flip_buffer_push() honored the low_latency flag of the tty port.
If the low_latency flag was set by the UART driver as follows (typically in its .startup() function),
t->uport.state->port.tty->low_latency = 1;
then tty_flip_buffer_push() perform a synchronous copy in the context of the current function call itself. Thus it automatically inherits the priority of the current task i.e. there is no chance of a priority inversion incurred by asynchronously scheduling a work task.
Note: If the serial driver sets the low_latency flag, it must avoid calling tty_flip_buffer_push() within an ISR(interrupt context). With the low_latency flag set, tty_flip_buffer_push() does NOT use separate workqueue, but directly calls the functions. So if called within an interrupt context, the ISR will take longer to execute. This will increase latency of other parts of the kernel/system. Also under certain conditions (dpeending on how much data is available in the serial buffer) tty_flip_buffer_push() may attempt to sleep (acquire a mutex). Calling sleep within an ISR in the Linux kernel causes a kernel bug.
With the workqueue implementation within the Linux kernel having migrated to CMWQ,
it is no longer possible to deterministically obtain independent execution contexts
(i.e. separate threads) for individual workqueues.
All workqueues in the system are backed by kworker/* threads in the system.
NOTE: THIS SECTION IS OBSOLETE!!
Leaving the following intact as a reference for older versions of the Linux kernel.
Customisations for low-latency/real-time UART/TTY:
1. Create and use a personal workqueue for the TTY layer.
Create a new workqueue in tty_init().
A workqueue created with create_workqueue() will have 1 worker thread for each CPU on the system.
struct workqueue_struct *create_workqueue(const char *name);
Using create_singlethread_workqueue() instead, creates a workqueue with a single kworker process
struct workqueue_struct *create_singlethread_workqueue(const char *name);
2. Use the private workqueue.
Queue the flip buffer work on the above private workqueue instead of the kernel's global global workqueue.
int queue_work(struct workqueue_struct *queue, struct work_struct *work);
Replace schedule_work() with queue_work() in functions called by tty_flip_buffer_push().
3. Tweak the execution priority of the private workqueue.
Upon boot the kworker thread being used by TTY layer workqueue can be identified by the string name used while creating it. Set an appropriate higher RT priority using chrt upon this thread as required by the system design.

How to reduce CUDA synchronize latency / delay

This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands
cudaStreamSynchronize,
CudaDeviceSynchronize,
cudaThreadSynchronize,
and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?
The main difference between synchronize methods is "polling" and "blocking."
"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.
"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.
The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).

concurrent kernel execution

Is it possible to launch kernels from different threads of a (host) application and have them run concurrently on the same GPGPU device? If not, do you know of any plans (of Nvidia) to provide this capability in the future?
The programming guide http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf says:
3.2.7.3 Concurrent Kernel Execution
Some devices of compute capability 2.0 can execute multiple kernels concurrently. Applications may query this capability by calling cudaGetDeviceProperties() and checking the concurrentKernels property.
The maximum number of kernel launches that a device can execute concurrently is sixteen.
So the answer is: It depends. It actually depends only on the device. Host threads won't make a difference in any way. Concurrent kernel launches are serialized if the device doesn't support concurrent kernel execution and if the device does, serial kernel launches on different streams are executed concurrently.