What difference between cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield? - concurrency

As said here: How to reduce CUDA synchronize latency / delay
There are two approach for waiting result from device:
"Polling" - burn CPU in spin - to decrease latency when we wait result
"Blocking" - thread is sleeping until an interrupt occurs - to increase general performance
For "Polling" need to use CudaDeviceScheduleSpin.
But for "Blocking" what do I need to use CudaDeviceScheduleYield or cudaDeviceScheduleBlockingSync?
What difference between cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield?
cudaDeviceScheduleYield as written: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g18074e885b4d89f5a0fe1beab589e0c8.html
"Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device." - i.e. wait result without burn CPU in spin - i.e. "Blocking". And cudaDeviceScheduleBlockingSync too - wait result without burn CPU in spin. But what difference?

For my understanding, both approaches use polling to synchronize. In pseudo-code for CudaDeviceScheduleSpin:
while (!IsCudaJobDone())
{
}
whereas CudaDeviceScheduleYield:
while (!IsCudaJobDone())
{
Thread.Yield();
}
i.e. CudaDeviceScheduleYield tells the operating system that it can interrupt the polling thread and activate another thread doing other work. This increases the performance for other threads on CPU but also increases latency, in case the CUDA job finishes when another thread than the polling one is active in that very moment.

Related

Is it really impossible to suspend two std/posix threads at the same time?

I want to briefly suspend multiple C++ std threads, running on Linux, at the same time.
It seems this is not supported by the OS.
The threads work on tasks that take an uneven and unpredictable amount of time (several seconds).
I want to suspend them when the CPU temperature rises above a threshold.
It is impractical to check for suspension within the tasks, only inbetween tasks.
I would like to simply have all workers suspend operation for a few milliseconds.
How could that be done?
What I'm currently doing
I'm currently using a condition variable in a slim, custom binary semaphore class (think C++20 Semaphore).
A worker checks for suspension before starting the next task by acquiring and immediately releasing the semaphore.
A separate control thread occupies the control semaphore for a few milliseconds if the temperature is too high.
This often works well and the CPU temperature is stable.
I do not care much about a slight delay in suspending the threads.
However, when one task takes some seconds longer than the others, its thread will continue to run alone.
This activates CPU turbo mode, which is the opposite of what I want to achieve (it is comparatively power inefficient, thus bad for thermals).
I cannot deactivate CPU turbo as I do not control the hardware.
In other words, the tasks take too long to complete.
So I want to forcefully pause them from outside.
I want to suspend them when the CPU temperature rises above a threshold.
In general, that is putting the cart before the horse.
Properly designed hardware should have adequate cooling for maximum load and your program should not be able to exceed that cooling capacity.
In addition, since you are talking about Turbo, we can assume an Intel CPU, which will thermally throttle all on their own, making your program run slower without you doing anything.
In other words, the tasks take too long to complete
You could break the tasks into smaller parts, and check the semaphore more often.
A separate control thread occupies the control semaphore for a few milliseconds
It's really unlikely that your hardware can react to millisecond delays -- that's too short a timescale for anything thermal. You will probably be better off monitoring the temperature and simply reducing the number of tasks you are scheduling when the temperature is rising and getting close to your limits.
I've now implemented it with pthread_kill and SIGRT.
Note that suspending threads in unknown state (whatever the target task was doing at the time of signal receipt) is a recipe for deadlocks. The task may be inside malloc, may be holding arbitrary locks, etc. etc.
If your "control thread" also needs that lock, it will block and you lose. Your control thread must execute only direct system calls, may not call into libc, etc. etc.
This solution is ~impossible to test, and ~impossible to implement correctly.

Measure time with same result

I want to measure time (very precisely in milliseconds) and start another thread (totally 4) at some time. I tried it with:
double Time()
{
duration = (std::clock() - start);
return duration;
}
//...
start = std::clock();
while (Time() < 1000)
{
//Start thread...
//...
}
It works, but in every experiment I recived diffrent result (small difference).
Its even possible? Does depends how many programs runs in background (It slows down my computer)? So if it possible what should I use? Thanks
(sorry for my English)
The operating system runs in quanta - little chunks of processing which are are below our level of perception.
Within a single quantum, the CPU should act reasonably stably. If your task will use more than one quantum of time, then the operating system will be free to use slices of times for other tasks.
Using a condition variable You can notify_all to wake up any waiting threads.
So start the number of threads, but before they are measured and start working have them waiting on a condition_variable. Then when the condition_variable notify_all the threads will be runnable. If they are started at the same time, you should get synchronized stable results.
Variance occurs.
Not scheduled - the cores on your CPU are doing other things, so 1 or more thread misses the quantum
Blocked on IO. If you need to interact with the disk, that can cause blocks until data is available.
Blocked in mutex - if they are modifying a shared resource, the wait for the resource becoming free adds time.
Cache behavior some operations cause the cache of all the CPUs to be flushed, this will affect the performance of all the threads.
Whether data is in the cache or not the CPU runs faster from L1 cache, than from main memory. If the threads read the same data, they will help each other cause the data to be cached, and run at the same (ish) speed.

Which threads exactly are CPU bound

I heard that the optimal amount of threads depends on whether they are CPU bound or not. But what exactly does it mean?
Suppose that the most time my threads will sleep via Sleep function from WinAPI. Should I considered such threads as non-CPU bound and increase their amount over the CPU cores count?
A thread is bound by a resource if it spends most of its time using it, and thus its speed is bound by the speed of that resource.
Given the above definition, a thread is CPU bound if its most used resource is the computing power of the CPU, that is, it's a thread that does heavy computation. You gain nothing from putting more of these than there are available cores, because they will compete for CPU time.
You can (instead) put more threads than available cores when the threads are bound by other resources (most commonly files), because they will spend most time waiting for those to be ready, and thus leave the CPU available for other threads.
A thread that spends most time sleeping does not use the CPU very much, and thus it is not CPU bound.
EDIT: examples of non-CPU bound threads are threads that read files, wait for network connections, talk to PCI connected devices, spend most time waiting on condition variables and GUI threads that wait for user input.

Do deadlocks cause high CPU utilization?

Do deadlocks put processes into a high rate of CPU usage, or do these two processes both "sleep", waiting on the other to finish?
I am trying to debug a multithreaded program written in C++ on a Linux system. I have noticed excessive CPU utilization from one particular process, and am wondering if it could be due to a deadlock issue. I have identified that one process consistently uses more of the CPU than I would anticipate (using top), and the process works, but it works slowly. If deadlocks cause the processes to sleep and do not cause high CPU usage, then at least I know this is not a deadlocking issue.
A deadlock typically does not cause high CPU usage, at least not if the deadlock occurs in synchronization primitives that are backed by the OS such that processes sleep while they wait.
If the deadlock occurs with i.e. lockless synchronization mechanisms (such as compare-exchange with an idle loop), CPU usage will be up.
Also, there is the notion of a livelock, which occurs when a program with multiple threads is unable to advance to some intended state because some condition (that depends on interaction between threads) cannot be fulfilled, even though none of the threads is explicitly waiting for something.
It depends on the type of lock. A lock that is implemented as a spin loop could run up 100% CPU usage in a deadlock situation.
On the other hand, a signalling lock such as a kernel mutex does not consume CPU cycles while waiting, so a deadlock on such a lock would not peg the CPU at 100%

what is the different of busy loop with Sleep(0) and pause instruction?

I would like to wait on an event in my app which supposed to happen immediately, so I don't want to put my thread on wait and wake it up later.
I wonder what are the difference between using Sleep(0) and hardware pause instruction.
I cannot see any differences of cpu utilization for the following program. My question isn't about power saving considerations.
#include <iostream>
using namespace std;
#include <windows.h>
bool t = false;
int main() {
while(t == false)
{
__asm { pause } ;
//Sleep(0);
}
}
Windows Sleep(0) vs The PAUSE instruction
Let me quote from the Intel 64 and IA-32 Architectures Optimization Reference Manual.
In multi-threading implementation, a popular construct in thread synchronization and for yielding scheduling
quanta to another thread waiting to carry out its task is to sit in a loop and issuing SLEEP(0).
These are typically called “sleep loops” (see example #1). It should be noted that a SwitchToThread call
can also be used. The “sleep loop” is common in locking algorithms and thread pools as the threads are
waiting on work.
This construct of sitting in a tight loop and calling Sleep() service with a parameter of 0 is actually a
polling loop with side effects:
Each call to Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles.
It also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
When there is no other thread waiting to take possession of control, this sleep loop behaves to the OS
as a highly active task demanding CPU resource, preventing the OS to put the CPU into a low-power
state.
Example #1. Unoptimized Sleep Loop
while(!acquire_lock())
{ Sleep( 0 ); }
do_work();
release_lock();
Example #2. Power Consumption Friendly Sleep Loop Using PAUSE
if (!acquire_lock())
{ /* Spin on pause max_spin_count times before backing off to sleep */
for(int j = 0; j < max_spin_count; ++j)
{ /* intrinsic for PAUSE instruction*/
_mm_pause();
if (read_volatile_lock())
{
if (acquire_lock()) goto PROTECTED_CODE;
}
}
/* Pause loop didn't work, sleep now */
Sleep(0);
goto ATTEMPT_AGAIN;
}
PROTECTED_CODE:
do_work();
release_lock();
Example #2 shows the technique of using PAUSE instruction to make the sleep loop power friendly.
By slowing down the “spin-wait” with the PAUSE instruction, the multi-threading software gains:
Performance by facilitating the waiting tasks to acquire resources more easily from a busy wait.
Power-savings by both using fewer parts of the pipeline while spinning.
Elimination of great majority of unnecessarily executed instructions caused by the overhead of a
Sleep(0) call.
In one case study, this technique achieved 4.3x of performance gain, which translated to 21% power savings at the processor and 13% power savings at platform level.
Pause Latency in Skylake Microarchitecture
The PAUSE instruction is typically used with software threads executing on two logical processors located in the same processor core, waiting for a lock to be released. Such short wait loops tend to last between tens and a few hundreds of cycles, so performance-wise it is more beneficial to wait while occupying the CPU than yielding to the OS. When the wait loop is expected to last for thousands of cycles or more, it is preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject on Windows OS.
The PAUSE instruction is intended to:
Temporarily provide the sibling logical processor (ready to make forward progress exiting the spin loop) with competitively shared hardware resources. The competitively-shared microarchitectural resources that the sibling logical processor can utilize in the Skylake microarchitecture are: (1) More front end slots in the Decode ICache, LSD and IDQ; (2) More execution slots in the RS.
Save power consumed by the processor core compared to executing equivalent spin loop instruction sequence in the following configurations: (1) One logical processor is inactive (e.g. entering a C-state); (2) Both logical processors in the same core execute the PAUSE instruction; (3) HT is disabled (e.g. using BIOS options).
The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles.
The increased latency (allowing more effective utilization of competitively-shared microarchitectural resources to the logical processor ready to make forward progress) has a small positive performance impact of 1-2% on highly threaded applications. It is expected to have negligible impact on less threaded applications if forward progress is not blocked on executing a fixed number of looped PAUSE instructions.
There's also a small power benefit in 2-core and 4-core systems. As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.
You can find more information on this issue in the "Intel 64 and IA-32 Architectures Optimization Reference Manual" and "Intel 64 and IA-32 Architectures Software Developer’s Manual", along with the code samples.
My Opinion
It is better make program logic flow in such a way that neither Sleep(0) nor the PAUSE instruction are ever needed. In other words, avoid the “spin-wait” loops altogether. Instead, use high-level synchronization functions like WaitForMultipleObjects(), SetEvent(), and so on. Such high-level synchronization functions are the best way to write the programs. If you analyze available tools (at your disposition) from the terms of performance, efficiency and power saving - the higher-level functions are the best choice. Although they also suffer from expensive context switches and ring 3 to ring 0 transitions, these expenses are infrequent and are more than reasonable, compared to what you would have spent in total for all the “spin-wait” PAUSE cycles combined, or the cycles with with Sleep(0).
On a processor supporting hyper-threading, “spin-wait” loops can consume a significant portion of the execution bandwidth of the processor. One logical processor executing a spin-wait loop can severely impact the performance of the other logical processor. That's why sometimes disabling hyper-threading may improve performance, as have been pointed out by some people.
Consistently polling for devices or file or state changes in the program logic workflow can cause the computer to consume more power, to put stress on memory and the bus and to provide unnecessary page faults (use the Task Manager in Windows to see which applications produce most page faults while in the idle state, waiting for user input in the background - these are most inefficient applications since they are using the poling above mentioned). Minimize polling (including the spin-loops) whenever possible and use an event driven ideology and/or a framework if available - this is the best practice that I highly recommend. You application should literally sleep all the time, waiting for multiple events set up in advance.
A good example of an event-driven application is Nginx, initially written for unix-like operating systems. Since the operating systems provide various functions and methods to notify your application, use these notifications instead of polling for device state changes. Just let your program to sleep infinitely until a notification will arrive or a user input will arrive. Using such a technique reduces the overhead for the code to poll the status of the data source, because the code can get notifications asynchronously when status changes happen.
Sleep is a system call, which allows the OS to reschedule the CPU time to any other process, if available, before allowing the caller to continue (even if the parameter is 0).
__asm {pause}; is not portable.
Well, Sleep is neither, but not on the CPU level but on the system libraries level.