Using Nsight to determine bank conflicts and coalescing - c++

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?

I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.

For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.

M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.

Related

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?
So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

How to eager commit allocated memory in C++?

The General Situation
An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers that require mapping for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently. There's not much that can be done about any of this. The actual concurrency level of the writes should be lower, if it weren't for the following bug.
It's a beefy workstation with 3 Pascal GPUs, a high-end Haswell processor, and quad-channel RAM. Not much can be improved on the hardware. It's running a desktop edition of Windows 10.
The Actual Problem
Once I pass ~50% CPU load, something in MmPageFault() (inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load is being wasted on a spin-lock inside MmPageFault(). The CPU becomes 100% utilized, and the application performance completely degrades.
I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to MmPageFault() per second, happening sequentially as memcpy() is writing sequentially to the buffer. For each single uncommitted page encountered.
One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance-wise.
Considerations
The buffer is allocated by the DX11 driver. Nothing can be tweaked about the allocation strategy. Use of a different memory API and especially re-use is not possible.
Calls to the DX11 API (mapping/unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi-threaded across more threads than there are virtual processors in the system.
Reducing the memory bandwidth requirements is not possible. It's a real-time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU. If I could, I would already need to push further.
Avoiding multi-threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.
The spin-lock performance degradation appears to be so rare (because the use case is pushing it that far) that on Google, you won't find a single result for the name of the spin-lock function.
Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short-term fix. Switching to a better OS kernel is currently not an option for the same reason.
Reducing the CPU load doesn't work either; there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.
The Question
What can be done?
I need to reduce the number of individual pagefaults significantly. I know the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.
How can I ensure that the memory is committed with the least amount of transactions possible?
Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.
The current state
// In the processing threads
{
DX11DeferredContext->Map(..., &buffer)
std::memcpy(buffer, source, size);
DX11DeferredContext->Unmap(...);
}
Current workaround, simplified pseudo code:
// During startup
{
SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
}
// In the DX11 render loop thread
{
DX11context->Map(..., &resource)
VirtualLock(resource.pData, resource.size);
notify();
wait();
DX11context->Unmap(...);
}
// In the processing threads
{
wait();
std::memcpy(buffer, source, size);
signal();
}
VirtualLock() forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock() function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)
In order for VirtualLock() to work at all, SetProcessWorkingSetSize() needs to be called first, as the sum of all memory regions locked by VirtualLock() can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.
Just the use of VirtualLock(), albeit in individual threads and using deferred DX11 contexts for Map / Unmap calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.
Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.
Summary?
When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock() is the most efficient way to ensure that.
(Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new or malloc your memory is pooled and recycled inside your process anyway, so soft faults are rare.)
Just make sure to avoid any form of concurrent page faults when working with Windows.

Concurrent memory access slowing down system

I have a particle system that needs to be visualized. But the visualization should have effectively no effect on the simulation itself. The current way this happens is by letting a second thread read the state of the particle system without any synchronization. This will of course cause the simulation to show some glitches but this is not an issue.
However what seems to happen is that the faster the rendered renders, the slower the particle system becomes. The measured time step of the simulation get spikes and almost doubles on average. I am fairly certain this is due to the renderer accessing the memory used by the particle system from a different thread.
Now the question is; is it somehow possible to disturb the particle system in a lesser extend? Accuracy of visualization is not an issue at all. I could theoretically imagine some way to instruct the compiler that the renderer is purely read only and/or that it does not need "recent" versions of the data. But i have no clue how to approach this.
PS. Language: C++, IDE: Visual Studio
PSS. Of course keeping the FPS of the renderer low already helps due to less memory access, but measured timed step of the simulation still spikes and slows down.
Your system slows down because when you just do a simulation, you data most probably is in cache level 1-2. The cache lines are in "Modified" state and every read and write to those cache lines is a cache hit with no bus transactions (i.e fast).
Once you run another thread accessing the same data, the changes made by the simulation need to be propagated to the point of coherency, so the visualization process (running on another CPU core) can read them. So the state of the cache line transitions from "Modified" to "Shared" state.
Then once the simulation thread want to modify that shared data again, the cache line transitions from "Shared" back to "Modified" state and a bus transaction generated, so the cache line in other caches gets invalidated.
So even reading from another thread slows down the simulation, because cache lines jumps between states and a lot of bus transactions is going on underneath. On Intel the cache coherency protocol is called MESI(F) and you can find more on that on Wikipedia:
https://en.wikipedia.org/wiki/MESI_protocol
Regarding how to deal with the issue. Basically, you should avoid reading and writing the same data at the same time. It is hard, but you might want to consider the following:
You might try to modify the simulation, so it operates on two banks of data. One bank is to be used for simulation, while another is to be used to visualize previously calculated data.
You might try to simply copy the data in visualization thread in one go right after a simulation loop has finished. It will fix the glitches and it might improve the overall performance.
To synchronize between the simulation and visualization threads you should use busy waiting (like spin locks), not kernel objects, like mutexes.
All that said, there is no guaranty any of those techniques it will help in your case. It all depends on your data, CPU, cache sizes etc etc.

At what code complexity does an OpenACC kernel lose efficiency on common GPU?

At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?
Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?
Would cache sizes and if code fits indicate optimal task per kernel or something else?
About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?
I would refrain from using the complexity of the code as an indication of performance. You can have a highly complex code run very efficiently on a GPU and a simple code run poorly. Instead, I would look at the following factors:
Data movement between the device and host. Limit the frequency of data movement and try to transfer data in contiguous chunks. Use OpenACC unstructured data regions to match the host allocation on the device (i.e. use "enter data" at the same time as you allocate data via "new" or "malloc"). Move as much compute to the GPU as you can and only use the OpenACC update directive to synchronize host and device data when absolutely necessary. In case where data movement is unavoidable, investigate using the "async" clause to interleave the data movement with compute.
Data access on the device and limiting memory divergence. Be sure to have your data layout so that the stride-1 (contiguous) dimension of your arrays are accessed contiguously across the vectors.
Have a high compute intensity which is the ratio of computation to data movement. More compute and less data movement the better. However, lower compute intensity loops are fine if there are other high intensity loops and the cost to move the data to the host would offset the cost of running the kernel on the device.
Avoid allocating data on the device since it forces threads to serialize. This includes using Fortran "automatic" arrays, and declaring C++ objects with include allocation in their constructors.
Avoid atomic operations. Atomic operations are actually quite efficient when compared to host atomics, but still should be avoided if possible.
Avoid subroutine calls. Try to inline routines when possible.
Occupancy. Occupancy is the ratio of the number of threads that can potentially be running on the GPU over the maximum number of threads that could be running. Note that 100% occupancy does not guarantee high performance but you should try and get above 50% if possible. The limiters to occupancy are the number of registers used per thread (vector) and shared memory used per block (gang). Assuming you're using the PGI compiler, you can see the limits of your device by running the PGI "pgaccelinfo" utility. The number of registers used will depend upon the number of local scalars used (explicitly declared by the programmer and temporaries created by the compiler to hold intermediate calculations) and the amount of shared memory used will be determined by the OpenACC "cache" and "private" directives when "private" is used on a "gang" loop. You can see the how much each kernel uses by adding the flag "-ta=tesla:ptxinfo". You can limit the number of registers used per thread via "-ta=tesla:maxregcount:". Reducing the number of registers will increase the occupancy but also increase the number of register spills. Spills are fine so long as they only spill to L1/L2 cache. Spilling to global memory will hurt performance. It's often better to suffer lower occupancy than spilling to global memory.
Note that I highly recommend using a profiler (PGPROF, NVprof, Score-P, TAU, Vampir, etc.) to help discover a program's performance bottlenecks.

How to write a C or C++ program to act as a memory and CPU cycle filler?

I want to test a program's memory management capabilities, for example (say, program name is director)
What happens if some other processes take up too much memory, and there is too less memory for director to run? How does director behave?
What happens if too many of the CPU cycles are used by some other program while director is running?
What happens if memory used by the other programs is freed after sometime? How does director claim the memory and start working at full capabilities.
etc.
I'll be doing these experiments on a Unix machine. One way is to limit the amount of memory available to the process using ulimit, but there is no good way to have control over the CPU cycle utilization.
I have another idea. What if I write some program in C or C++ that acts as a dynamic memory and CPU filler, i.e. does nothing useful but eats up memory and/or CPU cycles anyways?
I need some ideas on how such a program should be structured. I need to have dynamic(runtime) control over memory used and CPU used.
I think that creating a lot of threads would be a good way to clog up the CPU cycles. Is that right?
Is there a better approach that I can use?
Any ideas/suggestions/comments are welcome.
http://weather.ou.edu/~apw/projects/stress/
Stress is a deliberately simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system. It is written in C, and is free software licensed under the GPLv2.
The functionality you seek overlaps the feature set of "test tools". So also check out http://ltp.sourceforge.net/tooltable.php.
If you have a single core this is enough to put stress on a CPU:
while ( true ) {
x++;
}
If you have lots of cores then you need a thread per core.
You make it variably hungry by adding a few sleeps.
As for memory, just allocate lots.
There are several problems with such a design:
In a virtual memory system, memory size is effectively unlimited. (Well, it's limited by your hard disk...) In practice, systems usually run out of address space well before they run out of backing store -- and address space is a per-process resource.
Any reasonable (non realtime) operating system is going to limit how much CPU time and memory your process can use relative to other processes.
It's already been done.
More importantly, I don't see why you would ever want to do this.
Dynamic memory control, you could just allocate or free buffers of a certain size to use or free more or less memory. As for CPU utilization, you will have to get an OS function to check this and periodically check it and see if you need to do useful work.