Cuda zero-copy performance - c++

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?
I have a kernel that uses the zero-copy feature and with NVVP I see the following:
Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.
When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.
However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.
The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?

Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:
The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
The memory operation had thread address divergence requiring access to multiple cache lines.
The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
The LSU unit resources are full and the instruction needs to be replayed when the resource are available.
The latency to
L2 is 200-400 cycles
device memory (dram) is 400-800 cycles
zero copy memory over PCIe is 1000s of cycles
The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.
The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.
The profiler's exposes the following counters:
gld_throughput
l1_cache_global_hit_rate
dram_{read, write}_throughput
l2_l1_read_hit_rate
In the zero copy case all of these metrics should be much lower.
The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

Related

At what code complexity does an OpenACC kernel lose efficiency on common GPU?

At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?
Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?
Would cache sizes and if code fits indicate optimal task per kernel or something else?
About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?
I would refrain from using the complexity of the code as an indication of performance. You can have a highly complex code run very efficiently on a GPU and a simple code run poorly. Instead, I would look at the following factors:
Data movement between the device and host. Limit the frequency of data movement and try to transfer data in contiguous chunks. Use OpenACC unstructured data regions to match the host allocation on the device (i.e. use "enter data" at the same time as you allocate data via "new" or "malloc"). Move as much compute to the GPU as you can and only use the OpenACC update directive to synchronize host and device data when absolutely necessary. In case where data movement is unavoidable, investigate using the "async" clause to interleave the data movement with compute.
Data access on the device and limiting memory divergence. Be sure to have your data layout so that the stride-1 (contiguous) dimension of your arrays are accessed contiguously across the vectors.
Have a high compute intensity which is the ratio of computation to data movement. More compute and less data movement the better. However, lower compute intensity loops are fine if there are other high intensity loops and the cost to move the data to the host would offset the cost of running the kernel on the device.
Avoid allocating data on the device since it forces threads to serialize. This includes using Fortran "automatic" arrays, and declaring C++ objects with include allocation in their constructors.
Avoid atomic operations. Atomic operations are actually quite efficient when compared to host atomics, but still should be avoided if possible.
Avoid subroutine calls. Try to inline routines when possible.
Occupancy. Occupancy is the ratio of the number of threads that can potentially be running on the GPU over the maximum number of threads that could be running. Note that 100% occupancy does not guarantee high performance but you should try and get above 50% if possible. The limiters to occupancy are the number of registers used per thread (vector) and shared memory used per block (gang). Assuming you're using the PGI compiler, you can see the limits of your device by running the PGI "pgaccelinfo" utility. The number of registers used will depend upon the number of local scalars used (explicitly declared by the programmer and temporaries created by the compiler to hold intermediate calculations) and the amount of shared memory used will be determined by the OpenACC "cache" and "private" directives when "private" is used on a "gang" loop. You can see the how much each kernel uses by adding the flag "-ta=tesla:ptxinfo". You can limit the number of registers used per thread via "-ta=tesla:maxregcount:". Reducing the number of registers will increase the occupancy but also increase the number of register spills. Spills are fine so long as they only spill to L1/L2 cache. Spilling to global memory will hurt performance. It's often better to suffer lower occupancy than spilling to global memory.
Note that I highly recommend using a profiler (PGPROF, NVprof, Score-P, TAU, Vampir, etc.) to help discover a program's performance bottlenecks.

How does hardware affect the performance of malloc\new on Windows

I am working on the performance of a c++ application on Windows 7, which is doing a lot of computation and a lot of of small allocations. Basically I observed a bottleneck using visual studio sampling profiler and it come down to the parsing of a file and creation of a huge tree structure of the type
class TreeStruct : std::map<key, TreeStructPtr>
{
SomeMetadata p;
int* buff;
int buffsize;
}
There are ten of thousand of these structure created during the parsing
The buffer is not that big, 1 byte to few hundred bytes
The profiler report that the most costly functions is
free (13 000 exclusive samples, 38% Exclusive Samples)
operator new (13 000 exclusive samples, 38% Exclusive Samples)
realloc (4000 exclusive samples, 13% Exclusive Samples)
I managed to optimize and to reduce allocations to
operator new (2200 exclusive samples, 48% Exclusive Samples)
free (1770 exclusive samples, 38% Exclusive Samples)
some function (73 exclusive samples, 1.5% Exclusive Samples)
When I measure the client waiting time (ie a client wait for the action to process with a stopwatch) The installed version on my machine went from 85s of processing time to 16s of processing time, which is great. I proceed to test on the most powerful machine we have and was stunned that the non optimized version took only 3.5s while to optimized around 2s. Same executable, same operating system...
Question: How is such a disparity possible on two modern machines?
Here are the specs :
85s to 16s machine
3.5s to 2s machine
The processing is mono-threaded.
As others have commented, frequent small allocations are a waste of time and memory.
For every allocation, there is overhead:
Function call preparation
Function call (break in execution path; possible reload of execution
pipeline).
Algorithm to find a memory block (searching perhaps).
Allocating the memory (marking the block as unavailable).
Placing the address into a register
Returning from the function (another break in sequential execution).
Regardless of your machine's speed, the above process is a lot of execution to allocate a small block of memory.
Modern processors love to keep their data close (as in a data cache). Their performance increases when they can fetch data from the cache and not fetch outside the processor (access times slow down the further away the values are, such as memory on chip, outside core; memory off chip on the same board; memory on other boards; memory on devices (such as Flash and hard drive). Reallocating memory defeats the effectiveness of the data cache.
The Operating System may get involved and slow down your program. In the allocation or delete functions, the O.S. may check for paging. Paging, in a simple form, is the swapping of memory areas with areas on the hard drive. This may occur when other higher priority tasks are running and demand more memory.
An algorithm for speeding up data access:
Load data from memory into local variables (registers if possible).
Process the data in the local variables (registers).
Store the finished data.
If you can, place data into structures. Load all the structure members at once. Structures allow for data to be placed into contiguous memory (which reduces the need to reload the cache).
Lastly, reduce branching or changes in execution. Research "loop unrolling". Your compiler may perform this optimization at higher optimization settings.

How to clear L1, L2 and L3 caches?

I am doing some cache performance measuring and I need to ensure the caches are empty of "useful" data before timing.
Assuming an L3 cache is 10MB would it suffice to create a vector of 10M/4 = 2,500,000 floats, iterate through the whole of this vector, sum the numbers and that would empty the whole cache of any data which was in it prior to iterating through the vector?
Yes, that should be sufficient for flushing the L3 cache of useful data.
I have done similar types of measurements and cross-verified by using Intel's cache counters to verify that I incur the expected number of L3 cache misses during my tests.
If you want to absolutely sure, you should also use the counters. In particular, you can measure last-level cache misses by using Event select 2EH, Umask 41H in most Intel architectures.
See the Intel Manual for details on these counters.
It depends on how insane you are trying to be to get your guarantee.
x86_64 L3 cache is physically indexed, and while a 10MiB chunk that's linear in virtual space is almost definitely going to be physically contiguous on a lightly mem-loaded machine, it's not guaranteed.
Sandy and Ivy Bridge, for example, have L3 cache in 2MiB slices with 16-way set associativity (128kiB stride), so you could guarantee physical coverage by doing a MAP_HUGETLB mmap() call, assuming standard 2-4MiB huge pages.
Also, since each slice (on new Sandy/Ivy Bridge at least) is attached to a different core, and which slice a given physical address resides on is determined by a hash of some low/middle-order address bits, you might have to make an array slightly larger than the size of L3 to counter for minutely uneven overlap.
At this point, scrubbing your array a few times linearly should do the trick.
Another option is to use dedicated cache invalidation instructions that some ISAs provide. x86 for e.g. has wbinvd for this purpose (or clflush for a single line).
http://x86.renejeschke.de/html/file_module_x86_id_325.html
One problem is that it requires ring-0 permissions. Another one is that it doesn't guarantee that the flush is completed prior to any serialization point, so it's not good enough to guarantee system non-volatility, but it may be enough for benchmarking as long as you can prevent the ensuing WBs from eating up your memory bandwidth.
If you can overcome these issues, it may be a better solution in some cases than going over some large data structure just to make sure the cache is flushed. Some CPUs may decide to avoid caching fetches they believe would not be reused in the future (there are several papers about these options, and at least some claims that it's implemented in real CPUs)

More TLB misses when process memory size larger?

I have my program which I have written in C++. On linux the process is allocated a certain amount of memory. Part is the Stack, part the Heap, part Text and part BSS.
Is the following true:
The larger the amount of memory allocated to the Heap component of my process- the chance of Translation Lookaside Buffer misses increases?
And generally speaking- the more memory my application process consumes, the greater the chance of TLB misses?
I think there is no direct relationship between the amount of memory allocated and the miss rate of TLB. As far as I know, as long as your program has good locality, the TLB misses will remain low.
There is several reasons that would lead to high TLB miss:
1.Not enough memory and to many running process;
2.Low locality of your program.
3.the inefficient way you visit array elements in cycles in your codes.
Programs are usually divided into phases that exhibit completely different memory and execution characteristics - your code may allocate a huge chunk of memory at some point, then be off doing some other unrelated computations. In that case, your TLBs (that are basically just caches for address translation) would age away the unused pages and eventually drop them. While you're not using these pages, you shouldn't care about that.
The real question is - when you get to some performance critical phase, are you going to work with more pages than your TLBs can sustain simultaneously? On one hand modern CPUs have large TLB, often with 2 levels of caching - the L2 TLB of a modern intel CPU should have (IIRC) 512 entries - that's 2M worth of data if you're using 4k pages (with large pages that would have been more, but TLBs usually don't like to work with them due to potential conflicts with smaller pages..).
It's quite possible for an application to work with more than 2M of data, but you should avoid doing this at the same time if possible - either by doing cache tiling or changing the algorithms. That's not always possible (for e.g. when streaming from memory or from IO), but then the TLB misses are probably not your main bottlenecks. When working with the same set of data and accessing the same elements multiple times - you should always attempt to keep them cached as close as possible.
It's also possible to use software prefetches to make the CPU perform the TLB misses (and following page walks) earlier in time, preventing them from blocking your progress. On some CPUs hardware prefetches are already doing this for you.

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?
I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.
For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.
M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.