How to initialize CUDA so I can make valid execution time measurements? - c++

In my application I have implemented the same algorithm for CPU and GPU with CUDA and I have to measure the time needed to perform algorithm on CPU and GPU. I've noticed that there's some time spent for CUDA initialization in GPU version of algorithm and added cudaFree(0); at the beginning of the program code as it recommended here for CUDA initialization, but it still takes more time for the first GPU CUDA algorithm execution, than the second one.
Are there any other CUDA related stuff that have to be initialized at the beginning to measure actual algorithm execution time correctly?

The heuristics of lazy context initialisation in the CUDA runtime API have subtly changed since the answer you linked to was written in two ways I am aware of:
cudaSetDevice() now initiates a context, whereas earlier on it did not (hence the need for the cudaFree() call discussed in that answer)
Some device code related initialisation which the runtime API used to perform at context initialisation is now done at first call to the kernel
The only solution I am aware of for the second item is to run the CUDA kernel code you want to time once as a "warm up" to absorb the setup latency, and then perform your timing on the code for benchmarking purposes.
Alternatively, you can use the driver API and have much finer grained control over when latency will occur during application start up.

Related

OpenCL Profiling timestamps are not consistent in duration compared to CPU clock

I am creating a custom tool interface with my application to profile the performance of OpenCL kernels while also integrating CPU profiling points. I'm currently working with this code on Linux using Ubuntu, and am testing using the 3 OpenCL devices in my machine: Intel CPU, Intel IGP, and Nvidia Quadro.
I am using this code std::chrono::high_resolution_clock::now().time_since_epoch().count() to produce a timestamp on the CPU, and of course for the OpenCL profiling time points, they are 64-bit nanoseconds provided from the OpenCL profiling events API. The purpose of the tool I made is to consume log output (specially formatted and generated so as not to impact performance much) from the program and generate a timeline chart to aid performance analysis.
So far in my visualization interface I had made the assumption that nanoseconds are uniform. I've realized now after getting my visual interface working and checking a few assumptions that this condition more or less does hold to a standard deviation of 0.4 microsecond for the CPU OpenCL device (which indicates that the CPU device could be implemented using the same time counter, as it has no drift), but does not hold for the two GPU devices! This is perhaps not the most surprising thing in the world, but it affects the core design of my tool, so this was an unforeseen risk.
I'll provide some eye candy since it is very interesting and it does prove to me that this is indeed happening.
This is zoomed into the beginning of the profile where the GPU has the corresponding mapBuffer for poses happening around a millisecond before the CPU calls it (impossible!)
Toward the end of the profile we see the same shapes but reversed relationship, clearly showing that GPU seconds seem to count for a little bit less compared to CPU seconds.
The way that this visualization currently works as i had assumed a GPU nanosecond is indeed a CPU nanosecond, is that I actually have been computing the average of the delta between the values given to me by the CPU and GPU... Since I did implement this initially, perhaps it indicates that i was at least subconsciously expecting there to be an issue like this one. Anyway, what I did was establish a sync point at the kernel dispatch by recording a CPU timestamp immediately before calling clEnqueueNDRangeKernel and then comparing this against the CL_PROFILING_COMMAND_QUEUED OpenCL Profile event time. This delta upon further inspection showed the time drift:
This screenshot from the chrome console shows me logging the array of delta values I collected from these two GPU devices; they are showing BigInts to avoid losing integer precision: in both cases the GPU reported timestamp deltas are trending down.
Compare with the numbers from the CPU:
My questions:
What might be a practical way to deal with this issue? I am currently leaning toward the use of sync points when dispatching OpenCL kernels, and these sync points could be used to either locally piecewise stretch the OpenCL Profiling timestamps, or to locally sync at the beginning of, say, a kernel dispatch and just ignore the discrepancy we have, assuming it will be insignificant during the period. In particular it is unclear whether it'd be a good idea to maximize granularity by implementing a sync point for every single profiling event I want to use.
What might be some other time measuring systems I can or should use on the CPU-side to see if maybe they will end up aligning better? I don't really have much hope in this at this point because I can imagine that the profiling times being provided to me are actually generated and timed on the GPU device itself. The fluctuations would then be affected by such things as dynamic GPU clock scaling, and there would be no hope of stumbling upon a different better timekeeping scheme on the CPU.

What is the difference between 'GPU activities' and 'API calls' in the results of 'nvprof'?

What is the difference between 'GPU activities' and 'API calls' in the results of 'nvprof'?
I don't know why there's a time difference in the same function.
For example, [CUDA memcpy DtoH] and cuMemcpyDtoH.
So I don't know what the right time is.
I have to write a measurement, but I don't know which one to use.
Activities are actual usage of the GPU for some particular task.
An activity might be running a kernel, or it might be using GPU hardware to transfer data from Host to Device or vice-versa.
The duration of such an "activity" is the usual sense of duration: when did this activity start using the GPU, and when did it stop using the GPU.
API calls are calls made by your code (or by other CUDA API calls made by your code) into the CUDA driver or runtime libraries.
The two are related of course. You perform an activity on the GPU by initiating it with some sort of API call. This is true for data copying and running kernels.
However there can be a difference in "duration" or reported times. If I launch a kernel, for example, there may be many reasons (e.g. previous activity that is not yet complete in the same stream) why the kernel does not "immediately" begin executing. The kernel "launch" may be outstanding from an API perspective for a much longer time than the actual runtime duration of the kernel.
This applies to may other facets of API usage as well. For example, cudaDeviceSynchronize() can appear to require a very long time or a very short time, depending on what is happening (activities) on the device.
You may get a better sense of the difference between these two categories of reporting by studying the timeline in the NVIDIA visual profiler (nvvp).
Let's use your specific case as an example. This appears to be an app associated with the driver API, and you apparently have a kernel launch and (I would guess) a D->H memcpy operation immediately after the kernel launch:
multifrag_query_hoisted_kernels (kernel launch - about 479ms)
cuMemcpyDtoH (data copy D->H, about 20us)
In that situation, because CUDA kernel launches are asynchronous, the host code will launch the kernel and it will then proceed to the next code line, which is a cuMemcpyDtoH call, which is a blocking call. This means the call causes the CPU thread to wait there until the previous CUDA activity is complete.
The activity portion of the profiler tells us the kernel duration is around 479ms and the copy duration is around 20us (much much shorter). From the standpoint of activity duration, these are the times that are relevant. However, as viewed from the host CPU thread, the time it required the host CPU thread to "launch" the kernel was much shorter than 479ms, and the time it required the host CPU thread to complete the call to cuMemcpyDtoH and proceed to the next line of code was much longer than 20us, because it had to wait there at that library call, until the previously issued kernel was complete. Both of these are due to the asynchronous nature of CUDA kernel launches, and the "blocking" or synchronous nature of cuMemcpyDtoH.

slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice
The Test program:
The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding)
Before the 1st cudaMalloc, I’ve called “cudaSetDevice”
on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms
on my server: Win2012+ Tesla K40, it takes 1100ms!!
For both machines, subsequent cudaMalloc are much faster.
My questions are:
1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies
3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?
Thanks in advance
1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
The details of the initialization process are not specified, however by observation the amount of system memory affects initialization time. CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.
2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies
The CUDA initialization process is a "lazy" initialization. That means that just enough of the initialization process will be completed in order to support the requested operation. If the requested operation is cudaSetDevice, this may require less of the initialization to be complete (which means the apparent time required may be shorter) than if the requested operation is cudaMalloc. That means that some of the initialization overhead may be absorbed into the cudaSetDevice operation, while some additional initialization overhead may be absorbed into a subsequent cudaMalloc operation.
3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?
Independent host processes will generally spawn independent CUDA contexts. A CUDA context has the initialization requirement associated with it, so the fact that another, separate cuda context may be already initialized on the device will not provide much benefit if a new CUDA context needs to be initialized (perhaps from a separate host process). Normally, keeping a process active involves keeping an application running in that process. Applications have various mechanisms to "sleep" or suspend behavior. As long as the application has not terminated, any context established by that application should not require re-initialization (excepting, perhaps, if cudaDeviceReset is called).
In general, some benefit may be obtained on systems that allow the GPUs to go into a deep idle mode by setting GPU persistence mode (using nvidia-smi). However this will not be relevant for GeForce GPUs nor will it be generally relevant on a windows system.
Additionally, on multi-GPU systems, if the application does not need multiple GPUs, some initialization time can usually be avoided by using the CUDA_VISIBLE_DEVICES environment variable, to restrict the CUDA runtime to only use the necessary devices.
Depending on the target architecture that the code is compiled for and the architecture that is running the code, the JIT compilation can kick in with the first cudaMalloc (or any other) call. "If binary code is not found but PTX is available, then the driver compiles the PTX code." Some more details:
http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

Parallelize a method from inside a CUDA device function / kernel

I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation.
So there's a kernel
__global__ void complexStuff(...)
which calls one or more times this interpolation device function:
__device__ void interpolate(...)
The interpolation algorithm does an WENO interpolation successively over three dimensions. This is a highly parallelizable task which I urgently would like to parallelize!
It is clear that the kernel complexStuff() can easily be parallelized by calling it from host code using the <<<...>>> syntax. It is also important that complexStuff() is already parallelized.
But it's not clear to me how to parallelize something / create new threads from inside a CUDA device function ... is this even possible? Does anyone know?
You might want to consider Dynamic Parallelism (some resources here, here, and here) in order to call a CUDA kernel from inside another CUDA kernel. It requires your device compute capability to be 3.5 or higher. It comes with a number of restrictions and limitations that may degrade the performance (mentioned in 3rd link).
My suggestion is to first consider calling your CUDA kernel with complexStuff(...) amount of work multiplied by interpolate(...) amount work. In other words, statically guess what is the maximum parallel fine-grained Jobs you need to do. Then configure your kernel to perform those fine-grained jobs with block threads. Note that it's just a speculation without knowing your program code.

Why this C++ code don't reach 100% usage of one core?

I just made some benchmarks for this super question/answer Why is my program slow when looping over exactly 8192 elements?
I want to do benchmark on one core so the program is single threaded. But it doesn't reach 100% usage of one core, it uses 60% at most. So my tests are not acurate.
I'm using Qt Creator, compiling using MinGW release mode.
Are there any parameters to setup for better performance ? Is it normal that I can't leverage CPU power ? Is it Qt related ? Is there some interruptions or something preventing code to run at 100%...
Here is the main loop
// horizontal sums for first two lines
for(i=1;i<SIZE*2;i++){
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// rest of the computation
for(;i<totalSize;i++){
// compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
This is run 10 times on an array of SIZE*SIZE float with SIZE=8193, the array is on the heap.
There could be several reasons why Task Manager isn't showing 100% CPU usage on 1 core:
You have a multiprocessor system and the load is getting spread across multiple CPUs (most OSes will do this unless you specify a more restrictive CPU affinity);
The run isn't long enough to span a complete Task Manager sampling period;
You have run out of RAM and are swapping heavily, meaning lots of time is spent waiting for disk I/O when reading/writing memory.
Or it could be a combination of all three.
Also Let_Me_Be's comment on your question is right -- nothing here is QT's fault, since no QT functions are being called (assuming that the objects being read and written to are just simple numeric data types, not fancy C++ objects with overloaded operator=() or something). The only activities taking place in this region of the code are purely CPU-based (well, the CPU will spend some time waiting for data to be sent to/from RAM, but that is counted as CPU-in-use time), so you would expect to see 100% CPU utilisation except under the conditions given above.