slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice - c++

I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice
The Test program:
The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding)
Before the 1st cudaMalloc, I’ve called “cudaSetDevice”
on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms
on my server: Win2012+ Tesla K40, it takes 1100ms!!
For both machines, subsequent cudaMalloc are much faster.
My questions are:
1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies
3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?
Thanks in advance

1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
The details of the initialization process are not specified, however by observation the amount of system memory affects initialization time. CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.
2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies
The CUDA initialization process is a "lazy" initialization. That means that just enough of the initialization process will be completed in order to support the requested operation. If the requested operation is cudaSetDevice, this may require less of the initialization to be complete (which means the apparent time required may be shorter) than if the requested operation is cudaMalloc. That means that some of the initialization overhead may be absorbed into the cudaSetDevice operation, while some additional initialization overhead may be absorbed into a subsequent cudaMalloc operation.
3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?
Independent host processes will generally spawn independent CUDA contexts. A CUDA context has the initialization requirement associated with it, so the fact that another, separate cuda context may be already initialized on the device will not provide much benefit if a new CUDA context needs to be initialized (perhaps from a separate host process). Normally, keeping a process active involves keeping an application running in that process. Applications have various mechanisms to "sleep" or suspend behavior. As long as the application has not terminated, any context established by that application should not require re-initialization (excepting, perhaps, if cudaDeviceReset is called).
In general, some benefit may be obtained on systems that allow the GPUs to go into a deep idle mode by setting GPU persistence mode (using nvidia-smi). However this will not be relevant for GeForce GPUs nor will it be generally relevant on a windows system.
Additionally, on multi-GPU systems, if the application does not need multiple GPUs, some initialization time can usually be avoided by using the CUDA_VISIBLE_DEVICES environment variable, to restrict the CUDA runtime to only use the necessary devices.

Depending on the target architecture that the code is compiled for and the architecture that is running the code, the JIT compilation can kick in with the first cudaMalloc (or any other) call. "If binary code is not found but PTX is available, then the driver compiles the PTX code." Some more details:
http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

Related

What is the difference between 'GPU activities' and 'API calls' in the results of 'nvprof'?

What is the difference between 'GPU activities' and 'API calls' in the results of 'nvprof'?
I don't know why there's a time difference in the same function.
For example, [CUDA memcpy DtoH] and cuMemcpyDtoH.
So I don't know what the right time is.
I have to write a measurement, but I don't know which one to use.
Activities are actual usage of the GPU for some particular task.
An activity might be running a kernel, or it might be using GPU hardware to transfer data from Host to Device or vice-versa.
The duration of such an "activity" is the usual sense of duration: when did this activity start using the GPU, and when did it stop using the GPU.
API calls are calls made by your code (or by other CUDA API calls made by your code) into the CUDA driver or runtime libraries.
The two are related of course. You perform an activity on the GPU by initiating it with some sort of API call. This is true for data copying and running kernels.
However there can be a difference in "duration" or reported times. If I launch a kernel, for example, there may be many reasons (e.g. previous activity that is not yet complete in the same stream) why the kernel does not "immediately" begin executing. The kernel "launch" may be outstanding from an API perspective for a much longer time than the actual runtime duration of the kernel.
This applies to may other facets of API usage as well. For example, cudaDeviceSynchronize() can appear to require a very long time or a very short time, depending on what is happening (activities) on the device.
You may get a better sense of the difference between these two categories of reporting by studying the timeline in the NVIDIA visual profiler (nvvp).
Let's use your specific case as an example. This appears to be an app associated with the driver API, and you apparently have a kernel launch and (I would guess) a D->H memcpy operation immediately after the kernel launch:
multifrag_query_hoisted_kernels (kernel launch - about 479ms)
cuMemcpyDtoH (data copy D->H, about 20us)
In that situation, because CUDA kernel launches are asynchronous, the host code will launch the kernel and it will then proceed to the next code line, which is a cuMemcpyDtoH call, which is a blocking call. This means the call causes the CPU thread to wait there until the previous CUDA activity is complete.
The activity portion of the profiler tells us the kernel duration is around 479ms and the copy duration is around 20us (much much shorter). From the standpoint of activity duration, these are the times that are relevant. However, as viewed from the host CPU thread, the time it required the host CPU thread to "launch" the kernel was much shorter than 479ms, and the time it required the host CPU thread to complete the call to cuMemcpyDtoH and proceed to the next line of code was much longer than 20us, because it had to wait there at that library call, until the previously issued kernel was complete. Both of these are due to the asynchronous nature of CUDA kernel launches, and the "blocking" or synchronous nature of cuMemcpyDtoH.

How to initialize CUDA so I can make valid execution time measurements?

In my application I have implemented the same algorithm for CPU and GPU with CUDA and I have to measure the time needed to perform algorithm on CPU and GPU. I've noticed that there's some time spent for CUDA initialization in GPU version of algorithm and added cudaFree(0); at the beginning of the program code as it recommended here for CUDA initialization, but it still takes more time for the first GPU CUDA algorithm execution, than the second one.
Are there any other CUDA related stuff that have to be initialized at the beginning to measure actual algorithm execution time correctly?
The heuristics of lazy context initialisation in the CUDA runtime API have subtly changed since the answer you linked to was written in two ways I am aware of:
cudaSetDevice() now initiates a context, whereas earlier on it did not (hence the need for the cudaFree() call discussed in that answer)
Some device code related initialisation which the runtime API used to perform at context initialisation is now done at first call to the kernel
The only solution I am aware of for the second item is to run the CUDA kernel code you want to time once as a "warm up" to absorb the setup latency, and then perform your timing on the code for benchmarking purposes.
Alternatively, you can use the driver API and have much finer grained control over when latency will occur during application start up.

Locking a process to Cuda core

I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.
All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.

Windows multitasking breaks OpenCL perfomance

I'm writing Qt application with simple idea: there are several OpenCL-capable devices, each of them gets own control thread which preparing data, executing OpenCL kernel and processing results. OpenCL code is actually bitcoin mining kernel (for now it's this one, but it doesn't matter).
When working with 2 GPUs everything is ok.
When I use GPU and CPU there is a problem. CPU works at reasonable speed, but GPU slowing down to zero perfomance.
There are no such promblem under Linux. Under Windows, poclbm behaves in the same way: when starting multiple instances (1 for GPU, 1 for CPU), GPU perfomance is 0.
I'm not sure about which part of code I should post, so it will be helpfull. I can only mention, that thread is a QThread's child with run() reimplemented with a busy loop while( !_stop ) { mineBitcoins(); }. Logic of that loop is pretty much copied from poclbm's BitcoinMiner::mining_thread (here).
In which direction should I dig? Thanks.
upd:
I'm using QtOpenCL with AMD APP SDK.
If you run the kernel on the CPU with full utilization of all cores, the threads that handle the other devices might not be able to keep up with the GPU, effectively limiting performance.
Try decreasing the number of threads running the kernel on the CPU, e.g. if your program runs on a quad-core with hyper threading, limit the threads to 7.
don't use the host device as opencl device. If you really have too, restrict the amount of compute units (of the CPU used as host) allocated for CL by creating a subdevice.
I don't know if you are using the both devices in the same context. But if that is the case, the memory consistency inside a context can be your problem and how the different OpenCL implementation handle it.
OpenCL tries to mantain the memory inside a context updated (at least in Windows), and can cause the GPU to continuosly copy the memory used back to "CPU-side".
I tryed that long ago and resulted as in your case with "~=0 performance in the GPU".

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?
I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.
For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.
M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.