Cuda 4.0 vs 3.2 - c++

Is CUDA 4.0 faster than 3.2?
I am not interested in the additions of CUDA 4.0 but rather in knowing if memory allocation and transfer will be faster if I used CUDA 4.0.
Thanks

Memory allocation and transfer depend more (if not exclusively) on the hardware capabilities (more efficient pipelines, size of cache), not the version of CUDA.

Even while on CUDA 3.2, you can install the CUDA 4.0 drivers (270.x) -- drivers are backward compatible. So you can test that apart from re-compiling your application. It is true that there are driver-level optimizations that affect run-time performance.
While generally that has worked fine on Linux, I have noticed some hiccups on MacOSX.

Yes, I have a fairly substantial application which ran ~10% faster once I switched from 3.2 to 4.0. This is without any code changes to take advantage of new features.
I also have a GTX480 if that matters any.
Note that the performance gains may be due to the fact that I'm using a newer version of dev drivers (installed automatically when you upgrade). I image nVidia may well be tweaking for CUDA performance the way they do for blockbuster games like Crysis.

Performance of memory allocation mostly depends on host platform (because the driver models differ) and driver implementation. For large amounts of device memory, allocation performance is unlikely to vary from one CUDA version to the next; for smaller amounts (say less than 128K), policy changes in the driver suballocator may affect performance.
For pinned memory, CUDA 4.0 is a special case because it introduced some major policy changes on UVA-capable systems. First of all, on initialization the driver does some huge virtual address reservations. Secondly, all pinned memory is portable, so must be mapped for every GPU in the system.
Performance of PCI Express transfers is mostly an artifact of the platform, and usually there is not much a developer can do to control it. (For small CUDA memcpy's, driver overhead may vary from one CUDA version to another.) One issue is that on systems with multiple I/O hubs, nonlocal DMA accesses go across the HT/QPI link and so are much slower. If you're targeting such systems, use NUMA APIs to steer memory allocations (and threads) onto the same CPU that the GPU is plugged into.

The answer is Yes because CUDA 4.0 reduce the system memory usage and the CPU memcpy() overhead

Related

Identical code using more than double the RAM memory on different computers

I'm creating a Minecraft clone in C++ with OpenGL.
I noticed today that when debugging the program on my laptop, the RAM usage is way higher that the RAM usage on my desktop PC (~1.3gb vs ~500mb). I'm getting these memory numbers from Visual Studio's diagnostics tools.
I'm using GitHub and even with the same branch, same commit, literally the exact same code, the laptop uses more RAM. I tried cleaning the solution, rebuilding, cloning again, nothing works.
The memory usage is different on the Windows Task Manager, too.
I'm out of ideas of what could be happening. The computers are on different platforms (laptop is Intel 10th, and desktop is Ryzen 3000), the laptop has less RAM (8gb vs 16gb). Both are using the latest Windows 10. I'm using Visual Studio Community 2019.
I'm not sure if a platform difference could cause such a huge impact on memory allocation.
Many laptop architectures use something called unified memory. That is to say, there is only one big pool of memory that is shared between the CPU and GPU (or the equivalent portions on an APU).
On such architectures, allocating video memory is essentially the same thing as allocating RAM. It's all hidden away by the graphics drivers though.
So a graphics-heavy application using more RAM on a laptop than on a desktop with a discrete GPU is not surprising. However, it's not so much that it uses more memory, just that the memory it uses gets tabulated differently.
Assuming both platforms run at the same resolution and the same assets are loaded, you'd expect GPU Memory + RAM usage on desktop would be roughly equivalent to RAM usage on the laptop.
Emphasis on the word roughly. Different graphic architectures/drivers use memory differently, so don't expect a 1-to-1 match here. For example:
A single 1080p framebuffer takes a few megabytes at the minimum, depending on how the driver interacts with the actual screen, how many of these are around is rarely obvious.
Tiled architectures can completely bypass needing large chunks of
memory altogether.
That's the most likely scenario here.

OS versions that support the system allocator for CUDA Unified Memory?

From the slides posted here, it seems that using the system allocator through calls to malloc or new instead of Nvidia's cudaMallocManaged is only supported on Linux kernel versions 4.14 or newer?... If so, is there a way to query the Nvidia driver or the CUDA runtime to know whether the system allocator can be used to properly allocate a memory block for use within the CUDA unified memory model? Or would this have to be something where you keep a white-list of operating systems/kernel versions and fallback to the traditional cudaMallocManaged if the detected operating system is not on the white-list? If the latter, does anyone know of an approved white-list of operating system versions?
So I ran into this problem, because I thought my system was supported, but it's not. Long story short, since this "feature" seems to require a perfect storm of compatible hardware and software, I am sticking to the old API. I know that is probably not the answer you wanted.
If you really want to use malloc or new, I would say your query would be a combination of:
Asking the OS what kernel version its running. See the uname syscall.
Running deviceQuery (or similar) to check for CUDA version (8.0 or better) and GPU (looking for compute capability greater than or equal to 6.0).
More info regarding OS support can be found here: https://www.phoronix.com/scan.php?page=news_item&px=HMM-In-Linux-4.14. It does seem that kernel version 4.14 or greater should have this feature.

opencl and opencv 3.0 Beta

why there is no openCL (ocl) in opencv 3.0 beta?
I heard that the new opencv transparently uses opencl, but when I am testing this on a windows running on a intel core i5 (gpu HD400), I can not see any speed improvement as the result of running on GPU.
Am I missing something here?
Ocl module of OpenCV is intentionally removed. Developers are no more expected to use ocl::Canny like invocations. These methods will be invoked internally by OpenCV. Developers are expected to use UMat structure as explained in the presentation. UMat wrapsclmem when OpenCL is available. Or else it default to CPU. See ocl.cpp.
Regarding speed, I would ensure below
In cvconfig.h in build directory, check if OpenCL flag is ON or OFF
In code, ocl::setUseOpenCL(true)
In code, Use UMat in place of Mat
Then check FPS with and with out call to ocl::setUseOpenCL(true);
What I will expect to see is not a drastic FPS increase. Even assuming GPU is used, there could be cases when data has to be copied between CPU/GPU memory and back and forth this might affect end performance. I will expect to see processing offloading to GPU and a less burden on CPU. Not necessarily speed increase.
With regard to the tools, you could use AMD´s CodeXL to observe the behavior of OpenCV/OpenCL. You can see the sequence of OpenCL API calls, the kernels used, their performance and source code, data buffers and its contents, etc. Of course, all this only on AMD hardware. I think for NVIDIA, ParallelInsight can do the same. For Intel, do not know which tool can help.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.