CUDA C++ How to programs to benchmark shared memory bandwidth? - c++

I'm looking for a way to benchmark shared mem and L1/L2 cache. However, the benchmark results I found are very different depending on the source.
In this paper, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking, they show shared memory bandwidth to be 12000GB/s on Tesla V100, but they don't provide how they reached that number. If I use gpumembench on a NVIDIA A30, I only get ~5000GB/s.
Is there any other sample programs I can use to benchmark shared memory?

Related

Is cv::dft transform multithreaded?

I need to boost cv::dft perfomance in multithreaded environment. I've done a simple test on Windows 10 on Core-i5 Intel processor:
Here I see that CPU is not fully loaded (50% usage only). Individual threads are loaded equally and also far from 100%. Why is that and how can I fix it? Can DFT easily pluralized? Is it implemented in OpenCV library? Are there special build flags to enable it (which)?
UPDATE: Running this code on linux gives a bit different result, but also below 100% utilization:
First of all, behavior of cv::dft depends on OpenCV build flags, for example if you set WITH_IPP, then it will use Intel Primitives to speedup computation. FFT is memory-bound, if you simply launch more threads, you most probably wouldn't significantly benefit from this parallelism, because threads will be waiting for each other to finish accessing memory, I've observed this both on Linux and Windows. To gain more performance you should use FFTW3 which has some sophisticated algorithm for multi-threaded mode (should be ./configure-d with special flag). I observed up to 7x speedup with 8 threads. But FFTW has only payed business-friendly license, imposing GNU license for your software. I have not found any other opensource components which can handle FFT parallelism in a smart way.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

Cuda 4.0 vs 3.2

Is CUDA 4.0 faster than 3.2?
I am not interested in the additions of CUDA 4.0 but rather in knowing if memory allocation and transfer will be faster if I used CUDA 4.0.
Thanks
Memory allocation and transfer depend more (if not exclusively) on the hardware capabilities (more efficient pipelines, size of cache), not the version of CUDA.
Even while on CUDA 3.2, you can install the CUDA 4.0 drivers (270.x) -- drivers are backward compatible. So you can test that apart from re-compiling your application. It is true that there are driver-level optimizations that affect run-time performance.
While generally that has worked fine on Linux, I have noticed some hiccups on MacOSX.
Yes, I have a fairly substantial application which ran ~10% faster once I switched from 3.2 to 4.0. This is without any code changes to take advantage of new features.
I also have a GTX480 if that matters any.
Note that the performance gains may be due to the fact that I'm using a newer version of dev drivers (installed automatically when you upgrade). I image nVidia may well be tweaking for CUDA performance the way they do for blockbuster games like Crysis.
Performance of memory allocation mostly depends on host platform (because the driver models differ) and driver implementation. For large amounts of device memory, allocation performance is unlikely to vary from one CUDA version to the next; for smaller amounts (say less than 128K), policy changes in the driver suballocator may affect performance.
For pinned memory, CUDA 4.0 is a special case because it introduced some major policy changes on UVA-capable systems. First of all, on initialization the driver does some huge virtual address reservations. Secondly, all pinned memory is portable, so must be mapped for every GPU in the system.
Performance of PCI Express transfers is mostly an artifact of the platform, and usually there is not much a developer can do to control it. (For small CUDA memcpy's, driver overhead may vary from one CUDA version to another.) One issue is that on systems with multiple I/O hubs, nonlocal DMA accesses go across the HT/QPI link and so are much slower. If you're targeting such systems, use NUMA APIs to steer memory allocations (and threads) onto the same CPU that the GPU is plugged into.
The answer is Yes because CUDA 4.0 reduce the system memory usage and the CPU memcpy() overhead

Is there any online compiler with executer that would compile apps that use GPU-specific C/C++ code?

Generally I need some online compiler that can compile and execute provided program and output execution speed and other statistics. All program can be in one C file and it would use any GPU C/C++ lib provided. I want to compile at least C code. Does any GPU vendor provide any such compiler? Actually my problem is next - I have powerful CPU and weak GPU on my machine. I need to test some algorithms that are specific to GPUs and get statistics on there execution. I would like to test my programs any way possible so If there Is no such online GPU thing maybe there is any emulator that can output time and other statistics that I would get on some real GPUs? (meaning I would give it a program it would be executing it on my CPU but count time somehow as it was some GPU running).
So is it possible any how to test GPU specific programs not having GPU card mening on emulation software of somewhere in internet cloud?
Amazon EC2 recently added support for "GPU instances", which are normal HPC instances which come with two NVIDIA Tesla “Fermi” M2050 GPUs. You can SSH into these instances, install a compiler, and go to town with them.
It'll cost $2.10/hour (or $0.74/hour if you get a Reserved Instance for a longer block of time)
If it's an option at all, I'd strongly consider just getting the GPU card(s).
The low end of any given GPU family is usually pretty cheap, and you can make some reasonable performance extrapolations from that to the high end.
If you get the CUDA developer tools and SDK from nVidia then you can build and run CUDA programs in emulation mode, where they just run on the host CPU instead of on the GPU. This is a great way to learn GPU programming basics before you start trying to get code to run on an actual GPU card.
UPDATE
Apparently emulation was removed in CUDA 3.1.