Run OpenCL without compatible hardware? - c++

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html

For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

Related

How to differentiate between dedicated and integrated gpu card via c++?

I want to know via WMI or other means in c++ if the user has integrated or dedicated GPU card?
I have gone over Win32_VideoController and could not find anything that will help me to differentiate between the two.
Thanks in advance.
It is surprising that no one brought this idea up after so many years. Most people said it is impossible, and it is true that Windows natively does not provide any means to detect whether it is an iGPU or dGPU. However, I managed to make it work to a certain extent, if you could bear with the limitations.
The general idea is that you can use wmic to get the name of CPU installed, and maintain a list of all CPUs that has integrated graphics (which may be very short, depending on what you need this feature for.)
For newer CPU models like 9th gen or newer Intel desktop processor and AMD Ryzen 1000 and newer, you can simply tell by the CPU naming. Intel processors without integrated graphics will end with letter F, while AMD processors with integrated graphics will end with letter G. Then you can use wmic to get list of all gpus (including iGPU), and by counting the number of gpus installed, you can easily tell if the user has an iGPU or dGPU, as it is impossible to have more than 1 iGPU.
Thus, if you detect that the CPU comes with an iGPU, and if there is only 1 GPU reported by wmic, you know that it is definitely iGPU. On the other hand, if there are multiple GPUs reported, you know that one of them is definitely dGPU. Of course, this does not work if the user manually disabled iGPU themselves, thus I am saying this approach has limitations.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

Can one precompile opencl kernels for devices s/he doesn't have?

We are currently investigating switching from Cuda to openCL. I have pre-built the openCL kernels like you can in Cuda (using CL_PROGRAM_BINARIES). My quick question: is it possible to compile a byte code for a device you don't have (I install the AMD driver, for example, then compile for a set of Radeon cards, despite us only having NVIDIA cards in house?)
I know this would be torturous to maintain and is not suggested, I just want to know if it is currently even possible.
AFAIK AMD does support this, but as an example, you can't install the Nvidia GPU driver (which provides their OpenCL support) without an Nvidia GPU. Therefore, an Nvidia GPU is needed. Honestly, this does not seem to be an intended use case. Instead, loading and saving binaries is meant for caching a kernel so that it only gets compiled on the first time your app is run (and after any hardware changes or driver updates).

Cuda 4.0 vs 3.2

Is CUDA 4.0 faster than 3.2?
I am not interested in the additions of CUDA 4.0 but rather in knowing if memory allocation and transfer will be faster if I used CUDA 4.0.
Thanks
Memory allocation and transfer depend more (if not exclusively) on the hardware capabilities (more efficient pipelines, size of cache), not the version of CUDA.
Even while on CUDA 3.2, you can install the CUDA 4.0 drivers (270.x) -- drivers are backward compatible. So you can test that apart from re-compiling your application. It is true that there are driver-level optimizations that affect run-time performance.
While generally that has worked fine on Linux, I have noticed some hiccups on MacOSX.
Yes, I have a fairly substantial application which ran ~10% faster once I switched from 3.2 to 4.0. This is without any code changes to take advantage of new features.
I also have a GTX480 if that matters any.
Note that the performance gains may be due to the fact that I'm using a newer version of dev drivers (installed automatically when you upgrade). I image nVidia may well be tweaking for CUDA performance the way they do for blockbuster games like Crysis.
Performance of memory allocation mostly depends on host platform (because the driver models differ) and driver implementation. For large amounts of device memory, allocation performance is unlikely to vary from one CUDA version to the next; for smaller amounts (say less than 128K), policy changes in the driver suballocator may affect performance.
For pinned memory, CUDA 4.0 is a special case because it introduced some major policy changes on UVA-capable systems. First of all, on initialization the driver does some huge virtual address reservations. Secondly, all pinned memory is portable, so must be mapped for every GPU in the system.
Performance of PCI Express transfers is mostly an artifact of the platform, and usually there is not much a developer can do to control it. (For small CUDA memcpy's, driver overhead may vary from one CUDA version to another.) One issue is that on systems with multiple I/O hubs, nonlocal DMA accesses go across the HT/QPI link and so are much slower. If you're targeting such systems, use NUMA APIs to steer memory allocations (and threads) onto the same CPU that the GPU is plugged into.
The answer is Yes because CUDA 4.0 reduce the system memory usage and the CPU memcpy() overhead

Is there any online compiler with executer that would compile apps that use GPU-specific C/C++ code?

Generally I need some online compiler that can compile and execute provided program and output execution speed and other statistics. All program can be in one C file and it would use any GPU C/C++ lib provided. I want to compile at least C code. Does any GPU vendor provide any such compiler? Actually my problem is next - I have powerful CPU and weak GPU on my machine. I need to test some algorithms that are specific to GPUs and get statistics on there execution. I would like to test my programs any way possible so If there Is no such online GPU thing maybe there is any emulator that can output time and other statistics that I would get on some real GPUs? (meaning I would give it a program it would be executing it on my CPU but count time somehow as it was some GPU running).
So is it possible any how to test GPU specific programs not having GPU card mening on emulation software of somewhere in internet cloud?
Amazon EC2 recently added support for "GPU instances", which are normal HPC instances which come with two NVIDIA Tesla “Fermi” M2050 GPUs. You can SSH into these instances, install a compiler, and go to town with them.
It'll cost $2.10/hour (or $0.74/hour if you get a Reserved Instance for a longer block of time)
If it's an option at all, I'd strongly consider just getting the GPU card(s).
The low end of any given GPU family is usually pretty cheap, and you can make some reasonable performance extrapolations from that to the high end.
If you get the CUDA developer tools and SDK from nVidia then you can build and run CUDA programs in emulation mode, where they just run on the host CPU instead of on the GPU. This is a great way to learn GPU programming basics before you start trying to get code to run on an actual GPU card.
UPDATE
Apparently emulation was removed in CUDA 3.1.