How is if statement executed in NVIDIA GPUs?

How is if statement executed in NVIDIA GPUs? - c++

As much as know GPU cores are very simple and can only execute basic mathematic instructions.
If I have a kernel with an if statement, then what does execute that if statement? Fp32, Fp64 and Int32 can only execute operations with floats, doubles and integers, not a COMPARE instruction, am I wrong. What happens if I have printf function in kernel? Who executes that.

Compare instructions are arithmetic instructions, you can implement a comparison with subtraction and a flag register, and GPGPUs have them.
But they are often not advertised as much as the number-crunching capability of the whole GPU.
NVIDIA doesn't publish the machine code documentation for their GPUs nor the ISA of the respective assembly (called SASS).
Instead, NVIDIA maintains the PTX language which is designed to be more portable across different generations while still being very close to the actual machine code.
PTX is a predicated architecture. The setp instruction (which again, is just a subtraction with a few caveats) sets the value of the defined predicate registers and these are used to conditionally execute other instructions. Including the bra instruction which is a branch, making it possible to execute conditional branches.
One could argue that PTX is not SASS but it seems the predicate architecture is what NVIDIA GPUs, at least, used to do.
AMD GPUs seem to use the traditional approach to branching: there are comparison instructions (e.g. S_CMP_EQ_U64) and conditional branches (e.g. S_CBRANCH_SCCZ).
Intel GPUs also rely on predication but have different instructions for divergent vs non-divergent branches.
So GPGPUs do have branch instructions, in fact, their SIMT model has to deal with the branch divergence problem.
Before c. 2006 GPUs were not fully programmable and programmers had to rely on other tricks (like data masking or branchless code) to implement their kernel.
Keep in mind that at the time it was not widely accepted that one could execute arbitrary programs or make arbitrary shading effects with GPUs. GPUs relaxed their programming constraints with time.
Putting a printf in a CUDA kernel won't probably work because there is no C runtime on the GPU (remember the GPU is an entirely different executor from the CPU) and the linking would fail I guess.
You can theoretically force a GPU implementation of the CRT and design a mechanism to call syscalls from the GPU code but that would be unimaginably slow since GPUs are not designed for this kind of work.
EDIT: Apparently NVIDIA actually did implement a printf on the GPU that prints to a buffer shared with host.
The problem here is not the presence of branches but the very nature of printf.

Related

Can compiler optimize non-related commands to be executed with different cores?

Compiler can change order of non-correlating commands in term of optimization.
Can it also optimize them silently to be executed in different cores?
For example:
...
for (...)
{
//...
int a = a1+a2;
int b = b1+b2;
int c = c1+c2;
int d = d1+d2;
//...
}
...
May it happen that in terms of optimization not just order of execution may be changed, but also amount of cores? Does compiler have any restrictions in standard?
UPD: I'm not asking how to parallelize the code, I'm asking if it was not parallelized explicitly, can it still be parallelized by compiler?

There is more than meets the eyes here. Most likely the instructions (in your example) will end up being run in parallel, but it's not what you think.
There are many levels of hardware parallelism in a CPU, multiple cores being just the highest one 1). Inside a CPU core you have other levels of hardware parallelization that are mostly transparent 2) (you don't control them via software and you don't actually see them, only maybe their side-effects sometimes). Pipelines, extra bus lanes, multiple ALUs (Arithmetic Logic Units) and FPUs (Floating Point Units) per core are some of them.
Different stages of your instructions will be run in parallel in the pipelines (modern x86 processors have over a dozen pipeline stages) and possibly different instructions will run in parallel in different ALUS (modern x86 CPUs have around 5 ALUs per core).
All this happens without the compiler doing anything 2). And it's free (given the hardware, it was not free to add this capabilities in the hardware). Executing the instructions in different cores is not free. Creating of different threads is costly. Moving the data to be available to other cores is costly. Synchronization to wait for the execution from other cores is costly. There is a lot of overhead associated with creating and synchronizing threads. It is just not worth it for small instructions like this. And the cases that would have a real benefit from multi-threading would involve an analysis that is way too complicated today so practically not feasible. Someday in the future will have compilers that will be able to identify that your serial algorithm is actually a sort and efficiently and correctly parallelize it. Until then we have to rely on language support, library support and/or developer support for parallelizing algorithms.
1) well, actually hyper-threading is.
2) As pointed by MSalters:
modern compilers are very much aware of the various ALU's and will do
work to benefit from them. In particular, register assignments are
optimized so you don't have ALU's compete for the same register,
something which may not be apparent from the abstract sequential
model.
All this it indirectly influences the execution to benefit the hardware architecture, there are not explicit instructions or declarations.

Yes, the compiler can do things in any order (including not doing them at all), so long as the observable behaviour generated matches what the observable behaviour of the code should be. Assembly instructions, runtime, thread count, etc. are not observable behaviour.
I should add that it's unlikely a compiler would decide to do this without explicit instruction from the programmer to do so; even though the standard allows it, the compiler exists to help the programmer and randomly starting extra threads would be unexpected in many case

How can I break down the memory-only time and computation-only time for a program on Xeon Phi?

Modern processors overlap memory accesses with computations. I want to study this overlap on Intel Xeon Phi. A conventional way to do so is to modify the code and make two versions: memory-only and computation-only, like the approach used in this slide for GPU: http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010.pdf.
However, my program has complex control flows and data dependencies. It's very hard for me to make such two versions.
Is there any convenient way to measure this overlap? I'm considering the Vtune profile, but I'm still not sure about what HW counters should I look at.

Applications well suited for Xeon-phi many-core architecture

From this https://software.intel.com/en-us/videos/purpose-of-the-mic-architecture I understand that applications with complex or numerous random memory access are not well suited for Intel Xeon-phi. This is because the architecture uses 61 cores and 8 memory controllers. In case of L1 and L2 cache misses, it takes up to 100s cycles to fetch the line into memory and get it ready for use by the CPU. Such applications are called latency-bound.
Then, the tutorial mentions that many-core architecture (Xeon-phi coprocessor only) are well suited for highly parallel homogeneous code. Two questions from there:
What is referred to as homogeneous code ?
What are real-world applications which can fully benefit from MIC architecture ?

I see the Intel MIC architecture as a "x86 based GPGPU" and if you are familiar with the concept of GPGPU you will find your self familiar with the Intel MIC.
An homogeneous clustering is a system infrastructure with multiple execution unit (i.e. CPUs) all with the same features. For example a multicore system that have four Intel Xeon processors is homogeneous.
An heterogeneous clustering is a system infrastructure with multiple execution unit with different features (i.e. like CPU and GPU). For example my Levono z510 with its Intel i7 Haswell (4 CPUs), its Nvidia GT740M (GPU) and its Intel HD Graphics 4600 (GPU) is an heterogeneous system.
An example of heterogeneous code could be a Video Game.
A video game has a control code, executed by one code of one CPU, that control what the other agents do, its send shaders to execute on the GPUs, physic computation to be performed on others cores or GPUs and so on.
In this example you need to write code that run on the CPU (so it is "CPU aware") and code that run on GPU (so it is "GPU aware"). This is actually done by using different tools, different programming languages and different programming models!
homogeneous code is code that don't need to be aware of n different programming models, one for each different kind of agent. It is just the same programming model, language and tool.
Take a look a this very simple sample code for the MPI library.
The code is all written in C, it is the same program that just take a different flow.
About the applications, Well that's really a broad question...
As said above I see the Intel MIC as a GPGPU based on x86 ISA (part of it at least).
An SDK particularly useful (and listed on the video you linked) to work with clustered systems is OpenCL, it can be used for fast processing of images and computer vision and basically for anything that need the same algorithm to be run billions of times with different inputs (like cryptography applications/brute forcing).
If you search for some OpenCL based project on the web you will get an idea.
To answer you second question it is better to ask ourselves "What could not take advantage of the MIC architecture?" and we will soon find that the more an algorithm is distant from the concept of Stream Processing and the related topics, including the one of Kernel, the less it is suitable for the MIC.

First a straight forward answer to your direct question - to get the most out of the coprocessor, your code should be able to use a large number of threads and should vectorize. How many threads? Well, you have 60 cores (+/- depending on which version you get) and 4 threads per core, with a sweet spot around 2 threads per core on many codes. Sometimes you can get good performance even if you don't use every single core. But vectorization is extremely important; the long (512 byte) vectors are a big source of speed on the coprocessor.
Now, on to programming. The Intel Xeon Phi coprocessor uses two different kinds of programming - offload and native.
In the offload model, you write a program, determine which parts of that code have enough parallelism to make use of the large number of cores on the coprocessor and mark those sections with offload directives. Then inside those offloaded sections, you write the code using some form of parallelism, like OpenMP. (Heterogeneous)
In native code, you do not use any offload directives but, instead, use a -mmic compiler directive. Then you run the code directly on the coprocessor. The code you write will use some form of parallelism, like OpenMP, to make use of the large number of cores the coprocessor has. (Homogeneous)
Another variation on these programming models is to use MPI, often in addition to OpenMP. You can use the offload programming model, in which case, the nodes in you MPI system will be the host nodes in your system. (Hybrid) Alternately, you can use the native programming model, in which case you treat the coprocessor as just another node in your system. (Heterogeneous if host and coprocessors are nodes; homogeneous if only coprocessors are use.)
You may have noticed that nothing I have said implies a separate programming style for the host and coprocessor. There are some optimizations you can make that will keep code written for the coprocessor from running on the processor as well but, in general, the code you write for the coprocessor can also be compiled for and run on the host by just changing the compiler options.
As far as real world apps, see https://software.intel.com/en-us/mic-developer/app-catalogs

How to use GPU of Dual AMD FirePro D300 in my C++ calculations on MacOS?

I have a MacPro computer with Dual AMD FirePro D300 GPU based on it. So I want to use that GPU for increasing my calculations in C++ on MacOS.
Can you provide me with some useful information on this subject? I need to boost my C++ calculations on my MacPro. This is my C++ code, I can change it as it needs to achieve the acceleration. But what should I read first, to use GPU of AMD FirePro D300 on MacOS? What should I know before I start to learn this hard work?

Before starting, as you say the hard work, you should know the basic concept of using GPU in distinction to CPU. In a very abstract way I will try to give this concept.
Programming is to give data and instruction to processor, so processor will work on your data with that instruction.
If you give one instruction and some data to CPU - CPU will work on your data step by step alternately. For example, CPU will execute the same instruction on each part of array in a loop.
In GPU you have hundreds of little CPUs that will execute one instruction concurrently. Again, as example, if you have the same array of data, and the same instruction GPU will take your array, split it between CPUs and execute your instruction on all data concurrently.
CPU is really fast in executing one instruction.
One thread of GPU is much slower in it. (Like comparing Ferrari to a bus.)
And what I am implying to is that you will see the benefits of GPU only if you have to do huge amount of independent calculations in parallel.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.

Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js