Inconsistency between OpenGL and CUDA maximum number of threads - opengl

My GPU is NVIDIA GeForce GT440, whose compute capability version is 2.x. NVIDIA's official CUDA_C_Programming_Guide points out
Limit 1. Maximum number of threads per block = 1024
Limit 2. Maximum number of resident threads per multiprocessor = 1536
However, two of the OpenGL computer shader implementation limits are
Limit 3. GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS = 1536
My questions are
1. Why Limit 1 is not equal to Limit 2 and Limit 3?
2. Should the real threads/block (invocations/workgroup) be 1024 or 1536?

Why Limit 1 is not equal to Limit 2 and Limit 3?
Because it isn't the same thing. Blocks are a logical construct in CUDA and are limited to a maximum of 1024 threads. But a multiprocessor can run multiple blocks concurrently (up to 8 in the case of your hardware). So a SM can have up to 1536 concurrent threads in your hardware, but not all of those threads can come from a single block.
Should the real threads/block be 1024 or 1536?
1024 for all the reasons outlined above. You can see a complete summary of the capabilities of all supported hardware here.

Related

C++ CUDA Gridsize meaning clarification

I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).

Opencl launch concurrent kernels

As far as I understand, to execute concurrent kernels (in my case same kernel but different I/O data) it must be done by launching unique compute units (streaming multiprocessors -SMs) with apparently their own workgroups.
For example gtx960m has 5 SMs (compute units in Opencl). Launching clEnqueueNDRangeKernel 5 times asynchronously and out of order, with their own 16x16 (2d) workgroup, will launch all 5 compute units to execute them concurrently? The local memory reported is 64kb. That is for all compute units or each one will have 64kb by its own?
Each CU has either 4 (Maxwell/Pascal) or 2 (Turing/Ampere, AMD) Warps. A Warp is a group of 32 CUDA cores / stream processors in hardware.
All threads running in one Warp have to do exactly the same instructions. Within a Warp, not even branching is possible. Two Warps within a CU can handle different branches, but not different kernels at the same time.
If you execute two kernels in different queues in parallel on your 960m with 5 CUs, Kernel 1 can have for example 3 CUs and kernel 2 can have the remaining 2. But a CU cannot be split to run multiple kernels at the same time.
In OpenCL you can set the workgroup size to some multiple of the Warp size (32). Either 4 (workgroup size 32), 2 (workgroup size 64) or 1 (workgroup size 128 or greater) OpenCL workgroups can be executed at one moment on one Maxwell CU.
The amount of local memory, in your case 64KB, is per CU. So if you have a large workgroup of for example 256 threads, each thread has less local memory available than if you have workgroup size 64, because all threads in the workgroup share the same local memory uf the one CU they run on.

Calculating mflop/s of a HPC application using memory bandwidth info

I want to calculate mflops (million of operations per second per processor) of a HPC application(NAS benchmark) without running the application. I have measured the memory bandwidth of each core of my system (a supercomputer) using Stream Benchmark. I'm wondering how I can get the mflops per processor of the application by having memory bandwidth info of the cores.
My node has 64GiB memory (includes 16 cores-2 sockets) and 58 GiB/s aggregated bandwidth using all physical cores. The memory bandwidth of my cores are varied from 2728.1204 MB/s to 10948.8962 MB/s for Triad function that it's must be because of NUMA architecture.
Any help would be appreciate.
You can't get estimate of MFLOPS/GFLOPS of benchmark only from memory bandwidth results from STREAM. You need to know two more parameters: peak MFLOPS/GFLOPS of your CPU core (better as max FLOP operations per clock cycle with all variants of vector instructions and cpu frequency limits: min, mean, max) and also GFLOPS/GBytes (flops to bytes ratio, Arithmetic Intensity) of every program you need to estimate (every NAS Benchmark).
The Stream benchmark has very low arithmetic intensity (0 DP=FP64 flops per two double operands = 2*8 bytes in Copy, 1 flops per 16 bytes in Scale, 1 flops / 24 byte in Add and 2 flops / 24 bytes in Triad). So, Stream benchmark is limited by memory bandwidth in correct runs (and by cache bandwidth in incorrect runs on ). Many benchmarks may have higher
With this data (memory bandwidth, max gflops/GHz on different vectorization levels, normal/maximal/low frequency of cpu, arithmetic intensity of the test) you can start to use roofline performance model https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
With roofline you have x axis with flops/byte; y axis of GFlop/s (both are at logarithmic scale). The line of the "roof" consists of two parts for every CPU (or machine).
First part is inclined and corresponds to low arithmetic intensity. Applications in this part will have to wait data to be loaded from memory, they have no data to operate on with full GFlop/s speed of CPU; the tests are limited by memory. This line is defined by STREAM benchmark.
Second part of line is straight, it corresponds to higher intensity. Tasks here are not limited by memory bandwidth, they are limited by available FLOPS. And for modern CPU all flops are available only with wide vector instruction (Instruction-level parallelism), and not all tasks can use widest vectors:

Cuda block or thread preference

The algorithm that I'm implementing has a number of things that need to be done in parrallel. My question is, if I'm not going to use shared memory, should I prefer more blocks with less threads/block or more threads/block with less blocks for performance so that the total threads adds up to the number of parallel things I need to do?
I assume the "set number of things" is a small number or you wouldn't be asking this question. Attempting to expose more parallelism might be time well spent.
CUDA GPUs group execution activity and the resultant memory accesses into warps of 32 threads. So at a minimum, you'll want to start by creating at least one warp per threadblock.
You'll then want to create at least as many threadblocks as you have SMs in your GPU. If you have 4 SMs, then your next scaling increment above 32 would be to create 4 threadblocks of 32 threads each.
If you have more than 128 "number of things" in this hypothetical example, then you will probably want to increase both warps per threadblock as well as threadblocks. You might start with threadblocks until you get to some number, perhaps around 16 or so, that would allow your code to scale up on GPUs larger than your hypothetical 4-SM GPU. But there are limits to the number of threadblocks that can be open on a single SM, so pretty quickly after 16 or so threadblocks you'll also want to increase the number of warps per threadblock beyond 1 (i.e. beyond 32 threads).
These strategies for small problems will allow you to take advantage of all the hardware on the GPU as quickly as possible as your problem scales up, while still allowing opportunities for latency hiding if your problem is large enough (eg. more than one warp per threadblock, or more than one threadblock resident per SM).

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).
Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.
Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.
The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.