Calculating mflop/s of a HPC application using memory bandwidth info - fortran

I want to calculate mflops (million of operations per second per processor) of a HPC application(NAS benchmark) without running the application. I have measured the memory bandwidth of each core of my system (a supercomputer) using Stream Benchmark. I'm wondering how I can get the mflops per processor of the application by having memory bandwidth info of the cores.
My node has 64GiB memory (includes 16 cores-2 sockets) and 58 GiB/s aggregated bandwidth using all physical cores. The memory bandwidth of my cores are varied from 2728.1204 MB/s to 10948.8962 MB/s for Triad function that it's must be because of NUMA architecture.
Any help would be appreciate.

You can't get estimate of MFLOPS/GFLOPS of benchmark only from memory bandwidth results from STREAM. You need to know two more parameters: peak MFLOPS/GFLOPS of your CPU core (better as max FLOP operations per clock cycle with all variants of vector instructions and cpu frequency limits: min, mean, max) and also GFLOPS/GBytes (flops to bytes ratio, Arithmetic Intensity) of every program you need to estimate (every NAS Benchmark).
The Stream benchmark has very low arithmetic intensity (0 DP=FP64 flops per two double operands = 2*8 bytes in Copy, 1 flops per 16 bytes in Scale, 1 flops / 24 byte in Add and 2 flops / 24 bytes in Triad). So, Stream benchmark is limited by memory bandwidth in correct runs (and by cache bandwidth in incorrect runs on ). Many benchmarks may have higher
With this data (memory bandwidth, max gflops/GHz on different vectorization levels, normal/maximal/low frequency of cpu, arithmetic intensity of the test) you can start to use roofline performance model https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
With roofline you have x axis with flops/byte; y axis of GFlop/s (both are at logarithmic scale). The line of the "roof" consists of two parts for every CPU (or machine).
First part is inclined and corresponds to low arithmetic intensity. Applications in this part will have to wait data to be loaded from memory, they have no data to operate on with full GFlop/s speed of CPU; the tests are limited by memory. This line is defined by STREAM benchmark.
Second part of line is straight, it corresponds to higher intensity. Tasks here are not limited by memory bandwidth, they are limited by available FLOPS. And for modern CPU all flops are available only with wide vector instruction (Instruction-level parallelism), and not all tasks can use widest vectors:

Related

C++ CUDA Gridsize meaning clarification

I am new to CUDA programming. I am currently in the process of doing Monte Carlo Simulations on a high number of large data samples.
Im trying to dynamically maximize and calculate the number of Blocks to submit to the GPU. The issue i have is that i am unclear on how to calculate the maximum number of blocks i can submit to my GPU at one time.
Here is the output of my GPU when querying it:
-----------------------------------------------
CUDA Device #: 0
Name: NVIDIA GeForce GTX 670
Revision number: 3.0
Warp size: 32
Maximum threads per block: 1024
Maximum Grid size: 2147483647
Multiprocessor Count: 7
-----------------------------------------------
What i am unclear on is that the maximum number of thread per block is clearly defined as 1024 but the grid size is not (at least to me). when i looked around in the documentation and online the definition is as follow:
int cudaDeviceProp::maxGridSize[3] [inherited]
Maximum size of each dimension of a grid
What i wanna know is the grid size reffering to:
The maximum total number of threads that can be submitted to the GPU?
(therefore i would calculate the number of blocks like so: MAX_GRID_SIZE / MAX_THREAD_PER_BLOCK)
The Maximum number of blocks of 1024 threads (therefore i would simply use MAX_GRID_SIZE)
The last one seems kind of insane to me since the MAX_GRID_SIZE = 2^31-1 (2147483647) therefore the maximum number of threads would be (2^31-1)*1024 = ~ 2.3 Trillions threads. Which is why i tend to think the first option is correct. But i am looking for outside input.
I have found many discussion about the subject of calculating blocks but almost all of them were specific to one GPU and not the general way of calculating it or thinking about it.
On Nvidia CUDA the grid size signifies the number of blocks (not the number of threads), which are sent to the GPU in one kernel invocation.
The maximum grid size can be and is huge, as the CUDA programming model does not (normally) give any guarantee that blocks run at the same time. This helps to run the same kernels on low-end and high-end hardware of different generations. So the grid is for independent tasks, the threads in a block can cooperate (especially with shared memory and synchronization barriers).
So a very large grid is more or less the same as an automatic loop around your kernel invocation or within your kernel around your code.
If you want to optimize the occupancy (parallel efficiency) of your GPU to the maximum, you should calculate, how many threads can run at the same time.
The typical maximum is maximum number of threads per SM x number of SMs. The GTX 670 has 7 SMs (called SMX for that generation) with a maximum of 1024 threads each. So for maximum occupancy you can run a multiple of 7x1024 threads.
There are other limiting factors for the 1024 threads per multiprocessor, mainly the amount of registers and shared memory each of your threads or blocks need. The GTX has 48 KB shared memory per SM and 65536 32-bit registers per SM. So if you limit your threads to 64 registers per thread, then you can use the 1024 threads per block.
Sometimes, one runs kernels with less than the maximum size, e.g. 256 threads per block. The GTX 670 can run up to a maximum of 16 blocks per SM at the same time. But you cannot get more threads than 1024 per SM altogether. So nothing gained.
To optimize your kernel itself or get nice graphical and numeric feedback, about the efficiency and bottlenecks of your kernel, use the Nvidia Compute Nsight tool (if there is a version, which still supports the 3.0 Kepler generation).
To get full speed, it is typically important to optimize memory accesses (coalescing) and to make sure that the 32 threads within a warp are running in perfect lockstep. Additionally you should try to replace accesses to global memory with accesses to shared memory (be careful about bank conflicts).

What weights for a multi-part benchmark?

I'm writing a benchmark for a school project. It's very simple but I am wondering, in real life, what are the typical weights used for the various types of benchmarks? For instance, if I am combining an integer test, a cache test, a floating point test, should they be equally weighted in the final "score"? My hunch is that for many things, the cache test matters more than raw arithmetic, and that for many things, the RAM speed is a big factor. Is there a consensus?
There is no universal set of weights.
Different real-world workloads have different bottlenecks, or different weightings.
There is no single number that can tell you how fast a computer is. It's possible (and happens in real life) that program X runs faster on computer A then B, but program Y runs faster on computer B.
Choosing a set of weights for microbenchmarks totally comes down to what you want your number to mean, and what kind of workload you want it to be a rough indicator for.
e.g. a dense matmul can usually saturate FMA execution unit throughput because it does O(N^3) work over N^2 data. With careful cache-blocking you can get mostly L1d cache hits, and avoid doing more than 1 SIMD vector load per FMA. DRAM / cache bandwidth has to be high enough to keep up, but most of the stores/reloads hit in L1d cache (which of course also has to be able to keep up).
But other workloads might bottleneck on memory bandwidth or latency and not care about FPU throughput at all. e.g. AMD Ryzen 1 can do 1x 128-bit FMA per clock while Intel Haswell and later can do 2x 256-bit FMA per clock. But Ryzen is faster or nearly equal clock-for-clock for some other workloads.
And on multi-core systems some programs are single-threaded and care only about single-core throughput, while others scale well and get a big speedup on a machine with lots of slower cores. Or they might care about inter-core latency vs. aggregate memory bandwidth.

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA
for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?
A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:
The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.
Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.

Increased Execution TIme even with increased number of CPUs, Why?

I have run the same C++ problem size of different number of CPUs on an HPC cluster, but what I figured out is that when the number of CPUs increased the execution time also increased. I was expecting a significant decrease in execution time. Can anyone shed some light in this issue?
Below are my execution times per # of CPUs
Number of CPUs Problem size Time (seconds)
1 3000000 15.48
2 3000000 18.2
4 3000000 21.73
8 3000000 40.55
16 3000000 60.14
32 3000000 98.75
My thoughts:
Too much communications increased between the CPUs that leads to increased the execution time.
Hope this explains it:
"There are two major factors that influence performance: the speed of the CPUs themselves, and the speed of their access to memory. In a cluster, it’s fairly obvious that a given CPU will have fastest access to the RAM within the same computer (node). Perhaps more surprisingly, similar issues are relevant on a typical multicore laptop, due to differences in the speed of main memory and the cache. Consequently, a good multiprocessing environment should allow control over the “ownership” of a chunk of memory by a particular CPU."

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).
Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.
Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.
The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.