Understanding FMA performance - c++

I would like to understand how to compute FMA performance. If we look into the description here:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA
for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?

A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:
The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.

Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.

Related

How does OpenMP actually reduce clock cycles?

It might be a silly question but, with OpenMP you can achieve to distribute the number of operations between all the cores your CPU has. Of course, it is going to be faster in 99% times because you went from a single core doing N operations to K cores doing the same amount operations at the same time.
Despite of this, the total amount of clock cycles should be the same, right? Because the number of operations is the same. Or I am wrong?
This question boils down more or less to the difference between CPU time and elapsed time. Indeed, we see more often than none here questions which start by "my code doesn't scale, why?", for which the first answer is "How did you measure the time?" (I let you make a quick search and I'm sure you'll find many results)
But to illustrate more how things work, let's imagine you have a fixed-size problem, for which you have an algorithm that is perfectly parallelized.You have 120 actions to do, each taking 1 second. Then, 1 CPU core would take 120s, 2 cores would take 60s, 3 cores 40s, etc.
That is the elapsed time that is decreasing. However, 2 cores, running for 60 seconds in parallel, will consume 120s of CPU time. This means that the overall number of clock cycles won't have reduced compared to having only one CPU core running.
In summary, for a perfectly parallelized problem, you expect to see your elapsed time scaling down perfectly with the number of cores used, and the CPU time to remain constant.
In reality, what you often see is the elapsed time to scale down less than expected, due to parallelization overheads and/or imperfect parallelization. By the meantime, you see the CPU time slightly increasing with the number of cores used, for the same reasons.
I think the answer depends on how you define the total amount of clock cycles. If you define it as the sum of all the clock cycles from the different cores then you are correct and there will not be fewer clock cycles. But if you define it as the amount of clock cycles for the "main" core between initiating and completing the distributed operations then it is hopefully fewer.

What weights for a multi-part benchmark?

I'm writing a benchmark for a school project. It's very simple but I am wondering, in real life, what are the typical weights used for the various types of benchmarks? For instance, if I am combining an integer test, a cache test, a floating point test, should they be equally weighted in the final "score"? My hunch is that for many things, the cache test matters more than raw arithmetic, and that for many things, the RAM speed is a big factor. Is there a consensus?
There is no universal set of weights.
Different real-world workloads have different bottlenecks, or different weightings.
There is no single number that can tell you how fast a computer is. It's possible (and happens in real life) that program X runs faster on computer A then B, but program Y runs faster on computer B.
Choosing a set of weights for microbenchmarks totally comes down to what you want your number to mean, and what kind of workload you want it to be a rough indicator for.
e.g. a dense matmul can usually saturate FMA execution unit throughput because it does O(N^3) work over N^2 data. With careful cache-blocking you can get mostly L1d cache hits, and avoid doing more than 1 SIMD vector load per FMA. DRAM / cache bandwidth has to be high enough to keep up, but most of the stores/reloads hit in L1d cache (which of course also has to be able to keep up).
But other workloads might bottleneck on memory bandwidth or latency and not care about FPU throughput at all. e.g. AMD Ryzen 1 can do 1x 128-bit FMA per clock while Intel Haswell and later can do 2x 256-bit FMA per clock. But Ryzen is faster or nearly equal clock-for-clock for some other workloads.
And on multi-core systems some programs are single-threaded and care only about single-core throughput, while others scale well and get a big speedup on a machine with lots of slower cores. Or they might care about inter-core latency vs. aggregate memory bandwidth.

Increased Execution TIme even with increased number of CPUs, Why?

I have run the same C++ problem size of different number of CPUs on an HPC cluster, but what I figured out is that when the number of CPUs increased the execution time also increased. I was expecting a significant decrease in execution time. Can anyone shed some light in this issue?
Below are my execution times per # of CPUs
Number of CPUs Problem size Time (seconds)
1 3000000 15.48
2 3000000 18.2
4 3000000 21.73
8 3000000 40.55
16 3000000 60.14
32 3000000 98.75
My thoughts:
Too much communications increased between the CPUs that leads to increased the execution time.
Hope this explains it:
"There are two major factors that influence performance: the speed of the CPUs themselves, and the speed of their access to memory. In a cluster, it’s fairly obvious that a given CPU will have fastest access to the RAM within the same computer (node). Perhaps more surprisingly, similar issues are relevant on a typical multicore laptop, due to differences in the speed of main memory and the cache. Consequently, a good multiprocessing environment should allow control over the “ownership” of a chunk of memory by a particular CPU."

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.
You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.
Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.
On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.
Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.