What weights for a multi-part benchmark?

What weights for a multi-part benchmark? - c++

I'm writing a benchmark for a school project. It's very simple but I am wondering, in real life, what are the typical weights used for the various types of benchmarks? For instance, if I am combining an integer test, a cache test, a floating point test, should they be equally weighted in the final "score"? My hunch is that for many things, the cache test matters more than raw arithmetic, and that for many things, the RAM speed is a big factor. Is there a consensus?

There is no universal set of weights.
Different real-world workloads have different bottlenecks, or different weightings.
There is no single number that can tell you how fast a computer is. It's possible (and happens in real life) that program X runs faster on computer A then B, but program Y runs faster on computer B.
Choosing a set of weights for microbenchmarks totally comes down to what you want your number to mean, and what kind of workload you want it to be a rough indicator for.
e.g. a dense matmul can usually saturate FMA execution unit throughput because it does O(N^3) work over N^2 data. With careful cache-blocking you can get mostly L1d cache hits, and avoid doing more than 1 SIMD vector load per FMA. DRAM / cache bandwidth has to be high enough to keep up, but most of the stores/reloads hit in L1d cache (which of course also has to be able to keep up).
But other workloads might bottleneck on memory bandwidth or latency and not care about FPU throughput at all. e.g. AMD Ryzen 1 can do 1x 128-bit FMA per clock while Intel Haswell and later can do 2x 256-bit FMA per clock. But Ryzen is faster or nearly equal clock-for-clock for some other workloads.
And on multi-core systems some programs are single-threaded and care only about single-core throughput, while others scale well and get a big speedup on a machine with lots of slower cores. Or they might care about inter-core latency vs. aggregate memory bandwidth.

Related

Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA
for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?

A throughput of 0.5 means that the processor can execute two independent FMAs per cycle. So at 3GHz, the maximum FMA throughout is 6 billion per second. You said you are only able achieve a throughput that is slightly larger than 1.5B. This can happen due to one or more of the following reasons:
The frontend is delivering less than 2 FMA uops every single cycle due to a frontend bottleneck (the DSB path or the MITE path).
There are data dependencies between the FMAs or with other instructions (that are perhaps part of the looping mechanics). This can be stated alternatively as follows: there are less than 2 FMAs that are ready in the RS every single cycle. Latency comes into play when there are dependencies.
Some of the FMAs are using memory operands which if they are not found in the L1D cache when they are needed, a throughput of 2 FMAs per cycle cannot be sustained.
The core frequency becomes less than 3GHz during the experiment. This factor only impacts the throughput per second, not per cycle.
Other reasons depending on how exactly your loop works and how you are measuring throughput.

Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.

in a worst case how much QPI latency can slow-down arbitrary application?

I'm developing low-latency HFT trading application.
I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.
HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.
So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?
Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)

This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?
Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?
How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?
Does the code and data fit in cache?
Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".
As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].
Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).
This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.
It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark
int numberOfRuns[MAX_CPUS];
Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).
Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).
I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.
In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.
$ time ./weird -t 1 -e 100000
real 0m22.641s
user 0m22.660s
sys 0m0.003s
$ time ./weird -t 6 -e 100000
real 0m5.096s
user 0m25.333s
sys 0m0.005s
So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?

Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).

If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)

You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.

How can I measure how my multithreaded code scales (speedup)?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?

I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
caches - size, number of layers, shared / non-shared
memory bandwidth
number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.

I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.

bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the strong scaling and weak scaling of your code.
Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size per thread (n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case

I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.

I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.

You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.

Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.

On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.

Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js