Speed ratio of algorithm versus precompiled reference implementation differs across computers

Speed ratio of algorithm versus precompiled reference implementation differs across computers - c++

We have a small C++ project with the following architecture.
These two were compiled into a DLL:
An algorithm
A tester for the algorithm which checks the correctness of the result and measures the execution speed.
Then another implementation of the same algorithm is written by someone else.
The main() function does this:
Invoke the tester on both implementations of the algorithm and measure their execution speed. This is done several times, so that averages can be taken later.
Compute the speed ratio between them (measured time/measured reference time). This is referred to as the score.
We found that running the very same code and DLL on different computers returned quite different speed ratios. On one computer an implementation scored 6.4, and the very same implementation scored 2.8 on another machine. How could that be?

There could be tons of factors, but here are a few:
CPU cache can be a big one. Different processors have different caches (and not just in terms of raw cache size, but also caching strategies). One might be "smarter" than the other, or perhaps one just happens to work better than another in this specific situation.
CPU pipelining. Instructions these days are interleaved in the CPU, even in a single thread of execution. The way the CPU pipeline works varies from CPU to CPU, and one CPU might be able to two particular things at once, while another CPU can't. If one of the implementations exploit this, then it gets a speed boost (or if they both do, then they both get closer to the same speed).
CPU instruction execution times may vary. So one CPU executing the exact same instructions as another CPU might be able to do each one faster than the other CPU. If one computer's CPU takes a longer time to use a particular instruction (and one of the implementations happens to use that instruction), while another CPU has been improved to speed up that instruction's execution time, then there will be a larger time discrepancy.
Branch prediction models in the CPUs might be different, and one implementation might be more or less friendly to a particular CPU's branch prediction model.
Operating systems can affect this in many ways, from memory allocation strategies (maybe one OS has a memory allocation strategy that causes a bigger discrepancy in times, while another OS has a different allocation strategy that minimizes the discrepancy), to CPU time slice management (are the algorithms multithreaded, for example?).

Related

Will multithreading improve performance significantly if I have a fixed amount of calculations that are independet from each other?

I am programming a raycasting game engine.
Each ray can be calculated without knowing anything about the other rays (I'm only calculating distances).
Since there is no waiting time between calculations, I wonder whether it's worth the effort to make the ray calculations multithreaded or not.
Is it likely that there will be a performance boost?

Mostly likely multi-threading will improve performance if done correctly. The way you've stated your problem, it is a perfect candidate for multi-threading since the computations are independent, reducing the need for coordination between threads to a minimum.
Some reasons you still might not get a speed up, or may not get the full speed up you expect could include:
1) The bottleneck may not be on-die CPU execution resources (e.g., ALU-bound operations), but rather something shared like memory or shared LLC bandwidth.
For example, on some architectures, a single thread may be able to saturate memory bandwidth, so adding more cores may not help. A more common case is that a single core can saturate some fraction, 1/N < 1 of main memory bandwidth, and this value is larger than 1/C where C is the core count. For instance, on a 4 core box, one core may be able to consume 50% of the bandwidth. Then, for a memory-bound computation, you'll get good scaling to 2 cores (using 100% of bandwidth), but little to none above that.
Other resources which are shared among cores include disk and network IO, GPU, snoop bandwidth, etc. If you have a hyper-threaded platform, this list increases to include all levels of cache and ALU resources for logical cores sharing the same physical core.
2) Contention "in practice" between operations which are "theoretically" independent.
You mention that your operations are independent. Typically this means that they are logically independent - they don't share any data (other than perhaps immutable input) and they can write to separate output areas. That doesn't exclude the possibility, however, than any given implementation actually has some hidden sharing going on.
One classic example is false-sharing - where independent variables fall in the same cache line, so logically independent writes to different variables from different threads end up thrashing the cache line between cores.
Another example, frequently encountered in practice, is contention via library - if your routines use malloc heavily, you may find that all the threads spend most of their time waiting on a lock inside the allocator as malloc is shared resource. This can be remedied by reducing reliance on malloc (perhaps via fewer, larger mallocs) or with a good concurrent malloc such as hoard or tcmalloc.
3) Implementation of the distribution and collection of the computation across threads may overwhelm the advantage you get from multiple threads. For example, if you spin up a new thread for every individual ray, the thread creation overhead would dominate your runtime and you would likely see a negative benefit. Even if you use a thread-pool of persistent threads, choosing a "work unit" that is too fine grained will impose a lot of coordination overhead which may eliminate your benefits.
Similarly, if you have to copy the input data to and from the worker threads, you may not see the scaling you expect. Where possible, use pass-by-reference for read-only data.
4) You don't have more than 1 core, or you do have more than 1 core but they are already occupied running other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.

In general, it depends. Given that the calculations are independent, it sounds like this is a good candidate for potential performance improvements due to threading. Ray calculations typically can benefit from this.
However, there are many other factors, such as memory access requirements, as well as the underlying system on which this runs, which will have a tremendous impact on this. It's often possible to have multithreaded versions run slower than single threaded versions if not written correctly, so profiling is the only way to answer this definitively.

Probably yes, multithreading (e.g. with pthreads) could improve performance; but you surely want to benchmark (and you might be disappointed if your program is memory bound, not CPU bound). And you could also consider OpenCL (to run some regular numeric computations on the GPGPU) and OpenMP (to explicitly ask the compiler, thru pragmas, to parallelize some of your code).
Maybe Open-MPI might be considered to run on several communicating processes. And if you are brave (or crazy) you could mix several approaches.
In reality, it depends upon the algorithm and the system (both hardware and operating system), and you should benchmark (e.g. some micro-prototype related to your needs).
If on some particular system the bottleneck is the memory bandwidth (not the CPU), multi-threading or multi-processing won't help much (and probably could degrade performance).
Also, the cost of synchronization may vary widely (e.g. locking a mutex can be very fast on some systems and 50x slower on others).

Very likely. Independent calculations are a perfect candidate for parallelization. In the case of raycasting, there is so many of them that they would spread nicely across as many parallel threads as the hardware permits.
An unexpected bottleneck for calculations that would otherwise have perfect data-independence can be concurrent writes to nearby locations (false sharing of cache lines).

Measuring performance/throughput of fast code ignoring processor speed?

Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?

I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.

Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.

It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?

Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).

If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)

You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.

benchmark a piece of code independent of CPU performance?

My Objective is : I want to test a piece of code (or function) performance, just like how I test the correctness of that function in a unit-test, let say that the output of this benchmarking process is a "function performance index" which is "portable"
My Problem is : we usually benchmarking a code by using a timer to count elapsed time during execution of that code. and that method is depend on the hardware or O/S or other thing.
My Question is : is there a method to get a "function performance index" that is independent to the performance of the host (CPU/OS/etc..), or if not "independent" lets say it is "relative" to some fixed value. so that somehow the value of "function performance index" is still valid on any platform or hardware performance.
for example: that FPI value is could be measured in
number of arithmetic instruction needed to execute a single call
float value compared to benchmark function, for example function B has rating index of 1.345 (which is the performance is slower 1.345 times than the benchmark function)
or other value.
note that the FPI value doesn't need to be scientifically correct, exact or accurate, I just need a value to give a rough overview of that function performance compared to other function which was tested by the same method.

I think you are in search of the impossible here, because the performance of a modern computer is a complex blend of CPU, cache, memory controller, memory, etc.
So one (hypothetical) computer system might reward the use of enormous look-up tables to simplify an algorithm, so that there were very few cpu instructions processed. Whereas another system might have memory much slower relative to the CPU core, so an algorithm which did a lot of processing but touched very little memory would be favoured.
So a single 'figure of merit' for these two algorithms could not even convey which was the better one on all systems, let alone by how much it was better.

Probably what you really need is a tcov-like tool.
man tcov says:
Each basic block of code (or each
line if the -a option to tcov is specified) is prefixed with
the number of times it has been executed; lines that have
not been executed are prefixed with "#####". A basic block
is a contiguous section of code that has no branches: each
statement in a basic block is executed the same number of
times.

No, there is no such thing. Different hardware performs differently. You can have two different pieces of code X and Y such that hardware A runs X faster than Y but hardware B runs Y faster than X. There is no absolute scale of performance, it depends entirely on the hardware (not to mention other things like the operating system and other environmental considerations).

It sounds like what you want is a program that calculates the Big-O Notation of a piece of code. I don't know if it's possible to do that in an automated fashion (Halting problem, etc).

Like others have mentioned this is not a trivial task and may be impossible to get any sort of accurate results from. Considering a few methods:
Benchmark Functions -- While this seems promising I think you'll find that it won't work well as you try to compare different types of functions. For example, if your benchmark function is 100% CPU bound (as in some complex math computation) then it will compare/scale well with other CPU bound functions but fail when compared with, say, I/O or memory bound functions. Carefully matching a benchmark function to a small set of similar functions may work but is tedious/time consuming.
Number of Instructions -- For a very simple processor it may be possible to count the cycles of each instruction and get a reasonable value for the total number of cycles a block of code will take but with today's modern processors are anything but "simple". With branch prediction and parallel pipelines you can can't just add up instruction cycles and expect to get an accurate result.
Manual Counting -- This might be your best bet and while it is not automatic it may give better results faster than the other methods. Just look at things like the O() order of the code, how much memory the function reads/writes, how many file bytes are input/output etc.... By having a few stats like this for each function/module you should be able to get a rough comparison of their complexity.

How do you measure the effect of branch misprediction?

I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.

You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.

Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.

On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.

Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js