Efficient Kullback–Leibler calculation - c++

I am looking to implement KL Divergence in C++ efficiently. (CPU Only for now).
Much like AES or FTT (Fast Fourier transform) whereby use of a common function has lead to hardware level optimizations (Intel AES and Intel FTT).
Is there anything similar for natural log, or slightly higher level efficiencies (ASM/C) that prevent bottlenecks of executing many Natural log functions in success (If they exist)?
Same use case examples:
.Many parallel and independent node executions; each one performing 20~ KL calculations from localized (not shared or pointer reffed) memory.
.Scheduled KL: executing in stepped-parallel where the hardware is setup expecting uses of for example tables (assuming tables are used at this lower level - AES implement ions indicated this is a high probability) .
I have found a possible candiate, but I'm unsure: ALTFP_LOG

You can use SSE instructions to calculate the logarithm of many values in parallel. But whether you can actually make use of those instructions depends heavily on how the rest of the calculations you are going to do depend on the logarithms you calculate, so it is not possible to give a more specific answer.

Related

Performance of custom bitwise vs native CPU operations

everyone! I have been trying to create my own big integer class for RSA implementation in C++ (for practice purposes only). The only way I see such thing to be implemented well in terms of performance is by using C++'s built-in bitwise operations (&|^), meaning implementing custom full-adders for addition, binary multipliers for multiplication, etc. The thing I am interested in can be formulated as follows: would custom-made emulators of hardware circuits for number arithmetic (like full-adders, multipliers) using bitwise C++ operations be slower in terms of performance. In other words, if I make my own unsigned integer class of the size of 64 bits, and make it "ideal" in terms of number of bitwise operations needed to perform addition, multiplication, division on it, can it have the same performance as built-in unsigned long long? Can it be implemented to be this fast with any programming language at all or you never will be able to make any operation faster than those in CPUs' intrinsic instruction set?
Please note that I am not interested in answers regarding implementations of RSA, but only in performance comparison of native and hand-made arithmetic.
Thank you in advance!
Software implementations cannot even come close to arithmetic circuits that are typically used in hardware. It's not even a question of benchmarking or of different results depending on the system (assuming we're talking about hardware that isn't prehistoric), it's a hands-down win for hardware circuits, guaranteed, every time, by a huge factor. A big difference between hardware and software is this: hardware has almost unlimited parallelism, so the number of bit-operations doesn't matter as much, speed depends primarily on the "depth" of a circuit. Software also has parallelism, but it's very limited.
Consider a typical fast multiplier, as used in modern hardware. They're based on some parallel reduction scheme, such as a Dadda multiplier (the actual circuit doesn't necessarily follow Dadda's algorithm to the letter, but it's going to use a similar parallel reduction). Hardware has almost unlimited parallelism to do that parallel reduction with. As a result, 64bit multiplication takes 3 cycles on many modern machines (not all of them, but for example both Apple M1 and all current Intel and AMD x64 processors). Granted, in 3 cycles you could squeeze more than 3 bitwise operations, but it's just not even a contest - you cannot implement multiplication in just a handful of bitwise operations.
Even just addition is already unbeatable. It's already as fast as bitwise operations are, or perhaps more accurately, bitwise operations are as slow as addition is. The time a bitwise operation takes at the software level has little to do with the latency of the corresponding gate, it's more a property of how the processor was designed in general. By the way you may also be interested in this other question: Why is addition as fast as bit-wise operations in modern processors?
As said by P Kramer in his comment, it totally depends on your system, instruction set and compiler, modern compilers are pretty good at finding optimizations for your specific CPU when you ask them to but it's completely impossible to know if they'll do as good as/better than the native instruction through theory only.
As usual in this case, I suggest A/B testing (don't forget to use -march and -mtune if using gcc/clang) to check which implementation is the fastest on your machine and by how much.

Where is integer addition and subtraction event count from intel Vtune?

I am using intel VTune to profile my program.
The CPU I am using is IVY Bridge.
All the hardware instruction event can be found here:
https://software.intel.com/en-us/node/589933
FP_COMP_OPS_EXE.X87
Number of FP Computational Uops Executed this
cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs,
FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not
distinguish an FADD used in the middle of a transcendental flow from a
s
FP_COMP_OPS_EXE.X87 seems to include Integer Multiplication and Integer Division; however, there is no Integer Addition and Integer Subtraction there. I can not find those two kinds of instruction either from the above website.
Can anyone tell me what is the event that counts integer addition and integer subtraction instructions?
I'm reading a lot into your question, but here goes:
It's possible that if your code is computationally bound you could find ways to infer the significance of integer adds and subs without measuring them directly. For example, UOPS_RETIRED.ALL - FP_COMP_OPS_EXE.ALL would give you a very rough estimate of adds and subs, assuming that you've already done something to establish that your code is compute bound.
Have you? If not, it might help to start with VTune's basic analysis and then eliminate memory, cache and front end bottlenecks. If you've already done this, you have a few more options:
Cross-reference UOPS_DISPATCHED_PORT with an Ivy Bridge block diagram, or even better, a list of which specific types of arithmetic can execute on which ports (which I can't find).
Modify your program source, compiler flags or assembly, rerun a coarser-grained profile like basic analysis, and see whether you see an impact at the level of a measure like INST_RETIRED.ANY / CPU_CLK_UNHALTED.
Sorry there doesn't seem to be a more direct answer.

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?
FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..
You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html
FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.
Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).
This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

Speed of C++ operators/ simple math

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =
You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.
As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.
Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.
This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.
You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.
as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.

Typical time of execution for elementary functions

It is well-known that the processor instruction for multiplication takes several times more time than addition, division is even worse (UPD: which is not true any more, see below). What about more complex operations like exponent? How difficult are they?
Motivation. I am interested because it would help in algorithm design to estimate performance-critical parts of algorithms on early stage. Suppose I want to apply a set of filters to an image. One of them operates on 3×3 neighborhood of each pixel, sums them and takes atan. Another one sums more neighbouring pixels, but does not use complicated functions. Which one would execute longer?
So, ideally I want to have approximate relative times of elementary operations execution, like multiplication typically takes 5 times more time than addition, exponent is about 100 multiplications. Of course, it is a deal of orders of magnitude, not the exact values. I understand that it depends on the hardware and on the arguments, so let's say we measure average time (in some sense) for floating-point operations on modern x86/x64. For operations that are not implemented in hardware, I am interested in typical running time for C++ standard libraries.
Have you seen any sources when such thing was analyzed? Does this question makes sense at all? Or no rules of thumb like this could be applied in practice?
First off, let's be clear. This:
It is well-known that processor instruction for multiplication takes
several times more time than addition
is no longer true in general. It hasn't been true for many, many years, and needs to stop being repeated. On most common architectures, integer multiplies are a couple cycles and integer adds are single-cycle; floating-point adds and multiplies tend to have nearly equal timing characteristics (typically around 4-6 cycles latency, with single-cycle throughput).
Now, to your actual question: it varies with both the architecture and the implementation. On a recent architecture, with a well written math library, simple elementary functions like exp and log usually require a few tens of cycles (20-50 cycles is a reasonable back-of-the-envelope figure). With a lower-quality library, you will sometimes see these operations require a few hundred cycles.
For more complicated functions, like pow, typical timings range from high tens into the hundreds of cycles.
You shouldn't be concerned about this. If I tell you that a typical C library implementation of transcendental functions tend to take around 10 times a single floating point addition/multiplication (or 50 floating point additions/multiplications), and around 5 times a floating point division, this wouldn't be useful to you.
Indeed, the way your processor schedules memory accesses will interfere badly with any premature optimization you'd do.
If after profiling you find that a particular implementation using transcendental functions is too slow, you can contemplate setting up a polynomial interpolation scheme. This will include a table and therefore will incur extra cache issues, so make sure to measure and not guess.
This will likely involve Chebyshev approximation. Document yourself about it, this is a particularly useful technique in this kind of domains.
I have been told that compilers are quite bad in optimizing floating point code. You may want to write custom assembly code.
Also, Intel Performance Primitives (if you are on Intel CPU) is something good to own if you are ready to trade off some accuracy for speed.
You could always start a second thread and time the operations. Most elementary operations don't have that much difference in execution time. The big difference is how many times the are executed. The O(n) is generally what you should be thinking about.