Profiling computations in Turbo Pascal and Turbo C - profiling

I recently been doing some tasks for university, which include using Turbo Profiler (the software is implicitly declared in the task, sadly) for profiling C and Pascal implementations of Simpsons numerical integration. I came across very strange case, where Pascal is suspiciously much faster than C.
Pascal:
i: integer, lower: real, delta_x: real;
....
(0.0000 seconds) (30 times) x:=lower+delta_x*(2.0*i-1.0);
C:
long i, double lower, double delta_x;
....
(0.0549 seconds) (30 times) double x = lower + delta_x * (2.0 * i - 1.0);
So, what could it be, the difference between real and double (and integer and long) or just Pascal's compiler better at processing math operations?

The REAL of Pascal of Pascal is like FLOAT in C, an alias for the fastest floating point type on the given system.
So both fragments are not equivalent, in Pascal the most optimal type is used, and in C you hardcode double, the highest precision type (if we forget 80-bit floats)
In TP real default means an 48-bit soft floating point, but in many later programs {$N+} is added, which maps it onto x87 double.
I don't know Turbo C that well, but it could be that your (64-bit) double type is emulated (depending on settings), which would explain the performance degradation since obviously the emulation of a floating point value with more significant digits is slower. Or worse, you are benchmarking hardware FPU vs software somewhere.

Don't believe those numbers. If you want to measure time, put a loop of 10^6 or 10^9 iterations around the top subroutine, and count seconds. If you want to see what fraction of time goes into that statement, use stack-sampling.

Related

Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .
To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .
I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .
There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .
I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .
I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index
We’ve finally reached the point in this series that I’ve been waiting
for. In this post I am going to share the most crucial piece of
floating-point math knowledge that I have. Here it is:
[Floating-point] math is hard.
You just won’t believe how vastly, hugely, mind-bogglingly hard it is.
I mean, you may think it’s difficult to calculate when trains from
Chicago and Los Angeles will collide, but that’s just peanuts to
floating-point math.
(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)
How you could actually implement your bad idea:
There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.
Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.
So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.
The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.
FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.
FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.
This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.
Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.
PS. if you didn't recognize Bruce's HHGTG reference, the original is:
Space is big. Really big. You just won’t believe how vastly hugely
mindbogglingly big it is. I mean you may think it’s a long way down
the road to the chemist’s, but that’s just peanuts to space.
how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.
Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."
It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".
Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.
Text Method:
// - x xxx...xxx . xxxxx \0
char buf[1+1+ DBL_MAX_10_EXP+1 +5 +1];
sprintf(buf, "%.5f", x);
x = atof(buf);
round() rint() method:
#define SCALE 100000.0
if (fabs(x) < DBL_MAX/SCALE) {
x = x*SCALE;
x = rint(x)/SCALE;
}
Direct bit manipulation of x. Simply zero select bits in the significand.
TBD code.

Are C/C++ library functions and operators the most optimal ones?

So, at the divide & conquer course we were taught:
Karatsuba multiplication
Fast exponentiation
Now, given 2 positive integers a and b is operator::* faster than a karatsuba(a,b) or is pow(a,b) faster than
int fast_expo(int Base, int exp)
{
if (exp == 0) {
return 1;
}
if (exp == 1) {
return Base
}
if (exp % 2 == 0) {
return fast_expo(Base, exp / 2) * fast_expo(Base, exp / 2);
}
else {
return base * fast_expo(Base, exp / 2) * fast_expo(Base, exp / 2);
}
}
I ask this because I wonder if they have just a teaching purpose or they are already base implemented in the C/C++ language
Karatsuba multiplication is a special technique for large integers. It is not comparable to the built in C++ * operator which multiplies together operands of basic type like int and double.
To take advantage of Karatsuba, you have to be using multi-precision integers made up of at least around 8 words. (512 bits, if these are 64 bit words). The break-even point at which Karatsuba becomes advantageous is at somewhere between 8 and 24 machine words, according to the accepted answer to this question.
The pow function which works with a pair of floating-point operands of type double, is not comparable to your fast_expo, which works with operands of type int. They are different functions with different requirements. With pow, you can calculate the cube root of 5: pow(5, 1/3.0). If that's what you would like to calculate, then fast_expo is of no use, no matter how fast.
There is no guarantee that your compiler or C library's pow is absolutely the fastest way for your machine to exponentiate two double-precision floating-point numbers.
Optimization claims in floating-point can be tricky, because it often happens that multiple implementations of the "same" function do not give exactly the same results down to the last bit. You can probably write a fast my_pow that is only good to five decimal digits of precision, and in your application, that approximation might be more than adequate. Have you beat the library? Hardly; your fast function doesn't meet the requirements that would qualify it as a replacement for the pow in the library.
operator::* and other standard operators usually map to the primitives provided by the hardware. In case, such primitives don't exist (e.g. 64-bit long long on IA32), the compiler emulates them at a performance penalty (gcc does that in libgcc).
Same for std::pow. It is part of the standard library and isn't mandated to be implemented in a certain way. GNU libc implements pow(a,b) as exp(log(a) * b). exp and log are quite long and written for optimal performance with IEEE754 floating point in mind.
As for your sugestions:
Karatsuba multiplication for smaller numbers isn't worth it. The multiply machine instruction provided by the processor is already optimized for speed and power usage for the standard data types in use. With bigger numbers, 10-20 times the register capacity, it starts to pay off:
In the GNU MP Bignum Library, there used to be a default
KARATSUBA_THRESHOLD as high as 32 for non-modular multiplication
(that is, Karatsuba was used when n>=32w with typically w=32);
the optimal threshold for modular exponentiation tending to be
significantly higher. On modern CPUs, Karatsuba in software tends to
be non-beneficial for things like ECDSA over P-256 (n=256, w=32 or
w=64), but conceivably useful for much wider modulus as used in RSA.
Here is a list with the multiplication algorithms, GNU MP uses and their respective thresholds.
Fast exponentiation doesn't apply to non-integer powers, so it's not really comparable to pow.
A good way to check the speed of an operation is to measure it. If you run through the calculation a billion or so times and see how much time it takes to execute you have your answer there.
One thing to note. I'm lead to believe that % is fairly expensive. There is a much faster way to check if something is divisible by 2:
check_div_two(int number)
{
return ((number>>1) & 0x01);
}
This way you've just done a bit shift and compared against a mask. I'd assume it's a less expensive op.
The * operator for built-in types will almost certainly be implemented as a single CPU multiplication instruction. So ultimately this is a hardware question, not a language question. Longer code sequences, perhaps function calls, might be generated in cases where there's no direct hardware support.
It's safe to assume that chip manufacturers (Intel, AMD, et al) expend a great deal of effort making arithmetic operations as efficient as possible.

Integer division, or float multiplication?

If one has to calculate a fraction of a given int value, say:
int j = 78;
int i = 5* j / 4;
Is this faster than doing:
int i = 1.25*j; // ?
If it is, is there a conversion factor one could use to decide which to use, as in how many int divisions can be done in the same time a one float multiplication?
Edit: I think the comments make it clear that the floating point math will be slower, but the question is, by how much? If I need to replace each float multiplication by N int divisions, for what N will this not be worth it anymore?
You've said all the values are dynamic, which makes a difference. For the specific values 5 * j / 4, the integer operations are going to be blindingly fast, because pretty much the worst case is that the compiler optimises them to two shifts and one addition, plus some messing around to cope with the possibility that j is negative. If the CPU can do better (single-cycle integer multiplication or whatever) then the compiler typically knows about it. The limits of compilers' abilities to optimize this kind of thing basically come when you're compiling for a wide family of CPUs (generating lowest-common-denominator ARM code, for example), where the compiler doesn't really know much about the hardware and therefore can't always make good choices.
I suppose that if a and b are fixed for a while (but not known at compile time), then it's possible that computing k = double(a) / b once and then int(k * x) for many different values of x, might be faster than computing a * x / b for many different values of x. I wouldn't count on it.
If all the values vary each time, then it seems unlikely that the floating-point division to compute the 1.25, followed by floating-point multiplication, is going to be any faster than the integer multiplication followed by integer division. But you never know, test it.
It's not really possible to give simple relative timings for this on modern processors, it really depends a lot on the surrounding code. The main costs in your code often aren't the "actual" ops: it's "invisible" stuff like instruction pipelines stalling on dependencies, or spilling registers to stack, or function call overhead. Whether or not the function that does this work can be inlined might easily make more difference than how the function actually does it. As far as definitive statements of performance are concerned you can basically test real code or shut up. But the chances are that if your values start as integers, doing integer ops on them is going to be faster than converting to double and doing a similar number of double ops.
It is impossible to answer this question out of context. Additionally 5*j/4 does not generally produce the same result as (int) (1.25*j), due to properties of integer and floating-point arithmetic, including rounding and overflow.
If your program is doing mostly integer operations, then the conversion of j to floating point, multiplication by 1.25, and conversion back to integer might be free because it uses floating-point units that are not otherwise engaged.
Alternatively, on some processors, the operating system might mark the floating-point state to be invalid, so that the first time a process uses it, there is an exception, the operating system saves the floating-point registers (which contain values from another process), restores or initializes the registers for your process, and returns from the exception. This would take a great deal of time, relative to normal instruction execution.
The answer also depends on characteristics of the specific processor model the program is executing on, as well as the operating system, how the compiler translates the source into assembly, and possibly even what other processes on the system are doing.
Also, the performance difference between 5*j/4 and (int) (1.25*j) is most often too small to be noticeable in a program unless it or operations like it are repeated a great many times. (And, if they are, there may be huge benefits to vectorizing the code, that is, using the Single Instruction Multiple Data [SIMD] features of many modern processors to perform several operations at once.)
In your case, 5*j/4 would be much faster than 1.25*j because division by powers of 2 can be easily manipulated by a right shift, and 5*j can be done by a single instruction on many architectures such as LEA on x86, or ADD with shift on ARM. Most others would require at most 2 instructions like j + (j >> 2) but that way it's still probably faster than a floating-point multiplication. Moreover by doing int i = 1.25*j you need 2 conversions from int to double and back, and 2 cross-domain data movements which is generally very costly
In other cases when the fraction is not representable in binary floating-point (like 3*j/10) then using int multiply/divide would be more correct (because 0.3 isn't exactly 0.3 in floating-point), and most probably faster (because the compiler can optimize out division by a constant by converting it to a multiplication)
In cases that i and j are of a floating-point type, multiplying by another floating-point value might be faster. Because moving values between float and int domains takes time and conversion between int and float also takes time as I said above
An important difference is that 5*j/4 will overflow if j is too large, but 1.25*j doesn't
That said, there's no general answer for the questions "which is faster" and "how much faster", as it depends on a specific architecture and in a specific context. You must measure on your system and decide. But if an expression is done repeatedly to a lot of values then it's time to move to SIMD
See also
Why is int * float faster than int / int?
Should I use multiplication or division?
Floating point division vs floating point multiplication

Double versus float

I have a constant for pi in my code:
const float PI = acos(-1);
Would it be better to declare it as a double? An answer to another question on this site said floating point operations aren't exactly precise, and I'd like the constant to be accurate.
"precise" is not a boolean concept. float provides a certain amount of precision. Whether or not that amount is sufficient for your application depends on, well, your application.
most applications don't need more precision than float provides, though many prefer to use double to (try and) gloss over problems with unstable algorithms or "just because" due to misconceptions like "floating point operations aren't exactly precise".
In most cases when a float is "not precise enough", the problem is not float, it's the code that uses it.
Edit: That being said, most modern CPUs only do calculations in double precision or greater anyway, so you might as well use double unless you're working with large arrays and memory usage is an issue.
From standard:
There are three floating point types:
float, double, and long double. The
type double provides at least as much
precision as float, and the type long
double provides at least as much
precision as double.
Of the three (notice that this goes hand in hand with the 3 versions of acos) you should choose long double if what you are aiming for is precision (but you should also know that after some degree, further precision may be redundant in some cases).
So you should use this to get the most precise result from acos
long double result = acos(-1L);
(Note: There might be some platform specific types or some user defined types which provide more precision)
I'd like the constant to be accurate.
There is nothing like accurate floating point values. They cannot be stored with perfect precision, because of their representation in memory. This is only possible with integers. double give you double the precision a float offers (who would have guessed). double should fit your needs in almost every case.
I would recommend using M_PI from <cmath>, which should be available in all POSIX compliant implementations of the standard.
It depends exactly how precise you need to be. I've never had to you doubles because floats are not precise enough.
The most accurate representation of pi is M_PI from math.h
The question boils down to: how much accuracy do you need?
Let's quote wikipedia:
For example, the decimal
representation of π truncated to 11
decimal places is good enough to
estimate the circumference of any
circle that fits inside the Earth with
an error of less than one millimetre,
and the decimal representation of π
truncated to 39 decimal places is
sufficient to estimate the
circumference of any circle that fits
in the observable universe with
precision comparable to the radius of
a hydrogen atom.
I've written a small java program, here's its output:
As string: 3.14159265358979323846264338327950288419716939937510
As double: 3.141592653589793
As float: 3.1415927
Remember, that if you want to have the double precision of a double, all your numbers you're calculating with need also to be doubles. (That is not entierly true, but is close enough.)
For most applications, float would do just fine for PI. Double is definitely has more precision, but it doesn't guarantee precision anymore than floats can. By that I mean that the number 1.0 represented in binary is not a rational number. Therefore, if you try to represent it, you'll only succeed to an nth digit where n is determined by how many bytes you use to represent that number.
Unfortunately to contain many digits of PI, you'd probably need to hold it in a string. Though now we're talking about some impressive number crunching here that you might see in molecule simulations. You're probably not going to need that level of precision.
As this site says, there are three overloaded versions of acos function.
Therefore the call acos(-1) is ambiguous.
Having said that, you should declare PI as long double to avoid any loss of precision, by using
const long double PI = acos(-1.0L);

Is using double faster than float?

Double values store higher precision and are double the size of a float, but are Intel CPUs optimized for floats?
That is, are double operations just as fast or faster than float operations for +, -, *, and /?
Does the answer change for 64-bit architectures?
There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:
are double operations just as fast or
faster than float operations for +, -,
*, and /?
is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double than for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).
For example, Haswell has a divsd throughput of one per 8 to 14 cycles (data-dependent), but a divss (scalar single) throughput of one per 7 cycles. x87 fdiv is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)
The float versions of many library functions like logf(float) and sinf(float) will also be faster than log(double) and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float vs. double
However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.
#Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.
In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).
If all floating-point calculations are performed within the FPU, then, no, there is no difference between a double calculation and a float calculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the double or float floating-point format. Moving sizeof(double) bytes to/from RAM versus sizeof(float) bytes is the only difference in speed.
If, however, you have a vectorizable computation, then you can use the SSE extensions to run four float calculations in the same time as two double calculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use floats.
Another point to consider is if you are using GPU(the graphics card). I work with a project that is numerically intensive, yet we do not need the percision that double offers. We use GPU cards to help further speed the processing. CUDA GPU's need a special package to support double, and the amount of local RAM on a GPU is quite fast, but quite scarce. As a result, using float also doubles the amount of data we can store on the GPU.
Yet another point is the memory. Floats take half as much RAM as doubles. If you are dealing with VERY large datasets, this can be a really important factor. If using double means you have to cache to disk vs pure ram, your difference will be huge.
So for the application I am working with, the difference is quite important.
I just want to add to the already existing great answers that the __m256? family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either 4 double s in parallel (e.g. _mm256_add_pd), or 8 floats in parallel (e.g. _mm256_add_ps).
I'm not sure if this can translate to an actual speed up, but it seems possible to process 2x as many floats per instruction when SIMD is used.
In experiments of adding 3.3 for 2000000000 times, results are:
Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double
So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.
Even Stroustrup recommends double over float:
"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."
Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.
The only really useful answer is: only you can tell. You need to benchmark your scenarios. Small changes in instruction and memory patterns could have a significant impact.
It will certainly matter if you are using the FPU or SSE type hardware (former does all its work with 80bit extended precision, so double will be closer; later is natively 32bit, i.e. float).
Update: s/MMX/SSE/ as noted in another answer.
Alex Martelli's answer is good enough, but I want to mention a wrong but somewhat popular test method that may have misled some people:
#include <cstdio>
#include <ctime>
int main() {
const auto start_clock = clock();
float a = 0;
for (int i = 0; i < 256000000; i++) {
// bad latency benchmark that includes as much division as other operations
a += 0.11; // note the implicit conversions of a to double to match 0.11
a -= 0.13; // rather than 0.11f
a *= 0.17;
a /= 0.19;
}
printf("c++ float duration = %.3f\n",
(double)(clock() - start_clock) / CLOCKS_PER_SEC);
printf("%.3f\n", a);
return 0;
}
It's wrong! C++ default use double, if you replace += 0.11 by += 0.11f, float will usually be faster then double, on x86 CPU.
By the way, on modern SSE instruction set, both float and double have same speed except of division operation, in the CPU core itself. float being smaller may have fewer cache misses if you have arrays of them.
And if the compiler can auto-vectorize, float vectors work on twice as many elements per instruction as double.
Floating point is normally an extension to one's general purpose CPU. The speed will therefore be dependent on the hardware platform used. If the platform has floating point support, I will be surprised if there is any difference.
Previous answers missing a factor that may cause big diff(> 4 X) between float and double: denormal.
Avoiding denormal values in C++
Since double have a much wider normal range, for a specific problem that contains many small values, There is much higher probability to fall into denormal range with float than with double, so float could be much slower than double in this case.
In addition some real data of a benchmark to get a glimpse:
For Intel 3770k, GCC 9.3.0 -O2 [3]
Run on (8 X 3503 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
BM_FloatCreation 0.281 ns 0.281 ns 1000000000
BM_DoubleCreation 0.284 ns 0.281 ns 1000000000
BM_Vector3FCopy 0.558 ns 0.562 ns 1000000000
BM_Vector3DCopy 5.61 ns 5.62 ns 100000000
BM_Vector3F_CopyDefault 0.560 ns 0.546 ns 1000000000
BM_Vector3D_CopyDefault 5.57 ns 5.56 ns 112178768
BM_Vector3F_Copy123 0.841 ns 0.817 ns 897430145
BM_Vector3D_Copy123 5.59 ns 5.42 ns 112178768
BM_Vector3F_Add 0.841 ns 0.834 ns 897430145
BM_Vector3D_Add 5.59 ns 5.46 ns 100000000
BM_Vector3F_Mul 0.842 ns 0.782 ns 897430145
BM_Vector3D_Mul 5.60 ns 5.56 ns 112178768
BM_Vector3F_Compare 0.840 ns 0.800 ns 897430145
BM_Vector3D_Compare 5.61 ns 5.62 ns 100000000
BM_Vector3F_ARRAY_ADD 3.25 ns 3.29 ns 213673844
BM_Vector3D_ARRAY_ADD 3.13 ns 3.06 ns 224357536
where operations on 3 float(F) or 3 double(D) are compared and
- BM_Vector3XCopy is the pure copy of a (1,2,3) initialized vector not repeated before copy,
- BM_Vector3X_CopyDefault with default initialization repeated every copy,
- BM_Vector3X_Copy123 with repeated initialization of (1,2,3),
Add/Mul Each initialize 3 vectors(1,2,3) and add/multiplicate the first and second into the third,
Compare Checks for equality of two initialized vectors,
ARRAY_ADD Sums up vector(1,2,3) + vector(3,4,5) + vector(6,7,8) via std::valarray what in my case leads to SSE instructions.
Remember that these are isolated tests and the results differ with compiler settings, from machine to machine or architecture to architecture.
With caching (issues) and real world use-cases this may be completely different. So the theory can greatly differ from reality.
The only way to find out is a practical test such as with google-benchmark[1] and checking the result of the compiler output for your particular problem solution[2].
https://github.com/google/benchmark
https://sourceware.org/binutils/docs/binutils/objdump.html -> objdump -S
https://github.com/Jedzia/oglTemplate/blob/dd812b72d846ae888238d6f726d503485b796b68/benchmark/Playground/BM_FloatingPoint.cpp