Faster way to multiply floats

Faster way to multiply floats - c++

We can do left shift operators in C/C++ for faster way to multiply integers with powers of 2.
But we cannot use left shift operators for floats or doubles because they are represented in different way, having an exponent component and a mantissa component.
My questions is that,
Is there any way? Like left shift operators for integers to faster multiply float numbers? Even with powers of 2??

No, you can't. But depending on your problem, you might be able to use SIMD instructions to perform one operation on several packed variables.. Read about the SSE2 instruction set.
http://en.wikipedia.org/wiki/SSE2
http://softpixel.com/~cwright/programming/simd/sse2.php
In any event, if you are optimizing floating-point multiplications, you are in 99% of the cases looking in the wrong place. Without going on a major rant regarding premature optimization, at least justify it by performing proper profiling.

You could do this:
float f = 5.0;
int* i = (int*)&f;
*i += 0x00800000;
But then you have the overhead of moving the float out of the register, into memory, then back into a different register, only to be flushed back to memory ... about 15 or so cycles more than if you'd just done fmul. Of course, that's even assuming your system has IEEE floats at all.
Don't try to optimize this. You should look at the rest of your program to find algorithmic optimizations instead of trying to discover ways to microoptimize things like floats. It will only end in blood and tears.

Truly, any decent compiler would recognize static-time power-of-two constants and use the smartest operation.

In Microsoft Visual C++, don't forget the "floating point model" switch. The default is /fp:precise but you can change it to /fp:fast. The fast model trades some floating point accuracy for more speed. In some cases, the speedups can be drastic (the blog post referenced below notes speedups as high as x5 in some cases). Note that Xbox games are compiled with the /fp:fast switch by default.
I just switched from /fp:precise to /fp:fast on a math-heavy application of mine (with many float multiplications) and got an immediate 27% speedup with almost no loss in accuracy across my test suite.
Read the Microsoft blog post regarding the details of this switch here. It seems that the main reasons to not enable this would be if you need all the accuracy available (eg, games with large worlds, long-running simulations where errors may accumulate) or you need robust double or float NaN processing.
Lastly, also consider enabling the SSE2 instruction extensions. This gave an extra 3% boost in my application. The effects of this will vary depending on the number of operands in your arithmetic—for example, these extensions can provide speedup in cases where you are adding or multiplying more than 2 numbers together at a time.

Related

Built-in type efficiency

Under The most efficient types second here
...and when defining an object to store a floating point number, use the double type, ... The double type is two to three times less efficient than the float type...
Seems like it's contradicting itself?
And I read elsewhere (can't remember where) that computations involving ints are faster than shorts on many machines because they are converted to ints to perform the operations? Is this true? Any links on this?

One can always argue about the quality of the contents on the site you link to. But the two quotes you refer to:
...and when defining an object to store a floating point number, use the double type, ...
and
... The double type is two to three times less efficient than the float type...
Refer to two different things, the first hints that using doubles will give much less problems due to the increased precision, while the other talks about performance. But honestly I wouldn't pay too much attention to that, chance is that if your code performs suboptimal it is due to incorrect choice of algorithm rather than wrong choice of primitive data type.
Here is a quote about performance comparison of single and double precision floats from one of my old teachers: Agner Fog, who has a lot of interesting reads over at his website: http://www.agner.org about software optimizations, if you are really interested in micro optimizations go take a look at it:
In most cases, double precision calculations take no more time than single precision. When the floating point registers are used, there is simply no difference in speed between single and double precision. Long double precision takes only slightly more time. Single precision division, square root and mathematical functions are calculated faster than double precision when the XMM registers are used, while the speed of addition, subtraction, multiplication, etc. is still the same regardless of precision on most processors (when vector operations are not used).
source: http://agner.org/optimize/optimizing_cpp.pdf
While there might be different variations for different compilers, and different processors, the lesson one should learn from it, is that most likely you do not need to worry about optimizations at this level, look at choice of algorithm, even data container, not the primitive data type.

These optimizations are negligible unless you are writing software for space shuttle launches (which recently have not been doing too well). Correct code is far more important than fast code. If you require the precision, using doubles will barely affect the run time.
Things that affect execution time way more than type definitions:
Complexity - The more work there is to do, the more slowly the code will run. Reduce the amount of work needed, or break it up into smaller, faster tasks.
Repetition - Repetition can often be avoided and will inevitably ruin code performance. It comes in many guises-- for example, failing to cache the results of expensive calculations or of remote procedure calls. Every time you recompute, you waste efficiency. They also extend the executable size.
Bad Design - Self explanatory. Think before you code!
I/O - A program whose execution is blocked waiting for input or output (to and from the user, the disk, or a network connection) is bound to perform badly.
There are many more reasons, but these are the biggest. Personally, bad design is where I've seen most of it happen. State machines that could have been stateless, dynamic allocation where static would have been fine, etc. are the real problems.

Depending on the hardware, the actual CPU (or FPU if you like) performance of double is somewhere between half the speed and same speed on modern CPU's [for example add or subtract is probably same speed, multiply or divide may be different for larger type], when compared to float.
On top of that, there are "fewer per cache-line", so if when there is a large number of them, it gets slower still because memory speed is slower. Per cache-line, there are half as many double values -> about half the performance if the application is fully memory bound. It will be much less of a factor in a CPU-bound application.
Similarly, if you use SSE or similar SIMD technologies, the double will take up twice as much space, so the number of actual calculation with be half as many "per instruction", and typically, the CPU will allow the same number of instructions per cycle for both float and double - except for some operations that take longer for double. Again, leading to about half the performance.
So, yes, I think the page in the link is confusing and mixing up the ideal performance setup between double and float. That is, from a pure performance perspective. It is often much easier to get noticeable calculation errors when using float - which can be a pain to track down - so starting with double and switching to float if it's deemed necessary because you have identified it as a performance issue (either from experience or measurements).
And yes, there are several architectures where only one size integer exists - or only two sizes, such as 8-bit char and 32-bit int, and 16-bit short would be simulated by performing the 32-bit math, and then dropping the top part of the value. For example MIPS has only got 32-bit operations, but can store and load 16-bit values to memory. It doesn't necessarily make it slower, but it certainly means that it's "not faster".

Convert all doubles to integers for better performance, is it just a rumor?

I have a very complicated and sophisticated data fitting program which uses the Levenverg-Marquardt algorithm to do fitting in double precision (basically the fitting class is templatized, but I use instantiate it to doubles). The fitting process involves:
Calculating an error function (chi-square)
Solving a system of linear equations (I use lapack for that)
calculating the derivatives of a function with respect to the parameters, which I want to fit to the data (usually 20+ parameters)
calculating the function value continuously: the function is a complicated combination of a sinusoidal and exponential functions with a few harmonics.
A colleague of mine has suggested that I use integers for at least 10 times faster at least. My questions are:
Is that true that I will get that kind of improvement?
Is it safe to convert everything to integers? And what are the drawbacks to this?
What advice would you have for this whole issue? What would you do?
The program is developed to calculate some parameters from the signal online, which means that the program must be as fast as possible, but I'm wondering whether it's worth it to start the project of converting everything to integers.

The amount of improvement depends on your platform. For example, if your platform has a fast floating point coprocessor, performing arithmetic in floating point may be faster than integral arithmetic.
You may be able to get more performance gain by optimizing your algorithms rather than switching to integer arithmetic.
Another method for boosting performance is to reduce data cache hits and also reducing branches and loops.
I would measure performance of the program to find out where the bottlenecks are and then review the sections that where most of the performance takes place. For example, in my embedded system, micro-optimizations like what you are suggesting, saved 3 microseconds. This gain is not worth the effort to retest the entire system. If it works, don't fix it. Concentrate on correctness and robustness first.

The bottom line here is that you have to test it and decide for yourself. Profile a release build using real data.
1- Is that true that I will get that kind of improvement?
Maybe yes, maybe no. It depends on a number of factors, such as
How long it takes to convert from double to int
How big a word is on your machine
What platform/toolset you're using and what optimizations you have enabled
(Maybe) how big a cache line is on your platform
How fast your memory is
How fast your platform computes floating-point versus integer.
And who knows what else. In short, too many complex variables for anyone to be able to say for sure if you will or will not improve performance.
But I would be highly skeptical about your friend's claim, "at least 10 times faster at least."
2- Is it safe to convert everything to integers? And what are the
drawbacks to this?
It depends on what you're converting and how. Obviously converting a value like 123.456 to an integer is decidedly unsafe.
Drawbacks include loss of precision, loss of accuracy, and the expense in terms of space and time to actually do the conversions. Another significant drawback is the fact that you have to write a substantial amount of code, and every line of code you write is a probable source of new bugs.
3- What advice would you have for this whole issue? What would you do?
I would step back & take a deep breath. Profile your code under real-world conditions. Identify the sources of the bottlenecks. Find out what the real problems are, and if there even are any.
Identify inefficiencies in your algorithms, and fix them.
Throw hardware at the problem.
Then you can endeavor to start micro-optimizing. This would be my last resort, especially if the optimization technique you are considering would require writing a lot of code.

First, this reeks of attempting to optimize unnecessarily.
Second, doubles are a minimum of 64-bits. ints on most systems are 32-bits. So you have a couple of choices: truncate the double (which reduces your precision to a single), or store it in the space of 2 integers, or store it as an unsigned long long (which is at least 64-bits as well). For the first 2 options, you are facing a performance hit as you must convert the numbers back and forth between the doubles you are operating on and the integers you are storing it as. For the third option, you are not gaining any performance increase (in terms of memory usage) as they are basically the same size - so you'd just be converting them to integers for no reason.
So, to get to your questions:
1) Doubtful, but you can try it to see for yourself.
2) The problem isn't storage as the bits are just bits when they get into memory. The problem is the arithmetic. Since you stated you need double precision, attempting to do those operations on an integer type will not give you the results you are looking for.
3) Don't optimize until it has been proven something needs to have a performance improvement. And always remember Amdahl's Law: Make the common case fast and the rare case correct.

What I would do is:
First tune it in single-thread mode (by the random-pausing method) until you can't find any way to reduce cycles. The kinds of things I've found are:
a large fraction of time spent in library functions like sin, cos, exp, and log where the arguments were often unchanged, so the answers would be the same. The solution for that is called "memoizing", where you figure out a place to store old values of arguments and results, and check there first before calling the function.
In calling library functions like DGEMM (lapack matrix-multiply) that one would assume are optimized to the teeth, they are actually spending a large fraction of time calling a function to determine if the matrices are upper or lower triangle, square, symmetric, or whatever, rather than actually doing the multiplication. If so, the answer is obvious - write a special routine just for your situation.
Don't say "but I don't have those problems". Of course - you probably have different problems - but the process of finding them is the same.
Once you've made it as fast as possible in single-thread, then figure out how to parallelize it. Multi-threading can have high overhead, so it's best not to tightly-couple the threads.
Regarding your question about converting from doubles to integers, the other answers are right on the money. It only makes sense in very particular situations.

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?

FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..

You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html

FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.

Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).

This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

fp:precise vs. fp:strict performance

I detected some differences in my program results between Release and Debug versions. After some research I realized that some floating point optimizations are causing those differences. I have solved the problem by using the fenv_access pragma for disabling some optimizations for some critical methods.
Thinking about it, I realized that it is probably better to use the fp:strict model instead of fp:precise in my program because of its characteristics, but I am worried about performance. I have tried to find some information about the performance issues of fp:strict or the differences in performance between precise and strict, model but I have found very little information.
Does anyone know anything about this??
Thanks in advance.

This happens because you are compiling in 32-bit mode, it uses the x86 floating point processor. The code optimizer removes redundant moves from the FPU registers to memory and back, leaving intermediary results in the FPU stack. A pretty important optimization.
Problem is, the FPU stores doubles with 80 bits of precision. Instead of the 64 bits of precision of a double. Intel originally assumed this was a feature, producing more accurate intermediary calculations but it is really a bug. They didn't make the same mistake when they designed the SSE2 instruction set, used by 64 bit compilers to do floating point math. The XMM registers are 64 bits.
So in the release mode build you get subtly different results since the calculations are performed with more bits. This should never be a problem in a program that uses floating point values to calculate, a double can only store 15 significant digits. What's different are the noise digits, the ones beyond the first 15 digits. But sometimes less if your calculation loses significant digits badly. Like calculating 1 - 3 * (1/3.0).
But yeah, you can use fp:precise to get consistent noise digits. It forces the intermediate values to be flushed to memory so they cannot remain in the FPU with 80 bits of precision. It makes your code slow of course.

I am not sure if this is a solution but is what I have :)
As I have post previously I have wrote a test program that performs floating point operations that is said to be optimized under fp:precise and not under fp:strict and then measure performance. I run it 10000 times and, in average, fp:strict is 2.85% slower than fp:precise.

Just offering my two cents:
I have an image processing program that autovectorizes, the aim was to compare the performance and accuracy taking matlab as a gold standard.
Using VS2012 and an Intel i950.
Critical region error & runtime
2.3328196e-02 465 ms with strict
7.1277611e-02 182 ms with precise
7.1277611e-02 188 ms with fast
strict did not vecotrization
Using strict slowed the code down by 2x. Which was not acceptable.

It's absolutely normal to see performance difference between a Debug and Release version.
The compiler and run-times will do a lot more additional sanity checks in debug version; don't compare one to the other, especially in regards to performance; compare release vs. release with different compiler switches.
On the other hand, if the results are different between the 2 versions, then you will have to go in and check for programming errors (most probably).
Max.

x86 4byte floats vs. 8byte doubles (vs. long long)?

We have a measurement data processing application and currently all data is held as C++ float which means 32bit/4byte on our x86/Windows platform. (32bit Windows Application).
Since precision is becoming an issue, there have been discussions to move to another datatype. The options currently discussed are switching to double (8byte) or implementing a fixed decimal type on top of __int64 (8byte).
The reason the fixed-decimal solution using __int64 as underlying type is even discussed is that someone claimed that double performance is (still) significantly worse than processing floats and that we might see significant performance benefits using a native integer type to store our numbers. (Note that we really would be fine with fixed decimal precision, although the code would obviously become more complex.)
Obviously we need to benchmark in the end, but I would like to ask whether the statement that doubles are worse holds any truth looking at modern processors? I guess for large arrays doubles may mess up cache hits more that floats, but otherwise I really fail to see how they could differ in performance?

It depends on what you do. Additions, subtractions and multiplies on double are just as fast as on float on current x86 and POWER architecture processors. Divisions, square roots and transcendental functions (exp, log, sin, cos, etc.) are usually notably slower with double arguments, since their runtime is dependent on the desired accuracy.
If you go fixed point, multiplies and divisions need to be implemented with long integer multiply / divide instructions which are usually slower than arithmetic on doubles (since processors aren't optimized as much for it). Even more so if you're running in 32 bit mode where a long 64 bit multiply with 128 bit results needs to be synthesized from several 32-bit long multiplies!
Cache utilization is a red herring here. 64-bit integers and doubles are the same size - if you need more than 32 bits, you're gonna eat that penalty no matter what.

Look it up. Both and Intel publish the instruction latencies for their CPUs in freely available PDF documents on their websites.
However, for the most part, performance won't be significantly different, or a couple of reasons:
when using the x87 FPU instead of SSE, all floating point operations are calculated at 80 bits precision internally, and then rounded off, which means that the actual computation is equally expensive for all floating-point types. The only cost is really memory-related then (in terms of CPU cache and memory bandwidth usage, and that's only an issue in float vs double, but irrelevant if you're comparing to int64)
with or without SSE, nearly all floating-point operations are pipelined. When using SSE, the double instructions may (I haven't looked this up) have a higher latency than their float equivalents, but the throughput is the same, so it should be possible to achieve similar performance with doubles.
It's also not a given that a fixed-point datatype would actually be faster either. It might, but the overhead of keeping this datatype consistent after some operations might outweigh the savings. Floating-point operations are fairly cheap on a modern CPU. They have a bit of latency, but as mentioned before, they're generally pipelined, potentially hiding this cost.
So my advice:
Write some quick tests. It shouldn't be that hard to write a program that performs a number of floating-point ops, and then measure how much slower the double version is relative to the float one.
Look it up in the manuals, and see for yourself if there's any significant performance difference between float and double computations

I've trouble the understand the rationale "as double as slower than float we'll use 64 bits int". Guessing performance has always been an black art needing much of experience, on today hardware it is even worse considering the number of factors to take into account. Even measuring is difficult. I know of several cases where micro-benchmarks lent to one solution but in context measurement showed that another was better.
First note that two of the factors which have been given to explain the claimed slower double performance than float are not pertinent here: bandwidth needed will the be same for double as for 64 bits int and SSE2 vectorization would give an advantage to double...
Then consider than using integer computation will increase the pressure on the integer registers and computation units when apparently the floating point one will stay still. (I've already seen cases where doing integer computation in double was a win attributed to the added computation units available)
So I doubt that rolling your own fixed point arithmetic would be advantageous over using double (but I could be showed wrong by measures).

Implementing 64 fixed points isn't really fun. Especially for more complex functions like Sqrt or logarithm. Integers will probably still a bit faster for simple operations like additions. And you'll need to deal with integer overflows. And you need to be careful when implementing rounding, else errors can easily accumulate.
We're implementing fixed points in a C# project because we need determinism which floatingpoint on .net doesn't guarantee. And it's relatively painful. Some formula contained x^3 bang int overflow. Unless you have really compelling reasons not to, use float or double instead of fixedpoint.
SIMD instructions from SSE2 complicate the comparison further, since they allow operation on several floating point numbers(4 floats or 2 doubles) at the same time. I'd use double and try to take advantage of these instructions. So double will probably be significantly slower than floats, but comparing with ints is difficult and I'd prefer float/double over fixedpoint is most scenarios.

It's always best to measure instead of guess. Yes, on many architectures, calculations on doubles process twice the data as calculations on floats (and long doubles are slower still). However, as other answers, and comments on this answer, have pointed out, the x86 architecture doesn't follow the same rules as, say, ARM processors, SPARC processors, etc. On x86 floats, doubles and long doubles are all converted to long doubles for computation. I should have known this, because the conversion causes x86 results to be more accurate than SPARC and Sun went through a lot of trouble to get the less accurate results for Java, sparking some debate (note, that page is from 1998, things have since changed).
Additionally, calculations on doubles are built in to the CPU where calculations on a fixed decimal datatype would be written in software and potentially slower.
You should be able to find a decent fixed sized decimal library and compare.

With various SIMD instruction sets you can perform 4 single precision floating point operations at the same cost as one, essentially you pack 4 floats into a single 128 bit register. When switching to doubles you can only pack 2 doubles into these registers and hence you can only do two operations at the same time.

As many people have said, a 64bit int is probably not worth it if double is an option. At least when SSE is available. This might be different on micro controllers of various kinds but I guess that is not your application. If you need additional precision in long sums of floats, you should keep in mind that this operation is sometimes problematic with floats and doubles and would be more exact on integers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js