I was wondering that maybe double is faster on some machines than float.
However, the operations I am performing really only require the precision of floats. However, they are in image processing and I would desire using the fastest possible one.
Can I use float everywhere and trust that the optimizing VC++ 2008 compiler will convert it to double if it deems it is more appropriate? I don't see how this would break code.
Thanks in advance!
No, the compiler will not change a fundamental type like float to a double for optimization.
If you think this is likely, use a typedef for your floating point in a common header, e.g. typedef float FASTFLOAT; and use FASTFLOAT (or whatever you name it) throughout your code. You can then change one central typedef, and change the type throughout your code.
My own experience is that float and double are basically comparable in performance on x86/x64 platforms now for math operations, and I tend to prefer double. If you are processing a lot of data (and hitting memory bandwidth issues, instead of computationally bound), you may get some performance benefit from the fact that floats are half the size of doubles.
You will also want to explore the effects of the various optimization flags. Depending on your target platform requirements, you may be able to optimize more aggresively.
Firstly, the compiler doesn't change float types unless it has to, and never in storage declarations.
float will be no slower than double, but if you really want fast processing, you need to look into either using a compiler that can generate SSE2 or SSE3 code or you need to write your heavy-processing routines using those instructions. IIRC, there are tools that can help you micromanage the processor's pipeline if necessary. Last I messed with this (years ago), Intel had a library called IPP that could help as well by vectorizing your math.
I have never heard of an architecture where float was slower than double, if only for the fact that memory bandwidth requirements double if you use double. Any FPU that can do a single-cycle double operation can do a single-cycle float operation with a little modification at most.
Mark's got a good idea, though: profile your code if you think it's slow. You might find the real problem is somewhere else, like hidden typecasts or function-call overhead from something you thought was inlined not getting inlined.
When the code needs to store the variable in memory chances are on most architectures it will take 32 bits for a float and 64 bits for a double. Doing the memory size conversion would prevent complete optimization of such.
Are you sure that the floating point math is the bottleneck in your application? Perhaps profiling would reveal another possible source of improvement.
Related
I have to multiply a vector of integers with an other vector of integers, and then add the result (so a vector of integers) with a vector of floating points values.
Should I use MMX or SSE4 for integers, or can I just use SSE with all these values (even if there is integer ?) putting integers in __m128 registers ?
Indeed, I am often using integers in __m128 registers, and I don't know if I am wasting time (implicit casting values) or if it's the same thing.
I am compiling with -O3 option.
You should probably just use SSE for everything (MMX is just a very out-dated precursor to SSE). If you're going to be targetting mainly newer CPUs then you might even consider AVX/AVX2.
Start by implementing everything cleanly and robustly in scalar code, then benchmark it. It's possible that a scalar implementation will be fast enough, and you won't need to do anything else. Furthermore, gcc and other compilers (e.g. clang, ICC, even Visual Studio) are getting reasonably good at auto-vectorization, so you may get SIMD-vectorized code "for free" that meets your performance needs. However if you still need better performance at this point then you can start to convert your scalar code to SSE. Keep the original scalar implementation for validation and benchmarking purposes though - it's very easy to introduce bugs when optimising code, and it's useful to know how much faster your optimised code is than the baseline code (you're probably looking for somewhere between 2x and 4x faster for SSE versus scalar code).
While previous answer is reasonable, there is one significant difference - data organization. For direct SSE use data better be organized as Structure-of-Arrays (SoA). Typically, you scalar code might have data made around Array-of-Structures (AoS) layout. If it is the case, conversion from scalar to vectorized form would be difficult
More reading https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions
I'm writing a program that depends a lot on complex additions and multiplications. I wanted to know whether I should use gsl_complex or std::complex.
I don't seem to find a comparison online of how much better GSL complex arithmetic is as compared to std::complex. A rudimentary google search didn't help me find a benchmarks page for GSL complex either.
I wrote a 20-line program that generates two random arrays of complex numbers (1e7 of them) and then checked how long addition and multiplication took using clock() from <ctime>. Using this method (without compiler optimisation) I got to know that gsl_complex_add and gsl_complex_mul are almost twice as fast as std::complex<double>'s + and * respectively. But I've never done this sort of thing before, so is this even the way you check which is faster?
Any links or suggestions would be helpful. Thanks!
EDIT:
Okay, so I tried again with a -O3 flag, and now the results are extremely different! std::complex<float>::operator+ is more than twice as fast as gsl_complex_add, while gsl_complex_mul is about 1.25 times as fast as std::complex<float>::operator*. If I use double, gsl_complex_add is about 30% faster than std::complex<double>::operator+ while std::complex<double>::operator* is about 10% faster than gsl_complex_mul. I only need float-level precision, but I've heard that double is faster (and memory is not an issue for me)! So now I'm really confused!
Turn on optimisations.
Any library or set of functions that you link with will be compiled WITH optimisation (unless the names of the developer are Kermit, Swedish Chef, Miss Peggy (project manager) and Cookie Monster (tester) - in other words, the development team is a bunch of Muppets).
Since std::complex uses templates, it is compiled by the compiler settings you give, so the code will be unoptimized. So your question is really "Why is function X faster than function Y that does the same thing, when function X is compiled with optimisation and Y is compiled without optimisation?" - which should really be obvious to answer: "Optimisation works nearly all of the time!" (If optimisation wasn't working most of the time, compiler developers would have a MUCH easier time)
Edit: So my above point has just been proven. Note that since templates can inline the code, it is often more efficient than an external library (because the compiler can just insert the instructions straight into the flow, rather than calling out to another function).
As to float vs. double, the only time that float is slower than double is if there is ONLY double hardware available, with two functions added to "shorten" and "lengthen" between float and double. I'm not aware of any such hardware. double has more bits, so it SHOULD take longer.
Edit2:
When it comes to choosing "one solution over another", there are so many factors. Performance is one (and in some cases, the most important, in other cases not). Other aspects are "ease of use", "availability", "fit for the project", etc.
If you look at ONLY performance, you can sometimes run simple benchmarks to determine that one solution is better or worse than another, but for complex libraries [not "real&imaginary" type complex numbers, but rather "complicated"], there are sometimes optimisations to deal with large amounts of data, where if you use a less sophisticated solution, the "large data" will not achieve the same performance, because less effort has been spent on solving the "big data" type problems. So, if you have a "simple" benchmark that does some basic calculations on a small set of data, and you are, in reality, going to run some much bigger datasets, the small benchmark MAY not reflect reality.
And there is no way that I, or anyone else, can tell you which solution will give you the best performance on YOUR system with YOUR datasets, unless we have access to your datasets, know exactly which calculations you are performance (that is, pretty much have your code), and have experience with running that with both "packages".
And going on to the rest of the criteria ("ease of use", etc), those are much more "personal opinion" based, so wouldn't be a good fit for an SO question in the first place.
This answer depends not only on the optimization flags, but also on the compiler used to compile GSL library and your particular code. Example: if you compile gsl with gcc and your program with icc, then you may see a (significant) difference (I have done this test with std::pow vs gsl_pow). Also, the standard makefile generated by ./configure does not compile GSL with aggressive float point optimizations (example: it does not include fast-math flag in gcc) because some GSL routines (differential equation solver for example) fail their stringent accuracy tests when these optimizations are present.
One of the great points about GSL is the modularity of the library. If you don't need double accuracy, then you can compile gsl_complex.h, gsl_complex_math.h and math.c separately with aggressive float number optimizations (however you need to delete the line #include <config.h> in math.c). Another strategy is to compile a separate version of the whole library with aggressive float number optimizations and test if accuracy is not an issue for your particular problem (that is my favorite approach).
EDIT: I forgot to mention that gsl_complex.h also has a float version of gsl_complex
typedef struct
{
float dat[2];
}
gsl_complex_float;
I have been reading Game Engine Books since I was 14 (At that time I didn't understand a thing:P)
Now quite some years later I wanted to start programming the Mathmatical Basis for my Game Engine. I've been thinking long about how to design this 'library'. (Which I mean as "Organized set of files") Every few years new SIMD instructionsets come out, and I wouldn't want them to go to waste. (Tell me if I am wrong about this.)
I wanted to at least have the following properties:
Making it able to check if it has SIMD at runtime, and use SIMD if it has it and uses the normal C++ version if it doesn't. (Might have some call overhead, is this worth it?)
Making it able to compile for SIMD or normal C++ if we already know the target at compile time. The calls can get inlined and made suitable for Cross-Optimisation because the compiler knows if SIMD or C++ is used.
EDIT - I want to make the sourcecode portable so it can run on other deviced then x86(-64) too
So I thought it would be a good solution to use function pointers which I would make static and initialize at the start of the program. And which the suited functions(For example multiplication of Matrix/Vector) will call.
What do you think are the advantages and disadvantages(Which outweights more?) of this design and is it even possible to create it with both properties as described above?
Christian
It's important to get the right granularity at which you make decision on which routine to call. If you do this at too low a level then function dispatch overhead becomes a problem, e.g. a small routine which just has a few instructions could become very inefficient if called via some kind of function pointer dispatch mechanism rather than say just being inlined. Ideally the architecture-specific routines should be processing a reasonable amount of data so that function dispatch cost is negligible, without being so large that you get significant code bloat due to compiling additional non-architecture-specific code for each supported architecture.
The simplest way to do this is to compile your game twice, once with SIMD enabled, once without. Create a tiny little launcher app that performs the _may_i_use_cpu_feature checks, and then runs the correct build.
The double indirection caused by calling a matrix multiply (for example) via a function pointer is not going to be nice. Instead of inlining trivial maths functions, it'll introduce function calls all over the shop, and those calls will be forced to save/restore a lot of registers to boot (because the code behind the pointer is not going to be know until runtime).
At that point, a non-optimised version without the double indirection will massively outperform the SSE version with function pointers.
As for supporting multiple platforms, this can be easy, and it can also be a real bother. ARM neon is similar enough to SSE4 to make it worth wrapping the intructions behind some macros, however neon is also different enough to be really annoying!
#if CPU_IS_INTEL
#include <immintrin.h>
typedef __m128 f128;
#define add4f _mm_add_ps
#else
#include <neon.h>
typedef float32x4 f128;
#define add4f vqadd_f32
#endif
The MAJOR problem with starting on say Intel, and porting to ARM later is that a lot of the nice things don't exist. Shuffling is possible on ARM, but it's also a bother. Division, dot product, and sqrt don't exist on ARM (only reciprocal estimates, which you'll need to do your own newton iteration on)
If you are thinking about SIMD like this:
struct Vec4
{
float x;
float y;
float z;
float w;
};
Then you may just be able to wrap SSE and NEON behind a semi-ok wrapper. When it comes to AVX512 and AVX2 though, you'll probably be screwed.
If however you are thinking about SIMD using structure-of-array formats:
struct Vec4SOA
{
float x[BIG_NUM];
float y[BIG_NUM];
float z[BIG_NUM];
float w[BIG_NUM];
};
Then there is a chance you'll be able to produce an AVX2/AVX512 version. However, working with code organised like that is not the easiest thing in the world.
I have series of c++ signal processing classes which use 32 bit floats as their primary sample datatype. For example all the oscillator classes return floats for every sample thats requested. This is the same for all the classes, all calculations of samples are in floating point.
I am porting these classes to iOS.. and for performance issues I want to operate in 8.24 fixed point to get the most out of the processor, word has it there are major performance advantages on iOS to crunching integers instead of floats.. I'm currently doing all the calculations in floats, then converting to SInt32 at the final stage before output which means every sample at the final stage needs to be converted.
Do I simply change the datatype used inside my classes from Float to SInt32. So my oscillators and filters etc calculate in fixed point by passing SInt32's around internally instead of floats ??
is it really this simple ? or do I have to completely rewrite all the different algorithms ?
is there any other voodoo I need to understand before taking on this mission ?
Many Thanks for anyone who finds the time to comment on this.. Its much appreciated..
It's mostly a myth. Floating point performance used to be slow if you compiled for armv6 in Thumb mode; this not an issue in armv7 which supports Thumb 2 (I'll avoid further discussion of armv6 which is no longer supported in Xcode). You also want to avoid using doubles, since floats can use the faster NEON (a.k.a. Advanced SIMD Instructions) unit — this is easy to do accidentally; try enabling -Wshorten.
I also doubt you'll get significantly better performance doing an 8.24 multiply, especially over making use of the NEON unit. Changing float int/int32_t/SInt32 will also not automatically do the necessary shifts for an 8.24 multiply.
If you know that converting floats to ints is the slow bit, consider using some of the functions in Accelerate.framework, namely vDSP_vfix16() or vDSP_vfixr16().
What is faster: (Performance)
__int64 x,y;
x=y;
or
int x,y,a,b;
x=a;
y=b;
?
Or they are equal?
__int64 is a non-standard compiler extension, so whilst it may or may not be faster, you don't want to use it if you want cross platform code. Instead, you should consider using #include <cstdint> and using uint64_t etc. These derive from the C99 standard which provides stdint.h and inttypes.h for fixed width integer arithmetic.
In terms of performance, it depends on the system you are on - x86_64 for example should not see any performance difference in adding 32- and 64- bit integers, since the add instruction can handle 32 or 64 bit registers.
However, if you're running code on a 32-bit platform or compiling for a 32-bit architecture, the addition of a 64-bit integers will actually require adding two 32-bit registers, as opposed to one. So if you don't need the extra space, it would be wasteful to allocate it.
I have no idea if compilers can or do optimise the types down to a smaller size if necessary. I expect not, but I'm no compiler engineer.
I hate these sort of questions.
1) If you don't know how to measure for yourself then you almost certainly don't need to know the answer.
2) On modern processors it's very hard to predict how fast something will be based on single instructions, it's far more important to understand your program's cache usage and overall performance than to worry about optimising silly little snippets of code that use an assignment. Let the compiler worry about that and spend your time improving the algorithm used or other things that will have a much bigger impact.
So in short, you probably can't tell and it probably doesn't matter. It's a silly question.
The compiler's optimizer will remove all the code in your example so in that way there is no difference. I think you want to know whether it is faster to move data 32 bits at a time or 64 bits at a time. If your data is aligned to 8 bytes and you are on a 64 bit machine then it should be faster to move data at 8 bytes at a time. There are several caveat to that, however. You may find that your compiler is already doing this optimization for you (you would have to look at the emitted assembly code to be sure), in which case you would see no difference. Also, consider using memcpy instead of rolling your own if you are moving a lot of data. If you are considering casting an array of 32 bit ints to 64 bit in order to copy faster or do some other operation faster (i.e. in half the number of instructions), be sure to Google for the strict aliasing rule.
__int64 should be faster on most platforms, but be careful - some of the architectures require alignment to 8 for this to take effect, and some would even crash your app if it's not aligned.