Integer/Floating points values with SSE - c++

I have to multiply a vector of integers with an other vector of integers, and then add the result (so a vector of integers) with a vector of floating points values.
Should I use MMX or SSE4 for integers, or can I just use SSE with all these values (even if there is integer ?) putting integers in __m128 registers ?
Indeed, I am often using integers in __m128 registers, and I don't know if I am wasting time (implicit casting values) or if it's the same thing.
I am compiling with -O3 option.

You should probably just use SSE for everything (MMX is just a very out-dated precursor to SSE). If you're going to be targetting mainly newer CPUs then you might even consider AVX/AVX2.
Start by implementing everything cleanly and robustly in scalar code, then benchmark it. It's possible that a scalar implementation will be fast enough, and you won't need to do anything else. Furthermore, gcc and other compilers (e.g. clang, ICC, even Visual Studio) are getting reasonably good at auto-vectorization, so you may get SIMD-vectorized code "for free" that meets your performance needs. However if you still need better performance at this point then you can start to convert your scalar code to SSE. Keep the original scalar implementation for validation and benchmarking purposes though - it's very easy to introduce bugs when optimising code, and it's useful to know how much faster your optimised code is than the baseline code (you're probably looking for somewhere between 2x and 4x faster for SSE versus scalar code).

While previous answer is reasonable, there is one significant difference - data organization. For direct SSE use data better be organized as Structure-of-Arrays (SoA). Typically, you scalar code might have data made around Array-of-Structures (AoS) layout. If it is the case, conversion from scalar to vectorized form would be difficult
More reading https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions

Related

Basic ways to speed up a simple Eigen program

I'm looking for the fastest way to do simple operations using Eigen. There are so many datastructures available, its hard to tell which is the fastest.
I've tried to predefine my data structures, but even then my code is being outperformed by similar Fortran code. I've guessed Eigen::Vector3d is the fastest for my needs, (since its predefined), but I could easily be wrong. Using -O3 optimization during compile time gave me a big boost, but I'm still running 4x slower than a Fortran implementation of the same code.
I make use of an 'Atom' structure, which is then stored in an 'atoms' vector defined by the following:
struct Atom {
std::string element;
//double x, y, z;
Eigen::Vector3d coordinate;
};
std::vector<Atom> atoms;
The slowest part of my code is the following:
distance = atoms[i].coordinate - atoms[j].coordinate;
distance_norm = distance.norm();
Is there a faster data structure I could use? Or is there a faster way to perform these basic operations?
As you pointed out in your comment, adding the -fno-math-errno compiler flag gives you a huge increase in speed. As to why that happens, your code snipped shows that you're doing a sqrt via distance_norm = distance.norm();.
This makes the compiler not set ERRNO after each sqrt (that's a saved write to a thread local variable), which is faster and enables vectorization of any loop that is doing this repeatedly.The only disadvantage to this is that the IEEE adherence is lost. See gcc man.
Another thing you might want to try is adding -march=native and adding -mfma if -march=native doesn't turn it on for you (I seem to remember that in some cases it wasn't turned on by native and had to be turned on by hand - check here for details). And as always with Eigen, you can disable bounds checking with -DNDEBUG.
SoA instead of AoS!!! If performance is actually a real problem, consider using a single 4xN matrix to store the positions (and have Atom keep the column index instead of the Eigen::Vector3d). It shouldn't matter too much in the small code snippet you showed, but depending on the rest of your code, may give you another huge increase in performance.
Given you are ~4x off, it might be worth checking that you have enabled vectorization such as AVX or AVX2 at compile time. There are of course also SSE2 (~2x) and AVX512 (~8x) when dealing with doubles.
Either try another compiler like Intel C++ compiler (free for academic and non-profit usage) or use other libraries like Intel MKL (far faster that your own code) or even other BLAS/LAPACK implementations for dense matrices or PARDISO or SuperLU (not sure if still exists) for sparse matrices.

Float to SInt32

I have series of c++ signal processing classes which use 32 bit floats as their primary sample datatype. For example all the oscillator classes return floats for every sample thats requested. This is the same for all the classes, all calculations of samples are in floating point.
I am porting these classes to iOS.. and for performance issues I want to operate in 8.24 fixed point to get the most out of the processor, word has it there are major performance advantages on iOS to crunching integers instead of floats.. I'm currently doing all the calculations in floats, then converting to SInt32 at the final stage before output which means every sample at the final stage needs to be converted.
Do I simply change the datatype used inside my classes from Float to SInt32. So my oscillators and filters etc calculate in fixed point by passing SInt32's around internally instead of floats ??
is it really this simple ? or do I have to completely rewrite all the different algorithms ?
is there any other voodoo I need to understand before taking on this mission ?
Many Thanks for anyone who finds the time to comment on this.. Its much appreciated..
It's mostly a myth. Floating point performance used to be slow if you compiled for armv6 in Thumb mode; this not an issue in armv7 which supports Thumb 2 (I'll avoid further discussion of armv6 which is no longer supported in Xcode). You also want to avoid using doubles, since floats can use the faster NEON (a.k.a. Advanced SIMD Instructions) unit — this is easy to do accidentally; try enabling -Wshorten.
I also doubt you'll get significantly better performance doing an 8.24 multiply, especially over making use of the NEON unit. Changing float int/int32_t/SInt32 will also not automatically do the necessary shifts for an 8.24 multiply.
If you know that converting floats to ints is the slow bit, consider using some of the functions in Accelerate.framework, namely vDSP_vfix16() or vDSP_vfixr16().

What is faster? two ints or __int64?

What is faster: (Performance)
__int64 x,y;
x=y;
or
int x,y,a,b;
x=a;
y=b;
?
Or they are equal?
__int64 is a non-standard compiler extension, so whilst it may or may not be faster, you don't want to use it if you want cross platform code. Instead, you should consider using #include <cstdint> and using uint64_t etc. These derive from the C99 standard which provides stdint.h and inttypes.h for fixed width integer arithmetic.
In terms of performance, it depends on the system you are on - x86_64 for example should not see any performance difference in adding 32- and 64- bit integers, since the add instruction can handle 32 or 64 bit registers.
However, if you're running code on a 32-bit platform or compiling for a 32-bit architecture, the addition of a 64-bit integers will actually require adding two 32-bit registers, as opposed to one. So if you don't need the extra space, it would be wasteful to allocate it.
I have no idea if compilers can or do optimise the types down to a smaller size if necessary. I expect not, but I'm no compiler engineer.
I hate these sort of questions.
1) If you don't know how to measure for yourself then you almost certainly don't need to know the answer.
2) On modern processors it's very hard to predict how fast something will be based on single instructions, it's far more important to understand your program's cache usage and overall performance than to worry about optimising silly little snippets of code that use an assignment. Let the compiler worry about that and spend your time improving the algorithm used or other things that will have a much bigger impact.
So in short, you probably can't tell and it probably doesn't matter. It's a silly question.
The compiler's optimizer will remove all the code in your example so in that way there is no difference. I think you want to know whether it is faster to move data 32 bits at a time or 64 bits at a time. If your data is aligned to 8 bytes and you are on a 64 bit machine then it should be faster to move data at 8 bytes at a time. There are several caveat to that, however. You may find that your compiler is already doing this optimization for you (you would have to look at the emitted assembly code to be sure), in which case you would see no difference. Also, consider using memcpy instead of rolling your own if you are moving a lot of data. If you are considering casting an array of 32 bit ints to 64 bit in order to copy faster or do some other operation faster (i.e. in half the number of instructions), be sure to Google for the strict aliasing rule.
__int64 should be faster on most platforms, but be careful - some of the architectures require alignment to 8 for this to take effect, and some would even crash your app if it's not aligned.

Can C++ compiler automatically optimize float to double for me?

I was wondering that maybe double is faster on some machines than float.
However, the operations I am performing really only require the precision of floats. However, they are in image processing and I would desire using the fastest possible one.
Can I use float everywhere and trust that the optimizing VC++ 2008 compiler will convert it to double if it deems it is more appropriate? I don't see how this would break code.
Thanks in advance!
No, the compiler will not change a fundamental type like float to a double for optimization.
If you think this is likely, use a typedef for your floating point in a common header, e.g. typedef float FASTFLOAT; and use FASTFLOAT (or whatever you name it) throughout your code. You can then change one central typedef, and change the type throughout your code.
My own experience is that float and double are basically comparable in performance on x86/x64 platforms now for math operations, and I tend to prefer double. If you are processing a lot of data (and hitting memory bandwidth issues, instead of computationally bound), you may get some performance benefit from the fact that floats are half the size of doubles.
You will also want to explore the effects of the various optimization flags. Depending on your target platform requirements, you may be able to optimize more aggresively.
Firstly, the compiler doesn't change float types unless it has to, and never in storage declarations.
float will be no slower than double, but if you really want fast processing, you need to look into either using a compiler that can generate SSE2 or SSE3 code or you need to write your heavy-processing routines using those instructions. IIRC, there are tools that can help you micromanage the processor's pipeline if necessary. Last I messed with this (years ago), Intel had a library called IPP that could help as well by vectorizing your math.
I have never heard of an architecture where float was slower than double, if only for the fact that memory bandwidth requirements double if you use double. Any FPU that can do a single-cycle double operation can do a single-cycle float operation with a little modification at most.
Mark's got a good idea, though: profile your code if you think it's slow. You might find the real problem is somewhere else, like hidden typecasts or function-call overhead from something you thought was inlined not getting inlined.
When the code needs to store the variable in memory chances are on most architectures it will take 32 bits for a float and 64 bits for a double. Doing the memory size conversion would prevent complete optimization of such.
Are you sure that the floating point math is the bottleneck in your application? Perhaps profiling would reveal another possible source of improvement.

What does SSE instructions optimize in practice, and how does the compiler enables and use them?

SSE and/or 3D now! have vector instructions, but what do they optimize in practice ? Are 8 bits characters treated 4 by 4 instead of 1 by 1 for example ? Are there optimisation for some arithmetical operations ? Does the word size have any effect (16 bits, 32 bits, 64 bits) ?
Does all compilers use them when they are available ?
Does one really have to understand assembly to use SSE instructions ? Does knowing about electronics and gate logics helps understanding this ?
Background: SSE has both vector and scalar instructions. 3DNow! is dead.
It is uncommon for any compiler to extract a meaningful benefit from vectorization without the programmer's help. With programming effort and experimentation, one can often approach the speed of pure assembly, without actually mentioning any specific vector instructions. See your compiler's vector programming guide for details.
There are a couple portability tradeoffs involved. If you code for GCC's vectorizer, you might be able to work with non-Intel architectures such as PowerPC and ARM, but not other compilers. If you use Intel intrinsics to make your C code more like assembly, then you can use other compilers but not other architectures.
Electronics knowledge will not help you. Learning the available instructions will.
In the general case, you can't rely on compilers to use vectorized instructions at all. Some do (Intel's C++ compiler does a reasonable job of it in many simple cases, and GCC attempts to do so too, with mixed success)
But the idea is simply to apply the same operation to 4 32-bit words (or 2 64 bit values in some cases).
So instead of the traditional `add´ instruction which adds together the values from 2 different 32-bit wide registers, you can use a vectorized add, which uses special, 128-bit wide registers containing four 32-bit values, and adds them together as a single operation.
Duplicate of other questions:
Using SSE instructions
In short, SSE is short for Streaming SIMD Extensions, where SIMD = Single Instruction, Multiple Data. This is useful for performing a single mathematical or logical operation on many values at once, as is typically done for matrix or vector math operations.
The compiler can target this instruction set as part of it's optimizations (research your /O options), however you typically have to restructure code and either code SSE manually, or use a library like Intel Performance Primitives to really take advantage of it.
If you know what you are doing, you might get a huge performance boost. See for example here, where this guy improved the performances of his algorithm 6 times.