Intrinsics: using __128 registers - c++

I am playing with SIMD and thinking to use for Vector operations in 3D math.
Instead having
class Vec4f
{
float val[4];
//+operators here
}
I could have
class SimdVec4f
{
__m128 val; //+operators
}
But since there are just 8 available registers for __m128, what will happend if I want to have more than 8 instances of this class? Does the compiler handles the loading from memory to registers and vica versa on its own as for usual variables?
Thanks for your time and for giving me some insight into this.

It's exactly the same as when you have more int variables than there are integer registers: the compiler may have to spill them to memory if too many are live at the same time, and reload them later. Register allocation for vector registers is done pretty much the same way as register allocation for integer regs, analysing the data flow of a function and figuring out which variables are alive at the same time.
You should think of _mm_load_ps/loadu and store/storeu intrinsics as more describing the type-punning to/from vector types, not as being the only thing that can compile to a vector load or store instruction, or always compiling to a load/store.
And BTW, x86-64 has xmm0..15. Compile for 64-bit if you want code that needs several registers to be efficient.
SSE for 3D vectors:
Generally avoid keeping a single direction/geometry vector in a SIMD vector. You can add efficiently, but any cross- or dot-products or length calculations will require shuffling.
It's better if you can use a vector of 4 x values, a vector of 4 y values, etc., so you can compute 4 lengths in parallel. See https://stackoverflow.com/tags/sse/info for more, especially these slides:
SIMD at Insomniac Games (GDC 2015) which show how to lay out your data for efficient SIMD. (Struct of arrays, not array of structs).
See also Parallel programming using Haswell architecture
Sometimes you can get a minor benefit for a single vector in cases where you can't reorganize to compute lots of things in parallel. _mm_setr_ps() can be slow if the source data isn't contiguous, though.
There are already several C++ wrapper libraries for SIMD, such as Agner Fog's GPL Apache-licensed VectorClass, and some others.

Related

Does int32_t have lower latency than int8_t, int16_t and int64_t?

(I'm referring to Intel CPUs and mainly with GCC, but poss ICC or MSVC)
Is it true using int8_t, int16_t or int64_t is less efficient compared with int32_tdue to additional instructions generated to to convert between the CPU word size and the chosen variable size?
I would be interested if anybody has any examples or best practices for this? I sometimes use smaller variable sizes to reduce cacheline loads, but say I only consumed 50 bytes of a cacheline with one variable being 8-bit int, it may be quicker processing by using the remaining cacheline space and promote the 8-bit int to a 32-bit int etc?
You can stuff more uint8_ts into a cache line, so loading N uint8_ts will be faster than loading N uint32_ts.
In addition, if you are using a modern Intel chip with SIMD instructions, a smart compiler will vectorize what it can. Again, using a small variable in your code will allow the compiler to stuff more lanes into a SIMD register.
I think it is best to use the smallest size you can, and leave the details up to the compiler. The compiler is probably smarter than you (and me) when it comes to stuff like this. For many operations (say unsigned addition), the compiler can use the same code for uint8, uint16 or uint32 (and just ignore the upper bits), so there is no speed difference.
The bottom line is that a cache miss is WAY more expensive than any arithmetic or logical operation, so it is nearly always better to worry about cache (and thus data size) than simple arithmetic.
(It used to be true a long time again that on Sun workstation, using double was significantly faster than float, because the hardware only supported double. I don't think that is true any more for modern x86, as the SIMD hardware (SSE, etc) have direct support for both single and double precision).
Mark Lakata answer points in the right direction.
I would like to add some points.
A wonderful resource for understanding and taking optimization decision are the Agner documents.
The Instruction Tables document has the latency for the most common instructions. You can see that some of them perform better in the native size version.
A mov for example may be eliminated, a mul have less latency.
However here we are talking about gaining 1 clock, we would have to execute a lot of instruction to compensate for a cache miss.
If this were the whole story it would have not worth it.
The real problems comes with the decoders.
When you use some length-changing prefixes (and you will by using non native size word) the decoder takes extra cycles.
The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes.
In, nowadays, no longer more recent (but still present) microarchs the penalty was severe, specially with some kind arithmetic instructions.
In later microarchs this has been mitigated but the penalty it is still present.
Another aspect to consider is that using non native size requires to prefix the instructions and thereby generating larger code.
This is the closest as possible to the statement "additional instructions [are] generated to to convert between the CPU word size and the chosen variable size" as Intel CPU can handle non native word sizes.
With other, specially RISC, CPUs this is not generally true and more instruction can be generated.
So while you are making an optimal use of the data cache, you are also making a bad use of the instruction cache.
It is also worth nothing that on the common x64 ABI the stack must be aligned on 16 byte boundary and that usually the compiler saves local vars in the native word size or a close one (e.g. a DWORD on 64 bit system).
Only if you are allocating a sufficient number of local vars or if you are using array or packed structs you can gain benefits from using small variable size.
If you declare a single uint16_t var, it will probably takes the same stack space of a single uint64_t, so it is best to go for the fastest size.
Furthermore when it come to the data cache it is the locality that matters, rather than the data size alone.
So, what to do?
Luckily, you don't have to decide between having small data or small code.
If you have a considerable quantity of data this is usually handled with arrays or pointers and by the use of intermediate variables. An example being this line of code.
t = my_big_data[i];
Here my approach is:
Keep the external representation of data, i.e. the my_big_data array, as small as possible. For example if that array store temperatures use a coded uint8_t for each element.
Keep the internal representation of data, i.e. the t variable, as close as possible to the CPU word size. For example t could be a uint32_t or uint64_t.
This way you program optimize both caches and use the native word size.
As a bonus you may later decide to switch to SIMD instructions without have to repack the my_big_data memory layout.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
D. Knuth
When you design your structures memory layout be problem driven. For example, age values need 8 bit, city distances in miles need 16 bits.
When you code the algorithms use the fastest type the compiler is known to have for that scope. For example integers are faster than floating point numbers, uint_fast8_t is no slower than uint8_t.
When then it is time to improve the performance start by changing the algorithm (by using faster types, eliminating redundant operations, and so on) and then if it is needed the data structures (by aligning, padding, packing and so on).

Understanding how the instrinsic functions for SSE use memory

Before I ask my question, just a little background information.
In C languages, when you assign to a variable, you can conceptually assume you just modified a little piece of memory in RAM.
int a = rand(); //conceptually, you created and assigned variable A in ram
In assembly language, to do the same thing, you essentially need the result of rand() stored in a register, and a pointer to "a". You would then do a store instruction to get the register content into ram.
When you program in C++ for example, when you assign and manipulate value type objects, you usually dont even have to think about their addresses or how or when they will be stored in registers.
Using SSE instrinics are strange because they appear to somewhere inbetween coding in C and assembly, in terms of the conceptual memory model.
You can call load/store functions, and they return objects. A math operation like _mm_add will return an object, yet it's unclear to me weather the result will actually be stored in the object unless you call _mm_store.
Consider the following example:
inline void block(float* y, const float* x) const {
// load 4 data elements at a time
__m128 X = _mm_loadu_ps(x);
__m128 Y = _mm_loadu_ps(y);
// do the computations
__m128 result = _mm_add_ps(Y, _mm_mul_ps(X, _mm_set1_ps(a)));
// store the results
_mm_storeu_ps(y, result);
}
There are alot of temporary objects here. Do the temporary objects actually not exist? Is it all just syntax sugar for calling assembly instrunctions in a C like way? What happens if instead of doing the store command at the end, you just kept the result, would the result then be more than syntax sugar, and will actually hold data?
TL:DR How am I suppose to think about memory when using SSE instrinsics?
An __m128 variable may be in a register and/or memory. It's much the same as with simple float or int variables - the compiler will decide which variables belong in registers and which must be stored in memory. In general the compiler will try to keep the "hottest" variables in registers and the rest in memory. It will also analyse the lifetimes of variables so that a register may be used for more than one variable within a block. As a programmer you don't need to worry about this too much, but you should be aware of how many registers you have, i.e.. 8 XMM registers in 32 bit mode and 16 in 64 bit mode. Keeping your variable usage below these numbers will help to keep everything in registers as far as possible. Having said that, the penalty for accessing an operand in L1 cache is not that much greater than accessing a register operand, so you shouldn't get too hung up on keeping everything in registers if it proves difficult to do so.
Footnote: this vagueness about whether SSE variables are in registers or memory when using intrinsics is actually quite helpful, and makes it much easier to write optimised code than doing it with raw assembler - the compiler does the grunt work of keeping track of register allocation and other optimisations, allowing you to concentrate on making the code work correctly.
Vector variables aren't special. They will be spilled to memory and re-loaded when needed later, if the compiler runs out of registers when optimizing a loop (or across a function call to a function the compiler can't "see" to know that it doesn't touch the vector regs).
gcc -O0 actually does tend to store to RAM when you set them, instead of keeping __m128i variables only in registers, IIRC.
You could write all your intrinsic-using code without ever using any load or store intrinsics, but then you'd be at the mercy of the compiler to decide how and when to move data around. (You actually still are, to some degree these days, thanks to compilers being good at optimizing intrinsics, and not just literally spitting out a load wherever you use a load intrinsic.)
Compilers will fold loads into memory operands for following instructions, if the value isn't needed as an input to something else as well. However, this is only safe if the data is at a known-aligned address, or an aligned-load intrinsic was used.
The way I currently think about load intrinsics is as a way of communicating alignment guarantees (or lack thereof) to the compiler. The "regular" SSE (non-AVX / non-VEX-encoded) versions of vector instructions fault if used with an unaligned 128b memory operand. (Even on CPUs supporting AVX, FWIW.) For example, note that even punpckl* lists its memory operand as a m128, and thus has alignment requirements, even if it only actually reads the low 64b. pmovzx lists its operand as a m128.
Anyway, using load instead of loadu tells the compiler that it can fold the load into being a memory operand for another instruction, even if it can't otherwise prove that it comes from an aligned address.
Compiling for an AVX target machine will allow the compiler to fold even unaligned loads into other operations, to take advantage of uop micro-fusion.
This came up in comments on How to specify alignment with _mm_mul_ps.
The store intrinsics apparently have two purposes:
To tell the compiler whether it should use the aligned or unaligned asm instruction.
To remove the need for a cast from __m128d to double * (doesn't apply to the integer case).
Just to confuse things, AVX2 introduced things like _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a), which stores the high/low halves to different addresses. It probably compiles to a vmovdqu / vextracti128 ..., 1 sequence. Incidentally, I guess they made vextracti128 with AVX512 in mind, since using it with 0 as the immediate is the same as vmovdqu, but slower and longer-to-encode.

Why should you not access the __m128i fields directly?

I was reading this on MSDN, and it says
You should not access the __m128i fields directly. You can, however, see these types in the debugger. A variable of type __m128i maps to the XMM[0-7] registers.
However, it doesn't explain why. Why is it? For example, is the following "bad":
void func(unsigned short x, unsigned short y)
{
__m128i a;
a.m128i_i64[0] = x;
__m128i b;
b.m128i_i64[0] = y;
// Now do something with a and b ...
}
Instead of doing the assignments like in the example above, should one use some sort of load function?
The field m128i_i64 and family are Microsoft compiler specific extensions. They don't exist in most other compilers.
Nevertheless, they are useful for testing purposes.
The real reason for avoiding their use is performance. The hardware cannot efficiently access individual elements of a SIMD vector.
There are no instructions that let you directly access individual elements. (SSE4.1 does, but it requires a compile-time constant index.)
Going through memory may incur a very large penalty due to failure of store forwarding.
AVX and AVX2 doesn't extend the SSE4.1 instructions to allow accessing elements in a 256-bit vector. And as far as I can tell, AVX512 will not have it for 512-bit vectors.
Likewise, the set intrinsics (such as _mm256_set_pd()) suffer the same issue. They are implemented either as a series of data shuffling operations. Or by going through memory and taking on the store forwarding stalls.
Which begs the question: Is there an efficient way to populate a SIMD vector from scalar components? (or separate a SIMD vector into scalar components)
Short Answer: Not really. When you use SIMD, you're expected to do a lot of the work in the vectorized form. So the initialization overhead should not matter.

Speed up float 5x5 matrix * vector multiplication with SSE

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.
I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?
Do I need the Intel compiler to do it?
Can you point out some references?
The Eigen C++ template library for vectors, matrices, ... has both
optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations
so you should give it a try.
In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.
Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.
U =
xxxx
yyyy
zzzz
wwww
tttt
Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.
Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.
The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.
Repeat this for the next four vectors (8 for AVX) and so forth until done.
If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.
Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.
I would suggest using Intel IPP and abstract yourself of dependency on techniques
If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.
This should be easy, especially when you're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in SSE registers, if you don't need them for anything else).
As for memory bandwidth: we're talking about 2,4 megabytes of vectors, when memory bandwidths are in single-digit gigabytes per second.
What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.
The classic optimization technique to trade memory for cycles...
I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.
If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.
I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...
Hope this helps.
If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.
If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.
5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

How much effort do you have to put in to get gains from using SSE?

Case One
Say you have a little class:
class Point3D
{
private:
float x,y,z;
public:
operator+=()
...etc
};
Point3D &Point3D::operator+=(Point3D &other)
{
this->x += other.x;
this->y += other.y;
this->z += other.z;
}
A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we expect this to make much difference? MMX used to involve costly state cahnges IIRC, does SSE or are they just like other instructions? And even if there's no direct "use SSE" overhead, would moving the values into SSE registers and back out again really make it any faster?
Case Two
Instead, you're working with a less OO-based code base. Rather than an array/vector of Point3D objects, you simply have a big array of floats:
float coordinateData[NUM_POINTS*3];
void add(int i,int j) //yes it's unsafe, no overlap check... example only
{
for (int x=0;x<3;++x)
{
coordinateData[i*3+x] += coordinateData[j*3+x];
}
}
What about use of SSE here? Any better?
In conclusion
Is trying to optimise single vector operations using SSE actually worthwhile, or is it really only valuable when doing bulk operations?
In general you will need to take additional steps to get the best out of SSE (or any other SIMD architecture):
data needs to be 16 byte aligned (ideally)
data needs to be contiguous
you need enough data to make the SIMD operation worthwhile
you need to coalesce as many operations as you can to mitigate the costs of loads/stores
you need to be aware of the cache/memory hierarchy and its performance impact (e.g. use strip-mining/tiling)
This Gamasutra article shows what it takes to make fast SSE-based code. It covers your "Case 1" in detail.
The source code is available from the author's homepage.
Also Slides + text: SIMD at Insomniac Games (GDC 2015) discuss why using a SIMD vector to hold a single x,y,z,(padding) geometry vector is not efficient. (Because you'll need horizontal shuffles and a scalar sqrt to do things like the length of a vector, sqrt(sum of squares), compared to doing 4 lengths in parallel from 3 vectors of x0,x1,x2,x3, y0-3, z0-3.)
See also other links in the SSE tag wiki.
it is valuable if your is case is that you do a lot of same calculations on range of data. for example you calculate square roots of many-many equations. you can load 4 values in sse registers and call operations once. this will increase performance by 4.
and there are libraries that have all sse optimization inside them. don't reinvent bicycle.
I tried Case One at work a couple of years ago and the performance gain was barely measurable. In the end I decided to skip it since all the hassle with aligning all Point3D on 16 byte boundaries made it not worthwhile.
As you've correctly guessed SSE is most suited to bulk operations where they can give a pretty good speed up. Before you go ahead and use the SSE intrinsics check what code the compiler is already generating. I know from experience that for instance Visual Studio is pretty good at using SSE-optimizations.