Union type punning with SSE types

Union type punning with SSE types - c++

In cases where the SSE intrinsics lack certain operations I wanted to add default fallbacks. Currently I am assuming that it's better to do so via unions as visual studio 2013 is the primary compiler for now - and I have noticed it still generates better code if naked SSE types _m128/_m128i are used rather than unions when SSE operations exist. I don't know if that is better in VS2015.
So I am attempting something like this:
template<class _SIMD>
struct VUnion
{
_SIMD vector;
float lane[sizeof(_SIMD)/sizeof(float)]; // Can assert size makes sense etc
};
template<class _SIMD>
void __vectorcall Func(_SIMD& r, const _SIMD& a, const _SIMD& b)
{
const VUnion<_SIMD>& va{a};
const VUnion<_SIMD>& vb{b};
VUnion<_SIMD>& vr{r};
for(int i = 0; i < sizeof(_SIMD)/sizeof(float); ++i)
vr.lane[i] = LaneFunc(va.lane[i], vb.lane[i]);
}
Which works and allows me to only involve unions in operations only where there are no direct SSE equivalents. But I worry given strict aliasing rules etc how correct this really is?
I am not sure if this is only language safe for integer SIMD vectors and not float ones?
If this isn't safe then I suspect the only language safe way that is compatible with VS2013 is to use the extract intrinsics to get each lane? And the set ones to reconstruct all lanes and set the whole vector at once which is a PITA really and I am not convinced it will do good things to code generation.
Also it's unclear how smart compilers are regarding vectorising such fallback per lane functions. Although I am sure GCC/Clang are ahead of the curve there.

Related

Is this a proper way to extract a byte from a NEON uint8x16_t vector?

I am a beginner to NEON intrinsics, and I wanted to work with uint8x16_t and also uint8x16x4_t.
While working with it I came across a situation, where I wanted to extract a byte from a uint8x16_t. Being naive to the details I accidentally began extracting bytes from it using the [] operator at runtime. But my compiler CLANG happily compiled the code, gave no errors or warnings and I got the desired output.
I searched through the ARM reference guides and I never seemed to find any reference on using the [] operator on a uint8x16_t vector, after all it's a 128 bit register and not an array!? (Please correct me if I'm wrong).
Therefore, to bring light to the issue, I tracked the origin of the vector uint8x16_t in the header file arm_neon.h and I found this:
typedef __attribute__((neon_vector_type(16))) uint8_t uint8x16_t;
How is this stored in computer memory ?
Why am I able to use the [] operator on it directly, where I should
be using:
uint8_t fetch(uint8x16_t *r, int index) {
unsigned char u[16];
vst1q_u8(u, *r);
return u[index];
}
instead of:
uint8_t fetch(uint8x16_t *r, int index){
return (*r)[index];
} // This is much faster in performance!
Every help will be greatly appreciated!

Why am I able to use the [] operator on it directly
Because gcc / clang define it in terms of GNU C native vectors (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), which do have well-defined rules for operators.
ARM's docs probably don't guarantee that [] works, and there are probably some ARM compilers where it won't work.
It's stored in memory (or not, if just in a register or optimized away) the same as any other type. The object-representation has the lowest element at the lowest address. uint8x16_t objects are like int objects in most ways, in terms of the compiler being able to decide where to keep them, and optimize them away, etc.

C++ vector::size_type: signed vs unsigned; int vs. long

I have been doing some testing of my application by compiling it on different platforms, and the shift from a 64-bit system to a 32-bit system is exposing a number of issues.
I make heavy use of vectors, strings, etc., and as such need to count them. However, my functions also make use of 32-bit unsigned numbers because in many cases I need to explicitly consume a positive integer.
I'm having issues with seemingly simple tasks such as std::min and std::max, which may be more systemic. Consider the following code:
uint32_t getmax()
{
return _vecContainer.size();
}
Seems simple enough: I know that a vector can't have a negative number of elements, so returning an unsigned integer makes complete sense.
void setRowCol(const uint32_t &r_row; const uint32_t &r_col)
{
myContainer_t mc;
mc.row = r_row;
mc.col = r_col;
_vecContainer.push_back(mc);
}
Again, simple enough.
Problem:
uint32_t foo(const uint32_t &r_row)
{
return std::min(r_row, _vecContainer.size());
}
This gives me errors such as:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:2589:1: note: candidate template ignored: deduced conflicting types for parameter '_Tp' ('unsigned long' vs. 'unsigned int')
min(const _Tp& __a, const _Tp& __b)
I did a lot of digging, and on one platform vector::size_type is an 8 byte number. However, by design I am using unsigned 4-byte numbers. This is presumably causing things to be wacky because you cannot implicitly convert from an 8-byte number to a 4-byte number.
The solution was to do this the old fashioned weay:
#define MIN_M(a,b) a < b ? a : b
return MIN_M(r_row, _vecContainer.size());
Which works dandy. But the systemic issue remains: when planning for multiple platform support, how do you handle instances like this? I could use size_t as my standard size, but that adds other complications (e.g. moving from one platform which supports 64 bit numbers to another which supports 32 bit numbers at a later date). The bigger issue is that size_t is unsigned, so I can't update my signatures:
size_t foo(const size_t &r_row)
// bad, this allows -1 to be passed, which I don't want
Any suggestions?
EDIT: I had read somewhere that size_t was signed, and I've since been corrected. So far it looks like this is a limitation of my own design (e.g. 32-bit numbers vs. using std::vector::size_type and/or size_t).

One way to deal with this is to use
std::vector<Type>::size_type
as the underlying type of your function parameters/returns, or auto returns if using C++14.

An answer in the form of a set of tidbits:
Instead of relying on the compiler to deduce the type, you can explicitly specify the type when using function templates like std::min<T>. For example: std::min<std::uint32_t>(4, my_vec.size());
Turn on all the compiler warnings related to signed versus unsigned comparisons and implicit narrowing conversions. Use brace initialization where you can, as it will treat narrowing conversions as errors.
If you explicitly want to use 32-bit values like std::uint32_t, I'd try to find the minimal number of places to explicitly convert (i.e., static_cast) the "sizes" to the smaller types. You don't want casts everywhere, but if you're using library container sizes internally and you want your API to use std::uint32_t, explicitly cast at the API boundaries so that a user of your class never has to worry about doing the conversion themselves. If you can keep the conversions to just a couple places, it becomes practical to add run-time checks (i.e., assertions) that the size has not actually outgrown the range of the smaller type.
If you don't care about the exact size, use std::size_t, which is almost certainly identical to std::XXX::size_type for all of the standard containers. It's theoretically possible for them to be different, but it doesn't happen in practice. In most contexts, std::size_t is less verbose that std::vector::size_type, so it makes a good compromise.
Lots of people (including many people on the C++ standards committee) will tell you to avoid unsigned values even for sizes and indexes. I understand and respect their arguments, but I don't find them persuasive enough to justify the extra friction at the interface with the standard library. Whether or not it's an historical artifact that std::size_t is unsigned, the fact is that the standard library uses unsigned sizes extensively. If you use something else, your code ends up littered with implicit conversions, all of which are potential bugs. Worse, those implicit conversions make turning on the compiler warnings impractical, so all those latent bugs remain relatively invisible. (And even if you know your sizes will never exceed the smaller type, being forced to turn of the compiler warnings for signedness and narrowing means you could miss bugs in completely unrelated parts of the code.) Match the types of the APIs you're using as much as possible, assert and explicitly convert when necessary, and turn on all the warnings.
Keep in mind that auto is not a panacea. for (auto i = 0; i < my_vec.size(); ++i) ... is just as bad as for (int i .... But if you generally prefer algorithms and iterators to raw loops, auto will get you pretty far.
With division you must never divide unless you know the denominator is not 0. Similarly, with unsigned integral types, you must never subtract unless you know the subtrahend is smaller than or equal to the original value. If you can make that a habit, you can avoid the bugs that the always-use-a-signed-type folks are concerned about.

How to avoid SSE pipeline flush?

I've been encountering a very subtle issue on SSE. Here is the case, I want to optimise my ray tracer with SSE so that I can get a basic feeling how to improve the performance with SSE.
I'd like to start with this very function.
Vector3f Add( const Vector3f& v0 , Vector3f& v1 );
(Actually I tried to optimise CrossProduct first, adding is shown here for simplicity and I knew it is not the bottleneck of my ray tracer.)
Here is a part of the definition of the struct:
struct Vector3f
{ union { struct{ float x ; float y ; float z; float reserved; }; __m128 data; };
The issue is there will be SSE register flush with this very declaration, the compiler is not smart enough to hold those sse register for further uses.
And with the following declaration, it avoids the flushing.
__m128 Add( __m128 v0_data, __m128 v1_data );
I can go with this way on this case, however it would be ugly design for Matrix which holds four __m128 data. And you can't have operator works on the Vector3f itself but on its data, :(.
The most disturbing thing is that you will have to change your higher level code everywhere to adapt the change. And this way of optimisation through SSE is definitely no option for something large like a huge game engine, you'll change huge amount of code before it works.
Without avoiding the SSE register flushing, its power will be drained out by those useless flushing command which renders SSE useless, I guess.

It seems that union is a bad thing to use here. As long as a compiler sees __m128 unified with something, it has problems with understanding when to update values, leading to excessive memory operations.
MSVC is not the worst performing compiler in this situation. Just check the code generated by GCC 5.1.0, it works 12 times slower than the code generated by MSVC2013 (which is with registers spilling) on my machine, and 20+ times slower than the optimal code.
It is interesting that most compilers start doing silly things only when you really use x, y, z members to access your data. For instance, MSVC2013 spills registers only when you read them via scalar members after computation (I guess to make sure these members are actual). The terrible behavior of GCC seen above disappears if you set initial values with _mm_setr_ps instead of writing them to directly into members.
It is better to avoid unions in this case. It seems that OP has come to the same decision (see current Vector3fv code). Making it harder to access a single coordinate has a good "psychological" performance effect: a person would think twice before writing scalar code. You can easily write setters/getters either with extract/insert intrinsics (which makes compiler generate these instructions), or with simple pointer arithmetic (which makes compiler choose some way):
float getX() const { return ((float*)&data)[0]; }
When I remove union and simply use __m128, the generated code becomes better on all compilers. However, MSVC2013 still has unnecessary moves: one useless register move per each arithmetic operation. I suppose this is an inefficiency in the compiler's inlining algorithm. You can remove these moves in MSVC2013 by declaring all your functions as __vectorcall. Note that using this new calling convention also allows you to avoid register spilling in case your simd functions have not been inlined at all.

When does it make sense to typedef basic data types?

A company's internal c++ coding standards document states that even for basic data types like int, char, etc. one should define own typedefs like "typedef int Int". This is justified by advantage of portability of the code.
However are there general considerations/ advice about when (in means for which types of projects) does it really make sense?
Thanks in advance..

Typedefing int to Int offers almost no advantage at all (it provides no semantic benefit, and leads to absurdities like typedef long Int on other platforms to remain compatible).
However, typedefing int to e.g. int32_t (along with long to int64_t, etc.) does offer an advantage, because you are now free to choose the data-type with the relevant width in a self-documenting way, and it will be portable (just switch the typedefs on a different platform).
In fact, most compilers offer a stdint.h which contains all of these definitions already.

That depends. The example you cite:
typedef int Int;
is just plain dumb. It's a bit like defining a constant:
const int five = 5;
Just as there is zero chance of the variable five ever becoming a different number, the typedef Int can only possibly refer to the primitive type int.
OTOH, a typedef like this:
typedef unsigned char byte;
makes life easier on the fingers (though it has no portability benefits), and one like this:
typedef unsigned long long uint64;
Is both easier to type and more portable, since, on Windows, you would write this instead (I think):
typedef unsigned __int64 uint64;

Rubbish.
"Portability" is non-sense, because int is always an int. If they think they want something like an integer type that's 32-bits, then the typedef should be typedef int int32_t;, because then you are naming a real invariant, and can actually ensure that this invariant holds, via the preprocessor etc.
But this is, of course, a waste of time, because you can use <cstdint>, either in C++0x, or by extensions, or use Boost's implementation of it anyway.

Typedefs can help describing the semantics of the data type. For instance, if you typedef float distance_t;, you're letting the developer in on how the values of distance_t will be interpreted. For instance you might be saying that the values may never be negative. What is -1.23 kilometers? In this scenario, it might just not make sense with negative distances.
Of course, typedefs does not in any way constraint the domain of the values. It is just a way to make code (should at least) readable, and to convey extra information.
The portability issues your work place seem to mention would be when you want ensure that a particular datatype is always the same size, no matter what compiler is used. For instance
#ifdef TURBO_C_COMPILER
typedef long int32;
#elsif MSVC_32_BIT_COMPILER
typedef int int32;
#elsif
...
#endif

typedef int Int is a dreadful idea... people will wonder if they're looking at C++, it's hard to type, visually distracting, and the only vaguely imaginable rationalisation for it is flawed, but let's put it out there explicitly so we can knock it down:
if one day say a 32-bit app is being ported to 64-bit, and there's lots of stupid code that only works for 32-bit ints, then at least the typedef can be changed to keep Int at 32 bits.
Critique: if the system is littered which code that's so badly written (i.e. not using an explicitly 32-bit type from cstdint), it's overwhelmingly likely to have other parts of the code where it will now need to be using 64-bit ints that will get stuck at 32-bit via the typedef. Code that interacts with library/system APIs using ints are likely to be given Ints, resulting in truncated handles that work until they happen to be outside the 32-bit range etc.. The code will need a complete reexamination before being trustworthy anyway. Having this justification floating around in people's minds can only discourage them from using explicitly-sized types where they are actually useful ("what are you doing that for?" "portability?" "but Int's for portability, just use that").
That said, the coding rules might be meant to encourage typedefs for things that are logically distinct types, such as temperatures, prices, speeds, distances etc.. In that case, typedefs can be vaguely useful in that they allow an easy way to recompile the program to say upgrade from float precision to double, downgrade from a real type to an integral one, or substitute a user-defined type with some special behaviours. It's quite handy for containers too, so that there's less work and less client impact if the container is changed, although such changes are usually a little painful anyway: the container APIs are designed to be a bit incompatible so that the important parts must be reexamined rather than compiling but not working or silently performing dramatically worse than before.
It's essential to remember though that a typedef is only an "alias" to the actual underlying type, and doesn't actually create a new distinct type, so people can pass any value of that same type without getting any kind of compiler warning about type mismatches. This can be worked around with a template such as:
template <typename T, int N>
struct Distinct
{
Distinct(const T& t) : t_(t) { }
operator T&() { return t_; }
operator const T&() const { return t_; }
T t_;
};
typedef Distinct<float, 42> Speed;
But, it's a pain to make the values of N unique... you can perhaps have a central enum listing the distinct values, or use __LINE__ if you're dealing with one translation unit and no multiple typedefs on a line, or take a const char* from __FILE__ as well, but there's no particularly elegant solution I'm aware of.
(One classic article from 10 or 15 years ago demonstrated how you could create templates for types that knew of several orthogonal units, keeping counters of the current "power" in each, and adjusting the type as multiplications, divisions etc were performed. For example, you could declare something like Meters m; Time t; Acceleration a = m / t / t; and have it check all the units were sensible at compile time.)
Is this a good idea anyway? Most people clearly consider it overkill, as almost nobody ever does it. Still, it can be useful and I have used it on several occasions where it was easy and/or particularly dangerous if values were accidentally misassigned.

I suppose, that the main reason is portability of your code. For example, once you assume to use 32 bit integer type in the program, you need to be shure that the other's platform int is also 32 bits long. Typedef in header helps you to localize the changes of your code in one place.

I would like to put out that it could also be used for people who speak a different language. Say for instance, if you speak spanish and your code is all in spanish wouldn't you want a type definition in spanish. Just something to consider.

Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it still work or these instructions are processor specific?

SSE instructions are processor specific. You can look up which processor supports which SSE version on wikipedia.
If SSE code will be faster or not depends on many factors: The first is of course whether the problem is memory-bound or CPU-bound. If the memory bus is the bottleneck SSE will not help much. Try simplifying your integer calculations, if that makes the code faster, it's probably CPU-bound, and you have a good chance of speeding it up.
Be aware that writing SIMD-code is a lot harder than writing C++-code, and that the resulting code is much harder to change. Always keep the C++ code up to date, you'll want it as a comment and to check the correctness of your assembler code.
Think about using a library like the IPP, that implements common low-level SIMD operations optimized for various processors.

SIMD, of which SSE is an example, allows you to do the same operation on multiple chunks of data. So, you won't get any advantage to using SSE as a straight replacement for the integer operations, you will only get advantages if you can do the operations on multiple data items at once. This involves loading some data values that are contiguous in memory, doing the required processing and then stepping to the next set of values in the array.
Problems:
1 If the code path is dependant on the data being processed, SIMD becomes much harder to implement. For example:
a = array [index];
a &= mask;
a >>= shift;
if (a < somevalue)
{
a += 2;
array [index] = a;
}
++index;
is not easy to do as SIMD:
a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3]
a1 &= mask a2 &= mask a3 &= mask a4 &= mask
a1 >>= shift a2 >>= shift a3 >>= shift a4 >>= shift
if (a1<somevalue) if (a2<somevalue) if (a3<somevalue) if (a4<somevalue)
// help! can't conditionally perform this on each column, all columns must do the same thing
index += 4
2 If the data is not contigous then loading the data into the SIMD instructions is cumbersome
3 The code is processor specific. SSE is only on IA32 (Intel/AMD) and not all IA32 cpus support SSE.
You need to analyse the algorithm and the data to see if it can be SSE'd and that requires knowing how SSE works. There's plenty of documentation on Intel's website.

This kind of problem is a perfect example of where a good low level profiler is essential. (Something like VTune) It can give you a much more informed idea of where your hotspots lie.
My guess, from what you describe is that your hotspot will probably be branch prediction failures resulting from min/max calculations using if/else. Therefore, using SIMD intrinsics should allow you to use the min/max instructions, however, it might be worth just trying to use a branchless min/max caluculation instead. This might achieve most of the gains with less pain.
Something like this:
inline int
minimum(int a, int b)
{
int mask = (a - b) >> 31;
return ((a & mask) | (b & ~mask));
}

If you use SSE instructions, you're obviously limited to processors that support these.
That means x86, dating back to the Pentium 2 or so (can't remember exactly when they were introduced, but it's a long time ago)
SSE2, which, as far as I can recall, is the one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors didn't support them)
In any case, you have two options for using these instructions. Either write the entire block of code in assembly (probably a bad idea. That makes it virtually impossible for the compiler to optimize your code, and it's very hard for a human to write efficient assembler).
Alternatively, use the intrinsics available with your compiler (if memory serves, they're usually defined in xmmintrin.h)
But again, the performance may not improve. SSE code poses additional requirements of the data it processes. Mainly, the one to keep in mind is that data must be aligned on 128-bit boundaries. There should also be few or no dependencies between the values loaded into the same register (a 128 bit SSE register can hold 4 ints. Adding the first and the second one together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)
It may be tempting to use a library that wraps all the low-level SSE fiddling, but that might also ruin any potential performance benefit.
I don't know how good SSE's integer operation support is, so that may also be a factor that can limit performance. SSE is mainly targeted at speeding up floating point operations.

If you intend to use Microsoft Visual C++, you should read this:
http://www.codeproject.com/KB/recipes/sseintro.aspx

We have implemented some image processing code, similar to what you describe but on a byte array, In SSE. The speedup compared to C code is considerable, depending on the exact algorithm more than a factor of 4, even in respect to the Intel compiler. However, as you already mentioned you have the following drawbacks:
Portability. The code will run on every Intel-like CPU, so also AMD, but not on other CPUs. That is not a problem for us because we control the target hardware. Switching compilers and even to a 64 bit OS can also be a problem.
You have a steep learning curve, but I found that after you grasp the principles writing new algorithms is not that hard.
Maintainability. Most C or C++ programmers have no knowledge of assembly/SSE.
My advice to you will be to go for it only if you really need the performance improvement, and you can't find a function for your problem in a library like the intel IPP, and if you can live with the portability issues.

I can tell from my experince that SSE brings a huge (4x and up) speedup over a plain c version of the code (no inline asm, no intrinsics used) but hand-optimized assembler can beat Compiler-generated assembly if the compiler can't figure out what the programmer intended (belive me, compilers don't cover all possible code combinations and they never will).
Oh and, the compiler can't everytime layout the data that it runs at the fastest-possible speed.
But you need much experince for a speedup over an Intel-compiler (if possible).

SSE instructions were originally just on Intel chips, but recently (since Athlon?) AMD supports them as well, so if you do code against the SSE instruction set, you should be portable to most x86 procs.
That being said, it may not be worth your time to learn SSE coding unless you're already familiar with assembler on x86's - an easier option might be to check your compiler docs and see if there are options to allow the compiler to autogenerate SSE code for you. Some compilers do very well vectorizing loops in this way. (You're probably not surprised to hear that the Intel compilers do a good job of this :)

Write code that helps the compiler understand what you are doing. GCC will understand and optimize SSE code such as this:
typedef union Vector4f
{
// Easy constructor, defaulted to black/0 vector
Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f):
X(a), Y(b), Z(c), W(d) { }
// Cast operator, for []
inline operator float* ()
{
return (float*)this;
}
// Const ast operator, for const []
inline operator const float* () const
{
return (const float*)this;
}
// ---------------------------------------- //
inline Vector4f operator += (const Vector4f &v)
{
for(int i=0; i<4; ++i)
(*this)[i] += v[i];
return *this;
}
inline Vector4f operator += (float t)
{
for(int i=0; i<4; ++i)
(*this)[i] += t;
return *this;
}
// Vertex / Vector
// Lower case xyzw components
struct {
float x, y, z;
float w;
};
// Upper case XYZW components
struct {
float X, Y, Z;
float W;
};
};
Just don't forget to have -msse -msse2 on your build parameters!

Although it is true that SSE is specific to some processors (SSE may be relatively safe, SSE2 much less in my experience), you can detect the CPU at runtime, and load the code dynamically depending on the target CPU.

SIMD intrinsics (such as SSE2) can speed this sort of thing up but take expertise to use correctly. They are very sensitive to alignment and pipeline latency; careless use can make performance even worse than it would have been without them. You'll get a much easier and more immediate speedup from simply using cache prefetching to make sure all your ints are in L1 in time for you to operate on them.
Unless your function needs a throughput of better than 100,000,000 integers per second, SIMD probably isn't worth the trouble for you.

Just to add briefly to what has been said before about different SSE versions being available on different CPUs: This can be checked by looking at the respective feature flags returned by the CPUID instruction (see e.g. Intel's documentation for details).

Have a look at inline assembler for C/C++, here is a DDJ article. Unless you are 100% certain your program will run on a compatible platform you should follow the recommendations many have given here.

I agree with the previous posters. Benefits can be quite large but to get it may require a lot of work. Intel documentation on these instructions is over 4K pages. You may want to check out EasySSE (c++ wrappers library over intrinsics + examples) free from Ocali Inc.
I assume my affiliation with this EasySSE is clear.

I don't recommend doing this yourself unless you're fairly proficient with assembly. Using SSE will, more than likely, require careful reorganization of your data, as Skizz points out, and the benefit is often questionable at best.
It would probably be much better for you to write very small loops and keep your data very tightly organized and just rely on the compiler doing this for you. Both the Intel C Compiler and GCC (since 4.1) can auto-vectorize your code, and will probably do a better job than you. (Just add -ftree-vectorize to your CXXFLAGS.)
Edit: Another thing I should mention is that several compilers support assembly intrinsics, which would probably, IMO, be easier to use than the asm() or __asm{} syntax.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js