Conditional structures in SSE - c++

I have some trouble with a "special" kind of conditional structure in SSE/C++. The following pseudo code illustrates what I want to do:
for-loop ...
// some SSE calculations
__m128i a = ... // a contains four 32-bit ints
__m128i b = ... // b contains four 32-bit ints
if any of the four ints in a is less than its corresponding int in b
vector.push_back(e.g. first component of a)
So I do quite a few SSE calculations and as the result of these calculations, I have two __m128i values, each containing four 32-bit integer. This part is working fine. But now I want to push something into a vector, if at least one of the four ints in a is less than the corresponding int in b. I have no idea how I can achieve this.
I know the _mm_cmplt_epi32 function, but so far I failed to use it to solve my specific problem.
EDIT:
Yeah, actually I'm searching for a clever solution. I have a solution, but that looks very, very strange.
for-loop ...
// some SSE calculations
__m128i a = ... // a contains four 32-bit ints
__m128i b = ... // b contains four 32-bit ints
long long i[2] __attribute__((aligned (16)));
__m128i cmp = _mm_cmplt_epi32(a, b);
_mm_store_si128(reinterpret_cast<__m128i*>(i), cmp);
if(i[0] || i[1]) {
vector.push_back(...)
I hope, there is a better way...

You want to use the _mm_movemask_ps function, which will return an appropriate bitmask which you can test:
cmp = _mm_cmplt_epi32(a, b);
if(_mm_movemask_ps(cmp))
{
vector.push_back(...);
}
Documented here: http://msdn.microsoft.com/en-us/library/4490ys29%28v=vs.90%29.aspx

I did something similar to this to find prime numbers Finding lists of prime numbers with SIMD - SSE/AVX
This is only going to be useful with SSE if the result of the comparison is false most of the time. Otherwise you should just use scalar code. Let me try and lay out the code.
__m128i cmp = _mm_cmplt_epi32(a, b);
if(_mm_movemask_epi8(cmp)) {
int out[4] __attribute__((aligned (16)));
_mm_store_si128(out, _mm_and_si128(out, a));
for(int i=0; i<4; i++) if(out[i]) vector.push_back(out[i]);
}
You could store the comparison instead of using the logical and. Additionally, you could mask the bytes in the move mask and skip the store. Either way you do it what really matters is that the movemask is zero most of the time otherwise SSE won't be helpful.
In my case a was a list of numbers I wanted to test to be prime and b was a list of divisors. Since I knew that most of the time the values of a were not prime this gave me a boost of about 3x (out of max 4x with SSE).

Related

How to construct a Bitset from a vector of int [duplicate]

It's easy to construct a bitset<64> from a uint64_t:
uint64_t flags = ...;
std::bitset<64> bs{flags};
But is there a good way to construct a bitset<64 * N> from a uint64_t[N], such that flags[0] would refer to the lowest 64 bits?
uint64_t flags[3];
// ... some assignments
std::bitset<192> bs{flags}; // this very unhelpfully compiles
// yet is totally invalid
Or am I stuck having to call set() in a loop?
std::bitset has no range constructor, so you will have to loop, but setting every bit individually with std::bitset::set() is underkill. std::bitset has support for binary operators, so you can at least set 64 bits in bulk:
std::bitset<192> bs;
for(int i = 2; i >= 0; --i) {
bs <<= 64;
bs |= flags[i];
}
Update: In the comments, #icando raises the valid concern that bitshifts are O(N) operations for std::bitsets. For very large bitsets, this will ultimately eat the performance boost of bulk processing. In my benchmarks, the break-even point for a std::bitset<N * 64> in comparison to a simple loop that sets the bits individually and does not mutate the input data:
int pos = 0;
for(auto f : flags) {
for(int b = 0; b < 64; ++b) {
bs.set(pos++, f >> b & 1);
}
}
is somewhere around N == 200 (gcc 4.9 on x86-64 with libstdc++ and -O2). Clang performs somewhat worse, breaking even around N == 160. Gcc with -O3 pushes it up to N == 250.
Taking the lower end, this means that if you want to work with bitsets of 10000 bits or larger, this approach may not be for you. On 32-bit platforms (such as common ARMs), the threshold will probably lie lower, so keep that in mind when you work with 5000-bit bitsets on such platforms. I would argue, however, that somewhere far before this point, you should have asked yourself if a bitset is really the right choice of container.
If initializing from range is important, you might consider using std::vector
It does have constructor from pair of iterators

Construct bitset from array of integers

It's easy to construct a bitset<64> from a uint64_t:
uint64_t flags = ...;
std::bitset<64> bs{flags};
But is there a good way to construct a bitset<64 * N> from a uint64_t[N], such that flags[0] would refer to the lowest 64 bits?
uint64_t flags[3];
// ... some assignments
std::bitset<192> bs{flags}; // this very unhelpfully compiles
// yet is totally invalid
Or am I stuck having to call set() in a loop?
std::bitset has no range constructor, so you will have to loop, but setting every bit individually with std::bitset::set() is underkill. std::bitset has support for binary operators, so you can at least set 64 bits in bulk:
std::bitset<192> bs;
for(int i = 2; i >= 0; --i) {
bs <<= 64;
bs |= flags[i];
}
Update: In the comments, #icando raises the valid concern that bitshifts are O(N) operations for std::bitsets. For very large bitsets, this will ultimately eat the performance boost of bulk processing. In my benchmarks, the break-even point for a std::bitset<N * 64> in comparison to a simple loop that sets the bits individually and does not mutate the input data:
int pos = 0;
for(auto f : flags) {
for(int b = 0; b < 64; ++b) {
bs.set(pos++, f >> b & 1);
}
}
is somewhere around N == 200 (gcc 4.9 on x86-64 with libstdc++ and -O2). Clang performs somewhat worse, breaking even around N == 160. Gcc with -O3 pushes it up to N == 250.
Taking the lower end, this means that if you want to work with bitsets of 10000 bits or larger, this approach may not be for you. On 32-bit platforms (such as common ARMs), the threshold will probably lie lower, so keep that in mind when you work with 5000-bit bitsets on such platforms. I would argue, however, that somewhere far before this point, you should have asked yourself if a bitset is really the right choice of container.
If initializing from range is important, you might consider using std::vector
It does have constructor from pair of iterators

Fastest way to multiply two vectors of 32bit integers in C++, with SSE

I have two unsigned vectors, both with size 4
vector<unsigned> v1 = {2, 4, 6, 8}
vector<unsigned> v2 = {1, 10, 11, 13}
Now I want to multiply these two vectors and get a new one
vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}
What is the SSE operation to use? Is it cross platform or only
in some specified platforms?
Adding:
If my goal is adding not multiplication, I can do this super fast:
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
Using the set intrinsics such as _mm_set_epi32 for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128 or _mm_loadu_si128 (for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128. But aligned memory is much more efficient. To get aligned memory I recommend _mm_malloc and _mm_free, or C11 aligned_alloc so you can use normal free.
To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a and __m128i b
For SSE version >=SSE4.1 use
_mm_mullo_epi32(a, b);
Without SSE4.1:
This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):
// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13 = _mm_shuffle_epi32(a, 0xF5); // (-,a3,-,a1)
__m128i b13 = _mm_shuffle_epi32(b, 0xF5); // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b); // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13); // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13); // (-,-,a1*b1,a0*b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13); // (-,-,a3*b3,a2*b2)
__m128i prod = _mm_unpacklo_epi64(prod01,prod23); // (ab3,ab2,ab1,ab0)
There is _mm_mul_epu32 which is SSE2 only and uses the pmuludq instruction. Since it's an SSE2 instruction 99.9% of all CPUs support it (I think the most modern CPU that doesn't is an AMD Athlon XP).
It has a significant downside in that it only multiplies two integers at a time, because it returns 64-bit results, and you can only fit two of those in a register. This means you'll probably need to do a bunch of shuffling which adds to the cost.
Probably _mm_mullo_epi32 is what you need, although its intended use is for signed integers. This should not cause problems as long as v1 and v2 are such small that the most significant bits of these integers are 0. It's SSE 4.1. As an alternative you might want to consider _mm_mul_epu32.
You can (if SSE 4.1 is available) use
__m128i _mm_mullo_epi32 (__m128i a, __m128i b);
to multiply packed 32bit integers.
Otherwise you'd have to shuffle both packs in order to use _mm_mul_epu32 twice. See #user2088790's answer for explicit code.
Note that you could also use _mm_mul_epi32 but that is SSE4 so you'd rather use _mm_mullo_epi32 anyway.
std::transform applies the given function to a range and stores the
result in another range
std::vector<unsigned> result;
std::transform( v1.begin()+1, v1.end(), v2.begin()+1, v.begin(),std::multiplies<unsigned>() );

SSE extracting integers from a __m128 for indexing an array

In some code I have converted to SSE I preform some ray tracing, tracing 4 rays at a time using __m128 data types.
In the method where I determine which objects are hit first, I loop through all objects, test for intersection and create a mask representing which rays had an intersection earlier than previously found .
I also need to maintain data on the id of the objects which correspond to the best hit times. I do this by maintaining a __m128 data type called objectNo and I use the mask determined from the intersection times to update objectNo as follows:
objectNo = _mm_blendv_ps(objectNo,_mm_set1_ps((float)pobj->getID()),mask);
Where pobj->getID() will return an integer representing the id of the current object. Making this cast and using the blend seemed to be the most efficient way of updating the objectNo for all 4 rays.
After all intersections are tested I try to extract the objectNo's individually and use them to access an array to register the intersection. Most commonly I have tried this:
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
However this crashes with EXC_BAD_ACCESS as extracting a float with value 1.0 converts to an int of value 1065353216.
How do I correctly unpack the __m128 into ints which can be used to index an array?
There are two SSE2 conversion intrinsics which seem to do what you want:
_mm_cvtps_epi32()
_mm_cvttps_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_conversion.htm
These will convert 4 single-precision FP to 4 32-bit integers. The first one does it with rounding. The second one uses truncation.
So they can be used like this:
int o0 = _mm_extract_epi32(_mm_cvtps_epi32(objectNo), 0);
prv_noHits[o0]++;
EDIT : Based on what you're trying to do, I feel this can be better optimized as follows:
__m128i ids = _mm_set1_epi32(pobj->getID());
// The mask will need to change
objectNo = _mm_blend_epi16(objectNo,ids,mask);
int o0 = _mm_extract_epi32(objectNo, 0);
prv_noHits[o0]++;
This version gets rid of the unnecessary conversions. But you will need to use a different mask vector.
EDIT 2: Here's a way so that you won't have to change your mask:
__m128 ids = _mm_castsi128_ps(_mm_set1_epi32(pobj->getID()));
objectNo = _mm_blendv_ps(objectNo,ids,mask);
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
Note that the _mm_castsi128_ps() intrinsic doesn't map any instruction. It's just a bit-wise datatype conversion from __m128i to __m128 to get around the "typeness" in C/C++.

Why does does SSE set (_mm_set_ps) reverse the order of arguments

I recently noticed that
_m128 m = _mm_set_ps(0,1,2,3);
puts the 4 floats into reverse order when cast to a float array:
(float*) p = (float*)(&m);
// p[0] == 3
// p[1] == 2
// p[2] == 1
// p[3] == 0
The same happens with a union { _m128 m; float[4] a; } also.
Why do SSE operations use this ordering? It's not a big deal but slightly confusing.
And a follow-up question:
When accessing elements in the array by index, should one access in the order 0..3 or the order 3..0 ?
It's just a convention; they had to pick some order, and it really doesn't matter what the order is as long as everyone follows it. Intel happens to like little-endianness.
As far as accessing by index goes... the best thing is to try to avoid doing it. Nothing kills vector performance like element-wise accesses. If you must, try set things up so that the indexing matches the hardware vector lanes; that's what most vector programmers (in my experience) will expect.
Depend on what you would like to do, you can use either _mm_set_ps or _mm_setr_ps.
__m128 _mm_setr_ps (float z, float y, float x, float w )
Sets the four SP FP values to the four inputs in reverse order.
Isn't that consistent with the little-endian nature of x86 hardware? The way it stores the bytes of a long long.