SSE extracting integers from a __m128 for indexing an array

SSE extracting integers from a __m128 for indexing an array - c++

In some code I have converted to SSE I preform some ray tracing, tracing 4 rays at a time using __m128 data types.
In the method where I determine which objects are hit first, I loop through all objects, test for intersection and create a mask representing which rays had an intersection earlier than previously found .
I also need to maintain data on the id of the objects which correspond to the best hit times. I do this by maintaining a __m128 data type called objectNo and I use the mask determined from the intersection times to update objectNo as follows:
objectNo = _mm_blendv_ps(objectNo,_mm_set1_ps((float)pobj->getID()),mask);
Where pobj->getID() will return an integer representing the id of the current object. Making this cast and using the blend seemed to be the most efficient way of updating the objectNo for all 4 rays.
After all intersections are tested I try to extract the objectNo's individually and use them to access an array to register the intersection. Most commonly I have tried this:
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
However this crashes with EXC_BAD_ACCESS as extracting a float with value 1.0 converts to an int of value 1065353216.
How do I correctly unpack the __m128 into ints which can be used to index an array?

There are two SSE2 conversion intrinsics which seem to do what you want:
_mm_cvtps_epi32()
_mm_cvttps_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_conversion.htm
These will convert 4 single-precision FP to 4 32-bit integers. The first one does it with rounding. The second one uses truncation.
So they can be used like this:
int o0 = _mm_extract_epi32(_mm_cvtps_epi32(objectNo), 0);
prv_noHits[o0]++;
EDIT : Based on what you're trying to do, I feel this can be better optimized as follows:
__m128i ids = _mm_set1_epi32(pobj->getID());
// The mask will need to change
objectNo = _mm_blend_epi16(objectNo,ids,mask);
int o0 = _mm_extract_epi32(objectNo, 0);
prv_noHits[o0]++;
This version gets rid of the unnecessary conversions. But you will need to use a different mask vector.
EDIT 2: Here's a way so that you won't have to change your mask:
__m128 ids = _mm_castsi128_ps(_mm_set1_epi32(pobj->getID()));
objectNo = _mm_blendv_ps(objectNo,ids,mask);
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
Note that the _mm_castsi128_ps() intrinsic doesn't map any instruction. It's just a bit-wise datatype conversion from __m128i to __m128 to get around the "typeness" in C/C++.

Related

Can OpenMP's SIMD directive vectorize indexing operations?

Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.

Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.
First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.
Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).
To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).

Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}

How is it possible to get the float value from XMVECTOR? (DirectXMath)

I would like to get the dot product of two 3D vectors in float. But unfortunately the result is a vector, not a float. I trued to access it's elements using vector4_f32, but I get an error, that it's not a member of __m128
float res = XMVector3Dot(a, b).vector4_f32[0];
The [] operator is not defined on XMVECTOR

You can access individual elements of XMVECTOR by using XMVectorGetX, XMVectorGetY, XMVectorGetZ and XMVectorGetW. But remember, these are more likely expensive operations as DirectXMath uses SIMD instruction set. For more info:
1: XMVector3Dot performance
2: Expensive than expected

Conditional structures in SSE

I have some trouble with a "special" kind of conditional structure in SSE/C++. The following pseudo code illustrates what I want to do:
for-loop ...
// some SSE calculations
__m128i a = ... // a contains four 32-bit ints
__m128i b = ... // b contains four 32-bit ints
if any of the four ints in a is less than its corresponding int in b
vector.push_back(e.g. first component of a)
So I do quite a few SSE calculations and as the result of these calculations, I have two __m128i values, each containing four 32-bit integer. This part is working fine. But now I want to push something into a vector, if at least one of the four ints in a is less than the corresponding int in b. I have no idea how I can achieve this.
I know the _mm_cmplt_epi32 function, but so far I failed to use it to solve my specific problem.
EDIT:
Yeah, actually I'm searching for a clever solution. I have a solution, but that looks very, very strange.
for-loop ...
// some SSE calculations
__m128i a = ... // a contains four 32-bit ints
__m128i b = ... // b contains four 32-bit ints
long long i[2] __attribute__((aligned (16)));
__m128i cmp = _mm_cmplt_epi32(a, b);
_mm_store_si128(reinterpret_cast<__m128i*>(i), cmp);
if(i[0] || i[1]) {
vector.push_back(...)
I hope, there is a better way...

You want to use the _mm_movemask_ps function, which will return an appropriate bitmask which you can test:
cmp = _mm_cmplt_epi32(a, b);
if(_mm_movemask_ps(cmp))
{
vector.push_back(...);
}
Documented here: http://msdn.microsoft.com/en-us/library/4490ys29%28v=vs.90%29.aspx

I did something similar to this to find prime numbers Finding lists of prime numbers with SIMD - SSE/AVX
This is only going to be useful with SSE if the result of the comparison is false most of the time. Otherwise you should just use scalar code. Let me try and lay out the code.
__m128i cmp = _mm_cmplt_epi32(a, b);
if(_mm_movemask_epi8(cmp)) {
int out[4] __attribute__((aligned (16)));
_mm_store_si128(out, _mm_and_si128(out, a));
for(int i=0; i<4; i++) if(out[i]) vector.push_back(out[i]);
}
You could store the comparison instead of using the logical and. Additionally, you could mask the bytes in the move mask and skip the store. Either way you do it what really matters is that the movemask is zero most of the time otherwise SSE won't be helpful.
In my case a was a list of numbers I wanted to test to be prime and b was a list of divisors. Since I knew that most of the time the values of a were not prime this gave me a boost of about 3x (out of max 4x with SSE).

Look-Up Table using SIMD

I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.
Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.
Basically, I am trying to vectorize the following vanilla C++ code:
//outside loop
const float LUT_RATIO = 1000.0F;
//in loop
float v = ... //input value
v = myLookupTable[static_cast<int>(v * LUT_RATIO)];
What I'm trying:
//outside loop
const __m128 LUT_RATIO = _mm_set1_ps(1000.0F);
//in loop
__m128 v = _mm_set_ps(v1, v2, v3, v4); //input values
__m128i vI = _mm_cvtps_epi32(_mm_mul_ps(v, LUT_RATIO)); //multiply and convert to integers
v = ??? // how to get vI indices of myLookupTable?
edit: ildjarn makes a point that demands clarification on my part. I am not trying to achieve speedup for the lookup table code, I am simply trying to avoid having to store the registers back to floats specifically for doing the lookup, as this part is sandwiched between 2 other parts which could theoretically benefit from SSE.

If you can wait until next year then Intel's Haswell CPUs will have AVX2 which includes instructions for gathered loads. This enables you to do e.g. 8 parallel LUT lookups in one instruction (see e.g. VGATHERDPS). Other than that, you're out of luck, unless your LUTs are quite small (e.g. 16 elements), in which case you can use PSHUFB.

Optimization of Point to Voxel mapping

I used a profiler to look over some code which does not yet run fast enough. It found that the following function took most of the time, and half of the time in this function was spent in floor. Now, there are two possibilities: optimizing this function or going one level above and reducing the calls to this function. I wonder, if the first one is possible.
int Sph::gridIndex (Vector3 position) const {
int mx = ((int)floor(position.x / _gridIntervalSize) % _gridSize);
int my = ((int)floor(position.y / _gridIntervalSize) % _gridSize);
int mz = ((int)floor(position.z / _gridIntervalSize) % _gridSize);
if (mx < 0) {
mx += _gridSize;
}
if (my < 0) {
my += _gridSize;
}
if (mz < 0) {
mz += _gridSize;
}
int x = mx * _gridSize * _gridSize;
int y = my * _gridSize;
int z = mz * 1;
return x + y + z;
}
Vector3 is just some simple class which stores three floats and provides some overloaded operators. _gridSize is of type int and _gridIntervalSize is a float. There are _gridSize ^ 3 buckets.
The purpose of the function is to provide hash table support. Every 3d-point is mapped to an index, and points which lie in the same voxel of size _gridIntervalSize ^ 3 should land in the same bucket.

First rule of optimization when there is math involved: Eliminate division, square roots, and trig functions.
inverse_size = 1 / _gridIntervalSize;
....that should be done only once, not once per call.
int mx = ((int)floor(position.x * inverse_size) % _gridSize);
int my = ((int)floor(position.y * inverse_size) % _gridSize);
int mz = ((int)floor(position.z * inverse_size) % _gridSize);
I would also recommend dropping the mod operation because that's another division - if your grid size is a power of 2 you can use & (gridsize-1) which will also allow you to delete the conditional code at the bottom which is another big savings.
On another note, using overloaded operators may be hurting you. This is a touchy subject here so I'll let you experiment with it and decide for yourself.

I assume you use floor because negative values are possible, and because you don't want an anomaly due to the default truncation when you cast to int (values rounding toward zero from both sides, making some oversized voxels).
If you can specify a safe most-negative value for each value in the vector, you could subtract that (negative) value, or rather the nearest more-negative multiple of _gridIntervalSize, before the cast, and drop the floor.
Using fmod may ensure you have a safe most-negative value, and replace the integer %, but it's probably an anti-optimisation. Still, as a quick change, it may be worth checking.
Also, check whether your platform supports vector instructions, and whether your compiler can easily be encouraged to use them. x86 chips certainly have integer vector instructions as well as float (the old Pentium 1 MMX instructions, for a start) and might be able to handle this much more efficiently than the "normal" CPU instruction set. This may even be a case for digging out the list of vector instruction intrinsics for your compiler and doing some hand-optimisation. Just check what the compiler can do for you first - I'm not sure how much of this kind of optimisation compilers will do for you already.
One probably trivial piece of micro-optimisation...
return (mx * _gridSize + my) * _gridSize + mz;
Saves one integer multiplication. Trivial, of course, and the compiler may catch it anyway, but this is an old habitual thing.
Oh - watch the leading underscores. Those are reserved identifiers. Not likely to cause a problem, but you can't complain if they do.
EDIT
Another way to avoid the floor is to handle positive and negative separately. If you are willing to accept that items bang-on-the-edge of a grid cell may be in the wrong cell (possible anyway since floats should be considered approximate). Just apply a -1 offset in the negative case, to pull it away from the zero by almost exactly right amount to compensate for the truncation. You might consider a bit-fiddling increment-the-mantissa afterwards (to get already integer values in the cell you'd expect) but this is probably unnecessary.
If you can impose power-of-two limitations to your sizes, there may be a bit-fiddling way to efficiently extract the grid position from a float, avoiding some or all of the multiply, floor and % for each of x, y and z, assuming a standard floating point representation (ie this is non-portable). Again, handle positive and negative separately. Extract the exponent, bit-shift the mantissa accordingly, then mask out unwanted bits.

I think you need to look higher up the hierarchy to get real speed improvements. That is, is storing points in a hash-map really the most efficent solution? I assume you have an array of Vector3 arrays, i.e:
Vector3 *points [size][size][size]
where each element in the 3D array is an array of Vector3.
The algorithm you're using doesn't guarantee uniform distribution of points in each Vector3 array, which may be a problem. A cluster of points within _gridIntervalSize will map to the same array.
An alternative method would be to use oct-trees, which are like binary trees but each node has eight child nodes. Each node requires the min/max x/y/z values to define the volume the node covers. To add values to the tree:
Recursive search tree to find smallest node that can contain point
Add point to node
If number of points in node > upper limit to number of points in a node
Create child nodes and move points to child nodes
You may want to use quad-trees if there is little variation in values along a particular axis. Another method is to use BSPs - divide the world into two halves and recurse to find the container to add your point to. Again, these can be dynamic.
Converting the floats to ints and having the division planes lie on integer values will speed up the process as well.
Googling the above terms will lead you to more in depth analysis of the algorithms.
Finally, using floats (or doubles) for co-ordinates in an infinite plane is a bad idea - the further you get from (0,0,0) the less precision you have (the gaps between floating point values increases as the value increases). You will need to 'reset' the floating point values to keep the precision. One method is to 'tile' the space and change the co-ordinates to use integer and floating point parts. The integer part defines the 'tile' and the floating point part defines the position in the tile. This method gets you a much simpler hashing method - just use the integer parts, no call to floor required and only integer calculations required. Another approach is to use fixed-point values rather than floating point values, but this would constrain your precision. This would make calculations accross tile boundaries much easier.
If you could expand on what the top-level requriements of your coordinate system is, there are probably better algorithms available to you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js