I would like to get the dot product of two 3D vectors in float. But unfortunately the result is a vector, not a float. I trued to access it's elements using vector4_f32, but I get an error, that it's not a member of __m128
float res = XMVector3Dot(a, b).vector4_f32[0];
The [] operator is not defined on XMVECTOR
You can access individual elements of XMVECTOR by using XMVectorGetX, XMVectorGetY, XMVectorGetZ and XMVectorGetW. But remember, these are more likely expensive operations as DirectXMath uses SIMD instruction set. For more info:
1: XMVector3Dot performance
2: Expensive than expected
Related
Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.
Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.
First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.
Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).
To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).
Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}
I'm calling slerp() from the Eigen libary as follows:
Eigen::MatrixXf Rtime = (Eigen::Quaternionf::Identity().slerp(timer, quarts[i])).toRotationMatrix();
where timer is a float and quarts is declared as
std::vector<Eigen::Quaternionf> quarts;
This call to slerp only causes a Read Access Violation sometimes (about 50% of the time) , which confuses me.
Looking at the stack frame,
I can see that the code reaches Eigen::internal::pload until it breaks.
Generally I'd think that my indices are incorrect but it crashes even when
i = 0 and quarts.size() = 1. I declare the only quaternion in the vector:
Eigen::Matrix3f rotMatrix;
rotMatrix = U * V;
Eigen::Quaternionf temp;
temp = rotMatrix;
quarts.push_back(temp);
where U and V come from a computation of Singular Value Decomposition, so maybe there's something wrong with the way I declare the quaternion? Or storing it in a vector in some way affects it? I'm not sure.
The problem is that Quaternionf requires 16 bytes alignment that is not guaranteed by std::vector. More details there. The solutions are either to use an aligned allocator, e.g.:
std::vector<Quaternionf,Eigen::aligned_allocator<Quaternionf>> quats;
or to use non-aligned quaternions within the vector:
std::vector<Quaternion<float,Eigen::DontAlign>> quats;
I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.
Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.
Basically, I am trying to vectorize the following vanilla C++ code:
//outside loop
const float LUT_RATIO = 1000.0F;
//in loop
float v = ... //input value
v = myLookupTable[static_cast<int>(v * LUT_RATIO)];
What I'm trying:
//outside loop
const __m128 LUT_RATIO = _mm_set1_ps(1000.0F);
//in loop
__m128 v = _mm_set_ps(v1, v2, v3, v4); //input values
__m128i vI = _mm_cvtps_epi32(_mm_mul_ps(v, LUT_RATIO)); //multiply and convert to integers
v = ??? // how to get vI indices of myLookupTable?
edit: ildjarn makes a point that demands clarification on my part. I am not trying to achieve speedup for the lookup table code, I am simply trying to avoid having to store the registers back to floats specifically for doing the lookup, as this part is sandwiched between 2 other parts which could theoretically benefit from SSE.
If you can wait until next year then Intel's Haswell CPUs will have AVX2 which includes instructions for gathered loads. This enables you to do e.g. 8 parallel LUT lookups in one instruction (see e.g. VGATHERDPS). Other than that, you're out of luck, unless your LUTs are quite small (e.g. 16 elements), in which case you can use PSHUFB.
In some code I have converted to SSE I preform some ray tracing, tracing 4 rays at a time using __m128 data types.
In the method where I determine which objects are hit first, I loop through all objects, test for intersection and create a mask representing which rays had an intersection earlier than previously found .
I also need to maintain data on the id of the objects which correspond to the best hit times. I do this by maintaining a __m128 data type called objectNo and I use the mask determined from the intersection times to update objectNo as follows:
objectNo = _mm_blendv_ps(objectNo,_mm_set1_ps((float)pobj->getID()),mask);
Where pobj->getID() will return an integer representing the id of the current object. Making this cast and using the blend seemed to be the most efficient way of updating the objectNo for all 4 rays.
After all intersections are tested I try to extract the objectNo's individually and use them to access an array to register the intersection. Most commonly I have tried this:
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
However this crashes with EXC_BAD_ACCESS as extracting a float with value 1.0 converts to an int of value 1065353216.
How do I correctly unpack the __m128 into ints which can be used to index an array?
There are two SSE2 conversion intrinsics which seem to do what you want:
_mm_cvtps_epi32()
_mm_cvttps_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_conversion.htm
These will convert 4 single-precision FP to 4 32-bit integers. The first one does it with rounding. The second one uses truncation.
So they can be used like this:
int o0 = _mm_extract_epi32(_mm_cvtps_epi32(objectNo), 0);
prv_noHits[o0]++;
EDIT : Based on what you're trying to do, I feel this can be better optimized as follows:
__m128i ids = _mm_set1_epi32(pobj->getID());
// The mask will need to change
objectNo = _mm_blend_epi16(objectNo,ids,mask);
int o0 = _mm_extract_epi32(objectNo, 0);
prv_noHits[o0]++;
This version gets rid of the unnecessary conversions. But you will need to use a different mask vector.
EDIT 2: Here's a way so that you won't have to change your mask:
__m128 ids = _mm_castsi128_ps(_mm_set1_epi32(pobj->getID()));
objectNo = _mm_blendv_ps(objectNo,ids,mask);
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
Note that the _mm_castsi128_ps() intrinsic doesn't map any instruction. It's just a bit-wise datatype conversion from __m128i to __m128 to get around the "typeness" in C/C++.
I recently noticed that
_m128 m = _mm_set_ps(0,1,2,3);
puts the 4 floats into reverse order when cast to a float array:
(float*) p = (float*)(&m);
// p[0] == 3
// p[1] == 2
// p[2] == 1
// p[3] == 0
The same happens with a union { _m128 m; float[4] a; } also.
Why do SSE operations use this ordering? It's not a big deal but slightly confusing.
And a follow-up question:
When accessing elements in the array by index, should one access in the order 0..3 or the order 3..0 ?
It's just a convention; they had to pick some order, and it really doesn't matter what the order is as long as everyone follows it. Intel happens to like little-endianness.
As far as accessing by index goes... the best thing is to try to avoid doing it. Nothing kills vector performance like element-wise accesses. If you must, try set things up so that the indexing matches the hardware vector lanes; that's what most vector programmers (in my experience) will expect.
Depend on what you would like to do, you can use either _mm_set_ps or _mm_setr_ps.
__m128 _mm_setr_ps (float z, float y, float x, float w )
Sets the four SP FP values to the four inputs in reverse order.
Isn't that consistent with the little-endian nature of x86 hardware? The way it stores the bytes of a long long.