I have a big pixel processing function which I am currently trying to optimize using intrinsic functions.
Being an SSE novice, I am not sure how to tackle the part of the code which involves lookup tables.
Basically, I am trying to vectorize the following vanilla C++ code:
//outside loop
const float LUT_RATIO = 1000.0F;
//in loop
float v = ... //input value
v = myLookupTable[static_cast<int>(v * LUT_RATIO)];
What I'm trying:
//outside loop
const __m128 LUT_RATIO = _mm_set1_ps(1000.0F);
//in loop
__m128 v = _mm_set_ps(v1, v2, v3, v4); //input values
__m128i vI = _mm_cvtps_epi32(_mm_mul_ps(v, LUT_RATIO)); //multiply and convert to integers
v = ??? // how to get vI indices of myLookupTable?
edit: ildjarn makes a point that demands clarification on my part. I am not trying to achieve speedup for the lookup table code, I am simply trying to avoid having to store the registers back to floats specifically for doing the lookup, as this part is sandwiched between 2 other parts which could theoretically benefit from SSE.
If you can wait until next year then Intel's Haswell CPUs will have AVX2 which includes instructions for gathered loads. This enables you to do e.g. 8 parallel LUT lookups in one instruction (see e.g. VGATHERDPS). Other than that, you're out of luck, unless your LUTs are quite small (e.g. 16 elements), in which case you can use PSHUFB.
Related
Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.
Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.
First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.
Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).
To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).
Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}
I have a struct that represents a vector. This vector consists of two one-byte integers. I use them to keep values from 0 to 255.
typedef uint8_T unsigned char;
struct Vector
{
uint8_T x;
uint8_T y;
};
Now, the main use case in my program is to multiply both elements of the vector with a 32bit float value:
typedef real32_T float;
Vector Vector::operator * ( const real32_T f ) const {
return Vector( (uint8_T)(x * f), (uint8_T)(y * f) );
};
This needs to be performed very often. Is there a way that these two multiplications can be performed simultaneously? Maybe by vectorization, SSE or similar? Or is the Visual studio compiler already doing this simultaneously?
Another usecase is to interpolate between two Vectors.
Vector Vector::interpolate(const Vector& rhs, real32_T z) const
{
return Vector(
(uint8_T)(x + z * (rhs.x - x)),
(uint8_T)(y + z * (rhs.y - y))
);
}
This already uses an optimized interpolation aproach (https://stackoverflow.com/a/4353537/871495).
But again the values of the vectors are multiplied by the same scalar value.
Is there a possibility to improve the performance of these operations?
Thanks
(I am using Visual Studio 2010 with an 64bit compiler)
In my experience, Visual Studio (especially an older version like VS2010) does not do a lot of vectorization on its own. They have improved it in the newer versions, so if you can, you might see if a change of compiler speeds up your code.
Depending on the code that uses these functions and the optimization the compiler does, it may not even be the calculations that slow down your program. Function calls and cache misses may hurt a lot more.
You could try the following:
If not already done, define the functions in the header file, so the compiler can inline them
If you use these functions in a tight loop, try doing the calculations 'by hand' without any function calls (temporarily expose the variables) and see if it makes a speed difference)
If you have a lot of vectors, look at how they are laid out in memory. Store them contiguously to minimize cache misses.
For SSE to work really well, you'd have to work with 4 values at once - so multiply 2 vectors with 2 floats. In a loop, use a step of 2 and write a static function that calculates 2 vectors at once using SSE instructions. Because your vectors are not aligned (and hardly ever will be with 8 bit variables), the code could even run slower than what you have now, but it's worth a try.
If applicable and if you don't depend on the clamping that occurs with your cast from float to uint8_t (e.g. if your floats are in range [0,1]), try using float everywhere. This may allow the compiler do do far better optimization.
You haven't showed the full algorithm, but the conversions between integer and float numbers is a very slow operation. Eliminating this operation and using only one type (if possible preferably integers) can greatly improve performances.
Alternatevly, you can use lrint() to do the conversion as explained here.
I'm trying to learn about vectorisation, and rather than reinvet the wheel I'm using Agner Fog's vector library
Here's my original C++/STL code
#include <vector>
#include <vectorclass.h>
template<typename T>
double mean_v1(T begin,T end) {
float mean = 0;
std::for_each(begin,end,[&mean](const double& d) { mean+=d; });
return mean / std::distance(begin,end);
}
double mean_v2(T begin,T end) {
float mean = 0;
const int distance = std::distance(begin,end); // This is expensive
const int loop = ( distance >> 2)+1; // divide by 4
const int partial = distance & 2; // remainder 4
Vec4d vec;
for(int i = 0; i < loop;++i) {
if(i == (loop-1)) {
vec.load_partial(partial,&*begin);
mean = horizontal_add(vec);
}
else {
vec.load(&*begin);
mean = horizontal_add(vec);
begin+=4; // This is expensive
}
}
return mean / distance;
}
int main(int argc,char**argv) {
using namespace boost::assign;
std::vector<float> numbers;
// Note 13 numbers, which won't fit into a sse register perfectly
numbers+=39.57,39.57,39.604,39.58,39.61,31.669,31.669,31.669,31.65,32.09,33.54,32.46,33.45;
const float mean1 = mean_v1(numbers.begin(),numbers.end());
const float mean2 = mean_v2(numbers.begin(),numbers.end());
return 0;
}
Both v1 and v2 work correctly and they both take about the same time. However profiling it shows the the std::distance() and moving the iterator along takes almost 45% of the total time. The vector adds is just 0.8% which is significantly faster than v1.
Searching the web, all the examples seem to deal with perfect number of values that fit precisely into the SSE registers. How do people deal with odd numbers of values eg for this example where setting up the loop is taking a lot longer than the calculation.
I'm thinking there must be best practices or ideas on how to deal with this scenario.
Assume I can't change the interface of mean() to take float[], but must use iterators
You're mixing float & double unnecessarily, especially as you don't let your accumulator be double your precision is totally destroyed and won't be close to satisfactory for larger series.
As the arithmetic is super light weight what's destroying your performance here is most likely memory access, read up on memory cache lines and how they work. Basically what you need to do here is probe ahead, some processors have explicit instructions for pulling stuff into your cache, otherwise you can perform a load at a memory location ahead of time. Create another level of nesting in your loop and at regular intervals prime the cache with data you know you will get to in a few iterations.
What people do to maximize performance is that they spend a lot of time actually designing their data layout. You shouldn't need to do an intermediate transformation on your data. So what people do is they allocate aligned memory ( most SIMD instruction sets either requires or imposes grave penalties for reading / writing to unaligned memory ), and then they try to aggregate data in such a way that it fits the instruction set. In fact it's often a win to pad your data up to whatever register size the instruction set supports. So if lets say you're going to process 3 dimensional vectors, padding with an extra element which is unused will almost always be a big win.
In some code I have converted to SSE I preform some ray tracing, tracing 4 rays at a time using __m128 data types.
In the method where I determine which objects are hit first, I loop through all objects, test for intersection and create a mask representing which rays had an intersection earlier than previously found .
I also need to maintain data on the id of the objects which correspond to the best hit times. I do this by maintaining a __m128 data type called objectNo and I use the mask determined from the intersection times to update objectNo as follows:
objectNo = _mm_blendv_ps(objectNo,_mm_set1_ps((float)pobj->getID()),mask);
Where pobj->getID() will return an integer representing the id of the current object. Making this cast and using the blend seemed to be the most efficient way of updating the objectNo for all 4 rays.
After all intersections are tested I try to extract the objectNo's individually and use them to access an array to register the intersection. Most commonly I have tried this:
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
However this crashes with EXC_BAD_ACCESS as extracting a float with value 1.0 converts to an int of value 1065353216.
How do I correctly unpack the __m128 into ints which can be used to index an array?
There are two SSE2 conversion intrinsics which seem to do what you want:
_mm_cvtps_epi32()
_mm_cvttps_epi32()
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_conversion.htm
These will convert 4 single-precision FP to 4 32-bit integers. The first one does it with rounding. The second one uses truncation.
So they can be used like this:
int o0 = _mm_extract_epi32(_mm_cvtps_epi32(objectNo), 0);
prv_noHits[o0]++;
EDIT : Based on what you're trying to do, I feel this can be better optimized as follows:
__m128i ids = _mm_set1_epi32(pobj->getID());
// The mask will need to change
objectNo = _mm_blend_epi16(objectNo,ids,mask);
int o0 = _mm_extract_epi32(objectNo, 0);
prv_noHits[o0]++;
This version gets rid of the unnecessary conversions. But you will need to use a different mask vector.
EDIT 2: Here's a way so that you won't have to change your mask:
__m128 ids = _mm_castsi128_ps(_mm_set1_epi32(pobj->getID()));
objectNo = _mm_blendv_ps(objectNo,ids,mask);
int o0 = _mm_extract_ps(objectNo, 0);
prv_noHits[o0]++;
Note that the _mm_castsi128_ps() intrinsic doesn't map any instruction. It's just a bit-wise datatype conversion from __m128i to __m128 to get around the "typeness" in C/C++.
I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.
You can do this fairly easily with SSE, e.g.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 2)
{
__m128 vc = _mm_load_ps((float *)&values[idx]);
__m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
vc = _mm_mul_ps(vc, vk);
_mm_store_ps((float *)&values[idx], vc);
}
}
Note that values and scalar need to be 16 byte aligned.
Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 4)
{
__m128 vc0 = _mm_load_ps((float *)&values[idx]);
__m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
__m128 vk = _mm_load_ps(&scalar[idx]);
__m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
__m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
vc0 = _mm_mul_ps(vc0, vk0);
vc1 = _mm_mul_ps(vc1, vk1);
_mm_store_ps((float *)&values[idx], vc0);
_mm_store_ps((float *)&values[idx + 2], vc1);
}
}
Your best bet will be to use an optimised BLAS which will take advantage of whatever is available on your target platform.
One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.
Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.
What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).
For the problem of the unwanted serialization an option is to see what is the generated code for something like
float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;
because here the optimizer has in theory a little bit more freedom.
Do you have access to Intel's Integrated Performance Primitives?
Integrated Performance Primitives They have a number of functions that handle cases like this with pretty decent performance. You might have some success with your particular problem, but I would not be surprised if your compiler already does a decent job of optimizing the code.