Counter-intuitive results while playing with intrinsics

Counter-intuitive results while playing with intrinsics - c++

I'm new to the world of intrinsics, and I got here because I saw a way to achieve transparent code compilation i.e. what you see is what you get. Also, reproducibility. For a system supporting e.g. AVX2 I know I'll end up with the same instructions at the end, given I use AVX2 intrinsics. This is an important step towards writing HPC libraries which make use of SIMD. Feel free to correct me in my way of thinking.
Now, I have implemented a 3D vector dot product function in three variants in a micro-benchmarking setting. The code has been compiled using the GNU compiler v11.1.0 and run on a machine with a Intel(R) Core(TM) i5-8400 CPU # 2.80GHz chip and 32 GiB of DDR4 RAM. Single thread read-write memory bandwidth of said system has been measured at ~34 GiB/s by running a DAXPY benchmark.
First, let me present the elementary structures.
struct vector3
{
float data[3] = {};
inline float& operator()(const std::size_t& index) { return data[index]; }
inline const float& operator()(const std::size_t& index) const { return data[index]; }
inline float l2_norm_sq() const { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; }
};
// strictly speaking, the following is a class of its own that implements a subset of
// the functionality of the std::vector. The motivation is to be able to allocate memory
// without "touching" the data, a requirement that is crucial for "cold" microbenchmarking.
template<class Treal_t>
using vector3_array = std::vector<vector3<Treal_t>>;
The first is my scalar code. I'm compiling it with the flag "-O0".
void dot_product_novec(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
for (auto index = std::size_t{}; index < varray.size(); index += inc, dot_product_index += dot_products_per_inc)
{
dot_products[dot_product_index] = float_stream[index] * float_stream[index] + float_stream[index + 1] * float_stream[index + 1]
+ float_stream[index + 2] * float_stream[index + 2];
dot_products[dot_product_index + 1] = float_stream[index + 3] * float_stream[index + 3]
+ float_stream[index + 4] * float_stream[index + 4] + float_stream[index + 5] * float_stream[index + 5];
}
for (auto index = dot_product_index; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Next up is my auto-vectorized loop. I'm strongly recommending auto-vectorization using the corresponding directive of OpenMP 4.0. Compiled with flags "-O3;-ffast-math;-march=native;-fopenmp".
void dot_product_auto(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
#pragma omp simd safelen(16)
for (auto index = std::size_t{}; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Finally, here's my version which has been vectorized using intrinsics. Compiled using "-O3;-ffast-math;-march=native;-mfma;-mavx2".
void dot_product(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
static const auto load_mask = _mm256_setr_epi32(-1, -1, -1, -1, -1, -1, 0, 0);
static const auto permute_mask0 = _mm256_setr_epi32(0, 1, 2, 7, 3, 4, 5, 6);
static const auto permute_mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 4, 0);
static const auto store_mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
for (auto index = std::size_t{}; index < stream_size_div; index += inc, dot_product_index += dot_products_per_inc)
{
// 1. load and permute the vectors
const auto point_packed = _mm256_maskload_ps(float_stream + index, load_mask);
const auto point_permuted_packed = _mm256_permutevar8x32_ps(point_packed, permute_mask0);
// 2. do a multiply
const auto point_permuted_elementwise_sq_packed = _mm256_mul_ps(point_permuted_packed, point_permuted_packed);
// 3. do 2 horizontal additions
const auto hadd1 = _mm256_hadd_ps(point_permuted_elementwise_sq_packed, point_permuted_elementwise_sq_packed);
const auto hadd2 = _mm256_hadd_ps(hadd1, hadd1);
// 4. permute to target position
const auto result_packed = _mm256_permutevar8x32_ps(hadd2, permute_mask1);
// 4. store
_mm256_maskstore_ps(&dot_products[dot_product_index], store_mask, result_packed);
}
for (auto index = dot_product_index; index < varray.size(); ++index) // no opt for remainder loop
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
I've tested the code, so I know it works.
Now, brief details about the microbenchmarking:
I use a small library which I've written for this purpose: https://gitlab.com/anxiousprogrammer/tixl.
20 warm up runs, 100 timed runs.
fresh allocations in each run for cold microbenchmarking, first touch (zeroing of the first datum in each memory page) of test data prevents measuring of page-faults.
I'm modelling the dot product so: 5 * size FLOPs after 5 * size * sizeof(float) transfers i.e code-balance of 4 or computational intensity of 0.25. Using this information, here are the performance results in terms of effective bandwidth:
no-vec: 18.6 GB/s
auto-vec: 21.3 GB/s
intrinsic-vec: 16.4 GB/s
Questions:
Is my motivation (mentioned in paragraph 1) a sensible one?
Why is my version slower than the scalar code?
Why are they all far from the peak read-write BW of 34 GiB/s?
Please excuse the lack of a minimum reproducer, the amount of code would be too much. Thanks a lot for your thoughts and inputs.

Your manually-vectorized code is not particularly efficient.
Try to benchmark the following 2 versions instead.
This one is simpler, and only requires SSE 4.1 instruction set.
inline __m128 loadFloat3( const float* rsi )
{
__m128 xy = _mm_castpd_ps( _mm_load_sd( (const double*)rsi ) );
// Compilers should merge following 2 lines into single INSERTPS with a memory operand
__m128 z = _mm_load_ss( rsi + 2 );
return _mm_insert_ps( xy, z, 0x20 );
}
// Simple version which uses DPPS instruction from SSE 4.1 set
void dotProductsSimple( float* rdi, size_t length, const float* rsi )
{
const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
const float* const rsiEnd = rsi + length * 3;
for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
{
// Load complete 16 byte vector, discard the W
__m128 v = _mm_loadu_ps( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
if( rsi < rsiEnd )
{
// For the last vector, load exactly 12 bytes.
// Avoids potential crash when loading from out of bounds
__m128 v = loadFloat3( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
}
This one is more complicated, and requires AVX1 support. Probably, going to be slightly faster on most processors.
void dotProductTransposed( float* rdi, size_t length, const float* rsi )
{
constexpr size_t maskAlign8 = ~(size_t)7;
const float* const rsiEndAligned = rsi + ( length & maskAlign8 ) * 3;
const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
const float* const rsiEnd = rsi + length * 3;
while( rsi < rsiEndAligned )
{
// Load lower halves
__m256 m03, m14, m25;
m03 = _mm256_castps128_ps256( _mm_loadu_ps( rsi ) );
m14 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 4 ) );
m25 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 8 ) );
// Load upper halves; VINSERTF128 supports memory operand for the second argument.
m03 = _mm256_insertf128_ps( m03, _mm_loadu_ps( rsi + 12 ), 1 );
m14 = _mm256_insertf128_ps( m14, _mm_loadu_ps( rsi + 16 ), 1 );
m25 = _mm256_insertf128_ps( m25, _mm_loadu_ps( rsi + 20 ), 1 );
rsi += 24;
// Transpose these SIMD vectors
__m256 xy = _mm256_shuffle_ps( m14, m25, _MM_SHUFFLE( 2, 1, 3, 2 ) );
__m256 yz = _mm256_shuffle_ps( m03, m14, _MM_SHUFFLE( 1, 0, 2, 1 ) );
__m256 x = _mm256_shuffle_ps( m03, xy, _MM_SHUFFLE( 2, 0, 3, 0 ) );
__m256 y = _mm256_shuffle_ps( yz, xy, _MM_SHUFFLE( 3, 1, 2, 0 ) );
__m256 z = _mm256_shuffle_ps( yz, m25, _MM_SHUFFLE( 3, 0, 3, 1 ) );
// Now we have 3 SIMD vectors with gathered x/y/z fields of 8 source 3D vectors
// Compute squares
x = _mm256_mul_ps( x, x );
y = _mm256_mul_ps( y, y );
z = _mm256_mul_ps( z, z );
// Add squares
x = _mm256_add_ps( x, y );
x = _mm256_add_ps( x, z );
// Store 8 values
_mm256_storeu_ps( rdi, x );
rdi += 8;
}
// Handle the remainder
for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
{
__m128 v = _mm_loadu_ps( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
if( rsi < rsiEnd )
{
__m128 v = loadFloat3( rsi );
v = _mm_dp_ps( v, v, 0b01110001 );
_mm_store_ss( rdi, v );
}
}

Related

How to make use of SIMD capability for sum of squared differences between 8-bit components of RGBA pixels?

The below code is trying to extract the red, green and blue channel of a pixel value and performing an arithmetic with another set of RGB values.
It seems that code is slow around the logic where its trying to perform the squaring and addition.
What would be the possibility to replace it with a faster version as this logic doesn't seems to be using SIMD capabilities at all.
typedef struct {
unsigned char b, g, r, a;
} pixel;
register pixel *pPixel;
register int i, red1, green1, blue1, alpha1;
register int red2, green2, blue2, alpha2;
register long oldD, newD;
red1 = GetRed( *pPixel );
green1 = GetGreen( *pPixel );
blue1 = GetBlue( *pPixel );
alpha1 = GetAlpha( *pPixel );
oldD = 2000000000;
for ( i = 0; i < newcolors; ++i ) {
red2 = GetRed( mycolormap[i].acolor );
green2 = GetGreen( mycolormap[i].acolor );
blue2 = GetBlue( mycolormap[i].acolor );
alpha2 = GetAlpha( mycolormap[i].acolor );
newD = ( red1 - red2 ) * ( red1 - red2 ) +
( green1 - green2 ) * ( green1 - green2 ) +
( blue1 - blue2 ) * ( blue1 - blue2 ) +
( alpha1 - alpha2 ) * ( alpha1 - alpha2 );
if ( newD < oldD ) {
oldD = newD;
}
}
Below section of code seems to be requiring improvement
newD = ( red1 - red2 ) * ( red1 - red2 ) +
( green1 - green2 ) * ( green1 - green2 ) +
( blue1 - blue2 ) * ( blue1 - blue2 ) +
( alpha1 - alpha2 ) * ( alpha1 - alpha2 );

It’s harder than it seems. Unfortunately for you, automatic vectorizers in C++ compilers are very rarely doing a good job for integer arithmetic, like you have there.
The following implementation only needs SSE4.1. If you have AVX2 possible to improve substantially by upgrading all these vectors to 32-byte ones, however this will complicate a couple things, remainder and final reduction.
I assumed not only you want the minimum dot product, also the index of the pixel. If you only want the minimum dot product, remove bestIndices field and the code which handles that field.
struct alignas( 4 ) Pixel
{
uint8_t b, g, r, a;
};
// Define __SSE4_1__ macro when building with MSVC for AVX1 or newer ISA
#if defined( _MSC_VER ) && defined( __AVX__ ) && !defined( __SSE4_1__ )
#define __SSE4_1__ 1
#endif
size_t findClosestPixel( const Pixel& ref, const Pixel* rsi, size_t length, int& bestValue )
{
if( 0 == length )
{
bestValue = INT_MAX;
return ~(size_t)0;
}
class Acc
{
// The reference pixel we're after, broadcasted and split into low/high pieces in 16-bit lanes
__m128i lowRef, highRef;
// The best dot product so far
__m128i bestSquares = _mm_set1_epi32( INT_MAX );
// Index of the pixels currently in bestSquares
__m128i bestIndices = _mm_set1_epi32( -1 );
const __m128i lowMask = _mm_set1_epi16( 0xFF );
// For lanes where dp < bestSquares, update bestSquares and bestIndices vectors
void updateFields( __m128i dp, __m128i indices )
{
const __m128i lt = _mm_cmplt_epi32( dp, bestSquares );
#ifndef __SSE4_1__
bestSquares = _mm_or_si128( _mm_and_si128( lt, dp ), _mm_andnot_si128( lt, bestSquares ) );
bestIndices = _mm_or_si128( _mm_and_si128( lt, indices ), _mm_andnot_si128( lt, bestIndices ) );
#else
bestSquares = _mm_min_epi32( dp, bestSquares );
bestIndices = _mm_blendv_epi8( bestIndices, indices, lt );
#endif
}
public:
Acc( const Pixel& ref )
{
__m128i tmp = _mm_set1_epi32( *(const int*)( &ref ) );
lowRef = _mm_and_si128( tmp, lowMask );
highRef = _mm_srli_epi16( tmp, 8 );
}
// Update the accumulator with another 4 pixels
void update( __m128i pixels, __m128i indices )
{
// Split into two vectors with 16-bit lanes:
// low contains blue and red channels, high contains green and alpha
__m128i low = _mm_and_si128( pixels, lowMask );
__m128i high = _mm_srli_epi16( pixels, 8 );
// Compute difference with the reference value we're after
low = _mm_sub_epi16( low, lowRef );
high = _mm_sub_epi16( high, highRef );
// Compute squares as 32-bit numbers, add adjacent pairs
low = _mm_madd_epi16( low, low );
high = _mm_madd_epi16( high, high );
// Adding them results in the dot product (sum of squares) for all 4 channels
__m128i dp = _mm_add_epi32( low, high );
// Update the state
updateFields( dp, indices );
}
// Compute horizontal minimum across lanes in these accumulators
uint32_t reduce( int& bestDp )
{
// Swap low/high halves
__m128i s2 = _mm_shuffle_epi32( bestSquares, _MM_SHUFFLE( 1, 0, 3, 2 ) );
__m128i i2 = _mm_shuffle_epi32( bestIndices, _MM_SHUFFLE( 1, 0, 3, 2 ) );
updateFields( s2, i2 );
// Swap even/odd lanes
s2 = _mm_shuffle_epi32( bestSquares, _MM_SHUFFLE( 2, 3, 0, 1 ) );
i2 = _mm_shuffle_epi32( bestIndices, _MM_SHUFFLE( 2, 3, 0, 1 ) );
updateFields( s2, i2 );
// Return lowest lanes from both vectors
bestDp = _mm_cvtsi128_si32( bestSquares );
return (uint32_t)_mm_cvtsi128_si32( bestIndices );
}
};
Acc impl{ ref };
const size_t lengthAligned = ( length / 4 ) * 4;
size_t i;
__m128i currentIndices = _mm_setr_epi32( 0, 1, 2, 3 );
for( i = 0; i < lengthAligned; i += 4 )
{
// Load 4 source pixels
__m128i src = _mm_loadu_si128( ( const __m128i* )( rsi + i ) );
// Update things
impl.update( src, currentIndices );
// Increment index vector by 4 pixels
currentIndices = _mm_add_epi32( currentIndices, _mm_set1_epi32( 4 ) );
}
const size_t remainder = length % 4;
if( remainder == 0 )
{
// The input was a multiple of 4 pixels
return impl.reduce( bestValue );
}
const int* const pi = (const int*)( rsi + i );
__m128i rv;
if( lengthAligned > 0 )
{
// We had at least 4 elements on input, can do unaligned load with negative offset
size_t offset = 4 - remainder;
currentIndices = _mm_sub_epi32( currentIndices, _mm_set1_epi32( (int)offset ) );
rv = _mm_loadu_si128( ( const __m128i* )( pi - offset ) );
}
else
{
// Less than 4 elements on input, doing partial load and broadcasting the last element
const size_t remainder = length % 4;
switch( remainder )
{
case 1:
rv = _mm_set1_epi32( pi[ 0 ] );
break;
case 2:
rv = _mm_loadl_epi64( ( const __m128i* )pi );
rv = _mm_shuffle_epi32( rv, _MM_SHUFFLE( 1, 1, 1, 0 ) );
break;
case 3:
rv = _mm_loadl_epi64( ( const __m128i* )pi );
#ifndef __SSE4_1__
rv = _mm_unpacklo_epi64( rv, _mm_set1_epi32( pi[ 2 ] ) );
#else
rv = _mm_insert_epi32( rv, pi[ 2 ], 2 );
rv = _mm_shuffle_epi32( rv, _MM_SHUFFLE( 2, 2, 1, 0 ) );
#endif
break;
}
}
impl.update( rv, currentIndices );
return impl.reduce( bestValue );
}

Understand efficient implementation of median filter with SSE2 / SSSE3 Instruction Set

I'm recently stuck on understanding some legacy codes about median filter, it is totally implemented by SSE2 Instruction Set. My task is to improve its efficiency based on the legacy codes, which are as follows:
/* sort pixel to acquire median value quickly. */
inline void sortSwap(__m128& a, __m128& b)
{
__m128 temp = a;
a = _mm_min_ps(a, b);
b = _mm_max_ps(temp, b);
}
void medianDispOpt(float* dispImg, float* destDisp, const uint16_t width, const uint16_t height)
{
float* dispImgTemp = destDisp;
float* line1 = dispImg;
float* line2 = dispImg + width;
float* line3 = dispImg + 2 * width;
float* end = dispImg + width*height;
destDisp += width;
__m128 lastMedian = _mm_setzero_ps();
do {
const __m128 line1_reg = _mm_load_ps(line1);
const __m128 line1_reg_next = _mm_load_ps(line1 + 4);
__m128 store0 = line1_reg;
__m128 store1 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(line1_reg_next), _mm_castps_si128(line1_reg), 4));
__m128 store2 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(line1_reg_next), _mm_castps_si128(line1_reg), 8));
const __m128 line2_reg = _mm_load_ps(line2);
const __m128 line2_reg_next = _mm_load_ps(line2 + 4);
__m128 store3 = line2_reg;
__m128 store4 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(line2_reg_next), _mm_castps_si128(line2_reg), 4));
__m128 store5 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(line2_reg_next), _mm_castps_si128(line2_reg), 8));
const __m128 line3_reg = _mm_load_ps(line3);
const __m128 line3_reg_next = _mm_load_ps(line3 + 4);
__m128 store6 = line3_reg;
__m128 store7 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(line3_reg_next), _mm_castps_si128(line3_reg), 4));
__m128 store8 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(line3_reg_next), _mm_castps_si128(line3_reg), 8));
// find median
sortSwap(store1, store2);
sortSwap(store4, store5);
sortSwap(store7, store8);
sortSwap(store0, store1);
sortSwap(store3, store4);
sortSwap(store6, store7);
sortSwap(store1, store2);
sortSwap(store4, store5);
sortSwap(store7, store8);
sortSwap(store0, store3);
sortSwap(store5, store8);
sortSwap(store4, store7);
sortSwap(store3, store6);
sortSwap(store1, store4);
sortSwap(store2, store5);
sortSwap(store4, store7);
sortSwap(store4, store2);
sortSwap(store6, store4);
sortSwap(store4, store2);
const __m128i c = _mm_alignr_epi8(_mm_castps_si128(store4), _mm_castps_si128(lastMedian), 12);
_mm_store_si128((__m128i*)destDisp, c);
lastMedian = store4;
destDisp += 4; line1 += 4; line2 += 4; line3 += 4;
} while (line3 + 4 + 4 <= end);
memcpy(dispImgTemp, dispImg, sizeof(float)*(width + 1));
memcpy(dispImgTemp + width*height - width - 1 - 3, dispImg + width*height - width - 1 - 3, sizeof(float)*(width + 1 + 3));
}
the input image dispImg is of width * height with float pixel, the image data array is arranged by the image row, and each row is compactly laid(i.e. no redundant pixels between rows).
Now, suppose the line1_reg,line1_reg_next,line2_reg,line2_reg_next,line3_reg,line3_reg_next of first three rows are briefly represented by following numbers:
line1_reg= {0 1 2 3}, line1_reg_next = {4 5 6 7}
line2_reg= {8 9 10 11}, line2_reg_next = {12 13 14 15}
line3_reg= {16 17 18 19}, line2_reg_next = {20 21 22 23}
so storens are:
store0 = {0,1,2,3}, store1={1,2,3,4},store2={2,3,4,5}
store3 = {8,9,10,11}, store4={9,10,11,12},store5={10,11,12,13}
store6 = {16,17,18,19}, store7={17,18,19,20},store8={18,19,20,21}
After the first three sortSwaps, changed storens are as follows:
store1 = {min(1,2),min(2,3),min(3,4),min(4,5)}
store2 = {max(1,2),max(2,3),max(3,4),max(4,5)}
store4 = {min(9,10),min(10,11),min(11,12),min(12,13)}
store5 = {max(9,10),max(10,11),max(11,12),max(12,13)}
store7 = {min(17,18),min(18,19),min(19,20),min(20,21)}
store8 = {max(17,18),max(18,19),max(19,20),max(20,21)}
Then after the second three sortSwaps, changed storen are as follows:
store0 = {min(0,1,2),min(1,2,3),min(2,3,4),min(3,4,5)}
store1 = {max(0,1,2),max(1,2,3),max(2,3,4),max(3,4,5)}
store3 = {min(8,9,10),min(9,10,11),min(10,11,12),min(11,12,13)}
store4 = {max(8,9,10),max(9,10,11),max(10,11,12),max(11,12,13)}
store6 = {min(16,17,18),min(17,18,19),min(18,19,20),min(19,20,21)}
store7 = {max(16,17,18),max(17,18,19),max(18,19,20),max(19,20,21)}
Next, after the third three sortSwaps, changed storen are as follows:
store1 = {max(1,2),max(2,3),max(3,4),max(4,5)}
store2 = {max(0,1,2),max(1,2,3),max(2,3,4),max(3,4,5)}
store4 = {max(9,10),max(10,11),max(11,12),max(12,13)}
store5 = {max(8,9,10),max(9,10,11),max(10,11,12),max(11,12,13)}
store7 = {max(17,18),max(18,19),max(19,20),max(20,21)}
store8 = {max(16,17,18),max(17,18,19),max(18,19,20),max(19,20,21)}
...
Now start the sixth three sortSwaps, changed storen are as follows:
store4[0] = median(max(1,2),max(9,10),max(17,18))
store4[1] = median(max(2,3),max(10,11),max(18,19))
store4[2] = median(max(3,4),max(11,12),max(19,20))
store4[3] = median(max(4,5),max(12,13),max(20,21))
...
Above is my stiff thread, I cannot continue my inference on calculating the sixth three sortSwaps. So, can anyone give me a clearer interpretation of above codes?

SSE inline assembly and possible g++ optimization bug

Let's start with the code. I have two structures, one for vectors, and other for matrices.
struct AVector
{
explicit AVector(float x=0.0f, float y=0.0f, float z=0.0f, float w=0.0f):
x(x), y(y), z(z), w(w) {}
AVector(const AVector& a):
x(a.x), y(a.y), z(a.z), w(a.w) {}
AVector& operator=(const AVector& a) {x=a.x; y=a.y; z=a.z; w=a.w; return *this;}
float x, y, z, w;
};
struct AMatrix
{
// Row-major
explicit AMatrix(const AVector& a=AVector(), const AVector& b=AVector(), const AVector& c=AVector(), const AVector& d=AVector())
{row[0]=a; row[1]=b; row[2]=c; row[3]=d;}
AMatrix(const AMatrix& m) {row[0]=m.row[0]; row[1]=m.row[1]; row[2]=m.row[2]; row[3]=m.row[3];}
AMatrix& operator=(const AMatrix& m) {row[0]=m.row[0]; row[1]=m.row[1]; row[2]=m.row[2]; row[3]=m.row[3]; return *this;}
AVector row[4];
};
Next, code performing calculations on those structures. Dot product using inlined ASM and SSE instructions:
inline AVector AVectorDot(const AVector& a, const AVector& b)
{
// XXX
/*const double v=a.x*b.x+a.y*b.y+a.z*b.z+a.w*b.w;
return AVector(v, v, v, v);*/
AVector c;
asm volatile(
"movups (%1), %%xmm0\n\t"
"movups (%2), %%xmm1\n\t"
"mulps %%xmm1, %%xmm0\n\t" // xmm0 -> (a1+b1, , , )
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0xB1, %%xmm1, %%xmm1\n\t" // 0xB1 = 10110001
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x, y, z, w)+(y, x, w, z)=(x+y, x+y, z+w, z+w)
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0x0A, %%xmm1, %%xmm1\n\t" // 0x0A = 00001010
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x+y+z+w, , , )
"movups %%xmm0, %0\n\t"
: "=m"(c)
: "r"(&a), "r"(&b)
);
return c;
}
Matrix transposition:
inline AMatrix AMatrixTranspose(const AMatrix& m)
{
AMatrix c(
AVector(m.row[0].x, m.row[1].x, m.row[2].x, m.row[3].x),
AVector(m.row[0].y, m.row[1].y, m.row[2].y, m.row[3].y),
AVector(m.row[0].z, m.row[1].z, m.row[2].z, m.row[3].z),
AVector(m.row[0].w, m.row[1].w, m.row[2].w, m.row[3].w));
// XXX
/*printf("AMcrix c:\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n",
c.row[0].x, c.row[0].y, c.row[0].z, c.row[0].w,
c.row[1].x, c.row[1].y, c.row[1].z, c.row[1].w,
c.row[2].x, c.row[2].y, c.row[2].z, c.row[2].w,
c.row[3].x, c.row[3].y, c.row[3].z, c.row[3].w);*/
return c;
}
Matrix-matrix multiplication - transpose first matrix, because when I have it stored as column major, and second one as row major, then I can perform multiplication using dot-products.
inline AMatrix AMatrixMultiply(const AMatrix& a, const AMatrix& b)
{
AMatrix c;
const AMatrix at=AMatrixTranspose(a);
// XXX
/*printf("AMatrix at:\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n",
at.row[0].x, at.row[0].y, at.row[0].z, at.row[0].w,
at.row[1].x, at.row[1].y, at.row[1].z, at.row[1].w,
at.row[2].x, at.row[2].y, at.row[2].z, at.row[2].w,
at.row[3].x, at.row[3].y, at.row[3].z, at.row[3].w);*/
for(int i=0; i<4; ++i)
{
c.row[i].x=AVectorDot(at.row[0], b.row[i]).w;
c.row[i].y=AVectorDot(at.row[1], b.row[i]).w;
c.row[i].z=AVectorDot(at.row[2], b.row[i]).w;
c.row[i].w=AVectorDot(at.row[3], b.row[i]).w;
}
return c;
}
Now time for main (pun intended) part:
int main(int argc, char *argv[])
{
AMatrix a(
AVector(0, 1, 0, 0),
AVector(1, 0, 0, 0),
AVector(0, 0, 0, 1),
AVector(0, 0, 1, 0)
);
AMatrix b(
AVector(1, 0, 0, 0),
AVector(0, 2, 0, 0),
AVector(0, 0, 3, 0),
AVector(0, 0, 0, 4)
);
AMatrix c=AMatrixMultiply(a, b);
printf("AMatrix c:\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n [%5.2f %5.2f %5.2f %5.2f]\n",
c.row[0].x, c.row[0].y, c.row[0].z, c.row[0].w,
c.row[1].x, c.row[1].y, c.row[1].z, c.row[1].w,
c.row[2].x, c.row[2].y, c.row[2].z, c.row[2].w,
c.row[3].x, c.row[3].y, c.row[3].z, c.row[3].w);
AVector v(1, 2, 3, 4);
AVector w(1, 1, 1, 1);
printf("Dot product: %f (1+2+3+4 = 10)\n", AVectorDot(v, w).w);
return 0;
}
In the above code I make two matrices, multiply them and print the resulting matrix.
It works fine if I don't use any of the compiler optimizations (g++ main.cpp -O0 -msse). With optimizations enabled (g++ main.cpp -O1 -msse) resulting matrix is empty (all fields are zeroes).
Uncommenting any block marked with XXX makes program write correct result.
It seems to me that GCC optimizes-out matrix at from AMatrixMultiply function, because it wrongly assumes it's not used in AVectorDot, which is written using SSE inlines.
Last few lines check if dot-product function really works, and yes, it does.
So, the question is: did I do or understand something wrong, or is this some kind of bug in GCC? My guess is 7:3 mix of above.
I'm using GCC version 5.1.0 (tdm-1).

This is also a very inefficient way of multiplying matrices using SSE. I'd be surprised if it was much faster than a scalar implementation with so much floating-point throughput available on modern CPUs. A better method is outlined here, no explicit transpose needed:
AMatrix & operator *= (AMatrix & m0, const AMatrix & m1)
{
__m128 r0 = _mm_load_ps(& m1[0][x]);
__m128 r1 = _mm_load_ps(& m1[1][x]);
__m128 r2 = _mm_load_ps(& m1[2][x]);
__m128 r3 = _mm_load_ps(& m1[3][x]);
for (int i = 0; i < 4; i++)
{
__m128 ti = _mm_load_ps(& m0[i][x]), t0, t1, t2, t3;
t0 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(0, 0, 0, 0));
t1 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(1, 1, 1, 1));
t2 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(2, 2, 2, 2));
t3 = _mm_shuffle_ps(ti, ti, _MM_SHUFFLE(3, 3, 3, 3));
ti = t0 * r0 + t1 * r1 + t2 * r2 + t3 * r3;
_mm_store_ps(& m0[i][x], ti);
}
return m0;
}
On modern compilers, like gcc and clang, t0 * r0 + t1 * r1 + t2 * r2 + t3 * r3 is actually operating on __m128 types; though you can replace these with _mm_mul_ps and _mm_add_ps intrinsics if you want.
Return by value is then just a matter of adding a function like:
inline AMatrix operator * (const AMatrix & m0, const AMatrix & m1)
{
AMatrix lhs (m0); return (lhs *= m1);
}
Personally, I'd just replace the float x, y, z, w; with alignas (16) float _s[4] = {}; or similar - so you get a 'zero-vector' by default, or a defaulted constructor:
constexpr AVector () = default;
as well as nice constructors, like:
constexpr Vector (float x, float y, float z, float w)
: _s {x, y, z, w} {}

Your inline assembly lacks some constraints:
asm volatile(
"movups (%1), %%xmm0\n\t"
"movups (%2), %%xmm1\n\t"
"mulps %%xmm1, %%xmm0\n\t" // xmm0 -> (a1+b1, , , )
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0xB1, %%xmm1, %%xmm1\n\t" // 0xB1 = 10110001
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x, y, z, w)+(y, x, w, z)=(x+y, x+y, z+w, z+w)
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = xmm0
"shufps $0x0A, %%xmm1, %%xmm1\n\t" // 0x0A = 00001010
"addps %%xmm1, %%xmm0\n\t" // xmm1 -> (x+y+z+w, , , )
"movups %%xmm0, %0\n\t"
: "=m"(c)
: "r"(&a), "r"(&b)
);
GCC does not know that this assembler fragment clobbers %xmm0 and %xmm1, so it might not reload those registers to their previous values after the fragment has run. Some additional clobbers might be missing as well.

Using less matrices with BLAS

I'm quite new to BLAS (using OpenBLAS with C++ and VisualStudio)
I know dgemm performs C <- alpha * op(A) * op(B) + beta * C
I was trying to save some allocation doing this: B <- 1 * op(A) * op(B) + 0 * B
In other words, putting the result in the B matrix,
BUT making beta = 0 and repeating B in the position of C, results in a zero answer.
Is there a way to make it right?
The code that I'm using:
double* A = new double [3*3]; //3 rows x 3 columns
A[0] = 8;
A[1] = 3;
A[2] = 4;
A[3] = 1;
A[4] = 5;
A[5] = 9;
A[6] = 6;
A[7] = 7;
A[8] = 2;
double* v = new double[3]; //3 rows x 1 column
v[0] = 3;
v[1] = 5;
v[2] = 2;
double* foo = new double[3]; //3 rows x 1 column
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
3, 1, 3,
1,
A, 3,
v, 3,
0,
foo, 3); // makes foo = [41 ; 48 ; 61], **right**
cblas_dgemm(CblasColMajor, CblasTrans, CblasTrans,
3, 1, 3,
1,
A, 3,
v, 3,
0,
v, 3); // makes v = [0 ; 0 ; 0], **wrong**

BLAS dgemm function documentation states that only the C matrix parameter is for both input and output, being overwritten by the operation result. As B is defined just for input, BLAS implementations can assume that it shouldn't be modified.
Setting B and C to the same data pointer could be triggering some error verification on the implementation you're using, returning the zeroed result to indicate that.

Bilinear interpolation in C/C++ and CUDA

I want to emulate the behavior of CUDA bilinear interpolation on CPU, but I found that the return value of tex2D seems not fit to the bilinear formula.
I guess that casting the interpolation coefficients from float to 9-bit fixed point format with 8 bits of fractional value[1] results in different values.
According to the conversion fomula [2, line 106], the result of the conversion will be the same as the input float when the coeffient is 1/2^n, with n=0,1,..., 8, but I still (not always) receive weird values.
Below I report an example of weird values. In this case, weird values always happen when id = 2*n+1, could anyone tell me why?
Src Array:
Src[0][0] = 38;
Src[1][0] = 39;
Src[0][1] = 118;
Src[1][1] = 13;
Texture Definition:
static texture<float4, 2, cudaReadModeElementType> texElnt;
texElnt.addressMode[0] = cudaAddressModeClamp;
texElnt.addressMode[1] = cudaAddressModeClamp;
texElnt.filterMode = cudaFilterModeLinear;
texElnt.normalized = false;
Kernel Function:
static __global__ void kernel_texElnt(float* pdata, int w, int h, int c, float stride/*0.03125f*/) {
const int gx = blockIdx.x*blockDim.x + threadIdx.x;
const int gy = blockIdx.y*blockDim.y + threadIdx.y;
const int gw = gridDim.x * blockDim.x;
const int gid = gy*gw + gx;
if (gx >= w || gy >= h) {
return;
}
float2 pnt;
pnt.x = (gx)*(stride)/*1/32*/;
pnt.y = 0.0625f/*1/16*/;
float4 result = tex2D( texElnt, pnt.x + 0.5, pnt.y + 0.5f);
pdata[gid*3 + 0] = pnt.x;
pdata[gid*3 + 1] = pnt.y;
pdata[gid*3 + 2] = result.x;
}
Bilinear Result of CUDA
id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.0000000
1 0.03125 0.0625 42.6171875
2 0.06250 0.0625 42.6484375
3 0.09375 0.0625 42.2656250
4 0.12500 0.0625 42.2968750
5 0.15625 0.0625 41.9140625
6 0.18750 0.0625 41.9453125
7 0.21875 0.0625 41.5625000
8 0.25000 0.0625 41.5937500
9 0.28125 0.0625 41.2109375
0 0.31250 0.0625 41.2421875
10 0.34375 0.0625 40.8593750
11 0.37500 0.0625 40.8906250
12 0.40625 0.0625 40.5078125
13 0.43750 0.0625 40.5390625
14 0.46875 0.0625 40.1562500
15 0.50000 0.0625 40.1875000
16 0.53125 0.0625 39.8046875
17 0.56250 0.0625 39.8359375
18 0.59375 0.0625 39.4531250
19 0.62500 0.0625 39.4843750
20 0.65625 0.0625 39.1015625
21 0.68750 0.0625 39.1328125
22 0.71875 0.0625 38.7500000
23 0.75000 0.0625 38.7812500
24 0.78125 0.0625 38.3984375
25 0.81250 0.0625 38.4296875
26 0.84375 0.0625 38.0468750
27 0.87500 0.0625 38.0781250
28 0.90625 0.0625 37.6953125
29 0.93750 0.0625 37.7265625
30 0.96875 0.0625 37.3437500
31 1.00000 0.0625 37.3750000
CPU Result:
// convert coefficient ((1-α)*(1-β)), (α*(1-β)), ((1-α)*β), (α*β) to fixed point format
id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.00000000
1 0.03125 0.0625 43.23046875
2 0.06250 0.0625 42.64843750
3 0.09375 0.0625 42.87890625
4 0.12500 0.0625 42.29687500
5 0.15625 0.0625 42.52734375
6 0.18750 0.0625 41.94531250
7 0.21875 0.0625 42.17578125
8 0.25000 0.0625 41.59375000
9 0.28125 0.0625 41.82421875
0 0.31250 0.0625 41.24218750
10 0.34375 0.0625 41.47265625
11 0.37500 0.0625 40.89062500
12 0.40625 0.0625 41.12109375
13 0.43750 0.0625 40.53906250
14 0.46875 0.0625 40.76953125
15 0.50000 0.0625 40.18750000
16 0.53125 0.0625 40.41796875
17 0.56250 0.0625 39.83593750
18 0.59375 0.0625 40.06640625
19 0.62500 0.0625 39.48437500
20 0.65625 0.0625 39.71484375
21 0.68750 0.0625 39.13281250
22 0.71875 0.0625 39.36328125
23 0.75000 0.0625 38.78125000
24 0.78125 0.0625 39.01171875
25 0.81250 0.0625 38.42968750
26 0.84375 0.0625 38.66015625
27 0.87500 0.0625 38.07812500
28 0.90625 0.0625 38.30859375
29 0.93750 0.0625 37.72656250
30 0.96875 0.0625 37.95703125
31 1.00000 0.0625 37.37500000
I leave a simple code on my github [3], after running the program you will got two files in D:\.
Edit 2014/01/20
I run the program with different increments and found the specification of tex2D "when alpha multiplied beta is less than 0.00390625, the return of tex2D does not match the bilinear interpolation formula"

Already satisfactory answers have been provided to this question, so now I just want to give a compendium of hopefully useful information on bilinear interpolation, how can it be implemented in C++ and the different ways it can be done in CUDA.
Maths behind bilinear interpolation
Assume that the original function T(x, y) is sampled at the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers. For each value of y, one can first use 0 <= a < 1 to represent an arbitrary point i + a comprised between i and i + 1. Then, a linear interpolation along the y = j axis (which is parallel to the x axis) at that point can be performed obtaining
where r(x,y) is the function interpolating the samples of T(x,y). The same can be done for the line y = j + 1, obtaining
Now, for each i + a, an interpolation along the y axis can be performed on the samples r(i+a,j) and r(i+a,j+1). Accordingly, if one uses 0 <= b < 1 to represent an arbitrary point j + b located between j and j + 1, then a linear interpolation along the x = i + a axis (which is parallel to the y axis) can be worked out, so getting the final result
Note that the relations between i, j, a, b, x and y are the following
C/C++ implementation
Let me stress that this implementation, as well as the following CUDA ones, assume, as done at the beginning, that the samples of T are located on the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers (unit spacing). Also, the routine is provided in single precision, complex (float2) arithmetics, but it can be easily cast in other arithmetics of interest.
void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data,
float * __restrict__ h_xout, float * __restrict__ h_yout,
const int M1, const int M2, const int N1, const int N2){
float2 result_temp1, result_temp2;
for(int k=0; k<N2; k++){
for(int l=0; l<N1; l++){
const int ind_x = floor(h_xout[k*N1+l]);
const float a = h_xout[k*N1+l]-ind_x;
const int ind_y = floor(h_yout[k*N1+l]);
const float b = h_yout[k*N1+l]-ind_y;
float2 h00, h01, h10, h11;
if (((ind_x) < M1)&&((ind_y) < M2)) h00 = h_data[ind_y*M1+ind_x]; else h00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) h10 = h_data[ind_y*M1+ind_x+1]; else h10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) h01 = h_data[(ind_y+1)*M1+ind_x]; else h01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) h11 = h_data[(ind_y+1)*M1+ind_x+1]; else h11 = make_float2(0.f, 0.f);
result_temp1.x = a * h10.x + (-h00.x * a + h00.x);
result_temp1.y = a * h10.y + (-h00.y * a + h00.y);
result_temp2.x = a * h11.x + (-h01.x * a + h01.x);
result_temp2.y = a * h11.y + (-h01.y * a + h01.y);
h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
}
The if/else statements within the above code are simply boundary checkings. If the sample falls outside the [0, M1-1] x [0, M2-1], then it is set to 0.
Standard CUDA implementation
This is a "standard" CUDA implementation tracing the above CPU one. No usage of texture memory.
__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) {
float2 result_temp1, result_temp2;
const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;
const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;
float2 d00, d01, d10, d11;
if (((ind_x) < M1)&&((ind_y) < M2)) d00 = d_data[ind_y*M1+ind_x]; else d00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) d10 = d_data[ind_y*M1+ind_x+1]; else d10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) d01 = d_data[(ind_y+1)*M1+ind_x]; else d01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) d11 = d_data[(ind_y+1)*M1+ind_x+1]; else d11 = make_float2(0.f, 0.f);
result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);
result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);
d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
CUDA implementation with texture fetch
This is the same implementation as above, but the global memory is now accessed by the texture cache. For example, T[i,j] is accessed as
tex2D(d_texture_fetch_float,ind_x,ind_y);
(where, of course ind_x = i and ind_y = j, and d_texture_fetch_float is assumed to be a global scope variable) instead of
d_data[ind_y*M1+ind_x];
Note that the hard-wired texture filtering capabilities are not exploited here. The routine below has the same precision as the above one and could result somewhat faster than that on old CUDA architectures.
__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) {
float2 result_temp1, result_temp2;
const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;
const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;
const float2 d00 = tex2D(d_texture_fetch_float,ind_x,ind_y);
const float2 d10 = tex2D(d_texture_fetch_float,ind_x+1,ind_y);
const float2 d11 = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1);
const float2 d01 = tex2D(d_texture_fetch_float,ind_x,ind_y+1);
result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);
result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);
d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
Texture binding can be done according to
void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch));
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));
}
Note that now we need no if/else boundary checking, because the texture will automatically clamp to zero the samples falling outside the [0, M1-1] x [0, M2-1] sampling region, thanks to the instructions
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
CUDA implementation with texture interpolation
This is the last implementation and uses the hard-wired capabilities of texture filtering.
__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); }
}
Note that the interpolation formula implemented by this feature is the same as derived above, but now
where x_B = x - 0.5 and y_B = y - 0.5. This explains the 0.5 offset in the instruction
tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f)
In this case, texture binding should be done as follows
void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch));
d_texture_interp_float.addressMode[0] = cudaAddressModeClamp;
d_texture_interp_float.addressMode[1] = cudaAddressModeClamp;
d_texture_interp_float.filterMode = cudaFilterModeLinear; // --- Enable linear filtering
d_texture_interp_float.normalized = false; // --- Texture coordinates will NOT be normalized
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));
}
Note that, as already mentioned in the other answers, a and b are stored in 9-bit fixed point format with 8 bits of fractional value, so this approach will be very fast, but less accurate than those above.

The UV interpolants are truncated to 9 bits, not the participating texel values. In Chapter 10 (Texturing) of The CUDA Handbook, this is described in detail (including CPU emulation code) for the 1D case. Code is open source and may be found at https://github.com/ArchaeaSoftware/cudahandbook/blob/master/texturing/tex1d_9bit.cu

Wrong formula of bilinear interpolation makes result of texture fetching weird.
Formula - 1: you can find it in cuda appendix or wiki easily
tex(x,y)=(1−α)(1−β)T[i,j] + α(1−β)T[i+1,j] + (1−α)βT[i,j+1] + αβT[i+1,j+1]
Formula - 2: reduce times of multiply
tex(x,y)=T[i,j] + α(T[i+1,j]-T[i,j]) + β(T[i,j+1]-T[i,j]) + αβ(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1])
If you use 9-bit fixed point format to formula 1, you will get mismatch result of texture fetching, but formula 2 works fine.
Conclusion :
If you want to emulate the bilinear interpolation implemented by cuda texture, you should use formula 3. Try it!
Formula - 3:
tex(x,y)=T[i,j] + frac(α)(T[i+1,j]-T[i,j]) + frac(β)(T[i,j+1]-T[i,j]) + frac(αβ)(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1])
// frac(x) turns float to 9-bit fixed point format with 8 bits of fraction values.
float frac( float x ) {
float frac, tmp = x - (float)(int)(x);
float frac256 = (float)(int)( tmp*256.0f + 0.5f );
frac = frac256 / 256.0f;
return frac;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Counter-intuitive results while playing with intrinsics - c++

Related

How to make use of SIMD capability for sum of squared differences between 8-bit components of RGBA pixels?

Understand efficient implementation of median filter with SSE2 / SSSE3 Instruction Set

SSE inline assembly and possible g++ optimization bug

Using less matrices with BLAS

Bilinear interpolation in C/C++ and CUDA

Categories

Resources