SSE, row major vs column major performance issue - c++

For personnal and fun matter, I'm coding a geom lib using SSE(4.1).
I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix.
I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders.
But, I made some profiling, and I get faster result with colomun major.
To transform a point with a transfrom matrix in row major, it's P' = P * M. and in column major, it's P' = M * P.
So in Column major it's simply 4 dot product , so only 4 SSE4.1 instruction ( _mm_dp_ps ) when in Row major I must do those 4 dot products on the transposed matrix.
Performance result on 10M vectors
(30/05/2014#08:48:10) Log : [5] ( Vec.Mul.Matrix ) = 76.216653 ms ( row major transform )
(30/05/2014#08:48:10) Log : [6] ( Matrix.Mul.Vec ) = 61.554892 ms ( column major tranform )
I tried several way to do Vec * Matrix operation, using _MM_TRANSPOSE or not, and the fastest way I found is this :
mssFloat Vec4::operator|(const Vec4& v) const //-- Dot Product
{
return _mm_dp_ps(m_val, v.m_val, 0xFF ).m128_f32[0];
}
inline Vec4 operator*(const Vec4& vec,const Mat4& m)
{
return Vec4( Vec4( m[0][0],m[1][0],m[2][0],m[3][0]) | vec
, Vec4( m[0][1],m[1][1],m[2][1],m[3][1]) | vec
, Vec4( m[0][2],m[1][2],m[2][2],m[3][2]) | vec
, Vec4( m[0][3],m[1][3],m[2][3],m[3][3]) | vec
);
}
my class Vec4 is simply a __m128 m_val, in optimized C++ the vector construction is all done efficiently on SSE register.
My first guess, is that this multiplication is not optimal. I'm new in SSE, so I'm a bit puzzled how to optimize this, my intuition tell me to use shuffle instruction, but I'd like to understand why it would be faster. Will it load 4 shuffle __m128 faster than assigning ( __m128 m_val = _mm_set_ps(w, z, y, x); )
From https://software.intel.com/sites/landingpage/IntrinsicsGuide/
I couldn't find performance info on mm_set_ps
EDIT : I double check the profiling method, each test are done in the same manner, so no memory cache differences. To avoid local cache, I'm doing operation for randomized bug vector array, seed is same for each test. Only 1 test at each execution to avoir performance increase from memory cache.

Don't use _mm_dp_ps for matrix multiplication! I already explained this in great detail at Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point? (incidentally this was my first post on SO).
You don't need anything for more than SSE to do this efficiently (not even SSE2). Use this code to do 4x4 matrix multiplication efficiently. If the matrices are stored in row-major order than do gemm4x4_SSE(A,B,C). If the matrices are stored in column-major order than do gemm4x4_SSE(B,A,C).
void gemm4x4_SSE(float *A, float *B, float *C) {
__m128 row[4], sum[4];
for(int i=0; i<4; i++) row[i] = _mm_load_ps(&B[i*4]);
for(int i=0; i<4; i++) {
sum[i] = _mm_setzero_ps();
for(int j=0; j<4; j++) {
sum[i] = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(A[i*4+j]), row[j]), sum[i]);
}
}
for(int i=0; i<4; i++) _mm_store_ps(&C[i*4], sum[i]);
}

We actually profiled 3x4 matrix pseudo-multiplication (as-if its a 4x4 affine) and found that in both SSE3 and AVX there was very little difference (<10%) in the column-major vs row-major layouts as long as both are optimized to the limit.
The benchmark
https://github.com/buildaworldnet/IrrlichtBAW/blob/master/examples_tests/19.SIMDmatrixMultiplication/main.cpp

Related

How to efficiently normalize vector C++

I want to know how to efficiently normalize a vector in C++. So far, this is what I have. Is there a way to make it more efficient and / or do it in a single pass.
std::array<float, MyClass::FEATURE_LENGTH> MyClass::normalize(const std::array<float, FEATURE_LENGTH>& arr) {
std::array<float, MyClass::FEATURE_LENGTH> output{};
double mod = 0.0;
for (size_t i = 0; i < arr.size(); ++i) {
mod += arr[i] * arr[i];
}
double mag = std::sqrt(mod);
if (mag == 0) {
throw std::logic_error("The input vector is a zero vector");
}
for (size_t i = 0; i < arr.size(); ++i) {
output[i] = arr[i] / mag;
}
return output;
}
There are many ways to optimize implementations of this algorithm, depending on the particulars of your problem.
For all of your loops, you can use SIMD vectorization to increase throughput.
If your vectors are very wide then you can use multiple threads to compute the magnitude. Each would compute a partial sum, then some serial code would collect the results.
You can work entirely in floats, rather than doubles, if your values are within range.
You can compute the inverse square root of the magnitude by using intrinsics (such as RSQRTSS on x86) or using Quake's method if such intrinsics are unavailable. Then you would scale by that value.
Additionally, you can get much faster code by fusing operations with the normalization. Say you want to add two vectors and normalize the result. You can compute their sum and their magnitude in a single pass and then scale in a second.
How can you do it in a single pass. It is obvious than you need to compute mag using all items and that you must have compute it before updating items?
As it might more take to do a division than a multiplication, one possible optimization would be to add:
double mag_inv = 1.0 / mag;
Then you could multiply items like that:
output[i] = arr[i] * mag_inv;
If there is a relatively high probability that a vector is already normalized, you might want to check if mag is equal to 1.0.
In case, if someone needs it here's an example of SIMD vectorization code:
#include <immintrin.h> //header for SIMD functions
void Normalize(const float lpInput[4], float lpOutput[4]) {
__m128 vInput = _mm_load_ps(lpInput); // load input vector (x, y, z, a)
__m128 vSquared = _mm_mul_ps(vInput, vInput); // square the input values
__m128 vHalfSum = _mm_hadd_ps(vSquared, vSquared);
__m128 vSum = _mm_hadd_ps(vHalfSum, vHalfSum); // compute the sum of values
float fInvSqrt; _mm_store_ss(&fInvSqrt, _mm_rsqrt_ss(vSum)); // compute the inverse sqrt
__m128 vNormalized = _mm_mul_ps(vInput, _mm_set1_ps(fInvSqrt)); // normalize the input vector
_mm_store_ps(lpOutput, vNormalized); // store normalized vector (x, y, z, a)
}
In order to compile it properly you'll need to enable SSE and AVX instructions in compiler options (-msse -mavx for gcc or clang || /arch:sse /arch:avx for msvc)

vector * matrix product efficiency issue

Just as Z boson recommended, I am using a column-major matrix format in order to avoid having to use the dot product. I don't see a feasible way to avoid it when multiplying a vector with a matrix, though. The matrix multiplication trick requires efficient extraction of rows (or columns, if we transpose the product). To multiply a vector by a matrix, we therefore transpose:
(b * A)^T = A^T * b^T
A is a matrix, b a row vector, which, after being transposed, becomes a column vector. Its rows are just single scalars and the vector * matrix product implementation becomes an inefficient implementation of dot products of columns of (non-transposed) matrix A with b. Is there a way to avoid performing these dot products? The only way I see that could do it, would involve row extraction, which is inefficient with the column-major matrix format.
This can be understood from original post on this (my first on SO)
efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod
. The rest of the discussion applies to 4x4 matrices.
Here are two methods to do do matrix times vector (v = Mu where v and u are column vectors)
method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
method 2) v = u1*col1 + u2*col2 + u3*col3 + u4*col4.
The first method is more familiar from math class while the second is more efficient for a SIMD computer. The second method uses vectorized math (like numpy) e.g.
u1*col1 = (u1x*col1x, u1y*col1y, u1z*col1z, u1w*col1w).
Now let's look at vector times matrix (v = uM where v and u are row vectors)
method 1) v1 = dot(col1, u), v2 = dot(col2, u), v3 = dot(col3, u), v4 = dot(col4, u)
method 2) v = u1*row1 + u2*row2 + u3*row3 + u4*row4.
Now the roles of columns and rows have swapped but method 2 is still the efficient method to use on a SIMD computer.
To do matrix times vector efficiently on a SIMD computer the matrix should be stored in column-major order. To do vector times matrix efficient on a SIMD computer the matrix should be stored in row-major order.
As far as I understand OpenGL uses column major ordering and does matrix times vector and DirectX uses row-major ordering and does vector times matrix.
If you have three matrix transformations that you do in order M1 first then M2 then M3 with matrix times vector you write it as
v = M3*M2*M1*u //u and v are column vectors - OpenGL form
With vector times matrix you write
v = u*M1*M2*M3 //u and v are row vectors - DirectX form
Neither form is better than the other in terms of efficiency. It's just a question of notation (and causing confusion which is useful when you have competition).
It's important to note that for matrix*matrix row-major versus column-major storage is irrelevant.
If you want to know why the vertical SIMD instructions are faster than the horizontal ones that's a separate question which should be asked but in short the horizontal ones really act in serial rather than parallel and are broken up into several micro-ops (which is why ironically dppd is faster than dpps).

How can I most efficiently map a kernel range for a hermitian (symmetric) matrix in OpenCL?

I'm working on an OpenCL project to generate very large hermitian (symmetric) matrices, and I am trying to determine the best way to generate the work IDs.
A hermitian matrix is symmetric along the diagonal, so that M(i,j) = M*(j,i).
In the brute force way, the for loop looks like:
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
}
}
However, taking advantage of the hermitian property, the loop can be made to be twice as efficient by only calculating the upper triangular part of the matrix and duplicating the result in the lower triangular part:
for(int i = 0; i < N; i++)
{
for(int j = i; j < N; j++)
{
complex<float> result = doSomeCalculation();
M(i,j) = result;
M(j,i) = conj(result);
}
}
In both loops, doSomeCalculation() is an expensive operation, and each entry in the matrix is completely uncorrelated from every other entry (i.e. the problem is stupidly parallel).
My question is this:
How can I implement the second loop with doSomeCalculation as an OpenCL kernel so that the thread IDs are most efficiently used (i.e. so that the thread calculates both M(i,j) and M(j,i) without having to call doSomeCalculation() twice)?
You need to use a linear index, for example you can index every element of your matrix in this way:
0 1 2 ... N-1
* N-2 ... 2N-2
....
* * 2N-1 ... N(N+1)/2 -1
That is, the index K is given by:
k=iN-i*(i+1)/2+j
Where N is the size of the matrix and (i,j) are respectively the 0-based indices of the row and the column.
This relationship can be inverted; see the answer of this question, which I report here for completeness:
i = floor( ( 2*N+1 - sqrt( (2N+1)*(2N+1) - 8*k ) ) / 2 ) ;
j = k - N*i + i*(i+1)/2 ;
So you need to enqueue a 1D kernel with N(N+1)/2 work items, and you can decide by yourself the size of the workgroup (usually 64 items per work group is a good choice).
Then in the OpenCL code you can retrieve the index K by using:
int k = get_group_id(0)*64 + get_local_id(0);
And then use the two relationships above the index of the matrix element you need to compute.
Moreover, notice that you can also save space by representing your hermitian matrix as a linear vector with N(N+1)/2 elements.
If your matrices are really big, than you can dice up your NxN matrix into (N/k)x(N/k) tiles, each of size kxk. As soon as you need only a half of the data, you create 1D NDRange of size local_group_size * (N/k)x(N/k)/2 roughly.
Every tile of matrix is processed by one LocalGroup (size of LocalGroup is of your choice). The idea is that you create an array on Host side, which contain position of every WorkGroup in matrix. Kernel stub should look like follows:
void __kernel myKernel(
__global int* coords,
....)
{
int2 WorkGroupPositionInMatrix = vload2(get_group_id(0), coords);
...
DoCalculation();
...
WriteResultTwice();
...
return;
}
What you need to do by hand - is to cope with thouse WorkGroups, which will be placed on the matrix diagonal. If matrix size is big, than overhead for LocalGroups, placed on diagonal is negligible.
A right triangle can be cut in half vertically and the smaller portion rotated to fit with the larger portion to form a rectangle of equal area. Therefore it is easy to make your triangular global work area into one that is rectangular, which fits OpenCL.
See my answer here: OpenCL efficient way to group a lower triangular matrix

CUDA Thread IDs

I'm new to CUDA programming and I have the following problem.
If I use the following code to perform matrix multiplication, since CUDA uses Cartesian indexing for thread indexing and C/C++ use row major indexing for matrices, wouldn't it influence the accuracy of the calculation?
__global__ void gpuMM(float *A, float *B, float *C, int N)
{
// Matrix multiplication for NxN matrices C=A*B
// Each thread computes a single element of C
int col = blockIdx.y*blockDim.y + threadIdx.y;
int row = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*B[n*N+col];
C[row*N+col] = sum;
}
CUDA doesn't imply any memory storage structure. You can say CUDA C is row-major for matrix storage, but that is due to C, not CUDA. (CUDA Fortran would be column-major.) Thread indexing dimensions are arbitrary. They do not imply a data storage order in memory.
Implications about data storage order in memory of course arise as you write your code. From a correctness standpoint, it does not matter if we assign row indices based on x thread dimensions or on y thread dimensions. You can write correct code for this matrix multiply example using either approach (either row based on x, or else row based on y).
However, from a coalescing standpoint, we generally want adjacent executing threads to read or write adjacent cells in memory. Adjacent threads (for execution) typically are grouped in x first. Therefore this is preferable (for your kernel code):
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
because it will allow the read of B[] and the write of C[] to coalesce.
This is easy to prove to yourself. Try it both ways, and measure the execution time of the kernel. The results are correct (match the results produced using a host-based matrix multiply) either way, but one formulation runs significantly faster than the other.
This is especially easy to try, since your kernel code implies square matrices.

Calculate squared Euclidean distance matrix on GPU

Let p be a matrix of first set of locations where each row gives the coordinates of a particular point. Similarly, let q be a matrix of second set of locations where each row gives the coordinates of a particular point.
Then formula for pairwise squared Euclidean distance is:
k(i,j) = (p(i,:) - q(j,:))*(p(i,:) - q(j,:))',
where p(i,:) denotes i-th row of matrix p, and p' denotes the transpose of p.
I would like to compute matrix k on CUDA-enabled GPU (NVidia Tesla) in C++. I have OpenCV v.2.4.1 with GPU support but I'm open to other alternatives, like Thrust library. However, I'm not too familiar with GPU programming. Can you suggest an efficient way to accomplish this task? What C++ libraries should I use?
The problem looks simple enough to make a library overkill.
Without knowing the range of i and j, I'd suggest you partition k into blocks of a multiple of 32 threads each and in each block, compute
float sum, myp[d];
int i = blockIdx.x*blockDim.x + threadIdx.x;
for ( int kk = 0 ; kk < d ; kk++ )
myp[kk] = p(i,kk);
for ( j = blockIdx.y*blockDim.y ; j < (blockIdx.y+1)*blockDim ; j++ ) {
sum = 0.0f;
#pragma unroll
for ( int kk = 0 ; kk < d ; kk++ ) {
temp = myp[kk] - q(j,kk);
sum += temp*temp;
}
k(i,j) = sum;
}
where I am assuming that your data has d dimensions and writing p(i,k), q(j,k) and k(i,j) to mean an access to a two-dimensional array. I also took the liberty in assuming that your data is of type float.
Note that depending on how k is stored, e.g. row-major or column-major, you may want to loop over i per thread instead to get coalesced writes to k.