vector * matrix product efficiency issue - c++

Just as Z boson recommended, I am using a column-major matrix format in order to avoid having to use the dot product. I don't see a feasible way to avoid it when multiplying a vector with a matrix, though. The matrix multiplication trick requires efficient extraction of rows (or columns, if we transpose the product). To multiply a vector by a matrix, we therefore transpose:
(b * A)^T = A^T * b^T
A is a matrix, b a row vector, which, after being transposed, becomes a column vector. Its rows are just single scalars and the vector * matrix product implementation becomes an inefficient implementation of dot products of columns of (non-transposed) matrix A with b. Is there a way to avoid performing these dot products? The only way I see that could do it, would involve row extraction, which is inefficient with the column-major matrix format.

This can be understood from original post on this (my first on SO)
efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod
. The rest of the discussion applies to 4x4 matrices.
Here are two methods to do do matrix times vector (v = Mu where v and u are column vectors)
method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
method 2) v = u1*col1 + u2*col2 + u3*col3 + u4*col4.
The first method is more familiar from math class while the second is more efficient for a SIMD computer. The second method uses vectorized math (like numpy) e.g.
u1*col1 = (u1x*col1x, u1y*col1y, u1z*col1z, u1w*col1w).
Now let's look at vector times matrix (v = uM where v and u are row vectors)
method 1) v1 = dot(col1, u), v2 = dot(col2, u), v3 = dot(col3, u), v4 = dot(col4, u)
method 2) v = u1*row1 + u2*row2 + u3*row3 + u4*row4.
Now the roles of columns and rows have swapped but method 2 is still the efficient method to use on a SIMD computer.
To do matrix times vector efficiently on a SIMD computer the matrix should be stored in column-major order. To do vector times matrix efficient on a SIMD computer the matrix should be stored in row-major order.
As far as I understand OpenGL uses column major ordering and does matrix times vector and DirectX uses row-major ordering and does vector times matrix.
If you have three matrix transformations that you do in order M1 first then M2 then M3 with matrix times vector you write it as
v = M3*M2*M1*u //u and v are column vectors - OpenGL form
With vector times matrix you write
v = u*M1*M2*M3 //u and v are row vectors - DirectX form
Neither form is better than the other in terms of efficiency. It's just a question of notation (and causing confusion which is useful when you have competition).
It's important to note that for matrix*matrix row-major versus column-major storage is irrelevant.
If you want to know why the vertical SIMD instructions are faster than the horizontal ones that's a separate question which should be asked but in short the horizontal ones really act in serial rather than parallel and are broken up into several micro-ops (which is why ironically dppd is faster than dpps).

Related

2D FFT what to do after converting both matrix into FFT-ed form?

Assume that I have 2 matrix: image, filter; with size MxM and NxN.
My regular convolution looks like this and produces matrix output size (M-N+1)x(M-N+1). Basically it places the top-left corner of a filter on a pixel, convolute, then assign the sum onto that pixel:
for (int i=0; i<M-N; i++)
for (int j=0; j<M-N; j++)
{
float sum = 0;
for (int u=0; u<N; u++)
for (int v=0; v<N; v++)
sum += image[i+u][j+v] * filter[u][v];
output[i][j] = sum;
}
Next, to perform FFT:
Apply zero-padding to both image, filter to the right and bottom (that is, adding more zero columns to the right, zero rows to the bottom). Now both have size (M+N)x(M+N); the original image is at
image[0->M-1][0-M-1].
(Do the same for both matrix) Calculate the FFT of each row into a new matrix, then calculate the FFT of each column of that new matrix.
Now, I have 2 matrices imageFreq and filterFreq, both size (M+N)x(M+N), which is the FFT-ed form of the image and the filter.
But how can I get the convolution values that I need (as described in the sample code) from them?
convolution between A,B using FFT is done by per element multiplication in the frequency domain so in 1D something like this:
convert A,B by FFT
assuming the sizes are N,M of A[N],B[M] first zero pad to common size Q which is a power of 2 and at least M+N in size and then apply FFT:
Q = exp2(ceil(log2(M+N)));
zeropad(A,Q);
zeropad(B,Q);
a = FFT(A);
b = FFT(B);
convolute
in frequency domain use just element wise multiplication:
for (i=0;i<Q;i++) a[i]*=b[i];
reconstruct result
simply apply IFFT (inverse of FFT)...
AB = IFFT(a); // crop to first N (real) elements
and use only the first N element (unless algorithm used need more depends on what you are doing...)
For 2D you can either convolute directly in 2D (using 2 nested for loops) or convolve each axis separately. Beware that separating axises need also to normalize the result by some constant (which depends on dimensionality, resolution and kernel used)
So when put together (also assuming the same resolution NxN and MxM) first zero pad to (QxQ) and then:
Q = exp2(ceil(log2(M+N)));
zeropad(A,Q,Q);
zeropad(B,Q,Q);
a = FFT(A);
b = FFT(B);
for (i=0;i<Q;i++)
for (j=0;j<Q;j++) a[i][j]*=b[i][j];
AB = IFFT(a); // crop to first NxN (real) elements
And again crop to AB to NxN size (unless ...) for more info see:
How to compute Discrete Fourier Transform?
and all sublinks there... Also here at the end is 1D convolution example using NTT (its a special form of FFT) to compute bignum multiplication:
Modular arithmetics and NTT (finite field DFT) optimizations
Also if you want real result then just use only the real parts of the result (ignore imaginary part).

Is there a something like a sparse cube in armadillo or some way of using sparse matrices as slices in a cube?

I am using armadillos sparse matrices. But now I would like to use something like a "sparse cube" which does not exist in armadillo. writing sparse matrices into a cube with cube.slice(some_sparse_matrix) converts everything back to a dense cube.
I am using sparse matrices in order to multiply a vector with. for larger vectors/matrices the sparse variant is much faster. Now I have to sum up the multiplications of several sparse matrices with several vectors.
would a std:vector be a way?
In my experience it is faster to use armadillos functions (for example a subvector or arma::span() or arma::sum() )) as opposed to write loops myself. So I was wondering what would be the fastest way of doing this.
It's possible to approximate a sparse cube using the field class, like so.
arma::uword number_of_matrices = 10;
arma::uword number_of_rows = 5000;
arma::uword number_of_cols = 5000;
arma::field<arma::sp_mat> F(number_of_matrices);
F.for_each( [&](arma::sp_mat& X) { X.set_size(number_of_rows, number_of_cols); } );
F(0)(1,2) = 456.7; // write to element (1,2) in matrix 0
F(1)(2,3) = 567.8; // write to element (2,3) in matrix 1
F.print("F:"); // show all matrices
Your compiler must support at least C++11 for this to work.

sparse sparse product A^T*A optim in Eigen lib

In the case of multiple of same matrix matA, like
matA.transpose()*matA,
You don't have to compute all result product, because the result matrix is symmetric(so only if the m>n), in my specific case is always symmetric! square.
So its enough the compute only for. ex. lower triangular part and rest only copy..... because the results of the multiple 2nd and 3rd row, resp.col, is the same like 3rd and 2nd.....And etc....
So my question is , exist way how to tell Eigen, to compute only lower part. and optionally save to only lower trinaguler part the product?
DATA = SparseMatrix<double>((SparseMatrix<double>(matA.transpose()) * matA).pruned()).toDense();
According to the documentation, you can evaluate the lower triangle of a matrix with:
m1.triangularView<Eigen::Lower>() = m2 + m3;
or in your case:
m1.triangularView<Eigen::Lower>() = matA.transpose()*matA;
(where it says "Writing to a specific triangular part: (only the referenced triangular part is evaluated)"). Otherwise, in the line you've written
Eigen will calculate the entire sparse matrix matA.transpose()*matA.
Regarding saving the resulting m1 matrix, it is the same as saving whatever type of matrix it is (Eigen::MatrixXt or Eigen::SparseMatrix<t>). If m1 is sparse, then it will be only half the size of a straightforward matA.transpose()*matA. If m1 is dense, then it will be the full square matrix.
https://eigen.tuxfamily.org/dox/classEigen_1_1SparseSelfAdjointView.html
The symmetric rank update is defined as:
B = B + alpha * A * A^T
where alpha is a scalar. In your case, you are doing A^T * A, so you should pass the transposed matrix instead. The resulting matrix will only store the upper or lower portion of the matrix, whichever you prefer. For example:
SparseMatrix<double> B;
B.selfadjointView<Lower>().rankUpdate(A.transpose());

matrix order in skeletal animation using assimp

I had followed this tutorial and got the output animation for a rigged model as expected. The tutorial uses assimp, glsl and c++ to load a rigged model from a file. However, there were things that I couldn't figure out.
First thing is assimp's transformation matrix are row major matrices and the tutorial uses a Matrix4f class which uses those transformation matrices just as they are i.e. row major order. The constructor of that Matrix4f class is as given:
Matrix4f(const aiMatrix4x4& AssimpMatrix)
{
m[0][0] = AssimpMatrix.a1; m[0][2] = AssimpMatrix.a2; m[0][2] = AssimpMatrix.a3; m[0][3] = AssimpMatrix.a4;
m[1][0] = AssimpMatrix.b1; m[1][3] = AssimpMatrix.b2; m[1][2] = AssimpMatrix.b3; m[1][3] = AssimpMatrix.b4;
m[2][0] = AssimpMatrix.c1; m[2][4] = AssimpMatrix.c2; m[2][2] = AssimpMatrix.c3; m[2][3] = AssimpMatrix.c4;
m[3][0] = AssimpMatrix.d1; m[3][5] = AssimpMatrix.d2; m[3][2] = AssimpMatrix.d3; m[3][3] = AssimpMatrix.d4;
}
However, in the tutorial for calculating the final node transformation, the calculations are done expecting the matrices to be in column major order, which is shown below:
Matrix4f NodeTransformation;
NodeTransformation = TranslationM * RotationM * ScalingM; //note here
Matrix4f GlobalTransformation = ParentTransform * NodeTransformation;
if(m_BoneMapping.find(NodeName) != m_BoneMapping.end())
{
unsigned int BoneIndex = m_BoneMapping[NodeName];
m_BoneInfo[BoneIndex].FinalTransformation = m_GlobalInverseTransform * GlobalTransformation * m_BoneInfo[BoneIndex].BoneOffset;
m_BoneInfo[BoneIndex].NodeTransformation = GlobalTransformation;
}
Finally, since the matrices calculated are in row major order, it is specified so while passing the matrices in the shader by setting GL_TRUE flag in the following function. Then, openGL knows it is in row major order as openGL itself uses column major order.
void SetBoneTransform(unsigned int Index, const Matrix4f& Transform)
{
glUniformMatrix4fv(m_boneLocation[Index], 1, GL_TRUE, (const GLfloat*)Transform);
}
So, how does the calculation done considering column major order
transformation = translation * rotation * scale * vertices
yield a correct output. I expected that for the calculation to hold true, each matrices should first be transposed to change to column order, followed by the above calculation and finally transposed again to obtain back row order matrix, which is also discussed in this link. However, doing so produced a horrible output. Is there something that I am missing here?
You are confusing two different things:
the layout the data has in memory (row vs. column major order)
the mathematical interpretation of the operations (things like multiplication order)
It is often claimed that when working with row major vs. column major, things have to be transposed and matrix multipication order hase to be reversed. But this is not true.
What is true is that mathematically, transpose(A*B) = transpose(B) * transpose(A). However, that is irrelevant here, because the matrix storage order is independent of, and orthogonal to, the mathematical interpretation of the matrices.
What I mean by this is: In math, it is exactly defined what a row and a column of a matrix is, and each element can be uniquely addressed by these two "coordinates". All the matrix operations are defined based on this convention. For example, in C=A*B, the element in the first row and the first column of C, is calculated as the dot product of the first row of A (transposed to a column vector) and the first column of B.
Now, the matrix storage order just defines how the matrix data is laid out in memory. As a generalization, we could define a function f(row,col) mapping each (row, col) pair to some memory address. We now could write or matrix functions using f, and we could change f to adapt row-major, column-major or something completely else (like a Z order curve, if we want some fun).
It doesn't matter what f we actually use (as long as the mapping is bijective), the operation C=A*B will always have the same result. What changes is just the data in memory, but we have also to use f to interpet that data. We could just write a simple print function, also using f, to print the matrix as the 2D array in columns x rows as a typical human would expect.
The confusion comes from this fact when you use a matrix in a different layout than the implementation of the matrix functions is designed on.
If you have a matrix library which is internally assuimg colum-major layout, and pass in data in row-major format, it is as if you transformed that matrix before - and only at this point, things get screwed up.
To confuse things even more, there is another issue related to this: the matrix * vector vs vector * matrix issue. Some people like to write x' = x * M (with v' and v being row vectors), while others like to write y' = N *y (with column vectors). It is clear that mathematically, M*x = transpose((transpose(x) * transpose(M)), so that people often also confuse this with row- vs column-major order effects - but it is also totally independent of that. It is just a matter of convention if you want to use the one or the other.
So, to finally answer your question:
The transformation matrices created there are written for the convention of multyplying matrix * vector, so that Mparent * Mchild is the correct matrix multiplication order.
Up to this point, the actual data layout in memory does not matter at all. It only begins to matter because now, we are interfacing a different API, with its own conventions. GL's default order is column-major. The matrix class in use is written for row-major memory layout. So you just transpose at this point, so that GL's interpretation of that matrix matches your other library's.
The alternative would be not convert them and account for that by incorporating the implicit operation created by this into the system - either by changing the multiplication order in the shader, or by adjusting the operations which created the matrix in the first place. However, I would not recommend going that path, because the resulting code will be totally unintuitive, because in the end, this would mean working with column-major matrices in a matrix class using a row-major interpretation.
Yes, the memory layout is similar for glm and assimp : data.html
But, according to the doc page : classai_matrix4x4t
The assimp matrix is always row-major whereas the glm matrix is always col-major meaning you need to create a transponse on conversion:
inline static Mat4 Assimp2Glm(const aiMatrix4x4& from)
{
return Mat4(
(double)from.a1, (double)from.b1, (double)from.c1, (double)from.d1,
(double)from.a2, (double)from.b2, (double)from.c2, (double)from.d2,
(double)from.a3, (double)from.b3, (double)from.c3, (double)from.d3,
(double)from.a4, (double)from.b4, (double)from.c4, (double)from.d4
);
}
inline static aiMatrix4x4 Glm2Assimp(const Mat4& from)
{
return aiMatrix4x4(from[0][0], from[1][0], from[2][0], from[3][0],
from[0][1], from[1][1], from[2][1], from[3][1],
from[0][2], from[1][2], from[2][2], from[3][2],
from[0][3], from[1][3], from[2][3], from[3][3]
);
}
PS: The abcd stands for row and 1234 stands for col in assimp.

Image reconstruction using SVD Decomposition

I have performed block SVD decomposition over image and I stored results.
Now, I need to make reconstruction from this results. I found few examples all written in Matlab, which is a mystery for me.
I only need formula from which I can reconstruct my picture, or example written in C language.
Matrix A is equal U*S*V'. How will look formula, e.g. for calculating first five singular values (product of which rows and columns)? Please provide formula with indexes in C like style. U and V' are matrices and S is vector (not matrix).
Not sure if I get your question right, but if you just need to know singular values, they are the diagonal values of the middle matrix S. S in general is a diagonal matrix, which is stored here as a vector. I mean, only the diagonal is stored, you should imagine it as a matrix if you're thinking in matrix calculations.
Those diagonal values are your singular values, if you need the first biggest singular values, just take the 5 biggest values of the vector S.
Quoting from Wikipedia:
The diagonal entries Σi,i of Σ are known as the singular values of M.
The m columns of U and the n columns of V are called the left-singular
vectors and right-singular vectors of M, respectively.
In the above quote, sigma is your S, and M is the original matrix.
You have asked for C code, yet my hope is that pseudocode will suffice (it's late, I'm tired). The target matrix A has m rows, c columns and rank rho. The variable p = min(m,n).
One strategy is to first form the the intermediate matrix product B = US. This is trivial due to the diagonal-like nature of the matrix of singular values. Assume you have rho ( = 5 ) singular values. You must enforce rho <= p.
Replace column vector u1 with s1u1.
Replace column vector u2 with s2u2.
...
Replace column vector urho with srhourho.
Replace column vector urho+1 with a zero vector of length m.
Replace column vector urho+2 with a zero vector of length m.
...
Replace column vector up with a zero vector of length m.
Next form the new image matrix A = BVT. The matrix element in row r and column c is the dot product of the rth row vector (length rho) of B with the cth column vector (length rho) of VT.
Another strategy is to jump to the form where the matrix elements of A in row r and column c are
ar,c = sum ( skur,kvc,k, { k, 1, rho } )
The row counter r runs from 1 to m; the column counter c runs from 1 to n.