I'm writing a library where I want to have some basic NxN matrix functionality that doesn't have any dependencies and it is a bit of a learning project. I'm comparing my performance to Eigen. I've been able to be pretty equal and even beat its performance on a couple front with SSE2 and with AVX2 beat it on quite a few fronts (it only uses SSE2 so not super surprising).
My issue is I'm using Gaussian Elimination to create an Upper Diagonalized matrix then multiplying the diagonal to get the determinant.I beat Eigen for N < 300 but after that Eigen blows me away and it just gets worse as the matrices get bigger. Given all the memory is accessed sequentially and the compiler dissassembly doesn't look terrible I don't think it is an optimization issue.
There is more optimization that can be done but the timings look much more like an algorithmic timing complexity issue or there is a major SSE advantage I'm not seeing. Simply unrolling the loops a bit hasn't done much for me when trying that.
Is there a better algorithm for calculating determinants?
Scalar code
/*
Warning: Creates Temporaries!
*/
template<typename T, int ROW, int COLUMN> MML_INLINE T matrix<T, ROW, COLUMN>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<T, ROW, COLUMN> temp(*this);
/*We convert the temporary to upper triangular form*/
uint N = row();
T det = T(1);
for (uint c = 0; c < N; ++c)
{
det = det*temp(c,c);
for (uint r = c + 1; r < N; ++r)
{
T ratio = temp(r, c) / temp(c, c);
for (uint k = c; k < N; k++)
{
temp(r, k) = temp(r, k) - ratio * temp(c, k);
}
}
}
return det;
}
AVX2
template<> float matrix<float>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<float> temp(*this);
/*We convert the temporary to upper triangular form*/
float det = 1.0f;
const uint N = row();
const uint Nm8 = N - 8;
const uint Nm4 = N - 4;
uint c = 0;
for (; c < Nm8; ++c)
{
det *= temp(c, c);
float8 Diagonal = _mm256_set1_ps(temp(c, c));
for (uint r = c + 1; r < N;++r)
{
float8 ratio1 = _mm256_div_ps(_mm256_set1_ps(temp(r,c)), Diagonal);
uint k = c + 1;
for (; k < Nm8; k += 8)
{
float8 ref = _mm256_loadu_ps(temp._v + c*N + k);
float8 r0 = _mm256_loadu_ps(temp._v + r*N + k);
_mm256_storeu_ps(temp._v + r*N + k, _mm256_fmsub_ps(ratio1, ref, r0));
}
/*We go Scalar for the last few elements to handle non-multiples of 8*/
for (; k < N; ++k)
{
_mm_store_ss(temp._v + index(r, k), _mm_sub_ss(_mm_set_ss(temp(r, k)), _mm_mul_ss(_mm256_castps256_ps128(ratio1),_mm_set_ss(temp(c, k)))));
}
}
}
for (; c < Nm4; ++c)
{
det *= temp(c, c);
float4 Diagonal = _mm_set1_ps(temp(c, c));
for (uint r = c + 1; r < N; ++r)
{
float4 ratio = _mm_div_ps(_mm_set1_ps(temp[r*N + c]), Diagonal);
uint k = c + 1;
for (; k < Nm4; k += 4)
{
float4 ref = _mm_loadu_ps(temp._v + c*N + k);
float4 r0 = _mm_loadu_ps(temp._v + r*N + k);
_mm_storeu_ps(temp._v + r*N + k, _mm_sub_ps(r0, _mm_mul_ps(ref, ratio)));
}
float fratio = _mm_cvtss_f32(ratio);
for (; k < N; ++k)
{
temp(r, k) = temp(r, k) - fratio*temp(c, k);
}
}
}
for (; c < N; ++c)
{
det *= temp(c, c);
float Diagonal = temp(c, c);
for (uint r = c + 1; r < N; ++r)
{
float ratio = temp[r*N + c] / Diagonal;
for (uint k = c+1; k < N;++k)
{
temp(r, k) = temp(r, k) - ratio*temp(c, k);
}
}
}
return det;
}
Algorithms to reduce an n by n matrix to upper (or lower) triangular form by Gaussian elimination generally have complexity of O(n^3) (where ^ represents "to power of").
There are alternative approaches for computing determinate, such as evaluating the set of eigenvalues (the determinant of a square matrix is equal to the product of its eigenvalues). For general matrices, computation of the complete set of eigenvalues is also - practically - O(n^3).
In theory, however, calculation of the set of eigenvalues has complexity of n^w where w is between 2 and 2.376 - which means for (much) larger matrices it will be faster than using Gaussian elimination. Have a look at an article "Fast linear algebra is stable" by James Demmel, Ioana Dumitriu, and Olga Holtz in Numerische Mathematik, Volume 108, Issue 1, pp. 59-91, November 2007. If Eigen uses an approach with complexity less than O(n^3) for larger matrices (I don't know, never having had reason to investigate such things) that would explain your observations.
The answer most places seem to use Block LU Factorization to create an Lower triangle and Upper triangle matrix in the same memory space. It is ~O(n^2.5) depending on the size of block you use.
Here is a power point from Rice University that explains the algorithm.
www.caam.rice.edu/~timwar/MA471F03/Lecture24.ppt
Division by a matrix means multiplication by its inverse.
The idea seems to be to increase the number of n^2 operations significantly but reduce the number m^3 which in effect lowers the complexity of the algorithm since m is of a fixed small size.
Going to take a little bit to write this up in an efficient manner since to do it efficiently requires 'in place' algorithms I don't have written yet.
Related
I am making a simple Gaussian blur function for a 2D array that is supposed to represent an image. The function just prints out the array values at the end (no actual image processing going on here). I was pretty sure that I had implemented everything correct, but the values I am getting for (N=3, sigma=1.5) are much lower than expected based on this calculator: http://dev.theomader.com/gaussian-kernel-calculator/
I am following this equation:
void gaussian_filter(int N, double sigma) {
double k[N][N];
for(int i=0; i<N; i++) { //Initialize kernal to 0
for(int j=0; j<N; j++) {
k[i][j] = 0;
}
}
double sum = 0.0; //There is an issue somewhere in this block of code
int change = (N/2);
double r, s = change * sigma * sigma;
for (int x = -change; x <= change; x++) {
for(int y = -change; y <= change; y++) {
r = sqrt(x*x + y*y);
k[x + change][y + change] = (exp(-(r*r)/s))/(M_PI * s);
sum += k[x + change][y + change];
}
}
for(int i = 0; i < N; ++i) { //Normalize
for(int j = 0; j < N; ++j) {
k[i][j] /= sum;
}
}
for(int i = 0; i < N; ++i) { //Print out array
for (int j = 0; j < N; ++j)
cout<<k[i][j]<<"\t";
}
cout<<endl;
}
}
Here is the expected output for N=3 and Sigma=1.5
Here is the current broken output for N=3 and Sigma=1.5
Why does s depend on change? I think you should do:
double r, s = 2 * sigma * sigma;
// instead of
// double r, s = change * sigma * sigma;
That website computes Gaussian kernels in an unorthodox manner:
The weights are calculated by numerical integration of the continuous gaussian distribution over each discrete kernel tap.
That is, it samples a continuous Gaussian kernel that has been convolved with a uniform (“box”) filter of 1 pixel wide. The resulting Gaussian is wider than advertised. I advise against this method.
The proper way to create a Gaussian kernel is to just sample the Gaussian function at given integer locations, for example x = [-3, -2, -1, 0, 1, 2, 3].
Do note that a 3-pixel kernel is not wide enough to represent a Gaussian. It is important to sample the tail of the curve, without it, the kernel doesn’t have the good properties of the Gaussian kernel. I recommend sampling up to 3 sigma to each side, leading to 2*ceil(3*sigma)+1 pixels. 2 sigma is the bare minimum, useful only when speed is more important than good results.
Do also note that the Gaussian is separable, you can apply two 1D kernels in succession, rather than a single 2D kernel. For the 9x9 kernel you get for sigma=1.5, this translates to 9+9=18 multiplications and additions, compared to 9x9=81 for the 2D kernel. This is a significant saving!
I want to add constraints to my Cplex model, that ensures that a bunch of arrays are pairwise different. That is, at least one entry should differ in the two.
(To clarify: The IloNumVarArray h represents an n x m matrix and the constraints should ensure that no two rows are identical)
My code below has two errors (at least) that I can't seem to solve:
- First, there is 'no suitable conversion function from IloNumVar to IloNum',
- Second, it is not allowed to use the != operator to compare IloNumArrays.
IloNumVarArray h(env, n*m);
IloNumArray temp1(env, m);
IloNumArray temp2(env, m);
for (int i = 0; i < n - 1; i++) {
temp1.clear();
temp2.clear();
for (int k = 0; k < n - i; k++)
for (int j = 0; j < m; j++) {
temp1[j] = h[j + i * m];
temp2[j] = h[j + (i + k) * m];
}
model.add(temp1 != temp2);
}
So how can I change temp1 and temp2 such that it is possible to copy from h, and compare the two?
(or do it completely different)
I am quite new to Cplex and I would appreciate any help/suggestions
you could use logical constraints.
Let me give you an example in OPL CPLEX that you could adapt to C++
int n=3;
int m=2;
range N=1..n;
range M=1..m;
float epsilon=0.0001;
dvar float temp1[N][M] in 0..10;
dvar float temp2[N][M] in 0..10;
minimize sum(i in N,j in M) (temp1[i][j]+temp2[i][j]);
subject to
{
// at least for one (i,j) the 2 arrays are different
1<=sum(i in N,j in M) (abs(temp1[i][j]-temp2[i][j])>=epsilon);
}
this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation.
C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format)
On exit, A and B maintain their input values.
void matmul_optimized(int n, int *A, int *B, int *C)
{
// to the effective bitwise calculation
// save the matrix as the different type
int i, j, k;
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
for (k = 0; k < n; ++k) {
cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
}
C[i + j * n] = cij; // allocate the final result into C }
}
}
how do I more speed up the multiplication of matrix based on above function/method?
this function is tested up to 2048 by 2048 matrix.
the function matmul_optimized is done with matmul.
#include <stdio.h>
#include <stdlib.h>
#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"
int main()
{
int i, j;
int n = 1024; // Number of rows or columns in the square matrices
int *A, *B; // Input matrices
int *C1, *C2; // Output matrices from the reference and optimized implementations
// Performance and correctness measurement declarations
long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
long int COUNTER, REPEAT = 5;
int difference;
float speedup;
// Allocate memory for the matrices
A = malloc(n * n * sizeof(int));
B = malloc(n * n * sizeof(int));
C1 = malloc(n * n * sizeof(int));
C2 = malloc(n * n * sizeof(int));
// Fill bits in A, B, C1
fill(A, n * n);
fill(B, n * n);
fill(C1, n * n);
// Initialize C2 = C1
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C2[i * n + j] = C1[i * n + j];
// Measure performance of the reference implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_reference(n, A, B, C1);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_ref = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);
// Measure performance of the optimized implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_optimized(n, A, B, C2);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_opt = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);
speedup = (float)CLOCK_ref / (float)CLOCK_opt;
// Check correctness by comparing C1 and C2
difference = 0;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
difference = difference + C1[i * n + j] - C2[i * n + j];
if (difference == 0)
printf("Speedup factor = %.2f\n", speedup);
if (difference != 0)
printf("Reference and optimized implementations do not match\n");
//print(C2, n);
free(A);
free(B);
free(C1);
free(C2);
return 0;
}
You can try algorithm like Strassen or Coppersmith-Winograd and here is also a good example.
Or maybe try Parallel computing like future::task or std::thread
Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:
First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).
Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.
Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.
A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).
(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)
You may find the following thread relevant: How does BLAS get such extreme performance?
We have implemented DFT and wanted to test it with OpenCV's implementation. The results are different.
our DFT's results are in order from smallest to biggest, whereas OpenCV's results are not in any order.
the first (0th) value is the same for both calculations, as in this case, the complex part is 0 (since e^0 = 1, in the formula). The other values are different, for example OpenCV's results contain negative values, whereas ours does not.
This is our implementation of DFT:
// complex number
std::complex<float> j;
j = -1;
j = std::sqrt(j);
std::complex<float> result;
std::vector<std::complex<float>> fourier; // output
// this->N = length of contour, 512 in our case
// foreach fourier descriptor
for (int n = 0; n < this->N; ++n)
{
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
and this is how we calculate the DFT with OpenCV:
std::vector<std::complex<float>> fourierCV; // output
cv::dft(std::vector<float>(centroidDistance, centroidDistance + this->N), fourierCV, cv::DFT_SCALE | cv::DFT_COMPLEX_OUTPUT);
The variable centroidDistance is calculated in a previous step.
Note: please avoid answers saying use OpenCV instead of your own implementation.
You forgot to initialise result for each iteration of n:
for (int n = 0; n < this->N; ++n)
{
result = 0.0f; // initialise `result` to 0 here <<<
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
EDIT You can checkout my implementation on Github: https://github.com/Sheljohn/WalshHadamard
I am looking for an implementation, or indications on how to implement, the sequency-ordered Fast Walsh Hadamard transform (see this and this).
I slightly adapted a very nice implementation found online:
// (a,b) -> (a+b,a-b) without overflow
void rotate( long& a, long& b )
{
static long t;
t = a;
a = a + b;
b = t - b;
}
// Integer log2
long ilog2( long x )
{
long l2 = 0;
for (; x; x >>=1) ++l2;
return l2;
}
/**
* Fast Walsh-Hadamard transform
*/
void fwht( std::vector<long>& data )
{
const long l2 = ilog2(data.size()) - 1;
for (long i = 0; i < l2; ++i)
{
for (long j = 0; j < (1 << l2); j += 1 << (i+1))
for (long k = 0; k < (1 << i ); ++k)
rotate( data[j + k], data[j + k + (1<<i)] );
}
}
but it does not compute the WHT in sequency order (the natural Hadamard matrix is used implicitly). Note that in the code above (and if you try it), the size of data needs to be a power of 2.
My question is: is there a simple adaptation of this implementation that gives the sequency-ordered FWHT?
A possible solution would be to write a small function to compute dynamically the elements of Hn (the Hadamard matrix of order n), count the number of zero crossings, and create a ranking of the rows, but I am wondering whether there is a smarter way. Thanks in advance for any input! Cheers
As indicated here (linked from within your reference):
The sequency ordering of the rows of the Walsh matrix can be derived from the ordering of the Hadamard matrix by first applying the bit-reversal permutation and then the Gray code permutation.
There are various implementations of bit-reversal algorithm such as this:
// Bit-reversal
// adapted from http://www.idi.ntnu.no/~elster/pubs/elster-bit-rev-1989.pdf
void bitrev(int t, std::vector<long>& c)
{
long n = 1<<t;
long L = 1;
c[0] = 0;
for (int q=0; q<t; ++q)
{
n /= 2;
for (long j=0; j<L; ++j)
{
c[L+j] = c[j] + n;
}
L *= 2;
}
}
The gray code can be obtained from here:
/*
The purpose of this function is to convert an unsigned
binary number to reflected binary Gray code.
The operator >> is shift right. The operator ^ is exclusive or.
*/
unsigned int binaryToGray(unsigned int num)
{
return (num >> 1) ^ num;
}
These can be combined to yields the final permutation:
// Compute a permutation of size 2^order
// to reorder the Fast Walsh-Hadamard transform's output
// into the Walsh-ordered (sequency-ordered)
void sequency_permutation(long order, std::vector<long>& p)
{
long n = 1<<order;
std::vector<long> tmp(n);
bitrev(order, tmp);
p.resize(n);
for (long i=0; i<n; ++i)
{
p[i] = tmp[binaryToGray(i)];
}
}
All that's left to do is to apply the permutation to the normal Walsh-Hadamard Transform output.
void permuted_fwht(std::vector<long>& data, const std::vector<long>& permutation)
{
std::vector<long> tmp = data;
fwht(tmp);
for (long i=0; i<data.size(); ++i)
{
data[i] = tmp[permutation[i]];
}
}
Note that the permutation is fixed for a given data size, so it only needs to be computed once (assuming you are processing multiple blocks of data). So, putting it all together you would get something such as:
std::vector<long> p;
const long order = ilog2(data_block_size) - 1;
sequency_permutation(order, p);
permuted_fwht( data_block_1, p);
permuted_fwht( data_block_2, p);
//...