I am making a simple Gaussian blur function for a 2D array that is supposed to represent an image. The function just prints out the array values at the end (no actual image processing going on here). I was pretty sure that I had implemented everything correct, but the values I am getting for (N=3, sigma=1.5) are much lower than expected based on this calculator: http://dev.theomader.com/gaussian-kernel-calculator/
I am following this equation:
void gaussian_filter(int N, double sigma) {
double k[N][N];
for(int i=0; i<N; i++) { //Initialize kernal to 0
for(int j=0; j<N; j++) {
k[i][j] = 0;
}
}
double sum = 0.0; //There is an issue somewhere in this block of code
int change = (N/2);
double r, s = change * sigma * sigma;
for (int x = -change; x <= change; x++) {
for(int y = -change; y <= change; y++) {
r = sqrt(x*x + y*y);
k[x + change][y + change] = (exp(-(r*r)/s))/(M_PI * s);
sum += k[x + change][y + change];
}
}
for(int i = 0; i < N; ++i) { //Normalize
for(int j = 0; j < N; ++j) {
k[i][j] /= sum;
}
}
for(int i = 0; i < N; ++i) { //Print out array
for (int j = 0; j < N; ++j)
cout<<k[i][j]<<"\t";
}
cout<<endl;
}
}
Here is the expected output for N=3 and Sigma=1.5
Here is the current broken output for N=3 and Sigma=1.5
Why does s depend on change? I think you should do:
double r, s = 2 * sigma * sigma;
// instead of
// double r, s = change * sigma * sigma;
That website computes Gaussian kernels in an unorthodox manner:
The weights are calculated by numerical integration of the continuous gaussian distribution over each discrete kernel tap.
That is, it samples a continuous Gaussian kernel that has been convolved with a uniform (“box”) filter of 1 pixel wide. The resulting Gaussian is wider than advertised. I advise against this method.
The proper way to create a Gaussian kernel is to just sample the Gaussian function at given integer locations, for example x = [-3, -2, -1, 0, 1, 2, 3].
Do note that a 3-pixel kernel is not wide enough to represent a Gaussian. It is important to sample the tail of the curve, without it, the kernel doesn’t have the good properties of the Gaussian kernel. I recommend sampling up to 3 sigma to each side, leading to 2*ceil(3*sigma)+1 pixels. 2 sigma is the bare minimum, useful only when speed is more important than good results.
Do also note that the Gaussian is separable, you can apply two 1D kernels in succession, rather than a single 2D kernel. For the 9x9 kernel you get for sigma=1.5, this translates to 9+9=18 multiplications and additions, compared to 9x9=81 for the 2D kernel. This is a significant saving!
Related
I'm writing a library where I want to have some basic NxN matrix functionality that doesn't have any dependencies and it is a bit of a learning project. I'm comparing my performance to Eigen. I've been able to be pretty equal and even beat its performance on a couple front with SSE2 and with AVX2 beat it on quite a few fronts (it only uses SSE2 so not super surprising).
My issue is I'm using Gaussian Elimination to create an Upper Diagonalized matrix then multiplying the diagonal to get the determinant.I beat Eigen for N < 300 but after that Eigen blows me away and it just gets worse as the matrices get bigger. Given all the memory is accessed sequentially and the compiler dissassembly doesn't look terrible I don't think it is an optimization issue.
There is more optimization that can be done but the timings look much more like an algorithmic timing complexity issue or there is a major SSE advantage I'm not seeing. Simply unrolling the loops a bit hasn't done much for me when trying that.
Is there a better algorithm for calculating determinants?
Scalar code
/*
Warning: Creates Temporaries!
*/
template<typename T, int ROW, int COLUMN> MML_INLINE T matrix<T, ROW, COLUMN>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<T, ROW, COLUMN> temp(*this);
/*We convert the temporary to upper triangular form*/
uint N = row();
T det = T(1);
for (uint c = 0; c < N; ++c)
{
det = det*temp(c,c);
for (uint r = c + 1; r < N; ++r)
{
T ratio = temp(r, c) / temp(c, c);
for (uint k = c; k < N; k++)
{
temp(r, k) = temp(r, k) - ratio * temp(c, k);
}
}
}
return det;
}
AVX2
template<> float matrix<float>::determinant(void) const
{
/*
This method assumes square matrix
*/
assert(row() == col());
/*
We need to create a temporary
*/
matrix<float> temp(*this);
/*We convert the temporary to upper triangular form*/
float det = 1.0f;
const uint N = row();
const uint Nm8 = N - 8;
const uint Nm4 = N - 4;
uint c = 0;
for (; c < Nm8; ++c)
{
det *= temp(c, c);
float8 Diagonal = _mm256_set1_ps(temp(c, c));
for (uint r = c + 1; r < N;++r)
{
float8 ratio1 = _mm256_div_ps(_mm256_set1_ps(temp(r,c)), Diagonal);
uint k = c + 1;
for (; k < Nm8; k += 8)
{
float8 ref = _mm256_loadu_ps(temp._v + c*N + k);
float8 r0 = _mm256_loadu_ps(temp._v + r*N + k);
_mm256_storeu_ps(temp._v + r*N + k, _mm256_fmsub_ps(ratio1, ref, r0));
}
/*We go Scalar for the last few elements to handle non-multiples of 8*/
for (; k < N; ++k)
{
_mm_store_ss(temp._v + index(r, k), _mm_sub_ss(_mm_set_ss(temp(r, k)), _mm_mul_ss(_mm256_castps256_ps128(ratio1),_mm_set_ss(temp(c, k)))));
}
}
}
for (; c < Nm4; ++c)
{
det *= temp(c, c);
float4 Diagonal = _mm_set1_ps(temp(c, c));
for (uint r = c + 1; r < N; ++r)
{
float4 ratio = _mm_div_ps(_mm_set1_ps(temp[r*N + c]), Diagonal);
uint k = c + 1;
for (; k < Nm4; k += 4)
{
float4 ref = _mm_loadu_ps(temp._v + c*N + k);
float4 r0 = _mm_loadu_ps(temp._v + r*N + k);
_mm_storeu_ps(temp._v + r*N + k, _mm_sub_ps(r0, _mm_mul_ps(ref, ratio)));
}
float fratio = _mm_cvtss_f32(ratio);
for (; k < N; ++k)
{
temp(r, k) = temp(r, k) - fratio*temp(c, k);
}
}
}
for (; c < N; ++c)
{
det *= temp(c, c);
float Diagonal = temp(c, c);
for (uint r = c + 1; r < N; ++r)
{
float ratio = temp[r*N + c] / Diagonal;
for (uint k = c+1; k < N;++k)
{
temp(r, k) = temp(r, k) - ratio*temp(c, k);
}
}
}
return det;
}
Algorithms to reduce an n by n matrix to upper (or lower) triangular form by Gaussian elimination generally have complexity of O(n^3) (where ^ represents "to power of").
There are alternative approaches for computing determinate, such as evaluating the set of eigenvalues (the determinant of a square matrix is equal to the product of its eigenvalues). For general matrices, computation of the complete set of eigenvalues is also - practically - O(n^3).
In theory, however, calculation of the set of eigenvalues has complexity of n^w where w is between 2 and 2.376 - which means for (much) larger matrices it will be faster than using Gaussian elimination. Have a look at an article "Fast linear algebra is stable" by James Demmel, Ioana Dumitriu, and Olga Holtz in Numerische Mathematik, Volume 108, Issue 1, pp. 59-91, November 2007. If Eigen uses an approach with complexity less than O(n^3) for larger matrices (I don't know, never having had reason to investigate such things) that would explain your observations.
The answer most places seem to use Block LU Factorization to create an Lower triangle and Upper triangle matrix in the same memory space. It is ~O(n^2.5) depending on the size of block you use.
Here is a power point from Rice University that explains the algorithm.
www.caam.rice.edu/~timwar/MA471F03/Lecture24.ppt
Division by a matrix means multiplication by its inverse.
The idea seems to be to increase the number of n^2 operations significantly but reduce the number m^3 which in effect lowers the complexity of the algorithm since m is of a fixed small size.
Going to take a little bit to write this up in an efficient manner since to do it efficiently requires 'in place' algorithms I don't have written yet.
We have implemented DFT and wanted to test it with OpenCV's implementation. The results are different.
our DFT's results are in order from smallest to biggest, whereas OpenCV's results are not in any order.
the first (0th) value is the same for both calculations, as in this case, the complex part is 0 (since e^0 = 1, in the formula). The other values are different, for example OpenCV's results contain negative values, whereas ours does not.
This is our implementation of DFT:
// complex number
std::complex<float> j;
j = -1;
j = std::sqrt(j);
std::complex<float> result;
std::vector<std::complex<float>> fourier; // output
// this->N = length of contour, 512 in our case
// foreach fourier descriptor
for (int n = 0; n < this->N; ++n)
{
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
and this is how we calculate the DFT with OpenCV:
std::vector<std::complex<float>> fourierCV; // output
cv::dft(std::vector<float>(centroidDistance, centroidDistance + this->N), fourierCV, cv::DFT_SCALE | cv::DFT_COMPLEX_OUTPUT);
The variable centroidDistance is calculated in a previous step.
Note: please avoid answers saying use OpenCV instead of your own implementation.
You forgot to initialise result for each iteration of n:
for (int n = 0; n < this->N; ++n)
{
result = 0.0f; // initialise `result` to 0 here <<<
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
I am working on a project that detects some features of two input images(handwritten signatures) and compares those two features using cosine similarity. Here When I mean two input images, one is an original image, and other is duplicate image.
Say I am extracting 15 such features of one image(original image) and storing it in one array(Say, Array_ORG), and features of other image is stored in Array_DUP similarly.
Now, I am trying to calculate the cosine similarity between these two arrays. These arrays are of double datatype.
I am listing down two methods that I followed:
1)Manual calculation of cosine similarity:
main(){
for(int i=0;i<15;i++)
sum_org += (Array_org[i]*Array_org[i]);
for(int i=0;i<15;i++)
sum_dup += (Array_dup[i]*Array_dup[i]);
double magnitude = sqrt(sum_org +sum_dup );
double cosine_similarity = dot_product(Array_org, Array_dup, sizeof(Array_org)/sizeof(Array_org[0]))/magnitude;
}
double dot_product(double *a, double* b, size_t n){
double sum = 0;
size_t i;
for (i = 0; i < n; i++) {
sum += a[i] * b[i];
}
return sum;
}
2)Storing the values into a Mat and calling dot function:
Mat A = Mat(1,15,CV_32FC1,&Array_org);
Mat B = Mat(1,15,CV_32FC1,&Array_dup);
double similarity = cal_theta(A,B);
double cal_theta(Mat A, Mat B){
double ab = A.dot(B);
double aa = A.dot(A);
double bb = B.dot(B);
return -ab / sqrt(aa*bb);
}
I have read that cosine similarity value ranges from -1 to 1, with -1 saying both are exxactly opposite, and 1, saying both are equal. But first function gives me values in 1000's and second function gives me values more than 1.
Please guide me which process is right, and why?
Also how do I infer the similarity if cosine similarity values are more than 1?
The correct definition of cosine similarity is :
Your code does not compute the denominator, hence the values are wrong.
double cosine_similarity(double *A, double *B, unsigned int Vector_Length)
{
double dot = 0.0, denom_a = 0.0, denom_b = 0.0 ;
for(unsigned int i = 0u; i < Vector_Length; ++i) {
dot += A[i] * B[i] ;
denom_a += A[i] * A[i] ;
denom_b += B[i] * B[i] ;
}
return dot / (sqrt(denom_a) * sqrt(denom_b)) ;
}
Just adding a method that with Opencv(C++) to calculate to feature vectors cosine similarity:
float cosSim = f1.dot(f2) / (cv::norm(f1) * cv::norm(f2));
where f1 and f2 are both 1-dimension cv::Mat with size (1, xx).
I'm trying to write a function that runs a loop in C++ from R using Rcpp.
I have a matrix Z which is one row shorter than the matrix OUT that the function is supposed to return because each position of first row of OUT will be given by the scalar sigma_0.
The function is supposed to implement a differential equation. Each iteration depends on a value from the matrix Z as well as a previously generated value of the matrix OUT.
What I've got is this:
cppFunction('
NumericMatrix sim(NumericMatrix Z, long double sigma_0, long double delta, long double omega, long double gamma) {
int nrow = Z.nrow() + 1, ncol = Z.ncol();
NumericMatrix out(nrow, ncol);
for(int q = 0; q < ncol; q++) {
out(0, q) = sigma_0;
}
for(int i = 0; i < ncol; i++) {
for(int j = 1; j < nrow; j++) {
long double z = Z(j - 1, i);
long double sigma = out(j - 1, i);
out(j, i) = pow(abs(z * sigma) - gamma * z * sigma, delta);
}
}
return out;
}
')
Unfortunately I'm fairly certain it doesn't work. The function runs but the values calculated are incorrect - I've checked with simple examples in Excel and plain R-coding. I've stripped the main differentialequation apart trying to build it up step by step to see when the implementation i Excel and R using C++ starts to differ. Which seems to be when I start using the abs() function and power() function but I simply can't narrow the problem down. Any help would be greatly appreciated - also I might mention this is the first time for me using C++ and C++ along with R.
I think you want fabs rather than abs. abs operates on ints, while fabs operates on doubles / floats.
#include <vector>
using std::vector;
#include <complex>
using std::complex;
using std::polar;
typedef complex<double> Complex;
#define Pi 3.14159265358979323846
// direct Fourier transform
vector<Complex> dF( const vector<Complex>& in )
{
const int N = in.size();
vector<Complex> out( N );
for (int k = 0; k < N; k++)
{
out[k] = Complex( 0.0, 0.0 );
for (int n = 0; n < N; n++)
{
out[k] += in[n] * polar<double>( 1.0, - 2 * Pi * k * n / N );
}
}
return out;
}
// inverse Fourier transform
vector<Complex> iF( const vector<Complex>& in )
{
const int N = in.size();
vector<Complex> out( N );
for (int k = 0; k < N; k++)
{
out[k] = Complex( 0.0, 0.0 );
for (int n = 0; n < N; n++)
{
out[k] += in[n] * polar<double>(1, 2 * Pi * k * n / N );
}
out[k] *= Complex( 1.0 / N , 0.0 );
}
return out;
}
Who can say, what is wrong???
Maybe i don't understand details of implementation this algorithm... But i can't find it )))
also, i need to calculate convolution.
But i can't find test example.
UPDATE
// convolution. I suppose that x0.size == x1.size
vector convolution( const vector& x0, const vector& x1)
{
const int N = x0.size();
vector<Complex> tmp( N );
for ( int i = 0; i < N; i++ )
{
tmp[i] = x0[i] * x1[i];
}
return iF( tmp );
}
I really don't know exactly what your asking, but your DFT and IDFT algorithms look correct to me. Convolution can be performed using the DFT and IDFT using the circular convolution theorem which basically states that f**g = IDFT(DFT(f) * DFT(g)) where ** is circular convolution and * is simple multiplication.
To compute linear convolution (non-circular) using the DFT, you must zero-pad each of the inputs so that the circular wrap-around only occurs for zero-valued samples and does not affect the output. Each input sequence needs to be zero padded to a length of N >= L+M-1 where L and M are the lengths of the input sequences. Then you perform circular convolution as shown above and the first L+M-1 samples are the linear convolution output (samples beyond this should be zero).
Note: Performing convolution with the DFT and IDFT algorithms you have shown is much more inefficient than just computing it directly. The advantage only comes when using an FFT and IFFT(O(NlogN)) algorithm in place of the DFT and IDFT (O(N^2)).
Check FFTW library "for computing the discrete Fourier transform (DFT)" and its C# wrapper;) Maybe this too;)
Good luck!
The transforms look fine, but there's nothing in the program that is doing convolution.
UPDATE: the convolution code needs to forward transform the inputs first before the element-wise multiplication.