convert non-contiguous dot product to neon assembly

convert non-contiguous dot product to neon assembly - c++

Instead of a normal dot product, my application requires a slightly modified version. Here is the original C++ code:
for (int m = 0; m < k; m++) {
for (int n = 0; n < l; n++) {
for (int t = 0; t < dims2[2]; t++) {
for (int dm = 0; dm < dims2[1]; dm++) {
for (int dn = 0; dn < dims2[0]; dn++) {
int ai = (n + dn) + (m + dm) * dims1[0] + t * dims1[0] * dims1[1];
int bi = dn + dm * dims2[0] + t * dims2[0] * dims2[1];
total += A[ai] * B[bi];
}
}
}
}
}
Feel free to change the order of the loops. How do I convert this to neon assembly?

Related

Why does OMP nested parallelism execution outputs differently than linear execution?

I'm attempting to compare values of execution time when detecting edges of an image in a linear way and in a parallel way. Everything works fine in a linear way, but in a parallel way the image written has too much white pixels in a part of the image. To better show what I'm saying, see image below:
The left image is the output of the code executed linearly, and in the right is using parallelism. You can see the edges of the buildings in both images, and the bottom part of the right image close to its border doesen't have the same issue has the rest of it.
I cropped the "critical" part of the code that does this tasks, in hope that someone may know what may be causing this.
omp_set_nested(1);
#pragma omp parallel
while(col<cols-1) {
line = 1;
#pragma omp parallel
while(line<lines-1) {
gradient_x = 0;
gradient_y = 0;
for(int m = 0; m < mask_size; m++) {
for(int n = 0; n < mask_size; n++) {
int np_x = line + (m - 1);
int np_y = col + (n - 1);
float v = img(np_y,np_x);
int mask_index = (m*3) + n;
gradient_x = gradient_x + (x_mask[mask_index] * v);
gradient_y = gradient_y + (y_mask[mask_index] * v);
}
}
float gradient_sum = sqrt((gradient_x * gradient_x) + (gradient_y * gradient_y));
if(gradient_sum >= 255) {
gradient_sum = 255;
} else if(gradient_sum <= 0) {
gradient_sum = 0;
}
output(line, col) = gradient_sum;
#pragma omp critical
line++;
}
#pragma omp critical
col++;
}
I defined line and col variables as critical because they are the ones used for both reading and writing data, and I believe everything else is working propperly.

Without more context, is hard to tell. Nonetheless, those two nested parallel regions do not make sense, because you are not distributing tasks among threads; instead you are just executing the same code by multiple threads, with possible race-conditions on the updates of the variables gradient_x and gradient_y among others. Start with the following simpler parallel code:
omp_set_nested(0);
while(col<cols-1) {
line = 1;
while(line<lines-1) {
gradient_x = 0;
gradient_y = 0;
#pragma omp parallel for reduction(+:gradient_x,gradient_y)
for(int m = 0; m < mask_size; m++) {
for(int n = 0; n < mask_size; n++) {
int np_x = line + (m - 1);
int np_y = col + (n - 1);
float v = img(np_y,np_x);
int mask_index = (m*3) + n;
gradient_x = gradient_x + (x_mask[mask_index] * v);
gradient_y = gradient_y + (y_mask[mask_index] * v);
}
}
float gradient_sum = sqrt((gradient_x * gradient_x) + (gradient_y * gradient_y));
if(gradient_sum >= 255) {
gradient_sum = 255;
} else if(gradient_sum <= 0) {
gradient_sum = 0;
}
output(line, col) = gradient_sum;
line++;
}
col++;
}
You can try the following:
#pragma omp parallel for collapse(2)
for(int col = 0; col<cols-1; col++) {
for(int line = 1; line<lines-1; line++) {
float gradient_x = 0;
float gradient_y = 0;
for(int m = 0; m < mask_size; m++) {
for(int n = 0; n < mask_size; n++) {
int np_x = line + (m - 1);
int np_y = col + (n - 1);
float v = img(np_y,np_x);
int mask_index = (m*3) + n;
gradient_x = gradient_x + (x_mask[mask_index] * v);
gradient_y = gradient_y + (y_mask[mask_index] * v);
}
}
float gradient_sum = sqrt((gradient_x * gradient_x) +
(gradient_y * gradient_y));
if(gradient_sum >= 255) {
gradient_sum = 255;
} else if(gradient_sum <= 0) {
gradient_sum = 0;
}
output(line, col) = gradient_sum;
}
}
Of course, you need to check the race-condition in the code that you have cropped.

Gausian Filter with OpenCV and C++

My goal is to apply a gaussian filter to an input image.
I don't want to use the OpenCV function (i want to program it by myself).
I mention the Error in the code. Can somebody help me? The code is based on a mean filter example.
I know the examples with the GaussianBlur function.
public:Mat gaussianfilter(const Mat input, int n, float sigmaT, float sigmaS, const char* opt) {
Mat kernel;
int row = input.rows;
int col = input.cols;
int kernel_size = (2 * n + 1);
int tempa;
int tempb;
float denom;
float kernelvalue{};
// Initialiazing Kernel Matrix
kernel = Mat::zeros(kernel_size, kernel_size, CV_32F);
denom = 0.0;
for (int a = -n; a <= n; a++) { // Denominator in m(s,t)
for (int b = -n; b <= n; b++) {
float value1 = exp(-(pow(a, 2) / (2 * pow(sigmaS, 2))) - (pow(b, 2) / (2 * pow(sigmaT, 2))));
kernel.at<float>(a + n, b + n) = value1;
denom += value1;
}
}
for (int a = -n; a <= n; a++) { // Denominator in m(s,t)
for (int b = -n; b <= n; b++) {
kernel.at<float>(a + n, b + n) /= denom;
}
}
Mat output = Mat::zeros(row, col, input.type());
for (int i = 0; i < row; i++) {
for (int j = 0; j < col; j++) {
float sum1 = 0.0;
for (int a = -n; a <= n; a++) {
for (int b = -n; b <= n; b++) {
/* Gaussian filter with Zero-paddle boundary process:
Fill the code:
*/
if ((i + a <= row - 1) && (i + a >= 0) && (j + b <= col - 1) && (j + b >= 0)) { //if the pixel is not a border pixel
/*here is a failure: Severity Code Description Project File Line Suppression State
Error (active) E0349 no operator "+=" */
sum1 += kernel * (float)(input.at<G>(i + a, j + b));
}
}
}
output.at<G>(i, j) = (G)sum1;
}
}
return output;
}

sum1 - float , kernel - Mat. You can't += for float and Mat. You need to append indices.
sum1 += kernel.at<float>(a+n,b+n) * (float)(input.at<G>(i + a, j + b));
I think.

FFTW Complex to Real Segmentation Fault

I am attempting to write a naive implementation of the Short-Time Fourier Transform using consecutive FFT frames in time, calculated using the FFTW library, but I am getting a Segmentation fault and cannot work out why.
My code is as below:
// load in audio
AudioFile<double> audioFile;
audioFile.load ("assets/example-audio/file_example_WAV_1MG.wav");
int N = audioFile.getNumSamplesPerChannel();
// make stereo audio mono
double fileDataMono[N];
if (audioFile.isStereo())
for (int i = 0; i < N; i++)
fileDataMono[i] = ( audioFile.samples[0][i] + audioFile.samples[1][i] ) / 2;
// setup stft
// (test transform, presently unoptimized)
int stepSize = 512;
int M = 2048; // fft size
int noOfFrames = (N-(M-stepSize))/stepSize;
// create Hamming window vector
double w[M];
for (int m = 0; m < M; m++) {
w[m] = 0.53836 - 0.46164 * cos( 2*M_PI*m / M );
}
double* input;
// (pads input array if necessary)
if ( (N-(M-stepSize))%stepSize != 0) {
noOfFrames += 1;
int amountOfZeroPadding = stepSize - (N-(M-stepSize))%stepSize;
double ipt[N + amountOfZeroPadding];
for (int i = 0; i < N; i++) // copy values from fileDataMono into input
ipt[i] = fileDataMono[i];
for (int i = 0; i < amountOfZeroPadding; i++)
ipt[N + i] = 0;
input = ipt;
} else {
input = fileDataMono;
}
// compute stft
fftw_complex* stft[noOfFrames];
double frames[noOfFrames][M];
fftw_plan fftPlan;
for (int i = 0; i < noOfFrames; i++) {
stft[i] = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * M);
for (int m = 0; m < M; m++)
frames[i][m] = input[i*stepSize + m] * w[m];
fftPlan = fftw_plan_dft_r2c_1d(M, frames[i], stft[i], FFTW_ESTIMATE);
fftw_execute(fftPlan);
}
// compute istft
double* outputFrames[noOfFrames];
double output[N];
for (int i = 0; i < noOfFrames; i++) {
outputFrames[i] = (double*)fftw_malloc(sizeof(double) * M);
fftPlan = fftw_plan_dft_c2r_1d(M, stft[i], outputFrames[i], FFTW_ESTIMATE);
fftw_execute(fftPlan);
for (int m = 0; i < M; m++) {
output[i*stepSize + m] += outputFrames[i][m];
}
}
fftw_destroy_plan(fftPlan);
for (int i = 0; i < noOfFrames; i++) {
fftw_free(stft[i]);
fftw_free(outputFrames[i]);
}
// output audio
AudioFile<double>::AudioBuffer outputBuffer;
outputBuffer.resize (1);
outputBuffer[0].resize(N);
outputBuffer[0].assign(output, output+N);
bool ok = audioFile.setAudioBuffer(outputBuffer);
audioFile.setAudioBufferSize (1, N);
audioFile.setBitDepth (16);
audioFile.setSampleRate (8000);
audioFile.save ("out/audioOutput.wav");
The segfault seems to be being raised by the first fftw_malloc when computing the forward STFT.
Thanks in advance!

The relevant bit of code is:
double* input;
if ( (N-(M-stepSize))%stepSize != 0) {
double ipt[N + amountOfZeroPadding];
//...
input = ipt;
}
//...
input[i*stepSize + m];
Your input pointer points at memory that exists only inside the if statement. The closing brace denotes the end of the lifetime of the ipt array. When dereferencing the pointer later, you are addressing memory that no longer exists.

has invalid type for ‘reduction' openmp for c++

I have used openm to parallelize my c++ code as below:
int shell_num = 50, grparallel[shell_num],grbot[shell_num];
double p_x,p_y,grp[shell_num];
for (int f = 0; f < shell_num; f++)
{
grp[f] = 0;
grparallel[f] = 0;
grbot[f] = 0;
}
//some code...
#pragma omp parallel for reduction(+ : grp,grparallel,grbot)
for(int i = 0; i < N; i++){ //some code
for(int j = 0; j < N; j++){
if (j==i) continue;
double delta_x = x[i]-x[j],
delta_y = y[i]-y[j],
e_dot_e = e_x[i] * e_x[j] + e_y[i] * e_y[j],
e_cross_e = e_x[i] * e_y[j] - e_y[i] * e_x[j];
if (j > i)
{
double fasele = sqrt(dist(x[i],y[i],x[j],y[j],L));
for (int h = 0; h < shell_num; h++) //determine periodic distance between i and j is in which shel
{
if( L * h / 100 < fasele && fasele < L * (h + 1) / 100 )
{grp[h]+= e_dot_e;
double pdotr = abs(periodic(delta_x,L) * p_x + periodic(delta_y,L) * p_y)/fasele;
if (pdotr > 0.9659)
{
grparallel[h]+= 1;}else if(pdotr < 0.2588)
{
grbot[h]+= 1;
}
break;
}
}
}
}
}
When I run the code in terminal, there is an error:
‘grp’ has invalid type for ‘reduction’
The same error occurs for grparallel and grbot.
How can I remove the error?

From mathematic function to c++ code

I am trying to implement this F(S) function:
bellow is my code but is not working:
double EnergyFunction::evaluate(vector<short> field) {
double e = 0.0;
for (int k = 1; k < field.size() - 1; k++){
double c = 0.0;
for (int i = 1; i < field.size() - k; i++) {
c += field[i] * field[i + k];
}
e += pow(c, 2);
}
double f = pow(field.size(), 2) / ( 2 * e );
return f;
}
For example F(S) function should return value 8644 for vector:
1,1,1,-1,-1,-1,1,-1,1,1,-1,1,-1,1,-1,1,-1,-1,1,1,1,1,-1,-1,-1,1,1,1,1,-1,1,-1,1,1,-1,-1,1,1,1,1,-1,-1,-1,1,-1,-1,1,-1,-1,1,1,-1,1,-1,-1,1,1,-1,1,-1,1,-1,1,-1,1,-1,1,1,-1,-1,-1,-1,-1,-1,1,-1,1,1,1,-1,1,1,-1,1,1,-1,1,-1,1,1,1,-1,-1,1,1,-1,-1,1,1,1,1,1,1,1,1,-1,1,-1,1,-1,1,-1,-1,1,-1,-1,1,-1,-1,1,-1,-1,-1,-1,-1,1,1,1,1,1,-1,-1,-1,1,-1,-1,1,-1,-1,1,-1,-1,1,-1,1,-1,-1,1,1,1,1,1,1,-1,1,-1,1,-1,1,1,1,1,1,1,-1,1,-1,-1,-1,1,-1,1,1,-1,-1,-1,-1,1,-1,-1,-1,1,1,-1,-1,1,1,1,-1,-1,1,1,1,1,-1,1,1,-1,1,-1,-1,1,-1,-1,-1,-1,1,-1,-1,-1,1,-1,-1,1,1,-1,-1,-1,-1,-1,1,-1,-1,-1,1,1,-1,1,1,-1,-1,-1,1,-1,-1,1,-1,-1,-1,1,1,1,-1,-1,-1,-1,1,1,1,-1,1,-1,-1,1,-1,1,1,-1,-1,-1,-1,1,-1,1,1,1,1,1,1,-1,1,1,1,-1,-1,-1,-1,1,-1,1,1,1,1,-1,1,1,1,1,1,-1,-1,-1,1,-1,-1,1,1,1,-1,1,1,1,-1,1,1
I need another par of eyes to look at my code because I am a bit lost here. :)

after refactoring:
double EnergyFunction::evaluate(vector<short> field) {
double e = 0.0;
int l = field.size()
for (int k = 1; k < l; k++){
double c = 0.0;
for (int i = 0, j = k; j < l; i++, j++) {
c += field[i] * field[j];
}
e += c*c;
}
return l*l / ( e+e );
}
explanation:
1. we need to iterate (L-1) times
2. we need to shift the base and offset indexes until we reach the last one
3. c*c and e+e are quicker and easier to read

You are mapping variables into different ranges using the same names, which is always going to be confusing. Better is to keep ranges and names the same as in the math, and only subtract one for 0-base indexes at indexing time. Also might as well use L explicitly:
int L = field.size();
for (int k = 1; k <= L-1; k++){
...
for (int i = 1; i <= L-k; i++) {
c += field[i -1] * field[i+k -1];
...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

convert non-contiguous dot product to neon assembly - c++

Related

Why does OMP nested parallelism execution outputs differently than linear execution?

Gausian Filter with OpenCV and C++

FFTW Complex to Real Segmentation Fault

has invalid type for ‘reduction' openmp for c++

From mathematic function to c++ code

Categories

Resources