Multi-threading molecular simulations in C++ - c++

I am developing a molecular dynamics simulation code in C++, which essentially takes atom positions and other properties as input and simulates their motion under Newton's laws of motion. The core algorithm uses what's called the Velocity Verlet scheme and looks like:
// iterate through time (k=[1,#steps])
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
for (int i = 0; i < number_particles; i++)
p[i].position += vHalf[i]*Dt; // step 2
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
for (int i = 0; i < number_particles; i++)
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Where p is an array of class objects which store things like particle position, velocity, mass, etc. and Force is a function that calculates the net force on a particle using something like Lennard-Jones potential.
My question regards the time required to complete the calculation; all of my subroutines are optimized in terms of crunching numbers (e.g. using x*x*x to raise to the third power instead of pow(x,3)), but the main issue is the time loop will often be performed for millions of iterations and there are typically close to a million particles. Is there any way to implement this algorithm using multi-threading? From my understanding, multi-threading essentially opens another stream of data to and from a CPU core, which would allow me to run two different simulations at the same time; I would like to use multi-threading to make just one of these simulations run faster

I'd recommend using OpenMP.
Your specific use case is trivially parallelizable.
Prallelization should be as simple as:
double Dt = 0.002; // time step
double Ttot = 1.0; // total time
double halfDt = Dt/2.0;
for (int k = 1; k*Dt <= Ttot; k++){
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
vHalf[i] = p[i].velocity + F[i]*halfDt; // step 1
p[i].position += vHalf[i]*Dt; // step 2
#pragma omp parallel for
for (int i = 0; i < number_particles; i++)
F[i] = Force(p,i); // recalculate force on all particle i's
p[i].velocity = vHalf[i] + F[i]*halfDt; // step 3
}
Most popular compilers and platforms have support for OpenMP.

Related

Faster mathematical operations over a vector using libsimdpp

Searching around on how I can improve my waveform generation code, I've come across SIMD and the libsimdpp library, but I have no idea how to use it. If I got it right using raw SIMD will require me to write code for each architecture while libsimdpp will handle that for me.
What I need to do is, to calculate the squared and rms value of a chunk of samples, which I managed to boost the process using vectorization, which worked perfectly until I introduced the same calculation for both left and right channel of an audio file.
So, my question and what I need help with, is how can I use libsimdpp (or any library that will make simdp easier for me) to improve the bellow code?
// STRAT: vector containing all the audio samples
std::vector<double> samples;
nv_samples = samples.size();
// END
// START: Loop through the samples vector, incrementing ecah time the index with samples_per_pixel
for (int i = 0; i < nb_samples; i+= samples_per_pixel)
{
// START: Create chunk of samples with the size of samples_per_pixel
double* chunk = &samplesL[i];
// END
// START: Calculate rms and sqared sum
float sum = 0;
float squaredsum = 0;
/// there are multiple definitions of above for both channels but I won't include them
//// to make the code easier to be read
for (int j = 0; j < samples_per_pixel; j++)
{
if (chunk[j] < 0)
sum += -chunk[j]
else
sum += chunk[j]
squaredsum += chunk[j] * chunk[j]
}
/// average
float average_point = (sumL * 2) / samples_per_pixel;
// rms
float meanL = squaredsumL / samples_per_pixel;
rms_pointL = qSqrt(meanL);
/// Drawing of both avearge point and rms
//// [...]
// END
}

Optimizing the Vivado HLS code to reduce the latency for image processing algorithm

I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:
//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;
typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
float r;
float g;
float b;
};
struct pixel_8
{
uc r;
uc g;
uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);
//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;
for (int i = 0; i < 256; i++)
{
for (int j = 0; i < 512; i++)
{
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
for(int idx =0; idx < 3702; idx++)
{
buff_2.r = buff_1.r - ctrl_pts[idx][0];
buff_2.g = buff_1.g - ctrl_pts[idx][1];
buff_2.b = buff_1.b - ctrl_pts[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights[idx][0] * dist);
buff_3.g = buff_2.g + (weights[idx][1] * dist);
buff_3.b = buff_2.b + (weights[idx][2] * dist);
}
buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
}
Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.
Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?
Any suggestions will be highly appreciated,
Thanks in advance
Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation
There are some steps you can do to optimize your design, but bear in mind that if you really need a floating square root operation, that will most likely have a huge latency penalty (unless properly pipelined, of course).
Your code might have a typo in the second inner loop: the index should be j right?
Data Locality
First off: ctrl_pts is read multiple time from the main memory (I assume). Since it's reused 256x512 times, it would be better to store it into a local buffer on the FPGA (like a BRAM, but it can be inferred), like so:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
}
}
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
// ...
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
// ...
Same reasoning goes for coefs and weights, just store them in a local variable before running the rest of the code.
To access the arguments you can use a master AXI4 interface m_axi and configure it accordingly. Once the algorithm is dealing with the local buffers, HLS should be able to automatically partition the buffers accordingly. If not, you can place the ARRAY_PARTITION complete dim=0 pragmas to force it.
Dataflow
Because of the way your algorithm works, another thing you could try is to break down the main loops (256x512) into three smaller processes running in dataflow, and so in parallel (+3 if you include the setup ones)
The whole code will look something like this (I hope it renders correctly):
[Compute buff_1]-->[FIFO1]-->[compute buff_3]-->[FIFO2a]-->[compute buff_4 and buff_5 + stream out]
L-------------------------------->[FIFO2b]----^
One tricky thing would be to stream buff_1 to both the next processes.
Possible Code
I won't try this code, so there might be compilations errors along the way, but the whole accelerator code would look something like this:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
weights_local[i][j] = weights[i][j];
}
}
for(int i =0; i < 4; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
coefs_local[i][j] = coefs[i][j];
}
}
Process_1:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
fifo_1.write(buff_1); // <--- WRITE TO FIFOs
fifo_2b.write(buff_1);
}
}
Process_2:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
for(int idx =0; idx < 3702; idx++) {
#pragma HLS LOOP_FLATTEN // <-- It shouldn't be necessary, since the if statements already help
#pragma HLS PIPELINE II=1 // <-- The PIPELINE directive can go here
if (idx == 0) {
buff_1 = fifo_1.read(); // <--- READ FROM FIFO
}
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
buff_2.g = buff_1.g - ctrl_pts_local[idx][1];
buff_2.b = buff_1.b - ctrl_pts_local[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights_local[idx][0] * dist);
buff_3.g = buff_2.g + (weights_local[idx][1] * dist);
buff_3.b = buff_2.b + (weights_local[idx][2] * dist);
if (idx == 3702 - 1) {
fifo_2a.write(buff_3); // <-- WRITE TO FIFO
}
}
}
}
Process_3:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
buff_3 = fifo_2a.read(); // <--- READ FROM FIFO
buff_1 = fifo_2b.read(); // <--- READ FROM FIFO
buff_4.r = buff_3.r + coefs_local[0][0] + buff_1.r* coefs_local[1][0] + buff_1.g * coefs_local[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs_local[0][1] + buff_1.r* coefs_local[1][1] + buff_1.g * coefs_local[2][1] + buff_1.b* coefs_local[3][1];
buff_4.b = buff_3.b + coefs_local[0][2] + buff_1.r* coefs_local[1][2] + buff_1.g * coefs_local[2][2] + buff_1.b* coefs_local[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
Be extremely careful in sizing the depth of the FIFOs, since Process 2 (the one with the sqrt operation) might have a slower data consumption and production rates! Also, FIFO 2b needs to account for that delay. If the rates are not matching, there will be deadlocks. Make sure to have a meaningful testbench and to cosimulate your design.
(The FIFO's depth can be changed with the pragma #pragma HLS STREAM variable=fifo_1 depth=N).
Final Thoughs
There might be further smaller/detailed optimizations that can be performed along the way, but I would first start from the ones above, being the heaviest. Just bear in minds that floating point processing is not optimal on FPGAs (as you noted) and is usually avoided.
EDITs: I tried the code with the modifications above and I've achieved II=1 with decent resouce usage.
Since the II is now one, the ideal number of cycles the accelerator would take is 256x512 and I'm close to that: ideal 402,653,184 versus mine 485,228,587). One crazy idea I now have to propose to you is to split the Process_2 inner-most loop into two parallel branches (or even more than 2 actually), feeding their own FIFOs. Process_1 would supply the two branches while an additional process/loop will alternatively read from the two FIFOs the 256x512 elements and supply them in the correct order to Process_3. In this way, the total amount of cycles required should halve, since Process_2 is the slowest process in the dataflow (and so improving it will improve the whole design). One possible drawback of this approach will be a higher amount of area/resource required on the FPGA.
Good luck.

Neural Network Overestimating output of handwritten digits

So, I try to create my own neural network. Something really simple.
My input is the MNIST database of handwritten digits.
Input: 28*28 neurons (Images).
Output: 10 neurons (0/1/2/3/4/5/6/7/8/9).
So my network is as follow: 28*28 -> 15 -> 10.
The problem remains in my estimated output. Indeed, it seems I have a gradient explosion.
The output given by my network is here: https://pastebin.com/EFpBGAZd
As you can see, the first estimated output is wrong. So my network adjust the weights thanks to the backpropagation. But It doesn't seems to updates the weights correctly. Indeed the estimated output is too high compared to the second highest value.
So the first estimated output keeps being the best estimated output for the following training (13 in my example).
My backpropagation code:
VOID BP(NETWORK &Network, double Target[OUTPUT_NEURONS]) {
double DeltaETotalOut = 0;
double DeltaOutNet = 0;
double DeltaErrorNet = 0;
double DeltaETotalWeight = 0;
double Error = 0;
double ErrorTotal = 0;
double OutputUpdatedWeights[OUTPUT_NEURONS*HIDDEN_NEURONS] = { 0 };
unsigned int _indexOutput = 0;
double fNetworkError = 0;
//Calculate Error
for (int i = 0; i < OUTPUT_NEURONS; i++) {
fNetworkError += 0.5*pow(Target[i] - Network.OLayer.Cell[i].Output, 2);
}
Network.Error = fNetworkError;
//Output Neurons
for (int i = 0; i < OUTPUT_NEURONS; i++) {
DeltaETotalOut = -(Target[i] - Network.OLayer.Cell[i].Output);
DeltaOutNet = ActivateSigmoidPrime(Network.OLayer.Cell[i].Output);
for (int j = 0; j < HIDDEN_NEURONS; j++) {
OutputUpdatedWeights[_indexOutput] = Network.OLayer.Cell[i].Weight[j] - 0.5 * DeltaOutNet*DeltaETotalOut* Network.HLayer.Cell[j].Output;
_indexOutput++;
}
}
//Hidden Neurons
for (int i = 0; i < HIDDEN_NEURONS; i++) {
ErrorTotal = 0;
for (int k = 0; k < OUTPUT_NEURONS; k++) {
DeltaETotalOut = -(Target[k] - Network.OLayer.Cell[k].Output);
DeltaOutNet = ActivateSigmoidPrime(Network.OLayer.Cell[k].Output);
DeltaErrorNet = DeltaETotalOut * DeltaOutNet;
Error = DeltaErrorNet * Network.OLayer.Cell[k].Weight[i];
ErrorTotal += Error;
}
DeltaOutNet = ActivateSigmoidPrime(Network.HLayer.Cell[i].Output);
for (int j = 0; j < INPUT_NEURONS; j++) {
DeltaETotalWeight = ErrorTotal * DeltaOutNet*Network.ILayer.Image[j];
Network.HLayer.Cell[i].Weight[j] -= 0.5 * DeltaETotalWeight;
}
}
//Update Weights
_indexOutput = 0;
for (int i = 0; i < OUTPUT_NEURONS; i++) {
for (int j = 0; j < HIDDEN_NEURONS; j++) {
Network.OLayer.Cell[i].Weight[j] = OutputUpdatedWeights[_indexOutput];
_indexOutput++;
}
}}
How can I solve this issue?
I didn't worked on the hidden layer nor biases, is it due to it?
Thanks
Well, since Backpropagation is notoriously hard to implement and especially to debug (I guess everyone who did it can relate) it’s much harder to debug some Code written by others.
After a quick view over your code, I’m quite surprised that you calculate a negative delta term? Are you using ReLU or any sigmoid function? I’m quite sure there is more. But I’d suggest you to stay away from MNIST until you got your network to solve XOR.
I’ve wrote a summary in pseudo code on how to implement Backpropagation in pseudo code. I’m sure you’ll be able to translate it into C++ quite easily.
Strange convergence in simple Neural Network
In my experience neural networks should really be implemented with matrix operations. This will make your code faster and easier to debug.
The way to debug backpropagation is to use finite difference. For a loss function J(theta) we can approximate the gradient in each dimension with (J(theta + epsilon*d) - J(theta))/epsilon with d a one-hot vector representing one dimension (note the similarity to a derivative).
https://en.wikipedia.org/wiki/Finite_difference_method

_kmp huge overhead and spin time for unkown calls in OpenMP?

I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}
If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.

How to fast calculate the normalized l1 and l2 norm of a vector in C++?

I have a matrix X that has n column data vectors in d dimensional space.
Given a vector xj, v[j] is its l1 norm (the summation of all abs(xji)), w[j] is the square of its l2 norm (the summation of all xji^2), and pj[i] is the combination of entries divided by l1 and l2 norm. Finally, I need the outputs: pj, v, w for subsequet applications.
// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
jm = j*m;
jd = j*d;
for (unsigned long i=0; i<d; ++i) {
x_abs[i] = abs(X[i+jd]);
v[j] += x_abs[i];
x_2[i] = x_abs[i]*x_abs[i];
w[j] += x_2[i];
}
for (unsigned long i=0; i<d; ++i){
pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];
}
// functionA(pj){ ... ...} for subsequent applications
}
// functionB(v, w){ ... ...} for subsequent applications
My above algorithm takes O(nd) Flops/Time-complexity, can any one help me to speed up it by using building-functoin or new implementation in C++? Reducing the constant value in O(nd) is also very helpful for me.
Let me guess: since you have problems related with the performance, the dimension of your vectors is quite large.If this is the case, then it worth considering "CPU cache locality" - some interesting info on this in a cppcon14 presentation.
If the data is not available in the CPU caches, then abs-ing or squaring it it once available is dwarfed by the time the CPU just wait for the data.
With this is mind, you may want to try the following solution (with no warranties that will improve performance - the compiler may actually apply these techniques when optimizing the code)
for (unsigned long j=0; j<n; ++j) {
// use pointer arithmetic - at > -O0 the compiler will do it anyway
double *start=X+j*d, *end=X+(j+1)*d;
// this part avoid as much as possible the competition
// on CPU caches between X and v/w.
// Don't store the norms in v/w as yet, keep them in registers
double l1norm=0, l2norm=0;
for(double *src=start; src!=end; src++) {
double val=*src;
l1norm+=abs(src);
l2norm+= src*src;
}
double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
for(double *src=start, *dst=pj; src!=end; src++, dst++) {
// Yes, recomputing abs/sqr may actually save time by not
// creating competition on CPU caches with x_abs and x_2
double val=*src;
*dst = pl1*abs(val) + pl2*val*val;
}
// functionA(pj){ ... ...} for subsequent applications
// Think well if you really need v/w. If you really do,
// at least there are two values to be sent for storage into memory,
//meanwhile the CPU can actually load the next vector into cache
v[j]=l1norm; w[j]=l2norm;
}
// functionB(v, w){ ... ...} for subsequent applications