What is the fastest way to do this in c++ (Using OpenMP) - c++

I have an algorithem that I can write in psedu code as follow:
for(int frame=0;frame <1000;frame++)
{
Image *img=ReadFrame();
mat processedImage=processImage(img);
addtompeg(processedImage);
}
ProcessImage is time consuming and would takes around 30 sec. ReadFrame an AddToMpeg are not slow but they need to be done sequentially (otherwise, fame 2 may be added to output before frame 1).
How can I parallelize it using OpenMP?
I am using opencv to readframe and addtompeg.

Technically, in OpenMP you may execute a portion of a for loop in the same order as if the program was sequential using the ordered clause (see section 2.8.7 here). Anyhow I would not suggest using this clause for two reasons:
a thread must not execute more than one ordered region in the same loop (which seems not to be your case)
in many implementations an ordered loop behaves much like a sequential loop, with detrimental effects on performance
Therefore what I would suggest in your case is to unroll the loop:
Image * img [chunk];
mat processedImage[chunk];
/* ... */
for(int frame = 0; frame < nframes; frame += chunk) {
#pragma omp single
{ /* Frames are read in sequential order */
for( int ii = frame; ii < frame + chunk; ii++) {
img[ii%chunk] = ReadFrame();
}
} /* Implicit barrier here */
#pragma omp for
for( int ii = frame; ii < frame + chunk; ii++) {
processedImage[ii%chunk] = processImage(img[ii%chunk]); /* Images are processed in parallel */
} /* Implicit barrier here */
#pragma omp single
{ /* Frames are added to mpeg sequential order */
for( int ii = frame; ii < frame + chunk; ii++) {
addtompeg(processedImage[ii%chunk]);
}
} /* Implicit barrier here */
}
The value of chunk depends mainly on considerations about memory. If you think that memory will not be a problem, then you can completely remove the outer loop and let the inner one go from 0 to nframes.
Of course care must be taken to correctly manage the remainders of the outer loop (which I have not shown in the snippet).

Building on the chunking idea of Massimiliano, a more elegant solution is to use the explicit tasking mechanism of OpenMP 3.0 and later (which means that it would not work with the C++ compiler from Visual Studio):
const int nchunks = 10;
#pragma omp parallel
{
#pragma omp single
{
mat processedImage[nchunks];
for (int frame = 0; frame < nframes; frame++)
{
Image *img = ReadFrame();
#pragma omp task shared(processedImage)
{
processedImage[frame % nchunks] = processImage(img);
disposeImage(img);
}
// nchunks frames read or the last frame reached
if ((1 + frame) % nchunks == 0 || frame == nframes-1)
{
#pragma omp taskwait
int chunks = 1 + frame % nchunks;
for (int i = 0; i < chunks; i++)
addtompeg(processedImage[i]);
}
}
}
}
The code might look awkward, but it is conceptually very simple. If it wasn't for the OpenMP constructs, it would be just like a serial code that buffers up to nchunks processed frames before it adds them in a loop to the output MPEG file. The magic happens in this block of code:
#pragma omp task shared(processedImage)
{
processedImage[frame % nchunks] = processImage(img);
disposeImage(img);
}
This creates a new OpenMP task that executes the two lines of code in the block. img and frame are captured by value, i.e. they are firstprivate, therefore it is not necessary for img to be an array of pointers. The producer task gives the ownership of img to the task and therefore the task has to take care of disposing the image object. It is important here that ReadFrame() allocates each frame in a separate buffer and does not reuse some internal memory each time (I've never used OpenCV and I don't know if this is the case or not). Tasks are queued and executed by idle threads waiting at some task scheduling point. The implicit barrier at the end of the single construct is such a scheduling point, therefore the remaining threads will start executing the tasks. Once that nchunk frames have been read or the end of the input has been reached, the producer thread waits for all queued tasks to be processed (that's what the taskwait is for) and then simply writes the chunks to the output.
Selecting the proper value of nchunks is important, otherwise some of threads might end up idling. If ReadFrame and addtompeg are relatively fast, i.e. reading and writing num_threads frames takes less time than processImage, then nchunks should be an exact multiple of the number of threads. If processImage can take varying amount of time, then you would need to set a really large value of nchunks in order to prevent load imbalance. In this case I would rather try to parallelise processImage instead and keep the processing loop serial.

Related

Delay function makes openFrameworks window freeze until specified time passes

I have network of vertices and I want to change their color every second. I tried using Sleep() function and currently using different delay function but both give same result - lets say I want to color 10 vertices red with 1 second pause. When Im starting project it seems like the window freezes for 10 seconds and then shows every vertice already with red color.
This is my update function.
void ofApp::update(){
for (int i = 0; i < vertices.size(); i++) {
ofColor red(255, 0, 0);
vertices[i].setColor(red);
delay(1);
}
}
Here is draw function
void ofApp::draw(){
for (int i = 0; i < vertices.size(); i++) {
for (int j = 0; j < G[i].size(); j++) {
ofDrawLine(vertices[i].getX(), vertices[i].getY(), vertices[G[i][j]].getX(), vertices[G[i][j]].getY());
}
vertices[i].drawBFS();
}
}
void vertice::drawBFS() {
ofNoFill();
ofSetColor(_color);
ofDrawCircle(_x, _y, 20);
ofDrawBitmapString(_id, _x - 3, _y + 3);
ofColor black(0, 0, 0);
ofSetColor(black);
}
This is my delay() function
void ofApp::delay(int number_of_seconds) {
// Converting time into milli_seconds
int milli_seconds = 1000 * number_of_seconds;
// Stroing start time
clock_t start_time = clock();
// looping till required time is not acheived
while (clock() < start_time + milli_seconds)
;
}
There is no bug in your code, just a misconception about waiting. So this is no definitive answer, just a hint in the right direction.
You have probably just one thread. One thread can do exactly one thing at a time. When you call delay all this one thread is doing is checking the time over and over again until some time has passed.
During this time the thread can do nothing else (it can not magically skip instructions or detect your intentions). So it can not issue drawing commands or swap some buffers to display vectors on screen. That's the reason why you application seems to freeze - 99.9% of the time it's checking if the interval has passed. This also places a heavy load on the cpu.
The solution can be a bit tricky and requires threading. Usually you have a UI-Thread that regularly draws stuff, refreshes the display, maybe takes inputs and so on. This thread should never do heavy calculations to keep the UI responsive. A second thread will then manage heavier calculations or updating data.
If you want to run a task in an interval you don't just loop until the time is over, but essentially "tell the OS" that the second thread should be inactive for a certain period. The OS will manage this way more efficient than implementing active waiting.
But that's quite a large topic, so I suggest you read on about Multithreading. C++ has a small thread library since C++11. May be worth a look.
// .h
float nextEventSeconds = 0;
// .cpp
void ofApp::draw(){
float now = ofGetElapsedTimef();
if(now > nextEventSeconds) {
// do something here that should only happen
// every 3 seconds
nextEventSeconds = now + 3;
}
}
Following this example I managed to solve my problem.

Optimal way to handle array indexing in parallel?

I have the following situation: I have a list of particles in a box of size L, where L is the length of one of the sides.
Next, I split the box into cells, where L/cell_dim = 7. So there are 7*7*7 cells.
Finally, I read through all the particles, note their position, and calculate which cell they are in.
I accomplish the above in an openMP parallel for loop. However, I need to capture the information in a thread safe fashion such that I don't have to loop through all the particles for each cell. So I need some way to record an arbitrary subset of the particles into each cell, in parallel.
The method I have right now makes use of the OpenMP critical code block. I have an array size [7][7][7][max_particles], where max_particles is the highest number of particles per cell, but which is much less than the total number of particles. I record the index of the last particle added in a counter array size [7][7][7], and update the cell array according to the latest count in my parallel loop:
int cube[7][7][7][10];
int cube_counts[7][7][7]={0};
#pragma omp parallel for num_threads(a lot)
for (int i = 0; i < num_particles; i++){
cell_x = //cell calculation;
cell_y = //ditto;
cell_z = //...;
#pragma omp critical
{
cube_counts[cell_x][cell_y][cell_z] += 1;
// for readability
int index = cube_counts[cell_x][cell_y][cell_z];
cube[cell_x][cell_y][cell_z][index] = i;
}
}
// rest in pseudo code:
foreach cell:
adjacent_cell = cell2
particle_countA = cube_counts[cellx][celly][cellz]
particle_countB = cube_counts[cell2x][cell2y][cell2z]
// these two for loops will cover ~2-4 particles,
// so super small...as a result of the cell analysis above.
for particle in cell:
for particle in cell2:
...do stuff
Although this works, it increases in speed by a factor of more than 2 when I am able to eliminate the critical block (I am on an intel coprocessor with 60 physical, 240 logical).
How would I accomplish this without need for the critical block? I thought of doing a big array...but then I lose everything I gained when I iterate through the 7*7*7*257 (where 257 is the particle count) array. Linked lists still have the race conditions.
Maybe some kind of unordered, thread safe list...?
Using a lock instead of the critical section can be driven further:
You may use atomic increment and atomic assignment pseudo calls ("intrinsics") that the compiler will translate to the correct x86 specific assembler instructions. This is however platform or even compiler dependent.
If your use a modern c++ compiler (C++11) then std::atomic_* might be the best way to do it.

OpenMP - Poor performance when solving system of linear equations

I am trying to use OpenMP to parallelize a simple c++ code that solves a system of linear equations by Gauss elimination.
The relevant part of my code is:
#include <iostream>
#include <time.h>
using namespace std;
#define nl "\n"
void LinearSolve(double **& M, double *& V, const int N, bool parallel, int threads){
//...
for (int i=0;i<N;i++){
#pragma omp parallel for num_threads(threads) if(parallel)
for (int j=i+1;j<N;j++){
double aux, * Mi=M[i], * Mj=M[j];
aux=Mj[i]/Mi[i];
Mj[i]=0;
for (int k=i+1;k<N;k++) {
Mj[k]-=Mi[k]*aux;
};
V[j]-=V[i]*aux;
};
};
//...
};
class Time {
clock_t startC, endC;
time_t startT, endT;
public:
void start() {startC=clock(); time (&startT);};
void end() {endC=clock(); time (&endT);};
double timedifCPU() {return(double(endC-startC)/CLOCKS_PER_SEC);};
int timedif() {return(int(difftime (endT,startT)));};
};
int main (){
Time t;
double ** M, * V;
int N=5000;
cout<<"number of equations "<<N<<nl<<nl;
M= new double * [N];
V=new double [N];
for (int i=0;i<N;i++){
M[i]=new double [N];
};
for (int m=1;m<=16;m=2*m){
cout<<m<<" threads"<<nl;
for (int i=0;i<N;i++){
V[i]=i+1.5*i*i;
for (int j=0;j<N;j++){
M[i][j]=(j+2.3)/(i-0.2)+(i+2)/(j+3); //some function to get regular matrix
};
};
t.start();
LinearSolve(M,V,N,m!=1,m);
t.end();
cout<<"time "<<t.timedif()<<", CPU time "<<t.timedifCPU()<<nl<<nl;
};
}
Since the code is extremely simple I would expect that the time would be
inversely proportional to the number of threads. However the typical result I get is (the code is compiled with gcc on Linux)
number of equations 5000
1 threads
time 217, CPU time 215.89
2 threads
time 125, CPU time 245.18
4 threads
time 80, CPU time 302.72
8 threads
time 67, CPU time 458.55
16 threads
time 55, CPU time 634.41
There is a decrease in time, but much less that I would like to and the CPU time mysteriously grows.
I suspect the problem is in memory sharing, but I have been unable to identify it. Access to row M[j] should not be a problem, since each thread writes to the a different row of the matrix. There could be a problem in reading from row M[i], so I also tried to make a separate copy of this row for each thread by replacing the parallel loop by
#pragma omp parallel num_threads(threads) if(parallel)
{
double Mi[N];
for (int j=i;j<N;j++) Mi[j]=M[i][j];
#pragma omp for
for (int j=i+1;j<N;j++){
double aux, * Mj=M[j];
aux=Mj[i]/Mi[i];
Mj[i]=0;
for (int k=i+1;k<N;k++) {
Mj[k]-=Mi[k]*aux;
};
V[j]-=V[i]*aux;
};
};
Unfortunately it does not help at all.
I would very much appreciate any help.
Your problem is excessive OpenMP synchronization.
Having the #omp parallel inside the first loop means that each iteration of the outer loop comes with the whole synchronization overhead.
Take a look at the top-most chart on the image here (more detail can be found on the Allinea MAP OpenMP profiler introduction). The top line is application activity - and dark gray means "OpenMP synchronization" and green means "doing compute".
You can see a lot of dark gray in the right hand side of that top graph/chart - that is when the 16 threads are running. You're spending a lot of time synchronizing.
I also see a lot of time being spent in memory access (more than in compute) - so it's probably this that is making what should be balanced workload actually be highly unbalanced and giving the synchronization delay.
As the other respondent suggested - it's worth reading other literature for ideas here.
I think the underlying problem may be that traditional Gaussian elimination may not be suitable for parallelization.
Gaussian elimination is a process where by each subsequent step relies on the result of the previous step i.e. each iteration of your linear solve loop is dependent on the results of the previous iteration, i.e. it must be done serially. Try searching the literature for "parallel row reduction algorithms".
Also glancing at your code it looks like you will have race condition.

Realtime audio application, improving performance

I am currently writing a C++ real time audio application which roughly contains:
reading frames from a buffer
interpolating frames with the hermit interpolation here
filtering ever frame with two biquad filters (and updating their coefficients every frame)
a 3 band crossover containing 18 biquad calculations
a FreeVerb algorithm from the STK libary here
I think this should be handable for my PC but I get some buffer underflows every so often so I would like to improve the performance of my application. I have a bunch of question I hope you can answer me. :)
1) Operator Overloading
Instead of working directly with my flaot samples and doing calculations for every sample,
I pack my floats in a Frame class which contains the left and the right Sample. The class overloads some operators for addition, subtraction and multiplication with float.
The filters (biquad mostly) and the reverb works with floats and doesn't use this class but the hermite interpolator and every multiplication and addition for volume controll and mixing uses the class.
Does this has an impact on the performance and would it be better to work with left and right sample directly?
2) std::function
The callback function from the audio IO libary PortAudio calls a std::function. I use this to encapsulation everything related to PortAudio. So the "user" sets his own callback function with std::bind
std::bind( &AudioController::processAudio,
&(*this),
std::placeholders::_1,
std::placeholders::_2));
Since for every callback, the right function has to be found from the CPU (however this works...), does this have an impact and would it be better to define a class the user has to inherit from?
3) virtual functions
I use a class called AudioProcessor which declares a virtual function:
virtual void tick(Frame *buffer, int frameCout) = 0;
This function always processes a number of frames at once. Depending on the drive, 200 frames up to 1000 frames per call.
Within the signal processing path, I call this function 6 time from multiple derivated classes. I remember that this is done with lookup tables so the CPU knows exactly which function it has to call. So does the process of calling a "virtual" (derivated) function has an impact on the performance?
The nice thing about this is the structure in the source code but only using inlines maybe would have an performance improvement.
These are all questions for now. I have some more about Qt's event loop because I think that my GUI uses quite a bit of CPU time as well. But this is another topic I guess. :)
Thanks in advance!
These are all relevant function calls within the signal processing. Some of them are from the STK libary.
The biquad functions are from STK and should perform fine. This goes for the freeverb algorithm as well.
// ################################ AudioController Function ############################
void AudioController::processAudio(int frameCount, float *output) {
// CALCULATE LEFT TRACK
Frame * leftFrameBuffer = (Frame*) output;
if(leftLoaded) { // the left processor is loaded
leftProcessor->tick(leftFrameBuffer, frameCount); //(TrackProcessor::tick()
} else {
for(int i = 0; i < frameCount; i++) {
leftFrameBuffer[i].leftSample = 0.0f;
leftFrameBuffer[i].rightSample = 0.0f;
}
}
// CALCULATE RIGHT TRACk
if(rightLoaded) { // the right processor is loaded
// the rightFrameBuffer is allocated once and ensured to have enough space for frameCount Frames
rightProcessor->tick(rightFrameBuffer, frameCount); //(TrackProcessor::tick()
} else {
for(int i = 0; i < frameCount; i++) {
rightFrameBuffer[i].leftSample = 0.0f;
rightFrameBuffer[i].rightSample = 0.0f;
}
}
// MIX
for(int i = 0; i < frameCount; i++ ) {
leftFrameBuffer[i] = volume * (leftRightMix * leftFrameBuffer[i] + (1.0 - leftRightMix) * rightFrameBuffer[i]);
}
}
// ################################ AudioController Function ############################
void TrackProcessor::tick(Frame *frames, int frameNum) {
if(bufferLoaded && playback) {
for(int i = 0; i < frameNum; i++) {
// read from buffer
frames[i] = bufferPlayer->tick();
// filter coeffs
caltulateFilterCoeffs(lowCutoffFilter->tick(), highCutoffFilter->tick());
// filter
frames[i].leftSample = lpFilterL->tick(hpFilterL->tick(frames[i].leftSample));
frames[i].rightSample = lpFilterR->tick(hpFilterR->tick(frames[i].rightSample));
}
} else {
for(int i = 0; i < frameNum; i++) {
frames[i] = Frame(0,0);
}
}
// Effect 1, Equalizer
if(effsActive[0]) {
insEffProcessors[0]->tick(frames, frameNum);
}
// Effect 2, Reverb
if(effsActive[1]) {
insEffProcessors[1]->tick(frames, frameNum);
}
// Volume
for(int i = 0; i < frameNum; i++) {
frames[i].leftSample *= volume;
frames[i].rightSample *= volume;
}
}
// ################################ Equalizer ############################
void EqualizerProcessor::tick(Frame *frames, int frameNum) {
if(active) {
Frame lowCross;
Frame highCross;
for(int f = 0; f < frameNum; f++) {
lowAmp = lowAmpFilter->tick();
midAmp = midAmpFilter->tick();
highAmp = highAmpFilter->tick();
lowCross = highLPF->tick(frames[f]);
highCross = highHPF->tick(frames[f]);
frames[f] = lowAmp * lowLPF->tick(lowCross)
+ midAmp * lowHPF->tick(lowCross)
+ highAmp * lowAPF->tick(highCross);
}
}
}
// ################################ Reverb ############################
// This function just calls the stk::FreeVerb tick function for every frame
// The FreeVerb implementation can't realy be optimised so I will take it as it is.
void ReverbProcessor::tick(Frame *frames, int frameNum) {
if(active) {
for(int i = 0; i < frameNum; i++) {
frames[i].leftSample = reverb->tick(frames[i].leftSample, frames[i].rightSample);
frames[i].rightSample = reverb->lastOut(1);
}
}
}
// ################################ Buffer Playback (BufferPlayer) ############################
Frame BufferPlayer::tick() {
// adjust read position based on loop status
if(inLoop) {
while(readPos > loopEndPos) {
readPos = loopStartPos + (readPos - loopEndPos);
}
}
int x1 = readPos;
float t = readPos - x1;
Frame f = interpolate(buffer->frameAt(x1-1),
buffer->frameAt(x1),
buffer->frameAt(x1+1),
buffer->frameAt(x1+2),
t);
readPos += stepSize;;
return f;
}
// interpolation:
Frame BufferPlayer::interpolate(Frame x0, Frame x1, Frame x2, Frame x3, float t) {
Frame c0 = x1;
Frame c1 = 0.5f * (x2 - x0);
Frame c2 = x0 - (2.5f * x1) + (2.0f * x2) - (0.5f * x3);
Frame c3 = (0.5f * (x3 - x0)) + (1.5f * (x1 - x2));
return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}
inline Frame BufferPlayer::frameAt(int pos) {
if(pos < 0) {
pos = 0;
} else if (pos >= frames) {
pos = frames -1;
}
// get chunk and relative Sample
int chunk = pos/ChunkSize;
int chunkSample = pos%ChunkSize;
return Frame(leftChunks[chunk][chunkSample], rightChunks[chunk][chunkSample]);
}
Some suggestions on performance improvement:
Optimize Data Cache Usage
Review your functions that operate on a lot of data (e.g. arrays). The functions should load data into cache, operate on the data, then store back into memory.
The data should be organized to best fit into the data cache. Break up the data into smaller blocks if it doesn't fit. Search the web for "data driven design" and "cache optimizations".
In one project, performing data smoothing, I changed the layout of data and gained 70% performance.
Use Multiple Threads
In the big picture, you may be able to use at least three dedicated threads: input, processing and output. The input thread obtains the data and stores it in buffer(s); search the Web for "double buffering". The second thread gets data from the input buffer, processes it, then writes to an output buffer. The third thread writes data from the output buffer to the file.
You may also benefit from using threads for left and right samples. For example, while one thread is processing the left sample, another thread could be processing the right sample. If you could put the threads on different cores, you may see even more performance benefit.
Use the GPU processing
A lot of modern Graphics Processing Units (GPU) have a lot of cores that can process floating point values. Maybe you could delegate some of the filtering or analysis functions to the cores in the GPU. Be aware that this requires overhead and to gain the benefit, the processing part should be more computative than the overhead.
Reducing the Branching
Processors prefer to manipulate data over branching. Branching stalls the execution as the processor has to figure out where to get and process the next instruction. Some have large instruction caches that can contain small loops; but there is still a penalty for branching to the top of the loop again. See "Loop Unrolling". Also check your compiler optimizations and optimize high for performance. Many compilers will switch to loop unrolling for you, if the circumstances are correct.
Reduce the Amount of Processing
Do you need to process the entire sample or portions of it? For example, in video processing, much of the frame doesn't change only small portions. So the entire frame doesn't need to be processed. Can the audio channels be isolated so only a few channels are processed rather than the entire spectrum?
Coding to Help the Compiler Optimize
You can help the compiler with optimizations by using the const modifier. The compiler may be able to use different algorithms for variables that don't change versus ones that do. For example, a const value can be placed in the executable code, but a non-const value must be placed in memory.
Using static and const can help too. The static usually implies only one instance. The const implies something that doesn't change. So if there is only one instance of the variable that doesn't change, the compiler can place it into the executable or read-only memory and perform a higher optimization of the code.
Loading multiple variables at the same time can help too. The processor can place the data into the cache. The compiler may be able to use specialized assembly instructions for fetching sequential data.

Function calls in OpenMP loop?

I have been trying to parallelize a particle simulation code I wrote. But in my parallelization, I came away with no increase in performance when moving from 1 processor to 12, and even worse the code is no longer returning accurate results. I have been banging my head against the wall and can't figure this out. Below is the loop being parallelized:
#pragma omp parallel
{
omp_set_dynamic(1);
omp_set_num_threads(12);
#pragma omp for
// Loop over azimuth ejection angle, from 0-360.
for(int i=0; i<360; i++)
{
// Declare temporary variables
double *y = new double[12];
vector<double> ejecVel(3);
vector<double> colLoc(7);
double azimuth, inc;
bool collision;
// Loop over inclincation ejection angle from 1-max_angle, increasing by 1 degree.
for(int j=1; j<=15; j++)
{
// Update azimuth and inclination angle and get velocity direction vector.
azimuth = (double) i;
inc = (double) j;
ejecVel = Jet::GetEjecVelocity(azimuth,inc);
collision = false;
// Update initial conditions.
y[0] = m_parPos[0];
y[1] = m_parPos[1];
... (define pointer values)
// Simulate particle
systemSolver.ParticleSim(simSteps,dt,y,collision,colLoc);
if(collision == true)
{
cout << "Collision! " << endl;
}
}
delete [] y;
}
The goal is to loop through, simulating particles for different initial conditions over the loops, and store where they have gone and their state vector upon collision in master variables densCount and collisionStates. The simulation takes place in a function from another class (systemSolver.ParticleSim() ), and it seems like each solve from a different thread is not independent. Everything I've read suggests that it should be, but I can't figure out why else the result would not be right only if I have Open MP implemented. Any thoughts are greatly appreciated.
-ben
SOLUTION: The simulation was modifying a member variable of a separate (systemSolver) class. Since I provided a single class object to all threads, they were all simultaneously modifying an important member variable. Thought I would post this in case any other n00bs encounter a similar problem.
I believe one mistake is the call to omp_set_* functions inside the parallel region. In the best case, they take effect on subsequent regions only. Try to reorder as following:
omp_set_dynamic(1);
omp_set_num_threads(12);
#pragma omp parallel