I have little experience with parallel programming and was wondering if anyone could have a quick glance at a bit of code I've written and see, if there are any obvious ways I can improve the efficiency of the computation.
The difficulty arises due to the fact that I have multiple matrix operations of unequal dimensionality that I need to compute, so I'm not sure the most condensed way of coding the computation.
Below is my code. Note this code DOES work. The matrices I am working with are of dimension approx 700x700 [see int s below] or 700x30 [int n].
Also, I am using the armadillo library for my sequential code. It may be the case that parallelizing using openMP but retaining the armadillo matrix classes is slower than defaulting to the standard library; does anyone have an opinion on this (before I spend hours overhauling!)?
double start, end, dif;
int i,j,k; // iteration counters
int s,n; // matrix dimensions
mat B; B.load(...location of stored s*n matrix...) // input objects loaded from file
mat I; I.load(...s*s matrix...);
mat R; R.load(...s*n matrix...);
mat D; D.load(...n*n matrix...);
double e = 0.1; // scalar parameter
s = B.n_rows; n = B.n_cols;
mat dBdt; dBdt.zeros(s,n); // object for storing output of function
// 100X sequential computation using Armadillo linear algebraic functionality
start = omp_get_wtime();
for (int r=0; r<100; r++) {
dBdt = B % (R - (I * B)) + (B * D) - (B * e);
}
end = omp_get_wtime();
dif = end - strt;
cout << "Seq computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
// 100 * parallel computation using OpenMP
omp_set_num_threads(8);
for (int r=0; r<100; r++) {
// parallel computation of I * B
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < s; k++) {
dBdt(i, j) += I(i, k) * B(k, j);
}
}
}
// parallel computation of B % (R - (I * B))
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
dBdt(i, j) = R(i, j) - dBdt(i, j);
dBdt(i, j) *= B(i, j);
dBdt(i, j) -= B(i, j) * e;
}
}
// parallel computation of B * D
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
dBdt(i, j) += B(i, k) * D(k, j);
}
}
}
}
end = omp_get_wtime();
dif = end - strt;
cout << "OMP computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
If I hyper-thread 4 cores I get the following output:
Seq computation: 5.54926e-10
relaxation time = 0.130031
OMP computation: 5.54926e-10
relaxation time = 2.611040
Which suggests that although both methods produce the same result, the parallel formulation is roughly 20 times slower than the sequential.
It is possible that for matrices of this size, the overheads involved in this 'variable-dimension' problem outweighs the benefits of parallelizing. Any insights would be much appreciated.
Thanks in advance,
Jack
If you use a compiler which corrects your bad loop nests and fuses loops to improve memory locality for non parallel builds, openmp will likely disable those optimizations. As recommended by others, you should consider an optimized library such as mkl or acml. Default gfortran blas typically provided with distros is not multithreaded.
The Art of HPC is right the efficiency ( poor grants never get HPC cluster quota )
so first hope is your process will never re-read from file
Why? This would be an HPC-killer:
I need to repeat this computation many thousands of times
Fair enough to say, this comment has increased the overall need to completely review the approach and to re-design the future solution not to rely on a few tricks, but to indeed gain from your case-specific arrangement.
Last but not least - the [PARALLEL] scheduling is not needed, as a "just"-[CONCURRENT]-process scheduling is quite enough here. There is no need to orchestrate any explicit inter-process synchonisation or any message-passing and the process could just get orchestrated for the best performance possible.
No "...quick glance at a bit of code..." will help
You need to first understand both your whole process and also the hardware resources, it will be executed on.
CPU-type will tell you the available instruction set extensions for advanced tricks, L3- / L2- / L1-cache sizes + cache-line sizes will help you decide on best cache-friendly re-use of cheap data-access ( not paying hundreds [ns] if one can operate smarter on just a few [ns] instead, from a not-yet-evicted NUMA-core-local copy )
The Maths first, implementation next:
As given dBdt = B % ( R - (I * B) ) + ( B * D ) - ( B * e )
On a closer look, anyone ought be ready to realise HPC/cache-alignment priorities and wrong-looping traps:
dBdt = B % ( R - ( I * B ) ) ELEMENT-WISE OP B[s,n]-COLUMN-WISE
+ ( B * D ) SUM.PRODUCT OP B[s,n].ROW-WISE MUL-BY-D[n,n].COL
- ( B * e ) ELEMENT-WISE OP B[s,n].ROW-WISE MUL-BY-SCALAR
ROW/COL-SUM.PRODUCT OP -----------------------------------------+++++++++++++++++++++++++++++++++++++++++++++
ELEMENT-WISE OP ---------------------------------------------+ |||||||||||||||||||||||||||||||||||||||||||||
ELEMENT-WISE OP ----------------------+ | |||||||||||||||||||||||||||||||||||||||||||||
| | |||||||||||||||||||||||||||||||||||||||||||||
v v vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
dBdt[s,n] = B[s,n] % / R[s,n] - / I[s,s] . B[s,n] \ \
_________[n] _________[n] | _________[n] | ________________[s] _________[n] | |
|_| | |_| | | |_| | | |________________| | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | = | . | % | | . | - | | | . | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
| . | | . | | | . | | | | | | | | |
[s]|_________| [s]|_________| | [s]|_________| | [s]|________________| [s]|_|_______| | |
\ \ / /
B[s,n] D[n,n]
_________[n] _________[n]
|_________| | | |
| . | | | |
| . | | | |
| . | | | |
+ | . | . [n]|_|_______|
| . |
| . |
| . |
[s]|_________|
B[s,n]
_________[n]
|_| . . . |
| . |
| . |
| . |
- | . | * REGISTER_e
| . |
| . |
| . |
[s]|_________|
Having this in mind, efficient HPC loops will look much different
Depending on real-CPU-caches,
the loop may very efficiently co-process naturally-B-row-aligned ( B * D ) - ( B * e )
in a single phase and also the highest-re-use-efficiency based part of the elementwise longest-pipeline B % ( R - ( I * B ) )
here having a chance to re-use ~ 1000 x ( n - 1 ) cache-hits of B-column-aligned, which ought quite well fit into L1-DATA-cache footprints, so achieving savings in the order of seconds just from a cache-aligned loops.
After this cache-friendly loop-alignment is finished,
next may a distributed processing help, not before.
So, an experimentation plan setup:
Step 0: The Ground-Truth: ~ 0.13 [s] for dBdt[700,30] using armadillo in 100-test-loops
Step 1: The manual-serial: - test the rewards of the best cache-aligned code ( not the posted one, but the math-equivalent, cache-line re-use optimised one -- where there ought be not more than just 4x for(){...} code-blocks 2-nested, having the rest 2 inside, to meet the Linear Algebra rules without devastating benefits of cache-line alignments ( with some residual potential to benefit yet a bit more in [PTIME] from using a duplicated [PSPACE] data-layout ( both a FORTRAN-order and a C-order, for respective re-reading strategies ), as matrices are so miniature in sizes and L2- / L1-DATA-cache available per CPU-core enjoy cache sizes well grown in scale )
Step 2: The manual-omp( <= NUMA_cores - 1 ): - test if omp can indeed yield any "positive" Amdahl's Law speedup ( beyond the omp-setup overhead costs ). A carefull process-2-CPU_core affinity-mapping may help avoid any possible cache-eviction introduced by any non-HPC thread spoiling the cache-friendly layout on a set of configuration-"reserved"-set of ( NUMA_cores - 1 ), where all other ( non-HPC processes ) ought be affinity-mapped onto the last ( shared ) CPU-core, thus helping to prevent those HPC-process-cores retain their cache-lines un-evicted by any kernel/scheduler-injected non-HPC-thread.
( As seen in (2), there are arangements, derived from HPC best-practices, that none compiler ( even a magic-wand equipped one ) would ever be able to implement, so do not hesitate to ask your PhD Tutor for a helping hand, if your Thesis needs some HPC-expertise, as it is not so easy to build on trial-error in this quite expensive experimental domain and your primary domain is not the Linear Algebra and/or ultimate CS-theoretic / HW-specific cache-strategy optimisations. )
Epilogue:
Using smart tools in an inappropriate way does not bring anything more than additional overheads ( task-splits/joins + memory-translations ( worse with atomic-locking ( worst with blocking / fence / barriers ) ) ).
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
#include <iostream>
using namespace std;
int f(int n, int m){
if (n==1)
return 0;
else
return f(n-1, m) + m;
}
int main()
{
cout << f(3874, 1000);
cout << endl;
return 0;
}
The result is 3873000. Why is n multiplied by m after deducting 1 and how does the function work in detail please?
The else block is executed at all levels of recursion, except the deepest one.
The number of levels in the recursion tree is n. So we have n-1 times that the else block is executed.
This else block first makes the recursive call, and then adds m to the result it gets back, and returns that sum to the caller, who will do the same, ...etc until the walk upwards the recursion tree is complete.
The original caller will thus see a base number (0) to which m was repeatedly added, exactly n-1 times.
So the function calculates m(n-1), provided that n is greater than 0. If not, the recursion will run into a stack overflow error.
Visualisation
To visualise this, let's split the second return statement into two parts, where first the result of the recursive call is stored in a variable, and then the sum is returned. Also, let's take a small value for n, like 3.
So this is then the code:
int f(int n, int m){
if (n==1)
return 0;
else {
int result = f(n-1, m)
return result + m;
}
}
int main()
{
cout << f(3, 10);
}
We can imagine each function execution (starting with main) as a box (a frame), in which local variables live their lives. Each recursive call creates a new box, and when return is executed that box vanishes again.
So we can imagine the above code to execute like this:
+-[main]----------------------------+
| f(3, 10) ... |
| +-[f]-------------------------+ |
| | n = 3, m = 10 | |
| | f(3-1, 10) ... | |
| | +-[f]-------------------+ | |
| | | n = 2, m = 10 | | |
| | | f(2-1, 10) ... | | |
| | | +-[f]-------------+ | | |
| | | | n = 1, m = 10 | | | |
| | | | return 0 | | | |
| | | +-----------------+ | | |
| | | result = 0 | | |
| | | return 0 + 10 | | |
| | +-----------------------+ | |
| | result = 10 | |
| | return 10 + 10 | |
| +-----------------------------+ |
| cout << 20 |
+-----------------------------------+
I hope this clarifies it.
The algorithm solves the recurrence
F(n) = F(n-1) + m
with
F(1) = 0.
(I removed m as an argument, as its value is constant).
We have
F(n) = F(n-1) + m = F(n-2) + 2m = F(n-3) + 3m = ... = F(1) + (n-1)m.
As written elsewhere, the recursion depth is n, which is dangerous.
I am new to image processing. We have a requirement to get circle centers with sub pixel accuracy from an image. I have used median blurring to reduce the noise. A portion of the image is shown below. The steps I followed for getting circle boundaries is given below
Reduced the noise with medianBlur
Applied OTSU thresholding with threshold API
Identified circle boundaries with findContours method.
I get different results when used different kernel size for medianBlur. I selected medianBlur to keep edges. I tried kernel size 3, 5 and 7. Now I am confused to use the right kernel size for medianBlur.
How can I decide the right kernel size?
Is there any scientific approach to decide the right kernel size for medianBlur?
I will give you two suggestions here for how to find the centroids of these disks, you can pick one depending on the level of precision you need.
First of all, using contours is not the best method. Contours depend a lot on which pixels happen to fall within the object on thresholding, noise affects these a lot.
A better method is to find the center of mass (or rather, the first order moments) of the disks. Read Wikipedia to learn more about moments in image analysis. One nice thing about moments is that we can use pixel values as weights, increasing precision.
You can compute the moments of a binary shape from its contours, but you cannot use image intensities in this case. OpenCV has a function cv::moments that computes the moments for the whole image, but I don't know of a function that can do this for each object separately. So instead I'll be using DIPlib for these computations (I'm an author).
Regarding the filtering:
Any well-behaved linear smoothing should not affect the center of mass of the objects, as long as the objects are far enough from the image edge. Being close to the edge will cause the blur to do something different on the side of the object closest to the edge compared to the other sides, introducing a bias.
Any non-linear smoothing filter has the ability to change the center of mass. Please avoid the median filter.
So, I recommend that you use a Gaussian filter, which is the most well-behaved linear smoothing filter.
Method 1: use binary shape's moments:
First I'm going to threshold without any form of blurring.
import diplib as dip
a = dip.ImageRead('/Users/cris/Downloads/Ef8ey.png')
a = a(1) # Use green channel only, simple way to convert to gray scale
_, t = dip.Threshold(a)
b = a<t
m = dip.Label(b)
msr = dip.MeasurementTool.Measure(m, None, ['Center'])
print(msr)
This outputs
| Center |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.68 | 9.234 |
2 | 68.00 | 14.26 |
3 | 19.49 | 48.22 |
4 | 59.68 | 52.42 |
We can now apply a smoothing to the input image a and compute again:
a = dip.Gauss(a,2)
_, t = dip.Threshold(a)
b = a<t
m = dip.Label(b)
msr = dip.MeasurementTool.Measure(m, None, ['Center'])
print(msr)
| Center |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.82 | 9.177 |
2 | 67.74 | 14.27 |
3 | 19.51 | 47.95 |
4 | 59.89 | 52.39 |
You can see there's some small change in the centroids.
Method 2: use gray scale moments:
Here we use the error function to apply a pseudo-threshold to the image. What this does is set object pixels to 1 and background pixels to 0, but pixels around the edges retain some intermediate value. Some people refer to this as a "fuzzy thresholding". These two images show the normal ("hard") threshold, and the error function clip ("fuzzy threshold"):
By using this fuzzy threshold, we retain more information about the exact (sub-pixel) location of the edges, which we can use when computing the first order moments.
import diplib as dip
a = dip.ImageRead('/Users/cris/Downloads/Ef8ey.png')
a = a(1) # Use green channel only, simple way to convert to gray scale
_, t = dip.Threshold(a)
c = dip.ContrastStretch(-dip.ErfClip(a, t, 30))
m = dip.Label(a<t)
m = dip.GrowRegions(m, None, -2, 2)
msr = dip.MeasurementTool.Measure(m, c, ['Gravity'])
print(msr)
This outputs
| Gravity |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.75 | 9.138 |
2 | 67.89 | 14.22 |
3 | 19.50 | 48.02 |
4 | 59.79 | 52.38 |
We can now apply a smoothing to the input image a and compute again:
a = dip.Gauss(a,2)
_, t = dip.Threshold(a)
c = dip.ContrastStretch(-dip.ErfClip(a, t, 30))
m = dip.Label(a<t)
m = dip.GrowRegions(m, None, -2, 2)
msr = dip.MeasurementTool.Measure(m, c, ['Gravity'])
print(msr)
| Gravity |
- | ----------------------- |
| dim0 | dim1 |
| (px) | (px) |
- | ---------- | ---------- |
1 | 18.76 | 9.094 |
2 | 67.87 | 14.19 |
3 | 19.50 | 48.00 |
4 | 59.81 | 52.39 |
You can see the differences are smaller this time, because the measurement is more precise.
In the binary case, the differences in centroids with and without smoothing are:
array([[ 0.14768417, -0.05677508],
[-0.256 , 0.01668085],
[ 0.02071882, -0.27547569],
[ 0.2137167 , -0.03472741]])
In the gray-scale case, the differences are:
array([[ 0.01277204, -0.04444567],
[-0.02842993, -0.0276569 ],
[-0.00023144, -0.01711335],
[ 0.01776011, 0.01123299]])
If the centroid measurement is given in µm rather than px, it is because your image file contains pixel size information. The measurement function will use this to give you real-world measurements (the centroid coordinate is w.r.t. the top-left pixel). If you do not desire this, you can reset the image's pixel size:
a.SetPixelSize(1)
The two methods in C++
This is a translation to C++ of the code above, including a display step to double-check that the thresholding produced the right result:
#include "diplib.h"
#include "dipviewer.h"
#include "diplib/simple_file_io.h"
#include "diplib/linear.h" // for dip::Gauss()
#include "diplib/segmentation.h" // for dip::Threshold()
#include "diplib/regions.h" // for dip::Label()
#include "diplib/measurement.h"
#include "diplib/mapping.h" // for dip::ContrastStretch() and dip::ErfClip()
int main() {
auto a = dip::ImageRead("/Users/cris/Downloads/Ef8ey.png");
a = a[1]; // Use green channel only, simple way to convert to gray scale
dip::Gauss(a, a, {2});
dip::Image b;
double t = dip::Threshold(a, b);
b = a < t; // Or: dip::Invert(b,b);
dip::viewer::Show(a);
dip::viewer::Show(b); // Verify that the segmentation is correct
dip::viewer::Spin();
auto m = dip::Label(b);
dip::MeasurementTool measurementTool;
auto msr = measurementTool.Measure(m, {}, { "Center"});
std::cout << msr << '\n';
auto c = dip::ContrastStretch(-dip::ErfClip(a, t, 30));
dip::GrowRegions(m, {}, m, -2, 2);
msr = measurementTool.Measure(m, c, {"Gravity"});
std::cout << msr << '\n';
// Iterate through the measurement structure:
auto it = msr["Gravity"].FirstObject();
do {
std::cout << "Centroid coordinates = " << it[0] << ", " << it[1] << '\n';
} while(++it);
}
This may not be elegant. Chiefly because I am relatively new to C++, but this little program I am putting together is stumbling here.
I don't get it. Have I misunderstood arrays? The edited code is:
int diceArray [6][3][1] = {};
...
}else if (y >= xSuccess || x >= xSuccess){
// from here...
diceArray[2][1][0] = diceArray[2][1][0] + 1;
diceArray[2][1][1] = diceArray[2][1][1] + 1;
// ...to here, diceArray[2][2][0] increases by 1. I am not referencing that part of the array at all. Or am I?
}
By using comments I tracked the culprit down to the second expression. If I comment out the first one diceArray[2][2][0] does not change.
Why is diceArray[2][1][1] = diceArray[2][1][1] + 1 causing diceArray[2][2][0] to increment?
I tried..
c = diceArray[2][1][1] + 1;
diceArray[2][1][1] = c;
..as a workaround but it was just the same. It increased diceArray[2][2][0] by one.
You are indexing out of bounds. If I declare such an array
int data [3];
Then the valid indices are
data[0]
data[1]
data[2]
The analog to this is that you declare
int diceArray [6][3][1]
^
But then try to assign to
diceArray[2][1][0]
^
diceArray[2][1][1] // This is out of range
^
Since you are assigning out of range, due to pointer arithmetic you are actually assigning to the next dimension due to striding, etc.
The variable is declared as:
int diceArray [6][3][1] = {};
This is how it looks like in memory:
+---+ -.
| | <- diceArray[0][0] \
+---+ \
| | <- diceArray[0][1] > diceArray[0]
+---+ /
| | <- diceArray[0][2] /
+---+ -'
| | <- diceArray[1][0] \
+---+ \
| | <- diceArray[1][1] > diceArray[1]
+---+ /
| | <- diceArray[1][2] /
+---+ -'
. . .
. . .
. . .
+---+ -.
| | <- diceArray[5][0] \
+---+ \
| | <- diceArray[5][1] > diceArray[5]
+---+ /
| | <- diceArray[5][2] /
+---+ -'
The innermost component of diceArray is an array of size 1.
C/C++ arrays are always indexed starting from 0 and that means the only valid index in and array of size 1 is 0.
During the compilation, a reference to diceArray[x][y][z] is converted using pointer arithmetic to offset x*3*1+y*1+z (int values) using the memory address of diceArray as base.
The code:
diceArray[2][1][1] = diceArray[2][1][1] + 1;
operates on offset 8 (=2*3*1+1*1+1) inside diceArray. The same offset is computed using diceArray[2][2][0], which is a legal access inside the array.
The modern compilers are usually able to detect this kind of errors and warn you on the compilation.
How do I properly access org-mode table data in C++?
#+tblname: prob-calc
| a | 353.02 |
| b | 398.00 |
| c | 241.0 |
| d | 1 |
#+begin_src C++ :var tbl=prob-calc :includes <stdio.h> :results output
// in other languages, say python, you can just evaluate tbl to
// see the values (and of course access them in the usual python
// way. Same for R, Common Lisp. Is it possible with C++? My
// suspicion is that it can't be done in C++.
// What goes here to do it?
#+end_src
Thanks in advance
This seems to work:
#+tblname: prob-calc
| a | 353.02 |
| b | 398.00 |
| c | 241.0 |
| d | 1 |
#+begin_src C++ :var tbl=prob-calc :includes <iostream> <cstdlib> :results output
int row, col;
for (row=0; row < tbl_rows; row++) {
for (col=0; col < tbl_cols; col++) {
std::cout << tbl[row][col] << " ";
}
std::cout << "\n";
}
#+end_src
#+RESULTS:
: a 353.02
: b 398.0
: c 241.0
: d 1
On Linux, the source file is written in /tmp/babel-<mumble>/C-src-<mumble>.cpp, and the table declaration in that file looks like this:
const char* tbl[4][2] = {
{"a","353.02"},
{"b","398.0"},
{"c","241.0"},
{"d","1"}
};
const int tbl_rows = 4;
const int tbl_cols = 2;
I need to write a function that takes as arguments an integer, which represents a row in a truth table, and a boolean array, where it stores the values for that row of the truth table.
Here is an example truth table
Row| A | B | C |
1 | T | T | T |
2 | T | T | F |
3 | T | F | T |
4 | T | F | F |
5 | F | T | T |
6 | F | T | F |
7 | F | F | T |
8 | F | F | F |
Please note that a given truth table could have more or fewer rows than this table, since the number of possible variables can change.
A function prototype could look like this
getRow(int rowNum, bool boolArr[]);
If this function was called, for example, as
getRow(3, boolArr[])
It would need to return an array with the following elements
|1|0|1| (or |T|F|T|)
The difficulty for me arises because the number of variables can change, therefore increasing or decreasing the number of rows. For instance, the list of variables could be A, B, C, D, E, and F instead of just A, B, and C.
I think the best solution would to be write a loop that counted up to the row number, and essentially changed the elements of the array like it was counting in binary. So that
1st loop iteration, array elements are 0|0|...|0|1|
2nd loop iteration, array elements are 0|0|...|1|0|
I can't for the life of me figure out how to do this, and can't find a solution elsewhere on the web. Sorry for all the confusion and thanks for the help
Ok now that you rewrote your question to be much clearer. First, getRow needs to take an extra argument: the number of bits. Row 1 with 2 bits produces a different result than row 1 with 64 bits, so we need a way to differentiate that. Second, typically with C++, everything is zero-indxed, so I am going to shift your truth table down one row so that row "0" returns all trues.
The key here is to realize that the row number in binary is already what you want. Take this row (having shifted down the 4 to 3):
3 | T | F | F |
3 in binary is 011, which inverted is {true, false, false} - exactly what you want. We can express that using bitwise-or as the array:
{!(3 | 0x4), !(3 | 0x2), !(3 | 0x1)}
So it's just a matter of writing that as a loop:
void getRow(int rowNum, bool* arr, int nbits)
{
int mask = 1 << (nbits - 1);
for (int i = 0; i < nbits; ++i, mask >>= 1) {
arr[i] = !(rowNum & mask);
}
}