Splitting 2-Dim loop into multiple threads

Splitting 2-Dim loop into multiple threads - c++

My algorithm computes the vertices of a shape in 3-dimensional space by using a 2-dimensional loop iterating over the U and V segments.
for (LONG i=0; i < info.useg + 1; i++) {
// Calculate the u-parameter.
u = info.umin + i * info.udelta;
for (LONG j=0; j < info.vseg + 1; j++) {
// Calculate the v-parameter.
v = info.vmin + j * info.vdelta;
// Compute the point's position.
point = calc_point(op, &info, u, v);
// Set the point to the object and increase the point-index.
points[point_i] = point;
point_i++;
}
}
The array of points is however a one-dimensional array, which is why point_i is incremented in each loop. I know that I could compute the index via point_i = i * info.vseg + j.
I want this loop to be multithreaded. My aim was to create a number of threads that all process a specific range of points. In the thread, I'd do it like:
for (LONG x=start; x <= end; x++) {
LONG i = // ...
LONG j = // ...
Real u = info.umin + i * info.udelta;
Real v = info.vmin + j * info.vdelta;
points[i] = calc_point(op, &info, u, v);
}
The problem is to calculate the i and j indecies from the linear point-index. How can I compute i and j when (well I think):
point_i = i * vsegments + j
I'm unable to solve the math, here..

point_i = i * vsegments + j gives you:
i = point_i / vsegments
j = point_i % vsegments
Of course your loops actually do segments + 1 iterations each (indices 0 to segments), so you would need to use vsegments + 1 instead of vsegments
As a side note: Do you actually need to merge the loops into one for multithreading? I would expect the outer loop to have typically enough iterations to saturate your availible cores anyways.

Related

How to efficiently calculate the symmetric force matrix (Newton's 3rd law)?

Intro
For the N-body simulation I need to calculate the total received force Fi for each body it receives from the other bodies. This is an O(n^2) problem, because for each body the pairwise total force must be calculated. Because of the Newton's 3rd axiom fi,j = -fi,j we can reduce the number of force calculations by half.
Problem
How to implement this law in my code in order to optimize the acceleration calculation?
What I have did so far?
I programmed the total received force for each body, but without that optimization.
std::vector<glm::vec3> SequentialAccelerationCalculationImpl::calcAccelerations(
const std::vector<Body> &bodies,
const float softening_factor
) {
const float softening_factor_squared = softening_factor * softening_factor;
const size_t num_bodies = bodies.size();
std::vector<glm::vec3> accelerations(num_bodies);
// O(n^2)
for (size_t i = 0; i < num_bodies; ++i) {
glm::vec3 received_total_force(0.0);
for (size_t j = 0; j < num_bodies; ++j) { // TODO this can be reduced to half
if (i != j) {
const glm::vec3 distance_vector = bodies[i].getCurrentPosition() - bodies[j].getCurrentPosition();
float distance_squared =
(distance_vector.x * distance_vector.x) +
(distance_vector.y * distance_vector.y) +
(distance_vector.z * distance_vector.z);
// to avoid zero in the following division
distance_squared += softening_factor_squared;
received_total_force += ((bodies[j].getMass() / distance_squared) * glm::normalize(distance_vector));
}
}
accelerations[i] = GRAVITATIONAL_CONSTANT * received_total_force;
}
return accelerations;
}
I also sketched the force matrix:

The simple solution is to have the inner loop start from i + 1 instead of from 0:
for (size_t i = 0; i < num_bodies; ++i) {
glm::vec3 received_total_force(0.0);
for (size_t j = i + 1; j < num_bodies; ++j) {
glm::vec3 distance_squared = glm::distance2(bodies[i].getCurrentPosition(), bodies[j].getCurrentPosition());
...
}
}
However now you have to ensure you update both accelerations[i] and accelerations[j] inside the inner loop, instead of first accumulating in received_total_force:
accelerations[i] += bodies[j].getMass() / distance_squared * glm::normalize(distance_vector);
accelerations[j] -= bodies[i].getMass() / distance_squared * glm::normalize(distance_vector);

I would like to improve the performance of this code using AVX

I profiled my code and the most expensive part of the code is the loop included in the post. I want to improve the performance of this loop using AVX. I have tried manually unrolling the loop and, while that does improve performance, the improvements are not satisfactory.
int N = 100000000;
int8_t* data = new int8_t[N];
for(int i = 0; i< N; i++) { data[i] = 1 ;}
std::array<float, 10> f = {1,2,3,4,5,6,7,8,9,10};
std::vector<float> output(N, 0);
int k = 0;
for (int i = k; i < N; i = i + 2) {
for (int j = 0; j < 10; j++, k = j + 1) {
output[i] += f[j] * data[i - k];
output[i + 1] += f[j] * data[i - k + 1];
}
}
Could I have some guidance on how to approach this.

I would assume that data is a large input array of signed bytes, and f is a small array of floats of length 10, and output is the large output array of floats. Your code goes out of bounds for the first 10 iterations by i, so I will start i from 10 instead. Here is a clean version of the original code:
int s = 10;
for (int i = s; i < N; i += 2) {
for (int j = 0; j < 10; j++) {
output[i] += f[j] * data[i-j-1];
output[i+1] += f[j] * data[i-j];
}
}
As it turns out, processing two iterations by i does not change anything, so we simplify it further to:
for (int i = s; i < N; i++)
for (int j = 0; j < 10; j++)
output[i] += f[j] * data[i-j-1];
This version of code (along with declarations of input/output data) should have been present in the question itself, without others having to clean/simplify the mess.
Now it is obvious that this code applies one-dimensional convolution filter, which is a very common thing in signal processing. For instance, it can by computed in Python using numpy.convolve function. The kernel has very small length 10, so Fast Fourier Transform won't provide any benefits compared to bruteforce approach. Given that the problem is well-known, you can read a lot of articles on vectorizing small-kernel convolution. I will follow the article by hgomersall.
First, let's get rid of reverse indexing. Obviously, we can reverse the kernel before running the main algorithm. After that, we have to compute the so-called cross-correlation instead of convolution. In simple words, we move the kernel array along the input array, and compute the dot product between them for every possible offset.
std::reverse(f.data(), f.data() + 10);
for (int i = s; i < N; i++) {
int b = i-10;
float res = 0.0;
for (int j = 0; j < 10; j++)
res += f[j] * data[b+j];
output[i] = res;
}
In order to vectorize it, let's compute 8 consecutive dot products at once. Recall that we can pack eight 32-bit float numbers into one 256-bit AVX register. We will vectorize the outer loop by i, which means that:
The loop by i will be advanced by 8 every iteration.
Every value inside the outer loop turns into a 8-element pack, such that k-th element of the pack holds this value for (i+k)-th iteration of the outer loop from the scalar version.
Here is the resulting code:
//reverse the kernel
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]); //every component will have same value
//note: you have to compute the last 16 values separately!
for (size_t i = s; i + 16 <= N; i += 8) {
int b = i-10;
__m256 res = _mm256_setzero_ps();
for (size_t j = 0; j < 10; j++) {
//load: data[b+j], data[b+j+1], data[b+j+2], ..., data[b+j+15]
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]);
//convert first 8 bytes of loaded 16-byte pack into 8 floats
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes));
//compute res = res + floats * revKernel[j] elementwise
res = _mm256_fmadd_ps(revKernel[j], floats, res);
}
//store 8 values packed in res into: output[i], output[i+1], ..., output[i+7]
_mm256_storeu_ps(&output[i], res);
}
For 100 millions of elements, this code takes about 120 ms on my machine, while the original scalar implementation took 850 ms. Beware: I have Ryzen 1600 CPU, so results on Intel CPUs may be somewhat different.
Now if you really want to unroll something, the inner loop by 10 kernel elements is the perfect place. Here is how it is done:
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]);
for (size_t i = s; i + 16 <= N; i += 8) {
size_t b = i-10;
__m256 res = _mm256_setzero_ps();
#define DOIT(j) {\
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]); \
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes)); \
res = _mm256_fmadd_ps(revKernel[j], floats, res); \
}
DOIT(0);
DOIT(1);
DOIT(2);
DOIT(3);
DOIT(4);
DOIT(5);
DOIT(6);
DOIT(7);
DOIT(8);
DOIT(9);
_mm256_storeu_ps(&output[i], res);
}
It takes 110 ms on my machine (slightly better that the first vectorized version).
The simple copy of all elements (with conversion from bytes to floats) takes 40 ms for me, which means that this code is not memory-bound yet, and there is still some room for improvement left.

HOG optimization with using SIMD

There are several attempts to optimize calculation of HOG descriptor with using of SIMD instructions: OpenCV, Dlib, and Simd. All of them use scalar code to add resulting magnitude to HOG histogram:
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
for(size_t i = 0; i < size; ++i)
{
histogram[y/8][x/8][idx[i]] += val[i]*ky[y]*kx[x];
histogram[y/8][x/8 + 1][idx[i]] += val[i]*ky[y]*kx[x + 1];
histogram[y/8 + 1][x/8][idx[i]] += val[i]*ky[y + 1]*kx[x];
histogram[y/8 + 1][x/8 + 1][idx[i]] += val[i]*ky[y + 1]*kx[x + 1];
}
There the value of size depends from implementation but in general the meaning is the same.
I know that problem of histogram calculation with using of SIMD does not have a simple and effective solution. But in this case we have small size (18) of histogram. Can it help in SIMD optimizations?

I have found solution. It is a temporal buffer. At first we sum histogram to temporary buffer (and this operation can be vectorized). Then we add the sum from buffer to output histogram (and this operation also can be vectorized):
float histogram[height/8][width/8][18];
float ky[height], kx[width];
int idx[size];
float val[size];
float buf[18][4];
for(size_t i = 0; i < size; ++i)
{
buf[idx[i]][0] += val[i]*ky[y]*kx[x];
buf[idx[i]][1] += val[i]*ky[y]*kx[x + 1];
buf[idx[i]][2] += val[i]*ky[y + 1]*kx[x];
buf[idx[i]][3] += val[i]*ky[y + 1]*kx[x + 1];
}
for(size_t i = 0; i < 18; ++i)
{
histogram[y/8][x/8][i] += buf[i][0];
histogram[y/8][x/8 + 1][i] += buf[i][1];
histogram[y/8 + 1][x/8][i] += buf[i][2];
histogram[y/8 + 1][x/8 + 1][i] += buf[i][3];
}

You can do a partial optimisation by using SIMD to calculate all the (flattened) histogram indices and the bin increments. Then process these in a scalar loop afterwards. You probably also want to strip-mine this such that you process one row at a time, in order to keep the temporary bin indices and increments in cache. It might appear that this would be inefficient, due to the use of temporary intermediate buffers, but in practice I have seen a useful overall gain in similar scenarios.
uint32_t i = 0;
for (y = 0; y < height; ++y) // for each row
{
uint32_t inds[width * 4]; // flattened histogram indices for this row
float vals[width * 4]; // histogram bin increments for this row
// SIMD loop for this row - calculate flattened histogram indices and bin
// increments (scalar code shown for reference - converting this loop to
// SIMD is left as an exercise for the reader...)
for (x = 0; x < width; ++x, ++i)
{
indices[4*x] = (y/8)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+1] = (y/8)*(width/8)*18+(x/8 + 1)*18+idx[i];
indices[4*x+2] = (y/8+1)*(width/8)*18+(x/8)*18+idx[i];
indices[4*x+3] = (y/8+1)*(width/8)*18+(x/8 + 1)*18+idx[i];
vals[4*x] = val[i]*ky[y]*kx[x];
vals[4*x+1] = val[i]*ky[y]*kx[x+1];
vals[4*x+2] = val[i]*ky[y+1]*kx[x];
vals[4*x+3] = val[i]*ky[y+1]*kx[x+1];
}
// scalar loop for this row
float * const histogram_base = &histogram[0][0][0]; // pointer to flattened histogram
for (x = 0; x < width * 4; ++x) // for each set of 4 indices/increments in this row
{
histogram_base[indices[x]] += vals[x]; // update the (flattened) histogram
}
}

Discrete Fourier Transform implementation gives different result than OpenCV DFT

We have implemented DFT and wanted to test it with OpenCV's implementation. The results are different.
our DFT's results are in order from smallest to biggest, whereas OpenCV's results are not in any order.
the first (0th) value is the same for both calculations, as in this case, the complex part is 0 (since e^0 = 1, in the formula). The other values are different, for example OpenCV's results contain negative values, whereas ours does not.
This is our implementation of DFT:
// complex number
std::complex<float> j;
j = -1;
j = std::sqrt(j);
std::complex<float> result;
std::vector<std::complex<float>> fourier; // output
// this->N = length of contour, 512 in our case
// foreach fourier descriptor
for (int n = 0; n < this->N; ++n)
{
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
and this is how we calculate the DFT with OpenCV:
std::vector<std::complex<float>> fourierCV; // output
cv::dft(std::vector<float>(centroidDistance, centroidDistance + this->N), fourierCV, cv::DFT_SCALE | cv::DFT_COMPLEX_OUTPUT);
The variable centroidDistance is calculated in a previous step.
Note: please avoid answers saying use OpenCV instead of your own implementation.

You forgot to initialise result for each iteration of n:
for (int n = 0; n < this->N; ++n)
{
result = 0.0f; // initialise `result` to 0 here <<<
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}

parallel for with omp stucks

I have problem with the following code:
int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
dist2[i].first = std::numeric_limits<float>::max();
dist2[i].second = i;
}
// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
float sum_distribution = 0.0;
// look for the point that is furthest from any center
begin = omp_get_wtime();
#pragma omp parallel for reduction(+:sum_distribution)
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
sum_distribution += dist2[i].first;
}
end = omp_get_wtime() - begin;
std::cout << "center assigning -- "
<< ndx << " of " << k << " = "
<< (float)ndx / k * 100
<< "% is done. Elasped time: "<< (float)end <<"\n";
/**/
bool unique = true;
do {
// choose a random interval according to the new distribution
float r = sum_distribution * (float)rand() / (float)RAND_MAX;
float sum_cdf = dist2[0].first;
int cdf_ndx = 0;
while (sum_cdf < r) {
sum_cdf += dist2[++cdf_ndx].first;
}
chosen_pts[ndx] = cdf_ndx;
for (int i = 0; i < ndx; ++i) {
unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
}
} while (! unique);
++ndx;
}
As you can see i use omp to make parallel the for loop. It works fine and i can achive a significant speed up. However if i increase the value of x.n over 20000000 the function stops to work after 8-10 loops:
It doestn produces any output (std::cout)
Only one core works
No error, whatsoever
If i comment out the do while loop, it works again as expected. All cores are busy and there is an output after each iteration, and i can increase k.n over 100 millions just as i need it.

It's not OpenMP parallel for getting stuck, it's obviously in your serial do-while loop.
One particular issue that I see is that there is no array boundary checks in the inner while loop accessing dist2. In theory, out-of-boundary access should never happen; but in practice it may - see below why. So first of all I would rewrite the calculation of cdf_ndx to guarantee that the loop ends when all elements are inspected:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
sum_cdf += dist2[cdf_ndx].first;
++cdf_ndx;
}
Now, how it may happen that sum_cdf does not reach r? It is due to specifics of floating-point arithmetic and the fact that sum_distribution was computed in parallel, while sum_cdf is computed serially. The problem is that contribution of one element to the sum can be below the accuracy for floats; in other words, when you sum two float values that differ more than ~8 orders of magnitude, the smaller one does not affect the sum.
So, with 20M of floats after some point it might happen that the next value to add is so small comparing to the accumulated sum_cdf that adding this value does not change it! On the other hand, sum_distribution was essentially computed as several independent partial sums (one per thread) then combined together. Thus it is more accurate, and possibly bigger than sum_cdf can ever reach.
A solution can be to compute sum_cdf in portions, having two nested loops. For example:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
float block_sum = 0;
int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
for (int i=cdf_ndx; i<block_end; ++i ) {
block_sum += dist2[i].first;
if( sum_cdf+block_sum >=r ) {
block_end = i; // adjust to correctly compute cdf_ndx
break;
}
}
sum_cdf += block_sum;
cdf_ndx = block_end;
}
And after the loop you need to check that cdf_ndx < x.n, otherwise repeat with a new random interval.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting 2-Dim loop into multiple threads - c++

Related

How to efficiently calculate the symmetric force matrix (Newton's 3rd law)?

I would like to improve the performance of this code using AVX

HOG optimization with using SIMD

Discrete Fourier Transform implementation gives different result than OpenCV DFT

parallel for with omp stucks

Categories

Resources