How can I parallelize my code about deleting overlapping spheres? - c++

I'm trying to parallelize a piece of code. What my code does is checking if some spheres (defined by their coordinates xcentro, ycentro, zcentro and their radii r) overlapp each other or not. If they overlap, I must delete them, but as I don't know how to delete a component of a vector (it's a mess with the index and stuff) I just set the radii to zero and do not take them into account later.
My problem comes when I try to parallelize the code. If I don't do it, it works properly (although the code is not efficient at all and I need to run it with millions of spheres). And if I try to parallelize it, I obtain several errors. For example, if I try to run the code the exact way it is written below, I obtain segmentation fault. If I eliminate the private(...) part, I don't obtain any error, but don't obtain the same results as without parallelization.
What can I be doing wrong?
Here's the code:
vector<double> xcentro, ycentro, zcentro, r;
r.reserve(34000000);
xcentro.reserve(34000000);
ycentro.reserve(34000000);
zcentro.reserve(34000000);
... read files and fill up xcentro ycentro zcentro r with data ...
//#pragma omp parallel for private(i, j, xcentro, ycentro, zcentro, d) shared(r)
for (size_t i = 0; i < r.size() - 1; i++)
{
//#pragma omp parallel for private(i, j, xcentro, ycentro, zcentro, d) shared(r)
for (size_t j = i + 1; j < r.size() - 1; j++)
{
auto dist_square = (xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j])
+ (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j])
+ (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j]);
if ( dist_square < (r[i]+r[j])*(r[i]+r[j]) )
{
//hacer 0 el radio de la esfera j-esima
r[j] = 0;
//hacer 0 el radio de la esfera i-esima
r[i] = 0;
}
}
}

Okay, let's first consider an algorithm which actually works, i.e. obtain the subset of spheres with no overlap. To this end, we don't remove a sphere (before checking whether it overlaps with another one) but merely record that is has overlaps.
struct sphere { double R,X,Y,Z; };
inline constexpr double square(double x) noexcept
{ return x*x; }
inline constexpr bool overlap(sphere const&a, sphere const&b) noexcept
{ return square(a.X-b.X)+square(a.Y-b.Y)+square(a.Z-b.Z) > square(a.R+b.R); }
std::vector<sphere> keep_non_overlapping(std::vector<sphere> const&S)
{
std::vector<char> hasOverlap(Spheres.size(), char(0));
vector<sphere> result;
for(size_t i=0; i<S.size(); ++i) {
for(size_t j=i+1; j<S.size(); ++j)
if((!hasOverlap[i] || !hasOverlap[j]) && overlap(S[i],S[j])) {
hasOverlap[i] = 1;
hasOverlap[j] = 1;
}
if(!hasOverlap[i])
result.push_back(S[i]);
}
return result;
}
This algorithm loops every pair of spheres once. Since the test between spheres k and l is done when i equals the smaller of k and l and j the larger, the executions of the loop over i are still not mutually independent: there is still a race condition. This can be removed by looping over each pair of spheres twice:
std::vector<sphere> keep_non_overlapping(std::vector<sphere> const&S)
{
std::vector<char> hasOverlap(Spheres.size(), char(0));
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i) {
bool overlapping = false;
for(size_t j=0; !overlapping && j<S.size(); ++j)
if(j!=i && overlap(S[i],S[j])
overlapping = true;
hasOverlap[i] = !overlapping;
}
vector<sphere> result;
for(size_t i=0; i<S.size(); ++i)
if(!hasOverlap[i])
result.push_back(S[i]);
return result;
}
Note also that, depending on the distribution of spheres, it can make the execution significantly faster if you first order the sphere is descending radius (largest spheres first) as in
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.R > b.R; });
Note further that this naive O(N^2) algorithm is not optimal. There is likely a O(N ln(N)) algorithm which first arranges the spheres in some data structure (perhaps a spatial tree) in O(N ln(N)) time and then finds whether a sphere is overlapping in no more than O(ln N) time for each sphere.

Hereby, I answer your question asked in the comment:
How could I increase the speed of my program?
The best is to completely change the algorithm (as already suggested), but if you do not wish to change it for any reason, you can gain ca. 20% speed by parallelizing the outer loop:
#pragma omp parallel for schedule(dynamic, r.size()/500)
for (size_t i = 0; i < r.size(); ++i)
{
for (size_t j = i + 1; j < r.size(); ++j)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
#pragma omp atomic write
overlaps[i] = 1;
#pragma omp atomic write
overlaps[j] = 1;
}
}
}
UPDATE:
Based on #Walter’s response and code, I created a simple algorithm that is significantly faster than your code. The basic idea is as follows: Sort the data according to x values and determine the largest radius. For a given x value, it is not necessary to go through the entire range, it is enough to examine those x values that are closer than twice the largest radius. Thus, the number of loop cycles can be significantly reduced and the speed of the algorithm was increased by orders of magnitude. I tested the speed difference between your code and the new algorithm with the code below using arrays filled with data of randomly created spheres. I created the algorithm so that you don't have to change the rest of your program, the new_algorithm function takes the data from xcentro, ycentro, zcetro, r arrays and returns the indexes of the overlapping spheres in the overlay2 array. On compiler explorer significant speed increase was observed:
size=20000
Runtime(your method)=1216 ms
Runtime(new algorithm)=13 ms
Note that this is a simple algorithm and easy to understand how it works, but based on your real data better algorithms may be created. Here is the code:
#include <iostream>
#include <vector>
#include <chrono>
#include <omp.h>
#include <algorithm>
using namespace std;
constexpr size_t N=10000;
std::vector<double> xcentro, ycentro, zcentro, r;
struct sphere { double X,Y,Z,R; size_t index; };
std::vector<sphere> Spheres;
inline constexpr double square(double x) noexcept
{ return x*x; }
inline constexpr bool overlap(sphere const&a, sphere const&b) noexcept
{ return square(a.X-b.X)+square(a.Y-b.Y)+square(a.Z-b.Z) < square(a.R+b.R); }
void new_algorithm(const std::vector<double>& x, const std::vector<double>& y, const std::vector<double>& z, const std::vector<double>& r, std::vector<char>& overlaps)
{
const auto start = std::chrono::high_resolution_clock::now();
std::vector<sphere> S;
S.reserve(r.size());
for (size_t i = 0; i < r.size(); i++)
{
S.push_back(sphere{x[i],y[i],z[i],r[i], i});
}
//Sort ascending X
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.X < b.X; });
// Clear overlaps and determine maximum r value
double maxr=-1;
for (size_t i = 0; i < S.size(); i++)
{
overlaps[i]=0;
if(S[i].R>maxr) maxr=S[i].R;
}
//Create a vector for maximum indices
std::vector<size_t> max_index(S.size(),0);
//Determine maximum_index
size_t j=1;
for (size_t i = 0; i < S.size(); i++)
{
while(S[j].X-S[i].X<2*maxr)
{
if(j<r.size()) j++; else break;
}
max_index[i]=j;
}
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i)
{
for(size_t j=i+1; j<max_index[i]; ++j)
if(overlap(S[i],S[j]))
{
#pragma omp atomic write
overlaps[S[i].index] = 1;
#pragma omp atomic write
overlaps[S[j].index] = 1;
}
}
const auto stop = std::chrono::high_resolution_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Runtime(new algorithm)=" << diff.count() << " ms\n";
}
void your_algorithm(std::vector<char>& overlaps)
{
size_t i,j;
const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
overlaps[i]=0;
}
for (i = 0; i < r.size(); i++)
{
#pragma omp parallel for
for (j = i + 1; j < r.size(); j++)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
overlaps[i] = 1;
overlaps[j] = 1;
}
}
}
const auto stop = std::chrono::high_resolution_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Runtime(your method)=" << diff.count() << " ms" << std::endl;
}
int main() {
std::vector<char> overlaps1, overlaps2;
r.reserve(N);
xcentro.reserve(N);
ycentro.reserve(N);
zcentro.reserve(N);
overlaps1.reserve(N);
overlaps2.reserve(N);
//fill the arrays with random numbers
for(size_t i=0; i<N; i++)
{
double x=(rand() % 1000)/10.0;
double y=(rand() % 1000)/10.0;
double z=(rand() % 1000)/10.0;
double R=(rand() % 10000)/((double)N ) + 0.1;
xcentro.push_back( x );
ycentro.push_back( y );
zcentro.push_back( z );
r.push_back(R);
}
std::cout << "size=" << r.size() << std::endl;
your_algorithm(overlaps1);
new_algorithm(xcentro,ycentro,zcentro,r,overlaps2);
// Check if array of overlap is the same for the 2 methods
for(size_t i=0; i<N; i++)
{
if(overlaps1[i]!=overlaps2[i])
{
cout << "error\n"; exit (-1);
}
}
cout << "OK\n";
}
UPDATE2: Here is the code mentioned in comment (sort by R and remove the bigger sphere only)
std::vector<sphere> S;
S.reserve(r.size());
for (size_t i = 0; i < r.size(); i++)
{
overlaps[i]=0;
S.push_back(sphere{x[i],y[i],z[i],r[i], i});
}
//Sort descending R
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.R > b.R; });
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i)
{
for(size_t j=i+1; j<S.size(); ++j)
if(overlap(S[i],S[j]))
{
overlaps[S[i].index] = 1;
break;
}
}

Let us first improve your serial code a bit by avoiding to loop over already deleted spheres:
for(size_t i = 0; i < r.size(); ++i)
if(r[i]>0) {
for(size_t j=i+1; j<r.size(); ++j)
if(r[j]>0 && (xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j])
+ (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j])
+ (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])
< (r[i]+r[j])*(r[i]+r[j]) ) {
r[j] = 0;
r[i] = 0;
}
}
This immediately shows you that the execution of the outer loop depends on all previous executions at smaller index, since these may have removed some of the spheres. This interdependence of the loop executions implies that your algorithm cannot be straightforwardly parallelized (in the way you attempted it).
Also, you have race conditions in the variables r[], which are read and written to. Your naive parallelization didn't take care of that problem either.

Just in case someone is still interested, I've improved my code and now it's more efficient (althought now enough for me yet) and now it does eliminate overlapping spheres:
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
overlaps[i]=0;
}
cout << "overlaps igualados a cero..." << endl;
//Me queda ver qué esferas se superponen y eliminarlas. Primero voy a ver comprobar qué esferas se superponen y posteriormente hago el radio de aquellas
//que se superponen igual a cero.
double cero = 0.0;
for (i = 0; i < r.size(); i++)
{
contador=0;
#pragma omp parallel for reduction(+:contador)
for (j = i + 1; j < r.size(); j++)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
contador++;
overlaps[i] = contador;
overlaps[j]=contador;
}
}
}
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
if(overlaps[i]!=0)
{
r[i]=0;
}
}

Related

Why is multi-threading of matrix calculation not faster than single-core?

this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)
There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}
Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.

Constructing distance matrix in parallel in C++11 using OpenMP

I would like to construct a distance matrix in parallel in C++11 using OpenMP. I read various documentations, introductions, examples etc. Yet, I still have a few questions. To facilitate answering this post, I state my questions as assumptions numbered 1 through 7. This way, you can quickly browse through them and point out which ones are correct and which ones are not.
Let us begin with a simple serially executed function computing a dense Armadillo matrix:
// [[Rcpp::export]]
arma::mat compute_dist_mat(arma::mat &coordinates, unsigned int n_points) {
arma::mat dist_mat(n_points, n_points, arma::fill::zeros);
double dist {};
for(unsigned int i {0}; i < n_points; i++) {
for(unsigned int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
return dist_mat;
}
As a side note: this function is supposed to be called from R through the Rcpp interface - indicated by the // [[Rcpp::export]]. And accordingly the top of the file includes
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
using namespace arma;
However, the function should work also fine without the R interface.
In an attempt to parallelize the code, I replace the loops with
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
and add n_threads as an argument to the compute_dist_mat function.
This distributes the iterations of the outer loop across threads, with the iterations of the inner loop executed by the respective thread handling the outer loop.
The two loop levels cannot be combined because the inner loop depends on the outer one.
dist, i, and j are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
The # pragma line does not have any effect when n_treads = 1, inducing a serial execution.
Extending the dense matrix application, the following code block illustrates the serial sparse matrix case with batch insertion. To motivate the use of sparse matrices here, I set distances below a certain threshold to zero.
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold) {
std::vector<double> dists;
std::vector<unsigned int> dist_i;
std::vector<unsigned int> dist_j;
double dist {};
for(unsigned long int i {0}; i < n_points; i++) {
for(unsigned long int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists.push_back(dist);
dist_i.push_back(i);
dist_j.push_back(j);
}
}
}
unsigned int mat_size = dist_i.size();
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int j {};
for(unsigned int i {0}; i < mat_size; i++) {
j = i * 2;
index_mat.at(0, j) = dist_i[i];
index_mat.at(1, j) = dist_j[i];
index_mat.at(0, j + 1) = dist_j[i];
index_mat.at(1, j + 1) = dist_i[i];
dists_vec.at(j) = dists[i];
dists_vec.at(j + 1) = dists[i];
}
arma::sp_mat dist_mat(index_mat, values_vec, n_points, n_points);
return dist_mat;
}
Because the function does ex ante not know how many distances are above the threshold, it first stores the non-zero values in standard vectors and then constructs the Armadillo objects from them.
I parallelize the function as follows:
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold, unsigned short int n_threads) {
std::vector<std::vector<double>> dists(n_points);
std::vector<std::vector<unsigned int>> dist_j(n_points);
double dist {};
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists[i].push_back(dist);
dist_j[i].push_back(j);
}
}
}
unsigned int vec_intervals[n_points + 1];
vec_intervals[0] = 0;
for (i = 0; i < n_points; i++) {
vec_intervals[i + 1] = vec_intervals[i] + dist_j[i].size();
}
unsigned int mat_size {vec_intervals[n_points]};
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int vec_begins_i {};
unsigned int vec_length_i {};
unsigned int k {};
# pragma omp parallel for private(i, j, k, vec_begins_i, vec_length_i) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
vec_begins_i = vec_intervals[i];
vec_length_i = vec_intervals[i + 1] - vec_begins_i;
for(j = 0, j < vec_length_i, j++) {
k = (vec_begins_i + j) * 2;
index_mat.at(0, k) = i;
index_mat.at(1, k) = dist_j[i][j];
index_mat.at(0, k + 1) = dist_j[i][j];
index_mat.at(1, k + 1) = i;
dists_vec.at(k) = dists[i][j];
dists_vec.at(k + 1) = dists[i][j];
}
}
arma::sp_mat dist_mat(index_mat, dists_vec, n_points, n_points);
return dist_mat;
}
Using dynamic vectors in the loop is thread-safe.
dist, i, j, k, vec_begins_i, and vec_length_i are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
Nothing has to be marked as a section.
Are any of the seven statements incorrect?
The following does not directly answer your question (it's just some dev code I copied from a personal GitHub repo), but it makes several points clear that may be of use in your application:
OpenMP automatically determines private members so long as you are not doing any dynamic memory allocation within the parallel loop
For sparse matrix distance calculations, it becomes important to move beyond a simple calculation of distance at each non-zero index and instead consider the structure of sparsity that is expected, and optimize for that. In the example below, I assume both matrices are very sparse and their intersection is less than their union. Thus, I "precondition" each distance calculation with squared column sums (for calculating Euclidean distance), and then adjust the calculation for the intersection only. This avoids complicated iterator structures and is very fast.
Using as few temporaries as possible is much to your benefit, and sparse matrix iterators do as good of a job of this as any alternative code anyone may ever write.
Eigen provides better vectorization than Armadillo (across the board, I might add) which means you want Eigen instead of Armadillo if those last 20% of performance gains are important to you.
This function calculates the Euclidean distance between all unique pairs of columns in an Eigen::SparseMatrix<double> object:
// sparse column-wise Euclidean distance between all columns
Eigen::MatrixXd distance(Eigen::SparseMatrix<double>& A) {
Eigen::MatrixXd dists(A.cols(), A.cols());
Eigen::VectorXd sq_colsums(A.cols());
for (int col = 0; col < A.cols(); ++col)
for (Eigen::SparseMatrix<double>::InnerIterator it(A, col); it; ++it)
sq_colsums(col) += it.value() * it.value();
#pragma omp parallel for
for (unsigned int i = 0; i < (A.cols() - 1); ++i) {
for (unsigned int j = (i + 1); j < A.cols(); ++j) {
double dist = sq_colsums(i) + sq_colsums(j);
Eigen::SparseMatrix<double>::InnerIterator it1(A, i), it2(A, j);
while (it1 && it2) {
if (it1.row() < it2.row()) ++it1;
else if (it1.row() > it2.row()) ++it2;
else {
dist -= it1.value() * it1.value();
dist -= it2.value() * it2.value();
dist += std::pow(it1.value() - it2.value(), 2);
++it1; ++it2;
}
}
dists(i, j) = std::sqrt(dist);
dists(j, i) = dists(i, j);
}
}
dists.diagonal().array() = 1;
return dists;
}
As Dirk and others have said, there are packages out there (i.e. ParallelDist) that seem to do everything you're after (for dense matrices). Look at wordspace for fast cosine distance calculations. See here for some comparisons. Cosine distance is easy to efficiently calculate in R without use of Rcpp using crossprod operations (see qlcMatrix::cosSparse source code for algorithmic inspiration).

How to parallelize accumulative probability function with OpenMP?

I'm trying to make more efficient via parallelizing my code that calculates the accumulative probability function. I have a vector<double> of radii called r and I need to count how many elements there are with a radius bigger than a given one > R. In addition, I need to calculate the accumulative probability function for the volume.
The code I have is the following one:
int i, j;
double aux, contar, contar1, aux;
vector<double> r, contador, contador1, vol,
for (i = 0; i != r.size() - 1; i++)
{
aux = r[i];
contador[i] = 0;
contador1[i] = 0;
contar = 0;
contar1 = 0;
vol[i] = 0.0;
for (j = 0; j != r.size() - 1; j++)
{
if(aux <= r[j])
{
contar++;
#pragma omp atomic write
vol[i] = vol[i] + 4.0 * 3.141592653589793 * r[j] * r[j] * r[j] / 3.0;
}
if(aux==r[j])
{
contar1++;
}
}
#pragma omp atomic write
contador[i]=contar;
#pragma omp atomic write
contador1[i]=contar1;
}
but it's not efficient at all. Any help in order to make it more efficient with OpenMP?

OpenMP Race Condition when finding Closest Pair

I'm doing an assignment to find the closest pair between two disjoint sets A and B. I'm using OpenMP to parallelize the recursion of the algorithm, but I am running into some data races. I am very new to OpenMP, so I think it has something to do with incorrect privating/sharing of variables. I have put the full algorithm below:
float OMPParticleSim::efficient_closest_pair(int n, vector<Particle> & p, vector<Particle> & q)
{
// brute force
if(n <= 3) {
float m = numeric_limits<float>::max();
for(int i = 0; i < n - 2; i++) {
for(int j = i + 1; j < n - 1; j++) {
if((set_A.find(p[i].id) != set_A.end() && set_A.find(p[j].id) != set_A.end()) || (set_B.find(p[i].id) != set_B.end() && set_B.find(p[j].id) != set_B.end())) {
continue;
}
float distsq = pow(p[i].x - p[j].x, 2) + pow(p[i].y - p[j].y, 2) + pow(p[i].z - p[j].z, 2);
pair<pair<Particle, Particle>, float> pa = make_pair(make_pair(p[i], p[j]), sqrt(distsq));
#pragma omp critical
insert(pa);
m = min(m, distsq);
}
}
return sqrt(m);
}
// copy first ceil(n/2) points of p to pl
vector<Particle> pl;
int ceiling = ceil(n/2);
for(int i = 0; i < ceiling; i++) {
pl.push_back(p[i]);
}
// copy first ceil(n/2) points of q to ql
vector<Particle> ql;
for(int i = 0; i < ceiling; i++) {
ql.push_back(q[i]);
}
// copy remaining floor(n/2) points of p to pr
vector<Particle> pr;
for(int i = ceiling; i < p.size(); i++) {
pr.push_back(p[i]);
}
// copy remaining floor(n/2) points of q to qr
vector<Particle> qr;
for(int i = ceiling; i < q.size(); i++) {
qr.push_back(p[i]);
}
float dl, dr, d;
#pragma omp task firstprivate(pl, ql, p, q, n) private(dl) shared(closest_pairs)
dl = efficient_closest_pair(ceil(n / 2), pl, ql);
#pragma omp task firstprivate(pl, ql, p, q, n) private(dr) shared(closest_pairs)
dr = efficient_closest_pair(ceil(n / 2), pr, qr);
#pragma omp taskwait
d = min(dl, dr);
float m = p[ceil(n / 2) - 1].x;
vector<Particle> s;
for(int i = 0; i < q.size(); i++) {
if(fabs(q[i].x - m) < d) {
s.push_back(Particle(q[i]));
}
}
int num = s.size();
float dminsq = d * d;
for (int i = 0; i < num - 2; i++) {
int k = i + 1;
while(k <= num - 1 && pow(s[k].y - s[i].y, 2) < dminsq) {
if((set_A.find(s[i].id) != set_A.end() && set_A.find(s[k].id) != set_A.end()) || (set_B.find(s[i].id) != set_B.end() && set_B.find(s[k].id) != set_B.end())) {
k++;
continue;
}
float dist = pow(s[k].x - s[i].x, 2) + pow(s[k].y - s[i].y, 2) + pow(s[k].z - s[i].z, 2);
pair<pair<Particle, Particle>, float> pa = make_pair(make_pair(s[i], s[k]), sqrt(dist));
#pragma omp critical
insert(pa);
dminsq = min(dist, dminsq);
k++;
}
}
return sqrt(dminsq);
}
The insert method looks like this:
void OMPParticleSim::insert(pair<pair<Particle, Particle>, float> & pair) {
if(closest_pairs.size() == 0) {
closest_pairs.push_back(pair);
return;
}
for(int i = 0; i < closest_pairs.size(); ++i) {
if(closest_pairs[i].second > pair.second) {
closest_pairs.insert(closest_pairs.begin() + i, 1, pair);
break;
}
}
if(closest_pairs.size() > k) {
closest_pairs.pop_back();
}
}
The start of the parallel region is here:
void OMPParticleSim::do_closest_pair(int num_threads) {
vector<Particle> p = set;
// presort on x
sort(p.begin(), p.end(), sortxomp);
vector<Particle> q = p;
// presort on y
sort(q.begin(), q.end(), sortyomp);
float cp;
#pragma omp parallel num_threads(num_threads)
{
#pragma omp single
{
cp = efficient_closest_pair(set.size(), p, q);
}
}
sort(closest_pairs.begin(), closest_pairs.end(), sortpairsomp);
}
All of the results are stored in a list closest_pairs and output to a file. The reason I know there are data races is because some of the Particle id's are negative (all of them start positive), and running the program multiple times results in different values being written to the file. Any help would be great!
The error was the dl and dr should have been shared between the tasks.

Optimize outer loop with OpenMP and a reduction

I struggle a bit with a function. The calculation is wrong if I try to parallelize the outer loop with a
#pragma omp parallel reduction(+:det).
Can someone show me how to solve it and why it is failing?
// template<class T> using vector2D = std::vector<std::vector<T>>;
float Det(vector2DF &a, int n)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
for (int i = 0; i < n; i++)
{
int l = 0;
#pragma omp parallel for private(l)
for (int j = 1; j < n; j++)
{
l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
return det;
}
If you parallelize the outer loop, there is a race condition on this line:
m[j - 1][l] = a[j][k];
Also you likely want a parallel for reduction instead of just a parallel reduction.
The issue is, that m is shared, even though that wouldn't be necessary given that it is completely overwritten in the inner loop. Always declare variables as locally as possible, this avoids issues with wrongly shared variables, e.g.:
float Det(vector2DF &a, int n)
{
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
for (int i = 0; i < n; i++)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
for (int j = 1; j < n; j++)
{
int l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
return det;
}
Now that is correct, but since m can be expensive to allocate, performance could benefit from not doing it in each and every iteration. This can be done by splitting parallel and for directives as such:
float Det(vector2DF &a, int n)
{
if (n == 1) return a[0][0];
if (n == 2) return a[0][0] * a[1][1] - a[1][0] * a[0][1];
float det = 0;
#pragma omp parallel reduction(+:det)
{
vector2DF m(n - 1, vector1DF(n - 1, 0));
#pragma omp parallel for
for (int i = 0; i < n; i++)
{
for (int j = 1; j < n; j++)
{
int l = 0;
for (int k = 0; k < n; k++)
{
if (k == i) continue;
m[j - 1][l] = a[j][k];
l++;
}
}
det += std::pow(-1.0, 1.0 + i + 1.0) * a[0][i] * Det(m, n - 1);
}
}
return det;
}
Now you could also just declare m as firstprivate, but that would assume that the copy constructor makes a completely independent deep-copy and thus make the code more difficult to reason about.
Please be aware that you should always include expected output, actual output and a minimal complete and verifiable example.