Constructing distance matrix in parallel in C++11 using OpenMP - c++

I would like to construct a distance matrix in parallel in C++11 using OpenMP. I read various documentations, introductions, examples etc. Yet, I still have a few questions. To facilitate answering this post, I state my questions as assumptions numbered 1 through 7. This way, you can quickly browse through them and point out which ones are correct and which ones are not.
Let us begin with a simple serially executed function computing a dense Armadillo matrix:
// [[Rcpp::export]]
arma::mat compute_dist_mat(arma::mat &coordinates, unsigned int n_points) {
arma::mat dist_mat(n_points, n_points, arma::fill::zeros);
double dist {};
for(unsigned int i {0}; i < n_points; i++) {
for(unsigned int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
return dist_mat;
}
As a side note: this function is supposed to be called from R through the Rcpp interface - indicated by the // [[Rcpp::export]]. And accordingly the top of the file includes
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
using namespace arma;
However, the function should work also fine without the R interface.
In an attempt to parallelize the code, I replace the loops with
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
and add n_threads as an argument to the compute_dist_mat function.
This distributes the iterations of the outer loop across threads, with the iterations of the inner loop executed by the respective thread handling the outer loop.
The two loop levels cannot be combined because the inner loop depends on the outer one.
dist, i, and j are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
The # pragma line does not have any effect when n_treads = 1, inducing a serial execution.
Extending the dense matrix application, the following code block illustrates the serial sparse matrix case with batch insertion. To motivate the use of sparse matrices here, I set distances below a certain threshold to zero.
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold) {
std::vector<double> dists;
std::vector<unsigned int> dist_i;
std::vector<unsigned int> dist_j;
double dist {};
for(unsigned long int i {0}; i < n_points; i++) {
for(unsigned long int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists.push_back(dist);
dist_i.push_back(i);
dist_j.push_back(j);
}
}
}
unsigned int mat_size = dist_i.size();
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int j {};
for(unsigned int i {0}; i < mat_size; i++) {
j = i * 2;
index_mat.at(0, j) = dist_i[i];
index_mat.at(1, j) = dist_j[i];
index_mat.at(0, j + 1) = dist_j[i];
index_mat.at(1, j + 1) = dist_i[i];
dists_vec.at(j) = dists[i];
dists_vec.at(j + 1) = dists[i];
}
arma::sp_mat dist_mat(index_mat, values_vec, n_points, n_points);
return dist_mat;
}
Because the function does ex ante not know how many distances are above the threshold, it first stores the non-zero values in standard vectors and then constructs the Armadillo objects from them.
I parallelize the function as follows:
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold, unsigned short int n_threads) {
std::vector<std::vector<double>> dists(n_points);
std::vector<std::vector<unsigned int>> dist_j(n_points);
double dist {};
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists[i].push_back(dist);
dist_j[i].push_back(j);
}
}
}
unsigned int vec_intervals[n_points + 1];
vec_intervals[0] = 0;
for (i = 0; i < n_points; i++) {
vec_intervals[i + 1] = vec_intervals[i] + dist_j[i].size();
}
unsigned int mat_size {vec_intervals[n_points]};
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int vec_begins_i {};
unsigned int vec_length_i {};
unsigned int k {};
# pragma omp parallel for private(i, j, k, vec_begins_i, vec_length_i) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
vec_begins_i = vec_intervals[i];
vec_length_i = vec_intervals[i + 1] - vec_begins_i;
for(j = 0, j < vec_length_i, j++) {
k = (vec_begins_i + j) * 2;
index_mat.at(0, k) = i;
index_mat.at(1, k) = dist_j[i][j];
index_mat.at(0, k + 1) = dist_j[i][j];
index_mat.at(1, k + 1) = i;
dists_vec.at(k) = dists[i][j];
dists_vec.at(k + 1) = dists[i][j];
}
}
arma::sp_mat dist_mat(index_mat, dists_vec, n_points, n_points);
return dist_mat;
}
Using dynamic vectors in the loop is thread-safe.
dist, i, j, k, vec_begins_i, and vec_length_i are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
Nothing has to be marked as a section.
Are any of the seven statements incorrect?

The following does not directly answer your question (it's just some dev code I copied from a personal GitHub repo), but it makes several points clear that may be of use in your application:
OpenMP automatically determines private members so long as you are not doing any dynamic memory allocation within the parallel loop
For sparse matrix distance calculations, it becomes important to move beyond a simple calculation of distance at each non-zero index and instead consider the structure of sparsity that is expected, and optimize for that. In the example below, I assume both matrices are very sparse and their intersection is less than their union. Thus, I "precondition" each distance calculation with squared column sums (for calculating Euclidean distance), and then adjust the calculation for the intersection only. This avoids complicated iterator structures and is very fast.
Using as few temporaries as possible is much to your benefit, and sparse matrix iterators do as good of a job of this as any alternative code anyone may ever write.
Eigen provides better vectorization than Armadillo (across the board, I might add) which means you want Eigen instead of Armadillo if those last 20% of performance gains are important to you.
This function calculates the Euclidean distance between all unique pairs of columns in an Eigen::SparseMatrix<double> object:
// sparse column-wise Euclidean distance between all columns
Eigen::MatrixXd distance(Eigen::SparseMatrix<double>& A) {
Eigen::MatrixXd dists(A.cols(), A.cols());
Eigen::VectorXd sq_colsums(A.cols());
for (int col = 0; col < A.cols(); ++col)
for (Eigen::SparseMatrix<double>::InnerIterator it(A, col); it; ++it)
sq_colsums(col) += it.value() * it.value();
#pragma omp parallel for
for (unsigned int i = 0; i < (A.cols() - 1); ++i) {
for (unsigned int j = (i + 1); j < A.cols(); ++j) {
double dist = sq_colsums(i) + sq_colsums(j);
Eigen::SparseMatrix<double>::InnerIterator it1(A, i), it2(A, j);
while (it1 && it2) {
if (it1.row() < it2.row()) ++it1;
else if (it1.row() > it2.row()) ++it2;
else {
dist -= it1.value() * it1.value();
dist -= it2.value() * it2.value();
dist += std::pow(it1.value() - it2.value(), 2);
++it1; ++it2;
}
}
dists(i, j) = std::sqrt(dist);
dists(j, i) = dists(i, j);
}
}
dists.diagonal().array() = 1;
return dists;
}
As Dirk and others have said, there are packages out there (i.e. ParallelDist) that seem to do everything you're after (for dense matrices). Look at wordspace for fast cosine distance calculations. See here for some comparisons. Cosine distance is easy to efficiently calculate in R without use of Rcpp using crossprod operations (see qlcMatrix::cosSparse source code for algorithmic inspiration).

Related

How to multiply a sparse matrix and a dense vector?

I am trying the following:
Eigen::SparseMatrix<double> bijection(2 * face_count, 2 * vert_count);
/* initialization */
Eigen::VectorXd toggles(2 * vert_count);
toggles.setOnes();
Eigen::SparseMatrix<double> deformed;
deformed = bijection * toggles;
Eigen is returning an error claiming:
error: static assertion failed: THE_EVAL_EVALTO_FUNCTION_SHOULD_NEVER_BE_CALLED_FOR_DENSE_OBJECTS
586 | EIGEN_STATIC_ASSERT((internal::is_same<Dest,void>::value),THE_EVAL_EVALTO_FUNCTION_SHOULD_NEVER_BE_CALLED_FOR_DENSE_OBJECTS);
According to the eigen documentaion
Sparse matrix and vector products are allowed. What am I doing wrong?
The problem is you have the wrong output type for the product.
The Eigen documentation states that the following type of multiplication is defined:
dv2 = sm1 * dv1;
Sparse matrix times dense vector equals dense vector.
If you actually do need a sparse representation, I think there is no better way of getting one than performing the multiplication as above and then converting the product to a sparse matrix with the sparseView member function. e.g.
Eigen::SparseMatrix<double> bijection(2 * face_count, 2 * vert_count);
/* initialization */
Eigen::VectorXd toggles(2 * vert_count);
toggles.setOnes();
Eigen::VectorXd deformedDense = bijection * toggles;
Eigen::SparseMatrix<double> deformedSparse = deformedDense.sparseView();
This can be faster than outputting to a dense vector if it is very sparse. Otherwise, 99/100 times the conventional product is faster.
void sparsem_densev_sparsev(const SparseMatrix<double>& A, const VectorX<double>& x, SparseVector<double>& Ax)
{
Ax.resize(x.size());
for (int j = 0; j < A.outerSize(); ++j)
{
if (A.outerIndexPtr()[j + 1] - A.outerIndexPtr()[j] > 0)
{
Ax.insertBack(j) = 0;
}
}
for (int j_idx = 0; j_idx < Ax.nonZeros(); j_idx++)
{
int j = Ax.innerIndexPtr()[j_idx];
for (int k = A.outerIndexPtr()[j]; k < A.outerIndexPtr()[j + 1]; ++k)
{
int i = A.innerIndexPtr()[k];
Ax.valuePtr()[j_idx] += A.valuePtr()[k] * x.coeff(i);
}
}
}
For a (probably not optimal) self-adjoint version (lower triangle), change the j_idx loop to:
for (int j_idx = 0; j_idx < Ax.nonZeros(); j_idx++)
{
int j = Ax.innerIndexPtr()[j_idx];
int i_idx = j_idx;//i>= j, trick to improve binary search
for (int k = A.outerIndexPtr()[j]; k < A.outerIndexPtr()[j + 1]; ++k)
{
int i = A.innerIndexPtr()[k];
Ax.valuePtr()[j_idx] += A.valuePtr()[k] * x.coeff(i);
if (i != j)
{
i_idx = std::distance(Ax.innerIndexPtr(), std::lower_bound(Ax.innerIndexPtr() + i_idx, Ax.innerIndexPtr() + Ax.nonZeros(), i));
Ax.valuePtr()[i_idx] += A.valuePtr()[k] * x.coeff(j);
}
}
}

How to parallelize accumulative probability function with OpenMP?

I'm trying to make more efficient via parallelizing my code that calculates the accumulative probability function. I have a vector<double> of radii called r and I need to count how many elements there are with a radius bigger than a given one > R. In addition, I need to calculate the accumulative probability function for the volume.
The code I have is the following one:
int i, j;
double aux, contar, contar1, aux;
vector<double> r, contador, contador1, vol,
for (i = 0; i != r.size() - 1; i++)
{
aux = r[i];
contador[i] = 0;
contador1[i] = 0;
contar = 0;
contar1 = 0;
vol[i] = 0.0;
for (j = 0; j != r.size() - 1; j++)
{
if(aux <= r[j])
{
contar++;
#pragma omp atomic write
vol[i] = vol[i] + 4.0 * 3.141592653589793 * r[j] * r[j] * r[j] / 3.0;
}
if(aux==r[j])
{
contar1++;
}
}
#pragma omp atomic write
contador[i]=contar;
#pragma omp atomic write
contador1[i]=contar1;
}
but it's not efficient at all. Any help in order to make it more efficient with OpenMP?

How can I parallelize my code about deleting overlapping spheres?

I'm trying to parallelize a piece of code. What my code does is checking if some spheres (defined by their coordinates xcentro, ycentro, zcentro and their radii r) overlapp each other or not. If they overlap, I must delete them, but as I don't know how to delete a component of a vector (it's a mess with the index and stuff) I just set the radii to zero and do not take them into account later.
My problem comes when I try to parallelize the code. If I don't do it, it works properly (although the code is not efficient at all and I need to run it with millions of spheres). And if I try to parallelize it, I obtain several errors. For example, if I try to run the code the exact way it is written below, I obtain segmentation fault. If I eliminate the private(...) part, I don't obtain any error, but don't obtain the same results as without parallelization.
What can I be doing wrong?
Here's the code:
vector<double> xcentro, ycentro, zcentro, r;
r.reserve(34000000);
xcentro.reserve(34000000);
ycentro.reserve(34000000);
zcentro.reserve(34000000);
... read files and fill up xcentro ycentro zcentro r with data ...
//#pragma omp parallel for private(i, j, xcentro, ycentro, zcentro, d) shared(r)
for (size_t i = 0; i < r.size() - 1; i++)
{
//#pragma omp parallel for private(i, j, xcentro, ycentro, zcentro, d) shared(r)
for (size_t j = i + 1; j < r.size() - 1; j++)
{
auto dist_square = (xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j])
+ (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j])
+ (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j]);
if ( dist_square < (r[i]+r[j])*(r[i]+r[j]) )
{
//hacer 0 el radio de la esfera j-esima
r[j] = 0;
//hacer 0 el radio de la esfera i-esima
r[i] = 0;
}
}
}
Okay, let's first consider an algorithm which actually works, i.e. obtain the subset of spheres with no overlap. To this end, we don't remove a sphere (before checking whether it overlaps with another one) but merely record that is has overlaps.
struct sphere { double R,X,Y,Z; };
inline constexpr double square(double x) noexcept
{ return x*x; }
inline constexpr bool overlap(sphere const&a, sphere const&b) noexcept
{ return square(a.X-b.X)+square(a.Y-b.Y)+square(a.Z-b.Z) > square(a.R+b.R); }
std::vector<sphere> keep_non_overlapping(std::vector<sphere> const&S)
{
std::vector<char> hasOverlap(Spheres.size(), char(0));
vector<sphere> result;
for(size_t i=0; i<S.size(); ++i) {
for(size_t j=i+1; j<S.size(); ++j)
if((!hasOverlap[i] || !hasOverlap[j]) && overlap(S[i],S[j])) {
hasOverlap[i] = 1;
hasOverlap[j] = 1;
}
if(!hasOverlap[i])
result.push_back(S[i]);
}
return result;
}
This algorithm loops every pair of spheres once. Since the test between spheres k and l is done when i equals the smaller of k and l and j the larger, the executions of the loop over i are still not mutually independent: there is still a race condition. This can be removed by looping over each pair of spheres twice:
std::vector<sphere> keep_non_overlapping(std::vector<sphere> const&S)
{
std::vector<char> hasOverlap(Spheres.size(), char(0));
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i) {
bool overlapping = false;
for(size_t j=0; !overlapping && j<S.size(); ++j)
if(j!=i && overlap(S[i],S[j])
overlapping = true;
hasOverlap[i] = !overlapping;
}
vector<sphere> result;
for(size_t i=0; i<S.size(); ++i)
if(!hasOverlap[i])
result.push_back(S[i]);
return result;
}
Note also that, depending on the distribution of spheres, it can make the execution significantly faster if you first order the sphere is descending radius (largest spheres first) as in
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.R > b.R; });
Note further that this naive O(N^2) algorithm is not optimal. There is likely a O(N ln(N)) algorithm which first arranges the spheres in some data structure (perhaps a spatial tree) in O(N ln(N)) time and then finds whether a sphere is overlapping in no more than O(ln N) time for each sphere.
Hereby, I answer your question asked in the comment:
How could I increase the speed of my program?
The best is to completely change the algorithm (as already suggested), but if you do not wish to change it for any reason, you can gain ca. 20% speed by parallelizing the outer loop:
#pragma omp parallel for schedule(dynamic, r.size()/500)
for (size_t i = 0; i < r.size(); ++i)
{
for (size_t j = i + 1; j < r.size(); ++j)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
#pragma omp atomic write
overlaps[i] = 1;
#pragma omp atomic write
overlaps[j] = 1;
}
}
}
UPDATE:
Based on #Walter’s response and code, I created a simple algorithm that is significantly faster than your code. The basic idea is as follows: Sort the data according to x values and determine the largest radius. For a given x value, it is not necessary to go through the entire range, it is enough to examine those x values that are closer than twice the largest radius. Thus, the number of loop cycles can be significantly reduced and the speed of the algorithm was increased by orders of magnitude. I tested the speed difference between your code and the new algorithm with the code below using arrays filled with data of randomly created spheres. I created the algorithm so that you don't have to change the rest of your program, the new_algorithm function takes the data from xcentro, ycentro, zcetro, r arrays and returns the indexes of the overlapping spheres in the overlay2 array. On compiler explorer significant speed increase was observed:
size=20000
Runtime(your method)=1216 ms
Runtime(new algorithm)=13 ms
Note that this is a simple algorithm and easy to understand how it works, but based on your real data better algorithms may be created. Here is the code:
#include <iostream>
#include <vector>
#include <chrono>
#include <omp.h>
#include <algorithm>
using namespace std;
constexpr size_t N=10000;
std::vector<double> xcentro, ycentro, zcentro, r;
struct sphere { double X,Y,Z,R; size_t index; };
std::vector<sphere> Spheres;
inline constexpr double square(double x) noexcept
{ return x*x; }
inline constexpr bool overlap(sphere const&a, sphere const&b) noexcept
{ return square(a.X-b.X)+square(a.Y-b.Y)+square(a.Z-b.Z) < square(a.R+b.R); }
void new_algorithm(const std::vector<double>& x, const std::vector<double>& y, const std::vector<double>& z, const std::vector<double>& r, std::vector<char>& overlaps)
{
const auto start = std::chrono::high_resolution_clock::now();
std::vector<sphere> S;
S.reserve(r.size());
for (size_t i = 0; i < r.size(); i++)
{
S.push_back(sphere{x[i],y[i],z[i],r[i], i});
}
//Sort ascending X
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.X < b.X; });
// Clear overlaps and determine maximum r value
double maxr=-1;
for (size_t i = 0; i < S.size(); i++)
{
overlaps[i]=0;
if(S[i].R>maxr) maxr=S[i].R;
}
//Create a vector for maximum indices
std::vector<size_t> max_index(S.size(),0);
//Determine maximum_index
size_t j=1;
for (size_t i = 0; i < S.size(); i++)
{
while(S[j].X-S[i].X<2*maxr)
{
if(j<r.size()) j++; else break;
}
max_index[i]=j;
}
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i)
{
for(size_t j=i+1; j<max_index[i]; ++j)
if(overlap(S[i],S[j]))
{
#pragma omp atomic write
overlaps[S[i].index] = 1;
#pragma omp atomic write
overlaps[S[j].index] = 1;
}
}
const auto stop = std::chrono::high_resolution_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Runtime(new algorithm)=" << diff.count() << " ms\n";
}
void your_algorithm(std::vector<char>& overlaps)
{
size_t i,j;
const auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
overlaps[i]=0;
}
for (i = 0; i < r.size(); i++)
{
#pragma omp parallel for
for (j = i + 1; j < r.size(); j++)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
overlaps[i] = 1;
overlaps[j] = 1;
}
}
}
const auto stop = std::chrono::high_resolution_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
std::cout << "Runtime(your method)=" << diff.count() << " ms" << std::endl;
}
int main() {
std::vector<char> overlaps1, overlaps2;
r.reserve(N);
xcentro.reserve(N);
ycentro.reserve(N);
zcentro.reserve(N);
overlaps1.reserve(N);
overlaps2.reserve(N);
//fill the arrays with random numbers
for(size_t i=0; i<N; i++)
{
double x=(rand() % 1000)/10.0;
double y=(rand() % 1000)/10.0;
double z=(rand() % 1000)/10.0;
double R=(rand() % 10000)/((double)N ) + 0.1;
xcentro.push_back( x );
ycentro.push_back( y );
zcentro.push_back( z );
r.push_back(R);
}
std::cout << "size=" << r.size() << std::endl;
your_algorithm(overlaps1);
new_algorithm(xcentro,ycentro,zcentro,r,overlaps2);
// Check if array of overlap is the same for the 2 methods
for(size_t i=0; i<N; i++)
{
if(overlaps1[i]!=overlaps2[i])
{
cout << "error\n"; exit (-1);
}
}
cout << "OK\n";
}
UPDATE2: Here is the code mentioned in comment (sort by R and remove the bigger sphere only)
std::vector<sphere> S;
S.reserve(r.size());
for (size_t i = 0; i < r.size(); i++)
{
overlaps[i]=0;
S.push_back(sphere{x[i],y[i],z[i],r[i], i});
}
//Sort descending R
std::sort(S.begin(), S.end(), [](sphere const&a, sphere const&b) { return a.R > b.R; });
#pragma omp parallel for
for(size_t i=0; i<S.size(); ++i)
{
for(size_t j=i+1; j<S.size(); ++j)
if(overlap(S[i],S[j]))
{
overlaps[S[i].index] = 1;
break;
}
}
Let us first improve your serial code a bit by avoiding to loop over already deleted spheres:
for(size_t i = 0; i < r.size(); ++i)
if(r[i]>0) {
for(size_t j=i+1; j<r.size(); ++j)
if(r[j]>0 && (xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j])
+ (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j])
+ (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])
< (r[i]+r[j])*(r[i]+r[j]) ) {
r[j] = 0;
r[i] = 0;
}
}
This immediately shows you that the execution of the outer loop depends on all previous executions at smaller index, since these may have removed some of the spheres. This interdependence of the loop executions implies that your algorithm cannot be straightforwardly parallelized (in the way you attempted it).
Also, you have race conditions in the variables r[], which are read and written to. Your naive parallelization didn't take care of that problem either.
Just in case someone is still interested, I've improved my code and now it's more efficient (althought now enough for me yet) and now it does eliminate overlapping spheres:
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
overlaps[i]=0;
}
cout << "overlaps igualados a cero..." << endl;
//Me queda ver qué esferas se superponen y eliminarlas. Primero voy a ver comprobar qué esferas se superponen y posteriormente hago el radio de aquellas
//que se superponen igual a cero.
double cero = 0.0;
for (i = 0; i < r.size(); i++)
{
contador=0;
#pragma omp parallel for reduction(+:contador)
for (j = i + 1; j < r.size(); j++)
{
if ((((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j])) < (r[i] + r[j]) * (r[i] + r[j])))
{
contador++;
overlaps[i] = contador;
overlaps[j]=contador;
}
}
}
#pragma omp parallel for
for(i=0; i<r.size(); i++)
{
if(overlaps[i]!=0)
{
r[i]=0;
}
}

Using Eigen class to sum certain numbers in a vector

I am new to C++ and I am using the Eigen library. I was wondering if there was a way to sum certain elements in a vector. For example, say I have a vector that is a 100 by 1 and I just want to sum the first 10 elements. Is there a way of doing that using the Eigen library?
What I am trying to do is this: say I have a vector that is 1000 by 1 and I want to take the mean of the first 10 elements, then the next 10 elements, and so on and store that in some vector. Hence I will have a vector of size 100 of the averages. Any thoughts or suggestions are greatly appreciated.
Here is the beginning steps I have in my code. I have a S_temp4vector that is 1000 by 1. Now I intialize a new vector S_A that I want to have as the vector of the means. Here is my messy sloppy code so far: (Note that my question resides in the crudeMonteCarlo function)
#include <iostream>
#include <cmath>
#include <math.h>
#include <Eigen/Dense>
#include <Eigen/Geometry>
#include <random>
#include <time.h>
using namespace Eigen;
using namespace std;
void crudeMonteCarlo(int N,double K, double r, double S0, double sigma, double T, int n);
VectorXd time_vector(double min, double max, int n);
VectorXd call_payoff(VectorXd S, double K);
int main(){
int N = 100;
double K = 100;
double r = 0.2;
double S0 = 100;
double sigma = 0.4;
double T = 0.1;
int n = 10;
crudeMonteCarlo(N,K,r,S0,sigma,T,n);
return 0;
}
VectorXd time_vector(double min, double max, int n){
VectorXd m(n + 1);
double delta = (max-min)/n;
for(int i = 0; i <= n; i++){
m(i) = min + i*delta;
}
return m;
}
MatrixXd generateGaussianNoise(int M, int N){
MatrixXd Z(M,N);
static random_device rd;
static mt19937 e2(time(0));
normal_distribution<double> dist(0.0, 1.0);
for(int i = 0; i < M; i++){
for(int j = 0; j < N; j++){
Z(i,j) = dist(e2);
}
}
return Z;
}
VectorXd call_payoff(VectorXd S, double K){
VectorXd C(S.size());
for(int i = 0; i < S.size(); i++){
if(S(i) - K > 0){
C(i) = S(i) - K;
}else{
C(i) = 0.0;
}
}
return C;
}
void crudeMonteCarlo(int N,double K, double r, double S0, double sigma, double T, int n){
// Create time vector
VectorXd tt = time_vector(0.0,T,n);
VectorXd t(n);
double dt = T/n;
for(int i = 0; i < n; i++){
t(i) = tt(i+1);
}
// Generate standard normal Z matrix
//MatrixXd Z = generateGaussianNoise(N,n);
// Generate the log normal stock process N times to get S_A for crude Monte Carlo
MatrixXd SS(N,n+1);
MatrixXd Z = generateGaussianNoise(N,n);
for(int i = 0; i < N; i++){
SS(i,0) = S0;
for(int j = 1; j <= n; j++){
SS(i,j) = SS(i,j-1)*exp((double) (r - pow(sigma,2.0))*dt + sigma*sqrt(dt)*(double)Z(i,j-1));
}
}
// This long bit of code gives me my S_A.....
Map<RowVectorXd> S_temp1(SS.data(), SS.size());
VectorXd S_temp2(S_temp1.size());
for(int i = 0; i < S_temp2.size(); i++){
S_temp2(i) = S_temp1(i);
}
VectorXd S_temp3(S_temp2.size() - N);
int count = 0;
for(int i = N; i < S_temp2.size(); i++){
S_temp3(count) = S_temp2(i);
count++;
}
VectorXd S_temp4(S_temp3.size());
for(int i = 0; i < S_temp4.size(); i++){
S_temp4(i) = S_temp3(i);
}
VectorXd S_A(N);
S_A(0) = (S_temp4(0) + S_temp4(1) + S_temp4(2) + S_temp4(3) + S_temp4(4) + S_temp4(5) + S_temp4(6) + S_temp4(7) + S_temp4(8) + S_temp4(9))/(n);
S_A(1) = (S_temp4(10) + S_temp4(11) + S_temp4(12) + S_temp4(13) + S_temp4(14) + S_temp4(15) + S_temp4(16) + S_temp4(17) + S_temp4(18) + S_temp4(19))/(n);
int count1 = 0;
for(int i = 0; i < S_temp4.size(); i++){
S_A(count1) =
}
// Calculate payoff of Asian option
//VectorXd call_fun = call_payoff(S_A,K);
}
This question includes a lot of code, which makes it hard to understand the question you're trying to ask. Consider including only the code specific to your question.
In any case, you can use Eigen directly to do all of these things quite simply. In Eigen, Vectors are just matrices with 1 column, so all of the reasoning here is directly applicable to what you've written.
const Eigen::Matrix<double, 100, 1> v = Eigen::Matrix<double, 100, 1>::Random();
const int num_rows = 10;
const int num_cols = 1;
const int starting_row = 0;
const int starting_col = 0;
const double sum_of_first_ten = v.block(starting_row, starting_col, num_rows, num_cols).sum();
const double mean_of_first_ten = sum_of_first_ten / num_rows;
In summary: You can use .block to get a block object, .sum() to sum that block, and then conventional division to get the mean.
You can reshape the input using Map and then do all sub-summations at once without any loop:
VectorXd A(1000); // input
Map<MatrixXd> B(A.data(), 10, A.size()/10); // reshaped version, no copy
VectorXd res = B.colwise().mean(); // partial reduction, you can also use .sum(), .minCoeff(), etc.
The Eigen documentation at https://eigen.tuxfamily.org/dox/group__TutorialBlockOperations.html says an Eigen block is a rectangular part of a matrix or array accessed by matrix.block(i,j,p,q) where i and j are the starting values (eg 0 and 0) and p and q are the block size (eg 10 and 1). Presumably you would then iterate i in steps of 10, and use std::accumulate or perhaps an explicit summation to find the mean of matrix.block(i,0,10,1).

Optimization of C++ code - std::vector operations

I have this funcition (RotateSlownessTop) and it's called about 800 times computing the corresponding values. But the calculation is slow and is there a way I can make the computations faster.
The number of element in X/Y is 7202. (Fairly large set)
I did the performance analysis and the screenshot has been attached.
void RotateSlownessTop(vector <double> &XR1, vector <double> &YR1, float theta = 0.0)
{
Matrix2d a;
a(0,0) = cos(theta);
a(0,1) = -sin(theta);
a(1, 0) = sin(theta);
a(1, 1) = cos(theta);
vector <double> XR2(7202), YR2(7202);
for (size_t i = 0; i < X.size(); ++i)
{
XR2[i] = (a(0, 0)*X[i] + a(0, 1)*Y[i]);
YR2[i] = (a(1, 0)*X[i] + a(1, 1)*Y[i]);
}
size_t i = 0;
size_t j = 0;
while (i < YR2.size())
{
if (i > 0)
if ((XR2[i]>0) && (XR2[i-1]<0))
j = i;
if (YR2[i] > (-1e-10) && YR2[i]<0.0)
YR2[i] = 0.0;
if (YR2[i] < (1e-10) && YR2[i]>0.0)
YR2[i] = -YR2[i];
if ( YR2[i]<0.0)
{
YR2.erase(YR2.begin() + i);
XR2.erase(XR2.begin() + i);
--i;
}
++i;
}
size_t k = 0;
while (j < YR2.size())
{
YR1[k] = (YR2[j]);
XR1[k] = (XR2[j]);
YR2.erase(YR2.begin() + j);
XR2.erase(XR2.begin() + j);
++k;
}
size_t l = 0;
for (; k < XR1.size(); ++k)
{
XR1[k] = XR2[l];
YR1[k] = YR2[l];
l++;
}
}
Edit1: I have updated the code by replacing all push_back() with operator[], since I read somewhere that this is much faster.
However the whole program is still slow. Any suggestions are appreciated.
If the size is large, you can improve the push_back by pre-allocating the space needed. Add this before the loop:
XR2.reserve(X.size());
YR2.reserve(X.size());