Find similar distances between all values in vector and subset them - c++

Given is a vector with double values. I want to know which distances between any elements of this vector have a similar distance to each other. In the best case, the result is a vector of subsets of the original values where subsets should have at least n members.
//given
vector<double> values = {1,2,3,4,8,10,12}; //with simple values as example
//some algorithm
//desired result as:
vector<vector<double> > subset;
//in case of above example I would expect some result like:
//subset[0] = {1,2,3,4}; //distance 1
//subset[1] = {8,10,12}; //distance 2
//subset[2] = {4,8,12}; // distance 4
//subset[3] = {2,4}; //also distance 2 but not connected with subset[1]
//subset[4] = {1,3}; //also distance 2 but not connected with subset[1] or subset[3]
//many others if n is just 2. If n is 3 (normally the minimum) these small subsets should be excluded.
This example is simplified as the distances of integer numbers could be iterated and tested for the vector which is not the case for double or float.
My idea so far
I thought of something like calculating the distances and storing them in a vector. Creating a difference distance matrix and thresholding this matrix for some tolerance for similar distances.
//Calculate distances: result is a vector
vector<double> distances;
for (int i = 0; i < values.size(); i++)
for (int j = 0; j < values.size(); j++)
{
if (i >= j)
continue;
distances.push_back(abs(values[i] - values[j]));
}
//Calculate difference of these distances: result is a matrix
Mat DiffDistances = Mat::zero(Size(distances.size(), distances.size()), CV_32FC1);
for (int i = 0; i < distances.size(); i++)
for (int j = 0; j < distances.size(); j++)
{
if (i >= j)
continue;
DiffDistances.at<float>(i,j) = abs(distances[i], distances[j]);
}
//threshold this matrix with some tolerance in difference distances
threshold(DiffDistances, DiffDistances, maxDistTol, 255, CV_THRESH_BINARY_INV);
//get points with similar distances
vector<Points> DiffDistancePoints;
findNonZero(DiffDistances, DiffDistancePoints);
At this point I get stuck with finding the original values corresponding to my similar distances. It should be possible to find them, but it seems very complicated to trace back the indices and I wonder if there isn't an easier way to solve the problem.

Here is a solution that works, as long as there are no branches meaning, that there are no values closer together than 2*threshold. That is the valid neighbor region because neighboring bonds should differ by less than the threshold, if I understood #Phann correctly.
The solution is definitively neither the fastest nor the nicest possible solution. But you might use it as a starting point:
#include <iostream>
#include <vector>
#include <algorithm>
int main(){
std::vector< double > values = {1,2,3,4,8,10,12};
const unsigned int nValues = values.size();
std::vector< std::vector< double > > distanceMatrix(nValues - 1);
// The distanceMatrix has a triangular shape
// First vector contains all distances to value zero
// Second row all distances to value one for larger values
// nth row all distances to value n-1 except those already covered
std::vector< std::vector< double > > similarDistanceSubsets;
double threshold = 0.05;
std::sort(values.begin(), values.end());
for (unsigned int i = 0; i < nValues-1; ++i) {
distanceMatrix.at(i).resize(nValues-i-1);
for (unsigned j = i+1; j < nValues; ++j){
distanceMatrix.at(i).at(j-i-1) = values.at(j) - values.at(i);
}
}
for (unsigned int i = 0; i < nValues-1; ++i) {
for (unsigned int j = i+1; j < nValues; ++j) {
std::vector< double > thisSubset;
double thisDist = distanceMatrix.at(i).at(j-i-1);
// This distance already belongs to another cluster
if (thisDist < 0) continue;
double minDist = thisDist - threshold;
double maxDist = thisDist + threshold;
thisSubset.push_back(values.at(i));
thisSubset.push_back(values.at(j));
//Indicate that this is already clustered
distanceMatrix.at(i).at(j-i-1) = -1;
unsigned int lastIndex = j;
for (unsigned int k = j+1; k < nValues; ++k) {
thisDist = distanceMatrix.at(lastIndex).at(k-lastIndex-1);
// This distance already belongs to another cluster
if (thisDist < 0) continue;
// Check if you found a new valid pair
if ((thisDist > minDist) && (thisDist < maxDist)){
// Update the valid distance interval
minDist = thisDist - threshold;
minDist = thisDist - threshold;
// Add the newly found point
thisSubset.push_back(values.at(k));
// Indicate that this is already clustered
distanceMatrix.at(lastIndex).at(k-lastIndex-1) = -1;
// Continue the search from here
lastIndex = k;
}
}
if (thisSubset.size() > 2) {
similarDistanceSubsets.push_back(thisSubset);
}
}
}
for (unsigned int i = 0; i < similarDistanceSubsets.size(); ++i) {
for (unsigned int j = 0; j < similarDistanceSubsets.at(i).size(); ++j) {
std::cout << similarDistanceSubsets.at(i).at(j);
if (j != similarDistanceSubsets.at(i).size()-1) {
std::cout << " ";
}
else {
std::cout << std::endl;
}
}
}
}
The idea is to precompute the distances and then look for every pair of particles, starting from the smallest and its larger neighbors, if there is another valid pair above it. If so these are all collected in a subset and this is added to the subset vector. For every new value the valid neighbor region has to be updated to ensure that neighboring distances differ by less than the threshold. Afterwards, the program continues with the next smallest value and its larger neighbors and so on.

Here is an algorithm which is slightly different from yours, which is O(n^3) in the length n of the vector - not very efficient.
It is based on the premise that you want to have subsets of at least size 2. So what you can do is consider all the two-element subsets of the vector, then find all other elements that also match.
So given a function
std::vector<int> findSubset(std::vector<int> v, int baseValue, int distance) {
// Find the subset of all elements in v that differ by a multiple of
// distance from the base value
}
you can do
std::vector<std::vector<int>> findSubsets(std::vector<int> v) {
for(int i = 0; i < v.size(); i++) {
for(int j = i + 1; j < v.size(); j++) {
subsets.push_back(findSubset(v, v[i], abs(v[i] - v[j])));
}
}
return subsets;
}
Only remaining problem is keeping track of the duplicates, maybe you can keep a hashed list of (baseValue % distance, distance) pairs for all the subsets you have already found.

Related

Efficiently use Eigen for repeated sparse matrix assembly in nonlinear finite element code

I am trying to use Eigen to efficiently assemble a Stiffness matrix for non-linear finite element computations.
From my finite element discretization I can exactly extract my sparsity pattern. Therefore I can just use:
mat.reserve(nnz);
mat.setFromTriplets(TripletList.begin(), TripletList.end());
as proposed in http://eigen.tuxfamily.org/dox/group__SparseQuickRefPage.html.
My questions that arise here are:
Due to the non-linear nature I have to refill my matrix very often. Therefore should I store than again all contribution in a triplet and reuse mat.setFromTriplets(...) again and again?
If I reuse mat.setFromTriplets(...) can I somehow exploit the fact that I evaluated my element matrices for the assembly always in the same order and therefore my indices in the triplet never change but only the value. Therefore the "search in memory" can be circumvented since I can maybe store the place where to put it in a new Array?
If mat.coeffRef(i,j) is faster can I maybe exploit the aforementioned fact?
One extra question: (Lower priority) Is it possible to store and assemble efficiently 3 matrices with the same sparsity pattern, i.e. if I have to do it in a loop? For example a matrix wrapper where i have one SparseMatrix to get the matrices as M1=mat[0], M2=mat[1], M3=mat[2], where mat[i] return the first matrix and M1,M2 and M3 are e.g. SparseMatrix<double> M1(1000,1000).-
The general setup is the following (for question 1.-3. only M1 appears):
std::vector< Eigen::Triplet<double> > tripletListA; // triplets differ only in the values and not in the indices
std::vector< Eigen::Triplet<double> > tripletListB;
std::vector< Eigen::Triplet<double> > tripletListC;
SparseMatrix<double> M1(1000,1000);
SparseMatrix<double> M2(1000,1000);
SparseMatrix<double> M3(1000,1000);
//Reserve space in triplets
tripletListA.reserve(nnz);
tripletListB.reserve(nnz);
tripletListC.reserve(nnz);
//Reserve space in matrices
M1.reserve(nnz);
M2.reserve(nnz);
M3.reserve(nnz);
//fill triplet list with zeros
M1.setFromTriplets(tripletListA.begin(), tripletListA.end());
M2.setFromTriplets(tripletListB.begin(), tripletListB.end());
M3.setFromTriplets(tripletListC.begin(), tripletListC.end());
for (int i=0; i<1000; i++) {
//Fill triplets
M1.setFromTriplets(tripletListA.begin(), tripletListA.end()); //or use coeffRef?
M2.setFromTriplets(tripletListB.begin(), tripletListB.end());
M3.setFromTriplets(tripletListC.begin(), tripletListC.end());
//solve
//update
}
Thank you and regards,
Alex
UPDATE:
Thank you for your answers. Initially the order of my access to the nonzeros is quite arbitrary. But since i'm interested in an iterative scheme i think about documenting this random sorting and construct an operator which takes care of this. This operator can be constructed (at least in my mind) from the initially constructed triplet.
SparseMatrix<double> mat(rows,cols);
std::vector<double> valuevector(nnz);
//Initially construction
std::vector< Eigen::Triplet<double> > tripletList;
//naive fill of tripletList
//Sorting of entries and identifying double entries in tripletList from col and row values
//generating from this information operator P
for (int i=0; i<1000; i++)
{
//naive refill of tripletList
valuevector= P*tripletList.value(); //constructing vector in efficient ordering from values of triplets (tripletList.value() call does not makes since for std::vector but i hope it is clear what i have in mind
for (int k=0; k<mat.outerSize(); ++k)
for (SparseMatrix<double>::InnerIterator it(mat,k); it; ++it)
it.valueRef() =valuevector(it);
}
I think about the operator P just as a matrix with ones and zeros at the appropiate places.
The question remains if this is even a more efficient procedure?
UPDATE-2: Benchmark:
I tried to construct my ideas in a code snippet. I first generate a random triplet list. This list is constructed to get a sparsity of 95% and additionally some values in the list are duplicated to mimic dubplicates in the triplet list whic hwrite on the same position in the sparse matrix. These values are then inserted based on different concepts. The first one is the setfromtriplet approach and the second and third tries to exploit the known structure.
The second and third approach documents the ordering of the triplet list. This information is then exploited to directly write the values in the pure mat1.coeffs() vector.
#include <iostream>
#include <Eigen/Sparse>
#include <random>
#include <fstream>
#include <chrono>
using namespace std::chrono;
using namespace Eigen;
using namespace std;
typedef Eigen::Triplet<double> T;
void findDuplicates(vector<pair<int, int> > &dummypair, Ref<VectorXi> multiplicity) {
// Iterate over the vector and store the frequency of each element in map
int pairCount = 0;
pair<int, int> currentPair;
for (int i = 0; i < multiplicity.size(); ++i) {
currentPair = dummypair[pairCount];
while (currentPair == dummypair[pairCount + multiplicity[i]]) {
multiplicity[i]++;
}
pairCount += multiplicity[i];
}
}
typedef Matrix<duration<double, std::milli>, Dynamic, Dynamic> MatrixXtime;
int main() {
//init random generators
std::default_random_engine gen;
std::uniform_real_distribution<double> dist(0.0, 1.0);
int sizesForTest = 5;
int measures = 6;
MatrixXtime timeArray(sizesForTest, measures);
cout << "TripletTime NestetTime LNestedTime " << endl;
for (int m = 0; m < sizesForTest; ++m) {
int rows = pow(10, m + 1);
int cols = rows;
std::uniform_int_distribution<int> distentryrow(0, rows - 1);
std::uniform_int_distribution<int> distentrycol(0, cols - 1);
std::vector<T> tripletList;
SparseMatrix<double> mat1(rows, cols);
// SparseMatrix<double> mat2(rows,cols);
// SparseMatrix<double> mat3(rows,cols);
//generate sparsity pattern of matrix with 10% fill-in
tripletList.emplace_back(3, 0, 15);
for (int i = 0; i < rows; ++i)
for (int j = 0; j < cols; ++j) {
auto value = dist(gen); //generate random number
auto value2 = dist(gen); //generate random number
auto value3 = dist(gen); //generate random number
if (value < 0.05) {
auto rowindex = distentryrow(gen);
auto colindex = distentrycol(gen);
tripletList.emplace_back(rowindex, colindex, value); //if larger than treshold, insert it
//dublicate every third entry to mimic entries which appear more then once
if (value2 < 0.3333333333333333333333)
tripletList.emplace_back(rowindex, colindex, value);
//triple every forth entry to mimic entries which appear more then once
if (value3 < 0.25)
tripletList.emplace_back(rowindex, colindex, value);
}
}
tripletList.emplace_back(3, 0, 9);
int numberOfValues = tripletList.size();
//initially set all matrices from triplet to allocate space and sparsity pattern
mat1.setFromTriplets(tripletList.begin(), tripletList.end());
// mat2.setFromTriplets(tripletList.begin(), tripletList.end());
// mat3.setFromTriplets(tripletList.begin(), tripletList.end());
int nnz = mat1.nonZeros();
//reset all entries back to zero to fill in later
mat1.coeffs().setZero();
// mat2.coeffs().setZero();
// mat3.coeffs().setZero();
//document sorting of entries for repetative insertion
VectorXi internalIndex(numberOfValues);
vector<pair<int, int> > dummypair(numberOfValues);
VectorXd valuelist(numberOfValues);
for (int l = 0; l < numberOfValues; ++l) {
valuelist(l) = tripletList[l].value();
}
//init internalindex and dummy pair
internalIndex = Eigen::VectorXi::LinSpaced(numberOfValues, 0.0, numberOfValues - 1);
for (int i = 0; i < numberOfValues; ++i) {
dummypair[i].first = tripletList[i].col();
dummypair[i].second = tripletList[i].row();
}
auto start = high_resolution_clock::now();
// sort the vector internalIndex based on the dummypair
sort(internalIndex.begin(), internalIndex.end(), [&](int i, int j) {
return dummypair[i].first < dummypair[j].first ||
(dummypair[i].first == dummypair[j].first && dummypair[i].second < dummypair[j].second);
});
auto stop = high_resolution_clock::now();
timeArray(m, 3) = (stop - start) / 1000;
start = high_resolution_clock::now();
sort(dummypair.begin(), dummypair.end());
stop = high_resolution_clock::now();
timeArray(m, 4) = (stop - start) / 1000;
start = high_resolution_clock::now();
VectorXi dublicatecount(nnz);
dublicatecount.setOnes();
findDuplicates(dummypair, dublicatecount);
stop = high_resolution_clock::now();
timeArray(m, 5) = (stop - start) / 1000;
dummypair.clear();
//calculate vector containing all indices of triplet
//therefore vector[k] is the vectorXi containing the entries of triples which should be written at dof k
int indextriplet = 0;
int multiplicity = 0;
vector<VectorXi> listofentires(mat1.nonZeros());
for (int k = 0; k < mat1.nonZeros(); ++k) {
multiplicity = dublicatecount[k];
listofentires[k] = internalIndex.segment(indextriplet, multiplicity);
indextriplet += multiplicity;
}
//========================================
//Here the nonlinear analysis should start and everything beforehand is prepocessing
//Test1 from triplets
start = high_resolution_clock::now();
mat1.setFromTriplets(tripletList.begin(), tripletList.end());
stop = high_resolution_clock::now();
timeArray(m, 0) = (stop - start) / 1000;
mat1.coeffs().setZero();
//Test2 use internalIndex but calculate listofentires on the fly
indextriplet = 0;
start = high_resolution_clock::now();
for (int k = 0; k < mat1.nonZeros(); ++k) {
multiplicity = dublicatecount[k];
mat1.coeffs()[k] += valuelist(internalIndex.segment(indextriplet, multiplicity)).sum();
indextriplet += multiplicity;
}
stop = high_resolution_clock::now();
timeArray(m, 1) = (stop - start) / 1000;
mat1.coeffs().setZero();
//Test3 directly use listofentires
start = high_resolution_clock::now();
for (int k = 0; k < mat1.nonZeros(); ++k)
mat1.coeffs()[k] += valuelist(listofentires[k]).sum();
stop = high_resolution_clock::now();
timeArray(m, 2) = (stop - start) / 1000;
std::ofstream file("test.txt");
if (file.is_open()) {
file << mat1 << '\n';
}
cout << "Size: " << rows << ": ";
for (int n = 0; n < measures; ++n)
cout << timeArray(m, n).count() << " ";
cout << endl;
}
return 0;
}
If i run this example on my i5-6600K 3.5Ghz and 16GB ram i end up with the following results. which are the times in seconds.
Size Triplet Nested LessNested Sort_intIndex Sort_dum_pair findDuplica
10 1e-06 1e-06 2e-06 1e-06 1e-06 1e-06
100 2.8e-05 4e-06 1.4e-05 5e-05 4.2e-05 1e-05
1000 0.003 0.000416 0.001489 0.01012 0.00627 0.000635
10000 0.426 0.093911 0.48912 1.5389 0.780676 0.061881
100000 337.799 99.0801 37.3656 292.397 87.4488 0.79996
The first three columns denote the calculation time of the different approaches and column 4 to 6 denote the times for different preprocessing steps.
For the size of 100000 rowsand coloumns my Ram gets full relatively fast and therefore the last table entry should be taken with care. Here the fastest method changes from 2 to three.
My questions here are is this approach going in the correct direction to improve the efficiency? Is this a complete wrong direction because for example for the case of a size of 10000 an assemble time of 0.48s seems a bit high?
Additionally the preprocessing steps are getting expensive very fast and is there a better way to construct the ordering of the matrix? Finally as last question is the benchmarking done in the correct way?
Thanks for your time,
Alex

find the most similar value between two vectors in C++

I have two sorted vectors and I want to find the index of a value in vector1 that has the smallest difference (distance) to another value in vector2. My following code does the job, however, because the vectors I use are always sorted I feel there most be another more efficient way to do the same thing. Any guides? Thanks in advance.
#include<iostream>
#include<cmath>
#include<vector>
#include<limits>
std::vector<float> v1{2,3,6,7,9};
std::vector<float> v2{4,6.2,10};
int main(int argc, const char * argv[])
{
float mn=std::numeric_limits<float>::infinity();
float difference;
int index;
for(int i=0; i<v1.size(); i++){
for(int j=0; j<v2.size(); j++){
difference = abs(v1[i]-v2[j]);
if(difference < mn){
mn= difference;
index = i;
}
}
}
std::cout<< index;
// 2 is the wanted index because |6-6.2| is the smallest distance between the 2 vectors
return 0;
}
Indeed, there is a faster way. You only need to compare elements in v1 to those in v2 that are smaller or equal, or the first that is greater. Basically, the idea is to have two iterators, i and j, and advance j if v2[j] < v1[i], otherwise advance i. Here is a possible implementation:
for (int i = 0, j = 0; i < v1.size(); i++) {
while (true) {
difference = std::abs(v1[i] - v2[j]);
if (difference < mn) {
mn = difference;
index = i;
}
// Try the next item in v1 if the current item in v2 is bigger.
if (v2[j] > v1[i])
break;
// Otherwise, try the next item in v2, unless we are at the last item.
if (j + 1 < v2.size())
j++;
else
break;
}
}
While it still looks like a double loop, it only computes differences at most v1.size() + v2.size() times, instead of v1.size() * v2.size() times.

Sum up distances in a for loop

I have a vector of Points and I calculate the distances between every Point (P1P2, P1P3, P1P4,....P1PN, P2P1, ... ,PMPN).
Now I want to sum all the distances of Point 1 to every other point, then all the distances of Point 2 to every other point and so on (P1P2+P1P3+...+P1PN, P2P1+P2P2+...+P2PN) an put these distances into a vector. I am stuck in my for loop now:
Here is my code:
// Calculate mass centers
vector<Point2f> centroids_1;
// Calculate distances between all mass centers
vector<double> distance_vector;
for (int i = 0, iend = centroids_1.size(); i < iend; i++) {
for (int j = 0, jend = centroids_1.size(); j < jend; j++) {
double distance = norm(centroids_1[i] - centroids_1[j]);
distance_vector.push_back(distance);
// Here I tried many things with for loops and while loops but
// I couldn't find a proper solution
}
}
Use the standard library instead of raw loops. It will be easier to read and maintain. Plus, the indices are noise. They aren't required for iteration.
for(auto const& point : centroids_1)
distance_vector.push_back(std::accumulate(begin(centroids_1), end(centroids_1), 0.0,
[&](auto res, auto const& point2) { return res + norm(point - point2); }
));
Specifically, we used a range-based-for loop along with std::accumulate. This is the description of what you want to do. Store for each point the accumulated sum of distances between it and other points.
You are not adding distance anywhere.After the first iteration of the inner loop, the answer for first point would be ready which you can save.
Also you don't need to find the difference between same points so skip when i=j
for (int i = 0, iend = centroids_1.size(); i < iend; i++)
{
double distance=0.0;
for (int j = 0, jend = centroids_1.size(); j < jend; j++)
{
if(i==j)
continue;
distance+ = norm(centroids_1[i] - centroids_1[j]);
}
distance_vector.push_back(distance);
}
}

Prevent Cycles in Maximum Spanning Tree

I am trying to create a maximum spanning tree in C++ but am having trouble preventing cycles. The code I have works alright for some cases, but for the majority of cases there is a cycle. I am using an adjacency matrix to find the edges.
double maximumST( vector< vector<double> > adjacencyMatrix ) {
const int size = adjacencyMatrix.size();
vector <double> edges;
int edgeCount = 0;
double value = 0;
std::vector<std::vector<int>> matrix(size, std::vector<int>(size));
for (int i = 0; i < size; i++) {
for (int j = i; j < size; j++) {
if (adjacencyMatrix[i][j] != 0) {
edges.push_back(adjacencyMatrix[i][j]);
matrix[i][j] = adjacencyMatrix[i][j];
edgeCount++;
}
}
}
sort(edges.begin(), edges.end(), std::greater<int>());
for (int i = 0; i < (size - 1); i++) {
value += edges[i];
}
return value;
}
One I've tried to find a cycle was by creating a new adjacency matrix for the edges and checking that before adding a new edge, but that did not perform as expected. I also tried to build a 3D matrix, but I could not get that to work either.
What's a new approach I should try to prevent cycles?
You should add the edge if the lowest common ancestor(LCA) of the two vertices corresponding to that edge is not root.

creating matrix using 2-d vector c++

im trying to explain the problem i have.
I need a 2-d matrix which contains 233x233 row and columns.
for(int i = 0; i < dimension;i++)
for(int j = 0 ; j < dimension;j++)
distance3 = sqrt(pow((apointCollection2[j].x - apointCollection[i].x1), 2) + pow((apointCollection2[j].y - apointCollection[i].y1), 2));
if (distance3 < Min)
{
Min = distance3;
station = busStation;
}
distance2 = sqrt(pow((apointCollection2[j].x - apointCollection[i].x2), 2) + pow((apointCollection2[j].y - apointCollection[i].y2), 2));
if (distance2 < Min2)
{
Min2 = distance2;
station1 = busStation;
}
So i find the minimum distance and two stations with minimum distance. The first station(station) corresponds to row and the second one (station1) corresponds to column. Then i need to increment the number of people these(can be called route) has.
Then i need to find the station and station1 after the second iteration and if they are the same i need just to increment people and not add the same stations to the vector.
Or another variant i thought
I creat a 2-d vector with 233x233 and 0 values in each cell.
vector< vector<int> > m;
cout << "Filling matrix with test numbers.";
m.resize(233);
for (int i = 0; i < 233; i++)
{
m[i].resize(233);
for (int j = 0; j < 233; j++)
{
}
}
After the loop above i decided to create the following where i find the min distance :
Here i want to increment somehow:
m[station][station1] = person;
if (find(m.begin(), m.end(), station, station1))
{
person++;
}
else
{
m[station][station1] = person;
}
I have an error in "find" because there is no instance of function template.Another problem i don't add values to vector but here also a mistake when i want to add.
This should be done very easy just need to find out the logic i should follow.
Thanks in advance