Thresholding a huge matrix to avoid overuse of memory, C++

Thresholding a huge matrix to avoid overuse of memory, C++ - c++

I want to generate a huge weighted undirected graph, represented by a huge adjacency matrix AJM. So for the loop over i and j,
AJM[i][j] = AJM[j][i]
AJM[i][i] = 0
The weights are generated as random double numbers in the interval, say [0.01, 10.00]. If I have 10k vertices, the matrix would be 10k by 10k with double type entries, which is a huge chunk in the memory if I store it.
Now I want to set a threshold E for the wanted number of edges, and ignore all the edges with weight larger than some threshold T (T is determined by E, E is user-defined), just store the smallest E edges with weight under T in a vector for later use. Could you give me some suggestion how to achieve this in an efficient manner? It is best to avoid any kind of storage of the whole adjacency matrix, just use streaming structure. So I'm wondering how I should generate the matrix and do the thresholding?
I guess writing and reading file is needed, right?
One approach would be, after some kind of manipulation with file, I set the threshold E and do the following:
I read the element from the matrix one by one so I don't read in the whole matrix (could you show some lines of C++ code for achieving this?), and insert its weight into a min-heap, store its corresponding edge index in a vector. I stop when the size of the heap reaches E so that the vector of edge indices is what I want.
Do you think its the right way to do it? Any other suggestions? Pls point out any error I may have here. Thank you so much!

If there is no need to keep the original threshold-ed graph then it sounds like there is an easy way to save yourself a lot of work. You are given the number of vertices (V=10,000), and the number of edges (E) is user configurable. Just randomly select pairs of vertices until you have the required number of edges. Am I missing an obvious reason why this would not be equivalent?

Related

The speed of filling sparse matrix in Eigen depends on number of nodes or edges?

I filled edges of two network.
One is about 4000 nodes and 80000 edges.
Another one about is 80000 nodes and 1300000 edges.
The code is written like below:
SparseMatrix<int,Eigen::RowMajor> mat(nodenumber,nodenumber); //nodenumber is 4000 or 80000
mat.reserve(VectorXi::Constant(nodenumber,50)); //preserve 50 non-nero elements
for (i,j) in edges:
mat.insert(i,j) = 1;
mat.insert(j,i) = 1;
}
(4000 nodes,80000 edges) is done with 1.5 sec.
(80000 nodes,1300000 edges) is done with 600 sec.
But I think the speed of filling matrix should depend on edges.
That should be 1.5*1300000/80000 for (80000 nodes,1300000 edges) network.
Am I right or wrong?
How can I improve the speed of filling the matrix?
Thanks!

See this line: mat.reserve(VectorXi::Constant(nodenumber,50)); and this point of the documentation of Eigen on sparse matrix:
Note that when calling reserve(), it is not required that nnz is the exact number of nonzero elements in the final matrix. However, an exact estimation will avoid multiple reallocations during the insertion phase.
Hence, consider changing 50 by something larger than the number of edges so as to reduce repeated allocation. Nevertheless, it will only slightly reduce wall clock time, as detailed in the section Filling a sparse matrix
Because of the special storage scheme of a SparseMatrix, special care has to be taken when adding new nonzero entries. For instance, the cost of a single purely random insertion into a SparseMatrix is O(nnz), where nnz is the current number of non-zero coefficients.
As a consequence, filling the whole matrix by random insertions is O(nnz^2/2). Indeed, if you compute 80000^2 and 1300000^2, the ratio will not be too far from 1.5/600, these figures being the execution times you reported.
To gain some time, you may be interested in batch insertion, that is inserting all edges at once. Read this part of the documentation of Eigen: it really worths it! Indeed, the piece of code provided on this webpage will likely help you.
typedef Eigen::Triplet<double> T;
std::vector<T> tripletList;
tripletList.reserve(nnz);
for(...)
{
// ...
tripletList.push_back(T(i,j,v_ij));
}
SparseMatrixType mat(rows,cols);
mat.setFromTriplets(tripletList.begin(), tripletList.end());
As an alternative, you can also reserve storage space for each column, if you know the maximum number of non-null elements per column and if it is not too big:
mat.reserve(VectorXi::Constant(cols,6));

Sorting a C++ vector based on an adjacency matrix

I have a vector e whose elements are indices to edges in a 2D (surface) mesh. For whatever reason, I would like to reorder this vector, so that each edge is surrounded by edges that are closest to it (basically, similar to what the asker is trying to achieve in this question).
I don't need it to be an exact or perfect solution (there probably isn't one), but I'd like to get as close as possible.
Here are the steps I have taken:
Create an adjacency matrix B for the mesh edges,
Use an algorithm such as RCM to get a reordering of the adjacency matrix to reduce its bandwidth (I'm using Petsc's MatGetOrdering to do this),
Apply the new ordering to get a new, reshuffled adjacency matrix, B2.
At this point, I would like to reorder the original vector e of mesh edges, to get a new vector e2 whose adjacency matrix is now B2.
Is this possible? i.e. is there enough information above to achieve this?
Is this a good approach to do what I'm trying to achieve?
If not, what would be the most sensible and robust approach? (e.g. I was also playing around with trying to do this based on physical distances rather than edge connectivity, but I'm not sure which approach is more realistic / sensible / robust),
If yes, how do I accomplish the last step of reordering the edge vector based on the new adjacency matrix?
I'm fairly new to Stack Exchange so please let me know if I should be asking this on another sub-community. I am also fairly new to graph theory, so I may be missing something obvious.
Thanks!

Handle very large distance matrix in C (or C++ if it could help)

I am implementing this clustering algorithm http://www.sciencemag.org/content/344/6191/1492.full (free access version) in C in my software and I need to build a distance matrix, but in some cases, the size of the dataset (after redundancy removal) is huge (n > 1 500 000 and it is even larger, going up to 4 000 000 on more complex cases). My problem is, even allocating the upper triangular matrix would be ( (1500000*1500000) - 1500000) * 0.5 * sizeof(float) =~ 5.5e12 Bytes. So, memory allocation fails (even on our computing nodes with 256 GB of RAM) and writing to disk is not an option in this case.
Beside cutting down the size (which I will look) of the dataset to cluster, anybody has an idea of a technique I could use to approximate and store this amount of information ?
N.B. Like I said in the title, I am using C and I can also use C++. Also, if anybody has another clustering algorithm (where the number of clusters is determined with the algorithm itself) to use, please suggest it to me.
Thanks in advance for your time,

You probably have to step back and reconsider your algorithm.
First, perhaps you don't need to have distance matrix between all pairs of data points. Perhaps you could group together similar data points into data bins and then create a matrix of distances between bins.
That is, start by computing pairwise distances between points, but keep only relatively small distances and pointers to "the other" point. Kind of a very sparse matrix of shorter distances. This is straightforward to do in parallel.
Then create data bins that contain groups of points with mutually small distances between them. For example, if you threshold "short" distances in such manner that bins would hold on average, say, 50 data points you'd get 1500000/50=30000 bins.
Then go through your data again and compute distances between bins. That would produce 30000^2 distances, which is a matrix of about 4GB. In addition you still have 30000 with 50^2 distances within bins, which is another 300MB. This amount of data is quite manageable.
If replacing the distance between data points with a distance between the corresponding bins is sufficient precision for your application that would work. It all depends on the kind of data you are dealing with and the precision requirements of your application.

Extract the upper and lower boundaries from a list (vector) of 2d coordinates

My program contains polygons which have the form of a vector containing points (2 dimensional double coordinates, stored in a self-made structure). I'm looking for a quick way of finding the smallest square containing my polygon (ie. knowing the maximal and minimal coordinates of all the points).
Is there a quicker way than just parsing all the points and storing the minimum and maximum values?

The algorithm ou are describing is straightforward: Iterate over all your points and find the minimum and maximum for each coordinate. This is an O(n) algorithm, n being the number of points you have.
You can't do better, since you will need to check at least all your points once, otherwise the last one could be outside the square you found.
Now, the complexity is at best O(n) so you just have to minimize the constant factors, but in that case it's already pretty small : Only one loop over your vector, looking for two maximums and two minimums.

You can either iterate through all points and find max and min values, or do some preprocessing, for example, store your points in treap (http://en.wikipedia.org/wiki/Treap).
There is no way w/o some preprocessing to do it better than just iterating over all points.

I'm not sure if there can be any faster way to find the min & max values in an array of values than linear time. The only 'optimization' I can think of is to find these values on one of the other occasions you're iterating the array (filling it/performing a function on all points), then perform checks on any data update.

Finding edge in weighted graph

I have a graph with four nodes, each node represents a position and they are laid out like a two dimensional grid. Every node has a connection (an edge) to all (according to the position) adjacent nodes. Every edge also has a weight.
Here are the nodes represented by A,B,C,D and the weight of the edges is indicated by the numbers:
A 100 B
120 220
C 150 D
I want to structure a container and an algorithm that switches the nodes sharing the edge with the highest weight. Then reset the weight of that edge. No node (position) can be switched more than once each time the algorithm is executed.
For example, processing the above, the highest weight is on edge BD, so we switch those. Since no node can be switched more than once, all edges involved in either B or D is reset.
A D
120
C B
Then, the next highest weight is on the only edge left, switching those would give us the final layout: C,D,A,B.
I'm currently running a quite awful implementation of this. I store a long list of edges, holding four values for the nodes they are (potentially) connected to, a value for its weight and the position for the node itself. Every time anything is requested, I loop through the entire list.
I'm writing this in C++, could some parts of the STL help speed this up? Also, how to avoid the duplication of data? A node position is currently in five objects. The node itself that is there and the four nodes indicating a connection to it.
In short, I want help with:
Can this be structured in a way so that there is no data duplication?
Recognise the problem? If any of this has a name, tell me so I can google for more info on the subject.
Fast algorithms are always nice.

As for names, this is a vertex cover problem. Optimal vertex cover is NP-hard with decent approximation solutions, but your problem is simpler. You're looking at a pseudo-maximum under a tighter edge selection criterion. Specifically, once an edge is selected every connected edge is removed (representing the removal of vertices to be swapped).
For example, here's a standard greedy approach:
0) sort the edges; retain adjacency information
while edges remain:
1) select the highest edge
2) remove all adjacent edges from the list
endwhile
The list of edges selected gives you the vertices to swap.
Time complexity is O(Sorting vertices + linear pass over vertices), which in general will boil down to O(sorting vertices), which will likely by O(V*log(V)).
The method of retaining adjacency information depends on the graph properties; see your friendly local algorithms text. Feel free to start with an adjacency matrix for simplicity.
As with the adjacency information, most other speed improvements will apply best to graphs of a certain shape but come with a tradeoff of time versus space complexity.
For example, your problem statement seems to imply that the vertices are laid out in a square pattern, from which we could derive many interesting properties. For example, that system is very easily parallelized. Also, the adjacency information would be highly regular but sparse at large graph sizes (most vertices wouldn't be connected to each other). This makes the adjacency matrix give a high overhead; you could instead store adjacency in an array of 4-tuples as it would retain fast access but almost entirely eliminate overhead.

If you have bigger graphs look into the boost graph library. It gives you good data structures for graphs and basic iterators for different types of graph traversing

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js