I have a large graph (the number of vertices can be in the range of 50,000-100,000, the adjacency matrix need not be sparse). Edges in the graph can be removed/added, and I want to update the resulting connected components structure after such changes. I have implemented this in a straightforward fashion with a BFS search myself in C++ (keeping track of unordered_maps of vertices to connected component ids and updating them), but I am wondering if there is a more efficient way to do this using Boost's graph library.
I was able to find some questions similar to this here in Stackoverflow, and came to know of filtered_graph (and the connected_components function) but I am worried about the overhead involved in creating such filtered instances, every time we add or remove an edge. (Or should this be a concern at all?!)
I believe your solution is essentially the best possible. If you are only allowed to add edges, then I believe the algorithm can be improved by keeping track of connected components in terms of vertices included, and then when an edge is included you check to see if the two vertices belong to different connected components, in which case you merge the two connected components. This will reduce the complexity from quadratic to best-case per edge added. However, if you are allowed to insert and delete edges, I don't see any asymptotically faster way to solve the problem other that what you described.
There are algorithms for maintaining connectivity under edge insertions and deletions that are faster than recalculating. This is called "dynamic graph connectivity". Here is a paper on experimental evaluations (some newer theoretical results have been found since, but it is unclear whether they have practical relevance).
Related
I need to find an optimal solution for a Travelling Salesman Problem on graphs with the small number of vertices (< 10). Since this is an NP hard problem I am ready to do the brute force approach, for the small number of vertices it should be doable in a very small time.
I have a slightly modified conditions for 2 problems:
(A)
The graph is bidirectional, with different weights in each direction.
All vertices are connected to all.
(Nice to have condition) You can visit the same vertices more than once, and travel the same paths more than once (however for eventual completeness you should not loop infinitely)
(B)
In addition to conditions of (A), here you need to visit a subset of vertices, while you still allow to travel through all other vertices of a graph. (given that it is a better solution).
A while back I have implemented a brute force solution and some heuristics like Lin–Kernighan (using simple matrix of weights), however I never used Graph data structures like in boost. And I was wounding if there is an existing implementation that I could use or a set of algorithms that could help me out to get optimal solution. Also I would appreciate if you could advise on how to get the part (B) right.
Thanks!
I am pretty knew to the boost graph library. I need to implement a visitor that would be able to revisit the same vertices multiple times. For the undirectedS graphs (or directed with 2 edges) this would result in the infinite search depth, however I am planning to put a finite constraint on the maximum depth.
I am considering to use breadth_first_search(), and I think what I need to do is to un-mark visited vertices in vis.black_target()? Would you know how to do it?
Thanks!
EDIT: Technically I would like to explore all possible paths of a constant depth from a starting vertex. (even if some of them revisit the same vertices several times)
Data structures for directed and undirected graphs are of fundamental importance. Well-known and widely-used implementations such as the Boost Graph Library and Lemon are designed such that the contiguous integer indices of nodes and edges are not exposed to the user via the interface.
Instead, the user identifies nodes and edges by (small) representative objects. One advantage is that these objects are updated automatically when the indices of nodes and edges change due to the removal of edges or nodes from the graph.
In my opinion (!), this advantage is overrated. Users will typically store the representative objects of nodes and/or edges in a container, e.g., an std::vector. Now, if nodes or edges are removed from the graph and their representative objects become invalid, the user needs to either ignore this or rearrange the vector so as to keep valid integer indices contiguous, i.e., do exactly the bookkeeping that the design was supposed to make unnecessary.
Hence, my question is: Does the design choice (of hiding the contiguous integer indices of nodes and edges from the user) have other advantages?
(I'm at home in the Java world, but hope that it is OK to give an answer that is not focussed on the particular libraries in question)
There are several possible advantages of such an abstraction. One of the most important ones was already mentioned in the question: The consistency when performing modifications in a graph is much harder to accomplish when indices have to be maintained.
The reason why this may be hard lies in the different possible graph representations: It may be easy to maintain consistent indices if the internal representation only (and always) consisted of random-access lists of Vertex and Edge objects. But for other representations, determining an index may be difficult.
This directly related to the second main advantage: One is free to use different implementations of the graph interface. The section "Graph Data Structures" in the Review of Elementary Graph Theory of the boost documentation lists several data structures that are already offered by the BGL (and everybody may add his own implementation). The running times for certain operations are given in Big-O-notation, and once can see that they vary greatly between the different data structures.
So one can easily imagine that different implementations are better suited for certain tasks. For example, consider an algorithm frequently has to check whether a particular vertex is contained in a graph. For an indexed (that is, list-based) vertex storage, this would require O(n) for each test. With a set-based storage of the vertices, this could be done in O(1) - but there simply are no sensible "indices" in this case.
Additionally, as mentioned in the Graph Concepts overview:
In fact, the BGL interface need not even be implemented using a data-structure, as for some problems it is easier or more efficient to define a graph implicitly based on some functions.
So suggesting that there is an "indexed access" even when the graph does not even exist in memory may hinder such a purely functional implementation.
I can't speak for Lemon graph, but for boost graph I think the main goal is to be generic. So abstracting away the vertex (edge) access helps to achieve that goal.
It is stated in the documentation that boost graph is based on Dietmar Kühl's Masters Thesis on generic graph algorithms. (See my answer to Do property maps remain necessary for BGL?). So the main goal behind the library is to be generic and extensible. The choice of encapsulating access is part of abstracting the algorithms from the graph representation. To me, continuous integer indices are an implementation detail.
Boost doesn't make any assumptions on how you will use the graph or what performance trade offs are important to you. It lets you choose (or implement) the container that will best fit your needs.
If you want to break this encapsulation, you are free to do so. In fact, my most common use of boost graph involves vecS containers and a vector of structs. I usually work with graphs where the size is fixed. I could just as easily use a map of vertex_descriptors (or edge_descriptors) to objects to achieve the same goal.
So in summary, I would say that this not so much a design choice, but rather a consequence of achieving the broader goal of being generic. So the hiding of access has the benefit of being more generic.
My program starts out by creating a graph (~1K-50K vertices) that usually consists of a few hundred connected components.
The program only needs to be able to manipulate and visualize individual components (using force-directed layout algorithm).
It would be great (but not essential) to have the capability of further splitting each connected component into connected subcomponents (by removing edges or vertices).
So my question is, can I use use subgraph or filtered_graph class templates to achieve the required functionality (maintain a collection of component graphs that can be individually manipulated and possibly further subdivided by removing edges/vertices)? Or is there an another, better approach?
I apologize if this question is too basic. I've just started to learn BGL and not comfortable with this library yet. Thanks in advance!
Use connected_components to assign a unique number to each component, storing it in a property of the nodes. Then you can use that property in the filtered_graph predicate to decide whether a given component belongs to the currently active graph or not. The vertex predicate would be straight-forward, whereas the edge predicate could simply look at either endpoint to make its choice. The number of the subgraph would be stored in the predicate objects themselves.
Whether a different approach would be beter depends on your use case. If the components don't change much, and you have to perform a lot of operations which iterate over all nodes, then having separate graph objects might be better. You could create them as copies of filtered graphs constructed as described above. If the graph gets modified a lot, some approach which will not have to scan the whole graph to update connected components would be useful. If no edges get removed, incremental_components might do the trick.
Anybody out there using BGL for large production servers?
How many node does your network consist of?
How do you handle community detection
Does BGL have any cool ways to detect communities?
Sometimes two communities might be linked together by one or two edges, but these edges are not reliable and can fade away. Sometimes there are no edges at all.
Could someone speak briefly on how to solve this problem.
Please open my mind and inspire me.
So far I have managed to work out if two nodes are on an island (in a community)
in a lest expensive manner, but now I need to work out which two nodes on separate islands are closest to each other. We can only make minimal use of unreliable geographical data.
If we figuratively compare it to a mainland and an island and take it out of social distance context. I want to work out which two bits of land are the closest together across a body of water.
I've used the BGL for graphs with millions of nodes, but the size of the graph you can use depends on what algorithm you are trying to run. You can quickly compute distances between nodes. There are 4 shortest path algorithms which are most applicable depending on your data: (single pairs of points, for all pairs of points, sparse and dense graphs,...).
As for community detection, there aren't any algorithms built-into the BGL specifically for that (but maybe you can contribute one when you are finished with your project). There are a few algorithms that might be helpful in building a community detection algorithm. The max-flow/min-cut algorithms are typically used in community detection (if there is a lot of flow possible between two nodes, then they are likely to be in the same community, if there isn't much flow, then the min-cut is likely to represent roads between communities). There are also heuristics to order the nodes of the graph to reduce bandwidth. Nodes making up "communities" are likely to be close to each other in such an ordering.
As far as I know BGL doesn't have any algorithms specifically for community detection.
By "island" do you mean a disconnected subgraph?
Also, graphs do not have any notion of 'distance'.
This 'social distance' is something that you are going to have to define. Once you've done that a large part of the work is done.
There are numerous methods listed on the page you linked to, most of those only require you to define something like a 'distance' metric, and then plug your definitions into the algorithm.
# David Nehme
Graphs without edge-weights are only about connectedness, they have no notion of distance. If you want to talk about a network then you can talk about distance. But a graph with no edge-weights does not have any distance, unless you want to assume an implied edge-weight of 1 for all edges. But this is really just turning the graph into a network.
Also, he is talking about the distance between two disconnected graphs. To model this, you have to introduce an external concept for distance between nodes, separate from the edge-distance.