splitting a boost graph into connected components - c++

My program starts out by creating a graph (~1K-50K vertices) that usually consists of a few hundred connected components.
The program only needs to be able to manipulate and visualize individual components (using force-directed layout algorithm).
It would be great (but not essential) to have the capability of further splitting each connected component into connected subcomponents (by removing edges or vertices).
So my question is, can I use use subgraph or filtered_graph class templates to achieve the required functionality (maintain a collection of component graphs that can be individually manipulated and possibly further subdivided by removing edges/vertices)? Or is there an another, better approach?
I apologize if this question is too basic. I've just started to learn BGL and not comfortable with this library yet. Thanks in advance!

Use connected_components to assign a unique number to each component, storing it in a property of the nodes. Then you can use that property in the filtered_graph predicate to decide whether a given component belongs to the currently active graph or not. The vertex predicate would be straight-forward, whereas the edge predicate could simply look at either endpoint to make its choice. The number of the subgraph would be stored in the predicate objects themselves.
Whether a different approach would be beter depends on your use case. If the components don't change much, and you have to perform a lot of operations which iterate over all nodes, then having separate graph objects might be better. You could create them as copies of filtered graphs constructed as described above. If the graph gets modified a lot, some approach which will not have to scan the whole graph to update connected components would be useful. If no edges get removed, incremental_components might do the trick.

Related

Designing around bundled properties in the Boost Graph Library

I am porting some graph code out of Python (networkx) and into C++ (BGL). In my Python code, the vertices and edges of the graph are client-defined objects implementing an established interface; I go on to call a bunch of methods on them. Everything is fine.
Naively, it would seem that BGL is meant to support a similar design pattern with "bundled properties." These basically allow one to define custom types for the vertices and edges by passing certain template parameters:
adjacency_list<OutEdgeList, VertexList,
Directed, VertexProperties,
EdgeProperties, GraphProperties,
EdgeList>
The custom vertex and edge types here are given by VertexProperties and EdgeProperties.
In working on this port I've noticed a couple of things that make me think that maybe BGL's bundled properties interface is really only meant to support (more-or-less) immutable types:
Edge and vertex "descriptors"
If you put something into the graph, you get back a "descriptor" that you have to use to reference it from there on. There are descriptors for edges and vertices, and they are the "keys" -- implemented as immutables -- to the graph. So if you have a vertex and you want to find neighboring vertices, you have to (a) get the current vertex descriptor, (b) use BGL methods to find the descriptors of its neighbors, and finally (c) reference each of the neighbors via their respective descriptors.
The net result of all this bookkeeping is an apparent need for additional containers -- std::map, say -- to provide reverse-lookup from vertices and edges (again, types VertexProperty and EdgeProperty) to their descriptors.
"BGL isn't meant to store pointers."
I spotted this claim in a similar SO question but haven't been able to verify it in the Boost docs anywhere. From the ensuing discussion I am left to speculate that the constraint may actually be a bit stronger: "BGL isn't meant to directly reference the heap." This doesn't seem entirely reasonable, though, as the container types are configurable (OutEdgeList and VertexList, above) and default standard things like vectors.
I'm a BGL n00b and am having trouble understanding what's intended with bundled properties. (And, frankly, I feel a bit overwhelmed by the programming model in general -- "properties", "concepts", "traits", "descriptors", AHHHH!) Questions:
Do BGL graphs efficiently support complex and possibly heap-bound types for VertexProperty and EdgeProperty? Or are these meant to be lightweight containers for immutable data?
If the former, is there any way to get around all the descriptor bookkeeping?
If the latter, what's the "right way" to deal with large, complicated things that we might want to stick in a BGL graph?
Do BGL graphs efficiently support complex and possibly heap-bound types for VertexProperty and EdgeProperty? Or are these meant to be lightweight containers for immutable data?
Of course. You can do it anyway you want. Point in case: you could make the bundle hold an (immutable or mutable) pointer to a heap allocated "complex" type that is - of course - entirely mutable. Now,
I'd suggest using your preferred ownership adaptor (unique_ptr, scoped_ptr, shared_ptr, what not). Look at std::string: it too is "heap based", but you didn't ever worry about using it in a property bundle, did you?
If the former, is there any way to get around all the descriptor bookkeeping?
There is not strictly any "descriptor bookkeeping". There might be depending on the graph model. But in general, descriptor is really an abstraction of the underlying container iterator model (not that this could be iterators across multiple containers, e.g. the edges(EdgeListgraph) as opposed to out_edges(v, IncidenceGraph) for adjacency_list<>.
The point is to decouple the logic from assumptions about the storage model. Even with some pretty unsightly void* passing-around, the compiler should compile it down to the same code as direct iterator access. In this sense the "bookkeeping" is usually perceptual, and you are likely picking up on mental load of having the extra conceptual layer.
If the latter, what's the "right way" to deal with large, complicated things that we might want to stick in a BGL graph?
Oops. I think I accidentally addressed this under 1. The absolute simplest thing to use might be refcounted sharing. Your specific situation might allow for more efficient solutions.
CAPITA SELECTA
"BGL isn't meant to store pointers."
Well, not as bundles, perhaps. Even here, it depends solely on how you manage the ownership/lifetime of the pointees.
I think the linked answer is excellent. It resonates with what I said above.
So if you have a vertex and you want to find neighboring vertices, you have to (a) get the current vertex descriptor, (b) use BGL methods to find the descriptors of its neighbors, and finally (c) reference each of the neighbors via their respective descriptors.
Most BGL algorithms rely on the presence of vertex_index_t property (or require one to be specified as a parameter) to make sure these are low-cost operations. In fact, if you use vecS then the vertex index is the index into the vertex vector, so reverse and forward lookups are pretty simple. (You can always look at the generated assembly with optimizations enabled to see whether there are any surprises).
This answer might be inspirational: Boost Graph Library: possible to combine Bundled Properties with Interior Properties?
TL;DR / Summary
I have a feeling coming from Python you might be understandably underestimating C++ optimizing compilers in template-heavy generic library code.
What bubbles up when you sift through the layers of generic machinery in practice evaporates at compile time. Of course, the downside of this is that can be hard to see through the abstractions. But it also means that power and abstraction level aren't compromised.
BGL is one library that allows you to switch to radically different graph models, storage layout etc. with very few code changes. The goal here wasn't "ease of use" or "do what I mean" (use Java or python for that).
The goal was to make the choice for C++ /not/ remove all agility by hard-coding every single implementation detail into your entire code base. Instead, work at the level of library concepts and benefit from the freedom you retain to experiment/change your approach as requirements evolve.

Why do C++ data structures for graphs hide contiguous integer indices?

Data structures for directed and undirected graphs are of fundamental importance. Well-known and widely-used implementations such as the Boost Graph Library and Lemon are designed such that the contiguous integer indices of nodes and edges are not exposed to the user via the interface.
Instead, the user identifies nodes and edges by (small) representative objects. One advantage is that these objects are updated automatically when the indices of nodes and edges change due to the removal of edges or nodes from the graph.
In my opinion (!), this advantage is overrated. Users will typically store the representative objects of nodes and/or edges in a container, e.g., an std::vector. Now, if nodes or edges are removed from the graph and their representative objects become invalid, the user needs to either ignore this or rearrange the vector so as to keep valid integer indices contiguous, i.e., do exactly the bookkeeping that the design was supposed to make unnecessary.
Hence, my question is: Does the design choice (of hiding the contiguous integer indices of nodes and edges from the user) have other advantages?
(I'm at home in the Java world, but hope that it is OK to give an answer that is not focussed on the particular libraries in question)
There are several possible advantages of such an abstraction. One of the most important ones was already mentioned in the question: The consistency when performing modifications in a graph is much harder to accomplish when indices have to be maintained.
The reason why this may be hard lies in the different possible graph representations: It may be easy to maintain consistent indices if the internal representation only (and always) consisted of random-access lists of Vertex and Edge objects. But for other representations, determining an index may be difficult.
This directly related to the second main advantage: One is free to use different implementations of the graph interface. The section "Graph Data Structures" in the Review of Elementary Graph Theory of the boost documentation lists several data structures that are already offered by the BGL (and everybody may add his own implementation). The running times for certain operations are given in Big-O-notation, and once can see that they vary greatly between the different data structures.
So one can easily imagine that different implementations are better suited for certain tasks. For example, consider an algorithm frequently has to check whether a particular vertex is contained in a graph. For an indexed (that is, list-based) vertex storage, this would require O(n) for each test. With a set-based storage of the vertices, this could be done in O(1) - but there simply are no sensible "indices" in this case.
Additionally, as mentioned in the Graph Concepts overview:
In fact, the BGL interface need not even be implemented using a data-structure, as for some problems it is easier or more efficient to define a graph implicitly based on some functions.
So suggesting that there is an "indexed access" even when the graph does not even exist in memory may hinder such a purely functional implementation.
I can't speak for Lemon graph, but for boost graph I think the main goal is to be generic. So abstracting away the vertex (edge) access helps to achieve that goal.
It is stated in the documentation that boost graph is based on Dietmar Kühl's Masters Thesis on generic graph algorithms. (See my answer to Do property maps remain necessary for BGL?). So the main goal behind the library is to be generic and extensible. The choice of encapsulating access is part of abstracting the algorithms from the graph representation. To me, continuous integer indices are an implementation detail.
Boost doesn't make any assumptions on how you will use the graph or what performance trade offs are important to you. It lets you choose (or implement) the container that will best fit your needs.
If you want to break this encapsulation, you are free to do so. In fact, my most common use of boost graph involves vecS containers and a vector of structs. I usually work with graphs where the size is fixed. I could just as easily use a map of vertex_descriptors (or edge_descriptors) to objects to achieve the same goal.
So in summary, I would say that this not so much a design choice, but rather a consequence of achieving the broader goal of being generic. So the hiding of access has the benefit of being more generic.

Removing edges and splitting a connected component if necessary (C++, Boost)

I have a large graph (the number of vertices can be in the range of 50,000-100,000, the adjacency matrix need not be sparse). Edges in the graph can be removed/added, and I want to update the resulting connected components structure after such changes. I have implemented this in a straightforward fashion with a BFS search myself in C++ (keeping track of unordered_maps of vertices to connected component ids and updating them), but I am wondering if there is a more efficient way to do this using Boost's graph library.
I was able to find some questions similar to this here in Stackoverflow, and came to know of filtered_graph (and the connected_components function) but I am worried about the overhead involved in creating such filtered instances, every time we add or remove an edge. (Or should this be a concern at all?!)
I believe your solution is essentially the best possible. If you are only allowed to add edges, then I believe the algorithm can be improved by keeping track of connected components in terms of vertices included, and then when an edge is included you check to see if the two vertices belong to different connected components, in which case you merge the two connected components. This will reduce the complexity from quadratic to best-case per edge added. However, if you are allowed to insert and delete edges, I don't see any asymptotically faster way to solve the problem other that what you described.
There are algorithms for maintaining connectivity under edge insertions and deletions that are faster than recalculating. This is called "dynamic graph connectivity". Here is a paper on experimental evaluations (some newer theoretical results have been found since, but it is unclear whether they have practical relevance).

How to create a named_graph with Boost Graph Library?

I am currenty working with the Boost Graph Library. I need unique edges and vertices. Unfortunately the boost graphes doesn't provide this feature. So I have to check manual every time before I am inserting an edge or a vertex.
Now I've found this: http://www.boost.org/doc/libs/1_49_0/boost/graph/named_graph.hpp
I am wondering if this would help me? Because the documentation says no word about named_graph I don't know how to use it. Maybe there is someone around who could give me a little example or explenation? This would help me a lot.
Thanks in advance.
The Boost Graph Library is very flexible and allows you to choose the internal representation for your vertices and edges. If you choose a container such as std::set then you could enforce unique vertices and edges directly. Details are here: Using Adjacency List
The named_graph type allows you to index your vertices by a property you can choose yourself (for example a "string" representing a name). It effectively wraps a standard adjacency_list in a map whose key is the named property and whose values are the nodes. There is a good example of how to use it in the boost source named_vertices_test.cpp.
Not sure what you're trying to do, but you could use a std::map/std::set yourself to map from some unique property to nodes in an adjacency_list. If you just need to ensure the graph has unique nodes/edges when you make it, then this approach is straightforward and simple, and is usually the best way.
You should think about the consequences of changing the backed containers to std::set- for example the performance of many algorithms will change. There is no simple answer to which is the best container to use.

Distinguish directed and undirected graph

I need to write a graph using C++ and I have a little problem. My graph should be directed or undirected, weighted or unweighted, based on matrix or list all on user's choice. And distinguishing matrix from list graph is not a big deal, since it's two different classes, I got some problem with other parameters. The most obvious way to distinguish them is to make two bool variables and check them on every adding and deleting of vertex. It is quite obvious and easy to understand, but I doubt it's efficiency, because every time I add or delete vertex I have to do additional if. I also could write subclasses for it, but I seriously doubt if it's worth it.
Every library is okay to use, if it's not representing graph itself.
For directed and undirected best case is using bool variable for your graph, however You can assume your graph is weighted and directed, but for undirected edges add one edge from a→b and one edge from b→a. Also if there isn't weight function set its weight to 1.
But if you looking for graph library it depends to your programming language, but I'd suggest graph boost library which implemented fully in c++, and too many other people implement it partially in other languages.