I have a database with information about many hypothetical people. This data is oriented for creating a graph connecting related people together. This graph is to represent a genealogical tree, in an appropriate c++ data structure. So far, so good. I have my data structure holding information about a family, with each person being a node, all connected appropriately in a tree.
Now, here is the problem, I am lost in how to go about generating visual data for this family graph. For any given family, I need to generate a typical graph as you would see in a traditional genealogical tree. I intend to render the data with OpenGL and have everything set up in order to do it. My only problem is how to generate the correct positions and sizes for each person's rectangle, so in the end there are no overlaps and every generation of people sits at the same vertical position. Then, I have to add the traditional lines connecting each node in visual data, but that shouldnt be a big problem.
Are there any lightweight libraries prepared to do this function or can someone help me achieve the algorithm to solve this? Thanks
I recommend AT&T GraphVis library. This library is used by Doxygen to draw its inheritance diagrams, and call trees.
You could also search for "c++ tree draw".
Another way to state the problem would be: how to determine node layout in a hierarchal directed graph such that there is no overlap and nodes at the same hierarchy are placed at the same level? (Here nodes correspond to family members and generation specifies the hierarchy).
While GraphVis can provide a layout solution, a newer approach than algorithms used in GraphVis is the Dig-CoLa (directed graph layout through constrained energy minimization) method described in Dwyer et al. 2005. The advantage of Dig-Cola is that it also deals with certain corner cases (e.g cycles) gracefully and avoids introducing hierarchy not induced by the original data in such cases.
The underlying idea of this method is to formulate and solve the layout problem as a constrained optimization problem by minimizing the stress (or energy) function based on the node positions under constraints induced by the hierarchy information.
Original Dig-Cola paper for more information:
Dwyer, Tim, and Yehuda Koren. "Dig-CoLa: directed graph layout through constrained energy minimization." IEEE Symposium on Information Visualization, 2005. INFOVIS 2005.. IEEE, 2005.
(Full text currently available here)
The author also provides some examples and code for this method (and later extensions) here.
Related
I am porting some graph code out of Python (networkx) and into C++ (BGL). In my Python code, the vertices and edges of the graph are client-defined objects implementing an established interface; I go on to call a bunch of methods on them. Everything is fine.
Naively, it would seem that BGL is meant to support a similar design pattern with "bundled properties." These basically allow one to define custom types for the vertices and edges by passing certain template parameters:
adjacency_list<OutEdgeList, VertexList,
Directed, VertexProperties,
EdgeProperties, GraphProperties,
EdgeList>
The custom vertex and edge types here are given by VertexProperties and EdgeProperties.
In working on this port I've noticed a couple of things that make me think that maybe BGL's bundled properties interface is really only meant to support (more-or-less) immutable types:
Edge and vertex "descriptors"
If you put something into the graph, you get back a "descriptor" that you have to use to reference it from there on. There are descriptors for edges and vertices, and they are the "keys" -- implemented as immutables -- to the graph. So if you have a vertex and you want to find neighboring vertices, you have to (a) get the current vertex descriptor, (b) use BGL methods to find the descriptors of its neighbors, and finally (c) reference each of the neighbors via their respective descriptors.
The net result of all this bookkeeping is an apparent need for additional containers -- std::map, say -- to provide reverse-lookup from vertices and edges (again, types VertexProperty and EdgeProperty) to their descriptors.
"BGL isn't meant to store pointers."
I spotted this claim in a similar SO question but haven't been able to verify it in the Boost docs anywhere. From the ensuing discussion I am left to speculate that the constraint may actually be a bit stronger: "BGL isn't meant to directly reference the heap." This doesn't seem entirely reasonable, though, as the container types are configurable (OutEdgeList and VertexList, above) and default standard things like vectors.
I'm a BGL n00b and am having trouble understanding what's intended with bundled properties. (And, frankly, I feel a bit overwhelmed by the programming model in general -- "properties", "concepts", "traits", "descriptors", AHHHH!) Questions:
Do BGL graphs efficiently support complex and possibly heap-bound types for VertexProperty and EdgeProperty? Or are these meant to be lightweight containers for immutable data?
If the former, is there any way to get around all the descriptor bookkeeping?
If the latter, what's the "right way" to deal with large, complicated things that we might want to stick in a BGL graph?
Do BGL graphs efficiently support complex and possibly heap-bound types for VertexProperty and EdgeProperty? Or are these meant to be lightweight containers for immutable data?
Of course. You can do it anyway you want. Point in case: you could make the bundle hold an (immutable or mutable) pointer to a heap allocated "complex" type that is - of course - entirely mutable. Now,
I'd suggest using your preferred ownership adaptor (unique_ptr, scoped_ptr, shared_ptr, what not). Look at std::string: it too is "heap based", but you didn't ever worry about using it in a property bundle, did you?
If the former, is there any way to get around all the descriptor bookkeeping?
There is not strictly any "descriptor bookkeeping". There might be depending on the graph model. But in general, descriptor is really an abstraction of the underlying container iterator model (not that this could be iterators across multiple containers, e.g. the edges(EdgeListgraph) as opposed to out_edges(v, IncidenceGraph) for adjacency_list<>.
The point is to decouple the logic from assumptions about the storage model. Even with some pretty unsightly void* passing-around, the compiler should compile it down to the same code as direct iterator access. In this sense the "bookkeeping" is usually perceptual, and you are likely picking up on mental load of having the extra conceptual layer.
If the latter, what's the "right way" to deal with large, complicated things that we might want to stick in a BGL graph?
Oops. I think I accidentally addressed this under 1. The absolute simplest thing to use might be refcounted sharing. Your specific situation might allow for more efficient solutions.
CAPITA SELECTA
"BGL isn't meant to store pointers."
Well, not as bundles, perhaps. Even here, it depends solely on how you manage the ownership/lifetime of the pointees.
I think the linked answer is excellent. It resonates with what I said above.
So if you have a vertex and you want to find neighboring vertices, you have to (a) get the current vertex descriptor, (b) use BGL methods to find the descriptors of its neighbors, and finally (c) reference each of the neighbors via their respective descriptors.
Most BGL algorithms rely on the presence of vertex_index_t property (or require one to be specified as a parameter) to make sure these are low-cost operations. In fact, if you use vecS then the vertex index is the index into the vertex vector, so reverse and forward lookups are pretty simple. (You can always look at the generated assembly with optimizations enabled to see whether there are any surprises).
This answer might be inspirational: Boost Graph Library: possible to combine Bundled Properties with Interior Properties?
TL;DR / Summary
I have a feeling coming from Python you might be understandably underestimating C++ optimizing compilers in template-heavy generic library code.
What bubbles up when you sift through the layers of generic machinery in practice evaporates at compile time. Of course, the downside of this is that can be hard to see through the abstractions. But it also means that power and abstraction level aren't compromised.
BGL is one library that allows you to switch to radically different graph models, storage layout etc. with very few code changes. The goal here wasn't "ease of use" or "do what I mean" (use Java or python for that).
The goal was to make the choice for C++ /not/ remove all agility by hard-coding every single implementation detail into your entire code base. Instead, work at the level of library concepts and benefit from the freedom you retain to experiment/change your approach as requirements evolve.
Data structures for directed and undirected graphs are of fundamental importance. Well-known and widely-used implementations such as the Boost Graph Library and Lemon are designed such that the contiguous integer indices of nodes and edges are not exposed to the user via the interface.
Instead, the user identifies nodes and edges by (small) representative objects. One advantage is that these objects are updated automatically when the indices of nodes and edges change due to the removal of edges or nodes from the graph.
In my opinion (!), this advantage is overrated. Users will typically store the representative objects of nodes and/or edges in a container, e.g., an std::vector. Now, if nodes or edges are removed from the graph and their representative objects become invalid, the user needs to either ignore this or rearrange the vector so as to keep valid integer indices contiguous, i.e., do exactly the bookkeeping that the design was supposed to make unnecessary.
Hence, my question is: Does the design choice (of hiding the contiguous integer indices of nodes and edges from the user) have other advantages?
(I'm at home in the Java world, but hope that it is OK to give an answer that is not focussed on the particular libraries in question)
There are several possible advantages of such an abstraction. One of the most important ones was already mentioned in the question: The consistency when performing modifications in a graph is much harder to accomplish when indices have to be maintained.
The reason why this may be hard lies in the different possible graph representations: It may be easy to maintain consistent indices if the internal representation only (and always) consisted of random-access lists of Vertex and Edge objects. But for other representations, determining an index may be difficult.
This directly related to the second main advantage: One is free to use different implementations of the graph interface. The section "Graph Data Structures" in the Review of Elementary Graph Theory of the boost documentation lists several data structures that are already offered by the BGL (and everybody may add his own implementation). The running times for certain operations are given in Big-O-notation, and once can see that they vary greatly between the different data structures.
So one can easily imagine that different implementations are better suited for certain tasks. For example, consider an algorithm frequently has to check whether a particular vertex is contained in a graph. For an indexed (that is, list-based) vertex storage, this would require O(n) for each test. With a set-based storage of the vertices, this could be done in O(1) - but there simply are no sensible "indices" in this case.
Additionally, as mentioned in the Graph Concepts overview:
In fact, the BGL interface need not even be implemented using a data-structure, as for some problems it is easier or more efficient to define a graph implicitly based on some functions.
So suggesting that there is an "indexed access" even when the graph does not even exist in memory may hinder such a purely functional implementation.
I can't speak for Lemon graph, but for boost graph I think the main goal is to be generic. So abstracting away the vertex (edge) access helps to achieve that goal.
It is stated in the documentation that boost graph is based on Dietmar Kühl's Masters Thesis on generic graph algorithms. (See my answer to Do property maps remain necessary for BGL?). So the main goal behind the library is to be generic and extensible. The choice of encapsulating access is part of abstracting the algorithms from the graph representation. To me, continuous integer indices are an implementation detail.
Boost doesn't make any assumptions on how you will use the graph or what performance trade offs are important to you. It lets you choose (or implement) the container that will best fit your needs.
If you want to break this encapsulation, you are free to do so. In fact, my most common use of boost graph involves vecS containers and a vector of structs. I usually work with graphs where the size is fixed. I could just as easily use a map of vertex_descriptors (or edge_descriptors) to objects to achieve the same goal.
So in summary, I would say that this not so much a design choice, but rather a consequence of achieving the broader goal of being generic. So the hiding of access has the benefit of being more generic.
I have a data set with about 700 000 entries, and each entry is a set of 3D coordinates with attributes such as name, timestamp, ID, and so on.
Right now I'm just reading the coordinates and render them as points in OpenGL. However I want to associate each point with its corresponding attributes and I want to be able to sort and pick them during runtime based on their attributes. How would I go about to achieve this in an efficient manner?
I know I can put I can put the data in a struct and use stl sort for sorting, but is that a good design choice or is there a more efficient/elegant way of handling the problem?
The way I tend to look at these design choices is to first use one of the standard library containers (btw, if you need to "just" do lookup you don't necessarily have to sort, but you need a container that allows lookup), then check if this an "efficient enough" solution for the problem.
You can usually come up with a custom solution that is more efficient and maybe more elegant but you tend to run into two issues with that:
1) You end up having to implement some type of a container, which will cost you time both in implementation and debugging compared to a well understood and tested container that is already out there. Most of the time you're better off trying to solve the problem at hand rather than make it bigger by adding more code.
2) If someone else will have to maintain your code at some point, chances are they are familiar with standard library components both from a design and implementation perspective, but they won't be familiar with your custom container, thus increasing the learning curve.
If you consider each attribute of your point class as a component of a vector, then your selection process is a region query. Your example of a string attribute being equal to something means that the region is actually a line in your data space. However, there won't be any sorting made on other attributes within that selection, you will have to implement it by yourself, but it should be relatively straightforward for octrees, which partition data in ordered regions.
As advocated in another answer, try existing standard solutions first. If you can find an of the shelf implementation of one of these data structures:
R-tree
KD tree
BSP
Octree, or more likely, a n dimensional version of the quadtree or octree principle (I will use the term octree herein to denote the general data structure)
then go for it. These are the data structures I recommend for spatial data management.
You could also use an embedded RDBMS capable of working with spatial data (they usually implement R-tree for spatial indexing), but it may not be interesting if your dataset isn't dynamic.
If your dataset falls within the 10000 entries range, then by today standards it isn't that large, so using simpler structures should suffice. In that perimeter, I would go first for a simple std::vector, and use std::sort and std::find to filter the data in smaller set and sort it afterward.
I would probably try an ordered set or map on the most queried attribute in a second attempt, then do some benchmarks to pick the more performing solution.
For a more efficient one dimensional indexing algorithm (in essence, that`s what sets and maps are), you might want to try B-trees: there's C++ implementation available from google.
My third attempt would go toward an OpenCL solution (although if you are doing heavy OpenGL rendering, you might prefer doing the work on the CPU instead, but that depends on your framerate needs).
If your dataset is much larger, as it seems to be, then consider one of the more complex solutions I listed initially.
At any rate, without more details about your dataset and how you plan to use it, it will be difficult to provide a good solution, so the only real advice we can give is: try everthing you can and benchmark.
If you're dealing with point clouds, take a look at PCL, it could save you a lot of time and effort without having to dig into the intricacies of spatial indexing yourself. It also includes visualisation.
I implemented an obstacle avoidance algorithm in Matlab which assigns every node in a graph a potential and tries to descend this potential (the goal of the pathplanning is in the global minimum). Now there might appear local minima, so the (global) planning needs a way to get out of these. I used the strategy to have a list of open nodes which are reachable from the already visited nodes. I visit the open node which has the smallest potential next.
I want to implement this in C++ and I am wondering if Boost Graph has such algorithms already. If not - is there any benefit from using this library if I have to write the algorithm myself and I will also have to create my own graph class because the graph is too big to be stored as adjacency list/edge list in memory.
Any advice appreciated!
boost::graph provides a list of Shortest Paths / Cost Minimization Algorithms. You might be interested in the followings: Dijkstra Shortest path, A*.
The algorithms can be easily customized. If that doesn't exactly fit your needs, take a look at the visitor concepts. It allows you to customize your algorithm at some predefined event point.
Finally Distributed BGL handles huge graph (potentially millions of nodes). It will work for you if your graph does not fit in memory.
You can find good overview of the Boost Graph Library here.
And of course, do not hesitate to ask more specific question about BGL on stackoverflow.
To my mind, boost::graph is really awesome for implementing new algorithms, because it provides various data holders, adaptors and commonly used stuff (which can obviously be used as parts of the newly constructed algorithms).
Last ones are also customizable due to usage of visitors and other smart patterns.
Actually, boost::graph may take some time to get used to, but in my opinion it's really worth it.
Anybody out there using BGL for large production servers?
How many node does your network consist of?
How do you handle community detection
Does BGL have any cool ways to detect communities?
Sometimes two communities might be linked together by one or two edges, but these edges are not reliable and can fade away. Sometimes there are no edges at all.
Could someone speak briefly on how to solve this problem.
Please open my mind and inspire me.
So far I have managed to work out if two nodes are on an island (in a community)
in a lest expensive manner, but now I need to work out which two nodes on separate islands are closest to each other. We can only make minimal use of unreliable geographical data.
If we figuratively compare it to a mainland and an island and take it out of social distance context. I want to work out which two bits of land are the closest together across a body of water.
I've used the BGL for graphs with millions of nodes, but the size of the graph you can use depends on what algorithm you are trying to run. You can quickly compute distances between nodes. There are 4 shortest path algorithms which are most applicable depending on your data: (single pairs of points, for all pairs of points, sparse and dense graphs,...).
As for community detection, there aren't any algorithms built-into the BGL specifically for that (but maybe you can contribute one when you are finished with your project). There are a few algorithms that might be helpful in building a community detection algorithm. The max-flow/min-cut algorithms are typically used in community detection (if there is a lot of flow possible between two nodes, then they are likely to be in the same community, if there isn't much flow, then the min-cut is likely to represent roads between communities). There are also heuristics to order the nodes of the graph to reduce bandwidth. Nodes making up "communities" are likely to be close to each other in such an ordering.
As far as I know BGL doesn't have any algorithms specifically for community detection.
By "island" do you mean a disconnected subgraph?
Also, graphs do not have any notion of 'distance'.
This 'social distance' is something that you are going to have to define. Once you've done that a large part of the work is done.
There are numerous methods listed on the page you linked to, most of those only require you to define something like a 'distance' metric, and then plug your definitions into the algorithm.
# David Nehme
Graphs without edge-weights are only about connectedness, they have no notion of distance. If you want to talk about a network then you can talk about distance. But a graph with no edge-weights does not have any distance, unless you want to assume an implied edge-weight of 1 for all edges. But this is really just turning the graph into a network.
Also, he is talking about the distance between two disconnected graphs. To model this, you have to introduce an external concept for distance between nodes, separate from the edge-distance.