How to do per node caching in a tree visitor - c++

I have an application where a want to calculate different representations (mesh, voxelization, signed distance function, ...) of a tree of primitives (leaf nodes) that are combined via boolean operations (inner nodes).
My first approach to this was to write an abstract base class with a virtual getter function for each of the different representations and cached the intermediate results at the respective nodes as long as there was no change in their subtree (which would flush their cache).
However, I was unsatisfied with the ugly coupling of the tree structure with each of the different representations. To alleviate this I removed the abstract base classes and instead set up a visitor for each of the representations.
This neatly decoupled the tree from the representations but left me with the problem that I now need to cache the intermediate results somewhere else and this is where my problem starts.
TL;DR
How do I cache (arbitrary many differently typed) intermediate values at inner nodes of the tree without making the tree dependent on the value type?
My Approaches
The requirements offer two choices:
store the data in the tree but with type erasure
store the data outside the tree and somehow "connect" it to the node
The first one leaves me puzzled with some efficiency problem: I could easily add a container of boost::any (or something equivalent) in the nodes but then each visitor would have to search the whole container for it's own data.
The separation in the second one introduces the problem of keeping the cache up to date to the current tree. If there are changes in the tree (deletions, alterations of nodes) the cached values must at least be invalidated. My intuition was to use some hash function and an unordered_map but I hit some problems there as well:
I cannot use the treenodes themselves as key, so I need to introduce another class that just references tree nodes and represents them in the tree
referencing the values from the unordered_map's keys requires to erase all entries whose referencees are deleted or we have a dangling reference(/pointer) in the unordered_map which could get triggered on rehash
changes in the tree would require to reconstruct the unordered_map because keys might have changed
Am I missing some obvious solution to this?
Which approach would you favor (and why)?

I once had a similar problem and my solution was as follows:
Let each node have an unique identifier.
Let each node have a version number. Modifications that invalidate calculated values for the node just increase the version number.
Let each visitor have a caching map, where the ID pair is the key, mapped to a version/value pair.
When (re-)walking the tree, look for a node's entry in the map. If the version is correct, use the cached value. If it is outdated, calculate the new value and replace the old version/value pair.
At first, I used the node's address as IDs, but for memory reasons I had to reuse subtrees and picked the path to the node as ID. Such a path has the advantage that it can be calculated by each visitor and need not be stored at the node. In my case, each node could have at most two children, so a path was merely a set of left/right decisions, which can be stored in a simple unsigned int with some bit-shifting (my trees did never reach a depth of 32, so a 32 bit unsigned was more than enough as key).

Related

Traverse through a DAG-like structure to produce another DAG-like structure in Clojure

I have a DAG-like structure that is essentially a deeply-nested map. The maps in this structure can have common values, so the overall structure is not a tree but a direct acyclic graph. I'll refer to this structure as a DAG for brevity.
The nodes in this graph are of different but finite number of categories. Each category can have its own structure/keywords/number-of-children. There is one unique node that is the source of this DAG, meaning from this node we can reach all nodes in the DAG.
The task is to traverse through the DAG from the source node, and convert each node to another one or more nodes in a new constructed graph. I'll give an example for illustration.
The graph in the upper half is the input one. The lower half is the one after transformation. For simplicity, the transformation is only done on node A where it is split into node 1 and A1. The children of node A are also reallocated.
What I have tried (or in mind):
Write a function to convert one object for different types. Inside this function, recursively call itself to convert each of its children. This method suffers from the problem that data are immutable. The nodes in the transformed graph cannot be changed randomly to add children. To overcome this, I need to wrap every node in a ref/atom/agent.
Do a topological sort on the original graph. Then convert the nodes in the reversed order, i.e., bottom-up. This method requires a extra traverse of the graph but at least the data need not to be mutable. Regarding the topological sort algorithm, I'm considering DFS-based method as stated in the wiki page, which does not require the knowledge of the full graph nor a node's parents.
My question is:
Is there any other approaches you might consider, possibly more elegant/efficient/idiomatic?
I'm more in favour of the second method, is there any flaws or potential problems?
Thanks!
EDIT: On a second thought, a topological sorting is not necessary. The transformation can be done in the post-order traversal already.
This looks like a perfect application of Zippers. They have all the capabilities you described as needed and can produce the edited 'new' DAG. There are also a number of libraries that ease the search and replace capability using predicate threads.
I've used zippers when working with OWL ontologies defined in nested vector or map trees.
Another option would be to take a look at Walkers although I've found these a bit more tedious to use.

C++ augmenting an STL data structure

My requirement is to be able to quickly retrieve a minimum and maximum value in a tree. (Note, not the minimum/maximum key, but min/max of the satellite data).
The tree would be based on strings as keys and each node would store an integer with it. This integer bound to change and be constantly updated. Keys remain fixed
I was considering using an approach described here of augmenting a red black tree, so that every node stores the maximum ( max of left's max and right's max and itself) similar for minimum.
So when I update a node, I would simply update the min/max of every node which was traversed to reach my current node.
What would be the best approach to avoid rewriting the STL implementation of red/black tree.
You can't use an STL container (e.g. set, which doesn't technically even need to me a BST as far as I know), as it doesn't provide you with access to the underlying structure.
Your options are:
As you already mentioned, writing your own BST.
Simply using a secondary BST (or heap) which is ordered by your integer value.
Using Boost's multi_index_container.

Pure functional tree with parent pointer

I know that RB tree with left and right child can be implemented in pure functional way without degrading log n performance. Can tree with parent pointer be implemented in logarithm time? Seems like cyclic reference child->parent and parent->child requires all tree to be cloned, thus linear time.
Fully persistent, purely functional data structures are tree shaped, but may utilize pointer sharing to become a directed acyclic graph. However, once you introduce a pointer cycle, you can't "change" part of that subgraph without copying that entire subgraph.
The solution is to add an indirection: You assign "identities" to values, which can be objects (like Clojure's atoms) or simple values used as look up keys (such as numbers or symbols). You can think of a immutable pointer as an implementation of IDeref which always returns the same object. A cyclic graph can be represented as an adjacency graph, where "deref-ing" a node by name is the same as looking it up in a the map of names to nodes.
For more information on representing fully persistent graphs, see Fully Persistent Graphs - Which One to Choose?.

How to store a tree on the disk and make add/delete/swap operations easy

All right, this question requires a bit of reading on your side. I'll try to keep this short and simple.
I have a tree (not a binary tree, just a tree) with data associated to each node (binary data, I don't know what they are AND I don't know how long they are)
Each node of the tree also has an index which isn't related to how it appears in the tree, to make it short it could be like that:
The index number represents the order the user WANTS the tree to be navigated and cannot be duplicated.
I need to store this structure in a file on the disk.
My problem is: how to design a flexible disk storing format that can make loading and working on the tree as easy as possible.
In fact the user should be allowed to
Create a child block to an element (and this should be easy enough, it's sufficient to add data to the file paying attention to avoiding duplicated indices)
Delete a child (I should prompt the user "do you want to delete all this node's children as well? or should I add its children to its parent?"). The tricky part about this is that deleting a node could also free up an index, and I can't let the user use that index again when adding another node (or the order he set could be messed up), I need to update the entire tree!
Swap an index with another one
I'm using C++ and Qt and by now I thought of a lot of structures with a lot of fields like this one
struct dataToBeStoredInTheFile
{
long data_size;
byte *data; //... the data here
int index;
int number_of_children;
int *children_indices; // ... array of integers
}
this has the advantage to identify each node with its respective index, but it's highly slow when swapping indices between two nodes or deleting a node and updating each other node's index because you have to traverse all the nodes and all their "children_indices" arrays.
Would using something like an "hash" to identify each node be more flexible?
Should I use two indices, one for the position in the tree and one for the user's index? If you have any better idea to store the data, you're welcome
I would suggest using something like boost.serialization, then you don't have to worry about the actual format when save on disk, and can concentrate on effective in-memory solution.
Edit: Re-reading your question I see you are using Qt, in that case it should have it's own serialization framework that you can use.
If it doesn't have to be a SINGLE file, you could use the file/directory structure to represent your tree, where each node corresponds to a single file (w/ a directory for each interior node). Maybe not the most efficient, but incredibly easy to do.
Again, if you have some flexibility on the number of files (but not as much as above), you could have one file for the tree structure (so that each node is a fixed size, simplifying its manipulation) and a separate one for storing node contents. To speed up working with the "content file", you could treat it the way a garbage collecting system would: just keep adding new/updated nodes on the end, marking old nodes as no longer in use, and periodically clearing things out.
Better yet, follow #JoachimPileborg's advice :)
I don't think you should use the user-specified index to identify the nodes, as that's not directly related to the way you're storing the tree, and you don't have an efficient way of accessing the nodes by index. You should either keep two indices for each node - the user-specified one, and another one that's implementation dependent; or maintain an array mapping the user-specified index to one you're using for the implementation.
Also, it might be better if you use a different structure to store the tree. For each node, store the following:
the index of the parent
the index of the leftmost son
the index of the left brother
the index of the right brother
This way adding a node and swapping two nodes could be done with some simple pointer manipulations (I don't mean explicit pointers - the indices are somewhat like pointers anyway). Deleting a node would still probably be slow as you have to visit all the children.
As a bonus, if you use this structure, every node has a fixed size (unlike with the linked list you're proposing). This means that you can access a node directly by seeking in the file.
You should also maintain the smallest index the user can use for new nodes - so, for example, even if the largest index was 5 and it was deleted, you still keep 6 as the next free index so 5 cannot be reused.

How can one create cyclic (and immutable) data structures in Clojure without extra indirection?

I need to represent directed graphs in Clojure. I'd like to represent each node in the graph as an object (probably a record) that includes a field called :edges that is a collection of the nodes that are directly reachable from the current node. Hopefully it goes without saying, but I would like these graphs to be immutable.
I can construct directed acyclic graphs with this approach as long as I do a topological sort and build each graph "from the leaves up".
This approach doesn't work for cyclic graphs, however. The one workaround I can think of is to have a separate collection (probably a map or vector) of all of the edges for an entire graph. The :edges field in each node would then have the key (or index) into the graph's collection of edges. Adding this extra level of indirection works because I can create keys (or indexes) before the things they (will) refer to exist, but it feels like a kludge. Not only do I need to do an extra lookup whenever I want to visit a neighboring node, but I also have to pass around the global edges collection, which feels very clumsy.
I've heard that some Lisps have a way of creating cyclic lists without resorting to mutation functions. Is there a way to create immutable cyclic data structures in Clojure?
You can wrap each node in a ref to give it a stable handle to point at (and allow you to modify the reference which can start as nil). It is then possible to possible to build cyclic graphs that way. This does have "extra" indirection of course.
I don't think this is a very good idea though. Your second idea is a more common implementation. We built something like this to hold an RDF graph and it is possible to build it out of the core data structures and layer indices over the top of it without too much effort.
I've been playing with this the last few days.
I first tried making each node hold a set of refs to edges, and each edge hold a set of refs to the nodes. I set them equal to each other in a (dosync... (ref-set...)) type of operation. I didn't like this because changing one node requires a large amount of updates, and printing out the graph was a bit tricky. I had to override the print-method multimethod so the repl wouldn't stack overflow. Also any time I wanted to add an edge to an existing node, I had to extract the actual node from the graph first, then do all sorts of edge updates and that sort of thing to make sure everyone was holding on to the most recent version of the other thing. Also, because things were in a ref, determining whether something was connected to something else was a linear-time operation, which seemed inelegant. I didn't get very far before determining that actually performing any useful algorithms with this method would be difficult.
Then I tried another approach which is a variation of the matrix referred to elsewhere. The graph is a clojure map, where the keys are the nodes (not refs to nodes), and the values are another map in which the keys are the neighboring nodes and single value of each key is the edge to that node, represented either as a numerical value indicating the strength of the edge, or an edge structure which I defined elsewhere.
It looks like this, sort of, for 1->2, 1->3, 2->5, 5->2
(def graph {node-1 {node-2 edge12, node-3 edge13},
node-2 {node-5 edge25},
node-3 nil ;;no edge leaves from node 3
node-5 {node-2 edge52}) ;; nodes 2 and 5 have an undirected edge
To access the neighbors of node-1 you go (keys (graph node-1)) or call the function defined elsewhere (neighbors graph node-1), or you can say ((graph node-1) node-2) to get the edge from 1->2.
Several advantages:
Constant time lookup of a node in the graph and of a neighboring node, or return nil if it doesn't exist.
Simple and flexible edge definition. A directed edge exists implicitly when you add a neighbor to a node entry in the map, and its value (or a structure for more information) is provided explicitly, or nil.
You don't have to look up the existing node to do anything to it. It's immutable, so you can define it once before adding it to the graph and then you don't have to chase it around getting the latest version when things change. If a connection in the graph changes, you change the graph structure, not the nodes/edges themselves.
This combines the best features of a matrix representation (the graph topology is in the graph map itself not encoded in the nodes and edges, constant time lookup, and non-mutating nodes and edges), and the adjacency-list (each node "has" a list of its neighboring nodes, space efficient since you don't have any "blanks" like a canonical sparse matrix).
You can have multiples edges between nodes, and if you accidentally define an edge which already exists exactly, the map structure takes care of making sure you are not duplicating it.
Node and edge identity is kept by clojure. I don't have to come up with any sort of indexing scheme or common reference point. The keys and values of the maps are the things they represent, not a lookup elsewhere or ref. Your node structure can be all nils, and as long as it's unique, it can be represented in the graph.
The only big-ish disadvantage I see is that for any given operation (add, remove, any algorithm), you can't just pass it a starting node. You have to pass the whole graph map and a starting node, which is probably a fair price to pay for the simplicity of the whole thing. Another minor disadvantage (or maybe not) is that for an undirected edge you have to define the edge in each direction. This is actually okay because sometimes an edge has a different value for each direction and this scheme allows you to do that.
The only other thing I see here is that because an edge is implicit in the existence of a key-value pair in the map, you cannot define a hyperedge (ie one which connects more than 2 nodes). I don't think this is a big deal necessarily since most graph algorithms I've come across (all?) only deal with an edge that connects 2 nodes.
I ran into this challenge before and concluded that it isn't possible using truly immutable data structures in Clojure at present.
However you may find one or more of the following options acceptable:
Use deftype with ":unsynchronized-mutable" to create a mutable :edges field in each node that you change only once during construction. You can treat it as read-only from then on, with no extra indirection overhead. This approach will probably have the best performance but is a bit of a hack.
Use an atom to implement :edges. There is a bit of extra indirection, but I've personally found reading atoms to be extremely efficient.