How to 'mark' a node in a Clojure data structure? - clojure

I have
a Clojure data structure, let's call it dom, a tree of vectors and
maps of indefinite depth;
a particular node in it, let's call it the focus node, referred to as
a path into the tree: a sequence of keys such as you could present to
get-in.
I will be deciding on the focussed node in one function and I want to somehow represent that choice of focussed node in a way that can be passed to another function in a way that does not violate immutability and is not in conflict with Clojure's persistent data structures.
When I traverse the tree, I want to treat the focus node differently: for example, if I was printing the tree, I might want to print the focus node in bold.
If I were using C or Java, I could save a pointer/reference to the focus node, which I could compare with the current node as I traversed the tree. I don't think that's the right way to do it in Clojure: it feels hacky, and I'm sure there's some way to do it that takes advantage of Clojure's persistent data structures.
The solution has to work in Clojure and ClojureScript.
The options I can think of are:
Store a reference and check against that.
Attach a marker to the node in question.
Simultaneously recurse into the tree and along the path to the marked node.
Option (1) is unattractive, as I've explained.
Option (2) seems best, and painless given persistent data structures.
Option (3) is similar to option (2), except that it combines the
marking and traversing steps.
I'm sure this is a common problem. Is there a standard solution to it?

I suggest you reconsider #MerceloMorales's suggestion: to use metadata. Your node object is to have an accidental attribute that doesn't affect its normal functions. That is what metadata is designed for. And it works in ClojureScript. The only reason I can think of for not using metadata is that the node value is not a Clojure object, but is, for example, a number.
In The Clojure Cookbook, 2.22. Keeping Multiple Values for a Key, Luke Vanderhart uses metadata to solve a similar problem: marking entries that need to be interpreted as collections rather than single values.
Another approach might be to use a zipper to traverse/modify the node tree. Zippers are implemented in terms of - you've guessed it - metadata.
I share your misgivings about metadata: it feels queasy to attach just any old stuff to your data - like infecting it with a parasite. However, it's just as immutable a part of the object as any other.
The suggestion to use zippers is naive: The standard clojure zippers are designed for a hierarchy of sequential containers, not associative ones.

See Brandon Bloom's Dendrology talk for some great overview on questions like this.
I believe the ease of "marking" or otherwise updating tree structured data underlies his strong recommendation to always represent nodes as nested maps rather than vectors (or a mixture of vectors and maps). A mark based on a path described by a vector of keys is then as simple as:
(update-in tree-data path assoc :is-focussed true)
Your original data structure is unchanged and the new one returned by update-in shares everything structurally with the original except the updated node which is now easily tested for the :is-focussed property upon traversal.

Related

Clojure Zippers - Associating meta-data with a specific node?

I'm using Clojure's zippers to implement what I thought wouldn't be particularly challenging but it seems I may be missing something.
Essentially what I want to do is, given some data structure as a list, e.g. (1 (2 3) 4)), I want to be able to associate some meta-data against a particular loc so that I can make decisions about that loc given a different loc.
For example, using seq-zip from the zipper library, when I hit the loc for 2 in the above list I want to associate some arbitrary data with that loc, then when I hit loc 3 I want to examine that data (using something like clojure.zip/prev to get there) and then make a decision based on whether or not that particular loc had some data associated with it.
However it doesn't seem particularly straight forward, I've tried assoc'ing some data against the loc, but after using clojure.zip/next, the data is still present in the loc map, which isn't what I wanted.
Unfortunately because the node values I'm working with can be numbers, I can't simply supplement the node values themselves with meta-data, unless I we're to box the values in some sort of wrapper, but this seems rather ugly, any ideas?
There is only one meaningful loc object at a time: it is your "cursor" into the zipped data structure. Therefore, you can only store metadata on it globally. Since your nodes have no room for metadata themselves, you can either wrap the nodes (e.g., just globally wrap everything in a map as Alan Thompson suggests), or else store in the loc a sufficiently powerful metadata map to contain everything you'd want to know about all nodes. For example, loc's metadata could contain a map whose keys are paths in the zipper and whose values are decorations for the node at that path. It's not ideal, because if you edit the structure of your tree, the metadata will be left dangling. But if you do a structure-preserving traversal of the tree, this could work. Personally, I like Alan's idea better. Represent data as data.

Traverse through a DAG-like structure to produce another DAG-like structure in Clojure

I have a DAG-like structure that is essentially a deeply-nested map. The maps in this structure can have common values, so the overall structure is not a tree but a direct acyclic graph. I'll refer to this structure as a DAG for brevity.
The nodes in this graph are of different but finite number of categories. Each category can have its own structure/keywords/number-of-children. There is one unique node that is the source of this DAG, meaning from this node we can reach all nodes in the DAG.
The task is to traverse through the DAG from the source node, and convert each node to another one or more nodes in a new constructed graph. I'll give an example for illustration.
The graph in the upper half is the input one. The lower half is the one after transformation. For simplicity, the transformation is only done on node A where it is split into node 1 and A1. The children of node A are also reallocated.
What I have tried (or in mind):
Write a function to convert one object for different types. Inside this function, recursively call itself to convert each of its children. This method suffers from the problem that data are immutable. The nodes in the transformed graph cannot be changed randomly to add children. To overcome this, I need to wrap every node in a ref/atom/agent.
Do a topological sort on the original graph. Then convert the nodes in the reversed order, i.e., bottom-up. This method requires a extra traverse of the graph but at least the data need not to be mutable. Regarding the topological sort algorithm, I'm considering DFS-based method as stated in the wiki page, which does not require the knowledge of the full graph nor a node's parents.
My question is:
Is there any other approaches you might consider, possibly more elegant/efficient/idiomatic?
I'm more in favour of the second method, is there any flaws or potential problems?
Thanks!
EDIT: On a second thought, a topological sorting is not necessary. The transformation can be done in the post-order traversal already.
This looks like a perfect application of Zippers. They have all the capabilities you described as needed and can produce the edited 'new' DAG. There are also a number of libraries that ease the search and replace capability using predicate threads.
I've used zippers when working with OWL ontologies defined in nested vector or map trees.
Another option would be to take a look at Walkers although I've found these a bit more tedious to use.

Advice on data-structure for represeting a Path in system

I have a system where i need to represent something similar as Path, a path just provides a route to reach a particular node. There can be multiple Path that can be used to reach same node.
I am currently representing a Path using vector of Nodes, I need to do operations like replaceSubpath, containsNode, containsSubPath, appendNode, getRootNode, getLeafNode (very similar operations as done for string). All of these operations can be done on vector but performance for a large path can suck.
I am looking at using boost::graph but have no experience with it, I would like to know if using boost::graph would be correct/good data structure for these and similar operations?
Any advices on using some other data structure would be helpful too, I am aware I can optimize my vector solution by keep (multi) map of node to iterator etc.
Essentially, the class adjacency_list<> from Boost.Graph is a vector of vertices. Vertex descriptor is an integer index in this vector.
Typically, a a tree or a path (path is a special case of a tree, right?) is represented as a predecessor map (like going backward from leaves to root or from target to source). In case of integer vertex descriptors, such predecessor map is simply vector<int>. I do not think you can represent a path or a tree in a more compact way.
Of course, such vector of predecessors can be substituted into string operations, esp. those from Boost.String_Algo, http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo.html
From what you describe it sounds like you are generating and editing paths in a graph, perhaps for optimizing routes etc.
I don't think that one data structure will give you what you want. I would keep the graph structure separate from the paths you are generating.
replaceSubpath: To me this would suggest a doubly linked list implementation. When you have the start and end of your path just paste it in and replace the subpath.
containsNode: Consider adding a map or set for fast containment checks.
containsSubPath: This could be tough depending on your other concerns and speed needs. If this is a very important operation consider a Suffix Tree to test sub paths quickly. Keep in mind its better if the path doesn't change much since constructing them is O(N)
appendNode: Linked list will be easy here
getRootNode: Hold a pointer to the current root node.
getLeafNode: Hold a pointer to the current leaf node.
I would make a custom data structure that can address these concerns based on your goals. Finding subpaths and replacing them quickly might be competing performance goals. Usually more search optimization = more construction overhead making them less dynamic.
Take a look at how some other code that you admire implements the need to manage paths. For example, you might look at several implementation of Dijkstra and choose the one that looks best, most convenient or just to your taste.
IMHO it is not a good idea to model a "path" as an object, but rather think of it as a property of the nodes in a graph.
In general, I would consider 'marking' nodes that are on the path. For example, the class you use to contain the properties of the nodes might have a flag indicating true if the node is on the path and an attribute with the index of the next node on the path.

How can one create cyclic (and immutable) data structures in Clojure without extra indirection?

I need to represent directed graphs in Clojure. I'd like to represent each node in the graph as an object (probably a record) that includes a field called :edges that is a collection of the nodes that are directly reachable from the current node. Hopefully it goes without saying, but I would like these graphs to be immutable.
I can construct directed acyclic graphs with this approach as long as I do a topological sort and build each graph "from the leaves up".
This approach doesn't work for cyclic graphs, however. The one workaround I can think of is to have a separate collection (probably a map or vector) of all of the edges for an entire graph. The :edges field in each node would then have the key (or index) into the graph's collection of edges. Adding this extra level of indirection works because I can create keys (or indexes) before the things they (will) refer to exist, but it feels like a kludge. Not only do I need to do an extra lookup whenever I want to visit a neighboring node, but I also have to pass around the global edges collection, which feels very clumsy.
I've heard that some Lisps have a way of creating cyclic lists without resorting to mutation functions. Is there a way to create immutable cyclic data structures in Clojure?
You can wrap each node in a ref to give it a stable handle to point at (and allow you to modify the reference which can start as nil). It is then possible to possible to build cyclic graphs that way. This does have "extra" indirection of course.
I don't think this is a very good idea though. Your second idea is a more common implementation. We built something like this to hold an RDF graph and it is possible to build it out of the core data structures and layer indices over the top of it without too much effort.
I've been playing with this the last few days.
I first tried making each node hold a set of refs to edges, and each edge hold a set of refs to the nodes. I set them equal to each other in a (dosync... (ref-set...)) type of operation. I didn't like this because changing one node requires a large amount of updates, and printing out the graph was a bit tricky. I had to override the print-method multimethod so the repl wouldn't stack overflow. Also any time I wanted to add an edge to an existing node, I had to extract the actual node from the graph first, then do all sorts of edge updates and that sort of thing to make sure everyone was holding on to the most recent version of the other thing. Also, because things were in a ref, determining whether something was connected to something else was a linear-time operation, which seemed inelegant. I didn't get very far before determining that actually performing any useful algorithms with this method would be difficult.
Then I tried another approach which is a variation of the matrix referred to elsewhere. The graph is a clojure map, where the keys are the nodes (not refs to nodes), and the values are another map in which the keys are the neighboring nodes and single value of each key is the edge to that node, represented either as a numerical value indicating the strength of the edge, or an edge structure which I defined elsewhere.
It looks like this, sort of, for 1->2, 1->3, 2->5, 5->2
(def graph {node-1 {node-2 edge12, node-3 edge13},
node-2 {node-5 edge25},
node-3 nil ;;no edge leaves from node 3
node-5 {node-2 edge52}) ;; nodes 2 and 5 have an undirected edge
To access the neighbors of node-1 you go (keys (graph node-1)) or call the function defined elsewhere (neighbors graph node-1), or you can say ((graph node-1) node-2) to get the edge from 1->2.
Several advantages:
Constant time lookup of a node in the graph and of a neighboring node, or return nil if it doesn't exist.
Simple and flexible edge definition. A directed edge exists implicitly when you add a neighbor to a node entry in the map, and its value (or a structure for more information) is provided explicitly, or nil.
You don't have to look up the existing node to do anything to it. It's immutable, so you can define it once before adding it to the graph and then you don't have to chase it around getting the latest version when things change. If a connection in the graph changes, you change the graph structure, not the nodes/edges themselves.
This combines the best features of a matrix representation (the graph topology is in the graph map itself not encoded in the nodes and edges, constant time lookup, and non-mutating nodes and edges), and the adjacency-list (each node "has" a list of its neighboring nodes, space efficient since you don't have any "blanks" like a canonical sparse matrix).
You can have multiples edges between nodes, and if you accidentally define an edge which already exists exactly, the map structure takes care of making sure you are not duplicating it.
Node and edge identity is kept by clojure. I don't have to come up with any sort of indexing scheme or common reference point. The keys and values of the maps are the things they represent, not a lookup elsewhere or ref. Your node structure can be all nils, and as long as it's unique, it can be represented in the graph.
The only big-ish disadvantage I see is that for any given operation (add, remove, any algorithm), you can't just pass it a starting node. You have to pass the whole graph map and a starting node, which is probably a fair price to pay for the simplicity of the whole thing. Another minor disadvantage (or maybe not) is that for an undirected edge you have to define the edge in each direction. This is actually okay because sometimes an edge has a different value for each direction and this scheme allows you to do that.
The only other thing I see here is that because an edge is implicit in the existence of a key-value pair in the map, you cannot define a hyperedge (ie one which connects more than 2 nodes). I don't think this is a big deal necessarily since most graph algorithms I've come across (all?) only deal with an edge that connects 2 nodes.
I ran into this challenge before and concluded that it isn't possible using truly immutable data structures in Clojure at present.
However you may find one or more of the following options acceptable:
Use deftype with ":unsynchronized-mutable" to create a mutable :edges field in each node that you change only once during construction. You can treat it as read-only from then on, with no extra indirection overhead. This approach will probably have the best performance but is a bit of a hack.
Use an atom to implement :edges. There is a bit of extra indirection, but I've personally found reading atoms to be extremely efficient.

In Clojure, when should trees of heterogenous node types be represented using records or vectors?

Which is better idiomatic clojure practice for representing a tree made up of different node types:
A. building trees out of several different types of records, that one defines using deftype or defrecord:
(defrecord node_a [left right])
(defrecord node_b [left right])
(defrecord leaf [])
(def my-tree (node_a. (node_b. (leaf.) (leaf.)) (leaf.)))
B. building trees out of vectors, with keywords designating the types:
(def my-tree [:node-a [:node-b :leaf :leaf] :leaf])
Most clojure code that I see seems to favor the usage of the general purpose data structures (vectors, maps, etc.), rather than datatypes or records. Hiccup, to take one example, represents html very nicely using the vector + keyword approach.
When should we prefer one style over the other?
You can put as many elements into a vector as you want. A record has a set number of fields. If you want to constrain your nodes to only have N sub-nodes, records might be good, e.g. making when a binary tree, where a node has to have only a Left and Right. But for something like HTML or XML, you probably want to support arbitrary numbers of sub-nodes.
Using vectors and keywords means that "extending" the set of supported node types is as simple as putting a new keyword into the vector. [:frob "foo"] is OK in Hiccup even if its author never heard of frobbing. Using records, you'd potentially have to define a new record for every node type. But then you get the benefit of catching typos and verifying subnodes. [:strnog "some bold text?"] isn't going to be caught by Hiccup, but (Strnog. "foo") would be a compile-time error.
Vectors being one of Clojure's basic data types, you can use Clojure's built-in functions to manipulate them. Want to extend your tree? Just conj onto it, or update-in, or whatever. You can build up your tree incrementally this way. With records, you're probably stuck with constructor calls, or else you have to write a ton of wrapper functions for the constructors.
Seems like this partly boils down to an argument of dynamic vs. static. Personally, I would go the dynamic (vector + keyword) route unless there was a specific need for the benefits of using records. It's probably easier to code that way, and it's more flexible for the user, at the cost of being easier for the user to end up making a mess. But Clojure users are likely used to having to handle dangerous weapons on a regular basis. Clojure being largely a dynamic language, staying dynamic is often the right thing to do.
This is a good question. I think both are appropriate for different kinds of problems. Nested vectors are a good solution if each node can contain a variable set of information - in particular templating systems are going to work well. Records are a good solution for a smallish number of fixed node types where nesting is far more constrained.
We do a lot of work with heterogeneous trees of records. Each node represents one of a handful of well-known types, each with a different set of known fixed keys. The reason records are better in this case is that you can pick the data out of the node by key which is O(1) (really a Java method call which is very fast), not O(n) (where you have to look through the node contents) and also generally easier to access.
Records in 1.2 are imho not quite "finished" but it's pretty easy to build that stuff yourself. We have a defrecord2 that adds constructor functions (new-foo), field validation, print support, pprint support, tree walk/edit support via zippers, etc.
An example of where we use this is to represent ASTs or execution plans where nodes might be things like Join, Sort, etc.
Vectors are going to be better for creating stuff like strings where an arbitrary number of things can be put in each node. If you can stuff 1+ <p>s inside a <div>, then you can't create a record that contains a :p field - that just doesn't make any sense. That's a case where vectors are far more flexible and idiomatic.