Let's say there is a function and it received a json string which contains about 100KB of data. The string gets converted to a map and then it keeps associating new keys to that map and replacing the var in let bindings like below:
(defn myfn
[str]
(let [j (json/read-str str)
j (assoc-in j [:key1 :subkey1] somedata1)
j (assoc-in j [:key2 :subkey2] somedata2)
....
j (assoc-in j [:key100 :subkey100] somedata100)]
... do something ...))
I know, after all those let bindings, j will have all those new keys added. This is just an example. I wonder what happens inside those lots of bindings to the same var.
I mean what happens in the memory. Would that copy 100KB 100 times in the memory? And it eats up 100KB * 100 = 10,000KB until it gets out of that function? Or, Clojure is smart enough and it actually keeps adding new keys in the same memory space?
If you could also recommend where I should look for in Clojure reference to find an answer to this, that would be really nice.
Clojure uses a data structure called a trie, that is similar to a tree, but which only has data at the leaf nodes. Most of clojure's persistent structures are implemented as a trie.
This excellent article really explains things in detail and uses vectors, so I won't rehash it here. I know on S.O. it's preferred to give the content rather than a link, but it's not a topic that can be covered fully in an answer here, and the article does it best, so I'll just link to it.
In short, when a data structure is modified in some way, a new trie is created for the new "version", and instead of copying all the data over from the old to the new with one change made, the nodes in the new structure point to existing data. Here is a visualization from the above article that shows data sharing:
So, using this structure, we have shared data, but since it is only a binary trie, it can get deep very quickly, so lookups could take a very long time (for a vector of 1 billion elements, the depth to get to a leaf node is log21e9 which is 30). To get around this, clojure uses a 32-way branching factor instead of a 2-way one, yielding trees that are very shallow. So, the same vector that holds 1 billion elements in clojure only takes log321e9, or 6 levels of indirection, to reach the leaves.
I encourage you to read the article, and also have a look at PersistentHashMap, and you will see references to shift + 5 in several places. This is a clever way to use bit-shifting to index into the trie (log232 = 5). See the second part of the article for more in-depth info on this.
To summarize, clojure uses efficient data structures to achieve persistence, and any language which features immutability as a core feature must do this, if it hopes to achieve usable performance.
Related
In Python if I want to customize the way to define how to find the size of an object I define a __len__ method for the class, which alters the behavior of the len function. Is there any way to do this in Clojure with the count function? If so, how?
This is a reasonable question to ask when you are moving from one paradigm to
another i.e. OO to functional, but likely is not an issue. In languages like
Clojure, you will normally start by working out how to represent your data in
one of the existing data structures (Collections) rather than define a new
structure to represent your data. The idea is to have only a few different data
structures with a large number of well defined and understood functions which
operate on those structures in an efficient and reliable manner. As a
consequence, once you have decided on how to represent your graphs, you will
likely find that doing something like counting the number of vertices is
trivial - you will probably just need to use reduce with a very simple function
that in turn uses other core functions, such as the existing count(). A key to
becoming proficient in Clojure is in learning what the core functions are and
what they do.
In an OO language, you encapsulate your data and hide it inside an object. This
means you now need to define the methods to operate on the data inside that
object. To support polymorphism, you will frequently do things like add an
object specific size or length method. In Clojure, the underlying data structure
is not hidden and is frequently built on one of the standard collection types,
so you don't need tow rite a size/length function, you can use the standard
count function. When the standard collections are not suitable and you need
more, then you tend to look at things like protocols, where you can define your
own specialised functions e.g. size.
In your case, a protocol or record is unlikely to be necessary - representing
graphs is pretty much a natural fit for the standard collections and I woldn't
be at all surprised if you could re-implement what you did in C or C++ with
Clojure in a lot fewer lines and likely in a much more declarative and cleaner
manner. Start by looking at how the standard Clojure collection types could be
used to represent your graphs. Think about how you want to operate on the graphs
and whether you are best representing the graph as nodes or verticies and then
look at how you would answer questions like 'How many verticies are in this
graph?" and see how you would get that answer just using the available built-in
functions.
You do need to think about things differently when you move to a functional
paradigm. There will be a point you get to that is a real "Aha" moment as that
penny drops. Once it does, you will likely be very surprised how nice it is, but
until that moment, you will likely experience a bit of frustration and hair
pulling. The battle is worth it as you will likely find even your old imparative
programming skills benefit and when you have to go back to C/C++ or Python, your
code is even clearer and more concise. Try to avoid the temptation to reproduce
what you did in C/Python in Clojure. instead, focus on the outcome you want to
achieve and see how the supplied facilities of the language will help you do
that.
Your comment says you are dealing with graphs. Taking on board the good advice to use the standard data structures, let's consider how to represent them.
You would normally represent a graph as a map Node -> Collection of Node. For example,
(def graph {:a [:b], :b [:a :c]})
Then
(count graph)
=> 2
However, if you make sure that every node has a map entry, even the ones that have no afferent arcs, then all you have to do is count the graph's map. A function add the empty entries is ...
(defn add-empties [gm]
(if (empty? gm)
gm
(let [EMPTY (-> gm first val empty)
missing (->> gm
vals
(apply concat)
(remove gm))]
(reduce (fn [acc x] (assoc acc x EMPTY)) gm missing))))
For example,
(add-empties graph)
=> {:a [:b], :b [:a :c], :c []}
and
(count(add-empties graph))
=> 3
What does count mean for a graph?
What should count return for a graph? I can think of two equally obvious options -- the number of nodes in the graph or the number of edges. So perhaps count isn't very intuitive for graph.
Implementing Counted
Nevertheless, you certainly can define your own counted data structures in Clojure. The key is to implement the clojure.lang.Counted interface.
Now, if represent a graph via the following deftype:
(deftype NodeToEdgeGraph [node->neighbors]
...)
we can make it counted:
(deftype NodeToEdgeGraph [node->neighbors]
clojure.lang.Counted
(count [this]
(count node->neighbors))
...)
This is if we are representing a graph as a map that maps each node to its set of "neighbors" (where a node is considered a "neighbor" if, and only if, there is an edge between the two), and we want count to return the number of nodes.
Alternatively, we can represent a graph as a set of pairs (either ordered, in the case of a directed graph; or unordered, in the case of an undirected graph). In this case, we have:
(deftype EdgeGraph [edges]
...)
And we can have count return the number of edges in the graph:
(deftype EdgeGraph [edges]
clojure.lang.Counted
(count [this]
(count edges))
...)
So far, we have been using count on the underlying structure to implement count for the graph. This works because the underlying structure conveniently has the same count as the way we are counting each respective graph representation. However, there's no reason we couldn't use either representation with either way of counting. Also, there are other ways of representing graphs that might not align so nicely with the way we want to count. In these cases, we can always maintain our own internal count:
(deftype ???Graph [cnt ???]
clojure.lang.Counted
(count [this]
cnt)
...)
Here, we rely on the implementation of the graph operations to maintain cnt appropriately.
Built-in or Custom Types?
Others have suggested using the built-in datastructures only to represent your graphs. Indeed, if you take a look at NodeGraph and EdgeGraph, you can see that count simply delegates to the underlying structure. So for these representations (and definitions of count), you could simply eliminate the deftype and implement your graph operations directly over maps / sets. However, let's take a look at some advantages to encapsulating graphs in a custom type:
Ease of implementing alternative representations. With custom types, you can have an unlimited number of graph representations. All you need to do is define some protocols and define graph operations in terms of those protocols. While you can also extend protocols to built-in types, you are limited to only one representation / implementaton per built-in type. This might or might not be sufficient for you.
Maintenance of internal invariants. With custom types, you have better control over the internal structure. That is, you can implement protocols / interfaces in a way that maintains any necessary internal invariants. For example, if you are implementing undirected graphs, you can implement NodeGraph in a way that ensures adding an edge from node A to node B also adds an edge from node B to node A. Likewise with removing an edge. With built-in types, these invariants can only be maintained by the graph functions you implement. The Clojure core functions will not maintain these invariants for you, because they know nothing about them. So if you hand your "graphs" (over built-in types) off to some function that calls non-graph functions on them, you have no assurance that you will get a valid graph back. On the other have, custom types only allow the operations you implement, and will perform them the way you implement them. As long as you take care that all the operations you implement maintain the proper invariants, you can rest assured that these invariants will always be maintained.
On the other hand, sometimes it is appropriate to simply use the built-in types. For instance, if your application only makes light use of graph operations, it might be more convenient to simply use built-in types. On the other hand, if your application is heavily graph-oriented and makes a lot of graph operations, it's probably better in the long run to implement a custom type.
Of course, you are not forced to choose one over the other. With protocols, you can implement graph operations for both built-in types as well as custom types. That way, you can choose between "quick and dirty" and "heavy but robust" depending on your application. Here "heavy" just means it might take a little more work to use graphs with functions implemented over Clojure collection interfaces. In other words, you might need to convert graphs to other types in some cases. This depends heavily on how much effort you put into implementing these interfaces for your custom graph types (and to the extent they make sense for graphs).
By the way, you cannot override that function neither with with-redefs nor any related functionality. There is a hidden trick here: if you check the source code of the count function, you'll see an interesting inline attribute in its metadata:
(defn count
"Returns the number of items in the collection. (count nil) returns
0. Also works on strings, arrays, and Java Collections and Maps"
{
:inline (fn [x] `(. clojure.lang.RT (count ~x)))
:added "1.0"}
[coll] (clojure.lang.RT/count coll))
This attribute means the function's code will be inserted as-is into the resulting bytecode without calling the original function. Most of the general core functions have such attribute. So that's why you cannot override count, get, int, nth and so forth.
sorry, not for existing collection types you don't control.
You can define your own count that is aware of your needs and use that in your code, though unfortunately clojure does not use a universal protocol for counting so there is nowhere for you to attach that will extend count on an existing collection.
If counted where a protocol rather than an interface this would be easier for you, though it long predates protocols in the evolution of the language.
If you are making your own collection type then you can of course implement count anyway you want.
Every textbook says that Clojure data structures are 'immutable and persistent'. They go different lengths explaining the concept, but so far I failed to figure out what is the difference between immutability and persistence. Is there an entity persistent but mutable? or immutable but not persistent?
Immutable means that the value can't be changed and persistence means that the path to the value is copied if the value already exists within the program. Clojure uses this as a part of it's structural sharing implementation. If the data doesn't exist, it's created. If the data exists, the new data builds on the old version of the data without altering or removing it.
Atoms are persistent but safely mutable.
user> (def +a+ (atom 0))
#'user/+a+
user> #+a+
0
user> (swap! +a+ inc)
1
user> #+a+
1
Transients are mutable but should be made persistent after mutation
user> (def t (transient []))
#'user/t
user> (conj! t 1)
#<TransientVector clojure.lang.PersistentVector$TransientVector#658ee462>
user> (persistent! t)
[1]
Understanding Clojure's Persistent Vectors, pt. 1 =>
http://hypirion.com/musings/understanding-persistent-vector-pt-1
Persistent data structure => https://en.wikipedia.org/wiki/Persistent_data_structure
Persistent Data Structures and Managed References =>
http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
Purely Functional Data Structures by Chris Okasaki refers to an article [1] which appears to contain the original definition of the term persistent:
Ordinary data structures are ephemeral in the sense that making a change to the structure destroys the old version, leaving only the new one. … We call a data structure persistent if it supports access to multiple versions. The structure is partially persistent if all versions can be accessed but only the newest version can be modified, and fully persistent if every version can be both accessed and modified.
[1] James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38(1):86–124, February 1989.
Immutable implies persistent, but persistent does not imply immutable. So you could have something that's persistent but not immutable.
An example of a mutable and persistent data structure is Java's CopyOnWriteArrayList.
Persistence does not imply shared structure, nor does it say anything about performance. Of course, shared structure and good performance are both highly desirable, and are both provided by Clojure's persistent data structures. But it would be quite possible to create something that had no structure sharing and awful performance (see CopyOnWriteArrayList, for example ;-)) but was still persistent.
Basically immutable == can't be changed, and persistent == immutable, with shared structure.
If I have a language where arrays can't be changed, then arrays are immutable. To "change" the array, I must create a new array and copy every element (except the one(s) to be changed) into the new array. This makes any update O(n), where n is the number of elements in the array. This is obviously inefficient for large n.
On the other hand, if I use a persistent data structure instead of an array, then instead of copying every element every time the data structure is "altered", the new version shares most of the same structure with the old one.
The details depend on the structure, but often usually there is a tree involved. If the tree is balanced, replacing an element means creating new copies of nodes along the path from the root to the leaf containing the element. The rest of the nodes are shared with the original version. The length of this path is O(n log(n)). Since the nodes are O(1) size, the entire operation takes O(n log(n)) time and extra space.
Note that not all persistent structures support the same operations efficiently. In Clojure, for example, Lists are singly-linked lists. You can efficiently add and remove elements to/from the front, but that's about it. Vectors, on the other hand, allow you to efficiently get any element and add/remove elements at the back.
Functional programming languages often work on immutable data structures but stay efficient by structural sharing. E.g. you work on some map of information, if you insert an element, you will not modify the existing map but create a new updated version. To avoid massive copying and memory usage, the map will share (as good as possible) the unchanged data between both instances.
I would be interested if there exists some template library providing such a map like data structure for C++. I searched a bit and found nothing, beside internal classes in LLVM.
A Copy On Write b+tree sounds like what your looking for. It basically creates a new snapshot of itself every time it gets modified but it shares unmodified leaf nodes between versions. Most of the implementations I've seen tend to be baked into append only database log files. CouchDB has a very nice write up on them. They are however "relatively easy", as far as map data structures go, to implement.
You can use an ordinary map, but marking every element with a timestamp or "map version number". If you want to remove elements too, use two marks. If you might reinsert removed elements, then you need a list of values and pairs of marks per element.
For example, you search for the key "foo", and you find that it had the value 5 in versions 0 to 3 (included), then it was "removed", and then it had the value -8 in versions 9 to current.
This eats a lot of memory and time, though.
I'd love to hear what advice the Clojure gurus here have about managing state in hierarchies. I find I'm often using {:structures {:like {:this {:with {:many 'levels}} } } } and if I want to track changes in state at multiple levels, by throwing atoms around values (atom {:like (atom 'this)} ), I find myself thinking this must be wrong. Is it generally better to use just one atom at the top level, and have none as values in a map ?
Don't use nested atoms in a data structure if at all possible.
The main reason is that immutability is your friend. Clojure is a functional language that thrives on immutable data structures. Most libraries assume immutable data structures. Clojure's STM assumes immutable data structures to get the best possible concurrency. Immutability gives you the opportunity to take consistent snapshots of the entire state at any one instant. Pure functions that operate on immutable data are easy to develop and test.
If you put atoms inside your data structures then you lose all the advantages of immutability and risk making your code very complex - it's a lot harder to reason about a data structure if it contains a lot of mutable components.
Some suggested alternative approaches:
Put your entire data structure in a single ref or atom. This can be a huge data structure with no problem - I once wrote a game where the entire game map was held in a single atom without any difficulty.
Use the various methods that are designed for accessing and changing nested immutable data structures: assoc-in, get-in, update-in etc.
Use recursive functions to make navigating your data structure more managable. If one node of your structure has sub-nodes of the same "type" then it's usually a good hint that you should be using some form of recursive function.
You can use assoc-in, get-in, update-in, and dissoc-in functions to work with nested structures.
They are very convenient, but I don't know if they can handle atoms and such directly. In the worst case you should be able to nest them up to deref, e.g.:
(def m (atom {:like {:this {:nested (atom {:value 5})}}}))
#(get-in #m [:like :this :nested])
; => {:value 5}
(get-in #(get-in #m [:like :this :nested]) [:value])
; => 5
You can use -> to make this more readable:
(-> #m
(get-in [:like :this :nested])
deref
(get-in [:value]))
; => 5
Regarding nested atoms/refs/agents, etc. I think it depends on what you're trying to achieve. It's certainly easier to reason about things, if there's just one of them at the top and the changes are synchronized.
On the other hand, if you don't need this synchronization, you're wasting time in doing it, and you'll be better off with nested atoms/refs/agents.
The bottom line is, I don't think either way is "the right way", they both have their usages.
I would prefer to use one atom at top level as that would make things really simple and also that indicate that the data represent a state which is modified at once n all by an operation. If you put atoms at each level then it would become way too complex to figure out what is going on. Also if in your case the nesting is going way too deep then I would suggest you to sit back and think carefully whether you need such a structure or there can be any better alternate possible because this will certainly lead to complexity until the nested data is recursive (i.e same structure at each level)
Which is better idiomatic clojure practice for representing a tree made up of different node types:
A. building trees out of several different types of records, that one defines using deftype or defrecord:
(defrecord node_a [left right])
(defrecord node_b [left right])
(defrecord leaf [])
(def my-tree (node_a. (node_b. (leaf.) (leaf.)) (leaf.)))
B. building trees out of vectors, with keywords designating the types:
(def my-tree [:node-a [:node-b :leaf :leaf] :leaf])
Most clojure code that I see seems to favor the usage of the general purpose data structures (vectors, maps, etc.), rather than datatypes or records. Hiccup, to take one example, represents html very nicely using the vector + keyword approach.
When should we prefer one style over the other?
You can put as many elements into a vector as you want. A record has a set number of fields. If you want to constrain your nodes to only have N sub-nodes, records might be good, e.g. making when a binary tree, where a node has to have only a Left and Right. But for something like HTML or XML, you probably want to support arbitrary numbers of sub-nodes.
Using vectors and keywords means that "extending" the set of supported node types is as simple as putting a new keyword into the vector. [:frob "foo"] is OK in Hiccup even if its author never heard of frobbing. Using records, you'd potentially have to define a new record for every node type. But then you get the benefit of catching typos and verifying subnodes. [:strnog "some bold text?"] isn't going to be caught by Hiccup, but (Strnog. "foo") would be a compile-time error.
Vectors being one of Clojure's basic data types, you can use Clojure's built-in functions to manipulate them. Want to extend your tree? Just conj onto it, or update-in, or whatever. You can build up your tree incrementally this way. With records, you're probably stuck with constructor calls, or else you have to write a ton of wrapper functions for the constructors.
Seems like this partly boils down to an argument of dynamic vs. static. Personally, I would go the dynamic (vector + keyword) route unless there was a specific need for the benefits of using records. It's probably easier to code that way, and it's more flexible for the user, at the cost of being easier for the user to end up making a mess. But Clojure users are likely used to having to handle dangerous weapons on a regular basis. Clojure being largely a dynamic language, staying dynamic is often the right thing to do.
This is a good question. I think both are appropriate for different kinds of problems. Nested vectors are a good solution if each node can contain a variable set of information - in particular templating systems are going to work well. Records are a good solution for a smallish number of fixed node types where nesting is far more constrained.
We do a lot of work with heterogeneous trees of records. Each node represents one of a handful of well-known types, each with a different set of known fixed keys. The reason records are better in this case is that you can pick the data out of the node by key which is O(1) (really a Java method call which is very fast), not O(n) (where you have to look through the node contents) and also generally easier to access.
Records in 1.2 are imho not quite "finished" but it's pretty easy to build that stuff yourself. We have a defrecord2 that adds constructor functions (new-foo), field validation, print support, pprint support, tree walk/edit support via zippers, etc.
An example of where we use this is to represent ASTs or execution plans where nodes might be things like Join, Sort, etc.
Vectors are going to be better for creating stuff like strings where an arbitrary number of things can be put in each node. If you can stuff 1+ <p>s inside a <div>, then you can't create a record that contains a :p field - that just doesn't make any sense. That's a case where vectors are far more flexible and idiomatic.