Clojure: immutability and persistence - clojure

Every textbook says that Clojure data structures are 'immutable and persistent'. They go different lengths explaining the concept, but so far I failed to figure out what is the difference between immutability and persistence. Is there an entity persistent but mutable? or immutable but not persistent?

Immutable means that the value can't be changed and persistence means that the path to the value is copied if the value already exists within the program. Clojure uses this as a part of it's structural sharing implementation. If the data doesn't exist, it's created. If the data exists, the new data builds on the old version of the data without altering or removing it.
Atoms are persistent but safely mutable.
user> (def +a+ (atom 0))
#'user/+a+
user> #+a+
0
user> (swap! +a+ inc)
1
user> #+a+
1
Transients are mutable but should be made persistent after mutation
user> (def t (transient []))
#'user/t
user> (conj! t 1)
#<TransientVector clojure.lang.PersistentVector$TransientVector#658ee462>
user> (persistent! t)
[1]
Understanding Clojure's Persistent Vectors, pt. 1 =>
http://hypirion.com/musings/understanding-persistent-vector-pt-1
Persistent data structure => https://en.wikipedia.org/wiki/Persistent_data_structure
Persistent Data Structures and Managed References =>
http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey

Purely Functional Data Structures by Chris Okasaki refers to an article [1] which appears to contain the original definition of the term persistent:
Ordinary data structures are ephemeral in the sense that making a change to the structure destroys the old version, leaving only the new one. … We call a data structure persistent if it supports access to multiple versions. The structure is partially persistent if all versions can be accessed but only the newest version can be modified, and fully persistent if every version can be both accessed and modified.
[1] James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38(1):86–124, February 1989.

Immutable implies persistent, but persistent does not imply immutable. So you could have something that's persistent but not immutable.
An example of a mutable and persistent data structure is Java's CopyOnWriteArrayList.
Persistence does not imply shared structure, nor does it say anything about performance. Of course, shared structure and good performance are both highly desirable, and are both provided by Clojure's persistent data structures. But it would be quite possible to create something that had no structure sharing and awful performance (see CopyOnWriteArrayList, for example ;-)) but was still persistent.

Basically immutable == can't be changed, and persistent == immutable, with shared structure.
If I have a language where arrays can't be changed, then arrays are immutable. To "change" the array, I must create a new array and copy every element (except the one(s) to be changed) into the new array. This makes any update O(n), where n is the number of elements in the array. This is obviously inefficient for large n.
On the other hand, if I use a persistent data structure instead of an array, then instead of copying every element every time the data structure is "altered", the new version shares most of the same structure with the old one.
The details depend on the structure, but often usually there is a tree involved. If the tree is balanced, replacing an element means creating new copies of nodes along the path from the root to the leaf containing the element. The rest of the nodes are shared with the original version. The length of this path is O(n log(n)). Since the nodes are O(1) size, the entire operation takes O(n log(n)) time and extra space.
Note that not all persistent structures support the same operations efficiently. In Clojure, for example, Lists are singly-linked lists. You can efficiently add and remove elements to/from the front, but that's about it. Vectors, on the other hand, allow you to efficiently get any element and add/remove elements at the back.

Related

How to get older versions for clojure data structures

Clojure is using persistent data structures , is there a way to access older versions of vector or maps since it is keeping it internally ?
Lets say for a Vector ,what i meant is since clojure is not copying full structure and keeping it in a tree internally (see https://hypirion.com/musings/understanding-persistent-vector-pt-1) and keeps older structure values too, is there a way to use this to do some senarios like undo/redo or replay, It is using the same principle for Datomic to retrieve older version for data, so im asking if it is possible to get this in clojure.
I am not sure that I understood the question, but just keep a reference on the old structure.
(def my-old-map {a 1, b 2, c 3})
(def my-new-map (assoc my-old-map b 7))
Any version of a persistent data structure survives so long as there is a live reference to it. Thereafter, it is subject to garbage collection.
The Clojure persistent vectors and maps are like copy-on-write file systems such as Btrfs, both in concept and in the sort of internal data structures they employ to create the illusion that each version of an entity is quite distinct.
As per Kris's comment, using add-watch is the right answer it seems ,
David Nolen has described this approach here
https://swannodette.github.io/2013/12/31/time-travel

Clojure, replacing vars in let bindings causes performance issue?

Let's say there is a function and it received a json string which contains about 100KB of data. The string gets converted to a map and then it keeps associating new keys to that map and replacing the var in let bindings like below:
(defn myfn
[str]
(let [j (json/read-str str)
j (assoc-in j [:key1 :subkey1] somedata1)
j (assoc-in j [:key2 :subkey2] somedata2)
....
j (assoc-in j [:key100 :subkey100] somedata100)]
... do something ...))
I know, after all those let bindings, j will have all those new keys added. This is just an example. I wonder what happens inside those lots of bindings to the same var.
I mean what happens in the memory. Would that copy 100KB 100 times in the memory? And it eats up 100KB * 100 = 10,000KB until it gets out of that function? Or, Clojure is smart enough and it actually keeps adding new keys in the same memory space?
If you could also recommend where I should look for in Clojure reference to find an answer to this, that would be really nice.
Clojure uses a data structure called a trie, that is similar to a tree, but which only has data at the leaf nodes. Most of clojure's persistent structures are implemented as a trie.
This excellent article really explains things in detail and uses vectors, so I won't rehash it here. I know on S.O. it's preferred to give the content rather than a link, but it's not a topic that can be covered fully in an answer here, and the article does it best, so I'll just link to it.
In short, when a data structure is modified in some way, a new trie is created for the new "version", and instead of copying all the data over from the old to the new with one change made, the nodes in the new structure point to existing data. Here is a visualization from the above article that shows data sharing:
So, using this structure, we have shared data, but since it is only a binary trie, it can get deep very quickly, so lookups could take a very long time (for a vector of 1 billion elements, the depth to get to a leaf node is log21e9 which is 30). To get around this, clojure uses a 32-way branching factor instead of a 2-way one, yielding trees that are very shallow. So, the same vector that holds 1 billion elements in clojure only takes log321e9, or 6 levels of indirection, to reach the leaves.
I encourage you to read the article, and also have a look at PersistentHashMap, and you will see references to shift + 5 in several places. This is a clever way to use bit-shifting to index into the trie (log232 = 5). See the second part of the article for more in-depth info on this.
To summarize, clojure uses efficient data structures to achieve persistence, and any language which features immutability as a core feature must do this, if it hopes to achieve usable performance.

What is the difference between clojure's APersistentMap implementations

I'm trying to figure out what the difference is between a PersistentHashMap, PersistentArrayMap, PersistentTreeMap, and PersistentStructMap.
Also if I use {:a 1} it gives me a PersistentArrayMap but can this change to any of the other ones if I give it objects or things other than keys?
The four implementations you list fall into three groups:
"literal": PersistentArrayMap and PersistentHashMap: basic map types used when dealing with map literals (though constructor functions are also available with different behaviour around handling duplicate keys -- in Clojure 1.5.x literals throw exceptions when they discover duplicate keys, constructor functions work like left-to-right repeated conjing; this behaviour has been evolving from version to version). Array maps get promoted to hash maps when growing beyond a certain number of entries (9 IIRC). Array maps exist because they are faster for small maps; they also differ from hash maps in that they keep entries in insertion order prior to promotion to hash map (you can use clojure.core/array-map to get arbitrarily large array maps, which may be useful if you really know you'd benefit from insertion-order traversals and the map won't be too large, perhaps just a bit over the usual threshold; NB. a subsequent assoc to such an oversized array map will return a hash map). Array maps use arrays with keys and values interleaved; the PHM uses a persistent version of Phil Bagwell's hash array mapped trie with separate chaining for hash collisions and separate node types for mostly-empty and at-least-half-full nodes and is easily the most complex data structure in Clojure.
sorted: PersistentTreeMap instances are created by special request only (a call to sorted-map or sorted-map-by). They are implemented as red-black trees and maintain entries in a particular order, as specified by the default compare comparator if created with sorted-map or a user-supplied comparator if created with sorted-map-by.
special-purpose, probably deprecated: PersistentStructMap is not used very often and mostly viewed as deprecated in favour of records, although I actually can't remember right now if there ever was an official deprecation notice. The original purpose was to provide maps with particularly fast access to certain often-used keys. This can now be accomplished with records when using keywords for field access (with the keyword in the operator position: (:foo instance-of-some-record-with-field-foo)), though it's important to note that records are not = to regular maps with the same entries.
All these four built-in map types fall into the same "equality partition", that is, any two maps of one of the four classes mentioned above will be equal if (and only if) they contain the same keys (as determined by Clojure's =) with the same corresponding values. Records, as mentioned in 3. above, are map-like, but each record type forms its own equality partition.
They are different implementation of a Persistent Map (they all extend APersistentMap). So a PersistentArrayMap uses an array as the underlying data structure to implement persistent map and similarly other implementations uses different underlying data stucture.
The reason for different implementation is they provide different benefits in different situations (as the efficiency of the implementation depends on the underlying data structure).
From a developer perspective, it is abstracted away and hence you should not be directly using
them and instead work with the APersistentMap abstract class or IPersistentMap interface (in case some type checking etc is required for some specific case).
Depending on the number of elements in the map the various implementations are used.
(type (into {} (map #(-> [% %]) (range 5))))
=> PersistentArrayMap
(type (into {} (map #(-> [% %]) (range 10))))
=> PersistentHashMap

Clojure states within states within states

I'd love to hear what advice the Clojure gurus here have about managing state in hierarchies. I find I'm often using {:structures {:like {:this {:with {:many 'levels}} } } } and if I want to track changes in state at multiple levels, by throwing atoms around values (atom {:like (atom 'this)} ), I find myself thinking this must be wrong. Is it generally better to use just one atom at the top level, and have none as values in a map ?
Don't use nested atoms in a data structure if at all possible.
The main reason is that immutability is your friend. Clojure is a functional language that thrives on immutable data structures. Most libraries assume immutable data structures. Clojure's STM assumes immutable data structures to get the best possible concurrency. Immutability gives you the opportunity to take consistent snapshots of the entire state at any one instant. Pure functions that operate on immutable data are easy to develop and test.
If you put atoms inside your data structures then you lose all the advantages of immutability and risk making your code very complex - it's a lot harder to reason about a data structure if it contains a lot of mutable components.
Some suggested alternative approaches:
Put your entire data structure in a single ref or atom. This can be a huge data structure with no problem - I once wrote a game where the entire game map was held in a single atom without any difficulty.
Use the various methods that are designed for accessing and changing nested immutable data structures: assoc-in, get-in, update-in etc.
Use recursive functions to make navigating your data structure more managable. If one node of your structure has sub-nodes of the same "type" then it's usually a good hint that you should be using some form of recursive function.
You can use assoc-in, get-in, update-in, and dissoc-in functions to work with nested structures.
They are very convenient, but I don't know if they can handle atoms and such directly. In the worst case you should be able to nest them up to deref, e.g.:
(def m (atom {:like {:this {:nested (atom {:value 5})}}}))
#(get-in #m [:like :this :nested])
; => {:value 5}
(get-in #(get-in #m [:like :this :nested]) [:value])
; => 5
You can use -> to make this more readable:
(-> #m
(get-in [:like :this :nested])
deref
(get-in [:value]))
; => 5
Regarding nested atoms/refs/agents, etc. I think it depends on what you're trying to achieve. It's certainly easier to reason about things, if there's just one of them at the top and the changes are synchronized.
On the other hand, if you don't need this synchronization, you're wasting time in doing it, and you'll be better off with nested atoms/refs/agents.
The bottom line is, I don't think either way is "the right way", they both have their usages.
I would prefer to use one atom at top level as that would make things really simple and also that indicate that the data represent a state which is modified at once n all by an operation. If you put atoms at each level then it would become way too complex to figure out what is going on. Also if in your case the nesting is going way too deep then I would suggest you to sit back and think carefully whether you need such a structure or there can be any better alternate possible because this will certainly lead to complexity until the nested data is recursive (i.e same structure at each level)

Erlang persistent data structures

As I've understood, when you create a new list with expression like the following, Erlang doesn't copy L1, it just copies H.
L2 = [H|L1]
Does Erlang have persistent a data structure (see Persistent data structure) for dict, that is, when you add/remove/modify nodes in the tree only few elements are being copied (like in Clojure)?
You have misunderstood the situation when you build a list using [H|T]. It is as you say that T is not copied but neither is H. All that happens is that a new list cell is prepended to T with a reference to H as its head (its tail is T). When working with lists the only bits which are created are the actual list cells and never the data in each cell.
The same happens when working with dict. When you modify (add/delete elements) in the dict only the actual dict structure is modified and not the actual data in the dict. Also it is smart so as to only copy as little of the dict structure as is necessary to make the modification.
So, yes, Erlang has persistent data structures. In that respect clojure is like Erlang (we were around long before it).
In my experience, the data structures for the library module do not degrade in performance or memory pressure when they get larger.
For a dict, it uses a dynamic hash table as internal data structure and work is done essentially only on the bucket where the modification is done.
I also looked in the gb_trees module where I found the comment:
Behaviour is logaritmic (as it should be).
And gb_trees are generally pretty fast, so I'm quite sure not much copying is going on.
Generally, if you implement data structures like these in a language like Erlang, you take care of copying issues, so there is no need to worry about it for the general library functions.
I reread the article about persistent data structures: in the sense of this article, Erlang's data structures are fully persistent and also confluently persistent.