Erlang persistent data structures - clojure

As I've understood, when you create a new list with expression like the following, Erlang doesn't copy L1, it just copies H.
L2 = [H|L1]
Does Erlang have persistent a data structure (see Persistent data structure) for dict, that is, when you add/remove/modify nodes in the tree only few elements are being copied (like in Clojure)?

You have misunderstood the situation when you build a list using [H|T]. It is as you say that T is not copied but neither is H. All that happens is that a new list cell is prepended to T with a reference to H as its head (its tail is T). When working with lists the only bits which are created are the actual list cells and never the data in each cell.
The same happens when working with dict. When you modify (add/delete elements) in the dict only the actual dict structure is modified and not the actual data in the dict. Also it is smart so as to only copy as little of the dict structure as is necessary to make the modification.
So, yes, Erlang has persistent data structures. In that respect clojure is like Erlang (we were around long before it).

In my experience, the data structures for the library module do not degrade in performance or memory pressure when they get larger.
For a dict, it uses a dynamic hash table as internal data structure and work is done essentially only on the bucket where the modification is done.
I also looked in the gb_trees module where I found the comment:
Behaviour is logaritmic (as it should be).
And gb_trees are generally pretty fast, so I'm quite sure not much copying is going on.
Generally, if you implement data structures like these in a language like Erlang, you take care of copying issues, so there is no need to worry about it for the general library functions.
I reread the article about persistent data structures: in the sense of this article, Erlang's data structures are fully persistent and also confluently persistent.

Related

Does immutability mean that huge collections get completely recreated every time they change?

I've been trying to get into functional programming lately. Specifically I've been interested in Clojure. I understand most of the arguments for immutability of data, but one thing just doesn't make sense to me. Suppose I have a very large collection like a map or an array containing hundreds of millions of items. If I can't change it, then does that mean that every time I want to add a new item I'm essentially recreating the entire collection? That sounds like a horrible idea. How is a situation like this handled in languages where everything is immutable?
In the case of Clojure, they are not recreated each time. Rich Hickey made sure utilize persistent data structures, as he explains in an "Expert to Expert" interview.
You can think of it more like a linked list.
class LinkedList<E> {
E data;
LinkedList<E> next;
LinkedList<E> cons(E item) {
return new LinkedList(item, this);
}
}
Each time you cons a value onto the front, you do not need create a new linked list, but instead you merely keep a reference to the previous.
Similarly, even data structures like maps can be made to share large amounts of their data when they only differ by minor amounts. This helps reduce the size used.
Additionally, Clojure also provides transient data structures, which are mutable versions of its immutable collections, allowing you to make multiple changes to a data structure without the expense of copying/sharing data.

How to 'mark' a node in a Clojure data structure?

I have
a Clojure data structure, let's call it dom, a tree of vectors and
maps of indefinite depth;
a particular node in it, let's call it the focus node, referred to as
a path into the tree: a sequence of keys such as you could present to
get-in.
I will be deciding on the focussed node in one function and I want to somehow represent that choice of focussed node in a way that can be passed to another function in a way that does not violate immutability and is not in conflict with Clojure's persistent data structures.
When I traverse the tree, I want to treat the focus node differently: for example, if I was printing the tree, I might want to print the focus node in bold.
If I were using C or Java, I could save a pointer/reference to the focus node, which I could compare with the current node as I traversed the tree. I don't think that's the right way to do it in Clojure: it feels hacky, and I'm sure there's some way to do it that takes advantage of Clojure's persistent data structures.
The solution has to work in Clojure and ClojureScript.
The options I can think of are:
Store a reference and check against that.
Attach a marker to the node in question.
Simultaneously recurse into the tree and along the path to the marked node.
Option (1) is unattractive, as I've explained.
Option (2) seems best, and painless given persistent data structures.
Option (3) is similar to option (2), except that it combines the
marking and traversing steps.
I'm sure this is a common problem. Is there a standard solution to it?
I suggest you reconsider #MerceloMorales's suggestion: to use metadata. Your node object is to have an accidental attribute that doesn't affect its normal functions. That is what metadata is designed for. And it works in ClojureScript. The only reason I can think of for not using metadata is that the node value is not a Clojure object, but is, for example, a number.
In The Clojure Cookbook, 2.22. Keeping Multiple Values for a Key, Luke Vanderhart uses metadata to solve a similar problem: marking entries that need to be interpreted as collections rather than single values.
Another approach might be to use a zipper to traverse/modify the node tree. Zippers are implemented in terms of - you've guessed it - metadata.
I share your misgivings about metadata: it feels queasy to attach just any old stuff to your data - like infecting it with a parasite. However, it's just as immutable a part of the object as any other.
The suggestion to use zippers is naive: The standard clojure zippers are designed for a hierarchy of sequential containers, not associative ones.
See Brandon Bloom's Dendrology talk for some great overview on questions like this.
I believe the ease of "marking" or otherwise updating tree structured data underlies his strong recommendation to always represent nodes as nested maps rather than vectors (or a mixture of vectors and maps). A mark based on a path described by a vector of keys is then as simple as:
(update-in tree-data path assoc :is-focussed true)
Your original data structure is unchanged and the new one returned by update-in shares everything structurally with the original except the updated node which is now easily tested for the :is-focussed property upon traversal.

C++ - Map-like data structure with structural sharing/immutability

Functional programming languages often work on immutable data structures but stay efficient by structural sharing. E.g. you work on some map of information, if you insert an element, you will not modify the existing map but create a new updated version. To avoid massive copying and memory usage, the map will share (as good as possible) the unchanged data between both instances.
I would be interested if there exists some template library providing such a map like data structure for C++. I searched a bit and found nothing, beside internal classes in LLVM.
A Copy On Write b+tree sounds like what your looking for. It basically creates a new snapshot of itself every time it gets modified but it shares unmodified leaf nodes between versions. Most of the implementations I've seen tend to be baked into append only database log files. CouchDB has a very nice write up on them. They are however "relatively easy", as far as map data structures go, to implement.
You can use an ordinary map, but marking every element with a timestamp or "map version number". If you want to remove elements too, use two marks. If you might reinsert removed elements, then you need a list of values and pairs of marks per element.
For example, you search for the key "foo", and you find that it had the value 5 in versions 0 to 3 (included), then it was "removed", and then it had the value -8 in versions 9 to current.
This eats a lot of memory and time, though.

Scheme: Constant Access to the End of a List?

In C, you can have a pointer to the first and last element of a singly-linked list, providing constant time access to the end of a list. Thus, appending one list to another can be done in constant time.
As far as I am aware, scheme does not provide this functionality (namely constant access to the end of a list) by default. To be clear, I am not looking for "pointer" functionality. I understand that is non-idiomatic in scheme and (as I suppose) unnecessary.
Could someone either 1) demonstrate the ability to provide a way to append two lists in constant time or 2) assure me that this is already available by default in scheme or racket (e.g., tell me that append is in fact a constant operation if I am wrong to think otherwise)?
EDIT:
I should make myself clearer. I am trying to create an inspectable queue. I want to have a list that I can 1) push onto the front in constant time, 2) pop off the back in constant time, and 3) iterate over using Racket's foldr or something similar (a Lisp right fold).
Standard Lisp lists cannot be appended to in constant time.
However, if you make your own list type, you can do it. Basically, you can use a record type (or just a cons cell)---let's call this the "header"---that holds pointers to the head and tail of the list, and update it each time someone adds to the list.
However, be aware that if you do that, lists are no longer structurally inductive. i.e., a longer list isn't simply an extension of a shorter list, because of the extra "header" involved. Thus, you lose a great part of the simplicity of Lisp algorithms which involve recursing into the cdr of a list at each iteration.
In other words, the lack of easy appending is a tradeoff to enable recursive algorithms to be written much more easily. Most functional programmers will agree that this is the right tradeoff, since appending in a pure-functional sense means that you have to copy every cell in all but the last list---so it's no longer O(1), anyway.
ETA to reflect OP's edit
You can create a queue, but with the opposite behaviour: you add elements to the back, and retrieve elements in the front. If you are willing to work with that, such a data structure is easy to implement in Scheme. (And yes, it's easy to append two such queues in constant time.)
Racket also has a similar queue data structure, but it uses a record type instead of cons cells, because Racket cons cells are immutable. You can convert your queue to a list using queue->list (at O(n) complexity) for times when you need to fold.
You want a FIFO queue. user448810 mentions the standard implementation for a purely-functional FIFO queue.
Your concern about losing the "key advantage of Lisp lists" needs to be unpacked a bit:
You can write combinators for custom data structures in Lisp. If you implement a queue type, you can easily write fold, map, filter and so on for it.
Scheme, however, does lack in the area of providing polymorphic sequence functions that can work on multiple sequence types. You do often end up either (a) converting your data structures back to lists in order to use the rich library of list functions, or (b) implementing your own versions of various of these functions for your custom types.
This is very much a shame, because singly-linked lists, while they are hugely useful for tons of computations, are not a do-all data structure.
But what is worse is that there's a lot of Lisp folk who like to pretend that lists are a "universal datatype" that can and should be used to represent any kind of data. I've programmed Lisp for a living, and oh my god I hate the code that these people produce; I call it "Lisp programmer's disease," and have much too often had to go in and fix a lot of n^2 that uses lists to represent sets or dictionaries to use hash tables or search trees instead. Don't fall into that trap. Use proper data structures for the task at hand. You can always build your own opaque data types using record types and modules in Racket; you make them opaque by exporting the type but not the field accessors for the record type (you export your type's user-facing operations instead).
It sounds like you are looking for a deque, not a list. The standard idiom for a deque is to keep two lists, the front half of the list in normal order and the back half of the list in reverse order, thus giving access to both ends of the deque. If the half of the list that you want to access is empty, reverse the other half and swap the meaning of the two halves. Look here for a fuller explanation and sample code.

List design in functional languages

I've noticed that in functional languages such as Haskell and OCaml you can do 2 actions with lists. First you can do x:xs where x is an element ans xs is a list and the resulting action is we get a new list where x is appended to the beginning of xs in constant time. Second is x++y where both x and y are lists and the resulting action is we get a new list where y gets appended to the end of x in linear time with respect to the number of elements in x.
Now I'm no expert in how languages are designed and compilers are built, but this seems to me a lot like a simple implementation of a linked list with one pointer to the first item. If I were to implement this data structure in a language like C++ I would find it to be generally trivial to add a pointer to the last element. In this case if these languages were implemented this way (assuming they do use linked lists as described) adding a "pointer" to the last item would make it much more efficient to add items to the end of a list and would allow pattern matching with the last element.
My question is are these data structures really implemented as linked lists, and if so why do they not add a reference to the last element?
Yes, they really are linked lists. But they are immutable. The advantage of immutability is that you don't have to worry about who else has a pointer to the same list. You might choose to write x++y, but somewhere else in the program might be relying on x remaining unchanged.
People who work on compilers for such languages (of whom I am one) don't worry about this cost because there are plenty of other data structures that provide efficient access:
A functional queue represented as two lists provides constant-time access to both ends and amortized constant time for put and get operations.
A more sophisticated data structure like a finger tree can provide several kinds of list access at very low cost.
If you just want constant-time append, John Hughes developed an excellent, simple representation of lists as functions, which provides exactly that. (In the Haskell library they are called DList.)
If you're interested in these sorts of questions you can get good info from Chris Okasaki's book Purely Functional Data Structures and from some of Ralf Hinze's less intimidating papers.
You said:
Second is x++y where both x and y are
lists and the resulting action is y
gets appended to the end of x in
linear time with respect to the number
of elements in x.
This is not really true in a functional language like Haskell; y gets appended to a copy of x, since anything holding onto x is depending on it not changing.
If you're going to copy all of x anyway, holding onto its last node doesn't really gain you anything.
Yes, they are linked lists. In languages like Haskell and OCaml, you don't add items to the end of a list, period. Lists are immutable. There is one operation to create new lists — cons, the : operator you refer to earlier. It takes an element and a list, and creates a new list with the element as the head and the list as the tail. The reason x++y takes linear time is because it must cons the last element of x with y, and then cons the second-to-last element of x with that list, and so on with each element of x. None of the cons cells in x can be reused, because that would cause the original list to change as well. A pointer to the last element of x would not be very helpful here — we still have to walk the whole list.
++ is just one of dozens of "things you can do with lists". The reality is that lists are so versatile that one rarely uses other collections. Also, we functional programmers almost never feel the need to look at the last element of a list - if we need to, there is a function last.
However, just because lists are convenient this does not mean that we do not have other data structures. If you're really interested, have a look at this book http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf (Purely Functional Data Structures). You'll find trees, queues, lists with O(1) append of an element at the tail, and so forth.
Here's a bit of an explanation on how things are done in Clojure:
The easiest way to avoid mutating state is to use immutable data structures. Clojure provides a set of immutable lists, vectors, sets and maps. Since they can't be changed, 'adding' or 'removing' something from an immutable collection means creating a new collection just like the old one but with the needed change. Persistence is a term used to describe the property wherein the old version of the collection is still available after the 'change', and that the collection maintains its performance guarantees for most operations. Specifically, this means that the new version can't be created using a full copy, since that would require linear time. Inevitably, persistent collections are implemented using linked data structures, so that the new versions can share structure with the prior version. Singly-linked lists and trees are the basic functional data structures, to which Clojure adds a hash map, set and vector both based upon array mapped hash tries.
(emphasis mine)
So basically it looks you're mostly correct, at least as far as Clojure is concerned.