Are all immutable data structures in Elixir persistent? If not, which of them are and which not? Also, how do they compare to the persistent data structures in Clojure?
Yes, most of them are persistent data structures.
For example, Elixir lists are linked lists, and a linked list is a degenerate tree (it has only one branch):
Elixir: list = [1, 2, 3, 4]
Tree: 1 -> 2 -> 3 -> 4
Every time you prepend an element to a list, it will share its tail:
Elixir: [0|list]
Tree: 0 -> (1 -> 2 -> 3 -> 4)
Elixir's HashSet and HashDict implementations are based on Clojure's persistent data structures and are effectively trees. There is some write up on Joseph's blog.
Maps are also persistent data structures and they are very interesting because their representation change based on the amount of keys. When you have small maps, let's say:
%{:foo => 1, :bar => 2, :baz => 3}
It is represented as:
-------------(:foo, :bar, :baz)
|
(map, keys, values)
|
------(1, 2, 3)
So every time you update one key, we share the "keys" bucket and change only the values bucket. This is very efficient for small maps but once you get to about ~20 keys, in Erlang 18, they change their representation to be based on Hash Array Mapped Tries which is similar to Clojure too.
Note tuples are not persistent though (they represent a contiguous space in memory). Once you change one element in the tuple, a whole new tuple is created. This makes them great for holding and accessing few elements as well as pattern matching on them but you definitely don't want to hold many elements.
Related
I am working on something where I update an many items which are ordered in a Clojure atom. I can either store the items as a vector or as an indexed map. There may be millions of appendages to the items, so I want to choose the most memory efficient structure.
My gut feeling is that adding a new item to a map uses less memory than compared to a vector over millions of iterations, but would like to get a definitive answer:
So with a vector it could be:
["a" "b" ... "y"] -> ["a" "b" ... "y" "z"]
And with a map it would be:
{0 "a" 1 "b" ... 25 "y"} -> {0 "a" 1 "b" ... 25 "y" 26 "z"}
So which would use less memory?
In Clojure both vectors and hash maps use tries as their fundamental implementation technique.
Clojure's vectors use the index of the element directly as the value of the key to walk the trie in order to find the value. Bit partitioning is used in order to split the index into chunks of bits that can be used as the key at each level.
Clojure's hash maps, on the other hand, hashes the provided index to create a key to walk the trie in order to find the value. Bit partitioning is used on the hashed index rather than on the index directly.
The actual key used to traverse the trie for both vectors and hash maps will be a 32-bit int.
I would expect the difference in memory usage between vectors and hash maps to be negligible. The hash map should use slightly more memory in order to cater for key collisions and therefore having to have the overhead of hash buckets.
There is a more in-depth discussion on the implementation details of both vectors and hash maps available here.
Every textbook says that Clojure data structures are 'immutable and persistent'. They go different lengths explaining the concept, but so far I failed to figure out what is the difference between immutability and persistence. Is there an entity persistent but mutable? or immutable but not persistent?
Immutable means that the value can't be changed and persistence means that the path to the value is copied if the value already exists within the program. Clojure uses this as a part of it's structural sharing implementation. If the data doesn't exist, it's created. If the data exists, the new data builds on the old version of the data without altering or removing it.
Atoms are persistent but safely mutable.
user> (def +a+ (atom 0))
#'user/+a+
user> #+a+
0
user> (swap! +a+ inc)
1
user> #+a+
1
Transients are mutable but should be made persistent after mutation
user> (def t (transient []))
#'user/t
user> (conj! t 1)
#<TransientVector clojure.lang.PersistentVector$TransientVector#658ee462>
user> (persistent! t)
[1]
Understanding Clojure's Persistent Vectors, pt. 1 =>
http://hypirion.com/musings/understanding-persistent-vector-pt-1
Persistent data structure => https://en.wikipedia.org/wiki/Persistent_data_structure
Persistent Data Structures and Managed References =>
http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey
Purely Functional Data Structures by Chris Okasaki refers to an article [1] which appears to contain the original definition of the term persistent:
Ordinary data structures are ephemeral in the sense that making a change to the structure destroys the old version, leaving only the new one. … We call a data structure persistent if it supports access to multiple versions. The structure is partially persistent if all versions can be accessed but only the newest version can be modified, and fully persistent if every version can be both accessed and modified.
[1] James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38(1):86–124, February 1989.
Immutable implies persistent, but persistent does not imply immutable. So you could have something that's persistent but not immutable.
An example of a mutable and persistent data structure is Java's CopyOnWriteArrayList.
Persistence does not imply shared structure, nor does it say anything about performance. Of course, shared structure and good performance are both highly desirable, and are both provided by Clojure's persistent data structures. But it would be quite possible to create something that had no structure sharing and awful performance (see CopyOnWriteArrayList, for example ;-)) but was still persistent.
Basically immutable == can't be changed, and persistent == immutable, with shared structure.
If I have a language where arrays can't be changed, then arrays are immutable. To "change" the array, I must create a new array and copy every element (except the one(s) to be changed) into the new array. This makes any update O(n), where n is the number of elements in the array. This is obviously inefficient for large n.
On the other hand, if I use a persistent data structure instead of an array, then instead of copying every element every time the data structure is "altered", the new version shares most of the same structure with the old one.
The details depend on the structure, but often usually there is a tree involved. If the tree is balanced, replacing an element means creating new copies of nodes along the path from the root to the leaf containing the element. The rest of the nodes are shared with the original version. The length of this path is O(n log(n)). Since the nodes are O(1) size, the entire operation takes O(n log(n)) time and extra space.
Note that not all persistent structures support the same operations efficiently. In Clojure, for example, Lists are singly-linked lists. You can efficiently add and remove elements to/from the front, but that's about it. Vectors, on the other hand, allow you to efficiently get any element and add/remove elements at the back.
Sorry if this seems like an obvious question.
I was creating a Data.map of lists {actually a tuple of an integer and a list (Integer, [(Integer, Integer)])} for implementing a priority queue + adjacency list for some graph algorithms like Dijkstras and Prims,
The Data.map is implemented using binary trees(I read that) so I just want to confirm that when doing the map operations (I believe they will be rotations) the interpreter does not do deep copies of the list just shallow copies of the references of lists right?
I am doing this to implement a prims algorithm in haskell which will run in O(nlogn + mlogn) time where n = no. of vertices and m = no. of edges, in a purely functional way,
If the lists are stored in the priority queue the algorithm will work in that time. Most haskell implementations I found online, dont achieve this complexity.
Thanks in advance.
You are correct that the lists will not be copied every time you create a new Map, at least if you're using GHC (other implementations probably do this correctly as well). This is one of the advantages of a purely functional language: because data can't be rewritten, data structures don't need to be copied to avoid problems you might have in an imperative language. Consider this snippet of Lisp:
(setf a (list 1 2 3 4 5))
(setf b a)
; a and b are now both '(1 2 3 4 5).
(setf (cadr a) 0)
; a is now '(1 0 3 4 5).
; Careful though! a and b point to the same piece of memory,
; so b is now also '(1 0 3 4 5). This may or may not be what you expected.
In Haskell, the only way to have mutable state like this is to use an explicitly mutable data structure, such as something in the State monad (and even this is sort of faking it (which is a good thing)). This potentially unexpected memory duplication issue is unthinkable in Haskell because once you declare that a is a particular list, it is that list now and forever. Because it is guaranteed to never change, there is no danger in reusing memory for things that are supposed to be equal, and in fact, GHC will do exactly this. Therefore, when you make a new Map with the same values, only pointers to the values will be copied, not the values themselves.
For more information, read about the difference between Boxed and Unboxed types.
1) Integer is slower then Int
2) If you have [(Integer, [(Integer, Integer)])]
You could create with Data.Map not only Map Integer [(Integer, Integer)], but Map Integer (Map Integer Integer)
3) If you use Int instead of Integer, you could use a bit quicker data - IntMapfrom Data.IntMap: IntMap (IntMap Int)
4) complexity of each methods are written in description:
Data.IntMap.Strict and here Data.IntMap.Lazy:
map :: (a -> b) -> IntMap a -> IntMap b
O(n). Map a function over all values in the map.
Imagine data structure, that manipulates some contiguous container, and allows quick retrieval of contiguous ranges of indices, within this array, that contains data (and probably free ranges too). Let's call this ranges "blocks". Each block knows its head and tail index:
struct Block
{
size_t begin;
size_t end;
}
When we manipulating array, our data structure updates blocks:
array view blocks [begin, end]
--------------------------------------------------------------
0 1 2 3 4 5 6 7 8 9 [0, 9]
pop 2 block 1 splitted
0 1 _ 3 4 5 6 7 8 9 [0, 1] [3, 9]
pop 7, 8 block 2 splitted
0 1 _ 3 4 5 6 _ _ 9 [0, 1] [3, 6] [9, 9]
push 7 changed end of block 3
0 1 _ 3 4 5 6 7 _ 9 [0, 1] [3, 7] [9, 9]
push 5 error: already in
0 1 _ 3 4 5 6 7 _ 9 [0, 1] [3, 7] [9, 9]
push 2 blocks 1, 2 merged
0 1 2 3 4 5 6 7 _ 9 [0, 7] [9, 9]
Even before profiling, we know that blocks retrieval speed will be cornerstone of application performance.
Basically usage is:
very often retrieval of contiguous blocks
quite rare insertions/deletions
most time we want number of blocks be minimal (prevent fragmentation)
What we have already tried:
std::vector<bool> + std::list<Block*>. On every change: write true/false to vector, then traverse it in for loop and re-generate list. On every query of blocks return list. Slower than we wanted.
std::list<Block*> update list directly, so no traversing. Return list. Much code to debug/test.
Questions:
Is that data structure has some generic name?
Is there already such data structures implemented (debugged and tested)?
If no, what can you advice on fast and robust implementation of such data structure?
Sorry if my explanation is not quite clear.
Edit
Typical application for this container is managing buffers: either system or GPU memory. In case of GPU we can store huge amounts of data in single vertex buffer, and then update/invalidate some regions. On each draw call we must know first and last index of each valid block in buffer to draw (very often, tenth to hundreds times per second) and sometimes (once a second) we must insert/remove blocks of data.
Another application is a custom "block memory allocator". For that purpose, similar data structure implemented in "Alexandrescu A. - Modern C++ Design" book via intrusive linked list. I'm looking for better options.
What I see here is a simple binary tree.
You have pairs (blocks) with a begin and an end indices, that is, pairs (a,b) where a <= b. So the set of blocks can be easily ordered and stored in a search-binary-tree.
Searching the block wich corresponds to a given number is easy (Just the tipical bynary-tree-search). So when you delete a number from the array, you need to search the block that corresponds to the number and split it in two new blocks. Note that all blocks are leaves, the internal nodes are the intervals wich the two child nodes forms.
Insertion on the other hand means searching the block, and test its brothers to know if the brothers have to be collapsed. This should be done recursively up through the tree.
You may want to try a tree like structure, either a simple red-black tree or a B+ tree.
Your first solution (vector of bools + list of blocks) seems like a good direction, but note that you don't need to regenerate the list completely from scratch (or go over the entire vector) - you just need to traverse the list until you find where the newly changed index should be fixed, and split/merge the appropriate blocks on the list.
If the list traversal proves too long, you could implement instead a vector of blocks, where each block is mapped to its start index, and each hole has a block saying where the hole ends. You can traverse this vector as fast as a list since you always jump to the next block (one O(1) lookup to determine the end of the block, another O(1) lookup to determine the beginning of the next block. The benefit however is that you can also access indices directly (for push/pop), and figure out their enclosing block with a binary search.
To make it work, you'll have to do some maintenance work on the "holes" (merge and split them like real blocks), but that should also be O(1) on any insertion/deletion. The important part is that there's always a single hole between blocks, and vice-versa
Why are you using a list of blocks? Do you need stable iterators AND stable references? boost::stable_vector may help. If you don't need stable references, maybe what you want is to write a wrapper container that contains a std::vector blocks and a secondary memory map of size blocks.capacity() which is a map from iterator index (which is kept inside returned iterators to real offset in the blocks vector) and a list of currently unused iterator indices.
Whenever you erase members from blocks, you repack blocks and shuffle the map accordingly for increased cache coherence, and when you want to insert, just push_back to blocks.
With block packing, you get cache coherence when iterating at the cost of deletion speed. And maintain relatively fast insert times.
Alternatively, if you need stable references and iterators, or if the size of the container is very large, at the cost of some access speed, iteration speed, and cache coherency, you wrap each entry in the vector in a simple structure that contains the real entry and an offset to the next valid, or just store pointers in the vector and have them at null on deletion.
In clojure lists grow from the left and vectors grow from the right, so:
user> (conj '(1 2 3) 4)
(4 1 2 3)
user> (conj [1 2 3] 4)
[1 2 3 4]
What's the most efficient method of inserting values both into the front and the back of a sequence?
You need a different data structure to support fast inserting at both start and end. See https://github.com/clojure/data.finger-tree
As I understand it, a sequence is just a generic data structure so it depends on the specific implementation you are working with.
For a data structure that supports random access (e.g. a vector), it should take constant time, O(1).
For a list, I would expect inserting at the front of the list with a cons operation to take constant time, but inserting to the back of the list will take O(n) since you have to traverse the entire structure to get to the end.
There is, of course, a lot of other data structures that can theoretically be a sequence (e.g. trees) that will have their own O(n) characteristics.