Clojure sequences and collections

Clojure sequences and collections - clojure

In Lisp all data structures builds of cons cells, i.e they are essentially linked lists or binary trees or both (correct me if I'm wrong). Clojure data structures are lists, vectors, maps and sets. Clojure incorporates two inclusive abstractions for these data structures: collections and sequences. Sequence abstraction defines first, rest and cons operations, where as collection abstraction define collection specific operations such as conj and into.
Clojure core functions such as map and filter operates on sequence abstraction but accepts any data structure and performs implicit conversion. These functions are also lazy. Does this mean by default Clojure internally stores data in more efficient data structures such as indexed arrays and only switches to linked lists as needed? How does Clojure actually convert collections to sequences? Is the sequence built from collection using iterator in a streaming fashion or as a whole and then passed to the consumer?

The only data structure in Clojure that is a singly-linked list is an actual list like:
(list 1 2 3)
Everything else is an efficient data structure (i.e. vector, map).
A lazy-sequence is (nominally) composed of the current value and a recipe for generating the next value. Once computed, elements are cached and are not re-computed.
Conversion of a collection to a sequence is an implementation detail and is not normally important to the end user.
The original map and filter functions are lazy, as are many others. However, this was enough of a headache (unpredictable time of realization) that eager/imperative versions mapv and filterv were added to the language.

Related

Why does Elixir have so many similar list types in the standard library?

I'm doing the Elixir koans, and already I've worked through something like five different listy data types:
List
Char list
Word list
Tuple
Keyword list
Map
MapSet
Struct
Some of these I buy, but all of them at the same time? Does anyone actually use all of these lists for strictly separated purposes?

Short answer is: yes.
Long answer is:
Lists - are a basic data structure you use everywhere. Lists are ordered and allow duplicates. The main use case is: homogenous varied-length collections
Charlists - where Elixir uses strings (based on binaries), Erlang usually uses charlists (lists of integer codepoints). It's mainly a compatibility interface;
Word lists - I've never heard of those;
Tuples - are another basic data structure you use everywhere. The main use case is: heterogenous fixed-length collections;
Keyword lists - are very common, mainly used for options. It's a simple abstraction on top of lists and tuples (a list of two-element tuples). Allow for duplicate keys and maintain order of keys, since they are ordered pattern-matching is very impractical.
Maps - are common too. Allow for easy pattern matching on keys, but do not allow duplicate keys and are not ordered.
MapSet - sets are a basic data structure - an unordered, unique collection of elements.
Structs - are the main mechanism for polymorphism in Elixir (through protocols), allow creating more rigid structures with keyset enforced at compile-time.
With functional programming choosing the right data structure to represent your data is often half of the issue, that's why you get so many different structures, with different characteristics. Each one has it's use-cases and is useful in different ways.

#michalmuskala provided here great answer, maybe I just extend it a bit.
Lists are the workhorse in Elixir. There's a plenty of issues that you will solve with lists. Lists are not arrays, where random access is the best way to get values, instead lists in Elixir are linked data structures and you traverse them by splitting into head and tail (if you know LISP, Prolog or Erlang, you'd will just like in home).
Charlists are just lists, but narrowed to lists of integers.
Tuples - usually they contain two to four elements. There are common way to pass additional data, but still send one parameter. Common behaviours like GenServer etc. uses them as an expected reply.
Keyword lists are list of tuples and you can use them when you need to store for one key more than one value. This is syntantic sugar.
Instead of a = [{:name, "Patryk"}] you can have a = [name: "Patryk"] and access it with a[:name].
Maps are associative arrays, hashes, dicts etc. One key holds one value and keys are unique.
Set - think about mathematicians sets. Unordered collection of unique values.
Struct - as #michalmuskala wrote they are used in protocols and they are checked by the compiler. Actually they're maps defined for module.

The answers are to be read from the bottom to the top :)
#michalmuskala provided here great answer, #patnowak extended it perfectly. I am here to mostly answer to the question “Does anyone actually use all of these lists for strictly separated purposes?“
Elixir (as well as Erlang) is all about pattern matching. Having different types of lists makes it easy to narrow the pattern matching in each particular case:
List is used mostly in recursion; Erlang has no loops, instead one does recursive calls. It’s highly optimized when used properly (tail-recursion.) Usually matches as [head | tail].
charlist is used in “string” pattern matching, whatever it means. Check for “the first letter of his name is ‘A’” in Erlang would be done with pattern match against [?A | rest] = "Aleksei" |> List.Chars.to_charlist
Tuple is used in pattern matching of different instances of the more-or-less same entity. Fail/Success would be returned as tuples {:ok, result} and {:error, message} respectively and pattern matched afterwards. GenServer simplifies handling of different messages that way as well.
Map is to be pattern matched as %{name: "Aleksei"} = generic_input to immediately extract the name. Keywords are more or less the same.
etc.

What is the difference between clojure's APersistentMap implementations

I'm trying to figure out what the difference is between a PersistentHashMap, PersistentArrayMap, PersistentTreeMap, and PersistentStructMap.
Also if I use {:a 1} it gives me a PersistentArrayMap but can this change to any of the other ones if I give it objects or things other than keys?

The four implementations you list fall into three groups:
"literal": PersistentArrayMap and PersistentHashMap: basic map types used when dealing with map literals (though constructor functions are also available with different behaviour around handling duplicate keys -- in Clojure 1.5.x literals throw exceptions when they discover duplicate keys, constructor functions work like left-to-right repeated conjing; this behaviour has been evolving from version to version). Array maps get promoted to hash maps when growing beyond a certain number of entries (9 IIRC). Array maps exist because they are faster for small maps; they also differ from hash maps in that they keep entries in insertion order prior to promotion to hash map (you can use clojure.core/array-map to get arbitrarily large array maps, which may be useful if you really know you'd benefit from insertion-order traversals and the map won't be too large, perhaps just a bit over the usual threshold; NB. a subsequent assoc to such an oversized array map will return a hash map). Array maps use arrays with keys and values interleaved; the PHM uses a persistent version of Phil Bagwell's hash array mapped trie with separate chaining for hash collisions and separate node types for mostly-empty and at-least-half-full nodes and is easily the most complex data structure in Clojure.
sorted: PersistentTreeMap instances are created by special request only (a call to sorted-map or sorted-map-by). They are implemented as red-black trees and maintain entries in a particular order, as specified by the default compare comparator if created with sorted-map or a user-supplied comparator if created with sorted-map-by.
special-purpose, probably deprecated: PersistentStructMap is not used very often and mostly viewed as deprecated in favour of records, although I actually can't remember right now if there ever was an official deprecation notice. The original purpose was to provide maps with particularly fast access to certain often-used keys. This can now be accomplished with records when using keywords for field access (with the keyword in the operator position: (:foo instance-of-some-record-with-field-foo)), though it's important to note that records are not = to regular maps with the same entries.
All these four built-in map types fall into the same "equality partition", that is, any two maps of one of the four classes mentioned above will be equal if (and only if) they contain the same keys (as determined by Clojure's =) with the same corresponding values. Records, as mentioned in 3. above, are map-like, but each record type forms its own equality partition.

They are different implementation of a Persistent Map (they all extend APersistentMap). So a PersistentArrayMap uses an array as the underlying data structure to implement persistent map and similarly other implementations uses different underlying data stucture.
The reason for different implementation is they provide different benefits in different situations (as the efficiency of the implementation depends on the underlying data structure).
From a developer perspective, it is abstracted away and hence you should not be directly using
them and instead work with the APersistentMap abstract class or IPersistentMap interface (in case some type checking etc is required for some specific case).
Depending on the number of elements in the map the various implementations are used.
(type (into {} (map #(-> [% %]) (range 5))))
=> PersistentArrayMap
(type (into {} (map #(-> [% %]) (range 10))))
=> PersistentHashMap

When to use a sequence in F# as opposed to a list?

I understand that a list actually contains values, and a sequence is an alias for IEnumerable<T>. In practical F# development, when should I be using a sequence as opposed to a list?
Here's some reasons I can see when a sequence would be better:
When interacting with other .NET languages or libraries that require
IEnumerable<T>.
Need to represent an infinite sequence (probably not really useful in practice).
Need lazy evaluation.
Are there any others?

I think your summary for when to choose Seq is pretty good. Here are some additional points:
Use Seq by default when writing functions, because then they work with any .NET collection
Use Seq if you need advanced functions like Seq.windowed or Seq.pairwise
I think choosing Seq by default is the best option, so when would I choose different type?
Use List when you need recursive processing using the head::tail patterns
(to implement some functionality that's not available in standard library)
Use List when you need a simple immutable data structure that you can build step-by-step
(for example, if you need to process the list on one thread - to show some statistics - and concurrently continue building the list on another thread as you receive more values i.e. from a network service)
Use List when you work with short lists - list is the best data structure to use if the value often represents an empty list, because it is very efficient in that scenario
Use Array when you need large collections of value types
(arrays store data in a flat memory block, so they are more memory efficient in this case)
Use Array when you need random access or more performance (and cache locality)

Also prefer seq when:
You don't want to hold all elements in memory at the same time.
Performance is not important.
You need to do something before and after enumeration, e.g. connect to a database and close connection.
You are not concatenating (repeated Seq.append will stack overflow).
Prefer list when:
There are few elements.
You'll be prepending and decapitating a lot.
Neither seq nor list are good for parallelism but that does not necessarily mean they are bad either. For example, you could use either to represent a small bunch of separate work items to be done in parallel.

Just one small point: Seq and Array are better than List for parallelism.
You have several options: PSeq from F# PowerPack, Array.Parallel module and Async.Parallel (asynchronous computation). List is awful for parallel execution due to its sequential nature (head::tail composition).

list is more functional, math-friendly. when each element is equal, 2 lists are equal.
sequence is not.
let list1 = [1..3]
let list2 = [1..3]
printfn "equal lists? %b" (list1=list2)
let seq1 = seq {1..3}
let seq2 = seq {1..3}
printfn "equal seqs? %b" (seq1=seq2)

You should always expose Seq in your public APIs. Use List and Array in your internal implementations.

Scheme: Constant Access to the End of a List?

In C, you can have a pointer to the first and last element of a singly-linked list, providing constant time access to the end of a list. Thus, appending one list to another can be done in constant time.
As far as I am aware, scheme does not provide this functionality (namely constant access to the end of a list) by default. To be clear, I am not looking for "pointer" functionality. I understand that is non-idiomatic in scheme and (as I suppose) unnecessary.
Could someone either 1) demonstrate the ability to provide a way to append two lists in constant time or 2) assure me that this is already available by default in scheme or racket (e.g., tell me that append is in fact a constant operation if I am wrong to think otherwise)?
EDIT:
I should make myself clearer. I am trying to create an inspectable queue. I want to have a list that I can 1) push onto the front in constant time, 2) pop off the back in constant time, and 3) iterate over using Racket's foldr or something similar (a Lisp right fold).

Standard Lisp lists cannot be appended to in constant time.
However, if you make your own list type, you can do it. Basically, you can use a record type (or just a cons cell)---let's call this the "header"---that holds pointers to the head and tail of the list, and update it each time someone adds to the list.
However, be aware that if you do that, lists are no longer structurally inductive. i.e., a longer list isn't simply an extension of a shorter list, because of the extra "header" involved. Thus, you lose a great part of the simplicity of Lisp algorithms which involve recursing into the cdr of a list at each iteration.
In other words, the lack of easy appending is a tradeoff to enable recursive algorithms to be written much more easily. Most functional programmers will agree that this is the right tradeoff, since appending in a pure-functional sense means that you have to copy every cell in all but the last list---so it's no longer O(1), anyway.
ETA to reflect OP's edit
You can create a queue, but with the opposite behaviour: you add elements to the back, and retrieve elements in the front. If you are willing to work with that, such a data structure is easy to implement in Scheme. (And yes, it's easy to append two such queues in constant time.)
Racket also has a similar queue data structure, but it uses a record type instead of cons cells, because Racket cons cells are immutable. You can convert your queue to a list using queue->list (at O(n) complexity) for times when you need to fold.

You want a FIFO queue. user448810 mentions the standard implementation for a purely-functional FIFO queue.
Your concern about losing the "key advantage of Lisp lists" needs to be unpacked a bit:
You can write combinators for custom data structures in Lisp. If you implement a queue type, you can easily write fold, map, filter and so on for it.
Scheme, however, does lack in the area of providing polymorphic sequence functions that can work on multiple sequence types. You do often end up either (a) converting your data structures back to lists in order to use the rich library of list functions, or (b) implementing your own versions of various of these functions for your custom types.
This is very much a shame, because singly-linked lists, while they are hugely useful for tons of computations, are not a do-all data structure.
But what is worse is that there's a lot of Lisp folk who like to pretend that lists are a "universal datatype" that can and should be used to represent any kind of data. I've programmed Lisp for a living, and oh my god I hate the code that these people produce; I call it "Lisp programmer's disease," and have much too often had to go in and fix a lot of n^2 that uses lists to represent sets or dictionaries to use hash tables or search trees instead. Don't fall into that trap. Use proper data structures for the task at hand. You can always build your own opaque data types using record types and modules in Racket; you make them opaque by exporting the type but not the field accessors for the record type (you export your type's user-facing operations instead).

It sounds like you are looking for a deque, not a list. The standard idiom for a deque is to keep two lists, the front half of the list in normal order and the back half of the list in reverse order, thus giving access to both ends of the deque. If the half of the list that you want to access is empty, reverse the other half and swap the meaning of the two halves. Look here for a fuller explanation and sample code.

In Clojure, when should trees of heterogenous node types be represented using records or vectors?

Which is better idiomatic clojure practice for representing a tree made up of different node types:
A. building trees out of several different types of records, that one defines using deftype or defrecord:
(defrecord node_a [left right])
(defrecord node_b [left right])
(defrecord leaf [])
(def my-tree (node_a. (node_b. (leaf.) (leaf.)) (leaf.)))
B. building trees out of vectors, with keywords designating the types:
(def my-tree [:node-a [:node-b :leaf :leaf] :leaf])
Most clojure code that I see seems to favor the usage of the general purpose data structures (vectors, maps, etc.), rather than datatypes or records. Hiccup, to take one example, represents html very nicely using the vector + keyword approach.
When should we prefer one style over the other?

You can put as many elements into a vector as you want. A record has a set number of fields. If you want to constrain your nodes to only have N sub-nodes, records might be good, e.g. making when a binary tree, where a node has to have only a Left and Right. But for something like HTML or XML, you probably want to support arbitrary numbers of sub-nodes.
Using vectors and keywords means that "extending" the set of supported node types is as simple as putting a new keyword into the vector. [:frob "foo"] is OK in Hiccup even if its author never heard of frobbing. Using records, you'd potentially have to define a new record for every node type. But then you get the benefit of catching typos and verifying subnodes. [:strnog "some bold text?"] isn't going to be caught by Hiccup, but (Strnog. "foo") would be a compile-time error.
Vectors being one of Clojure's basic data types, you can use Clojure's built-in functions to manipulate them. Want to extend your tree? Just conj onto it, or update-in, or whatever. You can build up your tree incrementally this way. With records, you're probably stuck with constructor calls, or else you have to write a ton of wrapper functions for the constructors.
Seems like this partly boils down to an argument of dynamic vs. static. Personally, I would go the dynamic (vector + keyword) route unless there was a specific need for the benefits of using records. It's probably easier to code that way, and it's more flexible for the user, at the cost of being easier for the user to end up making a mess. But Clojure users are likely used to having to handle dangerous weapons on a regular basis. Clojure being largely a dynamic language, staying dynamic is often the right thing to do.

This is a good question. I think both are appropriate for different kinds of problems. Nested vectors are a good solution if each node can contain a variable set of information - in particular templating systems are going to work well. Records are a good solution for a smallish number of fixed node types where nesting is far more constrained.
We do a lot of work with heterogeneous trees of records. Each node represents one of a handful of well-known types, each with a different set of known fixed keys. The reason records are better in this case is that you can pick the data out of the node by key which is O(1) (really a Java method call which is very fast), not O(n) (where you have to look through the node contents) and also generally easier to access.
Records in 1.2 are imho not quite "finished" but it's pretty easy to build that stuff yourself. We have a defrecord2 that adds constructor functions (new-foo), field validation, print support, pprint support, tree walk/edit support via zippers, etc.
An example of where we use this is to represent ASTs or execution plans where nodes might be things like Join, Sort, etc.
Vectors are going to be better for creating stuff like strings where an arbitrary number of things can be put in each node. If you can stuff 1+ <p>s inside a <div>, then you can't create a record that contains a :p field - that just doesn't make any sense. That's a case where vectors are far more flexible and idiomatic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js