Rules of thumb for function arguments ordering in Clojure

Rules of thumb for function arguments ordering in Clojure - clojure

What (if any) are the rules for deciding the order of the parameters functions in Clojure core?
Functions like map and filter expect a data structure as the last
argument.
Functions like assoc and select-keys expect a data
structure as the first argument.
Functions like map and filter expect a function as the first
argument.
Functions like update-in expect a function as the last argument.
This can cause pains when using the threading macros (I know I can use as-> ) so what is the reasoning behind these decisions? It would also be nice to know so my functions can conform as closely as possible to those written by the great man.

Functions that operate on collections (and so take and return data structures, e.g. conj, merge, assoc, get) take the collection first.
Functions that operate on sequences (and therefore take and return an abstraction over data structures, e.g. map, filter) take the sequence last.
Becoming more aware of the distinction [between collection functions and sequence functions] and when those transitions occur is one of the more subtle aspects of learning Clojure.
(Alex Miller, in this mailing list thread)
This is important part of working intelligently with Clojure's sequence API. Notice, for instance, that they occupy separate sections in the Clojure Cheatsheet. This is not a minor detail. This is central to how the functions are organized and how they should be used.
It may be useful to review this description of the mental model when distinguishing these two kinds of functions:
I am usually very aware in Clojure of when I am working with concrete
collections or with sequences. In many cases I find the flow of data
starts with collections, then moves into sequences (as a result of
applying sequence functions), and then sometimes back to collections
when it comes to rest (via into, vec, or set). Transducers have
changed this a bit as they allow you to separate the target collection
from the transformation and thus it's much easier to stay in
collections all the time (if you want to) by apply into with a
transducer.
When I am building up or working on collections, typically the code
constructing it is "close" and the collection types are known and
obvious. Generally sequential data is far more likely to be vectors
and conj will suffice.
When I am thinking in "sequences", it's very rare for me to do an
operation like "add last" - instead I am thinking in whole collection
terms.
If I do need to do something like that, then I would probably convert
back to collections (via into or vec) and use conj again.
Clojure's FAQ has a few good rules of thumb and visualization techniques for getting an intuition of collection/first-arg versus sequence/last-arg.

Rather than have this be a link-only question, I'll paste a quote of Rich Hickey's response to the Usenet question "Argument order rules of thumb":
One way to think about sequences is that they are read from the left,
and fed from the right:
<- [1 2 3 4]
Most of the sequence functions consume and produce sequences. So one
way to visualize that is as a chain:
map<- filter<-[1 2 3 4]
and one way to think about many of the seq functions is that they are
parameterized in some way:
(map f)<-(filter pred)<-[1 2 3 4]
So, sequence functions take their source(s) last, and any other
parameters before them, and partial allows for direct parameterization
as above. There is a tradition of this in functional languages and
Lisps.
Note that this is not the same as taking the primary operand last.
Some sequence functions have more than one source (concat,
interleave). When sequence functions are variadic, it is usually in
their sources.
I don't think variable arg lists should be a criteria for where the
primary operand goes. Yes, they must come last, but as the evolution
of assoc/dissoc shows, sometimes variable args are added later.
Ditto partial. Every library eventually ends up with a more order-
independent partial binding method. For Clojure, it's #().
What then is the general rule?
Primary collection operands come first.That way one can write -> and
its ilk, and their position is independent of whether or not they have
variable arity parameters. There is a tradition of this in OO
languages and CL (CL's slot-value, aref, elt - in fact the one that
trips me up most often in CL is gethash, which is inconsistent with
those).
So, in the end there are 2 rules, but it's not a free-for-all.
Sequence functions take their sources last and collection functions
take their primary operand (collection) first. Not that there aren't
are a few kinks here and there that I need to iron out (e.g. set/
select).
I hope that helps make it seem less spurious,
Rich
Now, how one distinguishes between a "sequence function" and a "collection function" is not obvious to me. Perhaps others can explain this.

Related

Can I overload the count function in Clojure

In Python if I want to customize the way to define how to find the size of an object I define a __len__ method for the class, which alters the behavior of the len function. Is there any way to do this in Clojure with the count function? If so, how?

This is a reasonable question to ask when you are moving from one paradigm to
another i.e. OO to functional, but likely is not an issue. In languages like
Clojure, you will normally start by working out how to represent your data in
one of the existing data structures (Collections) rather than define a new
structure to represent your data. The idea is to have only a few different data
structures with a large number of well defined and understood functions which
operate on those structures in an efficient and reliable manner. As a
consequence, once you have decided on how to represent your graphs, you will
likely find that doing something like counting the number of vertices is
trivial - you will probably just need to use reduce with a very simple function
that in turn uses other core functions, such as the existing count(). A key to
becoming proficient in Clojure is in learning what the core functions are and
what they do.
In an OO language, you encapsulate your data and hide it inside an object. This
means you now need to define the methods to operate on the data inside that
object. To support polymorphism, you will frequently do things like add an
object specific size or length method. In Clojure, the underlying data structure
is not hidden and is frequently built on one of the standard collection types,
so you don't need tow rite a size/length function, you can use the standard
count function. When the standard collections are not suitable and you need
more, then you tend to look at things like protocols, where you can define your
own specialised functions e.g. size.
In your case, a protocol or record is unlikely to be necessary - representing
graphs is pretty much a natural fit for the standard collections and I woldn't
be at all surprised if you could re-implement what you did in C or C++ with
Clojure in a lot fewer lines and likely in a much more declarative and cleaner
manner. Start by looking at how the standard Clojure collection types could be
used to represent your graphs. Think about how you want to operate on the graphs
and whether you are best representing the graph as nodes or verticies and then
look at how you would answer questions like 'How many verticies are in this
graph?" and see how you would get that answer just using the available built-in
functions.
You do need to think about things differently when you move to a functional
paradigm. There will be a point you get to that is a real "Aha" moment as that
penny drops. Once it does, you will likely be very surprised how nice it is, but
until that moment, you will likely experience a bit of frustration and hair
pulling. The battle is worth it as you will likely find even your old imparative
programming skills benefit and when you have to go back to C/C++ or Python, your
code is even clearer and more concise. Try to avoid the temptation to reproduce
what you did in C/Python in Clojure. instead, focus on the outcome you want to
achieve and see how the supplied facilities of the language will help you do
that.

Your comment says you are dealing with graphs. Taking on board the good advice to use the standard data structures, let's consider how to represent them.
You would normally represent a graph as a map Node -> Collection of Node. For example,
(def graph {:a [:b], :b [:a :c]})
Then
(count graph)
=> 2
However, if you make sure that every node has a map entry, even the ones that have no afferent arcs, then all you have to do is count the graph's map. A function add the empty entries is ...
(defn add-empties [gm]
(if (empty? gm)
gm
(let [EMPTY (-> gm first val empty)
missing (->> gm
vals
(apply concat)
(remove gm))]
(reduce (fn [acc x] (assoc acc x EMPTY)) gm missing))))
For example,
(add-empties graph)
=> {:a [:b], :b [:a :c], :c []}
and
(count(add-empties graph))
=> 3

What does count mean for a graph?
What should count return for a graph? I can think of two equally obvious options -- the number of nodes in the graph or the number of edges. So perhaps count isn't very intuitive for graph.
Implementing Counted
Nevertheless, you certainly can define your own counted data structures in Clojure. The key is to implement the clojure.lang.Counted interface.
Now, if represent a graph via the following deftype:
(deftype NodeToEdgeGraph [node->neighbors]
...)
we can make it counted:
(deftype NodeToEdgeGraph [node->neighbors]
clojure.lang.Counted
(count [this]
(count node->neighbors))
...)
This is if we are representing a graph as a map that maps each node to its set of "neighbors" (where a node is considered a "neighbor" if, and only if, there is an edge between the two), and we want count to return the number of nodes.
Alternatively, we can represent a graph as a set of pairs (either ordered, in the case of a directed graph; or unordered, in the case of an undirected graph). In this case, we have:
(deftype EdgeGraph [edges]
...)
And we can have count return the number of edges in the graph:
(deftype EdgeGraph [edges]
clojure.lang.Counted
(count [this]
(count edges))
...)
So far, we have been using count on the underlying structure to implement count for the graph. This works because the underlying structure conveniently has the same count as the way we are counting each respective graph representation. However, there's no reason we couldn't use either representation with either way of counting. Also, there are other ways of representing graphs that might not align so nicely with the way we want to count. In these cases, we can always maintain our own internal count:
(deftype ???Graph [cnt ???]
clojure.lang.Counted
(count [this]
cnt)
...)
Here, we rely on the implementation of the graph operations to maintain cnt appropriately.
Built-in or Custom Types?
Others have suggested using the built-in datastructures only to represent your graphs. Indeed, if you take a look at NodeGraph and EdgeGraph, you can see that count simply delegates to the underlying structure. So for these representations (and definitions of count), you could simply eliminate the deftype and implement your graph operations directly over maps / sets. However, let's take a look at some advantages to encapsulating graphs in a custom type:
Ease of implementing alternative representations. With custom types, you can have an unlimited number of graph representations. All you need to do is define some protocols and define graph operations in terms of those protocols. While you can also extend protocols to built-in types, you are limited to only one representation / implementaton per built-in type. This might or might not be sufficient for you.
Maintenance of internal invariants. With custom types, you have better control over the internal structure. That is, you can implement protocols / interfaces in a way that maintains any necessary internal invariants. For example, if you are implementing undirected graphs, you can implement NodeGraph in a way that ensures adding an edge from node A to node B also adds an edge from node B to node A. Likewise with removing an edge. With built-in types, these invariants can only be maintained by the graph functions you implement. The Clojure core functions will not maintain these invariants for you, because they know nothing about them. So if you hand your "graphs" (over built-in types) off to some function that calls non-graph functions on them, you have no assurance that you will get a valid graph back. On the other have, custom types only allow the operations you implement, and will perform them the way you implement them. As long as you take care that all the operations you implement maintain the proper invariants, you can rest assured that these invariants will always be maintained.
On the other hand, sometimes it is appropriate to simply use the built-in types. For instance, if your application only makes light use of graph operations, it might be more convenient to simply use built-in types. On the other hand, if your application is heavily graph-oriented and makes a lot of graph operations, it's probably better in the long run to implement a custom type.
Of course, you are not forced to choose one over the other. With protocols, you can implement graph operations for both built-in types as well as custom types. That way, you can choose between "quick and dirty" and "heavy but robust" depending on your application. Here "heavy" just means it might take a little more work to use graphs with functions implemented over Clojure collection interfaces. In other words, you might need to convert graphs to other types in some cases. This depends heavily on how much effort you put into implementing these interfaces for your custom graph types (and to the extent they make sense for graphs).

By the way, you cannot override that function neither with with-redefs nor any related functionality. There is a hidden trick here: if you check the source code of the count function, you'll see an interesting inline attribute in its metadata:
(defn count
"Returns the number of items in the collection. (count nil) returns
0. Also works on strings, arrays, and Java Collections and Maps"
{
:inline (fn [x] `(. clojure.lang.RT (count ~x)))
:added "1.0"}
[coll] (clojure.lang.RT/count coll))
This attribute means the function's code will be inserted as-is into the resulting bytecode without calling the original function. Most of the general core functions have such attribute. So that's why you cannot override count, get, int, nth and so forth.

sorry, not for existing collection types you don't control.
You can define your own count that is aware of your needs and use that in your code, though unfortunately clojure does not use a universal protocol for counting so there is nowhere for you to attach that will extend count on an existing collection.
If counted where a protocol rather than an interface this would be easier for you, though it long predates protocols in the evolution of the language.
If you are making your own collection type then you can of course implement count anyway you want.

Intermediate representation for a Lisp / Clojure DSL

I'm designing a DSL in Clojure which is used to drive a code generator (in this case for procedural image synthesis - clisk) and am having trouble working out the best representation for intermediate values.
Originally the DSL consisted of functions that returned one or more forms, e.g. (illustrative)
(v+ 1.0 [1.0 'y])
=> ['(+ 1.0 1.0) '(+ 1.0 y)]
These functions could then be composed to build up larger blocks of code.
This was simple and the resulting forms could be fed directly into the code generator. However I've now identified what appear to be some weaknesses with this approach, for example if there is a need to pass around some auxillary data (e.g. objects that cannot be encoded in forms like BufferedImages, metadata useful for optimisation etc.).
I'm sure this is a solved problem in the Lisp world - What would typically be the best intermediate representation for this kind of DSL?

Anytime you need an intermediate representation that will be used to generate code, the most obvious thing that comes to my mind is an abstract syntax tree (AST). Your example representation is lists, which in my experience is not as flexible of a form. For any thing more than trivial code generation, I wouldn't beat around the bush, and just go with a full blown AST representation. By using lists, you push off more work to the generation side to parse out the information such as types and what the first item means. Moving to an AST representation will give you more flexibility and decouples more of the system, at the cost of more work from the parsing side (or more work from the functions that generate the forms). The generation side will be doing more work as well, but the many of those components can be decoupled since their inputs will be more structured.
In terms of what should the AST look like, I would copy either Christophe Grand's enlive, where he uses {:tag <tag name> :attrs <map of attrs> :content <some collection>}
or what clojure script uses, {:op <some operator> :children <some collection>}.
This makes it quite general since you can define arbitrary walkers that peek into :children and can traverse any structure without knowing exactly about what the :op's or :tag's are.
Then for atomic components, you can wrap it in a map and give it some type information (with respect to the semantics of your DSL) that's independent of the actual type of the object. {:atom <the object> :type :background-image}.
On the code generation side, when encountering an atom, your code can then dispatch on the :type, and then, if you want, further dispatch on the actual type of the object. Generation from collection forms is also easy, dispatch on :op/:tag, and then recur with the children. For what collection to use for children, I'd read more up on the discussion on the google groups. Their conclusions were enlightening to me.
https://groups.google.com/forum/#!topic/clojure-dev/vZLVKmKX0oc/discussion
To summarize, for children, if there were semantic ordering importance such as in an if statement, then use a map {:conditional z :then y :else x}. If it was just an argument list, then you could use a vector.

I guess I don't understand. I'd just use lists or structures myself.
In Lisp, lists can contain, well, anything. I should say, a CONS cell can point to anything and thus list can contain anything. So can pretty much any other data structure (structures, arrays, maps, etc.).
Now, these structures not be able to be renderable, by PRINT, or rendered in to something readable (by READ), but that doesn't mean they can't be stored and manipulated.
Is there some reason you need to externalize this representation?

Not really an answer, because I don't know how Clojure works in this regard, but in CL there're reader macros specifically designed for this case: i.e. you would define your function for printing unprintable objects + a reader macro that reads them from the way you have printed them. To define the way objects are printed you'd define a new method print-object which specializes on the type of object you need and set-macro-character to add a function to the readtable that knows how to read object of your design.
There are many things to be aware of, but some that usually act like timer bombs are the cases, when objects are allowed to recursively reference themselves, in which case printing needs to consider previously printed objects.

Scheme: Constant Access to the End of a List?

In C, you can have a pointer to the first and last element of a singly-linked list, providing constant time access to the end of a list. Thus, appending one list to another can be done in constant time.
As far as I am aware, scheme does not provide this functionality (namely constant access to the end of a list) by default. To be clear, I am not looking for "pointer" functionality. I understand that is non-idiomatic in scheme and (as I suppose) unnecessary.
Could someone either 1) demonstrate the ability to provide a way to append two lists in constant time or 2) assure me that this is already available by default in scheme or racket (e.g., tell me that append is in fact a constant operation if I am wrong to think otherwise)?
EDIT:
I should make myself clearer. I am trying to create an inspectable queue. I want to have a list that I can 1) push onto the front in constant time, 2) pop off the back in constant time, and 3) iterate over using Racket's foldr or something similar (a Lisp right fold).

Standard Lisp lists cannot be appended to in constant time.
However, if you make your own list type, you can do it. Basically, you can use a record type (or just a cons cell)---let's call this the "header"---that holds pointers to the head and tail of the list, and update it each time someone adds to the list.
However, be aware that if you do that, lists are no longer structurally inductive. i.e., a longer list isn't simply an extension of a shorter list, because of the extra "header" involved. Thus, you lose a great part of the simplicity of Lisp algorithms which involve recursing into the cdr of a list at each iteration.
In other words, the lack of easy appending is a tradeoff to enable recursive algorithms to be written much more easily. Most functional programmers will agree that this is the right tradeoff, since appending in a pure-functional sense means that you have to copy every cell in all but the last list---so it's no longer O(1), anyway.
ETA to reflect OP's edit
You can create a queue, but with the opposite behaviour: you add elements to the back, and retrieve elements in the front. If you are willing to work with that, such a data structure is easy to implement in Scheme. (And yes, it's easy to append two such queues in constant time.)
Racket also has a similar queue data structure, but it uses a record type instead of cons cells, because Racket cons cells are immutable. You can convert your queue to a list using queue->list (at O(n) complexity) for times when you need to fold.

You want a FIFO queue. user448810 mentions the standard implementation for a purely-functional FIFO queue.
Your concern about losing the "key advantage of Lisp lists" needs to be unpacked a bit:
You can write combinators for custom data structures in Lisp. If you implement a queue type, you can easily write fold, map, filter and so on for it.
Scheme, however, does lack in the area of providing polymorphic sequence functions that can work on multiple sequence types. You do often end up either (a) converting your data structures back to lists in order to use the rich library of list functions, or (b) implementing your own versions of various of these functions for your custom types.
This is very much a shame, because singly-linked lists, while they are hugely useful for tons of computations, are not a do-all data structure.
But what is worse is that there's a lot of Lisp folk who like to pretend that lists are a "universal datatype" that can and should be used to represent any kind of data. I've programmed Lisp for a living, and oh my god I hate the code that these people produce; I call it "Lisp programmer's disease," and have much too often had to go in and fix a lot of n^2 that uses lists to represent sets or dictionaries to use hash tables or search trees instead. Don't fall into that trap. Use proper data structures for the task at hand. You can always build your own opaque data types using record types and modules in Racket; you make them opaque by exporting the type but not the field accessors for the record type (you export your type's user-facing operations instead).

It sounds like you are looking for a deque, not a list. The standard idiom for a deque is to keep two lists, the front half of the list in normal order and the back half of the list in reverse order, thus giving access to both ends of the deque. If the half of the list that you want to access is empty, reverse the other half and swap the meaning of the two halves. Look here for a fuller explanation and sample code.

D naming conventions: How is Phobos organized?

I'm making my own little library of handy functions and I'm trying to follow Phobos's naming convention but I'm getting really confused. How do I know where things would fit?
Example:
If there was a function like foldRight in Phobos (basically reduce in the reverse direction), which module would I find it in?
I can think of several:
std.algorithm: Because it's expressing an algorithm
std.array: Because I'm likely going to use it on arrays
std.container: Because it's used on containers, rather than single objects
std.functional: Because it's used mainly in functional programming
std.range: Because it operates on ranges as well
but I have no idea which one would be a good choice -- I could give a convincing argument for at least 3 of them.
What's the convention?

std.algorithm: yep and you can implement it like reduce!fun(retro(r))
this module specifies algorithms that run on sequences
std.array: no because it can also run on other ranges
these are helper functions that run only on build-in arrays
std.container: no because it doesn't define a data structure (like a treeset)
this defines data structures that are not build in into the language (for now a linked list, a binary tree and a deterministic array in terms of memory management)
std.functional: no because it doesn't operate on a function but on a range
this one takes a function and returns a different one
std.range: no because it doesn't define a range or provide a different way to iterate over one
the lack of a clear structure is one of my gripes with the phobos library TBH but really reading the first paragraph of the docs should tell you quite a bit of where to put the function

In Clojure, when should trees of heterogenous node types be represented using records or vectors?

Which is better idiomatic clojure practice for representing a tree made up of different node types:
A. building trees out of several different types of records, that one defines using deftype or defrecord:
(defrecord node_a [left right])
(defrecord node_b [left right])
(defrecord leaf [])
(def my-tree (node_a. (node_b. (leaf.) (leaf.)) (leaf.)))
B. building trees out of vectors, with keywords designating the types:
(def my-tree [:node-a [:node-b :leaf :leaf] :leaf])
Most clojure code that I see seems to favor the usage of the general purpose data structures (vectors, maps, etc.), rather than datatypes or records. Hiccup, to take one example, represents html very nicely using the vector + keyword approach.
When should we prefer one style over the other?

You can put as many elements into a vector as you want. A record has a set number of fields. If you want to constrain your nodes to only have N sub-nodes, records might be good, e.g. making when a binary tree, where a node has to have only a Left and Right. But for something like HTML or XML, you probably want to support arbitrary numbers of sub-nodes.
Using vectors and keywords means that "extending" the set of supported node types is as simple as putting a new keyword into the vector. [:frob "foo"] is OK in Hiccup even if its author never heard of frobbing. Using records, you'd potentially have to define a new record for every node type. But then you get the benefit of catching typos and verifying subnodes. [:strnog "some bold text?"] isn't going to be caught by Hiccup, but (Strnog. "foo") would be a compile-time error.
Vectors being one of Clojure's basic data types, you can use Clojure's built-in functions to manipulate them. Want to extend your tree? Just conj onto it, or update-in, or whatever. You can build up your tree incrementally this way. With records, you're probably stuck with constructor calls, or else you have to write a ton of wrapper functions for the constructors.
Seems like this partly boils down to an argument of dynamic vs. static. Personally, I would go the dynamic (vector + keyword) route unless there was a specific need for the benefits of using records. It's probably easier to code that way, and it's more flexible for the user, at the cost of being easier for the user to end up making a mess. But Clojure users are likely used to having to handle dangerous weapons on a regular basis. Clojure being largely a dynamic language, staying dynamic is often the right thing to do.

This is a good question. I think both are appropriate for different kinds of problems. Nested vectors are a good solution if each node can contain a variable set of information - in particular templating systems are going to work well. Records are a good solution for a smallish number of fixed node types where nesting is far more constrained.
We do a lot of work with heterogeneous trees of records. Each node represents one of a handful of well-known types, each with a different set of known fixed keys. The reason records are better in this case is that you can pick the data out of the node by key which is O(1) (really a Java method call which is very fast), not O(n) (where you have to look through the node contents) and also generally easier to access.
Records in 1.2 are imho not quite "finished" but it's pretty easy to build that stuff yourself. We have a defrecord2 that adds constructor functions (new-foo), field validation, print support, pprint support, tree walk/edit support via zippers, etc.
An example of where we use this is to represent ASTs or execution plans where nodes might be things like Join, Sort, etc.
Vectors are going to be better for creating stuff like strings where an arbitrary number of things can be put in each node. If you can stuff 1+ <p>s inside a <div>, then you can't create a record that contains a :p field - that just doesn't make any sense. That's a case where vectors are far more flexible and idiomatic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js