how to spec a lazy-seq generating function? - clojure

I wish to use spec in my pre and post conditions of a generator function. A simplified example of what I wish to do is described below:
(defn positive-numbers
([]
{:post [(s/valid? (s/+ int?) %)]}
(positive-numbers 1))
([n]
{:post [(s/valid? (s/+ int?) %)]}
(lazy-seq (cons n (positive-numbers (inc n))))))
(->> (positive-numbers) (take 5))
However, defining the generator function like that seems to cause stack-overflow, the cause being that spec will eagerly try to evaluate the whole thing, -or something like that....
Is there another way of using spec to describe the :post result of a generator function like the one above (without causing stack-overflow)?

The theoretically correct answer is that in general you cannot check whether a lazy sequence matches a spec without realizing all of it.
In the case of your specific example of (s/+ int?), given a lazy sequence, how would one establish merely by observing the sequence whether all its elements are integers? However many elements you examine, the next one could always be a keyword.
This is the sort of thing that a type system like, say, core.typed may be able to prove, but a runtime-predicate-based assertion won't be able to check.
Now, in addition to s/+ and s/*, spec (as of Clojure 1.9.0-alpha14) also has a a combinator called s/every, whose docstring says this:
Note that 'every' does not do exhaustive checking, rather it samples *coll-check-limit* elements.
So we have e.g.
(s/valid? (s/* int?) (concat (range 1000) [:foo]))
;= false
but
(s/valid? (s/every int?) (concat (range 1000) [:foo]))
;= true
(with the default *coll-check-limit* value of 101).
This actually isn't an immediate fix to your example – plugging in s/every in place of s/+ won't work, because each recursive call will want to validate its own return value, which will involve realizing more of the sequence, which will involve more recursive calls etc. But you could factor out the sequence-building logic to a helper function with no postconditions and then have positive-numbers declare the postcondition and call that helper function:
(defn positive-numbers* [n]
(lazy-seq (cons n (positive-numbers* (inc n)))))
(defn positive-numbers [n]
{:post [(s/valid? (s/every int? :min-count 1) %)]}
(positive-numbers* n))
Note the caveats:
this will still realize a good chunk of your sequence, which may wreak havoc with your application's performance profile;
the only watertight guarantee here is that the prefix actually examined is as desired, if the seq has a weird item at position 123456, that will go unnoticed.
Because of (1), this is something that makes more sense as a test-only assertion. (2) may be acceptable – you'll still catch some silly typos and the documentation value of the spec is there anyway; if it isn't and you do want an absolutely watertight guarantee that your return type is as desired, then again, core.typed (perhaps used locally just for a handful of namespaces) may be the better bet.

Related

testing if something is an empty list

Which way should I prefer to test if an object is an empty list in Clojure? Note that I want to test just this and not if it is empty as a sequence. If it is a "lazy entity" (LazySeq, Iterate, ...) I don't want it to get realized?.
Below I give some possible tests for x.
;0
(= clojure.lang.PersistentList$EmptyList (class x))
;1
(and (list? x) (empty? x))
;2
(and (list? x) (zero? (count x)))
;3
(identical? () x)
Test 0 is a little low level and relies on "implementation details". My first version of it was (instance? clojure.lang.PersistentList$EmptyList x), which gives an IllegalAccessError. Why is that so? Shouldn't such a test be possible?
Tests 1 and 2 are higher level and more general, since list? checks if something implements IPersistentList. I guess they are slightly less efficient too. Notice that the order of the two sub-tests is important as we rely on short-circuiting.
Test 3 works under the assumption that every empty list is the same object. The tests I have done confirm this assumption but is it guaranteed to hold? Even if it is so, is it a good practice to rely on this fact?
All this may seem trivial but I was a bit puzzled not finding a completely straightforward solution (or even a built-in function) for such a simple task.
update
Perhaps I did not formulate the question very well. In retrospect, I realized that what I wanted to test was if something is a non-lazy empty sequence. The most crucial requirement for my use case is that, if it is a lazy sequence, it does not get realized, i.e. no thunk gets forced.
Using the term "list" was a little confusing. After all what is a list? If it is something concrete like PersistentList, then it is non-lazy. If it is something abstract like IPersistentList (which is what list? tests and probably the correct answer), then non-laziness is not exactly guaranteed. It just so happens that Clojure's current lazy sequence types do not implement this interface.
So first of all I need a way to test if something is a lazy sequence. The best solution I can think of right now is to use IPending to test for laziness in general:
(def lazy? (partial instance? clojure.lang.IPending))
Although there are some lazy sequence types (e.g. chunked sequences like Range and LongRange) that do not implement IPending, it seems reasonable to expect that lazy sequences implement it in general. LazySeq does so and this is what really matters in my specific use case.
Now, relying on short-circuiting to prevent realization by empty? (and to prevent giving it an unacceptable argument), we have:
(defn empty-eager-seq? [x] (and (not (lazy? x)) (seq? x) (empty? x)))
Or, if we know we are dealing with sequences like in my case, we can use the less restrictive:
(defn empty-eager? [x] (and (not (lazy? x)) (empty? x)))
Of course we can write safe tests for more general types like:
(defn empty-eager-coll? [x] (and (not (lazy? x)) (coll? x) (empty? x)))
(defn empty-eager-seqable? [x] (and (not (lazy? x)) (seqable? x) (empty? x)))
That being said, the recommended test 1 also works for my case, thanks to short-circuiting and the fact that LazySeq does not implement IPersistentList. Given this and that the question's formulation was suboptimal, I will accept Lee's succinct answer and thank Alan Thompson for his time and for the helpful mini-discussion we had with an upvote.
Option 0 should be avoided since it relies on a class within clojure.lang that is not part of the public API for the package: From the javadoc for clojure.lang:
The only class considered part of the public API is IFn. All other
classes should be considered implementation details.
Option 1 uses functions from the public API and avoids iterating the entire input sequence if it is non-empty
Option 2 iterates the entire input sequence to obtain the count which is potentially expensive.
Option 3 does not appear to be guaranteed and can be circumvented with reflection:
(identical? '() (.newInstance (first (.getDeclaredConstructors (class '()))) (into-array [{}])))
=> false
Given these I'd prefer option 1.
Just use choice (1):
(ns tst.demo.core
(:use tupelo.core tupelo.test) )
(defn empty-list? [arg] (and (list? arg)
(not (seq arg))))
(dotest
(isnt (empty-list? (range)))
(isnt (empty-list? [1 2 3]))
(isnt (empty-list? (list 1 2 3)))
(is (empty-list? (list)))
(isnt (empty-list? []))
(isnt (empty-list? {}))
(isnt (empty-list? #{})))
with result:
-------------------------------
Clojure 1.10.1 Java 13
-------------------------------
Testing tst.demo.core
Ran 2 tests containing 7 assertions.
0 failures, 0 errors.
As you can see by the first test with (range), the infinite lazy seq didn't get realized by empty?.
Update
Choice 0 depends on implementation details (unlikely to change, but why bother?). Also, it is noisier to read.
Choice 2 will blow up for infinite lazy seq's.
Choice 3 is not guaranteed to work. You could have more than one list with zero elements.
Update #2
OK, you are correct re (2). We get:
(type (range)) => clojure.lang.Iterate
Notice that it is not a Lazy-Seq as both you and I expected.
So you are relying on a (non-obvious) detail to prevent getting to count, which will blow up for an infinite lazy seq. Too subtle for my taste. My motto: Keep it as obvious as possible
Re choice (3), again it relies on the implementation detail of (the current release of) Clojure. I could almost make it fail except that clojure.lang.PersistentList$EmptyList is a package-protected inner class, so I would have to really try hard (subvert Java inheritance) to make a duplicate instance of the class, which would then fail.
However, I can come close:
(defn el3? [arg] (identical? () arg))
(dotest
(spyx (type (range)))
(isnt (el3? (range)))
(isnt (el3? [1 3 3]))
(isnt (el3? (list 1 3 3)))
(is (el3? (list)))
(isnt (el3? []))
(isnt (el3? {}))
(isnt (el3? #{}))
(is (el3? ()))
(is (el3? '()))
(is (el3? (list)))
(is (el3? (spyxx (rest [1]))))
(let [jull (LinkedList.)]
(spyx jull)
(spyx (type jull))
(spyx (el3? jull))) ; ***** contrived, but it fails *****
with result
jull => ()
(type jull) => java.util.LinkedList
(el3? jull) => false
So, I again make a plea to keep it obvious and simple.
There are two ways of constructing a software design. One way is to
make it so simple that there are obviously no deficiencies. And the
other way is to make it so complicated that there are no obvious
deficiencies.
---C.A.R. Hoare

What are side-effects in predicates and why are they bad?

I'm wondering what is considered to be a side-effect in predicates for fns like remove or filter. There seems to be a range of possibilities. Clearly, if the predicate writes to a file, this is a side-effect. But consider a situation like this:
(def *big-var-that-might-be-garbage-collected* ...)
(let [my-ref *big-var-that-might-be-garbage-collected*]
(defn my-pred
[x]
(some-operation-on my-ref x)))
Even if some-operation-on is merely a query that does not change state, the fact that my-pred retains a reference to *big... changes the state of the system in that the big var cannot be garbage collected. Is this also considered to be side-effect?
In my case, I'd like to write to a logging system in a predicate. Is this a side effect?
And why are side-effects in predicates discouraged exactly? Is it because filter and remove and their friends work lazily so that you cannot determine when the predicates are called (and - hence - when the side-effects happen)?
GC is not typically considered when evaluating if a function is pure or not, although many actions that make a function impure can have a GC effect.
Logging is a side effect, as is changing any state in the program or the world. A pure function takes data and returns data, without modifying anything else.
https://softwareengineering.stackexchange.com/questions/15269/why-are-side-effects-considered-evil-in-functional-programming covers why side effects are avoided in functional languages.
I found this link helpful
The problem is determining when, or even whether, the side-effects will occur on any given call to the function.
If you only care that the same inputs return the same answer, you are fine. Side-effects are dependent on how the function is executed.
For example,
(first (filter odd? (range 20)))
; 1
But if we arrange for odd? to print its argument as it goes:
(first (filter #(do (print %) (odd? %)) (range 20)))
It will print 012345678910111213141516171819 before returning 1!
The reason is that filter, where it can, deals with its sequence argument in chunks of 32 elements.
If we take the limit off the range:
(first (filter #(do (print %) (odd? %)) (range)))
... we get a full-size chunk printed: 012345678910111213141516171819012345678910111213141516171819202122232425262728293031
Just printing the argument is confusing. If the side effects are significant, things could go seriously awry.

Idiomatic no-op/"pass"

What's the (most) idiomatic Clojure representation of no-op? I.e.,
(def r (ref {}))
...
(let [der #r]
(match [(:a der) (:b der)]
[nil nil] (do (fill-in-a) (fill-in-b))
[_ nil] (fill-in-b)
[nil _] (fill-in-a)
[_ _] ????))
Python has pass. What should I be using in Clojure?
ETA: I ask mostly because I've run into places (cond, e.g.) where not supplying anything causes an error. I realize that "most" of the time, an equivalent of pass isn't needed, but when it is, I'd like to know what's the most Clojuric.
I see the keyword :default used in cases like this fairly commonly.
It has the nice property of being recognizable in the output and or logs. This way when you see a log line like: "process completed :default" it's obvious that nothing actually ran. This takes advantage of the fact that keywords are truthy in Clojure so the default will be counted as a success.
There are no "statements" in Clojure, but there are an infinite number of ways to "do nothing". An empty do block (do), literally indicates that one is "doing nothing" and evaluates to nil. Also, I agree with the comment that the question itself indicates that you are not using Clojure in an idiomatic way, regardless of this specific stylistic question.
The most analogous thing that I can think of in Clojure to a "statement that does nothing" from imperative programming would be a function that does nothing. There are a couple of built-ins that can help you here: identity is a single-arg function that simply returns its argument, and constantly is a higher-order function that accepts a value, and returns a function that will accept any number of arguments and return that value. Both are useful as placeholders in situations where you need to pass a function but don't want that function to actually do much of anything. A simple example:
(defn twizzle [x]
(let [f (cond (even? x) (partial * 4)
(= 0 (rem x 3)) (partial + 2)
:else identity)]
(f (inc x))))
Rewriting this function to "do nothing" in the default case, while possible, would require an awkward rewrite without the use of identity.

Why does Clojure's gensym increase by three on each call?

Fairly new to lisps, but in looking into sequential integer generating code, I noticed that repeated calls to (gensym) would increase the number provided after the prefix by 3. I'm curious why that is the case.
user=> (gensym)
G__662
user=> (gensym)
G__665
user=> (gensym)
G__668
user=> (gensym)
G__671
user=> (gensym)
G__674
user=> (gensym)
G__677
I've seen and understand the combined use of atom and inc, but I'm new to the gensym function.
There are a number of correct answers here. One is: it doesn't!
user> (take 5 (repeatedly gensym))
(G__2173 G__2174 G__2175 G__2176 G__2177)
Another is: gensym doesn't make any guarantees as to the form of the symbols it generates, so you really shouldn't care whether they're sequential or not (or even if they contain numbers at all). You certainly shouldn't hijack gensym to produce an integer sequence.
Lastly: why does it increase by three in your example? Because each time you evaluate a form in the repl, the compiler has to create some gensyms of its own. Apparently, for the form (gensym), the number it needs to create is two.
It doesn't!
=> (str (gensym) (gensym))
"G__4027G__4028"
Looking at the source of gensym we can see that it uses clojure.lang.RT/nextID.
(defn gensym
([prefix-string] (. clojure.lang.Symbol (intern (str prefix-string (str (. clojure.lang.RT (nextID))))))))
The nextID function is also used in the LispReader. So when you repeatedly evaluate (gensym), the reader is probably using two IDs.
I clearly have something else going on in my process too, as if I wait any time between evaluations, more IDs are consumed and the gensym gaps further than just 3.
https://github.com/clojure/clojure/search?q=nextid

In Clojure, are lazy seqs always chunked?

I was under the impression that the lazy seqs were always chunked.
=> (take 1 (map #(do (print \.) %) (range)))
(................................0)
As expected 32 dots are printed because the lazy seq returned by range is chunked into 32 element chunks. However, when instead of range I try this with my own function get-rss-feeds, the lazy seq is no longer chunked:
=> (take 1 (map #(do (print \.) %) (get-rss-feeds r)))
(."http://wholehealthsource.blogspot.com/feeds/posts/default")
Only one dot is printed, so I guess the lazy-seq returned by get-rss-feeds is not chunked. Indeed:
=> (chunked-seq? (seq (range)))
true
=> (chunked-seq? (seq (get-rss-feeds r)))
false
Here is the source for get-rss-feeds:
(defn get-rss-feeds
"returns a lazy seq of urls of all feeds; takes an html-resource from the enlive library"
[hr]
(map #(:href (:attrs %))
(filter #(rss-feed? (:type (:attrs %))) (html/select hr [:link])))
So it appears that chunkiness depends on how the lazy seq is produced. I peeked at the source for the function range and there are hints of it being implemented in a "chunky" manner. So I'm a bit confused as to how this works. Can someone please clarify?
Here's why I need to know.
I have to following code: (get-rss-entry (get-rss-feeds h-res) url)
The call to get-rss-feeds returns a lazy sequence of URLs of feeds that I need to examine.
The call to get-rss-entry looks for a particular entry (whose :link field matches the second argument of get-rss-entry). It examines the lazy sequence returned by get-rss-feeds. Evaluating each item requires an http request across the network to fetch a new rss feed. To minimize the number of http requests it's important to examine the sequence one-by-one and stop as soon as there is a match.
Here is the code:
(defn get-rss-entry
[feeds url]
(ffirst (drop-while empty? (map #(entry-with-url % url) feeds))))
entry-with-url returns a lazy sequence of matches or an empty sequence if there is no match.
I tested this and it seems to work correctly (evaluating one feed url at a time). But I am worried that somewhere, somehow it will start behaving in a "chunky" way and it will start evaluating 32 feeds at a time. I know there is a way to avoid chunky behavior as discussed here, but it doesn't seem to even be required in this case.
Am I using lazy seq non-idiomatically? Would loop/recur be a better option?
You are right to be concerned. Your get-rss-entry will indeed call entry-with-url more than strictly necessary if the feeds parameter is a collection that returns chunked seqs. For example if feeds is a vector, map will operate on whole chunks at a time.
This problem is addressed directly in Fogus' Joy of Clojure, with the function seq1 defined in chapter 12:
(defn seq1 [s]
(lazy-seq
(when-let [[x] (seq s)]
(cons x (seq1 (rest s))))))
You could use this right where you know you want the most laziness possible, right before you call entry-with-url:
(defn get-rss-entry
[feeds url]
(ffirst (drop-while empty? (map #(entry-with-url % url) (seq1 feeds)))))
Lazy seqs are not always chunked - it depends on how they are produced.
For example, the lazy seq produced by this function is not chunked:
(defn integers-from [n]
(lazy-seq (cons n (do (print \.) (integers-from (inc n))))))
(take 3 (integers-from 3))
=> (..3 .4 5)
But many other clojure built-in functions do produce chunked seqs for performance reasons (e.g. range)
Depending on the vagueness of Chunking seems unwise as you mention above. Explicitly "un chunking" in cases where you really need it not to be chunked is also wise because then if at some other point your code changes in a way that chunkifies it things wont break. On another note, if you need actions to be sequential, agents are a great tool you could send the download functions to an agent then they will be run one at a time and only once regardless of how you evaluate the function. At some point you may want to pmap your sequence and then even un-chunking will not work though using an atom will continue to work correctly.
I have discussed this recently in Can I un-chunk lazy sequences to realize one element at a time? and the conclusion is that if you need to control when items are produced/consumed, you should not use lazy sequences.
For processing you can use transducers, where you control when the next item is processed.
For producing the elements, the ideal approach is to reify ISeq. A practical approach is to use lazy-seq with a single cons call in it whose rest is a recursive call. But notice that this relies on an implementation detail of lazy-seq.