Using Filter on vectors - clojure

I am trying to use the filter function over a vector called dataset that is defined like so:
AK,0.89,0.98
AR,0.49,0.23
AN,0.21,0.78
...
And I want to get all of the values that contain a certain string, something like this:
(filter (contains "AK") dataset)
Which would return:
AK,0.89,0.98
Is it possible to do this using the filter function?
I already iterate over the vector using doseq but I'm required to use filter at some point in my code.
Thanks :)

The basic answer is yes, you can use filter to do this. Filter expects a
predicate function i.e. a function which returns true or false. The filter
function will iterate over the elements in the collection you pass in and pass
each element from that collection to the predicate. What you do inside the
predicate function is totally up to you (though you should make sure to avoid
side effects). Filter will collect all the elements where the predicate returned
true into a new lazy sequence.
Essentially, you have (long form)
(filter (fn [element]
; some test returning true/fals) col)
where col is your collection. The result will be a LAZY SEQUENCE of elements
where the predicate function returned true. It is important to understand that
things like filter and map return lazy sequences and know what that really means.
The critical bit to understand is the structure of your collection. In your
description, you stated
I am trying to use the filter function over a vector called dataset
that is defined like so:
AK,0.89,0.98 AR,0.49,0.23 AN,0.21,0.78 ...
Unfortunately, your description is a little ambiguous. If your dataset structure
is actually a vector of vectors (not simply a vector), then things are very
straight-forward. This is because it will mean that each 'element' passed to the
predicate function will be one of your 'inner' vectors. The real definition is
more accurately represented as
[
[AK,0.89,0.98]
[AR,0.49,0.23]
[AN,0.21,0.78]
...
]
what will be passed to the predicate is a vector of 3 elements. If you just want
to select all the vectors where the first element is 'AK', then the predicate
function could be as simple as
(fn [el]
(if (= "AK" (first el))
true;
false))
So the full line would be something like
(filter (fn [el]
(if (= "AK" (first el))
true
false)) [[AK 0.89 0.98] [AR 0.49 0.23] [AN 0.21 0.78]])
and that is just the start and very verbose version. There is lots you can do to
make this even shorter e.g.
(filter #(= "AK" (first %)) [..])
If on the other hand, you really do just have a single vector, then things
become a little more complicated because you would need to somehow group the
values. This could be done by using the partition function to break up your
vector into groups of 3 items before passing them to filter e.g.
(filter pred (partition 3 col))
which would group the elements in your original vector into groups of 3 and pass
each group to the predicate function. This is where the real power of map,
filter, reduce, etc come into play - you can transform the data, passing it
through a pipeline of functions, each of which somehow manipulates the data and
a final result pops out the end.
The key point is to understand what filter (and other functions like this, such
as map or reduce) will understand as an 'element' in your input
collection. Basically, this is the same as what would be returned by 'first'
called on the collection. This is what is passed to the predicate function in
fileter.
There are a lot of assumptions here. One of the main ones is that your data is
strictly ordered i.e. the value you are looking to test is always the first
element in each group. If this is not the case, then more work will need to be
done. Likewise, we assume the data is always in groups of 3. If it isn't, then other approaches will be needed.

Related

Non-map collection predicate?

Is there a Clojure predicate that means "collection, but not a map"?
Such a predicate is/would be valuable because there are many operations that can be performed on all collections except maps. For example (apply + ...) or (reduce + ...) can be used with vectors, lists, lazy sequences, and sets, but not maps, since the elements of a map in such a context end up as clojure.lang.MapEntrys. It's sets and maps that cause the problem with those predicates that I know of:
sequential? is true for vectors, lists, and lazy sequences, but it's false for both maps and sets. (seq? is similar but it's false for vectors.)
coll? and seqable? are true for both sets and maps, as well as for every other kind of collection I can think of.
Of course I can define such a predicate, e.g. like this:
(defn coll-but-not-map?
[xs]
(and (coll? xs)
(not (map? xs))))
or like this:
(defn sequential-or-set?
[xs]
(or (sequential? xs)
(set? xs)))
I'm wondering whether there's a built-in clojure.core (or contributed library) predicate that does the same thing.
This question is related to this one and this one but isn't answered by their answers. (If my question is a duplicate of one I haven't found, I'm happy to have it marked as such.)
For example (apply + ...) or (reduce + ...) can be used with vectors, lists, lazy sequences, and sets, but not maps
This is nothing about collections, I think. In your case, you have a problem not with general apply or reduce application, but with particular + function. (apply + [:a :b :c]) won't work either even though we are using a vector here.
My point is that you are trying to solve very domain specific problem, that's why there is no generic solution in Clojure itself. So use any proper predicate you can think of.
There's nothing that I've found or used that fits this description. I think your own predicate function is clear, simple, and easy to include in your code if you find it useful.
Maybe you are writing code that has to be very generic, but it's usually the case that a function both accepts and returns a consistent type of data. There are cases where this is not true, but it's usually the case that if a function can be the jack of all trades, it's doing too much.
Using your example -- it makes sense to add a vector of numbers, a list of numbers, or a set of numbers. But a map of numbers? It doesn't make sense, unless maybe it's the values contained in the map, and in this case, it's not reasonable for a single piece of code to be expected to handle adding both sequential data and associative data. The function should be handed something it expects, and it should return something consistent. This kind of reminds me of Stuart Sierra's blog post discussing consistency in this regard. Without more information I'm only guessing as to your use case, but it's something to consider.

IndexedSeq VS. PersistentVector

Can somebody explain me, the difference between 'IndexedSeq' and 'PersistentVector'?
I bumped into this, when updating a vector in my data structure via 'rest'. Here's a REPL excerpt that shows the transformation.
=> (def xs [1 2 3])
...
(type xs)
cljs.core/PersistentVector
=> (def xs2 (rest xs))
...
(type xs2)
cljs.core/IndexedSeq
I'm holding a list in an app-state atom, which needs to be shifted once in a while, so the first item must disappear. Would be really cool, if anybody could give me a hint about which data structure might be preferable here in terms of performance.
Sometimes elements get pushed to the end of the list as well, so I guess it's a LIFO mechanism that I'm creating here.
From your last paragraph, it sounds like you're using this as a stack. Taken together, pop, peek, and conj form a stack interface that can be used with either lists or vectors (working on the front of a list or the end of a vector). I would use those.
If you're just using those functions, I don't think there should be any significant performance differences (all three functions should be constant time).
looking at the superinterfaces here: http://static.javadoc.io/org.clojure/clojure/1.7.0/clojure/lang/IndexedSeq.html
I can guess, it is not the most efficient thing here, since it is just a seq, with no guaranteed constant-time access to the nth member. To ensure the vector semantics you should probably use subvec to remove the first element.
In general, if you don't do random access to elements, in terms of performance it should be enough to use concat to add element to the end (as it produces a lazy sequence, won't consume the whole collection, and should be done in a constant time) and rest to remove the first element (as it is also done in a constant time), to make FIFO stack (which is what you do). (it's not the best variant still, since it may lead to stack owerflow, if you do alot of push without realizing the sequence.
But sure it's better to use vectors. So the combination of conj , first, and subvec should be your choice.

Clojure reduce transducer

I am looking for a simple example of transducers with a reducing function.
I was hoping that the following would return a transducing function, since (filter odd?) works that way :
(def sum (reduce +))
clojure.lang.ArityException: Wrong number of args (1) passed to: core$reduce
My current understanding of transducers is that by omitting the collection argument, we get back a transducing function that we can compose with other transducing functions. Why is it different for filter and reduce ?
The reduce function doesn't return a transducer.
It's that way because of reduce is the function which returns a value which could be sequence. Other function like filter or map always returns sequences(even empty), that allows to combine this functions.
For combining something with a reduce function, you can use a reducer library
which gives a functionality similar to what you want to do (if I correctly understood this).
###UPD
Ok, my answer is a bit confusing.
First of all let's take a look on an how filter, map and many other functions works: it's not surprise, that all this function is based on reduce, it's an reduction function(in that way, that they doesn't create a collection with bigger size that input one). So, if you reduce some coll in any way - You can combine your reduction to pass reducible value from reducible coll between all the reduction function to get final value. It is a great way to improve performance, because part of values will hopefully nil somehow (as a part of transformation), and there is only one logical cycle (I mean loop, you iterate over sequence only one time and pass every value through all the transformation).
So, why the reduce function is so different, because everything built on it?
All transducer based on a simple and effective idea which in my opinion looks like sieve. But the reduction may be only for last step, because of the only one value as the result. So, the only way you can use reduce here is to provide a coll and a reduction form.
Reduce in sieve analogy is like funnel under the sieve:
You take your collection, throw it to some function like map and filter and take - and as you can see the size of new collection, which is the result of transformation, will never be bigger than the input collection. So the last step may be a reduce which takes a sieved collection and takes one value based on everything what was done.
There is also a different function, which allows you to combine transducers and reducers - transducer, but it's also require a function, because it's like an entry point and the last step in our sieve.
The reducers library is similar to transducers and also allow reduce as a last step. It's just another approach to make the same as a transducer.
To achieve what you want to do you can use partial function instead. It would be like this:
(def sum
(partial reduce +'))
(sum [1 2 3])
will return obvious answer of 6

Why clojure collections don't implement ISeq interface directly?

Every collection in clojure is said to be "sequable" but only list and cons are actually seqs:
user> (seq? {:a 1 :b 2})
false
user> (seq? [1 2 3])
false
All other seq functions first convert a collection to a sequence and only then operate on it.
user> (class (rest {:a 1 :b 2}))
clojure.lang.PersistentArrayMap$Seq
I cannot do things like:
user> (:b (rest {:a 1 :b 2}))
nil
user> (:b (filter #(-> % val (= 1)) {:a 1 :b 1 :c 2}))
nil
and have to coerce back to concrete data type. This looks like bad design to me, but most likely I just don't get it as yet.
So, why clojure collections don't implement ISeq interface directly and all seq functions don't return an object of the same class as the input object?
This has been discussed on the Clojure google group; see for example the thread map semantics from February of this year. I'll take the liberty of reusing some of the points I made in my message to that thread below while adding several new ones.
Before I go on to explain why I think the "separate seq" design is the correct one, I would like to point out that a natural solution for the situations where you'd really want to have an output similar to the input without being explicit about it exists in the form of the function fmap from the contrib library algo.generic. (I don't think it's a good idea to use it by default, however, for the same reasons for which the core library design is a good one.)
Overview
The key observation, I believe, is that the sequence operations like map, filter etc. conceptually divide into three separate concerns:
some way of iterating over their input;
applying a function to each element of the input;
producing an output.
Clearly 2. is unproblematic if we can deal with 1. and 3. So let's have a look at those.
Iteration
For 1., consider that the simplest and most performant way to iterate over a collection typically does not involve allocating intermediate results of the same abstract type as the collection. Mapping a function over a chunked seq over a vector is likely to be much more performant than mapping a function over a seq producing "view vectors" (using subvec) for each call to next; the latter, however, is the best we can do performance-wise for next on Clojure-style vectors (even in the presence of RRB trees, which are great when we need a proper subvector / vector slice operation to implement an interesting algorithm, but make traversals terrifying slow if we used them to implement next).
In Clojure, specialized seq types maintain traversal state and extra functionality such as (1) a node stack for sorted maps and sets (apart from better performance, this has better big-O complexity than traversals using dissoc / disj!), (2) current index + logic for wrapping leaf arrays in chunks for vectors, (3) a traversal "continuation" for hash maps. Traversing a collection through an object like this is simply faster than any attempt at traversing through subvec / dissoc / disj could be.
Suppose, however, that we're willing to accept the performance hit when mapping a function over a vector. Well, let's try filtering now:
(->> some-vector (map f) (filter p?))
There's a problem here -- there's no good way to remove elements from a vector. (Again, RRB trees could help in theory, but in practice all the RRB slicing and concatenating involved in producing "real vector" for filtering operations would absolutely destroy performance.)
Here's a similar problem. Consider this pipeline:
(->> some-sorted-set (filter p?) (map f) (take n))
Here we benefit from laziness (or rather, from the ability to stop filtering and mapping early; there's a point involving reducers to be made here, see below). Clearly take could be reordered with map, but not with filter.
The point is that if it's ok for filter to convert to seq implicitly, then it is also ok for map; and similar arguments can be made for other sequence functions. Once we've made the argument for all -- or nearly all -- of them, it becomes clear that it also makes sense for seq to return specialized seq objects.
Incidentally, filtering or mapping a function over a collection without producing a similar collection as a result is very useful. For example, often we care only about the result of reducing the sequence produced by a pipeline of transformations to some value or about calling a function for side effect at each element. For these scenarios, there is nothing whatsoever to be gained by maintaining the input type and quite a lot to be lost in performance.
Producing an output
As noted above, we do not always want to produce an output of the same type as the input. When we do, however, often the best way to do so is to do the equivalent of pouring a seq over the input into an empty output collection.
In fact, there is absolutely no way to do better for maps and sets. The fundamental reason is that for sets of cardinality greater than 1 there is no way to predict the cardinality of the output of mapping a function over a set, since the function can "glue together" (produce the same outputs for) arbitrary inputs.
Additionally, for sorted maps and sets there is no guarantee that the input set's comparator will be able to deal with outputs from an arbitrary function.
So, if in many cases there is no way to, say, map significantly better than by doing a seq and an into separately, and considering how both seq and into make useful primitives in their own right, Clojure makes the choice of exposing the useful primitives and letting users compose them. This lets us map and into to produce a set from a set, while leaving us the freedom to not go on to the into stage when there is no value to be gained by producing a set (or another collection type, as the case may be).
Not all is seq; or, consider reducers
Some of the problems with using the collection types themselves when mapping, filtering etc. don't apply when using reducers.
The key difference between reducers and seqs is that the intermediate objects produced by clojure.core.reducers/map and friends only produce "descriptor" objects that maintain information on what computations need to be performed in the event that the reducer is actually reduced. Thus, individual stages of the computation can be merged.
This allows us to do things like
(require '[clojure.core.reducers :as r])
(->> some-set (r/map f) (r/filter p?) (into #{}))
Of course we still need to be explicit about our (into #{}), but this is just a way of saying "the reducers pipeline ends here; please produce the result in the form of a set". We could also ask for a different collection type (a vector of results perhaps; note that mapping f over a set may well produce duplicate results and we may in some situations wish to preserve them) or a scalar value ((reduce + 0)).
Summary
The main points are these:
the fastest way to iterate over a collection typically doesn't involve produce intermediate results similar to the input;
seq uses the fastest way to iterate;
the best approach to transforming a set by mapping or filtering involves using a seq-style operation, because we want to iterate very fast while accumulating an output;
thus seq makes a great primitive;
map and filter, in their choice to deal with seqs, depending on the scenario, may avoid performance penalties without upsides, benefit from laziness etc., yet can still be used to produce a collection result with into;
thus they too make great primitives.
Some of these points may not apply to a statically typed language, but of course Clojure is dynamic. Additionally, when we do want to a return that matches input type, we're simply forced to be explicit about it and that, in itself, may be viewed as a good thing.
Sequences are a logical list abstraction. They provide access to a (stable) ordered sequence of values. They are implemented as views over collections (except for lists where the concrete interface matches the logical interface). The sequence (view) is a separate data structure that refers into the collection to provide the logical abstraction.
Sequence functions (map, filter, etc) take a "seqable" thing (something which can produce a sequence), call seq on it to produce the sequence, and then operate on that sequence, returning a new sequence. It is up to you whether you need to or how to re-collect that sequence back into a concrete collection. While vectors and lists are ordered, sets and maps are not and thus sequences over these data structures must compute and retain the order outside the collection.
Specialized functions like mapv, filterv, reduce-kv allow you to stay "in the collection" when you know you want the operation to return a collection at the end instead of sequence.
Seqs are ordered structures, whereas maps and sets are unordered. Two maps that are equal in value may have a different internal ordering. For example:
user=> (seq (array-map :a 1 :b 2))
([:a 1] [:b 2])
user=> (seq (array-map :b 2 :a 1))
([:b 2] [:a 1])
It makes no sense to ask for the rest of a map, because it's not a sequential structure. The same goes for a set.
So what about vectors? They're sequentially ordered, so we could potentially map across a vector, and indeed there is such a function: mapv.
You may well ask: why is this not implicit? If I pass a vector to map, why doesn't it return a vector?
Well, first that would mean making an exception for ordered structures like vectors, and Clojure isn't big on making exceptions.
But more importantly you'd lose one of the most useful properties of seqs: laziness. Chaining together seq functions, such as map and filter is a very common operation, and without laziness this would be much less performant and far more memory-intensive.
The collection classes follow a factory pattern i.e instead of implementing ISeq they implement Sequable i.e you can create a ISeq from the collection but the collection itself is not an ISeq.
Now even if these collections implemented ISeq directly I am not sure how that would solve your problem of having general purpose sequence functions that would return the original object, as that would not make sense at all as these general purpose functions are supposed to work on ISeq, they have no idea about which object gave them this ISeq
Example in java:
interface ISeq {
....
}
class A implements ISeq {
}
class B implements ISeq {
}
static class Helpers {
/*
Filter can only work with ISeq, that's what makes it general purpose.
There is no way it could return A or B objects.
*/
public static ISeq filter(ISeq coll, ...) { }
...
}

Idiomatic way to get first element of a lazy seq in clojure

When processing each element in a seq I normally use first and rest.
However these will cause a lazy-seq to lose its "laziness" by calling seq on the argument. My solution has been to use (first (take 1 coll)) and (drop 1 coll) in their place when working with lazy-seqs, and while I think drop 1 is just fine, I don't particularly like having to call first and take to get the first element.
Is there a more idiomatic way to do this?
The docstrings for first and rest say that these functions call seq on their arguments to convey the idea that you don't have to call seq yourself when passing in a seqable collection which is not in itself a seq, like, say, a vector or set. For example,
(first [1 2 3])
;= 1
would not work if first didn't call seq on its argument; you'd have to say
(first (seq [1 2 3]))
instead, which would be inconvenient.
Both take and drop also call seq on their arguments, otherwise you couldn't call them on vectors and the like as explained above. In fact this is true of all standard seq collections -- those which do not call seq directly are built upon lower-level components which do.
In no way does this impair the laziness of lazy seqs. The forcing / realization which happens as a result of a first / rest call is the smallest amount possible to obtain the requested result. (How much that is depends on the type of the argument; if it is not in fact lazy, there is no extra realization involved in the first call; if it is partly lazy -- that is, chunked -- there will be some extra realization (up to 32 initial elements will be computed at once); if it's fully lazy, only the first element will be computed.)
Clearly first, when passed a lazy seq, must force the realization of its first element -- that's the whole point. rest is actually somewhat lazy in that it actually doesn't force the realization of the "rest" part of the seq (that's in contrast to next, which is basically equivalent to (seq (rest ...))). The fact that it does force the first element to be realized so that it can skip over it immediately is a conscious design choice which avoids unnecessary layering of lazy seq objects and holding the head of the original seq; you could say something like (lazy-seq (rest xs)) to defer even this initial realization, at the cost of holding on to xs until realized the lazy seq wrapper is realized.