I'm trying to come up with a data structure for exploring data that have been been marked with key terms, like "systems theory" or "Internet", using some set and lattice theory concepts I like. I thought maybe I could extend the way that hash maps work. I wrote some tests for the behavior I want, and then I realized I don't really understand how to work, or to do work, with types and protocols.
Here's the idea. I want to index a collection of data by sets of strings. E.g.,
(def data { #{"systems theory" "internet"} [1 2 3]
#{"systems theory" "biology"} [4 5]
#{"systems theory"} [6 7 8] })
For free, I get
(data #{"systems theory"})
;=> [6 7 8]
which is good.
But it would also be slick to be able to do something like
(data "biology")
;=> { #{"systems theory"} [4 5] }
When I thought of this I figured it wouldn't be difficult to tell the get method of PersistentHashMap to act as normal, unless its being asked to use a String as a key, in which case, do whatever is necessary to get the new data structure. But when it came to write code I just had a mess and I don't actually know how to design this thing.
I have my copy of Fogus's The Joy of Clojure and I'm going to read about types and protocols and extend-type and such and see if I can make sense of how and where built-in functions are defined and changed. But I would also love a hint.
I would not create a new specialized map implementation but create a simple index map from the original data:
(def data {#{"systems theory" "internet"} [1 2 3]
#{"systems theory" "biology"} [4 5]
#{"systems theory"} [6 7 8] })
(def cats (->> data
(map (fn [[cats val]]
(->> cats
(map (juxt identity #(hash-map (disj cats %) val)))
(into {}))))
(apply merge)))
(get cats "internet")
;=> {#{"systems theory"} [1 2 3]}
(get cats "biology")
;=> {#{"systems theory"} [4 5]}
(get cats "systems theory")
;=> {#{"biology"} [4 5]}
You could also merge both of them if you want to:
(def full-index (merge data cats))
(get full-index "internet") ;=> {#{"systems theory"} [1 2 3]}
(get full-index #{"systems theory"}) ;=> [6 7 8]
If you still want to create the specialized map implementation, you might want to take a look into the following:
PersistenHashMap
implementation
sorted.clj: "An
implementation of Clojure's sorted collections written in Clojure".
For instance, see the code for
PersistentTreeMap
which is used to implement sorted-map
data.avl: "Persistent sorted
maps and sets with log-time rank queries"
data.priority-map:
"A priority map is very similar to a sorted map, but whereas a sorted
map produces a sequence of the entries sorted by key, a priority map
produces the entries sorted by value.". Perhaps the code is easier to
understand than the others.
It won't be easy if you want to keep the semantics of a hash-map (for example count should return the sum of the original map count plus the new keys count). You might want to use collection-check to test your implementation.
What you're describing may be possible, but I think you would be better off just writing a function to filter through your list for any sets containing your search terms.
Also, consider the access patterns you are going to be using, I suspect having the strings as keys and the document ids in a set may be more efficient and more flexible.
Related
I am in the second chapter of the Programming Clojure book, and I came across this paragraph -
Because Clojure data structures are immutable and implement hashCode
correctly, any Clojure data structure can be a key in a map.
I cannot understand how the feature mentioned in the above quote would be advantageous.
I would appreciate it if someone could help me understand this using an example or point me to the right resources.
This can be useful when forming a data structure that has composite keys i.e. keys that consist of more than one piece of information.
As a simple example, say we have a graph with vertices :a :b and :c and we wish to have a data structure which enables lookup of the cost metric associated with any edge. We can use a Clojure map where each key is a set:
(def cost {#{:a :b} 5
#{:b :c} 6
#{:c :a} 2})
We can now look up the cost associated with any edge:
(get cost #{:c :b}) ; => 6
In order to answer your question there are two things we need to discuss:
Hash Tables
Value vs Reference
In a hash table (and I'm grossly oversimplifying here) you take a "key" and you run it through a hashing function that converts that key in a unique* identifier that is then associated with an address in memory which holds a particular "value". This is the underlying abstraction for higher-level associative data structures like Clojure maps or Python dictionaries: you give me the key, I hash it, look up the address, give you back the thing.
In order for this scheme to work, the same key has to always hash to the same value for some definition of "same", otherwise you couldn't get the thing back out of the data structure.
In Java (and many other languages) the definition of "same" boils down to reference:
public class Foo {
int x = 5;
public static void main(String[] args) {
Foo myObj = new Foo();
Foo myOtherObj = new Foo();
myObj == myOtherObj; // false
myObj.x = 6;
myObj == myObj; // true
}
}
Even though myObj and myOtherObj both hold the value 5 at the point of comparison, they are not equal according to the rules of Java, because they are different references. That last comparison, which looks non-sensical if you've never worked in a different model, highlights the problem: when we create myObj it has a value of 5, but at one point in time it has a value of 6. Is it still the same? Again, Java says yes.
And when we get to hashing something that is a potential key for a hash table that distinction matters: how do we maintain a consistent thing to feed the hash function? Is it the value (x) the container holds or the container (Foo instance)?
Python takes an interesting approach here:
ls = [1, 2] # list
tp = (1, 2) # tuple
st = set() # empty set
st.add(tp) # fine
st.add((1, 2)) # still only has 1 element, (1, 2) == (1, 2)
st.add(ls) # Error: unhashable type list!
In Python, you can't use mutable objects as set members or dictionary keys, because they are saying "the meaning of this thing changes" so it is unsuitable for a hashed key. But you can use any immutable type as a hash key (even an immutable container). Note that (1, 2) == (1, 2), unlike in Java where two containers holding the same values are still compared on reference. Python compares mutable types by reference, but immutable types by value.
But in Clojure, everything** is immutable. Everything can be used as a hash key, because everything has a consistent value through time that can be fed to the hash function:
(def x [1 2])
(def y { x 5 })
(get y x) ; 5
(get y [1 2]) ; 5
When we lookup the vector bound to x in the map bound to y we get 5, since vectors are immutable we don't have to worry about identity. We don't have to pass around a reference like we do in Java, we can just create the value and use it as a lookup key.
* They're not entirely unique, per the pigeonhole principle unless the hashed output is at least as large as the input you will have collisions where two keys hash to the same values. But for our purposes here, they're unique.
** Not quite everything, but close enough.
For its own data structures Clojure uses hasheq to create the hash of an object. The behaviour is consistent with =, which means that, for example, the list '(1 2 3) is equal to the vector [1 2 3].
All Clojure data structures implement IHashEq which means they have a hasheq function. If we compare the implementation of hasheq for the PersistentList and APersistentVector we see they both extend ASeq and have the same implementation of the hasheq function, and when when getting to primitive types return a consistent value for the same values using a hashing algorithm called Murmur3.
If you look at the implementation of hasheq for longs we see hashLong is used of the Murmur3 class and we get consistent hash code values:
user=> (clojure.lang.Murmur3/hashLong 123)
823512154
user=> (clojure.lang.Murmur3/hashLong 123)
823512154
Similar for other primitive types.
Note that Java's hashcode function has similar behaviour for two types of ordered lists:
user=> (.hashCode (java.util.ArrayList. [1 2 3]))
30817
user=> (.hashCode (java.util.LinkedList. [1 2 3]))
30817
So an instance of ArrayList and LinkedList have the same hashcode as long as they have the same contents.
But a standard Java array returns different hashcodes not based on its contents, but specific to the instance:
user=> (.hashCode (to-array [1 2 3]))
1736458419
user=> (.hashCode (to-array [1 2 3]))
739210872
so two instances of similar arrays are not equal:
user=> (= (to-array [1 2 3]) (to-array [1 2 3]))
false
Now, why is this relevant when using Clojure data structures as keys in a map?
If we create a map with a vector or list as the key, we can look up this value:
user=> (def m {[1 2 3] :foo})
#'user/m
user=> (get m [1 2 3])
:foo
user=> (m [1 2 3])
:foo
also when using another data structure that is equal (=):
user=> (m '(1 2 3))
:foo
This is not possible with Java arrays that have an implementation for their hash codes that is based on the instance, not on the content:
user=> (def m {(to-array [1 2 3]) :foo})
#'user/m
user=> (m (to-array [1 2 3]))
nil
When using Clojure most things are coded using its data structures and it's advantageous that the lookups work based on the content of the data structure ((get-in {[1 2] {#{2 3} :foo}} ['(1 2) #{3 2}]) ;; => :foo).
Input
(def my-cat "meaw")
(def my-dog "baw")
(def my-pets {my-cat "Luna"
my-dog "Lucky"})
Output
(get my-pets my-cat) ;=> "Luna"
(:key my-pets my-cat) ;=> "meaw"
(get my-pets "baw") ;=> "Lucky"
(get my-pets "meaw") ;=> "Luna"
I was wondering the best way to accomplish this:
Pass along metadata important for the pipeline, but not actually part of the value
Here's what I have:
; Attach the metadata
^{:doc "How obj works!"} [1 2 3]
; [1 2 3]
; ensure that it's attached
(meta ^{:doc "How obj works!"} [1 2 3])
; {:doc "How obj works!"}
; map over the values
(map inc ^{:doc "How obj works!"} [1 2 3])
; [2 3 4]
; try and get the metadata after the map
(meta (map inc ^{:doc "How obj works!"} [1 2 3]))
; nil
I'm pretty sure I know why this is happening, but I'd like to know if there's a good way to do this, or if there's a better way to approach this problem.
Regarding a better way, there probably is. If I had an object substantial enough to require a metadata docstring explaining how it works, I think I would define records or types, and maybe a protocol.
Without knowing what you're doing or why, however, I have no informed opinion on the matter.
If you are certain you want to map a function and preserve metadata, you could try something along the outline of the following:
(defn meta-preserving-map
[f & cs]
(let [data (apply merge (map meta cs))]
(with-meta (apply map f cs) data)))
I am totally new to clojure (started learning yesterday) and functional programming so please excuse my ignorance. I've been trying to read a lot of the clojure documentation, but much of it is totally over my head.
I'm trying to iterate over an ArrayMap of this set up:
{city1 ([[0 0] [0 1] [1 1] [1 0]]), city2 ([[3 3] [3 4] [4 4] [4 3]]), city3 ([[10 10] [10 11] [11 11] [11 10]])}
(^hopefully that syntax is correct, that is what it looks like my terminal is printing)
where the city name is mapped to a vector of vectors that define the points that make up that city's borders. I need to compare all of these points with an outside point in order to determine if the outside point is in one of these cities and if so which city it is in.
I'm using the Ray Casting Algorithm detailed here to determine if an outside point is within a vector of vectors.
Maps actually implement the clojure.lang.ISeq interface which means that you can use all the higher-level sequence operations on them. The single elements are pairs of the form [key value], so, to find the first element that matches a predicate in-city? you could e.g. use some:
(some
(fn [[city-name city-points]] ;; the current entry of the map
(when (in-city? the-other-point city-points) ;; check the borders
city-name)) ;; return the name of a matching city
cities)
You might also use keep to find all elements that match the predicate but I guess there is no overlap between cities in your example.
Update: Let's back off a little bit, since working with sequences is fun. I'm not gonna dive into all the sequence types and just use vectors ([1 2 3 ...]) for examples.
Okay, for a start, let's access our vector:
(first [1 2 3]) ;; => 1
(rest [1 2 3]) ;; => [2 3]
(last [1 2 3]) ;; => 3
(nth [1 2 3] 1) ;; => 2
The great thing about functional programming is, that functions are just values which you can pass to other functions. For example, you might want to apply a function (let's say "add 2 to a number") to each element in a sequence. This can be done via map:
(map
(fn [x]
(+ x 2))
[1 2 3])
;; => [3 4 5]
If you haven't seen it yet, there is a shorthand for function values where % is the first parameter, %2 is the second, and so on:
(map #(+ % 2) [1 2 3]) ;; => [3 4 5]
This is concise and useful and you'll probably see it a lot in the wild. Of course, if your function has a name or is stored in a var (e.g. by using defn) you can use it directly:
(map pos? [-1 0 1]) ;; => [false false true]
Using the predicate like this does not make a lot of sense since you lose the actual values that produce the boolean result. How about the following?
(filter pos? [-1 0 1]) ;; => [1]
(remove pos? [-1 0 1]) ;; => [-1 0]
This selects or discards the values matching your predicate. Here, you should be able to see the connection to your city-border example: You want to find all the cities in a map that include a given point p. But maps are not sequences, are they? Indeed they are:
(seq {:a 0 :b 1}) ;; => [[:a 0] [:b 1]]
Oh my, the possibilities!
(map first {:a 0 :b 1}) ;; => [:a :b]
(filter #(pos? (second %)) {:a 0 :b 1}) ;; => [[:b 1]]
filter retrieves all the matching cities (and their coordinates) but since you are only interested in the names - which are stored as the first element of every pair - you have to extract it from every element, similarly to the following (simpler) example:
(map first (filter #(pos? (second %)) {:a 0 :b 1}))
:: => [:b]
There actually is a function that combines map and filter. It's called keep and return every non-nil value its predicate produces. You can thus check the first element of every pair and then return the second:
(keep
(fn [pair]
(when (pos? (second pair))
(first pair)))
{:a 0 b 1})
;; => [:b]
Everytime you see yourself using a lot of firsts and seconds, maybe a few rests inbetween, you should think of destructuring. It helps you access parts of values in an easy way and I'll not go into detail here but it can be used with sequences quite intuitively:
(keep
(fn [[a b]] ;; instead of the name 'pair' we give the value's shape!
(when (pos? b)
a))
{:a 0 :b 1})
;; => [:b]
If you're only interested in the first result you can, of course, directly access it and write something like (first (keep ...)). But, since this is a pretty common use case, you get some offered to you by Clojure. It's like keep but will not look beyond the first match. Let's dive into your city example whose solution should begin to make sense by now:
(some
(fn [[city-name city-points]]
(when (in-city? p city-points)
city-name))
all-cities)
So, I hope this can be useful to you.
Let's say I have a data structure like so:
[[1 2 3] [4 5 6] [[7 8 9] [10 11 12]]]
And what I want to end up with is:
[[1 2 3] [4 5 6] [7 8 9] [10 11 12]]
Is there any function that does this automatically?
Basically I'm converting/transforming a SQL result set to CSV, and there are some rows that will transform to 2 rows in the CSV. So my map function in the normal case returns a vector, but sometimes returns a vector of vectors. Clojure.data.csv needs a list of vectors only, so I need to flatten out the rows that got pivoted.
Mapcat is useful for mapping where each element can expand into 0 or more output elements, like this:
(mapcat #(if (vector? (first %)) % [%]) data)
Though I'm not sure if (vector? (first %)) is a sufficient test for your data.
A different approach using tree-seq:
(def a [[1 2 3] [4 5 6] [[7 8 9] [10 11 12]]])
(filter (comp not vector? first)
(tree-seq (comp vector? first) seq a))
I am stretching to use tree-seq here. Would someone with more experience care to comment on if there is a better way to return only the children other than using what is effectively a filter of (not branch?)
Clojure: Semi-Flattening a nested Sequence answers your question, but I don't want to mark this question as a duplicate of that one, since you're really asking a different question than he was; I wonder if it's possible to move his answer over here.
I was asking about the peculiarity of zipmap construct to only discover that I was apparently doing it wrong. So I learned about (map vector v u) in the process. But prior to this case I had used zipmap to do (map vector ...)'s work. Did it work then because the resultant map was small enough to be sorted out?
And to the actual question: what use zipmap has, and how/when to use it. And when to use (map vector ...)?
My original problem required the original order, so mapping anything wouldn't be a good idea. But basically -- apart from the order of the resulting pairs -- these two methods are equivalent, because the seq'd map becomes a sequence of vectors.
(for [pair (map vector v (rest v))]
( ... )) ;do with (first pair) and (last pair)
(for [pair (zipmap v (rest v))]
( ... )) ;do with (first pair) and (last pair)
Use (zipmap ...) when you want to directly construct a hashmap from separate sequences of keys and values. The output is a hashmap:
(zipmap [:k1 :k2 :k3] [10 20 40])
=> {:k3 40, :k2 20, :k1 10}
Use (map vector ...) when you are trying to merge multiple sequences. The output is a lazy sequence of vectors:
(map vector [1 2 3] [4 5 6] [7 8 9])
=> ([1 4 7] [2 5 8] [3 6 9])
Some extra notes to consider:
Zipmap only works on two input sequences (keys + values) whereas map vector can work on any number of input sequences. If your input sequences are not key value pairs then it's probably a good hint that you should be using map vector rather than zipmap
zipmap will be more efficient and simpler than doing map vector and then subsequently creating a hashmap from the key/value pairs - e.g. (into {} (map vector [:k1 :k2 :k3] [10 20 40])) is quite a convoluted way to do zipmap
map vector is lazy - so it brings a bit of extra overhead but is very useful in circumstances where you actually need laziness (e.g. when dealing with infinite sequences)
You can do (seq (zipmap ....)) to get a sequence of key-value pairs rather like (map vector ...), however be aware that this may re-order the sequence of key-value pairs (since the intermediate hashmap is unordered)
The methods are more or less equivalent. When you use zipmap you get a map with key/value pairs. When you iterate over this map you get [key value] vectors. The order of the map is however not defined. With the 'map' construct in your first method you create a list of vectors with two elements. The order is defined.
Zipmap might be a bit less efficient in your example. I would stick with the 'map'.
Edit: Oh, and zipmap isn't lazy. So another reason not to use it in your example.
Edit 2: use zipmap when you really need a map, for example for fast random key-based access.
The two may appear similar but in reality are very different.
zipmap creates a map
(map vector ...) creates a LazySeq of n-tuples (vectors of size n)
These are two very different data structures.
While a lazy sequence of 2-tuples may appear similar to a map, they behave very differently.
Say we are mapping two collections, coll1 and coll2. Consider the case coll1 has duplicate elements. The output of zipmap will only contain the value corresponding to the last appearance of the duplicate keys in coll1. The output of (map vector ...) will contain 2-tuples with all values of the duplicate keys.
A simple REPL example:
=> (zipmap [:k1 :k2 :k3 :k1] [1 2 3 4])
{:k3 3, :k2 2, :k1 4}
=>(map vector [:k1 :k2 :k3 :k1] [1 2 3 4])
([:k1 1] [:k2 2] [:k3 3] [:k1 4])
With that in mind, it is trivial to see the danger in assuming the following:
But basically -- apart from the order of the resulting pairs -- these two methods are equivalent, because the seq'd map becomes a sequence of vectors.
The seq'd map becomes a sequence of vectors, but not necessarily the same sequence of vectors as the results from (map vector ...)
For completeness, here are the seq'd vectors sorted:
=> (sort (seq (zipmap [:k1 :k2 :k3 :k1] [1 2 3 4])))
([:k1 4] [:k2 2] [:k3 3])
=> (sort (seq (map vector [:k1 :k2 :k3 :k1] [1 2 3 4])))
([:k1 1] [:k1 4] [:k2 2] [:k3 3])
I think the closest we can get to a statement like the above is:
The set of the result of (zip map coll1 coll2) will be equal to the set of the result of (map vector coll1 coll2) if coll1 is itself set.
That is a lot of qualifiers for two operations that are supposedly very similar.
That is why special care must be taken when deciding which one to use.
They are very different, serve different purposes and should not be used interchangeably.
(zipmap k v) takes two seqs and returns map (and not preserves order of elements)
(map vector s1 s2 ...) takes any count of seqs and returns seq
use the first, when you want to zip two seqs into a map.
use the second, when you want to apply vector (or list or any other seq-creating form) to multiple seqs.
there is some similarity to option "collate" when you print several copies of a document :)