How to use set/intersection with big result sets from MongoDB [duplicate] - clojure

This question already has an answer here:
MongoDB: Query has implicit limit(256)?
(1 answer)
Closed 4 years ago.
I've a function photos-with-keyword-starting that gets lists of photos for a given keyword from a MongoDB instance using monger, and another that finds subsets of these photos using set/intersection.
(defn photos-with-keywords-starting [stems]
(apply set/intersection
(map set
(map photos-with-keyword-starting stems))))
Previously I thought this worked fine, but since adding more records the intersection doesn't work as expected -- it misses lots of records that have both keywords.
I notice that calls to the function photos-with-keyword-starting always return a maximum of 256 results:
=> (count (photos-with-keyword-starting "lisa"))
256
Here's the code of that function:
(defn photos-with-keyword-starting [stem]
(with-db (q/find {:keywords {$regex (str "^" stem)}})
(q/sort {:datetime 1})))
So because calls to find records in MongoDB don't return all records if there are more than 256, I don't get the right subsets when specifying more than one keyword.
How do I increase this limit?

You could simply convert the datetime in your function photos-with-keyword-starting to for instance a string, if you can live with that.
Alternatively you could remove logical duplicates from your output, for instance like this:
(->>
-your-result-
(group-by #(update % :datetime str))
(map (comp first val)))

Related

building a hashmap from an array in clojure

First off, I am a student in week 5 of 12 at The Iron Yard studying Java backend engineering. The course is composed of roughly 60% Java, 25% JavaScript and 15% Clojure.
I have been given the following problem (outlined in the comment):
;; Given an ArrayList of words, return a HashMap> containing a keys for every
;; word's first letter. The value for the key will be an ArrayList of all
;; words in the list that start with that letter. An empty string has no first
;; letter so don't add a key for it.
(defn index-words [word-list]
(loop [word (first word-list)
index {}]
(if (contains? index (subs word 0 1))
(assoc index (subs word 0 1) (let [words (index (subs word 0 1))
word word]
(conj words word)))
(assoc index (subs word 0 1) (conj nil word)))
(if (empty? word-list)
index
(recur (rest word-list) index))))
I was able to get a similar problem working using zipmap but I am positive that I am missing something with this one. The code compiles but fails to run.
Specifically, I am failing to update my hashmap index in the false clause of the 'if'.
I have tested all of the components of this function in the REPL, and they work in isolation. but I am struggling to put them all together.
For your reference, here is the code that calls word-list.
(let [word-list ["aardvark" "apple" "zamboni" "phone"]]
(printf "index-words(%s) -> %s\n" word-list (index-words word-list)))
Rather than getting a working solution from the community, my hope is for a few pointers to get my brain moving in the right direction.
The function assoc does not modify index. You need to work with the new value that assoc returns. Same is true for conj: it does not modify the map you pass it.
I hope, this answer is of the nature you expected to get: just a pointer where your problem is.
BTW: If you can do with a PersistentList this becomes a one-liner when using reduce instead of loop and recur. An interesting function for you could be update-in.
Have fun with Clojure.
The group-by function does what you require.
You can use first as its discriminating function argument. It
returns the first character of a string, or nil if there isn't one:
(first word) is simpler than (subs word 0 1).
Use dissoc to remove the entry for key nil.
You seldom need to use explicit loops in clojure. Most common control patterns have been captured in functions like group-by. Such functions have function and possibly collection arguments. The commonest examples are map and reduce. The Clojure cheat sheet is a most useful guide to them.

Compare values in a list of maps in clojure

I have a list of maps like
(def listofmaps
({:directory_path "/some/path/1", :directory_size "8.49 GB"} {:directory_path "/user/dod/yieldbook/yb_sec_char", :directory_size "14.1 MB"})
containing many values and size can be in gb or mb.
Also I have a limitlistofmaps like
(def limitlistofmaps
({:directory_path "/some/path/8", :directory_size "15.2 GB"} {:directory_path "some/path/3", :directory_size "2.1 GB"}
{:directory_path "/some/path/1", :directory_size "17.2 GB"})
with many values..
I need to print "limit exceeded" if any map in list of maps had the same :directory_path as in limitlistofmaps but :directory_size exceeds the value specified. The problem is that size is in string format and unit has to be considered.
Can you help me with a way to do this in clojure?
I don't get why people down voted your question. I think as a Clojure community we are much better. I'm also a Clojure newb and I have nothing but great things to say about the community. I would like that it stays this way.
Firstly, why not have all the directory sizes in the same unit ? That way it's easier to compare them, say in KB.
Here is one version of a function that would transform any string like "100.23 Gb", "12 B", "123.3443 MB" to a Double representing Kilobytes.
(defn convert-to-kb
"Converts a string 'number (B|KB|MB|GB)' to Double representing KBytes"
[str]
(let [[number-str unit] (map str/lower-case (str/split (str/trim str) #"\s+"))
number (Double/parseDouble number-str)]
(condp = unit
"b" (/ number 1000)
"kb" number
"mb" (* number 1000)
"gb" (* number 1000000))))
Secondly, I would suggest you put the directory size limits in the same map data that lives inside your listofmaps, so you don't have state duplication like you have now in limitslistofmaps. But if for some reason you need this second map, here a piece of ugly code that returns a list of maps that are the same as in your listofmaps with two added key/val entries, :max_size_kb and :directory_size_kb.
for [dir listofmaps :let [{:keys [directory_path directory_size]} dir]]
(let [limit-map (first
(get (group-by :directory_path limitlistofmaps) directory_path))
max-size-kb (convert-to-kb (:directory_size limit-map))]
(-> dir
(assoc :max_size_kb max-size-kb)
(assoc :directory_size_kb (convert-to-kb directory_size)))))

clojure: how to get values from lazy seq?

Iam new to clojure and need some help to get a value out of a lazy sequence.
You can have a look at my full data structure here: http://pastebin.com/ynLJaLaP
What I need is the content of the title:
{: _content AlbumTitel2}
I managed to get a list of all _content values:
(def albumtitle (map #(str (get % :title)) photosets))
(println albumtitle)
and the result is:
({:_content AlbumTitel2} {:_content test} {:_content AlbumTitel} {:_content album123} {:_content speciale} {:_content neues B5 Album} {:_content Album Nr 2})
But how can I get the value of every :_content?
Any help would be appreciated!
Thanks!
You could simply do this
(map (comp :_content :title) photosets)
Keywords work as functions, so the composition with comp will first retrieve the :title value of each photoset and then further retrieve the :_content value of that value.
Alternatively this could be written as
(map #(get-in % [:title :_content]) photosets)
A semi alternative solution is to do
(->> data
(map :title)
(map :_content))
This take advances of the fact that keywords are functions and the so called thread last macro. What it does is injecting the result of the first expression in as the last argument of the second etc..
The above code gets converted to
(map :_content (map :title data))
Clearly not as readable, and not easy to expand later either.
PS I asume something went wrong when the data was pasted to the web, because:
{: _content AlbumTitel2}
Is not Clojure syntax, this however is:
{:_content "AlbumTitel2"}
No the whitespace after :, and "" around text. Just in case you might want to paste some Clojure some other time.

Clojure stack overflow using recur, lazy seq?

I've read other people's questions about having stack overflow problems in Clojure, and the problem tend to be a lazy sequence being built up somewhere. That appears to be the problem here, but for the life of me I can't figure out where.
Here is the code and after the code is a bit of explanation:
(defn pare-all []
"writes to disk, return new counts map"
(loop [counts (counted-origlabels)
songindex 0]
(let [[o g] (orig-gen-pair songindex)]
(if (< songindex *song-count*) ;if we are not done processing list
(if-not (seq o) ;if there are no original labels
(do
(write-newlabels songindex g);then use the generated ones
(recur counts (inc songindex)))
(let [{labels :labels new-counts :countmap} (pare-keywords o g counts)] ;else pare the pairs
(write-newlabels songindex labels)
(recur new-counts (inc songindex))))
counts))))
There is a map stored in "counts" originally retrieved from the function "counted-origlabels". The map have string keys and integer values. It is 600 or so items long and the values are updated during the iteration but the length stays the same, I've verified this.
The "orig-gen-pair" function reads from a file and returns a short pair of sequences, 10 or so items each.
The "write-newlabels" function just rite the passed sequence to the disk and doesn't have any other side effect nor does it return a value.
"Pare-keywords" returns a short sequence and an updated version of the "counts" map.
I just don't see what lazy sequence could be causing the problem here!
Any tips would be very much appreciated!
----EDIT----
Hello all, I've updated my function to be (hopefully) a little more idiomatic Clojure. But my original problem still remains. First, here is the new code:
(defn process-song [counts songindex]
(let [[o g] (orig-gen-pair songindex)]
(if-not (seq o) ;;if no original labels
(do
(write-newlabels songindex g);then use the generated ones
counts)
(let [{labels :labels new-counts :countmap} (pare-keywords o g counts)] ;else pare the pairs
(write-newlabels songindex labels)
new-counts))))
(defn pare-all []
(reduce process-song (counted-origlabels) (range *song-count*)))
This still ends with java.lang.StackOverflowError (repl-1:331). The stack trace doesn't mean much to me other than it sure seems to indicate lazy sequence mayhem going on. Any more tips? Do I need to post the code to the functions that process-song calls? Thanks!
I cannot quite grasp what you are trying to do without a little more concrete sample data, but it's very evident you're trying to iterate over your data using recursion. You're making things way more painful on yourself than you need to.
If you can generate a function, let's call it do-the-thing, that operates correctly with a single entry in your map, then you can call (map do-the-thing (counted-origlabels)), and it will apply (do-the-thing) to each map entry in (counted-origlabels), passing a single map entry to do-the-thing as it's sole argument and returning a seq of the return values from do-the-thing.
You also look like you need indexes, this is easily solved as well. You can splice in the lazy sequence (range) as the second argument to do-the-thing, and then you'll have a series of indexes generated with each map entry; however maps in clojure are not sorted by default, so unless you are using a sorted map, this index value is relatively meaningless.
Trying to abstract away what you've writen so far, try something like:
(defn do-the-thing [entry index counts]
(let [[o g] (orig-gen-pair index)]
(if-not (seq o)
(write-newlabels index g)
(let [{labels :labels new-counts :countmap} (pare-keywords o g counts)]
(write-newlabels index labels)))))
(map do-the-thing (counted-origlabels) (range) (constantly (counted-origlabels)))

How to use "Update-in" in Clojure?

I'm trying to use Clojure's update-in function but I can't seem to understand why I need to pass in a function?
update-in takes a function, so you can update a value at a given position depending on the old value more concisely. For example instead of:
(assoc-in m [list of keys] (inc (get-in m [list of keys])))
you can write:
(update-in m [list of keys] inc)
Of course if the new value does not depend on the old value, assoc-in is sufficient and you don't need to use update-in.
This isn't a direct answer to your question, but one reason why a function like update-in could exist would be for efficiency—not just convenience—if it were able to update the value in the map "in-place". That is, rather than
seeking the key in the map,
finding the corresponding key-value tuple,
extracting the value,
computing a new value based on the current value,
seeking the key in the map,
finding the corresponding key-value tuple,
and overwriting the value in the tuple or replacing the tuple with a new one
one can instead imagine an algorithm that would omit the second search for the key:
seek the key in the map,
find the corresponding key-value tuple,
extract the value,
compute a new value based on the current value,
and overwrite the value in the tuple
Unfortunately, the current implementation of update-in does not do this "in-place" update. It uses get for the extraction and assoc for the replacement. Unless assoc is using some caching of the last looked up key and the corresponding key-value tuple, the call to assoc winds up having to seek the key again.
I think the short answer is that the function passed to update-in lets you update values in a single step, rather than 3 (lookup, calculate new value, set).
Coincidentally, just today I ran across this use of update-in in a Clojure presentation by Howard Lewis Ship:
(def in-str "this is this")
(reduce
(fn [m k] (update-in m [k] #(inc (or % 0))))
{}
(seq in-str))
==> {\space 2, \s 3, \i 3, \h 2, \t 2}
Each call to update-in takes a letter as a key, looks it up in the map, and if it's found there increments the letter count (else sets it to 1). The reduce drives the process by starting with an empty map {} and repeatedly applies the update-in with successive characters from the input string. The result is a map of letter frequencies. Slick.
Note 1: clojure.core/frequencies is similar but uses assoc! rather than update-in.
Note 2: You can replace #(inc (or % 0)) with (fnil inc 0). From here: fnil
A practical example you see here.
Type this snippet (in your REPL):
(def my-map {:useless-key "key"})
;;{:useless-key "key"}
(def my-map (update-in my-map [:yourkey] #(cons 1 %)))
;;{:yourkey (1), :useless-key "key"}
Note that :yourkey is new. So the value - of :yourkey - passed to the lambda is null. cons will put 1 as the single element of your list. Now do the following:
(def my-map (update-in my-map [:yourkey] #(cons 25 %)))
;;{:yourkey (25 1), :useless-key "key"}
And that is it, in the second part, the anonymous function takes the list - the value for :yourkey - as argument and just cons 25 to it.
Since our my-map is immutable, update-in will always return a new version of your map letting you do something with the old value of the given key.
Hope it helped!