write large data structures as EDN to disk in clojure

write large data structures as EDN to disk in clojure - clojure

What is the most idiomatic way to write a data structure to disk in Clojure, so I can read it back with edn/read? I tried the following, as recommended in the Clojure cookbook:
(with-open [w (clojure.java.io/writer "data.clj")]
(binding [*out* w]
(pr large-data-structure)))
However, this will only write the first 100 items, followed by "...". I tried (prn (doall large-data-structure)) as well, yielding the same result.
I've managed to do it by writing line by line with (doseq [i large-data-structure] (pr i)), but then I have to manually add the parens at the beginning and end of the sequence to get the desired result.

You can control the number of items in a collection that are printed via *print-length*
Consider using spit instead of manually opening the writer and pr-str instead of manually binding to *out*.
(binding [*print-length* false]
(spit "data.clj" (pr-str large-data-structure))
Edit from comment:
(with-open [w (clojure.java.io/writer "data.clj")]
(binding [*print-length* false
*out* w]
(pr large-data-structure)))
Note: *print-length* has a root binding of nil so you should not need to bind it in the example above. I would check the current binding at the time of your original pr call.

Related

How can I record time for function call in clojure

I am newbie to Clojure. I am invoking Clojure function using java and I want to record the time a particular line of clojure code execution takes:
Suppose if my clojure function is:
(defn sampleFunction [sampleInput]
(fun1 (fun2 sampleInput))
Above function I am invoking from java which returns some String value and I want to record the time it takes for executing fun2.
I have a another function say logTime which will write the parameter passed to it in to some database:
(defn logTime [time]
.....
)
My Question is: How can I modify my sampleFunction(..) to invoke logTime for recording time it took to execute fun2.
Thank you in advance.

I'm not entirely sure how the different pieces of your code fit together and interoperate with Java, but here's something that could work with the way you described it.
To get the execution time of a piece of code, there's a core function called time. However, this function doesn't return the execution time, it just prints it... So given that you want to log that time into a database, we need to write a macro to capture both the return value of fun2 as well the time it took to execute:
(defmacro time-execution
[& body]
`(let [s# (new java.io.StringWriter)]
(binding [*out* s#]
(hash-map :return (time ~#body)
:time (.replaceAll (str s#) "[^0-9\\.]" "")))))
What this macro does is bind standard output to a Java StringWriter, so that we can use it to store whatever the time function prints. To return both the result of fun2 and the time it took to execute, we package the two values in a hash-map (could be some other collection too - we'll end up destructuring it later). Notice that the code whose execution we're timing is wrapped in a call to time, so that we trigger the printing side effect and capture it in s#. Finally, the .replaceAll is just to ensure that we're only extracting the actual numeric value (in miliseconds), since time prints something of the form "Elapsed time: 0.014617 msecs".
Incorporating this into your code, we need to rewrite sampleFunction like so:
(defn sampleFunction [sampleInput]
(let [{:keys [return time]} (time-execution (fun2 sampleInput))]
(logTime time)
(fun1 return)))
We're simply destructuring the hash-map to access both the return value of fun2 and the time it took to execute, then we log the execution time using logTime, and finally we finish by calling fun1 on the return value of fun2.

The library tupelo.prof gives you many options if you want to capture execution time for one or more functions and accumulate it over multiple calls. An example:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[tupelo.profile :as prof]))
(defn add2 [x y] (+ x y))
(prof/defnp fast [] (reduce add2 0 (range 10000)))
(prof/defnp slow [] (reduce add2 0 (range 10000000)))
(dotest
(prof/timer-stats-reset)
(dotimes [i 10000] (fast))
(dotimes [i 10] (slow))
(prof/print-profile-stats)
)
with result:
--------------------------------------
Clojure 1.10.2-alpha1 Java 14
--------------------------------------
Testing tst.demo.core
---------------------------------------------------------------------------------------------------
Profile Stats:
Samples TOTAL MEAN SIGMA ID
10000 0.955 0.000096 0.000045 :tst.demo.core/fast
10 0.905 0.090500 0.000965 :tst.demo.core/slow
---------------------------------------------------------------------------------------------------
If you want detailed timing for a single method, the Criterium library is what you need. Start off with the quick-bench function.

Since the accepted answer has some shortcomings around eating up logs etc,
A simpler solution compared to the accepted answer perhaps
(defmacro time-execution [body]
`(let [st# (System/currentTimeMillis)
return# ~body
se# (System/currentTimeMillis)]
{:return return#
:time (double (/ (- se# st#) 1000))}))

easiest way to use a i/o callback within concurrent http-kit/get instances

I am launching a few hundreds concurrent http-kit.client/get requests provided with a callback to write results to a single file.
What would be a good way to deal with thread-safety? Using chanand <!! from core.asyc?
Here's the code I would consider :
(defn launch-async [channel url]
(http/get url {:timeout 5000
:user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:10.0) Gecko/20100101 Firefox/10.0"}
(fn [{:keys [status headers body error]}]
(if error
(put! channel (json/generate-string {:url url :headers headers :status status}))
(put! channel (json/generate-string body))))))
(defn process-async [channel func]
(when-let [response (<!! channel)]
(func response)))
(defn http-gets-async [func urls]
(let [channel (chan)]
(doall (map #(launch-async channel %) urls))
(process-async channel func)))
Thanks for your insights.

Since you are already using core.async in your example, I thought I'd point out a few issues and how you can address them. The other answer mentions using a more basic approach, and I agree wholeheartedly that a simpler approach is just fine. However, with channels, you have a simple way of consuming the data which does not involve mapping over a vector, which will also grow large over time if you have many responses. Consider the following issues and how we can fix them:
(1) Your current version will crash if your url list has more than 1024 elements. There's an internal buffer for puts and takes that are asynchronous (i.e., put! and take! don't block but always return immediately), and the limit is 1024. This is in place to prevent unbounded asynchronous usage of the channel. To see for yourself, call (http-gets-async println (repeat 1025 "http://blah-blah-asdf-fakedomain.com")).
What you want to do is to only put something on the channel when there's room to do so. This is called back-pressure. Taking a page from the excellent wiki on go block best practices, one clever way to do this from your http-kit callback is to use the put! callback option to launch your next http get; this will only happen when the put! immediately succeeds, so you will never have a situation where you can go beyond the channel's buffer:
(defn launch-async
[channel [url & urls]]
(when url
(http/get url {:timeout 5000
:user-agent "Mozilla"}
(fn [{:keys [status headers body error]}]
(let [put-on-chan (if error
(json/generate-string {:url url :headers headers :status status})
(json/generate-string body))]
(put! channel put-on-chan (fn [_] (launch-async channel urls))))))))
(2) Next, you seem to be only processing one response. Instead, use a go-loop:
(defn process-async
[channel func]
(go-loop []
(when-let [response (<! channel)]
(func response)
(recur))))
(3) Here's your http-gets-async function. I see no harm in adding a buffer here, as it should help you fire off a nice burst of requests at the beginning:
(defn http-gets-async
[func urls]
(let [channel (chan 1000)]
(launch-async channel urls)
(process-async channel func)))
Now, you have the ability to process an infinite number of urls, with back-pressure. To test this, define a counter, and then make your processing function increment this counter to see your progress. Using a localhost URL that is easy to bang on (wouldn't recommend firing off hundreds of thousands of requests to, say, google, etc.):
(def responses (atom 0))
(http-gets-async (fn [_] (swap! responses inc))
(repeat 1000000 "http://localhost:8000"))
As this is all asynchronous, your function will return immediately and you can look at #responses grow.
One other interesting thing you can do is instead of running your processing function in process-async, you could optionally apply it as a transducer on the channel itself.
(defn process-async
[channel]
(go-loop []
(when-let [_ (<! channel)]
(recur))))
(defn http-gets-async
[func urls]
(let [channel (chan 10000 (map func))] ;; <-- transducer on channel
(launch-async channel urls)
(process-async channel)))
There are many ways to do this, including constructing it so that the channel closes (note that above, it stays open). You have java.util.concurrent primitives to help in this regard if you like, and they are quite easy to use. The possibilities are very numerous.

This is simple enough that I wouldn't use core.async for it. You can do this with an atom storing use a vector of the responses, then have a separate thread reading the contents of atom until it's seen all of the responses. Then, in your http-kit callback, you could just swap! the response into the atom directly.
If you do want to use core.async, I'd recommend a buffered channel to keep from blocking your http-kit thread pool.

Intermittent error serving a binary file with Clojure/Ring

I am building an event collector in Clojure for Snowplow (using Ring/Compojure) and am having some trouble serving a transparent pixel with Ring. This is my code for sending the pixel:
(ns snowplow.clojure-collector.responses
(:import (org.apache.commons.codec.binary Base64)
(java.io ByteArrayInputStream)))
(def pixel-bytes (Base64/decodeBase64 (.getBytes "R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==")))
(def pixel (ByteArrayInputStream. pixel-bytes))
(defn send-pixel
[]
{:status 200
:headers {"Content-Type" "image/gif"}
:body pixel})
When I start up my server, the first time I hit the path for send-pixel, the pixel is successfully delivered to my browser. But the second time - and every time afterwards - Ring sends no body (and content-length 0). Restart the server and it's the same pattern.
A few things it's not:
I have replicated this using wget, to confirm the intermittent-ness isn't a browser caching issue
I generated the "R01GOD..." base64 string at the command-line (cat original.gif | base64) so know there is no issue there
When the pixel is successfully sent, I have verified its contents are correct (diff original.gif received-pixel.gif)
I'm new to Clojure - my guess is there's some embarrassing dynamic gremlin in my code, but I need help spotting it!

I figured out the problem in the REPL shortly after posting:
user=> (import (org.apache.commons.codec.binary Base64) (java.io ByteArrayInputStream))
java.io.ByteArrayInputStream
user=> (def pixel-bytes (Base64/decodeBase64 (.getBytes "R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==")))
#'user/pixel-bytes
user=> (def pixel (ByteArrayInputStream. pixel-bytes))
#'user/pixel
user=> (slurp pixel-bytes)
"GIF89a!�\n,L;"
user=> (slurp pixel-bytes)
"GIF89a!�\n,L;"
user=> (slurp pixel)
"GIF89a!�\n,L;"
user=> (slurp pixel)
""
So basically the problem was that the ByteArrayInputStream was getting emptied after the first call. Mutable data structures!
I fixed the bug by generating a new ByteArrayInputStream for each response, with:
:body (ByteArrayInputStream. pixel-bytes)}))

The problem is your pixel variable holds a stream. Once it has been read, there is no possibility to re-read it again.
Moreover, you do not need to deal with encoding issues. Ring serves static files as well. Just return:
(file-response "/path/to/pixel.gif")
It handles non-existing files as well. See the docs also.

Parse a little-endian binary file, stuffing into a matrix

I have a binary file that contains an X by X matrix. The file itself is a sequence of single-precision floats (little-endian). What I would like to do is parse it, and stuff it into some reasonable clojure matrix data type.
Thanks to this question, I see I can parse a binary file with gloss. I now have code that looks like this:
(ns foo.core
(:require gloss.core)
(:require gloss.io)
(:use [clojure.java.io])
(:use [clojure.math.numeric-tower]))
(gloss.core/defcodec mycodec
(gloss.core/repeated :float32 :prefix :none))
(def buffer (byte-array (* 1200 1200)))
(.read (input-stream "/path/to/binaryfile") buffer)
(gloss.io/decode mycodec buffer)
This takes a while to run, but eventually dumps out a big list of numbers. Unfortunately, the numbers are all wrong. Upon further investigation, the numbers were read as big-endian.
Assuming there is some way to read these binary files as little-endian, I'd like to stuff the results into a matrix. This question seems to have settled on using Incanter with its Parallel Colt representation, however, that question was from '09, and I'm hoping to stick to clojure 1.4 and lein 2. Somewhere in my frenzy of googling, I saw other recommendations to use jblas or mahout. Is there a "best" matrix library for clojure these days?
EDIT: Reading a binary file is tantalizingly close. Thanks to this handy nio wrapper, I am able to get a memory mapped byte buffer as a short one-liner, and even reorder it:
(ns foo.core
(:require [clojure.java.io :as io])
(:require [nio.core :as nio])
(:import [java.nio ByteOrder]))
(def buffer (nio/mmap "/path/to/binaryfile"))
(class buffer) ;; java.nio.DirectByteBuffer
(.order buffer java.nio.ByteOrder/LITTLE_ENDIAN)
;; #<DirectByteBuffer java.nio.DirectByteBuffer[pos=0 lim=5760000 cap=5760000]>
However, reordering without doing the intermediate (def) step, fails:
(.order (nio/mmap f) java.nio.ByteOrder/LITTLE_ENDIAN)
;; clojure.lang.Compiler$CompilerException: java.lang.IllegalArgumentException: Unable to resolve classname: MappedByteBuffer, compiling:(/Users/peter/Developer/foo/src/foo/core.clj:12)
;; at clojure.lang.Compiler.analyzeSeq (Compiler.java:6462)
;; clojure.lang.Compiler.analyze (Compiler.java:6262)
;; etc...
I'd like to be able to create the reordered byte buffer this inside a function without defining a global variable, but right now it seems to not like that.
Also, once I've got it reordered, I'm not entirely sure what to do with my DirectByteBuffer, as it doesn't seem to be iterable. Perhaps for the remaining step of reading this buffer object (into a JBLAS matrix), I will create a second question.
EDIT 2: I am marking the answer below as accepted, because I think my original question combined too many things. Once I figure out the remainder of this I will try to update this question with complete code that starts with this ByteBuffer and that reads into a JBLAS matrix (which appears to be the right data structure).
In case anyone was interested, I was able to create a function that returns a properly ordered bytebuffer as follows:
;; This works!
(defn readf [^String file]
(.order
(.map
(.getChannel
(java.io.RandomAccessFile. file "r"))
java.nio.channels.FileChannel$MapMode/READ_ONLY 0 (* 1200 1200))
java.nio.ByteOrder/LITTLE_ENDIAN))
The nio wrapper I found looks to simplify / prettify this quite a lot, but it would appear I'm either not using it correctly, or there is something wrong. To recap my findings with the nio wrapper:
;; this works
(def buffer (nio/mmap "/bin/file"))
(def buffer (.order buffer java.nio.ByteOrder/LITTLE_ENDIAN))
(def buffer (.asFloatBuffer buffer))
;; this fails
(def buffer
(.asFloatBuffer
(.order
(nio/mmap "/bin/file")
java.nio.ByteOrder/LITTLE_ENDIAN)))
Sadly, this is a clojure mystery for another day, or perhaps another StackOverflow question.

Open a FileChannel(), then get a memory mapped buffer. There are lots of tutorials on the web for this step.
Switch the order of the buffer to little endian by calling order(endian-ness) (not the no-arg version of order). Finally, the easiest way to extract floats would be to call asFloatBuffer() on it and use the resulting buffer to read the floats.
After that you can put the data into whatever structure you need.
edit Here's an example of how to use the API.
;; first, I created a 96 byte file, then I started the repl
;; put some little endian floats in the file and close it
user=> (def file (java.io.RandomAccessFile. "foo.floats", "rw"))
#'user/file
user=> (def channel (.getChannel file))
#'user/channel
user=> (def buffer (.map channel java.nio.channels.FileChannel$MapMode/READ_WRITE 0 96))
#'user/buffer
user=> (.order buffer java.nio.ByteOrder/LITTLE_ENDIAN)
#<DirectByteBuffer java.nio.DirectByteBuffer[pos=0 lim=96 cap=96]>
user=> (def fbuffer (.asFloatBuffer buffer))
#'user/fbuffer
user=> (.put fbuffer 0 0.0)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.put fbuffer 1 1.0)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.put fbuffer 2 2.3)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.close channel)
nil
;; memory map the file, try reading the floats w/o changing the endianness of the buffer
user=> (def file2 (java.io.RandomAccessFile. "foo.floats" "r"))
#'user/file2
user=> (def channel2 (.getChannel file2))
#'user/channel2
user=> (def buffer2 (.map channel2 java.nio.channels.FileChannel$MapMode/READ_ONLY 0 96))
#'user/buffer2
user=> (def fbuffer2 (.asFloatBuffer buffer2))
#'user/fbuffer2
user=> (.get fbuffer2 0)
0.0
user=> (.get fbuffer2 1)
4.6006E-41
user=> (.get fbuffer2 2)
4.1694193E-8
;; change the order of the buffer and read the floats
user=> (.order buffer2 java.nio.ByteOrder/LITTLE_ENDIAN)
#<DirectByteBufferR java.nio.DirectByteBufferR[pos=0 lim=96 cap=96]>
user=> (def fbuffer2 (.asFloatBuffer buffer2))
#'user/fbuffer2
user=> (.get fbuffer2 0)
0.0
user=> (.get fbuffer2 1)
1.0
user=> (.get fbuffer2 2)
2.3
user=> (.close channel2)
nil
user=>

Iterator blocks in Clojure?

I am using clojure.contrib.sql to fetch some records from an SQLite database.
(defn read-all-foo []
(with-connection *db*
(with-query-results res ["select * from foo"]
(into [] res))))
Now, I don't really want to realize the whole sequence before returning from the function (i.e. I want to keep it lazy), but if I return res directly or wrap it some kind of lazy wrapper (for example I want to make a certain map transformation on result sequence), SQL-related bindings will be reset and connection will be closed after I return, so realizing the sequence will throw an exception.
How can I enclose the whole function in a closure and return a kind of iterator block (like yield in C# or Python)?
Or is there another way to return a lazy sequence from this function?

The resultset-seq that with-query-results returns is probably already as lazy as you're going to get. Laziness only works as long as the handle is open, as you said. There's no way around this. You can't read from a database if the database handle is closed.
If you need to do I/O and keep the data after the handle is closed, then open the handle, slurp it in fast (defeating laziness), close the handle, and work with the results afterward. If you want to iterate over some data without keeping it all in memory at once, then open the handle, get a lazy seq on the data, doseq over it, then close the handle.
So if you want to do something with each row (for side-effects) and discard the results without eating the whole resultset into memory, then you could do this:
(defn do-something-with-all-foo [f]
(let [sql "select * from foo"]
(with-connection *db*
(with-query-results res [sql]
(doseq [row res]
(f row))))))
user> (do-something-with-all-foo println)
{:id 1}
{:id 2}
{:id 3}
nil
;; transforming the data as you go
user> (do-something-with-all-foo #(println (assoc % :bar :baz)))
{:id 1, :bar :baz}
{:id 2, :bar :baz}
{:id 3, :bar :baz}
If you want your data to hang around long-term, then you may as well slurp it all in using your read-all-foo function above (thus defeating laziness). If you want to transform the data, then map over the results after you've fetched it all. Your data will all be in memory at that point, but the map call itself and your post-fetch data transformations will be lazy.

It is in fact possible to add a "terminating side-effect" to a lazy sequence, to be executed once, when the entire sequence is consumed for the first time:
(def s (lazy-cat (range 10) (do (println :foo) nil)))
(first s)
; => returns 0, prints out nothing
(doall (take 10 s))
; => returns (0 1 2 3 4 5 6 7 8 9), prints nothing
(last s)
; => returns 9, prints :foo
(doall s)
; => returns (0 1 2 3 4 5 6 7 8 9), prints :foo
; or rather, prints :foo if it it's the first time s has been
; consumed in full; you'll have to redefine it if you called
; (last s) earlier
I'm not sure I'd use this to close a DB connection, though -- I think it's considered best practice not to hold on to a DB connection indefinitely and putting your connection-closing call at the end of your lazy sequence of results would not only hold on to the connection longer than strictly necessary, but also open up the possibility that your programme will fail for an unrelated reason without ever closing the connection. Thus for this scenario, I would normally just slurp in all data. As Brian says, you can store it all somewhere unprocessed, than perform any transformations lazily, so you should be fine as long as you're not trying to pull in a really huge dataset in one chunk.
But then I don't know your exact circumstances, so if it makes sense from your point of view, you can definitely call a connection-closing function at the tail end of your result sequence. As Michiel Borkent points out, you wouldn't be able to use with-connection if you wanted to do this.

I have never used SQLite with Clojure before, but my guess is that with-connection closes the connection when it's body has been evaluated. So you need to manage the connection yourself if you want to keep it open, and close it when you finish reading the elements you're interested in.

There is no way to create a function or macro "on top" of with-connection and with-query-results to add lazyness. Both close the their Connection and ResultSet respectively, when control flow leaves the lexical scope.
As Michal said, it would be no problem to create a lazy seq, closing its ResultSet and Connection lazily. As he also said, it wouldn't be a good idea, unless you can guarantee that the sequences are eventually finished.
A feasible solution might be:
(def *deferred-resultsets*)
(defmacro with-deferred-close [&body]
(binding [*deferred-resultsets* (atom #{})]
(let [ret# (do ~#body)]
;;; close resultsets
ret# ))
(defmacro with-deferred-results [bind-form sql & body]
(let [resultset# (execute-query ...)]
(swap! *deferred-resultsets* conj resultset# )
;;; execute body, similar to with-query-results
;;; but leave resultset open
))
This would allow for e.g. keeping the resultsets open until the current request is finished.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

write large data structures as EDN to disk in clojure - clojure

Related

How can I record time for function call in clojure

easiest way to use a i/o callback within concurrent http-kit/get instances

Intermittent error serving a binary file with Clojure/Ring

Parse a little-endian binary file, stuffing into a matrix

Iterator blocks in Clojure?

Categories

Resources