md5 hash for big files in Clojure - clojure

how to modify this code to cope with larger files (2 GB)? In Java - use small buffer and update(), in Clojure - how?
(defn md5 [io-factory]
(let [bytes'
(with-open [xin (clojure.java.io/input-stream io-factory)
xout (java.io.ByteArrayOutputStream.)]
(clojure.java.io/copy xin xout)
(.toByteArray xout))
algorithm (java.security.MessageDigest/getInstance "MD5")
raw (.digest algorithm bytes')]
(format "%032x" (BigInteger. 1 raw))))
; Execution error (OutOfMemoryError) at java.util.Arrays/copyOf (Arrays.java:3236).
; Java heap space
Thank you for your answers.

You can use a DigestInputStream to calculate a hash without holding all bytes into memory simultaneously since it incrementally computes the hash as you consume bytes from the source stream.
(defn copy+md5 [source sink]
(let [digest (MessageDigest/getInstance "MD5")]
(with-open [input-stream (io/input-stream source)
digest-stream (DigestInputStream. input-stream digest)
output-stream (io/output-stream sink)]
(io/copy digest-stream output-stream))
(format "%032x" (BigInteger. 1 (.digest digest)))))
If you're not doing anything with the contents of the source other than computing a hash you could use the /dev/null equivalent (OutputStream/nullOutputStream) instance for the sink.

clj-digest uses a small buffer to calculate MD5 and other message digests.

Related

How can I record time for function call in clojure

I am newbie to Clojure. I am invoking Clojure function using java and I want to record the time a particular line of clojure code execution takes:
Suppose if my clojure function is:
(defn sampleFunction [sampleInput]
(fun1 (fun2 sampleInput))
Above function I am invoking from java which returns some String value and I want to record the time it takes for executing fun2.
I have a another function say logTime which will write the parameter passed to it in to some database:
(defn logTime [time]
.....
)
My Question is: How can I modify my sampleFunction(..) to invoke logTime for recording time it took to execute fun2.
Thank you in advance.
I'm not entirely sure how the different pieces of your code fit together and interoperate with Java, but here's something that could work with the way you described it.
To get the execution time of a piece of code, there's a core function called time. However, this function doesn't return the execution time, it just prints it... So given that you want to log that time into a database, we need to write a macro to capture both the return value of fun2 as well the time it took to execute:
(defmacro time-execution
[& body]
`(let [s# (new java.io.StringWriter)]
(binding [*out* s#]
(hash-map :return (time ~#body)
:time (.replaceAll (str s#) "[^0-9\\.]" "")))))
What this macro does is bind standard output to a Java StringWriter, so that we can use it to store whatever the time function prints. To return both the result of fun2 and the time it took to execute, we package the two values in a hash-map (could be some other collection too - we'll end up destructuring it later). Notice that the code whose execution we're timing is wrapped in a call to time, so that we trigger the printing side effect and capture it in s#. Finally, the .replaceAll is just to ensure that we're only extracting the actual numeric value (in miliseconds), since time prints something of the form "Elapsed time: 0.014617 msecs".
Incorporating this into your code, we need to rewrite sampleFunction like so:
(defn sampleFunction [sampleInput]
(let [{:keys [return time]} (time-execution (fun2 sampleInput))]
(logTime time)
(fun1 return)))
We're simply destructuring the hash-map to access both the return value of fun2 and the time it took to execute, then we log the execution time using logTime, and finally we finish by calling fun1 on the return value of fun2.
The library tupelo.prof gives you many options if you want to capture execution time for one or more functions and accumulate it over multiple calls. An example:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[tupelo.profile :as prof]))
(defn add2 [x y] (+ x y))
(prof/defnp fast [] (reduce add2 0 (range 10000)))
(prof/defnp slow [] (reduce add2 0 (range 10000000)))
(dotest
(prof/timer-stats-reset)
(dotimes [i 10000] (fast))
(dotimes [i 10] (slow))
(prof/print-profile-stats)
)
with result:
--------------------------------------
Clojure 1.10.2-alpha1 Java 14
--------------------------------------
Testing tst.demo.core
---------------------------------------------------------------------------------------------------
Profile Stats:
Samples TOTAL MEAN SIGMA ID
10000 0.955 0.000096 0.000045 :tst.demo.core/fast
10 0.905 0.090500 0.000965 :tst.demo.core/slow
---------------------------------------------------------------------------------------------------
If you want detailed timing for a single method, the Criterium library is what you need. Start off with the quick-bench function.
Since the accepted answer has some shortcomings around eating up logs etc,
A simpler solution compared to the accepted answer perhaps
(defmacro time-execution [body]
`(let [st# (System/currentTimeMillis)
return# ~body
se# (System/currentTimeMillis)]
{:return return#
:time (double (/ (- se# st#) 1000))}))

Closing a channel at the producer end when all the jobs are finished

For my Mandelbrot explorer project, I need to run several expensive jobs, ideally in parallel. I decided to try chunking the jobs, and running each chunk in its own thread, and end ended up with something like
(defn point-calculator [chunk-size points]
(let [out-chan (chan (count points))
chunked (partition chunk-size points)]
(doseq [chunk chunked]
(thread
(let [processed-chunk (expensive-calculation chunk)]
(>!! out-chan processed-chunk))))
out-chan))
Where points is a list of [real, imaginary] coordinates to be tested, and expensive-calculation is a function that takes the chunk, and tests each point in the chunk. Each chunk can take a long time to finish (potentially a minute or more depending on the chunk size and the number of jobs).
On my consumer end, I'm using
(loop []
(when-let [proc-chunk (<!! result-chan)]
; Do stuff with chunk
(recur)))
To consume each processed chunk. Right now, this blocks when the last chunk is consumed since the channel is still open.
I need a way of closing the channel when the jobs are done. This is proving difficult because of asynchronicity of the producer loop. I can't simply put a close! after the doseq since the loop doesn't block, and I can't just close when the last-indexed job is done, since the order is indeterminate.
The best idea I could come up with was maintaining a (atom #{}) of jobs, and disj each job as it finishes. Then I could either check for the set size in the loop, and close! when it's 0, or attach a watch to the atom and check there.
This seems very hackish though. Is there a more idiomatic way of dealing with this? Does this scenario suggest I'm using async incorrectly?
i would take a look at the take function from core-async. That is what it's documentation says:
"Returns a channel that will return, at most, n items from ch. After n items
have been returned, or ch has been closed, the return channel will close.
"
so it leads you to a simple fix: instead of returning out-chan you can just wrap it into take:
(clojure.core.async/take (count chunked) out-chan)
that should work.
Also i would recommend you to rewrite your example from blocking put/get to parking (<!, >!) and thread to go / go-loop which is more idiomatic usage for core async.
You may want to use async/pipeline(-blocking) to control parallelisms. And use aysnc/onto-chan to close the input channel automatically after all the chunks are copied.
E.g. below example shows a 16x improvement on elapsed time when parallelisms is set to 16.
(defn expensive-calculation [pts]
(Thread/sleep 100)
(reduce + pts))
(time
(let [points (take 10000 (repeatedly #(rand 100)))
chunk-size 500
inp-chan (chan)
out-chan (chan)]
(go-loop [] (when-let [res (<! out-chan)]
;; do stuff with chunk
(recur)))
(pipeline-blocking 16 out-chan (map expensive-calculation) inp-chan)
(<!! (onto-chan inp-chan (partition-all chunk-size points)))))

write large data structures as EDN to disk in clojure

What is the most idiomatic way to write a data structure to disk in Clojure, so I can read it back with edn/read? I tried the following, as recommended in the Clojure cookbook:
(with-open [w (clojure.java.io/writer "data.clj")]
(binding [*out* w]
(pr large-data-structure)))
However, this will only write the first 100 items, followed by "...". I tried (prn (doall large-data-structure)) as well, yielding the same result.
I've managed to do it by writing line by line with (doseq [i large-data-structure] (pr i)), but then I have to manually add the parens at the beginning and end of the sequence to get the desired result.
You can control the number of items in a collection that are printed via *print-length*
Consider using spit instead of manually opening the writer and pr-str instead of manually binding to *out*.
(binding [*print-length* false]
(spit "data.clj" (pr-str large-data-structure))
Edit from comment:
(with-open [w (clojure.java.io/writer "data.clj")]
(binding [*print-length* false
*out* w]
(pr large-data-structure)))
Note: *print-length* has a root binding of nil so you should not need to bind it in the example above. I would check the current binding at the time of your original pr call.

Intermittent error serving a binary file with Clojure/Ring

I am building an event collector in Clojure for Snowplow (using Ring/Compojure) and am having some trouble serving a transparent pixel with Ring. This is my code for sending the pixel:
(ns snowplow.clojure-collector.responses
(:import (org.apache.commons.codec.binary Base64)
(java.io ByteArrayInputStream)))
(def pixel-bytes (Base64/decodeBase64 (.getBytes "R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==")))
(def pixel (ByteArrayInputStream. pixel-bytes))
(defn send-pixel
[]
{:status 200
:headers {"Content-Type" "image/gif"}
:body pixel})
When I start up my server, the first time I hit the path for send-pixel, the pixel is successfully delivered to my browser. But the second time - and every time afterwards - Ring sends no body (and content-length 0). Restart the server and it's the same pattern.
A few things it's not:
I have replicated this using wget, to confirm the intermittent-ness isn't a browser caching issue
I generated the "R01GOD..." base64 string at the command-line (cat original.gif | base64) so know there is no issue there
When the pixel is successfully sent, I have verified its contents are correct (diff original.gif received-pixel.gif)
I'm new to Clojure - my guess is there's some embarrassing dynamic gremlin in my code, but I need help spotting it!
I figured out the problem in the REPL shortly after posting:
user=> (import (org.apache.commons.codec.binary Base64) (java.io ByteArrayInputStream))
java.io.ByteArrayInputStream
user=> (def pixel-bytes (Base64/decodeBase64 (.getBytes "R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==")))
#'user/pixel-bytes
user=> (def pixel (ByteArrayInputStream. pixel-bytes))
#'user/pixel
user=> (slurp pixel-bytes)
"GIF89a!�\n,L;"
user=> (slurp pixel-bytes)
"GIF89a!�\n,L;"
user=> (slurp pixel)
"GIF89a!�\n,L;"
user=> (slurp pixel)
""
So basically the problem was that the ByteArrayInputStream was getting emptied after the first call. Mutable data structures!
I fixed the bug by generating a new ByteArrayInputStream for each response, with:
:body (ByteArrayInputStream. pixel-bytes)}))
The problem is your pixel variable holds a stream. Once it has been read, there is no possibility to re-read it again.
Moreover, you do not need to deal with encoding issues. Ring serves static files as well. Just return:
(file-response "/path/to/pixel.gif")
It handles non-existing files as well. See the docs also.

Parse a little-endian binary file, stuffing into a matrix

I have a binary file that contains an X by X matrix. The file itself is a sequence of single-precision floats (little-endian). What I would like to do is parse it, and stuff it into some reasonable clojure matrix data type.
Thanks to this question, I see I can parse a binary file with gloss. I now have code that looks like this:
(ns foo.core
(:require gloss.core)
(:require gloss.io)
(:use [clojure.java.io])
(:use [clojure.math.numeric-tower]))
(gloss.core/defcodec mycodec
(gloss.core/repeated :float32 :prefix :none))
(def buffer (byte-array (* 1200 1200)))
(.read (input-stream "/path/to/binaryfile") buffer)
(gloss.io/decode mycodec buffer)
This takes a while to run, but eventually dumps out a big list of numbers. Unfortunately, the numbers are all wrong. Upon further investigation, the numbers were read as big-endian.
Assuming there is some way to read these binary files as little-endian, I'd like to stuff the results into a matrix. This question seems to have settled on using Incanter with its Parallel Colt representation, however, that question was from '09, and I'm hoping to stick to clojure 1.4 and lein 2. Somewhere in my frenzy of googling, I saw other recommendations to use jblas or mahout. Is there a "best" matrix library for clojure these days?
EDIT: Reading a binary file is tantalizingly close. Thanks to this handy nio wrapper, I am able to get a memory mapped byte buffer as a short one-liner, and even reorder it:
(ns foo.core
(:require [clojure.java.io :as io])
(:require [nio.core :as nio])
(:import [java.nio ByteOrder]))
(def buffer (nio/mmap "/path/to/binaryfile"))
(class buffer) ;; java.nio.DirectByteBuffer
(.order buffer java.nio.ByteOrder/LITTLE_ENDIAN)
;; #<DirectByteBuffer java.nio.DirectByteBuffer[pos=0 lim=5760000 cap=5760000]>
However, reordering without doing the intermediate (def) step, fails:
(.order (nio/mmap f) java.nio.ByteOrder/LITTLE_ENDIAN)
;; clojure.lang.Compiler$CompilerException: java.lang.IllegalArgumentException: Unable to resolve classname: MappedByteBuffer, compiling:(/Users/peter/Developer/foo/src/foo/core.clj:12)
;; at clojure.lang.Compiler.analyzeSeq (Compiler.java:6462)
;; clojure.lang.Compiler.analyze (Compiler.java:6262)
;; etc...
I'd like to be able to create the reordered byte buffer this inside a function without defining a global variable, but right now it seems to not like that.
Also, once I've got it reordered, I'm not entirely sure what to do with my DirectByteBuffer, as it doesn't seem to be iterable. Perhaps for the remaining step of reading this buffer object (into a JBLAS matrix), I will create a second question.
EDIT 2: I am marking the answer below as accepted, because I think my original question combined too many things. Once I figure out the remainder of this I will try to update this question with complete code that starts with this ByteBuffer and that reads into a JBLAS matrix (which appears to be the right data structure).
In case anyone was interested, I was able to create a function that returns a properly ordered bytebuffer as follows:
;; This works!
(defn readf [^String file]
(.order
(.map
(.getChannel
(java.io.RandomAccessFile. file "r"))
java.nio.channels.FileChannel$MapMode/READ_ONLY 0 (* 1200 1200))
java.nio.ByteOrder/LITTLE_ENDIAN))
The nio wrapper I found looks to simplify / prettify this quite a lot, but it would appear I'm either not using it correctly, or there is something wrong. To recap my findings with the nio wrapper:
;; this works
(def buffer (nio/mmap "/bin/file"))
(def buffer (.order buffer java.nio.ByteOrder/LITTLE_ENDIAN))
(def buffer (.asFloatBuffer buffer))
;; this fails
(def buffer
(.asFloatBuffer
(.order
(nio/mmap "/bin/file")
java.nio.ByteOrder/LITTLE_ENDIAN)))
Sadly, this is a clojure mystery for another day, or perhaps another StackOverflow question.
Open a FileChannel(), then get a memory mapped buffer. There are lots of tutorials on the web for this step.
Switch the order of the buffer to little endian by calling order(endian-ness) (not the no-arg version of order). Finally, the easiest way to extract floats would be to call asFloatBuffer() on it and use the resulting buffer to read the floats.
After that you can put the data into whatever structure you need.
edit Here's an example of how to use the API.
;; first, I created a 96 byte file, then I started the repl
;; put some little endian floats in the file and close it
user=> (def file (java.io.RandomAccessFile. "foo.floats", "rw"))
#'user/file
user=> (def channel (.getChannel file))
#'user/channel
user=> (def buffer (.map channel java.nio.channels.FileChannel$MapMode/READ_WRITE 0 96))
#'user/buffer
user=> (.order buffer java.nio.ByteOrder/LITTLE_ENDIAN)
#<DirectByteBuffer java.nio.DirectByteBuffer[pos=0 lim=96 cap=96]>
user=> (def fbuffer (.asFloatBuffer buffer))
#'user/fbuffer
user=> (.put fbuffer 0 0.0)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.put fbuffer 1 1.0)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.put fbuffer 2 2.3)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.close channel)
nil
;; memory map the file, try reading the floats w/o changing the endianness of the buffer
user=> (def file2 (java.io.RandomAccessFile. "foo.floats" "r"))
#'user/file2
user=> (def channel2 (.getChannel file2))
#'user/channel2
user=> (def buffer2 (.map channel2 java.nio.channels.FileChannel$MapMode/READ_ONLY 0 96))
#'user/buffer2
user=> (def fbuffer2 (.asFloatBuffer buffer2))
#'user/fbuffer2
user=> (.get fbuffer2 0)
0.0
user=> (.get fbuffer2 1)
4.6006E-41
user=> (.get fbuffer2 2)
4.1694193E-8
;; change the order of the buffer and read the floats
user=> (.order buffer2 java.nio.ByteOrder/LITTLE_ENDIAN)
#<DirectByteBufferR java.nio.DirectByteBufferR[pos=0 lim=96 cap=96]>
user=> (def fbuffer2 (.asFloatBuffer buffer2))
#'user/fbuffer2
user=> (.get fbuffer2 0)
0.0
user=> (.get fbuffer2 1)
1.0
user=> (.get fbuffer2 2)
2.3
user=> (.close channel2)
nil
user=>