Parse a little-endian binary file, stuffing into a matrix - clojure

I have a binary file that contains an X by X matrix. The file itself is a sequence of single-precision floats (little-endian). What I would like to do is parse it, and stuff it into some reasonable clojure matrix data type.
Thanks to this question, I see I can parse a binary file with gloss. I now have code that looks like this:
(ns foo.core
(:require gloss.core)
(:require gloss.io)
(:use [clojure.java.io])
(:use [clojure.math.numeric-tower]))
(gloss.core/defcodec mycodec
(gloss.core/repeated :float32 :prefix :none))
(def buffer (byte-array (* 1200 1200)))
(.read (input-stream "/path/to/binaryfile") buffer)
(gloss.io/decode mycodec buffer)
This takes a while to run, but eventually dumps out a big list of numbers. Unfortunately, the numbers are all wrong. Upon further investigation, the numbers were read as big-endian.
Assuming there is some way to read these binary files as little-endian, I'd like to stuff the results into a matrix. This question seems to have settled on using Incanter with its Parallel Colt representation, however, that question was from '09, and I'm hoping to stick to clojure 1.4 and lein 2. Somewhere in my frenzy of googling, I saw other recommendations to use jblas or mahout. Is there a "best" matrix library for clojure these days?
EDIT: Reading a binary file is tantalizingly close. Thanks to this handy nio wrapper, I am able to get a memory mapped byte buffer as a short one-liner, and even reorder it:
(ns foo.core
(:require [clojure.java.io :as io])
(:require [nio.core :as nio])
(:import [java.nio ByteOrder]))
(def buffer (nio/mmap "/path/to/binaryfile"))
(class buffer) ;; java.nio.DirectByteBuffer
(.order buffer java.nio.ByteOrder/LITTLE_ENDIAN)
;; #<DirectByteBuffer java.nio.DirectByteBuffer[pos=0 lim=5760000 cap=5760000]>
However, reordering without doing the intermediate (def) step, fails:
(.order (nio/mmap f) java.nio.ByteOrder/LITTLE_ENDIAN)
;; clojure.lang.Compiler$CompilerException: java.lang.IllegalArgumentException: Unable to resolve classname: MappedByteBuffer, compiling:(/Users/peter/Developer/foo/src/foo/core.clj:12)
;; at clojure.lang.Compiler.analyzeSeq (Compiler.java:6462)
;; clojure.lang.Compiler.analyze (Compiler.java:6262)
;; etc...
I'd like to be able to create the reordered byte buffer this inside a function without defining a global variable, but right now it seems to not like that.
Also, once I've got it reordered, I'm not entirely sure what to do with my DirectByteBuffer, as it doesn't seem to be iterable. Perhaps for the remaining step of reading this buffer object (into a JBLAS matrix), I will create a second question.
EDIT 2: I am marking the answer below as accepted, because I think my original question combined too many things. Once I figure out the remainder of this I will try to update this question with complete code that starts with this ByteBuffer and that reads into a JBLAS matrix (which appears to be the right data structure).
In case anyone was interested, I was able to create a function that returns a properly ordered bytebuffer as follows:
;; This works!
(defn readf [^String file]
(.order
(.map
(.getChannel
(java.io.RandomAccessFile. file "r"))
java.nio.channels.FileChannel$MapMode/READ_ONLY 0 (* 1200 1200))
java.nio.ByteOrder/LITTLE_ENDIAN))
The nio wrapper I found looks to simplify / prettify this quite a lot, but it would appear I'm either not using it correctly, or there is something wrong. To recap my findings with the nio wrapper:
;; this works
(def buffer (nio/mmap "/bin/file"))
(def buffer (.order buffer java.nio.ByteOrder/LITTLE_ENDIAN))
(def buffer (.asFloatBuffer buffer))
;; this fails
(def buffer
(.asFloatBuffer
(.order
(nio/mmap "/bin/file")
java.nio.ByteOrder/LITTLE_ENDIAN)))
Sadly, this is a clojure mystery for another day, or perhaps another StackOverflow question.

Open a FileChannel(), then get a memory mapped buffer. There are lots of tutorials on the web for this step.
Switch the order of the buffer to little endian by calling order(endian-ness) (not the no-arg version of order). Finally, the easiest way to extract floats would be to call asFloatBuffer() on it and use the resulting buffer to read the floats.
After that you can put the data into whatever structure you need.
edit Here's an example of how to use the API.
;; first, I created a 96 byte file, then I started the repl
;; put some little endian floats in the file and close it
user=> (def file (java.io.RandomAccessFile. "foo.floats", "rw"))
#'user/file
user=> (def channel (.getChannel file))
#'user/channel
user=> (def buffer (.map channel java.nio.channels.FileChannel$MapMode/READ_WRITE 0 96))
#'user/buffer
user=> (.order buffer java.nio.ByteOrder/LITTLE_ENDIAN)
#<DirectByteBuffer java.nio.DirectByteBuffer[pos=0 lim=96 cap=96]>
user=> (def fbuffer (.asFloatBuffer buffer))
#'user/fbuffer
user=> (.put fbuffer 0 0.0)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.put fbuffer 1 1.0)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.put fbuffer 2 2.3)
#<DirectFloatBufferU java.nio.DirectFloatBufferU[pos=0 lim=24 cap=24]>
user=> (.close channel)
nil
;; memory map the file, try reading the floats w/o changing the endianness of the buffer
user=> (def file2 (java.io.RandomAccessFile. "foo.floats" "r"))
#'user/file2
user=> (def channel2 (.getChannel file2))
#'user/channel2
user=> (def buffer2 (.map channel2 java.nio.channels.FileChannel$MapMode/READ_ONLY 0 96))
#'user/buffer2
user=> (def fbuffer2 (.asFloatBuffer buffer2))
#'user/fbuffer2
user=> (.get fbuffer2 0)
0.0
user=> (.get fbuffer2 1)
4.6006E-41
user=> (.get fbuffer2 2)
4.1694193E-8
;; change the order of the buffer and read the floats
user=> (.order buffer2 java.nio.ByteOrder/LITTLE_ENDIAN)
#<DirectByteBufferR java.nio.DirectByteBufferR[pos=0 lim=96 cap=96]>
user=> (def fbuffer2 (.asFloatBuffer buffer2))
#'user/fbuffer2
user=> (.get fbuffer2 0)
0.0
user=> (.get fbuffer2 1)
1.0
user=> (.get fbuffer2 2)
2.3
user=> (.close channel2)
nil
user=>

Related

easiest way to use a i/o callback within concurrent http-kit/get instances

I am launching a few hundreds concurrent http-kit.client/get requests provided with a callback to write results to a single file.
What would be a good way to deal with thread-safety? Using chanand <!! from core.asyc?
Here's the code I would consider :
(defn launch-async [channel url]
(http/get url {:timeout 5000
:user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:10.0) Gecko/20100101 Firefox/10.0"}
(fn [{:keys [status headers body error]}]
(if error
(put! channel (json/generate-string {:url url :headers headers :status status}))
(put! channel (json/generate-string body))))))
(defn process-async [channel func]
(when-let [response (<!! channel)]
(func response)))
(defn http-gets-async [func urls]
(let [channel (chan)]
(doall (map #(launch-async channel %) urls))
(process-async channel func)))
Thanks for your insights.
Since you are already using core.async in your example, I thought I'd point out a few issues and how you can address them. The other answer mentions using a more basic approach, and I agree wholeheartedly that a simpler approach is just fine. However, with channels, you have a simple way of consuming the data which does not involve mapping over a vector, which will also grow large over time if you have many responses. Consider the following issues and how we can fix them:
(1) Your current version will crash if your url list has more than 1024 elements. There's an internal buffer for puts and takes that are asynchronous (i.e., put! and take! don't block but always return immediately), and the limit is 1024. This is in place to prevent unbounded asynchronous usage of the channel. To see for yourself, call (http-gets-async println (repeat 1025 "http://blah-blah-asdf-fakedomain.com")).
What you want to do is to only put something on the channel when there's room to do so. This is called back-pressure. Taking a page from the excellent wiki on go block best practices, one clever way to do this from your http-kit callback is to use the put! callback option to launch your next http get; this will only happen when the put! immediately succeeds, so you will never have a situation where you can go beyond the channel's buffer:
(defn launch-async
[channel [url & urls]]
(when url
(http/get url {:timeout 5000
:user-agent "Mozilla"}
(fn [{:keys [status headers body error]}]
(let [put-on-chan (if error
(json/generate-string {:url url :headers headers :status status})
(json/generate-string body))]
(put! channel put-on-chan (fn [_] (launch-async channel urls))))))))
(2) Next, you seem to be only processing one response. Instead, use a go-loop:
(defn process-async
[channel func]
(go-loop []
(when-let [response (<! channel)]
(func response)
(recur))))
(3) Here's your http-gets-async function. I see no harm in adding a buffer here, as it should help you fire off a nice burst of requests at the beginning:
(defn http-gets-async
[func urls]
(let [channel (chan 1000)]
(launch-async channel urls)
(process-async channel func)))
Now, you have the ability to process an infinite number of urls, with back-pressure. To test this, define a counter, and then make your processing function increment this counter to see your progress. Using a localhost URL that is easy to bang on (wouldn't recommend firing off hundreds of thousands of requests to, say, google, etc.):
(def responses (atom 0))
(http-gets-async (fn [_] (swap! responses inc))
(repeat 1000000 "http://localhost:8000"))
As this is all asynchronous, your function will return immediately and you can look at #responses grow.
One other interesting thing you can do is instead of running your processing function in process-async, you could optionally apply it as a transducer on the channel itself.
(defn process-async
[channel]
(go-loop []
(when-let [_ (<! channel)]
(recur))))
(defn http-gets-async
[func urls]
(let [channel (chan 10000 (map func))] ;; <-- transducer on channel
(launch-async channel urls)
(process-async channel)))
There are many ways to do this, including constructing it so that the channel closes (note that above, it stays open). You have java.util.concurrent primitives to help in this regard if you like, and they are quite easy to use. The possibilities are very numerous.
This is simple enough that I wouldn't use core.async for it. You can do this with an atom storing use a vector of the responses, then have a separate thread reading the contents of atom until it's seen all of the responses. Then, in your http-kit callback, you could just swap! the response into the atom directly.
If you do want to use core.async, I'd recommend a buffered channel to keep from blocking your http-kit thread pool.

Detect non-empty STDIN in Clojure

How do you detect non-empty standard input (*in*) without reading from it in a non-blocking way in Clojure?
At first, I thought calling using the java.io.Reader#ready() method would do, but (.ready *in*) returns false even when standard input is provided.
Is this what you are looking for? InputStream .available
(defn -main [& args]
(if (> (.available System/in) 0)
(println "STDIN: " (slurp *in*))
(println "No Input")))
$ echo "hello" | lein run
STDIN: hello
$ lein run
No Input
Update: It does seem that .available is a race condition checking STDIN. n alternative is to have a fixed timeout for STDIN to become available otherwise assume no data is coming from STDIN
Here is an example of using core.async to attempt to read the first byte from STDIN and append it to the rest of the STDIN or timeout.
(ns stdin.core
(:require
[clojure.core.async :as async :refer [go >! timeout chan alt!!]])
(:gen-class))
(defn -main [& args]
(let [c (chan)]
(go (>! c (.read *in*)))
(if-let [ch (alt!! (timeout 500) nil
c ([ch] (if-not (< ch 0) ch)))]
(do
(.unread *in* ch)
(println (slurp *in*)))
(println "No STDIN"))))
Have you looked at PushbackReader? You can use it like:
Read a byte (blocking). Returns char read or -1 if stream is closed.
When returns, you know a byte is ready.
If the byte is something you're not ready for, put it back
If stream is closed (-1 return val), exit.
Repeat.
https://docs.oracle.com/javase/8/docs/api/index.html?java/io/PushbackReader.html
If you need it to be non-blocking stick it into a future, a core.async channel, or similar.

Waiting for n channels with core.async

In the same way alt! waits for one of n channels to get a value, I'm looking for the idiomatic way to wait for all n channels to get a value.
I need this because I "spawn" n go blocks to work on async tasks, and I want to know when they are all done. I'm sure there is a very beautiful way to achieve this.
Use the core.async map function:
(<!! (a/map vector [ch1 ch2 ch3]))
;; [val-from-ch-1 val-from-ch2 val-from-ch3]
You can say (mapv #(async/<!! %) channels).
If you wanted to handle individual values as they arrive, and then do something special after the final channel produces a value, you can use exploit the fact that alts! / alts!! take a vector of channels, and they are functions, not macros, so you can easily pass in dynamically constructed vectors.
So, you can use alts!! to wait on your initial collection of n channels, then use it again on the remaining channels etc.
(def c1 (async/chan))
(def c2 (async/chan))
(def out
(async/thread
(loop [cs [c1 c2] vs []]
(let [[v p] (async/alts!! cs)
cs (filterv #(not= p %) cs)
vs (conj vs v)]
(if (seq cs)
(recur cs vs)
vs)))))
(async/>!! c1 :foo)
(async/>!! c2 :bar)
(async/<!! out)
;= [:foo :bar]
If instead you wanted to take all values from all the input channels and then do something else when they all close, you'd want to use async/merge:
clojure.core.async/merge
([chs] [chs buf-or-n])
Takes a collection of source channels and returns a channel which
contains all values taken from them. The returned channel will be
unbuffered by default, or a buf-or-n can be supplied. The channel
will close after all the source channels have closed.

write large data structures as EDN to disk in clojure

What is the most idiomatic way to write a data structure to disk in Clojure, so I can read it back with edn/read? I tried the following, as recommended in the Clojure cookbook:
(with-open [w (clojure.java.io/writer "data.clj")]
(binding [*out* w]
(pr large-data-structure)))
However, this will only write the first 100 items, followed by "...". I tried (prn (doall large-data-structure)) as well, yielding the same result.
I've managed to do it by writing line by line with (doseq [i large-data-structure] (pr i)), but then I have to manually add the parens at the beginning and end of the sequence to get the desired result.
You can control the number of items in a collection that are printed via *print-length*
Consider using spit instead of manually opening the writer and pr-str instead of manually binding to *out*.
(binding [*print-length* false]
(spit "data.clj" (pr-str large-data-structure))
Edit from comment:
(with-open [w (clojure.java.io/writer "data.clj")]
(binding [*print-length* false
*out* w]
(pr large-data-structure)))
Note: *print-length* has a root binding of nil so you should not need to bind it in the example above. I would check the current binding at the time of your original pr call.

Intermittent error serving a binary file with Clojure/Ring

I am building an event collector in Clojure for Snowplow (using Ring/Compojure) and am having some trouble serving a transparent pixel with Ring. This is my code for sending the pixel:
(ns snowplow.clojure-collector.responses
(:import (org.apache.commons.codec.binary Base64)
(java.io ByteArrayInputStream)))
(def pixel-bytes (Base64/decodeBase64 (.getBytes "R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==")))
(def pixel (ByteArrayInputStream. pixel-bytes))
(defn send-pixel
[]
{:status 200
:headers {"Content-Type" "image/gif"}
:body pixel})
When I start up my server, the first time I hit the path for send-pixel, the pixel is successfully delivered to my browser. But the second time - and every time afterwards - Ring sends no body (and content-length 0). Restart the server and it's the same pattern.
A few things it's not:
I have replicated this using wget, to confirm the intermittent-ness isn't a browser caching issue
I generated the "R01GOD..." base64 string at the command-line (cat original.gif | base64) so know there is no issue there
When the pixel is successfully sent, I have verified its contents are correct (diff original.gif received-pixel.gif)
I'm new to Clojure - my guess is there's some embarrassing dynamic gremlin in my code, but I need help spotting it!
I figured out the problem in the REPL shortly after posting:
user=> (import (org.apache.commons.codec.binary Base64) (java.io ByteArrayInputStream))
java.io.ByteArrayInputStream
user=> (def pixel-bytes (Base64/decodeBase64 (.getBytes "R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==")))
#'user/pixel-bytes
user=> (def pixel (ByteArrayInputStream. pixel-bytes))
#'user/pixel
user=> (slurp pixel-bytes)
"GIF89a!�\n,L;"
user=> (slurp pixel-bytes)
"GIF89a!�\n,L;"
user=> (slurp pixel)
"GIF89a!�\n,L;"
user=> (slurp pixel)
""
So basically the problem was that the ByteArrayInputStream was getting emptied after the first call. Mutable data structures!
I fixed the bug by generating a new ByteArrayInputStream for each response, with:
:body (ByteArrayInputStream. pixel-bytes)}))
The problem is your pixel variable holds a stream. Once it has been read, there is no possibility to re-read it again.
Moreover, you do not need to deal with encoding issues. Ring serves static files as well. Just return:
(file-response "/path/to/pixel.gif")
It handles non-existing files as well. See the docs also.