Reading thousands of files in clojure

Reading thousands of files in clojure - clojure

I'm working on a script that needs to read tens of thousands of files from the disk. I'm trying to understand the best way to do this. I've run into a problem when I use map to do this using two packages clj-glob and clojure-mail:
(def sent-mail-paths
(->> (str maildir-path "/*/_sent_mail/*")
(glob) ;; returns files using clojure.java.io/as-file
(map str) ;; i just want the paths
))
(def msgs
(->> sent-mail-paths ;; 30K + paths
(map mail/file->message)))
where the glob function in the first block comes from clj-glob and uses as-file to return a set of file objects (see here). I only want the path strings, so I do (map str). The mail/file->message function in the second block uses with-open along with the java FileInputStream class to read the files (see here).
The trouble I am encountering is that this code causes an error the moment I try to process the files in the resulting lazy sequence by doing evensomething like:
(count msgs)
The error is:
(Too many open files in system)
The only way I've been able to get the job done here is to use doseq:
(def msgs (->> list-of-paths ;; 30K+ paths
(map mail/file->message)))
(def final (atom []))
(doseq [x result]
(swap! final conj (mail/file->message x)))
My question is whether this is the best (only?) way to accomplish this process without opening thousands and thousands of files at once? I don't fully understand why I can't use the lazy sequence that is returned by map. Why does that end up opening tons of files.
One thing, incidentally, that I noticed is that clj-glob, which is not a well-maintained package, does not use with-open when it calls as-file...

Even if you open/close the files correctly, there's a chance that during the execution of the program you hit an internally defined limit on the number of file descriptors that you program can have (this is common on long-lived programs such as microservices).
You can read here on how to look up what that limit is currently and how to increase it: https://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

Related

How do you mock file reading in Clojure

I have the following code that loads edn data from my resources folder:
(defn load-data
[]
(->> (io/resource "news.edn")
slurp
edn/read-string))
I'd like to test this by mocking out the file reading part, so far I have this:
(deftest loading-data
(is (= (edn/read-string (prn-str {:articles [{:title "ASAP Rocky released" :url "http://foo.com"}]})) (load-data))))
But I know this very flaky test because if the edn file name changes or it's contents or updated, the test will fail. Any ideas?

You can mock out a call to function with with-redefs, that "Temporarily redefines Vars while executing the body". E.g.,
(deftest load-data-test
(with-redefs [slurp (constantly "{:a \"b\"}")]
(is (= (load-data) {:a "b"}))))
This way the slurp in load-data in the scope of with-redefs returns "{:a \"b\"}".

What about this function are you wanting to gain confidence for? Are you worried that news.edn won't exist? Are you worried slurping and or edn reading won't work on a resource?
My advice is to test your separate concerns separately
If you're worried about news.edn not existing assert against it's existence
If you're worried about the rest of the function not converting from edn add a new signature to accept a resource then provide another resource at test time to assert against
If you're worried about the shape of the file maybe have a test that runs against the data before it's put in news.edn
Then when you come back to these tests years later you'll see clear reasons for failures instead of a test that fell over because of N possible reasons that are unknown till debug time

As other's have suggested: Mock less things.
It might be a reflex you learned from building Java tests.
If you have composed your functions correctly, you can test them individually without the need for side-effects (like reading a file).
If you mock slurp in your example, you are not testing anything meaningful: You would essentially test if the standard function edn/read-string works as intended.

Rewrite your load-data to accept the name of the file to load as argument (and later call it with news.edn in your "main"). This makes it way more functional and this way you can easily test load-data the way you test it right now, but you pass down some test-news.edn from your test resources. And there is no need to mock anything for the happy path.
This way you can also write test for other scenarios: what if the file is missing? Or the .edn file is malformed. What if you pass some resource that loads forever? etc.

Cider debug -- how to evaluate stuff while debugging

Cider debug instructions tell me I can press e to evaluate something while debugging. This gives me a little one-line space in Emacs mini-buffer at the bottom.
Is there a way to switch to the full REPL while in the middle of
debugging a function, with access to all the locals, etc.? Currently
the REPL is hung/frozen while debugging. I'm thinking of something in
the style of how PyCharm or Matlab allow full REPL while in the middle of something.

It does appear that the jacked-in REPL is tied up during debugging.
But there are a few options available through the debugger that may
give you nearly as much as you'd get out of the REPL. A handy one is
to inject a new value for the result you're about to produce.
So you're actually changing the data on-the-fly.
You can inspect the full list of local vars with l. Then see more
about a var with inspect and specifying which.
You can also eval to enter an arbitrary expression just like
you would in the REPL (as you've mentioned). That seems to be a
single-line full REPL, with history, editing, etc. Is there something
you'd want to do in the REPL that you can't do with e or discover
with l or p?

One thing I find really frustrating is that I can't edit a function while the debugger is stoped at the said function, then edit it and re run it with the initial arguments. In cider, if you try to edit a function being debugged, emacs will open the bebugger in a new buffer with the original code. Alternatively, you have the e command that evals things in the minibuffer, which I don't think is a great experience. The closest I came to this is the following:
Imagine you have some function that crashes and you need to debug:
(defn some-fn
[complex-data more-data]
; block of code with some bug
)
I'll create atoms in the namespace and set the value inside the given function:
(def c (atom nil))
(def d (atom nil))
(defn some-fn
[complex-data more-data]
(reset! c complex-data)
(reset! d more-data)
; block of code with some bug
)
Then I'l just iterate on some-fn using the args I now have available in the namespace.
(some-fn #c #d)
I think it's a much better approach than using the eval command and the minibuffer from the cider debugger.

Clojure file-system portability

I want to write a simple program for playing sound clips. I want to deploy it on Windows, Linux and MacOSX. The thing that still puzzles me is location of configuration file and folder with sound clips on different operating systems. I am a Clojure noob. I am aware that Common Lisp has special file-system portability library called CL-FAD. How it is being done in Closure? How can I write portable Clojure program with different file system conventions on different systems?

You can use clojure.java.io/file to build paths in a (mostly) platform-neutral way, similarly to how you would with os.path.join in Python or File.join in Ruby.
(require '[clojure.java.io :as io])
;; On Linux
(def home "/home/jbm")
(io/file home "media" "music") ;=> #<File /home/jbm/media/music>
;; On Windows
(def home "c:\\home\\jbm")
(io/file home "media" "music") ;=> #<File c:\home\jbm\media\music>
clojure.java.io/file returns a java.io.File. If you need to get back to a string you can always use .getPath:
(-> home
(io/file "media" "music")
(.getPath))
;=> /home/jbm/media/music"
Is that the sort of thing you had in mind?
In addition to clojure.java.io (and, of course, the methods on java.io.File), raynes.fs is a popular file system utility library.

Note that Windows perfectly supports the forward slash as a path separator (which is awesome because that way you don't have to escape backslashes all the time).
The only significant difficulty you'll run into is that the "standard" locations (home folder, etc.) are different on Windows and UNIX systems. So you need to get those from the system properties (see the getProperty method in http://docs.oracle.com/javase/7/docs/api/java/lang/System.html).

For a platform-independent approach, you can find the canonical path from a path relative to the project and then join it with the filename.
(:require [clojure.java.io :as io :refer [file]]))
(defn file-dir
"Returns canonical path of a given path"
[path]
(.getCanonicalPath (io/file path)))
(-> "./resources" ;; relative
(file-dir)
(io/file "filename.txt")) ;;=> /path/to/project/resources/filename.txt

How to compile ClojureScript inside Clojure

I want to compile ClojureScript inside Clojure and am having some problems. I would like to do something like this:
(def x '(map (fn [n] (* n n n)) [1 2 3 4]))
(cljs->js x)
where cljs->js returns JavaScript code. I guess Himera does something similar (first reading ClojureScript from a string), but I don't know enough about ClojureScript to figure it out.
Is there are simple solution to this?

Have you look at the Himera code? Here is where the code sent by the UI is compiled, which basically calls the cljs.compiler from the clojurescript project. Note that Himera is probably a lot more complex than what you are asking for, probably you just need to get the "compilation" function working

once you have the clojurescript dependencies sorted out (which is it's own question) then you can just call the clojurescript emit function. this is used in the Clutch project (couchdb for clojure+clojurescript). it basically looks like this:
(js/emit (aget doc "_id") nil)

Code sharing between server and client in Clojurescript/Clojure

Say I wanted to factor out some common code between my client-side *.cljs and my server-side *.clj, e.g. various data structures and common operations, can I do that ? Does it make sense to do it ?

I wrote the cljx Leiningen plugin specifically to handle Clojure/ClojureScript code sharing for a Clojure data visualization library.
95% of non-host-interop code looks the same, and cljx lets you automatically rewrite that last 5% by specifying rewrite rules using core.logic.
Most of the time, though, it's simple symbol substitutions; clojure.lang.IFn in Clojure is just IFn in ClojureScript, for instance.
You can also use metadata to annotate forms to be included or excluded when code is generated for a specific platform.

Update: as of clojure 1.7, check out Clojure reader conditionals or cljc. I've used cljc with great success to share a lot of code between server and browser very easily.
Great question! I've been thinking a lot about this as well lately and have written a few apps to experiment.
Here's my list of what types things you might want to share and pros/cons of each:
Most of my client cljs files contains code that manipulates the dom. So, it wouldn't make sense to share any of that with server
Most of the server side stuff deals with filesystem and database calls. I suppose you might want to call the database from the client (especially if you're using one of the no-sql db's that support javascript calls). But, even then, I feel like you should choose to either call db from client or call db from server and, therefore, it doesn't make much sense to share the db code either.
One area where sharing is definitely valuable is being able to share and pass clojure data structures (nested combinations of lists, vectors, sets, etc) between client and server. No need to convert to json (or xml) and back. For example, being able to pass hiccup-style representations of the dom back and forth is very convenient. In gwt, I've used gilead to share models between client and server. But, in clojure, you can simply pass data structures around, so there's really no need to share class definitions like in gwt.
One area that I feel I need to experiment more is sharing state between client and server. In my mind there are a few strategies: store state on client (single page ajax type applications) or store state on server (like legacy jsp apps) or a combo of both. Perhaps the code responsible for updating state (the atoms, refs, agents or whatever) could be shared and then state could be passed back and forth over request and response to keep the two tiers in synch? So far, simply writing server using REST best practices and then having state stored on client seems to work pretty well. But I could see how there might be benefits to sharing state between client and server.
I haven't needed to share Constants and/or Properties yet, but this might be something that would be good to reuse. If you put all your app's global constants in a clj file and then wrote a script to copy it over to cljs whenever you compiled the clojurescript, that should work fine, and might save a bit of duplication of code.
Hope these thoughts are useful, I'm very interested in what others have found so far!

The new lein-cljsbuild plugin for Leiningen has built-in support for sharing pure Clojure code.

Wrote a quick bit of code to copy a subset of my server clojure code over to my clojurescript code, renaming as .cljs before building:
(ns clj-cljs.build
(use
[clojure.java.io]
)
(require
[cljs.closure :as cljsc]
)
)
(defn list-files [path]
(.listFiles (as-file path))
)
(defn copy-file* [from to]
;(println " coping " from " to " to)
(make-parents to)
(copy from to)
)
(defn rename [to-path common-path f]
(str to-path common-path (.replaceAll (.getName f) ".clj" ".cljs"))
)
(defn clj-cljs* [files common-path to-path]
(doseq [i (filter #(.endsWith (.getName %) ".clj") files)]
(copy-file* i (file (rename to-path common-path i)))
)
(doseq [i (filter #(.isDirectory %) files)]
(clj-cljs* (list-files i) (str common-path (.getName i) "/") to-path)
)
)
(defn build [{:keys [common-path clj-path cljs-path js-path module-name]}]
(clj-cljs* (list-files (str clj-path common-path)) common-path cljs-path)
(cljsc/build
cljs-path
{
:output-dir js-path
:output-to (str js-path module-name ".js")
}
)
)
(defn build-default []
(build
{
:clj-path "/home/user/projects/example/code/src/main/clojure/"
:cljs-path "/home/user/projects/example/code/src/main/cljs/"
:js-path "/home/user/projects/example/code/public/js/cljs/"
:common-path "example/common/" ; the root of your common server-client code
:module-name "example"
}
)
)

This question predates cljc, but since I stumbled upon it, I thought I would mention Clojure reader conditionals.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js