I'm still newbie in clojure and I'm trying to build application which read two files and write the diffrence on JSON file
(defn read-csv
"reads data."
[]
(with-open [rdr (
io/reader "resources/staples_data.csv")]
(doseq [line (rest(line-seq rdr))]
(println(vec(re-seq #"[^,]+" line))))))
(defn read-psv
"reads data."
[]
(with-open [rdr (
io/reader "resources/external_data.psv")]
(doseq [line (rest(line-seq rdr))]
; (print(vec(re-seq #"[^|]+" line))))))
(doall(vec(re-seq #"[^|]+" line))))))
(defn process-content []
(let [csv-records (agent read-csv)
psv-records (agent read-psv)]
(json/write-str {"my-data" #csv-records "other-data" #psv-records}))
)
Im getting an exception: Exception Don't know how to write JSON of class $read_csv clojure.data.json/write-generic (json.clj:385)
Please some help with some explanation, thanks in advance!
You are giving the agent a function as its initial value. Perhaps you meant to do an asynchronous call to that function instead? In that case, a future is a better match for your scenario as shown. agent is synchronous, it's send and send-off that are async, and they assume you are propagating some state across calls which doesn't match your usage here.
(defn process-content []
(let [csv-records (future-call read-csv)
psv-records (future-call read-psv)]
(json/write-str {"my-data" #csv-records "other-data" #psv-records})))
The problem after that is that doseq is only for side effects, and always returns nil. If you want the results read from the csv files (evaluating eagerly so you are still in the scope of the with-open call), use (doall (for ...)) as a replacement for (doseq ...). Also, the println in read-csv will need to be removed, or replaced with (doto (vec (re-seq #"[^,]+" line)) println) because println always returns nil, and I assume you want the actual data from the file, not a list of nils.
Related
I'm trying to process an HTTP stream using clojure.
I am able to write the stream to a file, but I'm trying to process the messages using core.async.
I followed this answer here:
Processing a stream of messages from a http server in clojure
However when I call (line-seq ) on the java.io.BufferedReader, it freezes for me.
(defn trades-stream
[]
(let [session (new-session)
{:keys [url sessionid]} (:stream session)
dump-url (str url "?sessionid=" sessionid "&symbols=mu" )
lines (-> dump-url
(client/get {:as :stream})
:body
io/reader)]
(line-seq lines )))
Any idea how I would remidy this ? Thanks!
Note that line-seq is lazy and won't do anything until forced into a string or something. Perhaps try
(println (first (line-seq lines)))
or
(reduce conj [] (line-seq lines)) ; then print something
You can also use (slurp <input-stream>) to get the contents as a string.
I have had an issue with my code due to a very twisted behaviour of a function.
I use google-api to stream data in BigQuery. In Java, you create an object called Bigquery.Tabledata.InsertAll (a request) and then you execute it
TableDataInsertAllResponse response = request.execute();
(sample code from Google)
But as you can see, the execution is something that has side effect while returning a response.
I reproduced it in Clojure (in a chunked fashion)
(defn load-bq
[client project-id dataset-id table-id data & {:keys [chunk-size] :or {chunk-size 500}}]
(let [chunks (partition-all chunk-size data)
_ (str "Now streaming data into table : [" project-id ":" dataset-id "." table-id "]")]
(map (partial atomic-load-bq client project-id dataset-id table-id) chunks)))
If I try to stream in repl, it works fine. But surpinsingly a doall does not work in code like in a let or with a do.
Here a function to illustrate the principle
(def load-this-table
[... data]
(let [_ (doall (load-bq ... data))]
(load-bq ... data)
(do (load-bq ...data))))
Here, nothing will be loaded.
I found a trick that works though it is a bit far-fetched :
(def load-this-table
[... data]
(let [_ (println (load-bq ... data))]
(println (load-bq ... data))))
Here both line will execute. of course i need only one streaming so I chose a single solution here.
How to force evaluation of this code without having to use println ?
I could use what force evaluation behing println or any more general core function.
I have the impression that the difference is not really linked to lazyness but more to a more fundamental difference bewteen Clojure and Java. And maybe that the response have to be "taken" by the client.
Thanks !
If you have been following my questions over the day,
I am doing a class project in clojure and having difficulty reading a file, parsing it, and creating a graph from its content. I have managed to open and read a file along with parsing the lines as needed. The issue I face now is creating a graph structure from the data that was read in.
Some background first. In other functions I have implemented in this project I have used a for statement to "build up" a list of values as such
...
(let [rem-list (remove nil? (for [j (range (count (graph n)))]
(cond (< (rand) 0.5)
[n (nth (seq (graph n)) j)])))
...
This for would build up a list of edges to remove from a graph, after it was done I could then use rem-list in a reduce to remove all of the edges from some graph structure.
Back to my issue. I figured that if I were to read a file line by line I could "build up" a list in the same manner so I implemented the function below
(defn readGraphFile [filename, numnodes]
(let [edge-list
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)]
(lineToEdge line)))]
(edge-list)))
Though if I am to run this function I end up with a null pointer exception as if nothing was ever "added" to edge-list. So being the lazy/good? programmer I am I quickly thought of another way. Though it still somewhat relies on my thinking of how the for built the list.
In this function I first let [graph be equal to an empty graph with the known number of nodes. Then each time that a line was read I would simply add that edge (each line in the file is an edge) to the graph, in effect "building up" my graph. The function is shown below
(defn readGraph [filename, numnodes]
(let [graph (empty-graph numnodes)]
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)]
(add-edge graph (lineToEdge line))))
graph))
Here lineToEdge returns a pair of numbers (ex [1 2]). Which is proper input for the add-edge function.
finalproject.core> (add-edge (empty-graph 5) (lineToEdge "e 1 2"))
[#{} #{2} #{1} #{} #{}]
The issue with this function though is that it seems to never actually add an edge to a graph
finalproject.core> (readGraph "/home/eccomp/finalproject/resources/11nodes.txt" 11)
[#{} #{} #{} #{} #{} #{} #{} #{} #{} #{} #{}]
So I guess my issue lies with how doseq is different from for? Is it different or is my implementation incorrect?
doseq differs from for in that it is intended for running a function on a sequence just for the side effects.
If you look at the documentation for doseq:
(https://clojuredocs.org/clojure.core/doseq)
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil
So, regardless of any processing you're doing, nil will just be returned.
You can switch doseq with for, and it should work. However, line-seq is lazy, so what you might have to do is wrap it in a doall to ensure that it will try to read all the lines when the file is open.
Also, your second readGraph function will only return an empty graph:
(defn readGraph [filename, numnodes]
(let [graph (empty-graph numnodes)]
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)]
(add-edge graph (lineToEdge line))))
graph))
The final line is just the empty graph you set with let, since Clojure is an immutable language, the graph reference is never updated, since you have a function that takes an existing graph and adds an edge to it, you need to step though the list while passing the list that you're building up.
I know there must be a better way to do this, but I'm not as good at Clojure as I would like, but something like:
(defn readGraph
[filename numnodes]
(with-open [rdr (io/reader filename)]
(let [edge-seq (line-seq rdr)]
(loop [cur-line (first edge-seq)
rem-line (rest edge-seq)
graph (empty-graph numnodes)]
(if-not cur-line
graph
(recur (first rem-line)
(rest rem-line)
(add-edge graph (lineToEdge cur-line))))))))
Might give you something closer to what you're after.
Thinking about it a little more, you could try using reduce, so:
(defn readGraph
[filename numnodes]
(with-open [rdr (io/reader filename)]
(reduce add-edge (cons (empty-graph numnodes)
(doall (line-seq rdr))))))
Reduce will go through a sequence, applying the function you pass in to the first two arguments, then passing in the result of that as the first argument to the next call. The cons is there, so we can be sure an empty graph is the first argument that is passed in.
You could easily find an answer to your question in Clojure documentation.
You could find complete documentation for all core functions on clojuredocs.org website, or you could simply run (doc <function name>) in your Clojure REPL.
Here is what doseq function documentation says:
=> (doc doseq)
(doc doseq)
-------------------------
clojure.core/doseq
([seq-exprs & body])
Macro
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
In other words, in always returns nil. So, the only way you could use it is to cause some side-effects (e.g. repeatedly print something to your console).
And here is what for function documentation says:
=> (doc for)
(doc for)
-------------------------
clojure.core/for
([seq-exprs body-expr])
Macro
List comprehension. Takes a vector of one or more
binding-form/collection-expr pairs, each followed by zero or more
modifiers, and yields a lazy sequence of evaluations of expr.
Collections are iterated in a nested fashion, rightmost fastest,
and nested coll-exprs can refer to bindings created in prior
binding-forms. Supported modifiers are: :let [binding-form expr ...],
:while test, :when test.
(take 100 (for [x (range 100000000) y (range 1000000) :while (< y x)] [x y]))
So, for function produces a lazy sequence which you could bind to some variable and use later in your code.
Note that produced sequence is lazy. It means that elements of this sequence will not be computed until you'll try to use (or print) them. For example, the following function:
(defn noop []
(for [i (range 10)]
(println i))
nil)
won't print anything, since for loop result is not used and thus not computed. You could force evaluation of a lazy sequence using doall function.
I'm trying to import data from StackOverflow to Neo4j using clojure and the neocons library. Excuse me for being a bit of a newbie.
Here's my main function in Leiningen:
(defn -main
[& args]
(let [neo4j-conn (nr/connect "http://localhost:7777/db/data/")]
(cypher/tquery neo4j-conn "MATCH n OPTIONAL MATCH n-[r]-() DELETE n, r")
(for [page (range 1 6)]
(let [data (parse-string (stackoverflow-get-questions page))
questions (data "items")
has-more (data "has_more")
question-ids (map #(%1 "question_id") questions)
answers ((parse-string (stackoverflow-get-answers question-ids)) "items")]
(map #(import-question %1 neo4j-conn) questions)
(map #(import-answer %1 neo4j-conn) answers)
)
)
)
)
I've defined import-question and import-answer functions and those work fine independently. In fact, what's weird is I can remove either one of those import-* lines and the other will work just fine.
Can anybody see if I'm doing something simple that's wrong?
Both map and for are lazy, and will do nothing at all unless you consume their results.
The first map call ends up being a noop because there is no way for anything to consume it's output. Try wrapping the for and at least the first map call in a call to dorun, or doall if you plan on consuming the result.
Also, you can replace for with doseq, which is identical except that it returns nil, eagerly consumes its input, and can contain multiple forms in its body.
Here is what your code could look like using doseq:
(defn -main
[& args]
(let [neo4j-conn (nr/connect "http://localhost:7777/db/data/")]
(cypher/tquery neo4j-conn "MATCH n OPTIONAL MATCH n-[r]-() DELETE n, r")
(doseq [page (range 1 6)
:let [data (parse-string (stackoverflow-get-questions page))
questions (data "items")
has-more (data "has_more")
question-ids (map #(%1 "question_id") questions)
answers ((parse-string (stackoverflow-get-answers question-ids)) "items")]]
(doseq [q questions]
(import-question q neo4j-conn))
(doseq [a answers]
(import-answer a neo4j-conn)))))
I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.
My code (partly taken from Justin Kramer answer) looks like that:
(defn process-pages
[page]
(let [title (article-title page)
revisions (filter #(= :revision (:tag %)) (:content page))]
(for [revision revisions]
(let [user (revision-user revision)
time (revision-timestamp revision)]
(spit "files/data.csv"
(str "\"" time "\";\"" user "\";\"" title "\"\n" )
:append true)))))
(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
(->> (:content (data.xml/parse rdr :coalescing false))
(filter #(= :page (:tag %)))
(map process-pages))))
I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.
Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.
That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.
Here's what I recommend:
1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.
2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.
As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.
UPDATE:
Here's a rewrite which attempts to separate concerns more clearly:
(defn filter-tag [tag xml]
(filter #(= tag (:tag %)) xml))
;; lazy
(defn revision-seq [xml]
(for [page (filter-tag :page (:content xml))
:let [title (article-title page)]
revision (filter-tag :revision (:content page))
:let [user (revision-user revision)
time (revision-timestamp revision)]]
[time user title]))
;; eager
(defn transform [in out]
(with-open [r (io/input-stream in)
w (io/writer out)]
(binding [*out* out]
(let [xml (data.xml/parse r :coalescing false)]
(doseq [[time user title] (revision-seq xml)]
(println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))
(transform "dump.xml" "data.csv")
I don't see anything here that would cause excessive memory use.
Unfortunately data.xml/parse is not lazy, it attempts to read the whole file into memory and then parse it.
Instead use the this (lazy) xml library which holds only the part it is currently working on in ram. You will then need to re-structure your code to write the output as it reads the input instead of gathering all the xml, then outputting it.
your line
(:content (data.xml/parse rdr :coalescing false)
will load all the xml into memory and then request the content key from it. which will blow the heap.
a rough outline of a lazy answer would look something like this:
(with-open [input (java.io.FileInputStream. "/tmp/foo.xml")
output (java.io.FileInputStream. "/tmp/foo.csv"]
(map #(write-to-file output %)
(filter is-the-tag-i-want? (parse input))))
Have patience, working with (> data ram) always takes time :)
I don't know about Clojure but in plain Java one could use a SAX event based parser like http://docs.oracle.com/javase/1.4.2/docs/api/org/xml/sax/XMLReader.html
that doesn't need to load the XML to RAM