I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.
My code (partly taken from Justin Kramer answer) looks like that:
(defn process-pages
[page]
(let [title (article-title page)
revisions (filter #(= :revision (:tag %)) (:content page))]
(for [revision revisions]
(let [user (revision-user revision)
time (revision-timestamp revision)]
(spit "files/data.csv"
(str "\"" time "\";\"" user "\";\"" title "\"\n" )
:append true)))))
(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
(->> (:content (data.xml/parse rdr :coalescing false))
(filter #(= :page (:tag %)))
(map process-pages))))
I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.
Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.
That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.
Here's what I recommend:
1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.
2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.
As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.
UPDATE:
Here's a rewrite which attempts to separate concerns more clearly:
(defn filter-tag [tag xml]
(filter #(= tag (:tag %)) xml))
;; lazy
(defn revision-seq [xml]
(for [page (filter-tag :page (:content xml))
:let [title (article-title page)]
revision (filter-tag :revision (:content page))
:let [user (revision-user revision)
time (revision-timestamp revision)]]
[time user title]))
;; eager
(defn transform [in out]
(with-open [r (io/input-stream in)
w (io/writer out)]
(binding [*out* out]
(let [xml (data.xml/parse r :coalescing false)]
(doseq [[time user title] (revision-seq xml)]
(println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))
(transform "dump.xml" "data.csv")
I don't see anything here that would cause excessive memory use.
Unfortunately data.xml/parse is not lazy, it attempts to read the whole file into memory and then parse it.
Instead use the this (lazy) xml library which holds only the part it is currently working on in ram. You will then need to re-structure your code to write the output as it reads the input instead of gathering all the xml, then outputting it.
your line
(:content (data.xml/parse rdr :coalescing false)
will load all the xml into memory and then request the content key from it. which will blow the heap.
a rough outline of a lazy answer would look something like this:
(with-open [input (java.io.FileInputStream. "/tmp/foo.xml")
output (java.io.FileInputStream. "/tmp/foo.csv"]
(map #(write-to-file output %)
(filter is-the-tag-i-want? (parse input))))
Have patience, working with (> data ram) always takes time :)
I don't know about Clojure but in plain Java one could use a SAX event based parser like http://docs.oracle.com/javase/1.4.2/docs/api/org/xml/sax/XMLReader.html
that doesn't need to load the XML to RAM
Related
I am reading a csv file, processing the input, appending the output to the input and writing the results an output csv. Seems pretty straight-forward. I am using Clojure data.csv. However, I ran into a nuance in the output that does not fit with anything I've run into before with Clojure and I cannot figure it out. The output will contain 0 to N lines for each input, and I cannot figure out how to stream this down to the calling fn.
Here is the form that is processing the file:
(defn process-file
[from to]
(let [ctr (atom 0)]
(with-open [r (io/reader from)
w (io/writer to)]
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(csv/write-csv w)))))
And here is the form that processes each line (that returns 0 to N lines that each need to be written to the output csv):
(defn process-line
[line ctr]
(swap! ctr inc)
(->> (apps-for-org (first line))
(reduce #(conj %1 (add-results-to-input line %2)) [])))
Honestly, I didn't fully understand your question, but from your comment, I seem to have answered it.
If you want to run csv/write-csv for each row returned, you could just map over the rows:
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(mapv #(csv/write-csv w %))))
Note my use of mapv since you're running side effects. Depending on the context, if you used just map, the laziness may prevent the writes from happening.
It would be arguably more correct to use doseq however:
(let [rows (some->> (csv/read-csv r)
(map #(process-line % ctr)))]
(doseq [row rows]
(csv/write-csv w row)))
doseq makes it clear that the intent of the iteration is to carry out side effects and not to produce a new (immutable) list.
I think the problem you're running into is that your map function process-line returns a collection of zero-to-many rows, so when you map over the input lines you get a collection of collections, when you just want one collection (of rows) to send to write-csv. If that's true, the fix is simply changing this line to use mapcat:
(mapcat #(process-line % ctr))
mapcat is like map, but it will "merge" the resulting collections of each process-line call into a single collection.
If you have been following my questions over the day,
I am doing a class project in clojure and having difficulty reading a file, parsing it, and creating a graph from its content. I have managed to open and read a file along with parsing the lines as needed. The issue I face now is creating a graph structure from the data that was read in.
Some background first. In other functions I have implemented in this project I have used a for statement to "build up" a list of values as such
...
(let [rem-list (remove nil? (for [j (range (count (graph n)))]
(cond (< (rand) 0.5)
[n (nth (seq (graph n)) j)])))
...
This for would build up a list of edges to remove from a graph, after it was done I could then use rem-list in a reduce to remove all of the edges from some graph structure.
Back to my issue. I figured that if I were to read a file line by line I could "build up" a list in the same manner so I implemented the function below
(defn readGraphFile [filename, numnodes]
(let [edge-list
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)]
(lineToEdge line)))]
(edge-list)))
Though if I am to run this function I end up with a null pointer exception as if nothing was ever "added" to edge-list. So being the lazy/good? programmer I am I quickly thought of another way. Though it still somewhat relies on my thinking of how the for built the list.
In this function I first let [graph be equal to an empty graph with the known number of nodes. Then each time that a line was read I would simply add that edge (each line in the file is an edge) to the graph, in effect "building up" my graph. The function is shown below
(defn readGraph [filename, numnodes]
(let [graph (empty-graph numnodes)]
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)]
(add-edge graph (lineToEdge line))))
graph))
Here lineToEdge returns a pair of numbers (ex [1 2]). Which is proper input for the add-edge function.
finalproject.core> (add-edge (empty-graph 5) (lineToEdge "e 1 2"))
[#{} #{2} #{1} #{} #{}]
The issue with this function though is that it seems to never actually add an edge to a graph
finalproject.core> (readGraph "/home/eccomp/finalproject/resources/11nodes.txt" 11)
[#{} #{} #{} #{} #{} #{} #{} #{} #{} #{} #{}]
So I guess my issue lies with how doseq is different from for? Is it different or is my implementation incorrect?
doseq differs from for in that it is intended for running a function on a sequence just for the side effects.
If you look at the documentation for doseq:
(https://clojuredocs.org/clojure.core/doseq)
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil
So, regardless of any processing you're doing, nil will just be returned.
You can switch doseq with for, and it should work. However, line-seq is lazy, so what you might have to do is wrap it in a doall to ensure that it will try to read all the lines when the file is open.
Also, your second readGraph function will only return an empty graph:
(defn readGraph [filename, numnodes]
(let [graph (empty-graph numnodes)]
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)]
(add-edge graph (lineToEdge line))))
graph))
The final line is just the empty graph you set with let, since Clojure is an immutable language, the graph reference is never updated, since you have a function that takes an existing graph and adds an edge to it, you need to step though the list while passing the list that you're building up.
I know there must be a better way to do this, but I'm not as good at Clojure as I would like, but something like:
(defn readGraph
[filename numnodes]
(with-open [rdr (io/reader filename)]
(let [edge-seq (line-seq rdr)]
(loop [cur-line (first edge-seq)
rem-line (rest edge-seq)
graph (empty-graph numnodes)]
(if-not cur-line
graph
(recur (first rem-line)
(rest rem-line)
(add-edge graph (lineToEdge cur-line))))))))
Might give you something closer to what you're after.
Thinking about it a little more, you could try using reduce, so:
(defn readGraph
[filename numnodes]
(with-open [rdr (io/reader filename)]
(reduce add-edge (cons (empty-graph numnodes)
(doall (line-seq rdr))))))
Reduce will go through a sequence, applying the function you pass in to the first two arguments, then passing in the result of that as the first argument to the next call. The cons is there, so we can be sure an empty graph is the first argument that is passed in.
You could easily find an answer to your question in Clojure documentation.
You could find complete documentation for all core functions on clojuredocs.org website, or you could simply run (doc <function name>) in your Clojure REPL.
Here is what doseq function documentation says:
=> (doc doseq)
(doc doseq)
-------------------------
clojure.core/doseq
([seq-exprs & body])
Macro
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
In other words, in always returns nil. So, the only way you could use it is to cause some side-effects (e.g. repeatedly print something to your console).
And here is what for function documentation says:
=> (doc for)
(doc for)
-------------------------
clojure.core/for
([seq-exprs body-expr])
Macro
List comprehension. Takes a vector of one or more
binding-form/collection-expr pairs, each followed by zero or more
modifiers, and yields a lazy sequence of evaluations of expr.
Collections are iterated in a nested fashion, rightmost fastest,
and nested coll-exprs can refer to bindings created in prior
binding-forms. Supported modifiers are: :let [binding-form expr ...],
:while test, :when test.
(take 100 (for [x (range 100000000) y (range 1000000) :while (< y x)] [x y]))
So, for function produces a lazy sequence which you could bind to some variable and use later in your code.
Note that produced sequence is lazy. It means that elements of this sequence will not be computed until you'll try to use (or print) them. For example, the following function:
(defn noop []
(for [i (range 10)]
(println i))
nil)
won't print anything, since for loop result is not used and thus not computed. You could force evaluation of a lazy sequence using doall function.
I'm still newbie in clojure and I'm trying to build application which read two files and write the diffrence on JSON file
(defn read-csv
"reads data."
[]
(with-open [rdr (
io/reader "resources/staples_data.csv")]
(doseq [line (rest(line-seq rdr))]
(println(vec(re-seq #"[^,]+" line))))))
(defn read-psv
"reads data."
[]
(with-open [rdr (
io/reader "resources/external_data.psv")]
(doseq [line (rest(line-seq rdr))]
; (print(vec(re-seq #"[^|]+" line))))))
(doall(vec(re-seq #"[^|]+" line))))))
(defn process-content []
(let [csv-records (agent read-csv)
psv-records (agent read-psv)]
(json/write-str {"my-data" #csv-records "other-data" #psv-records}))
)
Im getting an exception: Exception Don't know how to write JSON of class $read_csv clojure.data.json/write-generic (json.clj:385)
Please some help with some explanation, thanks in advance!
You are giving the agent a function as its initial value. Perhaps you meant to do an asynchronous call to that function instead? In that case, a future is a better match for your scenario as shown. agent is synchronous, it's send and send-off that are async, and they assume you are propagating some state across calls which doesn't match your usage here.
(defn process-content []
(let [csv-records (future-call read-csv)
psv-records (future-call read-psv)]
(json/write-str {"my-data" #csv-records "other-data" #psv-records})))
The problem after that is that doseq is only for side effects, and always returns nil. If you want the results read from the csv files (evaluating eagerly so you are still in the scope of the with-open call), use (doall (for ...)) as a replacement for (doseq ...). Also, the println in read-csv will need to be removed, or replaced with (doto (vec (re-seq #"[^,]+" line)) println) because println always returns nil, and I assume you want the actual data from the file, not a list of nils.
all.
I want to parse big log files using Clojure.
And the structure of each line record is "UserID,Lantitude,Lontitude,Timestamp".
My implemented steps are:
----> Read log file & Get top-n user list
----> Find each top-n user's records and store in separate log file (UserID.log) .
The implement source code :
;======================================================
(defn parse-file
""
[file n]
(with-open [rdr (io/reader file)]
(println "001 begin with open ")
(let [lines (line-seq rdr)
res (parse-recur lines)
sorted
(into (sorted-map-by (fn [key1 key2]
(compare [(get res key2) key2]
[(get res key1) key1])))
res)]
(println "Statistic result : " res)
(println "Top-N User List : " sorted)
(find-write-recur lines sorted n)
)))
(defn parse-recur
""
[lines]
(loop [ls lines
res {}]
(if ls
(recur (next ls)
(update-res res (first ls)))
res)))
(defn update-res
""
[res line]
(let [params (string/split line #",")
id (if (> (count params) 1) (params 0) "0")]
(if (res id)
(update-in res [id] inc)
(assoc res id 1))))
(defn find-write-recur
"Get each users' records and store into separate log file"
[lines sorted n]
(loop [x n
sd sorted
id (first (keys sd))]
(if (and (> x 0) sd)
(do (create-write-file id
(find-recur lines id))
(recur (dec x)
(rest sd)
(nth (keys sd) 1))))))
(defn find-recur
""
[lines id]
(loop [ls lines
res []]
(if ls
(recur (next ls)
(update-vec res id (first ls)))
res)))
(defn update-vec
""
[res id line]
(let [params (string/split line #",")
id_ (if (> (count params) 1) (params 0) "0")]
(if (= id id_ )
(conj res line)
res)))
(defn create-write-file
"Create a new file and write information into the file."
([file info-lines]
(with-open [wr (io/writer (str MAIN-PATH file))]
(doseq [line info-lines] (.write wr (str line "\n")))
))
([file info-lines append?]
(with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
(doseq [line info-lines] (.write wr (str line "\n"))))
))
;======================================================
I tested this clj in REPL with command (parse-file "./DATA/log.log" 3), and get the results:
Records-----Size-----Time----Result
1,000-------42KB-----<1s-----OK
10,000------420KB----<1s-----OK
100,000-----4.3MB----3s------OK
1,000,000---43MB-----15s-----OK
6,000,000---258MB---->20M----"OutOfMemoryError Java heap space java.lang.String.substring (String.java:1913)"
======================================================
Here is the question:
1. how can i fix the error when i try to parse big log file , like > 200MB
2. how can i optimize the function to run faster ?
3. there are logs more than 1G size , how can the function deal with it.
I am still new to Clojure, any suggestion or solution will be appreciate~
Thanks
As a direct answer to your questions; from a little Clojure experience.
The quick and dirty fix for running out of memory boils down to giving the JVM more memory. You can try adding this to your project.clj:
:jvm-opts ["-Xmx1G"] ;; or more
That will make Leiningen launch the JVM with a higher memory cap.
This kind of work is going to use a lot of memory no matter how you work it. #Vidya's suggestion ot use a library is definitely worth considering. However, there's one optimization that you can make that should help a little.
Whenever you're dealing with your (line-seq ...) object (a lazy sequence) you should make sure to maintain it as a lazy seq. Doing next on it will pull the whole thing into memory at once. Use rest instead. Take a look at the clojure site, especially the section on laziness:
(rest aseq) - returns a possibly empty seq, never nil
[snip]
a (possibly) delayed path to the remaining items, if any
You may even want to traverse the log twice--once to pull just the username from each line as a lazy-seq, again to filter out those users. This will minimize the amount of the file you're holding onto at any one time.
Making sure your function is lazy should reduce the sheer overhead that having the file as a sequence in memory creates. Whether that's enough to parse a 1G file, I don't think I can say.
You definitely don't need Cascalog or Hadoop simply to parse a file which doesn't fit into your Java heap. This SO question provides some working examples of how to process large files lazily. The main point is you need to keep the file open while you traverse the lazy seq. Here is what worked for me in a similar situation:
(defn lazy-file-lines [file]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader file))))
You can map, reduce, count, etc. over this lazy sequence:
(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
The parsing is a separate, simpler problem.
I am also relatively new to Clojure, so there are no obvious optimizations I can see. Hopefully others more experienced can offer some advice. But I feel like this is simply a matter of the data size being too big for the tools at hand.
For that reason, I would suggest using Cascalog, an abstraction over Hadoop or your local machine using Clojure. I think the syntax for querying big log files would be pretty straightforward for you.
I'm trying to read a file that (may or may not) have YAML frontmatter line-by-line using Clojure, and return a hashmap with two vectors, one containing the frontmatter lines and one containing everything else (i.e., the body).
And example input file would look like this:
---
key1: value1
key2: value2
---
Body text paragraph 1
Body text paragraph 2
Body text paragraph 3
I have functioning code that does this, but to my (admittedly inexperienced with Clojure) nose, it reeks of code smell.
(defn process-file [f]
(with-open [rdr (java.io.BufferedReader. (java.io.FileReader. f))]
(loop [lines (line-seq rdr) in-fm 0 frontmatter [] body []]
(if-not (empty? lines)
(let [line (string/trim (first lines))]
(cond
(zero? (count line))
(recur (rest lines) in-fm frontmatter body)
(and (< in-fm 2) (= line "---"))
(recur (rest lines) (inc in-fm) frontmatter body)
(= in-fm 1)
(recur (rest lines) in-fm (conj frontmatter line) body)
:else
(recur (rest lines) in-fm frontmatter (conj body line))))
(hash-map :frontmatter frontmatter :body body)))))
Can someone point me to a more elegant way to do this? I'm going to be doing a decent amount of line-by-line parsing in this project, and I'd like a more idiomatic way of going about it if possible.
Firstly, I'd put line-processing logic in its own function to be called from a function actually reading in the files. Better yet, you can make the function dealing with IO take a function to map over the lines as an argument, perhaps along these lines:
(require '[clojure.java.io :as io])
(defn process-file-with [f filename]
(with-open [rdr (io/reader (io/file filename))]
(f (line-seq rdr))))
Note that this arrangement makes it the duty of f to realize as much of the line seq as it needs before it returns (because afterwards with-open will close the underlying reader of the line seq).
Given this division of responsibilities, the line processing function might look like this, assuming the first --- must be the first non-blank line and all blank lines are to be skipped (as they would be when using the code from the question text):
(require '[clojure.string :as string])
(defn process-lines [lines]
(let [ls (->> lines
(map string/trim)
(remove string/blank?))]
(if (= (first ls) "---")
(let [[front sep-and-body] (split-with #(not= "---" %) (next ls))]
{:front (vec front) :body (vec (next sep-and-body))})
{:body (vec ls)})))
Note the calls to vec which cause all the lines to be read in and returned in a vector or pair of vectors (so that we can use process-lines with process-file-with without the reader being closed too soon).
Because reading lines from an actual file on disk is now decoupled from processing a seq of lines, we can easily test the latter part of the process at the REPL (and of course this can be made into a unit test):
;; could input this as a single string and split, of course
(def test-lines
["---"
"key1: value1"
"key2: value2"
"---"
""
"Body text paragraph 1"
""
"Body text paragraph 2"
""
"Body text paragraph 3"])
Calling our function now:
user> (process-lines test-lines)
{:front ("key1: value1" "key2: value2"),
:body ("Body text paragraph 1"
"Body text paragraph 2"
"Body text paragraph 3")}
actually, the idiomatic way to do it using clojure would be to avoid returning 'a hashmap with two vectors' and treat the file as a (lazy) sequence of lines
then, the function that will process the sequence of lines decides whether the file has a YAML frontmatter or not
something like this:
(use '[clojure.java.io :only (reader)])
(let [s (line-seq (reader "YOURFILENAMEHERE"))]
(if (= "---\n" (take 1 (line-seq (reader "YOURFILENAMEHERE"))))
(process-seq-with-frontmatter s)
(process-seq-without-frontmatter s))
by the way, this is a quit and dirty solution; two things to improve:
notice I'm creating two seqs for the same file, it would be better to create just one and make the inspection of the first line so that it wouldn't traverse over the first element of the seq (like a peek instead of a pop)
I think it would be cleaner to have a multimethod 'process-seq' (with a better name of course) that would dispatch based on the content of the first line of the seq