What is the best way to turn this line of CSV for column 3 to a Clojure list?
357302041352401, 2012-08-27 19:59:32 -0700, 100, ["SNIA34", "M33KLC", "M34KLC", "W35REK", "SRBT", "MODE", "BFF21S", "CC12", "RCV56V", "NBA1", "RESP", "A0NTC", "PRNK", "WAYS", "HIRE", "BITE", "INGA1", "M32MOR", "TFT99W", "TBF5P", "NA3NR"]
Assuming you can already read the csv file...
You can use read-string in combination with into
user=> (def your_csv_column "[\"SNIA34\", \"M33KLC\", \"M34KLC\"]")
#'user/your_csv_column
user=> (into '() (read-string your_csv_column))
("M34KLC" "M33KLC" "SNIA34")
You can use Clojure Csv to do that.
You data is interesting, which appears to include a traditional comma-separated line, followed by data in brackets. I could not quite tell if the bracketed data was the representation you had in the .csv file or wanted after reading, but either way, this is how I read a .csv file:
My library's project.clj that uses clojure-csv:
(defproject util "1.0.4-SNAPSHOT"
:description "A general purposes Clojure library"
:dependencies [[org.clojure/clojure "1.4.0"]
[clojure-csv/clojure-csv "1.3.2"]]
:aot [util.core]
:omit-source true)
My library's core.clj header:
(ns util.core
^{:author "Charles M. Norton",
:doc "util is a Clojure utilities directory containing things
most Clojure programs need, like cli routines.
Created on April 4, 2012"}
(:require [clojure.string :as cstr])
(:import java.util.Date)
(:import java.io.File)
(:use clojure-csv.core))
My library's function that returns a .csv file parsed as a vector of vectors.
(defn ret-csv-data
"Returns a lazy sequence generated by parse-csv.
Uses open-file which will return a nil, if
there is an exception in opening fnam.
parse-csv called on non-nil file, and that
data is returned."
[fnam]
(let [ csv-file (open-file fnam)
inter-csv-data (if-not (nil? csv-file)
(parse-csv csv-file)
nil)
csv-data (vec (filter #(and pos? (count %) (not (nil? (rest %))))
inter-csv-data))]
;removes blank sequence at EOF.
(pop csv-data)))
(defn fetch-csv-data
"This function accepts a csv file name, and returns parsed csv data,
or returns nil if file is not present."
[csv-file]
(let [csv-data (ret-csv-data csv-file)]
csv-data))
What I have found to be very helpful is avoid using nth -- very useful advice from SO and other sources -- and given most of my .csv data is from database queries, I zipmap columns to each .csv seqeuence (row), and then operate on that data by map key. It simplifies things for me.
Related
I'm trying to open a file that is to large to slurp. I want to then edit the file to remove all characters except numbers. Then write the data to a new file.
So far I have
(:require [clojure.java.io :as io])
(:require [clojure.string :as str])
:jvm-opts ["-Xmx2G"]
(with-open [rdr (io/reader "/Myfile.txt")
wrt (io/writer "/Myfile2.txt")]
(doseq [line (line-seq rdr)]
(.write wrt (str line "\n"))))
Which reads and writes but I'm unsure of the best way to go about editing.Any help is much appreciated. I'm very new to the language.
Looks like you just need to modify the line value before writing it. If you want to modify a string to remove all non-numeric characters, a regular expression is a pretty easy route. You could make a function to do this:
(defn numbers-only [s]
(clojure.string/replace s #"[^\d]" ""))
(numbers-only "this is 4 words")
=> "4"
Then use that function in your example:
(str (numbers-only line) "\n")
Alternatively, you could map numbers-only over the output of line-seq, and because both map and line-seq are lazy you'll get the same lazy/on-demand behavior:
(map numbers-only (line-seq rdr))
And then your doseq would stay the same. I would probably opt for this approach as it keeps your "stream" processing together, and your imperative/side-effect loop is only concerned with writing its inputs.
I need to read large file (~1GB), process it and save to db. My solution looks like that:
data.txt
format: [id],[title]\n
1,Foo
2,Bar
...
code
(ns test.core
(:require [clojure.java.io :as io]
[clojure.string :refer [split]]))
(defn parse-line
[line]
(let [values (split line #",")]
(zipmap [:id :title] values)))
(defn run
[]
(with-open [reader (io/reader "~/data.txt")]
(insert-batch (map parse-line (line-seq reader)))))
; insert-batch just save vector of records into database
But this code does not work well, because it first parse all lines and then send them into database.
I think the ideal solution would be read line -> parse line -> collect 1000 parsed lines -> batch insert them into database -> repeat until there is no lines. Unfortunately, I have no idea how to implement this.
One suggestion:
Use line-seq to get a lazy sequence of lines,
use map to parse each line,
(so far this matches what you are doing)
use partition-all to partition your lazy sequence of parsed lines into batches, and then
use insert-batch with doseq to write each batch to the database.
And an example:
(->> (line-seq reader)
(map parse-line)
(partition-all 1000)
(#(doseq [batch %]
(insert-batch batch))))
How do I read a tab-delimited file using Clojure? There may be whitespaces in a line which do not correspond to a tab.
E.g.: transform
some field another-field a third field
into
["some field" "another-field" "a third field"]
You can use the data.csv Contrib library:
;; in your :dependencies
[org.clojure/data.csv "0.1.2"]
;; at the REPL
(require '[clojure.data.csv :as csv])
(csv/read-csv
(java.io.StringReader. "some field\tanother-field\ta third field")
:separator \tab)
;= (["some field" "another-field" "a third field"])
(Use something like (with-open [rdr (clojure.java.io/reader f)] (vec (csv/read-csv rdr :separator \tab))) to read data from the TSV file f.)
If you don't want to do it by hand you could use a CSV library, e.g.:
https://github.com/clojure/data.csv
https://github.com/davidsantiago/clojure-csv
Then you'd be on the save side if your requirements change (e.g. you want to allow spaces in values, the delimiter changes, you want quoting, ...) since you could easily adapt. However, directly splitting single lines works, too:
(require '[clojure.java.io :as io]
'[clojure.string :as string])
(with-open [rd (io/reader (io/file "/path/to/file"))]
(->> (line-seq rd)
(map #(.split ^String % "\t"))
(mapv vec)))
Still, I'd go with a library if I were you.
I know there are a lot of related questions, I have read them but still have not gained some fundamental understanding of how to read-process-write. Take the following function for example which uses clojure-csv library to parse a line
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (reader fname)]
(doseq [line (line-seq file)]
(let [record (parse-csv line)]))))
What I would like to obtain is data read into some collection as a result of (def data (take-csv "file.csv")) and later to process it. So basically my question is how do I return record or rather a list of records.
"doseq" is often used for operations with side effect. In your case to create collection of records you can use "map":
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (reader fname)]
(doall (map (comp first csv/parse-csv) (line-seq file)))))
Better parse the whole file at ones to reduce code:
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (reader fname)]
(csv/parse-csv (slurp file))))
You also can use clojure.data.csv instead of clojure-csv.core. Only should rename parse-csv to take-csv in previous function.
(defn put-csv [fname table]
(with-open [file (writer fname)]
(csv/write-csv file table)))
With all the things you can do with .csv files, I suggest using clojure-csv or clojure.data.csv. I mostly use clojure-csv to read in a .csv file.
Here are some code snippets from a utility library I use with most of my Clojure programs.
from util.core
(ns util.core
^{:author "Charles M. Norton",
:doc "util is a Clojure utilities directory"}
(:require [clojure.string :as cstr])
(:import java.util.Date)
(:import java.io.File)
(:use clojure-csv.core))
(defn open-file
"Attempts to open a file and complains if the file is not present."
[file-name]
(let [file-data (try
(slurp file-name)
(catch Exception e (println (.getMessage e))))]
file-data))
(defn ret-csv-data
"Returns a lazy sequence generated by parse-csv.
Uses open-file which will return a nil, if
there is an exception in opening fnam.
parse-csv called on non-nil file, and that
data is returned."
[fnam]
(let [csv-file (open-file fnam)
inter-csv-data (if-not (nil? csv-file)
(parse-csv csv-file)
nil)
csv-data
(vec (filter #(and pos? (count %)
(not (nil? (rest %)))) inter-csv-data))]
(if-not (empty? csv-data)
(pop csv-data)
nil)))
(defn fetch-csv-data
"This function accepts a csv file name, and returns parsed csv data,
or returns nil if file is not present."
[csv-file]
(let [csv-data (ret-csv-data csv-file)]
csv-data))
Once you've read in a .csv file, then what you do with its contents is another matter. Usually, I am taking .csv "reports" from one financial system, like property assessments, and formatting the data to be uploaded into a database of another financial system, like billing.
I will often either zipmap each .csv row so I can extract data by column name (having read in the column names), or even make a sequence of zipmap'ped .csv rows.
Just to add this good answers, here is a full example
First, add clojure-csv into your dependencies
(ns scripts.csvreader
(:require [clojure-csv.core :as csv]
[clojure.java.io :as io]))
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (io/reader fname)]
(-> file
(slurp)
(csv/parse-csv))))
usage
(take-csv "/path/youfile.csv")
I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.
My code (partly taken from Justin Kramer answer) looks like that:
(defn process-pages
[page]
(let [title (article-title page)
revisions (filter #(= :revision (:tag %)) (:content page))]
(for [revision revisions]
(let [user (revision-user revision)
time (revision-timestamp revision)]
(spit "files/data.csv"
(str "\"" time "\";\"" user "\";\"" title "\"\n" )
:append true)))))
(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
(->> (:content (data.xml/parse rdr :coalescing false))
(filter #(= :page (:tag %)))
(map process-pages))))
I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.
Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.
That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.
Here's what I recommend:
1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.
2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.
As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.
UPDATE:
Here's a rewrite which attempts to separate concerns more clearly:
(defn filter-tag [tag xml]
(filter #(= tag (:tag %)) xml))
;; lazy
(defn revision-seq [xml]
(for [page (filter-tag :page (:content xml))
:let [title (article-title page)]
revision (filter-tag :revision (:content page))
:let [user (revision-user revision)
time (revision-timestamp revision)]]
[time user title]))
;; eager
(defn transform [in out]
(with-open [r (io/input-stream in)
w (io/writer out)]
(binding [*out* out]
(let [xml (data.xml/parse r :coalescing false)]
(doseq [[time user title] (revision-seq xml)]
(println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))
(transform "dump.xml" "data.csv")
I don't see anything here that would cause excessive memory use.
Unfortunately data.xml/parse is not lazy, it attempts to read the whole file into memory and then parse it.
Instead use the this (lazy) xml library which holds only the part it is currently working on in ram. You will then need to re-structure your code to write the output as it reads the input instead of gathering all the xml, then outputting it.
your line
(:content (data.xml/parse rdr :coalescing false)
will load all the xml into memory and then request the content key from it. which will blow the heap.
a rough outline of a lazy answer would look something like this:
(with-open [input (java.io.FileInputStream. "/tmp/foo.xml")
output (java.io.FileInputStream. "/tmp/foo.csv"]
(map #(write-to-file output %)
(filter is-the-tag-i-want? (parse input))))
Have patience, working with (> data ram) always takes time :)
I don't know about Clojure but in plain Java one could use a SAX event based parser like http://docs.oracle.com/javase/1.4.2/docs/api/org/xml/sax/XMLReader.html
that doesn't need to load the XML to RAM