Processing CSV file in Clojure in parallel

Processing CSV file in Clojure in parallel - concurrency

I have a large CSV file that contains independent items that take a fair bit of effort to process. I'd like to be able to process each line item in parallel. I found a sample piece of code for processing a CSV file on SO here:
Newbie transforming CSV files in Clojure
The code is:
(use '(clojure.contrib duck-streams str-utils)) ;;'
(with-out-writer "coords.txt"
(doseq [line (read-lines "coords.csv")]
(let [[x y z p] (re-split #"," line)]
(println (str-join \space [p x y z])))))
This was able to print out data from my CSV file which was great - but it only used one CPU. I've tried various different things, ending up with:
(pmap println (read-lines "foo"))
This works okay in interactive mode but does nothing when running from the command line. From a conversation on IRC, this is because stdout isn't available by default to threads.
Really what I'm looking for is a way to idiomatically apply a function to each line of the CSV file and do so in parallel. I'd also like to print some results to stdout during testing if at all possible.
Any ideas?

If you want the results in the output be in the same order as in the input, then printing from pmap might not be a good idea. I would recommend creating a (lazy) sequence of the input lines pmap over that and then print the result of pmap.
Something like this should work:
(dorun (map println (pmap expensive-computation (read-lines "coords.csv"))))

If you want to do this at speed you might want to look at this article on how Alex Osborne solved the Widefinder 2 challenge posed by Tim Bray. Alex goes into all aspects of parsing, processing and collecting the results (in the Widefinder 2 case the file is a very large Apache log). The actual code used is here.

I would be extremely surprised if hat code can be sped up by using more cores. I'm 99% certain that the actual speed limit here is the file I/O, which should be a couple orders of magnitude slower than any single core you can throw at the problem.
And that's aside from the overhead you'll introduce when splitting these very minimal tasks over multiple CPUs. pmap isn't exactly free.
If you're sure that disk IO isn't going to be a problem and you've got a lot of CSV parsing to do, simply parsing multiple files in their own threads is going to gain you a lot more for a lot less effort.

Related

Why does `read` return a symbol when `read-line` returns a string

I'm learning Clojure by following the Hackerrank 30 days of code, and lost some hours due to a behavior I neither understand nor found any documentation or explanation about:
(read) returns a symbol:
user=> (def key-read (read))
sam
#'user/key-read
user=> (type key-read)
clojure.lang.Symbol
(read-line) returns a string
user=> (def key-line (read-line))
sam
#'user/key-line
user=> (type key-line)
java.lang.String
As a result, parsing a line with (read) (read) to get map keys and values results in the keys to be symbols, than will never be matched by a further (read-line).
Why is this so? And also, where can I find the return value? (this is not documented in (doc read)).

TL;DR
clojure.core/read is used to read "code" by Clojure itself
clojure.edn/read is used to read "data" (EDN)
read-line is used to read text lines as string; it's your problem to
decipher them
What can read do for you
read does not only read symbols, but anything, that Clojure uses to
represent code. If you give it a symbol to parse, it will give you
symbol back:
(type (read))
test
clojure.lang.Symbol
But also other things
(type (read))
5
java.lang.Long
(type (read))
{:a 42}
clojure.lang.PersistentArrayMap
(type (read))
"hello"
java.lang.String
So you can get back a string with read too, if you feed it a string.
real-world use of read
Usually read is used by Clojure itself and that's it. Reading
EDN is usually done using clojure.edn/read, which does not allow code
execution and therefor is no security risk if handling EDN from
untrusted sources.
docs
For good measure, here are the docs:
(doc read)
-------------------------
clojure.core/read
([] [stream] [stream eof-error? eof-value] [stream eof-error? eof-value recursive?] [opts stream])
Reads the next object from stream, which must be an instance of
java.io.PushbackReader or some derivee. stream defaults to the
current value of *in*.
Opts is a persistent map with valid keys:
:read-cond - :allow to process reader conditionals, or
:preserve to keep all branches
:features - persistent set of feature keywords for reader conditionals
:eof - on eof, return value unless :eofthrow, then throw.
if not specified, will throw
Note that read can execute code (controlled by *read-eval*),
and as such should be used only with trusted sources.
For data structure interop use clojure.edn/read
(doc read-line)
-------------------------
clojure.core/read-line
([])
Reads the next line from stream that is the current value of *in* .

You can find the Clojure API documentation at https://clojure.github.io/clojure/clojure.core-api.html. Both read and read-line are there.
Your specific goal isn't quite clear, but in general, application software prefers read-line and parses the results in whatever way makes sense... perhaps with re-matches for regular expressions. Clojure itself reads program code with read.

The other 2 answers are good. I didn't even know that clojure.core/read existed!
I only wanted to add in a list of my favorite documentation sources.
Please review & study the Clojure CheatSheet, which links to examples on clojuredocs.org.
Unfortunately, the API docs at clojure.org are not as descriptive and it is harder to find things unless you already know the name and location.

iterating through map and getting stack overflow error clojure

for an assignment I need to create a map from a text file in clojure, which I am new to. I'm specifically using a hash-map...but it's possible I should be using another type of map. I'm hoping someone here can answer that for me. I did try changing my hash-map to sorted-map but it gave me the same problem.
The first character in every line in the file is the key and the whole line is the value. The key is a number from 0-9999. There are 10,000 lines and each number after the first number in a line is a random number between 0 and 9999.
I've created the hashmap successfully I think. At least, its not giving me an error when I just run that code. However when I try to iterate through it, printing every value for keys 0-9999 it gives me a stack overflow error right at the middle of line 2764(in the text file). I'm hoping someone can tell me why it's doing this and a better way to do it?
Here's my code:
(ns clojure-project-441.core
(:gen-class))
(defn -main
[& args]
(def pages(def hash-map (file)))
(iter 0)
)
(-main)
(defn file []
(with-open [rdr (clojure.java.io/reader "pages.txt")]
(reduce conj [] (line-seq rdr))))
(defn iter [n]
(doseq [keyval (pages n)] (print keyval))
(if (< n 10000)
(iter (inc n))
)
)
here's a screenshot of my output
If it's relevant at all I'm using repl.it as my IDE.
Here are some screenshots of the text file, for clarity.
beginning of text file
where the error is being thrown
Thanks.

I think the specific problem that causes the exception to be thrown is caused because iter calls itself recursively too many times before hitting the 10,000 line limit.
There some issues in your code that are very common to all people learning Clojure; I'll try to explain:
def is used to define top-level names. They correspond with the concept of constants in the global scope on other programming languages. Think of using def in the same way you would use defn to define functions. In your code, you probably want to use let to give names to intermediate results, like:
(let [uno 1
dos 2]
(+ uno dos)) ;; returns 3
You are using the name hash-map to bind it to some result, but that will get in the way if you want to use the function hash-map that is used to create maps. Try renaming it to my-map or similar.
To call a function recursively without blowing the stack you'll need to use recur for reasons that are a bit long to explain. See the factorial example here: https://clojuredocs.org/clojure.core/recur
My advice would be to think of this assignment as a pipeline composed of the following small functions:
A function that reads the lines from the file (you already have this)
A function that, given a line, returns a pair: the first element of the pair is the first number of the line, the second element is the whole line (the input parameter) OR
A function that reads the first number of the line
To build the map, you have a few options; two off the top of my mind:
Use a loop construct and, for each line, "update" the hash-map to include a new key-value pair (the key is the first number, the value is the whole line), then return the whole hash-map you've built
Use a reduce operation: you create a collection of key-value pairs, then tell reduce to merge, one step at a time, into the original hash-map. The result is the hash-map you want
I think the key is to get familiar with the functions that you can use and build small functions that you can test in isolation and try to group them conveniently to solve your problem. Try to get familiar with functions like hash-map, assoc, let, loop and recur. There's a great documentation site at https://clojuredocs.org/ that also includes examples that will help you understand each function.

C++ environment/IDE to avoid multiple reads of big data sets

I am currently working on a big dataset (approximately a billion data points) and I have decided to use C++ over R in particular for convenience in memory allocation.
However, there does not seem to exist an equivalent to R Studio for C++ in order to "store" the data set and avoid to have to read the data every time I run the program, which is extremely time consuming...
What kind of techniques do C++ users use for big data in order to read the data "once for all" ?
Thanks for your help!

If I understand what you are trying to achieve, i.e. load some data into memory once and use the same data (in memory) with multiple runs of your code, with possible modifications to that code, there is no such IDE, as IDE are not ment to store any data.
What you can do is first load your data into some in-memory database and write your c++ program to read data from that database instead of reading it directly from data-source in C++.

how avoid multiple reads of big data set.
What kind of techniques do C++ users use for big data in order to read
the data "once for all" ?
I do not know of any C++ tool with such capabilities, but I doubt that I have ever searched for one ... seems like something you might do. Keywords appear to be 'data frame' and 'statistical analysis' (and C++).
If you know the 'data set' format, and wish to process raw data no more than one time, you might consider using Posix shared memory.
I can imagine that (a) the 'extremely time consuming' effort could (read and) transform the 'raw' data, and write into a 'data set' (a file) suitable for future efforts (i.e. 'once and for all').
Then (b) future efforts can 'simply' "map" the created 'data set' (a file) into the program's memory space, all ready for use with no (or at least much reduced) time consuming effort.
Expanding the memory map of your program is about using 'Posix' access to shared memory. (Ubuntu 17.10 has it, I have 'gently' used it in C++) Terminology includes, shm_open, mmap, munmap, shm_unlink, and a few others.
From 'man mmap':
mmap() creates a new mapping in the virtual address space of the
calling process. The starting address for
the new mapping is specified in ...

how avoid multiple reads of big data set. What kind of techniques do
C++ users use for big data in order to read the data "once for all" ?
I recently retried my hand at measuring std::thread context switch duration (on my Ubuntu 17.10, 64 bit desktop). My app captured <30 million entries over 10 seconds of measurement time. I also experimented with longer measurement times, and with larger captures.
As part of debugging info capture, I decided to write intermediate results to a text file, for a review of what would be input to the analysis.
The code spent only about 2.3 seconds to save this info to the capture text file. My original software would then proceed with analysis.
But this delay to get on with testing the analysis results (> 12 sec = 10 + 2.3) quickly became tedious.
I found the analysis effort otherwise challenging, and recognized I might save time by capturing intermediate data, and thus avoiding most (but not all) of the data measurement and capture effort. So the debug capture to intermediate file became a convenient split to the overall effort.
Part 2 of the split app reads the <30 million byte intermediate file in somewhat less 0.5 seconds, very much reducing the analysis development cycle (edit-compile-link-run-evaluate), which was was (usually) no longer burdened with the 12+ second measure and data gen.
While 28 M Bytes is not BIG data, I valued the time savings for my analysis code development effort.
FYI - My intermediate file contained a single letter for each 'thread entry into the critical section event'. With 10 threads, the letters were 'A', 'B', ... 'J'. (reminds me of dna encoding)
For each thread, my analysis supported splitting counts per thread. Where vxWorks would 'balance' the threads blocked at a semaphore, Linux does NOT ... which was new to me.
Each thread ran a different number of times through the single critical section, but each thread got about 10% of the opportunities.
Technique: simple encoded text file with captured information ready to be analyzed.
Note: I was expecting to test piping the output of app part 1 into app part 2. Still could, I guess. WIP.

Is it suboptimal to add the same datoms multiple times?

I'm currently using Datomic in one of my project, and a question is bothering me.
Here is a simplified version of my problem:
I need to parse a list of small English sentences and insert both the full sentence and its words into Datomic.
the file that contains the list of sentences is quite big (> 10 GB)
the same sentence can occur multiple times in the file, and their words can also occur multiple times across sentences
during the insertion process, an attribute will set to associate each sentence with its corresponding words
To ease the insertion process, I'm tempted to write the same datoms multiple times (i.e. not check if a record already exists in the database). But I'm afraid about the performance impact.
What happens in Datomic when the same datoms are added multiple times ?
Is it worth checking that a datom has already been added prior to the transaction ?
Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction) ?
Thank you for your help

What happens in Datomic when the same datoms are added multiple times ?
Is it worth checking that a datom has already been added prior to the transaction ?
Logically, a Datomic database is a sorted set of datoms, so adding the same datom several times is idempotent. However, when you're asserting a datom with a tempid, you may create a new datom for representing the same information as an old datom. This is where :db/unique comes in.
To ensure an entity does not get stored several times, you want to set the :db/unique attribute property to :db.unique/identity for the right attributes. For instance, if your schema consists of 3 attributes :word/text, :sentence/text, and :sentence/words, then :word/text and :sentence/text should be :db.unique/identity, which yields the following schema installation transaction:
[{:db/cardinality :db.cardinality/one,
:db/fulltext true,
:db/index true,
:db.install/_attribute :db.part/db,
:db/id #db/id[:db.part/db -1000777],
:db/ident :sentence/text,
:db/valueType :db.type/string,
:db/unique :db.unique/identity}
{:db/cardinality :db.cardinality/one,
:db/fulltext true,
:db/index true,
:db.install/_attribute :db.part/db,
:db/id #db/id[:db.part/db -1000778],
:db/ident :word/text,
:db/valueType :db.type/string,
:db/unique :db.unique/identity}
{:db/cardinality :db.cardinality/many,
:db/fulltext true,
:db/index true,
:db.install/_attribute :db.part/db,
:db/id #db/id[:db.part/db -1000779],
:db/ident :sentence/words,
:db/valueType :db.type/ref}]
Then the transaction for inserting inserting looks like:
[{:sentence/text "Hello World!"
:sentence/words [{:word/text "hello"
:db/id (d/tempid :db.part/user)}
{:word/text "world"
:db/id (d/tempid :db.part/user)}]
:db/id (d/tempid :db.part/user)}]
Regarding performance:
You may not need to optimize at all, but in my view, the potential performance bottlenecks of your import process are:
time spent building the transaction in the Transactor (which includes index lookups for unique attributes etc.)
time spent building the indexes.
To improve 2.: When the data you insert is sorted, indexing is faster, so an would be to insert words and sentences sorted. You can use Unix tools to sort large file even if they don't fit in memory. So the process would be:
sort sentences, insert them (:sentence/text)
extract words, sort them, insert them (:word/text)
insert word-sentence relationship (:sentence/words)
To improve 1.: indeed, it could put less pressure on the transactor to use entity ids for words that are already stored instead of the whole word text (which requires an index lookup to ensure uniqueness). One idea could be to perform that lookup on the Peer, either by leveraging parallelism and/or only for frequent words (For instance, you could insert the words from the 1st 1000 sentences, then retrieve their entity ids and keep them in a hash map).
Personally, I would not go through these optimizations until experience has shown they're necessary.

You are not at the point you need to worry about pre-optimization like this. Retail computer stores sell hard disks for about $0.05/GB, so you are talking about 50 cents worth of storage here. With Datomic's built-in storage compression, this will be even smaller. Indexes & other overhead will increase the total a bit, but it's still too small to worry about.
As with any problem, it is best to build up a solution incrementally. So, maybe do an experiment with the first 1% of your data and time the simplest possible algorithm. If that's pretty fast, try 10%. You now have a pretty good estimate of how long the whole problem will take to load the data. I'm betting that querying the data will be even faster than loading.
If you run into a roadblock after the first 1% or 10%, then you can think about reworking the design. Since you have built something concrete, you have been forced to think about the problem & solution in more detail. This is much better than hand-waving arguments & whiteboard designing. You now know a lot more about your data and possible solution implementations.
If it turns out the simplest solution won't work at a larger scale, the 2nd solution will be much easier to design & implement having had the experience of the first solution. Very rarely does the final solution spring full-formed from your mind. It is very important for any significant problem to plan on repeated refinement of the solution.
One of my favorite chapters from the seminal book
The Mythical Man Month by Fred Brooks is entitled, "Plan To Throw One Away".

What happens in Datomic when the same datoms are added multiple times?
If you are adding the word/sentence with a unique identity (:db.unique/identity) then Datomic will keep only one copy of it in the storage (i.e. single entity)
Is it worth checking that a datom has already been added prior to the transaction?
Is there a way to prevent Datomic from overriding previous datoms (i.e if a record already exists, skip the transaction)?*
Again, use :db.unique/identity, then you don't need to query for the entity id to check its existence.
For more information, please refer to here

Keeping file name information with Cascalog Tuples

I'm looking for a way of keeping a filename that's associated with the tuples/data that originate from that particular file. I've searched around and found that hfs-wholefile works really well at getting filenames but it then returns a large chunk of binary information. Is it possible to take this binary information and turn it back into tuples that I can then processes as if I had gotten them from hfs-textline?
(def file-name-with-data
"Process a file and associate a filename with it"
[file]
(<- [file-name ?data1 ?data2 ?data3 ?data4]
((hfs-wholefile file) ?file-name ?binary-data)
(function-that-im-looking-for ?binary-data :> ?data1 ?data2 ?data3 ?data4)))
The example above is ideally what I would like to use to process this information. In Cascalog/Cascading is there a way to turn the bytes into regular variables I can use in queries?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js