Can you create a map out of a text file? - clojure

I have a text file that contains 10,000 lines of numbers like this:
0 1076 1198 1722 1318 1642 9118
1 6367 461 4772 1324 1735 487 5668
2 4412 1028 209 3130 6902 8397 4373 905 3833 2403
3 5103 1203 7063 4590 5866 445 5498 6217 6498 7298
4 5544 1377 2284 3187 7931 5280 9572 7221 1916 9608
5 2598 9480 7989 1904 845 6514 1200 8699 6214 3216 942 7870 6685 4430 5532 3128 9298
6 9770 1223 8758 6103 9560 356 8469 3570 1178 3626 2985 8780
I want to use the number at index 0 as the key, and the rest of the numbers on the same line as the values assigned to that key. I thought I could make the program read the file line-by-line, then manually assign index 0 as the key, but I am unsure on the Clojure syntax for it.

i would propose the following (quick and dirty) approach:
we read file line by line, reading each line by edn reader.
each line is processed like this:
(defn process-line [line-str]
(let [[x & xs] (clojure.edn/read-string (str "[" line-str "]"))]
[x (vec xs)]))
user> (process-line "1 2 3 4 5")
;;=> [1 [2 3 4 5]]
now we just need to read-and-process every line, and then assemble it all into a map:
user> (->> "/home/leetwin/dev/input.txt"
clojure.java.io/reader
line-seq
(map process-line)
(into {}))
output:
{0 [1076 1198 1722 1318 1642 9118],
1 [6367 461 4772 1324 1735 487 5668],
2 [4412 1028 209 3130 6902 8397 4373 905 3833 2403],
3 [5103 1203 7063 4590 5866 445 5498 6217 6498 7298],
4 [5544 1377 2284 3187 7931 5280 9572 7221 1916 9608],
5 [2598 9480 7989 1904 845 6514 1200 8699 6214 3216 942 7870 6685 4430 5532 3128 9298],
6 [9770 1223 8758 6103 9560 356 8469 3570 1178 3626 2985 8780]}
Also, you may want to filter out invalid/empty lines, or something else, so you just plug filtering out invalid lines into the pipeline:
user> (->> "/home/leetwin/dev/input.txt"
clojure.java.io/reader
line-seq
(remove clojure.string/blank?)
(map process-line)
(into {}))

(->> lines
str/split-lines
(map str/trim)
(map #(str/split % #"\s+"))
(map (fn [line] (map #(Integer/parseInt %) line)))
(reduce (fn [state [head & tail]] (assoc state head tail)) {}))
Split the lines so we get a sequence of strings
Clean up the lines so we have a sequence of numbers
Trim the line to ignore empty lines
Split on Space so each line is now a sequence of string numbers
Parse each line to an Integer sequence
Reduce each line into a map by taking the head of the sequence
as the key and the rest as the values.

Here is one way to do it:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[tupelo.core :as t]
[schema.core :as s]
[clojure.string :as str]
[tupelo.string :as ts]
[tupelo.parse :as tp]
))
(def src-string
"0 1076 1198 1722 1318 1642 9118
1 6367 461 4772 1324 1735 487 5668
2 4412 1028 209 3130 6902 8397 4373 905 3833 2403
3 5103 1203 7063 4590 5866 445 5498 6217 6498 7298
4 5544 1377 2284 3187 7931 5280 9572 7221 1916 9608
5 2598 9480 7989 1904 845 6514 1200 8699 6214 3216 942 7870 6685 4430 5532 3128 9298
6 9770 1223 8758 6103 9560 356 8469 3570 1178 3626 2985 8780
" )
(defn line->map
[line]
(let [tokens-str (str/split line #"\s+")
tokens-num (mapv tp/parse-int tokens-str)
key (first tokens-num)
vals (vec (rest tokens-num))
result-map {key vals}]
;(spyx tokens-str)
;(spyx tokens-num)
;(spyx result-map)
result-map))
(dotest
(let [filename "/tmp/dummy.txt"
>> (spit filename src-string)
str-in (slurp filename)
>> (assert (= src-string str-in))
lines (remove str/blank?
(mapv str/trim
(str/split-lines str-in)))
line-maps (mapv line->map lines)
result (into {} line-maps) ]
; (spyx-pretty lines )
(spyx-pretty result)
result
) )
with result:
result =>
{0 [1076 1198 1722 1318 1642 9118],
1 [6367 461 4772 1324 1735 487 5668],
2 [4412 1028 209 3130 6902 8397 4373 905 3833 2403],
3 [5103 1203 7063 4590 5866 445 5498 6217 6498 7298],
4 [5544 1377 2284 3187 7931 5280 9572 7221 1916 9608],
5 [2598 9480 7989 1904 845 6514 1200 8699 6214 3216 942 7870 6685 4430 5532 3128 9298],
6 [9770 1223 8758 6103 9560 356 8469 3570 1178 3626 2985 8780]}

Related

Clojure - StackOverflowError while iterating over lazy collection

I am currently implementing solution for one of Project Euler problems, namely Sieve of Eratosthenes (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes), in Clojure. Here's my code:
(defn cross-first-element [coll]
(filter #(not (zero? (rem % (first coll)))) coll))
(println
(last
(map first
(take-while
(fn [[primes sieve]] (not (empty? sieve)))
(iterate
(fn [[primes sieve]] [(conj primes (first sieve)) (cross-first-element sieve)])
[[] (range 2 2000001)])))))
The basic idea is to have two collections - primes already retrieved from the sieve, and the remaining sieve itself. We start with empty primes, and until the sieve is empty, we pick its first element and append it to primes, and then we cross out the multiples of it from the sieve. When it's exhausted, we know we have all prime numbers from below two millions in the primes.
Unfortunately, as good as it works for small upper bound of sieve (say 1000), it causes java.lang.StackOverflowError with a long stacktrace with repeating sequence of:
...
clojure.lang.RT.seq (RT.java:531)
clojure.core$seq__5387.invokeStatic (core.clj:137)
clojure.core$filter$fn__5878.invoke (core.clj:2809)
clojure.lang.LazySeq.sval (LazySeq.java:42)
clojure.lang.LazySeq.seq (LazySeq.java:51)
...
Where is the conceptual error in my solution? How to fix it?
the reason for this is the following: since the filter function in your cross-first-element is lazy, it doesn't actually filter your collection on every iterate step, rather it 'stacks' filter function calls. This leads to the situation that when you are going to actually need the resulting element, the whole load of test functions would be executed, roughly like this:
(#(not (zero? (rem % (first coll1))))
(#(not (zero? (rem % (first coll2))))
(#(not (zero? (rem % (first coll3))))
;; and 2000000 more calls
leading to stack overflow.
the simplest solution in your case is to make filtering eager. You can do it by simply using filterv instead of filter, or wrap it into (doall (filter ...
But still your solution is really slow. I would rather use loop and native arrays for that.
You have (re-)discovered that having nested lazy sequences can sometimes be problematic. Here is one example of what can go wrong (it is non-intuitive).
If you don't mind using a library, the problem is much simpler with a single lazy wrapper around an imperative loop. That is what lazy-gen and yield give you (a la "generators" in Python):
(ns tst.demo.core
(:use demo.core tupelo.test)
(:require [tupelo.core :as t]))
(defn unprime? [primes-so-far candidate]
(t/has-some? #(zero? (rem candidate %)) primes-so-far))
(defn primes-generator []
(let [primes-so-far (atom [2])]
(t/lazy-gen
(t/yield 2)
(doseq [candidate (drop 3 (range))] ; 3..inf
(when-not (unprime? #primes-so-far candidate)
(t/yield candidate)
(swap! primes-so-far conj candidate))))))
(def primes (primes-generator))
(dotest
(is= (take 33 primes)
[2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 ])
; first prime over 10,000
(is= 10007 (first (drop-while #(< % 10000) primes)))
; the 10,000'th prime (https://primes.utm.edu/lists/small/10000.txt)
(is= 104729 (nth primes 9999)) ; about 12 sec to compute
)
We could also use loop/recur to control the loop, but it's easier to read with an atom to hold the state.
Unless you really, really need a lazy & infinite solution, the imperative solution is so much simpler:
(defn primes-upto [limit]
(let [primes-so-far (atom [2])]
(doseq [candidate (t/thru 3 limit)]
(when-not (unprime? #primes-so-far candidate)
(swap! primes-so-far conj candidate)))
#primes-so-far))
(dotest
(is= (primes-upto 100)
[2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97]) )

What does sequence do with a transducer?

Two related questions about sequence:
Given a transducer, e.g. (def xf (comp (filter odd?) (map inc))),
What's the relationship between (into [] xf (range 10)) or (into () xf (range 10)), and (sequence xf (range 10))? Is it just that there's no syntax for a lazy sequence that can be used as the second argument of into, so we need a separate function sequence for this purpose? (I know that sequence has another, non-transducer use, coercing a collection into a sequence of one kind or another.)
The Clojure transducers page says, about uses of sequence like the one above,
The resulting sequence elements are incrementally computed. These sequences will consume input incrementally as needed and fully realize intermediate operations. This behavior differs from the equivalent operations on lazy sequences.
To me that sounds as if sequence doesn't return a lazy sequence, yet the docstring for sequence says "When a transducer is supplied, returns a lazy sequence of applications of the transform to the items in coll(s), ....", and in fact (class (sequence xf (range 10))) returns clojure.lang.LazySeq. I think I don't understand the last sentence quoted above from the Clojure transducers page.
(sequence xform from) creates lazy-seq (RT.chunkIteratorSeq) over TransformerIterator to which xform and from are passed. When next value is requested, xform (composition of transformations) is invoked over next value from from.
This behavior differs from the equivalent operations on lazy sequences.
What would be equivalent operations on lazy sequences? With your xf as an example,
applying filter odd? to (range 10), producing intermediate lazy sequence, and applying map inc to intermediate lazy sequence, producing final lazy sequence as result.
I would say that (into to xform from) is similar to (into to (sequence xform from)) when from is some collection which does not implement IReduceInit.
into internally uses (transduce xform conj to from) which does the
same as (reduce (xform conj) to from) and at the end clojure.core.protocols/coll-reduce is called:
(into [] (sequence xf (range 10)))
;[2 4 6 8 10]
(into [] xf (range 10))
;[2 4 6 8 10]
(transduce xf conj [] (range 10))
;[2 4 6 8 10]
(reduce (xf conj) [] (range 10))
;[2 4 6 8 10]
I modified a bit your transducer into:
(defn hof-pr
"Prints char c on each invocation of function f within higher order function"
([hof f c]
(hof (fn [e] (print c) (f e))))
([hof f c coll]
(hof (fn [e] (print c) (f e)) coll)))
(def map-inc-pr (partial hof-pr map inc \m))
(def filter-odd-pr (partial hof-pr filter odd? \f))
(def xf (comp (filter-odd-pr) (map-inc-pr)))
so that it prints out character on each transformation step.
Create s1 in REPL as follows:
(def s1 (into [] xf (range 10)))
ffmffmffmffmffm
s1 is eagerly evaluated (printed f for filtering and m for mapping). No evaluation when s1 is requested again:
s1
[2 4 6 8 10]
Let's create s2:
(def s2 (sequence xf (range 10)))
ffm
Only first item in s2 is evaluated. Next items will be evaluated when requested:
s2
ffmffmffmffm(2 4 6 8 10)
Additionally, create s3, old way:
(def s3 (map-inc-pr (filter-odd-pr (range 10))))
s3
ffffffffffmmmmm(2 4 6 8 10)
As you can see, no evaluation when s3 is defined. When s3 is requested, filtering over 10 elements is applied and after that mapping over remaining 5 elements is applied, producing final sequence.
I didn't find the current answer clear enough, so here goes...
sequence does return a LazySeq, but it is a chunked one, so when you play around with it in the REPL, you will often have the impression it is eager, because your collection will probably be too small, and the chunking will make it look eager. The chunk size I think is a bit dynamic, and it won't always be exactly the same size chunks, but in general it seems to be of size 32. So your transducer will be applied to the input collection 32 elements at a time, lazily.
Here's a simple transducer that just prints the elements it reduces over and returns them untouched:
(defn printer
[xf]
(fn
([] (xf))
([result] (xf result))
([result input]
(println input)
(xf result input))))
If we create a sequence s of 100 elements with it:
(def s
(sequence
printer
(range 100)))
;;> 0
We see that it prints 0, but nothing else. On the call to sequence, the first element will thus be consumed from (range 100), and it will be passed to the xf chain to be transformed, which in our case just prints it. No other elements except the first one have thus been consumed yet.
Now if we take one element from s:
(take 1 s)
;;> 0
;;> 1
;;> 2
;;> 3
;;> 4
;;> 5
;;> 6
;;> 7
;;> 8
;;> 9
;;> 10
;;> 11
;;> 12
;;> 13
;;> 14
;;> 15
;;> 16
;;> 17
;;> 18
;;> 19
;;> 20
;;> 21
;;> 22
;;> 23
;;> 24
;;> 25
;;> 26
;;> 27
;;> 28
;;> 29
;;> 30
;;> 31
;;> 32
We see that it printed the first 32 elements. This is the normal behavior of chunked lazy sequence in Clojure. You can think of it as semi-lazy, in that it consumes chunk-size elements at a time, instead of 1 at a time.
Now if we try to take any element from 1 to 32, nothing else will be printed, because the first 32 elements have already been processed:
(take 1 s)
;; => (0)
(take 10 s)
;; => (0 1 2 3 4 5 6 7 8 9)
(take 24 s)
;; => (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23)
(take 32 s)
;; => (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31)
Nothing gets printed, and each take returns the expected set of result. I'm using ;; => for return values, and ;;> for printed output.
Okay, now if we take the 33rd element, we expect to see the next chunk of 32 elements being printed:
(take 33 s)
;;> 33
;;> 34
;;> 35
;;> 36
;;> 37
;;> 38
;;> 39
;;> 40
;;> 41
;;> 42
;;> 43
;;> 44
;;> 45
;;> 46
;;> 47
;;> 48
;;> 49
;;> 50
;;> 51
;;> 52
;;> 53
;;> 54
;;> 55
;;> 56
;;> 57
;;> 58
;;> 59
;;> 60
;;> 61
;;> 62
;;> 63
;;> 64
Awesome! So once more, we see that only the next 32 were taken, which brings us to a total of 64 elements now processed.
Well, this demonstrates that sequence called with a transducer does in fact creates a lazy chunked sequence where elements will only be processed when needed (chunk-size at a time).
So what's this about?:
The resulting sequence elements are incrementally computed. These sequences will consume input incrementally as needed and fully realize intermediate operations. This behavior differs from the equivalent operations on lazy sequences.
This is about the order in which the operations happen. With sequence and a transducer:
(sequence (comp A B C) coll)
Will for each elements in the chunk have them go through: A -> B -> C, so you get:
A(e1) -> B(e1) -> C(e1)
A(e2) -> B(e2) -> C(e2)
...
A(e32) -> B(e32) -> C(e32)
While for a normal lazy seq like:
(->> coll A B C)
Will first have all chunked elements go through A, and then have them all go through B and then C:
A(e1)
A(e2)
...
A(e32)
|
B(e1)
B(e2)
...
B(e32)
|
C(e1)
C(e2)
...
C(e32)
This requires an intermediate collection between each step, as the result of A have to be collected into a collection to then loop over and apply B, etc.
We can see this with our previous example:
(def s
(sequence
(comp (filter odd?)
printer
(map vector)
printer)
(range 10)))
(take 1 s)
;;> 1
;;> [1]
;;> 3
;;> [3]
;;> 5
;;> [5]
;;> 7
;;> [7]
;;> 9
;;> [9]
(def l
(->> (range 10)
(filter odd?)
(map #(do (println %) %))
(map vector)
(map #(do (println %) %))))
(take 1 l)
;;> 1
;;> 3
;;> 5
;;> 7
;;> 9
;;> [1]
;;> [3]
;;> [5]
;;> [7]
;;> [9]
See how the first will filter -> vector -> filter -> vector, etc. While the second will filter all -> vector all. Well this is what the quote from the doc means.
Now one more thing, there is a difference in how the chunking is applied as well between the two. With sequence and a transducer, it will process elements until the transducer result has chunk-size count of elements. While in the lazy-seq case, it will process in chunks at each level until all steps have enough for what they need to do.
Here's what I mean:
(def s
(sequence
(comp printer
(filter odd?))
(range 100)))
(take 1 s)
;;> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
(def l
(->> (range 100)
(map #(do (print % "") %))
(filter odd?)))
(take 1 l)
;;> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Here I modified the printing logic to be on the same line, so it doesn't take as much space. And if you look closely, s processed 66 elements of the input range, while l only consumed 32 elements.
The reason for this is what I said above. With sequence, we will continue taking in chunks until we have chunk-size number of results. In this case, the chunk-size is 32, and since we filter on odd?, it takes us two chunks to reach 32 results.
With lazy-seq, it doesn't try and grab the first chunk of results, but only enough chunks from the input to satisfy the logic, in this case, that only needs one chunk of 32 elements from the input for us to find a single odd number to take.

How to wrap a string in an input-stream?

How can I wrap a string in an input-stream in such a way that I can test the function bellow?
(defn parse-body [body]
(cheshire/parse-stream (clojure.java.io/reader body) true))
(deftest test-parse-body
(testing "read body"
(let [body "{\"age\": 28}"] ;; must wrap string
(is (= (parse-body body) {:age 28}))
)))
It is straightforward to construct an InputStream from a String using host interop, by converting to a byte-array first:
(defn string->stream
([s] (string->stream s "UTF-8"))
([s encoding]
(-> s
(.getBytes encoding)
(java.io.ByteArrayInputStream.))))
As another stream and byte interop example, here's a function that returns a vector of the bytes produced when encoding a String to a given format:
(defn show-bytes
[s encoding]
(let [buf (java.io.ByteArrayOutputStream.)
stream (string->stream s encoding)
;; worst case, 8 bytes per char?
data (byte-array (* (count s) 8))
size (.read stream data 0 (count data))]
(.write buf data 0 size)
(.flush buf)
(apply vector-of :byte (.toByteArray buf))))
+user=> (string->stream "hello")
#object[java.io.ByteArrayInputStream 0x39b43d60 "java.io.ByteArrayInputStream#39b43d60"]
+user=> (isa? (class *1) java.io.InputStream)
true
+user=> (show-bytes "hello" "UTF-8")
[104 101 108 108 111]
+user=> (show-bytes "hello" "UTF-32")
[0 0 0 104 0 0 0 101 0 0 0 108 0 0 0 108 0 0 0 111]

How to filter decreasing element in an increasing vector?

For example
[1 2 3 40 7 30 31 32 41]
after filtering should be
[1 2 3 30 31 32 41]
The problem doesn't seem very simple because I'd like to maximize the size of the resulting vector, so that if the starting vector is
[1 2 3 40 30 31 32 41 29]
I prefer this result
[1 2 3 30 31 32 41]
than just
[1 2 3 29]
Your problem is known as the longest increasing subsequence.
Via rosetta code:
(defn place [piles card]
(let [[les gts] (->> piles (split-with #(<= (ffirst %) card)))
newelem (cons card (->> les last first))
modpile (cons newelem (first gts))]
(concat les (cons modpile (rest gts)))))
(defn a-longest [cards]
(let [piles (reduce place '() cards)]
(->> piles last first reverse)))
(a-longest [1 2 3 40 30 31 32 41 29])
;; => (1 2 3 30 31 32 41)
Could probably be optimized to use transients if you care about performance.

How to combine sequences in clojure?

I have the following sequences
(def a [1 2 3 4])
(def b [10 20 30 40])
(def c [100 200 300 400])
I want to combine the sequences element by element:
(... + a b c)
To give me:
[111 222 333 444]
Is there a standard function available to do so? Or alternatively what is a good idiomatic way to do so?
if you use clojure-1.4.0 or above, you can use mapv:
user> (mapv + [1 2 3 4] [10 20 30 40] [100 200 300 400])
[111 222 333 444]
The function you are looking for is map.
(map + [1 2 3 4] [10 20 30 40] [100 200 300 400])
;=> (111 222 333 444)
Note that map returns a lazy sequence, and not a vector as shown in your example. But you can pour the lazy sequence into an empty vector by using the into function.
(into [] (map + [1 2 3 4] [10 20 30 40] [100 200 300 400]))
;=> [111 222 333 444]
Also, (for completeness, as it is noted in another answer) in Clojure 1.4.0+ you can use mapv (with the same arguments as map) in order to obtain a vector result.