I have this function to read a file and convert it to a list of two-elements lists:
(def f1 "/usr/example")
(defn read-file [file]
(let [f
(with-open [rdr (clojure.java.io/reader file)]
(doall (map list (line-seq rdr))))]
(cond
(= file f1) (map #(map read-string (split (first %) #" ")) f)
:else (map #(map read-string (split (first %) #"\t")) f))))
I use cond to split the file correctly(I have two types of files, the first separates elements by spaces and the second, with tabs).
The first type of file would be like:
"1.3880896237218878E9 0.4758112837388654
1.3889631620596328E9 0.491845185928218"
while the second is:
'1.3880896237218878E9\t0.4758112837388654
1.3889631620596328E9\t0.491845185928218"
I get the result I want, for example:
((1.3880896237218878E9 0.4758112837388654) (1.3889631620596328E9 0.491845185928218))
But I wonder if there's a cleaner way to do that, maybe using less map functions or doing it without cond
This returns a vector of vectors, splitting individual lines on arbitrary whitespace and using Double/parseDouble to read in the individual doubles. What it doesn't handle are any single or double quote characters in the files; if they are part of the actual input, I suppose I'd just preprocess it with a regex to get rid of them (see below).
(require '[clojure.java.io :as io] '[clojure.string :as string])
(defn read-file [f]
(with-open [rdr (io/reader f)]
(mapv (fn [line]
(mapv #(Double/parseDouble %) (string/split line #"\s+")))
(line-seq rdr))))
As for the aforementioned preprocessing, you could use #(string/replace % #"['\"]" "") to remove all single quotes. That would be appropriate if they occur at the beginning and end of the input, or perhaps the individual lines. (If the individual numbers are quoted, then you need to make sure you're not removing all delimiters between them -- in such a case it may be better to replace with a single space and then use string/trim to remove any whitespace from the ends of the string.)
Related
I am trying to replace bad characters from a input string.
Characters should be valid UTF-8 characters (tabs, line breaks etc. are ok).
However I was unable to figure out how to replace all found bad characters.
My solution works for the first bad character.
Usually there are none bad characters. 1/50 cases there is one bad character. I'd just want to make my solution foolproof.
(defn filter-to-utf-8-string
"Return only good utf-8 characters from the input."
[input]
(let [bad-characters (set (re-seq #"[^\p{L}\p{N}\s\p{P}\p{Sc}\+]+" input))
filtered-string (clojure.string/replace input (apply str (first bad-characters)) "")]
filtered-string))
How can I make replace work for all values in sequence not just for the first one?
Friend of mine helped me to find workaround for this problem:
I created a filter for replace using re-pattern.
Within let code is currently
filter (if (not (empty? bad-characters))
(re-pattern (str "[" (clojure.string/join bad-characters) "]"))
#"")
filtered-string (clojure.string/replace input filter "")
Here is a simple version:
(ns xxxxx
(:require
[clojure.string :as str]
))
(def all-chars (str/join (map char (range 32 80))))
(println all-chars)
(def char-L (str/join (re-seq #"[\p{L}]" all-chars)))
(println char-L)
(def char-N (str/join (re-seq #"[\p{N}]" all-chars)))
(println char-N)
(def char-LN (str/join (re-seq #"[\p{L}\p{N}]" all-chars)))
(println char-LN)
all-chars => " !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNO"
char-L => "ABCDEFGHIJKLMNO"
char-N => "0123456789"
char-LN => "0123456789ABCDEFGHIJKLMNO"
So we start off with all ascii chars in the range of 32-80. We first print only the letter, then only the numbers, then either letters or numbers. It seems this should work for your problem, although instead of rejecting non-members of the desired set, we keep the members of the desired set.
In Clojure I could use something like this solution: Compact Clojure code for regular expression matches and their position in string, i.e., creating a re-matcher and extracted the information from that, but re-matcher doesn't appear to be implemented in ClojureScript. What would be a good way to accomplish the same thing in ClojureScript?
Edit:
I ended up writing a supplementary function in order to preserve the modifiers of the regex as it is absorbed into re-pos:
(defn regex-modifiers
"Returns the modifiers of a regex, concatenated as a string."
[re]
(str (if (.-multiline re) "m")
(if (.-ignoreCase re) "i")))
(defn re-pos
"Returns a vector of vectors, each subvector containing in order:
the position of the match, the matched string, and any groups
extracted from the match."
[re s]
(let [re (js/RegExp. (.-source re) (str "g" (regex-modifiers re)))]
(loop [res []]
(if-let [m (.exec re s)]
(recur (conj res (vec (cons (.-index m) m))))
res))))
You can use the .exec method of JS RegExp object. The returned match object contains an index property that corresponds to the index of the match in the string.
Currently clojurescript doesn't support constructing regex literals with the g mode flag (see CLJS-150), so you need to use the RegExp constructor. Here is a clojurescript implementation of the re-pos function from the linked page:
(defn re-pos [re s]
(let [re (js/RegExp. (.-source re) "g")]
(loop [res {}]
(if-let [m (.exec re s)]
(recur (assoc res (.-index m) (first m)))
res))))
cljs.user> (re-pos "\\w+" "The quick brown fox")
{0 "The", 4 "quick", 10 "brown", 16 "fox"}
cljs.user> (re-pos "[0-9]+" "3a1b2c1d")
{0 "3", 2 "1", 4 "2", 6 "1"}
I'm trying to convert a hyphenated string to CamelCase string. I followed this post: Convert hyphens to camel case (camelCase)
(defn hyphenated-name-to-camel-case-name [^String method-name]
(clojure.string/replace method-name #"-(\w)"
#(clojure.string/upper-case (first %1))))
(hyphenated-name-to-camel-case-name "do-get-or-post")
==> do-Get-Or-Post
Why I'm still getting the dash the output string?
You should replace first with second:
(defn hyphenated-name-to-camel-case-name [^String method-name]
(clojure.string/replace method-name #"-(\w)"
#(clojure.string/upper-case (second %1))))
You can check what argument clojure.string/upper-case gets by inserting println to the code:
(defn hyphenated-name-to-camel-case-name [^String method-name]
(clojure.string/replace method-name #"-(\w)"
#(clojure.string/upper-case
(do
(println %1)
(first %1)))))
When you run the above code, the result is:
[-g g]
[-o o]
[-p p]
The first element of the vector is the matched string, and the second is the captured string,
which means you should use second, not first.
In case your goal is just to to convert between cases, I really like the camel-snake-kebab library. ->CamelCase is the function-name in question.
inspired by this thread, you could also do
(use 'clojure.string)
(defn camelize [input-string]
(let [words (split input-string #"[\s_-]+")]
(join "" (cons (lower-case (first words)) (map capitalize (rest words))))))
I'm working on writing a function in Clojure that will process a file character by character. I know that Java's BufferedReader class has the read() method that reads one character, but I'm new to Clojure and not sure how to use it. Currently, I'm just trying to do the file line-by-line, and then print each character.
(defn process_file [file_path]
(with-open [reader (BufferedReader. (FileReader. file_path))]
(let [seq (line-seq reader)]
(doseq [item seq]
(let [words (split item #"\s")]
(println words))))))
Given a file with this text input:
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
My output looks like this:
[International donations are gratefully accepted, but we cannot make]
[any statements concerning tax treatment of donations received from]
[outside the United States. U.S. laws alone swamp our small staff.]
Though I would expect it to look like:
["international" "donations" "are" .... ]
So my question is, how can I convert the function above to read character by character? Or even, how to make it work as I expect it to? Also, any tips for making my Clojure code better would be greatly appreciated.
(with-open [reader (clojure.java.io/reader "path/to/file")] ...
I prefer this way to get a reader in clojure. And, by character by character, do you mean in file access level, like read, which allow you control how many bytes to read?
Edit
As #deterb pointed out, let's check the source code of line-seq
(defn line-seq
"Returns the lines of text from rdr as a lazy sequence of strings.
rdr must implement java.io.BufferedReader."
{:added "1.0"
:static true}
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons line (lazy-seq (line-seq rdr)))))
I faked a char-seq
(defn char-seq
[^java.io.Reader rdr]
(let [chr (.read rdr)]
(if (>= chr 0)
(cons chr (lazy-seq (char-seq rdr))))))
I know this char-seq reads all chars into memory[1], but I think it shows that you can directly call .read on BufferedReader. So, you can write your code like this:
(let [chr (.read rdr)]
(if (>= chr 0)
;do your work here
))
How do you think?
[1] According to #dimagog's comment, char-seq not read all char into memory thanks to lazy-seq
I'm not familiar with Java or the read() method, so I won't be able to help you out with implementing it.
One first thought is maybe to simplify by using slurp, which will return a string of the text of the entire file with just (slurp filename). However, this would get the whole file, which maybe you don't want.
Once you have a string of the entire file text, you can process any string character by character by simply treating it as though it were a sequence of characters. For example:
=> (doseq [c "abcd"]
(prntln c))
a
b
c
d
=> nil
Or:
=> (remove #{\c} "abcd")
=> (\a \b \d)
You could use map or reduce or any sort of sequence manipulating function. Note that after manipulating it like a sequence, it will now return as a sequence, but you could easily wrap the outer part in (reduce str ...) to return it back to a string at the end--explicitly:
=> (reduce str (remove #{\c} "abcd"))
=> "abd"
As for your problem with your specific code, I think the problem lies with what words is: a vector of strings. When you print each words you are printing a vector. If at the end you replaced the line (println words) with (doseq [w words] (println w))), then it should work great.
Also, based on what you say you want your output to look like (a vector of all the different words in the file), you wouldn't want to only do (println w) at the base of your expression, because this will print values and return nil. You would simply want w. Also, you would want to replace your doseqs with fors--again, to avoid return nil.
Also, on improving your code, it looks generally great to me, but--and this is going with all the first change I suggest above (but not the others, because I don't want to draw it all out explicitly)--you could shorten it with a fun little trick:
(doseq [item seq]
(let [words (split item #"\s")]
(doseq [w words]
(println w))))
;//Could be rewritten as...
(doseq [item s
:let [words (split item #"\s")]
w words]
(println w))
You're pretty close - keep in mind that Strings are a sequence. (concat "abc" "def") results in the sequence (\a \b \c \d \e \f).
mapcat is another really useful function for this - it will lazily concatenate the results of applying the mapping fn to the sequence. This means that mapcating the result of converting all of the line strings to a seq will be the lazy sequence of characters you're after.
I did this as (mapcat seq (line-seq reader)).
For other advice:
For creating the reader, I would recommend using the clojure.java.io/reader function instead of directly creating the classes.
Consider breaking apart the reading the file and the processing (in this case printing) of the strings from each other. While it is important to keep the full file parsing inside the withopen clause, being able to test the actual processing code outside of the file reading code is quite useful.
When navigating multiple (potentially nested) sequences consider using for. for does a nice job handling nested for loop type cases.
(take 100 (for [line (repeat "abc") char (seq line)] (prn char)))
Use prn for debugging output. It gives you real output, as compared to user output (which hides certain details which users don't normally care about).
Why do you have to use "first" in get-word-ids, and what's the right way to do this?
(defn parts-of-speech []
(lazy-seq (. POS values)))
(defn index-words [pos]
(iterator-seq (. dict getIndexWordIterator pos)))
(defn word-ids [word]
(lazy-seq (. word getWordIDs)))
(defn get-word [word-id]
(. dict getWord word-id))
(defn get-index-words []
(lazy-seq (map index-words (parts-of-speech))))
(defn get-word-ids []
(lazy-seq (map word-ids (first (get-index-words)))))
;; this works, but why do you have to use "first" in get-word-ids?
(doseq [word-id (get-word-ids)]
(println word-id))
The short answer: remove all the references to lazy-seq.
as for your original question, it is worth explaining even if it's not a idomatic use of lazy-seq. you have to use first because the get-word-ids function is returning a lazy sequence with one entry. that entry is the lazy sequences you are looking for.
looks like this
( (word1 word2 word3) )
so first returns the sequence you want:
(word1 word2 word3)
It is very likely that the only time you will use lazy-seq will be in this pattern:
(lazy-seq (cons :something (function-call :produces :the :next :element)))
I have never seen lazy-seq used in any other pattern. The purpose of lazy-seq is to generate new sequences of original data. If code exists to produce the data then it's almost always better to use something like iterate map, or for to produce your lazy sequence.
This seems wrong:
(defn get-index-words []
(lazy-seq (map index-words (parts-of-speech))))
(index-words pos) returns a seq. which is why you need a (first) in get-word-ids.
also map is already lazy, so there's no need to wrap a (map ...) in a lazy-seq, and it would be almost pointless to use lazy-seq around map if map wasn't lazy. it would probably be useful if you'd read up a bit more on (lazy) sequences in clojure.