I have a string that is in Windows-1252 encoding, but needs to be converted to UTF-8.
This is for a program that fixes a UTF-8 file that has fields containing Russian text encoded in quoted-printable Windows-1252. Here's the code that decodes the quoted-printable:
(defn reencode
[line]
(str/replace line #"=([0-9A-Fa-f]{2})=([0-9A-Fa-f]{2})"
(fn [match] (apply str
(map #(char (Integer/parseInt % 16)) (drop 1 match))))))
Here's the final code:
(defn reencode
[line]
(str/replace line #"(=([0-9A-Fa-f]{2}))+"
(fn [[match ignore]]
(String.
(byte-array (map
#(Integer/parseInt (apply str (drop 1 %)) 16)
(partition 3 match)))
"Windows-1252"))))
It fixes the encoding using (String. ... "Encoding") on all consecutive runs of quoted-printable-encoded characters. The original function was trying to decode pairs, so it would skip things like =3D, which is the quoted-printable entity for =.
The best way to convert a Windows-1252 string from disk is to use the underlying Java primitives.
(def my-string (String. bytes-from-file "Windows-1252"))
will return you a Java String which has decoded the bytes with the Windows-1252 Charset. From there you can spit bytes back out with UTF-8 encoding with
(.getBytes my-string "UTF-8")
Addressing your question more closely, if you have a file with mixed encodings then you could work out what delimits each encoding and read each set of bytes in separately using the method above.
Edit: The Windows-1252 string has been encoded with quoted printable. You will first need to decode it, using either your function or perhaps more preferably with Apache Commons Codec using QuotedPrintable decode, passing the Windows-1252 Charset. That will return a Java string which you can operate on directly with no further transformation.
N.B. for some measure of type safety, you should probably use Java Charset objects rather than Strings when specifying the charset to use (the String class can take either).
Related
I am trying to replace bad characters from a input string.
Characters should be valid UTF-8 characters (tabs, line breaks etc. are ok).
However I was unable to figure out how to replace all found bad characters.
My solution works for the first bad character.
Usually there are none bad characters. 1/50 cases there is one bad character. I'd just want to make my solution foolproof.
(defn filter-to-utf-8-string
"Return only good utf-8 characters from the input."
[input]
(let [bad-characters (set (re-seq #"[^\p{L}\p{N}\s\p{P}\p{Sc}\+]+" input))
filtered-string (clojure.string/replace input (apply str (first bad-characters)) "")]
filtered-string))
How can I make replace work for all values in sequence not just for the first one?
Friend of mine helped me to find workaround for this problem:
I created a filter for replace using re-pattern.
Within let code is currently
filter (if (not (empty? bad-characters))
(re-pattern (str "[" (clojure.string/join bad-characters) "]"))
#"")
filtered-string (clojure.string/replace input filter "")
Here is a simple version:
(ns xxxxx
(:require
[clojure.string :as str]
))
(def all-chars (str/join (map char (range 32 80))))
(println all-chars)
(def char-L (str/join (re-seq #"[\p{L}]" all-chars)))
(println char-L)
(def char-N (str/join (re-seq #"[\p{N}]" all-chars)))
(println char-N)
(def char-LN (str/join (re-seq #"[\p{L}\p{N}]" all-chars)))
(println char-LN)
all-chars => " !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNO"
char-L => "ABCDEFGHIJKLMNO"
char-N => "0123456789"
char-LN => "0123456789ABCDEFGHIJKLMNO"
So we start off with all ascii chars in the range of 32-80. We first print only the letter, then only the numbers, then either letters or numbers. It seems this should work for your problem, although instead of rejecting non-members of the desired set, we keep the members of the desired set.
For the code below I'm reading input from stdin. Basically it's just some numbers delimited by spaces or line breaks. Specifically I'm trying to complete this challenge.
My goal is to create a list of numbers (without the first number) from the input. When I run the code below at hackerrank I get a list of a single number: (5)
Not sure what's going on, or how to fix. Would anyone know?
(map read-string (rest (line-seq (java.io.BufferedReader. *in*))))
line-seq gives one string for each line. read-string reads from a string, returning the first complete object found. Thus, you only get the first item on the line.
You could either us clojure.string/split to break up the string and use read-string on each part, or loop, accumulating the results of calling read on a PushbackReader made from the BufferedReader until you get no more input.
Since your input is
Input Format
The first line contains a single integer N.
The next line contains N integers: a0, a1,...aN-1 separated by space...
Sample Input
6
5 4 4 2 2 8
And you don't need to worry about validation / security, you can just
(let [n (read-string (read-line))
v (read-string (str "[" (read-line) "]"))]
(assert (== n (count v))) ;if you like
(comment solution here...))
I have a sequence of sequences and each sequence is similar to the following:
("9990999" "43" "ROADWAY" "MORRISON, VAN X DMD" "43 ROADWAY" "SOMETHINGTON" "XA" "00000" "501" "18050" "2500" "1180" "14370" "0")
clojure-csv won't help me here, because it -- as it should -- quotes fields with embedded commas. I want pipe-delimited output without quotes around each field, some of which contain embedded commas.
I have looked at a number of ways to remove the double quote characters including the following, but the quotes stay put.
(filter (fn [x] (not (= (str (first (str x))) (str (first (str \")))))) d1)
where d1 is the sequence above.
In addition to an answer, I am more interested in a pointer to documentation. I have been playing with this but to no avail.
As far as I understand you have a sequence of strings. Clojure provides a very specific toString implementation for sequences, you can see it here.
If you do (str d1) or simply type d1 in repl and press enter you'll see more or less what you typed: sequence of strings (String is printed as sequence of characters in double quotes).
Now if you want to concatenate all the string you can do this:
(apply str d1)
If you want to print it separated with commas you could do this:
(apply str (interpose "," d1))
To output is CSV format I would recommend to use clojure-csv.
Finally if you simply want to print the list but without the double quotes around strings you could do this:
(print d1)
Hope this helps.
EDIT1 (update due to changes in the question):
This can easily be achieved with:
(apply str (interpose "|" d1))
Please don't pay attention to double quotes around the entire result if you print it or spit it into a file you won't see them, this is just how Clojure prints string readably.
Alternatively if you have multiple sequences like that that you want to output at once you can still use clojure-csv but with different separator:
(ns csv-test.core
(:require [clojure-csv.core :as csv]))
(def d1 (list "9990999" "43" "ROADWAY" "MORRISON, VAN X DMD" "43 ROADWAY" "SOMETHINGTON" "XA" "00000" "501" "18050" "2500" "1180" "14370" "0"))
(print (csv/write-csv [d1] :delimiter "|"))
;;prints:
;;9990999|43|ROADWAY|MORRISON, VAN X DMD|43 ROADWAY|SOMETHINGTON|XA|00000|501|18050|2500|1180|14370|0
I have lines of data in a sequence of sequences and each sequence is different but follows the general pattern as follows:
("44999" "186300" "194300" "0" "380600" "325" "57" "0")
When I write the sequence of sequences out to a file using
(defn write-csv-file
"Writes a csv file using a key and an s-o-s"
[out-sos out-file]
(if (= dbg 1)
(println (first out-sos), "\n", out-file))
(spit out-file "" :append false)
(with-open [out-data (io/writer out-file)]
(csv/write-csv out-data out-sos)))
.
.
.
(write-csv-file out-re "re_values.csv")
the data comes out like this
44999,186300,194300,0,380600,325,57,0
That is exactly the way I want it (unquoted), except, I'd like a unquoted ',' at the end of each sequence.
I've tried (concat one-row (list \,)) and trying to add a ',' at the end of each sequence in a (list function, but I cannot get an unquoted ',' at the end of each sequence. How can I do this?
As a workaround, I can run files like this through sed to add the trailing comma, but I'd like to do it all in Clojure.
I think that you do not want to "add a comma" but add an empty field (which is then separated by a comma). So, you should simply add an empty string to your line sequences.
Maybe concat an empty string to each sequence inside of out-sos. Concat is lazy, so shouldn't be expensive.
(with-open [out-data (io/writer out-file)]
(csv/write-csv out-data (map #(concat % [""]) out-sos))))
Not sure what the csv library would do with an empty at the end though. Hopefully you would just get your empty element.
Did you try :end-of-line setting to ",\n"
This is what the documentation says:
:end-of-line
A string containing the end-of-line character for writing CSV files.
Default value: \n
This what I tried:
(csv/write-csv data :end-of-line ",\n")
I'm working on writing a function in Clojure that will process a file character by character. I know that Java's BufferedReader class has the read() method that reads one character, but I'm new to Clojure and not sure how to use it. Currently, I'm just trying to do the file line-by-line, and then print each character.
(defn process_file [file_path]
(with-open [reader (BufferedReader. (FileReader. file_path))]
(let [seq (line-seq reader)]
(doseq [item seq]
(let [words (split item #"\s")]
(println words))))))
Given a file with this text input:
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
My output looks like this:
[International donations are gratefully accepted, but we cannot make]
[any statements concerning tax treatment of donations received from]
[outside the United States. U.S. laws alone swamp our small staff.]
Though I would expect it to look like:
["international" "donations" "are" .... ]
So my question is, how can I convert the function above to read character by character? Or even, how to make it work as I expect it to? Also, any tips for making my Clojure code better would be greatly appreciated.
(with-open [reader (clojure.java.io/reader "path/to/file")] ...
I prefer this way to get a reader in clojure. And, by character by character, do you mean in file access level, like read, which allow you control how many bytes to read?
Edit
As #deterb pointed out, let's check the source code of line-seq
(defn line-seq
"Returns the lines of text from rdr as a lazy sequence of strings.
rdr must implement java.io.BufferedReader."
{:added "1.0"
:static true}
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons line (lazy-seq (line-seq rdr)))))
I faked a char-seq
(defn char-seq
[^java.io.Reader rdr]
(let [chr (.read rdr)]
(if (>= chr 0)
(cons chr (lazy-seq (char-seq rdr))))))
I know this char-seq reads all chars into memory[1], but I think it shows that you can directly call .read on BufferedReader. So, you can write your code like this:
(let [chr (.read rdr)]
(if (>= chr 0)
;do your work here
))
How do you think?
[1] According to #dimagog's comment, char-seq not read all char into memory thanks to lazy-seq
I'm not familiar with Java or the read() method, so I won't be able to help you out with implementing it.
One first thought is maybe to simplify by using slurp, which will return a string of the text of the entire file with just (slurp filename). However, this would get the whole file, which maybe you don't want.
Once you have a string of the entire file text, you can process any string character by character by simply treating it as though it were a sequence of characters. For example:
=> (doseq [c "abcd"]
(prntln c))
a
b
c
d
=> nil
Or:
=> (remove #{\c} "abcd")
=> (\a \b \d)
You could use map or reduce or any sort of sequence manipulating function. Note that after manipulating it like a sequence, it will now return as a sequence, but you could easily wrap the outer part in (reduce str ...) to return it back to a string at the end--explicitly:
=> (reduce str (remove #{\c} "abcd"))
=> "abd"
As for your problem with your specific code, I think the problem lies with what words is: a vector of strings. When you print each words you are printing a vector. If at the end you replaced the line (println words) with (doseq [w words] (println w))), then it should work great.
Also, based on what you say you want your output to look like (a vector of all the different words in the file), you wouldn't want to only do (println w) at the base of your expression, because this will print values and return nil. You would simply want w. Also, you would want to replace your doseqs with fors--again, to avoid return nil.
Also, on improving your code, it looks generally great to me, but--and this is going with all the first change I suggest above (but not the others, because I don't want to draw it all out explicitly)--you could shorten it with a fun little trick:
(doseq [item seq]
(let [words (split item #"\s")]
(doseq [w words]
(println w))))
;//Could be rewritten as...
(doseq [item s
:let [words (split item #"\s")]
w words]
(println w))
You're pretty close - keep in mind that Strings are a sequence. (concat "abc" "def") results in the sequence (\a \b \c \d \e \f).
mapcat is another really useful function for this - it will lazily concatenate the results of applying the mapping fn to the sequence. This means that mapcating the result of converting all of the line strings to a seq will be the lazy sequence of characters you're after.
I did this as (mapcat seq (line-seq reader)).
For other advice:
For creating the reader, I would recommend using the clojure.java.io/reader function instead of directly creating the classes.
Consider breaking apart the reading the file and the processing (in this case printing) of the strings from each other. While it is important to keep the full file parsing inside the withopen clause, being able to test the actual processing code outside of the file reading code is quite useful.
When navigating multiple (potentially nested) sequences consider using for. for does a nice job handling nested for loop type cases.
(take 100 (for [line (repeat "abc") char (seq line)] (prn char)))
Use prn for debugging output. It gives you real output, as compared to user output (which hides certain details which users don't normally care about).