Getting the same instance of RandomAccessFile in clojure - clojure

This piece of code runs on the server and it detects the changes to a file and sends it to the client. This is working for the first time and after that the file length is not getting updated even the I changed the file and saved it. I guess the clojure immutability is the reason here. How can I make this work?
(def clients (atom {}))
(def rfiles (atom {}))
(def file-pointers (atom {}))
(defn get-rfile [filename]
(let [rdr ((keyword filename) #rfiles)]
(if rdr
rdr
(let [rfile (RandomAccessFile. filename "rw")]
(swap! rfiles assoc (keyword filename) rfile)
rfile))))
(defn send-changes [changes]
(go (while true
(let [[op filename] (<! changes)
rfile (get-rfile filename)
ignore (println (.. rfile getChannel size))
prev ((keyword filename) #file-pointers)
start (if prev prev 0)
end (.length rfile) // file length is not getting updated even if I changed the file externally
array (byte-array (- end start))]
(do
(println (str "str" start " end" end))
(.seek rfile start)
(.readFully rfile array)
(swap! file-pointers assoc (keyword filename) end)
(doseq [client #clients]
(send! (key client) (json/write-str
{:changes (apply str (map char array))
:fileName filename}))
false))))))

There is no immutability here. In the rfiles atom, you store standard Java objects that are mutable.
This code works well only if data are appended to the end of the file, and the size is always increasing.
If there is an update/addition (of length +N) in the file other than at the end, the pointers start and end won't point to the modified data, but just to the last N characters and you will send dummy stuff to the clients.
If there is a delete or any change that decrease the length,
array (byte-array (- end start))
will throw a NegativeArraySizeException you don't see (eaten by the go bloc?). You can add some (try (...) catch (...)) or test that (- end start) is alway positive or null, to manage it and do the appropriate behaviour: resetting the pointers?,...
Are you sure the files you scan for changes are only changed by appending data? If not, you need to handle this case by resetting or updating the pointers accordingly.
I hope it will help.
EDIT test environment.
I defined the following. There is no change to the code you provided.
;; define the changes channel
(def notif-chan (chan))
;; define some clients
(def clients (atom {:foo "foo" :bar "bar"}))
;; helper function to post a notif of change in the channel
(defn notify-a-change [op filename]
(go (>! notif-chan [op filename])))
;; mock of the send! function used in send-changes
(defn send! [client message]
(println client message))
;; main loop
(defn -main [& args]
(send-changes notif-chan))
in a repl, I ran:
repl> (-main)
in a shell (I tested with an editor too):
sh> echo 'hello there' >> ./foo.txt
in the repl:
repl> (notify-a-change "x" "./foo.txt")
str0 end12
:bar {"changes":"hello there\n","fileName":".\/foo.txt"}
:foo {"changes":"hello there\n","fileName":".\/foo.txt"}
repl> (notify-a-change "x" "./foo.txt")
str12 end12
:bar {"changes":"","fileName":".\/foo.txt"}
:foo {"changes":"","fileName":".\/foo.txt"}
in a shell:
sh> echo 'bye bye' >> ./foo.txt
in a repl:
repl> (notify-a-change "x" "./foo.txt")
str12 end20
:bar {"changes":"bye bye\n","fileName":".\/foo.txt"}
:foo {"changes":"bye bye\n","fileName":".\/foo.txt"}

Related

Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))
I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.
As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.
To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.
IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

Clojure - Is it possible to increment a variable within a doseq statement?

I am trying to iterate over a list of files in a given directory, and add an incrementing variable i = {1,2,3.....} to their names.
Here is the code I have for iterating through the files and changing each file's name:
(defn addCounterToExtIn [d]
(def i 0)
(doseq [f (.listFiles (file d)) ] ; make a sequence of all files in d
(if (and (not (.isDirectory f)) ; if file is not a directry and
(= '(\. \i \n) (take-last 3 (.getName f))) ) ; if it ends with .in
(fs/rename f (str d '/ i (.getName f)))))) ; add i to start of its name
I don't know how can I increment i as doseq iterates through each file. Alternatively, is there a better loop to use to achieve the desired result?
use file-seq and map-indexed:
(require '[clojure.java.io :as io])
(dorun
(->>
(file-seq (io/file "/home/eduard/Downloads"))
(filter #(re-find #".+\.pdf$" (.getName %)))
(map-indexed (fn [i v] [i v]))))
Change function in map-indexed to rename and you're done.
The sample output for pdf files:
([0 #<File /home/eduard/Downloads/some.pdf>] ...)
This is the first approach off the top of my head. It's not ideal, but certainly more idiomatic than what the question proposes.
(def rename-one-file! [file counter]
(if (and (not (.isDirectory file))
(= ".in" (str (take-last 3 (.getName file)))))
(fs/rename file (file (parent dir)
(str counter (.getName file)))))
(defn iterate-files-with-counter [fn dir]
(loop [counter 0
remaining-files (.listFiles (file dir))]
(let [current-file (first remaining-files)]
(fn file counter)
(recur (+ counter 1) (rest remaining-files))))
(def add-counter-to-ext-in-dir
(partial iterate-files-with-counter rename-one-file!))
Note that the work of actually performing the rename was split off from the work of iterating over the files. Having a large number of small functions is better than than a small number of large functions in general, and making those functions reusable / independent unless you choose to use them together is even better than that.

How i can deserialize record structure from file, already saved to file with print-dup?

I'm have a following code:
(use 'clojure.java.io)
(defrecord Member [id name salary role])
(defrecord Role [id name])
(def member-records (ref ()))
(defn add-member [member]
(dosync (alter member-records conj member)))
;;Test-data -->
(def dev-r(->Role 1 "Developer"))
(def test-member1(->Member 1 "Kirill" 70000.00 dev-r))
;;Test-data <--
(defn save-data-2-file []
(with-open [wrtr (writer "C:/Platform/Work/test.cdf")]
(print-dup #member-records wrtr)))
(defn process-line [line]
(println line))
;;Test line content
;;#BTC.pcost.Member{:id 1, :name "Kirill", :salary 70000.0, :role #BTC.pcost.Role{:id 1, :name "Developer"}})
(defn load-data-from-file []
(with-open [rdr (reader "C:/Platform/Work/test.cdf")]
(doseq [line (line-seq rdr)]
(process-line line))))
I'm want to recreate records after reading file, but i can not understand how i can make it. Yes, i'm know that i can parse text and fill my structure by the elements of parsed line, but it's will be difficult, cause i'm have alot structs like "Member" and "Role". Can anyone to suggest me a way, that i can do?
You can use read-string, and slurp, to pull the records out of the file. read-string is limited to reading the first form of a string, but, from your sample, you are only storing a single form, as a list of records.
(defn load-data-from-file [file]
(read-string (slurp file)))
Lazy Reading
If you need more than the first form, or cannot read the entire stream into memory, you can use read directly, to make a lazy reader.
(defn lazy-read
([rdr] (let [eof (Object.)] (lazy-read rdr (read rdr false eof) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-read rdr (read rdr false eof) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (lazy-read rdr))))
(load-all-data "C:/Platform/Work/test.cdf")
Security
Also, it is good to mention security when loading code with read-string or read. You should only use them with trusted sources, because, using #= or a Java constructor, the source can execute arbitrary code inside your application. For a longer explanation, take a look at the documentation for read.
Setting *read-eval* to false would prevent the issue, but it would also prevent the reconstruction of the records in your sample. To avoid the issue all together, you can use the clojure.edn/read and clojure.edn/read-string functions, with a whitelist of readers.
(defn edn-read [eof rdr]
(clojure.edn/read {:eof eof :readers {'BTC.pcost.Role map->Role
'BTC.pcost.Member map->Member}}
rdr))
(defn lazy-edn-read
([rdr] (let [eof (Object.)] (lazy-edn-read rdr (edn-read eof rdr) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-edn-read rdr (edn-read eof rdr) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (take-while (complement nil?) (lazy-edn-read rdr)))))
(load-all-data "C:/Platform/Work/test.cdf")
You can use read.
This function will read one object from a file:
(defn load-data-from-file [filename]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(read rdr)))
Or this will read all objects from the file:
(defn load-all-data-from-file [filename]
(let [eof (Object.)]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
Here's the API documentation for read.
This is a small variation that will read all objects from a string:
(defn load-all-data-from-string [string]
(let [eof (Object.)]
(with-open [rdr (-> string java.io.StringReader. java.io.PushbackReader.)]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
This is, as far as I know, not possible to do using read-string. Instead we use read with a java.io.StringReader.

How to expand a sequence (var-args) into distinct items

I want to send var-args of a function to a macro, still as var-args.
Here is my code:
(defmacro test-macro
[& args]
`(println (str "count=" ~(count args) "; args=" ~#args)))
(defn test-fn-calling-macro
[& args]
(test-macro args))
The output of (test-macro "a" "b" "c") is what I want: count=3; args=abc
The output of (test-fn-calling-macro "a" "b" "c") is : count=1; args=("a" "b" "c") because args is sent as a single argument to the macro. How can I expand this args in my function in order to call the macro with the 3 arguments?
I guess I'm just missing a simple core function but I'm not able to find it. Thanks
EDIT 2 - My "real" code, shown in EDIT section below is not a valid situation to use this technique.
As pointed out by #Brian, the macro xml-to-cass can be replaced with a function like this:
(defn xml-to-cass
[zipper table key attr & path]
(doseq [v (apply zf/xml-> zipper path)] (cass/set-attr! table key attr v)))
EDIT - the following section goes beyond my original question but any insight is welcome
The code above is just the most simple I could come with to pinpoint my problem. My real code deals with clj-cassandra and zip-filter. It may also look over-engineering but it's just a toy project and I'm trying to learn the language at the same time.
I want to parse some XML found on mlb.com and insert values found into a cassandra database. Here is my code and the thinking behind it.
Step 1 - Function which works fine but contains code duplication
(ns stats.importer
(:require
[clojure.xml :as xml]
[clojure.zip :as zip]
[clojure.contrib.zip-filter.xml :as zf]
[cassandra.client :as cass]))
(def root-url "http://gd2.mlb.com/components/game/mlb/year_2010/month_05/day_01/")
(def games-table (cass/mk-cf-spec "localhost" 9160 "mlb-stats" "games"))
(defn import-game-xml-1
"Import the content of xml into cassandra"
[game-dir]
(let [url (str root-url game-dir "game.xml")
zipper (zip/xml-zip (xml/parse url))
game-id (.substring game-dir 4 (- (.length game-dir) 1))]
(doseq [v (zf/xml-> zipper (zf/attr :type))] (cass/set-attr! games-table game-id :type v))
(doseq [v (zf/xml-> zipper (zf/attr :local_game_time))] (cass/set-attr! games-table game-id :local_game_time v))
(doseq [v (zf/xml-> zipper :team [(zf/attr= :type "home")] (zf/attr :name_full))] (cass/set-attr! games-table game-id :home_team v))))
The parameter to import-game-xml-1 can be for example "gid_2010_05_01_colmlb_sfnmlb_1/". I remove the "gid_" and the trailing slash to make it the key of the ColumnFamily games in my database.
I found that the 3 doseq were a lot of duplication (and there should be more than 3 in the final version). So code templating using a macro seemed appropriate here (correct me if I'm wrong).
Step 2 - Introducing a macro for code templating (still works)
(defmacro xml-to-cass
[zipper table key attr & path]
`(doseq [v# (zf/xml-> ~zipper ~#path)] (cass/set-attr! ~table ~key ~attr v#)))
(defn import-game-xml-2
"Import the content of xml into cassandra"
[game-dir]
(let [url (str root-url game-dir "game.xml")
zipper (zip/xml-zip (xml/parse url))
game-id (.substring game-dir 4 (- (.length game-dir) 1))]
(xml-to-cass zipper games-table game-id :type (zf/attr :type))
(xml-to-cass zipper games-table game-id :local_game_time (zf/attr :local_game_time))
(xml-to-cass zipper games-table game-id :home_team :team [(zf/attr= :type "home")] (zf/attr :name_full))))
I believe that's an improvement but I still see some duplication in always reusing the same 3 parameters in my calls to xml-to-cass. That's were I introduced an intermediate function to take care of those.
Step 3 - Adding a function to call the macro (the problem is here)
(defn import-game-xml-3
"Import the content of xml into cassandra"
[game-dir]
(let [url (str root-url game-dir "game.xml")
zipper (zip/xml-zip (xml/parse url))
game-id (.substring game-dir 4 (- (.length game-dir) 1))
save-game-attr (fn[key path] (xml-to-cass zipper games-table game-id key path))]
(save-game-attr :type (zf/attr :type)) ; works well because path has only one element
(save-game-attr :local_game_time (zf/attr :local_game_time))
(save-game-attr :home :team [(zf/attr= :type "home"] (zf/attr :name_full))))) ; FIXME this final line doesn't work
Here's a some simple code which may be illuminating.
Macros are about code generation. If you want that to happen at runtime, for some reason, then you have to build and evaluate the code at runtime. This can be a powerful technique.
(defmacro test-macro
[& args]
`(println (str "count=" ~(count args) "; args=" ~#args)))
(defn test-fn-calling-macro
[& args]
(test-macro args))
(defn test-fn-expanding-macro-at-runtime
[& args]
(eval (cons `test-macro args)))
(defmacro test-macro-expanding-macro-at-compile-time
[& args]
(cons `test-macro args))
;; using the splicing notation
(defmacro test-macro-expanding-macro-at-compile-time-2
[& args]
`(test-macro ~#args))
(defn test-fn-expanding-macro-at-runtime-2
[& args]
(eval `(test-macro ~#args)))
(test-macro "a" "b" "c") ;; count=3; args=abc nil
(test-fn-calling-macro "a" "b" "c") ;; count=1; args=("a" "b" "c") nil
(test-fn-expanding-macro-at-runtime "a" "b" "c") ; count=3; args=abc nil
(test-macro-expanding-macro-at-compile-time "a" "b" "c") ; count=3; args=abc nil
(test-macro-expanding-macro-at-compile-time-2 "a" "b" "c") ; count=3; args=abc nil
(test-fn-expanding-macro-at-runtime "a" "b" "c") ; count=3; args=abc nil
If contemplation of the above doesn't prove enlightening, might I suggest a couple of my own blog articles?
In this one I go through macros from scratch, and how clojure's work in particular:
http://www.learningclojure.com/2010/09/clojure-macro-tutorial-part-i-getting.html
And in this one I show why run-time code generation might be useful:
http://www.learningclojure.com/2010/09/clojure-faster-than-machine-code.html
The typical way to use a collection as individual arguments to a function is to use (apply function my-list-o-args)
(defn test-not-a-macro [& args]
(print args))
(defn calls-the-not-a-macro [& args]
(apply test-not-a-macro args))
though you wont be able to use apply because test-macro is a macro. to solve this problem you will need to wrap test macro in a function call so you can apply on it.
(defmacro test-macro [& args]
`(println ~#args))
(defn calls-test-macro [& args]
(eval (concat '(test-macro) (args)))) ;you almost never need eval.
(defn calls-calls-test-macro [& args]
(calls-test-macro args))
This is actually a really good example of one of the ways macros are hard to compose. (some would say they cant be composed cleanly, though i think thats an exageration)
Macros are not magic. They are a mechanism to convert code at compile-time to equivalent code; they are not used at run-time. The pain you are feeling is because you are trying to do something you should not be trying to do.
I don't know the library in question, but if cass/set-attr! is a function, I see no reason why the macro you defined has to be a macro; it could be a function instead. You can do what you want to do if you can rewrite your macro as a function instead.
Your requirements aren't clear. I don't see why a macro is necessary here for test-macro, unless you're trying to print the unevaluated forms supplied to your macro.
These functions provide your expected results, but that's because your sample data was self-evaluating.
(defn test-args
[& args]
(println (format "count=%d; args=%s"
(count args)
(apply str args))))
or
(defn test-args
[& args]
(print (format "count=%d; args=" (count args)))
(doseq [a args]
(pr a))
(newline))
You can imagine other variations to get to the same result.
Try calling that function with something that doesn't evaluate to itself, and note the result:
(test-args (+ 1 2) (+ 3 4))
Were you looking to see the arguments printed as "37" or "(+ 1 2)(+ 3 4)"?
If you were instead trying to learn about macros and their expansion in general, as opposed to solving this particular problem, please tune your question to probe further.

URL Checker in Clojure?

I have a URL checker that I use in Perl. I was wondering how something like this would be done in Clojure. I have a file with thousands of URLs and I'd like the output file to contain the URL (minus http://, https://) and a simple :1 for valid and :0 for false. Ideally, I could check each site concurrently, considering that this is one of Clojure's strengths.
Input
http://www.google.com
http://www.cnn.com
http://www.msnbc.com
http://www.abadurlisnotgood.com
Output
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
I assume by "valid URL" you mean HTTP response 200. This might work. It requires clojure-contrib. Change map to pmap to attempt to make it parallel, like Arthur Ulfeldt mentioned.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(defn check-urls-from-file [filename]
(doseq [line (map check-url
(read-lines (as-file filename)))]
(println line)))
Given your example as input:
user> (check-urls-from-file "urls.txt")
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
Write a small function that appends a ":1" or ":0" to a url and then use pmap to apply it in parallel to all the urls.
(defn check-a-url [url] .... )
(pmap #(if (check-a-url %) (str url ":1") (str url ":0")))
Clojure now has a as-url function in clojure.java.io:
(as-url "http://google.com") ;;=> #object[java.net.URL 0x5dedf9bd "http://google.com"]
(str (as-url "http://google.com")) ;;=> "http://google.com"
(as-url "notanurl") ;; java.net.MalformedURLException
Based on that we could write a function like so:
(defn check-url
"checks if the url is well formed"
[url]
(str (clojure.string/replace-first url #"(http://|https://)" "")
":"
(try (as-url url) ;; built-in, does not perform an actual request, and does very little validation
1
(catch Exception e 0))))
(defn check-urls-from-file
"from Brian Carper answer"
[filename]
(doseq [line (map check-url (read-lines (as-file filename)))]
(println line)))
Instead of pmap, I used agents with send-off in conjunction with the above solution. I think this is better when there is blocking I/O. I believe pmap has limited concurrency too. Here's what I have so far. I wonder how this will scale with thousands of URLs.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(def urls (read-lines "urls.txt"))
(def agents (for [url urls] (agent url)))
(doseq [agent agents]
(send-off agent check-url))
(apply await agents)
(def x '())
(doseq [url (filter deref agents)]
(def x (cons #url x)))
(prn x)
(shutdown-agents)