URL Checker in Clojure? - clojure

I have a URL checker that I use in Perl. I was wondering how something like this would be done in Clojure. I have a file with thousands of URLs and I'd like the output file to contain the URL (minus http://, https://) and a simple :1 for valid and :0 for false. Ideally, I could check each site concurrently, considering that this is one of Clojure's strengths.
Input
http://www.google.com
http://www.cnn.com
http://www.msnbc.com
http://www.abadurlisnotgood.com
Output
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0

I assume by "valid URL" you mean HTTP response 200. This might work. It requires clojure-contrib. Change map to pmap to attempt to make it parallel, like Arthur Ulfeldt mentioned.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(defn check-urls-from-file [filename]
(doseq [line (map check-url
(read-lines (as-file filename)))]
(println line)))
Given your example as input:
user> (check-urls-from-file "urls.txt")
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0

Write a small function that appends a ":1" or ":0" to a url and then use pmap to apply it in parallel to all the urls.
(defn check-a-url [url] .... )
(pmap #(if (check-a-url %) (str url ":1") (str url ":0")))

Clojure now has a as-url function in clojure.java.io:
(as-url "http://google.com") ;;=> #object[java.net.URL 0x5dedf9bd "http://google.com"]
(str (as-url "http://google.com")) ;;=> "http://google.com"
(as-url "notanurl") ;; java.net.MalformedURLException
Based on that we could write a function like so:
(defn check-url
"checks if the url is well formed"
[url]
(str (clojure.string/replace-first url #"(http://|https://)" "")
":"
(try (as-url url) ;; built-in, does not perform an actual request, and does very little validation
1
(catch Exception e 0))))
(defn check-urls-from-file
"from Brian Carper answer"
[filename]
(doseq [line (map check-url (read-lines (as-file filename)))]
(println line)))

Instead of pmap, I used agents with send-off in conjunction with the above solution. I think this is better when there is blocking I/O. I believe pmap has limited concurrency too. Here's what I have so far. I wonder how this will scale with thousands of URLs.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(def urls (read-lines "urls.txt"))
(def agents (for [url urls] (agent url)))
(doseq [agent agents]
(send-off agent check-url))
(apply await agents)
(def x '())
(doseq [url (filter deref agents)]
(def x (cons #url x)))
(prn x)
(shutdown-agents)

Related

Getting the same instance of RandomAccessFile in clojure

This piece of code runs on the server and it detects the changes to a file and sends it to the client. This is working for the first time and after that the file length is not getting updated even the I changed the file and saved it. I guess the clojure immutability is the reason here. How can I make this work?
(def clients (atom {}))
(def rfiles (atom {}))
(def file-pointers (atom {}))
(defn get-rfile [filename]
(let [rdr ((keyword filename) #rfiles)]
(if rdr
rdr
(let [rfile (RandomAccessFile. filename "rw")]
(swap! rfiles assoc (keyword filename) rfile)
rfile))))
(defn send-changes [changes]
(go (while true
(let [[op filename] (<! changes)
rfile (get-rfile filename)
ignore (println (.. rfile getChannel size))
prev ((keyword filename) #file-pointers)
start (if prev prev 0)
end (.length rfile) // file length is not getting updated even if I changed the file externally
array (byte-array (- end start))]
(do
(println (str "str" start " end" end))
(.seek rfile start)
(.readFully rfile array)
(swap! file-pointers assoc (keyword filename) end)
(doseq [client #clients]
(send! (key client) (json/write-str
{:changes (apply str (map char array))
:fileName filename}))
false))))))
There is no immutability here. In the rfiles atom, you store standard Java objects that are mutable.
This code works well only if data are appended to the end of the file, and the size is always increasing.
If there is an update/addition (of length +N) in the file other than at the end, the pointers start and end won't point to the modified data, but just to the last N characters and you will send dummy stuff to the clients.
If there is a delete or any change that decrease the length,
array (byte-array (- end start))
will throw a NegativeArraySizeException you don't see (eaten by the go bloc?). You can add some (try (...) catch (...)) or test that (- end start) is alway positive or null, to manage it and do the appropriate behaviour: resetting the pointers?,...
Are you sure the files you scan for changes are only changed by appending data? If not, you need to handle this case by resetting or updating the pointers accordingly.
I hope it will help.
EDIT test environment.
I defined the following. There is no change to the code you provided.
;; define the changes channel
(def notif-chan (chan))
;; define some clients
(def clients (atom {:foo "foo" :bar "bar"}))
;; helper function to post a notif of change in the channel
(defn notify-a-change [op filename]
(go (>! notif-chan [op filename])))
;; mock of the send! function used in send-changes
(defn send! [client message]
(println client message))
;; main loop
(defn -main [& args]
(send-changes notif-chan))
in a repl, I ran:
repl> (-main)
in a shell (I tested with an editor too):
sh> echo 'hello there' >> ./foo.txt
in the repl:
repl> (notify-a-change "x" "./foo.txt")
str0 end12
:bar {"changes":"hello there\n","fileName":".\/foo.txt"}
:foo {"changes":"hello there\n","fileName":".\/foo.txt"}
repl> (notify-a-change "x" "./foo.txt")
str12 end12
:bar {"changes":"","fileName":".\/foo.txt"}
:foo {"changes":"","fileName":".\/foo.txt"}
in a shell:
sh> echo 'bye bye' >> ./foo.txt
in a repl:
repl> (notify-a-change "x" "./foo.txt")
str12 end20
:bar {"changes":"bye bye\n","fileName":".\/foo.txt"}
:foo {"changes":"bye bye\n","fileName":".\/foo.txt"}

clojure dynamic var conflicts with for in a weird way

I use a dynamic var to make an easy way to use ssh. However, it suddenly stops working in a multi-parameter for!
So, here is my core.clj (which is kind of sketchy now):
(use 'clj-ssh.ssh)
(def the-agent (ssh-agent {}))
(def ^:dynamic *session* nil)
(defmacro on-host [host & body]
`(binding [*session* (clj-ssh.ssh/session the-agent ~host {})]
~#body))
(defn cmd [& args]
(split (:out (ssh *session* {:cmd (join " " args)})) #"\n"))
(defn attempt-1 []
(cmd "ls -a"))
(defn attempt-2 []
(for [f (cmd "ls -a")]
f))
(defn attempt-3 []
(for [r (range 3)
f (cmd "ls -a")]
[r f]))
For some reason, first two trial functions work, and the third doesn't (the hosts and files are censured):
user=> (on-host "(some host)" (attempt-1))
["." ".." ".ackrc" ...]
user=> (on-host "(some host)" (attempt-2))
("." ".." ".ackrc" ...)
user=> (on-host "(some host)" (attempt-3))
IllegalArgumentException No implementation of method: :connected? of protocol: #'clj-ssh.ssh.protocols/Session found for class: nil clojure.core/-cache-protocol-fn (core_deftype.clj:544)
Just in case you need a stacktrace:
user=> (use 'clojure.stacktrace)
nil
user=> (print-stack-trace *e 7)
java.lang.IllegalArgumentException: No implementation of method: :connected? of protocol: #'clj-ssh.ssh.protocols/Session found for class: nil
at clojure.core$_cache_protocol_fn.invoke (core_deftype.clj:544)
clj_ssh.ssh.protocols$eval1554$fn__1566$G__1543__1571.invoke (protocols.clj:4)
clj_ssh.ssh$connected_QMARK_.invoke (ssh.clj:411)
clj_ssh.ssh$ssh.invoke (ssh.clj:712)
census.core$cmd.doInvoke (core.clj:15)
clojure.lang.RestFn.invoke (RestFn.java:408)
census.core$attempt_3$iter__1949__1955$fn__1956.invoke (core.clj:29)
nil
I'm really not sure what it is all about. Can you help me? Thank you!
Use doseq, not for
What you have there is a lazy sequence that will be evaluated after the binding form has returned. doseq forces evaluation.
For further reading:
Difference between doseq and for in Clojure
http://cemerick.com/2009/11/03/be-mindful-of-clojures-binding/

Using with-redefs inside a doseq

I have a Ring handler which uses several functions to build the response. If any of these functions throw an exception, it should be caught so a custom response body can be written before returning a 500.
I'm writing the unit test for the handler, and I want to ensure that an exception thrown by any of these functions will be handled as described above. My instinct was to use with-redefs inside a doseq:
(doseq [f [ns1/fn1 ns1/fn2 ns2/fn1]]
(with-redefs [f (fn [& args] (throw (RuntimeException. "fail!"))]
(let [resp (app (request :get "/foo")))))]
(is (= (:status resp) 500))
(is (= (:body resp) "Something went wrong")))))
Of course, given that with-redefs wants to change the root bindings of vars, it is treating f as a symbol. Is there any way to get it to rebind the var referred to by f? I suspect that I'm going to need a macro to accomplish this, but I'm hoping that someone can come up with a clever solution.
with-redefs is just sugar over repeated calls to alter-var-root, so you can just write the desugared form yourself, something like:
(doseq [v [#'ns1/fn1 #'ns1/fn2 #'ns2/fn1]]
(let [old #v]
(try
(alter-var-root v (constantly (fn [& args]
(throw (RuntimeException. "fail!")))))
(let [resp (app (request :get "/foo")))
(is (= (:status resp) 500))
(is (= (:body resp) "Something went wrong")))
(finally (alter-var-root v (constantly old))))))
Following on from amalloy's great answer, here's a utility function that I wrote to use in my tests:
(defn with-mocks
"Iterates through a list of functions to-mock, rebinding each to mock-fn, and calls
test-fn with the optional test-args.
Example:
(defn foobar [a b]
(try (+ (foo a) (bar b))
(catch Exception _ 1)))
(deftest test-foobar
(testing \"Exceptions are handled\"
(with-mocks
[#'foo #'bar]
(fn [& _] (throw (RuntimeException. \"\")))
(fn [a b] (is (= 1 (foobar a b)))) 1 2)))"
[to-mock mock-fn test-fn & test-args]
(doseq [f to-mock]
(let [real-fn #f]
(try
(alter-var-root f (constantly mock-fn))
(apply test-fn test-args)
(finally
(alter-var-root f (constantly real-fn)))))))

clojure.core/map isn't working

I am trying to figure out why one of my map calls isn't working. I am building a crawler with the purpose of learning Clojure.
(use '[clojure.java.io])
(defn md5
"Generate a md5 checksum for the given string"
[token]
(let [hash-bytes
(doto (java.security.MessageDigest/getInstance "MD5")
(.reset)
(.update (.getBytes token)))]
(.toString
(new java.math.BigInteger 1 (.digest hash-bytes)) ; Positive and the size of the number
16)))
(defn full-url [url base]
(if (re-find #"^http[s]{0,1}://" url)
url
(apply str "http://" base (if (= \/ (first url))
url
(apply str "/" url)))))
(defn get-domain-from-url [url]
(let [matcher (re-matcher #"http[s]{0,1}://([^/]*)/{0,1}" url)
domain-match (re-find matcher)]
(nth domain-match 1)))
(defn crawl [url]
(do
(println "-----------------------------------\n")
(if (.exists (clojure.java.io/as-file (apply str "theinternet/page" (md5 url))))
(println (apply str url " already crawled ... skiping \n"))
(let [domain (get-domain-from-url url)
text (slurp url)
matcher (re-matcher #"<a[^>]*href\s*=\s*[\"\']([^\"\']*)[\"\'][^>]*>(.*)</a\s*>" text)]
(do
(spit (apply str "theinternet/page" (md5 url)) text)
(loop [urls []
a-tag (re-find matcher)]
(if a-tag
(let [u (nth a-tag 1)]
(recur (conj urls (full-url u domain)) (re-find matcher)))
(do
(println (apply str "parsed: " url))
(println (apply str (map (fn [u]
(apply str "-----> " u "\n")) urls)))
(map crawl urls)))))))))
(defn -main
"I don't do a whole lot ... yet."
[& args]
(crawl "http://www.example.com/"))
First call to map works:
(println (apply str (map (fn [u]
(apply str "-----> " u "\n")) urls)))
But the second call (map crawl urls) seems to be ignored.
The crawl function is working as intended, slurping the url, parsing with the regex for a tags for fetching the href and the accumulation in the loop works as intended, but when i call map with crawl and the urls that have been found on the page, the call to map is ignored.
Also if I try to call (map crawl ["http://www.example.com"]) this call is, again, ignored.
I have started my Clojure adventure a couple of weeks ago so any suggestions/criticisms are most welcomed.
Thank you
In Clojure, map is lazy. From the docs, map:
Returns a lazy sequence consisting of the result of applying f to the
set of first items of each coll, followed by applying f to the set
of second items in each coll, until any one of the colls is
exhausted.
Your crawl function is a function with side effects - you're spit-ing some results to a file, and println-ing to report on progress. But, because map returns a lazy sequence, none of these things will happen - the result sequence is never explicitly realized so it can stay lazy.
There are a number of ways of realizing a lazy sequence (that has been created e.g. using map), but in this case, as you want to iterate over a sequence using a function that has side-effects, it's probably best to use doseq:
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
If you replace the call to (map crawl urls) with (doseq [u urls] (crawl u)), you should get the desired result.
Note: your first call to map works as expected because you are realizing the results using (apply str). There is no way to (apply str) without evaluating the sequence.

How to parse URL parameters in Clojure?

If I have the request "size=3&mean=1&sd=3&type=pdf&distr=normal" what's the idiomatic way of writing the function (defn request->map [request] ...) that takes this request and
returns a map {:size 3, :mean 1, :sd 3, :type pdf, :distr normal}
Here is my attempt (using clojure.walk and clojure.string):
(defn request-to-map
[request]
(keywordize-keys
(apply hash-map
(split request #"(&|=)"))))
I am interested in how others would solve this problem.
Using form-decode and keywordize-keys:
(use 'ring.util.codec)
(use 'clojure.walk)
(keywordize-keys (form-decode "hello=world&foo=bar"))
{:foo "bar", :hello "world"}
Assuming you want to parse HTTP request query parameters, why not use ring? ring.middleware.params contains what you want.
The function for parameter extraction goes like this:
(defn- parse-params
"Parse parameters from a string into a map."
[^String param-string encoding]
(reduce
(fn [param-map encoded-param]
(if-let [[_ key val] (re-matches #"([^=]+)=(.*)" encoded-param)]
(assoc-param param-map
(codec/url-decode key encoding)
(codec/url-decode (or val "") encoding))
param-map))
{}
(string/split param-string #"&")))
You can do this easily with a number of Java libraries. I'd be hesitant to try to roll my own parser unless I read the URI specs carefully and made sure I wasn't missing any edge cases (e.g. params appearing in the query twice with different values). This uses jetty-util:
(import '[org.eclipse.jetty.util UrlEncoded MultiMap])
(defn parse-query-string [query]
(let [params (MultiMap.)]
(UrlEncoded/decodeTo query params "UTF-8")
(into {} params)))
user> (parse-query-string "size=3&mean=1&sd=3&type=pdf&distr=normal")
{"sd" "3", "mean" "1", "distr" "normal", "type" "pdf", "size" "3"}
Can also use this library for both clojure and clojurescript: https://github.com/cemerick/url
user=> (-> "a=1&b=2&c=3" cemerick.url/query->map clojure.walk/keywordize-keys)
{:a "1", :b "2", :c "3"}
Yours looks fine. I tend to overuse regexes, so I would have solved it as
(defn request-to-keywords [req]
(into {} (for [[_ k v] (re-seq #"([^&=]+)=([^&]+)" req)]
[(keyword k) v])))
(request-to-keywords "size=1&test=3NA=G")
{:size "1", :test "3NA=G"}
Edit: try to stay away from clojure.walk though. I don't think it's officially deprecated, but it's not very well maintained. (I use it plenty too, though, so don't feel too bad).
I came across this question when constructing my own site and the answer can be a bit different, and easier, if you are passing parameters internally.
Using Secretary to handle routing: https://github.com/gf3/secretary
Parameters are automatically extracted to a map in :query-params when a route match is found. The example given in the documentation:
(defroute "/users/:id" [id query-params]
(js/console.log (str "User: " id))
(js/console.log (pr-str query-params)))
(defroute #"/users/(\d+)" [id {:keys [query-params]}]
(js/console.log (str "User: " id))
(js/console.log (pr-str query-params)))
;; In both instances...
(secretary/dispach! "/users/10?action=delete")
;; ... will log
;; User: 10
;; "{:action \"delete\"}"
You can use ring.middleware.params. Here's an example with aleph:
user=> (require '[aleph.http :as http])
user=> (defn my-handler [req] (println "params:" (:params req)))
user=> (def server (http/start-server (wrap-params my-handler)))
wrap-params creates an entry in the request object called :params. If you want the query parameters as keywords, you can use ring.middleware.keyword-params. Be sure to wrap with wrap-params first:
user=> (require '[ring.middleware.params :refer [wrap-params]])
user=> (require '[ring.middleware.keyword-params :refer [wrap-keyword-params])
user=> (def server
(http/start-server (wrap-keyword-params (wrap-params my-handler))))
However, be mindful that this includes a dependency on ring.