I am trying to figure out why one of my map calls isn't working. I am building a crawler with the purpose of learning Clojure.
(use '[clojure.java.io])
(defn md5
"Generate a md5 checksum for the given string"
[token]
(let [hash-bytes
(doto (java.security.MessageDigest/getInstance "MD5")
(.reset)
(.update (.getBytes token)))]
(.toString
(new java.math.BigInteger 1 (.digest hash-bytes)) ; Positive and the size of the number
16)))
(defn full-url [url base]
(if (re-find #"^http[s]{0,1}://" url)
url
(apply str "http://" base (if (= \/ (first url))
url
(apply str "/" url)))))
(defn get-domain-from-url [url]
(let [matcher (re-matcher #"http[s]{0,1}://([^/]*)/{0,1}" url)
domain-match (re-find matcher)]
(nth domain-match 1)))
(defn crawl [url]
(do
(println "-----------------------------------\n")
(if (.exists (clojure.java.io/as-file (apply str "theinternet/page" (md5 url))))
(println (apply str url " already crawled ... skiping \n"))
(let [domain (get-domain-from-url url)
text (slurp url)
matcher (re-matcher #"<a[^>]*href\s*=\s*[\"\']([^\"\']*)[\"\'][^>]*>(.*)</a\s*>" text)]
(do
(spit (apply str "theinternet/page" (md5 url)) text)
(loop [urls []
a-tag (re-find matcher)]
(if a-tag
(let [u (nth a-tag 1)]
(recur (conj urls (full-url u domain)) (re-find matcher)))
(do
(println (apply str "parsed: " url))
(println (apply str (map (fn [u]
(apply str "-----> " u "\n")) urls)))
(map crawl urls)))))))))
(defn -main
"I don't do a whole lot ... yet."
[& args]
(crawl "http://www.example.com/"))
First call to map works:
(println (apply str (map (fn [u]
(apply str "-----> " u "\n")) urls)))
But the second call (map crawl urls) seems to be ignored.
The crawl function is working as intended, slurping the url, parsing with the regex for a tags for fetching the href and the accumulation in the loop works as intended, but when i call map with crawl and the urls that have been found on the page, the call to map is ignored.
Also if I try to call (map crawl ["http://www.example.com"]) this call is, again, ignored.
I have started my Clojure adventure a couple of weeks ago so any suggestions/criticisms are most welcomed.
Thank you
In Clojure, map is lazy. From the docs, map:
Returns a lazy sequence consisting of the result of applying f to the
set of first items of each coll, followed by applying f to the set
of second items in each coll, until any one of the colls is
exhausted.
Your crawl function is a function with side effects - you're spit-ing some results to a file, and println-ing to report on progress. But, because map returns a lazy sequence, none of these things will happen - the result sequence is never explicitly realized so it can stay lazy.
There are a number of ways of realizing a lazy sequence (that has been created e.g. using map), but in this case, as you want to iterate over a sequence using a function that has side-effects, it's probably best to use doseq:
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
If you replace the call to (map crawl urls) with (doseq [u urls] (crawl u)), you should get the desired result.
Note: your first call to map works as expected because you are realizing the results using (apply str). There is no way to (apply str) without evaluating the sequence.
Related
I'm trying to import data from StackOverflow to Neo4j using clojure and the neocons library. Excuse me for being a bit of a newbie.
Here's my main function in Leiningen:
(defn -main
[& args]
(let [neo4j-conn (nr/connect "http://localhost:7777/db/data/")]
(cypher/tquery neo4j-conn "MATCH n OPTIONAL MATCH n-[r]-() DELETE n, r")
(for [page (range 1 6)]
(let [data (parse-string (stackoverflow-get-questions page))
questions (data "items")
has-more (data "has_more")
question-ids (map #(%1 "question_id") questions)
answers ((parse-string (stackoverflow-get-answers question-ids)) "items")]
(map #(import-question %1 neo4j-conn) questions)
(map #(import-answer %1 neo4j-conn) answers)
)
)
)
)
I've defined import-question and import-answer functions and those work fine independently. In fact, what's weird is I can remove either one of those import-* lines and the other will work just fine.
Can anybody see if I'm doing something simple that's wrong?
Both map and for are lazy, and will do nothing at all unless you consume their results.
The first map call ends up being a noop because there is no way for anything to consume it's output. Try wrapping the for and at least the first map call in a call to dorun, or doall if you plan on consuming the result.
Also, you can replace for with doseq, which is identical except that it returns nil, eagerly consumes its input, and can contain multiple forms in its body.
Here is what your code could look like using doseq:
(defn -main
[& args]
(let [neo4j-conn (nr/connect "http://localhost:7777/db/data/")]
(cypher/tquery neo4j-conn "MATCH n OPTIONAL MATCH n-[r]-() DELETE n, r")
(doseq [page (range 1 6)
:let [data (parse-string (stackoverflow-get-questions page))
questions (data "items")
has-more (data "has_more")
question-ids (map #(%1 "question_id") questions)
answers ((parse-string (stackoverflow-get-answers question-ids)) "items")]]
(doseq [q questions]
(import-question q neo4j-conn))
(doseq [a answers]
(import-answer a neo4j-conn)))))
I'm trying to read in a file line by line and concatenate a new string to the end of each line. For testing I've done this:
(defn read-file
[filename]
(with-open [rdr (clojure.java.io/reader filename)]
(doall (line-seq rdr))))
(apply str ["asdfasdf" (doall (take 1 (read-file filename)))])
If I just evaluate (take 1 (read-file filename)) in a repl, I get the first line of the file. However, when I try to evaluate what I did above, I get "asdfasdfclojure.lang.LazySeq#4be5d1db".
Can anyone explain how to forcefully evaluate take to get it to not return the lazy sequence?
The take function is lazy by design, so you may have to realize the values you want, using first, next, or nth, or operate on the entire seq with functions like apply, reduce, vec, or into.
In your case, it looks like you are trying to do the following:
(apply str ["asdfasdf" (apply str (take 1 (read-file filename)))])
Or:
(str "asdfasdf" (first (read-file filename)))
You can also realize the entire lazyseq using doall. Just keep in mind, a realized lazy seq is still a seq.
(realized? (take 1 (read-file filename))) ;; => false
(type (take 1 (read-file filename))) ;; => clojure.lang.LazySeq
(realized? (doall (take 1 (read-file filename)))) ;; => true
(type (doall (take 1 (read-file filename)))) ;; => clojure.lang.LazySeq
A better option would be to apply your transformations lazily, using something like map, and select the values you want from the resulting seq. (Like stream processing.)
(first (map #(str "prefix" % "suffix")
(read-file filename)))
Note: map is lazy, so it will return an unrealized LazySeq.
I have a map representing information about a subversion commit.
Example contents:
(def commit
{:repository "/var/group1/project1"
:revision-number "1234"
:author "toolkit"
etc..}
I would like to change the repository based on a prefix match, so that:
/var/group1 maps to http://foo/group1
/var/group2 maps to http://bar/group2
I have created 2 functions like:
(defn replace-fn [prefix replacement]
(fn [str]
(if (.startsWith str prefix)
(.replaceFirst str prefix replacement)
str)))
(def replace-group1 (replace-fn "/var/group1" "http://foo/group1"))
(def replace-group2 (replace-fn "/var/group2" "http://bar/group2"))
And now I have to apply them:
(defn fix-repository [{:keys [repository] :as commit}]
(assoc commit :repository
(replace-group1
(replace-group2 repository))))
But this means I have to add an extra wrapper in my fix-repository for each new replacement.
I would like to simply:
Given a commit map
Extract the :repository value
Loop through a list of replacement prefixes
If any prefix matches, replace :repository value with the new string
Otherwise, leave the :repository value alone.
I can't seem to build the right loop, reduce, or other solution to this.
You can use function composition:
(def commit
{:repository "/var/group2/project1"
:revision-number "1234"
:author "toolkit"})
(defn replace-fn [prefix replacement]
(fn [str]
(if (.startsWith str prefix)
(.replaceFirst str prefix replacement)
str)))
(def replacements
(comp (replace-fn "/var/group1" "http://foo/group1")
(replace-fn "/var/group2" "http://foo/group2")))
(defn fix-repository [commit replacements]
(update-in commit [:repository] replacements))
(fix-repository commit replacements)
How about something like this?
(defn replace-any-prefix [replacements-list string]
(or (first
(filter identity
(map (fn [[p r]]
(when (.startsWith string p)
(.replaceFirst string p r)))
replacements-list)))
string)))
(update-in commit
[:repository]
(partial replace-any-prefix
[["/var/group1" "http://foo/group1"]
["/var/group2" "http:/foo/group2"]]))
Documentation for update-in: http://clojuredocs.org/clojure_core/clojure.core/update-in
I am working in clojure with a java class which provides a retrieval API for a domain specific binary file holding a series of records.
The java class is initialized with a file and then provides a .query method which returns an instance of an inner class which has only one method .next, thus not playing nicely with the usual java collections API. Neither the outer nor inner class implements any interface.
The .query method may return null instead of the inner class. The .next method returns a record string or null if no further records are found, it may return null immediately upon the first call.
How do I make this java API work well from within clojure without writing further java classes?
The best I could come up with is:
(defn get-records
[file query-params]
(let [tr (JavaCustomFileReader. file)]
(if-let [inner-iter (.query tr query-params)] ; .query may return null
(loop [it inner-iter
results []]
(if-let [record (.next it)]
(recur it (conj results record))
results))
[])))
This gives me a vector of results to work with the clojure seq abstractions. Are there other ways to expose a seq from the java API, either with lazy-seq or using protocols?
Without dropping to lazy-seq:
(defn record-seq
[q]
(take-while (complement nil?) (repeatedly #(.next q))))
Instead of (complement nil?) you could also just use identity if .next does not return boolean false.
(defn record-seq
[q]
(take-while identity (repeatedly #(.next q))))
I would also restructure a little bit the entry points.
(defn query
[rdr params]
(when-let [q (.query rdr params)]
(record-seq q)))
(defn query-file
[file params]
(with-open [rdr (JavaCustomFileReader. file)]
(doall (query rdr params))))
Seems like a good fit for lazy-seq:
(defn query [file query]
(.query (JavaCustomFileReader. file) query))
(defn record-seq [query]
(when query
(when-let [v (.next query)]
(cons v (lazy-seq (record-seq query))))))
;; usage:
(record-seq (query "filename" "query params"))
Your code is not lazy as it would be if you were using Iterable but you can fill the gap with lazy-seq as follows.
(defn query-seq [q]
(lazy-seq
(when-let [val (.next q)]
(cons val (query-seq q)))))
Maybe you shoul wrap the query method to protect yourself from the first null value as well.
I have a URL checker that I use in Perl. I was wondering how something like this would be done in Clojure. I have a file with thousands of URLs and I'd like the output file to contain the URL (minus http://, https://) and a simple :1 for valid and :0 for false. Ideally, I could check each site concurrently, considering that this is one of Clojure's strengths.
Input
http://www.google.com
http://www.cnn.com
http://www.msnbc.com
http://www.abadurlisnotgood.com
Output
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
I assume by "valid URL" you mean HTTP response 200. This might work. It requires clojure-contrib. Change map to pmap to attempt to make it parallel, like Arthur Ulfeldt mentioned.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(defn check-urls-from-file [filename]
(doseq [line (map check-url
(read-lines (as-file filename)))]
(println line)))
Given your example as input:
user> (check-urls-from-file "urls.txt")
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
Write a small function that appends a ":1" or ":0" to a url and then use pmap to apply it in parallel to all the urls.
(defn check-a-url [url] .... )
(pmap #(if (check-a-url %) (str url ":1") (str url ":0")))
Clojure now has a as-url function in clojure.java.io:
(as-url "http://google.com") ;;=> #object[java.net.URL 0x5dedf9bd "http://google.com"]
(str (as-url "http://google.com")) ;;=> "http://google.com"
(as-url "notanurl") ;; java.net.MalformedURLException
Based on that we could write a function like so:
(defn check-url
"checks if the url is well formed"
[url]
(str (clojure.string/replace-first url #"(http://|https://)" "")
":"
(try (as-url url) ;; built-in, does not perform an actual request, and does very little validation
1
(catch Exception e 0))))
(defn check-urls-from-file
"from Brian Carper answer"
[filename]
(doseq [line (map check-url (read-lines (as-file filename)))]
(println line)))
Instead of pmap, I used agents with send-off in conjunction with the above solution. I think this is better when there is blocking I/O. I believe pmap has limited concurrency too. Here's what I have so far. I wonder how this will scale with thousands of URLs.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(def urls (read-lines "urls.txt"))
(def agents (for [url urls] (agent url)))
(doseq [agent agents]
(send-off agent check-url))
(apply await agents)
(def x '())
(doseq [url (filter deref agents)]
(def x (cons #url x)))
(prn x)
(shutdown-agents)