I am trying to scrape a website using clojure's enlive library. The corresponding CSS selector is:
body > table:nth-child(2) > tbody > tr > td:nth-child(3) > table > tbody > tr > td > table > tbody > tr:nth-child(n+3)
I have tested the above selector using jquery, and it works. But I don't know how to translate the above to enlive's selector syntax. I have tried to write something along the lines of:
(ns vimindex.core
(:gen-class)
(:require [net.cgrand.enlive-html :as html]))
(def ^:dynamic *vim-org-url* "http://www.vim.org/scripts/script_search_results.php?order_by=creation_date&direction=descending")
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn scrape-vimorg []
(println "Scraping vimorg")
(println
(html/select (fetch-url *vim-org-url*)
[:body :> [:table (html/nth-child 2)] :> :tbody :> :tr :> [:td (html/nth-child 3)] :> :table :> :tbody :> :tr :> :td :> :table :> :tbody :> [:tr (html/nth-child 1 3)]])))
; body > table:nth-child(2) > tbody > tr > td:nth-child(3) > table > tbody > tr > td > table > tbody > tr:nth-child(n + 3)
; Above selector works with jquery
(defn -main
[& args]
(scrape-vimorg))
But I get an empty response. Could you please tell me how to translate the above CSS selector in enlive's syntax.
Thanks a lot.
Edited: To include the full code.
The syntax you are missing is an additional set of brackets around elements that use pseudo-selectors. So you want something like this:
[:body :> [:table (html/nth-child 2)] :> :tbody :> :tr
[:td (html/nth-child 3)] :> :table :> :tbody :> :tr :> :td :>
:table :tbody :> [:tr (html/nth-child 1 3)]])
It looks like browsers (at least my version of firefox) add a tbody tag in their DOM representation even if it's not in the actual source.
Enlive does not do so. So your code should work when you omit the tbody parts.
Related
This is a program to parse some sites. The first site is site1. All the logic to parse that perticular site is located to (-> config :site1)
(ns program.core
(require [net.cgrand.enlive-html :as html]))
(def config
{:site1
{:site-url
["http://www.site1.com/page/1"
"http://www.site1.com/page/2"
"http://www.site1.com/page/3"
"http://www.site1.com/page/4"]
:url-encoding "iso-8859-1"
:parsing-index
{:date
{:selector
[[:td.PadMed (html/nth-of-type 1)] :table [:tr (html/nth-of-type 2)]
[:td (html/nth-of-type 3)] [:span]]
:trimming-fn
(comp first :content) ; (first) to remove extra parenthese
}
:title
{:selector
[[:td.PadMed (html/nth-of-type 1)] :table :tr [:td (html/nth-of-type 2)] [:a]]
:trimming-fn
(comp first :content first :content)
}
:url
{:selector
[[:td.PadMed (html/nth-of-type 1)] :table :tr [:td (html/nth-of-type 2)] [:a]]
:trimming-fn
#(str "http://www.site.com" (:href (:attrs %)))
}
}
}})
;=== Fetch fn ===;
(defn fetch-encoded-url
([url] (fetch-encoded-url url "utf-8"))
([url encoding] (-> url java.net.URL.
.getContent
(java.io.InputStreamReader. encoding)
html/html-resource)))
Now I want to parse the pages contained in (-> config :site1 :site-url) In this example I use only the first url, but how can i design this to actually do kind of a master for for all the URLs?
(defn parse-element [element]
(into [] (map (-> config :site1 :parsing-index element :trimming-fn)
(html/select
(fetch-encoded-url
(-> config :site1 :site-url first)
(-> config :site1 :url-encoding))
(-> config :site1 :parsing-index element :selector)))))
(def element-lists
(apply map vector
(map parse-element (-> config :site1 :parsing-index keys))))
(def tagged-lists
(into [] (for [element-list element-lists]
(zipmap [:date :title :url] element-list))))
;==== Fn call ====
(println tagged-lists)
Pass :site1 as an argument to parse-element and elements-list.
(defn parse-element [site element]
(into [] (map (-> config site :parsing-index element :trimming-fn)
(html/select
(fetch-encoded-url
(-> config site :site-url first)
(-> config site :url-encoding))
(-> config site :parsing-index element :selector)))))
(def element-lists [site]
(apply map vector
(map (partial parse-element site) (-> config site :parsing-index keys))))
And then map over :site1 :site2… keys.
Addendum in answer to the further question in the comments.
You could wrap the html/select in a map over the :site-urls. Something like:
(defn parse-element [site element]
(let [site-urls (-> config site :site-url)]
(into [] (map (-> config site :parsing-index element :trimming-fn)
map
#(html/select
(fetch-encoded-url
%
(-> config site :url-encoding))
(-> config site :parsing-index element :selector)))
site-urls)))
(I hope I got the parens right.)
Then you'll probably need to check the :trimming-fn, in order for it to handle the nesting. An apply should suffice.
I want to create a function that allows me to pull contents from some feed, here's what I have... zf is from here
(:require
[clojure.zip :as z]
[clojure.data.zip.xml :only (attr text xml->)]
[clojure.xml :as xml ]
[clojure.contrib.zip-filter.xml :as zf]
)
(def data-url "http://api.eventful.com/rest/events/search?app_key=4H4Vff4PdrTGp3vV&keywords=music&location=Belgrade&date=Future")
(defn zipp [data] (z/xml-zip data))
(defn contents[cont & tags]
(assert (= (zf/xml-> (zipp(parsing cont)) (seq tags) text))))
but when I call it
(contents data-url :events :event :title)
I get an error
java.lang.RuntimeException: java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IFn (NO_SOURCE_FILE:0)
(Updated in response to the comments: see end of answer for ready-made function parameterized by the tags to match.)
The following extracts the titles from the XML pointed at by the URL from the question text (tested at a Clojure 1.5.1 REPL with clojure.data.xml 0.0.7 and clojure.data.zip 0.1.1):
(require '[clojure.zip :as zip]
'[clojure.data.xml :as xml]
'[clojure.data.zip.xml :as xz]
'[clojure.java.io :as io])
(def data-url "http://api.eventful.com/rest/events/search?app_key=4H4Vff4PdrTGp3vV&keywords=music&location=Belgrade&date=Future")
(def data (-> data-url io/reader xml/parse))
(def z (zip/xml-zip data))
(mapcat (comp :content zip/node)
(xz/xml-> z
(xz/tag= :events)
(xz/tag= :event)
(xz/tag= :title)))
;; value of the above right now:
("Belgrade Early Music Festival, Gosta / Purcell: Dido & Aeneas"
"Belgrade Early Music Festival, Gosta / Purcell: Dido & Aeneas"
"Belgrade Early Music Festival, Gosta / Purcell: Dido & Aeneas"
"VIII Early Music Festival, Belgrade 2013"
"Kevlar Bikini"
"U-Recken - Tree of Life Pre event"
"Green Day"
"Smallman - Vrane Kamene (Crows Of Stone)"
"One Direction"
"One Direction in Serbia")
Some comments:
The clojure.contrib.* namespaces are all deprecated. xml-> now lives in clojure.data.zip.xml.
xml-> accepts a zip loc and a bunch of "predicates"; in this context, however, the word "predicate" has an unusual meaning of a filtering function working on zip locs. See clojure.data.zip.xml source for several functions which return such predicates; for an example of use, see above.
If you want to define a list of predicates separately, you can do that too, then use xml-> with apply:
(def loc-preds [(xz/tag= :events) (xz/tag= :event) (xz/tag= :title)])
(mapcat (comp :content zip/node) (apply xz/xml-> z loc-preds))
;; value returned as above
Update: Here's a function which takes the url and keywords naming tags as arguments and returns the content found at the tags:
(defn get-content-from-tags [url & tags]
(mapcat (comp :content zip/node)
(apply xz/xml->
(-> url io/reader xml/parse zip/xml-zip)
(for [t tags]
(xz/tag= t)))))
Calling it like so:
(get-content-from-tags data-url :events :event :title)
gives the same result as the mapcat form above.
Given the below function -
(defn ^:export hi [] (+ 2 3))
I would like to write a macro that does this -
(defex hi [] (+ 2 3))
The macro defex just adds the ^:export metadata in front of the function. How do I do that?
Edit - I checked the function on repl (meta hi) and it gives nil. So most probably I dont want to add metedata but define a function in the above manner.
Thanks,
Murtaza
You don't want the meta on the function itself, you want it on the var (or whatever clojurescript's equivalent of that is):
user> (defmacro defex [name & defn-args]
`(defn ~(vary-meta name assoc :export true) ~#defn-args))
#'user/defex
user> (defex hi [] "hi")
#'user/hi
user> (meta #'hi)
{:arglists ([]), :ns #<Namespace user>, :name hi, :export true, :line 1, :file "NO_SOURCE_FILE"}
you can use a basic template-macro that builds a function and uses def to save it in a var
user> (defmacro defex [name args & body] `(def ~name ^{:export true} (fn ~args ~#body)))
#'user/defex
user> (defex hi [] (+ 2 3))
#'user/hi
user> (meta hi)
{:export true}
user>
I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?
For nth-of-type, does the following example help?
user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
"<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})
No idea about the second issue. Your approach seems to work with a naive test:
user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})
Any chance of looking at your actual HTML?
Update: (in response to the comment)
Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:
user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})
It seems to be a powerful macro, yet I'm failing to apply it to anything but silly examples. Can you show me some real use of it?
Thanks!
Compare:
user> (:baz (:bar (:foo {:foo {:bar {:baz 123}}})))
123
user> (java.io.BufferedReader. (java.io.FileReader. "foo.txt"))
#<BufferedReader java.io.BufferedReader#6e1f8f>
user> (vec (reverse (.split (.replaceAll (.toLowerCase "FOO,BAR,BAZ") "b" "x") ",")))
["xaz" "xar" "foo"]
to:
user> (-> {:foo {:bar {:baz 123}}} :foo :bar :baz)
123
user> (-> "foo.txt" java.io.FileReader. java.io.BufferedReader.)
#<BufferedReader java.io.BufferedReader#7a6c34>
user> (-> "FOO,BAR,BAZ" .toLowerCase (.replaceAll "b" "x") (.split ",") reverse vec)
["xaz" "xar" "foo"]
-> is used when you want a concise way to nest calls. It lets you list the calls in the order they'll be called rather than inside-out, which can be more readable. In the third example, notice how much distance is between some of the arguments and the function they belong to; -> lets you group arguments and function calls a bit more cleanly. Because it's a macro it also works for Java calls, which is nice.
-> isn't that powerful, it just saves you a few parens now and then. Using it or not is a question of style and readability.
Look at the bottom of clojure.zip for extreme examples of how this is helpful.
(-> dz next next next next next next next next next remove up (append-child 'e) root)
Taken from the wiki I've always found this example impressive:
user=> (import '(java.net URL) '(java.util.zip ZipInputStream))
user=> (-> "http://clojure.googlecode.com/files/clojure_20081217.zip"
URL. .openStream ZipInputStream. .getNextEntry bean :name)
As Brian said - it isn't 'useful' so much as 'different style'. I find for all java interop this form of 'start with X' then do Y and Z ... more readable than do Z to Y of X.
Basically you have 4 options:
; imperative style named steps:
(let [X something
b (Y X)
c (Z b)] c)
; nested calls
(Z (Y X))
; threaded calls
(-> X Y Z)
; functional composition
((comp Z Y) X)
I find -> really shines for java interop but avoid it elsewhere.
(defn search-tickets-for [term]
(-> term search zip-soup first :content
((partial filter #(= :body (:tag %)))) first :content
((partial filter #(= :div (:tag %))))
((partial filter #(= "content" ((comp :id :attrs) %))))
((partial map :content)) first ((partial map :content))
((partial map first)) ((partial filter #(= :ul (:tag %)))) first :content
((partial map :content))
((partial map first))
((partial mapcat :content))
((partial filter #(= :h4 (:tag %))))
((partial mapcat :content))
((partial filter #(= :a (:tag %))))
((partial mapcat :content))))
clojurebot from #clojure uses this to search assembla tickets