Rescraping data with Enlive

Rescraping data with Enlive - clojure

I tried to create function to scrape and tags from HTML page, whose URL I provide to a function, and this works as it should. I get sequence of <h3> and <table> elements, when I try to use select function to extract only table or h3 tags from resulting sequence,
I get (), or if I try to map those tags I get (nil nil nil ...).
Could you please help me to resolve this issue, or explain me what am I doing wrong?
Here is the code:
(ns Test2
(:require [net.cgrand.enlive-html :as html])
(:require [clojure.string :as string]))
(defn get-page
"Gets the html page from passed url"
[url]
(html/html-resource (java.net.URL. url)))
(defn h3+table
"returns sequence of <h3> and <table> tags"
[url]
(html/select (get-page url)
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3]
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]}
))
(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
This line gives me headache :
(html/select (h3+table url) [:table])
Could you please tell me what am I doing wrong?
Just to clarify my question: is it possible to use enlive's select function to extract only table tags from result of (h3+table url) ?

As #Julien pointed out, you will probably have to work with the deeply nested tree structure that you get from applying (html/select raw-html selectors) on the raw html. It seems like you try to apply html/select multiple times, but this doesn't work. html/select parses html into a clojure datastructure, so you can't apply it on that datastructure again.
I found that parsing the website was actually a little involved, but I thought that it might be a nice use case for multimethods, so I hacked something together, maybe this will get you started:
(The code is ugly here, you can also checkout this gist)
(ns tutorial.scrape1
(:require [net.cgrand.enlive-html :as html]))
(def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
(defn get-page [url]
(html/html-resource (java.net.URL. url)))
(defn content->string [content]
(cond
(nil? content) ""
(string? content) content
(map? content) (content->string (:content content))
(coll? content) (apply str (map content->string content))
:else (str content)))
(derive clojure.lang.PersistentStructMap ::Map)
(derive clojure.lang.PersistentArrayMap ::Map)
(derive java.lang.String ::String)
(derive clojure.lang.ISeq ::Collection)
(derive clojure.lang.PersistentList ::Collection)
(derive clojure.lang.LazySeq ::Collection)
(defn tag-type [node]
(case (:tag node)
:tr ::CompoundNode
:table ::CompoundNode
:th ::TerminalNode
:td ::TerminalNode
:h3 ::TerminalNode
:tbody ::IgnoreNode
::IgnoreNode))
(defmulti parse-node
(fn [node]
(let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)])))
(defmethod parse-node [::Map ::TerminalNode] [node]
(content->string (:content node)))
(defmethod parse-node [::Map ::CompoundNode] [node]
(map parse-node (:content node)))
(defmethod parse-node [::Map ::IgnoreNode] [node]
(parse-node (:content node)))
(defmethod parse-node [::String nil] [node]
node)
(defmethod parse-node [::Collection nil] [node]
(map parse-node node))
(defn h3+table [url]
(let [ws-content (get-page url)
h3s+tables (html/select ws-content #{[:div#prospekt_container :h3]
[:div#prospekt_container :table]})]
(for [node h3s+tables] (parse-node node))))
A few words on what's going on:
content->string takes a data structure and collects its content into a string and returns that so you can apply this to content that may still contain nested subtags (like <br/>) that you want to ignore.
The derive statements establish an ad hoc hierarchy which we will later use in the multi-method parse-node. This is handy because we never quite know which data structures we're going to encounter and we could easily add more cases later on.
The tag-type function is actually a hack that mimics the hierarchy statements - AFAIK you can't create a hierarchy out of non-namespace qualified keywords, so I did it like this.
The multi-method parse-node dispatches on the class of the node and if the node is a map additionally on the tag-type.
Now all we have to do is define the appropriate methods: If we're at a terminal node we convert the contents to a string, otherwise we either recur on the content or map the parse-node function on the collection we're dealing with. The method for ::String is actually not even used, but I left it in for safety.
The h3+table function is pretty much what you had before, I simplified the selectors a bit and put them into a set, not sure if putting them into a map as you did works as intended.
Happy scraping!

Your question is hard to understand, but I think your last line should simply be
(h3+table url)
This will return a deeply nested data structure containing scraped HTML that you can then dive into with the usual Clojure sequence APIs. Good luck.

Related

Clojure kebab case on selected keywords

I want to change certain key's in a large map in clojure.
These key's can be present at any level in the map but will always be within a required-key
I was looking at using camel-snake-kebab library but need it to change only a given set of keys in the required-key map. It doesn't matter if the change is made in json or the map
(def my-map {:allow_kebab_or-snake {:required-key {:must_be_kebab ""}}
:allow_kebab_or-snake2 {:optional-key {:required-key {:must_be_kebab ""}}}})
currently using /walk/postwalk-replace but fear it may change keys not nested within the :required-key map
(walk/postwalk-replace {:must_be_kebab :must-be-kebab} my-map))

ummmm.. could you clarify: do you want to change the keys of the map?! or their associated values?
off-topic: your map above is not correct (having two identical keys :allow_kebab_or_snake - i-m assuming you're just underlining the point and not showing the actual example :))
postwalk-replace WILL replace any occurrence of the key with the value.
so if you know the exact map struct you could first select your sub-struct with get-in and then use postwalk-replace :
(walk/postwalk-replace {:must_be_kebab :mus-be-kebab}
(get-in my-map [:allow_kebab_or_snake :required-key]))
But then you'll have to assoc this into your initial map.
You should also consider the walk function and construct your own particular algorithm if the interleaved DS is too complex.

Here is a solution. Since you need to control when the conversion does/doesn't occur, you can't just use postwalk. You need to implement your own recursion and change the context from non-convert -> convert when your condition is found.
(ns tst.clj.core
(:use clj.core clojure.test tupelo.test)
(:require
[clojure.string :as str]
[clojure.pprint :refer [pprint]]
[tupelo.core :as t]
[tupelo.string :as ts]
))
(t/refer-tupelo)
(t/print-versions)
(def my-map
{:allow_kebab_or-snake {:required-key {:must_be_kebab ""}}
:allow_kebab_or-snake2 {:optional-key {:required-key {:must_be_kebab ""}}}})
(defn children->kabob? [kw]
(= kw :required-key))
(defn proc-child-maps
[ctx map-arg]
(apply t/glue
(for [curr-key (keys map-arg)]
(let [curr-val (grab curr-key map-arg)
new-ctx (if (children->kabob? curr-key)
(assoc ctx :snake->kabob true)
ctx)
out-key (if (grab :snake->kabob ctx)
(ts/kw-snake->kabob curr-key)
curr-key)
out-val (if (map? curr-val)
(proc-child-maps new-ctx curr-val)
curr-val)]
{out-key out-val}))))
(defn nested-keys->snake
[arg]
(let [ctx {:snake->kabob false}]
(if (map? arg)
(proc-child-maps ctx arg)
arg)))
The final result is shown in the unit test:
(is= (nested-keys->snake my-map)
{:allow_kebab_or-snake
{:required-key
{:must-be-kebab ""}},
:allow_kebab_or-snake2
{:optional-key
{:required-key
{:must-be-kebab ""}}}} ))
For this solution I used some of the convenience functions in the Tupelo library.

Just a left of field suggestion which may or may not work. This is a problem that can come up when dealing with SQL databases because the '-' is seen as a reserved word and cannot be used in identifiers. However, it is common to use '-' in keywords when using clojure. Many abstraction layers used when working with SQL in clojure take maps as arguments/bindings for prepared statements etc.
Ideally, what is needed is another layer of abstraction which converts between kebab and snake case as needed depending on the direction you are going i.e. to sql or from sql. The advantage of this aproach is your not walking through maps making conversions - you do the conversion 'on the fly" when it is needed.
Have a look at https://pupeno.com/2015/10/23/automatically-converting-case-between-sql-and-clojure/

Clojure: Custom functions inside Enlive selectors?

Here is an example where I use html/text directly inside a selector vector.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a html/text]))
Calling (parse-test) returns a data structure containing Hacker News Headlines :
("In emergency cases a passenger was selected and thrown out of the plane. [2004]"
"“Nobody expects privacy online”: Wrong."
"The SCUMM Diary: Stories behind one of the greatest game engines ever made" ...)
Cool!
Would it be possible to end the selector vector with a custom function that would give me back the list of article URLs.
Something like: [:td.title :a #(str "https://news.ycombinator.com/" (:href (:attrs %)))]
EDIT:
Here is a way to achieve this. We could write our own select function:
(defn select+ [coll selector+]
(map
(peek selector+)
(html/select
(fetch-url "https://news.ycombinator.com/")
(pop selector+))))
(def href
(fn [node] (:href (:attrs node))))
(defn parse-test []
(select+
(fetch-url "https://news.ycombinator.com/")
[:td.title :a href]))
(parse-test)

As you suggest in your comment, I think it's clearest to keep the selection and the transformation of nodes separate.
Enlive itself provides both selectors and transformers. Selectors to find nodes, and transformers to, um, transform them. If your intended output was html, you could probably use a combination of a selector and a transformer to achieve your desired result.
However, seeing as you are just looking for data (a sequence of maps, perhaps?) - you can skip the transform bit, and just use a sequence comprehension, like this:
(defn parse-test []
(for [s (html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a])]
{:title (first (:content s))
:link (:href (:attrs s))}))
(take 2 (parse-test))
;; => ({:title " \tStartup - Bill Watterson, a cartoonist's advice ",
:link "http://www.zenpencils.com/comic/128-bill-watterson-a-cartoonists-advice"}
{:title "Drug Agents Use Vast Phone Trove Eclipsing N.S.A.’s",
:link "http://www.nytimes.com/2013/09/02/us/drug-agents-use-vast-phone-trove-eclipsing-nsas.html?hp&_r=0&pagewanted=all"})

In CLojure how to call xml-> with arbitrary preds

I want to create a function that allows me to pull contents from some feed, here's what I have... zf is from here
(:require
[clojure.zip :as z]
[clojure.data.zip.xml :only (attr text xml->)]
[clojure.xml :as xml ]
[clojure.contrib.zip-filter.xml :as zf]
)
(def data-url "http://api.eventful.com/rest/events/search?app_key=4H4Vff4PdrTGp3vV&keywords=music&location=Belgrade&date=Future")
(defn zipp [data] (z/xml-zip data))
(defn contents[cont & tags]
(assert (= (zf/xml-> (zipp(parsing cont)) (seq tags) text))))
but when I call it
(contents data-url :events :event :title)
I get an error
java.lang.RuntimeException: java.lang.ClassCastException: clojure.lang.ArraySeq cannot be cast to clojure.lang.IFn (NO_SOURCE_FILE:0)

(Updated in response to the comments: see end of answer for ready-made function parameterized by the tags to match.)
The following extracts the titles from the XML pointed at by the URL from the question text (tested at a Clojure 1.5.1 REPL with clojure.data.xml 0.0.7 and clojure.data.zip 0.1.1):
(require '[clojure.zip :as zip]
'[clojure.data.xml :as xml]
'[clojure.data.zip.xml :as xz]
'[clojure.java.io :as io])
(def data-url "http://api.eventful.com/rest/events/search?app_key=4H4Vff4PdrTGp3vV&keywords=music&location=Belgrade&date=Future")
(def data (-> data-url io/reader xml/parse))
(def z (zip/xml-zip data))
(mapcat (comp :content zip/node)
(xz/xml-> z
(xz/tag= :events)
(xz/tag= :event)
(xz/tag= :title)))
;; value of the above right now:
("Belgrade Early Music Festival, Gosta / Purcell: Dido & Aeneas"
"Belgrade Early Music Festival, Gosta / Purcell: Dido & Aeneas"
"Belgrade Early Music Festival, Gosta / Purcell: Dido & Aeneas"
"VIII Early Music Festival, Belgrade 2013"
"Kevlar Bikini"
"U-Recken - Tree of Life Pre event"
"Green Day"
"Smallman - Vrane Kamene (Crows Of Stone)"
"One Direction"
"One Direction in Serbia")
Some comments:
The clojure.contrib.* namespaces are all deprecated. xml-> now lives in clojure.data.zip.xml.
xml-> accepts a zip loc and a bunch of "predicates"; in this context, however, the word "predicate" has an unusual meaning of a filtering function working on zip locs. See clojure.data.zip.xml source for several functions which return such predicates; for an example of use, see above.
If you want to define a list of predicates separately, you can do that too, then use xml-> with apply:
(def loc-preds [(xz/tag= :events) (xz/tag= :event) (xz/tag= :title)])
(mapcat (comp :content zip/node) (apply xz/xml-> z loc-preds))
;; value returned as above
Update: Here's a function which takes the url and keywords naming tags as arguments and returns the content found at the tags:
(defn get-content-from-tags [url & tags]
(mapcat (comp :content zip/node)
(apply xz/xml->
(-> url io/reader xml/parse zip/xml-zip)
(for [t tags]
(xz/tag= t)))))
Calling it like so:
(get-content-from-tags data-url :events :event :title)
gives the same result as the mapcat form above.

Conditionals in Hiccup, can I make this more idiomatic?

Clojure beginner here! I added flash message support to my Hiccup code in a Noir project.
What I'm trying to do is check if the message string for each specific was set or not. If there's no message, then I don't want to display the specific flash element containing that message.
(defpartial success-flash [msg]
[:div.alert.notice.alert-success
[:a.close {:data-dismiss "alert"} "x"]
[:div#flash_notice msg]])
(defpartial error-flash [msg]
[:div.alert.notice.alert-error
[:a.close {:data-dismiss "alert"} "x"]
[:div#flash_notice msg]])
[..]
(defpartial layout [& content]
(html5
[:head
[...]
[:body
(list
[...]
[:div.container
(let [error-msg (session/flash-get :error-message)
error-div (if (nil? error-msg) () (error-flash error-msg))
success-msg (session/flash-get :success-message)
success-div (if (nil? success-msg) () (success-flash success-msg))]
warning-msg (session/flash-get :warning-message)
warning-div (if (nil? warning-msg) () (warning-flash warning-msg))]
(list error-div success-div warning-div content))])]))
Disclaimer: I completely agree that you won't likely ever be in a situation where you'll need more than one of those specific flashes on at once, but indulge me in my attempt at figuring out a better and more functional way of implementing this.
I'm confident that there's a pattern out there for handling similar situations. Basically I check the value of several expressions, do a bunch of stuff with those values, and then act based on the results. You could pull this off with a progressively more and more monstrous (cond), but my solution is at least somewhat cleaner.
Tips?

You could also use when-let.
(defpartial layout
[& contents]
(html5
[:body
(when-let [msg (session/flash-get :error-message)]
(error-flash msg))
(when-let [msg (session/flash-get :warning-message)]
(warning-flash msg))
(when-let [msg (session/flash-get :success-message)]
(success-flash msg))
contents))
I'm not a hiccup expert, but I think this should work. I find it a little clearer on what's going on, although it's slightly more verbose.

The pattern is called mapping value. Below is an example that uses keep function to apply the pattern of mapping values and then filtering them
(use 'clojure.contrib.core)
(def flash-message
[[:error-message error-flash]
[:success-message success-flash]
[:warning-message warning-flash]])
(keep (fn [m f] (-?>> m (session/flash-get) (f))) flash-message)

How to parse URL parameters in Clojure?

If I have the request "size=3&mean=1&sd=3&type=pdf&distr=normal" what's the idiomatic way of writing the function (defn request->map [request] ...) that takes this request and
returns a map {:size 3, :mean 1, :sd 3, :type pdf, :distr normal}
Here is my attempt (using clojure.walk and clojure.string):
(defn request-to-map
[request]
(keywordize-keys
(apply hash-map
(split request #"(&|=)"))))
I am interested in how others would solve this problem.

Using form-decode and keywordize-keys:
(use 'ring.util.codec)
(use 'clojure.walk)
(keywordize-keys (form-decode "hello=world&foo=bar"))
{:foo "bar", :hello "world"}

Assuming you want to parse HTTP request query parameters, why not use ring? ring.middleware.params contains what you want.
The function for parameter extraction goes like this:
(defn- parse-params
"Parse parameters from a string into a map."
[^String param-string encoding]
(reduce
(fn [param-map encoded-param]
(if-let [[_ key val] (re-matches #"([^=]+)=(.*)" encoded-param)]
(assoc-param param-map
(codec/url-decode key encoding)
(codec/url-decode (or val "") encoding))
param-map))
{}
(string/split param-string #"&")))

You can do this easily with a number of Java libraries. I'd be hesitant to try to roll my own parser unless I read the URI specs carefully and made sure I wasn't missing any edge cases (e.g. params appearing in the query twice with different values). This uses jetty-util:
(import '[org.eclipse.jetty.util UrlEncoded MultiMap])
(defn parse-query-string [query]
(let [params (MultiMap.)]
(UrlEncoded/decodeTo query params "UTF-8")
(into {} params)))
user> (parse-query-string "size=3&mean=1&sd=3&type=pdf&distr=normal")
{"sd" "3", "mean" "1", "distr" "normal", "type" "pdf", "size" "3"}

Can also use this library for both clojure and clojurescript: https://github.com/cemerick/url
user=> (-> "a=1&b=2&c=3" cemerick.url/query->map clojure.walk/keywordize-keys)
{:a "1", :b "2", :c "3"}

Yours looks fine. I tend to overuse regexes, so I would have solved it as
(defn request-to-keywords [req]
(into {} (for [[_ k v] (re-seq #"([^&=]+)=([^&]+)" req)]
[(keyword k) v])))
(request-to-keywords "size=1&test=3NA=G")
{:size "1", :test "3NA=G"}
Edit: try to stay away from clojure.walk though. I don't think it's officially deprecated, but it's not very well maintained. (I use it plenty too, though, so don't feel too bad).

I came across this question when constructing my own site and the answer can be a bit different, and easier, if you are passing parameters internally.
Using Secretary to handle routing: https://github.com/gf3/secretary
Parameters are automatically extracted to a map in :query-params when a route match is found. The example given in the documentation:
(defroute "/users/:id" [id query-params]
(js/console.log (str "User: " id))
(js/console.log (pr-str query-params)))
(defroute #"/users/(\d+)" [id {:keys [query-params]}]
(js/console.log (str "User: " id))
(js/console.log (pr-str query-params)))
;; In both instances...
(secretary/dispach! "/users/10?action=delete")
;; ... will log
;; User: 10
;; "{:action \"delete\"}"

You can use ring.middleware.params. Here's an example with aleph:
user=> (require '[aleph.http :as http])
user=> (defn my-handler [req] (println "params:" (:params req)))
user=> (def server (http/start-server (wrap-params my-handler)))
wrap-params creates an entry in the request object called :params. If you want the query parameters as keywords, you can use ring.middleware.keyword-params. Be sure to wrap with wrap-params first:
user=> (require '[ring.middleware.params :refer [wrap-params]])
user=> (require '[ring.middleware.keyword-params :refer [wrap-keyword-params])
user=> (def server
(http/start-server (wrap-keyword-params (wrap-params my-handler))))
However, be mindful that this includes a dependency on ring.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rescraping data with Enlive - clojure

Your question is hard to understand, but I think your last line should simply be (h3+table url) This will return a deeply nested data structure containing scraped HTML that you can then dive into with the usual Clojure sequence APIs. Good luck.

Related

Clojure kebab case on selected keywords

Clojure: Custom functions inside Enlive selectors?

In CLojure how to call xml-> with arbitrary preds

Conditionals in Hiccup, can I make this more idiomatic?

How to parse URL parameters in Clojure?

Categories

Resources