Clojure, using Enlive to extract raw HTML from a selector? - clojure

I need to retrieve some some raw HTML from a certain part of an HTML page.
I wrote the scraper and it grabs the appropriate div, but it returns a map of tags.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(let [url "http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000928/"
url-data (fetch-url url)
id "a693025"]
(first (html/select url-data [(keyword (str "#" id "-why"))]))))
This outputs:
{:tag :div, :attrs {:class "section", :id "a693025-why"}, :content ({:tag :h2, :attrs nil, :content ({:tag :span, :attrs {:class "title"}, :content ("Why is this medication prescribed?")})} {:tag :p, :attrs nil, :content ("Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.")})}
How do I convert this to raw html? I couldn't find any enlive function to do this.

(apply str (html/emit* [(parse-test)]))
; => "<div class=\"section\" id=\"a693025-why\"><h2><span class=\"title\">Why is this medication prescribed?</span></h2><p>Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.</p></div>"

Related

How to obtain paths to all the child nodes in a tree that only have leaves using clojure zippers?

Say I have a tree like this. I would like to obtain the paths to child nodes that only contain leaves and not non-leaf child nodes.
So for this tree
root
├──leaf123
├──level_a_node1
│ ├──leaf456
├──level_a_node2
│ ├──level_b_node1
│ │ └──leaf987
│ └──level_b_node2
│ └──level_c_node1
| └── leaf654
├──leaf789
└──level_a_node3
└──leaf432
The result would be
[["root" "level_a_node1"]
["root" "level_a_node2" "level_b_node1"]
["root" "level_a_node2" "level_b_node2" "level_c_node1"]
["root" "level_a_node3"]]
I've attempted to go down to the bottom nodes and check if the (lefts) and the (rights) are not branches, but that that doesn't quite work.
(z/vector-zip ["root"
["level_a_node3" ["leaf432"]]
["level_a_node2" ["level_b_node2" ["level_c_node1" ["leaf654"]]] ["level_b_node1" ["leaf987"]] ["leaf789"]]
["level_a_node1" ["leaf456"]]
["leaf123"]])
edit: my data is actually coming in as a list of paths and I'm converting that into a tree. But maybe that is an overcomplication?
[["root" "leaf"]
["root" "level_a_node1" "leaf"]
["root" "level_a_node2" "leaf"]
["root" "level_a_node2" "level_b_node1" "leaf"]
["root" "level_a_node2" "level_b_node2" "level_c_node1" "leaf"]
["root" "level_a_node3" "leaf"]]
Hiccup-style structures are a nice place to visit, but I wouldn't want to live there. That is, they're very succinct to write, but a giant pain to manipulate programmatically, because the semantic nesting structure is not reflected in the physical structure of the nodes. So, the first thing I would do is convert to Enlive-style tree representation (or, ideally, generate Enlive to begin with):
(def hiccup
["root"
["level_a_node3" ["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456"]]
["leaf123"]])
(defn hiccup->enlive [x]
(when (vector? x)
{:tag (first x)
:content (map hiccup->enlive (rest x))}))
(def enlive (hiccup->enlive hiccup))
;; Yielding...
{:tag "root",
:content
({:tag "level_a_node3", :content ({:tag "leaf432", :content ()})}
{:tag "level_a_node2",
:content
({:tag "level_b_node2",
:content
({:tag "level_c_node1",
:content ({:tag "leaf654", :content ()})})}
{:tag "level_b_node1", :content ({:tag "leaf987", :content ()})}
{:tag "leaf789", :content ()})}
{:tag "level_a_node1", :content ({:tag "leaf456", :content ()})}
{:tag "leaf123", :content ()})}
Having done this, the last thing getting in your way is your desire to use zippers. They are a good tool for targeted traversals, where you care a lot about the structure near the node you are working on. But if all you care about is the node and its children, it is much easier to just write a simple recursive function to traverse the tree:
(defn paths-to-leaves [{:keys [tag content] :as root}]
(when (seq content)
(if (every? #(empty? (:content %)) content)
[(list tag)]
(for [child content
path (paths-to-leaves child)]
(cons tag path)))))
The ability to write recursive traversals like this is a skill that will serve you many times throughout your Clojure career (for example, a similar question I recently answered on Code Review). It turns out that a huge number of functions on trees are just: call yourself recursively on each child, and somehow combine the results, usually in a possibly-nested for loop. The hard part is just figuring out what your base case needs to be, and the correct sequence of maps/mapcats to combine the results without introducing undesired levels of nesting.
If you insist on sticking with Hiccup, you can de-mangle it at the use site without too much pain:
(defn hiccup-paths-to-leaves [node]
(when (vector? node)
(let [tag (first node), content (next node)]
(if (and content (every? #(= 1 (count %)) content))
[(list tag)]
(for [child content
path (hiccup-paths-to-leaves child)]
(cons tag path))))))
But it's noticeably messier, and is work you'll have to repeat every time you work with a tree. Again I encourage you to use Enlive-style trees for your internal data representation.
You can definitely use the file api to navigate the directory. If using zipper, you can do this:
(loop [loc (vector-zip ["root"
["level_a_node3"
["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456" "leaf456b"]]
["leaf123"]])
ans nil]
(if (end? loc)
ans
(recur (next loc)
(cond->> ans
(contains-leaves-only? loc)
(cons (->> loc down path (map node)))))))
which will output this:
(("root" "level_a_node1")
("root" "level_a_node2" "level_b_node1")
("root" "level_a_node2" "level_b_node2" "level_c_node1")
("root" "level_a_node3"))
with the way you define the tree, helper functions can be implemented
as:
(def is-leaf? #(-> % down nil?))
(defn contains-leaves-only?
[loc]
(some->> loc
down ;; branch name
right ;; children list
down ;; first child
(iterate right) ;; with other sibiling
(take-while identity)
(every? is-leaf?)))
UPDATE - add a lazy sequence version
(->> ["root"
["level_a_node3"
["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456" "leaf456b"]]
["leaf123"]]
vector-zip
(iterate next)
(take-while (complement end?))
(filter contains-leaves-only?)
(map #(->> % down path (map node))))
It is because zippers have so many limitations that I created the Tupelo Forest library for processing tree-like data structures. Your problem then has a simple solution:
(ns tst.tupelo.forest-examples
(:use tupelo.core tupelo.forest tupelo.test))
(with-forest (new-forest)
(let [data ["root"
["level_a_node3" ["leaf"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf"]]]
["level_b_node1" ["leaf"]]]
["level_a_node1" ["leaf"]]
["leaf"]]
root-hid (add-tree-hiccup data)
leaf-paths (find-paths-with root-hid [:** :*] leaf-path?)]
with a tree that looks like:
(hid->bush root-hid) =>
[{:tag "root"}
[{:tag "level_a_node3"}
[{:tag "leaf"}]]
[{:tag "level_a_node2"}
[{:tag "level_b_node2"}
[{:tag "level_c_node1"}
[{:tag "leaf"}]]]
[{:tag "level_b_node1"}
[{:tag "leaf"}]]]
[{:tag "level_a_node1"}
[{:tag "leaf"}]]
[{:tag "leaf"}]])
and a result like:
(format-paths leaf-paths) =>
[[{:tag "root"} [{:tag "level_a_node3"} [{:tag "leaf"}]]]
[{:tag "root"} [{:tag "level_a_node2"} [{:tag "level_b_node2"} [{:tag "level_c_node1"} [{:tag "leaf"}]]]]]
[{:tag "root"} [{:tag "level_a_node2"} [{:tag "level_b_node1"} [{:tag "leaf"}]]]]
[{:tag "root"} [{:tag "level_a_node1"} [{:tag "leaf"}]]]
[{:tag "root"} [{:tag "leaf"}]]]))))
There are many choices after this depending on the next steps in the processing chain.

Clojure - ajax.core POST

I am having some trouble with POST from ajax.
I want to add a user to my database, so I am using POST and the data I want to send is in the form {:id id :pass pass} This is my POST
(defn add-user! [user]
(POST "/add-user!"
{:params user}))
All I want to do is enter information in the form specified above into this POST so I can send it to the database. I know that the argument,to the POST, is in the right form and the queries to the database and my routes are correct but I've made a mistake with the POST and I cannot figure out my mistake.
I am calling add-user! by
(defonce fields (atom {}))
(defn add-user! [user]
(POST "/add-user!"
{:params user}))
(defn content
[]
[:div
[:div
[:p "Enter Name:"
[:input
{:type :text
:name :name
:on-change #(swap! fields assoc :id (-> % .-target .-value))
:value (:id #fields)}]]
[:p "Enter Pass:"
[:input
{:type :text
:name :pass
:on-change #(swap! fields assoc :pass (-> % .-target .-value))
:value (:pass #fields)}]]
[:input
{:type :submit
:on-click #(do
(add-user! #fields))
:value "Enter"}]]
[:div
[:p "Id is " (:id #fields)]
[:p "Pass is " (:pass #fields)]]])
My query to the database in a clj file is
(defn add-user! [user]
(sql/insert! db :users user))
where sql is [clojure.java.jdbc :as sql]
There is not really enough information here to help you debug this fully, but I suspect that you need to modify your POST to:
(defn add-user! [user]
(POST "/add-user!"
{:format :json
:params user}))
If you don't provide :format, cljs-ajax defaults to sending Transit data, which would definitely confuse a server expecting JSON.
:format - specifies the format for the body of the request (Transit, JSON, etc.). Also sets the appropriate Content-Type header. Defaults to :transit if not provided. - JulianBirch/cljs-ajax#getpostput
Happened to me with this code:
(POST "/admin/tests/load"
{:params {:test-id "83"}
:headers {"x-csrf-token" csrf-field}
:handler (fn [r] (do (.log js/console r) (swap! test-state r)))
:format :json
:response-format :json
:error-handler (fn [r] (prn r))})))
"params" always showed up empty "{}". Then I tried:
(POST "/admin/tests/load"
{:params {:test-id "83"}
:headers {"x-csrf-token" csrf-field}} )
and all started working well, even after adding the other options. I know, weird.

I can't figure out how to log/infof the string value of a LazySeq [duplicate]

This question already has an answer here:
Clojure printing lazy sequence
(1 answer)
Closed 7 years ago.
LazySeq is kicking my butt when I try to log its value.
(require '[clojure.tools.logging :as log])
(def layer->multipart [{:name "layer-name" :content "veg"} {:name "layer-tag" :content "abs"}])
(def field->multipart [{:name "field-id" :content "12345"} {:name "field-version" :content "v1"}])
(log/infof "concat is %s" (concat layer->multipart field->multipart))
; => 2016-02-16 16:31:11,707 level=INFO [nREPL-worker-38] user:288 - concat is clojure.lang.LazySeq#87177bed
; WTF is clojure.lang.LazySeq#87177bed?
I've check the How to convert lazy sequence to non-lazy in Clojure answer and it suggests that all I need to do is doall and all my dreams will come true. But alas...no.
(log/infof "concat is %s" (doall (concat layer->multipart field->multipart)))
; => 2016-02-16 16:31:59,958 level=INFO [nREPL-worker-40] user:288 - concat is clojure.lang.LazySeq#87177bed
; still clojure.lang.LazySeq#87177bed is not what I wanted
I've observed that (pr-str (concat layer->multipart field->multipart)) does what I want, but it makes no sense; The docs for pr-str say something about "pr to a string" and the docs for pr say "Prints the object(s) to the output stream that is the current value of *out*.". I don't want anything going to *out*, I just want the string value returned so the logger can use it!
(log/infof "concat is %s" (pr-str (concat layer->multipart field->multipart)))
; => 2016-02-16 16:42:02,927 level=INFO [nREPL-worker-1] user:288 - concat is ({:content "veg", :name "layer-name"} {:content "abs", :name "layer-tag"} {:content "12345", :name "field-id"} {:content "v1", :name "field-version"})
; this is what I wanted but I don't want anything going to *out*...or do I?
What do I have to do to get the effect of the pr-str variant without worrying about anything inadvertently getting dumped to stdout (I'm guessing that is what *out* is)? I want the lazy sequence to be fully realized for logging (it never gets too big...it only ends up as lazy as an accident of concat).
How can I log the full value of my LazySeq?
The problem is that behind the scenes the logger is calling .toString on your lazy sequence. Try this:
user=> (.toString (concat layer->multipart field->multipart))
;; "clojure.lang.LazySeq#87177bed"`
What you really want is to convert the contents of the sequence into a string. For example:
(log/infof "concat is %s" (apply str (concat layer->multipart field->multipart)))
;; Feb 16, 2016 5:10:19 PM clojure.tools.logging$eval420$fn__424 invoke
;; INFO: concat is {:name "layer-name", :content "veg"}{:name "layer-tag", :content "abs"}{:name "field-id", :content "12345"}{:name "field-version", :content "v1"}
By the way, pr-str is fine too. As it name says it prints to a string, not to *out*. You're using that string.

How to get the href attribute value using enlive

I am a new to Clojure and enlive.
I have html like this
<SPAN CLASS="f10">....</SPAN></DIV><DIV CLASS="p5"><SPAN CLASS="f10">.....</SPAN>
I tried this
(html/select (fetch-url base-url) [:span.f10 [:a (html/attr? :href)]]))
but it returns this
({:tag :a,
:attrs
{:target "detail",
:title
"...",
:href
"value1"},
:content ("....")}
{:tag :a,
:attrs
{:target "detail",
:title
"....",
:href
"value2"},
:content
("....")}
What i want is just value1 and value 2 in the output. How can i accomplish it ?
select returns the matched nodes, but you still need to extract their href attributes. To do that, you can use attr-values:
(mapcat #(html/attr-values % :href)
(html/select (html/html-resource "sample.html") [:span.f10 (html/attr? :href)]))
I use this little function because the Enlive attr functions don't return the values. You are basically just walking the hash to get the value.
user=> (def data {:tag :a, :attrs {:target "detail", :title "...", :href "value1"}})
#'user/data
user=> (defn- get-attr [node attr]
#_=> (some-> node :attrs attr))
#'user/get-attr
user=> (get-attr data :href)
"value1"

How to select nth element of particular type in enlive?

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?
For nth-of-type, does the following example help?
user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
"<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})
No idea about the second issue. Your approach seems to work with a naive test:
user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})
Any chance of looking at your actual HTML?
Update: (in response to the comment)
Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:
user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})