How to get the href attribute value using enlive - clojure

I am a new to Clojure and enlive.
I have html like this
<SPAN CLASS="f10">....</SPAN></DIV><DIV CLASS="p5"><SPAN CLASS="f10">.....</SPAN>
I tried this
(html/select (fetch-url base-url) [:span.f10 [:a (html/attr? :href)]]))
but it returns this
({:tag :a,
:attrs
{:target "detail",
:title
"...",
:href
"value1"},
:content ("....")}
{:tag :a,
:attrs
{:target "detail",
:title
"....",
:href
"value2"},
:content
("....")}
What i want is just value1 and value 2 in the output. How can i accomplish it ?

select returns the matched nodes, but you still need to extract their href attributes. To do that, you can use attr-values:
(mapcat #(html/attr-values % :href)
(html/select (html/html-resource "sample.html") [:span.f10 (html/attr? :href)]))

I use this little function because the Enlive attr functions don't return the values. You are basically just walking the hash to get the value.
user=> (def data {:tag :a, :attrs {:target "detail", :title "...", :href "value1"}})
#'user/data
user=> (defn- get-attr [node attr]
#_=> (some-> node :attrs attr))
#'user/get-attr
user=> (get-attr data :href)
"value1"

Related

How to obtain paths to all the child nodes in a tree that only have leaves using clojure zippers?

Say I have a tree like this. I would like to obtain the paths to child nodes that only contain leaves and not non-leaf child nodes.
So for this tree
root
├──leaf123
├──level_a_node1
│ ├──leaf456
├──level_a_node2
│ ├──level_b_node1
│ │ └──leaf987
│ └──level_b_node2
│ └──level_c_node1
| └── leaf654
├──leaf789
└──level_a_node3
└──leaf432
The result would be
[["root" "level_a_node1"]
["root" "level_a_node2" "level_b_node1"]
["root" "level_a_node2" "level_b_node2" "level_c_node1"]
["root" "level_a_node3"]]
I've attempted to go down to the bottom nodes and check if the (lefts) and the (rights) are not branches, but that that doesn't quite work.
(z/vector-zip ["root"
["level_a_node3" ["leaf432"]]
["level_a_node2" ["level_b_node2" ["level_c_node1" ["leaf654"]]] ["level_b_node1" ["leaf987"]] ["leaf789"]]
["level_a_node1" ["leaf456"]]
["leaf123"]])
edit: my data is actually coming in as a list of paths and I'm converting that into a tree. But maybe that is an overcomplication?
[["root" "leaf"]
["root" "level_a_node1" "leaf"]
["root" "level_a_node2" "leaf"]
["root" "level_a_node2" "level_b_node1" "leaf"]
["root" "level_a_node2" "level_b_node2" "level_c_node1" "leaf"]
["root" "level_a_node3" "leaf"]]
Hiccup-style structures are a nice place to visit, but I wouldn't want to live there. That is, they're very succinct to write, but a giant pain to manipulate programmatically, because the semantic nesting structure is not reflected in the physical structure of the nodes. So, the first thing I would do is convert to Enlive-style tree representation (or, ideally, generate Enlive to begin with):
(def hiccup
["root"
["level_a_node3" ["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456"]]
["leaf123"]])
(defn hiccup->enlive [x]
(when (vector? x)
{:tag (first x)
:content (map hiccup->enlive (rest x))}))
(def enlive (hiccup->enlive hiccup))
;; Yielding...
{:tag "root",
:content
({:tag "level_a_node3", :content ({:tag "leaf432", :content ()})}
{:tag "level_a_node2",
:content
({:tag "level_b_node2",
:content
({:tag "level_c_node1",
:content ({:tag "leaf654", :content ()})})}
{:tag "level_b_node1", :content ({:tag "leaf987", :content ()})}
{:tag "leaf789", :content ()})}
{:tag "level_a_node1", :content ({:tag "leaf456", :content ()})}
{:tag "leaf123", :content ()})}
Having done this, the last thing getting in your way is your desire to use zippers. They are a good tool for targeted traversals, where you care a lot about the structure near the node you are working on. But if all you care about is the node and its children, it is much easier to just write a simple recursive function to traverse the tree:
(defn paths-to-leaves [{:keys [tag content] :as root}]
(when (seq content)
(if (every? #(empty? (:content %)) content)
[(list tag)]
(for [child content
path (paths-to-leaves child)]
(cons tag path)))))
The ability to write recursive traversals like this is a skill that will serve you many times throughout your Clojure career (for example, a similar question I recently answered on Code Review). It turns out that a huge number of functions on trees are just: call yourself recursively on each child, and somehow combine the results, usually in a possibly-nested for loop. The hard part is just figuring out what your base case needs to be, and the correct sequence of maps/mapcats to combine the results without introducing undesired levels of nesting.
If you insist on sticking with Hiccup, you can de-mangle it at the use site without too much pain:
(defn hiccup-paths-to-leaves [node]
(when (vector? node)
(let [tag (first node), content (next node)]
(if (and content (every? #(= 1 (count %)) content))
[(list tag)]
(for [child content
path (hiccup-paths-to-leaves child)]
(cons tag path))))))
But it's noticeably messier, and is work you'll have to repeat every time you work with a tree. Again I encourage you to use Enlive-style trees for your internal data representation.
You can definitely use the file api to navigate the directory. If using zipper, you can do this:
(loop [loc (vector-zip ["root"
["level_a_node3"
["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456" "leaf456b"]]
["leaf123"]])
ans nil]
(if (end? loc)
ans
(recur (next loc)
(cond->> ans
(contains-leaves-only? loc)
(cons (->> loc down path (map node)))))))
which will output this:
(("root" "level_a_node1")
("root" "level_a_node2" "level_b_node1")
("root" "level_a_node2" "level_b_node2" "level_c_node1")
("root" "level_a_node3"))
with the way you define the tree, helper functions can be implemented
as:
(def is-leaf? #(-> % down nil?))
(defn contains-leaves-only?
[loc]
(some->> loc
down ;; branch name
right ;; children list
down ;; first child
(iterate right) ;; with other sibiling
(take-while identity)
(every? is-leaf?)))
UPDATE - add a lazy sequence version
(->> ["root"
["level_a_node3"
["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456" "leaf456b"]]
["leaf123"]]
vector-zip
(iterate next)
(take-while (complement end?))
(filter contains-leaves-only?)
(map #(->> % down path (map node))))
It is because zippers have so many limitations that I created the Tupelo Forest library for processing tree-like data structures. Your problem then has a simple solution:
(ns tst.tupelo.forest-examples
(:use tupelo.core tupelo.forest tupelo.test))
(with-forest (new-forest)
(let [data ["root"
["level_a_node3" ["leaf"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf"]]]
["level_b_node1" ["leaf"]]]
["level_a_node1" ["leaf"]]
["leaf"]]
root-hid (add-tree-hiccup data)
leaf-paths (find-paths-with root-hid [:** :*] leaf-path?)]
with a tree that looks like:
(hid->bush root-hid) =>
[{:tag "root"}
[{:tag "level_a_node3"}
[{:tag "leaf"}]]
[{:tag "level_a_node2"}
[{:tag "level_b_node2"}
[{:tag "level_c_node1"}
[{:tag "leaf"}]]]
[{:tag "level_b_node1"}
[{:tag "leaf"}]]]
[{:tag "level_a_node1"}
[{:tag "leaf"}]]
[{:tag "leaf"}]])
and a result like:
(format-paths leaf-paths) =>
[[{:tag "root"} [{:tag "level_a_node3"} [{:tag "leaf"}]]]
[{:tag "root"} [{:tag "level_a_node2"} [{:tag "level_b_node2"} [{:tag "level_c_node1"} [{:tag "leaf"}]]]]]
[{:tag "root"} [{:tag "level_a_node2"} [{:tag "level_b_node1"} [{:tag "leaf"}]]]]
[{:tag "root"} [{:tag "level_a_node1"} [{:tag "leaf"}]]]
[{:tag "root"} [{:tag "leaf"}]]]))))
There are many choices after this depending on the next steps in the processing chain.

Clojure tools.analyzer - identifying last leaf node?

I'm struggling to come up with a reliable solution to a problem I'm having using tools.analyzer.
What I'm trying to achieve is, given an ast node, what is the last/furthest node in the tree? E.g. If the following code were analayzed: (def a (do (+ 1 2) 3))
Is there a reliable way of marking the node that has the value "3" as the last node in this tree? Essentially what I'm attempting to do is work out which form will eventually be bound to the var b.
Questions like the above are the reason I created the tupelo.forest library a few years ago.
You may wish to view:
The home page of the lib
The API docs
The Clojure/Conj talk
Many live examples
To get started, lets add the data as a tree, after putting it into hiccup format. Here is an outline of how to proceed, with unit tests showing the result of the operations:
(ns tst.tupelo.forest-examples
(:use tupelo.core tupelo.forest tupelo.test))
(dotest-focus
(hid-count-reset)
(with-forest (new-forest)
(let [data-orig (quote (def a (do (+ 1 2) 3)))
data-vec (unlazy data-orig)
root-hid (add-tree-hiccup data-vec)
all-paths (find-paths root-hid [:** :*])
max-len (apply max (mapv #(count %) all-paths))
paths-max-len (keep-if #(= max-len (count %)) all-paths)]
Here is the result:
(is= data-vec (quote [def a [do [+ 1 2] 3]]))
(is= (hid->bush root-hid)
(quote
[{:tag def}
[{:tag :tupelo.forest/raw, :value a}]
[{:tag do}
[{:tag +}
[{:tag :tupelo.forest/raw, :value 1}]
[{:tag :tupelo.forest/raw, :value 2}]]
[{:tag :tupelo.forest/raw, :value 3}]]]))
(is= all-paths
[[1007]
[1007 1001]
[1007 1006]
[1007 1006 1004]
[1007 1006 1004 1002]
[1007 1006 1004 1003]
[1007 1006 1005]])
(is= paths-max-len
[[1007 1006 1004 1002]
[1007 1006 1004 1003]])
(nl)
(is= (format-paths paths-max-len)
(quote [[{:tag def}
[{:tag do} [{:tag +} [{:tag :tupelo.forest/raw, :value 1}]]]]
[{:tag def}
[{:tag do} [{:tag +} [{:tag :tupelo.forest/raw, :value 2}]]]]])))))
Depending on your specific goal, you can continue the processing further.

Clojure, using Enlive to extract raw HTML from a selector?

I need to retrieve some some raw HTML from a certain part of an HTML page.
I wrote the scraper and it grabs the appropriate div, but it returns a map of tags.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(let [url "http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000928/"
url-data (fetch-url url)
id "a693025"]
(first (html/select url-data [(keyword (str "#" id "-why"))]))))
This outputs:
{:tag :div, :attrs {:class "section", :id "a693025-why"}, :content ({:tag :h2, :attrs nil, :content ({:tag :span, :attrs {:class "title"}, :content ("Why is this medication prescribed?")})} {:tag :p, :attrs nil, :content ("Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.")})}
How do I convert this to raw html? I couldn't find any enlive function to do this.
(apply str (html/emit* [(parse-test)]))
; => "<div class=\"section\" id=\"a693025-why\"><h2><span class=\"title\">Why is this medication prescribed?</span></h2><p>Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.</p></div>"

combine multiple html fragment files with enlive, clojure

I have multiple html files, which is to be combined into a single html file. Those multiple files are like header, footer, etc, which are common to multiple files. I'm using enlive's html-resource method. but, that method inserting missing html tags into the final file, which I don't want.
Following is the output map,
({:tag :html, :attrs nil, :content (
{:tag :head, :attrs nil, :content (
{:tag :meta, :attrs {:content text/html; charset=utf-8, :http-equiv Content-Type}, :content ()}
{:tag :title, :attrs nil, :content (HewaniLife | Changing The Way You Live)}
{:tag :link, :attrs {:href styles/main.css, :rel stylesheet, :type text/css}, :content ()} )}
{:tag :body, :attrs nil, :content (
{:tag :html, :attrs nil, :content ({:tag :body, :attrs nil, :content ({:tag :div, :attrs {:id header}, :content (
{:tag :h1, :attrs nil, :content ({:tag :a, :attrs {:href index.xhtml, :id logo}, :content (
{:tag :span, :attrs {:class img-replace}, :content (hewaniLife)})})}
{:tag :div, :attrs {:id main-nav}, :content (
{:tag :ul, :attrs nil, :content (
{:tag :li, :attrs nil, :content ({:tag :a, :attrs {:href login.xhtml, :id btn-login}, :content (
{:tag :span, :attrs {:class img-replace}, :content (Login)})})}
{:tag :li, :attrs nil, :content ({:tag :a, :attrs {:href index.xhtml, :id btn-home}, :content (
{:tag :span, :attrs {:class img-replace}, :content (Home)})})}
{:tag :li, :attrs nil, :content ({:tag :a, :attrs {:href search.xhtml, :id btn-search}, :content (
{:tag :span, :attrs {:class img-replace}, :content (Search)})})})})}
{:type :comment, :data end of div#main-nav }
{:tag :br, :attrs {:class clear-all}, :content nil})} {:type :comment, :data end of div#header })})})})}
Here, you can see the html tags nested when I insert the files.
Is there are any way to insert these files..?
Can anybody used any other methods..?
You should use defsnippet rather and specify which parts are of interest to you.
All your fragements can even reside in a single page and defsnippet will pluck different fragments out.
html-snippet is mainly intended for playing at the repl
I have found a method in enlive named as html-snippet. You can use it to combine multiple html fragment codes.

How to select nth element of particular type in enlive?

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?
For nth-of-type, does the following example help?
user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
"<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})
No idea about the second issue. Your approach seems to work with a naive test:
user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})
Any chance of looking at your actual HTML?
Update: (in response to the comment)
Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:
user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})