enlive: smashing vectors of nodes together - clojure

So I have finally realized that I can use selectors to limit the portions of the page nodes that enlive transforms, that way I can create vectors of non-intersecting nodes.
Lots of words to say:
(defn b-content-transform []
(def b-area (eh/select global-page [:.b])) ;;cuts out all irrelevant nodes
(eh/transform b-area [:.b]
(eh/clone-for [i (range numberOfB)]
(eh/content (b-sample-content i)))))
So this returns something like..
[{:tag :div, :attrs {:class "b"}, :content ({:tag :div, :attrs {:id "b0", :class "topB"}]
Which is excellent, enlive nodes yay!
Now I have several transforms that act the same way.
My question is: how can I mash all the resultant vectors (?) together?

Well it turns out there is a knee-slappingly simple solution:
(concat transform1 transform2 transform3)
then enlive-html/emit* .

Related

How to obtain paths to all the child nodes in a tree that only have leaves using clojure zippers?

Say I have a tree like this. I would like to obtain the paths to child nodes that only contain leaves and not non-leaf child nodes.
So for this tree
root
├──leaf123
├──level_a_node1
│ ├──leaf456
├──level_a_node2
│ ├──level_b_node1
│ │ └──leaf987
│ └──level_b_node2
│ └──level_c_node1
| └── leaf654
├──leaf789
└──level_a_node3
└──leaf432
The result would be
[["root" "level_a_node1"]
["root" "level_a_node2" "level_b_node1"]
["root" "level_a_node2" "level_b_node2" "level_c_node1"]
["root" "level_a_node3"]]
I've attempted to go down to the bottom nodes and check if the (lefts) and the (rights) are not branches, but that that doesn't quite work.
(z/vector-zip ["root"
["level_a_node3" ["leaf432"]]
["level_a_node2" ["level_b_node2" ["level_c_node1" ["leaf654"]]] ["level_b_node1" ["leaf987"]] ["leaf789"]]
["level_a_node1" ["leaf456"]]
["leaf123"]])
edit: my data is actually coming in as a list of paths and I'm converting that into a tree. But maybe that is an overcomplication?
[["root" "leaf"]
["root" "level_a_node1" "leaf"]
["root" "level_a_node2" "leaf"]
["root" "level_a_node2" "level_b_node1" "leaf"]
["root" "level_a_node2" "level_b_node2" "level_c_node1" "leaf"]
["root" "level_a_node3" "leaf"]]
Hiccup-style structures are a nice place to visit, but I wouldn't want to live there. That is, they're very succinct to write, but a giant pain to manipulate programmatically, because the semantic nesting structure is not reflected in the physical structure of the nodes. So, the first thing I would do is convert to Enlive-style tree representation (or, ideally, generate Enlive to begin with):
(def hiccup
["root"
["level_a_node3" ["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456"]]
["leaf123"]])
(defn hiccup->enlive [x]
(when (vector? x)
{:tag (first x)
:content (map hiccup->enlive (rest x))}))
(def enlive (hiccup->enlive hiccup))
;; Yielding...
{:tag "root",
:content
({:tag "level_a_node3", :content ({:tag "leaf432", :content ()})}
{:tag "level_a_node2",
:content
({:tag "level_b_node2",
:content
({:tag "level_c_node1",
:content ({:tag "leaf654", :content ()})})}
{:tag "level_b_node1", :content ({:tag "leaf987", :content ()})}
{:tag "leaf789", :content ()})}
{:tag "level_a_node1", :content ({:tag "leaf456", :content ()})}
{:tag "leaf123", :content ()})}
Having done this, the last thing getting in your way is your desire to use zippers. They are a good tool for targeted traversals, where you care a lot about the structure near the node you are working on. But if all you care about is the node and its children, it is much easier to just write a simple recursive function to traverse the tree:
(defn paths-to-leaves [{:keys [tag content] :as root}]
(when (seq content)
(if (every? #(empty? (:content %)) content)
[(list tag)]
(for [child content
path (paths-to-leaves child)]
(cons tag path)))))
The ability to write recursive traversals like this is a skill that will serve you many times throughout your Clojure career (for example, a similar question I recently answered on Code Review). It turns out that a huge number of functions on trees are just: call yourself recursively on each child, and somehow combine the results, usually in a possibly-nested for loop. The hard part is just figuring out what your base case needs to be, and the correct sequence of maps/mapcats to combine the results without introducing undesired levels of nesting.
If you insist on sticking with Hiccup, you can de-mangle it at the use site without too much pain:
(defn hiccup-paths-to-leaves [node]
(when (vector? node)
(let [tag (first node), content (next node)]
(if (and content (every? #(= 1 (count %)) content))
[(list tag)]
(for [child content
path (hiccup-paths-to-leaves child)]
(cons tag path))))))
But it's noticeably messier, and is work you'll have to repeat every time you work with a tree. Again I encourage you to use Enlive-style trees for your internal data representation.
You can definitely use the file api to navigate the directory. If using zipper, you can do this:
(loop [loc (vector-zip ["root"
["level_a_node3"
["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456" "leaf456b"]]
["leaf123"]])
ans nil]
(if (end? loc)
ans
(recur (next loc)
(cond->> ans
(contains-leaves-only? loc)
(cons (->> loc down path (map node)))))))
which will output this:
(("root" "level_a_node1")
("root" "level_a_node2" "level_b_node1")
("root" "level_a_node2" "level_b_node2" "level_c_node1")
("root" "level_a_node3"))
with the way you define the tree, helper functions can be implemented
as:
(def is-leaf? #(-> % down nil?))
(defn contains-leaves-only?
[loc]
(some->> loc
down ;; branch name
right ;; children list
down ;; first child
(iterate right) ;; with other sibiling
(take-while identity)
(every? is-leaf?)))
UPDATE - add a lazy sequence version
(->> ["root"
["level_a_node3"
["leaf432"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf654"]]]
["level_b_node1"
["leaf987"]]
["leaf789"]]
["level_a_node1"
["leaf456" "leaf456b"]]
["leaf123"]]
vector-zip
(iterate next)
(take-while (complement end?))
(filter contains-leaves-only?)
(map #(->> % down path (map node))))
It is because zippers have so many limitations that I created the Tupelo Forest library for processing tree-like data structures. Your problem then has a simple solution:
(ns tst.tupelo.forest-examples
(:use tupelo.core tupelo.forest tupelo.test))
(with-forest (new-forest)
(let [data ["root"
["level_a_node3" ["leaf"]]
["level_a_node2"
["level_b_node2"
["level_c_node1"
["leaf"]]]
["level_b_node1" ["leaf"]]]
["level_a_node1" ["leaf"]]
["leaf"]]
root-hid (add-tree-hiccup data)
leaf-paths (find-paths-with root-hid [:** :*] leaf-path?)]
with a tree that looks like:
(hid->bush root-hid) =>
[{:tag "root"}
[{:tag "level_a_node3"}
[{:tag "leaf"}]]
[{:tag "level_a_node2"}
[{:tag "level_b_node2"}
[{:tag "level_c_node1"}
[{:tag "leaf"}]]]
[{:tag "level_b_node1"}
[{:tag "leaf"}]]]
[{:tag "level_a_node1"}
[{:tag "leaf"}]]
[{:tag "leaf"}]])
and a result like:
(format-paths leaf-paths) =>
[[{:tag "root"} [{:tag "level_a_node3"} [{:tag "leaf"}]]]
[{:tag "root"} [{:tag "level_a_node2"} [{:tag "level_b_node2"} [{:tag "level_c_node1"} [{:tag "leaf"}]]]]]
[{:tag "root"} [{:tag "level_a_node2"} [{:tag "level_b_node1"} [{:tag "leaf"}]]]]
[{:tag "root"} [{:tag "level_a_node1"} [{:tag "leaf"}]]]
[{:tag "root"} [{:tag "leaf"}]]]))))
There are many choices after this depending on the next steps in the processing chain.

Deep data structure match & replace first

I'm trying to figure out an idiomatic, performant, and/or highly functional way to do the following:
I have a sequence of maps that looks like this:
({:_id "abc" :related ({:id "123"} {:id "234"})}
{:_id "bcd" :related ({:id "345"} {:id "456"})}
{:_id "cde" :related ({:id "234"} {:id "345"})})
The :id fields can be assumed to be unique within any one :_id.
In addition, I have two sets:
ids like ("234" "345") and
substitutes like ({:id "111"} {:id "222"})
Note that the fact that substitutes only has :id in this example doesn't mean it can be reduced to a collection of ids. This is a simplified version of a problem and the real data has other key/value pairs in the map that have to come along.
I need to return a new sequence that is the same as the original but with the values from substitutes replacing the first occurrence of the matching id from ids in the :related collections of all of the items. So what the final collection should look like is:
({:_id "abc" :related ({:id "123"} {:id "111"})}
{:_id "bcd" :related ({:id "222"} {:id "456"})}
{:_id "cde" :related ({:id "234"} {:id "345"})})
I'm sure I could eventually code up something that involves nesting maps and conditionals (thinking in iterative terms about loops of loops) but that feels to me like I'm not thinking functionally or cleverly enough given the tools I might have available, either in clojure.core or extensions like match or walk (if those are even the right libraries to be looking at).
Also, it feels like it would be much easier without the requirement to limit it to a particular strategy (namely, subbing on the first match only, ignoring others), but that's a requirement. And ideally, a solution would be adaptable to a different strategy down the line (e.g. a single, but randomly positioned match). The one invariant to strategy is that each id/sub pair should used only once. So:
Replace one, and one only, occurrence of a :related value whose :id matches a value from ids with the corresponding value from substitutes, where the one occurrence is the first (or nth or rand-nth...) occurrence.
(def id-mapping (zipmap ids
(map :id substitutes)))
;; id-mapping -> {"345" "222", "234" "111"}
(clojure.walk/prewalk-replace id-mapping original)
Assuming that the collection is called results:
(require '[clojure.zip :as z])
(defn modify-related
[results id sub]
(loop [loc (z/down (z/seq-zip results))
done? false]
(if (= done? true)
(z/root loc)
(let [change? (->> loc z/node :_id (= id))]
(recur (z/next (cond change?
(z/edit loc (fn [_] identity sub))
:else loc))
change?)))))
(defn modify-results
[results id sub]
(loop [loc (z/down (z/seq-zip results))
done? false]
(if (= done? true)
(z/root loc)
(let [related (->> loc z/node :related)
change? (->> related (map :_id) set (#(contains? % id)))]
(recur (z/next (cond change?
(z/edit loc #(assoc % :related (modify-related related id sub)))
:else loc))
change?)))))
(defn sub-for-first
[results ids substitutes]
(let [subs (zipmap ids substitutes)]
(reduce-kv modify-results results subs)))

Available Clojure XML parsing libraries that complement clojure-xml/parse

Are there secondary clojure xml parsing projects that could be used after or in conjunction with clojure-xml/parse, and, if so, what are they?
clojure-xml/parse works wonderfully, but the map returned by clojure-xml/parse is deeply nested, at least after parsing one of our water cuts/tampers xml files. I am wondering if a secondary library exists that would allow me to parse further.
Here is just part of our xml file deliberately folded so you do not have to scroll.
:content [{:tag :Header, :attrs nil, :content [{:tag :ExportType,
:attrs nil, :content ["Tamper Export"]}
{:tag :CurrentDateTime, :attrs nil, :content ["
Notice the vector with embedded maps.
I can certainly develop something that could be used to parse this further, but I was just wondering if a module already exists.
Thank You.
The library to "parse" the content further is clojure.core. The functions and macros there can do a very good job of transforming the data structure generated from the XML into something useful. My personal favorite technique is using the two threading macros while making use of first and the keyword functions. If I need to do more than just digging deep, I'll write a quick function I can use map on.
The data structure you get back from the clojure.xml/parse is just as deep as the xml - each element has one map with three items, the content being a vector of child elements and strings. It may look a little bit deeper, but it's just an open representation of what would be stored, say, in the Java XML objects. It's biggest advantage is you don't need a special API to work with it - the functions you use on normal data work on the XML just as well. If anything, you write a few functions to translate into your domain and that's it.
Say you have something like the following (I'm leaving out attrs for brevity):
{:tag :stuff
:content [{:tag item
:content [{:tag :key :content ["Key one"]}
{:tag :value :content ["Item one"]}]}
{:tag item
:content [{:tag :key :content ["Key two"]}
{:tag :value :content ["Item two"]}]}]}
It's nested, but make a utility function for transforming each item into something usable.
(defn transform-item [item]
(let [key-element (-> item :content first)
value-element (-> item :content second)]
[(-> key-element :content first)
(-> value-element :content first)]))
And then map that on the content of the root element.
(defn transform-stuff [stuff-xml]
(into {} (map transform-item (:content stuff-xml)))
And you should end up with some data which actually represents your domain.
{"Key one" "Item One", "Key two" "Item 2"}
The key is to not think of it as parsing, but just as translating one data structure into another.

Clojure, using Enlive to extract raw HTML from a selector?

I need to retrieve some some raw HTML from a certain part of an HTML page.
I wrote the scraper and it grabs the appropriate div, but it returns a map of tags.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(let [url "http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000928/"
url-data (fetch-url url)
id "a693025"]
(first (html/select url-data [(keyword (str "#" id "-why"))]))))
This outputs:
{:tag :div, :attrs {:class "section", :id "a693025-why"}, :content ({:tag :h2, :attrs nil, :content ({:tag :span, :attrs {:class "title"}, :content ("Why is this medication prescribed?")})} {:tag :p, :attrs nil, :content ("Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.")})}
How do I convert this to raw html? I couldn't find any enlive function to do this.
(apply str (html/emit* [(parse-test)]))
; => "<div class=\"section\" id=\"a693025-why\"><h2><span class=\"title\">Why is this medication prescribed?</span></h2><p>Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.</p></div>"

How to select nth element of particular type in enlive?

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?
For nth-of-type, does the following example help?
user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
"<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})
No idea about the second issue. Your approach seems to work with a naive test:
user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})
Any chance of looking at your actual HTML?
Update: (in response to the comment)
Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:
user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})