Enlive - extract original HTML - clojure

Is it possible to retrieve the original HTML (with its quirks and formatting) using enlive selectors?
(def data "<div class=\"foo\"><p>some text <br> some more text</p></div>")
(apply str
(enlive/emit* (enlive/select (enlive/html-snippet data)
[:.foo :> enlive/any-node])))
=> "<p>some text <br /> some more text</p>"
In this example, enlive has transformed the <br> tag into a self-closing tag, unlike the original input snippet.
I suspect that enlive is transforming it into a hiccup-like list of tags, such that the original information is unfortunately lost.

Your suspicion is correct, enlive consumes this information in it's efforts to provide a consistent abstraction over the HTML. I don't think this was a feature it was designed to offer.

Although this is perhaps only tangentially related, if you use "append" you can preserve information (such as comments) that would otherwise be thrown out by net.cgrand.enlive-html/html-resource
https://github.com/cgrand/enlive/wiki/Table-and-Layout-Tutorial%2C-Part-3%3A-Simple-Transformations
<div id="wrapper">
<!--body-->
</div>
jcrit.server=> (pprint
(transform layout [:#wrapper]
(append page-content)))
({:tag :html,
{:tag :div,
:attrs {:id "wrapper"},
:content
("\n "
{:type :comment, :data "body"} ; <<== Still there.
"\n "
{:tag :p, :content ("Hi, mom!")})}
"\n")}
"\n\n")})

Related

enlive: smashing vectors of nodes together

So I have finally realized that I can use selectors to limit the portions of the page nodes that enlive transforms, that way I can create vectors of non-intersecting nodes.
Lots of words to say:
(defn b-content-transform []
(def b-area (eh/select global-page [:.b])) ;;cuts out all irrelevant nodes
(eh/transform b-area [:.b]
(eh/clone-for [i (range numberOfB)]
(eh/content (b-sample-content i)))))
So this returns something like..
[{:tag :div, :attrs {:class "b"}, :content ({:tag :div, :attrs {:id "b0", :class "topB"}]
Which is excellent, enlive nodes yay!
Now I have several transforms that act the same way.
My question is: how can I mash all the resultant vectors (?) together?
Well it turns out there is a knee-slappingly simple solution:
(concat transform1 transform2 transform3)
then enlive-html/emit* .

Available Clojure XML parsing libraries that complement clojure-xml/parse

Are there secondary clojure xml parsing projects that could be used after or in conjunction with clojure-xml/parse, and, if so, what are they?
clojure-xml/parse works wonderfully, but the map returned by clojure-xml/parse is deeply nested, at least after parsing one of our water cuts/tampers xml files. I am wondering if a secondary library exists that would allow me to parse further.
Here is just part of our xml file deliberately folded so you do not have to scroll.
:content [{:tag :Header, :attrs nil, :content [{:tag :ExportType,
:attrs nil, :content ["Tamper Export"]}
{:tag :CurrentDateTime, :attrs nil, :content ["
Notice the vector with embedded maps.
I can certainly develop something that could be used to parse this further, but I was just wondering if a module already exists.
Thank You.
The library to "parse" the content further is clojure.core. The functions and macros there can do a very good job of transforming the data structure generated from the XML into something useful. My personal favorite technique is using the two threading macros while making use of first and the keyword functions. If I need to do more than just digging deep, I'll write a quick function I can use map on.
The data structure you get back from the clojure.xml/parse is just as deep as the xml - each element has one map with three items, the content being a vector of child elements and strings. It may look a little bit deeper, but it's just an open representation of what would be stored, say, in the Java XML objects. It's biggest advantage is you don't need a special API to work with it - the functions you use on normal data work on the XML just as well. If anything, you write a few functions to translate into your domain and that's it.
Say you have something like the following (I'm leaving out attrs for brevity):
{:tag :stuff
:content [{:tag item
:content [{:tag :key :content ["Key one"]}
{:tag :value :content ["Item one"]}]}
{:tag item
:content [{:tag :key :content ["Key two"]}
{:tag :value :content ["Item two"]}]}]}
It's nested, but make a utility function for transforming each item into something usable.
(defn transform-item [item]
(let [key-element (-> item :content first)
value-element (-> item :content second)]
[(-> key-element :content first)
(-> value-element :content first)]))
And then map that on the content of the root element.
(defn transform-stuff [stuff-xml]
(into {} (map transform-item (:content stuff-xml)))
And you should end up with some data which actually represents your domain.
{"Key one" "Item One", "Key two" "Item 2"}
The key is to not think of it as parsing, but just as translating one data structure into another.

Clojure, using Enlive to extract raw HTML from a selector?

I need to retrieve some some raw HTML from a certain part of an HTML page.
I wrote the scraper and it grabs the appropriate div, but it returns a map of tags.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(let [url "http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0000928/"
url-data (fetch-url url)
id "a693025"]
(first (html/select url-data [(keyword (str "#" id "-why"))]))))
This outputs:
{:tag :div, :attrs {:class "section", :id "a693025-why"}, :content ({:tag :h2, :attrs nil, :content ({:tag :span, :attrs {:class "title"}, :content ("Why is this medication prescribed?")})} {:tag :p, :attrs nil, :content ("Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.")})}
How do I convert this to raw html? I couldn't find any enlive function to do this.
(apply str (html/emit* [(parse-test)]))
; => "<div class=\"section\" id=\"a693025-why\"><h2><span class=\"title\">Why is this medication prescribed?</span></h2><p>Zolpidem is used to treat insomnia (difficulty falling asleep or staying asleep). Zolpidem belongs to a class of medications called sedative-hypnotics. It works by slowing activity in the brain to allow sleep.</p></div>"

How to scrape data from specified tag with Enlive?

could someone explain me how to scrape content from <td> tags where the <th> has content value (actually in this case I need content of <b> tag for matching operation) "Row1 title", but without scraping <th> tag (or any of its content) in process? Here is my test HTML:
<table class="table_class">
<tbody>
<tr>
<th>
<b>
Row1 title
</b>
</th>
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
</tr>
<tr>
<th>
Row2 title
</th>
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
</tr>
</tbody>
</table>
Data which I want to extract should come from these tags:
<td>2.660.784</td>
<td>2.944.552</td>
<td>Correct, has 3 td elements</td>
I have managed to create function which returns entire content of the table, but I would like to exclude the <th> node from result, and to return only data from <td> nodes, which content I can use for further parsing. Can anyone help me with this?
With enlive something like this
(ns tutorial.so-scrape
(:require [net.cgrand.enlive-html :as html])
(defn parse-tds [url]
(html/select (html/html-resource (java.net.URL. url)) [:table :td]))
should give you a sequence of all the td nodes, something of the form {:tag :td :attrs {...} :content (...)}. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.
You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))
In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type function like this
(defn tag-type [node]
(case (:tag node)
:td ::TerminalNode
::IgnoreNode)
(change the return value of all nodes to ::IgnoreNode except for :td then it just gives you a sequence of the content of the :tds which is probably close to what you want. Let me know if you need more help.
EDIT (in reply to comments below)
I don't think selecting nodes based on their :content is possible with enlive alone - but you can certainly do so with Clojure.
for example you could do something like
(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
(:content line))
could work. (you might have to tweak the (:content line) form a little..

How to select nth element of particular type in enlive?

I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?
For nth-of-type, does the following example help?
user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
"<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})
No idea about the second issue. Your approach seems to work with a naive test:
user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})
Any chance of looking at your actual HTML?
Update: (in response to the comment)
Here's another example where "the second <p> inside the <div> inside the second <div> inside whatever" is returned:
user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})