Hi All I'm trying to "parse/extract" html-data with Clojure en Enlive (any better choices ?)
I am trying to get all the ul > li tags that are *NOT part of the
<nav> tag I think I should use the (html/but) function from Enlive
but can't seem to make it work ?
;;test-envlive.clj
(defn get-tags [dom tag-list]
(let [tags
(mapv
#(vec (html/select dom %1))
tag-list)]
tags))
;;Gives NO tags
(get-tags test-dom [[[(html/but :nav) :ul :> :li]]])
;;Gives ALL the LI-tags
(get-tags test-dom [[:ul :> :li]])
<!-- test.html -->
<html>
<head><title>Test page</title> </head>
<body>
<div>
<nav>
<ul>
<li>
skip these navs-li
</li>
</ul>
</nav>
<h1>Hello World<h1>
<ul><li>get only these li's</li>
</ul>
</div>
</body></html>
If you had a valid xhtml, you could use XPath from sigel:
(require '[sigel.xpath.core :as xpath])
(let [data "<html><head><title>Test page</title></head>
<body><div><nav><ul><li>skip these navs-li</li></ul></nav>
<h1>Hello World</h1>
<ul><li>get only these li's</li></ul>
</div></body></html>"]
(xpath/select data "//li[not(ancestor::nav)]"))
I was able to select target li with Hickory, so if you don't mind changing your library:
Dependency: [hickory "0.7.1"]
Require: [hickory.core :as h] [hickory.select :as s]
(s/select (s/and
(s/descendant (s/tag :ul)
(s/tag :li))
(s/not (s/descendant (s/tag :nav)
(s/tag :li))))
(h/as-hickory (h/parse (slurp "resources/site.html"))))
=> [{:type :element, :attrs nil, :tag :li, :content ["get only these li's"]}]
You could do this with the Tupelo Forest library. Watch the video and see the examples in the unit tests.
Here is one way to solve your problem:
(ns tst.tupelo.forest-examples
(:use tupelo.core tupelo.forest tupelo.test)
(:require. ... ))
<snip>
(verify
(let [html-data "<html>
<head><title>Test page</title> </head>
<body>
<div>
<nav>
<ul>
<li>
skip these navs-li
</li>
</ul>
</nav>
<h1>Hello World<h1>
<ul><li>get only these li's</li>
</ul>
</div>
</body>
</html> "]
and the interesting part comes next.
(hid-count-reset)
(with-forest (new-forest)
(let [root-hid (add-tree-html html-data)
out-hiccup (hid->hiccup root-hid)
result-1 (find-paths root-hid [:html :body :div :ul :li])
li-hid (last (only result-1))
li-hiccup (hid->hiccup li-hid)]
(is= out-hiccup [:html
[:head [:title "Test page"]]
[:body
[:div
[:nav
[:ul
[:li
"\n skip these navs-li\n "]]]
[:h1 "Hello World"]
[:ul [:li "get only these li's"]]]]])
(is= result-1 [[1011 1010 1009 1008 1007]])
(is= li-hid 1007)
(is= li-hiccup [:li "get only these li's"])))))
The above code can be seen live in the examples.
Related
I am just getting started using Elive for an HTML screen scraping task. If I wanted the text from the second and fourth TD nodes of the following table, how would I specify the selector? I read through the tutorial but didn't find any examples of how to specify what in XPath would be:
html/body/table/tr/td[2] and /td[4] (assuming a one-based index)
<html>
<body>
<table width="100%" border="0" cellspacing="3" cellpadding="2">
<tr>
<td width="15%" class="labels">Part No</td>
<td class="datafield">I2013-00007</td>
<td class="labels"><div align="right">Parcel No</div></td>
<td colspan="3" class="datafield">07-220-12-03-01-2-00-000</td>
</tr>
</table>
</body>
</html>
I need to capture the text value from those two TD nodes.
You can use nth-of-type like this:
user> (require '[net.cgrand.enlive-html :as html])
nil
user> (def test-html
"<html><body><table width='100%' border='0' cellspacing='3' cellpadding='2'><tr><td width='15%' class='labels'>Part No</td><td class='datafield'>I2013-00007</td><td class='labels'><div align='right'>Parcel No</div></td><td colspan='3' class='datafield'>07-220-12-03-01-2-00-000</td></tr></table></body></html>")
#'user/test-html
user> (:content (first (html/select (html/html-resource (java.io.StringReader. test-html)) [[:td (html/nth-of-type 2)]])))
("I2013-00007")
hii I am novice to python and django. I am referring one tutorial to develop a blog in django.
I have synchronized the database and have run the server.
My admin page is working fine but my application page is showing some problem
I have created an html file "blog.html"
(% extends "base.html" %)
(% block content %)
(% for post in object_list %)
<h3>{{ post.title}}</h3>
<div class="post_meta">
on {{post.date}}
</div>
<div class= "post_body">
{{post.body|safe|linebreaks}}
</div>
(%endfor %)
(%endblock %)
When i run my django, it is showing this code inspite of actual blog page..
Django's template language uses {% and %} for template tags, not (% and %) as in your template file.
Problem: Enlive snippet making funky HTML
Visual reference of problem: http://i.imgur.com/FIOzgZv.png
See bottom of code snippet for strange HTML in question
(ns notebook.handler
(:require [compojure.core :refer :all]
[compojure.handler :as handler]
[compojure.route :as route]
[net.cgrand.enlive-html :as html]))
(html/defsnippet nav "templates/nav.html" [:*]
[])
(html/deftemplate home-page "templates/base.html"
[]
[:body] (html/prepend (nav)))
(defroutes app-routes
(GET "/" [] (home-page))
(route/resources "/")
(route/not-found "Not Found"))
(def app
(handler/site app-routes))
Contents of base.html:
<html>
<head>
<link rel=stylesheet href="css/base.css">
</head>
<body>
</body>
</html>
Contents of nav.html:
<nav>
<ul>
<li>FlatNotes</li>
</ul>
</nav>
HTML when localhost:3000 is visited:
<html>
<head>
<link href="css/base.css" rel="stylesheet" />
</head>
<body><nav>
<ul>
<li>FlatNotes</li>
</ul>
</nav><ul>
<li>FlatNotes</li>
</ul><li>FlatNotes</li>
</body>
</html>
(reduce str (html/emit* (nav))) shows strange HTML meaning the problem occurs in defsnippet before deftemplate:
"<nav>\n <ul>\n\t<li>FlatNotes</li>\n </ul>\n\n</nav><ul>\n\t<li>FlatNotes</li>\n </ul><li>FlatNotes</li>"
Maybe I'm mistaken about what [:*] does, or there's a fundamental misunderstanding, or there's a gotcha I'm unaware of. I've already reduced the code down to as minimal as I can so lolidk.
:* represents the universal selector. It matches every element in nav.html - nav, ul, and li - which means the nav snippet is:
<nav>
<ul>
<li>FlatNotes</li>
</ul>
</nav>
<ul>
<li>FlatNotes</li>
</ul>
<li>FlatNotes</li>
The selector you pass to the snippet definition should point to the single, top level element of your snippet. If you change :* to match a single element (i.e. :nav), it ought to give you the snippet you're looking for.
I just start programming recently and I have this problem, so I have this html snippet. I want parse the src attribute of the img and normalize it with urly path normalization, and add some new path to the src.
<html>
<body>
<div class="content">lorem ipsum
<img style="margin-top: -5px;" src="/img/car.png" />
</div>
<img style="margin-top: -5px;" src="/img/chair.png" />
</body>
</html>
become this
<html>
<body>
<div class="content">lorem ipsum
<img style="margin-top: -5px;" src="/path1/img/car.png" />
</div>
<img style="margin-top: -5px;" src="/path1/img/chair.png" />
</body>
</html>
I think of this method but i just can't find the way to acquire the src value
(html/deftemplate template-about "../resources/public/build/about/index.html"
[]
[:img] (html/set-attr :src (str "path1" (urly/path-of ("the src value")))
)
You're looking for an update-attr function, was discussed before
As in:
(html/deftemplate template-about "../resources/public/build/about/index.html"
[]
[:img] (fn [node]
(let [href (-> node :attrs :href)]
(assoc-in node [:attrs :href] (urly/path-of href))))
Or taking the generic path
(defn update-attr [attr f & args]
(fn [node]
(apply update-in node [:attrs attr] f args))))
and then
(update-attr :href urly/path-of)
Compojure does not bind the fields in a POST form. This is my route def:
(defroutes main-routes
(POST "/query" {params :params}
(debug (str "|" params "|"))
"OK...")
)
When I post a form with fields in it, I get |{}|, i.e. there are no parameters. Incidentally, when I go http://localhost/query?param1=value1, params is not empty, and the values get printed on the server console.
Is there another binding for form fields??
ensure you have input fields with name="zzz" attribute, but not only id="zzz".
html form collects all inputs and posts them using the name attribute
my_post.html
<form action="my_post_route" method="post">
<label for="id">id</label> <input type="text" name="id" id="id" />
<label for="aaaa">aaa</label> <input type="text" name="aaa" id="aaa" />
<button type="submit">send</button>
</form>
my_routes.clj
(defroutes default-handler
;,,,,
(POST "/my_post_route" {params :params}
(str "POST id=" (params "id") " params=" params))
;,,,,
produce response like
id=21 params={"aaa" "aoeu", "id" "21"}
This is a great example of how to handle parameters
(ns example2
(:use [ring.adapter.jetty :only [run-jetty]]
[compojure.core :only [defroutes GET POST]]
[ring.middleware.params :only [wrap-params]]))
(defroutes routes
(POST "/" [name] (str "Thanks " name))
(GET "/" [] "<form method='post' action='/'> What's your name? <input type='text' name='name' /><input type='submit' /></form>"))
(def app (wrap-params routes))
(run-jetty app {:port 8080})
https://github.com/heow/compojure-cookies-example
See under Example 2 - Middleware is Features
note:
(params "id") return nil for me, i get a correct value with (params :id)