How to elegantly parse xml in clojure - clojure

I have this piece of code building up sentences from XML looking like follows. I wonder what might be an alternative code, that would be more readable after being hacked to work.
(mapcat
(fn [el]
(map special-join
(map
(fn [el] (map zip-xml/text (zip-xml/xml-> el :word)))
(zip-xml/xml-> el :sentence))))
(zip-xml/xml-> root :document))
The above code is not very readable, given the repeat inline function definitions combined with the nested probing, but tearing them apart into standalone functions as in this official tutorial really doesn't make sense to me for such simple cases.
For completeness, here's the repeat XML structure that this is parsing
<document>
<sentence id="1">
<word id="1.1">Foo</w>
<word id="1.2">bar</w>
</sentence>
</document>

Zippers may be overkill in this situation. clojure.xml/parse will give you a simple data structure representing the HTML.
(require '[clojure.xml :as xml] '[clojure.string :as string])
(def doc
(->
"<document>
<sentence id=\"1\">
<word id=\"1.1\">
Foo
</word>
<word id=\"1.2\">
bar
</word>
</sentence>
</document>
" .getBytes java.io.ByteArrayInputStream. xml/parse))
Then you can use xml-seq to get all the <sentence> tags and their children, gathering the children's text content, trimming whitespace, and joining with spaces.
(->> doc
xml-seq
(filter (comp #{:sentence} :tag))
(map :content)
(map #(transduce
(comp
(mapcat :content)
(map string/trim)
(interpose " "))
str %)))

I do not like the way zippers work in Clojure, and I've not looked at clojure.zip/xml-zip or clojure.data.zip/xml-> (confusing that they are two separate libs!).
Instead, may I suggest you try out the tupelo.forest library? Here is an overview from the 2017 Clojure/Conj.
Below is a live solution using tupelo.forest. I added a second sentence to make it more interesting:
(dotest
(with-forest (new-forest)
(let [xml-str (ts/quotes->double
"<document>
<sentence id='1'>
<word id='1.1'>foo</word>
<word id='1.2'>bar</word>
</sentence>
<sentence id='2'>
<word id='2.1'>beyond</word>
<word id='2.2'>all</word>
<word id='2.3'>recognition</word>
</sentence>
</document>")
root-hid (add-tree-xml xml-str)
>> (remove-whitespace-leaves)
bush-no-blanks (hid->bush root-hid)
sentence-hids (find-hids root-hid [:document :sentence])
sentences (forv [sentence-hid sentence-hids]
(let [word-hids (hid->kids sentence-hid)
words (mapv #(grab :value (hid->leaf %)) word-hids)
sentence-text (str/join \space words)]
sentence-text))
]
(is= bush-no-blanks
[{:tag :document}
[{:id "1", :tag :sentence}
[{:id "1.1", :tag :word, :value "foo"}]
[{:id "1.2", :tag :word, :value "bar"}]]
[{:id "2", :tag :sentence}
[{:id "2.1", :tag :word, :value "beyond"}]
[{:id "2.2", :tag :word, :value "all"}]
[{:id "2.3", :tag :word, :value "recognition"}]]])
(is= sentences
["foo bar"
"beyond all recognition"]))))
The idea is to find the hid (Hex ID, like a pointer) for each sentence. In the forv loop, we find the child nodes for each sentence, extract the :value, and joint into a string. The unit tests show the tree structure as parsed from XML (after deleting blank nodes) and the final result. Note that we ignore the id fields and use only the tree structure to understand the sentences.
Documentation for tupelo.forest is still a work in progress, but you can see many live examples here.
The Tupelo project lives on GitHub.\
Update
I have been thinking about the streaming data problem, and have added a new function proc-tree-enlive-lazy to enable lazy processing of large data sets. Here is an example:
(let [xml-str (ts/quotes->double
"<document>
<sentence id='1'>
<word id='1.1'>foo</word>
<word id='1.2'>bar</word>
</sentence>
<sentence id='2'>
<word id='2.1'>beyond</word>
<word id='2.2'>all</word>
<word id='2.3'>recognition</word>
</sentence>
</document>")
(let [enlive-tree-lazy (clojure.data.xml/parse (StringReader. xml-str))
doc-sentence-handler (fn [root-hid]
(remove-whitespace-leaves)
(let [sentence-hid (only (find-hids root-hid [:document :sentence]))
word-hids (hid->kids sentence-hid)
words (mapv #(grab :value (hid->leaf %)) word-hids)
sentence-text (str/join \space words)]
sentence-text))
result-sentences (proc-tree-enlive-lazy enlive-tree-lazy
[:document :sentence] doc-sentence-handler)]
(is= result-sentences ["foo bar" "beyond all recognition"])) ))
The idea is that you process successive subtrees, in this case whenever you get a subtree path of [:document :sentence]. You pass in a handler function, which will receive the root-hid of a tupelo.forest tree. The return value of the handler is then placed onto an output lazy sequence returned to the caller.

Related

Wrap HTML tags around pretty-printed Clojure forms

Clojure's pretty printer (clojure.pprint) takes unformatted code like this:
(defn fib ([n] (fib n 1 0)) ([n a b] (if (= n 0) a (fib (dec n) (+ a b) a))))
And makes it nice, like this.
(defn fib
([n] (fib n 1 0))
([n a b]
(if (= n 0)
a
(fib (dec n) (+ a b) a))))
I'd like to put some source in a web page, so I'd like it to be pretty-printed. But I'd also like to wrap each form in a set of < span > tags with a unique ID so I can manipulate the representation with javascript. That is, I want to turn
(foo bar baz)
into
<span id="001">(<span id="002">foo</span> <span id="003">bar</span> <span id="004">baz</span>)</span>
But I still want the resulting forms to be indented like the pretty printer would, so that the code that actually gets displayed looks right.
Some of the documentation for the pretty printer mentions that it can take custom dispatch functions, but I can't find anything about what they do or how to define them. Is it possible to do what I want with such a beast, and if so can someone provide me with some information on how to do it?
There are ways to pretty print XML, as you can see here:
https://nakkaya.com/2010/03/27/pretty-printing-xml-with-clojure/
That person used
(defn ppxml [xml]
(let [in (javax.xml.transform.stream.StreamSource.
(java.io.StringReader. xml))
writer (java.io.StringWriter.)
out (javax.xml.transform.stream.StreamResult. writer)
transformer (.newTransformer
(javax.xml.transform.TransformerFactory/newInstance))]
(.setOutputProperty transformer
javax.xml.transform.OutputKeys/INDENT "yes")
(.setOutputProperty transformer
"{http://xml.apache.org/xslt}indent-amount" "2")
(.setOutputProperty transformer
javax.xml.transform.OutputKeys/METHOD "xml")
(.transform transformer in out)
(-> out .getWriter .toString)))
So if you put your HTMl string (which is not exactly a subset of XML), you would get:
(ppxml "<root><child>aaa</child><child/></root>")
output:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>aaa</child>
<child/>
</root>
In Clojure, using Compojure, you can build HTML/XML tags in a very lispy syntax.
You can use them too:
(ppxml (html
[:html
[:head
[:title "Hello World"]]
[:body "Hello World!"]]))
With the output of:
<html>
<head>
<title>Hello World</title>
</head>
<body>Hello World!</body>
</html>
You see also suggestions here:
Compojure HTML Formatting
It's possible, but I think it's much more work than you expect. You'll work with source code pprint/dispatch and modify functions that are already here.
You'll surely need with-pprint-dispatch. This function uses given dispatch function to execute body:
(with-pprint-dispatch code-dispatch
(pprint '(foo bar baz)))
(foo bar baz)
=> nil
Look for function code-dispatch and see it's definition:
(defmulti
code-dispatch
"The pretty print dispatch function for pretty printing Clojure code."
{:added "1.2" :arglists '[[object]]}
class)
(use-method code-dispatch clojure.lang.ISeq pprint-code-list)
(use-method code-dispatch clojure.lang.Symbol pprint-code-symbol)
;; The following are all exact copies of simple-dispatch
(use-method code-dispatch clojure.lang.IPersistentVector pprint-vector)
(use-method code-dispatch clojure.lang.IPersistentMap pprint-map)
(use-method code-dispatch clojure.lang.IPersistentSet pprint-set)
(use-method code-dispatch clojure.lang.PersistentQueue pprint-pqueue)
(use-method code-dispatch clojure.lang.IDeref pprint-ideref)
(use-method code-dispatch nil pr)
(use-method code-dispatch :default pprint-simple-default)
As you can see, there is special function for each collection type. I just picked list and vector and my dispatch function looks like this:
(defmulti
my-code-dispatch
class)
(use-method my-code-dispatch clojure.lang.ISeq my-pprint-code-list)
(use-method my-code-dispatch clojure.lang.IPersistentVector my-pprint-vector)
(use-method my-code-dispatch :default my-pprint-simple-default)
Now, look for pprint-code-list, pprint-vector and pprint-simple-default. Two of them use pprint-logical-block with keywords :prefix and :suffix- that's the place where you insert additional string (the rest of function will be the same). Don't forget to define some counter for span numbering:
(in-ns 'clojure.pprint)
(def id (atom 0))
(defn- my-pprint-vector [avec]
(pprint-meta avec)
(pprint-logical-block :prefix (format "<span id=\"%03d\">[" (swap! id inc))
:suffix "]</span>"
...)
(defn- my-pprint-simple-default [obj]
(cond
(.isArray (class obj)) (pprint-array obj)
(and *print-suppress-namespaces* (symbol? obj)) (print (name obj))
:else (cl-format true "<span id=\"~3,'0d\">~s</span>"
(swap! id inc)
obj)))
(defn- my-pprint-simple-code-list [alis]
(pprint-logical-block :prefix (format "<span id=\"%03d\">(" (swap! id inc))
:suffix ")</span>"
...)
(defn- my-pprint-code-list [alis]
(if-not (pprint-reader-macro alis)
(if-let [special-form (*code-table* (first alis))]
(special-form alis)
(my-pprint-simple-code-list alis))))
With all this setup, I called:
(with-pprint-dispatch my-code-dispatch
(pprint '(foo bar baz)))
<span id="001">(<span id="002">foo</span>
<span id="003">bar</span>
<span id="004">baz</span>)</span>
=> nil
Or you can print it into string:
(with-out-str (with-pprint-dispatch my-code-dispatch
(pprint '(foo bar baz))))
=>
"<span id=\"001\">(<span id=\"002\">foo</span>\r
<span id=\"003\">bar</span>\r
<span id=\"004\">baz</span>)</span>\r
"
And I have to mention again that for printing some real code, you would have to modify all functions for all data types. So- it's possible? Yes. Worth the effort? I doubt it.

how to retrieve a string that is in the first position of a list that is a hash-map value

If I evaluate:
(:content {:foo "bar" :biz "baf" :content ("Happy Happy Joy Joy")})
I get:
java.lang.String cannot be cast to clojure.lang.IFn
If I wanted the "Happy Happy Joy Joy" string, how do I get it?
In my case, the hash-map is what I have to work with... I didn't create the string value inside a list. I understand clojure considers it a function as it's in the calling position.
If you're defining that list literally in your code, you'll need to "quote" it so that it isn't evaluated as a function:
user=> (:content {:foo "bar" :biz "baf" :content '("Happy Happy Joy Joy")})
("Happy Happy Joy Joy")
The only difference here is the ' character before the opening list parenthesis. You could also use the list function.
If you want just the first item in the :content list, you can then use first:
user=> (first (:content {:foo "bar" :biz "baf" :content '("Happy Happy Joy Joy")}))
"Happy Happy Joy Joy"
What is typed that has to include quote (') literals to prevent the error message you are getting will be different from what is being returned from a function that does not have to have quotes in it. So just play with it a bit for the real (non REPL) case.
(def x '(:content {:foo "bar" :biz "baf" :content '("Happy Happy Joy Joy")}))
(-> x second :content second first)
;;=> "Happy Happy Joy Joy"
For the real case (-> x second :content first) might be what you want, where of course x is the function call.
If as you say it is only the hash-map (m) you are concerned with then (-> m :content first) should do the trick.
One solution to the mismatch between the REPL and reality is to just use vectors instead of lists:
(def x [:content {:foo "bar" :biz "baf" :content ["Happy Happy Joy Joy"]}])
Here (-> x second :content first) will indeed work.
The other answers did not fully clarify the effect of quote. Please see this code:
(ns tst.demo.core
(:use tupelo.test)
(:require
[tupelo.core :as t] ))
; Note:
; (def data {:foo "bar" :biz "baf" :content ("Happy Happy Joy Joy")})
; => exception
(def data-1 '{:foo "bar" :biz "baf" :content ("Happy Happy Joy Joy")})
(def data-2 {:foo "bar" :biz "baf" :content '("Happy Happy Joy Joy")})
(def data-3 (quote {:foo "bar" :biz "baf" :content ("Happy Happy Joy Joy")}))
(dotest
(is= data-1 data-2 data-3)
(is= "Happy Happy Joy Joy" (first (:content data-1)))
(is= "Happy Happy Joy Joy" (first (:content data-2)))
(is= "Happy Happy Joy Joy" (first (:content data-3))))
So, data-1 shows we can quote the entire expression at the outer level, and data-2 shows we can also quote each list expression (stuff in parens) to suppress the "function call" interpretation of a "list" type in Clojure.
data-3 shows that the single-quote char ' is just short for the special form (quote ...) in Clojure.
Once you get the data literal form right, we see that data-1 and data-2 and data-3 actually result in identical data structures after being processed by the reader.
The last 3 tests show the proper syntax for extracting the string of interest from any of the 3 data structures.
P.S. The testing stuff dotest and is= is from the Tupelo library.

How do I iterate through a nested dict/hash-map in Clojure to custom-flatten/transform my data structure?

I have something that looks like this:
{:person-123 {:xxx [1 5]
:zzz [2 3 4]}
:person-456 {:yyy [6 7]}}
And I want to transform it so it looks like this:
[{:person "123" :item "xxx"}
{:person "123" :item "zzz"}
{:person "456" :item "yyy"}]
This is a flatten-like problem, and I know I can convert the keywords into strings by calling name on them, but I couldn't come across a convenient way to do this.
This is how I did it, but it seems inelegant (nested for loops, I'm looking at you):
(require '[clojure.string :refer [split]])
(into []
(apply concat
(for [[person nested-data] input-data]
(for [[item _] nested-data]
{:person (last (split (name person) #"person-"))
:item (name item)}))))
Your solution is not too bad, as for the nested for loops, well for actually supports nested loops, so you could write it as:
(vec
(for [[person nested-data] input-data
[item _] nested-data]
{:person (last (clojure.string/split (name person) #"person-"))
:item (name item)}))
personally, I tend to use for exclusively for that purpose (nested loops), otherwise I am usually more comfortable with map et al. But thats just a personal preference.
I also very much agree with #amalloy's comment on the question, I would put some effort into having a better looking map structure to begin with.
(let [x {:person-123 {:xxx [1 5]
:zzz [2 3 4]}
:person-456 {:yyy [6 7]}}]
(clojure.pprint/pprint
(mapcat
(fn [[k v]]
(map (fn [[k1 v1]]
{:person (clojure.string/replace (name k) #"person-" "") :item (name k1)}) v))
x))
)
I am not sure if there is a single high-order function, at least in the core, that does what you want in one go.
On the other hand, similar methods exist in GNU R reshape library, which, by the way, has been recreated for clojure:
https://crossclj.info/ns/grafter/0.8.6/grafter.tabular.melt.html#_melt-column-groups which might interest you.
This is how it works in Gnu R: http://www.statmethods.net/management/reshape.html
Lots of good solutions so far. All I would add is a simplification with keys:
(vec
(for [[person nested-data] input-data
item (map name (keys nested-data))]
{:person (clojure.string/replace-first
(name person)
#"person-" "")
:item item}))
Note btw the near universal preference for replace over last/split. Guessing the spirit of the transformation is "lose the leading person- prefix", replace says that better. If OTOH the spirit is "find the number and use that", a bit of regex to isolate the digits would be truer.
(reduce-kv (fn [ret k v]
(into ret (map (fn [v-k]
{:person (last (str/split (name k) #"-"))
:item (name v-k)})
(keys v))))
[]
{:person-123 {:xxx [1 5] :zzz [2 3 4]}
:person-456 {:yyy [6 7]}})
=> [{:person "123", :item "xxx"}
{:person "123", :item "zzz"}
{:person "456", :item "yyy"}]
Here are three solutions.
The first solution uses Python-style lazy generator functions via lazy-gen and yield functions from the Tupelo library. I think this method is the simplest since the inner loop produces maps and the outer loop produces a sequence. Also, the inner loop can run zero, one, or multiple times for each outer loop. With yield you don't need to think about that part.
(ns tst.clj.core
(:use clj.core clojure.test tupelo.test)
(:require
[clojure.string :as str]
[clojure.walk :as walk]
[clojure.pprint :refer [pprint]]
[tupelo.core :as t]
[tupelo.string :as ts]
))
(t/refer-tupelo)
(def data
{:person-123 {:xxx [1 5]
:zzz [2 3 4]}
:person-456 {:yyy [6 7]}})
(defn reformat-gen [data]
(t/lazy-gen
(doseq [[outer-key outer-val] data]
(let [int-str (str/replace (name outer-key) "person-" "")]
(doseq [[inner-key inner-val] outer-val]
(let [inner-key-str (name inner-key)]
(t/yield {:person int-str :item inner-key-str})))))))
If you really want to be "pure", the following is another solution. However, with this solution I made a couple of errors and required many, many debug printouts to fix. This version uses tupelo.core/glue instead of concat since it is "safer" and verifies that the collections are all maps, all vectors/list, etc.
(defn reformat-glue [data]
(apply t/glue
(forv [[outer-key outer-val] data]
(let [int-str (str/replace (name outer-key) "person-" "")]
(forv [[inner-key inner-val] outer-val]
(let [inner-key-str (name inner-key)]
{:person int-str :item inner-key-str}))))))
Both methods give the same answer:
(newline) (println "reformat-gen:")
(pprint (reformat-gen data))
(newline) (println "reformat-glue:")
(pprint (reformat-glue data))
reformat-gen:
({:person "123", :item "xxx"}
{:person "123", :item "zzz"}
{:person "456", :item "yyy"})
reformat-glue:
[{:person "123", :item "xxx"}
{:person "123", :item "zzz"}
{:person "456", :item "yyy"}]
If you wanted to be "super-pure", here is a third solution (although I think this one is trying too hard!). Here we use the ability of the for macro to have nested elements in a single expression. for can also embed let expressions inside itself, although here that leads to duplicate evaluation of int-str.
(defn reformat-overboard [data]
(for [[outer-key outer-val] data
[inner-key inner-val] outer-val
:let [int-str (str/replace (name outer-key) "person-" "") ; duplicate evaluation
inner-key-str (name inner-key)]]
{:person int-str :item inner-key-str}))
(newline)
(println "reformat-overboard:")
(pprint (reformat-overboard data))
reformat-overboard:
({:person "123", :item "xxx"}
{:person "123", :item "zzz"}
{:person "456", :item "yyy"})
I would probably stick with the first one since it is (at least to me) much simpler and more bulletproof. YMMV.
Update:
Notice that the 3rd method yields a single sequence of maps, even though there are 2 nested for iterations happening. This is different than having two nested for expressions, which would yield a sequence of a sequence of maps.

Deep data structure match & replace first

I'm trying to figure out an idiomatic, performant, and/or highly functional way to do the following:
I have a sequence of maps that looks like this:
({:_id "abc" :related ({:id "123"} {:id "234"})}
{:_id "bcd" :related ({:id "345"} {:id "456"})}
{:_id "cde" :related ({:id "234"} {:id "345"})})
The :id fields can be assumed to be unique within any one :_id.
In addition, I have two sets:
ids like ("234" "345") and
substitutes like ({:id "111"} {:id "222"})
Note that the fact that substitutes only has :id in this example doesn't mean it can be reduced to a collection of ids. This is a simplified version of a problem and the real data has other key/value pairs in the map that have to come along.
I need to return a new sequence that is the same as the original but with the values from substitutes replacing the first occurrence of the matching id from ids in the :related collections of all of the items. So what the final collection should look like is:
({:_id "abc" :related ({:id "123"} {:id "111"})}
{:_id "bcd" :related ({:id "222"} {:id "456"})}
{:_id "cde" :related ({:id "234"} {:id "345"})})
I'm sure I could eventually code up something that involves nesting maps and conditionals (thinking in iterative terms about loops of loops) but that feels to me like I'm not thinking functionally or cleverly enough given the tools I might have available, either in clojure.core or extensions like match or walk (if those are even the right libraries to be looking at).
Also, it feels like it would be much easier without the requirement to limit it to a particular strategy (namely, subbing on the first match only, ignoring others), but that's a requirement. And ideally, a solution would be adaptable to a different strategy down the line (e.g. a single, but randomly positioned match). The one invariant to strategy is that each id/sub pair should used only once. So:
Replace one, and one only, occurrence of a :related value whose :id matches a value from ids with the corresponding value from substitutes, where the one occurrence is the first (or nth or rand-nth...) occurrence.
(def id-mapping (zipmap ids
(map :id substitutes)))
;; id-mapping -> {"345" "222", "234" "111"}
(clojure.walk/prewalk-replace id-mapping original)
Assuming that the collection is called results:
(require '[clojure.zip :as z])
(defn modify-related
[results id sub]
(loop [loc (z/down (z/seq-zip results))
done? false]
(if (= done? true)
(z/root loc)
(let [change? (->> loc z/node :_id (= id))]
(recur (z/next (cond change?
(z/edit loc (fn [_] identity sub))
:else loc))
change?)))))
(defn modify-results
[results id sub]
(loop [loc (z/down (z/seq-zip results))
done? false]
(if (= done? true)
(z/root loc)
(let [related (->> loc z/node :related)
change? (->> related (map :_id) set (#(contains? % id)))]
(recur (z/next (cond change?
(z/edit loc #(assoc % :related (modify-related related id sub)))
:else loc))
change?)))))
(defn sub-for-first
[results ids substitutes]
(let [subs (zipmap ids substitutes)]
(reduce-kv modify-results results subs)))

Clojure: Custom functions inside Enlive selectors?

Here is an example where I use html/text directly inside a selector vector.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a html/text]))
Calling (parse-test) returns a data structure containing Hacker News Headlines :
("In emergency cases a passenger was selected and thrown out of the plane. [2004]"
"“Nobody expects privacy online”: Wrong."
"The SCUMM Diary: Stories behind one of the greatest game engines ever made" ...)
Cool!
Would it be possible to end the selector vector with a custom function that would give me back the list of article URLs.
Something like: [:td.title :a #(str "https://news.ycombinator.com/" (:href (:attrs %)))]
EDIT:
Here is a way to achieve this. We could write our own select function:
(defn select+ [coll selector+]
(map
(peek selector+)
(html/select
(fetch-url "https://news.ycombinator.com/")
(pop selector+))))
(def href
(fn [node] (:href (:attrs node))))
(defn parse-test []
(select+
(fetch-url "https://news.ycombinator.com/")
[:td.title :a href]))
(parse-test)
As you suggest in your comment, I think it's clearest to keep the selection and the transformation of nodes separate.
Enlive itself provides both selectors and transformers. Selectors to find nodes, and transformers to, um, transform them. If your intended output was html, you could probably use a combination of a selector and a transformer to achieve your desired result.
However, seeing as you are just looking for data (a sequence of maps, perhaps?) - you can skip the transform bit, and just use a sequence comprehension, like this:
(defn parse-test []
(for [s (html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a])]
{:title (first (:content s))
:link (:href (:attrs s))}))
(take 2 (parse-test))
;; => ({:title " \tStartup - Bill Watterson, a cartoonist's advice ",
:link "http://www.zenpencils.com/comic/128-bill-watterson-a-cartoonists-advice"}
{:title "Drug Agents Use Vast Phone Trove Eclipsing N.S.A.’s",
:link "http://www.nytimes.com/2013/09/02/us/drug-agents-use-vast-phone-trove-eclipsing-nsas.html?hp&_r=0&pagewanted=all"})