Clojure: Custom functions inside Enlive selectors? - clojure

Here is an example where I use html/text directly inside a selector vector.
(:use [net.cgrand.enlive-html :as html])
(defn fetch-url [url]
(html/html-resource (java.net.URL. url)))
(defn parse-test []
(html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a html/text]))
Calling (parse-test) returns a data structure containing Hacker News Headlines :
("In emergency cases a passenger was selected and thrown out of the plane. [2004]"
"“Nobody expects privacy online”: Wrong."
"The SCUMM Diary: Stories behind one of the greatest game engines ever made" ...)
Cool!
Would it be possible to end the selector vector with a custom function that would give me back the list of article URLs.
Something like: [:td.title :a #(str "https://news.ycombinator.com/" (:href (:attrs %)))]
EDIT:
Here is a way to achieve this. We could write our own select function:
(defn select+ [coll selector+]
(map
(peek selector+)
(html/select
(fetch-url "https://news.ycombinator.com/")
(pop selector+))))
(def href
(fn [node] (:href (:attrs node))))
(defn parse-test []
(select+
(fetch-url "https://news.ycombinator.com/")
[:td.title :a href]))
(parse-test)

As you suggest in your comment, I think it's clearest to keep the selection and the transformation of nodes separate.
Enlive itself provides both selectors and transformers. Selectors to find nodes, and transformers to, um, transform them. If your intended output was html, you could probably use a combination of a selector and a transformer to achieve your desired result.
However, seeing as you are just looking for data (a sequence of maps, perhaps?) - you can skip the transform bit, and just use a sequence comprehension, like this:
(defn parse-test []
(for [s (html/select
(fetch-url "https://news.ycombinator.com/")
[:td.title :a])]
{:title (first (:content s))
:link (:href (:attrs s))}))
(take 2 (parse-test))
;; => ({:title " \tStartup - Bill Watterson, a cartoonist's advice ",
:link "http://www.zenpencils.com/comic/128-bill-watterson-a-cartoonists-advice"}
{:title "Drug Agents Use Vast Phone Trove Eclipsing N.S.A.’s",
:link "http://www.nytimes.com/2013/09/02/us/drug-agents-use-vast-phone-trove-eclipsing-nsas.html?hp&_r=0&pagewanted=all"})

Related

How can I filter JSON data by a given date in Clojure?

I have many JSON objects, and I am trying to filter those objects by the date. These objects are being parsed from several JSON files using Cheshire.core, meaning that the JSON objects are in a collection. The date is being passed in in the following format "YYYY-MM-DD" (eg. 2015-01-10). I have tried using the filter and contains? functions to do this, but I am having no luck so far. How can I filter these JSON objects by my chosen date?
Current Clojure code:
(def filter-by-date?
(fn [orders-data date-chosen]
(contains? (get (get orders-data :date) :date) date-chosen)))
(prn (filter (filter-by-date? orders-data "2017-12-25")))
Example JSON object:
{
"id":"05d8d404-b3f6-46d1-a0f9-dbdab7e0261f",
"date":{
"date":"2015-01-10T19:11:41.000Z"
},
"total":{
"GBP":57.45
}
}
JSON after parsing with Cheshire:
[({:id "05d8d404-b3f6-46d1-a0f9-dbdab7e0261f",
:date {:date "2015-01-10T19:11:41.000Z"},
:total {:GBP 57.45}}) ({:id "325bd04-b3f6-46d1-a0f9-dbdab7e0261f",
:date {:date "2015-02-23T10:15:14.000Z"},
:total {:GBP 32.90}})]
First, I'm going to assume you've parsed the JSON first into something like this:
(def parsed-JSON {:id "05d8d404-b3f6-46d1-a0f9-dbdab7e0261f",
:date {:date "2015-01-10T19:11:41.000Z"},
:total {:GBP 57.45}})
The main problem is the fact that the date as stored in the JSON contains time information, so you aren't going to be able to check it directly using equality.
You can get around this by using clojure.string/starts-with? to check for prefixes. I'm using s/ here as an alias for clojure.string:
(defn filter-by-date [date jsons]
(filter #(s/starts-with? (get-in % [:date :date]) date)
jsons))
You were close, but I made a few changes:
You can't use contains? like that. From the docs of contains?: Returns true if key is present in the given collection, otherwise returns false. It can't be used to check for substrings; it's used to test for the presence of a key in a collection.
Use -in postfix versions to access nested structures instead of using multiple calls. I'm using (get-in ...) here instead of (get (get ...)).
You're using (def ... (fn [])) which makes things more complicated than they need to be. This is essentially what defn does, although defn also adds some more stuff as well.
To address the new information, you can just flatten the nested sequences containing the JSONs first:
(->> nested-json-colls ; The data at the bottom of the question
(flatten)
(filter-by-date "2015-01-10"))
#!/usr/bin/env boot
(defn deps [new-deps]
(merge-env! :dependencies new-deps))
(deps '[[org.clojure/clojure "1.9.0"]
[cheshire "5.8.0"]])
(require '[cheshire.core :as json]
'[clojure.string :as str])
(def orders-data-str
"[{
\"id\":\"987654\",
\"date\":{
\"date\":\"2015-01-10T19:11:41.000Z\"
},
\"total\":{
\"GBP\":57.45
}
},
{
\"id\":\"123456\",
\"date\":{
\"date\":\"2016-01-10T19:11:41.000Z\"
},
\"total\":{
\"GBP\":23.15
}
}]")
(def orders (json/parse-string orders-data-str true))
(def ret (filter #(clojure.string/includes? (get-in % [:date :date]) "2015-01-") orders))
(println ret) ; ({:id 987654, :date {:date 2015-01-10T19:11:41.000Z}, :total {:GBP 57.45}})
You can convert the date string to Date object using any DateTime library like joda-time and then do a proper filter if required.
clj-time has functions for parsing strings and comparing date-time objects. So you could do something like:
(ns filter-by-time-example
(:require [clj-time.coerce :as tc]
[clj-time.core :as t]))
(def objs [{"id" nil
"date" {"date" "2015-01-12T19:11:41.000Z"}
"total" nil}
{"id" "05d8d404-b3f6-46d1-a0f9-dbdab7e0261f"
"date" {"date" "2015-01-10T19:11:41.000Z"}
"total" {"GBP" :57.45}}
{"id" nil
"date" {"date" "2015-01-11T19:11:41.000Z"}
"total" nil}])
(defn filter-by-day
[objs y m d]
(let [start (t/date-time y m d)
end (t/plus start (t/days 1))]
(filter #(->> (get-in % ["date" "date"])
tc/from-string
(t/within? start end)) objs)))
(clojure.pprint/pprint (filter-by-day objs 2015 1 10)) ;; Returns second obj
If you're going to repeatedly do this (e.g. for multiple days) you could parse all dates in your collection into date-time objects with
(map #(update-in % ["date" "date"] tc/from-string) objs)
and then just work with that collection to avoid repeating the parsing step.
(ns filter-by-time-example
(:require [clj-time.format :as f]
[clj-time.core :as t]
[cheshire.core :as cheshire]))
(->> json-coll
(map (fn [json] (cheshire/parse-string json true)))
(map (fn [record] (assoc record :dt-date (f/format (get-in record [:date :date])))))
(filter (fn [record] (t/after? (tf/format "2017-12-25") (:dt-date record))))
(map (fn [record] (dissoc record :dt-date))))
Maybe something like this? You might need to change the filter for your usecase but as :dt-time is now a jodo.DateTime you can leverage all the clj-time predicates.

How do I loop through a subscribed collection in re-frame and display the data as a list-item?

Consider the following clojurescript code where the specter, reagent and re-frame frameworks are used, an external React.js grid component is used as a view component.
In db.cls :
(def default-db
{:cats [{:id 0 :data {:text "ROOT" :test 17} :prev nil :par nil}
{:id 1 :data {:text "Objects" :test 27} :prev nil :par 0}
{:id 2 :data {:text "Version" :test 37} :prev nil :par 1}
{:id 3 :data {:text "X1" :test 47} :prev nil :par 2}]})
In subs.cls
(register-sub
:cats
(fn [db]
(reaction
(select [ALL :data] (t/tree-visitor (get #db :cats))))))
result from select:
[{:text "ROOT", :test 17}
{:text "Objects", :test 27}
{:text "Version", :test 37}
{:text "X1", :test 47}]
In views.cls
(defn categorymanager []
(let [cats (re-frame/subscribe [:cats])]
[:> Reactable.Table
{:data (clj->js #cats)}]))
The code above works as expected.
Instead of displaying the data with the react.js component I want to go through each of the maps in the :cats vector and display the :text items in html ul / li.
I started as follows:
(defn categorymanager2 []
(let [cats (re-frame/subscribe [:cats])]
[:div
[:ul
(for [category #cats]
;;--- How to continue here ?? ---
)
))
Expected output:
ROOT
Objects
Version
X1
How do I loop through a subscribed collection in re-frame and display the data as a list-item? ( = question for title ).
First, be clear why you use key...
Supplying a key for each item in a list is useful when that list is quite dynamic - when new list items are being regularly added and removed, especially if that list is long, and the items are being added/removed near the top of the list.
keys can deliver big performance gains, because they allow React to more efficiently redraw these changeable lists. Or, more accurately, it allows React to avoid redrawing items which have the same key as last time, and which haven't changed, and which have simply shuffled up or down.
Second, be clear what you should do if the list is quite static (it does not change all the time) OR if there is no unique value associated with each item...
Don't use :key at all. Instead, use into like this:
(defn categorymanager []
(let [cats (re-frame/subscribe [:cats])]
(fn []
[:div
(into [:ul] (map #(vector :li (:text %)) #cats))])))
Notice what has happened here. The list provided by the map is folded into the [:ul] vector. At the end of it, no list in sight. Just nested vectors.
You only get warnings about missing keys when you embed a list into hiccup. Above there is no embedded list, just vectors.
Third, if your list really is dynamic...
Add a unique key to each item (unique amoung siblings). In the example given, the :text itself is a good enough key (I assume it is unique):
(defn categorymanager []
(let [cats (re-frame/subscribe [:cats])]
(fn []
[:div
[:ul (map #(vector :li {:key (:text %)} (:text %)) #cats)]])))
That map will result in a list which is the 1st parameter to the [:ul]. When Reagent/React sees that list it will want to see keys on each item (remember lists are different to vectors in Reagent hiccup) and will print warnings to console were keys to be missing.
So we need to add a key to each item of the list. In the code above we aren't adding :key via metadata (although you can do it that way if you want), and instead we are supplying the key via the 1st parameter (of the [:li]), which normally also carries style data.
Finally - part 1 DO NOT use map-indexed as is suggested in another answer.
key should be a unique value associated with each item. Attaching some arb integer does nothing useful - well, it does get rid of the warnings in the console, but you should use the into technique above if that's all you want.
Finally - part 2 there is no difference between map and for in this context.
They both result in a list. If that list has keys then no warning. But if keys are missing, then lots of warnings. But how the list was created doesn't come into it.
So, this for version is pretty much the same as the map version. Some may prefer it:
(defn categorymanager []
(let [cats (re-frame/subscribe [:cats])]
(fn []
[:div
[:ul (for [i #cats] [:li {:key (:text i)} (:text i)])]])))
Which can also be written using metadata like this:
(defn categorymanager []
(let [cats (re-frame/subscribe [:cats])]
(fn []
[:div
[:ul (for [i #cats] ^{:key (:text i)}[:li (:text i)])]])))
Finally - part 3
mapv is a problem because of this issue:
https://github.com/Day8/re-frame/wiki/Using-%5Bsquare-brackets%5D-instead-of-%28parentheses%29#appendix-2
Edit: For a much more coherent and technically correct explanation of keys and map, see Mike Thompson's answer!
Here's how I would write it:
(defn categorymanager2 []
(let [cats (re-frame/subscribe [:cats])]
(fn []
[:div
[:ul
(map-indexed (fn [n cat] ;;; !!! See https://stackoverflow.com/a/37186230/500207 !!!
^{:key n}
[:li (:text cat)])
#cats)]])))
(defn main-panel []
[:div
[categorymanager2]])
A few points:
See the re-frame readme's Subscribe section, near the end, which says:
subscriptions can only be used in Form-2 components and the subscription must be in the outer setup function and not in the inner render function. So the following is wrong (compare to the correct version above)…
Therefore, your component was ‘wrong’ because it didn't wrap the renderer inside an inner function. The readme has all the details, but in short, not wrapping a component renderer that depends on a subscription inside an inner function is bad because this causes the component to rerender whenever db changes—not what you want! You want the component to only rerender when the subscription changes.
Edit: seriously, see Mike Thompson's answer. For whatever reason, I prefer using map to create a seq of Hiccup tags. You could use a for loop also, but the critical point is that each [:li] Hiccup vector needs a :key entry in its meta-data, which I add here by using the current category's index in the #cats vector. If you don't have a :key, React will complain in the Dev Console. Note that this key should somehow uniquely tie this element of #cats to this tag: if the cats subscription changes and gets shuffled around, the result might not be what you expect because I just used this very simple key. If you can guarantee that category names will be unique, you can just use the :test value, or the :test value, or something else. The point is, the key must be unique and must uniquely identify this element.
(N.B.: don't try and use mapv to make a vector of Hiccup tags—re-frame hates that. Must be a seq like what map produces.)
I also included an example main-panel to emphasize that
parent components don't need the subscriptions that their children component need, and that
you should call categorymanager2 component with square-brackets instead of as a function with parens (see Using [] instead of ()).
Here's an ul / li example:
(defn phone-component
[phone]
[:li
[:span (:name #phone)]
[:p (:snippet #phone)]])
(defn phones-component
[]
(let [phones (re-frame/subscribe [:phones])] ; subscribe to the phones value in our db
(fn []
[:ul (for [phone in #phones] ^{:key phone} [phone-component phone] #phones)])))
I grabbed that code from this reframe tutorial.
Also map is preferable to for when using Reagent. There is a technical reason for this, it is just that I don't know what it is.

Clojure working with records

I have a set of values in Clojure that I want to structure similar to record. I'm trying to figure out the best way of handling a set of those records.
So I have a for example a record:
(defrecord Link [page url])
Whats the best datastructure to hold a collection of these records that I can step through recursively, while continuously updating the collection?
Previously I've done this on a single value using sequence, then concat'ing new links on the end as i process them recursively. But now I want to hold more information about each link.
Edit For Clarity
I had previously been using maps, however I think I've been confusing myself by trying to use a nested map with like
#{:rootlink "http://www.google.co.uk" :links nestedmapoflinks}
which is confusing me when i try to re curse through it.
Below is the code that I've been using, below is what currently works with a sequence of links but no other info about the link.
(defn get-links
[url]
(map :href (map :attrs (html/select (fetch-url url) [:a])))))
(defn process-links
[links]
(if (not (empty? links))
(do
(if (not (is-working (first links)))
(do
(println (str (first links) " is not working"))
(recur (rest links)))
(do
(println (str (first links) " is working"))
(recur (concat (rest links) (get-links (first links)))))))))
I think I have to add each item into the map with
{:rootlink "http://www.google.co.uk" :link "http://someurlontherootlinkpage.com"}
instead of trying to work with the nested map.
However, the reason I mentioned records, because I was struggling to merge two maps together using the first method of map creation. I'm still a bit confused about the best structure to use to recurse through the map.
Final Update
Ok, so after much wrangling I finally came up with this below piece of code which returns a seq of vectors made up of:
["root link address" "link"]
["http://www.google.co.uk" "http://www.google.co.uk/examplelink"]
Code:
(defn get-links
[url]
(map #(vector url %)(map :href (map :attrs (html/select (fetch-url url) [:a])))))
The code is now on my github available in my profile.
I think you are getting confused between using Tree type structure or a flat structure.
Lets say you have a list of links as a vector of maps:
[ {:root nil :link "A.COM"} {:root nil :link "B.COM"} ]
Now you map over it and using your get-link method you get:
[ [ {:root nil :link "A.COM"} {:root "A.COM" :link "Aa.COM"} {:root "A.COM" :link "Ab.COM"} ] [ {:root nil :link "B.COM"} {:root "B.COM" :link "Ba.COM"} {:root "B.COM" :link "Bb.COM"}] ]
Now you can call flatten on this result to get a flat list of link instead of nested map in vector.
You can repeat this process recursively till you exit condition met.

Rescraping data with Enlive

I tried to create function to scrape and tags from HTML page, whose URL I provide to a function, and this works as it should. I get sequence of <h3> and <table> elements, when I try to use select function to extract only table or h3 tags from resulting sequence,
I get (), or if I try to map those tags I get (nil nil nil ...).
Could you please help me to resolve this issue, or explain me what am I doing wrong?
Here is the code:
(ns Test2
(:require [net.cgrand.enlive-html :as html])
(:require [clojure.string :as string]))
(defn get-page
"Gets the html page from passed url"
[url]
(html/html-resource (java.net.URL. url)))
(defn h3+table
"returns sequence of <h3> and <table> tags"
[url]
(html/select (get-page url)
{[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :h3]
[:div#wrap :div#middle :div#content :div#prospekt :div#prospekt_container :table]}
))
(def url "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
This line gives me headache :
(html/select (h3+table url) [:table])
Could you please tell me what am I doing wrong?
Just to clarify my question: is it possible to use enlive's select function to extract only table tags from result of (h3+table url) ?
As #Julien pointed out, you will probably have to work with the deeply nested tree structure that you get from applying (html/select raw-html selectors) on the raw html. It seems like you try to apply html/select multiple times, but this doesn't work. html/select parses html into a clojure datastructure, so you can't apply it on that datastructure again.
I found that parsing the website was actually a little involved, but I thought that it might be a nice use case for multimethods, so I hacked something together, maybe this will get you started:
(The code is ugly here, you can also checkout this gist)
(ns tutorial.scrape1
(:require [net.cgrand.enlive-html :as html]))
(def *url* "http://www.belex.rs/trgovanje/prospekt/VZAS/show")
(defn get-page [url]
(html/html-resource (java.net.URL. url)))
(defn content->string [content]
(cond
(nil? content) ""
(string? content) content
(map? content) (content->string (:content content))
(coll? content) (apply str (map content->string content))
:else (str content)))
(derive clojure.lang.PersistentStructMap ::Map)
(derive clojure.lang.PersistentArrayMap ::Map)
(derive java.lang.String ::String)
(derive clojure.lang.ISeq ::Collection)
(derive clojure.lang.PersistentList ::Collection)
(derive clojure.lang.LazySeq ::Collection)
(defn tag-type [node]
(case (:tag node)
:tr ::CompoundNode
:table ::CompoundNode
:th ::TerminalNode
:td ::TerminalNode
:h3 ::TerminalNode
:tbody ::IgnoreNode
::IgnoreNode))
(defmulti parse-node
(fn [node]
(let [cls (class node)] [cls (if (isa? cls ::Map) (tag-type node) nil)])))
(defmethod parse-node [::Map ::TerminalNode] [node]
(content->string (:content node)))
(defmethod parse-node [::Map ::CompoundNode] [node]
(map parse-node (:content node)))
(defmethod parse-node [::Map ::IgnoreNode] [node]
(parse-node (:content node)))
(defmethod parse-node [::String nil] [node]
node)
(defmethod parse-node [::Collection nil] [node]
(map parse-node node))
(defn h3+table [url]
(let [ws-content (get-page url)
h3s+tables (html/select ws-content #{[:div#prospekt_container :h3]
[:div#prospekt_container :table]})]
(for [node h3s+tables] (parse-node node))))
A few words on what's going on:
content->string takes a data structure and collects its content into a string and returns that so you can apply this to content that may still contain nested subtags (like <br/>) that you want to ignore.
The derive statements establish an ad hoc hierarchy which we will later use in the multi-method parse-node. This is handy because we never quite know which data structures we're going to encounter and we could easily add more cases later on.
The tag-type function is actually a hack that mimics the hierarchy statements - AFAIK you can't create a hierarchy out of non-namespace qualified keywords, so I did it like this.
The multi-method parse-node dispatches on the class of the node and if the node is a map additionally on the tag-type.
Now all we have to do is define the appropriate methods: If we're at a terminal node we convert the contents to a string, otherwise we either recur on the content or map the parse-node function on the collection we're dealing with. The method for ::String is actually not even used, but I left it in for safety.
The h3+table function is pretty much what you had before, I simplified the selectors a bit and put them into a set, not sure if putting them into a map as you did works as intended.
Happy scraping!
Your question is hard to understand, but I think your last line should simply be
(h3+table url)
This will return a deeply nested data structure containing scraped HTML that you can then dive into with the usual Clojure sequence APIs. Good luck.

How to parse URL parameters in Clojure?

If I have the request "size=3&mean=1&sd=3&type=pdf&distr=normal" what's the idiomatic way of writing the function (defn request->map [request] ...) that takes this request and
returns a map {:size 3, :mean 1, :sd 3, :type pdf, :distr normal}
Here is my attempt (using clojure.walk and clojure.string):
(defn request-to-map
[request]
(keywordize-keys
(apply hash-map
(split request #"(&|=)"))))
I am interested in how others would solve this problem.
Using form-decode and keywordize-keys:
(use 'ring.util.codec)
(use 'clojure.walk)
(keywordize-keys (form-decode "hello=world&foo=bar"))
{:foo "bar", :hello "world"}
Assuming you want to parse HTTP request query parameters, why not use ring? ring.middleware.params contains what you want.
The function for parameter extraction goes like this:
(defn- parse-params
"Parse parameters from a string into a map."
[^String param-string encoding]
(reduce
(fn [param-map encoded-param]
(if-let [[_ key val] (re-matches #"([^=]+)=(.*)" encoded-param)]
(assoc-param param-map
(codec/url-decode key encoding)
(codec/url-decode (or val "") encoding))
param-map))
{}
(string/split param-string #"&")))
You can do this easily with a number of Java libraries. I'd be hesitant to try to roll my own parser unless I read the URI specs carefully and made sure I wasn't missing any edge cases (e.g. params appearing in the query twice with different values). This uses jetty-util:
(import '[org.eclipse.jetty.util UrlEncoded MultiMap])
(defn parse-query-string [query]
(let [params (MultiMap.)]
(UrlEncoded/decodeTo query params "UTF-8")
(into {} params)))
user> (parse-query-string "size=3&mean=1&sd=3&type=pdf&distr=normal")
{"sd" "3", "mean" "1", "distr" "normal", "type" "pdf", "size" "3"}
Can also use this library for both clojure and clojurescript: https://github.com/cemerick/url
user=> (-> "a=1&b=2&c=3" cemerick.url/query->map clojure.walk/keywordize-keys)
{:a "1", :b "2", :c "3"}
Yours looks fine. I tend to overuse regexes, so I would have solved it as
(defn request-to-keywords [req]
(into {} (for [[_ k v] (re-seq #"([^&=]+)=([^&]+)" req)]
[(keyword k) v])))
(request-to-keywords "size=1&test=3NA=G")
{:size "1", :test "3NA=G"}
Edit: try to stay away from clojure.walk though. I don't think it's officially deprecated, but it's not very well maintained. (I use it plenty too, though, so don't feel too bad).
I came across this question when constructing my own site and the answer can be a bit different, and easier, if you are passing parameters internally.
Using Secretary to handle routing: https://github.com/gf3/secretary
Parameters are automatically extracted to a map in :query-params when a route match is found. The example given in the documentation:
(defroute "/users/:id" [id query-params]
(js/console.log (str "User: " id))
(js/console.log (pr-str query-params)))
(defroute #"/users/(\d+)" [id {:keys [query-params]}]
(js/console.log (str "User: " id))
(js/console.log (pr-str query-params)))
;; In both instances...
(secretary/dispach! "/users/10?action=delete")
;; ... will log
;; User: 10
;; "{:action \"delete\"}"
You can use ring.middleware.params. Here's an example with aleph:
user=> (require '[aleph.http :as http])
user=> (defn my-handler [req] (println "params:" (:params req)))
user=> (def server (http/start-server (wrap-params my-handler)))
wrap-params creates an entry in the request object called :params. If you want the query parameters as keywords, you can use ring.middleware.keyword-params. Be sure to wrap with wrap-params first:
user=> (require '[ring.middleware.params :refer [wrap-params]])
user=> (require '[ring.middleware.keyword-params :refer [wrap-keyword-params])
user=> (def server
(http/start-server (wrap-keyword-params (wrap-params my-handler))))
However, be mindful that this includes a dependency on ring.