I want to expand URLs from Twitter data and (simultaneously-ish) extract their domains. I tried doing this before in Python using requests, but I guess I screwed up somewhere, because the vast majority of the URLs are still in 'short' form (bit.ly, goo.gl, etc.) I've got the Twitter data stored as JSON. I'm using clj-http.client :as client to resolve the URLs. I've got code that looks like this, so far:
(defn expand-urls [urls] (for [url-str urls]
(and url-str (last (:trace-redirects
(client/get url-str))))))
(def ^:dynamic *domain-pat* (re-pattern #"https?://([\w\.]+)/.*"))
(defn get-domains [urls] (for [url urls] (first (filter #(not= url %1)
(re-find *domain-pat* url)))))
I've got the Twitter data formatted [tweet-id [{tweet-data-map} {user-data-map}]], so (get-in json-data [1 0 "urls"] returns the URLs, (get-in json-data [1 0 "domains"]) returns the domains.
When I try something like (update-in (update-in js-line [1 0 "urls"] expand-urls) [1 0 "domains"] get-domains), domains are (nil). I've independently verified that my regex works, so I suspect that the issue is that the lazy sequence returned by expand-urls isn't evaluating by the time get-domains gets called. Frustratingly, (type (doall (expand-urls some-urls))) returns clojure.lang.LazySeq, as does (type (doall (doall (expand-urls some-urls)))). I've tried doall, I've tried adding vec into expand-urls. Neither seem to work.
Is this really a laziness issue, or am I missing something else?
You can rewrite your solution into
(defn expand-urls [urls] (
(mapv #(last (:trace-redirects (client/get %))
(remove nil? urls)))
assuming you don't want nils in your result. Mapv is the strict counterpart of the lazy map and always returns a vector.
Solved it! The key was adding a doall into expand-urls:
(defn expand-urls [urls] (vec (doall (for [url-str urls]
(and url-str (last (:trace-redirects
(client/get url-str))))))))
(The vec isn't actually necessary, but I'm planning on re-serializing this stuff, and didn't want to worry about how org.clojure/data.json translates lists.)
Thanks for the help, everybody! I'm glad there's a vibrant clojure support community here :)
Related
I'm developing a mini-social media API where the user is allowed to insert a new profile, connect two profiles together (like friends) and then receive recommendations based on the "friends of my friends" rule.
Right now I'm trying to create the API for Profile.
I have an atom that holds a list of maps, one for each profile.
(def profiles (atom ()))
(defn create [request]
(swap! profiles conj {:id (get-in request [:body "id"])
:name (get-in request [:body "name"])
:age (get-in request [:body "age"])
:recommendable (get-in request [:body "recommendable"])
:friends (list)
})
(created "")
)
I was trying to develop the find-by-id for the GET http verb for the API when I stumbled into a problem. How can I get the values from the maps within said list so I can apply functions to it?
For instance, here I was trying to use the filter function to return me only the maps that contained a given id. But I keep getting an error:
(defn find-by-id [id]
(filter #(= (:id %) id) profiles)
)
Dont know how to create ISeq from: clojure.lang.Atom
It seems to me that filter is not applicable to an Atom.
Same thing happens to remove:
(defn delete-by-id [id]
(swap! profiles (remove #(= (:id %) id) profiles))
)
When I try with #profiles I get an empty array as a result. And to make things worst when I tried the filter function using REPL it worked just fine.
Which leaves me wondering what am I missing here.
Could anyone please tell me what's going on?
Thanks in advance.
The first one fails because, as it says, atoms aren't a sequence, which filter is expecting.
You need to get the sequence out of the atom before you can filter it:
; I'm dereferencing the atom using # to get the list of profiles that it holds
(defn find-by-id [id]
(filter #(= (:id %) id) #profiles))
Note though, this isn't optimal. You're relying on the state of profiles that can change at seemingly random times (if you have asynchronous processes swap!ping it). It may complicate debugging since you can't get a good handle on the data before it's passed to filter. It also isn't good for the function to rely on profiles being an atom, since that's irrelevant to its function, and you may change your design later. It would be more future proof to make this function rely purely on its parameters and have no knowledge of the atom:
(defn find-by-id [id profiles]
(filter #(= (:id %) id) profiles))
; Then call it like this. I renamed your atom here
(find-by-id some-id #profile-atom)
Your second example fails because swap! accepts a function as its second argument. I think you meant to use reset!, which changes the value of the atom regardless of what it was before:
(defn delete-by-id [id]
(reset! profiles (remove #(= (:id %) id) #profiles)))
Although, this isn't optimal either. If you want to update an atom based on a previous state, use swap! instead and supply an updating function:
(defn delete-by-id [id]
(swap! profile-atom (fn [profiles] (remove #(= (:id %) id)) profiles)))
Or, slightly more succinctly:
(defn delete-by-id [id]
(swap! profile-atom (partial remove #(= (:id %) id))))
I'm partially applying remove to make a function. The old state of the atom is passed as the last argument to remove.
I am using taoensso.carmine redis client and want to achieve the following: given sequence s, get all its elements that aren't exist in redis. (I mean for which redis's EXISTS command return false)
At first I thought to do the following:
(wcar conn
(remove #(car/exists %) s))
but it returns sequence of car/exists responses rather than filtering my sequence by them
(remove #(wcar conn (car exists %)) s)
Does the job but takes a lot of time because no-pipeling and using new connection each time.
So I end up with some tangled map manipulation below, but I believe there should be simplier way to achieve it. How?
(let [s (range 1 100)
existance (wcar conn
(doall
(for [i s]
(car/exists i))))
existance-map (zipmap s existance)]
(mapv first (remove (fn [[k v]] (= v 1)) existance-map)))
Your remove function is lazy, so it won't do anything. You also can't do data manipulation inside the wcar macro so I'd so something like this:
(let [keys ["exists" "not-existing"]]
(zipmap keys
(mapv pos?
(car/wcar redis-db
(mapv (fn [key]
(car/exists key))
keys)))))
Could you reexamine you're first solution? I don't know what wcar does, but this example shows that you're on the right track:
> (remove #(odd? %) (range 9))
(0 2 4 6 8)
The anonymous function #(odd? %) returns either true or false results which are used to determine which numbers to keep. However, it is the original numbers that are returned by (remove...), not true/false.
Consider a dataset like this:
(def data [{:url "http://www.url1.com" :type :a}
{:url "http://www.url2.com" :type :a}
{:url "http://www.url3.com" :type :a}
{:url "http://www.url4.com" :type :b}])
The contents of those URL's should be requested in parallel. Depending on the item's :type value those contents should be parsed by corresponding functions. The parsing functions return collections, which should be concatenated, once all the responses have arrived.
So let's assume that there are functions parse-a and parse-b, which both return a collection of strings when they are passed a string containing HTML content.
It looks like core.async could be a good tool for this. One could either have separate channels for each item ore one single channel. I'm not sure which way would be preferable here. With several channels one could use transducers for the postprocessing/parsing. There is also a special promise-chan which might be proper here.
Here is a code-sketch, I'm using a callback based HTTP kit function. Unfortunately, I could not find a generic solution inside the go block.
(defn f [data]
(let [chans (map (fn [{:keys [url type]}]
(let [c (promise-chan (map ({:a parse-a :b parse-b} type)))]
(http/get url {} #(put! c %))
c))
data)
result-c (promise-chan)]
(go (put! result-c (concat (<! (nth chans 0))
(<! (nth chans 1))
(<! (nth chans 2))
(<! (nth chans 3)))))
result-c))
The result can be read like so:
(go (prn (<! (f data))))
I'd say that promise-chan does more harm than good here. The problem is that most of core.async API (a/merge, a/reduce etc.) relies on fact that channels will close at some point, promise-chans in turn never close.
So, if sticking with core.async is crucial for you, the better solution will be not to use promise-chan, but ordinary channel instead, which will be closed after first put!:
...
(let [c (chan 1 (map ({:a parse-a :b parse-b} type)))]
(http/get url {} #(do (put! c %) (close! c)))
c)
...
At this point, you're working with closed channels and things become a bit simpler. To collect all values you could do something like this:
;; (go (put! result-c (concat (<! (nth chans 0))
;; (<! (nth chans 1))
;; (<! (nth chans 2))
;; (<! (nth chans 3)))))
;; instead of above, now you can do this:
(->> chans
async/merge
(async/reduce into []))
UPD (below are my personal opinions):
Seems, that using core.async channels as promises (either in form of promise-chan or channel that closes after single put!) is not the best approach. When things grow, it turns out that core.async API overall is (you may have noticed that) not that pleasant as it could be. Also there are several unsupported constructs, that may force you to write less idiomatic code than it could be. In addition, there is no built-in error handling (if error occurs within go-block, go-block will silently return nil) and to address this you'll need to come up with something of your own (reinvent the wheel). Therefore, if you need promises, I'd recommend to use specific library for that, for example manifold or promesa.
I wanted this functionality as well because I really like core.async but I also wanted to use it in certain places like traditional JavaScript promises. I came up with a solution using macros. In the code below, <? is the same thing as <! but it throws if there's an error. It behaves like Promise.all() in that it returns a vector of all the returned values from the channels if they all are successful; otherwise it will return the first error (since <? will cause it to throw that value).
(defmacro <<? [chans]
`(let [res# (atom [])]
(doseq [c# ~chans]
(swap! res# conj (serverless.core.async/<? c#)))
#res#))
If you'd like to see the full context of the function it's located on GitHub. It's heavily inspired from David Nolen's blog post.
Use pipeline-async in async.core to launch asynchronous operations like http/get concurrently while delivering the result in the same order as the input:
(let [result (chan)]
(pipeline-async
20 result
(fn [{:keys [url type]} ch]
(let [parse ({:a parse-a :b parse-b} type)
callback #(put! ch (parse %)(partial close! ch))]
(http/get url {} callback)))
(to-chan data))
result)
if anyone is still looking at this, adding on to the answer by #OlegTheCat:
You can use a separate channel for errors.
(:require [cljs.core.async :as async]
[cljs-http.client :as http])
(:require-macros [cljs.core.async.macros :refer [go]])
(go (as-> [(http/post <url1> <params1>)
(http/post <url2> <params2>)
...]
chans
(async/merge chans (count chans))
(async/reduce conj [] chans)
(async/<! chans)
(<callback> chans)))
I'm trying to import data from StackOverflow to Neo4j using clojure and the neocons library. Excuse me for being a bit of a newbie.
Here's my main function in Leiningen:
(defn -main
[& args]
(let [neo4j-conn (nr/connect "http://localhost:7777/db/data/")]
(cypher/tquery neo4j-conn "MATCH n OPTIONAL MATCH n-[r]-() DELETE n, r")
(for [page (range 1 6)]
(let [data (parse-string (stackoverflow-get-questions page))
questions (data "items")
has-more (data "has_more")
question-ids (map #(%1 "question_id") questions)
answers ((parse-string (stackoverflow-get-answers question-ids)) "items")]
(map #(import-question %1 neo4j-conn) questions)
(map #(import-answer %1 neo4j-conn) answers)
)
)
)
)
I've defined import-question and import-answer functions and those work fine independently. In fact, what's weird is I can remove either one of those import-* lines and the other will work just fine.
Can anybody see if I'm doing something simple that's wrong?
Both map and for are lazy, and will do nothing at all unless you consume their results.
The first map call ends up being a noop because there is no way for anything to consume it's output. Try wrapping the for and at least the first map call in a call to dorun, or doall if you plan on consuming the result.
Also, you can replace for with doseq, which is identical except that it returns nil, eagerly consumes its input, and can contain multiple forms in its body.
Here is what your code could look like using doseq:
(defn -main
[& args]
(let [neo4j-conn (nr/connect "http://localhost:7777/db/data/")]
(cypher/tquery neo4j-conn "MATCH n OPTIONAL MATCH n-[r]-() DELETE n, r")
(doseq [page (range 1 6)
:let [data (parse-string (stackoverflow-get-questions page))
questions (data "items")
has-more (data "has_more")
question-ids (map #(%1 "question_id") questions)
answers ((parse-string (stackoverflow-get-answers question-ids)) "items")]]
(doseq [q questions]
(import-question q neo4j-conn))
(doseq [a answers]
(import-answer a neo4j-conn)))))
I posted before on a huge XML file - it's a 287GB XML with Wikipedia dump I want ot put into CSV file (revisions authors and timestamps). I managed to do that till some point. Before I got the StackOverflow Error, but now after solving the first problem I get: java.lang.OutOfMemoryError: Java heap space error.
My code (partly taken from Justin Kramer answer) looks like that:
(defn process-pages
[page]
(let [title (article-title page)
revisions (filter #(= :revision (:tag %)) (:content page))]
(for [revision revisions]
(let [user (revision-user revision)
time (revision-timestamp revision)]
(spit "files/data.csv"
(str "\"" time "\";\"" user "\";\"" title "\"\n" )
:append true)))))
(defn open-file
[file-name]
(let [rdr (BufferedReader. (FileReader. file-name))]
(->> (:content (data.xml/parse rdr :coalescing false))
(filter #(= :page (:tag %)))
(map process-pages))))
I don't show article-title, revision-user and revision-title functions, because they just simply take data from a specific place in the page or revision hash. Anyone could help me with this - I'm really new in Clojure and don't get the problem.
Just to be clear, (:content (data.xml/parse rdr :coalescing false)) IS lazy. Check its class or pull the first item (it will return instantly) if you're not convinced.
That said, a couple things to watch out for when processing large sequences: holding onto the head, and unrealized/nested laziness. I think your code suffers from the latter.
Here's what I recommend:
1) Add (dorun) to the end of the ->> chain of calls. This will force the sequence to be fully realized without holding onto the head.
2) Change for in process-page to doseq. You're spitting to a file, which is a side effect, and you don't want to do that lazily here.
As Arthur recommends, you may want to open an output file once and keep writing to it, rather than opening & writing (spit) for every Wikipedia entry.
UPDATE:
Here's a rewrite which attempts to separate concerns more clearly:
(defn filter-tag [tag xml]
(filter #(= tag (:tag %)) xml))
;; lazy
(defn revision-seq [xml]
(for [page (filter-tag :page (:content xml))
:let [title (article-title page)]
revision (filter-tag :revision (:content page))
:let [user (revision-user revision)
time (revision-timestamp revision)]]
[time user title]))
;; eager
(defn transform [in out]
(with-open [r (io/input-stream in)
w (io/writer out)]
(binding [*out* out]
(let [xml (data.xml/parse r :coalescing false)]
(doseq [[time user title] (revision-seq xml)]
(println (str "\"" time "\";\"" user "\";\"" title "\"\n")))))))
(transform "dump.xml" "data.csv")
I don't see anything here that would cause excessive memory use.
Unfortunately data.xml/parse is not lazy, it attempts to read the whole file into memory and then parse it.
Instead use the this (lazy) xml library which holds only the part it is currently working on in ram. You will then need to re-structure your code to write the output as it reads the input instead of gathering all the xml, then outputting it.
your line
(:content (data.xml/parse rdr :coalescing false)
will load all the xml into memory and then request the content key from it. which will blow the heap.
a rough outline of a lazy answer would look something like this:
(with-open [input (java.io.FileInputStream. "/tmp/foo.xml")
output (java.io.FileInputStream. "/tmp/foo.csv"]
(map #(write-to-file output %)
(filter is-the-tag-i-want? (parse input))))
Have patience, working with (> data ram) always takes time :)
I don't know about Clojure but in plain Java one could use a SAX event based parser like http://docs.oracle.com/javase/1.4.2/docs/api/org/xml/sax/XMLReader.html
that doesn't need to load the XML to RAM