Extracting string from clojure collections using regex - regex

can you suggest me the shortest and easiest way for extracting substring from string sequence? I'm getting this collection from using enlive framework, which takes content from certain web page, and here is what I am getting as result:
("background-image:url('http://s3.mangareader.net/cover/gantz/gantz-r0.jpg')"
"background-image:url('http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg')"
"background-image:url('http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg')" )
What I would like is to get some help in extracting the URL from the each string in the sequence.i tried something with partition function, but with no success. Can anyone propose a regex, or any other approach for this problem?
Thanks

re-seq to the resque!
(map #(re-seq #"http.*jpg" %) d)
(("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg")
("http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg")
("http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg"))
user>
re-find is even better:
user> (map #(re-find #"http.*jpg" %) d)
("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg"
"http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg"
"http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg")
because it doesn't add an extra layer of seq.

Would something simple like this work for you?
(defn extract-url [s]
(subs s (inc (.indexOf s "'")) (.lastIndexOf s "'")))
This function will return a string containing all the characters between the first and last single quotes.
Assuming your sequence of strings is named ss, then:
(map extract-url ss)
;=> ("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg"
; "http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg"
; "http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg")
This is definitely not a generic solution, but it fits the input you have provided.

Related

Clojure join fails to create a string from the result of filter function

I'm trying to write a function that takes a string and returns a result of a filter function (I'm working through 4clojure problems). The result must be a string too.
I've written this:
(fn my-caps [s]
(filter #(Character/isUpperCase %) s))
(my-caps "HeLlO, WoRlD!")
Result: (\H \L \O \W \R \D)
Now I'm trying to create a string out of this list, using clojure.string/join, like this:
(fn my-caps [s]
(clojure.string/join (filter #(Character/isUpperCase %) s)))
The result is however the same. I've also tried using apply str, with no success.
You have to convert the lazy sequence returned by filter into a string, by applying the str function. Also, use defn to define a new function - here's how:
(defn my-caps [s]
(apply str (filter #(Character/isUpperCase %) s)))
It works as expected:
(my-caps "HeLlO, WoRlD!")
=> "HLOWRD"
The last code snippet you pasted works fine. join indeed does return a string.
Try this:
(defn my-caps [s]
(->> (filter #(Character/isUpperCase %) s)
(apply str)))
filter function returns a lazy sequence. If you want to get a string, you should transform the sequence to string by applying str function.

Processing a file character by character in Clojure

I'm working on writing a function in Clojure that will process a file character by character. I know that Java's BufferedReader class has the read() method that reads one character, but I'm new to Clojure and not sure how to use it. Currently, I'm just trying to do the file line-by-line, and then print each character.
(defn process_file [file_path]
(with-open [reader (BufferedReader. (FileReader. file_path))]
(let [seq (line-seq reader)]
(doseq [item seq]
(let [words (split item #"\s")]
(println words))))))
Given a file with this text input:
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
My output looks like this:
[International donations are gratefully accepted, but we cannot make]
[any statements concerning tax treatment of donations received from]
[outside the United States. U.S. laws alone swamp our small staff.]
Though I would expect it to look like:
["international" "donations" "are" .... ]
So my question is, how can I convert the function above to read character by character? Or even, how to make it work as I expect it to? Also, any tips for making my Clojure code better would be greatly appreciated.
(with-open [reader (clojure.java.io/reader "path/to/file")] ...
I prefer this way to get a reader in clojure. And, by character by character, do you mean in file access level, like read, which allow you control how many bytes to read?
Edit
As #deterb pointed out, let's check the source code of line-seq
(defn line-seq
"Returns the lines of text from rdr as a lazy sequence of strings.
rdr must implement java.io.BufferedReader."
{:added "1.0"
:static true}
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons line (lazy-seq (line-seq rdr)))))
I faked a char-seq
(defn char-seq
[^java.io.Reader rdr]
(let [chr (.read rdr)]
(if (>= chr 0)
(cons chr (lazy-seq (char-seq rdr))))))
I know this char-seq reads all chars into memory[1], but I think it shows that you can directly call .read on BufferedReader. So, you can write your code like this:
(let [chr (.read rdr)]
(if (>= chr 0)
;do your work here
))
How do you think?
[1] According to #dimagog's comment, char-seq not read all char into memory thanks to lazy-seq
I'm not familiar with Java or the read() method, so I won't be able to help you out with implementing it.
One first thought is maybe to simplify by using slurp, which will return a string of the text of the entire file with just (slurp filename). However, this would get the whole file, which maybe you don't want.
Once you have a string of the entire file text, you can process any string character by character by simply treating it as though it were a sequence of characters. For example:
=> (doseq [c "abcd"]
(prntln c))
a
b
c
d
=> nil
Or:
=> (remove #{\c} "abcd")
=> (\a \b \d)
You could use map or reduce or any sort of sequence manipulating function. Note that after manipulating it like a sequence, it will now return as a sequence, but you could easily wrap the outer part in (reduce str ...) to return it back to a string at the end--explicitly:
=> (reduce str (remove #{\c} "abcd"))
=> "abd"
As for your problem with your specific code, I think the problem lies with what words is: a vector of strings. When you print each words you are printing a vector. If at the end you replaced the line (println words) with (doseq [w words] (println w))), then it should work great.
Also, based on what you say you want your output to look like (a vector of all the different words in the file), you wouldn't want to only do (println w) at the base of your expression, because this will print values and return nil. You would simply want w. Also, you would want to replace your doseqs with fors--again, to avoid return nil.
Also, on improving your code, it looks generally great to me, but--and this is going with all the first change I suggest above (but not the others, because I don't want to draw it all out explicitly)--you could shorten it with a fun little trick:
(doseq [item seq]
(let [words (split item #"\s")]
(doseq [w words]
(println w))))
;//Could be rewritten as...
(doseq [item s
:let [words (split item #"\s")]
w words]
(println w))
You're pretty close - keep in mind that Strings are a sequence. (concat "abc" "def") results in the sequence (\a \b \c \d \e \f).
mapcat is another really useful function for this - it will lazily concatenate the results of applying the mapping fn to the sequence. This means that mapcating the result of converting all of the line strings to a seq will be the lazy sequence of characters you're after.
I did this as (mapcat seq (line-seq reader)).
For other advice:
For creating the reader, I would recommend using the clojure.java.io/reader function instead of directly creating the classes.
Consider breaking apart the reading the file and the processing (in this case printing) of the strings from each other. While it is important to keep the full file parsing inside the withopen clause, being able to test the actual processing code outside of the file reading code is quite useful.
When navigating multiple (potentially nested) sequences consider using for. for does a nice job handling nested for loop type cases.
(take 100 (for [line (repeat "abc") char (seq line)] (prn char)))
Use prn for debugging output. It gives you real output, as compared to user output (which hides certain details which users don't normally care about).

Getting all matches for a regexp on clojure

I'm trying to parse an HTML file and get all href's inside it.
So far, the code I'm using is:
(map
#(println (str "Match: " %))
(re-find #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))
str_response being the string with the HTML code inside it. According to my basic understanding of Clojure, that code should print a list of matches, but so far, no luck.
It doens't crash, but it doens't match anything either.
I've tried using re-seq instead of re-find, but with no luck. Any help?
Thanks!
it is generally though that you cannot parse html with a regex (entertaining answer), though just finding all occurances of one tag should be dooable.
once you figure out the proper regex re-seq is the function you want to use:
user> (re-find #"aa" "aalkjkljaa")
"aa"
user> (re-seq #"aa" "aalkjkljaa")
("aa" "aa")
this is not crashing for you because re-find is returning nil which map is interpreting as an empty list and doing nothing
This really looks like an HTML scraping problem in which case, I would advise using enlive.
Something like this should work
(ns test.foo
(:require [net.cgrand.enlive-html :as html]))
(let [url (html/html-resource
(java.net.URL. "http://www.nytimes.com"))]
(map #(-> % :attrs :href) (html/select url [:a])))
I don't think there is anything wrong with your code. Perhapsstr_responseis the suspect. The following works with http://google.com with your regex:
(let [str_response (slurp "http://google.com")]
(map #(println (str "Match: " %))
(re-seq #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))
Note ref-find also works though it only returns one match.

What's the right way to create a Clojure function that returns a new sequence based on another sequence?

Why do you have to use "first" in get-word-ids, and what's the right way to do this?
(defn parts-of-speech []
(lazy-seq (. POS values)))
(defn index-words [pos]
(iterator-seq (. dict getIndexWordIterator pos)))
(defn word-ids [word]
(lazy-seq (. word getWordIDs)))
(defn get-word [word-id]
(. dict getWord word-id))
(defn get-index-words []
(lazy-seq (map index-words (parts-of-speech))))
(defn get-word-ids []
(lazy-seq (map word-ids (first (get-index-words)))))
;; this works, but why do you have to use "first" in get-word-ids?
(doseq [word-id (get-word-ids)]
(println word-id))
The short answer: remove all the references to lazy-seq.
as for your original question, it is worth explaining even if it's not a idomatic use of lazy-seq. you have to use first because the get-word-ids function is returning a lazy sequence with one entry. that entry is the lazy sequences you are looking for.
looks like this
( (word1 word2 word3) )
so first returns the sequence you want:
(word1 word2 word3)
It is very likely that the only time you will use lazy-seq will be in this pattern:
(lazy-seq (cons :something (function-call :produces :the :next :element)))
I have never seen lazy-seq used in any other pattern. The purpose of lazy-seq is to generate new sequences of original data. If code exists to produce the data then it's almost always better to use something like iterate map, or for to produce your lazy sequence.
This seems wrong:
(defn get-index-words []
(lazy-seq (map index-words (parts-of-speech))))
(index-words pos) returns a seq. which is why you need a (first) in get-word-ids.
also map is already lazy, so there's no need to wrap a (map ...) in a lazy-seq, and it would be almost pointless to use lazy-seq around map if map wasn't lazy. it would probably be useful if you'd read up a bit more on (lazy) sequences in clojure.

Reverse a string (simple question)

Is there a better way to do this in Clojure?
daniel=> (reverse "Hello")
(\o \l \l \e \H)
daniel=> (apply str (vec (reverse "Hello")))
"olleH"
Do you have to do the apply $ str $ vec bit every time you want to reverse a string back to its original form?
You'd better use clojure.string/reverse:
user=> (require '[clojure.string :as s])
nil
user=> (s/reverse "Hello")
"olleH"
UPDATE: for the curious, here follow the source code snippets for clojure.string/reverse in both Clojure (v1.4) and ClojureScript
; clojure:
(defn ^String reverse
"Returns s with its characters reversed."
{:added "1.2"}
[^CharSequence s]
(.toString (.reverse (StringBuilder. s))))
; clojurescript
(defn reverse
"Returns s with its characters reversed."
[s]
(.. s (split "") (reverse) (join "")))
OK, so it would be easy to roll your own function with apply inside, or use a dedicated version of reverse that works better (but only) at strings. The main things to think about here though, is the arity (amount and type of parameters) of the str function, and the fact that reverse works on a collection.
(doc reverse)
clojure.core/reverse
([coll])
Returns a seq of the items in coll in reverse order. Not lazy.
This means that reverse not only works on strings, but also on all other collections. However, because reverse expects a collection as parameter, it treats a string as a collection of characters
(reverse "Hello")
and returns one as well
(\o \l \l \e \H)
Now if we just substitute the functions for the collection, you can spot the difference:
(str '(\o \l \l \e \H) )
"(\\o \\l \\l \\e \\H)"
while
(str \o \l \l \e \H )
"olleH"
The big difference between the two is the amount of parameters. In the first example, str takes one parameter, a collection of 5 characters. In the second, str takes 5 parameters: 5 characters.
What does the str function expect ?
(doc str)
-------------------------
clojure.core/str
([] [x] [x & ys])
With no args, returns the empty string. With one arg x, returns
x.toString(). (str nil) returns the empty string. With more than
one arg, returns the concatenation of the str values of the args.
So when you give in one parameter (a collection), all str returns is a toString of the collection.
But to get the result you want, you need to feed the 5 characters as separate parameters to str, instead of the collection itself. Apply is the function that is used to 'get inside' the collection and make that happen.
(apply str '(\o \l \l \e \H) )
"olleH"
Functions that handle multiple separate parameters are often seen in Clojure, so it's good to realise when and why you need to use apply. The other side to realize is, why did the writer of the str function made it accept multiple parameters instead of a collection ? Usually, there's a pretty good reason. What's the prevalent use case for the str function ? Not concatenating a collection of separate characters surely, but concatenating values, strings and function results.
(let [a 1 b 2]
(str a "+" b "=" (+ a b)))
"1+2=3"
What if we had a str that accepted a single collection as parameter ?
(defn str2
[seq]
(apply str seq)
)
(str2 (reverse "Hello"))
"olleH"
Cool, it works ! But now:
(let [a 1 b 2]
(str2 '(a "+" b "=" (+ a b)))
)
"a+b=(+ a b)"
Hmmm, now how to solve that ? :)
In this case, making str accept multiple parameters that are evaluated before the str function is executed gives str the easiest syntax. Whenever you need to use str on a collection, apply is a simple way to convert a collection to separate parameters.
Making a str that accepts a collection and have it evaluate each part inside would take more effort, help out only in less common use cases, result in more complicated code or syntax, or limit it's applicability. So there might be a better way to reverse strings, but reverse, apply and str are best at what they do.
Apply, like reverse, works on any seqable type, not just vectors, so
(apply str (reverse "Hello"))
is a little shorter. clojure.string/reverse should be more efficient, though.