Getting all matches for a regexp on clojure - regex

I'm trying to parse an HTML file and get all href's inside it.
So far, the code I'm using is:
(map
#(println (str "Match: " %))
(re-find #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))
str_response being the string with the HTML code inside it. According to my basic understanding of Clojure, that code should print a list of matches, but so far, no luck.
It doens't crash, but it doens't match anything either.
I've tried using re-seq instead of re-find, but with no luck. Any help?
Thanks!

it is generally though that you cannot parse html with a regex (entertaining answer), though just finding all occurances of one tag should be dooable.
once you figure out the proper regex re-seq is the function you want to use:
user> (re-find #"aa" "aalkjkljaa")
"aa"
user> (re-seq #"aa" "aalkjkljaa")
("aa" "aa")
this is not crashing for you because re-find is returning nil which map is interpreting as an empty list and doing nothing

This really looks like an HTML scraping problem in which case, I would advise using enlive.
Something like this should work
(ns test.foo
(:require [net.cgrand.enlive-html :as html]))
(let [url (html/html-resource
(java.net.URL. "http://www.nytimes.com"))]
(map #(-> % :attrs :href) (html/select url [:a])))

I don't think there is anything wrong with your code. Perhapsstr_responseis the suspect. The following works with http://google.com with your regex:
(let [str_response (slurp "http://google.com")]
(map #(println (str "Match: " %))
(re-seq #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))
Note ref-find also works though it only returns one match.

Related

replace multiple bad characters in clojure

I am trying to replace bad characters from a input string.
Characters should be valid UTF-8 characters (tabs, line breaks etc. are ok).
However I was unable to figure out how to replace all found bad characters.
My solution works for the first bad character.
Usually there are none bad characters. 1/50 cases there is one bad character. I'd just want to make my solution foolproof.
(defn filter-to-utf-8-string
"Return only good utf-8 characters from the input."
[input]
(let [bad-characters (set (re-seq #"[^\p{L}\p{N}\s\p{P}\p{Sc}\+]+" input))
filtered-string (clojure.string/replace input (apply str (first bad-characters)) "")]
filtered-string))
How can I make replace work for all values in sequence not just for the first one?
Friend of mine helped me to find workaround for this problem:
I created a filter for replace using re-pattern.
Within let code is currently
filter (if (not (empty? bad-characters))
(re-pattern (str "[" (clojure.string/join bad-characters) "]"))
#"")
filtered-string (clojure.string/replace input filter "")
Here is a simple version:
(ns xxxxx
(:require
[clojure.string :as str]
))
(def all-chars (str/join (map char (range 32 80))))
(println all-chars)
(def char-L (str/join (re-seq #"[\p{L}]" all-chars)))
(println char-L)
(def char-N (str/join (re-seq #"[\p{N}]" all-chars)))
(println char-N)
(def char-LN (str/join (re-seq #"[\p{L}\p{N}]" all-chars)))
(println char-LN)
all-chars => " !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNO"
char-L => "ABCDEFGHIJKLMNO"
char-N => "0123456789"
char-LN => "0123456789ABCDEFGHIJKLMNO"
So we start off with all ascii chars in the range of 32-80. We first print only the letter, then only the numbers, then either letters or numbers. It seems this should work for your problem, although instead of rejecting non-members of the desired set, we keep the members of the desired set.

Processing a file character by character in Clojure

I'm working on writing a function in Clojure that will process a file character by character. I know that Java's BufferedReader class has the read() method that reads one character, but I'm new to Clojure and not sure how to use it. Currently, I'm just trying to do the file line-by-line, and then print each character.
(defn process_file [file_path]
(with-open [reader (BufferedReader. (FileReader. file_path))]
(let [seq (line-seq reader)]
(doseq [item seq]
(let [words (split item #"\s")]
(println words))))))
Given a file with this text input:
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
My output looks like this:
[International donations are gratefully accepted, but we cannot make]
[any statements concerning tax treatment of donations received from]
[outside the United States. U.S. laws alone swamp our small staff.]
Though I would expect it to look like:
["international" "donations" "are" .... ]
So my question is, how can I convert the function above to read character by character? Or even, how to make it work as I expect it to? Also, any tips for making my Clojure code better would be greatly appreciated.
(with-open [reader (clojure.java.io/reader "path/to/file")] ...
I prefer this way to get a reader in clojure. And, by character by character, do you mean in file access level, like read, which allow you control how many bytes to read?
Edit
As #deterb pointed out, let's check the source code of line-seq
(defn line-seq
"Returns the lines of text from rdr as a lazy sequence of strings.
rdr must implement java.io.BufferedReader."
{:added "1.0"
:static true}
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons line (lazy-seq (line-seq rdr)))))
I faked a char-seq
(defn char-seq
[^java.io.Reader rdr]
(let [chr (.read rdr)]
(if (>= chr 0)
(cons chr (lazy-seq (char-seq rdr))))))
I know this char-seq reads all chars into memory[1], but I think it shows that you can directly call .read on BufferedReader. So, you can write your code like this:
(let [chr (.read rdr)]
(if (>= chr 0)
;do your work here
))
How do you think?
[1] According to #dimagog's comment, char-seq not read all char into memory thanks to lazy-seq
I'm not familiar with Java or the read() method, so I won't be able to help you out with implementing it.
One first thought is maybe to simplify by using slurp, which will return a string of the text of the entire file with just (slurp filename). However, this would get the whole file, which maybe you don't want.
Once you have a string of the entire file text, you can process any string character by character by simply treating it as though it were a sequence of characters. For example:
=> (doseq [c "abcd"]
(prntln c))
a
b
c
d
=> nil
Or:
=> (remove #{\c} "abcd")
=> (\a \b \d)
You could use map or reduce or any sort of sequence manipulating function. Note that after manipulating it like a sequence, it will now return as a sequence, but you could easily wrap the outer part in (reduce str ...) to return it back to a string at the end--explicitly:
=> (reduce str (remove #{\c} "abcd"))
=> "abd"
As for your problem with your specific code, I think the problem lies with what words is: a vector of strings. When you print each words you are printing a vector. If at the end you replaced the line (println words) with (doseq [w words] (println w))), then it should work great.
Also, based on what you say you want your output to look like (a vector of all the different words in the file), you wouldn't want to only do (println w) at the base of your expression, because this will print values and return nil. You would simply want w. Also, you would want to replace your doseqs with fors--again, to avoid return nil.
Also, on improving your code, it looks generally great to me, but--and this is going with all the first change I suggest above (but not the others, because I don't want to draw it all out explicitly)--you could shorten it with a fun little trick:
(doseq [item seq]
(let [words (split item #"\s")]
(doseq [w words]
(println w))))
;//Could be rewritten as...
(doseq [item s
:let [words (split item #"\s")]
w words]
(println w))
You're pretty close - keep in mind that Strings are a sequence. (concat "abc" "def") results in the sequence (\a \b \c \d \e \f).
mapcat is another really useful function for this - it will lazily concatenate the results of applying the mapping fn to the sequence. This means that mapcating the result of converting all of the line strings to a seq will be the lazy sequence of characters you're after.
I did this as (mapcat seq (line-seq reader)).
For other advice:
For creating the reader, I would recommend using the clojure.java.io/reader function instead of directly creating the classes.
Consider breaking apart the reading the file and the processing (in this case printing) of the strings from each other. While it is important to keep the full file parsing inside the withopen clause, being able to test the actual processing code outside of the file reading code is quite useful.
When navigating multiple (potentially nested) sequences consider using for. for does a nice job handling nested for loop type cases.
(take 100 (for [line (repeat "abc") char (seq line)] (prn char)))
Use prn for debugging output. It gives you real output, as compared to user output (which hides certain details which users don't normally care about).

Escaping brackets in Clojure

If I try this
(import java.util.regex.Pattern)
(Pattern/compile ")!##$%^&*()")
or this
(def p #")!##$%^&*()")
I have Clojure complaining that there is an unmatched / unclosed ). Why are brackets evaluated within this simple string? How to escape them? Thanks
EDIT: While escaping works in the clojure-specific syntax (#""), it doesn't work with the Pattern/compile syntax that I do need because I have to compile the regex patter dynamically from a string.
I've tried with re-pattern, but I can't escape properly for some reason:
(re-pattern "\)!##$%^&*\(\)")
java.lang.Exception: Unsupported escape character: \)
java.lang.Exception: Unable to resolve symbol: ! in this context (NO_SOURCE_FILE:0)
java.lang.Exception: No dispatch macro for: $
java.lang.Exception: Unable to resolve symbol: % in this context (NO_SOURCE_FILE:0)
java.lang.IllegalArgumentException: Metadata can only be applied to IMetas
EDIT 2 This little function may help:
(defn escape-all [x]
(str "\\" (reduce #(str %1 "\\" %2) x)))
I got it working by double escaping everything. Oh the joys of double escaping.
=> (re-pattern "\\)\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)")
=> #"\)\!\#\#\$\%\^\&\*\(\)"
=> (re-find (re-pattern "\\)\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)")
")!##$%^&*()")
=> ")!##$%^&*()"
I would recommend writing a helper function str-to-pattern (or whatever you want to call it), that takes a string, double escapes everything it needs to, and then calls re-pattern on it.
Edit: making a string to pattern function
There are plenty of ways to do this, below is just one example. I start by making an smap of regex escape chars to their string replacement. An "smap" isn't an actual type, but functionally it's a map we will use to swap "old values" with "new values", where "old values" are members of the keys of the smap, and "new values" are corresponding members of the vals of smap. In our case, this smap looks like {\( "\\(", \) "\\)" ...}.
(def regex-char-esc-smap
(let [esc-chars "()*&^%$#!"]
(zipmap esc-chars
(map #(str "\\" %) esc-chars))))
Next is the actual function. I use the above smap to replace items in the string passed to it, then convert that back into a string and make a regex pattern out of it. I think the ->> macro makes the code more readable, but that's just a personal preference.
(defn str-to-pattern
[string]
(->> string
(replace regex-char-esc-smap)
(reduce str)
re-pattern))
are you sure the error is from the reader (ie from clojure itself)?
regexps use parentheses, and they have to match there too. i would guess the error is cominng from the code trying to compile the regexp.
if you want to escape a paren in a regexp, use a backquote: (def p #"\)!##$%^&*\(\)")
[update] ah, sorry, you probably need double escapes as Omri days.
All of the versions of Java that Clojure supports recognize \Q to start a quoted region and \E to end the quoted region. This allows you to do something like this:
(re-find #"\Q)!##$%^&*()\E" ")!##$%^&*()")
If you're using (re-pattern) then this will work:
(re-find (re-pattern "\\Q)!##$%^&*()\\E") ")!##$%^&*()")
If you're assembling a regular expression from a string whose content you don't know then you can use the quote method in java.util.regex.Pattern:
(re-find (re-pattern (java.util.regex.Pattern/quote some-str)) some-other-str)
Here's an example of this from my REPL:
user> (def the-string ")!##$%^&*()")
#'user/the-string
user> (re-find (re-pattern (java.util.regex.Pattern/quote the-string)) the-string)
")!##$%^&*()"

Extracting string from clojure collections using regex

can you suggest me the shortest and easiest way for extracting substring from string sequence? I'm getting this collection from using enlive framework, which takes content from certain web page, and here is what I am getting as result:
("background-image:url('http://s3.mangareader.net/cover/gantz/gantz-r0.jpg')"
"background-image:url('http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg')"
"background-image:url('http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg')" )
What I would like is to get some help in extracting the URL from the each string in the sequence.i tried something with partition function, but with no success. Can anyone propose a regex, or any other approach for this problem?
Thanks
re-seq to the resque!
(map #(re-seq #"http.*jpg" %) d)
(("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg")
("http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg")
("http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg"))
user>
re-find is even better:
user> (map #(re-find #"http.*jpg" %) d)
("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg"
"http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg"
"http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg")
because it doesn't add an extra layer of seq.
Would something simple like this work for you?
(defn extract-url [s]
(subs s (inc (.indexOf s "'")) (.lastIndexOf s "'")))
This function will return a string containing all the characters between the first and last single quotes.
Assuming your sequence of strings is named ss, then:
(map extract-url ss)
;=> ("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg"
; "http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg"
; "http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg")
This is definitely not a generic solution, but it fits the input you have provided.

Iterating through a map with doseq

I'm new to Clojure and I'm doing some basic stuff from labrepl, now I want to write a function that will replace certain letters with other letters, for example: elosska → elößkä.
I wrote this:
(ns student.dialect (:require [clojure.string :as str]))
(defn germanize
[sentence]
(def german-letters {"a" "ä" "u" "ü" "o" "ö" "ss" "ß"})
(doseq [[original-letter new-letter] german-letters]
(str/replace sentence original-letter new-letter)))
but it doesn't work as I expect. Could you help me, please?
Here is my take,
(def german-letters {"a" "ä" "u" "ü" "o" "ö" "ss" "ß"})
(defn germanize [s]
(reduce (fn[sentence [match replacement]]
(str/replace sentence match replacement)) s german-letters))
(germanize "elosska")
There are 2 problems here:
doseq doesn't preserve head of list that created by its evaluation, so you won't get any results
str/replace works on separate copies of text, producing 4 different results - you can check this by replacing doseq with for and you'll get list with 4 entries.
You code could be rewritten following way:
(def german-letters {"a" "ä" "u" "ü" "o" "ö" "ss" "ß"})
(defn germanize [sentence]
(loop [text sentence
letters german-letters]
(if (empty? letters)
text
(let [[original-letter new-letter] (first letters)]
(recur (str/replace text original-letter new-letter)
(rest letters))))))
In this case, intermediate results are collected, so all replacements are applied to same string, producing correct string:
user> (germanize "elosska")
"elößkä"
P.S. it's also not recommended to use def in the function - it's better to use it for top-level forms
Alex has of course already correctly answered the question with respect to the original issue using doseq... but I found the question interesting and wanted to see what a more "functional" solution would look like. And by that I mean without using a loop.
I came up with this:
(ns student.dialect (:require [clojure.string :as str]))
(defn germanize [sentence]
(let [letters {"a" "ä" "u" "ü" "o" "ö" "ss" "ß"}
regex (re-pattern (apply str (interpose \| (keys letters))))]
(str/replace sentence regex letters)))
Which yields the same result:
student.dialect=> (germanize "elosska")
"elößkä"
The regex (re-pattern... line simply evaluates to #"ss|a|o|u", which would have been cleaner, and simpler to read, if entered as an explicit string, but I thought it best to have only one definition of the german letters.