Phone number regular expressions - clojure

Included is some code that can find a single number such as "555-555-5555" in a string. But I'm not quite sure how to extend the code to find all phone numbers within a string. The code stops after it has found the first number...
(defn foo [x]
(re-find (re-matcher #"((\d+)-(\d+)-(\d+))" x)))
Is there a way to extend this code to find all numbers within a string?

re-seq returns a sequence of all the matches to a regex in a string:
user> (defn foo [x] (re-seq #"\d+-\d+-\d+" x))
#'user/foo
user> (foo "111-222-3333 555-666-7777")
("111-222-3333" "555-666-7777")
user> (foo "phone 1: 111-222-3333 phone 2: 555-666-7777")
("111-222-3333" "555-666-7777")
So it will keep going until it finds all the phone numbers in the string.

I you are interested in searching all possible phone numbers, depending on the region / country code and other parameters, check the phone-number library:
https://github.com/randomseed-io/phone-number
There is a function find-numbers for that purpose:
https://randomseed.io/software/phone-number/phone-number.core#var-find-numbers

Related

How to correctly check if a string is equal to another string in Clojure?

I am looking for better ways to check if two strings are equal in Clojure!
Given a map 'report' like
{:Result Pass}
, when I evaluate
(type (:Result report))
I get : Java.Lang.String
To write a check for the value of :Result, I first tried
(if (= (:Result report) "Pass") (println "Pass"))
But the check fails.
So I used the compare method, which worked:
(if (= 0 (compare (:Result report) "Pass")) (println "Pass"))
However, I was wondering if there is anything equivalent to Java's .equals() method in Clojure. Or a better way to do the same.
= is the correct way to do an equality check for Strings. If it's giving you unexpected results, you likely have whitespace in the String like a trailing newline.
You can easily check for whitespace by using vec:
(vec " Pass\n")
user=> [\space \P \a \s \s \newline]
As #Carcigenicate wrote, use = to compare strings.
(= "hello" "hello")
;; => true
If you want to be less strict, consider normalizing your string before you compare. If we have a leading space, the strings aren't equal.
(= " hello" "hello")
;; => false
We can then define a normalize function that works for us.
In this case, ignore leading and trailing whitespace and
capitalization.
(require '[clojure.string :as string])
(defn normalize [s]
(string/trim
(string/lower-case s)))
(= (normalize " hellO")
(normalize "Hello\t"))
;; => true
Hope that helps!

replace multiple bad characters in clojure

I am trying to replace bad characters from a input string.
Characters should be valid UTF-8 characters (tabs, line breaks etc. are ok).
However I was unable to figure out how to replace all found bad characters.
My solution works for the first bad character.
Usually there are none bad characters. 1/50 cases there is one bad character. I'd just want to make my solution foolproof.
(defn filter-to-utf-8-string
"Return only good utf-8 characters from the input."
[input]
(let [bad-characters (set (re-seq #"[^\p{L}\p{N}\s\p{P}\p{Sc}\+]+" input))
filtered-string (clojure.string/replace input (apply str (first bad-characters)) "")]
filtered-string))
How can I make replace work for all values in sequence not just for the first one?
Friend of mine helped me to find workaround for this problem:
I created a filter for replace using re-pattern.
Within let code is currently
filter (if (not (empty? bad-characters))
(re-pattern (str "[" (clojure.string/join bad-characters) "]"))
#"")
filtered-string (clojure.string/replace input filter "")
Here is a simple version:
(ns xxxxx
(:require
[clojure.string :as str]
))
(def all-chars (str/join (map char (range 32 80))))
(println all-chars)
(def char-L (str/join (re-seq #"[\p{L}]" all-chars)))
(println char-L)
(def char-N (str/join (re-seq #"[\p{N}]" all-chars)))
(println char-N)
(def char-LN (str/join (re-seq #"[\p{L}\p{N}]" all-chars)))
(println char-LN)
all-chars => " !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNO"
char-L => "ABCDEFGHIJKLMNO"
char-N => "0123456789"
char-LN => "0123456789ABCDEFGHIJKLMNO"
So we start off with all ascii chars in the range of 32-80. We first print only the letter, then only the numbers, then either letters or numbers. It seems this should work for your problem, although instead of rejecting non-members of the desired set, we keep the members of the desired set.

Processing a file character by character in Clojure

I'm working on writing a function in Clojure that will process a file character by character. I know that Java's BufferedReader class has the read() method that reads one character, but I'm new to Clojure and not sure how to use it. Currently, I'm just trying to do the file line-by-line, and then print each character.
(defn process_file [file_path]
(with-open [reader (BufferedReader. (FileReader. file_path))]
(let [seq (line-seq reader)]
(doseq [item seq]
(let [words (split item #"\s")]
(println words))))))
Given a file with this text input:
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
My output looks like this:
[International donations are gratefully accepted, but we cannot make]
[any statements concerning tax treatment of donations received from]
[outside the United States. U.S. laws alone swamp our small staff.]
Though I would expect it to look like:
["international" "donations" "are" .... ]
So my question is, how can I convert the function above to read character by character? Or even, how to make it work as I expect it to? Also, any tips for making my Clojure code better would be greatly appreciated.
(with-open [reader (clojure.java.io/reader "path/to/file")] ...
I prefer this way to get a reader in clojure. And, by character by character, do you mean in file access level, like read, which allow you control how many bytes to read?
Edit
As #deterb pointed out, let's check the source code of line-seq
(defn line-seq
"Returns the lines of text from rdr as a lazy sequence of strings.
rdr must implement java.io.BufferedReader."
{:added "1.0"
:static true}
[^java.io.BufferedReader rdr]
(when-let [line (.readLine rdr)]
(cons line (lazy-seq (line-seq rdr)))))
I faked a char-seq
(defn char-seq
[^java.io.Reader rdr]
(let [chr (.read rdr)]
(if (>= chr 0)
(cons chr (lazy-seq (char-seq rdr))))))
I know this char-seq reads all chars into memory[1], but I think it shows that you can directly call .read on BufferedReader. So, you can write your code like this:
(let [chr (.read rdr)]
(if (>= chr 0)
;do your work here
))
How do you think?
[1] According to #dimagog's comment, char-seq not read all char into memory thanks to lazy-seq
I'm not familiar with Java or the read() method, so I won't be able to help you out with implementing it.
One first thought is maybe to simplify by using slurp, which will return a string of the text of the entire file with just (slurp filename). However, this would get the whole file, which maybe you don't want.
Once you have a string of the entire file text, you can process any string character by character by simply treating it as though it were a sequence of characters. For example:
=> (doseq [c "abcd"]
(prntln c))
a
b
c
d
=> nil
Or:
=> (remove #{\c} "abcd")
=> (\a \b \d)
You could use map or reduce or any sort of sequence manipulating function. Note that after manipulating it like a sequence, it will now return as a sequence, but you could easily wrap the outer part in (reduce str ...) to return it back to a string at the end--explicitly:
=> (reduce str (remove #{\c} "abcd"))
=> "abd"
As for your problem with your specific code, I think the problem lies with what words is: a vector of strings. When you print each words you are printing a vector. If at the end you replaced the line (println words) with (doseq [w words] (println w))), then it should work great.
Also, based on what you say you want your output to look like (a vector of all the different words in the file), you wouldn't want to only do (println w) at the base of your expression, because this will print values and return nil. You would simply want w. Also, you would want to replace your doseqs with fors--again, to avoid return nil.
Also, on improving your code, it looks generally great to me, but--and this is going with all the first change I suggest above (but not the others, because I don't want to draw it all out explicitly)--you could shorten it with a fun little trick:
(doseq [item seq]
(let [words (split item #"\s")]
(doseq [w words]
(println w))))
;//Could be rewritten as...
(doseq [item s
:let [words (split item #"\s")]
w words]
(println w))
You're pretty close - keep in mind that Strings are a sequence. (concat "abc" "def") results in the sequence (\a \b \c \d \e \f).
mapcat is another really useful function for this - it will lazily concatenate the results of applying the mapping fn to the sequence. This means that mapcating the result of converting all of the line strings to a seq will be the lazy sequence of characters you're after.
I did this as (mapcat seq (line-seq reader)).
For other advice:
For creating the reader, I would recommend using the clojure.java.io/reader function instead of directly creating the classes.
Consider breaking apart the reading the file and the processing (in this case printing) of the strings from each other. While it is important to keep the full file parsing inside the withopen clause, being able to test the actual processing code outside of the file reading code is quite useful.
When navigating multiple (potentially nested) sequences consider using for. for does a nice job handling nested for loop type cases.
(take 100 (for [line (repeat "abc") char (seq line)] (prn char)))
Use prn for debugging output. It gives you real output, as compared to user output (which hides certain details which users don't normally care about).

Extracting string from clojure collections using regex

can you suggest me the shortest and easiest way for extracting substring from string sequence? I'm getting this collection from using enlive framework, which takes content from certain web page, and here is what I am getting as result:
("background-image:url('http://s3.mangareader.net/cover/gantz/gantz-r0.jpg')"
"background-image:url('http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg')"
"background-image:url('http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg')" )
What I would like is to get some help in extracting the URL from the each string in the sequence.i tried something with partition function, but with no success. Can anyone propose a regex, or any other approach for this problem?
Thanks
re-seq to the resque!
(map #(re-seq #"http.*jpg" %) d)
(("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg")
("http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg")
("http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg"))
user>
re-find is even better:
user> (map #(re-find #"http.*jpg" %) d)
("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg"
"http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg"
"http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg")
because it doesn't add an extra layer of seq.
Would something simple like this work for you?
(defn extract-url [s]
(subs s (inc (.indexOf s "'")) (.lastIndexOf s "'")))
This function will return a string containing all the characters between the first and last single quotes.
Assuming your sequence of strings is named ss, then:
(map extract-url ss)
;=> ("http://s3.mangareader.net/cover/gantz/gantz-r0.jpg"
; "http://s3.mangareader.net/cover/deadman-wonderland/deadman-wonderland-r0.jpg"
; "http://s3.mangareader.net/cover/12-prince/12-prince-r1.jpg")
This is definitely not a generic solution, but it fits the input you have provided.

Clojure: get list of regex matches

Perhaps I'm going about this all wrong, but I'm trying to get all the matches in a string for a particular regex pattern. I'm using re-matcher to get a Match object, which I pass to re-find, giving me (full-string-match, grouped-text) pairs. How would I get a sequence of all the matches produced by the Match object?
In Clojuresque Python, it would look like:
pairs = []
match = re-matcher(regex, line)
while True:
pair = re-find(match)
if not pair: break
pairs.append(pair)
Any suggestions?
You probably want to use the built in re-seq and Clojure's built in regex literal. Don't mess with underlying java objects unless you really have too.
(doc re-seq)
clojure.core/re-seq
([re s])
Returns a lazy sequence of successive matches of pattern in string,
using java.util.regex.Matcher.find(), each such match processed with
re-groups.
For example:
user> (re-seq #"the \w+" "the cat sat on the mat")
("the cat" "the mat")
In answer to the follow-up comment, group captures will result in a vector of strings with an element for each part of the group in a match:
user> (re-seq #"the (\w+(t))" "the cat sat on the mat")
(["the cat" "cat" "t"] ["the mat" "mat" "t"])
You can extract a specific element by taking advantage of the elegant fact that vectors are functions of their indices.
user> (defn extract-group [n] (fn [group] (group n)))
#'user/extract-group
user> (let [matches (re-seq #"the (\w+(t))" "the cat sat on the mat")]
(map (extract-group 1) matches))
("cat" "mat")
Or you can destructure the matches (here using a for macro to go over all the matches but this could also be done in a let or function argument binding):
user> (dorun
(for [[m1 m2 m3] (re-seq #"the (\w+(t))" "the cat sat on the mat")]
(do (println "m1:" m1)
(println "m2:" m2)
(println "m3:" m3))))
m1: the cat
m2: cat
m3: t
m1: the mat
m2: mat
m3: t