Group a list of strings by the first n letters - clojure

I have a collection of strings like
["snowy10" "catty20" "manny20" "snowy20" "catty10" "snowy20" "catty30" "manny10" "snowy20" "manny30"]
Would like it to be converted to a collection of collections grouped on the basis of the first five characters of the string.
[["snowy10" "snowy20" "snowy20"] ["catty10" "catty20""catty30"]["manny10" ""manny20"" "manny20"]]
Looking for a solution in Clojure.

The group-by function is helpful here:
clojure.core/group-by
([f coll])
Returns a map of the elements of coll keyed by the result of
f on each element. The value at each key will be a vector of the
corresponding elements, in the order they appeared in coll.
In other words, group-by uses the given function f to produce a key for each element in coll, and the value associated with that key is a vector of accumulated elements for that key.
In your example, if we know all input strings are guaranteed to have at least 5 characters, then we can use subs. But it's easier to construct a robust solution that is more general using take:
(def strings ["snowy10" "catty20" "manny20" "snowy20" "catty10" "snowy20" "catty30" "manny10" "snowy20" "manny30"])
(group-by (partial take 5) strings)
gives us:
{(\s \n \o \w \y) ["snowy10" "snowy20" "snowy20" "snowy20"]
(\c \a \t \t \y) ["catty20" "catty10" "catty30"]
(\m \a \n \n \y) ["manny20" "manny10" "manny30"]}
This isn't quite what we want -- we just want the map values. For that, we use vals:
(-> (group-by (partial take 5) strings)
(vals))
and we get:
(["snowy10" "snowy20" "snowy20" "snowy20"]
["catty20" "catty10" "catty30"]
["manny20" "manny10" "manny30"])
Changing the grouping criteria is as simple as changing the "key" function we provide to group-by. For example, we can group by the last two characters in each string by using take-last:
(-> (group-by (partial take-last 2) strings)
(vals))
which gives:
(["snowy10" "catty10" "manny10"]
["catty20" "manny20" "snowy20" "snowy20" "snowy20"]
["catty30" "manny30"])

user> (def v ["snowy10" "catty20" "manny20" "snowy20" "catty10" "snowy20" "catty30" "manny10" "snowy20" "manny30"])
#'user/v
user> (vals (group-by #(subs % 0 5) v))
(["snowy10" "snowy20" "snowy20" "snowy20"]
["catty20" "catty10" "catty30"]
["manny20" "manny10" "manny30"])

how about split string by \d, like this:
user=> (def v ["snowy10" "catty20" "manny20" "snowy20" "catty10" "snowy20" "catty30" "manny10" "snowy20" "manny30"])
#'user/v
user=> (vals (group-by #(first (clojure.string/split % #"\d")) v))
(["snowy10" "snowy20" "snowy20" "snowy20"]
["catty20" "catty10" "catty30"]
["manny20" "manny10" "manny30"])

Related

How to get all trigrams of a string in clojure

Suppose I have a string "This is a string". The tri-grams would be "Thi", "his", "is ", "s i" etc. I want to return a vector of all the trim-grams. How can I do that?
You can use partition or partition-all depending on whether you are
interested also in the last "non-tri-grams":
user=> (doc partition)
-------------------------
clojure.core/partition
([n coll] [n step coll] [n step pad coll])
Returns a lazy sequence of lists of n items each, at offsets step
apart. If step is not supplied, defaults to n, i.e. the partitions
do not overlap. If a pad collection is supplied, use its elements as
necessary to complete last partition upto n items. In case there are
not enough padding elements, return a partition with less than n items.
(user=> (doc partition-all)
-------------------------
clojure.core/partition-all
([n] [n coll] [n step coll])
Returns a lazy sequence of lists like partition, but may include
partitions with fewer than n items at the end. Returns a stateful
transducer when no collection is provided.
E.g.
user=> (partition 3 1 "This is a string")
((\T \h \i)
(\h \i \s)
(\i \s \space)
(\s \space \i)
(\space \i \s)
(\i \s \space)
(\s \space \a)
(\space \a \space)
(\a \space \s)
(\space \s \t)
(\s \t \r)
(\t \r \i)
(\r \i \n)
(\i \n \g))
To get the strings back, join the chars:
user=> (map clojure.string/join (partition 3 1 "This is a string"))
("Thi"
"his"
"is "
"s i"
" is"
"is "
"s a"
" a "
"a s"
" st"
"str"
"tri"
"rin"
"ing")
Or replace with partition-all accordingly:
user=> (map clojure.string/join (partition-all 3 1 "This is a string"))
("Thi"
; ...
"rin"
"ing"
"ng" ; XXX
"g") ; XXX

Comparing two strings and returning the number of matched words

I'm fairly new to Clojure, and in programming, in general.
Is there a way I can compare two strings word by word and then return the number of matched words in both strings? Also how can I count the numbers in a string?
Ex:
comparing string1 "Hello Alan and Max" and string2 "Hello Alan and Bob" will return "3" (such as Hello Alan and are the words matched in both strings)
and finding the number of words in string1 will result in the number 4.
Thank you
Let's break it down into some smaller problems:
compare two strings word by word
First we'll need a way to take a string and return its words. One way to do this is to assume any whitespace is separating words, so we can use a regular expression with clojure.string/split:
(defn string->words [s]
(clojure.string/split s #"\s+"))
(string->words "Hello world, it's me, Clojure.")
=> ["Hello" "world," "it's" "me," "Clojure."]
return the number of matched words in both strings
The easiest way I can imagine doing this is to build two sets, one to represent the set of words in both sentences, and finding the intersection of the two sets:
(set (string->words "a b c a b c d e f"))
=> #{"d" "f" "e" "a" "b" "c"} ;; #{} represents a set
And we can use the clojure.set/intersection function to find the intersection of two sets:
(defn common-words [a b]
(let [a (set (string->words a))
b (set (string->words b))]
(clojure.set/intersection a b)))
(common-words "say you" "say me")
=> #{"say"}
To get the count of (matching) words, we can use the count function with the output of the above functions:
(count (common-words "say you" "say me")) ;; => 1
what you need to do, is to compare word sequences' items pairwise, and count the number of items until the first mismatch. Here is an almost word for word translation of this:
(defn mismatch-idx [s1 s2]
(let [w #"\S+"]
(->> (map = (re-seq w s1) (re-seq w s2))
(take-while true?)
count)))
user> (mismatch-idx "a b c" "qq b c")
;;=> 0
user> (mismatch-idx "a b c" "a x c")
;;=> 1
user> (mismatch-idx "a b c" "a b x")
;;=> 2

take-while on clojure.string does not work

I'm wondering why clojure does not treat string as an array like in scala or haskell.
I want take-while function on string as in scala below
scala> "chich and chong".takeWhile(_ != ' ')
res1: String = chich
But take-while in clojure does not seem to work with string.
user=> (take-while #(not= % " ") "chich and chong")
(\c \h \i \c \h \space \a \n \d \space \c \h \o \n \g)
Just to make sure char/string equality works in clojure,
user=> (= " " " ")
true
user=> (not= 'A " ")
true
take-while does work with vector only.
user=> (take-while #(< % 0) [-3 -2 -1 0 1 2 3])
(-3 -2 -1)
Tried converting string to vector as well, but returns the same as input.
user=> (vec "apple")
[\a \p \p \l \e]
user=> (take-while #(not= % "p") (vec "apple"))
(\a \p \p \l \e)
how can I use take-while with clojure.string?
You should write character literal instead of string with space:
user=> (take-while #(not= % \space) "chich and chong")
=> (\c \h \i \c \h)
That is because:
" " - is java.lang.String
\space - is java.lang.Character
more info \ - Character literal
Just to point out that your code would work in ClojuseScript, because the host platform (JavaScript) has no character type, so characters are represented as one-character strings. On the JVM though characters are there own type.

Write a function to print (non-negative) integer numbers in full words in Clojure

(defn num-as-words [n]
(let [words '("zero" "one" "two" "three" "four"
"five" "six" "seven" "eight" "nine")]
(clojure.string/join "-"
(map (fn [x] (nth words (Integer. (re-find #"\d" (str x)) ))) (str n)))))
I've written this function called as num-as-words which takes an integer and displays it as full words, for example if you were to input (123) it would return (one-two-three).
I've done it using a map but I was wondering if there was another way of doing it? I was also wondering if there was another way to connect the words rather than clojure.string/join, I was initially using interpose but didn't like the way it was outputting, as it looked like ("one" "-" "two" "-" "three").
Any help would be greatly appreciated, thank you.
user=> (clojure.pprint/cl-format ; formatted printing
nil ; ... to a string
"~{~R~^-~}" ; format (see below)
(map ; map over characters
(fn [x] (Integer. (str x))) ; convert char to integer
(str 123))) ; convert number to string
"one-two-three"
First, we take the input number, here hard-coded as "123" in the example, coerce it as a string and iterate over the resulting string's characters thanks to map. For each character, we build a string containing that character and parse it as an Integer. Thus, we obtain a list of digits.
More precisely, (fn [x] ...) is a function taking one argument. You should probably name it char instead (sorry), because we iterate over characters. When we evaluate (str x), we obtain a string containing one char, namely x. For example, if the character is \2, the resulting string is "2". The (Integer. string) form (notice the dot!) calls the constructor for the Integer class, which parse a string as an integer. To continue with our example, (Integer. "2") would yield the integer 2.
We use cl-format to print the list of digits into a fresh string (as requested by the false argument). In order to do that, we specify the format as follows:
~{...~} iterates over a list and executes the format inside the braces for each element.
~R prints a number as an english word (1 => one, etc.)
~^ escapes the iteration made by ~{...~} when there is no remaining arguments. So when we print the last digit, the part that follows ~^ is not printed.
What follows ~^ is simply the character -. This is used to separate strings but we had to take care to not print a dash for all iterations of the loop, otherwise the resulting string would have ended with a dash.
If any character cannot be parsed as an Integer then the function will report an error. You might want to check first that the input really is a positive integer before converting it to a string.
I'd implement it like this:
(defn num-as-words [n]
(let [words ["zero" "one" "two" "three" "four" "five" "six" "seven" "eight" "nine"]]
(->> (str n)
(map #(Character/getNumericValue %))
(map words)
(clojure.string/join "-"))))
Using vector will simplify the implementation.
Instead of splitting number string with regular expression, you can treat it as sequence. In this case, you should use Charactor/getNumericValue to convert char to integer.
You can use ->> macro.
Using clojure.string/join looks fine.
interpose returns lazy sequence. That's why it returns like ("one" "-" "two"...). You should apply str to the result, (apply str (interpose ...)) to convert it to string.
If you want to handle negative numbers, you can modify the code like this:
(defn num-as-words [n]
(if (< n 0)
(str "-" (num-as-words (- n)))
(let [words ["zero" "one" "two" "three" "four" "five" "six" "seven" "eight" "nine"]]
(->> (str n)
(map #(Character/getNumericValue %))
(map words)
(clojure.string/join "-")))))
This will prepend - in the front. If you just want to throw an error, you can use precondition:
(defn num-as-words [n]
{:pre [(<= 0 n)]}
(let [words ["zero" "one" "two" "three" "four" "five" "six" "seven" "eight" "nine"]]
...
This will throw AssertionError when it receives negative number.

how to update-in ingoring the first level in clojure

When use update-in we need to provide the full path to an element. But what if I want to update ALL elements whose second level key is :MaxInclude
e.g the input is
(def a {:store {:type "varchar"},
:amount {:nullable true, :length nil, :type "float", :MaxInclude "100.02"},
:unit {:type "int"},
:unit-uniform {:type "int" :MaxInclude "100"}
})
the required output is (convert MaxInclude from string to float/int based on theie type):
{:store {:type "varchar"},
:amount {:nullable true, :length nil, :type "float", :MaxInclude 100.02},
:unit {:type "int"},
:unit-uniform {:type "int" :MaxInclude 100}
}
I was thinking it would be nice to have a function like update-in that matches on key predicate functions instead of exact key values. This is what I came up with:
(defn update-all
"Like update-in except the second parameter is a vector of predicate
functions taking keys as arguments. Updates all values contained at a
matching path. Looks for keys in maps only."
[m [key-pred & key-preds] update-fn]
(if (map? m)
(let [matching-keys (filter key-pred (keys m))
f (fn [acc k]
(update-in acc [k] (if key-preds
#(update-all %
key-preds
update-fn)
update-fn)))]
(reduce f m matching-keys))
m))
With this in place, all you need to do is:
(update-all a [= #{:MaxInclude}] read-string)
The = is used as the first key matching function because it always returns true when passed one argument. The second is using the fact that a set is a function. This function uses non-optimised recursion but the call stack will only be as deep as the number of matching map levels.
(into {}
(map (fn [[k v]]
{k (if (contains? v :MaxInclude)
(update-in v [:MaxInclude] read-string)
v)})
a))
Here I am mapping over the key-value pairs and destructuring each into k and v. Then I use update-in on the value if it contains :MaxInclude. Finally, I pour the pairs from a list into a hash map.
Notes:
This will error on contains? if any of the main map's values are not indexed collections.
I use read-string as a convenient way to convert the string to a number in the same way the Clojure reader would do when compiling the string that is your number literal. There may be disadvantages to this approach.