clojure.string/replace does not match pattern matched by re-seq - regex

Why does clojure.string/replace not match the \"[^\"]+\" pattern while re-seq does?
(re-seq #"\"[^\"]+\"" "ab,\"helo,bro\",yo")
=> ("\"helo,bro\"")
(clojure.string/replace "ab,\"helo,bro\",yo" #"\"[^\"]+\”" "")
=> "ab,\"helo,bro\",yo"
I would expect replace to delete the matched pattern. What am I missing here?
Thanks for insight.

Your regex are (probably unintentionally) different: in the replace option you used \” instead of \".
If you use the same exact regex it will work as expected:
(def r #"\"[^\"]+\"")
(re-seq r "ab,\"helo,bro\",yo")
=> ("\"helo,bro\"")
(clojure.string/replace "ab,\"helo,bro\",yo" r "")
=> "ab,,yo"

Related

In ClojureScript, how to split string around a regex and keep the matches in the result, without using lookaround?

I want to split a string on an arbitrary regular expression (similar to clojure.string/split) but keep the matches in the result. One way to do this is with lookaround in the regex but this doesn't work well in ClojureScript because it's not supported by all browsers.
In my case, the regex is #"\{\{\s*[A-Za-z0-9_\.]+?\s*\}\}")
So for example, foo {{bar}} baz should be split into ("foo " "{{bar}}" " baz").
Thanks!
One possible solution is to choose some special character as a delimiter, insert it into the string during replace and then split on that. Here I used exclamation mark:
Require: [clojure.string :as s]
(-> "foo {{bar}} baz"
(s/replace #"\{\{\s*[A-Za-z0-9_\.]+?\s*\}\}" "!$0!")
(s/split #"!"))
=> ["foo " "{{bar}}" " baz"]

re-seq not returning all regex matches

I'm using re-seq to get all the sequences that match a regex in clojure like so:
(re-seq #"(.*)\&(.*)" "((a & b) & c)")
And I'm getting the following result:
(["(a&b)&c" "(a&b)" "c"])
Whereas I expect the sequence to contain all such regex matches like so:
(["(a&b)&c" "(a&b)" "c"] ["(a&b)&c" "(a" "b)&c"])
How to fix this and what am I doing wrong?
As Jas alluded to, the star is greedy. In this case the non-greedy star should work (which java calls a "Reluctant Quantifier"): .*?
(re-seq #"(.*?)\&(.*)" "((a&b)&c)")
gives
(["((a&b)&c)" "((a" "b)&c)"])
This can be challenging to apply in more complicated situations.
Source: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html

Replace end of string with String/replace and re-pattern - Clojure

I want to remove a substring at the end of a string containing some code.
I have a vector a containing the expression "c=c+1"
My goal is to remove the expression "c=c+1;" at the end of my expression.
I have used the $ symbol indicating that the substring to replace must be at the end of my code.
Here is the code and the output :
project.core=> (def a [:LangFOR [:before_for "a=0; b="] [:init_var "c=a+1;"] [:loop_condition_expression "c-10;"] [:loop_var_step "c=c+1"] [:statements_OK "a=2*c;"] [:after_for " b+c;"]])
#'project.core/a
project.core=> (prn (str "REGEX debug : " (clojure.string/replace "b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;c=c+1;" (re-pattern (str "# "(get-in a [4 1]) ";$")) "")))
"REGEX debug : b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;c=c+1;"
nil
The expected output is :
"REGEX debug : b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;"
How can I correct my (re-pattern) function?
Thanks.
The string you're using to build the regex pattern has some characters in it that have special meaning in a regular expression. The + in c+1 is interpreted as one or more occurrences of c followed by 1. Java's Pattern class provides a function to escape/quote strings so they can be used literally in regex patterns. You could use it directly, or define a wrapper function:
(defn re-quote [s]
(java.util.regex.Pattern/quote s))
(re-quote "c=c+1")
=> "\\Qc=c+1\\E"
This function simply wraps the input string in some special control characters \Q and \E to have the interpreter start and stop the quoting of the contents.
Now you can use that literal string to build a regex pattern:
(clojure.string/replace
"b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;c=c+1;"
(re-pattern (str (re-quote "c=c+1;") "$"))
"")
=> "b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;"
I removed the leading "# " from the pattern in your example to make this work, because that doesn't appear in the input.

why is re-find returning a vector with two elements?

When using re-find in this way:
(re-find #"(\d{3})" "abc1245")
I get:
["124" "124"]
when I expect just one value. What's going on?
It is because the parentheses create a regex "group". See
https://clojuredocs.org/clojure.core/re-find
for examples. Here's the difference:
(re-find #"(\d{3})" "abc1245") => ["124" "124"] ; #1
(re-find #"\d{3}" "abc1245") => "124" ; #2
(re-seq #"\d{3}" "abc1245") => ("124") ; #3
(re-seq #"\d{3}" "abc12345678") => ("123" "456") ; #4
So, #1 gives you both the result and the "group result". #2 gives you just the matched substring.
#3 gives you a sequence of all matches. Since there are only 4 digits, the remaing "5" isn't enough to match 3 digits.
#4 gives 8 digits total, so we get "123" and "456" as matches, with 7 & 8 leftover since we only want triples of digits.
Round brackets in your regex (\d{3}) define a capture group, so re-find returns the whole match and all the match groups.
As per re-find doc:
Returns the next regex match, if any, of string to pattern, using
java.util.regex.Matcher.find(). Uses re-groups to return the groups.
If you remove the brackets - you'll get only one match as you expected:
=> (re-find #"\d{3}" "abc1245")
=> "124"
=> (re-find #"(\d{3})" "abc1245")
=> ["124" "124"]
You can check it yourself in this online repl.

Regexp doesn't evaluate meta-character \w

Fiddling around in Racket I'm trying to write a simple lexer that uses regular expressions to handle patterns, but it doesn't seem to want to work with the meta-character \w.
#lang racket
(define (tokenize-broken str)
(match str
["\"" 'StringDelim]
[(regexp #rx"#\\w+") 'Message]
[_ 'Undefined]))
(define (tokenize-working str)
(match str
["\"" 'StringDelim]
[(regexp #rx"#[a-zA-Z_]+") 'Message]
[_ 'Undefined]))
Now when I try to run them in the repl I get this:
> (tokenize-broken "#msg")
'Undefined
> (tokenize-working "#msg")
'Message
So what's going on here? why can't I get \w to work? It works fine in other languages supporting regular expressions, so why not here?
I believe that \w is not included in regexp. Try pregexp (ie, "Perl" regexp), and use #px instead of #rx.
(define (tokenize-fixed str)
(match str
["\"" 'StringDelim]
[(pregexp #px"#\\w+") 'Message]
[_ 'Undefined]))
> (tokenize-fixed "#msg")
'Message
It works: http://pasterack.org/pastes/19596