I'm using re-seq to get all the sequences that match a regex in clojure like so:
(re-seq #"(.*)\&(.*)" "((a & b) & c)")
And I'm getting the following result:
(["(a&b)&c" "(a&b)" "c"])
Whereas I expect the sequence to contain all such regex matches like so:
(["(a&b)&c" "(a&b)" "c"] ["(a&b)&c" "(a" "b)&c"])
How to fix this and what am I doing wrong?
As Jas alluded to, the star is greedy. In this case the non-greedy star should work (which java calls a "Reluctant Quantifier"): .*?
(re-seq #"(.*?)\&(.*)" "((a&b)&c)")
gives
(["((a&b)&c)" "((a" "b)&c)"])
This can be challenging to apply in more complicated situations.
Source: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html
Related
I want to split a string on an arbitrary regular expression (similar to clojure.string/split) but keep the matches in the result. One way to do this is with lookaround in the regex but this doesn't work well in ClojureScript because it's not supported by all browsers.
In my case, the regex is #"\{\{\s*[A-Za-z0-9_\.]+?\s*\}\}")
So for example, foo {{bar}} baz should be split into ("foo " "{{bar}}" " baz").
Thanks!
One possible solution is to choose some special character as a delimiter, insert it into the string during replace and then split on that. Here I used exclamation mark:
Require: [clojure.string :as s]
(-> "foo {{bar}} baz"
(s/replace #"\{\{\s*[A-Za-z0-9_\.]+?\s*\}\}" "!$0!")
(s/split #"!"))
=> ["foo " "{{bar}}" " baz"]
Fiddling around in Racket I'm trying to write a simple lexer that uses regular expressions to handle patterns, but it doesn't seem to want to work with the meta-character \w.
#lang racket
(define (tokenize-broken str)
(match str
["\"" 'StringDelim]
[(regexp #rx"#\\w+") 'Message]
[_ 'Undefined]))
(define (tokenize-working str)
(match str
["\"" 'StringDelim]
[(regexp #rx"#[a-zA-Z_]+") 'Message]
[_ 'Undefined]))
Now when I try to run them in the repl I get this:
> (tokenize-broken "#msg")
'Undefined
> (tokenize-working "#msg")
'Message
So what's going on here? why can't I get \w to work? It works fine in other languages supporting regular expressions, so why not here?
I believe that \w is not included in regexp. Try pregexp (ie, "Perl" regexp), and use #px instead of #rx.
(define (tokenize-fixed str)
(match str
["\"" 'StringDelim]
[(pregexp #px"#\\w+") 'Message]
[_ 'Undefined]))
> (tokenize-fixed "#msg")
'Message
It works: http://pasterack.org/pastes/19596
The split in both clojure and java takes regular expression as parameter to split. But I just want to use normal char to split. The char passed in could be "|", ","," " etc. how to split a line by that char?
I need some function like (split string a-char). And this function will be called at very high frequency, so need good performance. Any good solution.
There are a few features in java.util.regex.Pattern class that support treating strings as literal regular expressions. This is useful for cases such as these. #cgrand already alluded to (Pattern/quote s) in a comment to another answer. One more such feature is the LITERAL flag (documented here). It can be used when compiling literal regular expression patterns. Remember that #"foo" in Clojure is essentially syntax sugar for (Pattern/compile "foo"). Putting it all together we have:
(import 'java.util.regex.Pattern)
(clojure.string/split "foo[]bar" (Pattern/compile "[]" Pattern/LITERAL))
;; ["foo" "bar"]
Just make your character a regex by properly escaping special characters and use the default regex split (which is fastest by far).
This version will make a regexp that automatically escapes every character or string within it
(defn char-to-regex
[c]
(re-pattern (java.util.regex.Pattern/quote (str c))))
This version will make a regexp that escapes a single character if it's within the special character range of regexps
(defn char-to-regex
[c]
(if ((set "<([{\\^-=$!|]})?*+.>") c)
(re-pattern (str "\\" c))
(re-pattern c)))
Make sure to bind the regex, so you don't call char-to-regex over and over again if you need to do multiple splits
(let [break (char-to-regex \|)]
(clojure.string/split "This is | the string | to | split" break))
=> ["This is " " the string " " to " " split"]
I want to create a regex in Emacs that captures text between double square brackets.
I found this regex. It allows to find string betwen square brackets but it includes the square brackets :
"\\[\\[\\([^][]+\\)\\]\\(\\[\\([^][]+\\)\\]\\)?\\]"
How can extract the string between double square brackets excluding the square brackets?
This regexp will select the square brackets but by using group 1 you will be able to get only the content: "\\[\\[\\(.*\\)\\]\\]"
You cannot. Emacs' regexp engine does not support look-ahead/look-behind assertions.
As a work, around, just group the part you're interested in and access the subgroup.
To extract data from a string, use string-match and match-string, like this:
(let ((my-string "[[some text][some more text]]")
(my-regexp "\\[\\[\\([^][]+\\)\\]\\(\\[\\([^][]+\\)\\]\\)?\\]"))
(and (string-match my-regexp my-string)
(list (match-string 1 my-string) (match-string 3 my-string))))
which evaluates to:
("some text" "some more text")
To extract data from a buffer, use search-forward-regexp and drop the string argument to match-string:
(and
(search-forward-regexp "\\[\\[\\([^][]+\\)\\]\\(\\[\\([^][]+\\)\\]\\)?\\]" nil :no-error)
(list (match-string 1) (match-string 3)))
Note that this moves point to the match.
I wrote a program which can generate regex like this a(b|)c. Actually, it means (abc)|(ac). But is a(b|)c an acceptable regex for any regex engine? Or is there any other alternative to give the same semantic meaning?
Further question: is there any tool can covert it to a "normal" representation? e.g convert a(b|(c|))d to a(b|(c)?)d
It isn't illegal, but it's an extremely odd formation. ? is more "idiomatic" for the purpose (by which I mean it will be clearer to and more readily understood by "speakers" of regex).
ab?c, or ab{0,1}c would make more sense. An a, followed by at most one b, followed by a c.
use this regular expressionab?c
Yes, this is a valid regular expression. Proof in Ruby:
irb(main):003:0> "fooacbar".match( /a(b|)c/ )
#=> #<MatchData "ac" 1:"">
irb(main):004:0> "fooabcbar".match( /a(b|)c/ )
#=> #<MatchData "abc" 1:"b">
Proof in JavaScript:
console.log( "fooabcbar".match(/a(b|)c/) )
//-> ["abc", "b"]
console.log( "fooacbar".match(/a(b|)c/) )
//-> ["ac", ""]
As others have shown, however, it is more idiomatic to write:
/ab?c/ # If you have just one character optional
/a(foo)?c/ # If you have an arbitrary string optional
Also note that many regex engines allow you specify that the parentheses are non-capturing (which may provide slight performance benefits):
/a(?:foo)?c/ # Optional arbitrary string that you don't need to save