Tokenizing a string in Clojure - regex

I am trying to tokenize a string using clojure. The basic tokenization rules require the string to be split into separate symbols as follows:
String literals of the form "hello world" are a single token
Every word that is not part of a string literal is a single token
Every non-word character is a separate token
For example, given the string:
length=Keyboard.readInt("HOW MANY NUMBERS? ");
I would like it to be tokenized as:
["length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";"]
I have been able to write a function to split a string according to rules 2 and 3 above. I am having trouble fulfilling the first rule.
Meaning, currently the above string is split as follows:
["let" "length" "=" "Keyboard" "." "readInt" "(" "\"HOW" "MANY" "NUMBERS?" "\"" ")" ";"]
Here is my function:
(defn TokenizeJackLine [LineOfJackFile]
(filter not-empty
(->
(string/trim LineOfJackFile)
; get rid of all comments
(string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "")
; split into tokens using 0-width look-ahead
(string/split #"\s+|(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
)))
How can I write a function that will split a string into tokens following all three of the above rules? Alternatively, what other approach should I take to achieve the desired tokenization? Thank you.

Removing the initial \s+| from your split makes it work the way that you want it to. That is causing the string to split on white space characters.
(defn TokenizeJackLine [LineOfJackFile]
(filter not-empty
(->
(clojure.string/trim LineOfJackFile)
; get rid of all comments
(clojure.string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "")
; split into tokens using 0-width look-ahead
(clojure.string/split #"(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
)))
(def input "length=Keyboard.readInt(\"HOW MANY NUMBERS? \");")
(TokenizeJackLine input)
Produces this output:
("length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";")

Related

In ClojureScript, how to split string around a regex and keep the matches in the result, without using lookaround?

I want to split a string on an arbitrary regular expression (similar to clojure.string/split) but keep the matches in the result. One way to do this is with lookaround in the regex but this doesn't work well in ClojureScript because it's not supported by all browsers.
In my case, the regex is #"\{\{\s*[A-Za-z0-9_\.]+?\s*\}\}")
So for example, foo {{bar}} baz should be split into ("foo " "{{bar}}" " baz").
Thanks!
One possible solution is to choose some special character as a delimiter, insert it into the string during replace and then split on that. Here I used exclamation mark:
Require: [clojure.string :as s]
(-> "foo {{bar}} baz"
(s/replace #"\{\{\s*[A-Za-z0-9_\.]+?\s*\}\}" "!$0!")
(s/split #"!"))
=> ["foo " "{{bar}}" " baz"]

simple character replacement regex, tokenizing jsonpath

I am trying to tokenize a jsonpath string. For example, given a string like the following:
$.sensor.subsensor[0].foo[1][2].temp
I want ["$", "sensor", "subsensor", "0", "foo", "1", "2", "temp"]
I am terrible with regexs but I managed to come up with the following which according to regexr matches "." and "[" and "]". Assume the jsonpath string does not contain slices, wildcards, unions nor recursive descent.
[\.\[\]]
I am planning to match all "." and "[" and "]" chars and replace them with ";". Then i will split on ";".
The problem with the above regex is that I will get in certain instances ";;".
$;sensor;subsensor;0;;foo;1;;2;;temp
Is there a way I can in a single regex replace ".", "[", "]" as well as ".[" and "]." or "][" with ";"? Do I need to check for these groups explicitly or do I need to run the sequence through 2 regexs?
No needs to transform .[] into ;, just split directly:
console.log('$.sensor.subsensor[0].foo[1][2].temp'.split(/[.[\]]+/));
You can use this code to omit double semicolon:
console.log(
'$.sensor.subsensor[0].foo[1][2].temp'.replace(/[\.\[\]]+/g, ';')
)
Got a decent solution.
/(\.|\].|\[)/g
Apparently when you use [] as part of your regex that matches only a single character, which is why groups like "]." become ";;". Using () allows you to specify character groups, and the above group just enumerates the possibilities.

Replace end of string with String/replace and re-pattern - Clojure

I want to remove a substring at the end of a string containing some code.
I have a vector a containing the expression "c=c+1"
My goal is to remove the expression "c=c+1;" at the end of my expression.
I have used the $ symbol indicating that the substring to replace must be at the end of my code.
Here is the code and the output :
project.core=> (def a [:LangFOR [:before_for "a=0; b="] [:init_var "c=a+1;"] [:loop_condition_expression "c-10;"] [:loop_var_step "c=c+1"] [:statements_OK "a=2*c;"] [:after_for " b+c;"]])
#'project.core/a
project.core=> (prn (str "REGEX debug : " (clojure.string/replace "b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;c=c+1;" (re-pattern (str "# "(get-in a [4 1]) ";$")) "")))
"REGEX debug : b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;c=c+1;"
nil
The expected output is :
"REGEX debug : b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;"
How can I correct my (re-pattern) function?
Thanks.
The string you're using to build the regex pattern has some characters in it that have special meaning in a regular expression. The + in c+1 is interpreted as one or more occurrences of c followed by 1. Java's Pattern class provides a function to escape/quote strings so they can be used literally in regex patterns. You could use it directly, or define a wrapper function:
(defn re-quote [s]
(java.util.regex.Pattern/quote s))
(re-quote "c=c+1")
=> "\\Qc=c+1\\E"
This function simply wraps the input string in some special control characters \Q and \E to have the interpreter start and stop the quoting of the contents.
Now you can use that literal string to build a regex pattern:
(clojure.string/replace
"b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;c=c+1;"
(re-pattern (str (re-quote "c=c+1;") "$"))
"")
=> "b=0;c=a+1;a=2*c;c=c+1;c=c+1;a=2*c;"
I removed the leading "# " from the pattern in your example to make this work, because that doesn't appear in the input.

how to split a string in clojure not in regular expression mode

The split in both clojure and java takes regular expression as parameter to split. But I just want to use normal char to split. The char passed in could be "|", ","," " etc. how to split a line by that char?
I need some function like (split string a-char). And this function will be called at very high frequency, so need good performance. Any good solution.
There are a few features in java.util.regex.Pattern class that support treating strings as literal regular expressions. This is useful for cases such as these. #cgrand already alluded to (Pattern/quote s) in a comment to another answer. One more such feature is the LITERAL flag (documented here). It can be used when compiling literal regular expression patterns. Remember that #"foo" in Clojure is essentially syntax sugar for (Pattern/compile "foo"). Putting it all together we have:
(import 'java.util.regex.Pattern)
(clojure.string/split "foo[]bar" (Pattern/compile "[]" Pattern/LITERAL))
;; ["foo" "bar"]
Just make your character a regex by properly escaping special characters and use the default regex split (which is fastest by far).
This version will make a regexp that automatically escapes every character or string within it
(defn char-to-regex
[c]
(re-pattern (java.util.regex.Pattern/quote (str c))))
This version will make a regexp that escapes a single character if it's within the special character range of regexps
(defn char-to-regex
[c]
(if ((set "<([{\\^-=$!|]})?*+.>") c)
(re-pattern (str "\\" c))
(re-pattern c)))
Make sure to bind the regex, so you don't call char-to-regex over and over again if you need to do multiple splits
(let [break (char-to-regex \|)]
(clojure.string/split "This is | the string | to | split" break))
=> ["This is " " the string " " to " " split"]

Clojure string replace strange behaviour

i think I found a bug in clojure, can anyone explain why the backslash is missing from the output?
(clojure.string/replace "The color? is red." #"[?.]" #(str "zzz\\j" %1 %1))
=> "The colorzzzj?? is redzzzj.."
This isn't a bug. The string returned by the function in the 3rd parameter is parsed for escape sequences so that you can do things like this:
(clojure.string/replace "The color? is red." #"([?.])" "\\$1$1")
; => "The color$1? is red$1."
Notice how the first $ is escaped by the backslash, whereas the second serves as the identifier for a capture group. Change your code to use four backslashes and it works:
(clojure.string/replace "The color? is red." #"[?.]" #(str "zzz\\\\j" %1 %1))
Please check the function documentation at: http://clojuredocs.org/clojure_core/clojure.string/replace
Specifically:
Note: When replace-first or replace have a regex pattern as their
match argument, dollar sign ($) and backslash (\) characters in the
replacement string are treated specially.