Match and split by regular expression from start of string - regex

I'm trying to make a terminal parser (for a parser combinator) from scratch. My approach is to use regexp-match-positions* on the input string and if the pattern is found at the first position, then we output the split string.
This is what I've got, so far:
#lang racket/base
(require racket/match)
(define (make-terminal-parser pattern)
(define (regexp-match-from-start pattern input)
(match (regexp-match-positions* pattern input)
[(list (cons 0 x) ...)
(let ([index (car x)])
(values (substring input 0 index)
(substring input index)))]
[_ (error "Not found!")]))
(lambda (input)
(regexp-match-from-start pattern input)))
(define ALPHA (make-terminal-parser #rx"[a-zA-Z]"))
(ALPHA "hello")
My ALPHA doesn't seem to work and I think it's because of the pattern matching not equating with anything. In the REPL, (regexp-match-positions* #rx"[a-zA-Z]" "hello") outputs what I would expect ('((0 . 1) (1 . 2) etc.)), so I don't really understand why that doesn't match with (list (cons 0 x) ...). If I change the regular expression to #rx"h", then it correctly splits the string; but obviously this is too specific.
(On a related note: I don't understand why I need to (car x) to get the actual index value out of the matched cons.)

It turns out the problem I was having was indeed with my pattern matching. I was attempting to match on (list (cons 0 x) ...), but the documentation implies that will only match a list of one-or-more elements of (0 . x) (where x is arbitrary). That's not what I want.
Lists are a series of cons, so I changed my matching criteria to (cons (cons 0 x) _) and that gives me what I want.
That also explains why I had to (car x) in my previous attempt. The x match in (list (cons 0 x) ...) would have matched every righthand element of each cons in the list, so it would have returned a list. For example '((0 . 1) (0 . 2) (0 . 3)) would have matched and x would equal '(1 2 3).
So, my fixed code is:
(define (make-terminal-parser pattern)
(define (regexp-match-from-start pattern input)
(match (regexp-match-positions pattern input)
[(cons (cons 0 index) _)
(values (substring input 0 index)
(substring input index))]
[_ (error "Not found!")]))
(lambda (input)
(regexp-match-from-start pattern input)))
n.b., I also don't need to use the starred version of regexp-match-positions with pattern matching, fwiw.

Related

How to use regexp in Elisp to match ',' in the line but not inside quotation mark

How could I write a regexp to match , in the line but not inside ""?
For example:
`uvm_info("body", $sformatf("Value: a = %d, b = %d, c = %d", a, b, c), UVM_MEDIUM)
Hope to match those with ^ under it:
`uvm_info("body", $sformatf("Value: a = %d, b = %d, c = %d", a, b, c), UVM_MEDIUM)
^ ^ ^ ^ ^
The following function doesn't use a regular expression but rather parses a region of the buffer as sexps and returns a list of buffer positions of all commas excluding those within strings, or nil if there are no such commas.
(defun find-commas (start end)
(save-excursion
(goto-char start)
(let (matches)
(while (< (point) end)
(cond ((= (char-after) ?,)
(push (point) matches)
(forward-char))
((looking-at "[]\\[{}()]")
(forward-char))
(t
(forward-sexp))))
(nreverse matches))))
It works for the example you show, but might need tweaking for other examples or languages. If your example is in a buffer by itself, calling
(find-commas (point-min) (point-max))
returns
(17 60 63 66 70)
try this
"[^"]+"|(,)
the , in capture group 1
You can use the fact that font-lock first fontifies comments and strings, then applies your font-lock keywords.
The standard solution is to replace your regexp with a function that search for the regexp, and skips any occurrences in comments and strings.
The following is from my package lisp-extra-font-lock (a package that highlights variables bound by let, quoted expressions etc.) It search for quotes and backquotes, but the principle is the same:
(defun lisp-extra-font-lock-is-in-comment-or-string ()
"Return non-nil if point is in comment or string.
This assumes that Font Lock is active and has fontified comments
and strings."
(let ((props (text-properties-at (point)))
(faces '()))
(while props
(let ((pr (pop props))
(value (pop props)))
(if (eq pr 'face)
(setq faces value))))
(unless (listp faces)
(setq faces (list faces)))
(or (memq 'font-lock-comment-face faces)
(memq 'font-lock-string-face faces)
(memq 'font-lock-doc-face faces))))
(defun lisp-extra-font-lock-match-quote-and-backquote (limit)
"Search for quote and backquote in in code.
Set match data 1 if character matched is backquote."
(let (res)
(while (progn (setq res (re-search-forward "\\(?:\\(`\\)\\|'\\)" limit t))
(and res
(or
(lisp-extra-font-lock-is-in-comment-or-string)
;; Don't match ?' and ?`.
(eq (char-before (match-beginning 0)) ??)))))
res))
The font-lock keyword is as follows:
(;; Quote and backquote.
;;
;; Matcher: Set match-data 1 if backquote.
lisp-extra-font-lock-match-quote-and-backquote
(1 lisp-extra-font-lock-backquote-face nil t)
;; ...)

Return value from cond

I'm trying to interpret a list of keywords and integers to obtain an expression. If "input" is, say '(add 5 5), the incoming list would contain 3 pieces -> add 5 5
(define (evaluate input)
(if (integer? input)
input
(cond ((integer? (car input))
(car input))
((equal? (car input) "add")
(+ (evaluate (cdr input))
(evaluate (cddr input))))
~more~
I'm using the 'if' because cond doesn't like just returning a value. My questions are: Does equal? actually compare strings properly and should the (+ evaluate (...) evaluate(...)) actually return 10 in this case?
In the last line,
(+ (evaluate (cdr input)) (evaluate (cddr input))))
has to be
(+ (evaluate (cadr input)) (evaluate (caddr input))))
Because in the if-comparison, it has to return directly the number. But of course, instead you can also use cond, you dont have to use a other comparison methode.
To compare strings, you can use best string=? function.
More info:
add is actually a Scheme symbol, not a string, so you can just use eq?:
(define (evaluate input)
(cond
[(integer? input)
input]
[(integer? (car input))
(car input)]
[(eq? 'add (car input))
(+ (evaluate (cadr input) (caddr input)))]))
BTW, it looks like what you're really trying to do is "destructure" the input when it matches a pattern: that is, extract stuff that's stored inside the input. There's a nice little macro by Oleg Kiselyov called pmatch that does this for you. Download it from http://www.cs.indiana.edu/cgi-pub/c311/lib/exe/fetch.php?media=pmatch.scm . Then you can write the following, which handles all that cdr/cadr/caddr/etc. stuff automatically, and supports any number of arguments to add, and doesn't need the case where an integer is enclosed alone in parentheses:
(define (evaluate input)
(pmatch input
[,n (guard (integer? n))
n]
[(add . ,operands)
(apply + (map evaluate operands))]))
pmatch expects a series of clauses, like cond, except the first expression in the clause is a pattern, which can contain variables. The variables are indicated by preceding them with a comma (just as in a backquote expression). Any symbols in a pattern that aren't preceded by a comma must match literally. When pmatch finds a matching pattern, it binds the variables to the parts of the input that are in the corresponding parts of the input. n and operands are variables in the patterns above.
You can also stick a guard clause after a pattern if you want to require a condition beyond just matching the pattern, like checking if the extracted variable is an integer.

Lisp S-Expressions and Lists Length/Size

I am trying to write a function using Common Lisp functions only that will count how many s-expressions are in an s-expression. For example:
((x = y)(z = 1)) ;; returns 2
and
((x - y)) ;; returns 1
Nested expressions are possible so:
((if x then (x = y)(z = w))) ;; returns 3
I wrote a function which finds the length and it works if no nested expressions are there. It is:
(define (length exp)
(cond
((null? exp) 0)
(#t (+ 1 (length (cdr exp))))))
Now I modified this in an attempt to support nested expressions to the following:
(define (length exp)
(cond
((null? exp) 0)
((list? (car exp)) (+ 1 (length (cdr exp))))
(#t (length (cdr exp)))))
This works for expressions with no nests, but is always 1 less than the answer for nested expressions. This is because taking the example above, ((if x then (x = y)(z = w))), this will look at if at first and which satisfies the third condition, returning the cdr (the rest of the expression as a list) into length. The same happens up until (x=y) is reached, at which point a +1 is returned. This means that the expression (if x then .... ) has not been counted.
In what ways would I be able to account for it? Adding +2 will over-count un-nested expressions.
I will need this to work in one function as nesting can happen anywhere, so:
((x = y) (if y then (z = w)))
At first sight, your code only recurses to the right (cdr-side) and not to the left (car-side), so that definitely is a problem there.
On second sight, this is even a little bit more tricky than that, because you are not exactly counting conses; you need to differentiate the case where a cons starts a proper list vs where it's the cdr of a list. If you were to recurse to the car and cdr, that information would be lost. We need to iterate over the sexp as a proper list,
(defun count-proper-list (sexp)
(cond ((atom sexp) 0)
(t (1+ (reduce #'+ (mapcar #'count-proper-list sexp))))))
But this will also count the top level list, therefor always return one more than what you seem to want. So perhaps,
(defun count-proper-sublist (sexp)
(1- (count-proper-list sexp)))

How can I capture the results of splitting a string in elisp?

I am working in elisp and I have a string that represents a list of items. The string looks like
"apple orange 'tasty things' 'my lunch' zucchini 'my dinner'"
and I'm trying to split it into
("apple" "orange" "tasty things" "my lunch" "zucchini" "my dinner")
This is a familiar problem. My obstacles to solving it are less about the regex, and more about the specifics of elisp.
What I want to do is run a loop like :
(while (< (length my-string) 0) do-work)
where that do-work is:
applying the regex \('[^']*?'\|[[:alnum:]]+)\([[:space:]]*\(.+\) to my-string
appending \1 to my results list
re-binding my-string to \2
However, I can't figure out how to get split-string or replace-regexp-in-string to do that.
How can I split this string into values I can use?
(alternatively: "which built-in emacs function that does this have I not yet found?")
Something similar, but w/o regexp:
(defun parse-quotes (string)
(let ((i 0) result current quotep escapedp word)
(while (< i (length string))
(setq current (aref string i))
(cond
((and (char-equal current ?\ )
(not quotep))
(when word (push word result))
(setq word nil escapedp nil))
((and (char-equal current ?\')
(not escapedp)
(not quotep))
(setq quotep t escapedp nil))
((and (char-equal current ?\')
(not escapedp))
(push word result)
(setq quotep nil word nil escapedp nil))
((char-equal current ?\\)
(when escapedp (push current word))
(setq escapedp (not escapedp)))
(t (setq escapedp nil)
(push current word)))
(incf i))
(when quotep
(error (format "Unbalanced quotes at %d"
(- (length string) (length word)))))
(when word (push result word))
(mapcar (lambda (x) (coerce (reverse x) 'string))
(reverse result))))
(parse-quotes "apple orange 'tasty things' 'my lunch' zucchini 'my dinner'")
("apple" "orange" "tasty things" "my lunch" "zucchini" "my dinner")
(parse-quotes "apple orange 'tasty thing\\'s' 'my lunch' zucchini 'my dinner'")
("apple" "orange" "tasty thing's" "my lunch" "zucchini" "my dinner")
(parse-quotes "apple orange 'tasty things' 'my lunch zucchini 'my dinner'")
;; Debugger entered--Lisp error: (error "Unbalanced quotes at 52")
Bonus: it also allows escaping the quotes with "\" and will report it if the quotes aren't balanced (reached the end of the string, but didn't find the match for the opened quote).
Here is a straightforward way to implement your algorithm using a temporary buffer. I don't know if there would be a way to do this using replace-regexp-in-string or split-string.
(defun my-split (string)
(with-temp-buffer
(insert string " ") ;; insert the string in a temporary buffer
(goto-char (point-min)) ;; go back to the beginning of the buffer
(let ((result nil))
;; search for the regexp (and just return nil if nothing is found)
(while (re-search-forward "\\('[^']*?'\\|[[:alnum:]]+\\)\\([[:space:]]*\\(.+\\)\\)" nil t)
;; (match-string 1) is "\1"
;; append it after the current list
(setq result (append result (list (match-string 1))))
;; go back to the beginning of the second part
(goto-char (match-beginning 2)))
result)))
Example:
(my-split "apple orange 'tasty things' 'my lunch' zucchini 'my dinner'")
==> ("apple" "orange" "'tasty things'" "'my lunch'" "zucchini" "'my dinner'")
You might like to take a look at split-string-and-unquote.
If you manipulate strings often, you should install s.el library via package manager, it introduces a huge load of string utility functions under a constistent API. For this task you need function s-match, its optional 3rd argument accepts starting position. Then, you need a correct regexp, try:
(concat "\\b[a-z]+\\b" "\\|" "'[a-z ]+'")
\| means matching either sequence of letters that constitute a word (\b means a word boundary), or sequence of letters and space inside quotes. Then use loop:
;; let s = given string, r = regex
(loop for start = 0 then (+ start (length match))
for match = (car (s-match r s start))
while match
collect match)
For an educational purpose, i also implemented the same functionality with a recursive function:
;; labels is Common Lisp's local function definition macro
(labels
((i
(start result)
;; s-match searches from start
(let ((match (car (s-match r s start))))
(if match
;; recursive call
(i (+ start (length match))
(cons match result))
;; push/nreverse idiom
(nreverse result)))))
;; recursive helper function
(i 0 '()))
As Emacs lacks tail call optimization, executing it over a big list can cause stack overflow. Therefore you can rewrite it with do macro:
(do* ((start 0)
(match (car (s-match r s start)) (car (s-match r s start)))
(result '()))
((not match) (reverse result))
(push match result)
(incf start (length match)))

Problem with list in Lisp

I am trying to write a simple procedure in Lisp to insert an element into binary search tree.
I represented the tree as a list:
the first element in the tree is the root
the second element is the left sub-tree
the third element is the right sub-tree
This is my code:
(define Insert
(lambda (x T)
(if (null? T)
(list x '() '())
(if (> x (car T))
(list (car T)
(cadr T)
(Insert x (list (caddr T))))
(list (car T)
(Insert x (cadr T))
(list (caddr T)))))))
When I call the procedure like this: (Insert 2 '(4 '(2 '() '() ) '())), I get a problem with ">" because the second argument isn't a real number, but I don't know why.
The exception:
>: expects type <real number> as 2nd argument, given: quote; other arguments were: 2
However, when I call the procedure like this: (Insert 2 ( list 4 (list 2 '() '() ) '())),
it works successfully.
Why?
I know that '(1 '() '()) and (list 1 '() '()) are equals, aren't they?
No, quote and list are not the same at all. The meaning of 'foo is (quote foo).
'(a b c) is exactly (quote (a b c)), that is, a list literal (the quote operator prevents the list from being evaluated). It is comparable to "a b c", which is a string literal or 5, which is a number literal. Operations that modify literals have undefined effects, as you may recognize immediately when you see some nonsense like (setf 3 4).
(list a b c), on the other hand, creates ("conses") a new list from the values of a, b, and c.
I am sure that if you clear up your confusion about quote and list, you will be able to correct your code.
'(1 '()) is equivalent to (list 1 (list 'quote nil)). I suspect that if you drop the "internal" quote characters, you will be fine.
So, the quoted expression that generates an expression that is equal to (list 1 '() '()) is '(1 () ()).