Regex to extract S expression? - regex

I'm wondering if it's possible to do a pass on parsing of a define expression in lisp with a single regular expression, for example with the following input:
#lang sicp
(define (square x) (* x x))
(define (average x y) (/ (+ x y) 2))
; using block scope
(define (sqrt x)
; x is always the same -- "4" or whatever we pass to it, so we don't need that
; in every single function that we define, we can just inherit it from above.
(define (improve guess) (average guess (/ x guess)))
(define (good-enough? guess) (< (abs (- (square guess) x)) 0.001 ))
(define (sqrt-iter guess) (if (good-enough? guess) guess (sqrt-iter (improve guess))))
(sqrt-iter 1.0)
)
(sqrt 4)
I would want to highlight the three procedures below (none of the function-scoped procedures) that start with define. The process I was thinking (if I were to do it iteratively) would be:
Remove comments.
Grab the start of the define with \(\s*define
Consume balanced parentheses up until the unbalanced ) that finishes our procedure. For a regex, something like: (?:\([^)]*\))*, though I'm sure it gets much more complex with the greediness of the *'s.
And this wouldn't even be taking into account I could have a string "( define )" that we'd also want to ignore.
Would it be possible to build a regex for this, or too complicated? Here is my starting point, which is a long way from complete: https://regex101.com/r/MlPmOd/1.

As a preface, there is a famous quote due to Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
The one word answer to your question is 'no'. Regular languages – the languages that regular expressions can recognise – are a proper subset of context-free languages, and the written form of s-expressions is context-free but not regular. So no regular expression can recognise the written form of an s-expression.
To see this consider a very tiny subset of s-expressions:
n = () | ( n)
So n consists of the set {(), (()), ((())), ...}, where the number of left parens and right parens in each string are equal. Such a language can't be recognised by a regular expression because you need to count parens.
Notes
Some instances of what are called 'regular expressions' in various programming languages are in fact more powerful than regular expressions and can therefore recognise classes of languages larger than regular languages. jwz's quote still applies: just because, perhaps, you can does not mean you should.
All programmers should in my opinion learn enough formal language theory to be dangerous. I don't know what a good modern reference is, but I learnt it from the Cinderella book: Hopcroft & Ullman, Introduction to Automata Theory, Languages, and Computation.
All Lisp programmers should in my opinion write a toy reader for s-expressions, as this is a good way of learning about how the real reader works, and doesn't take long.

Related

Why are <e> inside if and cond designed to be handled differently in Scheme?

(if <predicate> <consequent> <alternative>)
(cond (<p1> <e1>)
(<p2> <e2>)
..........
(<pn> <en>))
A minor difference between if and cond is that the cond's expression part of each
cond clause may be a sequence of expressions. -- SICP
I wonder why the designers of the scheme language made the expression for if and cond different.
What's the purpose of that design?
In a language like Scheme which is not purely functional, it is often useful to allow a sequence of expressions where there is room for it in the syntax: for instance in the body of a procedure and so on. So for instance a purely functional Lisp might have a syntax for functions which was
(λ (<arg> ...)
<expression>)
But Scheme allows
(λ (<arg> ...)
<expression-1>
<expression-2>
...)
Where the values of all but the last expression are ignored: they just happen for side-effect. And since there is room for this in the syntax, Scheme allows it.
However there is simply no room in the syntax of if for this to be the case (see below).
It would be possible to design multiway conditional expression where there was also no room in the syntax for it, which might look like:
(kond
a 1
b 2
c 3
else 4)
for instance (here else is magic to kond: I can't remember if SICP's Scheme has that).
But if you consider what cond's syntax actually is:
(cond
(a 1)
(b 2)
(c 3)
(else 4))
Then there is obviously now room in the syntax to write a sequence of expressions in the result position of each clause. And Scheme does therefore allow that because there simply is no reason not to do so. So instead of
(cond
(<t> <e>)
...)
You can write
(cond
(<t> <e1> <e2> ...)
...)
For instance:
(cond
(world-has-ended
(displayln "The world has ended: rain of fire imminent")
(rain-fire-from-sky 'yes-really))
...)
In fact, Scheme has an operator, begin whose whole purpose is to allow a sequence of expressions where only one is allowed. So for instance if you want to have a sequence of expressions where there would naturally only be one, you can use begin:
(if world-has-ended
(begin
(displayln "The world has ended: rain of fire imminent")
(rain-fire-from-sky 'yes-really))
(begin
(displayln "World has not yet ended, sorry for the frogs")
(rain-frogs-from-sky)))
You can then think of cond as defined in terms of if and begin:
(cond
(world-has-ended
(displayln "The world has ended: rain of fire imminent")
(rain-fire-from-sky 'also-rocks))
(world-has-nearly-ended
(displayln "The world has nearly ended")
(rain-frogs-from-sky 'also-some-fire)))
is the same as
(if world-has-ended
(begin
(displayln "The world has ended: rain of fire imminent")
(rain-fire-from-sky 'also-rocks))
(if world-has-nearly-ended
(begin
(displayln "The world has nearly ended")
(rain-frogs-from-sky 'also-some-fire))
#f))
As to why the syntax of cond was designed so there is room for multiple expressions: that's a question for history. Two things I think help to explain it:
A syntax like (cond <t1> <e1> <t2> <e2> ...) is a pain to read since you need to count the number of forms in the body, while (cond (<t1> ...) (<t2> ...) ...) is much easier.
In the early days of Lisp people who had exposure only to FORTRAN and assembler (because that's pretty much all there was) tended to write programs which were, well, like programs in FORTRAN, and had a lot of imperative operations, side-effects, sequencing and so on. And the syntax of cond allows for that.
If you allow any number of expressions in if, how do you tell when the true ones end and the false ones begin?¹ With cond, each case is a list where the first element is the test expression and the rest of the list gets evaluated when true. There's no ambiguity.
1: With begin:
(if <test>
(begin <true exprs> ...)
(begin <false exprs> ...))
cond will typically be a macro that expands to
(if <p1>
(begin <e1>)
(if <p2>
(begin <e2>)
...))
cond has implicit begin in Scheme due to accidental design in the parent language of the 60s called LISP!
If is not the designers of Scheme that designed cond or if. cond is the original conditional of McCarhty's paper and his eval does not support more than one consequent expression. Basically if you were to write more than one consequence it would just do the first. if did not exist in that paper and it didn't exist in Fortran at the time since it had arithmetic if.
Wrapping each term in a parenthesis is what opens up for the newer versions of lisp to actually allow for more than one consequent with the last one being the tail expression. Imagine he didn't do that but suggested this instead:
; a and c are predicates
(cond a b
c d
t e)
The problem with this is that it is only my formatting that helps determining what is what. If I were to write it in one line it would be almost impossible to read this simple short code:
(cond a b c d t e)
Adding parens was a way to group the things that belongs together together. When if came later it did not support more than two branching in the same form and when lisp got imperative and they introduced progn cond had implicit due to accidental design while if needed a progn form so it kept only 3 operands.
You might be interested in The roots of lisp
Btw. My flat cond example is very similar to the implementation of if in Paul Graham's lisp language Arc:
(if p1 c1
p2 c2
c3)

SICP lisp 'if' conditional expression explanation

I'm working through SICP and have gotten to the part about the square root code. I understood that 'if' statements could only be followed by single expressions. However, in the code,
(define (sqrt-iter guess x)
(if (good-enough? guess x)
guess
(sqrt-iter (improve guess x)
x)))
I don't understand how the 3rd, 4th, and 5th lines are valid when the 'guess' and 'x' have already been stated as the consequent expressions for 'if'.
In some Scheme interpreters an if special form can be followed by one or two expressions after the condition, in others (for example: Racket) the condition must be followed by exactly two expressions. But in your code there are two expressions after the condition! it's more of an indentation problem, see:
(define (sqrt-iter guess x)
(if (good-enough? guess x) ; condition
guess ; first expression (consequent)
(sqrt-iter (improve guess x) ; second expression (alternative)
x)))
To clarify: guess and x are not the consequent and alternative of the condition, they are the arguments for the good-enough? procedure in the expression (good-enough? guess x), which is just the condition part. Remember that the general structure of an if expression looks like this:
(if <condition>
<consequent>
<alternative>)
Where each part is an expression. For further details please refer to the documentation.
guess and x are arguments to the good-enough? predicate, "if" is selecting between the following guess and (sqrt-iter ...) expressions.
No, In scheme language, 'if' statements could followed by two or three expressions, not only one.
(if test-exp then-exp else-exp)
Even in some implement of scheme interpreter,'if' statements MUST followed by three expressions, 'else-exp' can not be ommitted.
More details read:
http://classes.soe.ucsc.edu/cmps112/Spring03/languages/scheme/SchemeTutorialA.html#condexp

List to String in Racket

I've got a list defined like this:
(define testlist '((Dog <=> Cat)
(Anne <=> Dodd))
Is there any way to turn: (car testlist) into a string so I can use regexp on it to search for "<=>"?
Let me start with this extremely relevant Jamie Zawinski quote:
Some people, when confronted with a problem, think, “I know, I'll use regular expressions.” Now they have two problems.
You really really don't want to use regular expressions here. For one thing, a regexp-based solution will break when you have identifiers with <=> in the middle of them.
For another, it's really easy to solve this problem without using regular expressions.
There are a whole bunch of "right answers" here, depending on what exactly you're trying to do, but let me start by pointing out that you can use the "member" function to see whether a list contains the symbol '<=> :
#lang racket
(define testlist '((Dog <=> Cat)
(Anne <=> Dodd)))
(cond [(member '<=> (car testlist)) "yep"]
[else "nope"])
I suspect that you're trying to parse these as logical equivalences, in which case you'll need to define the possible structures of the statements, and go from there, but let's just start by NOT USING REGULAR EXPRESSIONS :).

Why does Clojure allow (eval 3) although 3 is not quoted?

I'm learning Clojure and trying to understand reader, quoting, eval and homoiconicity by drawing parallels to Python's similar features.
In Python, one way to avoid (or postpone) evaluation is to wrap the expression between quotes, eg. '3 + 4'. You can evaluate this later using eval, eg. eval('3 + 4') yielding 7. (If you need to quote only Python values, you can use repr function instead of adding quotes manually.)
In Lisp you use quote or ' for quoting and eval for evaluating, eg. (eval '(+ 3 4)) yielding 7.
So in Python the "quoted" stuff is represented by a string, whereas in Lisp it's represented by a list which has quoteas first item.
My question, finally: why does Clojure allow (eval 3) although 3 is not quoted? Is it just the matter of Lisp style (trying to give an answer instead of error wherever possible) or are there some other reasons behind it? Is this behavior essential to Lisp or not?
The short answer would be that numbers (and symbols, and strings, for example) evaluate to themselves. Quoting instruct lisp (the reader) to pass unevaluated whatever follows the quote. eval then gets that list as you wrote it, but without the quote, and then evaluates it (in the case of (eval '(+ 3 4)), eval will evaluate a function call (+) over two arguments).
What happens with that last expression is the following:
When you hit enter, the expression is evaluated. It contain a normal function call (eval) and some arguments.
The arguments are evaluated. The first argument contains a quote, which tells the reader to produce what is after the quote (the actual (+ 3 4) list).
There are no more arguments, and the actual function call is evaluated. This means calling the eval function with the list (+ 3 4) as argument.
The eval function does the same steps again, finding the normal function + and the arguments, and applies it, obtaining the result.
Other answers have explained the mechanics, but I think the philosophical point is in the different ways lisp and python look at "code". In python, the only way to represent code is as a string, so of course attempting to evaluate a non-string will fail. Lisp has richer data structures for code: lists, numbers, symbols, and so forth. So the expression (+ 1 2) is a list, containing a symbol and two numbers. When evaluating a list, you must first evaluate each of its elements.
So, it's perfectly natural to need to evaluate a number in the ordinary course of running lisp code. To that end, numbers are defined to "evaluate to themselves", meaning they are the same after evaluation as they were before: just a number. The eval function applies the same rules to the bare "code snippet" 3 that the compiler would apply when compiling, say, the third element of a larger expression like (+ 5 3). For numbers, that means leaving it alone.
What should 3 evaluate to? It makes the most sense that Lisp evaluates a number to itself. Would we want to require numbers to be quoted in code? That would not be very convenient and extremely problematic:
Instead of
(defun add-fourtytwo (n)
(+ n 42))
we would have to write
(defun add-fourtytwo (n)
(+ n '42))
Every number in code would need to be quoted. A missing quote would trigger an error. That's not something one would want to use.
As a side note, imagine what happens when you want to use eval in your code.
(defun example ()
(eval 3))
Above would be wrong. Numbers would need to be quoted.
(defun example ()
(eval '3))
Above would be okay, but generating an error at runtime. Lisp evaluates '3 to the number 3. But then calling eval on the number would be an error, since they need to be quoted.
So we would need to write:
(defun example ()
(eval ''3))
That's not very useful...
Numbers have be always self-evaluating in Lisp history. But in earlier Lisp implementations some other data objects, like arrays, were not self-evaluating. Again, since this is a huge source of errors, Lisp dialects like Common Lisp have defined that all data types (other than lists and symbols) are self-evaluating.
To answer this question we need to look at eval definition in lisp. E.g. in CLHS there is definition:
Syntax: eval form => result*
Arguments and Values:
form - a form.
results - the values yielded by the evaluation of form.
Where form is
any object meant to be evaluated.
a symbol, a compound form, or a self-evaluating object.
(for an operator, as in <<operator>> form'') a compound form having that operator as its first element.A quote form is a
constant form.''
In your case number "3" is self-evaluating object. Self-evaluating object is a form that is neither a symbol nor a cons is defined to be a self-evaluating object. I believe that for clojure we can just replace cons by list in this definition.
In clojure only lists are interpreted by eval as function calls. Other data structures and objects are evaluated as self-evaluating objects.
'(+ 3 4) is equal to (list '+ 3 4). ' (transformed by reader to quote function) just avoid evaluation of given form. So in expression (eval '(+ 3 4)) eval takes list data structure ('+ 3 4) as argument.

Pattern matching functions in Clojure?

I have used erlang in the past and it has some really useful things like pattern matching functions or "function guards". Example from erlang docs is:
fact(N) when N>0 ->
N * fact(N-1);
fact(0) ->
1.
But this could be expanded to a much more complex example where the form of parameter and values inside it are matched.
Is there anything similar in clojure?
There is ongoing work towards doing this with unification in the core.match ( https://github.com/clojure/core.match ) library.
Depending on exactly what you want to do, another common way is to use defmulti/defmethod to dispatch on arbitrary functions. See http://clojuredocs.org/clojure_core/clojure.core/defmulti (at the bottom of that page is the factorial example)
I want to introduce defun, it's a macro to define functions with pattern matching just like erlang,it's based on core.match. The above fact function can be wrote into:
(use 'defun)
(defun fact
([0] 1)
([(n :guard #(> % 0))]
(* n (fact (dec n)))))
Another example, an accumulator from zero to positive number n:
(defun accum
([0 ret] ret)
([n ret] (recur (dec n) (+ n ret)))
([n] (recur n 0)))
More information please see https://github.com/killme2008/defun
core.match is a full-featured and extensible pattern matching library for Clojure. With a little macro magic and you can probably get a pretty close approximation to what you're looking for.
Also, if you want to take apart only simple structures like vectors and maps (any thing that is sequence or map, e.g. record, in fact), you could also use destructuring bind. This is the weaker form of pattern matching, but still is very useful. Despite it is described in let section there, it can be used in many contexts, including function definitions.